Retrieval Augmented Generation (RAG) models have become increasingly important in the fields of Artificial Intelligence, Natural Language Processing (NLP), and Information Retrieval. Despite this, there’s a cautionary note from data science experts advising against a rush into using sophisticated RAG models until the evaluation pipeline is reliable and robust.
Emphasising the importance of examining RAG pipelines, researchers and practitioners are encouraged to prioritize refining their evaluation procedures before focusing on improving complex model enhancements. This is particularly critical given that the efficacy of RAG models hinges on both generation capabilities and retrieval quality.
When considering the assessment of these RAG pipelines, it’s helpful to divide the evaluation dimensions into two primary categories; retrieval dimensions and generation dimensions.
Retrieval Dimensions include:
1. Context Precision: Measures if every item in the context has a higher ranking compared to other items.
2. Context Recall: Evaluates the extent to which the ground-truth response and the retrieved context align.
3. Context Relevance: Assesses the relevance of the retrieved context.
4. Context Entity Recall: Calculates the recall of the retrieved context by comparing the number of entities present in the ground truths and contexts against the entities in the ground truths only.
5. Noise Robustness: Assesses how well the model handles question-related noise documents.
Generation Dimensions include:
1. Faithfulness: Assesses the factual consistency of the generated response according to the given context.
2. Answer Relevance: Measures the appropriateness of the generated response to the given question.
3. Negative Rejection: Evaluates the model’s capability to refrain from responding when the retrieved documents don’t contain sufficient information for the query.
4. Information Integration: Assesses the capacity of the model to amalgamate data from different sources to answer complex queries.
5. Counterfactual Robustness: Measures the model’s ability to recognise and disregard known errors in documents.
Frameworks that incorporate these dimensions include Ragas, TruLens, ARES, DeepEval, Tonic Validate, and LangFuse. These can be accessed on their respective websites. Each framework has unique features that can aid in enhancing the robustness of RAG model evaluation, which in turn ensures the reliability and robustness of the pipeline.
To conclude, the successful application of RAG models calls for a comprehensive assessment method, comprised of both retrieval and generation dimensions, on top of a reliable and robust evaluation pipeline. Prioritizing the improvement of the evaluation setup is essential before embarking on complex model enhancements.