Large language models (LLMs) have gained significant popularity recently, but evaluating them can be quite challenging, particularly for highly specialised client tasks requiring domain-specific knowledge. Therefore, Amazon researchers have developed a new evaluation approach for Retrieval-Augmented Generation (RAG) systems, focusing on such systems’ factual accuracy, defined as their ability to retrieve and apply correct information to meet user inquiries.
This method also provides insights into factors influencing RAG performance, including model size, retrieval mechanisms, prompting techniques, and fine-tuning procedures. This information can aid users in selecting the best component combination for their RAG systems.
The researchers have introduced an automated, quantitative, exam-based evaluation technique that can be scaled up or down, marking a departure from traditional human-in-the-loop evaluations that often need the involvement of an expert or annotator. Using this new method, an LLM creates exams from the data corpus related to the given task, and then RAG systems are evaluated based on their ability to correctly answer multi-choice questions from these exams. This method aims to balance the evaluation’s representativeness and easy scoring, allowing for regular, feedback-driven improvements to the exam corpus.
Furthermore, the researchers presented a methodological enhancement plan within this automated exam-generating process. They optimised the generated exams using Item Response Theory (IRT) to improve their informativeness on task-specific model performance. They illustrated and evaluated this technique using tasks across four distinct knowledge corpora, including AWS DevOps troubleshooting manuals, Arxiv abstracts, StackExchange queries, and SEC filings.
In summary, the researchers’ primary contributions include:
1. The introduction of a comprehensive approach to automatically evaluating RAG LLM pipelines based on task-specific synthetic tests.
2. The use of IRT to develop reliable and understandable evaluation metrics to better understand model performance.
3. The proposal of a fully-automated approach to creating tests, including an iterative refinement process to improve the informativeness of the exams.
4. The provision of benchmark datasets based on publicly available datasets from various fields for evaluating RAG systems.