Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP). However, they often generate ungrounded or factually incorrect information, an issue informally known as ‘hallucination’. This is particularly noticeable when it comes to Question Answering (QA) tasks, where even the most advanced models, such as GPT-4, struggle to provide accurate responses. The challenge is primarily linked to questions involving changing facts or less well-known entities.
To address these problems, a collaborative team of researchers from Meta Reality Labs, Facebook AI Research (FAIR), Meta, and the Hong Kong University of Science and Technology (HKUST) has introduced a comprehensive benchmark named CRAG (Comprehensive benchmark for RAG). This is designed to help enhance the development of retrieval-augmented generation (RAG) models, which have emerged as a promising strategy to overcome the knowledge gaps in LLMs.
CRAG is designed to evaluate five key qualities: realism, richness, insightfulness, reliability, and longevity. It includes a diverse mix of 4,409 QA pairs from five different domains, which cover simple factual queries and seven types of complex questions. The questions are extensively validated and paraphrased to enhance realism and reliability. The CRAG also offers mock APIs that simulate content retrieval from webpages and mock knowledge graphs, containing 2.6 million entities to realistically mimic noise.
The CRAG benchmark incorporates three tasks that assess the diverse capabilities of RAG models. These tasks employ the same set of question-answer pairs but involve different types of externally retrievable data for augmenting the process of response generation. CRAG was found to effectively identify shortcomings in existing RAG solutions, providing valuable insights for future improvements.
Preliminary results show that while advanced models like GPT-4 only achieve around 34% accuracy on CRAG, adding basic RAG increased accuracy to 44%. Furthermore, even the best industry-standard RAG solutions could only answer 63% of the questions accurately. Most struggled with high dynamism, low popularity, or greater complexity.
The researchers plan to refine and expand CRAG in the future to include multi-lingual queries, multi-modal inputs, and multi-turn conversations. This will allow it to continuously adapt to emerging research challenges and needs in the fast-evolving field of RAG models. By providing a solid base for enhancing reliable grounded language generation, the CRAG benchmark is expected to drive further advancements in this research area.