Natural language processing (NLP), a field within artificial intelligence (AI), aims at aiding machines to decipher and establish human language. It includes tasks such as translation, sentiment analysis, and text summarization. The progress in this field has led to the creation of ‘Large Language Models’ (LLMs), capable of handling massive quantities of text. This progress has enabled complex tasks like long-context summarization and Retrieval-Augmented Generation (RAG).
Yet, evaluating the performance of LLMs is challenging, especially in long-context tasks. Traditional tasks like ‘Needle-in-a-Haystack’ fall short in demonstrating the potential of modern models. The scarcity of high-quality reference summaries and reliable automatic metrics make it difficult to assess the outputs’ quality. The current methods largely depend on short-input, single-document settings and low-quality reference summaries, correlating poorly with human judgment. The lack of benchmarks for long-context models highlights the need for better evaluation methods.
To bridge this gap, researchers at Salesforce AI Research have introduced a novel evaluation method known as the “Summary of a Haystack” (SummHay) task. The technique involves generating large volumes of documents called Haystacks on specific topics, with certain insights repeated across them. The key task of SummHay is to process these Haystacks, generate accurate summaries covering the relevant insights, and mention the source documents. The summaries are evaluated based on two main aspects: their coverage of the expected insights and the quality of the citations.
Studying the performance of 10 LLMs and 50 RAG systems using SummHay, the researchers found that even the best performing systems lagged behind human performance by over 10 points on a joint score. For instance, advanced LLMs like GPT-4o and Claude 3 Opus scored below 20% on SummHay without a retriever. Further trade-offs were observed between RAG systems and long-context models, where the former improved citation quality but reduced insight coverage.
The researchers found a substantial performance gap when comparing these models to human standards. Even after introducing advanced RAG components, such as Cohere’s Rerank3, to models like Claude 3 Opus and GPT-4o, they reached a joint score of around 36%, still lagging behind the estimated human performance of 56%. This difference calls for the development of better and more efficient models.
In conclusion, the research by Salesforce AI Research introduces SummHay as a robust benchmark for evaluating the abilities of long-context LLMs and RAG systems. Despite the current challenges, this research lays a foundation for advancements that could eventually equate to or even exceed human performance in long-context summarization.