Natural language processing (NLP), a subfield of Artificial Intelligence (AI), is designed to allow machines to understand and mirror human language. It oversees a variety of tasks like language translation, sentiment analysis, and text summarization. The advent of large language models (LLMs), capable of processing great amounts of data, has significantly advanced these tasks, opening new opportunities in long-context summarization and retrieval-augmented generation (RAG).
However, a key issue is evaluating these LLMs’ performance, as many conventional tasks do not present the complexity required to differentiate the capabilities of latest models. Analyzing the quality of these models’ outcomes also presents a difficulty, particularly because high-quality reference summaries and reliable automatic metrics are needed. This evaluation gap leaves the competency of modern LLMs under scrutiny.
Existing techniques to measure summarization performance generally focus on short-input, single-document contexts and often depend on sub-standard reference summaries. Furthermore, present benchmarks for long-context models, such as Needle-in-a-Haystack and book summarization, may not adequately test the full potential of advanced LLMs. As a result, more thorough and reliable measurements are needed.
To address this issue, researchers at Salesforce AI Research proposed a novel method named “Summary of a Haystack” (SummHay) task. The concept is set up to evaluate long-context models and RAG systems more effectively. The researchers created synthetic “Haystacks” of documents with specific, repeated insights. The objective of the SummHay task is for the systems to process these ‘Haystacks’, generate accurate summaries covering the insights, and cite the source documents – providing a comprehensive, reproducible evaluation framework.
Implementing SummHay involves creating ‘Haystacks’ composed of around 100 documents on specific topics, each including specific insights grouped into subtopics. Following this, the SummHay task frames a query-focused summary task, asking systems to generate bullet-point summaries with accurate citations. These summaries are assessed for coverage of expected insights and citation quality.
The team extensively tested 10 LLMs and 50 RAG systems, concluding that the SummHay task, despite improvements achieved by advanced RAG systems, remains challenging and performance significantly lags behind human levels.
The research carried out by Salesforce AI Research addresses the crucial need for robust methods to evaluate long-context LLMs and RAG systems. While it made clear that current systems are underperforming compared to human benchmarks, it also highlighted areas for improvement and set the stage for future advancements. As a result, the study is a significant stepping stone towards achieving or even surpassing human-level performance in long-context summarization.