Retrieval-augmented language models often only obtain small sections from a corpus, inhibiting their potential to adapt to global changes and incorporate extensive knowledge. This problem is prevalent in most existing methods that struggle to leverage large-scale discourse structure effectively. It is notably significant for thematic questions that require knowledge integration from multiple text sections.
Large Language Models (LLMs) have shown their effectiveness as independent knowledge stores, encoding facts into their parameters and further refining through downstream tasks. Challenges emerge when updating these models with current world knowledge, which prompts the consideration of alternative approaches. These involve indexing text in information retrieval systems and using retrieved data to provide LLMs with present, domain-specific knowledge. However, existing methods remain confined to retrieving short, continuous text chunks, creating a hurdle in large-scale discourse structure representation, crucial for comprehensive text understanding.
In response, researchers from Stanford University have developed RAPTOR, an advanced indexing and retrieval system designed to address these limitations. RAPTOR uses a tree structure to encompass both macro and micro text details. It assembles text chunks, generates summarizations, and constructs a tree from these elements. This structure allows for loading varying levels of text pieces into the LLMs context, hence enabling efficient and diverse question answering. The primary contribution is the use of text summarization for retrieval augmentation, improving context representation over various scales.
RAPTOR deals with apprehending semantic depth and interconnectedness issues by forming a recursive tree structure that comprehends both wide-ranging themes and intricate details. The process involves dividing the retrieval corpus into chunks, embedding them through SBERT, and clustering them using Gaussian Mixture Models (GMMs) and Uniform Manifold Approximation and Projection (UMAP). The resulting tree structure enables efficient querying, allowing for relevant information retrieval at diverse specificity levels.
RAPTOR has proven superior to other methods across three question-answering data sets – NarrativeQA, QASPER, and QuALITY. It exhibits consistent superiority paired with GPT-4, achieving ground-breaking results on QASPER and QuALITY datasets – reflecting its capability in tackling thematic and multi-step queries. The tree structure’s contribution was validated, showcasing upper-level nodes’ importance in broad understanding and retrieval capabilities.
Finally, Stanford University researchers’ introduced RAPTOR, a novel tree-based retrieval system enhancing LLMs’ knowledge across differing abstraction levels. RAPTOR constructs a hierarchical tree structure through recursive clustering and summarization. This enables the effective synthesis of information from the diverse sections of retrieval corpora. The superiority of RAPTOR over traditional methods was demonstrated in controlled experiments, leading to new benchmarks for question-answering tasks. Thus, it represents an exciting development for enhancing the capacities of language models through enriched contextual retrieval.