Large Language Models (LLMs) are known for their ability to carry out multiple tasks and perform exceptionally across diverse applications. However, their potential to produce accurate information is inhibited, particularly when the knowledge is less represented in their training data. To tackle this issue, a technique known as retrieval augmentation was devised, combining information retrieval and nearest neighbor search from a non-parametric data store, which enhances evidence-based and situated reasoning with LLMs.
Various methods have been explored to address these issues. One of these is Retrieval Augmentation (RA), which uses external knowledge sources to bolster the performance of LLMs in tasks that require deep comprehension. Advancements in retrieval augmentation like REALM, RAG, and Atlas incorporate the retrieval element into pre-training and fine-tuning for these downstream tasks.
A team of researchers from Facebook’s FAIR, the University of Waterloo, Carnegie Mellon University, and the University of Chicago proposed a new technique known as Nearest Neighbor Speculative Decoding (NEST). NEST is a novel semi-parametric language modeling method that can integrate real-world text spans of any length into the generations of an existing LM, thus improving both its quality and speed of execution. NEST augments the standard kNN-LM method by blending the output distribution of an LM with the distribution of potential next tokens drawn from a text corpus. It also includes an additional passage retrieval step, reducing the need to store and search through all tokens in the text corpus, creating a balance between search accuracy and efficiency.
The NEST method works in three sub-steps: Confidence-based interpolation, Dynamic span selection, and Relaxed speculative decoding. Confidence-based interpolation uses a Relative Retrieval Confidence (RRC) score to assess the uncertainty of the token retriever, which is then used to mix output probability. Dynamic span selection allows NEST to select the best token predicted by the probability mixture and extend it to encompass the span from that token when the token retrieval confidence exceeds the threshold. In the last step, Relaxed speculative decoding, the span of multiple tokens is assessed based on mixture probability, and only a prefix that is highly likely according to the mixture probability is accepted.
NEST has outperformed the base LM and the standard kNN-LM in a zero-shot setting using the Llama-2-Chat models on tasks like text completion and factuality-aware generation. For example, the NEST combined with the Llama-2-Chat 70B model yielded a 42.3% improvement of ROUGE-1 on WikiText-103 and 21.6% improvement of FactScore on Biography. Moreover, NEST enhances the efficiency of long-form generation by producing multiple tokens at each time step, and it is 1.8 times faster in inference time with Llama-2-Chat 70B without affecting attribution or fluency.
However, NEST does come with some limitations: the results might bear factual errors depending on the exactness of the first-stage passage retrieval and second-stage token retrieval. The method’s results could potentially improve if fine-tuned on appropriate tasks, as the integrated system without fine-tuning might yield sub-optimal outcomes.
In summary, NEST is an inference-time revision method for LMs that enhances their factuality and attribution through nearest-neighbor speculative decoding. Although it exhibits some limitations, it has been proven to enhance the validation perplexity and quality of free-form generation across various tasks. These advancements in language technologies all contribute to improving the performance of Large Language Models.