The paper discusses the challenge of ensuring that large language models (LLMs) generate accurate, credible, and verifiable responses. This is difficult as the current methods often require assistance due to errors and hallucinations, which results in incorrect or misleading information. To address this, the researchers introduce a new verification framework to improve the accuracy and reliability of LLM outputs. This is essential, as LLMs have become more powerful and popular, and it is key to understand how their performance scales with model size and training data.
LLMs are currently employed for tasks that need information retrieval and generation and focus on grounding responses in verifiable sources. The existing methods use retrieval-augmented generation where LLMs are instructed to formulate responses in tandem with corresponding sources in a single run. However, these approaches struggle to maintain accuracy and citation quality due to the complexity of processing large data volumes and the risk of error propagation during preprocessing steps.
The paper proposes a solution called CaLM (Contrasting Large and Small Language Models) which leverages the complementary strengths of large and small LMs. CaLM employs a post-verification approach, where a smaller LM validates the outputs of a larger LM. The smaller LM reviews the cited documents to confirm the larger LM’s citations’ accuracy. If the responses align, the large LM’s answer is verified. If discrepancies are found, CaLM refines the response iteratively using a feedback loop.
The verification process in CaLM involves using a smaller LM to cross-check the output of a larger LM with the cited documents. The smaller LM, which excels at processing relevant information, evaluates whether the larger LM’s response is consistent with the information from the cited source. This method leverages the smaller LM’s sensitivity to input relevance to ensure any inconsistencies are identified and rectified. The iterative feedback loop allows for ongoing refinement of the response, significantly enhancing citation accuracy and overall answer quality.
CaLM was tested on three open-domain question-answering datasets: QAMPARI, ASQA, and ELI5. These tests showed substantial performance gains using CaLM, which improved answer correctness and citation quality, outperforming the current best methods by an average of 1.5% to 7%. It proved sturdy even in challenging scenarios with less effective retrieval systems, highlighting its value for enhancing LLMs’ grounded generation capabilities.
The CaLM framework solves the problem of ensuring accurate and credible responses from LLMs by leveraging the strengths of both large and small language models. It improves the quality and dependability of LLM outputs through a post-verification approach and iterative refinement. This is a significant advancement in language model research and contributes to a better understanding of LLMs’ capabilities and limitations. This knowledge is crucial for their effective use in real-world applications.