For AI research, efficiently managing long contextual inputs in Retrieval-Augmented Generation (RAG) models is a central challenge. Current techniques such as context compression have certain limitations, particularly in how they handle multiple context documents, which is a pressing issue for many real-world scenarios.
Addressing this challenge effectively, researchers from the University of Amsterdam, The University of Queensland, and Naver Labs Europe have introduced COCOM (COntext COmpression Model), a new method for context compression. This model compresses long contexts into a small number of context embeddings, significantly speeding up the generation time without compromising performance. Crucially, COCOM is designed to handle multiple contexts, a key innovation setting it apart from previous techniques.
COCOM employs a unified model for both context compression and answer generation. This integrated approach ensures an effective use of the compressed context embeddings by large language models (LLMs). Furthermore, COCOM provides the flexibility to balance decoding time and answer quality with adjustable compression rates.
The researchers utilized various QA datasets like Natural Questions, MS MARCO, HotpotQA, and WikiQA for training. Notably, they applied parameter-efficient LoRA tuning and used SPLADE-v3 for retrieval to streamline the process.
COCOM demonstrated noticeable improvements over previous models. The new algorithm achieved a speed-up of up to 5.69 times in decoding time while maintaining high performance. It also performed impressively in terms of Exact Match (EM) and Match (M) scores, significantly outdoing existing methods like AutoCompressor, ICAE, and xRAG. For instance, COCOM scored 0.554 on the Natural Questions dataset with a compression rate of 4, and 0.859 on TriviaQA.
In conclusion, COCOM addresses the critical challenge of handling long contextual inputs in RAG models effectively and efficiently. The ability to handle multiple contexts and offer adaptable compression rates makes this a crucial development for enhancing the scalability of RAG systems, paving the way for more efficient and user-friendly AI applications. This innovative method also raises the potential of improving the practical use of LLMs, fulfilling a long-standing demand in AI research.