Researchers from Tsinghua University and Microsoft Corporation have unveiled a groundbreaking study known as LLMLingua-2, as part of a collaborative effort that reinforces the cruciality of interdisciplinary research. The study primarily focuses on improving the efficiency of language models, which play a pivotal role in ensuring fluent communication between humans and machines. The core challenge that the researchers intend to address is the redundant aspect of human language that often calls for revisions when used in computational processing.
In the course of their research, they identified a major stumbling block in the efficiency and transferability of language model prompts. Conventional prompt compression techniques that are successful for specific queries or tasks often fall short when used across diverse models and functions. This lack of universality results in increased computational and financial expenditure, apart from distorting the perception capabilities of the models due to protracted prompts. The hitherto used task-aware compression techniques need re-compression for different tasks or queries, triggering inefficiencies.
To rectify these limitations, the research team introduced a data distillation procedure, aimed at distilling the most relevant information from large language models (LLMs) without negating important details. The procedure stands out for its integration of an extractive text compression dataset and a token classification model, which ensure that the compressed prompts replicate their original form. Unlike older methods, this procedure meticulously preserves the most significant information, thereby maintaining the usefulness and accuracy of the compressed prompts.
The data distillation method is innovative and reliable, using a token classification problem to view prompt compression as a task involving careful preservation or discarding. This sophisticated approach, based on the full bidirectional context of language, permits a better understanding and retention of important information. A Transformer encoder serves as the central framework for the method, which uses the comprehensive context for prompt compression optimization, a significant departure from older models that ignored important details or could not be used across different tasks.
The efficacy of this novel approach is theoretically and empirically validated using diverse benchmarks. Despite its comparatively smaller size, LLMLingua-2, showcases remarkable advancement in the field of prompt compression techniques. In comprehensive evaluations, the model demonstrated significant performance improvements across in- and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. It recorded a 3x-6x speed increase over existing methods, with an end-to-end latency acceleration of 1.6x-2.9x, along with compression ratios ranging from 2x-5x.
This study by Tsinghua University and Microsoft presents an efficient, adaptable, and faithful compression technique that can be used across diverse tasks and language models. The researchers have developed a method that while retaining the richness of the original prompts substantially reduces their size, thereby facilitating the creation of more efficient and cost-effective language models. This progress in task-agnostic prompt compression enhances the practical utility of large language models and opens new opportunities for research and application in computational linguistics and encompassing fields.