Large Language Models (LLMs) are valuable in many areas, especially when it comes to generating texts or responding to queries. However, they face a significant challenge – they consume vast amounts of memory for efficient functioning. This memory is utilized to store information on previously encountered words and phrases, which aids the model in generating new texts by looking up this stored data. The larger the memory requirement, the slower the model runs, with some instances even leading to memory drainage.
To reduce the memory needs of LLMs, researchers often use a process known as quantization, which is essentially a compression technique for information to occupy less space. Although existing methods employ quantization, they require comprehensive fine-tuning to function correctly. This fine-tuning can be complex and time-consuming, becoming a hurdle for researchers and developers to effectively deploy these solutions.
This is where KIVI enters. KIVI is a purposed-built, plug-and-play 2-bit key-value (KV) cache quantization algorithm for LLMs. Its function revolves around compressing information in the cache to occupy less space, eliminating the requirement for any fine-tuning. This allows researchers and developers to employ KIVI without spending extensive time on calibration according to their specific LLMs.
Trials have demonstrated KIVI’s ability to considerably reduce memory usage while maintaining performance. It’s proven to reduce memory consumption by up to 2.6 times in comparison to other quantization methods. This allows LLMs using KIVI to operate faster and handle larger data batches, allowing for performance improvements of up to 3.47 times in actual application scenarios. For instance, during tests with Mistral-v0.2, KIVI maintained close-to-baseline accuracy whilst reducing key-value cache memory usage by 5.3 times.
To summarize, KIVI offers a streamlined and effective way to address LLM’s memory bottleneck issues. This 2-bit key-value cache quantization algorithm compresses stored information, decreasing memory usage without the need for fine-tuning. As a result, LLMs can operate more quickly, manage larger data batches, and improve overall performance. Future investigations and improvements aim to reduce the overhead of the quantization process, rendering KIVI even more efficient and user-friendly.
This innovative research is detailed further in a dedicated paper available for review, along with a Github link showcasing the work. All credit goes to the individuals involved in this project. To stay informed on such developments, individuals are encouraged to follow the relevant social channels, subscribe to the newsletter, engage with a dedicated ML SubReddit community, and engage with other resources shared.