Large Language Models (LLMs), which focus on understanding and generating human language, are a subset of artificial intelligence. However, their use of the Transformer architecture to process long texts introduces a significant challenge due to its quadratic time complexity. This complexity is a barrier to efficient performance with extended text inputs.
To deal with this issue, researchers have introduced the KV-Cache mechanism, which stores keys and values generated by past tokens. This mechanism reduces time complexity from quadratic to linear, but it also raises GPU memory usage. Therefore, optimization of KV-Cache usage is vital.
Scientists from Wuhan University and Shanghai Jiao Tong University in China proposed several methods for optimizing KV-Cache space usage in LLMs. They introduced alterations to the model architecture during pre-training, reducing the size of the keys and values vectors by up to 75%. This approach keeps the benefits of the attention mechanism, while significantly reducing memory demands.
These modifications also include the use of frameworks like Paged Attention and DistKV-LLM during deployment and after training. They distribute KV-Cache over multiple servers and use methods like dynamic eviction strategies and quantization techniques to compress KV-Cache. These methods significantly increase memory efficiency and inference speed. For example, using the GQA method, the popular LLaMA2-70B model achieves better memory use by reducing the KV-Cache size while maintaining performance levels. GQA reduces memory usage to a fraction of what traditional methods require, slashing KV-Cache size by 75%.
The LLaMA2-70B model also showed significant improvements in per-token memory usage, dropping from 0.5MB to 0.125MB. Models using the Multi-Query Attention (MQA) and GQA methods displayed increased throughput and reduced latency, important for real-time applications.
The research contributes a comprehensive strategy for optimizing KV-Cache in LLMs, addressing the issue of memory overhead. The techniques proposed allow for higher efficiency and better performance in LLMs, paving the way for more scalable and sustainable AI solutions. Finally, the team’s findings underscore the importance of efficient memory management in the evolution of LLM technology and offer a potential roadmap for future progress, which could also facilitate exploration of more complex applications for LLMs across various sectors.