Large language models (LLMs) such as GPT-4 have been proven to excel at language comprehension, however, they struggle with high GPU memory usage during inference. This is a significant limitation for real-time applications, such as chatbots, due to scalbility issues. To illustrate, present methods reduce memory by compressing the KV cache, a prevalent memory consumer in LLMs, comparable to model parameters. For instance, a 7 billion parameter model uses 14 GB for parameters but 72 GB for the KV cache.
Various teams of researchers developed PyramidInfer, a new method aimed at enhancing LLM inference by compressing the KV cache. Unlike existing methods, PyramidInfer only retains crucial context keys and values at each layer considering inter-layer dependencies and the memory demands of pre-computation. PyramidInfer significantly improves throughput by 2.2x and reduces KV cache memory by over 54% compared to existing methods.
The researchers took inspiration from the Recent Attention Consistency (RAC) hypothesis and the Inference Context Redundancy (ICR), which respectively identify that certain keys and values are consistently attended by recent tokens and many context keys and values are redundant and only needed during training. These insights are used by PyramidInfer to significantly reduce the KV cache during both prefill and generation phases.
Multiple evaluations of PyramidInfer’s performance were conducted across a variety of tasks and models to ensure generalization of this method. These evaluations showcased a reduction in GPU memory usage and an increase in throughput while maintaining the quality of the output. Tests included language models like wikitext-v2, benchmarks like MMLU and BBH, and models like LLaMA 2 and Vicuna 1.5-16k.
However, PyramidInfer isn’t without limitations. As it requires additional computation, it can limit speedup with smaller batch sizes. Furthermore, as the first method to compress the KV cache in the prefill phase, PyramidInfer is not yet a lossless method and so leaves room for improvements.
In summary, PyramidInfer demonstrates a new strategy for compressing the KV cache in both the prefill and generation phases, resulting in significant reductions in GPU memory usage while maintaining the quality of language model outputs. Its application may help in efficiently deploying large language models in resource-constrained environments, such as chatbots. With room for improvements, the future adoption and adaptation of PyramidInfer could largely influence real-time applications of LLMs.