Skip to content Skip to footer

FastGen: Efficiently Reducing GPU Memory Expenses without Sacrificing LLM Quality

Autoregressive language models (ALMs) have become invaluable tools in machine translation, text generation, and similar tasks. Despite their success, challenges persist such as high computational complexity and extensive GPU memory usage. This makes the need for a cost-effective way to operate these models urgent. Large language models (LLMs), which use KV Cache mechanism to enhance generation speed, also suffer from increased memory usage with an increase in model size and generation length. Existing methods to enhance LLM efficiency, such as token skipping and the BERT model’s token selection task, have limitations, signalling the need to explore the potential of pruning tokens within the KV cache of autoregressive LLMs.

A new technique called FastGen has been proposed to address this problem. Introduced by researchers from the University of Illinois at Urbana-Champaign and Microsoft, FastGen aims to boost the inference efficiency of LLMs without any loss in visible quality. It adopts an adaptive approach to key-value (KV) cache construction, guided by lightweight attention profiling, and does not require resource-intensive fine-tuning or re-training. FastGen is thus capable of reducing GPU memory usage without a notable drop in generation quality.

FastGen’s approach involves two steps: Prompt Encoding and Token Generation. In the Prompt Encoding stage, the attention module gathers contextual information from all the preceding (i-1) tokens for the i-th token produced by the autoregressive transformer-based LLM. Following this step, Token Generation takes place where, after prompt encoding has been completed, the LLM generates output token by token.

FastGen outperforms all non-adaptive KV compression methods for 30B models. It contributes to a higher KV cache reduction ratio as the model size increases, all the while maintaining the quality of the model. One example shows FastGen resulted in a 44.9% pruned ratio on Llama 1-65B, a significant increase compared to 16.9% pruned ratio on Llama 1-7B, achieving a 45% win rate. Sensitivity analysis indicated that variations in hyper-parameters did not visibly impact the model’s generation quality.

FastGen’s promise lies in not only improving LLMs inference efficiency without collateral loss of quality, but its potential in reducing the memory footprint of generative inference for LLMs thanks to its adaptive KV Cache compression. Looking ahead, the researchers plan to combine FastGen with other model compression approaches, including quantization and distillation, and grouped-query attention. This research consolidates the assertion that with accurate strategies, computational costs can be reduced without compromising on the quality of LLMs.

Leave a comment

0.0/5