Skip to content Skip to footer

Researchers from MIT suggest a method called Cross-Layer Attention (CLA), which is a modification of Transformer Architecture aimed at decreasing the size of Key-Value KV cache by distributing KV activations over different layers.

MIT researchers have developed a method known as Cross-Layer Attention (CLA) to alleviate the memory footprint bottleneck of the key-value (KV) cache in large language models (LLMs). As more applications demand longer input sequences, the KV cache’s memory requirements limit batch sizes and necessitate costly offloading techniques. Additionally, persistently storing and retrieving KV caches to avoid redundant calculations is desired albeit challenging due to their size.

Traditional grouping methodologies to reduce KV cache size like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) have limitations regarding achievable memory reduction. However, CLA extends the idea of key and value head sharing across adjacent layers, which allows effective storage overhead reduction without undermining the model’s performance.

Through computing key/value projections for only a subset of layers and allowing other layers to reuse KV activations from previous layers, CLA achieves a significant reduction in the KV cache memory footprint. The reduction factor is equal to the sharing factor or slightly less if the sharing factor does not evenly divide the number of layers. Furthermore, this technique is orthogonal to both MQA and GQA, meaning it can be used in conjunction with either methodology.

Benefits of CLA include a reduction in the memory footprint of intermediate KV activation tensors materialized during training, compatibility with standard tensor parallelism techniques, a decrease in the model’s parameters and the number of Floating-Point Operations Per Second (FLOPs) needed during passes while also enabling larger batch sizes and longer KV cache persistence times. However, it does not directly affect the memory bandwidth consumed by the attention mechanism in each decoding step or the latency of the core attention computation during decoding.

Experiments at 1 billion and 3 billion parameter scales find that CLA enables favorable accuracy/memory tradeoffs compared to plain GQA or MQA. A sharing factor of 2 (CLA2) was discovered as the most effective and the impact was consistent across scales at 1B and 3B.

Research concluded that CLA is effective in reducing memory footprint while maintaining performance. Models with longer sequences stand to gain the most from CLA. Future work will entail end-to-end inference efficiency evaluations of large, long-context models.

Leave a comment

0.0/5