Skip to content Skip to footer

Researchers from MIT have suggested a change known as Cross-Layer Attention (CLA) to the Transformer Architecture, which leads to a shrinkage in the Key-Value KV Cache size through an integrated approach to KV activations across different layers.

Managing large language models (LLMs) often entails dealing with issues related to the size of key-value (KV) cache, given that it scales with both the sequence length and the batch size. While techniques have been employed to reduce the KV cache size, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), they have only managed to achieve limited memory reduction results.

In response to this, researchers at the Massachusetts Institute of Technology have developed an innovative method called Cross-Layer Attention (CLA). CLA builds upon the idea of key/value head sharing, which refers to the sharing of key and value heads not just within a layer, but also across adjacent layers. This means that some layers calculate key/value projections while others reuse KV activations from earlier layers, resulting in a significant reduction in the KV cache memory footprint. This method can be combined with either MQA or GQA, leading to larger batch sizes and prolonged KV cache persistence times.

When tested in transformer-based language models with 1 billion and 3 billion parameter scales, the results of the CLA method were promising. They showed beneficial accuracy/memory tradeoffs when compared to MQA and GQA alone. Experimentally, when combined with MQA, CLA with a sharing factor of 2 (CLA2) produced the most effective results. CLA2 led to a KV cache reduction by a factor of 2 while causing only a mild degradation in perplexity and even an improvement in some cases.

Finally, the researchers suggest that CLA will be beneficial for LLMs that deal with extremely long sequences, like models with long-term memory or those that use Landmark Attention. They concluded that CLA is a promising development in managing memory-constrained applications of LLMs by reducing their KV cache memory storage by a factor of 2. Future work includes conducting inference efficiency evaluations of large, long-context models that employ CLA.

Leave a comment

0.0/5