MIT researchers have developed a method known as Cross-Layer Attention (CLA) to alleviate the memory footprint bottleneck of the key-value (KV) cache in large language models (LLMs). As more applications demand longer input sequences, the KV cache's memory requirements limit batch sizes and necessitate costly offloading techniques. Additionally, persistently storing and retrieving KV caches to…
