Memory is a crucial component of intelligence, facilitating the recall and application of past experiences to current situations. However, both traditional Transformer models and Transformer-based Large Language Models (LLMs) have limitations related to context-dependent memory due to the workings of their attention mechanisms. This primarily concerns the memory consumption and computation time of these attention mechanisms, which are quadratic in complexity.
Compressive memory systems have emerged as a potential solution, aimed at providing a more efficient and scalable approach to managing long sequences. Unlike classical attention mechanisms, which require memory to expand with the length of the input sequence, compressive memory systems maintain a constant number of parameters for storing and retrieving information. Thus, this approach helps manage storage and computation costs.
Google researchers have proposed a new solution to overcome the memory-related limitations of Transformer LLMs. Their approach includes an “Infini-attention” mechanism that allows the models to manage arbitrarily long inputs while keeping memory footprint and computing power under control. The Infini-attention mechanism blends long-term linear attention and masked local attention into a single Transformer block and incorporates compressive memory into the conventional attention process. This unique solution has effectively allowed the models to handle memory while processing long sequences by retaining and recollecting data with a fixed set of parameters, thus managing computation costs and controlling memory consumption.
The researchers have proven the effectiveness of the Infini-attention approach through a variety of tasks, such as summarizing books with input sequences of 500,000 tokens, retrieving passkey context blocks for sequences up to 1 million tokens in length, and achieving long-context language modeling benchmarks. LLMs with sizes ranging from 1 billion to 8 billion parameters have successfully performed these tasks.
A key advantage of the Infini-attention approach is the inclusion of minimal bounded memory parameters, which helps limit and predict the model’s memory needs. This mechanism also enables fast streaming inference for LLMs, facilitating efficient analysis of sequential input in real time or near-real time.
In addition, the Infini-attention mechanism only requires minimal changes to the standard dot-product attention mechanism, aiding in seamless integration into existing transformer architectures. This new method allows for effective long-context adaptation and continuous pre-training.
In conclusion, the Infini-attention mechanism introduced by Google’s team is a significant advancement for LLMs. This development enables them to efficiently handle long inputs regarding computation and memory utilization, making it a vital tool for LLMs in handling large-scale real-world data applications.