Google researchers have developed a technique called Infini-attention, aimed at improving the efficiency of Language Learning Models (LLMs). This technique allows LLMs to handle infinitely long text without swelling the compute and memory requirement which is a common limitation with existing models.
LLMs, which use a Transformer architecture, work by giving attention to all tokens, or pieces of data, in a text input or prompt. However, this process can be intensely demanding on computer resources, as the requirement for memory and processing power increases exponentially with the number of tokens.
Normally, when the text input grows beyond the context length that the model can handle, earlier information is lost. This changed with the advent of Infini-attention, as it allows for the retention of information beyond the context window of the LLM. Infini-attention integrates compressive memory techniques with modified attention mechanisms, enabling retention of older but still relevant data in a compressed format.
The process consists of weighing and summarizing the older input information that’s deemed relevant, instead of attempting to retain all of it. The technique makes use of a traditional attention mechanism, but reuses the key value (KV) states from following segments in the model, rather than discarding them.
As LLMs take in new data, they provide local attention to this recent input. However, the transformed system can also continually distill and compress older data for future reference. It’s this ability to hold detailed compressed historical data that has garnered attention, as its efficient memory function offers a compelling alternative to large-scale data libraries.
Google ran benchmarking tests comparing Infini-attention models with other extended context models like Transformer-XL and Memorizing Transformers. The Infini-Transformer performed significantly better, offering superior retention capability, and requiring 114 times less memory.
In passkey retrieval tests, the Infini-attention models consistently found random numbers hidden in large volumes of text, whereas other models struggled when the crucial data wasn’t near the end of the input. The results led Google researchers to believe that the Infini-attention technique could be scaled to manage extremely long input sequences, with the compute and memory requirements still kept bounded.
The ease of use associated with Infini-attention also allows its integration into the pre-training and fine-tuning of existing Transformer models, potentially extending their context windows without complete retraining.
Despite these advancements, the world of large language models is continually evolving, and context windows will inevitably continue to grow. However, the Infini-attention technique, with its ability to manage significantly more data using fewer computational resources, looks set to redefine how we train and manage these models in future.