Google AI researchers have developed a new Transformer network dubbed TransformerFAM, aimed to enhance performance in handling extremely long context tasks. Despite Transformers proving revolutionary in the domain of deep learning, they have limitations due to their quadratic attention complexity— an aspect that curtails their ability to process infinitely long inputs. Existing Transformers often forget information beyond the attention window and struggle with processing extended context.
Attempts to rectify these limitations, such as the introduction of sliding window attention and sparse or linear approximations, often run into scaling issues. The all-new TransformerFAM circumvents these issues by introducing a feedback loop within the Transformer blocks that enables the network to self-attend to latent representations. This approach triggers the working memory in Transformers, thus obviating the need for additional weights in the model.
The flexibility of TransformerFAM allows easy integration with pre-trained models. It simplifies the ability of Transformer models to preserve past information and handle extremely long input sequences. Importantly, TransformerFAM offers the benefit of reusing pre-trained checkpoints which significantly enhances performance across a variety of model sizes.
Previous attempts of integrating feedback mechanisms in Transformers did not adequately consider potential representational gaps. Thus, these models can face difficulties when handling long-context inputs. While Big Bird’s Sliding Window Attention (SWA) and the subsequent Block Sliding Window Attention (BSWA) attempted to address this issue, they still suffered from having a limited receptive field. Other alternatives, MLP-mixer and State Space Models also have their own limitations.
The researchers designed FAM to overcome the shortcomings of BSWA. It incorporates feedback activations into each block, enabling dynamic propagation of global contextual information across the blocks. It meets crucial requirements such as integrated attention, block-wise updates, information compression, and retention of global context. The inception of FAM helps to propagate comprehensive contextual information and enrich representations, surpassing the restrictions posed by BSWA.
The analogy of the protagonist’s struggle with short-term memory loss in the movie ‘Memento’ aptly captures the state of current language models where attention windows limit their short-term memory. TransformerFAM helps in addressing this analogous ‘anterograde amnesia’ in language models by leveraging an attention-based working memory. This innovation provides a path towards addressing the memory challenge in deep learning models; a pivotal step to deal with more complex issues, like reasoning – in the area of AI research.