Large Language Models (LLMs) are crucial for a variety of applications, from machine translation to predictive text completion. They face challenges, including capturing complex, long-term dependencies and enabling efficient large-scale parallelisation. Attention-based models that have dominated LLM architectures struggle with computational complexity and extrapolating to longer sequences. Meanwhile, State Space Models (SSMs) offer linear computation complexity and potential for better extrapolation, yet struggle with memory recall due to their Markovian nature. This presents an issue for information retrieval-related tasks.
Hybrid models that combine the strengths of both SSMs and attention mechanisms don’t solve all the limitations of both. Current models don’t achieve unlimited-length extrapolation with linear time complexity. Techniques for length generalisation in attention mechanisms also present their own issues, such as the quadratic computation complexity and restricted context extrapolation ability.
Addressing these challenges, Microsoft researchers and the University of Illinois at Urbana-Champaign have introduced SAMBA, an architecture that combines the best of SSMs and attention-based models and can handle sequences of unlimited length with linear complexity. SAMBA uses layers called Mamba, SwiGLU, and Sliding Window Attention (SWA) to capture time-dependent semantics and model complex dependencies, allowing the architecture to scale to a range of sizes. The largest model they tested had 3.8 billion parameters and was pre-trained on 3.2 trillion tokens. This model outperformed most other open-source language models of up to 8 billion parameters.
The innovation in SAMBA comes from the way it combines, or hybridises, Mamba, SWA, and Multi-Layer Perceptron layers. Mamba layers capture time-dependent semantics using selective state spaces, addressing the Markovian limitations of SSMs. SWA layers model complex, non-Markovian dependencies using a sliding window approach. The researchers employed several hybridisation strategies, including Samba, Mamba-SWA-MLP, and Mamba-MLP, with the aim of harmonising the distinct functionalities to increase the efficiency of language modelling.
According to their research, the 3.8 billion parameter SAMBA model outperformed several baseline models across a range of benchmarks. The model excelled at tasks involving common-sense reasoning, language understanding, truthfulness, and coding. Notably, it achieved 18.1% higher accuracy than the Transformer++ model.
SAMBA, therefore, represents a significant advancement in the field of language modelling. It combines the best elements of attention mechanisms and SSMs within a hybrid architecture that excels on various benchmarks. It has proven successful in handling unlimited context lengths, alongside impressive memory extrapolation abilities, making it suited for real-world applications requiring extensive context understanding. Its optimal balance between attention and recurrent structures results in a powerful, efficient model, pushing the boundaries of language modelling and offering promising solutions for complex natural language processing tasks.