Skip to content Skip to footer

This AI article showcases a straight experimental juxtaposition of the 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models, which have been trained on a maximum of 3.5 trillion tokens.

Transformer-based Large Language Models (LLMs) have become essential to Natural Language Processing (NLP), with their self-attention mechanism delivering impressive results across various tasks. However, this mechanism struggles with long sequences, since the computational load and memory requirements increase dramatically based on sequence length. Alternatives have been sought to optimize the self-attention layers, but these often do not match up to traditional self-attention in terms of language modeling power.

Selective state-space models (SSMs) like Mamba have emerged as a viable solution to some of the limitations associated with transformers. In contrast to transformers’ quadratic computational complexity and high memory consumption during inference, SSMs offer a more efficient solution by reducing these issues. Recent research indicates that SSMs are now on par with, or even surpassing, transformers in language modeling tasks.

Despite promising initial results, most studies comparing SSMs and transformers have been limited to small-scale trials using models with fewer than 3 billion parameters and data sets under one trillion tokens. However, recent research has moved beyond these limits. In an in-depth comparative study, an 8-billion-parameter model comprising Mamba, Mamba-2, and transformers were trained on data sets of up to 3.5 trillion tokens.

The research team also proposed an 8-billion-parameter hybrid model called Mamba-2-Hybrid, comprising 50% MLP layers, 7% self-attention, and 43% Mamba-2. This model was evaluated against standard transformers across a range of NLP tasks. The findings revealed that the SSM models either equaled or surpassed the transformers in performance on several tasks.

However, the SSM models struggled with tasks requiring substantial long-context reasoning, and those demanding robust copying or in-context learning. Despite this, the Mamba-2-Hybrid model outperformed the standard transformer on all 12 standard tasks assessed, with an average performance increase of 2.65 points. It also generated tokens up to eight times faster during inference.

To further analyze long-context capabilities, the research team broadened their study to allow sequence lengths of 16K, 32K, and 128K in versions of the Mamba-2-Hybrid and Transformer models. On average, the hybrid model continued to match or surpass the Transformer’s performance across an additional 23 long-context tasks.

In support of wider research, the team has released their code as part of NVIDIA’s Megatron-LM project. This research indicates a bright future for SSMs in NLP tasks and points towards further studies for providing more training resources to examine the full potential of Mamba models.

Leave a comment

0.0/5