Large Language Models (LLMs) like Mistral, Gemma, and Llama have significantly contributed to advancements in Natural Language Processing (NLP), but their dense models make them computationally heavy and expensive. As they utilize every parameter during inference, this intensity makes creating affordable, widespread AI challenging.
Conditional computation is seen as an efficiency-enhancing solution, activating specific model parameters based on input to reduce unnecessary calculations. The first method, Mixture-of-Experts (MoE), predetermines model structure constraints before training, aiding model efficiency without creating additional computational complexities. The second method relies on activation functions such as ReLU’s inherent sparsity: it generates zeroes for non-positive inputs to enhance inference efficiency.
However, many LLMs with activation functions viz. GELU and Swish, despite their own efficiency advantages, do not encourage sufficient sparsity, hence are difficult to accelerate using conditional computation. A proposed solution, ReLUfication, replaces the activation function with ReLU during pre-training. This technique is problematic due to compromised performance and insufficient sparsity.
To address this, a team of Chinese researchers introduced dReLU, a new activation function enhancing efficiency by handling negative activations in the GLU component. Small-scale LLMs pre-trained with dReLU demonstrated comparable performance to SwiGLU models, achieving close to 90% sparsity levels. The researchers improved ReLUfication by aggregating diverse pre-training data, including code, website content, and mathematical datasets.
The researchers also analyzed the sparsity of MoE-based LLMs, finding that experts’ feed-forward networks demonstrated activation sparsity similar to dense LLMs. This highlights that combining MoE strategies with ReLU-induced sparsity could lead to additional efficiency benefits.
To validate this hypothesis, the team built TurboSparse-Mistral-47B and TurboSparse-Mixtral-47B models, applying their methodology to Mistral-7B and Mixtral-47B models, respectively. These enhanced models consistently outperformed their original versions, achieving up to 97% enhanced sparsity and substantially reduced processing requirements during inference. When combined with PowerInfer, the models achieved an average 2.83x acceleration in generation tasks, demonstrating improvements in both performance and effectiveness.
The team’s significant contributions include the introduction of the dReLU function, enhancing activation sparsity and the creation of TurboSparse-Mistral7B and TurboSparse-Mixtral-47B, outperforming their original models in terms of performance and inference speed. Evaluations reveal a practical inference speed of 2-5x, making it possible to accomplish up to 10 tokens without a GPU using TurboSparse-Mixtral-47B.