With the increasing adoption of Large Language Models (LLMs) and the continuous quest for efficient ways to run them on consumer hardware, a promising strategy has emerged – the use of sparse Mixture-of-Experts (MoE) architectures. These models are able to generate tokens faster than their denser counterparts due to their characteristic of only activating certain model layers for a given input. Unfortunately, the presence of multiple “experts” often leads to an increased model size, making the latest MoE language models too challenging to execute without high-end GPUs.
This paper dives deep into the problem of running large MoE language models on consumer hardware. The authors build upon parameter offloading algorithms and introduce a novel strategy that capitalizes on the inherent properties of MoE LLMs. The two main avenues for running these models on more affordable hardware setups include compressing model parameters or offloading them to a less expensive storage medium, such as RAM or SSD. These strategies are specifically targeted for inference rather than training.
The concept of parameter offloading involves moving model parameters to a cheaper memory for loading just in time when needed for computation. This approach is particularly effective for deep learning models that follow a fixed layer order, enabling pre-dispatch of the next layer’s parameters in the background.
The MoE model is also explored, which builds on an older concept of training ensembles of specialized models (“experts”) with a gating function to select the appropriate expert for a given task. The authors introduce the concept of Expert Locality and LRU Caching to leverage the pattern of individual experts being assigned to distinct sub-tasks and to keep active experts in GPU memory as a “cache” for future tokens, resulting in a significant speedup in inference for modern MoE models.
To further speed up the expert loading time, the authors propose Speculative Expert Loading. By guessing the likely next experts based on the gating function of the previous layer’s hidden states, the next layer’s inference is significantly sped up. MoE Quantization is also explored, with Half Quadratic Quantization (HQQ) being used for its data-free quantization capabilities, achieving better quality-size trade-offs when quantizing experts to a lower bitwidth.
The paper concludes with an evaluation of the proposed strategies using Mixtral-8x7B and Mixtral-8x7B-Instruct models. Results indicate a significant increase in generation speed on consumer-grade hardware, making large MoE models more accessible for research and development. Don’t miss out on this exciting paper and join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, LinkedIn Group, Twitter, and Email Newsletter to stay in the know about the latest AI research news, cool AI projects, and more!