Large Language Models (LLMs) and Large Multi-modal Models (LMMs) are effective across various domains and tasks, but scaling up these models comes with significant computational costs and inference speed limitations. Sparse Mixtures of Experts (SMoE) can help to overcome these challenges by enabling model scalability while reducing computational costs. However, SMoE struggles with low expert activation and limited analytical capabilities, which affect its effectiveness and scalability.
SMoE can enhance model capacity while keeping computational demand constant, offering superior performance compared to densely activated models. It uses N-independent Feed-Forward Networks (FFN) as experts per each Mixture-of-Experts (MoE) layer and a gating function to distribute weights over these experts’ outputs. A routing mechanism selects the top-k experts, where k << N to facilitate data and expert parallelism. Higher k values typically improve model performance, but they can also reduce training efficiency.
Researchers from Tsinghua University and Microsoft Research have developed the Multi-Head Mixture-of-Experts (MH-MoE). Unlike SMoE, MH-MoE employs a multi-head mechanism to divide each input token into multiple sub-tokens and distribute them across different experts. This results in denser expert activation without raising computational or parameter complexity.
The MH-MoE architecture addresses the issues of low expert activation and token ambiguity by splitting tokens into sub-tokens and directing them to different experts through a multi-head mechanism. Each of the parallel layers contains a set of N experts, with a multi-head layer projecting inputs. This is followed by token splitting and gating functions directing sub-tokens to experts. A top-k routing mechanism activates experts with the highest scores. A Token-Splitting-Merging (TSM) operation increases data volume directed to specific experts, improving experts activation and understanding.
Validation of the MH-MoE model shows lower perplexity for varying expert settings, indicating more effective learning. The number of experts correlates with a decrease in perplexity, suggesting its enhanced representation learning capabilities. Evaluation across different pre-training tasks further validates the effectiveness of MH-MoE. MH-MoE outperforms other models in English-focused language modeling, multi-lingual language modeling, and masked multi-modal modeling tasks.
This research suggests viable methods to achieve denser expert activation without introducing additional costs while enhancing fine-grained understanding ability. The introduced MH-MoE smoothly integrates with other SMoE frameworks, enabling performance improvements easily. The research's results across three tasks validate the effectiveness of MH-MoE in achieving these objectives. The study could represent a significant step forward in advancing AI models in the future.