Mixture-of-experts (MoE) architectures, designed for better scaling of model sizes and more efficient inference and training, present a challenge to optimize due to their non-differentiable, discrete nature. Traditional MoEs use a router network which directs input data to expert modules, a process that is complex and can lead to inefficiencies and under-specialization of expert modules. Recently, researchers from Princeton University and Meta AI designed the Lory method, which significantly improves MoE architectures for application in autoregressive language model pre-training.
Lory utilizes two major techniques: casual segment routing and a similarity-based data batching method. The first of these techniques, casual segment routing, works by breaking down a series of input tokens into smaller segments, a process that aids in expert merging operations while preserving the autoregressive nature of language models. However, segment-level routing can lead to insufficient specialization of experts, a challenge that Lory overcomes with its second technique: similarity-based data batching. This method groups similar documents together during training to efficiently create sequential segments ideal for expert routing.
Lory’s methods resulted in marked improvements across several factors. In regards to training efficiency and convergence, Lory demonstrated an equivalent loss level with less than half of the training tokens for 0.3B and 1.5B models, suggesting better performance with the same computational input. Lory surpassed dense models in language modeling spheres, leading to decreased perplexity. Furthermore, the model also achieved performance increases in common sense reasoning, reading comprehension, and text classification in downstream tasks.
In general, the Lory model demonstrates that improvements to MoE architecture optimization can lead to significant advancements in autoregressive language model pre-training. Given the success of Lory with these two techniques, future work aims at scaling up the model and integrating token and segment-level routing via the development of effective decoding methods for Lory. Such advancements to the field of MoEs offer tremendous potential and will play a crucial role in furthering research and understanding of language model pre-training.