The Sparse Mixture of Experts (SMoEs) has become popular as a method of scaling models, particularly in memory-restricted environments. They are crucial to the Switch Transformer and Universal Transformers, providing efficient training and inference. However, some limitations exist with current implementations of SMoEs, such as a lack of GPU parallelism and complications related to tensor size variability in initial TPU deployments, which caused memory allocation problems.
Existing solutions sought to address these issues by framing the SMoE computation as a sparse matrix multiplication problem, which improves the efficiency of GPU-based implementations. The models Megablocks and PIT propose this method. Yet, they introduced further issues, including memory overhead during training from a necessary initial copy of the input, and increased memory usage from padding the copied input. Additionally, converting the SMoE problem into this format added extra computation overhead and obfuscation, which made it difficult to extend beyond SMoE MLPs.
Research teams from IBM, Mila, and the University of Montreal have presented a new system dubbed ScatterMoE, which provides a more efficient SMoE implementation and minimizes memory footprint. Using ParallelLinear, the ScatterMoE performs grouped matrix operations on scattered groups, bypassing the need for extra copying and padding. It also exposes intermediate representations as standard PyTorch tensors, allowing for simple expansion to other expert modules.
In comparison to Megablocks, ScatterMoE displays superior performance, with an overall throughput increase of 38.1%. As granularity increases, ScatterMoE continues to perform better and is shown to be more efficient than dense MLP models when sparsity decreases. It also consistently outperforms Megablocks in the Mixture of Attention implementation, especially in high granularity settings.
In summary, ScatterMoE offers an enhancement to SMoEs by reducing the memory footprint and increasing the speed of inference and training. By utilizing ParallelLinear, it demonstrates higher throughput and lower memory usage than Megablocks. ScatterMoE’s design makes it easy to expand on Mixture-of-Experts concepts, exemplified by its implementation of Mixture of Attention. Therefore, it can greatly benefit deep learning model training and inference operations, advancing the efficacy of these systems.