The advancement of natural language processing (NLP) capabilities has been to a large extent, dependent on developing large language models (LLMs). Although these models deliver high performance, they also pose challenges due to their need for immense computational resources and related costs, making them hard to scale up without incurring substantial expenses.
These challenges, therefore, create a need for more resource-efficient training methods that can balance computational feasibility with the capacity to efficiently handle complex NLP tasks. Traditionally, these LLMs are trained based on dense models that require the activation of every parameter for each input token, a process that results in considerable computational load.
However, sparse models, specifically the Mixture-of-Experts (MoE), have emerged as promising alternatives. They distribute computational tasks across numerous specialized sub-models, also known as “experts”. Rather than activating all parameters for each input token, these models selectively activate only a subset of the experts, achieving similar or even better performance levels compared to dense models but with fewer resources.
The Kunlun Inc. research team, known as Skywork, contributed to this emerging field by introducing the Skywork-MoE model, a highly efficient MoE large language model with an unparalleled 146 billion parameters and 16 experts. It builds upon the foundational architecture of the previously developed model Skywork-13B but incorporates two novel training techniques – gating logit normalization and adaptive auxiliary loss coefficients – to improve efficiency and performance.
Skywork-MoE was trained using dense checkpoints from the Skywork-13B model, further trained on an additional 2 trillion tokens. Gating logit normalization, which normalizes the gating layer outputs before applying the softmax function, ensures a distinct gate output distribution and enhances the expert diversification. On the other hand, adaptive auxiliary loss coefficients facilitate layer-specific adjustment, maintaining a balanced load across the experts without any one expert becoming overloaded.
The performance evaluation of Skywork-MoE across various benchmarks indicates robust results. Among others, it outperformed models like Llama2-70B and Mixtral 8*7B in mathematical reasoning tasks and surpassed all dense models in code synthesis tasks. However, it fell slightly behind the Deepseek-V2 model.
In summary, Skywork-MoE represents a significant innovation in the field of NLP and addresses the challenges facing resource-intensive LLM training. By implementing advanced techniques such as gating logit normalization and adaptive auxiliary loss coefficients, it manages to reduce computational demands while improving performance. The successful development of Skywork-MoE sets a new benchmark for the efficiency and effectiveness of MoE models in large-scale language processing tasks.