Skip to content Skip to footer

IEIT SYSTEMS introduces the updated version, Yuan 2.0-M32. This upgraded edition is a Bilingual Mixture of Expert MoE Language Model, which is fundamentally grounded on the Yuan 2.0. It also features an Attention Router.

A research team from IEIT Systems has recently developed a new model, Yuan 2.0-M32, which uses the Mixture of Experts (MoE) architecture. This complex model is built on the same foundation as the Yuan-2.0 2B, but with utilization of 32 experts, only two of whom are active at any given time, resulting in its unique and efficient computational structure.

In comparison to traditional router networks, Yuan 2.0-M32 uses a distinctive Attention Router network, which significantly enhances expert selection and improves the overall accuracy. Despite using a vast dataset of 2000 billion tokens for training, the model’s computational consumption was a mere 9.25% compared to a dense model with a similar parameter scale.

Yuan 2.0-M32 demonstrated impressive performance across various fields, including mathematics and coding. The model used just 7.4 GFlops of forward computation per token, with 3.7 billion active parameters out of a total of 40 billion. Considering these values account for just 1/19th of the requirements of the Llama3-70B model, it indicates an outstanding efficiency.

In addition, Yuan 2.0-M32 excelled in benchmarks, outperforming Llama3-70B with scores of 55.89 and 95.8 in the MATH and ARC-Challenge benchmarks respectively. This was accomplished with a smaller active parameter set and a less large computational footprint.

The adoption of the Attention Router is a key development in Yuan 2.0-M32, where it improves precision and performance by honing the selection process, focusing on the most relevant experts for each task. This unique approach to expert selection holds potential for boosted accuracy and efficiency in MoE models.

Summarizing their main contributions, the team stated that they presented the Attention Router, achieving a significant improvement in accuracy compared to traditional routing techniques. They successfully created the Yuan 2.0-M32 model consisting of 40 billion total parameters with only 3.7 billion active. Its training uses just 1/16 of the computing power required for a dense model with a similar number of parameters, maintaining the model’s efficiency and cost-effectiveness both during training and in real-world scenarios.

The full research is available in the published paper, the Yuan 2.0-M32 model, and on GitHub. All credit for this innovative research belongs to the researchers of this project. For project updates and more, notifications can be followed on Twitter, by joining their Telegram Channel, Discord Channel, and LinkedIn Group. Additional resources and their newsletter can also be explored for enthusiasts and subscribers to their 43k+ ML SubReddit and AI Events Platform.

Leave a comment

0.0/5