Universal Transformers (UTs) are key in machine learning applications such as language models and image processors, but they suffer from efficiency issues. Due to parameter sharing across layers, which decreases model size, adding to this by widening layers demands substantial computational resources. Consequently, UTs are not ideal for tasks which require heavy parameters, such as contemporary language modeling.
However, a team of researchers from Stanford University, The Swiss AI Lab IDSIA, Harvard University, and KAUST have developed a type of UT that addresses such a problem — the Mixture-of-Experts Universal Transformers (MoEUTs). These leverage an architecture known as mixture-of-experts to bring about enhanced memory and computational efficiency.
The MoEUTs incorporate two innovations: layer grouping and peri-layernorm. Given a specific issue with UTs about their parameter-to-compute ratio, MoEUTs mitigate the same through combining shared layer parameters with mixture-of-experts. The MoE’s recent advancements are combined with the creation of recurrently stacked layers and the application of a layer norm before linear layers that precede sigmoid or softmax activations.
In MoE’s feedforward blocks, experts are dynamically chosen according to the scores of the input, and regularization takes place within sequences. As for the MoE’s self-attention layers, these use what is called the SwitchHead to choose the experts dynamically in terms of the value and output projections. Layer grouping lessens computation while increasing attention heads, whereas the new peri-layernorm scheme ensures enhanced signal propagation and gradient flow.
Through a series of rigorous tests, researchers observed MoEUTs outperforming the standard UTs in a few datasets, while consuming less resources. Further, compared to Sparse Universal Transformers, MoEUTs demonstrated clear advantages. These extensively trained models showed the layer normalization schemes to perform best, particularly for smaller models, hinting at the possibility for significant gains with additional training.
In conclusion, the MoEUTs are Mixture-of-Expert-based UT models that significantly minimize the compute-resource requirement and are ideal for parameter-heavy tasks such as language modeling. The researchers argue that this new algorithm has the potential to revive scientific investigations in large-scale UTs. They prompted readers to review their published paper and GitHub for more comprehensive details.