The increase in the hidden layer width of feedforward (FFW) layers results in linear growth in computational costs and activation memory in transformer architectures. This causes a significant issue in scaling, especially with increasingly complex models. These challenges affect the deployment of large-scale models in real-world applications, including language modeling and natural language processing.
Previously, Mixture of Experts (MoE) architectures was utilized to address these challenges. These architectures deploy sparsely activated expert modules as an alternative to a single dense FFW layer, thus decoupling the model’s size from the computational cost. However, MoEs face their own set of challenges when it comes to scaling beyond a small number of experts. The efficiency often plateaus with the increase in model size owing to a fixed number of training tokens, thus hindering the potential of MoEs.
Researchers from Google DeepMind have proposed a new approach called Parameter Efficient Expert Retrieval (PEER) to address these limitations. PEER utilizes the product key method to effectively retrieve expert knowledge from a pool of tiny experts. The method efficiently tackles computational costs and provides a quality performance-compute trade-off. The PEER method allows handling over a million experts, significantly enhancing the effectiveness of MoE models. By utilizing this approach, there is a potential in advancing AI research by enabling efficient and robust language modeling.
PEER operates by mapping an input vector to a query vector, then comparing it with a set of product keys to retrieve the top k experts. These experts contribute to the final output through a weighted combination based on router scores. The complexity of expert retrieval is reduced, making the retrieval from over a million experts achievable.
Experiments using the PEER layer demonstrated a substantial improvement in terms of the performance-compute trade-off compared to dense FFWs and coarse-grained MoEs when applied to several language modeling datasets. The PEER models reached a notably lower perplexity score, implying the efficiency and effectiveness of PEER architecture in enhancing scalability and performance of transformer models.
In conclusion, the PEER architecture is a significant advancement in AI, and addresses the computational challenges posed by scaling transformer models by efficiently routing a vast number of tiny experts. The PEER model revealed its potential in AI research by providing a superior performance-compute trade-off. The results indicate that the PEER model can effectively handle continuous data streams, suggesting that it can be applied to cutting-edge AI applications that require extensive and continual learning.