Neural network models are dominant in the areas of natural language processing and computer vision. However, the initialization and learning rates of these models often depend on heuristic methods, which can lead to inconsistencies across different studies and model sizes. The µ-Parameterization (µP) seeks to address this issue by proposing scaling rules for model parameters to allow ‘zero-shot’ hyperparameter transfer from small to large models. Despite its potential, µP is yet to be widely adopted due to the complex nature of its implementation, the numerous variations involved, and its intricate theoretical foundations.
To date, there is a lack of empirical evidence on the effectiveness of µP at large scales, creating concerns regarding the preservation of hyperparameters and compatibility with existing techniques, such as decoupled weight decay. Recent studies using µP have raised further questions that remain unanswered. The version of µP presented within the “Tensor Programs” series has shown promise in regards to zero-shot hyperparameter transfer but has also sparked concerns about whether it is suitable for use with large-scale transformers.
Hyperparameter tuning with µP has been explored in some recent studies, but evidence of its efficacy for large models is limited. Some have suggested using µ-Transfer, an alternative method that bypasses the need for large-scale experiments. Other suggested alternatives include scaling laws based on computing budget or making architectural adjustments. Complex options such as Automatic Gradient Descent and Hypergradients have also been offered as possible solutions for fine-tuning learning rates.
This research paper seeks to further explore and understand µP’s application for transformers in terms of width. The investigation of µP focuses on the ability of the system to transfer hyperparameters from smaller models to larger ones, specifically in terms of transformer width. The study also examines the impact of adjusting the attention scale for simplicity and evaluates its effect on performance and transfer. The report showcases µP as offering a systematic approach to the scaling of parameters within neural networks.
The research investigates RMSNorm’s efficiency to test the impact of trainable scale vectors (‘gains’) on learning rate transferability under µP. Results show that there is unreliable transfer of optimal learning rates with Θ(1) scaling for gains, which can negatively affect the quality of the larger µP models. However, when zero-initialized query projections are used, transfer is enhanced and loss is slightly improved. Despite potential interference, multiplicative nonlinearities allow for transfer.
In conclusion, µP can successfully transfer learning rates in most scenarios, including those with various architectural modifications and batch sizes. However, it fails to do so when using trainable gain parameters or excessively large attention scales. µP outperforms standard methods of parameterization for transformers, with µ-Transfer being particularly effective in predicting optimal learning rates. These findings contribute to the field’s understanding of hyperparameter transfer and could inspire further research.