Skip to content Skip to footer

Google DeepMind Presents WARP: A Unique Approach to Reinforcement Learning from Human Feedback (RLHF) for the Synchronization of Large Language Models (LLMs) and Optimization of the KL-Reward Pareto Solutions Spectrum.

Reinforcement Learning from Human Feedback (RLHF) uses a reward model trained on human preferences to align large language models (LLMs) with the aim of optimizing rewards. Yet, there are issues such as the model becoming too specialized, the potential for the LLM to exploit flaws in the reward model, and a reduction in output variety.

Researchers from Google’s DeepMind have proposed a solution to these issues: a method called Weight Average Rewarded Policies (WARP). This method incorporates weight averaging (WA), a technique where models are merged at the weight level rather than the prediction level. WA enhances generalization, reduces variance and memorization, and modifies the loss landscape.

WARP applies three types of WA at different stages of the alignment process. Initially, it uses the exponential moving average for the KL regularization. Then it merges the fine-tuned policies into a superior policy using spherical interpolation. Finally, it interpolates between the merged model and initialization to retrieve features from pre-training. This process is then repeated, enhancing the KL-reward Pareto front.

An experiment was conducted using Gemma “7B” LLM as a baseline, which was then fine-tuned using RLHF. Using the REINFORCE policy gradient, the KL-regularized reward was optimized. Additionally, on-policy samples were generated with specific parameters to tweak the model and the SLERP method was applied to the model’s 28 layers individually.

To validate the efficacy of WARP, side-by-side comparisons were performed against Mistral and Mixtral LLMs and the preference rates were ascertained. The results confirmed that WARP is efficient and that the designed policies exceeded the performance of the alternate LLMs.

In summary, WARP provides a fresh approach to RLHF, addressing several existing issues and contributing to the creation of powerful AI systems. This method uses iterative applications of model merging to enhance the KL-reward Pareto front, protecting the knowledge from pre-training. Looking ahead, WARP could potentially lead to improved AI system alignment and motivate further exploration of model merging techniques.

Leave a comment

0.0/5