Reinforcement learning from human feedback (RLHF) is a technique that encourages artificial intelligence (AI) to generate high rewards by aligning large language models (LLMs) with a reward model based on human preferences. However, it is beset by several challenges, such as the limiting of fine-tuning processes to small datasets, the risk of AI exploiting flaws inherent in imperfect reward models, and a lack of variety in AI-generated outputs.
A paper by researchers at Google DeepMind proposes a solution called Weight Averaged Rewarded Policies (WARP), which uses three different types of weight averaging at various stages to align LLMs and optimize the Kullback-Leibler (KL) reward Pareto front of solutions.
The first application of weight averaging in WARP is using an exponential moving average (EMA) as a dynamic reference point for the KL regularization policy. Next, fine-tuned policies are merged into a more effective policy through spherical interpolation. Lastly, a linear interpolation between the merged policy model and the initialization retrieves features from the pre-training stage. The process is iterated with each final model serving as a starting point for the next, which progressively improves the KL-reward Pareto front and allows the retention of more knowledge from pre-training.
The DeepMind team tested WARP by fine-tuning the Gemma “7B” LLM with RLHF to improve it as a conversational AI agent. They also used the REINFORCE policy gradient to optimize the KL-regularized reward. The Gemma was then put through on-policy sample generation using a dataset comprising conversational prompts and several other parameters.
To validate the WARP, the researchers undertook side-by-side comparison testing of the trained policies against other LLMs, Mistral and Mixtral. This test saw each policy generate answers based on presets and determine performance rates for the responses. The researchers found that WARP was effective, as the policies it proposed were preferred to the Mistral variants and also outperformed previous Gemma “7B” releases.
Overall, WARP has shown promise as a new RLHF technique to optimize the alignment of LLMs, which could contribute to the creation of safer and more powerful AI in the future. It is also hoped that WARP will encourage more exploration into model merging techniques. As model merging is a recent trend of combining deep models in the weight space, it holds potential benefits such as improving generalization, reducing variance and memorization, and presenting a combined strength in multi-task settings.