Skip to content Skip to footer

Reconsidering the Function of PPO in RLHF

The development of more powerful and complex virtual assistants, like GPT-4, Claude-2, Bard, and Bing Chat, are facilitated by the use of Reinforcement Learning with Human Feedback (RLHF). However, despite their achievements, certain issues arise within the RLHF process. The reward learning stage relies on human preference data in the form of comparisons to train the model, but the reinforcement learning refinement stage relies on a single, non-comparative reward, leading to potential inconsistencies. To address this, researchers introduce the Pairwise Proximal Policy Optimization (P3O), which integrates the methodologies from both stages to create a consistent and effective learning process.

In the RLHF pipeline, models first go through a supervised fine-tuning stage where they learn to respond to human queries. Next, the reward modeling stage produces pairs of answers to prompts, which human labelers preference rate. A comparative loss is used to train a reward model. Lastly, in the RL fine-tuning stage, another algorithm is used to optimize the policy, maximizing the reward while limiting deviation from the initial policy.

The P3O concept comes from the commonly used vanilla policy gradient (VPG). However, rather than using the absolute magnitude of the reward, as happens in VPG, Pairwise Policy Gradient (PPG) uses the reward difference. To enhance performance, the researchers incorporated a replay buffer using Importance Sampling and Clipping.

To evaluate, the researchers tested P3O on open-ended text generation tasks such as summarization and question-answering. The algorithms were compared against the P3O with PPO and DPO. The researchers found that P3O precisely controlled the KL-Reward trade-off and had better performance than both the PPO and the DPO across a range of model sizes.

In head-to-head comparisons in the HH dataset, P3O showed promising results. Despite DPO’s slightly higher reward rate, it also had a considerably higher KL-divergence, which can lower the quality of generation. As a result, P3O outperformed the DPO in the GPT-4 evaluation, affirming that P3O could better align with human preference than other similar methods.

In conclusion, the new P3O method shows potential in aligning large language models with human preferences using reinforcement learning. This approach successfully unifies the fundamental principles of reward modeling and RL fine-tuning with comparative training. This research inspires new possibilities in harnessing relative feedback for large language model alignment.

Leave a comment

0.0/5