Rethinking the Role of PPO in RLHF
The process of Reinforcement Learning with Human Feedback (RLHF) utilises a dominant RL optimizer called Proximal Policy Optimization (PPO), an important component of the training behind powerful virtual assistants such as GPT-4, Claude-2, Bard, and Bing Chat. However, current RLHF processes exhibit a tension between the reward learning phase, which uses human comparison, and the RL fine-tuning phase, which optimises a non-comparative reward. This discrepancy can amplify problems, especially in language generation.
This article presents an innovative solution called Pairwise Proximal Policy Optimization (P3O) that addresses this issue. P3O harmonises training at both the reward learning and RL fine-tuning stages, allowing RL to operate in a comparative manner.
RLHF is a multi-stage process. The Supervised Fine-Tuning Stage involves pre-training a model to mimic human responses using a high-quality dataset. The Reward Modeling Stage prompts the model with prompts to produce pairs of responses. These response pairs form a dataset which is presented to human labellers who express a preference for one answer over another. This comparative feedback trains the reward model. Finally, in the RL Fine-Tuning Stage, the SFT model serves as the basis and an RL algorithm maximises the reward while limiting deviation from the initial policy. However, there is an issue with this process because the reward scale information from a prompt can disrupt the policy, causing it to fail to learn a useful part – the relative preference represented by the reward difference.
P3O offers a solution by incorporating the Pairwise Policy Gradient which relies on the difference between rewards, bypassing certain issues related to reward translation. This algorithm, enhanced with Importance Sampling and Clipping for performance improvement, is effective not only for maintaining the trade-off between KL divergence and reward but also for alignment with large language models.
P3O was evaluated through summarisation and question-answering tasks. The results showed that P3O outperformed PPO and DPO in terms of both the KL-Reward frontier and the GPT-4 win-rate. Therefore, P3O is a more promising method for aligning large language models with human preference.
By presenting the RLHF framework and the P3O method, this article contributes to the crucial work on aligning large language models with human preferences using reinforcement learning. These ground-breaking findings are based on a recent paper and blog titled `Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment` published in the journal arXiv in 2023.