The efficiency of Reinforcement Learning from Human Feedback (RLHF) in language model alignment greatly depends on the quality of the underlying reward model. This reward model, which should accurately reflect human preferences, impacts the effectiveness of RLHF applications. However, modeling human preferences accurately can come with a high data collection cost.
Researchers from ETH Zurich, the Max Planck Institute for Intelligent Systems, Tubingen, and Google Research have proposed a novel method, termed West-of-N, for enhancing the quality of the reward model. The strategy employs synthetic preference data in the training dataset, improving on the Best-of-N sampling approaches previously used in language model training.
West-of-N generates synthetic preference data by selecting the best and worst responses from the language model’s policy to a specific query. This self-training strategy greatly improves the performance of the reward model, with its impact compared to integrating an equivalent amount of human preference data. The strategy has been theoretically guaranteed to correctly label generated preference pairs, with the inclusion of filtering steps based on model confidence and response distribution further enhancing the quality of the generated data.
In the evaluation on the Reddit TL;DR summarization and Anthropic Helpful and Harmless dialogue datasets, West-of-N significantly improved the performance of the reward model. It outperformed other methods such as RLAIF and RLCD and surpassed gains from additional human feedback data. These results were consistent across different base preference types, indicating the effectiveness of this strategy in language model alignment.
In conclusion, the West-of-N method improves reward model performance in RLHF, with experiments indicating its usefulness across different initial preference data and datasets. It showcases the potential of Best-of-N sampling and semi-supervised learning for preference modeling, and researchers suggest further exploration of techniques like noisy student training to enhance RM performance with West-of-N.
This research was conducted by specialists from top institutions and demonstrates the potential of innovative strategies in improving the performance of reward models for RLHF.
