Recent research into Predictive Large Models (PLM) aims to align the models with human values, avoiding harmful behaviors while maximising efficiency and applicability. Two significant methods used for alignment are supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). RLHF, notably, commoditizes the reward model to new prompt-response pairs. However, this approach often faces difficulties with maintaining functionality with unseen data, a problem termed “overoptimization” or “reward hacking.”
The reviewed article analysed two major strategies used in this area of research. First, Reward Modeling, where models are bred using human preference data to guide RLHF. This involves enhancing the quality or quantity of preference data for better reward models. Second, Mitigating Overoptimization in RLHF, involves using techniques like label smoothing or SFT regularization to minimise the risk of overfitting, thereby enhancing the capability of models to generalize beyond specific training data.
Collaborative researchers from the Georgia Institute of Technology, HKUST, and the University of Illinois Urbana-Champaign proposed the “Generalizable Reward Model” (GRM). This introduces text-generation regularization on hidden states to bolster reward model performance. The study found that all categories of text-generation regularization were compatible with GRM, although SFT regularization demonstrated greater consistency and effectiveness. The GRM significantly improved the reward model’s accuracy on various out-of-distribution tasks and consistently increased the RLHF’s capabilities, reducing the issue of overoptimization.
For this research, the unified Feedback dataset, a substantial collection of pairwise feedback datasets, is used to train the reward models, with all models tested on Unified Feedback subsets of 400k and 40k instances, and a separate 8k-instance hold-out as an eval set. Additionally, models were tested on OOD preference data via the HHH-Alignment, MT-Bench Human Judgements, and RewardBench datasets.
The evaluated results of the GRM were as follows:
1. The reward model’s generalization ability is vastly improved upon, leading to better performance in both in-distribution and out-of-distribution datasets.
2. All text-generation regularization losses could enhance generalisation, with SFT regularization showing the most stable and effective improvements.
3. GRM showed consistent robustness when faced with limited datasets, consistently besting baselines.
4. GRM also demonstrated a reduction in overoptimisation problems in both BoN and PPO and proved robust against label noise in preference data set.
In sum, the GRM proved to be an effective means of improving the generalizability and robustness of reward learning for Predictive Large Models. By introducing regularization techniques on the hidden states of reward models, unseen data’s generalization performance dramatically improved. Overoptimisation, a recurrent issue in RLHF, also saw a reduction with GRM use—providing a strong basis for future research in crafting stronger, cost-effective reward models and more efficient large models alignment.