The standard method for aligning Language Learning Models (LLMs) is known as RLHF, or Reinforcement Learning from Human Feedback. However, new developments in offline alignment methods – such as Direct Preference Optimization (DPO) – challenge RLHF’s reliance on on-policy sampling. Unlike online methods, offline algorithms use existing datasets, making them simpler, cheaper, and often more efficient.
This development has led to questions about the necessity of online, or on-policy, RL methods in AI alignment. However, straightforward comparisons between online and offline methods are challenging due to their differing computational demands. To achieve fair performance measures, the budget used must be carefully calibrated.
A group of researchers from Google DeepMind decided to explore this issue. Initial experiments demonstrated that online algorithms can outperform offline ones, sparking interest in the difference in performance. Further controlled experiments suggested that the quality and coverage of offline data played a substantial role in this performance gap.
Though offline methods excel at pairwise classification, they need assistance when it comes to generation. This issue persists regardless of the type of loss function or model scaling used. Following this research, it seems likely that on-policy sampling is vital for effective AI alignment. This finding also emphasizes the challenges involved in offline alignment methods.
Taking a novel approach to performance comparison, the study used KL-divergence from the Supervised Fine-Tuned (SFT) policy. Using different algorithms and budget amounts allowed the researchers to identify persistent performance differences between online and offline methodologies.
This investigation then compared online and offline RLHF algorithms. A steady performance gap was identified, similar to previous findings, furthering the notion that the two different approaches to AI alignment have unique challenges and benefits.
The study’s framework involved the use of an IPO loss across different datasets, looking at their performance under Goodhart’s law. This law states that when a measure becomes a target, it ceases to be a good measure. The type of loss involved optimizing the weight of winning responses over losing ones. In this scenario, online algorithms sample responses on policy, while offline methods use a fixed dataset.
In conclusion, this study shed light on RLHF’s role and suggested that on-policy sampling may be important for effectively aligning AI. The research also exposes challenges involved with offline alignment methods. It was suggested that offline algorithms could benefit from strategies mimicking online learning processes. This idea opens the door for future work on hybrid online-offline approaches and further investigations into reinforcement learning based on human feedback.