Large Language Models (LLMs) have successfully replicated human-like conversational abilities and demonstrated proficiency in coding. However, they continue to grapple with the challenges of maintaining high reliability and stringent abidance to ethical and safety measures. Reinforcement Learning from Human Feedback (RLHF) or Preference-based Reinforcement Learning (PbRL) has emerged as a promising solution to help fine-tune LLMs.
Current RLHF methods leverage explicit or implicit reward models to facilitate learning. In a fresh approach, some researchers have experimented with using direct preference probabilities to represent human preferences more accurately. RLHF is subsequently framed as the task of locating Nash equilibriums in constant-sum games leading to the proposal of methods like mirror descent and Self-Play Preference Optimization (SPO). Ideas of Direct Nash Optimization (DNO) have also started to gain traction.
Researchers from the University of California and Carnegie Mellon University now propose a robust self-play framework called Self-Play Preference Optimization (SPPO) that addresses RLHF challenges and highlights the possibilities for a language model alignment. The SPPO offers reliable guarantees for resolving two-player constant-sum games and allows scalability for large language models.
Applying the idea of RLHF to a game, winning hinges on identifying the Nash equilibrium policy, which ensures the generation of preferred responses. The researchers developed an adaptive algorithm that utilizes a self-play mechanism, allowing the policy to fine-tune itself on synthetic data annotated by the preference model.
SPPO takes a self-play mechanism and an iterative framework grounded in multiplicative weight updates to solve two-player constant-sum games. The algorithm moves towards the perfect policy slowly but surely, landing on the Nash equilibrium. A theoretical analysis provides a guarantee for convergence making it better than DPO and IPO models in terms of achieving convergence and tackling data sparsity issues.
Performance evaluations are conducted using GPT-4 on AlpacaEval 2.0 and MT-Bench. As the iterations progress, the SPPO models show improved performance with SPPO Iter3 demonstrating the highest win rate. SPPO has the upper hand over DPO and IPO in terms of controlling output length and overall performance.
To summarize, SPPO is a new approach for fine-tuning LLMs using Human/AI feedback. It significantly outperforms earlier methods such as DPO and IPO across different benchmarks. By integrating a preference model and batched estimation, SPPO aligns LLMs more closely with human preferences, thereby effectively resolving the issue of “length bias” reward hacking. The findings hold promise for the broader adoption of SPPO, thus paving the way for more advancements in generative AI system alignment.