Proximal Policy Optimization (PPO), initially designed for continuous control tasks, is widely used in reinforcement learning (RL) applications, like fine-tuning generative models. However, PPO’s effectiveness is based on a series of heuristics for stable convergence, like value networks and clipping, adding complexities in its implementation.
Adapting PPO to optimize complex modern generative models with billions of parameters is challenging; it needs simultaneous storage of multiple models and its performance widely varies due to trivial implementation details. This prompts the question: Can simpler RL algorithms be developed?
Policy Gradient (PG) methods, often divided into two groups, are significant in the field of RL. The PG methods based on REINFORCE often include variance reduction techniques for optimization, while those with adaptive techniques precondition policy gradients for ensured stability and faster convergence.
A collaborative team of researchers from Cornell, Princeton, and Carnegie Mellon University recently introduced REBEL, a streamlined RL algorithm. The method simplifies the issue of policy optimization by regressing the relative rewards via direct policy parameterization.
Their theoretical evaluation reveals that REBEL forms the basis for RL algorithms like Natural Policy Gradient, matching the top theoretical warranties for convergence and sample efficiency. This method also accommodates offline data and handles intransitive preferences.
In contextual Bandit formulation for RL, considered for models with deterministic transitions like LLMs and Diffusion Models, paired response and prompt pairs are assessed with a reward function to measure response quality. REBEL proposes a KL-constrained RL problem and offers a solution to the relative entropy problem to allow reward expression as a function of the policy.
When compared with other RL algorithms, REBEL shows superior performance regarding RM score across all model sizes. It also demonstrates a higher win rate against human references, signifying the advantage of regressing relative rewards.
In essence, REBEL is a simplified RL algorithm that solves relative reward regression tasks on sequentially gathered datasets, merging theoretical prowess with practical applicability. Unlike traditional policy gradient methods, REBEL focuses on reducing training error on a least squares problem. This makes it a straightforward, scalable solution that has demonstrated competitive or superior performance across various tasks.