Skip to content Skip to footer

Google Deepmind’s researchers have introduced BOND: An innovative RLHF method that refines the policy through online distilling of the top-N sampling distribution.

Reinforcement Learning from Human Feedback (RLHF) plays a pivotal role in ensuring the quality and safety of Large Language Models (LLMs), such as Gemini and GPT-4. However, RLHF poses significant challenges, including the risk of forgetting pre-trained knowledge and reward hacking. Existing practices to improve text quality involve choosing the best output from N-generated possibilities, known as Best-of-N sampling. Nevertheless, this method is computationally expensive.

Researchers at Google DeepMind have, therefore, developed a new RLHF algorithm called Best-of-N Distillation (BOND). It is designed to reproduce the advantages of Best-of-N sampling without the concurrent high computational demands. BOND is a distribution-matching algorithm that aligns the model’s output with the Best-of-N metrics. This innovative method improves the balance between covering and seeking modes using a concept known as Jeffreys divergence. BOND employs an iterative process for refining the distribution through a moving anchor approach.

BOND is employed in two main stages. Firstly, it establishes an analytical means for determining the Best-of-N (BoN) metrics. Subsequently, it redefines the task as a distribution-matching problem, attempting to sync the algorithm’s decisions with the BoN metrics. BOND aims to reduce the divergence between the algorithm’s decisions and the BoN measurements.

Furthermore, a variant of BOND, named J-BOND, has been created to refine policies with minimal sample complexity. This practical implementation aligns the algorithm with the Best-of-2 samples, using the concept of Jeffreys divergence to bring about continuous improvement. The method also incorporates an Exponential Moving Average (EMA), which enhances the algorithm’s stability, thereby optimizing the balance between KL-rewards.

The success of BOND and J-BOND has been affirmed through a number of experiments, with results indicating that both methods outperform traditional RLHF algorithms. J-BOND, for instance, offers potent effectiveness and better performance, even without the necessity for a steady regularization level.

In summary, BOND offers a novel RLHF method that improves policy adaptations through the implementation of Best-of-N sampling distribution. Its practicality and efficiency are further heightened through the use of J-BOND, incorporating Monte-Carlo quantile estimation and an iterative process with an EMA anchor. BOND and J-BOND’s success were highlighted through experiments on abstractive summarization and Gemma models, demonstrating their ability to improve the KL-reward Pareto front and outperform existing baselines. These results affirm their potential in effectively replicating the benefits of the Best-of-N strategy without the associated computational overhead.

Leave a comment

0.0/5