Iterative preference optimization methods have demonstrated effectiveness in general instruction tuning tasks but haven’t shown as significant improvements in reasoning tasks. Recently, offline techniques such as Discriminative Preference Optimization (DPO) have gained popularity due to their simplicity and efficiency. More advanced models advocate the iterative application of offline procedures to create new preference relations, further improving model performance. However, preference optimization is still relatively unexplored in this field, despite the successful integration of other iterative training methods in reasoning tasks.
Iterative alignment methods, encompassing human-in-the-loop and automated strategies, hold potential for improving these processes. Some rely on human feedback for reinforcement learning, while others optimize preference pairs independently, creating new pairs for future iterations using updated models. Strategies like Iterative DPO and Self-Rewarding LLMs leverage the model itself for reward evaluation, which has seen success in instruction following but only modest gains in reasoning tasks.
Researchers from Facebook AI Research (FAIR) at Meta and New York University have developed a new approach targeting iterative preference optimization specifically for reasoning tasks, using a concept known as Chain-of-Thought (CoT) reasoning. Each iteration involves sampling numerous reasoning steps and deriving final answers, forming preference pairs in which the winner possesses the correct answer and the loser holds an incorrect one. The approach uses a DPO variant incorporating a negative log-likelihood (NLL) loss term, which is crucial for improving performance. The iterative process involves generating new pairs and then retraining the model using the last trained iteration, progressively refining model performance.
The researchers’ approach involves using a base language model and a dataset of training inputs, allowing for the evaluation of output correctness. The model generates a sequence of reasoning steps and a final answer, with the correctness of the final answer being evaluated. Notably, the accuracy of the reasoning steps is not considered.
Following experimentation, the study demonstrated the method’s effectiveness in teaching accurate reasoning skills. The approach, using solely training set examples, resulted in increasing accuracy rates for a variety of tasks, outperforming other models that did not use additional datasets.
The study introduces an iterative training algorithm, Iterative Reasoning Preference Optimization, aimed at enhancing performance in CoT-based reasoning tasks. The method doesn’t rely on human-in-the-loop feedback or extra training data, maintaining its simplicity and efficiency. The experimental outcomes showed substantial enhancements on several benchmark tasks as compared to various baselines utilizing the same base model and training data. These results emphasize the power of the iterative training process in enhancing language models’ reasoning capabilities. The researchers’ approach lays the groundwork for further study in this innovative field.