Iterative preference optimization methods have demonstrated effectiveness in general instruction tuning tasks but haven't shown as significant improvements in reasoning tasks. Recently, offline techniques such as Discriminative Preference Optimization (DPO) have gained popularity due to their simplicity and efficiency. More advanced models advocate the iterative application of offline procedures to create new preference relations, further…
