The standard method for aligning Language Learning Models (LLMs) is known as RLHF, or Reinforcement Learning from Human Feedback. However, new developments in offline alignment methods - such as Direct Preference Optimization (DPO) - challenge RLHF's reliance on on-policy sampling. Unlike online methods, offline algorithms use existing datasets, making them simpler, cheaper, and often more…
