Stanford University researchers have developed a new method called Demonstration ITerated Task Optimization (DITTO) designed to align language model outputs directly with users’ demonstrated behaviors. This technique was introduced to address the challenges language models (LMs) face – including the need for big data sets for training, generic responses, and mismatches between universal style and application-specific preferences.
DITTO borrows ideas from online imitation learning and generates online comparison data inexpensively. It prioritizes users’ demonstrations over the output from LMs and intermediate checkpoints. The proposed method saw an average win rate of 19% points, outperforming techniques like supervised fine-tuning, few-shot prompting, and self-play, thereby offering a new and effective way to customize LMs.
Functioning across verticals such as news, emails, and blog posts, DITTO follows a three-step iterative process. First, limited supervised fine-tuning is executed depending on expert demonstrations; next, a new dataset is generated by sampling completions for each demonstration and added to the ranking over policies; finally, reinforcement learning with human feedback is deployed for policy updating.
The technique was tested using the GPT-4 eval tool, with an average win rate of 77.09% across CMCC (71.67%) and CCAT50 (82.50%). This represents an 11.7% average win rate improvement compared to other methods. User studies also showed that DITTO was more effective than other approaches.
While the Stanford researchers recognized the importance of demonstrations as feedback, they did not test bigger model sizes due to computational costs and called for further analysis to determine the types of preference data needed. This points to room for further advancements in the field.