Machine learning researchers have developed a cost-effective reward mechanism to help improve how language models interact with video data. The technique involves using detailed video captions to measure the quality of responses produced by video language models. These captions serve as proxies for actual video frames, allowing language models to evaluate the factual accuracy of a response to a video-related query and identify any false information that may have been generated – a phenomenon known as ‘hallucinations’.
Reinforcement learning and direct preference optimization (DPO) have been effective at generating more accurate responses from language models, but they often fall short when applied to video data due to the high level of complexity involved in analyzing multiple video frames. The new reward mechanism overcomes this hurdle by focusing on video captions instead of the video data itself.
To ensure the quality of video captions, the researchers created a dataset of 900,000 captions for a wide range of video content using a new prompting method and the GPT-4V model, which they dubbed SHAREGPTVIDEO. The dataset includes information on temporal dynamics, world knowledge, object attributes, and spatial relationships.
Using this dataset, the team trained a model called LLAVA-HOUND-DPO, which focuses on using the video captions as proxies for rewards, and demonstrated an 8.1% increase in accuracy on video question-answering tasks compared to its counterpart that underwent supervised fine-tuning (SFT).
The success of this method opens the door for more accurate responses from video language models and suggests that a cost-effective caption-based reward system could be a useful tool in the area of machine learning. The model also has potential applications in reducing computational resources need for analyzing video data.
The researchers’ work went through several stages, including caption pre-training, supervised fine-tuning, and DPO training. A positive correlation was noted between the reward system of the developed model and that of the more powerful but costly GPT-4V model.
This research has paved the way for more truthful and accurate responses from video language models, while potentially reducing costs and computational efforts. The results of the study cautiously underscore the applicability of the reward mechanism in video language models, as a moderate positive correlation was noted between it and the GPT-4V model, and an over 70% agreement was noted in preference between the two systems.