Researchers have been refocusing the abilities of Large Vision-Language Models (LVLMs), typically passive technological entities, to participate more proactively in interactions. Large Vision-Language Models are crucial for tasks needing visual understanding and language processing. However, they often provide heavily detailed and confident responses, even when they face unclear or invalid questions, leading to potentially biased responses.
To develop general-purpose LVLMs, visual instruction tuning has become vital, aiding the model in learning vision-language reasoning from zero-shot and few-shot textual instructions. However, LVLMs can sometimes behave unpredictably, hence the creation of applications such as Llava-Guard to ensure safety compliance against harmful content.
In an effort to improve the conversational capabilities of these models, researchers from the University of Illinois Urbana-Champaign, the University of Southern California, and the University of California created MACAROON (self-iMaginAtion for ContrAstive pReference OptimizatiON). This system trains LVLMs to generate contrasting response pairs based on task descriptions and human-defined criteria. The resultant information is used in conditional reinforcement learning, standardising the training data and assisting the models in distinguishing between good and bad responses. The initial results of using MACAROON are positive, with improved LVLM behaviours and more dynamic, proactive engagement.
Creating a preference dataset with human annotations is challenging, costly, and slow. However, in MACAROON, the method used in PIE construction is employed to produce six types of questions without labels. The technique facilitates scaling and only requires human annotators to devise a detailed question description and to define two sets of criteria specifying desired and undesired behaviours in the LVLMs for each type of question.
The PIE results show that LVLMs perform well with Tier I questions, which are straightforward to identify as invalid, but struggle with the more challenging Tier III questions, as existing LVLMs are primarily designed for single-turn responses without extensive interaction. It is here that MACAROON stands out, demonstrating proactive engagement more successfully than other LVLMs and a strong performance in general vision-language tasks.
However, the proactive engagement capabilities have their limits. For instance, the approach relies on single images from a high-quality dataset for a visual context. Future research might consider exploring AI that can plan actions based on sequences of images over time, like videos, for a more in-depth investigation into this growing field.