Recent research from a team of Apple researchers has assessed the limitations of Vision-Language Models (VLMs). VLMs, including OpenAI’s GPT4-V, have seen substantial improvements recently, showing impressive performance across various vision-language tasks. However, the researchers found a significant difference between the high performance of Large Language Models (LLMs) in text-based tasks and VLMs’ capability in visual reasoning.
The researchers used Raven’s Progressive Matrices (RPMs) to evaluate VLMs, as RPMs demand advanced, multi-hop relational and deductive reasoning based solely on visual cues. The team introduced three analytical categories: perception, inference, and hypothesis testing. According to their findings, perception is the main bottleneck for current VLMs. Even though various strategies effectively enhance LLMs capabilities, they did not transfer well to visual reasoning problems. Furthermore, the team identified unique issues with VLM operations, including overconfidence, sensitivity to prompt design, and the models’ inability to use in-context examples effectively.
To corroborate their findings, the researchers conducted extensive evaluations using three different datasets: Mensa IQ exam, IntelligenceTest, and RAVEN. Their approach also investigated common inference-time techniques found in LLMs such as self-consistency and in-context learning. Additionally, they explored the impact of prompts on model performance through manipulation, suggesting structured prompts as a possible improvement tactic.
In conclusion, while VLMs have demonstrated significant advancements in recent years, the research reveals a gap in these models’ ability to handle complicated visual reasoning tasks. Accordingly, these findings highlight opportunities for further enhancements in VLMs.
The Apple team’s research highlights a path for future developments in the field of machine learning and AI. By identifying current limitations in VLMs, they have laid the groundwork for improving these models’ performance in complex visual tasks.