The ability of large Multimodal Language Models (MLLMs) to tackle visual math problems is currently the subject of intense interest. While MLLMs have performed remarkably well in visual scenarios, the extent to which they can fully understand and solve visual math problems remains unclear. To address these challenges, frameworks such as GeoQA and MathVista have sought to bridge the gap between textual content and visual interpretation. Despite these efforts, a model that can integrate textual analysis with accurate visual interpretation in the context of math reasoning is yet to be fully realized.
Researchers from CUHK MMLab and Shanghai Artificial Intelligence Laboratory aim to fill this gap through the introduction of MATHVERSE, a benchmark designed to test the ability of MLLMs to interpret visual information within math problems. MATHVERSE presents MLLMs with math problems supplemented with diagrams, in an attempt to scrutinize their multimodal analysis skills beyond textual comprehension. The researchers have evaluated the performance of several models, revealing an interesting observation: some models improved their accuracy by over 5% when visual cues were removed, suggesting a stronger reliance on textual than visual interpretation. One model – GPT-4V – demonstrated proficient interpretation of both text and images.
Furthermore, results from MATHVERSE highlighted that while models such as Qwen-VL-Max and InternLM-XComposer2 saw increased performance without visual inputs, GPT-4V displayed advanced ability in integrating visual information, almost equating human-level performance in text-only scenarios. This suggests that most MLLMs have a stronger performance with textual inputs, with GPT-4V being the exemplary exception with its effective visual comprehension.
Ergo, the study proposes MATHVERSE as a comprehensive tool for assessing the visual, mathematical problem-solving capacity of MLLMs. The findings show a potential for models to perform better with the help of visual inputs and underline the need for more sophisticated math-specific vision encoders. This could steer the future direction of MLLM development towards better interpretation and problem-solving in scenarios that combine visual information and mathematical content.
The researchers look forward to unfolding more about the capabilities and limits of different MLLM models and contributing to the ongoing discussions on how to improve these models. For more details about this study, readers are encouraged to access the research paper and project directly. Future updates and developments from the research team can also be obtained through their official Twitter handle, LinkedIn Group, and other channels.