Skip to content Skip to footer

MathVerse: A Comprehensive Visual Math Standard Created to Fairly and Thoroughly Assess Multi-modal Large Language Models (MLLMs)

Multimodal large Language Models (MLLMs), such as GeoQA, MathVista, SPHINX, and GPT-4V, have made great strides in interpreting mathematical problems and diagrams, yet there remains a need for a more integrated approach that combines textual analysis with accurate visual interpretation. A research team from the CUHK MMLab and the Shanghai Artificial Intelligence Laboratory has developed a novel benchmark, called MATHVERSE, to address this challenge.

MATHVERSE presents a variety of mathematical problems alongside diagrams to appraise MLLMs’ ability to interpret visual information. The tool tests 2,612 math problems, each with accompanying diagrams, and adapts these problems into six distinct formats to comprehensively assess a model’s multimodal analysis skills. Interestingly, some models outperformed others by over 5% in accuracy when visual cues were removed, indicating a stronger reliance on text over visuals for these systems.

In contrast, the GPT-4V model demonstrated a balanced proficiency in both textual and visual modalities and, in text-only scenarios, closely matched human-level performance. This discrepancy underlines a general dependence on text over visual cues amongst MLLMs. However, GPT-4V stands out as an exception due to its exceptional visual comprehension.

In conclusion, the significant contribution of the research lies in developing the MATHVERSE benchmark. The assessment of MLLMs’ problem-solving competence involving visual mathematical contexts reveals the strength of text-based reasoning over visual. This finding points to an urgent need for more advanced math-specific vision encoders, suggesting a potential future direction for MLLM development.

The research was conducted by a joint team from the CUHK MMLab and Shanghai Artificial Intelligence Laboratory and is available as a paper. The highlighted models were Qwen-VL-Max, InternLM-XComposer2, GPT-4V among others. An interesting observation from the study was that some models performed better without visual inputs indicating a stronger textual reliance. Overall, the study championed for improvement in model’s ability to interpret visual information, a crucial aspect many deep learning architectures lag behind in. Future work would need to focus more on this area to truly unlock the problem-solving competence of MLLMs.

Leave a comment

0.0/5