Large Language Models (LLMs) and multi-modal counterparts (MLLMs), crucial in advancing artificial general intelligence (AGI), face issues while dealing with visual mathematical problems, especially where geometric figures and spatial relationships are involved. While advances have been made through techniques for vision-language integration and text-based mathematical problem-solving, progress in the multi-modal mathematical domain has been limited.
A team of researchers from CUHK, Peking University, Shanghai AI Laboratory, and Oracle introduced MAVIS (MAthematical VISual instruction tuning). This framework addresses three key issues: problematic math diagram embeddings by vision encoders, diagram-language misalignment between vision encoders and LLMs, and inaccurate mathematical problem-solving with visual elements. Two new datasets, MAVIS-Caption and MAVIS-Instruct, covering various mathematical domains, are introduced as part of the framework. A specialized model, MAVIS-7B, trained via a three-stage pipeline, has shown superior performance on evaluation benchmarks.
The team also introduced an innovative data engine to generate mathematical diagrams, addressing the scarcity of visual mathematics datasets. It covers plane geometry, analytic geometry, and functions. Different strategies for diagram construction were used for each type, and all diagrams were rendered using Matplotlib.
MAVIS-Caption, a part of the MAVIS framework, is a large-scale dataset comprising 588,000 diagram-caption pairs covering plane geometry, analytic geometry, and function. The captions, detailed with an average length of around 61 words, are used to provide mathematically accurate descriptions of the mathematical content.
On the other hand, MAVIS-Instruct, a dataset of 834,000 visual math problems, is designed to enhance MLLMs’ visual mathematical problem-solving capabilities. The dataset encourages MLLMs to extract critical information from visual inputs by minimizing textual redundancy in the questions.
On the MathVerse benchmark, MAVIS-7B the specialized model, shows superior performance across multiple mathematical benchmarks, achieving the highest overall accuracy among open-source models. It also outperforms models like InternLM-XComposer2 and ShareGPT4V on specific domains such as GeoQA for plane geometry and FunctionQA.
The MAVIS study presents a novel framework for mathematical visual instruction tuning for MLLMs, comprising of high-quality datasets and a three-stage training pipeline. This approach enhances the math-specific vision encoder, improves diagram-language alignment, and develops mathematical reasoning capabilities. MAVIS’s innovative approach sets a new tendency in visual mathematical problem-solving, providing a foundation for future advancements in artificial intelligence and education technology.