Large Vision-Language Models (LVLMs), which combine powerful language and vision encoders, have shown excellent proficiency in tasks involving real-world images. However, they have generally struggled with abstract ideas, primarily due to their lack of exposure to domain-specific data during training. This is particularly true for areas requiring abstract reasoning, such as physics and mathematics.
To address this issue, researchers have developed a strategy known as Multimodal ArXiv. This approach leverages the large volume of data available on the arXiv repository, which contains a wide range of scholarly preprints from multiple scientific fields.
The central component of this strategy is ArXivCap, a dataset containing scientifically relevant figures and informative captions. Unlike previous datasets, ArXivCap contains a wealth of real academic figures from various scientific domains, keeping the structural integrity of subfigures intact and even incorporating the titles of the original papers. The dataset consists of 6.4 million images and 3.9 million captions, sourced from 572,000 publications.
Furthermore, a broad collection of 100,000 multiple-choice question-answer combinations have been created specifically for the figures in ArXivCap using GPT-4V. Named ArXivQA, this feature is designed to enhance the scientific reasoning abilities of LVLMs by mimicking real-world scientific problem-solving settings.
Assessments of the Multimodal ArXiv approach have focused on two main performance metrics: model accuracy in answering questions, and generative ability in caption-generation tasks. Significant performance gains have been observed with the addition of the ArXivQA dataset. These are seen in the increased accuracy on MathVista, specifically created to assess multimodal mathematical reasoning abilities, underlining the positive impact of domain-specific training.
The use of ArXivCap has also facilitated the creation of four different generative tasks designed to evaluate model comprehension and expression of scientific ideas. These tasks range from simple figure captioning to generating summaries and titles based on figure-caption pairs. Despite improvements, however, LVLMs still face challenges in accurately interpreting and describing scientific figures, even with specific training on the ArXivCap dataset.
Manual error evaluations have revealed that LVLMs still have issues with aspects of visual understanding and caption production. These include misinterpretations of visual context and inaccurate recognition. Nevertheless, this research sheds light on the progress made in the field and provides directions for future studies to help LVLMs understand scientific content more profoundly.