Skip to content Skip to footer

CharXiv: An In-depth Assessment Platform Enhancing Advanced Multimodal Big Language Models by Applying Authentic Chart Comprehension Standards

Multimodal large language models (MLLMs) are crucial tools for combining the capabilities of natural language processing (NLP) and computer vision, which are needed to analyze visual and textual data. Particularly useful for interpreting complex charts in scientific, financial, and other documents, the prime challenge lies in improving these models to understand and interpret charts accurately. However, traditional benchmarks often fall short in justifying this role and tend to overly estimate MLLM skills. Providing diverse and realistic datasets mirroring real-world scenarios is vital in evaluating the true performance of these models and in addressing this gap.

The conventional simplification observed in current benchmarks presents a significant issue in MLLM research. Current datasets like FigureQA, DVQA, and ChartQA base their structure on artificially generated charts and questions, which lack the needed complexity and visual diversity. To truly test a model’s ability to understand charts, benchmarks need to present real-world intricacies and should do away with using template-based questions and uniform chart designs.

A team of researchers from Princeton University, the University of Wisconsin, and The University of Hong Kong introduced CharXiv. This comprehensive evaluation suite aims to present a more realistic and challenging environment for assessing MLLM performance. With 2,323 charts from arXiv papers covering various topics and chart types, CharXiv is paired with descriptive and reasoning questions needing in-depth visual and numerical analysis. The data set covers eight main academic subjects featuring complex and diverse charts created to test model abilities.

Evaluation of CharXiv by the researchers using 13 open-source and 11 proprietary models showed a significant performance gap. The most potent proprietary model, GPT-4o, achieved 47.1% accuracy on the reasoning questions and 84.5% on descriptive questions. The leading open-source model, InternVL Chat V1.5, only achieved 29.2% and 58.5% on reasoning and descriptive questions, respectively. Human evaluators scored much higher, reaching 80.5% accuracy in reasoning questions and 92.1% in descriptive questions. This difference in performance emphasizes the need for further advancement in the field using more robust benchmarks like CharXiv.

Results from CharXiv unveil the strengths and weaknesses of current MLLMs. Descriptive skills have been proven fundamental for effective reasoning, as models with better descriptive abilities tend to perform well on reasoning tasks. Yet, MLLMs still struggle with simple compositional tasks like counting labeled ticks on axes—a task easy for humans but challenging for machines.

CharXiv addresses these key weaknesses in current benchmarks by offering a more tangible and challenging dataset, enabling a more accurate evaluation of MLLM skills in interpreting intricate charts. The gaps in performance highlighted in the study stress the need for ongoing research and improvement. CharXiv is driving future advancements in MLLM capabilities, striving to produce more effective and reliable models for practical purposes.

Leave a comment

0.0/5