Skip to content Skip to footer

Is Our Approach in Assessing Large-Scale Visual-Language Models Correct? This Chinese AI Research Presents MMStar: A Superior Vision-Driven Multi-Modal Benchmark.

Researchers have noted gaps in the evaluation methods for Large Vision Language Models (LVLMs). Primarily, they note that evaluations overlook the potential of visual content being unnecessary for many samples, as well as the risk of unintentional data leakage during training. They also indicate the limitations of single-task benchmarks for accurately assessing the multi-modal capabilities of LVLMs.

To remedy these issues, they present MMStar, an advanced multi-modal benchmark developed to offer a more thorough and accurate evaluation for LVLMs. MMStar incorporates a dataset of 1,500 samples, selected conscientiously by human reviewers, concentrating on six core capabilities and 18 detailed axes.

The development of MMStar entailed three main stages. First was the data curation process where the selected evaluation samples satisfied three primary criteria: visual dependency, minimal data leakage, and the need for advanced multi-modal capabilities for resolution. An automated pipeline was used for initial sample filtering, followed by a human review of the samples to ensure that they met the prescribed evaluation criteria.

The second stage focused on defining the Core Capabilities to thoroughly evaluate the LVLMs’ diverse multi-modal capabilities. This was done by incorporating six core capability dimensions and eighteen detailed axes, drawn from existing benchmarks.

Finally, the evaluative metrics were considered. Two unique metrics were developed to assess the potential data leakage and actual performance gain from the multi-modal training process.

On testing various LVLMs, the best average scores achieved were only slightly above 50%, indicating room for improvement in the performances of these models. This underlines the pressing necessity for stringent evaluation methods, such as MMStar, in order to further advance the capabilities of LVLMs.

This critical research was undertaken by a team from University of Science and Technology of China, The Chinese University of Hong Kong, and Shanghai AI Laboratory. Their work on MMStar is expected to contribute significantly to the study and advancement of LVLMs.

Leave a comment

0.0/5