Researchers have noted gaps in the evaluation methods for Large Vision Language Models (LVLMs). Primarily, they note that evaluations overlook the potential of visual content being unnecessary for many samples, as well as the risk of unintentional data leakage during training. They also indicate the limitations of single-task benchmarks for accurately assessing the multi-modal capabilities…
