Fundamental large language models (LLMs) including GPT-4, Gemini and Claude have shown significant competencies, matching or surpassing human performance. In this light, benchmarks are necessary tools to determine the strengths and weaknesses of various models. Transparent, standardized and reproducible evaluations are crucial and much needed for language and multimodal models. However, the development of custom evaluation pipelines frequently impedes transparency and reproducibility due to varied data preparation, output postprocessing and metrics calculation used by different model developers.
To address this issue, researchers from LMMs-Lab Team and S-Lab, NTU, Singapore, have developed a unified and standardized multimodal AI benchmark framework called LMMS-EVAL. This system evaluates over ten multimodal models with each containing nearly 30 variants while spanning over 50 tasks in numerous contexts. Also, it offers simplified integration of new models and datasets through a uniform interface and guarantees transparency and repeatability of the evaluation process.
Benchmarking is a tough task due to the balancing act required between execution cost, coverage and avoiding data contamination, referred to as the impossible triangle. Therefore, the team developed LMMS-EVAL LITE and LiveBench. LMMS-EVAL LITE focuses on multiple tasks and reduces redundant data instances, thereby offering an affordable yet comprehensive evaluation. LiveBench, on the other hand, allows a cheap and widely applicable method of running benchmarks by generating test data from the latest information obtained from news and internet forums.
The main merits of the project lie in three components. The first, LMMS-EVAL, ensures impartial and consistent comparisons between models by standardizing the evaluation process. The second, LMMS-EVAL LITE, removes redundant data, keeping down costs and maintaining reliable results. The last, LiveBench, assesses the zero-shot generalization ability of models by using up-to-date data from news and forum websites.
To sum up, robust benchmarks are crucial for AI advancements as they help identify strengths and weaknesses in models and guide future developments. The introduction of LMMS-EVAL, LMMS-EVAL LITE and LiveBench will likely close the gaps in current evaluation frameworks and promote the continuous development of AI. All the credit to this significant work goes to the researchers, whos’ paper and GitHub details related to the project are available for further information.