Large Language Models (LLMs) such as GPT-4, Gemini, and Claude have exhibited striking capabilities but evaluating them is complex, necessitating an integrated, transparent, standardized and reproducible framework. Despite the challenges, no comprehensive evaluation technique currently exists, which has hampered progress in this area.
However, researchers from the LMMs-Lab Team and S-Lab at NTU, Singapore, developed the LMMS-EVAL platform designed to assess multimodal models. This benchmark suite evaluates over ten multimodal models with about 30 variants, covering more than 50 tasks across different areas. It offers a standardized assessment pipeline to ensure transparency and repeatability.
The key challenge in creating an effective benchmark is to make it free of contaminants, cost-effective, and comprehensive, often referred to as the “impossible triangle”. Despite this hurdle, the team developed LMMS-EVAL LITE and LiveBench. LMMS-EVAL LITE offers an affordable, comprehensive evaluation by focusing on a range of tasks and eliminating unnecessary data instances. LiveBench maintains affordability while ensuring wide applicability by creating test data from the latest information available on the web.
To summarise, the team’s significant contributions are the creation of LMMS-EVAL, a standardized multimodal model evaluation suite; LMMS-EVAL LITE, an affordable version of the full evaluation suite; and LiveBench, a way to assess multimodal models’ ability to generalize to current events.
Essentially, robust, standardized, and transparent benchmarks are crucial for the growth of AI, helping to distinguish between models, identify their weaknesses, and guide future advancements. By introducing LMMS-EVAL, LMMS-EVAL LITE, and LiveBench, the researchers aim to bridge gaps in existing evaluation methodologies to help foster the ongoing evolution of AI.