The assessment of artificial intelligence (AI) models, particularly large language models (LLMs), is a field of rapid research evolution. There is a growing focus on creating more rigorous benchmarks to assess these models’ abilities across various complex tasks. Understanding the strengths and weaknesses of different AI systems through this field is crucial as it helps in informed decision-making regarding the improvement and refinement of these models. However, a significant challenge in evaluating LLMs is that existing benchmarks, such as the original Massive Multitask Language Understanding (MMLU) dataset, fail to provide a comprehensive assessment. They often include limited answer choices and mostly concentrate on knowledge-based queries that do not require extensive reasoning.
Acknowledging these shortcomings, researchers from TIGER-Lab have launched the MMLU-Pro dataset. This improved dataset provides a comprehensive and rigorous benchmark for evaluating LLMs. Compared to the original MMLU dataset with only four answer choices per question, MMLU-Pro expands the options to ten, thus increasing the complexity of the evaluation. It also includes more reasoning-centered questions. The construction of this dataset included filtering the most challenging and relevant questions from the original MMLU and augmenting the number of answer options per question using GPT-4, an advanced AI model. The added options are not random but plausible distractors requiring discriminative reasoning.
The MMLU-Pro dataset was carefully constructed to reduce the chance of random guessing while significantly increasing the evaluation’s complexity. Various models’ performance on the MMLU-Pro dataset revealed significant differences compared to their scores on the original MMLU. For instance, GPT-4’s accuracy on MMLU-Pro was 71.49%, a noticeable decrease from its original MMLU score of 88.7%. These results underline the enhanced difficulty and robustness of the MMLU-Pro dataset, requiring deeper reasoning and problem-solving skills from the models. Hence, it can be concluded that MMLU-Pro, providing a more precise measure of AI capabilities, represents a significant advancement in AI evaluation. By revealing areas for improvement, the dataset is vital for driving future AI growth and enhancing LLMs’ performance.