The evaluation of artificial intelligence (AI) systems, particularly large language models (LLMs), has come to the fore in recent artificial intelligence research. Existing benchmarks, such as the original Massive Multitask Language Understanding (MMLU) dataset, have been found to inadequately capture the true potential of AI systems, largely due to their focus on knowledge-based questions and limited answer options. This leads to an incomplete understanding of AI systems’ reasoning capabilities and problem-solving ability, highlighting the need for more inclusive datasets.
To address these limitations, researchers from TIGER-Lab have introduced the MMLU-Pro dataset, a more comprehensive benchmark specifically designed to evaluate LLMs. The MMLU-Pro increases the number of answer options per question from four to ten and includes more reasoning-focused questions, significantly enhancing the evaluation’s complexity and breadth. The construction of this dataset involved filtering the original MMLU dataset to retain only challenging and highly relevant questions. Subsequent augmentation involved using GPT-4 to generate plausible distractors, increasing the complexity and realism of the evaluation.
Each question in the MMLU-Pro dataset and its ten answer options have undergone thorough review by a panel of over ten experts. Moreover, the dataset sources its questions from high-quality STEM websites, theorem-based QA datasets, and college-level science exams, ensuring a robust benchmarking tool for AI. By tapping into various disciplines and reducing the likelihood of random guessing, this new dataset provides a more accurate and comprehensive assessment of LLMs’ capabilities.
Performance evaluations have revealed significant differences in the results scored in the MMLU-Pro compared to the original MMLU. For instance, GPT-4’s accuracy dropped by 17.21% to 71.49% when tested with the MMLU-Pro, while other models like GPT-4-Turbo-0409 and Claude-3-Sonnet experienced similar drops. This demonstrates the increased difficulty and robustness of the new dataset, demanding improved reasoning and problem-solving skills from AI models.
In conclusion, the introduction of the MMLU-Pro dataset marks a major development in AI evaluation. This dataset provides a more sophisticated benchmark for LLMs, pushing them to operate at higher levels of complexity and reasoning. The significant drops in the performance of models like GPT-4 highlight the effectiveness of the dataset in pointing out room for improvement. This new evaluation tool brings substantial value for future AI advancements, facilitating critical enhancements to the performance of LLMs based on a more accurate measure of their capabilities.