Hugging Face has unveiled the Open LLM Leaderboard v2, a significant upgrade to its initial leaderboard used for ranking language models. The new version aims to address the challenges faced by the initial model, featuring refined evaluation methods, tougher benchmarks, and a fairer scoring system.
Over the last year, the original leaderboard had become a pivotal tool in the machine learning community, accruing over 2 million unique users and actively engaging 300,000 users monthly. However, its success also brought with it benchmark saturation. High-performing models were starting to reach human performance levels on benchmarks, ultimately reducing their efficacy in distinguishing between different model capabilities.
The new leaderboard introduces six fresh benchmarks to counter these challenges. These include MMLU-Pro, an elevated version of the MMLU dataset featuring ten questions rather than four, aimed at encouraging more reasoning and reducing noise; GPQA, a challenging knowledge dataset designed with contamination prevention mechanisms; MuSR, which tests reasoning and long-range context parsing using algorithmically formulated complex problems; MATH, a subset of high school-level competition problems; IFEval, which tests models’ ability to follow explicit instructions; and BBH, a BigBench subset featuring challenging tasks to test algorithmic reasoning, multistep arithmetic, and language understanding.
Scoring in the Open LLM Leaderboard v2 has also changed significantly, with the introduction of a fairer system that uses normalized scores for ranking models. This methodology ensures a more balanced comparison across different benchmarks, as it prevents any single benchmark from unduly influencing the final ranking.
The platform’s reproducibility and interface have been enhanced through Hugging Face’s collaboration with EleutherAI. This includes a new logging system that is compatible with the leaderboard, support for delta weights, and application of chat templates for evaluation. Significant manual checks have been implemented to increase accuracy and consistency.
The community plays a crucial role in the new leaderboard, with the introduction of a “maintainer’s choice” category showcasing high-grade models from a variety of sources. A community voting system has also been implemented to manage the influx of model submissions.
Hugging Face’s Open LLM Leaderboard v2 is a landmark advance in the evaluation of language models. With enhanced benchmarks, a more balanced scoring system, and improved reproducibility, it aims to push model development boundaries and provide more dependable insights into model capabilities. The team at Hugging Face is hopeful about future advancements and is looking forward to the continuous improvement and innovation that will result from models being evaluated on this new, more rigorous leaderboard.