Stanford University’s AI Index Report 2024 outlines the rapid advancement in artificial intelligence (AI) and the increasing irrelevance of traditional benchmark comparisons with humans. The yearly report explores AI developments and trends, noting how the previously held industry benchmarks that compared AI models with human capabilities are gradually losing effectiveness due to rapid AI evolution. Models like the Massive Multitask Language Understanding (MMLU), a traditional evaluation system comprising multiple-choice tests across 57 categories, are now outdated — their effectiveness proven by the Gemini Ultra model surpassing the previously set human baseline success rate of 89.8% with its score of 90.04%.
Following such advancements, the AI Index Report 2024 argues that the MMLU and similar benchmarks may need replacing. Models have exceeded human baselines in key benchmarks such as ImageNet, SQuAD, and SuperGLUE, leading researchers to create more challenging tests. One of these is the Graduate-Level Google-Proof Q&A Benchmark (GPQA), a test that pits AI systems against high-performing individuals.
However, despite these impressive advancements, AI systems still face limitations, including factual unreliability, complicated reasoning and inconclusive interpretations. Furthermore, the report highlighted the difficulties in creating effective benchmarks for evaluating AI safety, citing transparency issues and the absence of standardised training data and methodologies.
Considering these limitations, the report notes an emerging trend of crowd-sourced human evaluations of AI performance, moving away from traditional benchmark tests. This shift means that AI’s progress may soon be evaluated in more subjectively nuanced categories like image aesthetics or prose, rather than traditional tech-driven rankings.
The report suggests a future where AI models could become smarter than humans, a situation that may render current benchmark tools obsolete. As AI technologies continue evolving beyond human baseline capabilities, it’s suggested that subjective sentiment may eventually play a significant part in choosing which AI model to adopt. Therefore, new benchmarks reflecting this changing landscape need to be established.