Scale AI’s Safety, Evaluations, and Alignment Lab (SEAL) has unveiled SEAL Leaderboards, a novel ranking system designed to comprehensively gauge the trove of large language models (LLMs) becoming increasingly significant in AI developments. Solely conceived to offer fair, systematic evaluations of AI models, the innovatively-designed leaderboards will serve to highlight disparities and compare performance levels of progressively complex LLMs.
A proliferating number of LLMs has led to complications in discerning the relative merits of each model in terms of performance and safety. As a response, Scale AI — a respected third-party assessor for premier AI labs, contrived the SEAL Leaderboards to rate the top-tier LLMs using private, untamperable datasets. Verified professionals, well-versed in their respective domains, conduct the evaluations, making the ratings impartial and reflective of the LLMs’ operational prowess.
Several imperative domains are covered in SEAL Leaderboards’ initial run, including Coding, Instruction Following, Math – based on GSM1k, and Multilinguality. Evaluation in these key areas will involve the use of tailor-made prompt sets, devised by domain specialists resulting in an accurate and specific performance assessment.
To protect the authenticity of the evaluations, the datasets used by Scale remain confidential and have not been published. This anonymity safeguards the data from misuse or from being incorporated into model training data. The SEAL Leaderboards will only allow entries from developers without prior access to the prompt sets, ensuring an unbiased evaluation process. Scale holds itself accountable by corroborating with other trusted third-party organizations to review their findings.
Since its establishment in November, Scale’s SEAL research lab has been addressing recurring challenges in AI evaluation, such as guaranteeing uncontaminated datasets, standardizing model comparisons, recording reliable evaluation results, establishing evaluators’ domain-specific proficiency, and supplying unassailable tools for comprehending and amending evaluation results.
These measures go a long way towards enhancing the overall quality, transparency, and standardization of AI model evaluations. Scale also unveiled a new initiative, Scale Evaluation, a platform designed to provide AI researchers, developers, and public organizations the resources necessary for analyzing, understanding, and enhancing AI models and applications.
Scale intends to perpetually refresh the SEAL Leaderboards several times a year, each time incorporating new prompt sets and emergent models, a step to ensure the leaderboards stay pertinent, reflecting latest AI progressions. The continuous effort showcases Scale’s commitment to promote improved evaluation standards within the AI community. This aligns with Scale’s mission to expedite AI development via meticulous, independent evaluation practices.