MLCommons, a joint venture of industry and academia, has built a collaborative platform to improve AI safety, efficiency, and accountability. The MLCommons AI Safety Working Group established in late 2023 focuses on creating benchmarks for evaluating AI safety, tracking its progress, and encouraging safety enhancements. Its members, with diverse expertise in technical AI, policy, and governance, aim to increase transparency and foster collective solutions for AI safety evaluation. Given the extensive use of AI in vital spheres, it’s imperative to ensure a safe and responsible development of AI to prevent potential harms.
Teaming up with organizations like Stanford University and Google Research, MLCommons has rolled out version 0.5 of the AI Safety Benchmark. This version assesses the safety risks linked to AI systems that use chat-tuned language models. The benchmark uses a structured approach for construction, which includes defining use cases, system types, language and context parameters, personas, tests, and grading criteria. It identifies 13 hazard categories, with tests for seven of these categories constituting 43,090 test items. Moreover, it provides an open platform and a downloadable tool, ModelBench, for gauging AI system safety against the benchmark and employs a principled grading system to evaluate AI systems’ performance.
Through a discussion about current and future threats posed by AI systems, the study emphasized potential physical, emotional, financial, and reputational harms. Besides, it underscored the challenges in AI safety evaluation, including complexity, socio-technical entanglement, and the difficulty in accessing relevant data. Categorized as algorithmic auditing, directed evaluation, and exploratory evaluation, different techniques for AI safety evaluation were discussed, each having its strengths and weaknesses. The study stressed the significance of benchmarks in driving AI safety research and innovation, and highlighted several projects for the safety evaluation of AI, such as HarmBench, TrustLLM, and SafetyBench.
The benchmark mainly targets three audiences: model providers, model integrators, and AI standards makers and regulators. While model providers strive to create safer models and comply with legal standards, integrators aim to compare models and ensure their safety. On the other hand, AI standards makers and regulators attempt to set industry standards and offer effective safety evaluation across companies. Adhering to release requirements is essential for maintaining the benchmark’s integrity, ensuring an accurate safety assessment.
A study evaluating AI systems using chat-tuned language models against a benchmark (v0.5) across different hazard categories was discussed. A total of thirteen models from 11 providers, released between March 2023 and February 2024, were tested. The responses collected had controlled parameters to minimize variability. His results showed varying levels of risk across models, with some graded as high, moderate, or moderate-low risk based on unsafe response percentages. Differences in unsafe responses were noted across user personas, with higher risks related to malicious or vulnerable users compared to typical users across hazard categories and systems.
In conclusion, MLCommons’ v0.5 release of the AI Safety Benchmark provides a structured approach to assessing AI system safety risks. While not intended for safety assessment, v0.5 lays a foundation for future iterations. With its openly available platform, ModelBench, it encourages community feedback to refine the benchmark further. The paper, written by the researchers of this project, is available for review.