Artificial Intelligence (AI) safety continues to become an increasing concern as AI systems become more powerful. This has led to AI safety research aiming to address the imminent and future risks through the development of benchmarks to measure safety properties such as fairness, reliability, and robustness. However, these benchmarks are not always clear in defining safety improvements but reflect more on general AI capabilities, leading to the possibility of “safetywashing,” where AI capability advancements are misrepresented as safety improvements.
The common practice for ensuring AI safety has gravitated toward benchmarks designed to assess attributes like fairness, reliability, and adversarial robustness. These benchmarks include tests for model alignment with human preferences, bias evaluations, and calibration metrics. While useful, these benchmarks have significant limitations, such as being highly associated with general AI capabilities. Therefore, any improvements on these benchmarks often result from general performance enhancements and not necessarily targeted safety improvements. This has led to the misrepresentation of AI safety progress, where capability improvements are presented as safety advancements.
A team of researchers, representing institutions such as the Center for AI Safety, the University of Pennsylvania, and more, introduced a unique empirical approach to distinguish genuine safety progression from general capability improvements. The researchers conducted a meta-analysis of numerous AI safety benchmarks and measured their correlation with general capabilities across multiple models. The findings showed that many safety benchmarks were correlated with general capabilities, resulting in potential safetywashing.
This critical study used various models and benchmarks to ensure robust results. The researchers analyzed the performance scores using Principal Component Analysis (PCA), creating a general capabilities score. This allowed the researchers to identify which benchmarks measured safety properties genuinely independently of general capabilities. The specifications include models fine-tuned for specific tasks and diverse benchmarks for alignment, bias, adversarial robustness, and calibration.
The study’s findings disclosed that numerous AI safety benchmarks were highly correlated with general capabilities. This demonstrates how improvements in these benchmarks primarily result from overall performance enhancements and not targeted safety improvements. For example, while the alignment benchmark MT-Bench had a 78.7% capabilities correlation, the MACHIAVELLI benchmark for ethical propensities showed negligible correlation with general capabilities. This proves its effectiveness in measuring distinct safety attributes. This highlights potential safetywashing risks, where improvements in AI safety benchmarks can be mistaken as genuine safety progress rather than general capability enhancements.
In conclusion, the research shed new light on the effectiveness of AI safety benchmarks by demonstrating how they are often more related to general capabilities than genuine safety improvements. The proposed solution is to develop benchmarks that measure safety improvements accurately. This would be achieved by ensuring that advancements in AI safety are not correlations of general capability enhancements but actual improvements in AI reliability and trustworthiness. This cutting-edge research offers potential to significantly impact AI safety research by providing a more rigorous framework for evaluating safety progress.