Skip to content Skip to footer

Artificial intelligence (AI) chatbots like ChatGPT, capable of generating computer code, summarizing articles, and potentially even providing instructions for dangerous or illegal activities, pose unique safety challenges. To mitigate this risk, companies use a safeguarding process known as red-teaming, where human testers attempt to prompt inappropriate or unsafe responses from AI models. This process is not foolproof, as it requires testers to anticipate and counter every possible hazardous prompt, a near-impossible task owing to the sheer number of possibilities.

Researchers at the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab have developed a machine-learning driven technique to enhance this red-teaming process. By training a red-team AI model to automatically generate diverse prompts that could trigger unsafe responses from the target AI model, human testers are relieved from the task of trying to anticipate every possible prompt. The new technique centers on teaching the red-team model to value novelty, deliberately focusing on prompts likely to evoke unsafe output from the target model.

Tests show this method outperforms both human testers and other machine-learning strategies, by producing more unique prompts that result in increasingly inappropriate output from the target AI. Impressively, even AI models that had been safeguarded by human experts were found to produce inappropriate responses when tested in this way. The model uses curiosity-driven reinforcement learning, maximizing the randomness and novelty of prompts without generating nonsensical text, a common issue with such AI models.

However, the potential danger is not limited to inappropriate or unsafe responses. AI chatbots learn from huge amounts of text data, which may include personal information that could potentially be leaked. Machine-learning assisted red-teaming not only improves the safety by broadening the range of potential prompts, but also addresses this risk of information leakage. According to Pulkit Agrawal, director of Improbable AI Lab, their method enables faster, more effective quality assurance of AI systems, a critical improvement as these systems become an increasingly integrated part of our lives.

The researchers hope to expand their technique so the red-team AI model can generate prompts over a broader range of topics and possibly serve as a toxicity classifier. This would support efforts to program AI behavior to align with company policies or societal norms, contributing to the creation of safer, more reliable AI systems.

According to Agrawal, any company releasing a new AI model and concerned about its behavior should consider using this curiosity-driven red-teaming approach, as it would provide a more reliable safeguard. The research was funded, in part, by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, the U.S. Army Research Office, the U.S Defense Advanced Research Projects Agency Machine Common Sense Program, the U.S Office of Naval Research, the U.S Air Force Research Laboratory, and the U.S Air Force Artificial Intelligence Accelerator among others.

Leave a comment

0.0/5