Artificial Intelligence chatbots have the capacity to construct helpful code, summarize articles, and even create more hazardous content. To prevent safety violations like these, companies employed a procedure known as “red-teaming” in which human testers crafted prompts intended to elicit unsafe responses from chatbots, which were then taught to avoid these inputs. However, this required the testers to have a comprehensive understanding of all potentially harmful prompts—an unlikely prospect given the vast range of possibilities.
To resolve this issue, researchers at the Improbable AI Lab at MIT, in collaboration with the MIT-IBM Watson AI Lab, developed a machine-learning technique to enhance the red-teaming process. They trained a “red team” large language model to produce various prompts that could elicit a broader scope of undesirable responses from a chatbot.
This new procedure involves instilling a sense of curiosity in the red team model, directing it to create novel prompts that could elicit potentially harmful responses from the target model. This technique outperformed both human testers and other machine-learning tactics by generating a broader range of prompts that elicited increasingly harmful responses, improving the scope of inputs being tested.
Red teaming usually involves a lengthy process, “not sustainable if we want to update these models in rapidly changing environments,” says Zhang-Wei Hong, the lead author on this research project. His method offers a faster and more efficient alternative for assuring quality and safety.
However, reinforcement learning, the typical training method for red team models, often results in few but highly toxic prompts due to its focus on maximizing success. To combat this, the MIT researchers used a curiosity-driven exploration technique, incentivizing the red-team model to try different words, sentence patterns, or meanings in their prompts.
This new approach successfully created more distinctive prompts. The team from MIT also tested an AI chatbot which had been fine-tuned to eliminate harmful outputs, only for their curiosity-driven method to produce 196 prompts that elicited toxic responses from this “safe” chatbot.
Pulkit Agrawal, director of the Improbable AI Lab and senior author on this work, says, “These models are going to be an integral part of our lives and it’s important that they are verified before released for public consumption.” The team aims to expand the curiosity-driven red team’s ability to generate prompts about a wider array of topics and to examine the potential of using a large language model as the toxicity assessor in future research. The goal is to minimize human effort and maximise safety and reliability in AI models.