Artificial Intelligence (AI) Chatbots like OpenAI’s ChatGPT are capable of performing tasks from generating code to writing article summaries. However, they can also potentially provide information that could be harmful. To prevent this from happening, developers use a process called red-teaming, where human testers write prompts to identify unsafe responses in the model. Nevertheless, this process isn’t fully effective, as some unsafe prompts may be missed due to the immense number of possibilities.
Recognizing this gap, a team of researchers from MIT’s Improbable AI Lab and the MIT-IBM Watson AI Lab sought to improve this strategy using machine learning. They built a red-team large language model that generates different prompts which could trigger a range of undesirable responses from a chatbot being tested.
The new process encourages the red-team model to be “curious” and write novel prompts that prompt the target model to give toxic responses. This innovative method outperformed human testers and other machine learning approaches by eliciting more distinct and increasingly toxic responses from the chatbot under test. The main contributor to this improved performance was the use of curiosity-driven exploration, where the red-team model was encouraged to generate diverse prompts to stimulate diverse replies.
Lead author Zhang-Wei Hong elaborated on the issue, explaining that every large language model goes through an extensive period of red-teaming to ensure its safety. This process, however, is not sustainable for rapid rollouts, hence the need for a speedier, more effective model.
In terms of reinforcement learning, the researchers from MIT have rewarded their red-team model for triggering toxic responses from the chatbot. One of the challenges with that approach is that the model tended to keep generating similar prompts. By using the strategy of curiosity-driven exploration and rewarding new prompts, the red-team model has been able to elicit even more toxic responses with its unique prompts.
The researchers adopted a multi-pronged approach to fostering curiosity in the red-team model. They included an entropy bonus to foster randomness, novelty rewards based on word similarity and semantic similarity, and a bonus for natural language usage to avoid nonsensical generation of text.
The strategy has been successful. When pitted against other automated methods, the MIT model outscored the benchmarks on toxicity and diversity of responses. It also effectively drew out unsafe responses from a chatbot that had been fine-tuned to avoid giving toxic answers.
Researchers plan to expand the red-team model’s capability for a wider variety of topics. They also plan to use large language models as toxic classifiers. For instance, a company policy document could train the toxicity classifier, and a red-team model could then test a chatbot against company policy violations.
The research was funded partially by Hyundai Motor Company, Quanta Computer Inc., Amazon Web Services, the U.S. Army Research Office, and several other agencies. As we continue to see a rising number of AI models, efforts like this are crucial for ensuring we have safer, reliable AI tools.