To counter unsafe responses from chatbots, companies often use a process called red-teaming, in which human testers write prompts designed to elicit such responses so the artificial intelligence (AI) can be trained to avoid them. However, since it is impossible for human testers to cover every potential toxic prompt, MIT researchers developed a technique utilizing machine learning to improve red-teaming on large language models. This involves training a red-team AI model to generate a wider range of diverse prompts that target undesirable responses. This training also involves sparking curiosity in the AI model which encourages it to create new, diverse prompts.
This innovative method has proven to significantly outperform both human testers and other machine-learning techniques by outputting an extensive range of distinct prompts that trigger incrementally toxic responses from the chatbot being tested. The method also proved to be effective in eliciting toxic responses from chatbots previously thought to be safe due to built-in human safeguards.
The researchers used reinforcement learning to train the red-team model using a technique called curiosity-driven exploration. As a result, the AI model becomes incentivized to create new prompts. During training, the red-team model generates a prompt, interacts with the chatbot, and a safety classifier then rates the toxicity of the chatbot’s response, offering rewards to the red-team model based on this rating.
The red-team model’s aim is to maximize this reward by eliciting more toxic responses via unique prompts. This curiosity in the model is fostered by modifying the reward signal in the reinforcement learning setup. The technique includes an entropy bonus for the model to explore different prompts and additional rewards based on the similarity of words in the prompts and semantic similarity.
To avoid the generation of nonsensical text, researchers also added a natural language bonus during the training. This method outperformed other machine-learning methods in both the toxicity and diversity of responses.
The researchers hope to extend the model’s use to cover a broader range of topics. They also see the potential of using a large language model as the toxicity classifier, with training based on documents like company policy, which would allow a red-team model to test chatbots for policy violations. The main goal is to ensure the development of safe, reliable AI in a scalable manner. The research was supported by major organizations including Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, and the U.S. Defense.