Skip to content Skip to footer

AI chatbots pose unique safety risks—while they can write computer programs or provide useful summaries of articles, they can also potentially generate harmful or even illegal instructions, including how to build a bomb. To address such risks, companies typically use a process called red-teaming. Human testers aim to generate unsafe or toxic content from AI models, thereby teaching them to avoid producing such responses. However, effectiveness depends on testers knowing the right toxic prompts to use. Unanticipated prompts could lead to a supposedly safe chatbot producing unsafe responses.

Researchers at the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab have pioneered a machine-learning technique for red-teaming which they claim is both quicker and more effective. This technique trains a red-team large language model to generate prompts that trigger a broader array of undesirable chatbot responses.

The new technique is curiosity-based—the red-team model creates novel prompts that encourage the target model to generate toxic responses. This approach not only outperformed human testers but also proved superior to other machine-learning strategies by producing more diverse prompts, even promoting toxic responses from chatbots previously deemed safe to use.

In their approach, the MIT researchers used curiosity-driven exploration, which encourages the red-team model to be interested in diverse outcomes. The model is driven to generate prompts using different wording, sentence structures, or meanings. When the red-team model interacts with the target chatbot, a safety classifier rates the toxicity of the response and rewards the red-team model based on this rating.

The red-team model’s goal is to maximize its reward by eliciting even more toxic responses through novel prompts. The researchers enabled model curiosity by modifying the reinforcement learning paradigm, rewarding semantic novelty and discouraging repeated prompts.

The red-team model tests a chatbot fine-tuned by human feedback, with the curiosity-driven approach rapidly generating prompts that yield toxic responses from the ‘safe’ chatbot. This work could reduce the human effort needed to ensure that AI algorithms will be safe and reliable in the future.

Future work will aim to widen the red-team model’s range of topics and explore the use of a large language model as a toxicity classifier. Using policy documents, for example, the red-team could test chatbots for company policy violations. The findings will be unveiled at the International Conference on Learning Representations.

The project received funding from sources such as Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, and various departments of U.S. military and defense research.

Leave a comment

0.0/5