Skip to content Skip to footer

AI chatbots like ChatGPT, trained on vast amounts of text from billions of websites, have a broad potential output which includes harmful or toxic material, or even leaking personal information. To maintain safety standards, large language models typically undergo a process known as red-teaming, where human testers use prompts to elicit and manage unsafe outputs. However, this process can miss potential toxic outputs and is slow, costly, and with only limited coverage against the thousands of prompts a chatbot could respond to.

Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab have proposed an innovative solution using machine learning to automate and enhance the red-teaming process. Through this technique, a large language model is trained to automatically generate diverse prompts that might elicit a range of undesirable responses from the chatbot being tested. The ‘red team’ model is stimulated to be curious, focusing on inspiring novel prompts that produce toxic responses from the model it is challenging.

The MIT initiative found that their technique outperformed human testers and other machine-learning approaches, invoking more unique prompts and obtaining ever-more toxic responses from the tested chatbot. Their method expanded the inputs’ coverage being tested and managed to coax toxic responses even from a chatbot that had been secured by human experts.

Reinforcement learning, a trial-and-error method, is ordinarily used to train a red-team model, rewarding it for generating prompts yielding toxic results from the tested chatbot. However, the MIT research team innovated this approach using ‘curiosity-driven exploration’. This technique encourages the red-team model to experiment with different words, sentence patterns, or implications. It promotes the creation of new prompts rather than returning to the most impactful ones.

The red-team model is rewarded for exploring unusual prompts that elicit increasingly toxic responses. The team at MIT has modified the usual reward pattern in reinforcement learning to bring about this curiosity. They have achieved this by enforcing an entropy bonus to inspire randomness and two novelty rewards to incentivize dissimilarity within prompts.

To control the model from creating meaningless, random text, a natural language bonus has also been integrated into the training objectives. The researchers discovered their model outshone other automated techniques on toxicity and diversity metrics.

Looking ahead, the research team intends to further their model to encompass a broader array of topics and consider utilizing a large language model as a toxicity classifier. This development would allow users to train the model with particular documents, such as company policies, enabling the red-team model to test a chatbot for specific violations. This research signifies a significant progression toward an automated, quicker, and effective method in ensuring the safety and reliability of AI models.

Leave a comment

0.0/5