Researchers from the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab have developed a new technique to improve “red-teaming,” a process of safeguarding large language models, such as AI chatbot, through the use of machine learning. The new approach focuses on the automatic generation of diverse prompts that result in undesirable responses from the chatbot being tested, helping to uncover potential safety issues and toxic responses.
The researchers utilize a technique known as curiosity-driven exploration. The reinforcement learning, trial-and-error process incentivizes the ‘red team’ model to produce varied and novel prompts, focusing on different keywords, sentence structures, and meanings. A safety classifier then assesses the toxicity of the chatbot’s response, rewarding the red-team model based on the rating. The resulting system tends to generate a wider range of prompts, improving the coverage compared to traditional automated methods.
Despite traditionally thorough red-teaming processes, even chatbots believed to be safe may still be capable of generating unsafe responses if certain prompts go untested. Human red-teaming tends to be an expensive and lengthy process, often unable to generate a fully comprehensive array of prompts for testing all potential outcomes.
The study revealed that this new technique surpassed both human testers and other machine-learning methods in obtaining distinct prompts which sparked increasingly toxic responses. It also proved effective even in chatbots previously considered safe and optimized with human feedback, for example, producing nearly 200 prompts that resulted in toxic responses from such a chatbot.
In the future, the research team aims to equip the red team model with the ability to generate prompts encompassing a broader range of topics and further investigate the use of a large language model as a toxicity classifier. The ultimate goal revolves around enhancing the testing phase, ensuring AI tools are safe and trustworthy before deployment in environments with frequent, often daily, updates and modifications. This machine learning-based approach provides an essential improvement in quality assurance, offering a faster and more effective red-team testing process.
The researchers also highlighted the potential for training the toxicity classifier using a company policy document, subsequently testing the chatbot for violations of such policies. The project was partly funded by multiple entities, including Hyundai Motor Company, the MIT-IBM Watson AI Lab, and the U.S. Army Research Office.