Companies that build large language models, like those used in AI chatbots, routinely safeguard their systems using a process known as red-teaming. This involves human testers generating prompts designed to trigger unsafe or toxic responses from the bot, thus enabling creators to understand potential weaknesses and vulnerabilities. Despite the merits of this procedure, it often falls short as human testers are likely to miss some toxic prompts given the sheer number of possibilities.
To address these limitations and improve safety, researchers from the Improbable AI Lab at MIT and MIT-IBM Watson AI Lab have developed a machine learning technique to significantly improve the efficacy of red-teaming. They have trained a large language model to automatically generate a diverse range of prompts that elicit undesirable responses from the bot being tested. The approach is based on motivating the red-team model to be curious when writing prompts, concentrating on innovative prompts that extract toxic responses from the target model.
This process was found to outperform human testers and other machine learning methods, generating a higher percentage of distinct prompts and drawing out toxic responses from a supposedly safe chatbot. The technique also accelerates the safety assurance process for models, particularly in rapidly evolving environments.
In more detail, the researchers have trained the red-team model using reinforcement learning, and make use of a technique known as curiosity-driven exploration. Herein, the red-team model is incentivised to be curious about the consequences of each prompt it generates, thus pushing it to try prompts with varying words, sentence patterns and meanings. Further, during its training process, the model generates a prompt and interacts with the chatbot. A safety classifier then rates the toxicity of the chatbot’s response, rewarding the red-team model based on this rating.
Making use of this curiosity-driven approach, the researchers were able to quickly generate a large number of prompts that elicited toxic responses from a supposedly safe chatbot, one which had been meticulously fine-tuned with human feedback.
In the future, the researchers aim to expand the variety of topics the red-team model can generate prompts for. Further, they plan to explore the potential application of a large language model as the toxicity classifier.
This study was funded by a multitude of projects and organisations, including Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, the U.S Army Research Office, and the U.S Defence Advanced Research Projects Agency Machine Common Sense Program, among others.