Large language models powering AI chatbots possess the potential for generating harmful content due to their exposure to countless websites, putting users at risk if the AI generates illegal activities description, illicit instructions, or personal information leakage. To mitigate such threats, AI-developing companies use a procedure known as red-teaming, where human testers compose prompts aimed to extract unsafe or toxic responses from the chatbot, thereby teaching it to avoid generating such text. However, the effectiveness of this method relies heavily on the testers identifying the prompts that result in harmful responses, which can be unavailable due to the countless potential prompts.
A group of researchers from the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab have sought to address these shortcomings by applying a machine-learning technique to red-teaming. The researchers developed a method to cultivate a red-team large language model capable of generating diverse prompts sharpened to elicit a wider scope of undesirable responses. This method promotes curiosity within the red-team model, encouraging it to seek out unique prompts that will extract harmful responses from the target model. As a result of this technique, the red-team model surpasses other machine-learning methods and human testers, as it often triggers toxic responses from AI chatbots.
The red-team model uses reinforcement learning to assist in the writing of harmful prompts. Moreover, the technique applies a principle known as curiosity-driven exploration which involves driving the model to consider the consequences of its every action, thereby testing different words, sentence structures, or meanings of its prompts. During the red-team model’s training process, it interacts with a chatbot and creates a prompt, consequently generating a response from the chatbot. Lastly, a safety classifer assigns the response a toxicity rating based on the level of harmfulness in the response, then rewards the red-team model according to the rating.
The process is designed to enable the model to maximise its reward by giving it the means to generate more toxic responses through novel prompts. The research team used entropy and novelty rewards to motivate curiosity in the red-team model by enhancing the reward signal in the learning process. During testing, the model surpassed other automated techniques,capable of generating 196 prompts that cause a chatbot developed to avoid toxic replies to generate them.
The researchers believe their approach offers a quicker and more effective means of quality assurance. In future, they intend to enhance their model’s ability to create prompts on a wider array of topics; they also are considering the use of a large language model as a classifier for toxicity. If successful, the ingenuity of the process could stand as a worthwhile solution for AI developers seeking to shield their AI from breeding covert harmful activities. This study’s primary author, Zhang-Wei Hong, stated that the traditional method is unsustainable for models functioning in a rapidly changing environment, thus reinforcing the importance of quicker and more thorough toxic identification methods.