Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab have developed a technique to enhance the safety measures implemented in AI chatbots to prevent them from providing toxic or dangerous information. They have improved the process of red-teaming, where human testers trigger unsafe or dangerous context to teach AI chatbot to avoid such responses. The new development ensures that a wider range of undesirable responses are accounted for; this is done by training the red-team model to be curious, focusing on novel prompts that could trigger toxic responses from the AI chatbot.
However, if human testers overlook some prompts, an AI chatbot considered safe can still offer unsafe replies. Therefore, this new machine-learning approach uses curiosity-driven exploration, where the red-team model probes various prompts, considering different words, sentence structures or meanings. The red-team model generates a prompt and engages with the chatbot, with a safety classifier ranking the toxicity of the response and then rewarding the red-team model based on this rating.
This improved method has shown to draw out toxic responses from otherwise deemed safe AI chatbots, and its curiosity-driven approach has facilitated 196 novel prompts that triggered toxic responses. By adding a naturalistic language bonus, the newly-developed model can avoid generating random, nonsensical text which may manipulate the classifier into awarding a high toxicity score.
According to the researchers, this method for ensuring and enhancing AI safety will be crucial as AI models become an indispensable part of modern life. Any models released will need to undergo thorough inspection to prevent any harm to human users. These researchers believe that manual red-teaming is not scalable, therefore improvements in the technique will ensure a reliable and safe future of AI technology.
Looking forward, the team hopes to expand the range of topics that the red-team model can generate prompts for. They also plan to investigate having a large language model acting as the toxicity classifier, which could be trained using a specific company’s policy document. This would allow the red-team model to test a chatbot for violations in line with a company’s guidelines.
In conclusion, the research presents a new method for red-teaming AI models. By employing curiosity-driven exploration in the training methodology, they have been able to generate a wider range of prompts and test for a larger number of potential risks in the AI model’s responses. This has implications for both safety and accuracy when implementing AI models. The researchers hope to build on this work to improve the safety testing process even further.