Skip to content Skip to footer

Artificial intelligence (AI) advancements have led to the creation of large language models, like those used in AI chatbots. These models learn and generate responses through exposure to substantial data inputs, opening the potential for unsafe or undesirable outputs. One current solution is “red-teaming” where human testers generate potentially toxic prompts to train chatbots to avoid such responses. However, this method falls short when these humans miss some prompts, leaving the AI potentially capable of generating unsafe replies.

Seeking to enhance the efficacy of red-teaming, researchers from the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning. The technique they developed educates a large language model to automatically produce an extended range of prompts that elicit unwanted responses from the chatbot being tested. This is achieved by teaching the model to generate novel prompts that evoke unfavorable reactions from the target model.

The research team’s method significantly increases the extent of inputs compared to other automated procedures, even extracting adverse responses from a safeguarded chatbot designed by human experts. Zhang-Wei Hong, the lead author of the research, stated that this approach provides a more effective and quicker way to ensure AI safety when testing large language models.

The traditional red-teaming approach involved a meticulous and costly manual process that often proved ineffective at generating a sufficient variety of prompts to safeguard the model fully. The researchers’ novel approach uses curiosity-driven exploration in the context of reinforcement learning. Here, the red-team model is incentivized to experiment with different prompts, maximizing its reward by generating curiosity. It uses a classifier to rate the toxicity of the generated replies, rewarding the red-team model based on the rating.

To encourage the variation of prompts, the researchers integrated an entropy bonus to the reward signal in the red-team model. They also added a natural language bonus to discourage random, nonsensical text. Upon testing, their model outperformed baseline automated techniques on both toxicity and diversity. It quickly produced multiple novel prompts that elicited toxic responses from a chatbot previously trained to avoid such responses.

Moving forward, the researchers plan to broaden the range of topics for the red-team model’s prompts and explore large language models’ use as toxicity classifiers. This would allow the chatbot to be tested against company policy violations, for instance, exemplifying their approach’s extensive applicability.

In sum, curiosity-driven red-teaming challenges the existing norms surrounding AI model safety, promising a scalable and effective approach for an ever-evolving AI era.

Leave a comment

0.0/5