A quicker, more efficient method to safeguard against an AI chatbot providing harmful or inappropriate responses.

To counter unsafe responses from chatbots, companies often use a process called red-teaming, in which human testers write prompts designed to elicit such responses so the artificial intelligence (AI) can be trained to avoid them. However, since it is impossible for human testers to cover every potential toxic prompt, MIT researchers developed a technique utilizing machine learning to improve red-teaming on large language models. This involves training a red-team AI model to generate a wider range of diverse prompts that target undesirable responses. This training also involves sparking curiosity in the AI model which encourages it to create new, diverse prompts.

This innovative method has proven to significantly outperform both human testers and other machine-learning techniques by outputting an extensive range of distinct prompts that trigger incrementally toxic responses from the chatbot being tested. The method also proved to be effective in eliciting toxic responses from chatbots previously thought to be safe due to built-in human safeguards.

The researchers used reinforcement learning to train the red-team model using a technique called curiosity-driven exploration. As a result, the AI model becomes incentivized to create new prompts. During training, the red-team model generates a prompt, interacts with the chatbot, and a safety classifier then rates the toxicity of the chatbot’s response, offering rewards to the red-team model based on this rating.

The red-team model’s aim is to maximize this reward by eliciting more toxic responses via unique prompts. This curiosity in the model is fostered by modifying the reward signal in the reinforcement learning setup. The technique includes an entropy bonus for the model to explore different prompts and additional rewards based on the similarity of words in the prompts and semantic similarity.

To avoid the generation of nonsensical text, researchers also added a natural language bonus during the training. This method outperformed other machine-learning methods in both the toxicity and diversity of responses.

The researchers hope to extend the model’s use to cover a broader range of topics. They also see the potential of using a large language model as the toxicity classifier, with training based on documents like company policy, which would allow a red-team model to test chatbots for policy violations. The main goal is to ensure the development of safe, reliable AI in a scalable manner. The research was supported by major organizations including Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, and the U.S. Defense.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

A quicker, more efficient method to safeguard against an AI chatbot providing harmful or inappropriate responses.

Leave a comment Cancel reply

You May Also Like

BioDiscoveryAgent: Transforming Genetic Research Design with Insights Powered by Artificial Intelligence.

The Pentagon desires 1,000 mini fighter jets operated by artificial intelligence, also known as ghost jets.

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

A quicker, more efficient method to safeguard against an AI chatbot providing harmful or inappropriate responses.

Leave a comment Cancel reply

You May Also Like

BioDiscoveryAgent: Transforming Genetic Research Design with Insights Powered by Artificial Intelligence.

The Pentagon desires 1,000 mini fighter jets operated by artificial intelligence, also known as ghost jets.

+60 12-462 2768

All
Categories

All
Categories

All
Categories