An improved, quicker method to inhibit an AI chatbot from providing harmful responses.

Researchers from the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab have developed a new technique to improve “red-teaming,” a process of safeguarding large language models, such as AI chatbot, through the use of machine learning. The new approach focuses on the automatic generation of diverse prompts that result in undesirable responses from the chatbot being tested, helping to uncover potential safety issues and toxic responses.

The researchers utilize a technique known as curiosity-driven exploration. The reinforcement learning, trial-and-error process incentivizes the ‘red team’ model to produce varied and novel prompts, focusing on different keywords, sentence structures, and meanings. A safety classifier then assesses the toxicity of the chatbot’s response, rewarding the red-team model based on the rating. The resulting system tends to generate a wider range of prompts, improving the coverage compared to traditional automated methods.

Despite traditionally thorough red-teaming processes, even chatbots believed to be safe may still be capable of generating unsafe responses if certain prompts go untested. Human red-teaming tends to be an expensive and lengthy process, often unable to generate a fully comprehensive array of prompts for testing all potential outcomes.

The study revealed that this new technique surpassed both human testers and other machine-learning methods in obtaining distinct prompts which sparked increasingly toxic responses. It also proved effective even in chatbots previously considered safe and optimized with human feedback, for example, producing nearly 200 prompts that resulted in toxic responses from such a chatbot.

In the future, the research team aims to equip the red team model with the ability to generate prompts encompassing a broader range of topics and further investigate the use of a large language model as a toxicity classifier. The ultimate goal revolves around enhancing the testing phase, ensuring AI tools are safe and trustworthy before deployment in environments with frequent, often daily, updates and modifications. This machine learning-based approach provides an essential improvement in quality assurance, offering a faster and more effective red-team testing process.

The researchers also highlighted the potential for training the toxicity classifier using a company policy document, subsequently testing the chatbot for violations of such policies. The project was partly funded by multiple entities, including Hyundai Motor Company, the MIT-IBM Watson AI Lab, and the U.S. Army Research Office.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

An improved, quicker method to inhibit an AI chatbot from providing harmful responses.

Leave a comment Cancel reply

You May Also Like

Chasing the Platonic Ideals: AI’s Hunt for a Single Reality Paradigm

Essential Statistics About Hugging Face You Should Be Aware of (2024)

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

An improved, quicker method to inhibit an AI chatbot from providing harmful responses.

Leave a comment Cancel reply

You May Also Like

Chasing the Platonic Ideals: AI’s Hunt for a Single Reality Paradigm

Essential Statistics About Hugging Face You Should Be Aware of (2024)

+60 12-462 2768

All
Categories

All
Categories

All
Categories