Adjusting AdvPrompter: A New AI Technique for Creating Understandably Written Adversarial Prompts

Advanced language models (LLMs) have significantly improved natural language understanding and are broadly applied in multiple areas. However, they can be sensitive to specific input prompts, prompting research into understanding this characteristic. Through exploring this behavior, prompts for learning tasks like zero-shot and in-context training are created. One such application, AutoPrompt, recognizes task-specific tokens to assist with zero-shot text classification and fact retrieval. Through gradient-based scoring, it can determine optimal token distributions relative to a specific task loss.

Despite their capabilities, LLMs can sometimes inadvertently generate inappropriate or harmful content. Such behavior is mainly due to adversarial prompts, which can be inserted manually, a process that tends to be inefficient. Automated generation of adversarial prompts can have the same unintended consequences and can often be detected via a perplexity filter, relying on gradient information from the target LLM.

A promising new method to address this has been introduced by artificial intelligence (AI) researchers from Meta and the Max-Planck-Institute for Intelligent Systems. They have developed AdvPrompter, an LLM designed to generate human-readable adversarial prompts. The algorithm used to train AdvPrompter, AdvPromterTrain, does not require access to the target LLM’s gradients. The trained AdvPrompter can then generate adversary suffixes that subtly modify, or ‘veil’, the input instruction while preserving its original meaning. It then elicits unwanted responses from the target LLM.

This novel method offers several key advantages:

First, using AdvPromter, the generated adversarial prompts are human-readable, enhancing clarity for users. When tested on multiple open-source LLMs, the method displayed a high success rate compared to other approaches.

Second, the trained AdvPrompter introduces a strategy of using next-token prediction to generate adversarial suffixes. This differs from other methods that must solve a new optimization problem for each suffix generated.

Third, the suffixes generated by AdvPromter incorporate a measure of randomness that allows users to rapidly generate a diverse set of adversarial prompts, potentially leading to improved performance and successful outcomes.

Consequently, the research represents a significant breakthrough in the red-teaming of LLMs. The researchers used the AdvPromterTrain algorithm to train AdvPromter, physically creating human-readable adversarial prompts. They also developed a new algorithm, AdvPromterOpt, for automatically generating adversarial prompts and fine-tuning the AdvPrompter’s predictions. Future work will provide a detailed analysis of the automatic safety fine-tuning of data, thereby improving the robustness of the target LLM via AdvPromter.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Adjusting AdvPrompter: A New AI Technique for Creating Understandably Written Adversarial Prompts

Leave a comment Cancel reply

You May Also Like

Discussing the Future of Healthcare with Frank Dannacher, Lead AI and Data Director for Life Science and Healthcare at Deloitte Switzerland.

Physical education instructor apprehended for using AI to replicate the principal’s voice.

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Adjusting AdvPrompter: A New AI Technique for Creating Understandably Written Adversarial Prompts

Leave a comment Cancel reply

You May Also Like

Discussing the Future of Healthcare with Frank Dannacher, Lead AI and Data Director for Life Science and Healthcare at Deloitte Switzerland.

Physical education instructor apprehended for using AI to replicate the principal’s voice.

+60 12-462 2768

All
Categories

All
Categories

All
Categories