Advanced language models (LLMs) have significantly improved natural language understanding and are broadly applied in multiple areas. However, they can be sensitive to specific input prompts, prompting research into understanding this characteristic. Through exploring this behavior, prompts for learning tasks like zero-shot and in-context training are created. One such application, AutoPrompt, recognizes task-specific tokens to assist with zero-shot text classification and fact retrieval. Through gradient-based scoring, it can determine optimal token distributions relative to a specific task loss.
Despite their capabilities, LLMs can sometimes inadvertently generate inappropriate or harmful content. Such behavior is mainly due to adversarial prompts, which can be inserted manually, a process that tends to be inefficient. Automated generation of adversarial prompts can have the same unintended consequences and can often be detected via a perplexity filter, relying on gradient information from the target LLM.
A promising new method to address this has been introduced by artificial intelligence (AI) researchers from Meta and the Max-Planck-Institute for Intelligent Systems. They have developed AdvPrompter, an LLM designed to generate human-readable adversarial prompts. The algorithm used to train AdvPrompter, AdvPromterTrain, does not require access to the target LLM’s gradients. The trained AdvPrompter can then generate adversary suffixes that subtly modify, or ‘veil’, the input instruction while preserving its original meaning. It then elicits unwanted responses from the target LLM.
This novel method offers several key advantages:
First, using AdvPromter, the generated adversarial prompts are human-readable, enhancing clarity for users. When tested on multiple open-source LLMs, the method displayed a high success rate compared to other approaches.
Second, the trained AdvPrompter introduces a strategy of using next-token prediction to generate adversarial suffixes. This differs from other methods that must solve a new optimization problem for each suffix generated.
Third, the suffixes generated by AdvPromter incorporate a measure of randomness that allows users to rapidly generate a diverse set of adversarial prompts, potentially leading to improved performance and successful outcomes.
Consequently, the research represents a significant breakthrough in the red-teaming of LLMs. The researchers used the AdvPromterTrain algorithm to train AdvPromter, physically creating human-readable adversarial prompts. They also developed a new algorithm, AdvPromterOpt, for automatically generating adversarial prompts and fine-tuning the AdvPrompter’s predictions. Future work will provide a detailed analysis of the automatic safety fine-tuning of data, thereby improving the robustness of the target LLM via AdvPromter.