Large Language Models (LLMs), widely used in automation and content creation, are vulnerable to manipulation by adversarial attacks, leading to significant risk of misinformation, privacy breaches, and enabling criminal activities. According to research led by Meetyou AI Lab, Osaka University and East China Normal University, these sophisticated models are open to harmful exploitation despite safety measures like ethical alignment and prevention of overtly harmful content used during training and fine-tuning phases.
The research team has developed a new adversarial attack method – Imposter.AI – which leverages the subtleties of human conversation to extract harmful information from LLMs. Imposter.AI differs from traditional attack methods by focusing on the nature of the information in the responses rather than on overtly harmful inputs. The researchers highlight how it decomposes harmful questions into seemingly innocent sub-questions, rephrases overtly malicious questions to appear harmless, and enhances harmfulness of responses by prompting the models for detailed examples.
Imposter.AI’s approach exploits LLM’s limitations, increasing the possibility of obtaining sensitive information without triggering safety measures. It breaks down harmful questions into less harmful sub-questions, masking malicious intent; rephrases harmful questions to pass content filters; and enhances harmful responses by drawing detailed examples from the LLMs.
The effectiveness of Imposter.AI was tested on models like GPT-3.5-turbo, GPT-4, and Llama2. It significantly outperformed existing attack methods, with an average harmfulness score of 4.38 and an executability score of 3.14 on GPT-4, compared to the next best method’s 4.32 and 3.00, respectively. However, Llama2 showed strong resistance to all attack methods, attributed to its strong emphasis on safety over usability.
This research into Imposter.AI’s approach highlights the need for developers to create more robust safety mechanisms to detect and mitigate sophisticated attacks. It also underscores the ongoing challenge to balance security and model performance. These findings demonstrate that while LLMs have come a long way, they still have vulnerabilities that need attention to ensure their safe and beneficial usage.