Academics from the University of Washington, Western Washington University, and the University of Chicago have devised a method of manipulating language-learning models (LLMs), such as GPT-3.5, GPT-4, Gemini, Claude, and Llama2, utilizing a tactic known as ArtPrompt. ArtPrompt involves the use of ASCII art, a form of design made from letters, numbers, symbols, and punctuation marks a computer can read, to introduce unsafe or inappropriate words into prompts.
LLMs are programmed with safety guardrails that prevent them from responding to certain prompts, such as a request for instructions on building a bomb. The researchers found, however, that if they substituted the word “bomb” with ASCII art representing the word, the LLM would respond to the illicit request.
While LLM safety alignment methods concentrate on the semantics of natural language to determine if a prompt is safe, the ArtPrompt technique emphasizes shortcomings in this approach. Developers have generally succeeded in addressing unsafe prompts embedded in images in multi-modal models, but linguistic models remain vulnerable to such threats, the research showed.
The findings showed this technique managed to “jailbreak” every one of the 5 tested model types. In some cases, even multi-modal models, which usually process ASCII art as text, could potentially become confused.
To measure the capability of LLMs to respond to ArtPrompt-style prompts, the researchers developed a benchmark named the Vision-in-Text Challenge (VITC). The VITC results revealed that among the models tested, Llama2 was the least susceptible, while Gemini Pro and GPT-3.5 were the most likely to be compromised or “jailbroken”.
By sharing their findings, the researchers hope to highlight the vulnerability of LLMs to ASCII art-based attacks and encourage developers to improve these defenses. Given that ASCII art can be used to breach a model’s safeguards, it’s plausible that other, non-publicized attacks are being used by individuals for more underhanded purposes.