Scientists from the Swiss Federal Institute of Technology Lausanne (EPFL) have discovered a flaw in the refusal training of modern language learning models (LLMs) that is easily bypassed through the mere use of past tense when inputting dangerous prompts.
When interacting with artificial intelligence (AI) models such as ChatGPT, certain responses are programmed to be refused. For instance, if one were to ask for advice on making harmful substances, the AI model would deny to provide such information. This mechanism, known as refusal training, is implemented in AI models through techniques like supervised fine-tuning (SFT) or reinforcement learning human feedback (RLHF) for safe user interaction. However, researchers at EPFL have found an easy way to sidestep this safety measure.
In their study, researchers took a dataset of 100 harmful instructions and rephrased them into the past tense using GPT-3.5. They then observed how eight different LLMs responded to the rephrased prompts. The AI models used in this experiment were Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, and R2D2.
On examining the results, the team found that transforming the tense of the prompt to past had a profound impact on the rate of bypassing refusal training, termed as attack success rate (ASR). Particularly, models GPT-4o and GPT-4o mini were found to be most vulnerable. When harmful requests were rephrased in the past tense, the ASR for GPT-4o escalated from a mere one percent to a whopping 88 percent.
The refusal training in AI models is designed to generalize and refuse harmful prompts, even if the specific wording hasn’t been encountered before. Still, when these prompts were modified to past tense, the models lost this capacity to generalize and reject them. Interestingly, rewriting in the future tense also increased the ASR, but not as drastically as past tense modifications.
Though this loophole seems troublesome, researchers found that incorporating past tense prompts in the fine-tuning datasets can potentially reduce this vulnerability. However, this solution does require anticipation of potential harmful requests. The research team suggests another more viable solution could be evaluating a model’s output before it is presented to the user. As of now, no perfect solution has been found by leading AI ventures to fix this loophole entirely.
So, while the implications of this discovery are unsettling, it also encourages AI developers to seek more comprehensive measures to ensure the safe and responsible use of AI technology.