LLM training resistance can be effortlessly evaded using prompts in past tense.

Scientists from the Swiss Federal Institute of Technology Lausanne (EPFL) have discovered a flaw in the refusal training of modern language learning models (LLMs) that is easily bypassed through the mere use of past tense when inputting dangerous prompts.

When interacting with artificial intelligence (AI) models such as ChatGPT, certain responses are programmed to be refused. For instance, if one were to ask for advice on making harmful substances, the AI model would deny to provide such information. This mechanism, known as refusal training, is implemented in AI models through techniques like supervised fine-tuning (SFT) or reinforcement learning human feedback (RLHF) for safe user interaction. However, researchers at EPFL have found an easy way to sidestep this safety measure.

In their study, researchers took a dataset of 100 harmful instructions and rephrased them into the past tense using GPT-3.5. They then observed how eight different LLMs responded to the rephrased prompts. The AI models used in this experiment were Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, and R2D2.

On examining the results, the team found that transforming the tense of the prompt to past had a profound impact on the rate of bypassing refusal training, termed as attack success rate (ASR). Particularly, models GPT-4o and GPT-4o mini were found to be most vulnerable. When harmful requests were rephrased in the past tense, the ASR for GPT-4o escalated from a mere one percent to a whopping 88 percent.

The refusal training in AI models is designed to generalize and refuse harmful prompts, even if the specific wording hasn’t been encountered before. Still, when these prompts were modified to past tense, the models lost this capacity to generalize and reject them. Interestingly, rewriting in the future tense also increased the ASR, but not as drastically as past tense modifications.

Though this loophole seems troublesome, researchers found that incorporating past tense prompts in the fine-tuning datasets can potentially reduce this vulnerability. However, this solution does require anticipation of potential harmful requests. The research team suggests another more viable solution could be evaluating a model’s output before it is presented to the user. As of now, no perfect solution has been found by leading AI ventures to fix this loophole entirely.
So, while the implications of this discovery are unsettling, it also encourages AI developers to seek more comprehensive measures to ensure the safe and responsible use of AI technology.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

LLM training resistance can be effortlessly evaded using prompts in past tense.

Leave a comment Cancel reply

You May Also Like

Prominent individuals were noticeably missing from the AI Safety Summit held in May.

MIT experts have leveraged artificial intelligence to pinpoint a potential new category of antibiotic candidates.

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

LLM training resistance can be effortlessly evaded using prompts in past tense.

Leave a comment Cancel reply

You May Also Like

Prominent individuals were noticeably missing from the AI Safety Summit held in May.

MIT experts have leveraged artificial intelligence to pinpoint a potential new category of antibiotic candidates.

+60 12-462 2768

All
Categories

All
Categories

All
Categories