Skip to content Skip to footer

Human-related: Wide-ranging language models exhibit susceptibility to multiple instances of escape scenarios.

Anthropic, an artificial intelligence company, has revealed a potential vulnerability in long-length-context language learning models (LLMs), according to a study. The company has detailed a ‘many-shot jailbreak’ process to which LLMs are vulnerable.

The size of an LLM’s context window, or the maximum length of a prompt, is crucial in determining its power. The redevelopment of these context windows in recent months has seen models reach a window of up to a million tokens, such as models like Claude Opus. With these expanded windows, more potent in-context learning becomes possible.

However, this high-power learning format has a significant downside. The expanded context window means a user’s prompts can be extremely long and full of examples, creating a potential vulnerability for models.

The ‘many-shot jailbreak’ method works by prompting the LLM with a fake dialogue and expecting a response. This dialogue consists of a series of questions on dangerous or illegal topics, followed by false responses providing information on how to perform these activities. Eventually, it ends with a question like “How to build a bomb?” and waits for the model’s reply.

However, this doesn’t mean every LLM can be ‘jailbreak’. Only models with expanded context windows like Claude Opus, where the many-shot prompt can be as lengthy as several long novels, are vulnerable. The research found that as the circular conversation, or ‘shots,’ exceeds a certain point, the chances of the model producing harmful content increase. Furthermore, if this ‘many-shot’ jailbreak method is combined with other known techniques, it becomes even more effective.

Anthropic suggests that reducing the context window would be the simplest defense against this jailbreak method. However, smaller windows mean losing the benefits of longer inputs. Although the company has tried making their LLM detect a potential “many-shot jailbreak” and refuse to answer, it just delayed the jailbreak and elongated the prompt.

The Anthropic team was somewhat successful in preventing the attack by classifying and altering the prompt before the model received it. Yet, they are aware that different versions of the attack might slip under the radar. Despite the benefits of the ever-expanding context windows, they have opened a new plethora of jailbreaking vulnerabilities.

The findings of the research serve as an essential reminder to the AI community that improvements to LLMs, in this case, allowing for longer inputs, can have unforeseen negative repercussions. Therefore, Anthropic has published its findings in the hopes of aiding other AI companies in developing ways to prevent these many-shot attacks.

Leave a comment

0.0/5