Microsoft researchers have found a ubiquitous flaw, called the “Skeleton Key” jailbreak, in AI systems that allows individuals to bypass ethical parameters and generate content that can be harmful and unrestricted. This technique tricks the AI into believing it should comply with any prompts, even if they’re unethical, which exposes the system to manipulated attacks. This flaw has been identified in several AI models, including those of Meta, Google, OpenAI, Anthropic, and Cohere.
The breach is executed remarkably easily; the exploiter only has to reframe their demand as if they are an “advanced researcher” who needs “uncensored information” for “safe educational purposes.” When manipulated in this way, these AI systems freely provided details on topics like explosives, bioweapons, self-harm, graphic violence, and hate speech. Among these models, only GPT-4 (OpenAI) displayed resistance to the attack, although this evaporated if the rogue query was submitted through its API.
Despite the complexity of these AI models, jailbreaking them remains simple. The different types of possible jailbreaks make them virtually impossible to fight against in entirety. Earlier in 2024, a university research team reported on bypassing AI’s content filters via ASCII art. Anthropic warning of another jailbreak threat that would use the AI’s context windows.
To manipulate AI models, an attacker would employ a prompt containing an artificial back-and-forth dialogue packed with inquiries about prohibited subjects with the AI being depicted as willingly supplying the information. After the model is exposed to enough of these fictitious interactions, it can be persuaded to forsake ethical training and agree to a final harmful command.
Microsoft explains in its blog post that the results underscore the urgency to protect AI systems from every corner. Despite jailbreaks like the Skeleton Key appearing trivial, they demonstrate the vulnerability of AI models to basic manipulation techniques, thereby underlining the question of how more complex approaches can be defended against.
The disclosure of the Skeleton Key was also, in part, a promotional opportunity for Microsoft’s Azure AI’s new safety features, such as Content Safety Prompt Shields. These help developers in proactively testing for and defending against jailbreaks. Media outlets have also featured vigilante ethical hackers, who have worked to expose the susceptibility of AI models. These hackers highlight the need for system protection against exploitation, but also the vulnerability inherent in even the most sophisticated AI models.