Skip to content Skip to footer

This study by UC Berkeley showcases how the division of tasks could potentially undermine the security of artificial intelligence (AI) systems, initiating misuse.

Artificial Intelligence (AI) systems are tested rigorously before their release to ensure they cannot be used for dangerous activities like bioterrorism or manipulation. Such safety measures are essential as powerful AI systems are coded to reject commands that may harm them, unlike less potent open-source models. However, researchers from UC Berkeley recently found that guaranteeing individual AI models’ safety is insufficient. Despite each model seeming safe independently, their combination could be exploited for harmful ends.

In a tactic called task decomposition, adversaries divide a complex malicious activity into smaller tasks, each assigned to distinct models. Here, competent frontier models handle benign but challenging subtasks, and weaker models with laxer safety precautions manage malicious but straightforward subtasks.

To illustrate this potential threat, the research team developed a theoretical model where an adversary manipulates a set of AI models to produce a harmful output, such as a destructive Python script. The adversary iteratively selects models and prompts to achieve this malicious end. Success in this context signifies that the adversary leveraged multiple models’ combined efforts to create harmful output.

The research team put both manual and automated task decomposition techniques under the microscope. In manual task decomposition, a human determines how to parcel a task into manageable segments. Prositively complicated tasks necessitate automated decomposition. This process involves having a strong model resolve related benign tasks suggested by a weak model, which then uses the solutions to perform the initial malicious task.

The study found that models combined yielded a far higher success rate in creating harmful effects than individual models. For instance, while developing vulnerable code, the success rate for combining Llama 2 70B and Claude 3 Opus models was at 43%, whereas neither model independently exceeded a 3% success rate.

Moreover, the likelihood of misuse was found to correlate with the quality of both weaker and stronger models. As AI models improve, their potential for misuse in multi-model combinations may increase. This misuse risk might be further escalated by adopting other decomposition techniques like training weak models to exploit strong models via reinforcement learning, or by transforming the weak model into a general agent that continually summons the strong model.

This study underscores the imperative of continuous red-teaming, involving the exploration of different AI model combinations for potential misuse risks. Throughout an AI model’s deployment lifecycle, this procedure needs to be a standing policy for developers, given that system updates can create new vulnerabilities.

The researchers of this project are credited with these findings, which highlight how task decomposition can break down the safety barriers of AI systems and open them up for misuse. The researchers’ detailed findings are available in their research paper.

Leave a comment

0.0/5