Large Language Models (LLMs) have shown vast potential in various critical sectors, such as finance, healthcare, and self-driving cars. Typically, these LLM agents use external tools and databases to carry out tasks. However, this reliance on external sources has raised concerns about their trustworthiness and vulnerability to attacks. Current methods of attack against LLMs often fail due to strong retrieval processes that handle harmful content.
Researchers from prominent U.S. universities have subsequently introduced a new method termed “AGENTPOISON.” Unlike previous methods, AGENTPOISON targets generic LLM agents using retrieval-augmented generation (RAG). Launched by infecting an agent’s long-term memory or knowledge base with harmful examples, including a query and a special trigger, AGENTPOISON encourages the agent to retrieve these harmful examples when the query contains the special trigger. The collected examples yield adverse outcomes when used by the agent.
The team tested AGENTPOISON on three types of agents: Agent-Driver for self-driving cars, ReAct for answering knowledge-intensive questions, and EHRAgent for managing healthcare records. Testing focused on the attack success rate for retrieval (ASR-r), measuring the percentage of cases where all examples retrieved by the agent were poisoned, and the attack success rate for the target action (ASR-a), evaluating the percentage of cases where the agent completed the intended adverse action after successfully procuring poisoned examples.
Results indicated that AGENTPOISON exceeded baseline methods in terms of attack success rate and preservation of benign utility. AGENTPOISON also generated the intended adversary actions 59.4% of the time, and 62.6% of these actions affected the environment as intended. The method demonstrated successful transferability across varying embedders, with low impact on benign performance.
In summary, the AGENTPOISON method provides a new approach for evaluating the safety of RAG-based LLM agents. This approach creates a specific area in the embedding space to ensure high retrieval accuracy and attack success rate. Significantly, AGENTPOISON requires no model training, is highly adaptable, stealthy, and coherent. Results from extensive experiments show AGENTPOISON as outperforming all baseline methods. The research team have made their findings available via a research paper and GitHub repository.