AI system vulnerabilities, particularly in large language models (LLMs) and multimodal models, can be manipulated to produce harmful outputs, raising questions about their safety and reliability. Existing defenses, such as refusal training and adversarial training, often fall short against sophisticated adversarial attacks and may degrade model performance.
Addressing these limitations, a research team from Black Swan AI, Carnegie Mellon University, and the Center for AI Safety proposes a new approach called short-circuiting. This method, inspired by representation engineering, directly manipulates the internal representations of the AI system which are responsible for producing harmful outputs. Rather than working on specific attacks or defenses, short-circuiting seeks to disrupt the harmful generation process by rerouting the system’s internal states to neutral or refusal states.
The heart of the short-circuiting method lies in representation rerouting (RR), a technique that intervenes in the system’s internal processes, particularly the representations contributing to harmful outputs. This modification prevents the completion of harmful actions, providing a robust defense against adversarial attacks.
This method was put to the test by applying RR to a refusal-trained Llama-3-8B-Instruct model. The outcome showed a substantial decrease in the success of adversarial attack rates across various benchmarks, without compromising performance on standard tasks. For instance, harmBench prompts experienced lower attack success rates while capability benchmarks like MT Bench and MMLU maintained high scores. The method also proved effective in multimodal settings, improving resilience against image-based attacks while preserving the model’s usability.
In the short-circuiting method the training dataset is divided into two sets: Short Circuit Set and Retain Set. The former contains data triggering harmful outputs while the latter includes data representing safe or desired outputs. Loss functions, tailored to adjust the model’s representations, redirect harmful processes towards incoherent or refusal states, effectively nullifying harmful outputs.
Therefore, the short-circuiting approach offers a promising breakthrough in creating safer AI systems. By directly adjusting internal representations, it provides a robust, attack-agnostic solution that balances model performance while significantly enhancing safety and reliability. This method seeks to overcome the limitations of current defenses like refusal training and adversarial training.
Credit for this research goes to the aforementioned research team. This approach represents a significant step towards the evolution of safer AI systems. The paper detailing the approach is available for further reading.
The research team shared their achievement on Twitter, introducing short-circuiting as the first adversarially robust alignment technique, promising a safer future for LLMs.