Large Language Models (LLMs) such as GPT-4 and Llama-2, while highly capable, require fine-tuning with specific data tailored to various business requirements. This process can expose the models to safety threats, most notably the Fine-tuning based Jailbreak Attack (FJAttack). The introduction of even a small number of harmful examples during the fine-tuning phase can drastically compromise the safety of the model.
To counteract this, researchers from several prestigious universities including the University of Wisconsin-Madison, University of Michigan-Ann Arbor, Princeton University, University of California, Davis, and University of Chicago have developed a method known as Backdoor Enhanced Safety Alignment. Inspired by backdoor attacks, this technique integrates a secret prompt as a “backdoor trigger” into prefixed safety examples.
According to extensive experimentation, it was found that the inclusion of as few as 11 prefixed safety examples was sufficient to improve safety performance against FJAttack without undermining the utility of the model. The method has been shown to effectively safeguard against the FJAttack in practical tasks such as creating dialogue summaries and generating SQL.
Fine-tuning of LLMs, while common, can present several complications including ‘catastrophic forgetting’ and limited resources. However, by incorporating the Backdoor Enhanced Safety Alignment method, safety alignment during inference can be ensured without the model utility being compromised.
This method has been put through rigorous testing using Llama-2-7B-Chat and GPT-3.5-Turbo models under various conditions and settings. The results clearly demonstrated that this method not only significantly lowered harmfulness scores and Attack Success Rates (ASR) in comparison to the baseline methods but also successfully retained performance efficiency in benign tasks.
The Backdoor Enhanced Alignment Method has been hailed as an effective solution to the problems posed by the FJAttack on LLMs. Through comprehensive testing, this method has shown promise in not only preserving safety alignment and task performance with a limited number of safety examples but also being applicable in real-world circumstances. It represents a significant step in bolstering the robustness of LLMs against potential threats during fine-tuning, further advancing the safety and security of such models.
The full paper detailing this research is available online for those interested in understanding this method in detail. All credit for the research goes to the aforementioned team of researchers.
For more updates on similar research studies and information regarding the use of AI, interested readers can join related social media groups and channels on Facebook, Twitter, LinkedIn, and Discord. The researchers also offer a number of free AI courses for enthusiasts and experts alike.