Large Language Models (LLMs) require safety tuning to ensure alignment with human values. However, even those tuned for safety are susceptible to jailbreaking—errant behavior that escapes designed safety measures. Even benign data, free of harmful content, can lead to safety degradation, an issue recently studied by researchers from Princeton University’s Princeton Language and Intelligence (PLI).
The PLI researchers focused on why benign fine-tuning inadvertently leads to jailbreaking in their study. They considered fine-tuning data through two viewpoints: representation and gradient spaces. For representation matching, the theory is that examples positioned near harmful examples follow similar optimization pathways, hence, they become more likely to degrade safety guardrails during fine-tuning. This can occur even without explicit harmful content inclusion. For gradient matching, they considered the sample update direction on the model. The assumption here is that samples that more likely reduce harmful examples’ loss are also more likely to cause jailbreaking.
Applying these methods to a model tuned for safety and comparing randomly chosen data, the researchers showed the effectiveness of their techniques in identifying implicitly harmful data subsets in benign ones. They also proposed a bi-directional anchoring approach that prioritizes data points close to harmful examples while distancing from harmless ones. This approach helped identify subsets of benign data that have more probability of negatively impacting the model’s safety post fine-tuning.
The results of their study showed significant findings. The ASR for top-selected instances significantly increased on two keybarometers: From 46.6% to 66.5% on ALPACA, and from 4.9% to 53.3% on DOLLY through the incorporation of safety anchors. Additionally, selecting the lowest-ranked examples led to a substantially reduced ASR—3.8% on ALPACA—highlighting the effectiveness of their method in curbing potential safety hazards.
The researchers’ findings provide valuable insights into the paradox of machine learning fine-tuning. Their research has shed light on the potential for benign data to undermine AI safety, despite the assumed non-harmful nature of the data. Their representation and gradient-based methods effectively select a subset of benign data which might jailbreak models after finetuning. The study represents a stepping stone into understanding which benign data types are more likely to impair safety after fine-tuning. This understanding can pave the way for more polished and secure LLM development processes in the future.