Does Harmless Information Compromise AI Security? This Study from Princeton University Investigates the Conundrum of Precision Tuning in Machine Learning

Large Language Models (LLMs) require safety tuning to ensure alignment with human values. However, even those tuned for safety are susceptible to jailbreaking—errant behavior that escapes designed safety measures. Even benign data, free of harmful content, can lead to safety degradation, an issue recently studied by researchers from Princeton University’s Princeton Language and Intelligence (PLI).

The PLI researchers focused on why benign fine-tuning inadvertently leads to jailbreaking in their study. They considered fine-tuning data through two viewpoints: representation and gradient spaces. For representation matching, the theory is that examples positioned near harmful examples follow similar optimization pathways, hence, they become more likely to degrade safety guardrails during fine-tuning. This can occur even without explicit harmful content inclusion. For gradient matching, they considered the sample update direction on the model. The assumption here is that samples that more likely reduce harmful examples’ loss are also more likely to cause jailbreaking.

Applying these methods to a model tuned for safety and comparing randomly chosen data, the researchers showed the effectiveness of their techniques in identifying implicitly harmful data subsets in benign ones. They also proposed a bi-directional anchoring approach that prioritizes data points close to harmful examples while distancing from harmless ones. This approach helped identify subsets of benign data that have more probability of negatively impacting the model’s safety post fine-tuning.

The results of their study showed significant findings. The ASR for top-selected instances significantly increased on two keybarometers: From 46.6% to 66.5% on ALPACA, and from 4.9% to 53.3% on DOLLY through the incorporation of safety anchors. Additionally, selecting the lowest-ranked examples led to a substantially reduced ASR—3.8% on ALPACA—highlighting the effectiveness of their method in curbing potential safety hazards.

The researchers’ findings provide valuable insights into the paradox of machine learning fine-tuning. Their research has shed light on the potential for benign data to undermine AI safety, despite the assumed non-harmful nature of the data. Their representation and gradient-based methods effectively select a subset of benign data which might jailbreak models after finetuning. The study represents a stepping stone into understanding which benign data types are more likely to impair safety after fine-tuning. This understanding can pave the way for more polished and secure LLM development processes in the future.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Does Harmless Information Compromise AI Security? This Study from Princeton University Investigates the Conundrum of Precision Tuning in Machine Learning

Leave a comment Cancel reply

You May Also Like

Researchers from KAIST have developed CHOP, a system designed to improve the oral presentation skills of EFL students. The system provides instant, customized feedback using ChatGPT and Whisper technologies.

Delphi-2M: An Adapted GPT Structure for Predicting Future Health Conditions Using Historical Medical Data

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Does Harmless Information Compromise AI Security? This Study from Princeton University Investigates the Conundrum of Precision Tuning in Machine Learning

Leave a comment Cancel reply

You May Also Like

Researchers from KAIST have developed CHOP, a system designed to improve the oral presentation skills of EFL students. The system provides instant, customized feedback using ChatGPT and Whisper technologies.

Delphi-2M: An Adapted GPT Structure for Predicting Future Health Conditions Using Historical Medical Data

+60 12-462 2768

All
Categories

All
Categories

All
Categories