Despite remarkable advances in large language models (LLMs) like ChatGPT, Llama2, Vicuna, and Gemini, these platforms often struggle with safety issues. These problems often manifest as the generation of harmful, incorrect, or biased content by these models. The focus of this paper is on a new safety-conscious decoding method, SafeDecoding, that seeks to shield LLMs from ‘jailbreak attacks’.
Current alignment algorithms have proved insufficient in preventing hostile inputs from compromising LLMs. A pressing concern highlighted by recent research is the relatively new threat known as a ‘jailbreak attack’ which can effectively bypass current defenses. Despite a range of existing defenses, such as input perturbation, input and output detection, and implementation of prompt known constraints, these solutions tend to prolong inference time and compromise the utility of LLMs for non-malicious users.
To counter jailbreak attacks, researchers from the University of Washington, the Pennsylvania State University, and the Allen Institute for AI shifted their focus onto token probabilities. The concept of ‘tokens’, the smallest textual unit that an LLM can comprehend, are key to understanding these attacks.
This analysis led the research team to two observations. Firstly, token probabilities that endorse the attack’s aims increase the success rate of jailbreak attacks. Secondly, even under unpredictable behavioral circumstances, the language model naturally contains tokens for safety disclaimers.
On the basis of these observations, the research team has proposed a solution: SafeDecoding, a fresh safety conscious decoding method devised to neutralize jailbreak attacks. This strategy involves proactively identifying safety disclaimers and increasing their token probabilities, whilst decreasing the probabilities of token sequences that support the attacker’s aims.
SafeDecoding’s success lies in its ability to satisfactorily balance the safety-utility tradeoff at the time of inference. This is achieved by identifying mutual tokens from the original and expert models, creating a new token distribution, and subsequently using it to respond to the input query.
SafeDecoding’s performance was appraised against various benchmarks and contemporary jailbreak attempts across five LLMs. It consistently exceeded standard results in combatting jailbreak attacks, whilst maintaining minimal computational overhead, thereby preserving the utility of LLMs for non-malicious users.
However, it’s worth noting that SafeDecoding isn’t without its shortcomings. In rare instances, a user’s harmful queries may initially be refused by the model, but are later accepted. This inconsistency in the model’s decoding is a challenge for future iterations of SafeDecoding.
The current scope of this research and the performance evaluations of SafeDecoding are applicable only to large language models. Future research is set to assess SafeDecoding’s efficacy when employed with newly developed multimodal large language models like GPT-4V. These multimodal models, which combine text, graphics, audio, and other forms of data, present a unique set of challenges beyond the scope of this current work.