Language models are advanced artificial intelligence systems that can generate human-like text, but when they’re trained on large amounts of data, there’s a risk they’ll inadvertently learn to produce offensive or harmful content. To avoid this, researchers use two primary methods: first, safety tuning, which is aligning the model’s responses to human values, but this can be bypassed using various strategies; and second, the use of guardrails, which are separate models that flag risky content. However, this creates extra computational load.
Researchers at Samsung’s R&D Institute have proposed an architecture, called LoRA-Guard, that effectively combines these two separate models into one. LoRA-Guard uses a piece of technology called a low-rank adapter, which is inserted into the backbone of the chat model to identify any potentially harmful content. The system operates in two modes, one for normal chatting, and the other with the adapter in place, monitoring for any content that needs to be flagged.
The advantage of LoRA-Guard is that it significantly reduces computational load, making it feasible for low-resource settings. It also means there’s no degradation in performance when the system switches between its two tasks. It’s been tested on a variety of data sets with excellent results.
Using less computational resources, LoRA-Guard demonstrates impressive results on several datasets, using significantly fewer parameters yet outperforming baselines or matching alternative techniques. It revealed that models trained on certain datasets generalize well to others, yet not vice versa, indicating something about dataset characteristics that may require further exploration.
This new system is a significant advancement in moderated conversational AI systems, drastically reducing the computational overhead needed to effectively moderate content. Through parameter sharing and efficient learning mechanisms, performance is maintained or even improved. LoRA-Guard also avoids a common issue in training language models called “catastrophic forgetting” when fine-tuning. As such, it is a promising development for the employment of robust content moderation for AI language models, particularly as on-device large language models become increasingly commonplace. This allows safer AI conversations across a broader range of applications. All credit for this research goes to the researchers of this project at the Samsung R&D Institute.