Large language models (LLMs) have gained significant attention in recent years, but their safety in multilingual contexts remains a critical concern. Studies have shown high toxicity levels in multilingual LLMs, highlighting the urgent need for effective multilingual toxicity mitigation strategies.
Strategies to reduce toxicity in open-ended generations for non-English languages currently face considerable challenges due to the resource-intensive nature of existing solutions, which often require extensive datasets of toxic and non-toxic samples that are not always available in the target language. Therefore, researchers often have to rely on translated English data instead.
To address these challenges, researchers from the Department of Computer Science at Brown University have developed a method of cross-lingual detoxification of LLMs using English preference tuning without translation. Using Direct Preference Optimization (DPO), English training data significantly reduces toxicity levels across 17 different languages. This method proves effective for various multilingual LLMs, demonstrating zero-shot cross-lingual generalization across different linguistic contexts and contradicting prior assumptions about cross-lingual transfer.
The method uses probes to locate toxicity within the LLM and conduct causal interventions. A binary toxicity classification probe is trained on the Jigsaw dataset, and the top 100 potential sources of toxicity are identified. These sources are determined by collecting average neuron activations over 20 tokens using English prompts from the RTP-LX dataset. The sources of toxicity are then manipulated to evaluate changes in toxicity across languages.
Results demonstrate the dual multilinguality of MLPs in LLMs. Among the top 100 potential sources of toxicity, 36 were identified as actual sources. Causal intervention experiments confirmed that these toxic neuron activations significantly influence content toxicity across languages. By modifying just 36 of 196,608 toxic neuron activations, the average toxicity level across 17 languages was reduced from 0.175 to 0.032.
This study shows that safety preference tuning with DPO can effectively detoxify LLMs, proving robust across various multilingual LLMs. The findings offer a powerful solution for multilingual toxicity mitigation and provide insight into generalization behavior. Importantly, the research established that bilingual sentence retrieval could predict the cross-lingual generalizability of English safety preference tuning.
All credit for this groundbreaking research goes to the researchers at Brown University. More details can be found in the published paper.