The surge of low-quality data online has led to potentially harmful knowledge instilled in Large Language Models (LLMs). This problem elevates risks when LLMs are deployed in chatbots that might expose users to harmful advice or aggressive interactions. Existing toxicity evaluation datasets focus mainly on English, limiting their capability to detect multilingual toxicity which compromises the safety of LLMs. To address this problem, AI2 and Carnegie Mellon University (CMU) teamed up to study how toxicity changes based on language resources and design decisions, such as model size and alignment methods.
Standard methods for evaluating toxicity in LLMs do not effectively capture instances of multilingual toxicity. To tackle this issue, researchers at AI2 and CMU introduced PolygloToxicityPrompts, a dataset containing 425,000 naturally occurring prompts across 17 languages. The data extracted from the web range in toxic levels and are represented in short snippets of text. This dataset builds on previous work like RealToxicityPrompts, but expands its scope to include multiple languages.
The design of PolygloToxicityPrompts allows LLMs to identify toxic behavior at early stages of the conversation. The inclusion of multiple languages addresses the deficiency left open by predominantly English-centric datasets. The researchers used PerspectiveAPI to measure prompt and generation toxicity, which helped calculate a model’s average toxicity across all its continuations. The study revealed that languages with less quality data like Hindi and Czech had higher toxicity levels, while languages like Russian and Dutch experienced less toxicity.
The research further examined the effect of model size and alignment techniques on toxicity. Results indicated that toxicity increased with bigger base LLMs, implying that larger models absorb more toxicity from their training data. Nevertheless, LLMs that underwent instruction and preference tuning had lower toxicity levels than base models. The study also noted that while related, toxicity and safety are separate concepts, both needing distinct resolutions.
In conclusion, the PolygloToxicityPrompts dataset provides a significant resource for evaluating and reducing toxicity in LLMs across various languages. It offers insights into the importance of prompt language, model size, and alignment methods in addressing toxic behavior. By facilitating proactive moderation and multilingual content filtering, the dataset can contribute to a safer online environment.
The detailed paper and dataset are available online, with all research credit going to the project’s researchers. The initiative is dedicated to furthering the field of machine learning and fostering a robust community.