Skip to content Skip to footer

Google AI presents ShieldGemma: an extensive assembly of LLM-based models for safe content moderation, which is constructed on Gemma2.

Large Language Models (LLMs) have gained significant traction in various applications but they need robust safety measures for responsible user interactions. Current moderation solutions often lack detailed harm type predictions or customizable harm filtering. Now, researchers from Google have introduced ShieldGemma, a suite of content moderation models ranging from 2 billion to 27 billion parameters, built on Gemma2. This model filters user input and output for key harm types, adapting to various application needs and making content filtering more nuanced and adaptable across different deployment scenarios.

ShieldGemma uses a novel process for generating high-quality datasets which reduces human annotation effort and data curation. This process uses Automated Adversarial Red Teaming (AART) to create diverse, adversarial prompts. The data is then expanded through a self-critiquing and generation framework, augmenting the dataset further with examples from Anthropic HH-RLHF to increase variety.

To balance the diversity of the training data, ShieldGemma uses a cluster-margin algorithm for data sub-sampling. The data is annotated by humans, and fairness expansion is applied to improve representation across various identity categories. The model is then fine-tuned using supervised learning on Gemma2 Instruction-Tuned models of varying sizes (2B, 9B, and 27B parameters).

ShieldGemma models show excellent performance in binary classification tasks across all sizes. Comparatively, the SG-9B model achieves an average AU-PRC that’s 10.8% higher on external benchmarks. Also, its F1 score surpasses those of WildGuard and GPT-4. The results highlight ShieldGemma’s effectiveness in content moderation tasks across various model sizes.

In conclusion, ShieldGemma marks a significant advancement in safety content moderation for Large Language Models. By outperforming existing baselines and offering flexible deployment options, ShieldGemma enhances the safety and reliability of LLM interactions. This research is particularly beneficial for AI development domains, and the researchers hope that by sharing these resources with the community it could lead to further progress in AI safety and responsible deployment.

Leave a comment

0.0/5