Skip to content Skip to footer

An Extensive Comparison by Innodata: Evaluating Llama2, Mistral, Gemma, and GPT in terms of Accuracy, Offensive Language, Prejudice, and Tendency to Imagine

A recent study by Innodata assessed various large language models (LLMs), including Llama2, Mistral, Gemma, and GPT for their factuality, toxicity, bias, and hallucination tendencies. The research used fourteen original datasets to evaluate the safety of these models based on their ability to generate factual, unbiased, and appropriate content. Ultimately, the study sought to help future research improve the safety and reliability of LLMs in diverse settings.

In determining factuality- i.e., the ability to provide accurate information- Llama2 outshone other models in tasks requiring verifiable facts. By using a mixture of summarization tasks and factual consistency checks, the study could figure out which models were able to provide grounded and correct responses.

On measuring toxicity- the capacity to not produce offensive content- it was clear that while Llama2 could handle toxic content to a good extent, it was unable to maintain this ability during extended interactions.

In terms of bias detection- the scrutiny was concerning any religious, political, gender, or racial prejudice in the generated content. The study found that although all models struggled with this factor, Gemma was somewhat able to refuse answering biased prompts.

The last factor measured the models’ propensity for hallucinations, meaning their tendency to generate incorrect or nonsensical information. Here, Mistral showed a robust ability to avoid producing hallucinatory content, especially in reasoning and multi-turn tasks.

Upon evaluating these models, researchers learned that Llama2 delivered well in factuality and managing toxic content, which made it more suitable for tasks demanding safe responses. But it showed a high propensity for hallucination tasks, and its handling of multi-turn conversations needs substantial improvement.

Mistral avoided hallucinations and was excellent at multi-turn tasks, but was unsuccessful at detecting and managing toxic content. Alternatively, Gemma showed a balanced output but lagged in overall efficiency compared to Llama2 and Mistral.

The GPT models, particularly GPT-4, outpaced the other models in all safety aspects, indicating the advanced engineering and larger parameters in the OpenAI models.

The study emphasized the need for comprehensive safety evaluations of LLMs as they continue to find their place in enterprise environments. While Llama2, Mistral, and Gemma show potential in certain areas, substantial enhancements are needed to reach the standards set by GPT models, indicating the importance of advancements in LLM technology. As this technology evolves, ongoing monitoring and safety assessments will be crucial for the successful and safe integration of LLMs into a variety of applications.

Leave a comment

0.0/5