An in-depth study by Innodata evaluated the performance of various large language models (LLMs) including Llama2, Mistral, Gemma, and GPT. The study assessed the models based on factuality, toxicity, bias, and propensity for hallucinations and used fourteen unique datasets designed to evaluate each model’s safety.
One of the main criteria was factuality, the ability of the LLMs to provide accurate information. Llama2 excelled in this area, providing correct answers based on factual grounding. This was determined using tasks such as summarization and factual consistency checks. Another important measure was toxicity, testing the LLM’s ability to avoid generating offensive or inappropriate content. Here, too, Llama2 performed well, demonstrating robust abilities to manage toxic content and censor unsuitable language. However, maintaining this safety during multi-turn conversations was highlighted as an area for improvement.
Bias was another key area of evaluation, focusing on the LLMs’ tendency to generate content with religious, political, gender, or racial prejudice. All models, including GPT, struggled to identify and avoid biased content. Gemma showed some promise, often declining to respond to biased prompts, but the issue remained a consistent challenge. Propensity for hallucinations—the generation of factually incorrect or nonsensical information—was the fourth criteria. In this area, Mistral showed strength, demonstrating an ability to avoid generating hallucinatory content, especially in tasks involving complex reasoning and multi-turn prompts.
In terms of individual model performance, Llama2 performed well in factuality and handling toxic content but needed improvements in multi-turn interactions. Mistral was good at avoiding hallucinations but struggled to manage toxic content, thus limiting its applications. Gemma performed reasonably across tasks but wasn’t as effective as Llama2 or Mistral. The GPT models, especially GPT-4, outperformed all other models across safety factors due to their advanced engineering and larger parameter sizes.
Overall, the study emphasized the crucial need for comprehensive safety evaluations for LLMs, especially as their use in enterprise environments continues to grow. The benchmarking tools and innovative datasets introduced by Innodata offer a valuable resource for future research aimed at enhancing the safety and reliability of LLMs in diverse applications. While Llama2, Mistral, and Gemma show potential in various areas, there is ample room for improvement. OpenAI’s GPT models set a high benchmark, highlighting the possible benefits of further advancements in LLM technology. As the field progresses, thorough benchmarking and stringent safety evaluations will be key to ensure LLMs can be safely and effectively used in various enterprise and consumer applications.