Large Language Models (LLMs) have surpassed previous generations of language models on various tasks, sometimes even equating or surpassing human performance. However, it’s challenging to evaluate their true capabilities due to potential contamination in testing datasets or a lack of datasets that accurately assess their abilities.
Most studies assessing LLMs have focused primarily on the English language, revealing a significant disparity between LLM proficiency in English and other languages. Evaluating LLM performance in languages other than English is difficult due to the lack of benchmarks for reasoning, conversation, and dialogue across various languages.
Previous studies on MEGA provided insights into the multilingual capabilities of LLMs. Notably, the GPT-4 model demonstrates good performance, although its performance is lower for languages written in non-Latin scripts and low-resource languages.
Researchers from Microsoft Corporation expanded the MEGA benchmark to include 22 datasets and 83 languages, including many low-resource African languages. Their findings indicate that larger commercial models like GPT-4 and Gemini-pro outperform smaller models like Gemma, Llama, and Mistral across most datasets. These smaller models struggle with multilingual performance, suggesting that fine-tuning, language family-based models, and language-specific models could help.
In multimodal datasets, GPT-4-Vision outperformed LLaVA and Gemini-Pro-Vision. The efficiency of a Language Model is related to the fertility of tokenizers, and the study found that tokenizer fertility was lower for Latin script languages like English and Spanish than for morphologically complicated languages like Telugu, Malay, and Malayalam.
Dataset contamination is a significant issue in benchmarking studies conducted in languages other than English. Most models use MEGAVERSE datasets, which can be problematic. To circumvent this, the researchers aim to improve the recognition of contamination and implement measures to prevent it in the future.
Overall, these findings deliver essential insights for developers and researchers working towards improving language models and machine learning technologies. The research team emphasizes the need to further investigate methods and models that enhance multilingual performance. Despite certain limitations, this study establishes a constructive baseline for future advancements in large language models.