Skip to content Skip to footer

The OpenGPT-X Team has released a leaderboard for European LLM, paving the path for the progression and assessment of sophisticated multilingual language model development.

The OpenGPT-X team has launched the European Large Language Models (LLM) Leaderboard, a key step forward in the creation and assessment of multilingual language models. The project began in 2022 with backing from the BMWK and the support of TU Dresden and a 10-partner consortium comprised of numerous sectors. The primary target is to expand language models’ capacity to manage multiple languages, thereby lessening digital language obstacles and augmenting the versatility of AI applications throughout Europe.

The digital processing of natural language has progressed significantly in recent times, largely thanks to the wide-ranging development of open-source Large Language Models (LLMs), which have displayed extraordinary abilities in comprehending and generating human language. However, the majority of these benchmarks have customarily aimed at the English language, leaving a gap in support for multilinguality.

The newly released European LLM Leaderboard weighs up several top-of-the-range language models, each containing approximately 7 billion parameters, across a range of European languages. The final aim of the OpenGPT-X consortium is to expand language accessibility and ensure AI’s advantages are not just confined to English-speaking regions. To achieve these ambitious goals, the team carried out extensive multilingual training and assessment, testing the models they had developed on several tasks, such as logical reasoning, commonsense understanding, multi-task learning, truthfulness, and translation.

Benchmarks such as ARC, HellaSwag, TruthfulQA, GSM8K, and MMLU were machine-translated into 21 of the 24 supported European languages using DeepL to enable comprehensive and equivalent evaluations. Furthermore, two other multilingual benchmarks already available for the project’s languages were also included in the leaderboard.

The evaluation process for these multilingual models is automated via the AI platform Hugging Face Hub, with TU Dresden providing the vital infrastructure to execute the evaluation tasks on their High-Performance Computing cluster. This setup offers the scalability and efficiency necessary for handling extensive datasets and complex evaluation tasks. The publication of the European LLM Leaderboard signals just the start, with the OpenGPT-X models scheduled for wider publication this summer, facilitating further research and development.

Several benchmarks have been translated and employed in the project to assess the performance of multilingual LLMs. These include ARC and GSM8K, which focus on general education and mathematics, HellaSwag and TruthfulQA, which examine the models’ ability to give plausible continuations and truthful responses, and MMLU, which offers a range of tasks to test the models’ capabilities across varied domains.

In conclusion, the European LLM Leaderboard by the OpenGPT-X team responds to the demand for broader language accessibility and provides robust evaluation metrics. This pioneering project is crucial for traditionally underrepresented languages in natural language processing and promotes the path for more inclusive and adaptable AI applications.

Leave a comment

0.0/5