The study of multilingual natural language processing (NLP) is rapidly progressing, seeking to create language models capable of interpreting and generating text in various languages. The central goal of this research is to improve global communication and access to information, making artificial intelligence technologies accessible across diverse linguistic backgrounds.
However, creating such models brings significant challenges, largely due to the complexities associated with managing numerous languages simultaneously. One main issue is that research presently favors major languages like English and Chinese, with less widely spoken languages less well-represented. This limited focus constricts the reach and fairness of AI technologies. Addressing this imbalance requires innovative methods to enhance the quality and diversity of multilingual datasets, thus ensuring that AI models can effectively perform across a wide range of languages.
Traditional ways to improve multilingual language models usually involve translating data from English into other languages. This approach, while somewhat beneficial, introduces a range of problems, including translation flaws that can hinder model performance. Furthermore, the reliance on translation often results in a lack of diversity in the data, crucial for robust model training. Using human annotation to collate multilingual preference data is one solution, but this method is expensive and time-consuming, rendering it impractical for large-scale applications.
Researchers from Cohere For AI have developed a new scalable method to produce high-quality multilingual feedback data to improve the performance of multilingual large language models (LLMs). The team leverages diverse multilingual prompts, and completions from multiple LLMs and avoids common issues associated with translation flaws. This tactic not only augments the diversity of the data but also allows superior model performance.
The methodology translates around 50,000 English prompts into 22 other languages using the NLLB 3.3B model. These prompts are used to generate completions in each language, guaranteeing high diversity and data quality. The team found that completions generated directly in the target language significantly reduced translation flaws compared to those translated from English. These findings led to a diverse set of multilingual preference pairs vital for effective preference optimization.
The researchers compared their model’s performance against several state-of-the-art multilingual LLMs. The model achieved a 54.4% win rate against Aya 23 8B, the leading multilingual LLM in its parameter class, and a 69.5% win rate or higher against other popular models such as Gemma-1.1-7B-it, Meta-Llama3-8B-Instruct, and Mistral-7B-Instruct-v0.3.
Besides, the study found that increasing the number of languages in the training data consistently enhanced the model’s performance. Training with five languages resulted in a win rate of 54.9% on unseen languages, as opposed to 46.3% when training only in English. Furthermore, online preference optimization methods like Reinforcement Learning from Human Feedback (RLHF) were more effective than offline methods, such as Direct Preference Optimization (DPO).
In conclusion, the research conducted by Cohere For AI proves the vital role of high-quality, diverse, multilingual data in training effective multilingual language models. The innovative methodologies introduced help to tackle challenges of data scarcity and quality, resulting in performance improvements across a broad range of languages.