The rapid improvement of large language models and their role in natural language processing has led to challenges in incorporating less commonly spoken languages. Embedding the majority of artificial intelligence (AI) systems in well-known languages inevitably forces a technological divide across linguistic communities that remains mostly unaddressed.
This paper introduces the SambaLingo system, a novel AI-driven method created to help current top-quality language models adapt to new languages. In doing so, it not only utilizes the strengths of pre-existing models but also tailors those strengths to suit the unique attributes of the target language.
Previously, most attempts to tackle this issue would start from scratch and present either monolithic multilingual or language-specific models. However, these approaches ordinarily faced numerous obstacles. Known issues include “the curse of multilinguality”, data scarcity, and the considerable computational resources needed for their operation. Conversely, the adaptation of predominantly English models to integrate with new languages serves as a more promising alternative. These adapted models consistently prove to outperform language-specific models that are trained from the beginning.
The SambaLingo system starts by selecting an appropriate base model, one known to perform exceedingly well in its original language. In this study, researchers chose the open-source Llama2 model known for its excellent English language capabilities. The researchers then expanded the model’s vocabulary with non-overlapping tokens from the target language, initializing them with sub-word embeddings from the original tokenized model. This vital step guarantees the correct tokenization and representation of the new language and thus allows for seamless adaptation.
A mixture of web data in English and the target language was then introduced to the model. The researchers used a 1:3 data ratio skewed towards the target language to balance the preservation of the model’s existing knowledge with the adaptation to new linguistic environments.
Following that step, a two-stage process using supervised fine-tuning and direct preference optimization was enabled to augment the model’s alignment with human preferences. They utilized the ultrachat-200k dataset for supervised fine-tuning, while for direct preference optimization, they used ultra feedback and the cai-conversation-harmless datasets, having a 10:1 English to machine-translated data ratio.
The SambaLingo models were evaluated across language modeling, translation, text classification, question-answering tasks, and various natural language understanding benchmarks. The nine levels of what are known as typologically diverse languages that were integrated are Arabic, Thai, Turkish, Japanese, Hungarian, Russian, Bulgarian, Serbian, and Slovenian. Across all benchmarks, the SambaLingo models maintained consistent superior performance when compared to the existing state-of-the-art models in those languages.
The SambaLingo models effectively broke through previous difficulties to adapt extant, dominant models to new languages. By taking some of the strongest pre-existing models and tailoring them for various new linguistic landscapes, this system presents an effective, scalable solution to linguistic hurdles. Given its cutting-edge performance and alignment with human linguistic preferences, SambaLingo is marking a path forward to the future of natural language processing. It represents a world where artificial intelligence benefits aren’t constrained by language, promoting inclusion and accessibility for all linguistic communities.