We are witnessing a remarkable breakthrough in language-related machine learning tasks, with the most impressive example being ChatGPT, which excels in complex language processing tasks. However, many mainsteam language learning models like LLaMA are pre-trained on English-dominant corpus and LaMDA, proposed by Google, is pre-trained on text containing over 90% English, limiting its performance in other non-English languages.
This limitation is concerning for non-English users, which inspired researchers at the School of Computer Science, Fudan University, to investigate the transfer of language generation capabilities and instruction-following from English to non-English languages. In their study, they explored the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To this end, they analyzed the performance of five models: LLaMA, LLaMA2, Chinese LLaMA, Chinese LLaMA2, and Open Chinese LLaMA, each with different pretraining scales.
The research team was thrilled to discover that extending the vocabulary of English-dominant models actually diminishes performance in Chinese, whereas increased pretraining scale initially improves response quality, but plateaus thereafter. Moreover, they found that English proficiency suffers with exclusive Chinese training. The evaluation across 13 low-resource languages revealed that the SFT data can significantly boost response quality, with Arabic, Indonesian, and Vietnamese showing the best performance. Furthermore, the code-switching samples suggested that LLaMA learns cross-lingual semantic alignment during pretraining, thus enhancing its transferability.
This groundbreaking study offers invaluable insights for non-English LLM development, emphasizing the need for nuanced approaches for effective transfer. By breaking through the language barriers, it has opened up a world of opportunities and possibilities for non-English users. So, what are you waiting for? Follow us on Twitter, join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group and stay tuned for more exciting developments in the field of machine language learning!