Skip to content Skip to footer

Introducing Sailor: A group of unrestricted language models spanning from 0.5B to 7B parameters designed for Southeast Asian (SEA) languages.

Large Language Models (LLMs) have gained immense technological prowess over the recent years, thanks largely to the exponential growth of data on the internet and ongoing advancements in pre-training methods. Despite their progress, LLMs’ dependency on English datasets limits their performance in other languages. This challenge, known as the “curse of multilingualism,” suggests that models predominantly trained on English data underperform in non-English languages due to inadequate exposure during pre-training.

To address this problem, researchers from Sea AI Lab and SUTD, both in Singapore, presented the Sailor project in recent research. This project entails a set of complimentary language models developed specifically for Southeast Asian (SEA) languages. These models, with parameters ranging from 0.5B to 7B, are designed to accommodate the linguistic diversity of the region. The foundation of these models lies on the flexible language model Qwen1.5, specially designed for multilingual implementations.

An extensive pre-training regime using a corpus of 200B to 400B tokens initiates the Sailor models, starting with Qwen1.5. The languages that predominantly make up this corpus include English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao, which are all critical languages in the SEA region. The training uses this vast data to apply strategies aimed at enhancing model performance.

One unique methodology used in the models’ training is the Byte Pair Encoding (BPE) dropout, which bolsters the models’ resilience. It improves the model’s ability to generalize across diverse language patterns and situations while also addressing possible overfitting issues. The training pipeline also incorporates strict data-cleaning measures and a deduplication process that ensures the high-quality of the training set to improve the Sailor models’ overall performance.

The research team leverages tiny proxy models to optimize the blend of training data. This approach fine-tunes essential hyperparameters, such as the data mixture ratio, enhancing the training process’s efficiency and thereby improving model performance.

The versatility and efficacy of the Sailor models were proved in experiments involving tasks that test comprehension and logical reasoning. Such tasks include examination, question response, reading comprehension, and common sense thinking. They held up impressively against diverse standards. The experimental results indicate the potential of the Sailor models in addressing linguistic issues in the SEA region.

In summary, the research provides an insightful blueprint for constructing LLMs effectively serving SEA’s variety of languages. The detailed approach addresses multilingualism and data quality concerns while employing innovative methods to enhance model resilience and performance. The authors of the research have shared their work through a Paper, Project, and Github.

Leave a comment

0.0/5