The recent development of large language models (LLMs), which can generate high-quality content across various domains, has revolutionized the field of natural language creation. These models are fundamentally of two types: those with open-source model weights and data sources, and those for which all model-related information, including training data, data sampling ratios, logs, checkpoints, and assessment methods, is publicly accessible. It’s crucial for research communities to have comprehensive access to open language models to study their capabilities, limitations, inherent biases, and potential risks.
A notable addition to the field is ChuXin 1.6B, a 1.6 billion parameter open-source language model. The model was trained using a multitude of sources including encyclopedias, online publications, public knowledge databases in English and Chinese, and 2.3 trillion tokens of open-source data. Considerable enhancements have been made to ChuXin’s input capabilities allowing it to handle an input length of up to one million.
The backbone of ChuXin 1.6B is LLaMA2, designed specifically for about 1.6 billion parameters. The model utilizes Rotary Positional Embedding (RoPE) to capture the relationship between parts of a sequence located at different positions. It uses RMSNorm for pre-normalization and implements a block-diagonal attention mask architecture inspired by stableLM. The DeepSeek LLM tokenizer was employed for tokenization and SwiGLU for activation function.
For training, the team used all pre-training datasets from HuggingFace to aid in the model’s replication. Training efficiency was improved with a 4096-context length, starting from scratch, and the use of FlashAttention-2. Training used BFloat16 mixed precision, and all-reduce operations were preserved in FP32. They trained for two epochs using 2 trillion tokens.
To assess ChuXin’s Chinese task performance, they used Chinese comprehension and reasoning tests— CMMLU and C-Eval — and HumanEval for testing code generation. Commonsense reasoning benchmarks were used to monitor pre-training performance. The results showed that ChuXin’s performance on most tasks improves as the number of training tokens increases.
The team aims to build larger, more advanced models in the future, integrating features like instruction tweaking and multi-modal integration. Sharing the challenges and solutions encountered throughout ChuXin’s development is also planned to inspire the open-source community and drive further advancement in language modeling.