In recent years, there has been increasing attention paid to the development of Small Language Models (SLMs) as a more efficient and cost-effective alternative to Large Language Models (LLMs), which are resource-heavy and present operational challenges. In this context, researchers from the Department of Computer Science and Technology at Tsinghua University and Modelbest Inc. have introduced a new model known as MiniCPM.
Despite advancements in SLMs like the Phi series, TinyLlama, MobileLLM, and Gemma, there are still issues with replicating the comprehensive capabilities of LLMs and establishing efficient, scalable training methods. MiniCPM, which includes 1.2 billion and 2.4 billion non-embedding parameter variants, is designed to address these challenges. It achieves performance levels comparable to 7-13 billion parameter LLMs while placing a focus on the development of SLMs.
Alongside its scalable approach to both model and data dimensions, MiniCPM also utilizes a novel technique known as the Warmup-Stable-Decay (WSD) learning rate scheduler. This facilitates continuous training and domain adaptation, contributing to a more efficient understanding of the data-model scaling law. Variants of MiniCPM, including MiniCPM-DPO, MiniCPM-MoE, and MiniCPM-128K, have also been introduced.
A key part of MiniCPM’s approach is the use of the Cosine Learning Rate Scheduler (LRS), used to adjust learning rates during training. Although complex, its process allows it to find global and local optima efficiently and effectively.
Analyzing the performance of MiniCPM-2.4B found that it generally outperformed Mistral-7B-v0.1 in Chinese language tasks and achieved similar outcomes in English. It also surpassed Llama2-13B in most areas, with the exception of Model-Based Machine Learning (MMLU), Bridging the Gap between Human and Machine Commonsense Reading Comprehension (BBH), and Hyperspace Analogue to Language Scoring Model (HellaSwag). Interestingly, researchers noted that BBH, a knowledge-oriented dataset, seems to prove more difficult for SLMs than LLMs, implying that reasoning ability is more reliant on model size than knowledge.
In conclusion, this research paper represents a significant development in the study of SLMs. The scalable approach to training methods taken by MiniCPM demonstrates their potential not just as a more efficient alternative to LLMs, but also a viable tool for enhancing the ability of larger models. The WSD scheduler, in particular, shows promise for future LLM development, and the researchers plan to further enhance MiniCPM through scaling in both model and data size. To fully understand its benefits and the potential it holds for the future of AI, the researchers plan to further analyze the decrease in loss during the decay stage of the WSD scheduler’s operation.