Large Language Models (LLMs) are widely used in complex reasoning tasks across various fields. But, their construction and optimization demand considerable computational power, particularly when pretraining on large datasets. To mitigate this, researchers have proposed scaling equations showing the relationship between pretraining loss and computational effort.
However, new findings suggest these rules may not thoroughly represent LLMs’ capabilities, especially in downstream tasks, necessitating improved evaluation frameworks. A recent study examined the dynamics of numerous publicly accessible LLMs like Yi-34B, Baichuan-7B, DeepSeek-7B, Amber7B, OpenLLaMA-7B, and DeepSeek-67B. It evaluated these models’ performance on different tasks using interim checkpoints based on the number of pre-trained tokens.
Three crucial conclusions emerged from investigating these models’ performance in various downstream tasks based on scaling law theory.
Firstly, the researchers discovered Task Dynamic Prediction – the capability to predict tasks within a domain based on the dynamics of currently existing downstream tasks. This means the LLMs’ performance on known tasks can indicate its potential performance on similar yet unknown tasks.
Secondly, Cross-domain Promotion emerged. Just like the human cognitive process, the algorithms’ skill development across different domains progresses from basic to advanced through curriculum learning. Acquired knowledge in one domain may aid learning in others, guiding the training of the model accordingly.
Lastly, an in-depth examination revealed the influence of training strategies, dataset quality, learning rate changes, batch size, and regularization techniques on LLMs’ learning efficiency, particularly during initial training.
In addition to these findings, the researchers emphasized the model scale’s impact on reasoning tasks and the scaling law effect. They found that smaller models could be improved using specific strategies to attain comparable performance with larger ones in commonsense reasoning.
Larger training datasets enhanced model performance on multiple benchmarks, reinforcing the importance of vast training datasets. However, as datasets grow, the advantages of more data diminish, implying potential performance gain limits.
The research team plans to make Amber-7B and OpenLLaMA-7B’s intermediate checkpoints publicly accessible to further understand scaling laws and create more successful LLM training programs. These insights and publicly accessible checkpoints aim to help developers understand the LLM optimization process and stimulate foundation model development.
Overall, the study sheds light on pretraining Large Language Models with an emphasis on downstream capabilities.