Scaling up language learning models (LLMs) involves substantial computational power and the need for high-density datasets. Language models typically make use of billions of parameters and are trained using datasets that contain trillions of tokens, making the process resource-intensive.
A group of researchers from the University of Texas at Austin have found a solution. They’ve developed a method named “Inheritune” which distinguishes smaller base LMs from larger ones. This system operates differently from previous models by inheriting a few transformer blocks from a larger LM and uses only a tiny fraction (0.1%) of the original pre-training data. By doing so, LMs with 1.5 billion parameters can be created using just 1 billion tokens. This model relies on a single GPU and takes under twelve hours to create.
Previous attempts to train smaller LMs have involved extensive training from scratch with trillions of tokens. Others have made use of high-quality synthetic data. An example is tinyllama-1B, which is trained from scratch with 3 trillion tokens over the course of ninety days. Inheritune’s capacity to maintain comparable performance with significantly fewer computational resources sets it apart from traditional models.
The Inheritune method starts by creating a small base LM, which is done by using a portion of pre-training data, combined with a few layers from an existing LM. A few layers of the existing model are retained initially, used to start a new target model. This model is trained using the available subset of data for a specified number of epochs.
Researchers made use of a 1 billion token subset of the Redpajama v1 dataset for their experiments, creating a 1.5 billion parameter LM. The resulting model’s performance compared favorably to both scratch-trained and derived LMs.
Inheritune’s inheritance capability allows for the extraction of smaller target LMs without sacrificing performance. The models created outperform similar-sized models trained from scratch. GPT2-medium models experimented with the Inheritune method illustrated superior convergence speed and final validation loss performance.
Inheritune does have limitations. These include its inability to modify the architectural design beyond changing the number of transformer blocks, limiting flexibility in customizing hidden sizes and attention heads. The small size of its training dataset might also be a concern, given the method’s sensitivity to the quality of the training dataset. In spite of these limitations, the research concludes that Inheritune effectively pre-trains small base language models using minimum data and computational resources. It offers a viable, straightforward approach to model reduction from large reference models.