Researchers at UT Austin have developed an effective and efficient method for training smaller language models (LM). Called “Inheritune,” the method borrows transformer blocks from larger language models and trains the smaller model on a minuscule fraction of the original training data, resulting in a language model with 1.5 billion parameters using just 1 billion tokens in less than 12 hours with a single GPU. This new approach eliminates the need for billions of parameters and trillions of tokens typically required in pre-training.
The potential of Inheritune lies in its ability to train language models comparably to larger, publicly available models, thus proving effective in a variety of settings. Traditionally, training smaller LMs requires extensive training from scratch with trillions of tokens or making use of high-quality synthetic data. In comparison, Inheritune inherits transformer blocks from larger models and trains on a much smaller subset of data, thereby conserving significant computational resources.
The method comprises crafting a small base language model by inheriting a fraction of training data from an existing larger model. The first n layers of the reference model are inherited, creating the target model, which is then trained on a smaller subset of data for a determined number of epochs. The researchers used a 1 billion token subset of the Redpajama v1 dataset to train a 1.5 billion parameter language model and achieved competitive performance compared to other scratch-trained and derived LMs.
Inheritune’s power is in its ability to extract smaller target language models without a significant loss in performance and demonstrating comparable zero-shot performance on related downstream tasks. Moreover, these smaller LMs outperform comparable models trained from scratch, often surpassing them after fewer training steps. GPT2-medium models’ experimentation showed that initialization with Inheritune, especially with attention and MLP weights, resulted in superior convergence speed and final validation loss performance.
In spite of its success, Inheritune faces certain limitations. These include the inability to modify the architectural design beyond the number of transformer blocks, thereby potentially reducing flexibility in the customization of hidden sizes and attention heads. There’s also sensitivity to the quality of the training dataset due to its small size. Selection of blocks to retain, dataset curation, and hyperparameter tuning still needs to be worked upon for further improvement.
Inheritune lends an efficient method to train smaller base language models using minimal data and computational resources. It simplifies the reduction of large reference models to smaller ones without sacrificing performance.