Training Large Language Models (LLMs) has become more demanding as they require an enormous amount of data to function efficiently. This has led to increased computational expenses, making it challenging to reduce training costs without impacting their performance. Conventionally, LLMs are trained using next token prediction, predicting the next token in a sequence. However, Pattern Recognition Center, WeChat AI, and Tencent Inc. researchers propose a more efficient method known as patch-level training.
In patch-level training, multiple tokens are compressed into a single patch to shorten the sequence. This method employs the concept of transfer learning, which involves transferring information from a model with a lower training cost (patch-level) to a model with a higher training cost (token-level). The uniqueness of this approach is that it doesn’t require the final model to function at the patch-level. It rather facilitates more efficient acquisition of patch-level information during model training.
Patch-level training involves two steps. First, the language model is trained on shorter sequences of patches, predicting the next patch, which allows most training data to be analyzed at lower computing costs. Second, the initial parameters obtained from patch-level are used to initialize the token-level model, which then trains on the remaining data, harnessing the information from the patch level. Essentially, patch-level training focuses on predicting patches, or groups of tokens, while token-level training focuses on predicting individual tokens.
The patch-level training model offers several advantages. Firstly, it improves training efficiency by predicting all tokens in the upcoming patch simultaneously and reduces sequence length during training via multi-token prediction, using a single brain without adding any additional parameters. Because it doesn’t require specific model architectures or advanced model mapping techniques, it is both adaptable and broadly applicable.
Results indicate that upon initializing with patch-level training, the model shows a swift decrease in loss and continues to train more effectively on the remaining data—further reducing loss and training costs by 50% compared to starting from scratch. These promising outcomes suggest that higher acceleration rates can be accomplished by adjusting the hyperparameter settings, with a negligible impact on model performance.
This training method marks the beginning of research into patch-level training, and further developments, such as determining an empirical scaling rule for patch-level training and testing its scalability on larger models and data sets, will significantly enhance this approach.