Language models that can recognize and generate human-like text by studying patterns from vast datasets are extremely effective tools. Nevertheless, the traditional technique for training these models, known as “next-token prediction,” has its shortcomings. The method trains models to predict the next word in a sequence, which can lead to suboptimal performance in more complicated tasks.
The scholars proposing this study have created an innovative technique known as multi-token prediction. Instead of predicting one word at a time, the method trains the model to predict many future words at once. The concept can be compared to learning a language, where instead of guessing one word at a time, entire phrases or sentences are predicted.
The researchers’ multi-token prediction model employs a shared trunk that generates a latent representation of the input context, linked to multiple separate output heads responsible for predicting future tokens. For instance, if the model predicts four future words– it will have four output heads working congruently.
Crucially, researchers have also addressed a key challenge: decreasing the memory usage of the GPU for these multi-token predictors. They incorporated a clever method that sequentially calculates the forward and backward passes for each output head, accumulating gradients at the shared trunk. This reduces the apex of GPU memory use, allowing larger models to be trained more efficiently.
The results of extensive experiments carried out by the researchers are promising. They found that as the size of the model increased, multi-token prediction became increasingly useful. Furthermore, they found that additional output heads can be used to accelerate inference using techniques like speculative decoding, which resulted in decoding times being sped up by up to three times.
Interestingly, multi-token prediction also showed promising results in natural language tasks. When evaluated on summarization benchmarks, models trained with multi-token prediction achieved higher ROUGE scores, showing better text generation capabilities.
The researchers propose that multi-token prediction works so well because it reduces the distributional discrepancy between teacher forcing at training-time and autoregressive generation at inference-time. This model also assigns higher weights to tokens that represent critical decision points – decisions that have a significant impact on the remainder of the text.
While results are promising, researchers acknowledge that there is still room for improvement. Possible areas for future exploration include automatically determining the optimal number of future tokens to predict based on the task and data distribution, adjusting vocabulary size, and exploring alternative auxiliary prediction losses to further efficiency. Overall, this research opens up the prospect of enhancing language model capabilities, paving the way for more powerful and efficient natural language processing systems.