Skip to content Skip to footer

LayerSkip: A Comprehensive AI Approach for Accelerating the Inference Process of Extensive Language Models (LLMs)

Large Language Models (LLMs) are used in various applications, but high computational and memory demands lead to steep energy and financial costs when deployed to GPU servers. Research teams from FAIR, GenAI, and Reality Labs at Meta, the Universities of Toronto and Wisconsin-Madison, Carnegie Mellon University, and Dana-Farber Cancer Institute have been investigating the possibility of accelerating these large models by reducing the number of layers needed for each token through early inference exit.

A common approach to LLM acceleration involves using speculative decoding — pairing a large “main” model with a quicker “draft” model to maintain accuracy. The current study focuses on a method that does not require extra models or auxiliary layers, combining early exit from the process with speculative decoding.

In scrutinizing each layer of the LLM process, the researchers trained a Llama1 7B model using the HumanEval coding dataset. By iteratively applying softmax to the output embeddings of each layer, and projecting onto the final normalization and linear layers of the language model, they identified the index of the output element with the highest value, equating to the predicted token. They found that making token predictions in earlier layers was pointless and using every layer to predict the token was unnecessary. They noted that, on average, only 23.45 of the 32 layers were needed for a token, suggesting a potential 26% reduction in computation could be achieved with an optimal predictor.

The objective of the researchers was to reduce the computation spent by the model “changing its mind” or hesitating and to enhance prediction accuracy using fewer layers per token. They suggest that the model should rely less on later layers for easier tokens and utilize layer dropout — omitting layers during training to decrease dependency on subsequent layers.

To incentivize the model to understand the embeddings of earlier layers, the researchers proposed incorporating a loss function into the training procedure. They also simplified deployment and maintenance, and cut training times and memory use during inference and training by using a common language model head across the transformer layers of the model.

The researchers suggest checking an early prediction before running the remaining layers. They deployed speculative decoding techniques to check and fix prediction for a group of tokens, thus operationalizing the self-speculative decoding method. Even though this method needs fine-tuning or pre-training the model and raises the learning rate to maintain accuracy from base level, the researchers think that the realized speed and efficiency justify the effort.

The research team plans to improve the accuracy of early exit layers for future enhancements in self-speculative decoding speed-ups. They also hope to identify a unique exit layer for each token, increasing the token acceptance rate for self-speculative decoding. The team hope that this research will inspire other end-to-end AI solution developers to incorporate layer dropout and early exit loss in their own systems.

Leave a comment

0.0/5