Advancements in large language models (LLMs) have greatly elevated natural language processing applications by delivering exceptional results in tasks like translation, question answering, and text summarization. However, LLMs grapple with a significant challenge, which is their slow inference speed that restricts their utility in real-time applications. This problem mainly arises due to memory bandwidth bottlenecks rather than lack of computational strength, hence researchers are exploring innovative ways to hasten the inference process.
Conventional speculative decoding methods, which carry high latency and huge training costs, typically generate multiple tokens simultaneously to expedite the process. Yet, their dependence on external drafter models for accelerating text generation introduces added computations, slowing down the process even more.
Existing solutions such as Lookahead and Medusa aim to train smaller draft models to work with the main language model, but they still encounter latency problems since the drafter models require substantial computational resources and parameter updates. This situation further reduces the acceleration’s overall effectiveness.
To address this issue, Huawei Noah’s Ark Lab researchers have created a novel solution known as Kangaroo. It is a lossless self-speculative decoding framework that uses a fixed shallow LLM sub-network as the drafter model unlike traditional methods that rely on external drafter models. Kangaroo includes a light adapter module that links the sub-network and the full model for efficient and accurate token generation.
The Kangaroo system also involves an early-exiting mechanism that ceases the small model’s predictions once the confidence level of the current token dips below a certain limit, curbing unnecessary computational latency. The adapter module consists of a multi-head attention mechanism and two normalization layers, ensuring high-quality token generation. Kangaroo’s dynamic feature enables more efficient token generation via parallel computing and avoiding extraneous computations.
Tests conducted using Spec-Bench show that Kangaroo achieves a speedup ratio of up to 1.7× compared to other methods, utilizing 88.7% fewer additional parameters than Medusa. The considerable improvement in Kangaroo’s speedup ratio is due to its double early-exit mechanism and the efficient design of its adapter network. This newly developed framework considerably trims down latency, making it apt for real-time natural language processing applications.
In summary, Kangaroo provides an innovative solution for boosting LLMs’ inference speed. It does away with the need for costly and time-consuming external drafter models by using a fixed shallow sub-network from the LLM as a drafter model. With up to a 1.7× speedup and a drastic reduction in additional parameters, Kangaroo offers a promising approach to enhance the efficiency of large language models. It sets a new benchmark in real-time natural language processing by drastically diminishing latency without compromising accuracy.