Auto-regressive decoding is the gold standard in Large Language Models (LLMs), but the process can be quite time-consuming and costly. An approach called speculative sampling has emerged to resolve this, creating “drafts” of the LLMs efficiently and verifying them in parallel, significantly improving speed.
However, the speed gains from speculative sampling often come at an accuracy cost. To this end, researchers from Peking University, Microsoft Research, and other institutions have developed EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a new framework that optimizes both speed and accuracy by executing auto-regressive operations at the feature level instead of the token level.
In testing, EAGLE’s draft accuracy was significantly better than that of other approach like Medusa, and it avoids distribution degradation risks sometimes associated with accelerated LLM outputs. Moreover, EAGLE runs in cohesion with other methods to enhance throughput while reducing operational expenses, improving the performance of LLM systems. EAGLE has been shown to accelerate LLM decoding while maintaining the original LLM’s text distribution, making it immediately usable.
The EAGLE approach depends on two key findings: that top-layer features are more efficacious than bottom-layer token embeddings in the same lightweight network; and draft models that only input top-layer features perform poorly due to the inherent uncertainty of sampling. As a result, it is crucial to include the token representing sample results into the draft model.
The researchers tested EAGLE on the MT-bench, a realistic benchmark simulating real-world scenarios that has been used to showcase speed-up ratios of other methods too.This benchmarking helped compare EAGLE to other existing models impartially and directly.
Competition with modest training costs is another commendable feature of EAGLE. Its performance boosts and accelerates queries in real-world scenarios, bringing the amortized training cost down to zero as the number of queries increases. Operating in tandem with other throughput-boosting approaches, EAGLE has demonstrated potential to work in enhancing the performance of LLM operations.