Language transformer models like Chat-GPT and LLaMA-2 have witnessed a rapid evolution, with parameters now running from a few billion to tens of trillions. Despite being excellent generators, these models struggle with inference delay due to their heavy computational load. This has led to a strong push for accelerating their inference, particularly in resource-constrained environments like real-time apps or edge devices.
Most decoder-only large language models (LLMs) follow a token-by-token generation pattern, which results in a large number of transformer calls. This autoregressive (AR) token generation process often leads to lower computational efficiency and longer execution time.
Semi-autoregressive (SAR) decoding, which synthesizes several tokens simultaneously, has been proposed as a solution to reduce the demand for intensive inference executions. However, most LLMs only generate AR models, not SARs, and re-training SAR models can be challenging due to the misalignment of SAR goals and AR pretraining.
Researchers at Intellifusion and Harbin Institute of Technology are developing an approach called Bi-directional Tuning for lossless Acceleration (BiTA). This strategy only requires a minimal increase in trainable parameters and aims to enable lossless SAR decoding for AR language models.
BiTA has two main components; bi-directional tuning and simplified verification of SAR draft candidates. It expands the usual AR model to predict future tokens, using learnable prefix and suffix embeddings in the token sequence. An advanced tree-based attention mechanism allows for simultaneous generation and verification in one forward pass. As a result, no additional validation processes or third-party verification models are needed.
The model uses tree-based decoding to perform efficient creation and verification. Each aspect of BiTA works collectively to expedite the process of LLMs without compromising the original outputs. Tests have shown that BiTA can boost speed by 2.1× to 3.3×. The model’s adaptable prompting design makes it an excellent plug-and-play tool for accelerating any publicly available LLMs in resource-constrained or real-time scenarios.
Credit for this innovative AI method goes to the researchers who developed BiTA. Their research paper is available for a detailed understanding of the process. If you find this work interesting, don’t forget to follow the researchers for more insights and updates.