Large language models (LLMs) such as GPT-4, LLaMA, and PaLM are playing a significant role in advancing the field of artificial intelligence. However, the attention mechanism of these models relies on generating one token at a time, thus leading to high latency. To address this, researchers have proposed two approaches to efficient LLM inference, with one requiring additional training and the second not needing it.
These approaches involve using Knowledge Distillation (KD) for autoregressive LLMs, aiming to minimize the reverse KL divergence between student and teacher models. Nevertheless, conventional KD methods were found to be ineffective for LLMs. Hence, researchers from Shanghai Jiao University and the University of California proposed a more effective family of LLMs known as Consistency Large Language Models (CLLMs).
The CLLMs are designed for the Jacobi decoding method, which specialises in reducing latency. Unlike previous approaches, CLLMs don’t require additional memory to adjust auxiliary model parts. They also outperform other methods such as speculative decoding and Medusa. On training on approximately 1M tokens for LLaMA-7B, CLLMs proved to be 3.4 times faster on the Spider dataset. This speed-up is attributed to the simultaneous prediction of several tokens in a single forward pass (fast forwarding), and the accurate prediction of stationary tokens which remain unchanged despite being preceded by inaccurate tokens.
The research finding indicates that for both fast-forwarded and stationary tokens, there’s an improvement of 2.0x to 6.8x across all four datasets. Furthermore, CLLMs perform better on domain-specific datasets than on open-domain datasets profiled on MT-bench.
Tests were performed to evaluate the performance and inference speedup of CLLMs across multiple tasks. The results showed CLLMs could achieve 2.4× to 3.4× speedup using Jacobi decoding with nearly no accuracy loss on domain-specific benchmarks such as GSM8K, CodeSearchNet Python, and Spider. On ShareGPT, CLLMs achieved 2.4x speedup with a similar performance level, with a score of 6.4 on the open-domain benchmark MT-bench.
To sum up, the researchers introduced CLLMs as a new family of LLMs that significantly enhance the efficiency of Jacobi decoding. By leveraging a pre-trained LLM, CLLMs reduce the complexity of managing two different models within a system. They also improve the efficiency in generating token counts across different datasets. The researchers’ work on CLLMs represents a significant advancement in the field of AI and language model development.