Large Language Models (LLMs) have improved significantly, but challenges persist, particularly in the prefilling stage. This is because the cost of computing attention increases with the number of tokens in the prompts, leading to a slow time-to-first-token (TTFT). As such, optimizing TTFT is crucial for efficient LLM inference.
Various methods have been proposed to improve long-context inference and TTFT optimization, but most of them require considerable model changes and retraining, and do not address the TTFT issues fully. Token pruning approaches have shown potential in sentence classification tasks. However, their use in generative LLMs has been limited as they were designed for single-iteration processing tasks.
Research teams from Apple and Meta AI have proposed a novel technique, LazyLLM, that selectively computes the KV cache for important tokens and defers less critical ones. Unlike other methods, LazyLLM can revive pruned tokens to maintain accuracy. An auxiliary cache mechanism stores the hidden states of pruned tokens, allowing efficient retrieval and preventing performance degradation. LazyLLM outperforms other techniques as it is compatible with any transformer-based LLM, requires no additional training, and is effective across various language tasks.
The LazyLLM framework optimizes LLM inference by gradually reducing computations towards the end of the model by pruning less important tokens. It allows dynamic selection of token subsets in different generation steps, which is critical for performance. Tokens are pruned based on confidence scored calculated using attention maps.
To solve the challenges in extending pruning to decoding steps, LazyLLM introduces an auxiliary cache mechanism. This cache stores hidden states of pruned tokens, which allows efficient retrieval without recomputation. Additionally, each token is computed only once per transformer layer, ensuring that LazyLLM’s runtime is not slower than the baseline.
LazyLLM has proven to increase the efficiency of LLM inference across various language tasks while maintaining close-to-baseline accuracy levels. It outperforms other methods in speed-accuracy trade-offs and is effective across several tasks including question answering, summarization, and code completion.
LazyLLM’s main advantage is its seamless integration with existing transformer-based LLMs that improves inference speed without the need for fine-tuning. By dynamically prioritizing token computation based on relevance, LazyLLM offers a practical solution for enhancing LLM efficiency. This is particularly helpful given the growing demand for faster, more resource-efficient language models in numerous applications.