Large Language Models (LLMs) have significantly impacted industries from translation to sentiment analysis. However, their practical use is hampered by computational demands, particularly with long prompts due to the quadratic complexity of the attention mechanism. Addressing this issue, researchers from Microsoft Corporation and the University of Surrey have developed MInference, a method to accelerate long-sequence processing in LLMs.
MInference identifies three distinct attention patterns—A-shape, Vertical-Slash, and Block-Sparse—and uses them to optimize sparse calculations on GPUs. It dynamically builds sparse indices for these patterns during inference, which reduces latency without needing to alter pre-training or requiring fine-tuning. In tests across various LLMs and benchmarks, MInference demonstrated up to a 10x speedup. This cut down the pre-filling stage from 30 minutes to 3 minutes on a single A100 GPU, all while maintaining accuracy levels.
The attention weights in a long-context LLM are inherently sparse and dynamic. Even in a 128k context, only retaining the top 4k columns covers 96.8% of the total attention. Despite the variability of these patterns, they often exhibit consistent structures across different layers and heads. By leveraging these patterns, MInference can significantly optimize sparse computations, which reduces computational overhead while keeping the accuracy level high in long-context LLMs.
MInference’s effectiveness and efficiency were evaluated across multiple benchmarks with tasks such as QA, summarization, and retrieval. Four state-of-the-art long-context language models, including LLaMA-3 and GLM-4, were used. MInference was tested on various context lengths, showing its superiority in maintaining context and processing speed over other methods.
MInference uses dynamic sparse attention, taking advantage of specific spatial aggregation patterns—A-shape, Vertical-Slash, and Block-Sparse. A kernel-aware method refines these sparse patterns for each attention head, and then a rapid approximation is used to create dynamic sparse masks for different prompts, which allows efficient sparse attention. Combining this with key-value cache compression techniques, MInference makes it possible to maintain the long-context performance of LLMs while achieving up to a 10x speedup on a single A100 GPU, cutting latency down from 30 minutes to 3 minutes.
In conclusion, MInference offers a promising solution to the significant latency faced during the pre-filling stage of long-context LLMs’ attention calculations. The researchers behind this new approach anticipate that similar patterns have the potential for use in multi-modal and encoder-decoder LLMs, indicating prospects for further pre-filling stage acceleration applications.