Large language models (LLMs) have substantially impacted various applications across sectors by offering excellent natural language processing capabilities. They help generate, interpret, and understand the human language, opening routes for new technological advancements. However, LLMs demand considerable computational, memory, and energy resources, particularly during the inference phase, which restricts operational efficiency and their deployment.
The extensive number of parameters within these models necessitates substantial data storage and manipulation resources, posing challenges. As a solution to these difficulties, researchers have introduced quantization. This process achieves faster computation times and reduces memory consumption by reducing the precision of the model’s parameters. However, outliers in the data can profoundly impact the model’s accuracy, proving to be a constant challenge.
A revolutionary approach called QuaRot has been presented by researchers from ETH Zurich, EPFL, Microsoft Research, IST Austria, and NeuralMagic. It employs a groundbreaking quantization scheme based on rotations to minimize the impact of outliers. Using randomized Hadamard transformations and leveraging computational invariance (a principle ensuring these transformations will not alter the model’s final output), it provides an all-inclusive 4-bit quantization covering all model parts, including weights, the key-value (KV) cache, and activations. This method considerably reduces the model’s memory and computational requirements.
The utility of QuaRot was demonstrated on the LLAMA 2-70B model, achieving remarkable results. It showed a quantized model could retain up to 99% of its pre-quantization zero-shot performance capabilities. This method allowed for up to 2.16 times speed-up during the prefill phase of inference, typically a compute-bound stage, and up to 3.39 times memory savings during the decoding stage, usually a memory-bound phase. These improvements are crucial as they lower energy consumption and operational costs related to running such advanced models.
Without any significant performance loss, QuaRot enables end-to-end 4-bit inference. It allows for the wider deployment and adoption of LLMs across various devices, including those with limited computational resources. This opens up the potential for the widespread use of advanced language models, spurring innovation and expanding the applicability of LLMs in sectors where computational resources are a limiting factor.
In summary, QuaRot signifies a significant advancement in optimizing large language models. It successfully addresses the long-standing challenge of efficiently quantizing LLMs without compromising accuracy, courtesy of its innovative approach of using randomized Hadamard transformations and computational invariance. The method’s ability to drastically reduce memory usage and computational demands is evident from its performance with the LLAMA 2-70B model. The credit for this research goes to the lead researchers of this project.