Skip to content Skip to footer

Enhancing Pretrained LLMs via Post-Training Reparameterization using Shift-and-Add Method: Generating High-Performance Models without the Need for Multiplication Operations

Large language models (LLMs) like GPT-3 require substantial computational resources for their deployment, making it challenging to use them on resource-constrained devices. Strategies to boost the efficiency of LLMs like pruning, quantization, and attention optimization have been developed, but these can often lead to decreased accuracy or continue to rely heavily on energy-consuming multiplication operations.

To address these problems, researchers from Google, Intel, and Georgia Institute of Technology have proposed a new method named ShiftAddLLM. This method utilizes post-training shift-and-add reparameterization to accelerate pretrained LLMs. It works by converting weight matrices into binary matrices with group-wise scaling factors and then reparameterizing the multiplications into shifts and adds, based on the binary matrices. This not only reduces memory usage and latency but also maintains or improves model accuracy.

Key to the success of ShiftAddLLM is a multi-objective optimization strategy aimed at aligning weight and output activation objectives, which ultimately minimizes reparameterization errors. This is complemented by an automated bit allocation strategy that optimizes the bit-widths for weights in each layer according to their reparameterization sensitivity. Layers more sensitive to the reparameterization are allocated higher-bit representations to theoretically avoid accuracy loss while boosting efficiency.

This new approach has been tested across five LLM families and eight tasks, showing an average perplexity improvement of 5.6 and 22.7 points at similar or lower latency compared to previous quantized LLMs. Moreover, the method resulted in an over 80% reduction in memory and energy consumption.

Overall, ShiftAddLLM proves to be more successful in reducing perplexity scores across different models and tasks compared to other popular methods, like OPTQ, LUT-GEMM, and AWQ. It also demonstrates better accuracy-latency trade-offs, achieving a perplexity reduction of up to 103830.45 and a latency reduction of up to 60.1%.

In conclusion, ShiftAddLLM offers a drastic reduction in the computational costs of deploying large language models, without a loss in accuracy. This major development regarding memory and energy efficiency may make high-level LLMs more practical and accessible for a broader range of applications, signifying a key step forward in mitigating the deployment challenges of large-scale AI models.

Leave a comment

0.0/5