Skip to content Skip to footer

ETH Zurich and Microsoft Scientists Present SliceGPT for Enhanced Compression of Extensive Language Models via Sparsification

Large language models (LLMs) like GPT-4 require considerable computational power and memory, making their efficient deployment challenging. Techniques like sparsification have been developed to reduce these demands, but can introduce additional complexities like complicated system architecture and partially realized speedup due to limitations in current hardware architectures.

Compression methods for LLMs such as sparsification, low-rank approximation, and structured pruning are known, but some like Optimal Brain Surgeon (OBS) are computationally demanding. Techniques like GPTQ and SparseGPT concentrate on quantization and pruning, while others simplify weight matrices or propose eliminating specific rows and columns. Proceedings like ThiNet and LLM-pruner utilize linear operations and fine-tuning.

A novel sparsification scheme called SliceGPT has been proposed by researchers at ETH Zurich and Microsoft Research. It reduces the network’s embedding dimension by replacing each weight matrix with a dense, smaller matrix. This leads to faster inference with existing system architectures. The method depends on computational invariance in transformer networks.

The research involves RMSNorm operations that maintain transformation invariance to enable the application of orthogonal transformations without changing the model’s function. Networks using LayerNorm can be converted to RMSNorm by blending LayerNorm’s linear components into adjacent blocks. Principal Component Analysis (PCA) is used in this process to identify and project signals onto their main components at each layer. Subsequently, minor components are sliced off to cut down the network size without affecting performance. The SliceGPT technique has been proven through experiments to outdo SparseGPT, offering considerable speedups across various models and tasks.

SliceGPT allows for prudent pruning of LLMs, reduces the cost of inference, and maintains better performance than SparseGPT. Other techniques like quantization and structural pruning can be used effectively to improve efficiency and functionality. Insights from SliceGPT can contribute to further research in improving the efficiency of deep learning models.

SliceGPT efficiently compresses models like LLAMA-2 70B, OPT 66B, and Phi-2, reducing up to 25% of the model parameters, including embeddings, while maintaining high task performance. This enables models to run on fewer GPUs and achieve faster inference times with no additional code optimization. On consumer and high-end GPUs, SliceGPT reduces compute requirements during inference to 64% and 66%, respectively. While OPT models are more compressible than LLAMA-2 models, larger models show less accuracy reduction, making SliceGPT an effective tool for efficient compression of LLMs.

The original research papers and additional information can be accessed via the respective research project’s website and GitHub. The research was conducted by professionals from ETH Zurich and Microsoft Research.

Leave a comment

0.0/5