Skip to content Skip to footer

Exploring LLM Surgeon: A Machine Learning Framework for Pruning of LLMs with Unstructured, Semi-Structured, and Structured Data

We are thrilled to share with you the recent advancements in Artificial Intelligence that have enabled the development of Large Language Models (LLMs) with a significantly large number of parameters, with some of them reaching into billions (for example, LLaMA-2 that comes in sizes of 7B, 13B, and even 70B parameters). This model is incredibly powerful and can achieve very high performances across diverse tasks, making it a powerful tool for various AI applications. However, the downside to this is that the deployment of such models comes with an expensive cost, and devices like phones do not possess enough memory to host them.

In order to overcome this issue, a team of researchers from Imperial College London, Qualcomm AI Research, QUVA Lab, and the University of Amsterdam have introduced LLM Surgeon, a revolutionary framework for unstructured, semi-structured, and structured LLM pruning that prunes the model in multiple steps, updating the weights and curvature estimates between each step. Through their experiments, the team was able to prune LLMs by up to 30% without any significant performance degradation, demonstrating the effectiveness of their framework.

This framework uses weight magnitude and activations from forward passes and gradient information from backward passes to relate weight removal costs to the true final objective. It also improved the previous works in weight pruning by using more accurate approximations to the loss curvature and more weight correlations to update remaining weights. Moreover, LLM Surgeon uses the KFAC approximation for computing the dynamic allocation of structures that can be removed and for updating the remaining weights, accounting for the removal. This method allows the framework to prune multiple weights at once to reach the target model size while inflicting the least possible cost.

The team evaluated the performance of LLM Surgeon on language modeling tasks on models like OPT and LLaMA-2, using data from the wikitext-2 dataset. They showed that the pruning performance increased with more shots and that the model size can be reduced by up to 30% without any significant loss. Additionally, they achieved state-of-the-art results in unstructured and semi-structured pruning of LLMs, enabling an easier deployment process.

We are excited to share that the LLM Surgeon framework addresses the problem posed by LLMs with a significantly large number of parameters in terms of deployment. It can prune rows and columns from a range of LLMs by 20-30% without significant loss in performance and achieves the best performance for each target size. This groundbreaking research paves the way for an easier deployment process for LLMs, making them more accessible to us all! Check out the paper and join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, LinkedIn Group, and Email Newsletter to stay up-to-date on the latest AI research news, cool AI projects, and more.

Leave a comment

0.0/5