The rapid development in Large Language Models (LLMs) has seen billion- or trillion-parameter models achieve impressive performance across multiple fields. However, their sheer scale poses real issues for deployment due to severe hardware requirements. The focus of current research has been on scaling models to improve performance, following established scaling laws. This, however, emphasizes the need to overcome hardware restrictions in order to achieve wider utilization of these powerful LLMs.
Previous studies have attempted to address the difficulties in deploying such massive trained models by focusing on techniques that compress the models. These techniques, which include quantization and pruning, are designed to reduce the costs of inference. Quantization decreases precision while pruning eliminates redundant parameters without needing retraining. Recent advancements in pruning techniques have shown great potential in simplifying large language model compression, thus underscoring the importance of developing efficient pruning methods specifically designed for such models.
Researchers from Baichuan Inc and the Chinese Information Processing Laboratory Institute of Software at the Chinese Academy of Sciences have introduced a unique approach, called ShortGPT, which helps analyze layer-by-layer redundancy in LLMs using Block Influence (BI), a measure of hidden state transformations. They argue that their method significantly surpasses previous complex pruning techniques by identifying and removing redundant layers based on BI scores. They have found that LLMs demonstrate considerable layer redundancy, which suggests a simple yet effective pruning strategy. This method, which is complimentary to quantization, reduces parameters and computation without compromising performance, thereby allowing more efficient LLM training.
The proposed LLM layer deletion approach quantifies layer redundancy, particularly in Transformer-based architectures, by using a BI metric that determines the impact of each layer on hidden state transformations during inference. Layers with low BI scores, indicating minimal impact, are removed to lower inference costs without reducing model performance. The process involves building a calibration set, collecting hidden states, calculating BI scores, and then iteratively deleting less essential layers based on BI rankings.
The proposed method’s effectiveness has been demonstrated through comparative experiments against benchmarks (including MMLU, CMMLU, and CMNLI) and baseline techniques (including LLMPru, SliceGPT, and LaCo). The results revealed that the model pruned using the proposed approach consistently outperforms baseline methods across multiple natural language benchmarks. Moreover, it was found that layer reduction is more effective than reducing embedding dimensions, thereby suggesting deeper redundancy within the models.
In conclusion, the researchers have presented ShortGPT, a novel LLM pruning approach based on layer redundancy and attention entropy. Findings indicate significant layer-wise redundancy in LLMs, making it possible to remove minimally contributing layers without affecting performance. The method can maintain up to 95% of model performance while reducing the parameter count and computational requirements by approximately 25%, surpassing previous pruning techniques. This suggests depth-based redundancy in LLMs and offers compatibility with other compression techniques to reduce model size effectively.