Researchers from various institutions have recently unveiled a unique linear property of transformer decoders in natural language processing models such as GPT, LLaMA, OPT, and BLOOM. This discovery could have significant implications for future advancements in the field. These researchers discovered that there is a nearly perfect linear relationship in the embedding transformations between sequential layers of these models. This suggests that removing or approximating these linear blocks doesn’t negatively impact model performance, which could potentially lead to the development of depth-pruning algorithms and new distillation techniques.
The study area of sparsity for model pruning is central in machine learning – previously explored through methods like backpropagation and fine-tuning to understand sparsity in convolutional neural networks. Developing methods for sparse fine-tuning has led to technologies such as SquareHead distillation and WANDA. Greater understanding of the structure of transformer models has provided insights into their linear complexity.
The research team found that transformer decoders consistently displayed high linearity scores, denoting strong linear characteristics in embedding transformations. However, throughout the pretraining and fine-tuning stages, they found the degree of linearity changed. Pretraining typically decreased linearity, while fine-tuning for specific tasks increased it. This pattern was consistent across various tasks, suggesting that task-specific fine-tuning enhances the linear characteristics of transformer models.
Through experimentation with the Mistral architecture on specifically chosen datasets, the researchers saw improvements using a cosine-based approach to adjust relationships between embeddings within transformer layers. This meant that embeddings from sequential layers were encouraged to converge, leading to improved model performance. Furthermore, they developed a technique that sequentially removes the most linear layers, replacing them with linear approximations, including a distillation loss to reduce performance degradation.
The study provides comprehensive insights into the linearity of transformer decoders, uncovering their fundamentally near-linear behavior across diverse models. Interestingly, the researchers found that while pretraining can increase nonlinearity, fine-tuning can decrease it — presenting a paradox. Their findings suggest that introducing new pruning and distillation techniques can improve transformer models without negatively affecting performance. The discovered cosine-based regularization approach during pretraining boosts model efficiency and performance.
Nonetheless, this study’s focus on transformer decoders is a limitation, as it does not extend to encoder-only or encoder-decoder architectures. Further research is needed to verify if the proposed techniques can be scaled to different models and domains.