Natural Language Processing (NLP) has transformed with the advent of Transformer models. The document generation and summarization, machine translation, and speech recognition abilities of Transformers have exhibited significant progress. Their dominance is specifically seen in large language models (LLMs) that deal with more complex tasks through upscaling transformer architecture. However, the growth of the Transformer model also raises concerns regarding escalating computational needs, increased inference costs and energy consumption, especially in situations with restricted resources like mobile devices and robotics.
The need to perfect Transformer models has led to various strategies like model pruning and quantization, each with the underlying aim of creating a more efficient attention method. One notable approach is simplifying attention mechanisms from quadratic complexity to a linear one. However, current methods require a vast amount of retraining, a process that is particularly difficult for models with a vast number of parameters due to the significant time and computational resources required.
To address this issue, researchers from Peking University and Huawei Noah’s Ark Lab conducted a thorough review of current linear attention methods. The prime source of approximation error was found to be Monte Carlo sampling. As a solution, the researchers introduced DiJiang. It’s a Frequency Domain Kernelization method – a path-breaking approach in the NLP field. DiJiang employs weighted Quasi-Monte Carlo sampling, using the Discrete Cosine Transform (DCT) to transfer the Transformer’s queries and keys to the frequency domain efficiently and precisely. Furthermore, this method removes the softmax operation, simplifying the attention computation and making training costs for adaptation more manageable.
After comprehensive tests, DiJiang displayed performance similar to traditional Transformers while enhancing inference speeds and reducing training costs by almost ten times. DiJiang promises broader applicability, offering a breakthrough in various tasks within natural language processing and beyond.
For further information, read the complete Papers and Github. Follow on Twitter, join the Telegram Channel, Discord Channel, and LinkedIn Group, and don’t forget to subscribe to the 39k+ ML SubReddit. If you like the work, you will love the newsletter. The research credit goes solely to the researchers of this project.
The original blog post’s title is ‘DiJiang: A Groundbreaking Frequency Domain Kernelization Method Designed to Address the Computational Inefficiencies Inherent in Traditional Transformer Models.’