Matrix multiplication (MatMul) is a fundamental process in most neural network topologies. It is commonly used in vector-matrix multiplication (VMM) by dense layers in neural networks, and in matrix-matrix multiplication (MMM) by self-attention mechanisms. Significant reliance on MatMul can be attributed to GPU optimization for these tasks. Libraries like cuBLAS and the Compute Unified Device Architecture (CUDA) enable MatMul operations to be parallelized and accelerated, improving performance significantly.
Large language models (LLMs) often utilize matrix multiplication for most of their computational processes. As these models increase in size and complexity, the demand for MatMul also grows. Interestingly, it has been found that even at billion-parameter scales, it is feasible to eliminate MatMul operations from LLMs without negatively impacting performance.
Recent research by a team of scientists from the University of California, Soochow University, and LuxiTech has found that MatMul-free models can reach performance within close proximity to that of state-of-the-art Transformers for models up to at least 2.7 billion parameters. Transformers typically use more memory for inference. Testing revealed that the performance gap between MatMul-free models and conventional full-precision Transformers shrinks as the model size increases. This indicates that larger models may not need to rely on MatMul operations for efficiency and success.
To address implementation practicalities, the team developed a GPU-efficient version that reduces memory use by up to 61% during training compared to an unoptimized baseline. They used an optimized kernel for inference which cuts down memory use by tenfold compared to unoptimized models. This significant decrease in memory usage makes these models more accessible and efficient for a variety of applications.
The researchers also created a unique hardware solution using a Field-Programmable Gate Array (FPGA) to leverage the lightweight nature of these models fully. This technology can process billion-parameter scale models at 13 watts by exploiting lightweight operations which are currently beyond the reach of existing GPUs. This approach, approaching human brain energy levels, increases the efficiency of LLMs.
Overall, the research demonstrates that it is possible to achieve significant reductions in LLM complexity without compromising their functionality. It also suggests future hardware accelerators should focus on processing lightweight LLMs. This breakthrough has facilitated the development of more effective, scalable and practical large language model implementations.
The paper and accompanying GitHub code have been made available by the researchers. They extend a warm invitation for interested parties to follow them on social platforms and join their forums and groups. They are also encouraging subscriptions to their newsletter and participation in their Reddit discussions and AI events.