Deep neural networks (DNNs) have found widespread success across various fields. This success can be attributed to first-order optimizers such as stochastic gradient descent with momentum (SGDM) and AdamW. However, these methods encounter challenges in efficiently training large-scale models. As an alternative, second-order optimizers like K-FAC, Shampoo, AdaBK, and Sophia have demonstrated superior convergence properties, but often with significant computational and memory costs.
Attempts to reduce the memory consumption have centered on factorization and quantization. Factorization exploits a low-rank approximation strategy to represent optimizer states. Conversely, quantization employs low-bit representations to compress optimizer states. Notably, while quantization has been effective for first-order optimizers, its adaptation to second-order optimizers presents challenges due to the involvement of matrix operations.
A team of researchers from Beijing Normal University and Singapore Management University have introduced the first 4-bit second-order optimizer, using Shampoo as an example, but maintaining comparable performance to 32-bit alternatives. The researchers achieved this by quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo, preserving the small singular values vital for accurately computing the inverse fourth root and thereby avoiding performance degradation.
To enhance the performance, two techniques were proposed. Firstly, Björck orthonormalization improves the orthogonality of the quantized eigenvector matrix. Secondly, linear square quantization overtakes dynamic tree quantization for second-order optimizer states. The quantized eigenvector matrix U of the preconditioner is used, preserving the singular value matrix crucial for precise computation of the matrix power via matrix decompositions.
Empirical testing showed that the 4-bit Shampoo outperformed first-order optimizers such as AdamW. These first-order methods required 1.2 to 1.5 times more epochs, leading to longer operational times, while achieving lower test accuracies compared to second-order optimizers. 4-bit Shampoo, on the other hand, achieved test accuracies comparable to 32-bit Shampoo, providing memory savings of 4.5% to 41%. Importantly, the memory costs of 4-bit Shampoo were only slightly higher than first-order optimizers, marking a considerable progression in the use of second-order methods.
In conclusion, the 4-bit Shampoo allows for memory-efficient training of DNNs while matching the performance of 32-bit alternatives. This represents a significant advancement in second-order optimizers, potentially enabling their wider use in training large-scale DNNs. Researchers continue to explore ways of enhancing performance and efficiency of optimizers, with an aim towards more cost-effective and streamlined DNN training.