The methods of parameter-efficient fine-tuning (PEFT) are essential in machine learning as they allow large models to adapt to new tasks without requiring extensive computational resources. PEFT methods achieve this by only fine-tuning a small subset of parameters while leaving the majority of the model unchanged, aiming to make the adaptation process more efficient and widely accessible. This approach is particularly vital for deploying large foundational models which would otherwise be limited due to their high computational costs and extensive parameter counts.
The focus of the recent research has been the performance gap observed between low-rank adaptation methods like LoRA (Low-Rank Adaptation) and full fine-tuning of machine learning models. Despite its efficiency, LoRA often falls short in performance compared to fully fine-tuned models. This discrepancy limits the broader application of LoRA across various domains where high performance is crucial. The objective, therefore, is to make LoRA as effective as full fine-tuning without compromising its parameter-efficient advantages.
Existing PEFT methods include adapter tuning and prompt tuning. While adapter tuning uses small trainable modules (adapters) inserted into specific layers of a model for fine-tuning, prompt tuning adapts models by adding learnable tokens to the input data, without directly modifying the model’s parameters. LoRA differentiates itself by re-parameterizing weight adjustments during fine-tuning into the product of two low-rank matrices, which helps reduce the number of trainable parameters.
In an effort to bridge the performance gap between LoRA and full fine-tuning, researchers from various Chinese institutions introduced a novel method called LoRA-Pro. It improves upon LoRA’s optimization process by introducing “Equivalent Gradient”, a concept that allows the researchers to measure differences in the optimization process between LoRA and full fine-tuning and minimize those differences to enhance performance. LoRA-Pro ensures that the fine-tuning process more closely resembles that of full fine-tuning.
LoRA-Pro defines the equivalent gradient as a virtual gradient that represents the gradient of the original matrix after low-rank approximation, despite not being directly trainable. This gradient is derived from the gradients of the low-rank matrices A and B – used in LoRA. During optimization, LoRA-Pro minimizes the difference between the Equivalent Gradient and the gradient obtained from full fine-tuning. It selects appropriate gradients for matrices A and B and formulates the task as an optimization problem, allowing theoretical solutions for updating these matrices. These solutions ensure that the Equivalent Gradient closely matches the optimization dynamics of full fine-tuning, enhancing the overall effectiveness of LoRA.
LoRA-Pro’s effectiveness was confirmed through numerous experiments on natural language processing tasks, using a subset of GLUE datasets on the T5-base model. The tests showed that LoRA-Pro scored the highest on three out of five datasets, with its average scores being 6.72% higher than standard LoRA.
In summary, the introduction of LoRA-Pro presents a major development in parameter-efficient fine-tuning. It addresses the optimization issues of LoRA and introduces the concept of Equivalent Gradient, which allowed the development of a method that narrows the performance gap between LoRA and full fine-tuning. Extensive experimental validation confirms that LoRA-Pro maintains the efficiency of LoRA whilst achieving performance levels closer to full fine-tuning, making it a valuable tool for deploying large foundational models more efficiently.