The rapid evolution of natural language processing (NLP) is currently focused on refining large language models (LLMs) for specific tasks, which often contain billions of parameters posing a significant challenge for customization. The primary goal is to devise better methods to fine-tune these models to particular downstream tasks with minimal computational costs, posing a need for creative approaches to parameter-efficient fine-tuning (PEFT).
The biggest bottleneck in this field is the resource-demanding nature of adjusting LLMs for specified tasks. Current fine-tuning methods usually refresh all model parameters, leading to overfitting and high computational costs. Given the size of present-day LLMs, including those with sparse configurations that distribute tasks among several specialized experts, there is a pressing need for more effective fine-tuning techniques. The challenge resides in improving performance while maintaining manageable computational loads.
Existing PEFT methods in dense-architecture LLMs include low-rank adaptation (LoRA) and P-Tuning, which typically add new parameters or selectively update existing ones. However, these strategies have primarily concentrated on dense models, without harnessing the potential of sparse-architecture LLMs fully. As different tasks influence disparate subsets of parameters, traditional techniques underperform.
Researchers at DeepSeek AI and Northwestern University have introduced an innovative technique tailored for sparse-architecture LLMs. The method, christened Expert-Specialized Fine-Tuning (ESFT), aims to fine-tune only the most relevant experts for a specific task, leaving other experts and model components unchanged. By optimizing the mixture-of-experts (MoE) architecture, ESFT improves tuning efficiency and preserves experts’ specialization while updating only necessary parameters.
In detail, ESFT calculates the affinity scores of experts to task-specific data and chooses the most relevant subset. These selected experts are fine-tuned, while the rest of the model remains unaltered. This method significantly reduces computational costs; ESFT can minimize storage requirements by up to 90% and training time by up to 30% compared to full-parameter fine-tuning. These efficiencies are achieved without compromising the model’s performance.
In multiple downstream tasks, ESFT matched and often eclipsed the performance of traditional full-parameter fine-tuning techniques. ESFT achieved substantial performance gains in tasks like math and code, retaining a high level of specialization. It proved to maintain general task performance better than other PEFT methods such as LoRA, serving as a potent tool for LLM customization.
To sum it up, this research presents Expert-Specialized Fine-Tuning (ESFT) as an answer to the challenge of resource-intensive fine-tuning in large language models. By selectively tuning pertinent experts, ESFT enhances both performance and efficiency. It takes advantage of the specialized configuration of sparse-architecture LLMs, delivering superior results with lower computational costs. It significantly improves training efficiency, reduces storage and training time, and maintains top performance across diverse tasks. Therefore, ESFT holds promising potential for future advancements in customizing large language models.