Large Language Models (LLMs) have become pivotal in natural language processing (NLP), excelling in tasks such as text generation, translation, sentiment analysis, and question-answering. The ability to fine-tune these models for various applications is key, allowing practitioners to use the pre-trained knowledge of the LLM while needing fewer labeled data and computational resources than starting from scratch. The challenge, however, is the complexity and resource-intensity of fine-tuning various parameters for different models and tailored tasks.
A team of researchers from Beihang University and Peking University have developed a systematic framework named LLAMAFACTORY to aid this process, democratizing the fine-tuning of LLMs. The system combines various efficient fine-tuning methods into a scalable module, enabling the fine-tuning of hundreds of LLMs with minimal resources whilst maintaining high throughput. This approach also simplifies commonly used training approaches, such as generative pre-training, supervised fine-tuning, reinforcement learning from human feedback, and direct preference optimization. Users can customize and fine-tune their LLMs through a command-line or web interface, reducing or negating the need for coding.
LLAMAFACTORY incorporates three main modules: Model Loader, Data Worker, and Trainer. These are used alongside LLAMABOARD, a user-friendly visual interface, which permits users to configure and monitor model fine-tuning processes without the need for code. The Model Loader has four components – Model Initialization, Model Patching, Model Quantization, and Adapter Attaching – which readies different architectures for fine-tuning, supporting over 100 LLMs. The Data Worker processes data for different tasks, supporting over 50 datasets. Finally, the Trainer module adapts the efficient fine-tuning methods to various tasks and datasets, offering four different training approaches.
The researchers compared the efficiency of QLoRA, LoRA, GaLore, and other models. QLoRA showed the lowest memory footprint in training, LoRA demonstrated higher throughput by optimizing LoRA layers, and GaLore delivered lower PPL values on larger models. In terms of downstream task evaluation, LoRA and QLoRA generally performed best, with a few exceptions. The Mistral-7B model performed better on English datasets, while the Qwen1.5-7B model achieved higher scores on Chinese datasets.
In summary, the proposed LLAMAFACTORY provided a unified framework for fine-tuning LLMs. It enables custom fine-tuning and evaluation without any coding efforts, thus simplifying the complex task of adapting LLMs to various downstream functions. The tool supports over 100 LLMs and at least 50 datasets, using a modular design which minimizes dependencies between models, datasets, and training methods. This research serves as another significant stride in making advanced Natural Language Processing more accessible to users across the globe.