Reinforcement Learning from Human Feedback (RLHF) is a technique that improves the alignment of Pretrained Large Language Models (LLMs) with human values, enhancing their usefulness and reliability. However, training LLMs with RLHF is a resource-intensive and complex task, posing significant obstacles to widespread implementation due to its computational intensity.
In response to this challenge, several methods have been developed, including RLHF, RLAIF, and LoRA. RLHF involves fitting a reward model on preferred outputs, then training a policy using reinforcement learning algorithms like PPO. Given the cost of labeling examples for reward model training, some techniques have substituted human feedback with AI feedback. It’s also worth mentioning Parameter Efficient Fine-Tuning (PEFT) methods, which reduce the number of trainable parameters in PLMs while preserving performance levels. A notable example of PEFT is LoRA, which factorizes weight updates into trainable low-rank matrices, permitting the training of only a small percentage of the total parameters.
In a significant advancement in this field, a team of Google researchers has introduced Parameter-Efficient Reinforcement Learning (PERL), leveraging LoRA to enhance models more resource-effectively. While preserving the operational performance of conventional RLHF methods, PERL drastically reduces computational and memory usage, enabling selective training of adapters while maintaining the core model.
PERL redefines how RLHF models are trained, using LoRA to boost parameter efficiency across a wide range of datasets. It utilises a diverse mix of data, including text summarizations from Reddit TL;DR and BOLT English SMS/Chat, harmless response preference modeling, helpfulness metrics derived from the Stanford Human Preferences Dataset, and UI Automation tasks based on human demonstrations. Crowdsourced Taskmaster datasets are also used, with a focus on refining model responses modelled on conversational interactions in specific scenarios.
Results from the study underline PERL’s efficacy in aligning with standard RLHF outcomes, achieving a 50% reduction in memory usage and accelerating Reward Model training by up to 90%. Models enhanced with LoRA match the accuracy of fully trained counterparts, but with half the peak HBM usage and 40% faster training. Despite reducing computational demands, PERL upholds the high performance of RLHF, presenting a promising prospect for employing ensemble models like Mixture-of-LoRA for robust, cross-domain generalizations, and using weight-averaged adapters to minimise reward hacking risks.
To summarize, Google’s PERL innovation is a significant stride forward in aligning AI with human values and preferences, achieving remarkable computational efficiency with LLMs. In overcoming the computational challenges associated with RLHF, PERL has set a new benchmark for future research in AI alignment, demonstrating how resource-efficient techniques can revolutionise artificial intelligence, making it more accessible, effective, and attuned to human values.