Reinforcement Learning (RL) finetuning is an integral part of programming language models (LMs) to behave in a particular manner. However, in the current digital landscape, RL finetuning has to cater to numerous aims due to diverse human preferences. Therefore, multi-objective finetuning (MOFT) has come to the forefront as a superior method to train an LM, superseding the more linear approach of single-objective finetuning (SOFT).
Two main methodologies are currently used for MOFT, namely, prompt-based and parameter-based conditioning techniques. The prompt-based approach is leveraged by language models to achieve different rewards based on binary weights with the help of customized prompts. On the other hand, the parameter-based method uses parameter-averaging and takes the benefit of multi-task learning to train a model for any given reward.
A team of researchers from Google has put forth a new MOFT framework called Conditional Language Policy (CLP), which blends parameter-space conditioning and multi-task training. The unique selling point of CLP is that it is easier to steer and produces better quality responses than zero-shot methods such as Rewarded Soups (RS).
CLP’s efficiency has been proven through a series of experiments conducted by the Google team. The outcomes showed that CLP outperforms other models consistently across various conditions, which include different reward choices and model sizes.
CLP makes use of a learning algorithm that samples a range of weightings to enhance its Pareto-front for all weightings at once. This broad-based learning system caters to different weightings and maximizes the MOFT objective. An automatic evaluation with Gemini 1.0 Ultra revealed that CLP can swiftly adapt and generate superior responses compared to existing baselines.
In conclusion, the Google team brought forth a significant contribution to the field of language model learning with the introduction of CLP. Future research in this domain is expected to focus on other conditioning mechanisms like soft tokens, the automation of weight sampling distributions, and solutions to non-linear reward scalarization.