The paper “Training Diffusion Models with Reinforcement Learning” presents a technique to train diffusion models, recognized for generating high-dimensional outputs using reinforcement learning (RL). The paper’s key idea focuses on improving diffusion models’ performances on particular objectives instead of broadly matching with training data. A significant point is to upgrade the model’s efficiency on atypical prompts with feedback from a substantial vision-language model, excluding human intervention.
The authors also introduce a novel algorithm named Denoising Diffusion Policy Optimization(DDPO). This procedure advances the reward of the last sample by taking into account the entire sequence of denoising steps that led to it instead of following the approximate likelihood of the final sample. It applies Multi-step Markov decision process (MDP), providing an exact likelihood of each denoising step.
Furthermore, the researchers utilize policy gradient algorithms for DDPO’s application due to their simplistic execution and prior accomplishments with language model fine-tuning. They make use of two variations of DDPO; DDPOSF and DDPOIS. The latter represents their superior performing algorithm and its execution mirrors that of proximal policy optimization(PPO).
The authors demonstrate the performance of DDPO using Stable Diffusion on four tasks defined by various reward functions: compressibility, incompressibility, aesthetic quality, and prompt-image alignment. Interestingly, the model gradually adopts a more cartoonish style in response to human-like activities, illustrating the image diffusion system’s resulting aesthetic trends.
The paper also discusses several challenges while applying RL to train diffusion models. For instance, unintended generic behaviors when tuning large language models, which can also be seen in text-to-image diffusion models. There is also a considerable risk of over-optimization, leading to the model exploiting the reward function in an undesirably excessive manner that disrupts the meaningful image content.
Despite how rewarding it was training with RL, the authors point out that there is not yet a general-purpose method for preventing overoptimization, emphasizing that it’s a significant area to investigate further.
Conclusively, the authors suggest that while the scope of their experiments was indeed limited, the “pretrain + finetune” plan in language modeling certainly seems promising to be pursued further in diffusion models. Lastly, the authors encourage others to build upon their work to improve both text-image generation and other potential applications such as video generation, image editing, protein synthesis, etc.