The transition of Reinforcement Learning (RL) from theory to real-world application has been hampered by sample inefficiency, especially in risky exploration environments. The challenges include a distribution shift between the target policy and the collected data, resulting in overestimation bias and an overly optimistic target policy. A new method proposed by researchers from Oxford University, called Policy-Guided Diffusion (PGD), addresses this issue by modelling entire trajectories instead of single-step transitions.
Standard RL methods typically use either explicit or implicit regularization of the policy towards the behaviour distribution or a single-step world model learned from the offline dataset. While these methods help to mitigate distribution shift, PGD offers a solution to the compounding error problem in offline RL.
PGD is a trajectory-level diffusion model trained on an offline dataset to approximate the behaviour distribution. By incorporating guidance from the target policy during the denoising process, PGD can steer trajectory sampling towards the target distribution, creating a behavior-regularized target distribution. It excludes behaviour policy guidance and focuses solely on target policy guidance. PGD introduces guidance coefficients to control the strength of the guidance, it also applies a cosine guidance schedule and stabilization techniques to enhance guidance stability and reduce dynamic error.
The researchers found that agents trained with synthetic experience using PGD outperformed those trained on unguided synthetic data or directly on the offline dataset. By tuning the guidance coefficient, PGD ensured that trajectory likelihood increased monotonically under each target policy as the coefficient increased. Despite sampling high-likelihood actions, PGD maintained a low error rate compared to an autoregressive world model (PETS). Its robustness to different target policies was highlighted, along with its potential as an RL method enhancement.
In conclusion, the researchers presented PGD as a controllable method for synthetic trajectory generation in offline RL that potentially outperforms other methods like PETS. It offers improvements in downstream agent performance across diverse environments and behavior policies. By addressing out-of-sample issues, PGD may lay the groundwork for the development of less conservative algorithms in offline RL.