Reinforcement Learning (RL) is a method of learning that engages an agent with its environment to gather experiences and maximize received rewards. Given the policy rollouts necessary in the experience collection and improvement process, this is known as online RL. However, these online interactions required by both on-policy and off-policy RL can be impractical due to environmental or experimental limitations. As such, offline RL algorithms have been developed to extract optimal policies from static datasets.
While offline RL algorithms have seen significant success lately, they need substantial hyperparameter tuning. This can hinder their practical use because this intensive tuning affects the execution of these algorithms in real-life scenarios. Furthermore, offline RL can be challenging when evaluating out-of-distribution (OOD) actions.
Addressing these issues, Imperial College London researchers have introduced the TD3-BST (TD3 with Behavioral Supervisor Tuning) algorithm. This model uses an uncertainty model to adjust the strength of regularization dynamically while using the trained uncertainty model to provide a TD3 model with behavioral supervisor tuning (TD3-BST). The TD3-BST helps optimize Q-values and adapt to changes using an uncertainty network, outperforming other methods particularly when tested with D4RL datasets.
One great advantage of the TD3-BST algorithm is its straightforward tuning process, which primarily involves selecting and scaling the kernel (λ) using the fundamental hyperparameters of the Morse network to optimize high-dimensional actions. Moreover, training with Morse-weighted behavioral cloning (BC) reduces the effect of BC loss for distant modes, enabling the model to focus on a single mode. The importance of permitting OOD actions within the TD3-BST framework has also been highlighted.
Simpler versions of RL, known as One-step algorithms, can learn policies from offline datasets, with limitations being improved via relaxing the policy objective. The issue is alleviated by integrating a BST objective into an existing IQL algorithm, allowing for learning an optimal policy while maintaining an in-sample policy evaluation. Although performance slightly declines on large datasets, relaxing weighted BC with a BST objective performs well on difficult-medium and large datasets.
In conclusion, TD3-BST, introduced by the Imperial College London researchers, emerges as a strong contender in dynamic regulatory adjustment using an uncertainty model. It also showed robust performance when learning from suboptimal data. The integration of policy regularization with an uncertainty source also enhances algorithm performance. The researchers suggest future work on different methods to estimate uncertainty and the best ways to combine multiple uncertainty sources. Their research is a significant contribution to the field of reinforcement learning.