Data-driven techniques, such as imitation and offline reinforcement learning (RL), that convert offline datasets into policies are seen as solutions to control problems across many fields. However, recent research has suggested that merely increasing expert data and finetuning imitation learning can often surpass offline RL, even if RL has access to abundant data. This finding has led to questions around what primarily influences the effectiveness of offline RL.
Offline RL uses previously collected data to learn a policy. The challenge here is dealing with the variations in state-action distributions between the dataset and the policy learnt. These differences can potentially lead to overestimation of values, which could be problematic. Previous research in offline RL proposed several methods to estimate more precise value functions from offline data. However, there have been only a few studies that have sought to scrutinize and comprehend the real-world challenges in offline RL.
Scientists from the University of California Berkeley and Google DeepMind have made two interesting observations in offline RL that could provide useful insights for practitioners and future algorithm development. They observed that the choice of policy extraction algorithm has more impact than value learning algorithms on performance. Among different policy extraction algorithms, behavior-regularized policy gradient methods consistently perform better than commonly used methods like value-weighted regression.
The researchers also found that offline RL often encounters issues due to the policy’s underperformance on states during testing rather than during training. They proposed two practical solutions to tackle this problem: using datasets with high coverage and adopting test-time policy extraction techniques.
The researchers developed new techniques to enhance policies as and when required, which improve the information from the value function into the policy during the evaluation process, thus shoring up performance.
In conclusion, the researchers found that the main challenge in offline RL is not merely improving the value function. Rather, the main issue lies in how accurately the policy is extracted from the value function and how effectively it functions on new, unseen states during testing. For efficient offline RL, the value function needs to be trained on diverse data, and the policy should be allowed to fully utilize the value function. The researchers posed two critical questions for future research in offline RL: the optimal way to extract a policy from the learned value function, and how a policy can be trained to generalize well on test-time states.