Deep reinforcement learning (RL) heavily relies on value functions, which are typically trained through mean squared error regression to ensure alignment with bootstrapped target values. However, while cross-entropy classification loss effectively scales up supervised learning, regression-based value functions pose scalability challenges in deep RL.
In classical deep learning, large neural networks show proficiency at handling classification tasks. Conversely, regression tasks often benefit from being reframed as classification problems, which can substantially enhance performance. This process entails transitioning real-valued targets into categorical labels, followed by the minimization of categorical cross-entropy.
Nonetheless, translating regression-based RL methods like deep Q-learning and actor-critic to classification, particularly in larger contexts like Transformers, is proving profoundly challenging. To address this issue, researchers from Google DeepMind and others have undertaken substantial studies seeking to devise methods for training value functions using categorical cross-entropy loss in deep RL. The findings point to substantial improvements in scalability, robustness, and performance compared to more conventional regression-based techniques.
The team’s approach involves converting the regression problem in TD learning into a classification problem. This reduces the distance between categorical distributions representing these quantities instead of minimizing the squared distances between scalar Q-values and TD targets.
Three distinct strategies, namely Two-Hot, HL-Gauss, and C51, directly model the categorical return distribution in an underlying bid to augment robustness and scalability in deep RL. The studies showed that employing cross-entropy loss, such as HL-Gauss, consistently outperformed traditional regression losses like MSE across numerous domains. These include areas such as Atari games, chess, language agents, and robotic manipulation, with noted advancements in performance, scalability, and sample efficiency, indicating its efficacy in the training of value-based deep RL models.
In conclusion, the research underscores that transforming regression to classification and minimizing categorical cross-entropy rather than mean squared errors significantly enhances performance and scalability across different tasks in value-based RL methods. This improvement hinges on the ability of cross-entropy loss to facilitate more expressive representations and more effectively manage noise and nonstationarity. Despite the persistence of challenges, the research highlights the profound impact of this adaptation.