Reinforcement Learning (RL) has gained significant popularity as a technology for achieving superhuman performance in a variety of applications, from games, complex physical control to mathematical computations. Although RL has produced impressive research advancements, its adoption in production environments has remained limited. On this note, I’ll share some key bottlenecks from applying RL to production environments, in addition to summarize key takeaways from discussions with domain experts through the AAAI 2023 Reinforcement Learning Ready for Production workshop, where I served as chair.
Over the past few years, researchers have developed various methods for reinforcement learning (RL) to enhance decision-making quality. These methods include model-based learning, advanced exploration designs, and techniques for coping with epistemic and aleatoric uncertainty, amongst others. Nonetheless, a few of these methods fail to deal with a vital bottleneck in real-world environments: the bounds of computation and response time.
In certain scenarios, comparable to social media recommendations or self-driving cars, the time allotted for making a call is commonly very short, sometimes lower than half a second and sometimes responses need to be real-time. Due to this fact, complex and computationally expensive methods, comparable to full neural network gradient descent, matrix inversion, or forward-looking model-based simulations, usually are not feasible for production-level environments.
Given these constraints, RL methods must have the opportunity to make intelligent decisions online without counting on computationally expensive operations. Addressing these challenges is critical for developing RL methods that may operate effectively in real-world applications.
To unravel a big selection of tasks with limited interactions, an intelligent RL agent must make sequential decisions with limited feedback. Nonetheless, current state-of-the-art RL algorithms require tens of millions of information points to coach and don’t generalize well across tasks. Although supervised learning is even harder to generalize for sequential decision tasks, it’s valid to be concerned that RL remains to be insufficient.
One strategy to improve sample efficiency for online RL agents is by utilizing smarter exploration algorithms that seek informative feedback. Despite practitioners’ fear of exploration on account of uncertainty and potential metric losses, relying solely on supervised learning and greedy algorithms can result in the “echoing chamber” phenomenon, where the agent fails to listen to the true story from its environment.
One other strategy to enable RL agents to unravel a wide range of tasks with less data is thru generalized value functions (GVF) and auxiliary tasks. By utilizing GVF and auxiliary tasks to achieve on-policy understanding of the environment through multiple lenses, the agent can grasp a multi-angle representation of the environment and generalize more quickly to different tasks with fewer interactions.
Practitioners accustomed to precision-recall metrics from supervised learning models are sometimes apprehensive about deploying RL algorithms due to their nature of generating counterfactual trajectories. The fear stems from the issue of imagining the parallel universe an RL agent creates when deployed in production.
Conservative learning in RL agents is vital to alleviating concerns about their deployment. As a substitute of aggressively optimizing the expectation of cumulative return, it’s paramount to listen to the variance of the educational goal to construct confidence in a freshly trained RL model. This principle aligns well with the direction of secure RL and calls for a rigorous study of the tradeoff between learning and risk aversion.
Off-policy policy evaluation (OPE) is a field that researchers study to deal with the lack of know-how in RL agent behavior after deployment within the environment. While the event of doubly robust and problem-specific OPE tools lately brings hope for estimating agent performance, such methods are still quite noisy to supply useful signals in highly stochastic environments.
One aspect of RL productionization that is commonly missed by the research community is the nonstationarity of production environments. Popular topics in recommender systems, seasonality of commodity prices, economic cycles, and other real-world phenomena could be considered nonstationary behavior from an RL agent’s perspective, given the limited history it might consider and the jumping behavior of the environment. Continual learning and exploration within the face of nonstationarity are potential directions to deal with these concerns, but as emerging fields, they require extensive study to mature and develop into useful for production environments.
On this note, I outline among the difficulties in applying reinforcement learning to production environments and summarize some discussions I actually have had with experts through the workshop. I would really like to thank Sergey Levine, Susan Murphy, Emma Brunskill, Susan Murphy and Benjamin Van Roy for joining this workshop for insightful discussions and I hope that the directions and learnings above shed some light on future RL advancements to production.


