# Real-Time Reinforcement Learning

###### Abstract

Markov Decision Processes (MDPs), the mathematical framework underlying most algorithms in Reinforcement Learning (RL), are often used in a way that wrongfully assumes that the state of an agent’s environment does not change during action selection. As RL systems based on MDPs begin to find application in real-world, safety-critical situations, this mismatch between the assumptions underlying classical MDPs and the reality of real-time computation may lead to undesirable outcomes. In this paper, we introduce a new framework, in which states and actions evolve simultaneously and show how it is related to the classical MDP formulation. We analyze existing algorithms under the new real-time formulation and show why they are suboptimal when used in real time. We then use those insights to create a new algorithm Real-Time Actor-Critic (RTAC) that outperforms the existing state-of-the-art continuous control algorithm Soft Actor-Critic both in real-time and non-real-time settings. Code and videos can be found at github.com/rmst/rtrl.

Reinforcement Learning, has led to great successes in games (Tesauro, 1994; Mnih et al., 2015; Silver et al., 2017) and is starting to be applied successfully to real-world robotic control (Schulman et al., 2015; Hwangbo et al., 2019).

The theoretical underpinning for most methods in Reinforcement Learning is the Markov Decision Process (MDP) framework (Bellman, 1957). While it is well suited to describe turn-based decision problems such as board games, this framework is ill suited for real-time applications in which the environment’s state continues to evolve while the agent selects an action (Travnik et al., 2018). Nevertheless, this framework has been used for real-time problems using what are essentially tricks, e.g. pausing a simulated environment during action selection or ensuring that the time required for action selection is negligible (Hwangbo et al., 2017).

Instead of relying on such tricks, we propose an augmented decision-making framework - Real-Time Reinforcement Learning (RTRL) - in which the agent is allowed exactly one timestep to select an action. RTRL is conceptually simple and opens up new algorithmic possibilities because of its special structure.

We leverage RTRL to create Real-Time Actor-Critic (RTAC), a new actor-critic algorithm, better suited for real-time interaction, that is based on Soft Actor-Critic (Haarnoja et al., 2018a). We then show experimentally that RTAC outperforms SAC in both real-time and non-real-time settings.

## 1 Background

In Reinforcement Learning the world is split up into agent and environment. The agent is represented by a policy – a state-conditioned action distribution, while the environment is represented by a Markov Decision Process (Def. 1). Traditionally, the agent-environment interaction has been governed by the MDP framework. Here, however, we strictly use MDPs to represent the environment. The agent-environment interaction is instead described by different types of Markov Reward Processes (MRP), with the (Def. 2) behaving like the traditional interaction scheme.

###### Definition 1.

A Markov Decision Process (MDP) is characterized by a tuple with

(1) state space , (2) action space ,
(3) initial state distribution ,

(4) transition distribution ,
(5) reward function .

An agent-environment system can be condensed into a Markov Reward Process consisting of a Markov process and a state-reward function . The Markov process induces a sequence of states and, together with , a sequence of rewards .

As usual, the objective is to find a policy that maximizes the expected sum of rewards. In practice, rewards can be discounted and augmented to guarantee convergence, reduce variance and encourage exploration. However, when evaluating the performance of an agent, we will always use the undiscounted sum of rewards.

### 1.1 Turn-Based Reinforcement Learning

Usually considered part of the standard Reinforcement Learning framework is the turn-based scheme in which agent and environment interact. We call this interaction scheme Turn-Based Markov Reward Process.

###### Definition 2.

A Turn-Based Markov Reward Process combines a Markov Decision Process with a policy , such that

(1) |

We say the interaction is turn-based, because the environment pauses while the agent selects an action and the agent pauses until it receives a new observation from the environment. This is illustrated in Figure 2. An action selected in a certain state is paired up again with that same state to induce the next. The state does not change during the action selection process.

## 2 Real-Time Reinforcement Learning

In contrast to the conventional, turn-based interaction scheme, we propose an alternative, real-time interaction framework in which states and actions evolve simultaneously. Here, agent and environment step in unison to produce new state-action pairs from old state-action pairs as illustrated in Figures 2 and 4.

###### Definition 3.

A Real-Time Markov Reward Process combines a Markov Decision Process with a policy , such that

(2) |

The system state space is . The initial action can be set to some fixed value, i.e. .^{1}^{1}1 is the Dirac delta distribution. If then with probability one.

Note that we introduced a new policy that takes state-action pairs instead of just states. That is because the system state is now a state-action pair and alone is not a sufficient statistic of the future of the stochastic process anymore.

### 2.1 The real-time framework is made for back-to-back action selection

In the real-time framework, the agent has exactly one timestep to select an action. If an agent takes longer that its policy would have to be broken up into stages that take less than one timestep to evaluate. On the other hand, if an agent takes less than one timestep to select an action, the real-time framework will delay applying the action until the next observation is made. The optimal case is when an agent, immediately upon finishing selecting an action, observes the next state and starts computing the next action. This continuous, back-to-back action selection is ideal in that it allows the agent to update its actions the quickest and no delay is introduced through the real-time framework.

To achieve back-to-back action selection, it might be necessary to match timestep size to the policy evaluation time. With current algorithms, reducing timestep size might lead to worse performance. Recently, however, progress has been made towards timestep agnostic methods (Tallec et al., 2019). We believe back-to-back action selection is an achievable goal and we demonstrate here that the real-time framework is effective even if we are not able to tune timestep size (Section 5).

### 2.2 Real-time interaction can be expressed within the turn-based framework

It is possible to express real-time interaction within the standard, turn-based framework, which allows us to reconnect the real-time framework to the vast body of work in RL. Specifically, we are trying to find an augmented environment that behaves the same with turn-based interaction as would with real-time interaction.

In the real-time framework the agent communicates its action to the environment via the state. However, in the traditional, turn-based framework, only the environment can directly influence the state. We therefore need to deterministically "pass through" the action to the next state by augmenting the transition function. The has two types of actions, (1) the actions emitted by the policy and (2) the action component of the state , where with probability one.

###### Definition 4.

A Real-Time Markov Decision Process augments another Markov Decision Process , such that

(1) state space , (2) action space is ,

(3) initial state distribution ,

(4) transition distribution

(5) reward function .

###### Theorem 1.

A policy interacting with in the conventional, turn-based manner gives rise to the same Markov Reward Process as interacting with in real-time, i.e.

(3) |

Interestingly, the RTMDP is equivalent to a 1-step constant delay MDP (Walsh et al. (2008)). However, we believe the different intuitions behind both of them warrant the different names: The constant delay MDP is trying to model external action and observation delays whereas the RTMDP is modelling the time it takes to select an action. The connection makes sense, though: In a framework where the action selection is assumed to be instantaneous, we can apply a delay to account for the fact that the action selection was not instantaneous after all.

### 2.3 Turn-based interaction can be expressed within the real-time framework

It is also possible to define an augmentation that allows us to express turn-based environments (e.g. Chess, Go) within the real-time framework (Def. 7 in the Appendix). By assigning separate timesteps to agent and environment, we can allow the agent to act while the environment pauses. More specifically, we add a binary variable to the state to keep track of whether it is the environment’s or the agent’s turn. While inverts at every timestep, the underlying environment only advances every other timestep.

###### Theorem 2.

A policy interacting with in real time, gives rise to a Markov Reward Process that contains (Def. 10) the MRP resulting from interacting with in the conventional, turn-based manner, i.e.

(4) |

As a result, not only can we use conventional algorithms in the real-time framework but we can use algorithms built on the real-time framework for all turn-based problems.

## 3 Reinforcement Learning in Real-Time Markov Decision Processes

Having established the RTMDP as a compatibility layer between conventional RL and RTRL, we can now look how existing theory changes when moving from an environment to .

Since most RL methods assume that the environment’s dynamics are completely unknown, they will not be able to make use of the fact that we precisely know part of the dynamics of RTMDP. Specifically they will have to learn from data, the effects of the "feed-through" mechanism which could lead to much slower learning and worse performance when applied to an environment instead of . This could especially hurt the performance of off-policy algorithms which have been among the most successful RL methods to date (Mnih et al., 2015; Haarnoja et al., 2018a). Most off-policy methods make use of the action-value function.

###### Definition 5.

The action value function for an environment and a policy can be recursively defined as

(5) |

When this identity is used to train an action-value estimator, the transition can be sampled from a replay memory containing off-policy experience while the next action is sampled from the policy .

###### Lemma 1.

In a Real-Time Markov Decision Process for the action-value function we have

(6) |

Note that the action does not affect the reward nor the next state. The only thing that does affect is which, in turn, only in the next timestep will affect and . To learn the effect of an action on (specifically the future rewards), we now have to perform two updates where previously we only had to perform one. We will investigate experimentally the effect of this on the off-policy Soft Actor-Critic algorithm (Haarnoja et al., 2018a) in Section 5.1.

### 3.1 Learning the state-value function off-policy

The state-value function can usually not be used in the same way as the action-value function for off-policy learning.

###### Definition 6.

The state-value function for an environment and a policy is

(7) |

The definition shows that the expectation over the action is taken *before* the expectation over the next state. When using this identity to train a state-value estimator, we cannot simply change the action distribution to allow for off-policy learning since we have no way of resampling the next state.

###### Lemma 2.

In a Real-Time Markov Decision Process for the state-value function we have

(8) |

Here, is always a valid transition no matter what action is selected. Therefore, when using the real-time framework, we can use the value function for off-policy learning. Since Equation 8 is the same as Equation 5 (except for the policy inputs), we can use the state-value function where previously the action-value function was used without having to learn the dynamics of the from data since they have already been applied to Equation 8.

### 3.2 Partial simulation

The off-policy learning procedure described in the previous section can be applied more generally. Whenever parts of the agent-environment system are known and (temporarily) independent of the remaining system, they can be used to generate synthetic experience. More precisely, transitions with a start state can be generated according to the true transition kernel by simulating the known part of the transition () and using a stored sample for the unknown part of the transition (). This is only possible if the transition kernel factorizes as . Hindsight Experience Replay (Andrychowicz et al., 2017) can be seen as another example of partial simulation. There, the goal part of the state evolves independently of the rest which allows for changing the goal in hindsight. In the next section, we use the same partial simulation principle to compute the gradient of the policy loss.

## 4 Real-Time Actor-Critic (RTAC)

Actor-Critic algorithms (Konda and Tsitsiklis, 2000) formulate the RL problem as bi-level optimization where the critic evaluates the actor as accurately as possible while the actor tries to improve its evaluation by the critic. Silver et al. (2014) showed that it is possible to reparameterize the actor evaluation and directly compute the pathwise derivative from the critic with respect to the actor parameters and thus telling the actor how to improve. Heess et al. (2015) extended that to stochastic policies and Haarnoja et al. (2018a) further extended it to the maximum entropy objective to create Soft Actor-Critic (SAC) which RTAC is going to be based on and compared against.

In SAC a parameterized policy (the actor) is optimized to minimize the KL-divergence between itself and the exponential of an approximation of the action-value function (the critic) normalized by (where is unknown but irrelevant to the gradient) giving rise to the policy loss

(9) |

where is a uniform distribution over the replay memory containing past states, actions and rewards. The action-value function itself is optimized to fit Equation 5 presented in the previous section (augmented with an entropy term). We can thus expect SAC to perform worse in RTMDPs.

In order to create an algorithm better suited for the real-time setting we propose to use a state-value function approximator as the critic instead, that will give rise to the same policy gradient.

###### Proposition 1.

The following policy loss based on the state-value function

(10) |

has the same policy gradient as , i.e.

(11) |

The value function itself is trained off-policy according to the procedure described in Section 3.1 to fit an augmented version of Equation 8, specifically

(12) |

Therefore, for the value loss, we have

(13) |

### 4.1 Merging actor and critic

Using the state-value function as the critic has another advantage: When evaluated at the same timestep, the critic does not depend on the actor’s output anymore and we are therefore able to use a single neural network to represent both the actor and the critic. Merging actor and critic makes it necessary to trade off between the value function and policy loss. Therefore, we introduce an additional hyperparameter .

(14) |

Merging actor and critic could speed up learning and even improve generalization, but could also lead to greater instability. We compare RTAC with both merged and separate actor and critic networks in Section 5.

### 4.2 Stabilizing learning

Actor-Critic algorithms are known to be unstable during training. We use a number of techniques that help make training more stable. Most notably we use Pop-Art output normalization (van Hasselt et al., 2016) to normalize the value targets. This is necessary if and are represented using an overlapping set of parameters. Since the scale of the error gradients of the value loss is highly non-stationary it is hard to find a good trade-off between policy and value loss (). If and are separate, Pop-Art matters less, but still improves performance both in SAC as well as in RTAC.

Another difficulty are the recursive value function targets. Since we try to maximize the value function, overestimation errors in the value function approximator are amplified and recursively used as target values in the following optimization steps. As introduced by Fujimoto et al. (2018) and like SAC, we will use two value function approximators and take their minimum when computing the target values to reduce value overestimation, i.e. .

## 5 Experiments

We compare Real-Time Actor-Critic to Soft Actor-Critic (Haarnoja et al., 2018a) on several OpenAI-Gym/MuJoCo benchmark environments (Brockman et al., 2016; Todorov et al., 2012) as well as on two Avenue autonomous driving environments with visual observations (Ibrahim et al., 2019).

The SAC agents used for the results here, include both an action-value and a state-value function approximator and use a fixed entropy scale (as in Haarnoja et al. (2018a)). In the code accompanying this paper we dropped the state-value function approximator since it had no impact on the results (as done and observed in Haarnoja et al. (2018b)). For a comparison to other algorithms such as DDPG, PPO and TD3 also see Haarnoja et al. (2018a, b).

To make the comparison between the two algorithms as fair as possible, we also use output normalization in SAC which improves performance on all tasks (see Figure 9 in Appendix A for a comparison between normalized and unnormalized SAC). Both SAC and RTAC are performing a single optimization step at every timestep in the environment starting after the first timesteps of collecting experience based on the initial random policy. The hyperparameters used can be found in Table 1.

### 5.1 SAC in Real-Time Markov Decision Processes

When comparing the return trends of SAC in turn-based environments against SAC in real-time environments , the performance of SAC deteriorates. This seems to confirm our hypothesis that having to learn the dynamics of the augmented environment from data impedes action-value function approximation (as hypothesized in Section 3).

### 5.2 RTAC and SAC on MuJoCo in real time

Figure 6 shows a comparison between RTAC and SAC in real-time versions of the benchmark environments. We can see that RTAC learns much faster and achieves higher returns than SAC in . This makes sense as it does not have to learn from data the "pass-through" behavior of the RTMDP. We show RTAC with separate neural networks for the policy and value components showing that a big part of RTAC’s advantage over SAC is its value function update. However, the fact that policy and value function networks can be merged further improves RTAC’s performance as the plots suggest. Note that RTAC is always in , therefore we do not explicitly state it again.

RTAC is even outperforming SAC in (when SAC is allowed to act without real-time constraints) in four out of six environments including the two hardest ones - Ant and Humanoid - with largest state and action space (Figure 11). We theorize this is possible due to the merged actor and critic networks used in RTAC. It is important to note however, that for RTAC with merged actor and critic networks output normalization is critical (Figure 12).

### 5.3 Autonomous driving task

In addition to the MuJoCo environments, we have also tested RTAC and SAC on an autonomous driving task using the Avenue simulator (Ibrahim et al., 2019). Avenue is a game-engine-based simulator where the agent controls a car. In the task shown here, the agent has to stay on the road and possibly steer around pedestrians. The observations are single image (256x64 grayscale pixels) and the car’s velocity. The actions are continuous and two dimensional, representing steering angle and gas-brake. The agent is rewarded proportionally to the car’s velocity in the direction of the road and negatively rewarded when making contact with a pedestrian or another car. In addition, episodes are terminated when leaving the road or colliding with any objects or pedestrians.

The hyperparameters used for the autonomous driving task are largely the same as for the MuJoCo tasks, however we used a lower entropy reward scale () and lower learning rate (). We used convolutional neural networks with four layers of convolutions with filter sizes , strides and channels. The convolutional layers are followed by two fully connected layers with units each.

## 6 Related work

Travnik et al. (2018) noticed that the traditional MDP framework is ill suited for real-time problems. Other than our paper, however, no rigorous framework is proposed as an alternative, nor is any theoretical analysis provided.

Firoiu et al. (2018) applies a multi-step action delay to level the playing field between humans and artificial agents on the ALE (Atari) benchmark However, it does not address the problems arising from the turn-based MDP framework or recognizes the significance and consequences of the one-step action delay.

Similar to RTAC, NAF (Gu et al., 2016) is able to do continuous control with a single neural network. However, it is requiring the action-value function to be quadratic in the action (and thus possible to optimize in closed form). This assumption is quite restrictive and could not outperform more general methods such as DDPG.

In SVG(1) (Heess et al., 2015) a differentiable transition model is used to compute the path-wise derivative of the value function one timestep after the action selection. This is similar to what RTAC is doing when using the value function to compute the policy gradient. However, in RTAC, we use the actual differentiable dynamics of the RTMDP, i.e. "passing through" the action to the next state, and therefore we do not need to approximate the transition dynamics. At the same time, transitions for the underlying environment are not modelled at all and instead sampled which is only possible because the actions in a RTMDP only start to influence the underlying environment at the next timestep.

## 7 Discussion

We have introduced a new framework for Reinforcement Learning, RTRL, in which agent and environment step in unison to create a sequence of state-action pairs. We connected RTRL to the conventional Reinforcement Learning framework through the RTMDP and investigated its effects in theory and practice. We predicted and confirmed experimentally that conventional off-policy algorithms would perform worse in real-time environments and then proposed a new actor-critic algorithm, RTAC, that not only avoids the problems of conventional off-policy methods with real-time interaction but also allows us to merge actor and critic which comes with an additional gain in performance. We showed that RTAC outperforms SAC on both a standard, low dimensional continuous control benchmark, as well as a high dimensional autonomous driving task.

#### Acknowledgments

We would like to thank Cyril Ibrahim for building and helping us with the Avenue simulator; Craig Quiter and Sherjil Ozair for insightful discussions about agent-environment interaction; Alex Piché, Scott Fujimoto, Bhairav Metha and Jhelum Chakravorty, for reading drafts of this paper and finally Jose Gallego, Olexa Bilaniuk and many others at Mila that helped us on countless occasions online and offline.

This work was completed during a part-time internship at Element AI and was supported by the Open Philanthropy Project.

## References

- Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: §3.2.
- A markovian decision process. Journal of mathematics and mechanics, pp. 679–684. Cited by: Real-Time Reinforcement Learning.
- OpenAI gym. External Links: arXiv:1606.01540 Cited by: §5.
- At human speed: deep reinforcement learning with action delay. CoRR abs/1810.07286. Cited by: §6.
- Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §4.2.
- Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §6.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: Figure 9, Appendix C, §3, §3, §4, §5, §5, Real-Time Reinforcement Learning.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §5.
- Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952. Cited by: §4, §6.
- Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26), pp. eaau5872. Cited by: Real-Time Reinforcement Learning.
- Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters 2 (4), pp. 2096–2103. Cited by: Real-Time Reinforcement Learning.
- Avenue. GitHub. Note: https://github.com/elementai/avenue Cited by: §5.3, §5.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Table 1.
- Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §4.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §4.2.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §3, §4.2, Real-Time Reinforcement Learning.
- Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: Real-Time Reinforcement Learning.
- Deterministic policy gradient algorithms. In ICML, Cited by: §4.
- Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: Real-Time Reinforcement Learning.
- Making deep q-learning methods robust to time discretization. arXiv preprint arXiv:1901.09732. Cited by: §2.1.
- TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation 6 (2), pp. 215–219. Cited by: Real-Time Reinforcement Learning.
- MuJoCo: a physics engine for model-based control.. In IROS, pp. 5026–5033. External Links: ISBN 978-1-4673-1737-5, Link Cited by: §5.
- Reactive reinforcement learning in asynchronous environments. Frontiers in Robotics and AI 5, pp. 79. External Links: Link, Document, ISSN 2296-9144 Cited by: §6, Real-Time Reinforcement Learning.
- Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pp. 4287–4295. Cited by: §4.2.
- Learning and planning in environments with delayed feedback. Autonomous Agents and Multi-Agent Systems 18, pp. 83–105. Cited by: §2.2.

## Appendix A Additional Experiments

## Appendix B Hyperparameters

Name | RTAC | SAC |
---|---|---|

optimizer | Adam | Adam (Kingma and Ba, 2014) |

learning rate | ||

discount () | ||

hidden layers | ||

units per layer | ||

samples per minibatch | ||

target smoothing coefficient () | ||

gradient steps / environment steps | ||

reward scale | ||

entropy scale () | ||

actor-critic loss factor () | - | |

Pop-Art alpha | - | |

start training after | steps |

## Appendix C Proofs

See 1

###### Proof.

For any environment , we want to show that the two above MRPs are the same. Per Def. 2 and 4 for we have

(1) state space | |||

(2) initial distribution | |||

(3) transition kernel | |||

(4) state-reward function |

The transition kernel, using the definition of the Dirac delta function , can be simplified to

(15) |

The state-reward function can be simplified to

(16) |

It should now be easy to see how the elements above match , Def. 3. ∎

See 2

###### Proof.

Given MDP , we have with

(1) state space | (17) | |||

(2) initial distribution | (18) | |||

(3) transition kernel | (19) | |||

(20) | ||||

(4) state-reward function | (21) |

We can construct , a sub-MRP with interval . Since we always skip the step in which , we only have to define the transition kernel for , i.e.

(22) | ||||

(23) | ||||

(24) | ||||

(25) |

For the state-reward function we have (again only considering )

(26) | ||||

(27) | ||||

(28) | ||||

(29) |

The sub-MRP is already very similar to except for having a larger state-space. To get rid of the and state components, we reduce with a state transformation . The reduced MRP has

(1) state space | (30) | |||

(2) initial distribution | (31) | |||

(3) transition kernel | (32) | |||

(33) | ||||

(34) | ||||

(4) state-reward function | (35) | |||

(36) |

which is exactly . ∎

See 1

###### Proof.

As shown in Haarnoja et al. (2018a), Equation 9 can be reparameterized to obtain the policy gradient, which, applied in a RTMDP, yields

(37) |

and reparameterizing Equation 10 yields

(38) |

where is a function mapping from state and noise to an action distributed according to . This leaves us to show that

(39) |

which follows from the definition of the soft action-value function and simplifying quantities defined in the RTMDP. ∎

## Appendix D Definitions

###### Definition 7.

A Turn-Based Markov Decision Process augments another Markov Decision Process , such that

(1) state space | |||

(2) action space | |||

(3) initial state distribution | |||

(4) transition distribution | |||

(5) reward function |

###### Definition 8.

is a sub-MRP of if its states are sub-sampled with interval and rewards are summed over each interval, i.e. for almost all

(40) |

###### Definition 9.

A MRP is a reduction of if there is a state transformation that neither affects the evolution of states nor the rewards, i.e.

(1) state space | (41) | |||

(2) initial distribution | (42) | |||

(3) transition kernel | (43) | |||

(4) state-reward function | (44) |

###### Definition 10.

A MRP contains another MRP (we write ) if works at a higher frequency and has a richer state than but behaves otherwise identically. More precisely,

(45) |

###### Definition 11.

The -step transition function of a MRP is

(46) |

###### Definition 12.

The n-step value function of a MRP is

(47) |