Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs

Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs


We present two elegant solutions for modeling continuous-time dynamics, in a novel model-based reinforcement learning (RL) framework for semi-Markov decision processes (SMDPs) using neural ordinary differential equations (ODEs). Our models accurately characterize continuous-time dynamics and enable us to develop high-performing policies using a small amount of data. We also develop a model-based approach for optimizing time schedules to reduce interaction rates with the environment while maintaining the near-optimal performance, which is not possible for model-free methods. We experimentally demonstrate the efficacy of our methods across various continuous-time domains.

1 Introduction

Algorithms for deep reinforcement learning (RL) have led to major advances in many applications ranging from robots (Gu et al., 2017) to Atari games (Mnih et al., 2015). Most of these algorithms formulate the problem in discrete time and assume observations are available at every step. However, many real-world sequential decision-making problems operate in continuous time. For instance, control problems such as robotic manipulation are generally governed by systems of differential equations. In healthcare, patient observations often consist of irregularly sampled time series: more measurements are taken when patients are sicker or more concerns are suspected, and clinical variables are usually observed on different time-scales.

Unfortunately, the problem of learning and acting in continuous-time environments has largely been passed over by the recent advances of deep RL. Previous methods using semi-Markov decision processes (SMDPs) (Howard, 1964)—including -learning (Bradtke and Duff, 1995), advantage updating (Baird, 1994), policy gradient (Munos, 2006), actor-critic (Doya, 2000)—extend the standard RL framework to continuous time, but all use relatively simple linear function approximators. Furthermore, as model-free methods, they often require large amounts of training data. Thus, rather than attempt to handle continuous time directly, practitioners often resort to discretizing time into evenly spaced intervals and apply standard RL algorithms. However, this heuristic approach loses information about the dynamics if the discretization is too coarse, and results in overly-long time horizons if the discretization is too fine.

In this paper, we take a model-based approach to continuous-time RL, modeling the dynamics via neural ordinary differential equations (ODEs) (Chen et al., 2018). Not only is this more sample-efficient than model-free approaches, but it allows us to efficiently adapt policies learned using one schedule of interactions with the environment for another. Our approach also allows for optimizing the measurement schedules to minimize interaction with the environment while maintaining the near-optimal performance that would be achieved by constant intervention.

Specifically, to build flexible models for continuous-time, model-based RL, we first introduce ways to incorporate action and time into the neural ODE work of (Chen et al., 2018; Rubanova et al., 2019). We present two solutions, ODE-RNN (based on a recurrent architecture) and Latent-ODE (based on an encoder-decoder architecture), both of which are significantly more robust than current approaches for continuous-time dynamics. Because these models include a hidden state, they can handle partially observable environments as well as fully-observed environments. Next, we develop a unified framework that can be used to learn both the state transition and the interval timing for the associated SMDP. Not only does our model-based approach outperform baselines in several standard tasks, we demonstrate the above capabilities which are not possible with current model-free methods.

2 Related work

There has been a large body of work on continuous-time reinforcement learning, many based on the RL framework of SMDPs (Bradtke and Duff, 1995; Parr and Russell, 1998). Methods with linear function approximators include -learning (Bradtke and Duff, 1995), advantage updating (Baird, 1994), policy gradient (Munos, 2006), and actor-critic (Doya, 2000). Classical control techniques such as the linear–quadratic regulator (LQR) (Kwakernaak and Sivan, 1972) also operate in continuous time using a linear model class that is likely too simplistic and restrictive for many real-world scenarios. We use a more flexible model class for learning continuous-time dynamics models to tackle a wider range of settings.

Other works have considered SMDPs in the context of varying time discretizations. Options and hierarchical RL (Sutton et al., 1999; Barto and Mahadevan, 2003) contain temporally extended actions and meta-actions. More recently, Sharma et al. (2017) connected action repetition of deep RL agents with SMDPs, and Tallec et al. (2019) identified the robustness of -learning on different time discretizations; however, the transition times were still evenly-spaced. In contrast, our work focuses on truly continuous-time environments with irregular, discrete intervention points.

More generally, discrete-time, model-based RL has offered a sample-efficient approach (Kaelbling et al., 1996) for real-world sequential decision-making problems. Recently, RNN variants have become popular black-box methods for summarizing long-term dependencies needed for prediction. RNN-based agents have been used to play video games (Oh et al., 2015; Chiappa et al., 2017); Ha and Schmidhuber (2018) trained agents in a “dreamland” built using RNNs; Igl et al. (2018) utilized RNNs to characterize belief states in situations with partially observable dynamics; and Neitz et al. (2018) trained a recurrent dynamics model skipping observations adaptively to avoid poor local optima. To our knowledge, no prior work in model-based RL focuses on modeling continuous-time dynamics and planning with irregular observation and action times.

Neural ODEs (Chen et al., 2018) have been used to tackle irregular time series. Rubanova et al. (2019); De Brouwer et al. (2019) used a neural ODE to update the hidden state of recurrent cells; Chen et al. (2018) defined latent variables for observations as the solution to an ODE; Kidger et al. (2020) adjusted trajectories based on subsequent observations with controlled differential equations. To our knowledge, ours is the first to extend the applicability of neural ODEs to RL.

3 Background and notation

Semi-Markov decision processes.

A semi-Markov decision process (SMDP) is a tuple , where is the state space, is the action space, is the transition time space and is the discount factor. We assume the environment has transition dynamics unknown to the agent, where represents the time between taking action in observed state and arriving in the next state and can take a new action. Thus, we assume no access to any intermediate observations. In general, we are given the reward function for the reward after observing . However, in some cases the reward may also depend on , i.e. , for instance if the cost of a system involves how much time has elapsed since the last intervention. The goal throughout is to learn a policy maximizing long-term expected rewards with a finite horizon , where .

While the standard SMDP model above assumes full observability, in our models, we will introduce a latent variable summarizing the history until right before the most recent state, and learn a transition function , treating as an emission of the latent . Introducing this latent will allow us to consider situations in which: (a) the state can be compressed, and (b) we only receive partial observations and not the complete state with one coherent model.

Neural ordinary differential equations.

Neural ODEs define a latent state as the solution to an ODE initial value problem using a time-invariant neural network :


Utilizing an off-the-shelf numerical integrator, we can solve the ODE for at any desired time. In this work, we consider two different neural ODE models as starting points. First, a standard RNN can be transformed to an ODE-RNN (Rubanova et al., 2019):


where is the predicted state and is the emission (decoding) function. An alternate approach, based on an encoder-decoder structure (Sutskever et al., 2014), is the Latent-ODE (Chen et al., 2018):


where is a RNN encoder and the latent state is defined by an ODE. The Latent-ODE is trained as a variational autoencoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014).

The ODE-RNN allows online predictions and is natural for sequential decision-making, though the effect of the ODE network is hard to interpret in the RNN updates. On the other hand, the Latent-ODE explicitly models continuous-time dynamics using an ODE, along with a measure of uncertainty from the posterior over , but the solution to this ODE is determined entirely by the initial latent state.

Recurrent environment simulator.

None of the above neural ODE models contain explicit actions. We build on the recurrent simulator of Chiappa et al. (2017), which used the following structure for incorporating the effects of actions:


where denotes either the observed state or the predicted state . We can use either or during training of the recurrent simulator, but only is available for inference. In this work, we generally use the actual observations during training, i.e. (this is known as the teacher forcing strategy), but also find that using scheduled sampling (Bengio et al., 2015) which switches between choosing the previous observation and the prediction improves performance on some tasks.

4 Approach

In this section, we first describe how to construct ODE-based dynamics models for model-based RL that account for both actions and time intervals, overcoming shortcomings of existing neural ODEs. Then, we describe how to use these models for prediction in the original environment, as well as how to transfer to environments with new time schedules and how to optimize environment interaction.

4.1 Model definition and learning

We decompose the transition dynamics into two parts, one to predict the time until the next action with corresponding observation, and one to predict the next latent state :


Using the recurrent environment simulator from Equation 4 we can incorporate an action into an ODE-RNN or a Latent-ODE to approximate (referred to as the state transition). We can also learn and generate transition times using another neural network based on the current latent state, action and observed state, which approximates (referred to as the interval timing). We model them separately yet optimize jointly in a multi-task learning fashion. Specifically, we propose an action-conditioned and time-dependent ODE-RNN and Latent-ODE for approximating the transition dynamics .


Combining Equations 2 and 4, we obtain the following model:


where denotes either the observed time interval or the predicted time interval , similar to .


Given the parameters of the underlying dynamics, the entire latent trajectory of the vanilla Latent-ODE is determined by the initial condition, . However, to be useful as a dynamics model for RL the Latent-ODE should allow for actions to modify the latent state. To do this, at every time we adjust the previous latent state to obtain a new latent state , incorporating the new actions and observations. In particular, we transform the vanilla Latent-ODE using Equations 3 and 4:


where is a function incorporating the action and observation (or prediction) to the ODE solution. This method of incorporating actions into neural ODEs is similar to neural controlled differential equations (Kidger et al., 2020). We find that a linear transformation works well in practice for , i.e.,


where is the vector concatenation. The graphical model of Latent-ODE is shown in Figure 1.

The ODE-RNN and Latent ODE intrinsically differ in the main entity they use for modeling continuous-time dynamics. The ODE-RNN models the transition between latent states using a recurrent unit, whereas the Latent-ODE parameterizes the dynamics with an ODE network directly.

Figure 1: The graphical representation of action-conditioned and time-dependent Latent ODE. Blue dashed arrow represents the emission function. Orange dashed arrow represents the (latent) policy. Green dash arrow represents the prediction of interval timing. Gray double-sided arrow represents the selection between observation and prediction. Note that the encoder is only used in model training.

Training objective.

We assume that we have a collection of variable-length trajectories . We optimize the overall objective in Equation 9 using mini-batch stochastic gradient descent:


where is a loss for prediction of state transitions, is a loss for prediction of interval timings, and is a hyperparameter trades off between the two. Based on the complexity or importance of predicting state transitions and interval timings, we can emphasize either one by adjusting .

For recurrent models, such as the ODE-RNN, is simply the mean squared error (MSE); for encoder-decoder models, such as the Latent ODE, it is the negative evidence lower bound (ELBO):


For the loss for interval timing (Equation 11), we use cross entropy for a small number of discrete . Otherwise, MSE can be used for continuous .


where is the number of classes of time interval, is the binary indicator if the class label is the correct classification for , and is the predicted probability that is of class .

4.2 Planning and learning

Now that we have models and procedures for learning them, we move on to the question of identifying an optimal policy. With partial observations, the introduced latent state provides a representation encoding previously seen information (Ha and Schmidhuber, 2018) and we construct a latent policy conditioned on ; otherwise the environment is fully observable and we construct a policy . In general, we model the action-value function (or ) with continuous-time -learning (Bradtke and Duff, 1995; Sutton et al., 1999) for SMDPs, which works the unequally-sized time gap into the discount factor :


We construct the policy with a deep -network (DQN) (Mnih et al., 2015) for discrete actions and deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) for continuous actions. We perform efficient planning with dynamics models to learn the policy , which is detailed in Section 5. The exact method for planning is orthogonal to the use of our models and framework.

Transferring to environments with different time schedules.

Our model-based approach allows us to adapt to changes in time interval settings: once we have learned the underlying state transition dynamics from a particular time schedule , the model can be used to find a policy for another environment with any different time schedules (either irregular or regular). The model is generalizable for interval times if and only if it truly learns how the system changes over time in the continuous-time environment.

Interpolating rewards and optimizing interval times.

In addition to adapting to new time schedules, we can also optimize a state-specific sampling rate to maximize cumulative rewards while minimizing interactions. For example, in Section 5.2, we will demonstrate how we can reduce the number of times a (simulated) HIV patient must come into the clinic for measurement and treatment adjustment while still staying healthy. However, this approach may not work well in situations where constant adjustments on small time scales can hurt performance (e.g., Atari games use frameskipping to avoid flashy animations).

When optimizing interval times, we assume that are discrete, , and we only have access to the immediate reward function . We can optimize the interval times to decrease the amount of interaction with the environment while achieving near-optimal performance, obtained by maximal interaction (). Specifically, assuming we always take an action for each of the steps in the interval starting from the state , we select the optimal based on the estimated -step ahead value:


is the simulated state from the model (), is the optimal action at state , and is the action-value function from the policy. We interpolate intermediate rewards using the dynamics model—we can simulate by varying what states would be passed through, and what the cumulative reward would be—whereas it is not possible to do this for model-free methods as we have no access to in-between observations. In this way, agents learn to skip situations in which no change of action is needed and minimize environment interventions. The full procedure can be found in Appendix A.1. Note that Equation 13 can be easily extended to continuous interval times if they are lower-bounded, i.e., (); we leave this to future work.

5 Experiments

We evaluate our ODE-based models across four continuous-time domains. We show our models characterize continuous-time dynamics more accurately and allow us to find a good policy with less data. We also demonstrate capabilities of our model-based methods that are not possible for model-free methods.

5.1 Experimental setup


We provide demonstrations on three simpler domains—windy gridworld (Sutton and Barto, 2018), acrobot (Sutton, 1996), and HIV (Adams et al., 2004)—and three Mujoco (Todorov et al., 2012) locomotion tasks—Swimmer, Hopper and HalfCheetah—interfaced through OpenAI Gym (Brockman et al., 2016). Unless otherwise stated, the state space of all tasks is fully observable, and we are given the immediate reward function . We provide more details of domains in Appendix B.

  • Windy gridworld (Figure 1(a)). We consider a continuous state version in which agents pursue actions for seconds to reach a goal region despite a crosswind.

  • Acrobot. Acrobot aims to swing up a pendulum to reach a given height. The dynamics is defined by a set of first-order ODEs and the original version uses to discretize underlying ODEs. We instead sample the time interval randomly from .

  • HIV. Establishing effective treatment strategies for HIV-infected patients based on markers from blood tests can be cast as an RL problem (Ernst et al., 2006; Parbhoo et al., 2017; Killian et al., 2017). The effective period varies from one day to two weeks. Healthy patients with less severe disease status may only need occasional inspection, whereas unhealthy patients require more frequent monitoring and intervention.

  • Mujoco. We use action repetition (Sharma et al., 2017) to introduce irregular transition times to Mujoco tasks, however, we assume the intermediate observations are not available so that the dynamics naturally fits into the SMDP definition. The same action repeats times where is the angle velocity vector of all joints and is a periodic function. The periodicity is learnable by RNNs (Gers et al., 2002) and ensures consistently irregular measurements in the course of policy learning1.

As proofs of concept, we assume the transition times in gridworld and acrobot problems are known, and only focus on learning the state transition probability (set in Equation 9); for HIV and Mujoco tasks, we learn both state transitions and interval timings.

Figure 2: (a) 2D continuous windy gridworld, where “S” and “G” is the start area and goal area , and the shaded region and arrows represent the wind moves upward (the darker color indicates the stronger wind); (b) normalized value functions obtained by DQNs for all baselines and our proposed ODE-based models (marked as red) over the gridworld. “Oracle” refers to the model-free baseline trained until convergence. The lighter the pixel, the higher the value. The policy developed with the Latent-ODE is “closet” to the optimal policy (oracle).


We compare the performance of our proposed ODE-based models with four baselines embedded in our model-based RL framework for SMDPs. With the recurrent architecture, we evaluate 1) vanilla RNN; 2) -RNN, where the time intervals are concatenated with the original input as an extra feature; 3) Decay-RNN (Che et al., 2018), which adds an exponential decay layer between hidden states: . With the encoder-decoder architecture, we evaluate 4) Latent-RNN, where the decoder is constructed with a RNN and the model is trained variationally. RNNs in all models are implemented by gate recurrent units (GRUs) (Cho et al., 2014). Moreover, we also run a model-free method (DQN or DDPG) for comparison.

5.2 Demonstrations on simpler domains

We learn the world model (Ha and Schmidhuber, 2018) of simpler environments for planning. We gather data from an initial random policy and learn the dynamics model on this collection of data. Agent policies are only trained using fictional samples generated by the model without considering the planning horizon. To achieve optimal performance (i.e. the model-free baseline trained until convergence), the model has to capture long-term dependencies so that the created virtual world is accurate enough to mimic the actual environment. Thus, with this planning scheme, we can clearly demonstrate a learned model’s capacity. The details of the algorithm can be found in Appendix A.2.

RNN -RNN Decay-RNN Latent-RNN ODE-RNN Latent-ODE
Gridworld 0.894 0.023 0.334 0.023 0.899 0.022 1.161 0.039 0.452 0.040 0.845 0.017
Acrobot 0.176 0.010 0.039 0.006 0.060 0.006 0.176 0.010 0.022 0.005 0.021 0.005
HIV 0.332 0.013 0.168 0.014 0.346 0.022 0.361 0.017 0.068 0.006 0.020 0.001
Table 1: The state prediction error (mean std, over five runs) of all models on three simpler domains. Note that models always consume predictions in testing to match the inference procedure. The Latent-ODE achieves the lowest prediction errors on Acrobot and HIV tasks.
RNN -RNN Decay-RNN Latent-RNN ODE-RNN Latent-ODE Oracle
Gridworld -54.02 9.24 -45.64 8.22 -48.91 8.97 -47.92 8.16 -58.50 10.46 -34.87 1.96 -34.17 1.47
Acrobot -179.35 10.49 -106.78 10.34 -105.23 10.96 -181.64 10.92 -65.90 7.06 -54.26 4.01 -48.67 3.29
HIV () 0.78 0.04 0.75 0.17 0.95 0.21 0.82 0.05 11.74 1.50 30.32 2.70 35.22 1.42
RNN -RNN Decay-RNN Latent-RNN ODE-RNN Latent-ODE Oracle
Gridworld -61.01 10.03 -64.55 10.89 -60.78 10.03 -52.32 8.91 -114.70 11.65 -49.31 6.62 -35.93 1.95
Acrobot -407.46 13.82 -281.92 9.99 -285.07 8.47 -237.25 10.29 -190.82 9.13 -171.37 10.07 -78.75 3.23
HIV () 7.66 1.79 17.21 2.44 5.84 1.62 16.95 3.05 11.32 1.09 21.60 2.39 33.55 1.97
Table 2: The cumulative reward (mean std, over five runs) of policies developed with all models on three domains. “Oracle” (italic) refers to the model-free baseline trained until convergence. (a) planning in the original irregular time schedule; (b) planning using pretrained models from the original irregular time schedule for a new regular time schedule (gridworld: ; acrobot: ; HIV: ).

Latent-ODEs mimic continuous-time environments and value functions more accurately.

Because the training dataset is fixed, we can use the state prediction error (MSE in Equation 10) on a separate test dataset to measure if the model learns the dynamics well. Table 1 shows state prediction errors of all models. The ODE-RNN and Latent-ODE outperform other models on acrobot and HIV, but the -RNN achieves the lowest error on the windy gridworld. However, by visualizing the value functions of learned policies which are constructed using dynamics models (Figure 1(b)), we find that only the Latent-ODE accurately recovers the true gridworld (the one from the model-free baseline), whereas the -RNN characterizes the dark parts of the world very well, but fails to identify the true goal region (the light part). Thus, the lower state prediction error averaged over the gridworld does not imply better policies.

Latent-ODEs help agents develop better policies.

Table 2(a) shows the performance of all models, in terms of returns achieved by policies in the actual environment. Latent-ODE consistently surpasses other models and achieves near-optimal performance with the model-free baseline. In contrast, all non-ODE-based models develop very poor policies on the acrobot and HIV problems.

Latent-ODEs are more robust to changes in time intervals.

To test if the dynamics model is generalizable across interval times, we change the time schedules from irregular measurements to regular measurements without retraining the models. Due to the page limit, we show the results of for the gridworld, for the acrobot and for HIV in Table 2(b) and include full results of all time discretizations in Appendix D.2. The Latent-ODE is once again the best dynamics model to solve the new environment, even if the transition times are regular.

Optimized time schedules achieve the best balance of high performance and low interaction rate.

Figure 3: The cumulative reward vs. interaction rate on the HIV environment. “ env” refers to the original time schedule for training models; “ model/oracle” refers to selecting using the Latent-ODE/true dynamics and Equation 13.

For evaluation, to ensure the fair comparison of different interaction rates given the fixed horizon, we collect the reward at every time step (every day) from the environment and calculate the overall cumulative reward. The results on the HIV environment are shown in Figure 3. Developing the model-based schedule using the Latent-ODE, we can obtain similar returns as measuring the health state every two days, but with less than half the interventions. Further, using the oracle dynamics, the optimized schedule achieves similar performance with maximal interaction () while reducing interaction times by three-quarters.

Policy Reward
12.88 1.71
18.04 2.07
16.96 1.53
30.32 2.70
Table 3: The performance of different policies on partially observable HIV task.

Latent variables capture hidden state representations in partially observable environment.

We mask two blood test markers in the state space to build a partially observable HIV environment, where we demonstrate the behavior of the latent policy . We compare its performance with the model-based policy and the model-free policy using partial observations, and the model-based policy using full observations. All model-based policies are developed with the Latent-ODE. Based on the results in Table 3, the latent variable improves the performance in the partially observable setting, and even builds a better policy than the model-free baseline, though the latent policy cannot achieve the asymptotic performance of the policy using full observations.

Figure 4: Learning curves with all models on three Mujoco tasks. The shaded region represents a standard deviation of average evaluation over four runs (evaluation data is collected every 5000 timesteps). Curves are smoothed with a 20-point window for visual clarity. The Latent-ODE develop better policies with fewer data than the model-free method and other models on all three tasks.

5.3 Continuous control: Mujoco

For the more complex Mujoco tasks, exploration and learning must be interleaved. We combine model predictive control (MPC) (Mayne et al., 2000) with the actor-critic method (DDPG) for planning. MPC refines the stochastic model-free policy (the actor) by sampling-based planning (Wang and Ba, 2019; Hong et al., 2019), and the value function (the critic) mitigates the short sight of imaginary model rollouts in MPC planning (Lowrey et al., 2018; Clavera et al., 2020). This approach iterates between data collection, model training, and policy optimization, which allows us to learn a good policy with a small amount of data. The details of the algorithm can be found in Appendix A.3.

Latent-ODEs exhibit the sample efficiency across all three Mujoco tasks.

Figure 4 shows the learning process of all models on Mujoco tasks. The Latent-ODE is more sample-efficient than the model-free method and other models on all three tasks. For example, on the swimmer and hopper task, we develop a high-performing policy over 100k environment steps using the Latent-ODE, whereas the model-free baseline requires four times the amount of data. However, the ODE-RNN is not as good as the Latent-ODE and its performance is similar with other baselines.

6 Discussion and conclusion

We incorporate actions and time intervals into neural ODEs for modeling continuous-time dynamics, and build a unified framework to train dynamics models for SMDPs. Our empirical evaluation across various continuous-time domains demonstrates the superiority of the Latent-ODE in terms of model learning, planning, and robustness for changes in interval times. Moreover, we propose a method to minimize interventions with the environment but maintain near-optimal performance using the dynamics model, and show its effectiveness in the health-related domain.

The flexibility of our model training procedure (Section 4.1), which is orthogonal to the planning method, inspires the future research on model-based RL and neural ODEs in various aspects. First, one might easily enhance the performance of Latent-ODEs in continuous-time domains using more advanced planning schemes and controllers, e.g., TD3 (Fujimoto et al., 2018), soft actor-critic (Haarnoja et al., 2018) and etc. Second, since our model is always trained on a batch of transition data, we can apply our method to the continuous-time off-policy setting with the recent advances on model-based offline RL (Yu et al., 2020; Kidambi et al., 2020). RL health applications (e.g., in ICU (Gottesman et al., 2020)) might benefit from this in particular, because practitioners usually assume the presence of a large batch of already-collected data, which consists of patient measurements with irregular observations and actions. Furthermore, while we focus on flat actions in this work, it is natural to extend our models and framework to model-based hierarchical RL (Botvinick and Weinstein, 2014; Li et al., 2017).

Last but not least, we find that training a Latent-ODE usually takes more than ten times longer than training a simple RNN model due to slow ODE solvers, which means the scalability might be a key limitation for applying our models to a larger state space setting. We believe that the efficiency of our methods will not only be significantly improved with a theoretically faster numerical ODE solver (e.g., (Finlay et al., 2020; Kelly et al., 2020)), but also with a better implementation of ODE solvers2 (e.g., a faster C++/Cython implementation, using single precision arithmetic for solvers, and etc.).

Broader Impact

We introduce a new approach for continuous-time reinforcement learning that could eventually be useful for a variety of applications with irregular time-series, e.g. in healthcare. However, models are only as good as the assumptions made in the architecture, the data they are trained on, and how they are integrated into a broader context. Practitioners should treat any output from RL models objectively and carefully, as in real life there are many novel situations that may not be covered by the RL algorithm.


We thank Andrew Ross, Weiwei Pan, Melanie Pradier and other members from Harvard Data to Actionable Knowledge lab for helpful discussion and feedbacks. We thank Harvard Faculty of Arts and Sciences Research Computing and School of Engineering and Applied Sciences for providing computational resources. FDV and JD are supported by an NSF CAREER.

Supplementary Materials

Appendix A Algorithm details

a.1 Optimizing interval times

Input : The pretrained state transition dynamics model , the replay buffer , the reward function , the time horizon , and the number of episodes .
Output : .
1 Initialize the policy (along with the action-value function );
2 for  to  do
3       ;
4       Observe the initial state from the environment;
5       Initialize the latent state ;
6       while  do
7             Select action ;
8            ;
9             for  to  do
10                   ;
11                   Decode from ;
12                   Select action ;
13                   Calculate the immediate reward at the current time point ;
15             end for
16            Select the best incoming time interval ;
17             ;
18             if  is not the terminated state then
19                   Store the tuple into ;
21            else
22                   Store the tuple into ;
23                   break;
25             end if
26            Optimize with data in ;
27             ;
29       end while
31 end for
Algorithm 1 Optimizing interval times with the dynamics model.

Our innovation of optimizing interval times is highlighted in blue in Algorithm 1. Note that we can use either the imaginary reward from the dynamics model or the true reward from the environment for training the policy , i.e., can be either or , and similar case for (line 16 of Algorithm 1). In this work, to focus on the efficacy of the optimized time schedules, we use the true reward and true observation ; but the optimal is always determined using the imaginary reward .

A key assumption of Algorithm 1 is that acting often using short time intervals will not hurt performance, and that maximal interaction (i.e. for discrete time) will have the optimal performance. In many scenarios, this assumption seems reasonable and applying Algorithm 1 may work well. For instance, in healthcare applications, it is safe to assume that more frequent monitoring of patients and more careful tuning of their treatment plan should yield optimal performance, although in practice this is not generally done due to e.g. resource constraints. However, this assumption does not hold for all environments. For example, some Atari games require frameskipping, i.e., repeating actions for times, because we need enough changes in pixels to find a good policy. Also, some control problems have an optimal time step for the underlying physical systems. In these situations, one might set a minimum threshold for and apply the algorithm, i.e., ; together with the extension to continuous time (Equation 14), we leave them to future work.


a.2 Learning world models

Input : The replay buffer , the reward function , the time horizon , and the number of episodes for model learning and for policy optimization.
Output : .
1 Initialize the policy (along with the action-value function );
2 Initialize the dynamics model ;
3 Collect a collection of trajectories using random policies, where for ;
4 Train with as described in Section 4.1;
5 for  to  do
6       ;
7       Observe the initial state from the environment or sample from a set of initial states;
8       Initialize the latent state ;
9       while  do
10             Select action ;
11             Predict the incoming time interval and next latent state using ;
12             Decode from ;
13             Calculate the reward ;
14             if  is not the terminated state then
15                   Store the tuple into ;
17            else
18                   Store the tuple into ;
19                   break;
21             end if
22            Optimize with data in ;
23             ;
25       end while
27 end for
Algorithm 2 Learning world models [Ha and Schmidhuber, 2018] for SMDPs.

Algorithm 2 assumes that the dynamics can be fully covered by random policies. However, these may be far away from the optimal policy. Because the policy is trained only on fictional samples without considering the planning horizon, there is no difference between learning in the virtual world created by the dynamics model and the true environment. Thus, the performance of learned policies is mainly determined by the model’s capacity.

Nevertheless, Algorithm 2 does not work well on more sophisticated Mujoco tasks, because the dynamics cannot be fully explored by a random policy, and a long planning horizon makes the compounding error of fictional samples accumulate very quickly. One might use an iterative training procedure of Algorithm 2 [Ha and Schmidhuber, 2018, Schmidhuber, 2015] for more complex environments, which interleaves exploration and learning. However, in order to combat the model bias, this type of Dyna-style algorithm usually requires computationally expensive model ensembles [Kurutach et al., 2018, Janner et al., 2019]. Thus, we turn to the MPC-style algorithm for planning (Algorithm 3), which is also sufficiently effective to demonstrate a learned model’s capacity and is more computationally efficient.

a.3 Model predictive control with actor-critic

Input : The replay buffer , the environment dataset , the reward function , the planning horizon , the search population , the number of environment steps and the number of epochs .
Output : .
1 Initialize the actor and the critic , and their target networks and ;
2 Initialize the dynamics model ;
3 Gather a collection of trajectories using random policies, and save them into ;
4 for  to  do
5       Train with data in as described in Section 4.1;
6       Observe the initial state from the environment;
7       Initialize the latent state ;
8       for  to  do
9             for  to  do
10                   ;
11                   for  to  do
12                         Select action ;
13                         ;
14                         Decode from ;
15                         Calculate the reward ;
16                         ;
18                   end for
19                  Select action ;
21             end for
22            Select the best sequence index ;
23             Select the best action ;
24             Observe the incoming time interval and next observation ;
25             Encode the next latent state ;
26             Calculate the reward ;
27             if  is not the terminated state then
28                   Store the tuple into ;
30            else
31                   Store the tuple into ;
32                   Observe the initial state from the environment;
33                   Initialize the latent state ;
34                   continue;
36             end if
37            Optimize and with data in ;
38             Update target networks and ;
39             ;
41       end for
42      Store trajectories collected in the current epoch into ;
44 end for
Algorithm 3 Model predictive control (MPC) with DDPG for SMDPs.

In Algorithm 3, the actor provides a good initialization of deterministic action sequences, and MPC searches for the best one from these sequences plus Gaussian noises, i.e., we sample the action from a normal distribution , where is the noise variance (line 12 of Algorithm 3). Further, in the vanilla MPC, the best sequence is determined by the cumulative reward of model rollouts, however, the selected action may not be globally optimal due to the short planning horizon . The critic offers an estimate of expected returns for the final state of simulated trajectories, which overcomes the shortsighted planning problem (line 20 of Algorithm 3). Note that we use the deterministic action (without Gaussian noise) for the final state to calculate the action-value function (line 18 of Algorithm 3).

Appendix B Environment specifications

Windy gridworld.

We extend the gridworld to have continuous states and continuous-time actions. That is, picking an orientation (action) , agents move towards this direction over arbitrary second(s) for . The agent is allowed to “move as a king”, i.e., take eight actions, including moving up, down, left, right, upleft, upright, downleft and downright. Specifically, the agent can move in the gridworld every second:

Every second, agents in the region with the weaker wind are pushed to move upward for , and agents in the region with the strong wind are pushed to move upward for . If agents hit the boundary of the gridworld, they will just stand still until the end of transition time . Every second in the gridworld incurs -1 cost until discovering the goal region (trigger +10 reward) or after seconds. Thus, we are given the reward function

In addition, we feed the zero-centered state for both model training and policy optimization.


Acrobot, a canonical RL and control problem, is a two-link pendulum with only the second joint actuated. Initially, both links point downwards. The goal is to swing up the pendulum by applying a positive, neutral, or negative torque on the joint such that the tip of the pendulum reaches a given height. The state space consists of four continuous variables, , where is the angular position of the first link in relation to the joint, and is the angular position of the second link in relation to the first; and are the angular velocities of each link respectively. The reward is collected as the height of the tip of the pendulum (as recommended in the work of [Wang et al., 2019]) after the transition time :

until the goal is reached or after .


The interaction of the immune system with the human immunodeficiency virus (HIV) and treatment protocols was mathematically formulated as a dynamical system [Adams et al., 2004], which can be resolved using RL approaches [Ernst et al., 2006, Parbhoo et al., 2017, Killian et al., 2017]. The goal of this task is to determine effective treatment strategies for HIV infected patients based on critical markers from a blood test, including the viral load (, which is the main maker indicating if healthy), the number of healthy and infected CD4+ T-lymphocytes (, , respectively), the number of healthy and infected macrophages (, , respectively), and the number of HIV-specific cytotoxic T-cells (), i.e., . To build a partially observerable HIV environment, and are removed from the state space. The anti-HIV drugs can be roughly grouped into two main categories (Reverse Transcriptase Inhibitors (RTI) and Protease Inhibitors (PI)). The patient is assumed to be given treatment from one of two classes of drugs, a mixture of the two treatments, or provided no treatment. The agent starts at an unhealthy status , where the viral load and number of infected cells are much higher than the number of virus-fighting T-cells. The dynamics system is defined by a set of differential equations:

where (if RTI is applied, otherwise 0) and (if PI is applied, otherwise 0) are treatment specific parameters, selected by the prescribed action. See the specification of other parameters in the work of [Adams et al., 2004].

The effective period is determined by the state (mainly determined by the viral load ) and the treatment as follow:

The reward is gathered based on the patient’s healthy state after effective period:

An episode ends after days and there is no early terminated condition. The state variables are first put in log scale then normalized to have zero mean and unit standard deviation for both model training and policy optimization.


We consider the fully observable Mujoco environments, where the position of the root joint is also observed, which allows us to calculate the reward and determine the terminal condition for simulated states easily. The action repeats times with the following pattern:

where is the minimum/maximum action repetition times, is the angle velocity vector of all joints and is rounding to the nearest integer. We have for all three locomotion tasks; for the swimmer and the hopper and for the half-cheetah.

Because we assume the intermediate observations are not available during action repetition, the reward is calculated only based on the current observation , the next observation , the control input and and repeated times :

where is the previous/current position of the root joint, and there is an alive bonus of 1 for the hopper for every step. Also, instead of setting a fixed horizon, we keep the original maximum length of an episode in OpenAI Gym, i.e., the maximum number of environment steps over an episode is 1000 for all three tasks. We normalize observations so that they have zero mean and unit standard deviation for both model training and policy optimization.

Appendix C Experimental details

c.1 Planning

Learning world models.

For all three simpler domains, we collect episodes as the training dataset, and collect another 100 episodes as the validation dataset. We optimize the policy for episodes until convergence, whose value is shown in Table 4. All final cumulative rewards are evaluated by taking the average reward of 100 trials after training policies for episodes.

Gridworld Acrobot HIV
model-based 1000 200 1500
model-free 1000 3500
Table 4: The choice of for training policies until convergence.

Model predictive control with actor-critic.

We switch between model training and policy optimization every environment steps. Equipped with the value function from the critic, we can choose a relatively shorter planning horizon , which maintains the good performance while reducing the computational cost. We set a large search population .

For model learning, the 90% collected trajectories are used as the training dataset and the remaining 10% are used as the validation dataset. Also, we divide a full trajectory into several pieces, whose length is equal to (or less than) the MPC planning horizon . Not only does it reduce the computational cost of training a sequential model, but also helps the dynamics model provide more accurate predictions in a finite horizon.

Throughout all experiments, we use a soft-greedy trick for MPC planning to combat the model approximation errors [Hong et al., 2019]. Instead of selecting the best first action (line 20 of Algorithm 3), we take the average of first actions of the top 50 sequences as the final action for the agent. This simple approach alleviates the impact of inaccurate models and improves the performance.

Initialization of latent states.

While using the model for planning, initial latent states of recurrent-based models are all zeros, but they are sampled from the prior distribution (standard normal distribution) for models with the encoder-decoder structure.

c.2 Model learning

Scheduled sampling.

As our model predicts the new latent state at time , it needs to be conditioned on the previous state at the previous time step . During training, there is a choice for the source of the next input for the model: either the ground truth (observation) or the model’s own previous prediction can be taken. The former provides more signal when the model is weak, while the latter matches more accurately the conditions during inference (episode generation), when the ground truth is not known. The scheduled sampling strikes a balance between the two. Specifically, at the beginning of the training, the ground truth is offered more often, which pushes the model to deliver the accurate short term predictions; at the end of the training, the previous predicted state is more likely to be used to help the model focus on the global dynamics. In other word, the optimization objective transits from the one-step loss to multiple-step loss. Therefore, the scheduled sampling can prevent the model from drifting out of the area of its applicability due to compounding errors. We use a linear decay scheme for scheduled sampling , where is the decay rate and is the number of iterations. However, we find that the scheduled sampling only works well on the acrobot task.

Early stopping.

On Mujoco tasks, we utilize early stopping to prevent from overfitting, i.e., we terminate the training if the state prediction error (MSE in Equation 10) on the validation dataset does not decrease for training epochs and we use the parameters achieving the lowest state prediction error as the final model parameters. Because the model already learns the dynamics after trained for several epochs in Algorithm 3 and only needs to be refined for some novel situations, we use a linear decay scheme: , where is the number of epochs in Algorithm 3. For three simpler domains, we run 12,000 iterations without early stopping.

Model hyperparameters.

We tune the hyperparameters for dynamics models on different domains, but all baseline models use a same set of hyperparameters for comparison.

Gridworld Acrobot HIV Swimmer Hopper HalfCheetah
Learning rate 1e-3 5e-4 1e-3
Batch size 32 128
Latent dimension 2 10 128 400
Weight decay 1e-3
Scheduled sampling No Yes No
GRU one layer, unidirectional, Tanh activation
Encoder hidden to latent 5 20
Interval timer N/A 20
in Equation 9 0 0.01
ODE network 5 20 128 400
ODE solver Runge-Kutta 4(5) adaptive solver (dopri5)
ODE error tolerance 1e-3 (relative), 1e-4 (absolute) 1e-5 (relative), 1e-6 (absolute)
Table 5: Hyperparameters for dynamics models on different domains.

All hyperparameters are shown in Table 5. Specifically,

  • “Learning rate” refers to the learning rate for the Adam optimizer to train the dynamics model.

  • “GRU” refers to the architecture of the GRU in all experiments, including the encoder in the Latent-RNN and Latent-ODE;

  • For encoder-decoder models, we use a neural network with one layer and Tanh activation to convert the final hidden state of the encoder to the mean and log variance (for applying reparameterization trick to train VAE) of the initial latent state of the decoder. “Encoder hidden to latent” refers to the number of hidden units of this neural network.

  • We use a neural network with one layer and Tanh activation for the interval timer . “Interval timer ” refers to the number of hidden units of this neural network.

  • We use a neural network with two layers and Tanh activation for the ODE network . “ODE network ” refers to the number of hidden units of this neural network.

  • “ODE solver” refers to the numerical ODE solver we use to solve the ODE (Equations 6 and 7). Note that we do not use the adjoint method [Chen et al., 2018] for ODE solvers due to a longer computation time.

  • “ODE error tolerance” refers to the error tolerance we use to solve the ODE numerically.

c.3 Policy

DQN hyperparameters.

The DQN for three simpler domains has two hidden layers of 256 and 512 units each with Relu activation. Parameters are trained using an Adam optimizer with a learning rate 5e-4 and a batch size of 128. We minimize the temporal difference error using the Huber loss, which is more robust to outliers when the estimated Q-values are noisy. We update the target network every 10 episodes (hard update). The action is one-hot encoded as the input for DQNs. To improve the performance of the DQN, we use a prioritized experience replay buffer [Schaul et al., 2015] with a prioritization exponent of 0.6 and an importance sampling exponent of 0.4, and its size is 1e5. To encourage exploration, we construct an -greedy policy with an inverse sigmoid decay scheme from 1 to 0.05. Also, all final policies are softened with for evaluation.

DDPG hyperparameters.

The DDPG networks for both the actor and critic have two hidden layers of 64 units each with Relu activation. Parameters are trained using an Adam optimizer with a learning rate of 1e-4 for the actor, a learning rate of 1e-3 for the critic and a batch size of 128. The target networks are updated with the rate 1e-3. The size of the replay buffer is 1e6. To encourage exploration, we collect 15000 samples with a random policy at the beginning of training, and add a Gaussian noise to every selected action.

Discount factors.

The discount factor is 0.99 for the windy gridworld and Mujoco tasks, is 0.995 for the HIV domain, and is 1 for the acrobot problem.

Figure 5: Learning curves with all baselines on three simpler domains. The -axis is the number of episodes in Algorithm 2 (for the model-free baseline, the -axis is the number of episodes in the actual environment). The shaded region represents a standard deviation of average evaluation over four runs (evaluation data is collected every 4 episodes). Curves are smoothed with a 20-point window for visual clarity.

Appendix D Additional figures and tables

d.1 Learning curves of learning world models

Figure 6: Learning curves of different policies on the partially observable HIV environment.

Figure 5 shows the learning process of all baselines on three simpler domains. Note that Figure 5 does not necessarily demonstrate the sample efficiency because the model-free method uses the online real data, whereas the model-based approach (learning world models) uses the offline real data. However, we still observe that the Latent-ODE and ODE-RNN develop a high-performing policy much faster than the model-free baseline over 1500 episodes in the HIV environment. The model-free baseline converges after 3500 episodes. In addition, the acrobot problem clearly shows the model difference in terms of abilities of modeling the continuous-time dynamics. The Latent-ODE and ODE-RNN outperforms other models; The RNN and Latent-RNN, designed for discrete-time transitions, totally lose their ways in continuous-time dynamics, but their similar performance might suggest the limited impact of architecture (recurrent vs. encoder-decoder); The -RNN and Decay-RNN also struggle on modeling continuous-time dynamics though they leverage time interval information in different ways.

Moreover, Figure 6 shows the learning process of different policies on the partially observable HIV environment. The latent policy develops a better-performing policy more quickly than the vanilla model-based policy and the model-free policy .

d.2 Full results of changing time intervals

Table 6 shows the cumulative rewards of policies learned on regular measurements using pretrained models from the original irregular time schedule on three simpler domains. We can see that the Latent-ODE surpasses other baseline models in most situations.

RNN -RNN Decay-RNN Latent-RNN ODE-RNN Latent-ODE Oracle
-31.59 1.68 -43.21 1.25 -31.69 1.36 -32.22 1.19 -47.31 1.46 -34.24 1.93 -44.17 0.80
-31.95 2.10 -39.37 3.37 -32.27 2.39 -32.75 1.60 -32.40 1.43 -33.82 2.32 -31.61 1.60
-35.08 4.91 -34.49 4.82 -35.03 4.87 -43.74 8.71 -36.61 7.04 -32.74 2.22 -32.29 1.60
-38.55 7.99 -36.00 5.41 -40.58 8.56 -36.94 5.73 -64.10 14.78 -33.26 1.86 -32.81 1.74
-44.33 9.77 -42.97 9.43 -43.55 9.50 -36.87 5.04 -87.50 18.47 -33.84 2.02 -33.35 2.03