\boldsymbol{\gamma}-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction


We introduce the -model, a predictive model of environment dynamics with an infinite probabilistic horizon. Replacing standard single-step models with -models leads to generalizations of the procedures that form the foundation of model-based control, including the model rollout and model-based value estimation. The -model, trained with a generative reinterpretation of temporal difference learning, is a natural continuous analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of task reward. We instantiate the -model as both a generative adversarial network and normalizing flow, discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors, and empirically investigate its utility for prediction and control.

1 Introduction

The common ingredient in all of model-based reinforcement learning is the dynamics model: a function used for predicting future states. The choice of the model’s prediction horizon constitutes a delicate trade-off. Shorter horizons make the prediction problem easier, as the near-term future increasingly begins to look like the present, but may not provide sufficient information for decision-making. Longer horizons carry more information, but present a more difficult prediction problem, as errors accumulate rapidly when a model is applied to its own previous outputs (Janner et al., 2019).

Can we avoid choosing a prediction horizon altogether? Value functions already do so by modeling the cumulative return over a discounted long-term future instead of an immediate reward, circumventing the need to commit to any single finite horizon. However, value prediction folds two problems into one by entangling environment dynamics with reward structure, making value functions less readily adaptable to new tasks in known settings than their model-based counterparts.

In this work, we propose a model that predicts over an infinite horizon with a geometrically-distributed timestep weighting (Figure 1). This -model, named for the dependence of its probabilistic horizon on a discount factor , is trained with a generative analogue of temporal difference learning suitable for continuous spaces. The -model bridges the gap between canonical model-based and model-free mechanisms. Like a value function, it is policy-conditioned and contains information about the distant future; like a conventional dynamics model, it is independent of reward and may be reused for new tasks within the same environment. The -model may be instantiated as both a generative adversarial network (Goodfellow et al., 2014) and a normalizing flow (Rezende and Mohamed, 2015).

The shift from standard single-step models to infinite-horizon -models carries several advantages:

Constant-time prediction     Single-step models must perform an operation to predict steps into the future; -models amortize the work of predicting over extended horizons during training such that long-horizon prediction occurs with a single feedforward pass of the model.

Generalized rollouts and value estimation     Probabilistic prediction horizons lead to generalizations of the core procedures of model-based reinforcement learning. For example, generalized rollouts allow for fine-tuned interpolation between training-time and testing-time compounding error. Similarly, terminal value functions appended to truncated -model rollouts allow for a gradual transition between model-based and model-free value estimation.

Omission of unnecessary information     The predictions of a -model do not come paired with an associated timestep. While on the surface a limitation, we show why knowing precisely when a state will be encountered is not necessary for decision-making. Infinite-horizon -model prediction selectively discards the unnecessary information from a standard model-based rollout.

current state   prediction

Figure 1: Conventional predictive models trained via maximum likelihood have a horizon of one. By interpreting temporal difference learning as a training algorithm for generative models, it is possible to predict with a probabilistic horizon governed by a geometric distribution. In the spirit of infinite-horizon control in model-free reinforcement learning, we refer to this formulation as infinite-horizon prediction.

2 Related Work

The complementary strengths and weaknesses of model-based and model-free reinforcement learning have led to a number of works that attempt to combine these approaches. Common strategies include initializing a model-free algorithm with the solution found by a model-based planner (Levine and Koltun, 2013; Farshidian et al., 2014; Nagabandi et al., 2018), feeding model-generated data into an otherwise model-free optimizer (Sutton, 1990; Silver et al., 2008; Lampe and Riedmiller, 2014; Kalweit and Boedecker, 2017; Luo et al., 2019), using model predictions to improve the quality of target values for temporal difference learning (Buckman et al., 2018; Feinberg et al., 2018), leveraging model gradients for backpropagation (Nguyen and Widrow, 1990; Jordan and Rumelhart, 1992; Heess et al., 2015), and incorporating model-based planning without explicitly predicting future observations (Tamar et al., 2016; Silver et al., 2017; Oh et al., 2017; Kahn et al., 2018; Amos et al., 2018; Schrittwieser et al., 2019). In contrast to combining independent model-free and model-based components, we describe a framework for training a new class of predictive model with a generative, model-based reinterpretation of model-free tools.

Temporal difference models (TDMs) Pong et al. (2018) provide an alternative method of training models with what are normally considered to be model-free algorithms. TDMs interpret models as a special case of goal-conditioned value functions (Kaelbling, 1993; Foster and Dayan, 2002; Schaul et al., 2015; Andrychowicz et al., 2017), though the TDM is constrained to predict at a fixed horizon and is limited to tasks for which the reward depends only on the last state. In contrast, the -model predicts over a discounted infinite-horizon future and accommodates arbitrary rewards.

The most closely related prior work to -models is the successor representation (Dayan, 1993), a formulation of long-horizon prediction that has been influential in both cognitive science (Momennejad et al., 2017; Gershman, 2018) and machine learning (Kulkarni et al., 2016; Ma et al., 2018). In its original form, the successor representation is tractable only in tabular domains. While a continuous analogue retaining an interpretation as a probabilistic model has remained elusive, variants have been developed focusing on policy evaluation based on expected featurizations instead of state prediction (Barreto et al., 2017, 2018; Hansen et al., 2020). Converting the tabular successor representation into a continuous generative model is non-trivial because the successor representation implicitly assumes the ability to normalize over a finite state space for interpretation as a predictive model.

Both the -model and the successor representation circumvent the compounding prediction errors that occur with single-step models during long model-based rollouts. Prior work has addressed this problem by training self-correcting models (Bengio et al., 2015; Talvitie, 2014, 2016), using multi-step models (Venkatraman et al., 2016; Whitney and Fergus, 2018; Asadi et al., 2019), and truncating model rollouts during policy optimization (Janner et al., 2019). In contrast, we modify the model objective so that predictions are no longer tied to specific times at all.

3 Preliminaries

We consider an infinite-horizon Markov decision process (MDP) defined by the tuple , with state space and action space . The transition distribution and reward function are given by and , respectively. The discount is denoted by . A policy induces a conditional occupancy over future states:


We denote parametric approximations of () as (), in which the subscripts denote model parameters. Standard model-based reinforcement learning algorithms employ the single-step model for long-horizon decision-making by performing multi-step model-based rollouts.

4 Generative Temporal Difference Learning

Our goal is to make long-horizon predictions without the need to repeatedly apply a single-step model. Instead of modeling states at a particular instant in time by approximating the environment transition distribution , we aim to predict a weighted distribution over all possible future states according to . In principle, this can be posed as a conventional maximum likelihood problem:

However, doing so would require collecting samples from the occupancy independently for each policy of interest. Forgoing the ability to re-use data from multiple policies when training dynamics models would sacrifice the sample efficiency that often makes model usage compelling in the first place, so we instead aim to design an off-policy algorithm for training . We accomplish this by reinterpreting temporal difference learning as a method for training generative models.

Instead of collecting only on-policy samples from , we observe that admits a convenient recursive form. Consider a modified MDP in which there is a probability of terminating at each timestep. The distribution over the state at termination, denoted as the exit state , corresponds to first sampling from a termination timestep and then sampling from the per-timestep distribution . The distribution over corresponds exactly to that in the definition of the occupancy in Equation 1, but also lends itself to an interpretation as a mixture over only two components: the distribution at the immediate next timestep, in the event of termination, and that over all subsequent timesteps, in the event of non-termination. This mixture yields the following target distribution:


We use the shorthand . The target distribution is reminiscent of a temporal difference target value: the state-action conditioned occupancy acts as a -function, the state-conditioned occupancy acts as a value function, and the single-step distribution acts as a reward function. However, instead of representing a scalar target value, is a distribution from which we may sample future states . We can use this target distribution in place of samples from the true discounted occupancy :

This formulation differs from a standard maximum likelihood learning problem in that the target distribution depends on the current model. By bootstrapping the target distribution in this manner, we are able to use only empirical transitions from one policy in order to train an infinite-horizon predictive model for any other policy. Because the horizon is governed by the discount , we refer to such a model as a -model.

This bootstrapped model training may be incorporated into a number of different generative modeling frameworks. We discuss two cases here. (1) When the model permits only sampling, we may train by minimizing an -divergence from samples:


This objective leads naturally to an adversarially-trained -model. (2) When the model permits density evaluation, we may minimize an error defined on log-densities directly:


This objective is suitable for -models instantiated as normalizing flows. Due to the approximation of a target log-density using a single next state , is unbiased for deterministic dynamics and a bound in the case of stochastic dynamics. We provide complete algorithmic descriptions of both variants and highlight practical training considerations in Section 6.

5 Analysis and Applications of -Models

Using the -model for prediction and control requires us to generalize procedures common in model-based reinforcement learning. In this section, we derive the -model rollout and show how it can be incorporated into a reinforcement learning procedure that hybridizes model-based and model-free value estimation. First, however, we show that the -model is a continuous, generative counterpart to another type of long-horizon model: the successor representation.

5.1 -Models as a Continuous Successor Representation

The successor representation is a prediction of expected visitation counts (Dayan, 1993). It has a recurrence relation making it amenable to tabular temporal difference algorithms:


Adapting the successor representation to continuous state spaces in a way that retains an interpretation as a probabilistic model has proven challenging. However, variants that forego the ability to sample in favor of estimating expected state features have been developed (Barreto et al., 2017).

The form of the successor recurrence relation bears a striking resemblance to that of the target distribution in Equation 2, suggesting a connection between the generative, continuous -model and the discriminative, tabular successor representation. We now make this connection precise.

Proposition 1.

The global minimum of both and is achieved if and only if the resulting -model produces samples according to the normalized successor representation:


In the case of either objective, the global minimum is achieved only when

for all . We recognize this optimality condition exactly as the recurrence defining the successor representation (Equation 5), scaled by such that integrates to over . ∎

5.2 -Model Rollouts

Standard single-step models, which correspond to -models with , can predict multiple steps into the future by making iterated autoregressive predictions, conditioning each step on their own output from the previous step. These sequential rollouts form the foundation of most model-based reinforcement learning algorithms. We now generalize these rollouts to -models for , allowing us to decouple the discount used during model training from the desired horizon in control. When working with multiple discount factors, we explicitly condition an occupancy on its discount as . In the results below, we omit the model parameterization whenever a statement applies to both a discounted occupancy and a parametric -model .

Theorem 1.

Let denote the distribution over states at the sequential step of a -model rollout beginning from state . For any desired discount , we may reweight the samples from these model rollouts according to the weights

to obtain the state distribution drawn from . That is, we may reweight the steps of a -model rollout so as to match the distribution of a -model with larger discount:


See Appendix A. ∎

This reweighting scheme has two special cases of interest. A standard single-step model, with , yields . These weights are familiar from the definition of the discounted state occupancy in terms of a per-timestep mixture (Equation 1). Setting yields , or a weight of 1 on the first step and on all subsequent steps.1 This result is also expected: when the model discount matches the target discount, only a single forward pass of the model is required.

Figure 2: (a) The first step from a -model samples states at timesteps distributed according to a geometric distribution with parameter ; all subsequent steps have a negative binomial timestep distribution stemming from the sum of independent geometric random variables. When these steps are reweighted according to Theorem 1, the resulting distribution follows a geometric distribution with smaller parameter (corresponding to a larger discount value ). (b) The number of steps needed to recover of the probability mass from distributions induced by various target discounts for all valid model discounts . When using a standard single-step model, corresponding to the case of , a -step model rollout is required to reweight to a discount of .

Figure 2 visually depicts the reweighting scheme and the number of steps required for truncated model rollouts to approximate the distribution induced by a larger discount. There is a natural tradeoff with -models: the higher is, the fewer model steps are needed to make long-horizon predictions, reducing model-based compounding prediction errors (Asadi et al., 2019; Janner et al., 2019). However, increasing transforms what would normally be a standard maximum likelihood problem (in the case of single-step models) into one resembling approximate dynamic programming (with a model bootstrap), leading to model-free bootstrap error accumulation (Kumar et al., 2019). The primary distinction is whether this accumulation occurs during training, when the work of sampling from the occupancy is being amortized, or during “testing”, when the model is being used for rollouts. While this horizon-based error compounding cannot be eliminated entirely, -models allow for a continuous interpolation between the two extremes.

5.3 -Model-Based Value Expansion

We now turn our attention from prediction with -models to value estimation for control. In tabular domains, the state-action value function can be decomposed as the inner product between the successor representation and the vector of per-state rewards (Gershman, 2018). Taking care to account for the normalization from the equivalence in Proposition 1, we can similarly estimate the function as the expectation of reward under states sampled from the -model:


This relation suggests a model-based reinforcement learning algorithm in which -values are estimated by a -model without the need for sequential model-based rollouts. However, in some cases it may be practically difficult to train a generative -model with discount as large as that of a discriminative -function. While one option is to chain together -model steps as in Section 5.2, an alternative solution often effective with single-step models is to combine short-term value estimates from a truncated model rollout with a terminal model-free value prediction:

This hybrid estimator is referred to as a model-based value expansion (MVE; Feinberg et al. 2018). There is a hard transition between the model-based and model-free value estimation in MVE, occuring at the model horizon . We may replace the single-step model with a -model for a similar estimator in which there is a probabilistic prediction horizon, and as a result a gradual transition:

The -MVE estimator allows us to perform -model-based rollouts with horizon , reweight the samples from this rollout by solving for weights given a desired discount , and correct for the truncation error stemming from the finite rollout length using a terminal value function with discount . As expected, MVE is a special case of -MVE, as can be verified by considering the weights corresponding to described in Section 5.2. This estimator, along with the simpler value estimation in Equation 6, highlights the fact that it is not necessary to have timesteps associated with states in order to use predictions for decision-making. We provide a more thorough treatment of -MVE, complete with pseudocode for a corresponding actor-critic algorithm, in Appendix B.

1:  Input dataset of transitions, policy, step size, delay parameter
2:  Initialize parameter vectors
3:  while not converged do
4:     Sample transitions from and actions
5:     Sample from bootstrapped target 
6:     Sample from current model 
7:     Evaluate objective 
8:     Update model parameters 
9:     Update target parameters
10:  end while
Algorithm 1   -model training without density evaluation
1:  Input dataset of transitions, policy, step size, delay parameter, variance
2:  Initialize parameter vectors ; let denote the Gaussian pdf
3:  while not converged do
4:     Sample transitions from and actions
5:     Sample from bootstrapped target 
6:     Construct target values
7:     Evaluate objective 
8:     Update model parameters
9:     Update target parameters
10:  end while
Algorithm 2   -model training with density evaluation

6 Practical Training of -Models

Because -model training differs from standard dynamics modeling primarily in the bootstrapped target distribution and not in the model parameterization, -models are in principle compatible with any generative modeling framework. We focus on two representative scenarios, differing in whether the generative model class used to instantiate the -model allows for tractable density evaluation.

Training without density evaluation

When the -model parameterization does not allow for tractable density evaluation, we minimize a bootstrapped -divergence according to (Equation 3) using only samples from the model. The generative adversarial framework provides a convenient way to train a parametric generator by minimizing an -divergence of choice given only samples from a target distribution and the ability to sample from the generator (Goodfellow et al., 2014; Nowozin et al., 2016). In the case of bootstrapped maximum likelihood problems, our target distribution is induced by the model itself (alongside a single-step transition distribution), meaning that we only need sample access to our -model in order to train as a generative adversarial network (GAN).

Introducing an auxiliary discriminator and selecting the Jensen-Shannon divergence as our -divergence, we can reformulate minimization of the original objective as a saddle-point optimization over the following objective:

which is minimized over and maximized over . As in , refers to the bootstrapped target distribution in Equation 2. In this formulation, produces samples by virtue of a deterministic mapping of a random input vector and conditioning information . Other choices of -divergence may be instantiated by different choices of activation function (Nowozin et al., 2016).

Training with density evaluation

When the -model permits density evaluation, we may bypass saddle point approximations to an -divergence and directly regress to target density values, as in objective (Equation 4). This is a natural choice when the -model is instantiated as a conditional normalizing flow (Rezende and Mohamed, 2015). Evaluating target values of the form

requires density evaluation of not only our -model, but also the single-step transition distribution. There are two options for estimating the single-step densities: (1) a single-step model may be trained alongside the -model for the purposes of constructing targets , or (2) a simple approximate model may be constructed on the fly from transitions. We found , with a constant hyperparameter, to be sufficient.

Stability considerations

To alleviate the instability caused by bootstrapping, we appeal to the standard solution employed in model-free reinforcement learning: decoupling the regression targets from the current model by way of a “delayed” target network (Mnih et al., 2015). In particular, we use a delayed -model in the bootstrapped target distribution , with the parameters given by an exponentially-moving average of previous parameters .

We summarize the above scenarios in Algorithms 1 and 2. We isolate model training from data collection and focus on a setting in which a static dataset is provided, but this algorithm may also be used in a data-collection loop for policy improvement. Further implementation details, including all hyperparameter settings and network architectures, are included in Appendix C.

Figure 3: Visualization of the predicted distribution from a single feedforward pass of normalizing flow -models trained with varying discounts . The conditioning state is denoted by . The leftmost plots, with , correspond to a single-step model. For comparison, the rightmost plots show a Monte Carlo estimation of the discounted occupancy from environment trajectories.

7 Experiments

Our experimental evaluation is designed to study the viability of -models as a replacement of conventional single-step models for long-horizon state prediction and model-based control.

7.1 Prediction

We investigate -model predictions as a function of discount in continuous-action versions of two benchmark environments suitable for visualization: acrobot (Sutton, 1996) and pendulum. The training data come from a mixture distribution over all intermediate policies of 200 epochs of optimization with soft-actor critic (SAC; Haarnoja et al. 2018). The final converged policy is used for -model training. We refer to Appendix C for implementation and experiment details.

Figure 3 shows the predictions of a -model trained as a normalizing flow according to Algorithm 2 for five different discounts, ranging from (a single-step model) to . The rightmost column shows the ground truth discounted occupancy corresponding to , estimated with Monte Carlo rollouts of the policy. Increasing the discount during training has the expected effect of qualitatively increasing the predictive lookahead of a single feedforward pass of the -model. We found flow-based -models to be more reliable than GAN parameterizations, especially at higher discounts. Corresponding GAN -model visualizations can be found in Appendix E for comparison.

Equation 6 expresses values as an expectation over a single feedforward pass of a -model. We visualize this relation in Figure 4, which depicts -model predictions on the pendulum environment for a discount of and the resulting value map estimated by taking expectations over these predicted state distributions. In comparison, value estimation for the same discount using a single-step model would require 299-step rollouts in order to recover of the probability mass (see Figure 2).

7.2 Control

To study the utility of the -model for model-based reinforcement learning, we use the -MVE estimator from Section 5.3 as a drop-in replacement for value estimation in SAC. We compare this approach to the state-of-the-art in model-based and model-free methods, with representative algorithms consisting of SAC, PPO (Schulman et al., 2017), MBPO (Janner et al., 2019), and MVE (Feinberg et al., 2018). In -MVE, we use a model discount of , a value discount of and a single model step (). We use a model rollout length of in MVE such that it has an effective horizon identical to that of -MVE. Other hyperparameter settings can once again be found in Appendix C; details regarding the evaluation environments can be found in Appendix D. Figure 5 shows learning curves for all methods. We find that -MVE converges faster than prior algorithms, twice as quickly as SAC, while retaining their asymptotic performance.

Figure 4: Values are expectations of reward over a single feedforward pass of a -model (Equation 6). We visualize -model predictions () from nine starting states, denoted by , in the pendulum benchmark environment. Taking the expectation of reward over each of these predicted distributions yields a value estimate for the corresponding conditioning state. The rightmost plot depicts the value map produced by value iteration on a discretization of the same environment for reference.
Figure 5: Comparative performance of -MVE and four prior reinforcement learning algorithms on continuous control benchmark tasks. -MVE retains the asymptotic performance of SAC with sample-efficiency matching that of MBPO. Shaded regions depict standard deviation among seeds.

8 Discussion, Limitations, and Future Work

We have introduced a new class of predictive model, a -model, that is a hybrid between standard model-free and model-based mechanisms. It is policy-conditioned and infinite-horizon, like a value function, but independent of reward, like a standard single-step model. This new formulation of infinite-horizon prediction allows us to generalize the procedures integral to model-based control, yielding new variants of model rollouts and model-based value estimation.

Our experimental evaluation shows that, on tasks with low to moderate dimensionality, our method learns accurate long-horizon predictive distributions without sequential rollouts and can be incorporated into standard model-based reinforcement learning methods to produce results that are competitive with state-of-the-art algorithms. Scaling up our framework to more complex tasks, including high-dimensional continuous control problems and tasks with image observations, presents a number of additional challenges. We are optimistic that continued improvements in training techniques for generative models and increased stability in off-policy reinforcement learning will also carry benefits for -model training.


We thank Karthik Narasimhan, Pulkit Agrawal, Anirudh Goyal, Pedro Tsividis, Anusha Nagabandi, Aviral Kumar, and Michael Chang for formative discussions about model-based reinforcement learning and generative modeling. This work was partially supported by computational resource donations from Amazon. M.J. is supported by fellowships from the National Science Foundation and the Open Philanthropy Project.

Appendix A Derivation of -Model-Based Rollout Weights

Theorem 1.

Let denote the distribution over states at the sequential step of a -model rollout beginning from state . For any desired discount , we may reweight the samples from these model rollouts according to the weights

to obtain the state distribution drawn from . That is, we may reweight the steps of a -model rollout so as to match the distribution of a -model with larger discount:


Each step of the -model samples a time according to , so the time after -model steps is distributed according to the sum of independent geometric random variables with identical parameters. This sum corresponds to a negative binomial random variable, , with the following pmf:


Equation 7 is mildly different from the textbook pmf because we want a distribution over the total number of trials (in our case, cumulative timesteps ) instead of the number of successes before the failure. The latter is more commonly used because it gives the random variable the same support, , for all . The form in Equation 7 only has support for , which substantially simplifies the following analysis.

The distributions expressible as a mixture over the per-timestep negative binomial distributions are given by:

in which are the mixture weights. Because only has support for , it suffices to only consider the first -model steps when solving for .

We are interested in the scenario in which is also a geometric random variable with smaller parameter, corresponding to a larger discount . We proceed by setting and solving for the mixture weights by induction.

Base case.

Let . Because is the only mixture component with support at , is determined by :

Solving for gives:

Induction step.

We now assume the form of for and solve for using .

Solving for gives

as desired. ∎

Appendix B Derivation of -Model-Based Value Expansion

In this section, we derive the -MVE estimator and provide pseudo-code showing how it may be used as a drop-in replacement for value estimation in an actor-critic algorithm. Before we begin, we prove a lemma which will become useful in interpreting value functions as weighted averages.

Lemma 1.

We now proceed to the -MVE estimator itself.

Theorem 2.

For , may be decomposed as a weighted average of -model steps and a terminal value estimation. We denote this as the -MVE estimator:


The second equality rewrites an expectation over a -model as an expectation over a rollout of a -model using step weights from Theorem 1. We recognize


as the model-based component of the value estimation in -MVE. All that remains is to write


using a terminal value function.


The second equality uses and the time-invariance of with respect to its conditioning state. Plugging Equation 9 into Equation 8 gives:

Remark 1.

Using Lemma 1 to substitute in place of clarifies the interpretation of as a weighted average over -model steps and a terminal value function. Because the mixture weights must sum to 1, it is unsurprising that the weight on the terminal value function turned out to be .

Remark 2.

Setting recovers standard MVE with a single-step model, as the weights on the model steps simplify to and the weight on the terminal value function simplifies to .

1:  Input : model discount, : value discount, step size
2:  Initialize -model generator
3:  Initialize -function, value function, policy, replay buffer
4:  for each iteration do
5:     for each environment step do
10:     end for
11:     for each gradient step do
12:        Sample transitions from
13:        Update to Algorithm 1 or 2
14:        Compute according to Theorem 2
15:        Update -function parameters:          
16:        Update value function parameters:          
17:        Update policy parameters:          
18:     end for
19:  end for
Algorithm 3   -model based value expansion

Appendix C Implementation Details

-MVE algorithmic description.

The -MVE estimator may be used for value estimation in any actor-critic algorithm. We describe the variant used in our control experiments, in which it is used in the soft actor critic algorithm (SAC; Haarnoja et al. 2018), in Algorithm 3. The -model update is unique to -MVE; the objectives for the value function and policy are identical to those in SAC. The objective for the -function differs only by replacing with . For a detailed description of how the gradients of these objectives may be estimated, and for hyperparameters related to the training of the -function, value function, and policy, we refer to Haarnoja et al. (2018).

Network architectures.

For all GAN experiments, the -model generator and discriminator are instantiated as two-layer MLPs with hidden dimensions of 256 and leaky ReLU activations. For all normalizing flow experiments, we use a six-layer neural spline flow (Durkan et al., 2019) with 16 knots defined in the interval . The rational-quadratic coupling transform uses a three-layer MLP with hidden dimensions of 256.

Hyperparameter settings.

We include the hyperparameters used for training the GAN -model in Table 1 and the flow -model in Table 2.

Parameter Value
Batch size 128
Number of samples per pair 512
Delay parameter
Step size
Replay buffer size (off-policy prediction experiments)

Table 1: GAN -model hyperparameters (Algorithm 1).
Parameter Value
Batch size 1024
Number of samples per pair 1
Delay parameter
Step size
Replay buffer size (off-policy prediction experiments)
Single-step Gaussian variance

Table 2: Flow -model hyperparameters (Algorithm 2)

We found the original GAN (Goodfellow et al., 2014) and the least-squares GAN (Mao et al., 2016) formulation to be equally effective for training -models as GANs.

Appendix D Environment Details

Acrobot-v1 is a two-link system (Sutton, 1996). The goal is to swing the lower link above a threshold height. The eight-dimensional observation is given by . We modify it to have a one-dimensional continuous action space instead of the standard three-dimensional discrete action space. We provide reward shaping in the form of .

MountainCarContinuous-v0 is a car on a track (Moore, 1990). The goal is to drive the car up a high too high to summit without built-up momentum. The two-dimmensional observation space is . We provide reward shaping in the form of .

Pendulum-v0 is a single-link system. The link starts in a random position and the goal is to swing it upright. The three-dimensional observation space is given by .

Reacher-v2 is a two-link arm. The objective is to move the end effector of the arm to a randomly sampled goal position . The 11-dimensional observation space is given by .

Model-based methods often make use of shaped reward functions during model-based rollouts (Chua et al., 2018). For fair comparison, when using shaped rewards we also make the same shaping available to model-free methods.

Appendix E Adversarial -Model Predictions

Figure 6: Visualization of the distribution from a single feedforward pass of -models trained as GANs according to Algorithm 1. GAN-based -models tend to be more unstable than normalizing flow -models, especially at higher discounts.


  1. We define as .


  1. Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, Cited by: §2.
  2. Hindsight experience replay. In Advances in Neural Information Processing Systems, Cited by: §2.
  3. Combating the compounding-error problem with a multi-step model. arXiv preprint arXiv:1905.13320. Cited by: §2, §5.2.
  4. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In Proceedings of the International Conference on Machine Learning, Cited by: §2.
  5. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems 30, Cited by: §2, §5.1.
  6. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  7. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, Cited by: §2.
  8. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, Cited by: Appendix D.
  9. Improving generalization for temporal difference learning: the successor representation. Neural Computation 5, pp. 613. Cited by: §2, §5.1.
  10. Neural spline flows. In Advances in Neural Information Processing Systems, Cited by: Appendix C.
  11. Learning of closed-loop motion control. In International Conference on Intelligent Robots and Systems, Cited by: §2.
  12. Model-based value estimation for efficient model-free reinforcement learning. In International Conference on Machine Learning, Cited by: §2, §5.3, §7.2.
  13. Structure in the space of value functions. Machine Learning 49, pp. 325. Cited by: §2.
  14. The successor representation: its computational logic and neural substrates. Journal of Neuroscience. Cited by: §2, §5.3.
  15. Generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: Appendix C, §1, §6.
  16. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, Cited by: Appendix C, §7.1.
  17. Fast task inference with variational intrinsic successor features. In International Conference on Learning Representations, Cited by: §2.
  18. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, Cited by: §2.
  19. When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §5.2, §7.2.
  20. Forward models: supervised learning with a distal teacher. Cognitive Science 16, pp. 307. Cited by: §2.
  21. Learning to achieve goals. In Proceedings of the International Joint Conference on Artificial Intelligence, Cited by: §2.
  22. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In International Conference on Robotics and Automation, Cited by: §2.
  23. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, Cited by: §2.
  24. Deep successor reinforcement learning. External Links: 1606.02396 Cited by: §2.
  25. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, Cited by: §5.2.
  26. Approximate model-assisted neural fitted Q-iteration. In International Joint Conference on Neural Networks, Cited by: §2.
  27. Guided policy search. In International Conference on Machine Learning, Cited by: §2.
  28. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, Cited by: §2.
  29. Universal successor representations for transfer reinforcement learning. arXiv preprint arXiv:1804.03758. Cited by: §2.
  30. Least squares generative adversarial networks. arXiv preprint arXiv:1611.04076. Cited by: Appendix C.
  31. Human-level control through deep reinforcement learning. Nature. Cited by: §6.
  32. The successor representation in human reinforcement learning. Nature Human Behaviour 1 (9), pp. 680–692. Cited by: §2.
  33. Efficient memory-based learning for robot control. Ph.D. Thesis, University of Cambridge. Cited by: Appendix D.
  34. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In International Conference on Robotics and Automation, Cited by: §2.
  35. Neural networks for self-learning control systems. IEEE Control Systems Magazine. Cited by: §2.
  36. F-gan: training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, Cited by: §6, §6.
  37. Value prediction network. In Advances in Neural Information Processing Systems, Cited by: §2.
  38. Temporal difference models: model-free deep RL for model-based control. In International Conference on Learning Representations, Cited by: §2.
  39. Variational inference with normalizing flows. Proceedings of Machine Learning Research. Cited by: §1, §6.
  40. Universal value function approximators. In Proceedings of the International Conference on Machine Learning, Cited by: §2.
  41. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265. Cited by: §2.
  42. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §7.2.
  43. Sample-based learning and search with permanent and transient memories. In Proceedings of the International Conference on Machine Learning, Cited by: §2.
  44. The predictron: end-to-end learning and planning. In International Conference on Machine Learning, Cited by: §2.
  45. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, Cited by: §2.
  46. Generalization in reinforcement learning: successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems, Cited by: Appendix D, §7.1.
  47. Model regularization for stable sample rollouts. In Conference on Uncertainty in Artificial Intelligence, Cited by: §2.
  48. Self-correcting models for model-based reinforcement learning. In AAAI Conference on Artificial Intelligence, Cited by: §2.
  49. Value iteration networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  50. Improved learning of dynamics for control. In Proceedings of International Symposium on Experimental Robotics, Cited by: §2.
  51. Understanding the asymptotic performance of model-based RL methods. Cited by: §2.