# Sample-Efficient Reinforcement Learning

with Stochastic Ensemble Value Expansion

###### Abstract

Integrating model-free and model-based approaches in reinforcement learning has the potential to achieve the high performance of model-free algorithms with low sample complexity. However, this is difficult because an imperfect dynamics model can degrade the performance of the learning algorithm, and in sufficiently complex environments, the dynamics model will almost always be imperfect. As a result, a key challenge is to combine model-based approaches with model-free learning in such a way that errors in the model do not degrade performance. We propose stochastic ensemble value expansion (STEVE), a novel model-based technique that addresses this issue. By dynamically interpolating between model rollouts of various horizon lengths for each individual example, STEVE ensures that the model is only utilized when doing so does not introduce significant errors. Our approach outperforms model-free baselines on challenging continuous control benchmarks with an order-of-magnitude increase in sample efficiency, and in contrast to previous model-based approaches, performance does not degrade in complex environments.

Sample-Efficient Reinforcement Learning

with Stochastic Ensemble Value Expansion

Jacob Buckman^{†}^{†}thanks: This work was completed as part of the Google AI Residency program. Danijar Hafner George Tucker Eugene Brevdo Honglak Lee
Google Brain, Mountain View, CA, USA
jacobbuckman@gmail.com,
mail@danijar.com,
{gjt,ebrevdo,honglak}@google.com

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Deep model-free reinforcement learning has had great successes in recent years, notably in playing video games Mnih et al. [2013] and strategic board games Silver et al. [2016]. However, training agents using these algorithms takes tens to hundreds of millions of samples, which makes many practical applications infeasible, particularly in real-world control problems (e.g., robotics) where data collection is expensive.

Model-based approaches aim to reduce the number of samples required to learn a policy by modeling the dynamics of the environment. A dynamics model can be used to increase sample efficiency in various ways, including training the policy on rollouts from the dynamics model Sutton [1990], using rollouts to improve targets for TD learning Feinberg et al. [2018], and using information gained from rollouts as inputs to the policy Weber et al. [2017]. Model-based algorithms such as PILCO Deisenroth and Rasmussen [2011] have shown that it is possible to learn from orders-of-magnitude fewer samples.

These successes have mostly been limited to environments where the dynamics are simple to model. In noisy, complex environments, it is difficult to learn an accurate model of the environment. When the model makes mistakes in this context, it can cause the wrong policy to be learned, hindering performance. Recent work has begun to address this issue. Kalweit and Boedecker [2017] train a model-free algorithm on a mix of real and imagined data, adjusting the proportion in favor of real data as the Q-function becomes more confident. Kurutach et al. [2018] train a model-free algorithm on purely imaginary data, but use an ensemble of environment models to avoid overfitting to errors made by any individual model.

We propose stochastic ensemble value expansion (STEVE), an extension to model-based value expansion (MVE) proposed by Feinberg et al. [2018]. Both techniques use a dynamics model to compute “rollouts” that are used to improve the targets for temporal difference learning. MVE rolls out a fixed length into the future, potentially accumulating model errors or increasing value estimation error along the way. In contrast, STEVE interpolates between many different horizon lengths, favoring those whose estimates have lower uncertainty, and thus lower error. To compute the interpolated target, we replace both the model and Q-function with ensembles, approximating the uncertainty of an estimate by computing its variance under samples from the ensemble. Through these uncertainty estimates, STEVE dynamically utilizes the model rollouts only when they do not introduce significant errors. For illustration, Figure 1 compares the sample efficiency of various algorithms on a tabular toy environment, which shows that STEVE significantly outperforms MVE and TD-learning baselines when the dynamics model is noisy. We systematically evaluate STEVE on several challenging continuous control benchmarks and demonstrate that STEVE significantly outperforms model-free baselines with an order-of-magnitude increase in sample efficiency.

## 2 Background

Reinforcement learning aims to learn an agent policy that maximizes the expected (discounted) sum of rewards [Sutton and Barto, 1998]. We focus on the deterministic case for exposition; however, our method is applicable to the stochastic case as well. The agent starts at an initial state . Then, the agent chooses an action according to its policy with parameters , receives a reward , and transitions to a subsequent state according to the Markovian dynamics of the environment. This generates a trajectory of states, actions, and rewards . We abbreviate the trajectory by . The goal is to maximize the expected discounted sum of rewards along sampled trajectories where is a discount parameter.

### 2.1 Value Estimation with TD-learning

The action-value function is a critical quantity to estimate for many learning algorithms. Using the fact that satisfies a recursion relation

where , we can estimate off-policy with collected transitions of the form sampled uniformly from a replay buffer [Sutton and Barto, 1998]. We approximate with a deep neural network, . We learn parameters to minimize the mean squared error (MSE) between Q-value estimates of states and their corresponding TD targets:

(1) | ||||

(2) |

This expectation is taken with respect to transitions sampled from our replay buffer. Note that we use an older copy of the parameters, , when computing targets [Mnih et al., 2013].

Since we evaluate our method in a continuous action space, it is not possible to compute a policy from our Q-function by simply taking . Instead, we use a neural network to approximate this maximization function Lillicrap et al. [2016], by learning parameters to minimize the negative Q-value:

(3) |

Note that we use the DDPG formulation only because of our specific choice of continuous-control environment, but our technique is generally applicable to all other methods using TD objectives.

### 2.2 Model-Based Value Expansion

Recently, Feinberg et al. [2018] showed that a learned dynamics model can be used to improve value estimation. Their method, model-based value expansion, combines a short term value estimate formed by unrolling the model dynamics and a long term value estimate using the learned function. When the model is accurate, this reduces the bias of the targets, leading to improved performance.

The learned dynamics model consists of three learned functions: the transition function , which returns a successor state ; a termination function , which returns the probability that is a terminal state; and the reward function , which returns a scalar reward. This model is trained to minimize

(4) |

where the expectation is over collected transitions , is an indicator function which returns when is a terminal state and otherwise, and is the cross-entropy. In this work, we consider continuous environments; for discrete environments, the first term can be replaced by a cross-entropy loss term.

To incorporate the model into value estimation, we replace our standard Q-learning target with an improved target, , computed by rolling the learned model out for steps.

(5) | ||||

(6) |

To use this target, we substitute in place of when training using Equation 2.^{1}^{1}1This formulation is a minor generalization of the original MVE objective in that we additionally model the reward function and termination function; Feinberg et al. [2018] consider “fully observable” environments in which the reward function and termination condition were known, deterministic functions of the observations.
Note that when , MVE reduces to TD-learning (i.e., ).

When the model is perfect and the learned Q-function has similar bias on all states and actions, Feinberg et al. [2018] show that the MVE target with rollout horizon will decrease the target error by a factor of Feinberg et al. [2018]. Errors in the learned model can lead to worse targets, so in practice, we must tune to balance between the errors in the model and the -function estimates. An additional challenge is that the bias in the learned Q-function is not uniform across states and actions [Feinberg et al., 2018]. In particular, they find that the bias in the Q-function on states sampled from the replay buffer is lower than when the Q-function is evaluated on states generated from model rollouts. They term this the distribution mismatch problem and propose the TD-k trick as a solution; see Appendix B for further discussion of this trick.

While the results of Feinberg et al. [2018] are promising, they rely on task-specific tuning of the rollout horizon . This sensitivity arises from the difficulty of modeling transitions and the -function, which are task-specific and may change throughout training as the policy explores different parts of the state space. Complex environments require much smaller rollout horizon , which limits the effectiveness of the approach (e.g., Feinberg et al. [2018] used for HalfCheetah-v1, but had to reduce to on Walker2d-v1). Motivated by this limitation, we propose an approach that balances model error and Q-function error by dynamically adjusting the rollout horizon.

## 3 Stochastic Ensemble Value Expansion

From a single rollout of timesteps, we can compute distinct candidate targets by considering rollouts to various horizon lengths: ,,,,. Standard TD learning uses as the target, while MVE uses as the target. We propose interpolating all of the candidate targets to produce a target which is better than any individual. Conventionally, one could average the candidate targets, or weight the candidate targets in an exponentially-decaying fashion, similar to TD() Sutton and Barto [1998]. However, we show that we can do still better by weighting the candidate targets in a way that balances errors in the learned -function and errors from longer model rollouts. STEVE provides a computationally-tractable and theoretically-motivated algorithm for choosing these weights. We describe the algorithm for STEVE in Section 3.1, and justify it in Section 3.2.

### 3.1 Algorithm

To estimate uncertainty in our learned estimators, we maintain an ensemble of parameters for our Q-function, reward function, and model: , , and , respectively. Each parameterization is initialized independently and trained on different subsets of the data in each minibatch.

We roll out an step trajectory with each of the models, . Each trajectory consists of states, , which correspond to in Equation 5 with the transition function parameterized by . Similarly, we use the reward functions and Q-functions to evaluate Equation 6 for each at every rollout-length . This gives us different values of for each rollout-length .

Using these values, we can compute the empirical mean and variance for each partial rollout of length . In order to form a single target, we use an inverse variance weighting of the means:

(7) |

To learn a value function with STEVE, we substitute in in place of when training using Equation 2.

### 3.2 Derivation

We wish to find weights , where that minimize the mean-squared error between the weighted-average of candidate targets ,,,…, and the true Q-value.

where the expectation considers the candidate targets as random variables conditioned on the collected data and minibatch sampling noise, and the approximation is due to assuming the candidate targets are independent. Experiments suggested that removing the covariance terms decreased perfomance only slightly while providing a large speedup, so we dropped the extra terms for simplicity.

Our goal is to minimize this with respect to . We can estimate the variance terms using empirical variance estimates from the ensemble. Unfortunately, we could not devise a reliable estimator for the bias terms, and this is a limitation of our approach and an area for future work. In this work, we ignore the bias terms and minimize the weighted sum of variances

With this approximation, which is equivalent to in inverse-variance weighting Fleiss [1993], we achieve state-of-the-art results. Setting each equal to and normalizing yields the formula for given in Equation 7.

### 3.3 Note on ensembles

This technique for calculating uncertainty estimates will work equally well for any family of models from which we can sample. For example, we could train a Bayesian neural network for each model MacKay [1992], or use dropout as a Bayesian approximation by resampling the dropout masks each time we wish to sample a new model Gal and Ghahramani [2016]. These options could potentially give better diversity of various samples from the family, and thus better uncertainty estimates; exploring them further is a promising direction for future work. However, we found that these methods degraded the accuracy of the base models. An ensemble is far easier to train, and so we focus on that in this work. This is a common choice, as the use of ensembles in the context of uncertainty estimations for deep reinforcement learning has seen wide adoption in the literature. It was first proposed by Osband et al. [2016] as a technique to improve exploration, and subsequent work showed that this approach gives a good estimate of the uncertainty of both value functions Kalweit and Boedecker [2017] and models Kurutach et al. [2018].

## 4 Experiments

### 4.1 Implementation

We use DDPG Lillicrap et al. [2016] as our baseline model-free algorithm. We train two deep feedforward neural networks, a Q-function network and a policy network , by minimizing the loss functions given in Equations 2 and 3. We also train another three deep feedforward networks to represent our world model, corresponding to function approximators for the transition , termination , and reward , and minimize the loss function given in Equation 4.

When collecting rollouts for evaluation, we simply take the action selected by the policy, , at every state . Each run was evaluated after every 500 updates by computing the mean total episode reward (referred to as score) across many environment restarts. To produce the lines in Figures 2, 3, and 4, these evaluation results were downsampled by splitting the domain into non-overlapping regions and computing the mean score within each region across several runs. The shaded area shows one standard deviation of scores in the region as defined above.

When collecting rollouts for our replay buffer, we do -greedy exploration: with probability , we select a random action by adding Gaussian noise to the pre-tanh policy action.

All algorithms were implemented in Tensorflow [Abadi et al., 2016]. We use a distributed implementation to parallelize computation. In the style of ApeX Horgan et al. [2018], IMPALA Espeholt et al. [2018], and D4PG Barth-Maron et al. [2018], we use a centralized learner with several agents operating in parallel. Each agent periodically loads the most recent policy, interacts with the environment, and sends its observations to the central learner. The learner stores received frames in a replay buffer, and continuously loads batches of frames from this buffer to use as training data for a model update. In the algorithms with a model-based component, there are two learners: a policy-learner and a model-learner. In these cases, the policy-learner periodically reloads the latest copy of the model.

All baselines reported in this section were re-implementations of existing methods. This allowed us to ensure that the various methods compared were consistent with one another, and that the differences reported are fully attributable to the independent variables in question. Our baselines are competitive with state-of-the-art implementations of these algorithms Haarnoja et al. [2018], Feinberg et al. [2018]. All MVE experiments utilize the TD-k trick. For hyperparameters and additional implementation details, please see Appendix C.^{2}^{2}2Our code is available open-source at: https://github.com/tensorflow/models/tree/master/research/steve

### 4.2 Comparison of Performance

We tested on a variety of continuous control tasks [Brockman et al., 2016, Klimov and Schulman, ]; the results can be shown in Figure 2. We found that STEVE yields significant improvements in both performance and sample efficiency across a wide range of environments. Importantly, the gains are more substantial in the more complex environments. On all of the most challenging environments, such as Humanoid-v1, RoboschoolHumanoid-v1, RoboschoolHumanoidFlagrun-v1, and BipedalWalkerHardcore-v2, we see that STEVE is the only algorithm to show significant learning within 10M frames.

### 4.3 Ablation Study

In order to verify that STEVE’s gains in sample efficiency are due to the reweighting, and not simply due to the additional parameters of the ensembles of its components, we examine several ablations. Ensemble MVE is the regular MVE algorithm, but the model and Q-functions are replaced with with ensembles. Mean-MVE uses the exact same architecture as STEVE, but uses a simple uniform weighting instead of the uncertainty-aware reweighting scheme. Similarly, TDL25 and TDL75 correspond to TD() reweighting schemes with and , respectively. COV-STEVE is a version of STEVE which includes the covariances between candidate targets when computing the weights. We also investigate the effect of the horizon parameter on the performance of both STEVE and MVE. These results are shown in Figure 3.

All of these variants show the same trend: fast initial gains, which quickly taper off and are overtaken by the baseline. STEVE is the only variant to converge faster and higher than the baseline; this provides strong evidence that the gains come specifically from the uncertainty-aware reweighting of targets. Additionally, we find that increasing the rollout horizon increases the sample efficiency of STEVE, even though the dynamics model for Humanoid-v1 has high error.

### 4.4 Wall-Clock Comparison

In the previous experiments, we synchronized data collection, policy updates, and model updates. However, when we run these steps asynchronously, we can reduce the wall-clock time at the risk of instability. To evaluate this configuration, we compare DDPG, MVE-DDPG, and STEVE-DPPG on Humanoid-v1 and RoboschoolHumanoidFlagrun-v1. Both were trained on a P100 GPU and had 8 CPUs collecting data; STEVE-DPPG additionally used a second P100 to learn a model in parallel. We plot reward as a function of wall-clock time for these tasks in Figure 4. STEVE-DDPG learns more quickly on both tasks, and it achieves a higher reward than DDPG and MVE-DDPG on Humanoid-v1 and performs comparably to DDPG on RoboschoolHumanoidFlagrun-v1. Moreover, in future work, STEVE could be accelerated by parallelizing training of each component of the ensemble.

## 5 Discussion

Our primary experiments (Section 4.2) show that STEVE greatly increases sample efficiency relative to baselines, matching or out-performing both MVE-DDPG and DDPG baselines on every task. STEVE also outperforms other recently-published results on these tasks in terms of sample efficiency Gu et al. [2017], Haarnoja et al. [2018], Schulman et al. [2017]. Our ablation studies (Section 4.3) support the hypothesis that the increased performance is due to the uncertainty-dependent reweighting of targets, as well as demonstrate that the performance of STEVE consistently increases with longer horizon lengths, even in complex environments. Finally, our wall-clock experiments (Section 4.4) demonstrate that in spite of the additional computation per epoch, the gains in sample efficiency are enough that it is competitive with model-free algorithms in terms of wall-clock time. The speed gains associated with improved sample efficiency will only be exacerbated as samples become more expensive to collect, making STEVE a promising choice for applications involving real-world interaction.

Given that the improvements stem from the dynamic reweighting between horizon lengths, it may be interesting to examine the choices that the model makes about which candidate targets to favor most heavily. In Figure 5, we plot the average model usage over the course of training. Intriguingly, most of the lines seem to remain stable at around 50% usage, with two notable exceptions: Humanoid-v1, the most complex environment tested (with an observation-space of size 376); and Swimmer-v1, the least complex environment tested (with an observation-space of size 8). This supports the hypothesis that STEVE is trading off between Q-function bias and model bias; it chooses to ignore the model almost immediately when the environment is too complex to learn, and gradually begins ignore the model as the Q-function improves if an optimal environment model is learned quickly.

## 6 Related Work

Sutton and Barto [1998] describe TD, a family of Q-learning variants in which targets from multiple timesteps are merged via exponentially decay. STEVE is similar in that it is also computing a weighted average between targets. However, our approach is significantly more powerful because it adapts the weights to the specific characteristics of each individual rollout, rather than being constant between examples and throughout training. Our approach can be thought of as a generalization of TD(), in that the two approaches are equivalent in the specific case where the overall uncertainty grows exponentially at rate at every timestep.

Heess et al. [2015] describe stochastic value gradient (SVG) methods, which are a general family of hybrid model-based/model-free control algorithms. By re-parameterizing distributions to separate out the noise, SVG is able to learn stochastic continuous control policies in stochastic environments. STEVE currently operates only with deterministic policies and environments, but this is a promising direction for future work.

Kurutach et al. [2018] propose model-ensemble trust-region policy optimization (ME-TRPO), which is motivated similarly to this work in that they also propose an algorithm which uses an ensemble of models to mitigate the deleterious effects of model bias. However, the algorithm is quite different. ME-TRPO is a purely model-based policy-gradient approach, and uses the ensemble to avoid overfitting to any one model. In contrast, STEVE interpolates between model-free and model-based estimates, uses a value-estimation approach, and uses the ensemble to explicitly estimate uncertainty.

Kalweit and Boedecker [2017] train on a mix of real and imagined rollouts, and adjust the ratio over the course of training by tying it to the variance of the Q-function. Similarly to our work, this variance is computed via an ensemble. However, they do not adapt to the uncertainty of individual estimates, only the overall ratio of real to imagined data. Additionally, they do not take into account model bias, or uncertainty in model predictions.

Weber et al. [2017] use rollouts generated by the dynamics model as inputs to the policy function, by “summarizing” the outputs of the rollouts with a deep neural network. This second network allows the algorithm to implicitly calculate uncertainty over various parts of the rollout and use that information when making its decision. However, I2A has only been evaluated on discrete domains. Additionally, the lack of explicit model use likely tempers the sample-efficiency benefits gained relative to more traditional model-based learning.

Gal et al. [] use a deep neural network in combination with the PILCO algorithm Deisenroth and Rasmussen [2011] to do sample-efficient reinforcement learning. They demonstrate good performance on the continuous-control task of cartpole swing-up. They model uncertainty in the learned neural dynamics function using dropout as a Bayesian approximation, and provide evidence that maintaining these uncertainty estimates is very important for model-based reinforcement learning.

Depeweg et al. [2016] use a Bayesian neural network as the environment model in a policy search setting, learning a policy purely from imagined rollouts. This work also demonstrates that modeling uncertainty is important for model-based reinforcement learning with neural network models, and that uncertainty-aware models can escape many common pitfalls.

Gu et al. [2016] propose a continuous variant of Q-learning known as normalized advantage functions (NAF), and show that learning using NAF can be accelerated by using a model-based component. They use a variant of Dyna-Q Sutton [1990], augmenting the experience available to the model-free learner with imaginary on-policy data generated via environment rollouts. They use an iLQG controller and a learned locally-linear model to plan over small, easily-modelled regions of the environment, but find that using more complex neural network models of the environment can yield errors.

## 7 Conclusion

In this work, we demonstrated that STEVE, an uncertainty-aware approach for integrating model-free and model-based reinforcement learning, outperforms model-free approaches while reducing sample complexity by an order magnitude on several challenging tasks. We believe that this is a strong step towards enabling RL for practical, real-world applications. Future directions include exploring more complex worldmodels for various tasks, as well as comparing various techniques for calculating uncertainty and estimating bias.

#### Acknowledgments

The authors would like to thank the following individuals for their valuable insights and discussion: David Ha, Prajit Ramachandran, Tuomas Haarnoja, Dustin Tran, Matt Johnson, Matt Hoffman, Ishaan Gulrajani, and Sergey Levine. Also, we would like to thank Jascha Sohl-Dickstein, Joseph Antognini, Shane Gu, and Samy Bengio for their feedback during the writing process. Finally, we would like to thank Erwin Coumans for his help on PyBullet enivronments.

## References

- Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
- Barth-Maron et al. [2018] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. Lillicrap. Distributional policy gradients. In International Conference on Learning Representations, 2018.
- Brockman et al. [2016] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Deisenroth and Rasmussen [2011] M. Deisenroth and C. E. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
- Depeweg et al. [2016] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. 2016.
- Espeholt et al. [2018] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the International Conference on Machine Learning, 2018.
- Feinberg et al. [2018] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
- Fleiss [1993] J. Fleiss. Review papers: The statistical basis of meta-analysis. Statistical methods in medical research, 2(2):121–145, 1993.
- Gal and Ghahramani [2016] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
- [10] Y. Gal, R. McAllister, and C. E. Rasmussen. Improving PILCO with Bayesian neural network dynamics models.
- Gu et al. [2016] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
- Gu et al. [2017] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. International Conference on Learning Representations, 2017.
- Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.
- Heess et al. [2015] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
- Horgan et al. [2018] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018.
- Kalweit and Boedecker [2017] G. Kalweit and J. Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pages 195–206, 2017.
- Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
- [18] O. Klimov and J. Schulman. Roboschool. https://github.com/openai/roboschool.
- Kurutach et al. [2018] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, 2018.
- Lillicrap et al. [2016] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016.
- MacKay [1992] D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
- Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.
- Osband et al. [2016] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, pages 4026–4034, 2016.
- Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver et al. [2016] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Sutton [1990] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990.
- Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction, volume 1. MIT Press Cambridge, 1998.
- Weber et al. [2017] T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. Imagination-augmented agents for deep reinforcement learning. 31st Conference on Neural Information Processing Systems, 2017.

## Appendix A Toy Problem: A Tabular FSM with Model Noise

To demonstrate the benefits of Bayesian model-based value expansion, we evaluated it on a toy problem. We used a finite state environment with states , and a single action available at every state which always moves from state to , starting at and terminating at . The reward for every action is -1, except when moving from to , which is +100. Since this environment is so simple, there is only one possible policy , which is deterministic. It is possible to compute the true action-value function in closed form, which is .

We estimate the value of each state using tabular TD-learning. We maintain a tabular function , which is just a lookup table matching each state to its estimated value. We initialize all values to random integers between 0 and 99, except for the terminal state , which we initialize to 0 (and keep fixed at 0 at all times). We update using the standard undiscounted one-step TD update, . For each update, we sampled a nonterminal state and its corresponding transition at random. For experiments with an ensemble of Q-functions, we repeat this update once for each member of the ensemble at each timestep.

The transition and reward function for the oracle dynamics model behaved exactly the same as the true environment. In the “noisy” dynamics model, noise was added in the following way: 10% of the time, rather than correctly moving from to , the model transitions to a random state. (Other techniques for adding noise gave qualitatively similar results.)

On the y-axis of Figure 1, we plot the mean squared error between the predicted values and the true values of each state: .

For both the STEVE and MVE experiments, we use an ensemble of size 8 for both the model and the Q-function. To compute the MVE target, we average across all ensembled rollouts and predictions.

## Appendix B The TD-k Trick

The TD-k trick, proposed by Feinberg et al. [2018], involves training the Q-function using every intermediate state of the rollout:

To summarize Feinberg et al. [2018], the TD-k trick is helpful because the off-policy states collected by the replay buffer may have little overlap with the states encountered during on-policy model rollouts. Without the TD-k trick, the Q-function approximator is trained to minimize error only on states collected from the replay buffer, so it is likely to have high error on states not present in the replay buffer. This would imply that the Q-function has high error on states produced by model rollouts, and that this error may in fact continue to increase the more steps of on-policy rollout we take. By invoking the TD-k trick, and training the Q-function on intermediate steps of the rollout, Feinberg et al. [2018] show that we can decrease the Q-function bias on frames encountered during model-based rollouts, leading to better targets and improved performance.

The TD-k trick is orthogonal to STEVE. STEVE tends to ignore estimates produced by states with poorly-learned Q-values, so it is not hurt nearly as much as MVE by the distribution mismatch problem. However, better Q-values will certainly provide more information with which to compute STEVE’s target, so in that regard the TD-k trick seems beneficial. An obvious question is whether these two approaches are complimentary. STEVE+TD-k is beyond the scope of this work, and we did not give it a rigorous treatment; however, initial experiments were not promising. In future work, we hope to explore the connection between these two approaches more deeply.

## Appendix C Implementation Details

All models were feedforward neural networks with ReLU nonlinearities. The policy network, reward model, and termination model each had 4 layers of size 128, while the transition model had 8 layers of size 512. All environments were reset after 1000 timesteps. Parameters were trained with the Adam optimizer Kingma and Ba [2015] with a learning rate of 3e-4.

Policies were trained using minibatches of size 512 sampled uniformly at random from a replay buffer of size 1e6. The first 1e5 frames were sampled via random interaction with the environment; after that, 4 policy updates were performed for every frame sampled from the environment. (In Section 4.4, the policy updates and frames were instead de-synced.) Policy checkpoints were saved every 500 updates; these checkpoints were also frozen and used as . For model-based algorithms, the most recent checkpoint of the model was loaded every 500 updates as well.

Each policy training had 8 agents interacting with the environment to send frames back to the replay buffer. These agents typically took the greedy action predicted by the policy, but with probability , instead took an action sampled from a normal distribution surrounding the pre-tanh logit predicted by the policy. In addition, each policy had 2 purely-greedy agents interacting with the environment for evaluation.

Dynamics models were trained using minibatches of size 1024 sampled uniformly at random from a replay buffer of size 1e6. The first 1e5 frames were sampled via random interaction with the environment; the dynamics model was then pre-trained for 1e5 updates. After that, 4 model updates were performed for every frame sampled from the environment. (In Section 4.4, the model updates and frames were instead de-synced.) Model checkpoints were saved every 500 updates.

All ensembles were of size 4. During training, each ensemble member was trained on a disjoint subset of each minibatch, i.e. each sample in the minibatch was “assigned” to one member of the ensemble uniformly at random. Additionally, for all experiments.