Backpropagation through the Void:Optimizing control variates for black-box gradient estimation

Backpropagation through the Void:
Optimizing control variates for
black-box gradient estimation

Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, David Duvenaud
University of Toronto
Vector Institute
{wgrathwohl, choidami, ywu, roeder, duvenaud}

Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.

Backpropagation through the Void:
Optimizing control variates for
black-box gradient estimation

Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, David Duvenaud
University of Toronto
Vector Institute
{wgrathwohl, choidami, ywu, roeder, duvenaud}

1 Introduction

Gradient-based optimization has been key to most recent advances in machine learning and reinforcement learning. The back-propagation algorithm (Rumelhart & Hinton, 1986), also known as reverse-mode automatic differentiation (Speelpenning, 1980; Rall, 1981) computes exact gradients of deterministic, differentiable objective functions. The reparameterization trick (Williams, 1992; Kingma & Welling, 2014; Rezende et al., 2014) allows backpropagation to give unbiased, low-variance estimates of gradients of expectations of continuous random variables. This has allowed effective stochastic optimization of large probabilistic latent-variable models.

Unfortunately, there are many objective functions relevant to the machine learning community for which backpropagation cannot be applied. In reinforcement learning, for example, the function being optimized is unknown to the agent and is treated as a black box (Schulman et al., 2015). Similarly, when fitting probabilistic models with discrete latent variables, discrete sampling operations create discontinuities giving the objective function zero gradient with respect to its parameters. Much recent work has been devoted to constructing gradient estimators for these situations. In reinforcement learning, advantage actor-critic methods (Sutton et al., 2000) give unbiased gradient estimates with reduced variance obtained by jointly optimizing the policy parameters with an estimate of the value function. In discrete latent-variable models, low-variance but biased gradient estimates can be given by continuous relaxations of discrete variables (Maddison et al., 2016; Jang et al., 2016).

A recent advance by Tucker et al. (2017) used a continuous relaxation of discrete random variables to build an unbiased and lower-variance gradient estimator, and showed how to tune the free parameters of these relaxations to minimize the estimator’s variance during training.

We generalize the method of Tucker et al. (2017) to learn a free-form control variate parameterized by a neural network. This gives a lower-variance, unbiased gradient estimator which can be applied to a wider variety of problems. Most notably, our method is applicable even when no continuous relaxation is available, as in reinforcement learning or black-box function optimization.

Figure 1: Left: Training curves comparing different gradient estimators on a toy problem: Right: Variance of each estimator’s gradient.

2 Background: Gradient estimators

How can we choose the parameters of a distribution to maximize an expectation? This problem comes up in reinforcement learning, where we must choose the parameters of a policy distribution to maximize the expected reward over state-action trajectories . It also comes up in fitting latent-variable models, when we wish to maximize the marginal probability . In this paper, we’ll consider the general problem of optimizing


When the parameters are high-dimensional, gradient-based optimization is appealing because it provides information about how to adjust each parameter individually. Stochastic optimization is essential for scalablility. However, it is only guaranteed to converge to a fixed point of the objective when the stochastic gradients are unbiased, i.e.  (Robbins & Monro, 1951).

How can we build unbiased, stochastic estimators of ? There are several standard methods:

The score-function gradient estimator

One of the most generally-applicable gradient estimators is known as the score-function estimator, or REINFORCE (Williams, 1992):


This estimator is unbiased, but in general has high variance. Intuitively, this estimator is limited by the fact that it doesn’t use any information about how depends on , only on the final outcome .

The reparameterization trick

When is continuous and differentiable, and the latent variables can be written as a deterministic, differentiable function of a random draw from a fixed distribution, the reparameterization trick (Williams, 1992; Kingma & Welling, 2014; Rezende et al., 2014) creates a low-variance, unbiased gradient estimator by making the dependence of on explicit through a reparameterization function :


This gradient estimator is often used when training high-dimensional, continuous latent-variable models, such as variational autoencoders. One intuition for why this gradient estimator is preferable to REINFORCE is that it depends on , which exposes the dependence of on .

Control variates

Control variates are a general method for reducing the variance of a Monte Carlo estimator. Given an estimator , a control variate is a function with a known mean . Subtracting the control variate from our estimator and adding its mean gives us a new estimator:


This new estimator has the same expectation as the old one:


Importantly, the new estimator has lower variance than if is positively correlated with .

3 Constructing and optimizing a differentiable surrogate

In this section, we introduce a gradient estimator for the expectation of a function that can be applied even when is unknown, or not differentiable, or when is discrete. Our estimator combines the score function estimator, the reparameterization trick, and control variates. We obtain an unbiased estimator whose variance can potentially be as low as the reparameterization-trick estimator, even when is not differentiable or not computable.

First, we consider the case where is continuous, but that cannot be differentiated. Instead of differentiating through , we build a surrogate of using a neural network , and differentiate instead. Since the score-function estimator and reparameterization estimator have the same expectation, we can simply subtract the score-function estimator for and add back its reparameterization estimator. This gives a gradient estimator which we call LAX:


This estimator is unbiased for any choice of . When , then LAX becomes the reparameterization estimator for . Thus LAX can have variance at least as low as the reparameterization estimator.

3.1 Optimizing the gradient control variate with gradients

Since is unbiased for any choice of the surrogate , the only remaining problem is to choose a that gives low variance to . How can we find a which gives our estimator low variance? We simply optimize using stochastic gradient descent, at the same time as we optimize the parameters of our model or policy.

To optimize , we require the gradient of the variance of our gradient estimator. To estimate these gradients, we could simply differentiate through the empirical variance over each mini-batch. Or, following Ruiz et al. (2016) and Tucker et al. (2017), we can construct an unbiased, single-sample estimator using the fact that our gradient estimator is unbiased. For any unbiased gradient estimator with parameters :


Thus, an unbiased single-sample estimate of the gradient of the variance of is given by .

This method of directly minimizing the variance of the gradient estimator stands in contrast to other methods such as Q-Prop (Gu et al., 2016) and advantage actor-critic (Sutton et al., 2000), which train the control variate to minimize the squared error . Our algorithm, which jointly optimizes the parameters and the surrogate is given in Algorithm 1.

3.1.1 Optimal surrogate

What is the form of the variance-minimizing ? Inspecting the square of (3), we can see that this loss encourages to approximate , but with a weighting based on . Moreover, as then . Thus, this objective encourages a balance between the variance of the reparameterization estimator and the variance of the REINFORCE estimator. Figure 2 shows the learned surrogate on a toy problem.

, , reparameterized sampler , neural network
while not converged do
      Sample noise
      Compute input
      Estimate gradient
      Estimate gradient of variance of gradient
      Update parameters
      Update control variate
end while
Algorithm 1 LAX: Optimizing parameters and a gradient control variate simultaneously.

3.2 Discrete random variables and conditional reparameterization

We can adapt the LAX estimator to the case where is a discrete random variable by introducing a “relaxed” continuous variable . We require a continuous, reparameterizable distribution and a deterministic mapping such that when . In our implementation, we use the Gumbel-softmax trick, the details of which can be found in appendix B.

The discrete version of the LAX estimator is given by:


This estimator is simple to implement and general. However, when we do not recover the reparameterization estimator as we do with LAX. To achieve this, we must be able to replace the in the control variate with . This is the motivation behind our next estimator, which we call RELAX.

To construct a more powerful gradient estimator, we incorporate a further refinement due to Tucker et al. (2017). Specifically, we evaluate our control variate both at a relaxed input , and also at a relaxed input conditioned on the discrete variable , denoted . Doing so gives us:


This estimator is unbiased for any . A proof and a detailed algorithm can be found in appendix A. We note that the distribution must also be reparameterizable. We demonstrate how to perform this conditional reparameterization for Bernoulli and categorical random variables in appendix B.

3.3 Choosing the control variate architecture

The variance-reduction objective introduced above allows us to use any differentiable, parametric function as our control variate . How should we choose the architecture of ? Ideally, we will take advantage of any known structure in .

If is a known, differentiable function of discrete random variables, we can use the concrete relaxation (Jang et al., 2016; Maddison et al., 2016) and let . In this special case, our estimator is exactly the REBAR estimator. We are also free to add a learned component to the concrete relaxation and let where is a neural network with parameters . We took this approach in our experiments training discrete variational auto-encoders. If is unknown, we can simply let be a generic function approximator such as a neural network. We took this simpler approach in our reinforcement learning experiments.

3.4 Reinforcement learning

We now describe how we apply the LAX estimator in the reinforcement learning (RL) setting. By reinforcement learning, we refer to the problem of optimizing the parameters of a policy distribution to maximize the sum of rewards. In this setting, the random variable being integrated over is , which denotes a series of actions and states . The function whose expectation is being optimized, , maps to the sum of rewards .

Again, we want to estimate the gradient of an expectation of a black-box function: . The de facto standard approach is the advantage actor-critic estimator (A2C) (Sutton et al., 2000):


Where is an estimate of the state-value function, This estimator is unbiased when does not depend on . The main limitations of A2C are that does not depend on , and that it’s not obvious how to optimize . Using the LAX estimator addresses both of these problems.

First, we assume is reparameterizable, meaning that we can write , where does not depend on . We again introduce a differentiable surrogate . Crucially, this surrogate is a function of the action as well as the state.

Our estimator is defined as:


This estimator is unbiased if the true dynamics of the system are Markovian w.r.t. the state . When , we recover the special case . Comparing to the standard advantage actor-critic estimator in (10), the main difference is that our baseline is action-dependent while still remaining unbiased.

To optimize the parameters of our control variate , we can again use the single-sample estimator of the gradient of our estimator’s variance given in (7). This approach avoids unstable training dynamics, and doesn’t require storage and replay of previous rollouts.

Details of this derivation, as well as the discrete and conditionally reparameterized version of this estimator can be found in appendix C.

4 Scope and Limitations

The work most related to ours is the recently-developed REBAR method (Tucker et al., 2017), which inspired our work. The REBAR estimator is a special case of the RELAX estimator, when the surrogate is set to . The only free parameters of the REBAR estimator are the scaling factor , and the temperature , which gives limited scope to optimize the surrogate. REBAR can only be applied when is known and differentiable. Furthermore, it depends on essentially undefined behavior of the function being optimized, since it evaluates the discrete loss function at continuous inputs.

Because LAX and RELAX can construct a surrogate from scratch, they can be used for optimizing black-box functions, as in reinforcement learning settings where the reward is an unknown function of the environment. LAX and RELAX only require that we can query the function being optimized, and can sample from and differentiate .

In principle one could use RELAX to optimize deterministic black-box functions, but only by introducing stochasticity to the inputs. Thus, RELAX is most suitable for problems where one is already optimizing a distribution over inputs, such as in inference or reinforcement learning.

Direct dependence on parameters

Above, we assumed that the function being optimized does not depend directly on , which is usually the case in black-box optimization settings. However, a dependence on can occur when training probabilistic models, or when we add a regularizer. In both these settings, if the dependence on is known and differentiable, we can use the fact that


and simply add to any of the gradient estimators above to recover an unbiased estimator.

5 Related work

Miller et al. (2017) reduce the variance of reparameterization gradients in an orthogonal way to ours by approximating the gradient-generating procedure with a simple model and using that model as a control variate. NVIL (Mnih & Gregor, 2014) and VIMCO (Mnih & Rezende, 2016) provide reduced variance gradient estimation in the special case of discrete latent variable models and discrete latent variable models with Monte-Carlo objectives. Salimans et al. (2017) estimate gradients using a form of finite differences, evaluating hundreds of different parameter values in parallel to construct a gradient estimate. In contrast, our method is a single-sample estimator.

Staines & Barber (2012) address the general problem of developing gradient estimators for deterministic black-box functions or discrete optimization. They introduce a sampling distribution, and optimize an objective similar to ours. Wierstra et al. (2014) also introduce a sampling distribution to build a gradient estimator, and consider optimizing the sampling distribution.

In the reinforcement learning setting, the work most similar to ours is -prop (Haarnoja et al., 2017). Like our method, -prop reduces the variance of the policy gradient with an learned, action-dependent control variate whose expectation is approximated via a monte-carlo sample from a taylor series expansion of the control variate. Unlike our method, their control variate is trained off-policy. While our method is applicable in both the continuous and discrete action domain, -prop is only applicable to continuous actions.

6 Applications

Figure 2: The optimal relaxation for a toy loss function, using different gradient estimators. Because REBAR uses the concrete relaxation of , which happens to be implemented as a quadratic function, the optimal relaxation is constrained to be a warped quadratic. In contrast, RELAX can choose a free-form relaxation.

We demonstrate the effectiveness of our estimator on a number of challenging optimization problems. Following Tucker et al. (2017) we begin with a simple toy example to illuminate the potential of our method and then continue to the more relevant problems of optimizing binary VAE’s and reinforcement learning.

6.1 Toy experiment

As a simple example, we follow Tucker et al. (2017) in minimizing as a function of the parameter where . Tucker et al. (2017) set the target . We focus on the more challenging case where . Figures 1a and 1b show the relative performance and gradient log-variance of REINFORCE, REBAR, and RELAX.

Figure 2 plots the learned surrogate for a fixed value of . We can see that is near for all , keeping the variance of the REINFORCE part of the estimator small. Moreover the derivative of is positive for all meaning that the reparameterization part of the estimator will produce gradients pointing in the correct direction to optimize the expectation. Conversely, the concrete relaxation of REBAR is close to only near and and its gradient points in the correct direction only for values of . These factors together result in the RELAX estimator achieving the best performance.

6.2 Discrete variational autoencoder

Next, we evaluate the RELAX estimator on the task of training a variational autoencoder (Kingma & Welling, 2014; Rezende et al., 2014) with Bernoulli latent variables. We reproduced the variational autoencoder experiments from Tucker et al. (2017), training models with 1 or 2 layers of 200 Bernoulli random variables with linear or nonlinear mappings between them, on both the MNIST and Omniglot (Lake et al., 2015) datasets. Details of these models and our experimental procedure can be found in appendix E.1.

To take advantage of the available structure in the loss function, we choose the form of our control variate to be where is a neural network with parameters and is the discrete loss function (the evidence lower-bound) evaluated at continuously relaxed inputs as in REBAR. In all experiments, the learned control variate improved the training performance, over the state-of-the-art baseline of REBAR. In both linear models, we achieved improved validation performance as well increased convergence speed. We believe the decrease in validation performance for the nonlinear models was due to overfitting caused by improved optimization of an under-regularized model. We leave exploring this phenomenon to further work.

Dataset Model Concrete NVIL MuProp REBAR RELAX
Nonlinear -101.1 -81.01 -78.13
MNIST linear 1 layer -111.3 -111.6 -111.20
linear 2 layer -99.62 -98.22 -98.00
Nonlinear -108.72 -56.76 -56.12
Omniglot linear 1 layer -117.23 -116.63 -116.57
linear 2 layer -109.95 -108.71 -108.54
Table 1: Best obtained training objective for discrete variational autoencoders.

To obtain training curves we created our own implementation of REBAR, which gave identical or slightly improved performance compared to the implementation of Tucker et al. (2017).

While we obtained a modest improvement in training and validation scores (tables 1 and 3), the most notable improvement provided by RELAX is in its rate of convergence. Training curves for all models can be seen in figure 3 and in appendix D. In table 4 we compare the number of training epochs that are required to match the best validation score of REBAR. In both linear models, RELAX provides an increase in rate of convergence.

MNIST Omniglot


Figure 3: Training curves for the VAE Experiments with the 1 layer linear model. The horizontal dashed line indicates the lowest validation error obtained by REBAR.

6.3 Reinforcement learning

We apply our gradient estimator to a few simple reinforcement learning environments with discrete and continuous actions. We use the RELAX and LAX estimators for discrete and continuous actions, respectively. We compare with the advantage actor-critic algorithm (A2C) (Sutton et al., 2000) as a baseline. Full details of our experiments can be found in Appendix E.

6.3.1 Experiments

In the discrete action setting, we test our approach on the Cart Pole and Lunar Lander environments as provided by the OpenAI gym (Brockman et al., 2016). In the continuous action setting, we test on the MuJoCo-simulated (Todorov et al., 2012) environments Inverted Pendulum and Inverted Double Pendulum also found in the OpenAI gym. In all tested environments we observe improved performance and sample efficiency using our method. The results of our experiments can be seen in figure 4, and table 2.

We found that our estimator produced policy gradients with drastically reduced variance (see figure 4) allowing for larger learning rates to be used while maintaining stable training. In both discrete environments our estimator achieved great than a 2-times speedup in convergence over the baseline.

Model Cart-pole Lunar lander Inverted pendulum Inverted double pendulum
Table 2: Mean episodes to solve each task. Definition of solving each task can be found in appendix E.
Cart-pole Lunar lander Inverted pendulum Inverted double pendulum



Figure 4: Top row: Reward curves. Bottom row: Variance of policy gradients (log scale). In each curve, the center line indicates the mean reward over 5 random seeds. The opaque bars in the top row indicate the 25th and 75th percentiles. The opaque bars in the bottom row indicate 1 standard deviation. After every 10th training episode 100 episodes were run and the sample log-variance is reported averaged over all policy parameters.

Code for all experiments can be found at

7 Conclusions and future work

In this work we synthesized and generalized several standard approaches for constructing gradient estimators. We proposed a generic gradient estimator that can be applied to expectations of known or black-box functions of discrete or continuous random variables, and adds little computational overhead. We also derived a simple extension to reinforcement learning in both discrete and continuous-action domains.

The generality of this method opens up new possibilities for training non-differentiable models. For example, we could apply our estimator to continuous latent-variable models whose likelihood is non-differentiable, such as a 3D rendering engine. There is also room to explore architecture choices for the control variate.

Our results may motivate further work using action-dependent control-variates for policy-gradient methods, and can be combined with other variance-reduction techniques such as generalized advantage estimation (Kimura et al., 2000). One could also train our control variate off-policy, as in -prop (Gu et al., 2016).


We thank Dougal Maclaurin, Tian Qi Chen, Elliot Creager, and Bowen Xu for helpful discussions. We would also like to thank Christopher Prohm for pointing out an error in one of our derivations.


  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
  • Djork-Arné Clevert & Hochreiter (2016) Thomas Unterthiner Djork-Arné Clevert and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representations, 2016.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
  • Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.
  • Hesse et al. (2017) Christopher Hesse, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines., 2017.
  • Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Kimura et al. (2000) Hajime Kimura, Shigenobu Kobayashi, et al. An analysis of actor-critic algorithms using eligibility traces: reinforcement learning with imperfect value functions. Journal of Japanese Society for Artificial Intelligence, 15(2):267–275, 2000.
  • Kingma & Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. International Conference on Learning Representations, 2014.
  • Lake et al. (2015) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  • Miller et al. (2017) Andrew C Miller, Nicholas J Foti, Alexander D’Amour, and Ryan P Adams. Reducing reparameterization gradient variance. arXiv preprint arXiv:1705.07880, 2017.
  • Mnih & Gregor (2014) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1791–1799, 2014.
  • Mnih & Rezende (2016) Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pp. 2188–2196, 2016.
  • Rall (1981) Louis B Rall. Automatic differentiation: Techniques and applications. 1981.
  • Rezende et al. (2014) Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pp. 1278–1286, 2014.
  • Robbins & Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
  • Ruiz et al. (2016) Francisco J.R. Ruiz, Michalis K Titsias, and David M Blei. Overdispersed black-box variational inference. In Uuncertainty in Artificial Intelligence, 2016.
  • Rumelhart & Hinton (1986) David E Rumelhart and Geoffrey E Hinton. Learning representations by back-propagating errors. Nature, 323:9, 1986.
  • Salimans et al. (2017) Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
  • Schulman et al. (2015) John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3528–3536, 2015.
  • Speelpenning (1980) Bert Speelpenning. Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD thesis, University of Illinois at Urbana-Champaign, 1980.
  • Staines & Barber (2012) Joe Staines and David Barber. Variational optimization. arXiv preprint arXiv:1212.4507, 2012.
  • Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
  • Tieleman & Hinton (2012) T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
  • Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
  • Tucker et al. (2017) George Tucker, Andriy Mnih, Chris J Maddison, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. arXiv preprint arXiv:1703.07370, 2017.
  • Wierstra et al. (2014) Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(1):949–980, 2014.
  • Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.


Appendix A The RELAX Algorithm

We prove that is unbiased. Following Tucker et al. (2017):

, , reparameterized samplers , and ,
          neural network
while not converged do
      Sample noise
      Compute unconditional relaxed input
      Compute input
      Compute conditional relaxed input
      Estimate gradient
      Estimate gradient of variance of gradient
      Update parameters
      Update control variate
end while
Algorithm 2 RELAX: Low-variance control variate optimization for black-box gradient estimation.

Appendix B Conditional Re-sampling for Discrete Random Variables

When applying the RELAX estimator to a function of discrete random variables , we require that there exists a distribution and a deterministic mapping such that if then . Treating both and as random, this procedure defines a probabalistic model . The RELAX estimator requires reparameterized samples from and . We describe how to sample from these distributions in the common cases of and .


When is Bernoulli distribution we let and we sample from with

We can sample from with


When is a Categorical distribution where , we let and we sample from with

where is the number of possible outcomes.

To sample from we sample a value and compute . We note that in the unconditional case we would have but in the conditional case . We first sample in this way. Then we can sample by finding the point in where and scaling a uniform random variable to be below that value.


and then which is our sample from .

Appendix C Derivations of estimators used in Reinforcement learning

We give the derivation of the LAX estimator used for continuous RL tasks.

Theorem C.1.

The LAX estimator,


is unbiased.


Note that by using the score-function estimator, for all , we have

Then, by adding and subtracting the same term, we have

In the discrete control setting, our policy parameterizes a soft-max distribution which we use to sample actions. We define , which is equal to where , , is the soft-max function. We also define and uses the same reparametrization trick for sampling as explicated in Appendix B.

Theorem C.2.

The RELAX estimator,


is unbiased.


Note that by using the score-function estimator, for all , we have

Then, by adding and subtracting the same term, we have

Since is reparametrizable, we obtain the estimator in Eq.(16). ∎

Appendix D Further results on discrete variational autoencoders

Dataset Model REBAR RELAX
1 layer linear -114.32 -113.62
MNIST 2 layer linear -101.20 -100.85
Nonlinear -111.12 119.19
1 layer linear -122.44 -122.11
Omniglot 2 layer linear -115.83 -115.42
Nonlinear -127.51 128.20
Table 3: Best obtained validation objective.
Dataset Model REBAR RELAX
1 layer 857 531
MNIST 2 layer 900 620
Nonlinear 331 -
1 layer 2086 566
Omniglot 2 layer 1027 673
Nonlinear 368 -
Table 4: Epochs needed to achieve REBAR’s best validation score. “-” indicates that the nonlinear RELAX models achieved lower validation scores than REBAR.
MNIST Omniglot


Figure 5: Training curves for the VAE Experiments with the 2 layer linear model. The horizontal dashed line indicates the lowest validation error obtained by REBAR.
MNIST Omniglot


Figure 6: Training curves for the VAE Experiments with the 1 layer nonlinear model. The horizontal dashed line indicates the lowest validation error obtained by REBAR.

Appendix E Experimental Details

e.1 Discrete VAE

In the 1 layer linear models we optimize the evidence lower bound (ELBO):

where and with weight matrices and bias vectors . The parameters of the prior are also learned.

We run all models for iterations with a batch size of . For the REBAR models, we tested learning rates in .

RELAX adds more hyperparameters. These are the depth of the neural network component of our control variate , the weight decay placed on the network, and the scaling on the learning rate for the control variate. We tested neural network models with layers of 200 units using the ReLU nonlinearity with . We trained the control variate with weight decay in . We trained the control variate with learning rate scaling in .

To limit the size of hyperparameter search for the RELAX models, we only test the best performing learning rate for the REBAR baseline and the next largest learning rate in our search set. In many cases, we found that RELAX allowed our model to converge at learning rates which made the REBAR estimators diverge. We believe further improvement could be achieved by tuning this parameter.

All presented results are from the models which achieve the highest ELBO on the validation data.

e.1.1 Two layer model

In the two layer linear models we optimize the ELBO

where , , , and with weight matrices and biases . As in the one layer model, the prior is also learned.

e.1.2 Nonlinear model

In the one layer nonlinear model, the mappings between random variables consist of 2 deterministic layers with 200 units using the hyperbolic-tangent nonlinearity followed by a linear layer with 200 units.

We run an identical hyperpameter search in all models.

e.2 Discrete RL

In both the baseline A2C and RELAX models, the policy and control variate (value function in the baseline model) were 2 layer neural networks with 10 units per layer. The ReLU non linearity was used on all layers except for the output layer.

For these tasks we estimate the policy gradient with a single Monte Carlo sample. We run one episode of the environment to completion, compute the discounted rewards, and run one iteration of gradient decent. We believe using larger batches will improve performance but would less clearly demonstrate the potential of our method.

As our control variate does not have the same interpretation as the value function of A2C, it was not directly clear how to add reward bootstrapping and other variance reduction techniques common in RL into our model. We leave the task of incorporating these and other variance reduction techniques to future work.

Both models were trained with the RMSProp (Tieleman & Hinton, 2012) optimizer and a reward discount factor of was used.

Both models have 2 hyperparameters to tune; the global learning rate and the scaling factor on the learning rate for the control variate (or value function). We complete a grid search for both parameters in and present the model which “solves” the task in the fewest number of episodes averaged over 5 random seeds. “Solving” the tasks was defined by the creators of the OpenAI gym (Brockman et al., 2016). The Cart Pole task is considered solved if the agent receives an average reward greater than 195 over 100 consecutive episodes. The Lunar Lander task is considered solved if the agent receives an average reward greater than 200 over 100 consecutive episodes.

The Cart Pole experiments were run for 250,000 frames. The Lunar Lander experiments were run for 5,000,000 frames.

The results presented for the CartPole and LunarLander environments were obtained using a slightly biased sampler for .

e.3 Continuous RL

The continuous tasks uses both the value function and the control variate to enable bootstrapping, which is needed due to the increased complexity of the problem. The three models- policy, value, and control variate, are 2 layer neural networks with 64 hidden units per layer. The value and control variate networks are identical, with the ELU(Djork-Arné Clevert & Hochreiter, 2016) nonlinearity in each hidden layer. The policy network has tanh nonlinearity. The policy network, which parameterizes the Gaussian policy comprises of a network (with the architecture mentioned above) that outputs the mean, and a separate, trainable log standard deviation value that is not input dependent. All three networks have a linear output layer. We selected the batch size to be 2500, meaning for a fixed timestep (2500) we collect multiple rollouts of a task and update the networks’ parameters with the batch of episodes. Per one policy update, we optimize both the value and control variate network multiple times. The number of times we train the value network is fixed to 25, while for the control variate, it was chosen to be a hyperparameter. All models were trained using ADAM (Kingma & Ba, 2015), with , , and .

The baseline A2C case has 2 hyperparameters to tune: the learning rate for the optimizer for the policy and value network. A grid search was done over the set: . RELAX has 4 hyperparameters to tune: 3 learning rates for the optimizer per network, and the number of training iterations of the control variate per policy gradient update. Due to the large number of hyperparameters, we restricted the size of the grid search set to for the learning rates, and for the control variate training iteration number. We chose the hyperparameter setting that yielded the shortest episode-to-completion time averaged over 5 random seeds. As with the discrete case, we used the definition of completion defined by OpenAI gym (Brockman et al., 2016) for each task.

The Inverted Pendulum experiments were run for 1,000,000 frames. The Inverted Double Pendulum experiments were run for 50,000,000 frames.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description