Using Large Ensembles of Control Variates for Variational Inference

# Using Large Ensembles of Control Variates for Variational Inference

Tomas  Geffner
College of Information and Computer Science
University of Massachusetts
Amherst, MA 01003
tgeffner@cs.umass.edu
&Justin  Domke
College of Information and Computer Science
University of Massachusetts
Amherst, MA 01003
domke@cs.umass.edu
###### Abstract

Variational inference is increasingly being addressed with stochastic optimization. In this setting, the gradient’s variance plays a crucial role in the optimization procedure, since high variance gradients lead to poor convergence. A popular approach used to reduce gradient’s variance involves the use of control variates. Despite the good results obtained, control variates developed for variational inference are typically looked at in isolation. In this paper we clarify the large number of control variates that are available by giving a systematic view of how they are derived. We also present a Bayesian risk minimization framework in which the quality of a procedure for combining control variates is quantified by its effect on optimization convergence rates, which leads to a very simple combination rule. Results show that combining a large number of control variates this way significantly improves the convergence of inference over using the typical gradient estimators or a reduced number of control variates.

## 1 Introduction

Variational Inference (VI) wainwright2008graphical (); blei2017variational (); jordan1999introduction () is a framework for approximate probabilistic inference. It has been successfully applied in several areas including topic modeling blei2003latent (); ranganath2015deep (), generative models vaes_welling (); fabius2014variational (); rezende2014stochastic (), reinforcement learning furmston2010variational (), and parsing liang2007infinite (), among others. Recently, VI has been able to address a wider range of problems by adopting a "black box" overdispersed_blei () view based on only evaluating the value or gradient of the target distribution. Then, the target can be optimized via stochastic gradient descent. It is desirable to reduce the variance of the gradient estimate, since this governs convergence. Control variates (CVs), a classical technique from statistics, is often used to accomplish this.

This paper investigates how to use many CVs in concert. We present a systematic view of existing CVs, which starts by splitting the exact gradient into four terms (Eq. 2). Then, a CV is obtained by application of a generic "recipe": Pick a term, possibly approximate it, and take the difference of two estimators (Fig. 2). This suggests many possible CVs, including some seemingly not used before.

With many possible CVs, one can naturally ask how to use many together. In principle, the optimal combination is well known (Eq. 6). However, this requires unknown (intractable) expectations. We address this using decision theory. The goal is a “decision rule” that takes a minibatch of evaluations together with the set of CVs to be used, and returns a gradient estimate. We adopt a Bayesian risk measuring how gradient variance impacts convergence rates of stochastic optimization, with simple prior over gradients and sets of CVs. A simple optimal decision rule emerges, where the intractable expectations are replaced with "regularized" empirical estimates (Thm 4.1). To share information across iterations, we suggest combining this Bayesian approach with exponential averaging by using an “effective” minibatch size.

We demonstrate practicality on logistic regression problems, where careful combination of many CVs improves performance. For all learning rates, convergence is improved over any single CV.

### 1.1 Contributions

The contribution of this work is twofold. First, in Section 3, we propose a systematic view of how to generate many existing control variates. Second, we propose a an algorithm to use multiple control variates simultaneously, described in Section 4. As shown in Section 5, combining these two ideas result in gradients with low variance that allow the use of larger learning rates, while retaining convergence.

## 2 Preliminaries

Variational Inference (VI) works by transforming an inference problem into an optimization, by decomposing the marginal likelihood of the observed data given latent variables as:

 logp(x)=EZ∼qw(Z)[logp(Z,x)qw(Z)]ELBO(w)+KL(qw(Z)||p(Z|x))KL-divergence.

Here, the variational distribution is used to approximate the true posterior distribution . VI’s goal is to find the parameters that minimize the KL-divergence between and the true posterior . Since does not depend on , minimizing the KL-divergence is equivalent to maximizing the ELBO (Evidence Lower BOund).

Historically, models and variational families for which expectations were simple enough to allow closed-form updates of were used blei2017variational (); blei2003latent (); variationalmsgpassing (). However, for more complex models, closed form expressions are usually not available, which has led to widespread use of stochastic optimization methods hoffman2013stochastic (); neuralVI_minh (); viasstochastic_jordan (); blackbox_blei (); doublystochastic_titsias (). These require approximating the target’s gradient

 g(w)=∇wELBO(w)=∇wEZ∼qw(Z)[logp(Z,x)−logqw(Z)]. (1)

Good gradient estimates play an important role, since high variance will negatively impact on convergence and optimization speed. Several methods have been developed to improve gradient estimates, including Rao-Blackwellization blackbox_blei (), control variates backpropvoid (); traylorreducevariance_adam (); neuralVI_minh (); viasstochastic_jordan (); blackbox_blei (); REBAR (); varianceredstochastic_wang (); zhang2017advances (), closed-form solutions for certain expectations localexpectations_gredilla (), discarding terms stickingthelanding (), and different estimators.

### 2.1 Control variates

A control variate (CV) is a random variable with expectation zero that is added to another random variable in the hope of reducing variance. Let be a random variable with unknown mean, and let be a random variable with mean zero. Then for any scalar , has the same expectation as but (usually) different variance. A standard result from statistics is that the value of that minimizes the variance of is , for which . Thus, a good control variate for is a random variable that is highly correlated with .

## 3 Systematic generation of control variates

This section gives a generic recipe for creating control variates (Fig. 2) and reviews how existing control variates are an instance of it (see also Sec. .4 in the appendix). We begin by splitting the ELBO gradient into four terms as

 g(w)=∇wEqwlogp(x|Z)g1(w): Data term+∇wEqwlogp(Z)g2(w): Prior term−∇wEqwlogqv(Z)∣∣v=wg3(w): Variational term−∇wEqvlogqw(Z)∣∣v=wg4(w): Score term. (2)

The first three terms all correspond to the influence of on the expectation of some function independent of . Control variates for these terms, and for any combination of them, are discussed in Sec. 3.1-3.2. The score term, discussed in Sec. 3.3, is different, since the function inside the expectation depends on . (Roeder et al. stickingthelanding () give a related decomposition, albeit specifically for reparameterization estimators.)

### 3.1 Control Variates from Pairs of estimators

The basic technique for deriving CVs is to take the difference between a pair of unbiased estimators of a general term (any of , , or a combination of them), which must therefore have expectation zero. The terms , or are all the expectation (over ) of some function (independent of ). 111For , , and , use , , and respectively. Thus, can be written as

 t(w)=∇wEqw(Z)[f(Z)] or (∇wEqw(Z)[fv(Z)])∣∣v=w.

Many methods exist to estimate gradients of this type. Mathematically, we think of these as random variables (with a corresponding generation algorithm). A few estimators are summarized in Eq. 3 (dropping dependence of on ). If we write for an estimator for using method , then

 (3)

Score function (SF) estimation, or REINFORCE williams_reinforce (), uses the equality neuralVI_minh (); blackbox_blei (). This gives , with as in Eq. 3. Unbiased estimates for the gradient can be obtained using Monte Carlo sampling, with samples from .

Reparameterization (RP) estimators vaes_welling (); traylorreducevariance_adam (); doublystochastic_titsias () are based on splitting the procedure to sample from into sampling and transformation steps. First, sample ; second, transform . Here, is a fixed distribution (indep. of ) and is a deterministic transformation. When sampling is done this way, it follows that rendering the expectation independent of . The general term can therefore be written as , with . The multivariate Gaussian distribution illustrates this: A sample can be generated by drawing and setting , where is a matrix such that .

Multiple reparameterizations are typically possible. For example, the above estimator for the multivariate Gaussian is valid with any such that . For instance, could be a lower triangular matrix obtained via the Cholesky factorization of challis2011concave (); doublystochastic_titsias (). (Often, entries of directly specify entries in the Cholesky factorization, obviating the need to explicitly compute it.) Another option is the matrix square root of marlin2016sqrt (). All valid reparameterizations give unbiased gradients, but with different statistical properties.

Generalized reparameterization (GR) is intended for distributions where reparameterization is not applicable, e.g. the gamma or beta generalreparam_blei (). Take a transformation and a base distribution (both dependent on ) such that is distributed identically to . Then, . The dependence of this expectation on is mediated partially through ’s influence on and partially through ’s influence on . This leads to a representation of a general term as , where is as in Eq. 3. This has essentially has a score function-like term and a reparameterization-like term, corresponding to ’s influence on and , respectively.

Closed form (CF) expressions are sometimes available for general terms involving and , but rarely for . This is because a closed-form expression needs and to be simple enough, that is rarely the case for the data term , which is usually estimated with one of the methods described above traylorreducevariance_adam (); viasstochastic_jordan (); blackbox_blei (); generalreparam_blei (). However, there are some cases for which can be computed exactly challis2011concave ().

Data Subsampling is often applied to the data term kingma2015variational (). If the likelihood treats as i.i.d., then can be approximated without bias from a minibatch of data. If is that estimate, an equivalent representation of the data term is where is uniform over subsets of data. Thus, one can define an unbiased estimator by using one of the techniques above (to cope with ) on a random minibatch (to cope with ). With large datasets this can be much faster, but sampling acts as an additional source of variance.

### 3.2 Control Variates from approximations

The previous section used that the difference of two unbiased estimators of a term has expectation zero, and so is a control variate. Another class of control variates uses the insight that if a general term is replaced with an approximation, the difference between two estimators of the (approximate) general term still produces a valid control variate. The motivation is that approximations might allow the use of high-quality estimators (e.g. a closed-form) not otherwise available.

Fundamentally, the randomness in the above estimators is due to two types of sampling. First, expectations over are approximated by sampling, introducing "distributional sampling error". Second, with large data, the data term can be approximated by drawing a minibatch, introducing "data subsampling error". Approximations to terms have been devised so that expectations (either over or the full dataset) can be efficiently computed.

Correcting for distributional sampling: Here, the goal is to approximate with some function so as to make easier to estimate – typically so admits a closed-form solution. Paisley et al. viasstochastic_jordan () approximate the data term with either a Taylor approximation in or a bound and then define a control variate as the difference between computed exactly and its estimator using the score function method, which greatly reduces the variance of their gradient estimate, obtained with the score function method. Miller et al. traylorreducevariance_adam () also use a Taylor approximation of the data term, but use the difference between computed exactly and and its estimator using reparameterization. They use this control variate together with a base gradient estimate obtained via reparameterization.

Correcting for data subsampling: As discussed in Sec. 3.1 it is common with large datasets to define estimators for the data term that only evaluate the likelihood on random subsets of data. To reduce the variance introduced by this subsampling, Wang et al. varianceredstochastic_wang () propose to approximate with a Taylor expansion in , leading to an approximate data term . For some models the inner expectation (over ) can be computed efficiently by caching the and order empirical moments of the data. Since the outer expectation (over ) usually remains intractable, a final control variate is obtained by applying one of the estimation methods described in Sec. 3.1 (SF, RP, etc) to both and and taking the difference.

Both correction mechanisms described above represent particular scenarios that are included in the proposed framework shown in Fig. 2, which also includes other control variates based on approximations. First, it imposes no restrictions on other approximations, such as the ones based on approximating the distribution instead of . And second, it includes control variates based on the difference of two estimates of an approximate general term, despite neither being CF. These two ideas are used in the control variate introduced by Tucker et al. REBAR (), which use a continuous relaxation gumbel1 (); gumbel2 () to approximate the distribution (discrete in this case), and construct a control variate by taking the difference between the SF and RP estimates of the resulting term based on the relaxation. Following a similar idea, Grathwohl, et al. backpropvoid () use a neural network as a surrogate for , and use as control variate the difference between the SF and RP estimation of the term involving the surrogate.

### 3.3 Control variate from the score term (g4)

It’s easy to show that the score term is always zero, i.e. (proof in appendix). Thus, it does not need to be estimated. However, since it has expectation zero, one can use the naive control variate blackbox_blei (); stickingthelanding ().

## 4 Combining multiple control variates

In order to use control variates we need to define a base gradient estimator and a set of control variates, , that we want to use to reduce the base gradient’s variance. We multiply each control variate with a scalar weight to get the estimator

 ^g(w)=h(w)+L∑i=1aici(w). (4)

Defining as the vector of weights and as the matrix with as the i-th column, can be equivalently expressed as

 ^g(w)=h(w)+C(w)a. (5)

The goal is to find such that the final gradient has low variance. This follows from theoretical results on stochastic optimization with a first-order unbiased gradient oracle that indicate that convergence is governed by the expected squared norm of the gradient oracle agarwal2012lower (), which is equivalent (up to a constant) to the trace of the variance. In particular, in the case in which the CVs are all differences between unbiased estimators for different terms, finding the optimal is equivalent to finding the best affine combination of the estimators.222Intuitively, given two estimators, if one is used as the base estimator and the difference as a CV, then finding the best weight for that CV is equivalent to finding the best mixture of the estimators.

{restatable}

lemmaoptimala Let be a random variable, a matrix of random variables such that each element has mean zero. For , define . The value of that minimizes for a given is

 a∗(w)=−Ep(C,h|w)[CTC]−1E[CTh]. (6)

Variants of this result are known varianceredstochastic_wang (). Of course, this requires the expectations and , which are usually not available in closed form. One solution is, given some observed gradients and control variates , to estimate using empirical expectations in place of the true ones. However, this approach does not account for how errors in the estimates of these expectations affect and therefore the final variance of .

### 4.1 Bayesian regularization

We deal with this problem from a "risk minimization" perspective. We imagine that the joint distribution over and is governed by some (unknown) parameter vector . Then, we can define the loss for selecting the vector of weights when the true parameter vector is as

 L(a,θ)=EC,h|θ∥h+Ca∥2.

We seek a "decision rule"

 α(C1,h1,...,CM,hM)

that takes as input a "minibatch" of evaluations of and and returns a weight vector . Then, for a pre-specified probabilistic model , we can define the Bayesian regret as

 BayesRegret(α)=EθEC1,h1,...,CM,hM|θ[L(α(C1,h1,...,CM,hM),θ)].

The following theorem shows that if we model jointly as a Gaussian with canonical parameters , and use a Normal-Wishart prior for , then the decision rule minimizing the Bayesian risk ends up being similar to Eq. 6, with two modifications. First, the unknown expectations are replaced with empirical expectations. Second, the empirical expectation of is "regularized" by a term determined by the prior. For simplicity, the following result is stated assuming that the Normal-Wishart prior uses being a constant times the identity. However, in the appendix we state (and prove) a more general result where is arbitrary. This can also be implemented efficiently, although the result is more clumsy to state.

{restatable}

thmcombining If is a Gaussian parameterized as

 p(C,h|θ=(η,Λ))=Gaussian([vec(C),h]∣∣μ=Λ−1η,Σ=Λ−1),

and the prior is a Normal-Wishart, parameterized as then the decision rule that minimizes the Bayesian regret for is

 α∗(C1,h1,...,CM,hM)=−(dv0MI+¯CTC)−1¯¯¯¯¯¯¯¯¯¯¯CTh (7)

Where , and .

The proof idea is as follows: Since the loss is the expected squared norm, the optimal decision rule can be reduced to a form similar to Eq. 6 but with the expectations replaced by posterior expectations conditioned on the observations and . For exponential families with conjugate priors (e.g. the Gaussian with a Normal-Wishart prior), the posterior expectation of sufficient statistics given observations has a simple closed-form solution mjexponentialfamily (). The sufficient statistics for the Gaussian are the first and second joint moments of , from which the expectations needed for the optimal decision rule can be extracted.

The rule in Eq. 7 is surprisingly simple: just compute the empirical averages and add a diagonal regularizer before solving the linear system. Using a large provides better estimates for the expectation and thus reduces the amount of “regularization” applied, while using a small provides worse estimates, which are regularized more heavily.

### 4.2 Empirical Averages

The probabilistic model described above does not explicitly mention the parameters . One way to use this would be to apply it separately in each iteration. It is desirable, however, to exploit the fact that the parameters change slowly during learning. Algorithmically, the procedure above requires as input only empirical expectations for and . Instead of using samples from a single step alone, we propose using an exponential average. At every step we compute a weighted average of the previous empirical expectation and the current one. This results in the update rule where represents either or , and is the empirical average obtained using the samples drawn at step . To combine this with the Bayesian regularization procedure, we use an “effective M”, , which indicates how many samples are effectively being included in the empirical averages, where is the minibatch size. is used instead of in equation 7. Technically, the regularization procedure assumes that the samples for the empirical expectations are independent of those actually used for the final gradient estimate . To reflect this, we compute at step using the empirical average from step , .

## 5 Experiments and Results

We tried several control variates and the combination algorithm on a Bayesian binary logistic regression model with a standard Gaussian prior, using three well known datasets: ionosphere, australian, and sonar. We use simple SGD with momentum () as our optimization algorithm, minibatches of size 10, a decay factor of for the exponentially decayed empirical averages, and , value based on results obtained for the sensitivity analysis carried out (see Sec. 5.1). We chose a full covariance Gaussian as variational distribution parameterized using the mean and a Cholesky factorization of the covariance. Since both the prior and the variational distribution are Gaussian, the prior and variational terms can be computed in closed form.

As base gradient we use what seems to be the most common estimator, with reparameterization () to estimate the data term (with the local reparameterization trick kingma2015variational ()) and the prior term , and a closed form expression for the variational/entropy term . Here, is the reparameterization estimator using , while uses marlin2016sqrt () with the matrix square root. For CVs, we chose to use the following seven, which provide a reasonable coverage of the different methods described in Section 3:

• : The difference between the and closed-form estimates of the variational term.

• : The difference between the and closed-form estimates of the prior term.

• : The difference between the and estimates of the prior term.

• : The difference between the and estimates of the data term.

• : Taylor expansion of the estimate of the data term, correcting for data subsampling varianceredstochastic_wang ().

• : Taylor expansion of the estimate of the data term, correcting for data subsampling varianceredstochastic_wang ().

• : Taylor expansion of the estimate of the data term, correcting for sampling from . This control variate is based on the work of Miller, et al traylorreducevariance_adam (), but adapted to a full covariance (rather than diagonal) Gaussian (see appendix).

We compare the optimization results obtained using the base gradient alone and the base gradient combined with different subsets of CVs, which were chosen following a simple approach: We tried each CV in isolation, and chose the four worst performing ones as one subset, the five worst performing ones as another subset, and so on. The final subsets of CVs obtained this way are , , , and . We also show results for the two best control variates, and , used in isolation. All the results shown in this section, figures and tables, were obtained averaging the results from 50 runs.

Best learning rate. Table 1 shows the ELBO value achieved after 500 iterations, with the largest learning rate333The loss is normalized by the number of samples in the dataset. If it was not the equivalent learning rates would be smaller for which optimization converged with at least one estimator. It can be seen that increasing the number of CVs often leads to higher final values for the ELBO and that, in all cases, the higher ELBOs (better) were achieved by using all CVs together.

Comparing across learning rates. Now we compare the performance achieved using each gradient estimator with different learning rates. To do so we present two sets of images. First, the two leftmost columns of Fig. 3 show, for each dataset, the ELBO vs. iterations for two different learning rates; while the third column shows, for each gradient estimator and iteration, the ELBO for the best learning rate (vs. iteration). As in Table 1 it can be seen that for a given learning rate (or when choosing the best at each iteration) the gradients that combine more control variates are better suited for optimization and display a strictly dominant performance.

Finally, Fig. 4 shows, for several gradients, the final ELBO (after 500 iterations) vs. learning rate used, providing a systematic comparison of how the gradient estimates perform with different learning rates. Again, estimates employing more CVs display a dominant performance, with larger improvements at larger learning rates. Furthermore, the “best” learning rate increases with better estimators.

### 5.1 Sensitivity analysis

It is natural to ask how the variance of the gradient estimate is related to the choice of the prior parameter and the minibatch size . Recall from Thm. 4.1 that a larger value of corresponds to a more concentrated prior, and is thus a more conservative choice – essentially it results in more "regularization" of the empirical moments. To answer this we carried out a simple experiment, where we fix and estimate with a variety of and . To choose , we applied SGD with a low-variance gradient (computed with many samples), and a learning rate of and same initialization as in the previous section, and selected the parameters found after 25 iterations. This is intended to be "typical", in that it is neither at the start nor the end of optimization.

Estimating is a three step process: (1) Use one set of evaluations of and to estimate and . (2) Apply the prior to compute from those estimate (Eq. 7). Recall from Thm. 4.1 that a larger corresponds to a more concentrated prior, essentially "regularizing" more. (3) Use a second set of evaluations of and to compute , using weights (Eq. 4 / 5).

In a first experiment, we tested exactly that procedure, drawing two independent evaluations of and using the current weights . Results are shown in Figure 5. We found a small artificial dataset illustrative, with samples . For this "" dataset, with small minibatches, a fairly large value of provided the best results. However with ionosphere, even a very small tended to perform well.

For efficiency, our logistic regression experiments used exponential averaging over previous iterations to estimate and , rather than drawing two evaluations at each iteration. So, even with large value of these are not fully reliable. To roughly simulate this, we performed a second "lagged" experiment estimating and from evaluations of and at the weight from 10 iterations previous during SGD. (This was chosen considering the "average age" of gradients when using exponential averaging, and that is a relatively small learning rate.) The results of this are shown on the right of Fig. 5. Lagged evaluations result in stochastic gradients with more variance, with a different dependence on . (Note, however, that the gradient remains unbiased, lagging is cheaper, and that all estimators have a variance decreasing with .)

We emphasize that several somewhat arbitrary decisions were made for these experiments, such as the learning rate, the choice of iteration, the amount of "lag". However, we believe that the results illustrate an important phenomenon related to the use of regularization: when using past gradient information (as exponential averaging does) larger values of are beneficial and result in gradients with lower variance. While intuitively plausible, note that this benefit of regularization for countering errors introduced by the use of old gradients is not really captured by our theoretical analysis in Section 4 which is entirely based on "single-iteration" reasoning.

## 6 Conclusion

This work focuses on how to obtain low variance gradients given a fixed set of control variates. We first present a unified view that attempts to explain how most control variates used for variational inference are derived, which sheds light on the large number of CVs available. We then propose a combination algorithm to use multiple control variates in concert. We show experimentally that, given a set of control variates, the combination algorithm provides a simple and effective combination rule that leads to gradients with less variance than those obtained using a reduced number of CVs (or no CVs at all). The algorithm assumes that a fixed set of control variates to be used is given, and minimizes the final gradient’s variance using them, without analyzing how favorable using all the CVs actually is. A “smarter” algorithm could, for instance, decide whether to use all the CVs given or a just a subset. We leave the development of such algorithm for future work.

## References

• (1) Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Information Theory, 58(5):3235–3249, 2012.
• (2) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
• (3) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
• (4) Edward Challis and David Barber. Concave gaussian variational approximations for inference in large-scale bayesian linear models. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 199–207, 2011.
• (5) Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581, 2014.
• (6) Thomas Furmston and David Barber. Variational methods for reinforcement learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 241–248, 2010.
• (7) Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123, 2017.
• (8) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
• (9) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
• (10) Michael I. Jordan. The exponential family: Conjugate priors.
• (11) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
• (12) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
• (13) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
• (14) Steven Cheng-Xian Li and Benjamin M. Marlin. A scalable end-to-end gaussian process adapter for irregularly sampled time series classification. In Advances in Neural Information Processing Systems, pages 1804–1812, 2016.
• (15) Percy Liang, Slav Petrov, Michael Jordan, and Dan Klein. The infinite pcfg using hierarchical dirichlet processes. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.
• (16) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
• (17) Andrew C Miller, Nicholas J Foti, Alexander D’Amour, and Ryan P Adams. Reducing reparameterization gradient variance. arXiv preprint arXiv:1705.07880, 2017.
• (18) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
• (19) John Paisley, David Blei, and Michael Jordan. Variational bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430, 2012.
• (20) Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822, 2014.
• (21) Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and David Blei. Deep exponential families. In Artificial Intelligence and Statistics, pages 762–771, 2015.
• (22) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
• (23) Geoffrey Roeder, Yuhuai Wu, and David Duvenaud. Sticking the landing: An asymptotically zero-variance gradient estimator for variational inference. arXiv preprint arXiv:1703.09194, 2017.
• (24) Francisco Ruiz, Titsias Michalis, and David Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, pages 460–468, 2016.
• (25) Francisco JR Ruiz, Michalis K Titsias, and David M Blei. Overdispersed black-box variational inference. arXiv preprint arXiv:1603.01140, 2016.
• (26) Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1971–1979, 2014.
• (27) Michalis Titsias and Miguel Lázaro-Gredilla. Local expectation gradients for black box variational inference. In Advances in neural information processing systems, pages 2638–2646, 2015.
• (28) George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2627–2636, 2017.
• (29) Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
• (30) Chong Wang, Xi Chen, Alexander J Smola, and Eric P Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, pages 181–189, 2013.
• (31) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
• (32) John Winn and Christopher M Bishop. Variational message passing. Journal of Machine Learning Research, 6(Apr):661–694, 2005.
• (33) Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. arXiv preprint arXiv:1711.05597, 2017.

## Appendix

### .1 Proof of Lemmas

###### Proof.
 Eqw(z)[∇wlogqw(Z)]=Eqw(z)[∇wqw(Z)qw(Z)]=∫qw(z)∇wqw(z)qw(z)dz=∇w∫qw(z)dz=∇w1=0

\optimala

*

###### Proof.
 E[^gT^g]=E[(h+Ca)T(h+Ca)]=E[hTh+2hTCa+aTCTCa]=E[hTh]+2E[hTC]a+aTE[CTC]a

Differentiating with respect to , and making the result equal to 0 gives us

 2E[hTC]+2E[CTC]a=0⟶a∗=−E[CTC]−1E[CTh]

### .2 Proof of Theorem 4.1

First we state Lemma .2 and Lemma .3, which will use to prove the main theorem 4.1.

###### Lemma .2.

Suppose that

 P(x|w)=h(x)exp(⟨w,T(x)⟩−A(w))

be some exponential family, and let

 P(w|τ0)∝exp(⟨τ0,w⟩−n0A(w))

be the conjugate prior to that family. If are i.i.d. variables and is new data from the distribution, then

 E[T(X)|x1,...,xN]=κτ0n0+(1−κ)^μ,

where and .

For a proof see Jordan [10].

###### Lemma .3.

Given some observations , the decision rule that minimizes the Bayes regret is

 a(C1,h1,...,CM,hM)=E[CCT|C1,h1,...,CM,hM]−1E[Ch|C1,h1,...,CM,hM]

Where the expectations are over all possible values of , and , given the observed data.

###### Proof.

We have

 a∗(C1,h1,...,CM,hM)=argminaE[∥h+Ca∥2|C1,h1,...,CM,hM]=argminaE[(h+Ca)T(h+Ca)|C1,h1,...,CM,hM]

The solution to this is to find the derivative of the above expression with respect to and find such that

 0=E[hTCa+h+aTCTCa|C1,h1,...,CM,hM]

This gives

Now we prove Theorem 4.1 for an arbitrary , using Lemma .2 and Lemma .3. Consider the same setting as in Section 4. Let be observed gradients and be observed control variates. We define a probabilistic model composed by the likelihood and the prior .

###### Theorem .4.

If we choose the likelihood to be a Gaussian,

 P(h,C|θ=(η,Λ))=Gaussian([hvec(C)]|μ=Λ−1η,Σ=Λ−1)

the prior (conjugate) to be

 P(θ=(η,Λ))∝exp(tT0η−tr(VT0Λ)−n0A(η,Λ))

where can be written as:

 V0=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣VhhV⊤hc1V⊤hc2⋯V⊤hcLVhc1Vc1c1V⊤c2c1⋯V⊤cLc1Vhc2Vc2c1⋮⋮VhcLVcLc1⋯VcLcL⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

then the decision rule that minimizes the Bayesian regret is

 a∗(h1,C1,...,hM,CM)=−E[CTC|h1,C1,...,hM,CM]−1E[CTh|h1,C1,...,hM,CM] (8)

Where

 E[CTh|h1,C1,...,hM,CM]=κn0⎡⎢ ⎢ ⎢ ⎢ ⎢⎣trVhc1trVhc2⋮trVhcL⎤⎥ ⎥ ⎥ ⎥ ⎥⎦+(1−κ)¯¯¯¯¯¯¯¯¯¯¯CTh.

and

 E[CTC|h1,C1,...,hM,CM]=κn0⎡⎢ ⎢ ⎢ ⎢ ⎢⎣trVc1c1trVTc2c1⋯trVTcLc1trVc2c1⋮⋮trVcLc1⋯trVcLcL⎤⎥ ⎥ ⎥ ⎥ ⎥⎦+(1−κ)¯¯¯¯¯¯¯¯¯¯¯¯CTC.

With being an empircal average and also being an empirical average.

###### Proof.

The expression for in equation 8 is obtained using Lemma .3. We need to find and . Using Lemma .2, given observations we have a closed form expression form

 E[[hvec(C)][hvec(C)]T|h1,C1,...,hM,CM]=κV0n0+(1−κ)¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯[hvec(C)][hvec(C)]T (9)

Where is a vector with the columns of concatenated. Noticing that

 CTh=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣cT1hcT2h⋮cTLh⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

it can be concluded that each component of corresponds to the sum of components of the matrix on the right hand side of equation 9. Using the decomposition for shown above, we get:

 E[CTh|h1,C1,...,hM,CM]=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣%trVhc1trVhc2⋮trVhcL⎤⎥ ⎥ ⎥ ⎥ ⎥⎦+(1−κ)¯¯¯¯¯¯¯¯¯¯¯CTh. (10)

A similar reasoning can be used for , getting:

 E[CTC|h1,C1,...,hM,CM]=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣trVc1c1trVTc2c1⋯trVTcLc1trVc2c1⋮trVcLc1⋯trVcLcL⎤⎥ ⎥ ⎥ ⎥ ⎥⎦+(1−κ)¯¯¯¯¯¯¯¯¯¯¯¯CTC (11)

Replacing these expressions in concludes the proof.

Using the result above we now prove Theorem 4.1, which takes .

\combining

*

###### Proof.

The expression for is given in eq. 8. We need to find the expressions for and when . In this particular case we get that for . Combining this with eq. 10 gives

 E[CTh|h1,C1,...,hM,CM]=(1−κ)¯¯¯¯¯¯¯¯¯¯¯CTh.

When we also get

• for

Combining these two facts with eq. 11 gives

 E[CTC|h1,C1,...,hM,CM]=κn0dv0I+(1−κ)¯¯¯¯¯¯¯¯¯¯¯¯CTC.

Finally,

Where in the last equality we used

### .3 Adaptation of control variate introduced by Miller et al. [17] to full covariance Gaussians

The derivation of the variate introduced by Miller et al. [17] was done for the case in which is a Gaussian distribution with a diagonal covariance matrix. This section of the appendix explains how to use the CV in the case in which is a Gaussian distribution with full covariance matrix, in which case , where and is the mean of the distribution.

Following the procedure in Miller et al. [17], we build an approximation for as , where is a second order Taylor expansion. The difference between computed exactly (lemma .5) and its estimation using reparameterization is used as a control variate.

###### Lemma .5.

Let and be a second order Taylor expansion of , then and .

###### Proof.

Applying reparameterization we can express , where and .

Introducing the Taylor expansion we get

 E¯q[~f(Tw(r))]=E¯q[f(z0)+(Z−z0)T∇f(z0)+12(Z−z0)T∇2f(z0)(Z−z0)]Z=Tw(r)=f(z0)+(ETw(r)−z0)∇f(z0)+12E[tr(∇2f(z0)(Tw(r)−z0)(Tw(r)−z0)T)]=f(z0)+(μw−z0)∇f(z0)+12tr(∇2f(z0)E[(Tw(r)−z0)(Tw(r)−z0)T)]) (12)