Quasi-Monte Carlo Variational Inference

# Quasi-Monte Carlo Variational Inference

Alexander Buchholz    Florian Wenzel    Stephan Mandt
###### Abstract

Many machine learning problems involve Monte Carlo gradient estimators. As a prominent example, we focus on Monte Carlo variational inference (\oldtextscmcvi) in this paper. The performance of \oldtextscmcvi crucially depends on the variance of its stochastic gradients. We propose variance reduction by means of Quasi-Monte Carlo (\oldtextscqmc) sampling. \oldtextscqmc replaces i.i.d. samples from a uniform probability distribution by a deterministic sequence of samples of length . This sequence covers the underlying random variable space more evenly than i.i.d. draws, reducing the variance of the gradient estimator. With our novel approach, both the score function and the reparameterization gradient estimators lead to much faster convergence. We also propose a new algorithm for Monte Carlo objectives, where we operate with a constant learning rate and increase the number of \oldtextscqmc samples per iteration. We prove that this way, our algorithm can converge asymptotically at a faster rate than \oldtextscsgd. We furthermore provide theoretical guarantees on \oldtextscqmc for Monte Carlo objectives that go beyond \oldtextscmcvi, and support our findings by several experiments on large-scale data sets from various domains.

Quasi-Monte Carlo, Black Box Variational Inference, Bayesian Approximate Inference
\LetLtxMacro\oldtextsc\fvset

fontsize= \glsdisablehyper

## 1 Introduction

In many situations in machine learning and statistics, we encounter objective functions which are expectations over continuous distributions. Among other examples, this situation occurs in reinforcement learning (Sutton and Barto, 1998) and variational inference (Jordan et al., 1999). If the expectation cannot be computed in closed form, an approximation can often be obtained via Monte Carlo (\oldtextscmc) sampling from the underlying distribution. As most optimization procedures rely on the gradient of the objective, a \oldtextscmc gradient estimator has to be built by sampling from this distribution. The finite number of \oldtextscmc samples per gradient step introduces noise. When averaging over multiple samples, the error in approximating the gradient can be decreased, and thus its variance reduced. This guarantees stability and fast convergence of stochastic gradient descent (\oldtextscsgd).

Certain objective functions require a large number of \oldtextscmc samples per stochastic gradient step. As a consequence, the algorithm gets slow. It is therefore desirable to obtain the same degree of variance reduction with fewer samples. This paper proposes the idea of using Quasi-Monte Carlo (\oldtextscqmc) samples instead of i.i.d. samples to achieve this goal.

A \oldtextscqmc sequence is a deterministic sequence which covers a hypercube more regularly than random samples. When using a \oldtextscqmc sequence for Monte Carlo integration, the mean squared error (MSE) decreases asymptotically with the number of samples as (Leobacher and Pillichshammer, 2014). In contrast, the naive \oldtextscmc integration error decreases as . Since the cost of generating \oldtextscqmc samples is ), this implies that a much smaller number of operations per gradient step is required in order to achieve the same precision (provided that is large enough). Alternatively, we can achieve a larger variance reduction with the same number of samples, allowing for larger gradient steps and therefore also faster convergence. This paper investigates the benefits of this approach both experimentally and theoretically.

Our ideas apply in the context of Monte Carlo variational inference (\oldtextscmcvi), a set of methods which make approximate Bayesian inference scalable and easy to use. Variance reduction is an active area of research in this field. Our algorithm has the advantage of being very general; it can be easily implemented in existing software packages such as STAN and Edward (Carpenter et al., 2017; Tran et al., 2016). In Appendix D we show how our approach can be easily implemented in your existing code.

The main contributions are as follows:

• We investigate the idea of using \oldtextscqmc sequences for Monte Carlo variational inference. While the usage of \oldtextscqmc for \oldtextscvi has been suggested in the outlook section of Ranganath et al. (2014), to our knowledge, we are the first to actually investigate this approach both theoretically and experimentally.

• We show that when using a randomized version of \oldtextscqmc (\oldtextscrqmc), the resulting stochastic gradient is unbiased and its variance is asymptotically reduced. We also show that when operating \oldtextscsgd with a constant learning rate, the stationary variance of the iterates is reduced by a factor of , allowing us to get closer to the optimum.

• We propose an algorithm which operates at a constant learning rate, but increases the number of \oldtextscrqmc samples over iterations. We prove that this algorithm has a better asymptotic convergence rate than \oldtextscsgd.

• Based on three different experiments and for two popular types of gradient estimators we illustrate that our method allows us to train complex models several orders of magnitude faster than with standard \oldtextscmcvi.

Our paper is structured as follows. Section 2 reviews related work. Section 3 explains our method and exhibits our theoretical results. In Section 4 we describe our experiments and show comparisons to other existing approaches. Finally, Section 5 concludes and lays out future research directions.

## 2 Related Work

#### Monte Carlo Variational Inference (MCVI)

Since the introduction of the score function (or REINFORCE) gradient estimator for variational inference (Paisley et al., 2012; Ranganath et al., 2014), Monte Carlo variational inference has received an ever-growing attention, see Zhang et al. (2017a) for a recent review. The introduction of the gradient estimator made \oldtextscvi applicable to non-conjugate models but highly depends on the variance of the gradient estimator. Therefore various variance reduction techniques have been introduced; for example Rao-Blackwellization and control variates, see Ranganath et al. (2014) and importance sampling, see Ruiz et al. (2016a); Burda et al. (2016).

At the same time the work of Kingma and Welling (2014); Rezende et al. (2014) introduced reparameterization gradients for \oldtextscmcvi, which typically exhibits lower variance but are restricted to models where the variational family can be reparametrized via a differentiable mapping. In this sense \oldtextscmcvi based on score function gradient estimators is more general but training the algorithm is more difficult. A unifying view is provided by Ruiz et al. (2016b). Miller et al. (2017) introduce a modification of the reparametrized version, but relies itself on assumptions on the underlying variational family. Roeder et al. (2017) propose a lower variance gradient estimator by omitting a term of the \oldtextscelbo. The idea of using \oldtextscqmc in order to reduce the variance has been suggested by Ranganath et al. (2014) and Ruiz et al. (2016a) and used for a specific model by Tran et al. (2017), but without a focus on analyzing or benchmarking the method.

#### Quasi-Monte Carlo and Stochastic Optimization

Besides the generation of random samples for approximating posterior distributions (Robert and Casella, 2013), Monte Carlo methods are used for calculating expectations of intractable integrals via the law of large numbers. The error of the integration with random samples goes to zero at a rate of in terms of the MSE. For practical application this rate can be too slow. Faster rates of convergence in reasonable dimensions can be obtained by replacing the randomness by a deterministic sequence, also called Quasi-Monte Carlo.

Compared to Monte Carlo and for sufficiently regular functions, \oldtextscqmc reaches a faster rate of convergence of the approximation error of an integral. Niederreiter (1992); L’Ecuyer and Lemieux (2005); Leobacher and Pillichshammer (2014); Dick et al. (2013) provide excellent reviews on this topic. From a theoretical point of view, the benefits of \oldtextscqmc vanish in very high dimensions. Nevertheless, the error bounds are often too pessimistic and in practice, gains are observed up to dimension , see Glasserman (2013).

\oldtextsc

qmc has frequently been used in financial applications (Glasserman, 2013; Joy et al., 1996; Lemieux and L’Ecuyer, 2001). In statistics, some applications include particle filtering (Gerber and Chopin, 2015), approximate Bayesian computation (Buchholz and Chopin, 2017), control functionals (Oates and Girolami, 2016) and Bayesian optimal design (Drovandi and Tran, 2018). Yang et al. (2014) used \oldtextscqmc in the context of large scale kernel methods.

Stochastic optimization has been pioneered by the work of Robbins and Monro (1951). As stochastic gradient descent suffers from noisy gradients, various approaches for reducing the variance and adapting the step size have been introduced (Johnson and Zhang, 2013; Kingma and Ba, 2015; Defazio et al., 2014; Duchi et al., 2011; Zhang et al., 2017b). Extensive theoretical results on the convergence of stochastic gradients algorithms are provided by Moulines and Bach (2011). Mandt et al. (2017) interpreted stochastic gradient descent with constant learning rates as approximate Bayesian inference. Some recent reviews are for example Bottou et al. (2016); Nesterov (2013). Naturally, concepts from \oldtextscqmc can be beneficial to stochastic optimization. Contributions on exploiting this idea are e.g. Gerber and Bornn (2017) and Drew and Homem-de Mello (2006).

## 3 Quasi-Monte Carlo Variational Inference

In this Section, we introduce Quasi-Monte Carlo Variational Inference (\oldtextscqmcvi), using randomized \oldtextscqmc (\oldtextscrqmc) for variational inference. We review \oldtextscmcvi in Section 3.1. \oldtextscrqmc and the details of our algorithm are exposed in Section 3.2. Theoretical results are given in Section 3.3.

### 3.1 Background: Monte Carlo Variational Inference

Variational inference (\oldtextscvi) is key to modern probabilistic modeling and Bayesian deep learning (Jordan et al., 1999; Blei et al., 2017; Zhang et al., 2017a). In Bayesian inference, the object of interest is a posterior distribution of latent variables given observations . \oldtextscvi approximates Bayesian inference by an optimization problem which we can solve by (stochastic) gradient ascent (Jordan et al., 1999; Hoffman et al., 2013).

In more detail, \oldtextscvi builds a tractable approximation of the posterior by minimizing the -divergence between a variational family , parametrized by free parameters , and . This is equivalent to maximizing the so-called evidence lower bound (\oldtextscelbo):

 L(λ)=Eq(z|λ)[logp(x,z)−logq(z|λ)]. (1)

In classical variational inference, the expectations involved in (1) are carried out analytically (Jordan et al., 1999). However, this is only possible for the fairly restricted class of so-called conditionally conjugate exponential family models (Hoffman et al., 2013). More recently, black-box variational methods have gained momentum, which make the analytical evaluation of these expectation redundant, and which shall be considered in this paper.

Maximizing the objective (1) is often based on a gradient ascent scheme. However, a direct differentiation of the objective (1) with respect to is not possible, as the measure of the expectation depends on this parameter. The two major approaches for overcoming this issue are the score function estimator and the reparameterization estimator.

The score function gradient (also called REINFORCE gradient) (Ranganath et al., 2014) expresses the gradient as expectation with respect to and is given by

 ∇λL(λ)=Eq(z|λ)[∇λlogq(z|λ)(logp(x,z)−logq(z|λ))]. (2)

The gradient estimator is obtained by approximating the expectation with independent samples from the variational distribution . This estimator applies to continuous and discrete variational distributions.

The second approach is based on the reparametrization trick (Kingma and Welling, 2014), where the distribution over is expressed as a deterministic transformation of another distribution over a noise variable , hence where . Using the reparameterization trick, the \oldtextscelbo is expressed as expectation with respect to and the derivative is moved inside the expectation:

 ∇λL(λ)=Ep(ε)[∇λlogp(x,gλ(ε))−∇λlogq(gλ(ε)|λ)]. (3)

The expectation is approximated using a \oldtextscmc sum of independent samples from ). In its basic form, the estimator is restricted to distributions over continuous variables.

#### Mcvi

In the general setup of \oldtextscmcvi considered here, the gradient of the \oldtextscelbo is represented as an expectation over a random variable . For the score function estimator we choose according to Equation (2) with and for the reparameterization gradient according to Equation (3) with , respectively. This allows us to obtain a stochastic estimator of the gradient by an average over a finite sample as .This way, the \oldtextscelbo can be optimized by stochastic optimization. This is achieved by iterating the \oldtextscsgd updates with decreasing step sizes :

 λt+1=λt+αt^gN(λt). (4)

The convergence of the gradient ascent scheme in (4) tends to be slow when gradient estimators have a high variance. Therefore, various approaches for reducing the variance of both gradient estimators exist; e.g. control variates (\oldtextsccv), Rao-Blackwellization and importance sampling. However these variance reduction techniques do not improve the rate of the MSE of the estimator, except under some restrictive conditions (Oates et al., 2017). Moreover, the variance reduction schemes must often be tailored to the problem at hand.

### 3.2 Quasi-Monte Carlo Variational Inference

#### Quasi Monte Carlo

Low discrepancy sequences, also called \oldtextscqmc sequences, are used for integrating a function over the hypercube. When using standard i.i.d. samples on , the error of the approximation is . \oldtextscqmc achieves a rate of convergence in terms of the MSE of if is sufficiently regular (Leobacher and Pillichshammer, 2014). This is achieved by a deterministic sequence that covers more evenly.

On a high level, \oldtextscqmc sequences are constructed such that the number of points that fall in a rectangular volume is proportional to the volume. This idea is closely linked to stratification. Halton sequences e.g. are constructed using coprime numbers (Halton, 1964). Sobol sequences are based on the reflected binary code (Antonov and Saleev, 1979). The exact construction of \oldtextscqmc sequences is quite involved and we refer to Niederreiter (1992); Leobacher and Pillichshammer (2014); Dick et al. (2013) for more details.

The approximation error of \oldtextscqmc increases with the dimension, and it is difficult to quantify. Carefully reintroducing randomness while preserving the structure of the sequence leads to randomized \oldtextscqmc. \oldtextscrqmc sequences are unbiased and the error can be assessed by repeated simulation. Moreover, under slightly stronger regularity conditions on we can achieve rates of convergence of (Gerber, 2015). For illustration purposes, we show different sequences in Figure 1. In Appendix A we provide more technical details.

\oldtextsc

qmc or \oldtextscrqmc can be used for integration with respect to arbitrary distributions by transforming the initial sequence on via a transformation to the distribution of interest. Constructing the sequence typically costs (Gerber and Chopin, 2015).

#### QMC and VI

We suggest to replace independent \oldtextscmc samples for computing by an \oldtextscrqmc sequence of the same length. With our approach, the variance of the gradient estimators becomes , and the costs for creating the sequence is . The incorporation of \oldtextscrqmc in \oldtextscvi is straightforward: instead of sampling as independent \oldtextscmc samples, we generate a uniform \oldtextscrqmc sequence and transform this sequence via a mapping to the original random variable . Using this transformation we obtain the \oldtextscrqmc gradient estimator

 ^gN(λt)=(1/N)N∑i=1gΓ(ui)(λ). (5)

From a theoretical perspective, the function has to be sufficiently smooth for all . For commonly used variational families this transformation is readily available. Although evaluating these transforms adds computational overhead, we found this cost negligible in practice. For example, in order to sample from a multivariate Gaussian , we generate an \oldtextscrqmc squence and apply the transformation , where is the Cholesky decomposition of and is the component-wise inverse cdf of a standard normal distribution. Similar procedures are easily obtained for exponential, Gamma, and other distributions that belong to the exponential family. Algorithm 1 summarizes the procedure.

\oldtextsc

rqmc samples can be generated via standard packages such as randtoolbox (Christophe and Petr, 2015), available in R. Existing \oldtextscmcvi algorithms are adapted by replacing the random variable sampler by an \oldtextscrqmc version. Our approach reduces the variance in \oldtextscmcvi and applies in particular to the reparametrization gradient estimator and the score function estimator. \oldtextscrqmc can in principle be combined with additional variance reduction techniques such as \oldtextsccv, but care must be taken as the optimal \oldtextsccv for \oldtextscrqmc are not the same as for \oldtextscmc (Hickernell et al., 2005).

### 3.3 Theoretical Properties of QMCVI

In what follows we give a theoretical analysis of using \oldtextscrqmc in stochastic optimization. Our results apply in particular to \oldtextscvi but are more general.

\oldtextsc

qmcvi leads to faster convergence in combination with Adam (Kingma and Ba, 2015) or Adagrad (Duchi et al., 2011), as we will show empirically in Section 4. Our analysis, presented in this section, underlines this statement for the simple case of \oldtextscsgd with fixed step size in the Lipschitz continuous (Theorem 1) and strongly convex case (Theorem 2). We show that for sufficiently large, \oldtextscsgd with \oldtextscrqmc samples reaches regions closer to the true optimizer of the \oldtextscelbo. Moreover, we obtain a faster convergence rate than \oldtextscsgd when using a fixed step size and increasing the sample size over iterations (Theorem 3).

#### RQMC for Optimizing Monte Carlo Objectives

We step back from black box variational inference and consider the more general setup of optimizing Monte Carlo objectives. Our goal is to minimize a function , where the optimizer has only access to a noisy, unbiased version , with and access to an unbiased noisy estimator of the gradients , with . The optimum of is .

We furthermore assume that the gradient estimator has the form as in Eq. 5, where is a reparameterization function that converts uniform samples from the hypercube into samples from the target distribution. In this paper, is an \oldtextscrqmc sequence.

In the following theoretical analysis, we focus on \oldtextscsgd with a constant learning rate . The optimal value is approximated by \oldtextscsgd using the update rule

 λt+1=λt−α^gN(λt). (6)

Starting from the procedure is iterated until , for a small threshold . The quality of the approximation crucially depends on the variance of the estimator (Johnson and Zhang, 2013).

Intuitively, the variance of based on an \oldtextscrqmc sequence will be and thus for large enough, the variance will be smaller than for the \oldtextscmc counterpart, that is . This will be beneficial to the optimization procedure defined in (6). Our following theoretical results are based on standard proof techniques for stochastic approximation, see e.g. Bottou et al. (2016).

#### Stochastic Gradient Descent with Fixed Step Size

In the case of functions with Lipschitz continuous derivatives, we obtain the following upper bound on the norm of the gradients.

###### Theorem 1

Let be a function with Lipschitz continuous derivatives, i.e. there exists s.t. , let be an \oldtextscrqmc sequence and let , has cross partial derivatives of up to order . Let the constant learning rate and let . Then , where and and

 ∑Tt=1E∥∇F(λt)∥22T≤12μαLMVr(N)+F(λ1)−F(λ⋆)αμT,

where is iteratively defined in (6). Consequently,

 limT→∞∑Tt=1E∥∇F(λt)∥22T≤12μαLMVr(N). (7)

Equation (7) underlines the dependence of the sum of the norm of the gradients on the variance of the gradients. The better the gradients are estimated, the closer one gets to the optimum where the gradient vanishes. As the dependence on the sample size becomes for an \oldtextscrqmc sequence instead of for a \oldtextscmc sequence, the gradient is more precisely estimated for large enough.

We now study the impact of a reduced variance on \oldtextscsgd with a fixed step size and strongly convex functions. We obtain an improved upper bound on the optimality gap.

###### Theorem 2

Let have Lipschitz continuous derivatives and be a strongly convex function, i.e. there exists a constant s.t. , let be an \oldtextscrqmc sequence and let be as in Theorem 1. Let the constant learning rate and . Then the expected optimality gap satisfies, ,

 E[F(λt+1)−F(λ⋆)]≤[(α2L2−α)2c+1]×E[FN(λt)−F(λ⋆)]+12Lα2[MVr(N)].

Consequently,

 limT→∞E[F(λT)−F(λ⋆)]≤αL4c−αLc[MVr(N)].

The previous result has the following interpretation. The expected optimality gap between the last iteration and the true minimizer is upper bounded by the magnitude of the variance. The smaller this variance, the closer we get to . Using \oldtextscrqmc we gain a factor in the bound.

#### Increasing Sample Size Over Iterations

While \oldtextscsgd with a fixed step size and a fixed number of samples per gradient step does not converge, convergence can be achieved when increasing the number of samples used for estimating the gradient over iterations. As an extension of Theorem 2, we show that a linear convergence is obtained while increasing the sample size at a slower rate than for \oldtextscmc sampling.

###### Theorem 3

Assume the conditions of Theorem 2 with the modification . Let . Use an increasing sample size , where is defined in Appendix B.3. Then

 trVarUN[^gNt(λ)]≤^MV×1τ2t

and

 E[F(λt+1)−F(λ⋆)]≤ωξ2t,

where .

This result compares favorably with a standard result on the linear convergence of \oldtextscsgd with fixed step size and strongly convex functions (Bottou et al., 2016). For \oldtextscmc sampling one obtains a different constant and an upper bound with and not . Thus, besides the constant factor, \oldtextscrqmc samples allow us to close the optimality gap faster for the same geometric increase in the sample size or to use to obtain the same linear rate of convergence as \oldtextscmc based estimators.

#### Other Remarks

The reduced variance in the estimation of the gradients should allow us to make larger moves in the parameter space. This is for example achieved by using adaptive step size algorithms as Adam (Kingma and Ba, 2015), or Adagrad (Duchi et al., 2011). However, the theoretical analysis of these algorithms is beyond the scope of this paper.

Also, note that it is possible to relax the smoothness assumptions on while supposing only square integrability. Then one obtains rates in . Thus, \oldtextscrqmc yields always a faster rate than \oldtextscmc, regardless of the smoothness. See Appendix A for more details.

In the previous analysis, we have assumed that the entire randomness in the gradient estimator comes from the sampling of the variational distribution. In practice, additional randomness is introduced in the gradient via mini batch sampling. This leads to a dominating term in the variance of for mini batches of size . Still, the part of the variance related to the variational family itself is reduced and so is the variance of the gradient estimator as a whole.

## 4 Experiments

We study the effectiveness of our method in three different settings: a hierarchical linear regression, a multi-level Poisson generalized linear model (GLM) and a Bayesian neural network (BNN). Finally, we confirm the result of Theorem 3, which proposes to increase the sample size over iterations in \oldtextscqmcvi for faster asymptotic convergence.

#### Setup

In the first three experiments we optimize the \oldtextscelbo using the Adam optimizer (Kingma and Ba, 2015) with the initial step size set to , unless otherwise stated. The \oldtextscrqmc sequences are generated through a python interface to the R package randtoolbox (Christophe and Petr, 2015). In particular we use scrambled Sobol sequences. The gradients are calculated using an automatic differentiation toolbox. The \oldtextscelbo values are computed by using \oldtextscmc samples, the variance of the gradient estimators is estimated by resampling the gradient times in each optimization step and computing the empirical variance.

#### Benchmarks

The first benchmark is the vanilla \oldtextscmcvi algorithm based on ordinary \oldtextscmc sampling. Our method \oldtextscqmcvi replaces the \oldtextscmc samples by \oldtextscrqmc sequences and comes at almost no computational overhead (Section 3).

Our second benchmark in the second and third experiment is the control variate (\oldtextsccv) approach of Miller et al. (2017), where we use the code provided with the publication. In the first experiment, this comparison is omitted since the method of Miller et al. (2017) does not apply in this setting due to the non-Gaussian variational distribution.

#### Main Results

We find that our approach generally leads to a faster convergence compared to our baselines due to a decreased gradient variance. For the multi-level Poisson GLM experiment, we also find that our \oldtextscrqmc algorithm converges to a better local optimum of the \oldtextscelbo. As proposed in Theorem 3, we find that increasing the sample size over iteration in \oldtextscqmcvi leads to a better asymptotic convergence rate than in \oldtextscmcvi.

### 4.1 Hierarchical Linear Regression

We begin the experiments with a toy model of hierarchical linear regression with simulated data. The sampling process for the outputs is We place lognormal hyper priors on the variance of the intercepts and on the noise ; and a Gaussian hyper prior on . Details on the model are provided in Appendix C.1. We set the dimension of the data points to be and simulated data points from the model. This results in a -dimensional posterior, which we approximate by a variational distribution that mirrors the prior distributions.

We optimize the \oldtextscelbo using Adam (Kingma and Ba, 2015) based on the score function as well as the reparameterization gradient estimator. We compare the standard \oldtextscmc based approach using 10 and samples with our \oldtextscrqmc based approach using samples, respectively. The \oldtextsccv based estimator cannot be used in this setting since it only supports Gaussian variational distributions and the variational family includes a lognormal distribution. For the score function estimator, we set the initial step size of Adam to .

The results are shown in Figure 1. We find that using \oldtextscrqmc samples decreases the variance of the gradient estimator substantially. This applies both to the score function and the reparameterization gradient estimator. Our approach substantially improves the standard score function estimator in terms of convergence speed and leads to a decreased gradient variance of up to three orders of magnitude. Our approach is also beneficial in the case of the reparameterization gradient estimator, as it allows for reducing the sample size from 100 \oldtextscmc samples to 10 \oldtextscrqmc samples, yielding a similar gradient variance and optimization speed.

### 4.2 Multi-level Poisson GLM

We use a multi-level Poisson generalized linear model (GLM), as introduced in (Gelman and Hill, 2006) as an example of multi-level modeling. This model has a -dim posterior, resulting from its hierarchical structure.

As in (Miller et al., 2017), we apply this model to the frisk data set (Gelman et al., 2006) that contains information on the number of stop-and-frisk events within different ethnicity groups. The generative process of the model is described in Appendix C.2. We approximate the posterior by a diagonal Gaussian variational distribution.

The results are shown in Figure 4. When using a small number of samples , all three methods have comparable convergence speed and attain a similar optimum. In this setting, the \oldtextsccv based method has lowest gradient variance. When increasing the sample size to , our proposed \oldtextscrqmc approach leads to substantially decreased gradient variance and allows Adam to convergence closer to the optimum than the baselines. This agrees with the fact that \oldtextscrqmc improves over \oldtextscmc for sufficiently large sample sizes.

### 4.3 Bayesian Neural Network

As a third example, we study \oldtextscqmcvi and its baselines in the context of a Bayesian neural network. The network consists of a -unit hidden layer with ReLU activations. We place a normal prior over each weight, and each weight prior has an inverse Gamma hyper prior. We also place an inverse Gamma prior over the observation variance. The model exhibits a posterior of dimension and is applied to a 100-row subsample of the wine dataset from the UCI repository. The generative process is described in Appendix C.3. We approximate the posterior by a variational diagonal Gaussian.

The results are shown in Figure 4. For , both the \oldtextscrqmc and the \oldtextsccv version converge to a comparable value of the \oldtextscelbo, whereas the ordinary \oldtextscmc approach converges to a lower value. For , all three algorithms reach approximately the same value of the \oldtextscelbo, but our \oldtextscrqmc method converges much faster. In both settings, the variance of the \oldtextscrqmc gradient estimator is one to three orders of magnitude lower than the variance of the baselines.

### 4.4 Increasing the Sample Size Over Iterations

Along with our new Monte Carlo variational inference approach \oldtextscqmcvi, Theorem 3 gives rise to a new stochastic optimization algorithm for Monte Carlo objectives. Here, we investigate this algorithm empirically, using a constant learning rate and an (exponentially) increasing sample size schedule. We show that, for strongly convex objective functions and some mild regularity assumptions, our \oldtextscrqmc based gradient estimator leads to a faster asymptotic convergence rate than using the ordinary \oldtextscmc based gradient estimator.

In our experiment, we consider a two-dimensional factorizing normal target distribution with zero mean and standard deviation one. Our variational distribution is also a normal distribution with fixed standard deviation of , and with a variational mean parameter, i.e., we only optimize the mean parameter. In this simple setting, the \oldtextscelbo is strongly convex and the variational family includes the target distribution. We optimize the \oldtextscelbo with an increasing sample size, using the \oldtextscsgd algorithm described in Theorem 3. We initialize the variational parameter to . Results are shown in Figure 5.

We considered both \oldtextscrqmc (red) and \oldtextscmc (blue) based gradient estimators. We plot the difference between the optimal \oldtextscelbo value and the optimization trace in logarithmic scale. The experiment confirms the theoretical result of Theorem 3 as our \oldtextscrqmc based method attains a faster asymptotic convergence rate than the ordinary \oldtextscmc based approach. This means that, in the absence of additional noise due to data subsampling, optimizing Monte Carlo objectives with \oldtextscrqmc can drastically outperform \oldtextscsgd.

## 5 Conclusion

We investigated randomized Quasi-Monte Carlo (\oldtextscrqmc) for stochastic optimization of Monte Carlo objectives. We termed our method Quasi-Monte Carlo Variational Inference (\oldtextscqmcvi), currently focusing on variational inference applications. Using our method, we showed that we can achieve faster convergence due to variance reduction.

\oldtextsc

qmcvi has strong theoretical guarantees and provably gets us closer to the optimum of the stochastic objective. Furthermore, in absence of additional sources of noise such as data subsampling noise, \oldtextscqmcvi converges at a faster rate than \oldtextscsgd when increasing the sample size over iterations.

\oldtextsc

qmcvi can be easily integrated into automated inference packages. All one needs to do is replace a sequence of uniform random numbers over the hypercube by an \oldtextscrqmc sequence, and perform the necessary reparameterizations to sample from the target distributions.

An open question remains as to which degree \oldtextscqmcvi can be combined with control variates, as \oldtextscrqmc may introduce additional unwanted correlations between the gradient and the \oldtextsccv. We will leave this aspect for future studies. We see particular potential for \oldtextscqmcvi in the context of reinforcement learning, which we consider to investigate.

## Acknowledgements

We would like to thank Pierre E. Jacob, Nicolas Chopin, Rajesh Ranganath, Jaan Altosaar and Marius Kloft for their valuable feedback on our manuscript. This work was partly funded by the German Research Foundation (DFG) award KL 2698/2-1 and a GENES doctoral research scholarship.

## References

• Antonov and Saleev (1979) Antonov, I. A. and Saleev, V. (1979). An economic method of computing LP-sequences. USSR Computational Mathematics and Mathematical Physics, 19(1):252–256.
• Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
• Bottou et al. (2016) Bottou, L., Curtis, F. E., and Nocedal, J. (2016). Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838.
• Buchholz and Chopin (2017) Buchholz, A. and Chopin, N. (2017). Improving approximate Bayesian computation via quasi Monte Carlo. arXiv preprint arXiv:1710.01057.
• Burda et al. (2016) Burda, Y., Grosse, R., and Salakhutdinov, R. (2016). Importance weighted autoencoders. Proceedings of the International Conference on Learning Representations.
• Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1):1–32.
• Christophe and Petr (2015) Christophe, D. and Petr, S. (2015). randtoolbox: Generating and Testing Random Numbers. R package version 1.17.
• Defazio et al. (2014) Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654.
• Dick et al. (2013) Dick, J., Kuo, F. Y., and Sloan, I. H. (2013). High-dimensional integration: The quasi-monte carlo way. Acta Numerica, 22:133–288.
• Drew and Homem-de Mello (2006) Drew, S. S. and Homem-de Mello, T. (2006). Quasi-monte carlo strategies for stochastic optimization. In Proceedings of the 38th conference on Winter simulation, pages 774–782. Winter Simulation Conference.
• Drovandi and Tran (2018) Drovandi, C. C. and Tran, M.-N. (2018). Improving the Efficiency of Fully Bayesian Optimal Design of Experiments Using Randomised Quasi-Monte Carlo. Bayesian Anal., 13(1):139–162.
• Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
• Gelman and Hill (2006) Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
• Gelman et al. (2006) Gelman, A., Kiss, A., and Fagan, J. (2006). An analysis of the nypd’s stop-and-frisk policy in the context of claims of racial bias. Columbia Public Law & Legal Theory Working Papers, page 0595.
• Gerber (2015) Gerber, M. (2015). On integration methods based on scrambled nets of arbitrary size. Journal of Complexity, 31(6):798–816.
• Gerber and Bornn (2017) Gerber, M. and Bornn, L. (2017). Improving simulated annealing through derandomization. Journal of Global Optimization, 68(1):189–217.
• Gerber and Chopin (2015) Gerber, M. and Chopin, N. (2015). Sequential quasi monte carlo. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(3):509–579.
• Glasserman (2013) Glasserman, P. (2013). Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media.
• Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
• Halton (1964) Halton, J. H. (1964). Algorithm 247: Radical-inverse quasi-random point sequence. Communications of the ACM, 7(12):701–702.
• Hardy (1905) Hardy, G. H. (1905). On double Fourier series, and especially those which represent the double zeta-function with real and incommensurable parameters. Quart. J., 37:53–79.
• Hickernell (2006) Hickernell, F. J. (2006). Koksma-Hlawka Inequality. American Cancer Society.
• Hickernell et al. (2005) Hickernell, F. J., Lemieux, C., Owen, A. B., et al. (2005). Control variates for quasi-monte carlo. Statistical Science, 20(1):1–31.
• Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347.
• Johnson and Zhang (2013) Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323.
• Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2):183–233.
• Joy et al. (1996) Joy, C., Boyle, P. P., and Tan, K. S. (1996). Quasi-monte carlo methods in numerical finance. Management Science, 42(6):926–938.
• Kingma and Ba (2015) Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations.
• Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. Proceedings of the International Conference on Learning Representations.
• Kuipers and Niederreiter (2012) Kuipers, L. and Niederreiter, H. (2012). Uniform distribution of sequences. Courier Corporation.
• L’Ecuyer and Lemieux (2005) L’Ecuyer, P. and Lemieux, C. (2005). Recent advances in randomized quasi-Monte Carlo methods. In Modeling uncertainty, pages 419–474. Springer.
• Lemieux and L’Ecuyer (2001) Lemieux, C. and L’Ecuyer, P. (2001). On the use of quasi-monte carlo methods in computational finance. In International Conference on Computational Science, pages 607–616. Springer.
• Leobacher and Pillichshammer (2014) Leobacher, G. and Pillichshammer, F. (2014). Introduction to quasi-Monte Carlo integration and applications. Springer.
• Mandt et al. (2017) Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic Gradient Descent as Approximate Bayesian Inference. Journal of Machine Learning Research, 18(134):1–35.
• Miller et al. (2017) Miller, A. C., Foti, N., D’Amour, A., and Adams, R. P. (2017). Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems.
• Moulines and Bach (2011) Moulines, E. and Bach, F. R. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems.
• Nesterov (2013) Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
• Niederreiter (1992) Niederreiter, H. (1992). Random number generation and quasi-Monte Carlo methods. SIAM.
• Oates and Girolami (2016) Oates, C. and Girolami, M. (2016). Control functionals for quasi-monte carlo integration. In Artificial Intelligence and Statistics, pages 56–65.
• Oates et al. (2017) Oates, C. J., Girolami, M., and Chopin, N. (2017). Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695–718.
• Owen (1997) Owen, A. B. (1997). Scrambled Net Variance for Integrals of Smooth Functions. The Annals of Statistics, 25(4):1541–1562.
• Owen et al. (2008) Owen, A. B. et al. (2008). Local antithetic sampling with scrambled nets. The Annals of Statistics, 36(5):2319–2343.
• Paisley et al. (2012) Paisley, J., Blei, D., and Jordan, M. (2012). Variational Bayesian inference with stochastic search. International Conference on Machine Learning.
• Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
• Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning.
• Robbins and Monro (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
• Robert and Casella (2013) Robert, C. and Casella, G. (2013). Monte Carlo Statistical Methods. Springer Science & Business Media.
• Roeder et al. (2017) Roeder, G., Wu, Y., and Duvenaud, D. K. (2017). Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In Advances in Neural Information Processing Systems.
• Rosenblatt (1952) Rosenblatt, M. (1952). Remarks on a multivariate transformation. Ann. Math. Statist., 23(3):470–472.
• Ruiz et al. (2016a) Ruiz, F. J. R., Titsias, M. K., and Blei, D. M. (2016a). Overdispersed black-box variational inference. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
• Ruiz et al. (2016b) Ruiz, F. R., Titsias, M., and Blei, D. (2016b). The generalized reparameterization gradient. In Advances in Neural Information Processing Systems.
• Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press Cambridge.
• Tran et al. (2016) Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016). Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787.
• Tran et al. (2017) Tran, M.-N., Nott, D. J., and Kohn, R. (2017). Variational bayes with intractable likelihood. Journal of Computational and Graphical Statistics, 26(4):873–882.
• Yang et al. (2014) Yang, J., Sindhwani, V., Avron, H., and Mahoney, M. (2014). Quasi-monte carlo feature maps for shift-invariant kernels. In International Conference on Machine Learning, pages 485–493.
• Zhang et al. (2017a) Zhang, C., Butepage, J., Kjellstrom, H., and Mandt, S. (2017a). Advances in variational inference. arXiv preprint arXiv:1711.05597.
• Zhang et al. (2017b) Zhang, C., Kjellström, H., and Mandt, S. (2017b). Determinantal point processes for mini-batch diversification. In Uncertainty in Artificial Intelligence, UAI 2017.

## Appendix A Additional Information on QMC

We provide some background on \oldtextscqmc sequences that we estimate necessary for the understanding of our algorithm and our theoretical results.

#### Quasi-Monte Carlo (QMC)

Low discrepancy sequences (also called Quasi Monte Carlo sequences), are used to approximate integrals over the hyper-cube: , that is the expectation of the random variable , where , is a uniform distribution on . The basic Monte Carlo approximation of the integral is , where each , independently. The error of this approximation is , since .

This basic approximation may be improved by replacing the random variables by a low-discrepancy sequence; that is, informally, a deterministic sequence that covers more regularly. The error of this approximation is assessed by the Koksma-Hlawka inequality (Hickernell, 2006):

 ∣∣ ∣∣∫[0,1]dψ(u)du−1NN∑n=1ψ(un)∣∣ ∣∣≤V(ψ)D∗(u1:N), (8)

where is the total variation in the sense of Hardy and Krause (Hardy, 1905). This quantity is closely linked to the smoothness of the function . is called the star discrepancy, that measures how well the sequence covers the target space.

The general notion of discrepancy of a given sequence is defined as follows:

 D(u1:N,A):=supA∈A∣∣ ∣∣1NN∑n=1\mathds1{un∈A}−λd(A)∣∣ ∣∣,

where is the volume (Lebesgue measure on ) of and is a set of measurable sets. When we fix the sets with as a products set of intervals anchored at , the star discrepancy is then defined as follows

 D∗(u1:N):=sup[0,b]∣∣ ∣∣1NN∑n=1\mathds1{un∈[0,b]}−λd([0,b])∣∣ ∣∣.

It is possible to construct sequences such that . See also Kuipers and Niederreiter (2012) and Leobacher and Pillichshammer (2014) for more details.

Thus, \oldtextscqmc integration schemes are asymptotically more efficient than \oldtextscmc schemes. However, if the dimension gets too large, the number of necessary samples in order to reach the asymptotic regime becomes prohibitive. As the upper bound is rather pessimistic, in practice \oldtextscqmc integration outperforms \oldtextscmc integration even for small in most applications, see e.g. the examples in Chapter 5 of Glasserman (2013). Popular \oldtextscqmc sequences are for example the Halton sequence or the Sobol sequence. See e.g. Dick et al. (2013) for details on the construction. A drawback of \oldtextscqmc is that it is difficult to assess the error and that the deterministic approximation is inherently biased.

#### Ramdomized Quasi Monte Carlo (RQMC).

The reintroduction of randomness in a low discrepancy sequence while preserving the low discrepancy properties enables the construction of confidence intervals by repeated simulation. Moreover, the randomization makes the approximation unbiased. The simplest method for this purpose is a randomly shifted sequence. Let . Then the sequence based on preserves the properties of the \oldtextscqmc sequence with probability and is marginally uniformly distributed.

Scrambled nets (Owen, 1997) represent a more sophisticated approach. Assuming smoothness of the derivatives of the function, Gerber (2015) showed recently, that rates of are achievable. We summarize the best rates in table 1.

#### Transforming \oldtextscqmc and \oldtextscrqmc sequences

A generic recipe for using \oldtextscqmc / \oldtextscrqmc for integration is given by transforming a sequence with the inverse Rosenblatt transformation , see Rosenblatt (1952) and Gerber and Chopin (2015), such that

 ∫ψ(Γ(u))du=∫ψ(z)p(z)dz,

where is the respective measure of integration. The inverse Rosenblatt transformation can be understood as the multivariate extension of the inverse cdf transform. For the procedure to be correct we have to make sure that is sufficiently regular.

#### Theoretical Results on RQMC

In our analysis we mainly use the following result.

###### Theorem 4

(Owen et al., 2008) Let be a function such that its cross partial derivatives up to order exist and are continuous, and let be a relaxed scrambled -net in base with dimension with uniformly bounded gain coefficients. Then,

 Var(1NN∑n=1ψ(un))=O(N−3log(N)(d−1)),

where .

In words, the \oldtextscrqmc error rate is when a scrambled -net is used. However, a more general result has recently been shown by Gerber (2015)[Corollary 1], where if is square integrable and is a scrambled -sequence, then

 Var(1NN∑n=1ψ(un))=o(N−1).

This result shows that \oldtextscrqmc integration is always better than MC integration. Moreover, Gerber (2015)[Proposition 1] shows that rates can be obtained when the function is regular in the sense of Theorem 4 . In particular one gets

 Var(1NN∑n=1ψ(un))=O(N−2).

## Appendix B Proofs

Our proof are deduced from standard results in the stochastic approximation literature, e.g. Bottou et al. (2016), when the variance of the gradient estimator is reduced due to \oldtextscrqmc sampling. Our proofs rely on scrambled -sequences in order to use the result of Gerber (2015). The scrambled Sobol sequence, that we use in our simulations satisfies the required properties. We denote by the total expectation and by the expectation with respect to the \oldtextscrqmc sequence generated at time . Note, that is not a random variable w.r.t. as it only depends on all the previous due to the update equation in (6). However, is a random variable depending on .

### b.1 Proof of Theorem 1

Let us first prove that for all . By assumption we have that with is a function with continuous mixed partial derivatives of up to order for all . Therefore, if is a \oldtextscrqmc sequence, then the trace of the variance of the estimator is upper bounded by Theorem 4 and its extension by Gerber (2015)[Proposition 1] by a uniform bound and the quantity , that goes to faster than the Monte Carlo rate .

By the Lipschitz assumption we have that , , see for example Bottou et al. (2016). By using the fact that we obtain

 F(λt+1)−F(λt) ≤ = −α∇F(λt)T^gN(λt)+α2L2∥^gN(λt)∥22.

After taking expectations with respect to we obtain

 EUN,tF(λt+1)−F(λt) ≤ −α∇F(λt)EUN,t^gN(λt)+α2L2EUN,t∥^gN(λt)∥22.

We now use the fact that and after exploiting the fact that we obtain

 EUN,tF(λt+1)−F(λt) ≤ −α∥∇F(λt)∥22+α2L2[trVarUN,t[^gN(λt)]+∥∇F(λt)∥22], = α2L2trVarUN,t[^gN(λt)]+(α2L2−α)∥∇F(λt)∥22.

The inequality is now summed for and we take the total expectation:

 EF(λT)−F(λ1) ≤ α2L2T∑t=1EtrVarUN,t[^gN(λt)] +(α2L2−α)T∑t=1E∥∇F(λt)∥22.

We use the fact that , where is deterministic and is the true minimizer, and divide the inequality by :

 1T[F(λ⋆)−F(λ1)] ≤ α2L21TT∑t=1EtrVarUN,t[^gN(λt)] +(α2L2−α)1TT∑t=1E∥∇F(λt)∥22.

By rearranging and using and we obtain

 1TT∑t=1E∥∇F(λt)∥22 ≤ 1Tαμ[F(λ1)−F(λ⋆)] +αL2μ1TT∑t=1EtrVarUN,t[^gN(λt)].

We now use for all . Equation (7) is obtained as .

### b.2 Proof of Theorem 2

A direct consequence of strong convexity is the fact that the optimality gap can be upper bounded by the gradient in the current point , e.g. . The following proof uses this result. Based on the previous proof we get

 EUN,tF(λt+1)−F(λt) ≤ α2L2trVarUN,t[^gN(λt)] +(α2L2−α)∥∇F(λt)∥22 ≤ α2L2trVarUN,t[^gN(λt)] +(α2L2−α)2c(F(λt)−F(λ⋆)),

where we have used that . By subtracting from both sides, taking total expectations and rearranging we obtain:

 EF(λt+1)−F(λ⋆) ≤ α2L2EtrVarUN,t[^gN(λt)] +[(α2L2−α)2c+1](EF(λt)−F(λ⋆)).

 α2LEtrVarUN,t[^gN(λt)]2(β−1)

to both sides of the equation. This yields

 EF(λt+1)−F(λ⋆)+α2LEtrVarUN,t[^gN(λt)]2(β−1) ≤ α2L2EtrVarUN,t[^gN(λt)] +β(EF(λt)−F(λ⋆))+α2LEtrVarUN,t[^gN(λt)]2(β−1) ≤ β(EF(λt)−F(λ⋆)+α2LEtrVarUN,t[^gN(λt)]2(β−1)).

Let us now show that :

 β ≤ [(αL2−1)2αc+1]

And as we get . Using we obtain and thus get a contracting equation when iterating over :

 EF(λt+1)−F(λ⋆) ≤ βt(F(λ1)−F(λ⋆)+α2LEtrVarUN,t[^gN(λt)]2(β−1)) −α2LEtrVarUN,t[^gN(λt)]2(β−1), ≤ βt(F(λ1)−F(λ⋆)+α2LMVr(N)2(β−1)) +α2LMVr(N)2(1−β).

After simplification we get

 EF(λt+1)−F(λ⋆) ≤ βt(F(λ1)−F(λ⋆)+αL2αLc−4cMVr(N)) +αL4c−2αLcMVr(N),

where the first term of the r.h.s. goes to as .

### b.3 Proof of Theorem 3

We require , due to a remark of Gerber (2015), where is the dimension and are integer parameters of the \oldtextscrqmc sequence. As has continuous mixed partial derivatives of order for all , and consequently , where is an universal upper bound on the variance. We recall that . Consequently

 trVar[^gNt(λ)]≤^MV×1(N––+⌈τt⌉)2≤^MV×1τ2t.

Now we take an intermediate result from the previous proof:

 EUN,tF(λt+1)−F(λt) ≤ α2L2trVarUN,t[^gNt(λt)]+(α2L2−α)∥∇F(λt)∥22 ≤ α2L2trVarUN,t[^gNt(λt)]−α2∥∇F(λt)∥22 ≤ α2L2trVarUN,t[^gNt(λ