A Convergence Analysis for A Class of Practical Variance-Reduction Stochastic Gradient MCMC

# A Convergence Analysis for A Class of Practical Variance-Reduction Stochastic Gradient MCMC

Changyou Chen, Wenlin Wang, Yizhe Zhang, Qinliang Su, Lawrence Carin
University at Buffalo; Duke University
###### Abstract

Stochastic gradient Markov Chain Monte Carlo (SG-MCMC) has been developed as a flexible family of scalable Bayesian sampling algorithms. However, there has been little theoretical analysis of the impact of minibatch size to the algorithm’s convergence rate. In this paper, we prove that under a limited computational budget/time, a larger minibatch size leads to a faster decrease of the mean squared error bound (thus the fastest one corresponds to using full gradients), which motivates the necessity of variance reduction in SG-MCMC. Consequently, by borrowing ideas from stochastic optimization, we propose a practical variance-reduction technique for SG-MCMC, that is efficient in both computation and storage. We develop theory to prove that our algorithm induces a faster convergence rate than standard SG-MCMC. A number of large-scale experiments, ranging from Bayesian learning of logistic regression to deep neural networks, validate the theory and demonstrate the superiority of the proposed variance-reduction SG-MCMC framework.

A Convergence Analysis for A Class of

Changyou Chen, Wenlin Wang, Yizhe Zhang, Qinliang Su, Lawrence Carin University at Buffalo; Duke University

## Introduction

With the increasing size of datasets of interest to machine learning, stochastic gradient Markov Chain Monte Carlo (SG-MCMC) has been established as an effective tool for large-scale Bayesian learning, with applications in topic modeling (??), matrix factorization (???), differential privacy (?), Bayesian optimization (?) and deep neural networks (?). Typically, in each iteration of an SG-MCMC algorithm, a minibatch of data is used to generate the next sample, yielding computational efficiency comparable to stochastic optimization. While a large number of SG-MCMC algorithms have been proposed, their optimal convergence rates generally appear to share the same form, and are typically slower than stochastic gradient descent (SGD) (?). The impact of stochastic gradient noise comes from a higher-order term (see Lemma 1 below), which was omitted in the analysis of (?). In other words, current theoretical analysis for SG-MCMC does not consider the impact of minibatch size (corresponding to stochastic gradient noise), making the underlying convergence theory w.r.t. minibatch size unclear. Recent work by (?) on applying variance reduction in stochastic gradient Langevin dynamics (SGLD) claims to improve the convergence rate of standard SGLD (???).

The theoretical analysis in (?) omits certain aspects of variance reduction in SGLD, that we seek to address here: i) how does the minibatch size (or equivalently the stochastic gradient noise) affect the convergence rate of an SG-MCMC algorithm? and ii) how can one effectively reduce the stochastic gradient noise in SG-MCMC to improve its convergence rate, from both an algorithmic and a theoretical perspective? For , we provide theoretical results on the convergence rates of SG-MCMC w.r.t. minibatch size. For , we propose a practical variance-reduction technique for SG-MCMC, as well as theory to analyze improvements of the corresponding convergence rates. The resulting SG-MCMC algorithm is referred to as variance-reduction SG-MCMC (vrSG-MCMC).

For a clearer description, we first define notation. In a Bayesian model, our goal is typically to evaluate the posterior average of a test function , defined as , where is the target posterior distribution with the possibly augmented model parameters (see Section Preliminaries). Let be the samples generated from an SG-MCMC algorithm. We use the sample average, , to approximate . The corresponding bias and mean square error (MSE) are defined as and , respectively. In vrSG-MCMC, unbiased estimations of full gradients are used, leading to the same bias bound as standard SG-MCMC (?). As a result, we focus here on analyzing the MSE bound for vrSG-MCMC.

Specifically, we first analyze how minibatch size affects the MSE convergence rate of standard SG-MCMC, summarized in two cases: i) for a limited computation budget, the optimal MSE bound is achieved when using full gradients in the algorithm; ii) for a large enough computational budget, i.e., in a long-run setting, stochastic gradients with minibatches of size one are preferable. This indicates that stochastic gradient noise hurts SG-MCMC at the beginning of the algorithm. While it is computationally infeasible to use full gradients in practice, a remedy to overcome this issue is to use relatively small minibatches with variance reduction techniques to reduce stochastic gradient noise. Consequently, we propose a practical variance-reduction scheme, making SG-MCMC computationally efficient in a big-data setting. Finally, we develop theory to analyze the benefit of the proposed variance-reduction technique and empirically show improvements of vrSG-MCMC over standard SG-MCMC algorithms.

## Preliminaries

SG-MCMC is a family of scalable Bayesian sampling algorithms, developed recently to generate approximate samples from a posterior distribution . Here represents a model parameter vector and represents the data available to learn the model. In general, SG-MCMC algorithms are discretized numerical approximations of continuous-time Itô diffusions (??), which are equipped with stationary distributions coincident with the target posterior distributions. An Itô diffusion is written as

 dxt =F(xt)dt+g(xt)dwt , (1)

where is the state variable, is the time index, and is -dimensional Brownian motion. Typically, is an augmentation of the model parameters, so . Functions and are assumed to satisfy the Lipschitz continuity condition (?).

According to (?), all SG-MCMC algorithms can be formulated by defining appropriate functions and in (1). For example, the stochastic gradient Langevin dynamic (SGLD) model corresponds to , and , where denotes the unnormalized negative log-posterior. Similar formula can be defined for other SG-MCMC algorithms, such as stochastic gradient Hamiltonian Monte Carlo (SGHMC) (?) and stochastic gradient thermostats (SGNHT) (?).

An SG-MCMC algorithm is usually developed by numerically solving the corresponding Itô diffusion and replacing the full gradient with an unbiased estimate from a minibatch of data in each iteration. For example, in SGLD, this yields an update equation of for the -th iteration, where is the stepsize, . This brings two sources of error into the chain: numerical error (from discretization of the differential equation) and stochastic noise error from use of minibatches. In particular, (?) proved the following bias and MSE bounds for general SG-MCMC algorithms:

###### Lemma 1 ((?))

Under Assumption 2 in Appendix A, the bias and MSE of SG-MCMC with a th-order integrator***The order characterizes the accuracy of a numerical integrator, e.g., the Euler method is a 1st-order integrator. at time are bounded as:

 ∣∣E^ϕL−¯ϕ∣∣ =O(∑l∥EΔVl∥L+1Lh+hK) E(^ϕL−¯ϕ)2 =O⎛⎝1L∑lE∥ΔVl∥2L+1Lh+h2K⎞⎠

Here , where is the infinitesimal generator of the Itô diffusion (1) defined as , for any compactly supported twice differentiable function . for two vectors and , for two matrices and . is defined as the standard operator norm acting on the space of bounded functions, e.g., for a function . is the same as except for the substitution of the stochastic gradient for the full gradient due to the usage of a stochastic gradient in the -th iteration. By substituting the definition of and , typically we have .

By using an unbiased estimate of the true gradient, the term in the bias bound in Lemma 1 vanishes, indicating that stochastic gradients (or equivalently minibatch size) only affect the MSE bound. Consequently, we focus on improving the MSE bound with the proposed variance-reduction SG-MCMC framework.

## Practical Variance-Reduction SG-MCMC

We first motivate the necessity of variance reduction in SG-MCMC, by analyzing how minibatch size affects the MSE bound. A practical variance reduction scheme is then proposed, which is efficient from both computational and storage perspectives. Comparison with existing variance-reduction SG-MCMC approaches is also highlighted. Previous research has revealed that the convergence of diffusion-based MCMC scales at an order of w.r.t. dimension (?). For the interest of SG-MCMC, we following standard analysis (?) and do not consider the impact of in our analysis.

### The necessity of variance reduction: a theoretical perspective

It is clear from Lemma 1 that the variance of noisy stochastic gradients plays an important role in the MSE bound of an SG-MCMC algorithm. What is unclear is how exactly minibatch size affects the convergence rate. Intuitively, minibatch size appears to play the following roles in SG-MCMC: i) smaller minibatch sizes introduce larger variance into stochastic gradients; ii) smaller minibatch sizes allow an algorithm to run faster (thus more samples can be obtained in a given amount of computation time). To balance the two effects, in addition to using the standard assumptions for SG-MCMC (which basically requires the coefficients of Itô diffusions to be smooth and bounded, and is deferred to Assumption 2 in the Appendix), we assume that the algorithms with different minibatch sizes all run for a fixed computational time/budget in the analysis, as stated in Assumption 1.

###### Assumption 1

For a fair comparison, all SG-MCMC algorithms with different minibatch sizes are assumed to run for a fixed amount of computation time/budget . Further, we assume that linearly depends on the minibatch size and the sample size , i.e., .

For simplicity, we rewrite the gradient of the log-likelihood for data in the -th iteration as: . We first derive the following lemma about the property of , which is useful in the subsequent developments, e.g., to guarantee a positive bound in Theorem 3 and an improved bound for the proposed vrSG-MCMC (Theorem 6).

###### Lemma 2

Under Assumption 2, given in the -th iteration, , where the expectation is taken over the randomness of an SG-MCMC algorithmThe same meaning goes for other expectations in the paper if not explicitly specified..

We next generalize Lemma 1 by incorporating the minibatch size into the MSE bound. The basic idea in our derivation is to associate with each data a binary random variable, , to indicate whether data is included in the current minibatch or not. These depend on each other such that in order to guarantee minibatches of size . Consequently, the stochastic gradient in the -th iteration can be rewritten as: . Substituting the above gradient formula into the proof of standard SG-MCMC (?) and further summing out results in an alternative MSE bound for SG-MCMC, stated in Theorem 3. In the analysis, we assume to use a 1st-order numerical integrator for simplicity, e.g. the Euler method, though the results generalize to th-order integrators easily.

###### Theorem 3

Under Assumption 2, let the minibatch size of an SG-MCMC be , . The finite-time MSE is bounded, for a constant independent of , as:

 E (^ϕL−¯ϕ)2≤C(2(N−n)N2ΓMnL+1Lh+h2) .

Theorem 3 represents the bound in terms of minibatch size and sample size . Note in our finite-time setting, and are considered to be constants. Consequently, is also a bounded constant in our analysis. To bring in the computational budget , based on Assumption 1, e.g., , the optimal MSE bound w.r.t. stepsize in Theorem 3 can be written as: . After further optimizing the bound w.r.t. by setting the derivative of the above MSE bound to zero, the optimal minibatch size can be written as . To guarantee this bound for to be finite and integers, it is required that the computational budget to scale at the order of when varying . When considering both and as impact factors, the optimal becomes more interesting, and is concluded in Corollary 4Note we only have that for some unknown constant , i.e., the specific value of is unknown..

###### Corollary 4

Under Assumption 1 and 2, we have three cases of optimal minibatch sizes, each corresponding to different levels of computational budget.

• When the computational budget is small, e.g., , the optimal MSE bound is decreasing w.r.t. in range . The minimum MSE bound is achieved at .

• When the computational budget is large, e.g., , the optimal MSE bound is increasing w.r.t. in range . The minimum MSE bound is achieved at .

• When the computational budget is in between the above two cases, the optimal MSE bound first increases then decreases w.r.t. in range . The optimal MSE bound is obtained either at or at , depending on .

In many machine learning applications, the computational budget is limited, leading the algorithm to the first case of Corollary 4, i.e., . According to Corollary 4, processing full data (i.e., no minibatch) is required to achieve the optimal MSE bound, which is computationally infeasible when is large (which motivated use of minibatches in the first place). A practical way to overcome this is to use small minibatches and adopt variance-reduction techniques to reduce the stochastic gradient noise.

### A practical variance reduction algorithm

For practical use, we require that a variance-reduction method should achieve both computational and storage efficiency. While variance reduction has been studied extensively in stochastic optimization, it is applied much less often in SG-MCMC. In this section we propose a vrSG-MCMC algorithm, a simple extension of the algorithm in (?), but is more computationally practical in large-scale applications. A convergence theory is also developed in Section Convergence rate.

The proposed vrSG-MCMC is illustrated in Algorithm 1. Similar to stochastic optimization (?), the idea of variance reduction is to balance the gradient noise with a less-noisy old gradient, i.e., a stochastic gradient is calculated based on a previous sample, as well as using a larger minibatch than that of the current stochastic gradient, resulting in a less noisy estimation. In each iteration of our algorithm, an unbiased stochastic gradient is obtained by combining the above two versions of gradients in an appropriate way (see in Algorithm 1). Such a construction of stochastic gradients essentially inherits a low variance with theoretical guarantees (detailed in Section Convergence rate). In Algorithm 1, the whole parameter is decomposed into the model parameter and the remaining algorithm-specific parameter , e.g., the momentum parameter. The expression “” means assigning the corresponding model parameter from to . The old gradient is denoted as , calculated with a minibatch of size . The current stochastic gradient is calculated on a minibatch of size . We use to denote a function which generates the next sample with an SG-MCMC algorithm, based on the current sample , input stochastic gradient , and step size .

One should note that existing variance-reduction algorithms, e.g. (?), use a similar concept to construct low-variance gradients. However, most algorithms use the whole training data to compute in Algorithm 1, which is computationally infeasible in large-scale settings. Moreover, we note that like in stochastic optimization (??), instead of using a single parameter sample to compute , similar methods can be adopted to compute based on an average of old parameter samples. The theoretical analysis can be readily adopted for such cases, which is omitted here for simplicity. More references are discussed in Section Related Work.

### Comparison with existing variance-reduction SG-MCMC algorithms

The most related variance-reduction SG-MCMC algorithm we are aware of is a recent work on variance-reduction SGLD (SVRG-LD) (?). SVRG-LD shares a similar flavor to our scheme from the algorithmic perspective, except that when calculating the old gradient , the whole training data set is used in SVRG-LD. As mentioned above, this brings a computational challenge for large-scale learning. Although the problem is mitigated by using a moving average estimation of the stochastic gradient, this scheme does not match their theory. A more distinctive advantage of vrSG-MCMC over SVRG-LD (?) is in terms of theoretical analysis. Concerning SVRG-LD, i) the authors did not show theoretically in which case variance reduction is useful in SGLD, and ii) it is not clear in their theory whether SVRG-LD is able to speed up the convergence rate compared to standard SGLD. Specifically, the MSE of SVRG-LD was shown to be bounded by , compared to for SGLD, where are constants. By inspecting the above bounds, it is not clear whether SVRG-LD improves SGLD because the two bounds are not directly comparable§§§The first term in the of the SVRG-LD bound is strictly larger than the first term of the SGLD bound (if the term is used in the “min”), making the bounds not easily compared.. More detailed explanations are provided in Appendix E.

### Convergence rate

We derive convergence bounds for Algorithm 1 and analyze the improvement of vrSG-MCMC over the corresponding standard SG-MCMC. Using a similar approach as in Section The necessity of variance reduction: a theoretical perspective, we first introduce additional binary random variables, , to indicate which data points are included in calculating the old gradient in Algorithm 1. This results in the expression for the stochastic gradient used in the -th iteration: . It is easy to verify that the above stochastic gradient is an unbiased estimation of the true gradient in the -th iteration (see Appendix C).

In order to see how Algorithm 1 reduces the variance of stochastic gradients, from Lemma 1, it suffices to study , as the minibatch size only impacts this term. For notational simplicity, similar to the defined in Section The necessity of variance reduction: a theoretical perspective, we denote , which is similar to but evaluated on the old parameter . Intuitively, since the old gradient is calculated from to balance the stochastic gradient noise (calculated from ), and are expected to be close to each other. Lemma 5 formulates the intuition, a key result in proving our main theorem, where we only consider the update interval and stepsize as factors. In the lemma below, following (?) (Assumption 1), we further assume the gradient function to be Lipschitz.

###### Lemma 5

Under Assumption 2 and assume to be Lipschitz (Assumption 1 in (?)), and are close to each other in expectation, i.e., .

In the Appendix, we further simplify in the MSE bound by decomposing it into several terms. Finally, we arrive at our main theorem for the proposed vrSG-MCMC framework.

###### Theorem 6

Under the setting of Lemma 5, let , and . The MSE of vrSG-MCMC with a th-order integrator is bounded as:

 E(^ϕL−¯ϕ)2=O(AML+1Lh+h2K+mhL−λML) ,

where , and . Furthermore, we have for , so that .

Note that for a fixed , in the above bound is a high-order term relative to . As a result, the MSE is bounded by . Because the MSE of standard SG-MCMC is bounded by (see Appendix C) and from Theorem 6, we conclude that vrSG-MCMC induces a lower MSE bound compared to the corresponding SG-MCMC algorithm, with an improvement of .

It is worth noting that in Algorithm 1, the minibatch for calculating the old gradient is required to be larger than that for calculating the current stochastic gradient, i.e., . Otherwise, in Theorem 6 would become negative, leading to an increased MSE bound compared to standard SG-MCMC. This matches the intuition that old gradients need to be more accurate (thus with larger minibatches) than current stochastic gradients in order to reduce the stochastic gradient noise.

###### Remark 7

In the special case of (?) where for SGLD, Theorem 6 gives a MSE bound of , with . According to Lemma 2, is also positive, thus leading to a reduced MSE bound. However, the bound is not necessarily better than that of vrSG-MCMC, where a minibatch is used instead of the whole data set to calculate , leading to a significant decrease of computational time.

###### Remark 8

Following Corollary 4, Theorem 6 can also be formulated in terms of the computational budget . Specifically, according to Algorithm 1, the computational budget would be proportional to . Substituting this into the MSE bound of Theorem 6 gives a reformulated bound of . The optimal MSE w.r.t. and would be complicated since both and depend on and . We omit the details here for simplicity. Nevertheless, our experiments indicate that our algorithm always improves standard SG-MCMC algorithms for the same computational time.

## Related Work

Variance reduction was first introduced in stochastic optimization, which quickly became a popular research topic and has been actively developed in recent years. (??) introduced perhaps the first variance reduction algorithm, called stochastic average gradient (SAG), where historical gradients are stored and continuously updated in each iteration. Later, stochastic variance reduction gradient (SVRG) was developed to reduce the storage bottleneck of SAG, at the cost of an increased computational time (??). (?) combined ideas of SAG and SVRG and proposed the SAGA algorithm, which improves SAG by using a better and unbiased stochastic-gradient estimation.

Variance reduction algorithms were first designed for convex optimization problems, followed by a number of recent works extending the techniques for non-convex optimization (????), as well as for distributed learning (?). All these algorithms are mostly based on SVRG and are similar in algorithmic form, but differ in the techniques for proving the rigorous theoretical results.

For scalable Bayesian sampling with SG-MCMC, however, this topic has been studied little until a recent work on variance reduction for SGLD (?). In this work, the authors adapted the SAG and SVRG ideas to SGLD. Although they provided corresponding convergence results, some fundamental problems, such as how minibatch size affects the convergence rate, were not fully studied. Furthermore, their algorithms suffer from an either high computational or storage cost in a big-data setting, because the whole data set needs to be accessed frequently.

To reduce the computational cost of SVRG-based algorithms, the idea of using a minibatch of data to calculate the old gradient (corresponding to the in Algorithm 1) has also been studied in stochastic optimization. Representative works include, but are not limited to (?????). The proposed approach adopts similar ideas, with the following main differences: i) Our algorithm represents the first work for large-scalable Bayesian sampling with a practical (computationally cheap) variance reduction technique; ii) the techniques used here for analysis are different and appear to be simpler than those used for stochastic optimization; iii) our theory addresses fundamental questions for variance reduction in SG-MCMC, such as those raised in the Introduction.

## Experiments

### A synthetic experiment

We first test the conclusion of the long-run setting in Corollary 4, which indicates that vrSG-MCMC with minibatches of size 1 achieve the optimal MSE bound. To make the algorithm go into the long-run setting regime as sufficient as possible, we test vrSG-MCMC on a simple Gaussian model, which runs very fast so that a little actual walk-clock time is regarded as a large computational budget. The model is defined as: . We generate data samples , and calculate the the MSE for minibatch sizes of . The test function is . The results are ploted in Figure 1. We can see from the figure that achieves the lowese MSE, consistent with the theory (Corollary 4).

### Applications on deep neural networks

We apply the proposed vrSG-MCMC framework to Bayesian learning of deep neural networks, including the multilayer perceptron (MLP), convolutional neural network (CNN), and recurrent neural network (RNN). The latter two have not been empirically evaluated in previous work. Experiments with Bayesian logistic regression are given in Appendix F. In the experiments, we are interested in modeling weight uncertainty of neural networks, which is an important topic and has been well studied (????). We achieve this goal by applying priors to the weights (in our case, we use simple isotropic Gaussian priors) and performing posterior sampling with vrSG-MCMC or SG-MCMC. We implement vrSG-MCMC based on SGLD, and compare it to the standard SGLD and SVRG-LD (?) in our experimentsThe SAGA-LD algorithm in (?) is not compared here because it is too storage-expensive thus is not fair.. For this reason, comparisons to other optimization-based methods such as the maximum likelihood are not considered. For simplicity, we set the update interval for the old gradient in Algorithm 1 to . For all the experiments, the minibatch sizes for vrSG-MCMC are set to and . To be fair, this corresponds to a minibatch size of in SGLD and SVRG-LD. Sensitivity of model performance w.r.t. minibatch size is tested in Section Parameter sensitivity. For a fair comparison, following convention (??), we plot the number of data passes versus error in the figuresSince true posterior averages are infeasible, we plot sample averages in terms of accuracy/loss.. Results on the number of data passes versus loss are given in the Appendix. In addition, we use fixed stepsizes in our algorithm for all except for the ResNet model specified below. Following relevant literature (??), we tune the stepsizes and plot the best results for all the algorithms to ensure fairness. Note in our Bayesian setup, it is enough to run an algorithm for once since the uncertainty is encoded in the samples.

### Multilayer perceptron

We follow conventional settings (??) and use a single-layer MLP with 100 hidden units, using the sigmoid activation function as the nonlinear transformation. We test the MLP on the MNIST and CIFAR-10 datasets. The stepsizes for both vrSG-MCMC and SGLD are set to 0.25 and 0.01 in the two datasets, respectively. Figure 2 plots the number of passes through the data versus test error/loss. Results on the training datasets, including training results for the CNN and RNN-based deep learning models described below, are provided in Appendix F. It is clear that vrSG-MCMC leads to a much faster convergence speed than SGLD, resulting in much lower test errors and loss at the end, especially on the CIFAR-10 dataset. SVRG-LD, though it leads to potential lower errors/loss, converges slower than vrSG-MCMC, due to the high computational cost in calculating the old gradient . As a result, we do not compare vrSG-MCMC with SVRG-LD in the remaining experiments.

### Convolutional neural networks

We use the CIFAR-10 dataset, and test two CNN architectures for image classification. The first architecture is a deep convolutional neural networks with 4 convolutional layers, denoted as C32-C32-C64-C32, where max-pooling is applied on the output of the first three convolutional layers, and a Dropout layer is applied on the output of the last convolutional layer. The second architecture is a 20-layers deep residual network (ResNet) with the same setup as in (?). Specifically, we use a step-size-decrease scheme as for both vrSG-MCMC and SGLD, where is the number of iterations so far.

Figure 3 plots the number of passes through the data versus test error/loss on both models. Similar to the results on MLP, vrSG-MCMC converges much faster than SGLD, leading to lower test errors and loss. Interestingly, the gap seems larger in the more complicated ResNet architecture; furthermore, the learning curves look much less noisy (smoother) for vrSG-MCMC because of the reduced variance in stochastic gradients.

### Recurrent neural networks

The recurrent neural network with LSTM units (?) is a powerful architecture used for modeling sequence-to-sequence data. We consider the task of language modeling on two datasets, i.e., the Penn Treebank (PTB) dataset and WikiText-2 dataset (?). PTB is the smaller dataset among the two, containing a vocabulary of size 10,000. We use the default setup of 887,521 tokens for training, 70,390 for validation and 78,669 for testing. WikiTest-2 is a large dataset with 2,088,628 tokens from 600 Wiki articles for training, 217,649 tokens from 60 Wiki articles for validation, and 245,569 tokens from an additional 60 Wiki articles for testing. The total vocabulary size is 33,278.

We adopt the hierarchical LSTM achitecture (?). The hierarchy depth is set to 2, with each LSTM containing 200 hidden unites. The step size is set to 0.5 for both datasets. For more stable training, standard gradient clipping is adopted, where gradients are clipped if the norm of the parameter vector exceeds 5. Figure 4 plots the number of passes through the data versus test perplexity on both datasets. The results are consistent with the previous experiments on MLPs and CNNs, where vrSG-MCMC achieves faster convergence than SGLD; its learning curves in terms of testing error/loss are also much smoother.

### Parameter sensitivity

Note that one of the main differences between vrSG-MCMC and the recently proposed SVRG-LD (?) is that the former uses minibatches of size to calculate the old gradient in Algorithm 1, leading to a much more computationally efficient algorithm, with theoretical guarantees. This section tests the sensitivity of model performance to the parameter .

For simplicity, we run on the same MLP model described in Section Multilayer perceptron on the CIFAR-10 dataset, where the same parameter settings are used, but varying in . Figure 5 plots the number of passes through data versus test errors/loss, where we use “vrSG-MCMC-Gn” to denote vrSG-MCMC with . Interestingly, vrSG-MCMC outperforms the baseline SGLD on all values. Notably, when is large enough ( in our case), their corresponding test errors and loss are very close. This agrees with the intuition that computing the old gradient using the whole training data is not necessarily a good choice in order to balance the stochastic gradient noise and computational time.

## Conclusion

We investigate the impact of minibatches in SG-MCMC and propose a practical variance-reduction SG-MCMC algorithm to reduce the stochastic gradient noise in SG-MCMC. Compared to existing variance reduction techniques for SG-MCMC, the proposed method is efficient from both computational and storage perspectives. Theory is developed to guarantee faster convergence rates of vrSG-MCMC compared to standard SG-MCMC algorithms. Extensive experiments on Bayesian learning of deep neural networks verify the theory, obtaining significant speedup compared to the corresponding SG-MCMC algorithms.

## References

• [Allen-Zhu and Hazan 2016] Allen-Zhu, Z., and Hazan, E. 2016. Variance reduction for faster non-convex optimization. In ICML.
• [Allen-Zhu et al. 2016] Allen-Zhu, Z.; Richtárik, P.; Qu, Z.; and Yuan, Y. 2016. Even faster accelerated coordinate descent using non-uniform sampling. In ICML.
• [Blundell et al. 2015] Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertainty in neural networks. In ICML.
• [Chen et al. 2016] Chen, C.; Ding, N.; Li, C.; Zhang, Y.; and Carin, L. 2016. Stochastic gradient MCMC with stale gradients. In NIPS.
• [Chen, Ding, and Carin 2015] Chen, C.; Ding, N.; and Carin, L. 2015. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In NIPS.
• [Chen, Fox, and Guestrin 2014] Chen, T.; Fox, E. B.; and Guestrin, C. 2014. Stochastic gradient Hamiltonian Monte Carlo. In ICML.
• [Şimşekli et al. 2016] Şimşekli, U.; Badeau, R.; Cemgil, A. T.; and Richard, G. 2016. Stochastic Quasi-Newton Langevin Monte Carlo. In ICML.
• [Defazio, Bach, and Lacoste-Julien 2014] Defazio, A.; Bach, F.; and Lacoste-Julien, S. 2014. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS.
• [Ding et al. 2014] Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R. D.; and Neven, H. 2014. Bayesian sampling using stochastic gradient thermostats. In NIPS.
• [Dubey et al. 2016] Dubey, A.; Reddi, S. J.; Póczos, B.; Smola, A. J.; and Xing, E. P. 2016. Variance reduction in stochastic gradient Langevin dynamics. In NIPS.
• [Durmus et al. 2016] Durmus, A.; Roberts, G. O.; Vilmart, G.; and Zygalakis, K. C. 2016. Fast Langevin based algorithm for MCMC in high dimensions. Technical Report arXiv:1507.02166.
• [Frostig et al. 2015] Frostig, R.; Ge, R.; Kakade, S. M.; and Sidford, A. 2015. Competing with the empirical risk minimizer in a single pass. In COLT.
• [Gan et al. 2015] Gan, Z.; Chen, C.; Henao, R.; Carlson, D.; and Carin, L. 2015. Scalable deep Poisson factor analysis for topic modeling. In ICML.
• [Ghosh 2011] Ghosh, A. P. 2011. Backward and Forward Equations for Diffusion Processes. Wiley Encyclopedia of Operations Research and Management Science.
• [Harikandeh et al. 2015] Harikandeh, R.; Ahmed, M. O.; Virani, A.; Schmidt, M.; Konecný, J.; and Sallinen, S. 2015. Stop wasting my gradients: Practical SVRG. In NIPS.
• [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
• [Hernández-Lobato and Adams 2015] Hernández-Lobato, J. M., and Adams, R. P. 2015. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In ICML.
• [Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
• [Johnson and Zhang 2013] Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS.
• [Lei and Jordan 2016] Lei, L., and Jordan, M. I. 2016. Less than a single pass: Stochastically controlled stochastic gradient method. In NIPS.
• [Li et al. 2016] Li, C.; Chen, C.; Carlson, D.; and Carin, L. 2016. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In AAAI.
• [Lian, Wang, and Liu 2017] Lian, X.; Wang, M.; and Liu, J. 2017. Finite-sum composition optimization via variance reduced gradient descent. In AISTATS.
• [Liu, Zhu, and Song 2016] Liu, C.; Zhu, J.; and Song, Y. 2016. Stochastic gradient geodesic MCMC methods. In NIPS.
• [Louizos and Welling 2016] Louizos, C., and Welling, M. 2016. Structured and efficient variational deep learning with matrix Gaussian posteriors. In ICML.
• [Ma, Chen, and Fox 2015] Ma, Y. A.; Chen, T.; and Fox, E. B. 2015. A complete recipe for stochastic gradient MCMC. In NIPS.
• [Mattingly, Stuart, and Tretyakov 2010] Mattingly, J. C.; Stuart, A. M.; and Tretyakov, M. V. 2010. Construction of numerical time-average and stationary measures via Poisson equations. SIAM J. NUMER. ANAL. 48(2):552–577.
• [Merity et al. 2016] Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
• [Reddi et al. 2015] Reddi, S. J.; Hefny, A.; Sra, S.; and B. Póczos, A. S. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In NIPS.
• [Reddi et al. 2016] Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. 2016. Stochastic variance reduction for nonconvex optimization. In ICML.
• [Reddi, Sra, and B. Póczos 2016] Reddi, S. J.; Sra, S.; and B. Póczos, A. S. 2016. Fast stochastic methods for nonsmooth nonconvex optimization. In NIPS.
• [Schmidt, Le Roux, and Bach 2013] Schmidt, M.; Le Roux, N.; and Bach, F. 2013. Minimizing finite sums with the stochastic average gradient. Technical Report arXiv:1309.2388.
• [Schmidt, Le Roux, and Bach 2016] Schmidt, M.; Le Roux, N.; and Bach, F. 2016. Minimizing finite sums with the stochastic average gradient. Mathematical Programming.
• [Shah et al. 2016] Shah, V.; Asteris, M.; Kyrillidis, A.; and Sanghavi, S. 2016. Trading-off variance and complexity in stochastic gradient descent. Technical Report arXiv:1603.06861.
• [Springenberg et al. 2016] Springenberg, J. T.; Klein, A.; Falkner, S.; and Hutter, F. 2016. Bayesian optimization with robust Bayesian neural networks. In NIPS.
• [Teh, Thiery, and Vollmer 2016] Teh, Y. W.; Thiery, A. H.; and Vollmer, S. J. 2016. Consistency and fluctuations for stochastic gradient Langevin dynamics. JMLR (17):1–33.
• [Vollmer, Zygalakis, and Teh 2016] Vollmer, S. J.; Zygalakis, K. C.; and Teh, Y. W. 2016. Exploration of the (Non-)Asymptotic bias and variance of stochastic gradient Langevin dynamics. JMLR.
• [Wang, Fienberg, and Smola 2015] Wang, Y. X.; Fienberg, S. E.; and Smola, A. 2015. Privacy for free: Posterior sampling and stochastic gradient Monte Carlo. In ICML.
• [Welling and Teh 2011] Welling, M., and Teh, Y. W. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In ICML.
• [Zaremba, Sutskever, and Vinyals 2014] Zaremba, W.; Sutskever, I.; and Vinyals, O. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
• [Zhang, Mahdavi, and Jin 2013] Zhang, L.; Mahdavi, M.; and Jin, R. 2013. Linear convergence with condition number independent access of full gradients. In NIPS.

Supplementary Materials for:

A Theory for A Class of Practical Variance-Reduction Stochastic Gradient MCMC

## Appendix A Basic Setup for Stochastic Gradient MCMC

Given data , a generative model

 p(D|\mathchar28946\relax)=N∏i=1p(di|\mathchar28946\relax) ,

with model parameter , and prior , we want to compute the posterior distribution:

 ρ(\mathchar28946\relax)≜p(\mathchar28946\relax|D)∝p(D|\mathchar28946\relax)p(\mathchar28946\relax)≜e−U(\mathchar28946\relax) ,

where

 ∇\mathchar28946\relaxU(\mathchar28946\relax)=−∇\mathchar28946\relaxlogp(\mathchar28946\relax)−N∑i=1∇\mathchar28946\relaxlogp(di|\mathchar28946\relax) . (2)

Consider the SDE:

 dxt =F(xt)dt+g(xt)dwt , (3)

where is the state variable, typically is an augmentation of the model parameter, thus ; is the time index, is -dimensional Brownian motion; functions and are assumed to satisfy the usual Lipschitz continuity condition (?). In Langevin dynamics, we have and

 F(\mathchar28946\relaxt) =−∇\mathchar28946\relaxU(\mathchar28946\relaxt) g(\mathchar28946\relaxt) =√2 .

For the SDE in (3), the generator is defined as:

 Lψ≜12∇ψ⋅F+12g(\mathchar28946\relax)g(\mathchar28946\relax)∗:D2ψ , (4)

where is a measurable function, means the -derivative of , means transpose. for two vectors and , for two matrices and . Under certain assumptions, we have that there exists a function on such that the following Poisson equation is satisfied (?):

 Lψ=ϕ−¯ϕ , (5)

where denotes the model average, with being the equilibrium distribution for the SDE (3).

In stochastic gradient Langevin dynamics (SGLD), we update the parameter at step , denoted as ******Strictly speaking, should be indexed by “time” instead of “step”, i.e., instead of . We adopt the later for notation simplicity in the following. This applies for the general case of ., using the following descreatized method:

 \mathchar28946\relaxl+1=\mathchar28946\relaxl−∇\mathchar28946\relax~Ul(\mathchar28946\relaxl)hl+1+√2hl+1ζl+1 ,

where is the step size, a Gaussian random variable with mean 0 and variance 1, is an unbiased estimate of in (2) with a random minibatch of size n, e.g.,

 ∇\mathchar28946\relax~Ul(\mathchar28946\relaxl)=∇\mathchar28946\relaxlogp(\mathchar28946\relaxl)+Nnn∑i=1∇\mathchar28946\relaxlogp(xπi|\mathchar28946\relaxl) , (6)

where is a subset of a random permutation of .

In our analysis, we are interested in the mean square error (MSE) at iteration , defined as

 MSEL≜E(^ϕL−¯ϕ)2 ,

where denotes the sample average, is the true posterior average defined in (5).

In this paper, for the function in an space, i.e., a space of functions for which the -th power of the absolute value is Lebesgue integrable, we consider the standard norm defined as ( is simplified as ):

 ∥f∥p≜(∫Rm|f(x)|pdx)1/p<∞ .

In order to guarantee well-behaved SDEs and the corresponding numerical integrators, following existing literatures such as (??), we impose the following assumptions.

###### Assumption 2

The SDE (3) is ergodic. Furthermore, the solution of (5) exists, and the solution functional of the Poisson equation (5) satisfies the following properties:

• and its up to 3th-order derivatives , are bounded by a function , i.e., for , .

• The expectation of on is bounded: .

• is smooth such that , for some .

## Appendix B Proofs of Extended Results for Standard SG-MCMC

First, according to the definition of , we note that for the solution functional of the Poisson equation 5. Since , and is assumed to be bounded for a test function , we omit the operator in our following analysis (which only contributes to a constant), manifesting a slight abuse of notation for conciseness.

The proofs of Lemma 2 and Theorem 3 are closely related. We will first prove Theorem 3, the proof for Lemma 2 is then directly followed.

Proof [Proof of Theorem 3]

Let , and

 zi={1 if data i is selected% 0 otherwise ,

then we have

 ΔVl =N∑i=1Eαli(1−Nnzi) →E|ΔVl|2 =N∑i=1N∑j=1EαliEαlj(1−Nnzi)(1−Nnzj) .

Since

 Ezi= 1N+N−1N1N−1+⋯ + N−1NN−2N−1⋯N−m+1N−m+21N−m+1 = nN ,

we have , i.e., is an unbiased estimate of .

 E(1−Nnzi)(1−Nnzj) =E[1−Nnzi−Nnzj+N2n2zizj] =1−2NnnN+N2n2Ezizj =N2n2Ezizj−1 .

When ,

 E(1−Nnzi)(1−Nnzj)=N2n2Ez2i−1 = N2n2Ezi−1=Nn−1 .

When , because

 Ezizj =p(i selected)p(j selected|i selected) =nNn−1N−1 .

We have

 E(1−Nnzi)(1−Nnzj)=N2n2Ezizj−1 = Nnn−1N−1−1 .

As a result,

 E|ΔVl|2 = (N∑i=1Eα2li)(Nn−1)+2∑i

Because we assume using a 1st-order numerical integrator, according to Lemma 1, and combining (B) from above, we have the bound for the MSE as:

 E(^ϕL−¯ϕ)2≤C((N−n)N2ΓMnL+1Lh+h2) .

Proof [Proof of Lemma 2] The lemma follows directly from (B) and the fact that

 E|ΔVl|2≥0 .

Proof [Proof of the optimal MSE bound of Theorem 3]

From the assumption, we have

 T∝nL . (8)

The MSE bounded is obtained by directly substituting (8) into the MSE bound in Lemma 2, resulting in

 MSE: E(^ϕL−¯ϕ)2 ≤C((N−n)N2ΓMT+nTh+h2)

After optimizing the above bound over , we have

 E(^ϕL−¯ϕ)2≤C((N−n)N2ΓMT+n2/3T2/3) . (9)

Proof [Proof of Corollary 4]

To examine the property of the MSE bound (9) w.r.t. , we first note that the derivative can be written as:

 f≜∂∂nE(^ϕL−¯ϕ)2=O(23T2/3n1/3−ΓMN2T) .

As a result, we have the following three cases:

• When , i.e., the bound is decreasing when increasing, we have . Because is in the range of , and we require for all ’s, the minimum value of is obtained when taking . Consequently, we have that when