Improving Sampling Accuracy of Stochastic Gradient MCMC Methods via Non-uniform Subsampling of Gradients

# Improving Sampling Accuracy of Stochastic Gradient MCMC Methods via Non-uniform Subsampling of Gradients

## Abstract

Common Stochastic Gradient MCMC methods approximate gradients by stochastic ones via uniformly subsampled data points. We propose that a non-uniform subsampling can reduce the variance introduced by the stochastic approximation, hence making the sampling of a target distribution more accurate. An exponentially weighted stochastic gradient approach (EWSG) is developed for this objective by matching the transition kernels of SG-MCMC methods respectively based on stochastic and batch gradients. A demonstration of EWSG combined with second-order Langevin equation for sampling purposes is provided. In our method, non-uniform subsampling is done efficiently via a Metropolis-Hasting chain on the data index, which is coupled to the sampling algorithm. The fact that our method has reduced local variance with high probability is theoretically analyzed. A non-asymptotic global error analysis is also presented. Numerical experiments based on both synthetic and real world data sets are also provided to demonstrate the efficacy of the proposed approaches. While statistical accuracy has improved, the speed of convergence was empirically observed to be at least comparable to the uniform version.

## 1 Introduction

Many MCMC methods use physics-inspired evolution such as Langevin dynamics [brooks2011handbook] to utilize gradient information for exploring posterior distributions over continuous parameter space more efficiently. However, gradient-based MCMC methods are often limited by the computational cost of computing the gradient on large data sets. Motivated by the great success of stochastic gradient methods for optimization problems, stochastic gradient MCMC methods (SG-MCMC) for sampling distributions have also been gaining increasing attention. When the accurate but expensive-to-evaluate batch gradients in a MCMC method are replaced by computationally cheaper estimates based on a subset of the data, the method is turned to a stochastic gradient version. Successful examples include SG (overdamped) Langevin Dynamics [welling2011bayesian] and SG Hamiltonian Monte Carlo [chen2014stochastic], all of which were designed for scalability suitable for machine learning tasks.

However, directly replacing the batch gradient by a (uniform) stochastic one without additional mitigation will generally cause a MCMC method to sample from a statistical distribution different from the target, because the transition kernel of the MCMC method gets corrupted by the noise of subsampled gradient. In general, the additional noise is tolerable if the learning rate/step size is tiny or decreasing. However, when larges step are used, the extra noise is non-negligible and undermines the performance of downstream applications such as Bayesian inference.

Our presentation of EWSG will be based on second-order Langevin equations, although it works for other MCMC methods too (e.g., Sec.F). To concentrate on the role of non-uniform weights when approximating the full-gradient by a stochastic version, we will work with constant step sizes/learning rate only. The fact that EWSG has locally reduced variance than its uniform counterpart is rigorously argued in Theorem 3 and a global non-asymptotic analysis of EWSG is given in Theorem 4 to show its convergence properties and demonstrate the advantage over its uniform stochastic gradient counterpart.

A number of experiments on synthetic and real world data sets, across various downstream machine learning tasks, including Bayesian logistic regression and Bayesian neural networks, are conducted to validate our theoretical results and demonstrate the effectiveness of EWSG. In addition to improved statistical accuracy, the speed of convergence was empirically observed, in a fair comparison setup based on the same data pass, to be at least comparable to, and in some cases faster than, its uniform counterpart. Additional theoretical investigation of convergence speed of EWSG is provided in Section H in appendix.

## 2 Related Work

### 2.1 Stochastic Gradient MCMC Methods

Since the seminal work of SGLD [welling2011bayesian], which joined the forces of stochastic gradient and gradient-based MCMC methods, much progress has been made. [ahn2012bayesian] proposed modification of SGLD which samples from a Gaussian approximation of posterior. [patterson2013stochastic] extended SGLD to Riemann manifolds. [teh2016consistency] theoretically justified convergence of SGLD and gave practical recommendation for tuning step size. [li2016preconditioned] introduced preconditioner and greatly improved stability of SGLD. We also refer to [maclaurin2015firefly] and [fu2017cpsg] which will be discussed in Sec.5. While these work were mostly based on first-order (overdamped) Langevin dynamics, other dynamics were considered too; for instance, [chen2014stochastic] proposed SGHMC, which is closely related to second-order (underdamped) Langevin dynamics [bou2018geometric, bou2018coupling]. Second-order Langevin dynamics is faster than the first-order version in certain situations [cheng2017underdamped, cheng2018sharp] and starts to gain more attention in the machine learning community.

### 2.2 Variance Reduction

Vanilla stochastic gradient methods usually find approximate solutions relatively quickly but the convergence speed slows down when an accurate solution is needed [bach2013stochastic, johnson2013accelerating]. Stochastic average gradient algorithm [schmidt2013minimizing] improved the convergence speed of stochastic gradient methods to linear, which is the same as gradient descent methods with full gradient, at the expense of large memory overhead. Stochastic variance reduced gradient (SVRG) [johnson2013accelerating] successfully reduced this memory overhead. [dubey2016variance] applied variance reduction technique to SGLD and obtained a tighter bound.

EWSG is effective in reducing the variance introduced by the stochastic gradient; however, it is based on a new idea — matching transition kernels of MCMC. It thus can be combined with traditional VR approaches (e.g., Sec.G).

### 2.3 Importance Sampling

Importance sampling, when combined with stochastic gradient methods, is a useful technique to reduce variance. Stochastic gradient methods with static importance sampling fix a probability distribution which do not change along iterations. However, computing the fixed distribution often requires prior information of gradient terms, e.g. Lipschitz constants, upper bounds, or -smoothness [needell2014stochastic, schmidt2015non, csiba2018importance], which could be difficult to compute or estimate. On the other hand, stochastic gradient methods with adaptive importance sampling re-evaluate the importance at each iteration, whose computation usually requires the entire data set per parameter update [zhao2015stochastic, zhu2016gradient].

The proposed approach can be categorized as an adaptive importance sampling method. However, it does not require the full data set per parameter update; instead, an inner-loop Metropolis chain was designed for a random data index to approximate a state-dependent non-uniform distribution.

## 3 Background

Underdamped Langevin Dynamics is described by a diffusion process governed by the following SDE

 {dθ=rdtdr=−(∇V(θ)+γr)dt+σdW (1)

where is a state (position) variable, is a momentum variable, is a potential energy function which in our context (originated from cost minimization or Bayesian inference over many data) is the sum of many terms , is friction coefficient, is intrinsic noise amplitude, and is a standard multi-dimensional Wiener process. Under mild assumptions on the potential (e.g., [pavliotis2014stochastic]), Langevin dynamics admits a unique invariant distribution

 π(θ,r)=Z−1exp(−1T(V(θ)+∥r∥22)) (2)

and is in many cases geometric ergodic, where is a normalization constant and is the temperature of system determined by friction and noise via the fluctuation dissipation theorem [kubo1966fluctuation].

The main reason for considering underdamped Langevin rather than overdamped one is that underdamped Langevin can converge faster than overdamped Langevin, in particular in high-dimension space [cheng2017underdamped]. Like the overdamped version, numerical integrators for underdamped Langevin with well captured statistical properties of the continuous process have been extensively investigated [roberts1996exponential, bou2010long], and both the overdamped and underdamped integrators are friendly to derivations that will allow us to obtain explicit expressions of the non-uniform weights.

Terminology-wise, will be called the full/batch-gradient, with random will be called stochastic gradient (SG), and when is uniform distributed it will be called a uniform SG/subsampling, otherwise non-uniform. When uniform SG is used to approximate the batch-gradient in underdamped Langevin, the method will be referred to as (vanilla) stochastic gradient underdamped Langevin dynamics (SGULD), and it serves as a baseline in experiments.

## 4 Main Work

### 4.1 Non-optimality of Uniform Subsampling

Uniform subsampling of gradients have long been the dominant way of stochastic gradient approximations. In many machine learning applications, cases where data size is much larger than problem dimension are not uncommon. In such cases, are linearly dependent and hence it is possible that there exist probability distributions other than the uniform one such that the gradient estimate is unbiased. This opens up the door to develop non-uniform subsampling schemes (weights may be dependent), which can help reduce introduced addtional variance while maintaining unbiasedness.

In fact, in a reasonable setup, it turns out an optimal way of subsampling gradients, is far from being uniform:

###### Theorem 1

Suppose given , the errors of stochastic gradient approximation

 bi=n∇Vi(θ)−∇V(θ),i=1,2,⋯,n

are i.i.d. absolutely continuous random vectors with possibly--dependent density . Define as a sparse vector if the number of non-zero entries in is no greater than . Then with probability , the optimal probability distribution that is unbiased and minimizes the trace of the covariance of , i.e. which solves

 minp\rm Tr(EI∼p[bIbTI])s.t. EI∼p[bI]=0, (3)

is a sparse vector.

Despite the sparsity of , which seemingly suggests that one only needs to use at most gradient terms (which terms may vary across iterations) when using stochastic gradient methods, it is not practical because requires solving the linear programming problem (3) in Theorem 1, for which an entire data pass is needed. Nevertheless, this result still shows uniform gradient subsampling can be far from optimal and motivates us to propose an exponentially weighted stochastic gradient method, which has reduced local variance with high probability and at the same time remains efficiently implementable without necessarily using all the data per parameter update.

### 4.2 Exponentially Weighted Stochastic Gradient

MCMC methods or Markov processes in general are characterized by their transition kernels. In traditional SG-MCMC methods, uniform gradient subsampling is used, which is completely independent of the intrinsic randomness of MCMC methods (e.g. diffusion in underdamped Langevin), as a result, the transition kernel of SG-MCMC method can be quite different from gradient-based MCMC methods with full gradient. Therefore, it is natural to ask - is it possible to couple these two originally independent randomness so that the transition kernels can be better matched and the sampling accuracy can be hence improved?

Consider Euler-Maruyama (EM) discretization1 of eq. (1):

 {θk+1=θk+rkhrk+1=rk−(∇V(θk)+γrk)h+σ√hξk+1 (4)

where is step size and ’s are i.i.d. -dimensional standard Gaussian random variables.

Denote the transition kernel of EM discretization with full gradient by . If is replaced by a weighted stochastic gradient , where is the index of datum chosen to approximate full gradient and has probability mass function , denote the transition kernel by .

It turns out that we can choose in a smart way to match the two transition kernels:

###### Theorem 2

Denote and . If we set

 pi=^Z−1exp{−∥x+∑nj=1aj∥22+∥x+nai∥22} (5)

where is a normalization constant, then the two transition kernels are identical, i.e.,

 ~PEM(θk+1,rk+1|θk,rk)=PEM(θk+1,rk+1|θk,rk)

We refer to this choice of Exponentially Weighted Stochastic Gradient (EWSG). Note the idea of designing non-uniform weights of stochastic gradient MCMC to match the transition kernel of full gradient can be suitably applied to a wide class of gradient-based MCMC methods; for example, Sec.F shows how EWSG can be applied to Langevin Monte Carlo (overdamped Langevin eqation). In this sense, EWSG is complementary to a wide range of classical and contemporary SG-MCMC approaches.

Compared with vanilla stochastic gradients, exponentially weighted stochastic gradients have smaller variance with high probability, as is shown in Theorem 3.

###### Theorem 3

Assume are i.i.d random vectors and for some constant almost surely. Denote the uniform distribution over by , the exponentially weighted distribution by , and let . If , we have

 E[Δ]<0,

and independent of or such that for any

 P(|Δ−E[Δ]|≥ϵ)≤2exp(−ϵ2nCh2)

Intuitively, less nonintrinsic local variance means better global statistical accuracy, and this reasoning will be made more rigorous in Section 4.4.

### 4.3 Practical Implementation

In EWSG, the probability of each gradient term is

 pi=^Z−1exp{−∥x+∑nj=1aj∥22+∥x+nai∥22}

Although the term depends on the full data set, it is shared by all ’s and can be absorbed into the normalization constant (we still included it explicitly due to the needs of analyses in proofs); unique to each is only the term , which involves only one data point. This motivates us to run a Metropolis-Hasting chain over the possible indices : at each inner-loop step, a proposal of index value will be uniformly drawn, and then accepted with probability

 P(i→j)=min{1,exp(∥x+naj∥22−∥x+nai∥22)}; (6)

if accepted, the current index value will be replaced by . When this Markov chain converges, it is easy to see the index will follow the distribution given by . The advantage is, we avoid explicitly passing through the entire data sets to compute each , but yet the index will still sample from the non-uniform distribution efficiently.

In practice, we often only perform step of the Metropolis chain per integration step, especially if is not too large. The rationale is, when is small, the integration timescale is slower than the index chain timescale. Note although the efficacy of local variance reduction via non-uniform subsampling is more pronounced when is larger (see e.g., Theorem 4), in which case may no longer be the optimal choice, improved sampling with large and is still clearly observed in numerical experiments with various sampling/learning tasks (Section 5).

Another hyper-parameter is , because essentially depends on the future state via , which we do not know, and yet we’d like to avoid expensive nonlinear solves. Therefore, in our experiments, we choose . That corresponds to a deterministic MLE of , which is a sufficient (but not necessary) condition for mimicking the statistical equilibrium at which and are equal in distribution. This approximation turned out to be a good one in all our experiments with medium and . Because it is only an approximation, when is large, the method still introduces extra variance (smaller than that caused by vanilla stochastic gradient variant, though), and larger may actually decrease the accuracy of sampling.

EWSG algorithm is summarized in Algorithm 1. For simplicity of notation, we restrict the description to the stochastic gradient case (i.e., mini batch size ), but an extension to (i.e. minibatch SG) is straightforward via multidimensional random indices . See Section E in appendix for details.

EWSG has reduced variance but does not completely eliminate the nonintrinsic noise created by stochastic gradient due to these approximations. A small bias was also created by these approximations, but its effect is dominated by the variance effect (see Sec.4.4). In practice, if needed, one can combine EWSG with other variance reduction technique to further improve accuracy. We showcase how EWSG can be combined with SVRG in Sec.G of appendix.

### 4.4 Theoretical Analysis

We now provide a non-asymptotic analysis of the global sampling error of EWSG. We first define some notations and list the required assumptions, and then state the main results. Detailed proof are deferred to appendix.

The generator of underdamped Langevin (1) is given by

 Lf(Xt) =(rT∇θ−(γr+∇V(θ))T∇r+γΔr)f(Xt)

where . is associated with an integrated form via Kolmogorov’s backward equation . Given a test function , its posterior average is given by , where is the invariant distribution defined in Eq. (2), and we use the time average of samples to approximate , where is the sample path of a Markov chain (e.g., given by EM integrator). A useful tool in weak convergence analysis for SG-MCMC is the following Poisson equation [mattingly2010convergence, vollmer2016exploration, chen2015convergence]:

 Lψ=ϕ−¯ϕ (7)

The solution function characterizes the difference between test function and its posterior average .

We make the following three assumptions:

###### Assumption 1 (Bounded Gradient of Potential)

The gradient of the all potential terms are uniformly bounded

 ∥n∇Vi(θ)∥2

Hence we also have .

###### Assumption 2 (Bounded Moments of Momentum)

Assume the -th moment of the momentum variable are uniformly bounded across all iterations

 E[∥rEk∥p]

Hence by Holder’s inequality, all lower moments are also uniformly bounded. In our proof, we require .

###### Assumption 3 (Bounded Solution of Poisson Equation)

The solution of the Possion equation and all of its -th order derivative are uniformly bounded

 ∥Dlψ∥∞

We need in our proof.

We are now ready to bound the mean squared error (MSE) for stochastic gradient underdamped Langevin algorithms, including both SGULD and the proposed EWSG algorithm.

###### Theorem 4

Under Assumption 1, 2 and 3, for both SGULD and EWSG algorithms, there exists a constant such that

 E(ˆϕK−¯ϕ)2≤C(1T+hT∑K−1k=0E[Tr[\rm cov% (n∇VIk|Fk)]]K+h2) (8)

where is the corresponding time in the underlying continuous dynamics, is the index of the datum used to estimate gradient at -th iteration, and is the covariance of stochastic gradient at -th iteration conditioned on the current sigma algebra in the filtration.

Remark: We follow the powerful framework developed in [mattingly2010convergence] to prove this bound, and to this end, our notations in the proof are made consistent to theirs. One key difference is, [mattingly2010convergence] only discusses the batch gradient case, whereas our theory has an additional quantification of the effect of non-uniform stochastic gradient. Note [vollmer2016exploration, chen2015convergence] studied the effect of stochastic gradient, but the SG considered there did not use state-dependent weights, which would destroy several martingales used in their proofs. In addition, our result incorporates the effects of both local bias and local variance of a SG approximation. Unlike in [mattingly2010convergence] but like in [vollmer2016exploration, chen2015convergence], our state space is not the compact set of torus but .

Variance and bias of the stochastic gradient approximation were respectively reflected in the 2nd and 3rd term in the above bound, although the 3rd term also contains a contribution from the numerical integration error. Note the 2nd term is larger than the 3rd in general due to its lower order in , which means reducing the local variance can improve the sampling accuracy even if this is at the cost of introducing a small bias. Having a smaller local variance is the main advantage of EWSG over uniform SG (see e.g., Thm.3).

## 5 Experiments

In this section, the proposed EWSG algorithm will be compared with SGULD, the classical SGLD method [welling2011bayesian], as well as several recent popular SG-MCMC methods, including Firefly Monte Carlo (FlyMC) [maclaurin2015firefly], pSGLD [li2016preconditioned], and CP-SG-MCMC [fu2017cpsg] (this one is motivated by combining importance sampling with SG-MCMC). Test problems will include sampling from a simple Gaussian distribution, fitting a misspecified Gaussian model, Bayesian logistic regression, and Bayesian neural network(BNN). As FlyMC requires a tight lower bound of likelihood, which is unknown for many models, it will only be compared against in Sec. 5.2 and 5.3 where such a bound is obtainable. CP-SG-MCMC requires heavy tuning on the number of clusters which differs across data sets/algorithms, so it will only be included in the BNN example, for which the authors empirically found such a good hyper parameter for MNIST.

For fair comparison, all algorithms use constant step sizes and are allowed fixed computation budget, i.e., for data passes, all algorithms are only allowed to call gradient function times. All experiments are conducted on a machine with a 2.20GHz Intel(R) Xeon(R) E5-2630 v4 CPU and an Nvidia GeForce GTX 1080 GPU. If not specifically mentioned, the noise coefficient is set so only needs to be specified in each experiment, the length of the index chain is set for EWSG and the default value of two hyper-parameters required in pSGLD are set and , the same as suggested in [li2016preconditioned].

### 5.1 A Simple Gaussian Example

Consider sampling from a simple two-dimensional Gaussian distribution whose potential function is . In this experiment, we set , and randomly sample from a two-dimensional standard normal . Due to the simplicity of , we can write the target density analytically as and are able to report sample quality quantitatively and compare it with vanilla stochastic gradient method on an objective basis. To this end, we use Kullback-Leibler divergence to measure how different the target distribution and samples generated by simulation are. For two Gaussians, we have closed form expression for KL divergence where . The formula is used to estimate the KL divergence between generated samples and the target distribution.

For each algorithm, we generate 10000 independent samples. All algorithms are run for 30 data passes and minibatch size of 1 is used for all of them. Step size is tuned from and is chosen for SGLD and pSGLD, is chosen for SGULD and EWSG. For SGULD and EWSG, both of which are based on underdamped Langevin, we set friction coefficient . The results are shown in Figure 1(a). From the figure, we observe that both SGULD and EWSG outperform the other two benchmarks SGLD and pSGLD. Between the two, EWSG converges to a smaller KL divergence than SGULD, which implies EWSG achieves better statistical accuracy than its uniform stochastic gradient counterpart.

We also consider comparing SGULD and EWSG at a large range of different step sizes, and plot the mean absolute error of sample covariance matrix against autocorrelation time2 in Figure 1(b). When simulating a gradient-based Markov chain, generally speaking, large step size reduces autocorrelation time, yet leads to large discretization error. We observe from Figure 1(b) that at the same autocorrelation time, EWSG achieves smaller error of covariance estimate than SGULD, which demonstrates the effectiveness of EWSG from a more statistical perspective.

### 5.2 A Misspecified Gaussian Case

In this subsection, we follow the same setup as in [bardenet2017markov] and study a misspecified Gaussian model where one fits a one-dimensional normal distribution to i.i.d points drawn according to , and flat prior is assigned . It was shown in [bardenet2017markov] that FlyMC algorithm behaves erratically in this case, as “bright” data points with large values are rarely updated and they drive samples away from the target distribution. Consequently the chain mixes very slowly. One important commonality FlyMC shares with EWSG is that in each iteration, both algorithms select a subset of data in a non-uniform fashion. Therefore, it is interesting to investigate the performance of EWSG in this misspecified model.

For FlyMC3 used in this experiment, a tight lower bound based on Taylor’s expansion is used to minimize “bright” data points used per iteration. At each iteration, 10% data points are resampled and turned “on/off” accordingly and the step size is adaptively adjusted. FlyMC algorithm is run for 10000 iterations. Figure 2(a) shows the histogram of number of data points used in each iteration for FlyMC algorithm. On average, FlyMC consumes of all data points per iteration. For fair comparison, the minibatch size of EWSG is hence set and we run EWSG for 1090 data passes. We set step size and friction coefficient for EWSG. An isotropic random walk Metropolis Hasting (MH) is also run for sufficiently long and serves as the ground truth.

Figure 2(b) shows the autocorrelation of three algorithms. The autocorrelation of FlyMC decays very slowly, samples that are even 500 iterations away still show strong correlation. The autocorrelation of EWSG, on the other hand, decays much faster, suggesting EWSG explores parameter space more efficiently than FlyMC does. Figure 2(c) and 2(d) show the samples (the first 1000 samples are discarded as burn-in) generated by EWSG and FlyMC respectively. The samples of EWSG center around the mode of the target distribution while the samples of FlyMC are still far away from the true posterior. The experiment shows EWGS works quite well even in misspecified models, and hence is an effective candidate in combining importance sampling with scalable Bayesian inference.

### 5.3 Bayesian Logistic Regression

Consider Bayesian logistic regression for the binary classification problem. The probabilistic model for predicting a label giving a feature vector is . We set a Gaussian prior with zero mean and covariance for parameter . We conduct our experiments on Covertype data set4, which contains 581,012 data points and 54 features. Given the large size of this data set, SG is needed to scale up MCMC methods. We use 80% of data for training and the rest 20% for testing.

The FlyMC algorithm5 use a lower bound derived in [maclaurin2015firefly] for likelihood function. For underdamped Langevin based algorithms , we set friction coefficient . After tuning, we set the step size as for SGULD, EWSG, SGLD, pSGLD and FlyMC. All algorithms are run for one data pass, with minibatch size of 50 (for FlyMC, it means 50 data are sampled in each iteration to switch state). 20 independent samples are drawn from each algorithm to estimate statistics. To further smooth out noise, all experiments are repeated 10 times with different seeds.

We plot learning curves in Fig. 3 and report final test accuracy and log likelihood on test set in Table 1. The final log likelihood of EWSG outperforms that of many competitors, and is comparable to FlyMC, known as an exact MCMC method. Moreover, EWSG achieves the best test accuracy.

### 5.4 Bayesian Neural Network

Bayesian inference is compelling for deep learning (see e.g. a recent review [wilson2020case]) and here we apply our algorithm to Bayesian neural network (BNN)s. Two popular architecture of neural nets are experimented – multilayer perceptron (MLP) and convolutional neural nets (CNN). In MLP architecture, a hidden layer with 100 neurons followed by a softmax layer is used. In CNN, we use standard network configuration with 2 convolutional layers followed by 2 fully connected layers [jarrett2009best]. Both convolutional layers use convolution kernel with 32 and 64 channels, max pooling layers follow immediately after convolutional layer. The last two fully-connected layers each has 200 neurons. We set the standard normal as prior for all weights and bias.

We test algorithms on the MNIST data set, which consists of 60000 training data and 10000 test data, each datum is a gray-scale image with one of the ten possible labels (digits ). For underdamped Langevin based algorithms , we set friction coefficient in MLP and in CNN. In MLP, the step sizes are set for EWSG, SGULD and CG-SGULD, and for SGLD and pSGLD, via grid search. For CP-SGULD (clustering-based preprocessing is conducted [fu2017cpsg] before SGULD), we use Kmeans with 10 clusters to preprocess the data set. In CNN, the step sizes are set for EWSG, SGULD and CG-SGULD, and for SGLD and pSGLD, via grid search. All algorithms use minibatch size of 100 and are run for 200 data passes. For each algorithm, we generate 10 independent samples to estimate posterior distributions and make prediction accordingly. To smooth out noise and obtain more significant results, we repeat all experiments 10 times with different seeds.

The learning curve and final test error are shown in Figure 4 and Table 2. We find EWSG consistently improve over its uniform gradient subsampling counterpart SGULD as well as CG-SGULD which is motivated by marrying importance sampling with SG-MCMC. Moreover, EWSG also outperforms two standard benchmarks SGLD and pSGLD.

In each iteration of EWSG, we run an index Markov chain of length and select a “good” minibatch to estimate gradient, therefore EWSG essentially uses data points per iteration where is minibatch size. How does EWSG compare with its uniform gradient subsampling counterpart with a larger minibatch size ()?

We empirically answer this question in the context of BNN with MLP architecture. We use the same step size for SGULD and EWSG and experiment a large range of values of minibatch size and index chain length . Each algorithm is run for 200 data passes and 10 independent samples are drawn to estimate test error. The results are shown in Table 3. We find that EWSG beats SGULD with larger minibatch in 8 out of 9 comparison groups, which suggests in general EWSG could be a better way to consuming data compared to increasing minibatch size and may shed light on other areas where stochastic gradient methods are used (e.g. optimization).

## 6 Conclusion

In this paper, we proposed EWSG, which uses exponentially weighted subsampling of gradients to match the transition kernel of a base MCMC base with full gradient. The goal is better sample quality. Both local variance analysis and global non-asymptotic analysis are presented to demonstrate the advantage of EWSG theoretically. Empirical results also showed improved sampling/learning performance. We believe non-uniform stochastic gradient can be introduced to a large class of MCMC methods and capable for impactful algorithmic improvements.

## AAcknowledgement

The authors thank Wenlong Mou for discussions. MT was partially supported by NSF DMS-1847802 and ECCS-1936776.

## Appendix A Proof of Theorem 1

Proof: Denote the set of all -dimensional probability vectors by , the set of sparse probability vectors by , and the set of non-sparse (dense) probability vectors by . Denote , then the optimization problem can be written as

 minn∑i=1pi∥bi∥2 s.t.⎧⎨⎩Bp=0pT1n=1pi≥0,i=1,2,⋯,n

Note that the feasible region is always non-empty (take to be a uniform distribution) and is also closed and bounded, hence this linear programming is always solvable. Denote the set of all minimizers by . Note that depends on and is in this sense random.

The Lagrange function is

 L(p,λ,μ,ω)=pTs−λTBp−μ(pT1n)−ωTp

where and are dual variables. The optimality condition reads as

 ∂L∂p=s−BTλ−μ1n−ω=0

Dual feasibilty and complementary slackness require

 ωi≤0,i=1,2,⋯,n ωTp=0

Consider the probability of the event {a dense probability vector can solve the above minimization problem}, i.e., . It is upper bounded by

 P(M∩D≠∅)≤P(p∈D and p solves KKT condition)

Since , complementary slackness implies that at least entries in are zero. Denote the indices of these entries by . For every , by optimality condition, we have , i.e.,

 ∥bj∥2−λTbj−μ=0

Take the first indices in , and note a geometric fact that points in a -dimensional space must be on the surface of a hypersphere of at most dimension, which we denote by for some vector and integer . Because ’s distribution is absolutely continuous, we have

 P(p∈D and p solves KKT condition) ≤ P(p∈D and bj∈S,∀j∈J) ≤ P(bj∈S,∀j∈J) = P(bjk∈S,k=d+2,⋯,|J|) = |J|∏k=d+2P(bjk∈S)(independence) = 0(absolute continuous)

Hence and

 1 =P(M≠∅) =P((M∩S)∪(M∩D)≠∅) ≤P(M∩S≠∅)+P(M∩D≠∅) =P(M∩S≠∅)

Therefore we have

 P(M∩S≠∅)=1

## Appendix B Proof of Theorem 2

Proof: The transition kernel of EM discretization with full gradient can be explicitly written as

 PEM(θk+1,rk+1|θk,rk) = δ(θk+1−(θk+rkh)) × Φ(rk+1−rk+hγrk+h∇V(θk)σ√h)1σ√h

where is the Dirac delta function and is the probability density of -dimensional standard normal distribution.

Denote the unnormalized probability measure of index by

 ~pi=exp{−∥x+∑nj=1aj∥22+∥x+nai∥22}

and the normalization constant by

 ^Z=n∑i=1∫~pidrk+1.

Then the transition kernel of EWSG can be written as

 ~PEM(θk+1,rk+1|θk,rk) = δ(θk+1−(θk+rkh))n∑i=1piΦ(rk+1−rk+hγrk+hn∇Vi(θk)σ√h)1σ√h = δ(θk+1−(θk+rkh))n∑i=1~pi^ZΦ(rk+1−rk+hγrk+hn∇Vi(θk)σ√h)1σ√h = 1^Zδ(θk+1−(θk+rkh))n∑i=1exp{−∥x+∑nj=1aj∥22+∥x+nai∥22}1√(2π)dexp{−∥x+nai∥22}1σ√h = n^Z√(2π)dδ(θk+1−(θk+rkh))exp{−∥x+∑nj=1aj∥22}1σ√h

Recall the transition kernel of EM integrator with full gradient is

 PEM(θk+1,rk+1|θk,rk)= δ(θk+1−(θk+rkh))Φ(rk+1−rk+hγrk+h∇V(θk)σ√h)1σ√h = δ(θk+1−(θk+rkh))1√(2π)dexp{−∥x+∑nj=1aj∥22}1σ√h

As both transition kernels are proportional to

 δ(θk+1−(θk+rkh))exp{−∥x+∑nj=1aj∥22}

We therefore conclude that

 PEM(θk+1,rk+1|θk,rk)=~PEM(θk+1,rk+1|θk,rk)

## Appendix C Proof of Theorem 3

Proof: Let and assume for some constant . Denote . For any probability distribution over , we have

 covI∼p[bI|b1,⋯,bn] = = n∑i=1pibibTin∑i=1pi−(n∑i=1pibi)(n∑i=1pibi)T = ∑i

Therefore we let

 f(B):= Tr[∑i

and use it to compare the trace of covariance matrix of uniform- and nonuniform- subsamplings.

First of all,

 E[f(B)] = E[∥bi−bj∥2]∑i

where the inequality is due to Cauchy-Schwarz and it is a strict inequality unless all ’s are equal, which means uniform subsampling on average has larger variablity than a non-uniform scheme measured by the trace of covariance matrix.

Moreover, concentration inequality can help show is negative with high probability if is small. To this end, plug in and rewrite

 pi=1Zexp{Fh[∥y+1n∑ni=1bi∥22−∥y+bi∥22]}

where , and is the normalization constant. Denote the unnormalized probability by

 ˜pi=exp{Fh[∥y+1n∑ni=1bi∥22−∥y+bi∥22]}

and we have

 f(B) =12n∑i=1n∑j=1∥bi−bj∥2(pipj−1n2) =12n∑i=1n∑j=1∥bi−bj∥2˜pi˜pj[∑nk=1˜pk]2−12n∑i=1n∑j=1∥bi−bj∥21n2

To prove concentration results, it is useful to estimate

 Ci=supb1,⋯,bn∈B(0,R)ˆbi∈B(0,R)|f(b1,⋯,bi,⋯,bn)
 −f(b1,⋯,ˆbi,⋯,bn)|

where is a ball centered at origin with radius in .

Due to the mean value theorem, we have . By symmetry, it suffices to compute to upper bound . Note that

 ∂˜pj∂b1=2˜pjFh[1n(y+1nn∑i=1bi)−(y+bj)δ1j]=O(h)˜pj

where is the Kronecker delta function. Thus

 ∂f∂b1 =∑j=1(b1−bj)˜p1˜pj[∑nk=1˜pk]2−n∑j=1(b1−bj)1n2+n∑i,j=1∥b1−bj∥2O(h)˜pi˜pj[∑nk=1˜pk]2 −2n∑i,j=1∥b1−bj∥2˜pi˜pj[∑nk=1˜pk]3n∑k=1˜pkO(h) =˜p1n∑j=1(b1−bj)˜pj[∑nk=1˜pk]2−n∑j=1(b1−bj)1n2+O(n2)O(h)O(n2)+O(n2)O(n3)O(n)O(h) =O(hn)+O(h)+O(h) =O(h)

where in the 2nd last equation comes from the difference of the first two terms in the 3rd last equation. This estimation shows that .

Therefore, by McDiarmid’s inequality, we conclude for any ,

 P(|f−E[f]|>ϵ)≤2exp(−2ϵ2∑ni=1C2i)=2exp(−2ϵ2nO(h2)).

Any choice of will render this probability asymptotically vanishing as grows, which means that will be negative with high probability, which is equivalent to reduced variance per step.

## Appendix D Proof of Theorem 4

Proof: We rewrite the generator of underdamped Langevin with full gradient as

 Lf(X)=F(X)T[∇θf(X)∇rf(X)]+12A:∇∇f(X)

where

 F(X)=[r−γr−∇V(θ)],A=GGTandG=[Od×dOd×dOd×d√2γId×d]

Rewrite the discretized underdamped Langevin with stochastic gradient in variable

 XEk+1−XEk=hFk(XEk)+√hGkηk+1

where

 Fk(X)=[r−γr−n∇VIk(θ)],Gk=G=[Od×dOd×dOd×d√2γId×d]

and is a