Improving Sampling Accuracy of Stochastic Gradient MCMC Methods via Nonuniform Subsampling of Gradients
Abstract
Common Stochastic Gradient MCMC methods approximate gradients by stochastic ones via uniformly subsampled data points. We propose that a nonuniform subsampling can reduce the variance introduced by the stochastic approximation, hence making the sampling of a target distribution more accurate. An exponentially weighted stochastic gradient approach (EWSG) is developed for this objective by matching the transition kernels of SGMCMC methods respectively based on stochastic and batch gradients. A demonstration of EWSG combined with secondorder Langevin equation for sampling purposes is provided. In our method, nonuniform subsampling is done efficiently via a MetropolisHasting chain on the data index, which is coupled to the sampling algorithm. The fact that our method has reduced local variance with high probability is theoretically analyzed. A nonasymptotic global error analysis is also presented. Numerical experiments based on both synthetic and real world data sets are also provided to demonstrate the efficacy of the proposed approaches. While statistical accuracy has improved, the speed of convergence was empirically observed to be at least comparable to the uniform version.
1 Introduction
Many MCMC methods use physicsinspired evolution such as Langevin dynamics [brooks2011handbook] to utilize gradient information for exploring posterior distributions over continuous parameter space more efficiently. However, gradientbased MCMC methods are often limited by the computational cost of computing the gradient on large data sets. Motivated by the great success of stochastic gradient methods for optimization problems, stochastic gradient MCMC methods (SGMCMC) for sampling distributions have also been gaining increasing attention. When the accurate but expensivetoevaluate batch gradients in a MCMC method are replaced by computationally cheaper estimates based on a subset of the data, the method is turned to a stochastic gradient version. Successful examples include SG (overdamped) Langevin Dynamics [welling2011bayesian] and SG Hamiltonian Monte Carlo [chen2014stochastic], all of which were designed for scalability suitable for machine learning tasks.
However, directly replacing the batch gradient by a (uniform) stochastic one without additional mitigation will generally cause a MCMC method to sample from a statistical distribution different from the target, because the transition kernel of the MCMC method gets corrupted by the noise of subsampled gradient. In general, the additional noise is tolerable if the learning rate/step size is tiny or decreasing. However, when larges step are used, the extra noise is nonnegligible and undermines the performance of downstream applications such as Bayesian inference.
In this paper, we present a statedependent nonuniform SGMCMC algorithm termed exponentially weighted stochastic gradients method (EWSG), which is in line with the efforts of uniform SGMCMC methods for better scalability. The novelty of our approach is, unlike uniform gradient subsampling approaches which aim only at an unbiased gradient estimator, our approach is motivated by directly matching the transition kernel of a SGMCMC method with the transition kernel of a fullgradientbased MCMC method, and this matching naturally leads to EWSG algorithm. All SGMCMC methods contain two sources of stochasticity, one being the intrinsic randomness of MCMC, and the other being the randomness introduced by gradient subsampling. In conventional treatments where uniform gradient subsampling is used, the latter randomness is independent of the former one, and thus when they are coupled together, variances add up. EWSG, on the other hand, dynamically chooses the weight of each datum according to the current state of the MCMC, and is able to keep the transition kernel of the Markov process to be close to that of a gradientbased MCMC method with full gradient. Therefore, the invariant distribution of EWSG (if existent) will be close to that of a fullgradient based MCMC method, and this is how better accuracy in sampling the target distribution can be achieved.
Our presentation of EWSG will be based on secondorder Langevin equations, although it works for other MCMC methods too (e.g., Sec.F). To concentrate on the role of nonuniform weights when approximating the fullgradient by a stochastic version, we will work with constant step sizes/learning rate only. The fact that EWSG has locally reduced variance than its uniform counterpart is rigorously argued in Theorem 3 and a global nonasymptotic analysis of EWSG is given in Theorem 4 to show its convergence properties and demonstrate the advantage over its uniform stochastic gradient counterpart.
A number of experiments on synthetic and real world data sets, across various downstream machine learning tasks, including Bayesian logistic regression and Bayesian neural networks, are conducted to validate our theoretical results and demonstrate the effectiveness of EWSG. In addition to improved statistical accuracy, the speed of convergence was empirically observed, in a fair comparison setup based on the same data pass, to be at least comparable to, and in some cases faster than, its uniform counterpart. Additional theoretical investigation of convergence speed of EWSG is provided in Section H in appendix.
2 Related Work
2.1 Stochastic Gradient MCMC Methods
Since the seminal work of SGLD [welling2011bayesian], which joined the forces of stochastic gradient and gradientbased MCMC methods, much progress has been made. [ahn2012bayesian] proposed modification of SGLD which samples from a Gaussian approximation of posterior. [patterson2013stochastic] extended SGLD to Riemann manifolds. [teh2016consistency] theoretically justified convergence of SGLD and gave practical recommendation for tuning step size. [li2016preconditioned] introduced preconditioner and greatly improved stability of SGLD. We also refer to [maclaurin2015firefly] and [fu2017cpsg] which will be discussed in Sec.5. While these work were mostly based on firstorder (overdamped) Langevin dynamics, other dynamics were considered too; for instance, [chen2014stochastic] proposed SGHMC, which is closely related to secondorder (underdamped) Langevin dynamics [bou2018geometric, bou2018coupling]. Secondorder Langevin dynamics is faster than the firstorder version in certain situations [cheng2017underdamped, cheng2018sharp] and starts to gain more attention in the machine learning community.
2.2 Variance Reduction
Vanilla stochastic gradient methods usually find approximate solutions relatively quickly but the convergence speed slows down when an accurate solution is needed [bach2013stochastic, johnson2013accelerating]. Stochastic average gradient algorithm [schmidt2013minimizing] improved the convergence speed of stochastic gradient methods to linear, which is the same as gradient descent methods with full gradient, at the expense of large memory overhead. Stochastic variance reduced gradient (SVRG) [johnson2013accelerating] successfully reduced this memory overhead. [dubey2016variance] applied variance reduction technique to SGLD and obtained a tighter bound.
EWSG is effective in reducing the variance introduced by the stochastic gradient; however, it is based on a new idea — matching transition kernels of MCMC. It thus can be combined with traditional VR approaches (e.g., Sec.G).
2.3 Importance Sampling
Importance sampling, when combined with stochastic gradient methods, is a useful technique to reduce variance. Stochastic gradient methods with static importance sampling fix a probability distribution which do not change along iterations. However, computing the fixed distribution often requires prior information of gradient terms, e.g. Lipschitz constants, upper bounds, or smoothness [needell2014stochastic, schmidt2015non, csiba2018importance], which could be difficult to compute or estimate. On the other hand, stochastic gradient methods with adaptive importance sampling reevaluate the importance at each iteration, whose computation usually requires the entire data set per parameter update [zhao2015stochastic, zhu2016gradient].
The proposed approach can be categorized as an adaptive importance sampling method. However, it does not require the full data set per parameter update; instead, an innerloop Metropolis chain was designed for a random data index to approximate a statedependent nonuniform distribution.
3 Background
Underdamped Langevin Dynamics is described by a diffusion process governed by the following SDE
(1) 
where is a state (position) variable, is a momentum variable, is a potential energy function which in our context (originated from cost minimization or Bayesian inference over many data) is the sum of many terms , is friction coefficient, is intrinsic noise amplitude, and is a standard multidimensional Wiener process. Under mild assumptions on the potential (e.g., [pavliotis2014stochastic]), Langevin dynamics admits a unique invariant distribution
(2) 
and is in many cases geometric ergodic, where is a normalization constant and is the temperature of system determined by friction and noise via the fluctuation dissipation theorem [kubo1966fluctuation].
The main reason for considering underdamped Langevin rather than overdamped one is that underdamped Langevin can converge faster than overdamped Langevin, in particular in highdimension space [cheng2017underdamped]. Like the overdamped version, numerical integrators for underdamped Langevin with well captured statistical properties of the continuous process have been extensively investigated [roberts1996exponential, bou2010long], and both the overdamped and underdamped integrators are friendly to derivations that will allow us to obtain explicit expressions of the nonuniform weights.
Terminologywise, will be called the full/batchgradient, with random will be called stochastic gradient (SG), and when is uniform distributed it will be called a uniform SG/subsampling, otherwise nonuniform. When uniform SG is used to approximate the batchgradient in underdamped Langevin, the method will be referred to as (vanilla) stochastic gradient underdamped Langevin dynamics (SGULD), and it serves as a baseline in experiments.
4 Main Work
4.1 Nonoptimality of Uniform Subsampling
Uniform subsampling of gradients have long been the dominant way of stochastic gradient approximations. In many machine learning applications, cases where data size is much larger than problem dimension are not uncommon. In such cases, are linearly dependent and hence it is possible that there exist probability distributions other than the uniform one such that the gradient estimate is unbiased. This opens up the door to develop nonuniform subsampling schemes (weights may be dependent), which can help reduce introduced addtional variance while maintaining unbiasedness.
In fact, in a reasonable setup, it turns out an optimal way of subsampling gradients, is far from being uniform:
Theorem 1
Suppose given , the errors of stochastic gradient approximation
are i.i.d. absolutely continuous random vectors with possiblydependent density . Define as a sparse vector if the number of nonzero entries in is no greater than . Then with probability , the optimal probability distribution that is unbiased and minimizes the trace of the covariance of , i.e. which solves
(3) 
is a sparse vector.
Despite the sparsity of , which seemingly suggests that one only needs to use at most gradient terms (which terms may vary across iterations) when using stochastic gradient methods, it is not practical because requires solving the linear programming problem (3) in Theorem 1, for which an entire data pass is needed. Nevertheless, this result still shows uniform gradient subsampling can be far from optimal and motivates us to propose an exponentially weighted stochastic gradient method, which has reduced local variance with high probability and at the same time remains efficiently implementable without necessarily using all the data per parameter update.
4.2 Exponentially Weighted Stochastic Gradient
MCMC methods or Markov processes in general are characterized by their transition kernels. In traditional SGMCMC methods, uniform gradient subsampling is used, which is completely independent of the intrinsic randomness of MCMC methods (e.g. diffusion in underdamped Langevin), as a result, the transition kernel of SGMCMC method can be quite different from gradientbased MCMC methods with full gradient. Therefore, it is natural to ask  is it possible to couple these two originally independent randomness so that the transition kernels can be better matched and the sampling accuracy can be hence improved?
Consider EulerMaruyama (EM) discretization
(4) 
where is step size and ’s are i.i.d. dimensional standard Gaussian random variables.
Denote the transition kernel of EM discretization with full gradient by . If is replaced by a weighted stochastic gradient , where is the index of datum chosen to approximate full gradient and has probability mass function , denote the transition kernel by .
It turns out that we can choose in a smart way to match the two transition kernels:
Theorem 2
Denote and . If we set
(5) 
where is a normalization constant, then the two transition kernels are identical, i.e.,
We refer to this choice of Exponentially Weighted Stochastic Gradient (EWSG). Note the idea of designing nonuniform weights of stochastic gradient MCMC to match the transition kernel of full gradient can be suitably applied to a wide class of gradientbased MCMC methods; for example, Sec.F shows how EWSG can be applied to Langevin Monte Carlo (overdamped Langevin eqation). In this sense, EWSG is complementary to a wide range of classical and contemporary SGMCMC approaches.
Compared with vanilla stochastic gradients, exponentially weighted stochastic gradients have smaller variance with high probability, as is shown in Theorem 3.
Theorem 3
Assume are i.i.d random vectors and for some constant almost surely. Denote the uniform distribution over by , the exponentially weighted distribution by , and let . If , we have
and independent of or such that for any
Intuitively, less nonintrinsic local variance means better global statistical accuracy, and this reasoning will be made more rigorous in Section 4.4.
4.3 Practical Implementation
In EWSG, the probability of each gradient term is
Although the term depends on the full data set, it is shared by all ’s and can be absorbed into the normalization constant (we still included it explicitly due to the needs of analyses in proofs); unique to each is only the term , which involves only one data point. This motivates us to run a MetropolisHasting chain over the possible indices : at each innerloop step, a proposal of index value will be uniformly drawn, and then accepted with probability
(6) 
if accepted, the current index value will be replaced by . When this Markov chain converges, it is easy to see the index will follow the distribution given by . The advantage is, we avoid explicitly passing through the entire data sets to compute each , but yet the index will still sample from the nonuniform distribution efficiently.
In practice, we often only perform step of the Metropolis chain per integration step, especially if is not too large. The rationale is, when is small, the integration timescale is slower than the index chain timescale. Note although the efficacy of local variance reduction via nonuniform subsampling is more pronounced when is larger (see e.g., Theorem 4), in which case may no longer be the optimal choice, improved sampling with large and is still clearly observed in numerical experiments with various sampling/learning tasks (Section 5).
Another hyperparameter is , because essentially depends on the future state via , which we do not know, and yet we’d like to avoid expensive nonlinear solves. Therefore, in our experiments, we choose . That corresponds to a deterministic MLE of , which is a sufficient (but not necessary) condition for mimicking the statistical equilibrium at which and are equal in distribution. This approximation turned out to be a good one in all our experiments with medium and . Because it is only an approximation, when is large, the method still introduces extra variance (smaller than that caused by vanilla stochastic gradient variant, though), and larger may actually decrease the accuracy of sampling.
EWSG algorithm is summarized in Algorithm 1. For simplicity of notation, we restrict the description to the stochastic gradient case (i.e., mini batch size ), but an extension to (i.e. minibatch SG) is straightforward via multidimensional random indices . See Section E in appendix for details.
EWSG has reduced variance but does not completely eliminate the nonintrinsic noise created by stochastic gradient due to these approximations. A small bias was also created by these approximations, but its effect is dominated by the variance effect (see Sec.4.4). In practice, if needed, one can combine EWSG with other variance reduction technique to further improve accuracy. We showcase how EWSG can be combined with SVRG in Sec.G of appendix.
4.4 Theoretical Analysis
We now provide a nonasymptotic analysis of the global sampling error of EWSG. We first define some notations and list the required assumptions, and then state the main results. Detailed proof are deferred to appendix.
The generator of underdamped Langevin (1) is given by
where . is associated with an integrated form via Kolmogorov’s backward equation . Given a test function , its posterior average is given by , where is the invariant distribution defined in Eq. (2), and we use the time average of samples to approximate , where is the sample path of a Markov chain (e.g., given by EM integrator). A useful tool in weak convergence analysis for SGMCMC is the following Poisson equation [mattingly2010convergence, vollmer2016exploration, chen2015convergence]:
(7) 
The solution function characterizes the difference between test function and its posterior average .
We make the following three assumptions:
Assumption 1 (Bounded Gradient of Potential)
The gradient of the all potential terms are uniformly bounded
Hence we also have .
Assumption 2 (Bounded Moments of Momentum)
Assume the th moment of the momentum variable are uniformly bounded across all iterations
Hence by Holder’s inequality, all lower moments are also uniformly bounded. In our proof, we require .
Assumption 3 (Bounded Solution of Poisson Equation)
The solution of the Possion equation and all of its th order derivative are uniformly bounded
We need in our proof.
We are now ready to bound the mean squared error (MSE) for stochastic gradient underdamped Langevin algorithms, including both SGULD and the proposed EWSG algorithm.
Theorem 4
Under Assumption 1, 2 and 3, for both SGULD and EWSG algorithms, there exists a constant such that
(8) 
where is the corresponding time in the underlying continuous dynamics, is the index of the datum used to estimate gradient at th iteration, and is the covariance of stochastic gradient at th iteration conditioned on the current sigma algebra in the filtration.
Remark: We follow the powerful framework developed in [mattingly2010convergence] to prove this bound, and to this end, our notations in the proof are made consistent to theirs. One key difference is, [mattingly2010convergence] only discusses the batch gradient case, whereas our theory has an additional quantification of the effect of nonuniform stochastic gradient. Note [vollmer2016exploration, chen2015convergence] studied the effect of stochastic gradient, but the SG considered there did not use statedependent weights, which would destroy several martingales used in their proofs. In addition, our result incorporates the effects of both local bias and local variance of a SG approximation. Unlike in [mattingly2010convergence] but like in [vollmer2016exploration, chen2015convergence], our state space is not the compact set of torus but .
Variance and bias of the stochastic gradient approximation were respectively reflected in the 2nd and 3rd term in the above bound, although the 3rd term also contains a contribution from the numerical integration error. Note the 2nd term is larger than the 3rd in general due to its lower order in , which means reducing the local variance can improve the sampling accuracy even if this is at the cost of introducing a small bias. Having a smaller local variance is the main advantage of EWSG over uniform SG (see e.g., Thm.3).
5 Experiments
In this section, the proposed EWSG algorithm will be compared with SGULD, the classical SGLD method [welling2011bayesian], as well as several recent popular SGMCMC methods, including Firefly Monte Carlo (FlyMC) [maclaurin2015firefly], pSGLD [li2016preconditioned], and CPSGMCMC [fu2017cpsg] (this one is motivated by combining importance sampling with SGMCMC). Test problems will include sampling from a simple Gaussian distribution, fitting a misspecified Gaussian model, Bayesian logistic regression, and Bayesian neural network(BNN). As FlyMC requires a tight lower bound of likelihood, which is unknown for many models, it will only be compared against in Sec. 5.2 and 5.3 where such a bound is obtainable. CPSGMCMC requires heavy tuning on the number of clusters which differs across data sets/algorithms, so it will only be included in the BNN example, for which the authors empirically found such a good hyper parameter for MNIST.
For fair comparison, all algorithms use constant step sizes and are allowed fixed computation budget, i.e., for data passes, all algorithms are only allowed to call gradient function times. All experiments are conducted on a machine with a 2.20GHz Intel(R) Xeon(R) E52630 v4 CPU and an Nvidia GeForce GTX 1080 GPU. If not specifically mentioned, the noise coefficient is set so only needs to be specified in each experiment, the length of the index chain is set for EWSG and the default value of two hyperparameters required in pSGLD are set and , the same as suggested in [li2016preconditioned].
5.1 A Simple Gaussian Example
Consider sampling from a simple twodimensional Gaussian distribution whose potential function is . In this experiment, we set , and randomly sample from a twodimensional standard normal . Due to the simplicity of , we can write the target density analytically as and are able to report sample quality quantitatively and compare it with vanilla stochastic gradient method on an objective basis. To this end, we use KullbackLeibler divergence to measure how different the target distribution and samples generated by simulation are. For two Gaussians, we have closed form expression for KL divergence where . The formula is used to estimate the KL divergence between generated samples and the target distribution.
For each algorithm, we generate 10000 independent samples. All algorithms are run for 30 data passes and minibatch size of 1 is used for all of them. Step size is tuned from and is chosen for SGLD and pSGLD, is chosen for SGULD and EWSG. For SGULD and EWSG, both of which are based on underdamped Langevin, we set friction coefficient . The results are shown in Figure 1(a). From the figure, we observe that both SGULD and EWSG outperform the other two benchmarks SGLD and pSGLD. Between the two, EWSG converges to a smaller KL divergence than SGULD, which implies EWSG achieves better statistical accuracy than its uniform stochastic gradient counterpart.
We also consider comparing SGULD and EWSG at a large range of different step sizes, and plot the mean absolute error of sample covariance matrix against autocorrelation time
5.2 A Misspecified Gaussian Case
In this subsection, we follow the same setup as in [bardenet2017markov] and study a misspecified Gaussian model where one fits a onedimensional normal distribution to i.i.d points drawn according to , and flat prior is assigned . It was shown in [bardenet2017markov] that FlyMC algorithm behaves erratically in this case, as “bright” data points with large values are rarely updated and they drive samples away from the target distribution. Consequently the chain mixes very slowly. One important commonality FlyMC shares with EWSG is that in each iteration, both algorithms select a subset of data in a nonuniform fashion. Therefore, it is interesting to investigate the performance of EWSG in this misspecified model.
For FlyMC
Figure 2(b) shows the autocorrelation of three algorithms. The autocorrelation of FlyMC decays very slowly, samples that are even 500 iterations away still show strong correlation. The autocorrelation of EWSG, on the other hand, decays much faster, suggesting EWSG explores parameter space more efficiently than FlyMC does. Figure 2(c) and 2(d) show the samples (the first 1000 samples are discarded as burnin) generated by EWSG and FlyMC respectively. The samples of EWSG center around the mode of the target distribution while the samples of FlyMC are still far away from the true posterior. The experiment shows EWGS works quite well even in misspecified models, and hence is an effective candidate in combining importance sampling with scalable Bayesian inference.
5.3 Bayesian Logistic Regression
Consider Bayesian logistic regression for the binary classification problem.
The probabilistic model for predicting a label giving a feature vector is . We set a Gaussian prior with zero mean and covariance for parameter . We conduct our experiments on Covertype data set
The FlyMC algorithm


Method  Accuracy(%)  Log Likelihood 
SGLD  75.2823 0.0788  0.5249 0.0002 
pSGLD  75.0785 0.0939  0.5266 0.0004 
SGULD  75.2717 0.0686  0.5250 0.0001 
EWSG  75.2928 0.0452  0.5235 0.0003 
FlyMC  75.1650 0.0792  0.5235 0.0005 

5.4 Bayesian Neural Network
Bayesian inference is compelling for deep learning (see e.g. a recent review [wilson2020case]) and here we apply our algorithm to Bayesian neural network (BNN)s. Two popular architecture of neural nets are experimented – multilayer perceptron (MLP) and convolutional neural nets (CNN). In MLP architecture, a hidden layer with 100 neurons followed by a softmax layer is used. In CNN, we use standard network configuration with 2 convolutional layers followed by 2 fully connected layers [jarrett2009best]. Both convolutional layers use convolution kernel with 32 and 64 channels, max pooling layers follow immediately after convolutional layer. The last two fullyconnected layers each has 200 neurons. We set the standard normal as prior for all weights and bias.
We test algorithms on the MNIST data set, which consists of 60000 training data and 10000 test data, each datum is a grayscale image with one of the ten possible labels (digits ). For underdamped Langevin based algorithms , we set friction coefficient in MLP and in CNN. In MLP, the step sizes are set for EWSG, SGULD and CGSGULD, and for SGLD and pSGLD, via grid search. For CPSGULD (clusteringbased preprocessing is conducted [fu2017cpsg] before SGULD), we use Kmeans with 10 clusters to preprocess the data set. In CNN, the step sizes are set for EWSG, SGULD and CGSGULD, and for SGLD and pSGLD, via grid search. All algorithms use minibatch size of 100 and are run for 200 data passes. For each algorithm, we generate 10 independent samples to estimate posterior distributions and make prediction accordingly. To smooth out noise and obtain more significant results, we repeat all experiments 10 times with different seeds.


Method  Test Error(%), MLP  Test Error(%), CNN 
SGLD  1.976 0.055  0.848 0.060 
pSGLD  1.821 0.061  0.860 0.052 
SGULD  1.833 0.073  0.778 0.040 
CGSGULD  1.835 0.047  0.772 0.055 
EWSG  1.793 0.100  0.753 0.035 

The learning curve and final test error are shown in Figure 4 and Table 2. We find EWSG consistently improve over its uniform gradient subsampling counterpart SGULD as well as CGSGULD which is motivated by marrying importance sampling with SGMCMC. Moreover, EWSG also outperforms two standard benchmarks SGLD and pSGLD.
In each iteration of EWSG, we run an index Markov chain of length and select a “good” minibatch to estimate gradient, therefore EWSG essentially uses data points per iteration where is minibatch size. How does EWSG compare with its uniform gradient subsampling counterpart with a larger minibatch size ()?
We empirically answer this question in the context of BNN with MLP architecture. We use the same step size for SGULD and EWSG and experiment a large range of values of minibatch size and index chain length . Each algorithm is run for 200 data passes and 10 independent samples are drawn to estimate test error. The results are shown in Table 3. We find that EWSG beats SGULD with larger minibatch in 8 out of 9 comparison groups, which suggests in general EWSG could be a better way to consuming data compared to increasing minibatch size and may shed light on other areas where stochastic gradient methods are used (e.g. optimization).
1.86% 1.94%  1.83% 1.92%  1.80% 1.97%  
1.90% 1.87%  1.87% 1.97%  1.80% 2.07%  
1.79% 1.97%  2.01% 2.17%  2.36% 2.37% 
6 Conclusion
In this paper, we proposed EWSG, which uses exponentially weighted subsampling of gradients to match the transition kernel of a base MCMC base with full gradient. The goal is better sample quality. Both local variance analysis and global nonasymptotic analysis are presented to demonstrate the advantage of EWSG theoretically. Empirical results also showed improved sampling/learning performance. We believe nonuniform stochastic gradient can be introduced to a large class of MCMC methods and capable for impactful algorithmic improvements.
AAcknowledgement
The authors thank Wenlong Mou for discussions. MT was partially supported by NSF DMS1847802 and ECCS1936776.
References
Appendix A Proof of Theorem 1
Proof: Denote the set of all dimensional probability vectors by , the set of sparse probability vectors by , and the set of nonsparse (dense) probability vectors by . Denote , then the optimization problem can be written as
Note that the feasible region is always nonempty (take to be a uniform distribution) and is also closed and bounded, hence this linear programming is always solvable. Denote the set of all minimizers by . Note that depends on and is in this sense random.
The Lagrange function is
where and are dual variables. The optimality condition reads as
Dual feasibilty and complementary slackness require
Consider the probability of the event {a dense probability vector can solve the above minimization problem}, i.e., . It is upper bounded by
Since , complementary slackness implies that at least entries in are zero. Denote the indices of these entries by . For every , by optimality condition, we have , i.e.,
Take the first indices in , and note a geometric fact that points in a dimensional space must be on the surface of a hypersphere of at most dimension, which we denote by for some vector and integer . Because ’s distribution is absolutely continuous, we have
Hence and
Therefore we have
Appendix B Proof of Theorem 2
Proof: The transition kernel of EM discretization with full gradient can be explicitly written as
where is the Dirac delta function and is the probability density of dimensional standard normal distribution.
Denote the unnormalized probability measure of index by
and the normalization constant by
Then the transition kernel of EWSG can be written as
Recall the transition kernel of EM integrator with full gradient is
As both transition kernels are proportional to
We therefore conclude that
Appendix C Proof of Theorem 3
Proof: Let and assume for some constant . Denote . For any probability distribution over , we have
Therefore we let
and use it to compare the trace of covariance matrix of uniform and nonuniform subsamplings.
First of all,
where the inequality is due to CauchySchwarz and it is a strict inequality unless all ’s are equal, which means uniform subsampling on average has larger variablity than a nonuniform scheme measured by the trace of covariance matrix.
Moreover, concentration inequality can help show is negative with high probability if is small. To this end, plug in and rewrite
where , and is the normalization constant. Denote the unnormalized probability by
and we have
To prove concentration results, it is useful to estimate
where is a ball centered at origin with radius in .
Due to the mean value theorem, we have . By symmetry, it suffices to compute to upper bound . Note that
where is the Kronecker delta function. Thus
where in the 2nd last equation comes from the difference of the first two terms in the 3rd last equation. This estimation shows that .
Therefore, by McDiarmid’s inequality, we conclude for any ,
Any choice of will render this probability asymptotically vanishing as grows, which means that will be negative with high probability, which is equivalent to reduced variance per step.
Appendix D Proof of Theorem 4
Proof: We rewrite the generator of underdamped Langevin with full gradient as
where
Rewrite the discretized underdamped Langevin with stochastic gradient in variable
where
and is a