The promises and pitfalls of Stochastic Gradient Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC algorithm for Bayesian learning from large scale datasets. While SGLD with decreasing step sizes converges weakly to the posterior distribution, the algorithm is often used with a constant step size in practice and has demonstrated successes in machine learning tasks. The current practice is to set the step size inversely proportional to where is the number of training samples. As becomes large, we show that the SGLD algorithm has an invariant probability measure which significantly departs from the target posterior and behaves like Stochastic Gradient Descent (SGD). This difference is inherently due to the high variance of the stochastic gradients. Several strategies have been suggested to reduce this effect; among them, SGLD Fixed Point (SGLDFP) uses carefully designed control variates to reduce the variance of the stochastic gradients. We show that SGLDFP gives approximate samples from the posterior distribution, with an accuracy comparable to the Langevin Monte Carlo (LMC) algorithm for a computational cost sublinear in the number of data points. We provide a detailed analysis of the Wasserstein distances between LMC, SGLD, SGLDFP and SGD and explicit expressions of the means and covariance matrices of their invariant distributions. Our findings are supported by limited numerical experiments.
The promises and pitfalls of Stochastic Gradient Langevin Dynamics
Nicolas Brosse, Éric Moulines Centre de Mathématiques Appliquées, UMR 7641, Ecole Polytechnique, Palaiseau, France. email@example.com, firstname.lastname@example.org Alain Durmus Ecole Normale Supérieure CMLA, 61 Av. du Président Wilson 94235 Cachan Cedex, France. email@example.com
noticebox[b]Preprint. Work in progress.\end@float
Most MCMC algorithms have not been designed to process huge sample sizes, a typical setting in machine learning. As a result, many classical MCMC methods fail in this context, because the mixing time becomes prohibitively long and the cost per iteration increases proportionally to the number of training samples . The computational cost in standard Metropolis-Hastings algorithm comes from 1) the computation of the proposals, 2) the acceptance/rejection step. Several approaches to solve these issues have been recently proposed in machine learning and computational statistics.
Among them, the stochastic gradient langevin dynamics (SGLD) algorithm, introduced in Welling and Teh (2011), is a popular choice. This method is based on the Langevin Monte Carlo (LMC) algorithm proposed in Grenander (1983); Grenander and Miller (1994). Standard versions of LMC require to compute the gradient of the log-posterior at the current fit of the parameter, but avoid the accept/reject step. The LMC algorithm is a discretization of a continuous-time process, the overdamped Langevin diffusion, which leaves invariant the target distribution . To further reduce the computational cost, SGLD uses unbiased estimators of the gradient of the log-posterior based on subsampling. This method has triggered a huge number of works among others Ahn et al. (2012); Korattikara et al. (2014); Ahn et al. (2014); Chen et al. (2015, 2014); Ding et al. (2014); Ma et al. (2015); Dubey et al. (2016); Bardenet et al. (2017) and have been successfully applied to a range of state of the art machine learning problems Patterson and Teh (2013); Li et al. (2016b).
The properties of SGLD with decreasing step sizes have been studied in Teh et al. (2016). The two key findings in this work are that 1) the SGLD algorithm converges weakly to the target distribution , 2) the optimal rate of convergence to equilibrium scales as where is the number of iterations, see (Teh et al., 2016, Section 5). However, in most of the applications, constant rather than decreasing step sizes are used, see Ahn et al. (2012); Chen et al. (2014); Hasenclever et al. (2017); Li et al. (2016a); Sato and Nakagawa (2014); Vollmer et al. (2016). A natural question for the practical design of SGLD is the choice of the minibatch size. This size controls on the one hand the computational complexity of the algorithm per iteration and on the other hand the variance of the gradient estimator. Non-asymptotic bounds in Wasserstein distance between the marginal distribution of the SGLD iterates and the target distribution have been established in Dalalyan (2017b); Dalalyan and Karagulyan (2017). These results highlight the cost of using stochastic gradients and show that, for a given precision in Wasserstein distance, the computational cost of the plain SGLD algorithm does not improve over the LMC algorithm; Nagapetyan et al. (2017) reports also similar results on the mean square error.
It has been suggested to use control variates to reduce the high variance of the stochastic gradients. For strongly log-concave models, Nagapetyan et al. (2017); Baker et al. (2017) use the mode of the posterior distribution as a reference point and introduce the SGLDFP (Stochastic Gradient Langevin Dynamics Fixed Point) algorithm. Nagapetyan et al. (2017); Baker et al. (2017) provide upper bounds on the mean square error and the Wasserstein distance between the marginal distribution of the iterates of SGLDFP and the posterior distribution. In addition, Nagapetyan et al. (2017); Baker et al. (2017) show that the overall cost remains sublinear in the number of individual data points, up to a preprocessing step. Other control variates methodologies are provided for non-concave models in the form of SAGA-Langevin Dynamics and SVRG-Langevin Dynamics Dubey et al. (2016); Chen et al. (2017), albeit a detailed analysis in Wasserstein distance of these algorithms is only available for strongly log-concave models Chatterji et al. (2018).
In this paper, we provide further insights on the links between SGLD, SGLDFP, LMC and SGD (Stochastic Gradient Descent). In our analysis, the algorithms are used with a constant step size and the parameters are set to the standard values used in practice Ahn et al. (2012); Chen et al. (2014); Hasenclever et al. (2017); Li et al. (2016a); Sato and Nakagawa (2014); Vollmer et al. (2016). The LMC, SGLD and SGLDFP algorithms define homogeneous Markov chains, each of which admits a unique stationary distribution used as a hopefully close proxy of . The main contribution of this paper is to show that, while the invariant distributions of LMC and SGLDFP become closer to as the number of data points increases, on the opposite, the invariant measure of SGLD never comes close to the target distribution and is in fact very similar to the invariant measure of SGD.
In Section 3.1, we give an upper bound in Wasserstein distance of order between the marginal distribution of the iterates of LMC and the Langevin diffusion, SGLDFP and LMC, and SGLD and SGD. We provide a lower bound on the Wasserstein distance between the marginal distribution of the iterates of SGLDFP and SGLD. In Section 3.2, we give a comparison of the means and covariance matrices of the invariant distributions of LMC, SGLDFP and SGLD with those of the target distribution . Our claims are supported by numerical experiments in Section 4.
Denote by the observations. We are interested in situations where the target distribution arises as the posterior in a Bayesian inference problem with prior density and a large number of i.i.d. observations with likelihoods . In this case, . We denote for , , .
Under mild conditions, is the unique invariant probability measure of the Langevin Stochastic Differential Equation (SDE):
where is a -dimensional Brownian motion. Based on this observation, Langevin Monte Carlo (LMC) is an MCMC algorithm that enables to sample (approximately) from using an Euler discretization of the Langevin SDE:
where is a constant step size and is a sequence of i.i.d. standard -dimensional Gaussian vectors. Discovered and popularised in the seminal works Grenander (1983); Grenander and Miller (1994); Roberts and Tweedie (1996), LMC has recently received renewed attention Dalalyan (2017a); Durmus and Moulines (2017, 2016); Dalalyan and Karagulyan (2017). However, the cost of one iteration is which is prohibitively large for massive datasets. In order to scale up to the big data setting, Welling and Teh (2011) suggested to replace with an unbiased estimate where is a minibatch of with replacement of size . A single update of SGLD is then given for by
The idea of using only a fraction of data points to compute an unbiased estimate of the gradient at each iteration comes from Stochastic Gradient Descent (SGD) which is a popular algorithm to minimize the potential . SGD is very similar to SGLD because it is characterised by the same recursion as SGLD but without Gaussian noise:
It is worth mentioning that the objectives of the different algorithms presented so far are distinct. On the one hand, LMC, SGLD and SGDLFP are MCMC methods used to obtain approximate samples from the posterior distribution . On the other hand, SGD is a stochastic optimization algorithm used to find an estimate of the mode of the posterior distribution. In this paper, we focus on the fixed step-size SGLD algorithm and assess its ability to reliably sample from . For that purpose and to quantify precisely the relation between LMC, SGLD, SGDFP and SGD, we make for simplicity the following assumptions on .
For all , is four times continuously differentiable and for all , . In particular for all , is -gradient Lipschitz, i.e. for all , .
is -strongly convex, i.e. for all , .
For all , is convex.
Note that under H 1, is four times continuously differentiable and for , , with and where . In particular, is -gradient Lipschitz. Furthermore, under H 2, has a unique minimizer . In this paper, we focus on the asymptotic ,. We assume that , which is a common assumption for the analysis of SGLD and SGLDFP Baker et al. (2017); Chatterji et al. (2018). In practice Ahn et al. (2012); Chen et al. (2014); Hasenclever et al. (2017); Li et al. (2016a); Sato and Nakagawa (2014); Vollmer et al. (2016), is of order and we adopt this convention in this article.
For a practical implementation of SGLDFP, an estimator of is necessary. The theoretical analysis and the bounds remain unchanged if, instead of considering SGLDFP centered w.r.t. , we study SGLDFP centered w.r.t. satisfying . Such an estimator can be computed using for example SGD with decreasing step sizes, see (Nemirovski et al., 2009, eq.(2.8)) and (Baker et al., 2017, Section 3.4), for a computational cost linear in .
3.1 Analysis in Wasserstein distance
Before presenting the results, some notations and elements of Markov chain theory have to be introduced. Denote by the set of probability measures with finite second moment and by the Borel -algebra of . For , define the Wasserstein distance of order by
where is the set of probability measures on satisfying for all , and .
A Markov kernel on is a mapping satisfying the following conditions: (i) for every , is a probability measure on (ii) for every , is a measurable function. For any probability measure on , we define for all by . For all , we define the Markov kernel recursively by and for all and , . A probability measure is invariant for if .
The LMC, SGLD, SGD and SGLDFP algorithms defined respectively by (2), (3), (4) and (5) are homogeneous Markov chains with Markov kernels denoted , and . To avoid overloading the notations, the dependence on and is implicit.
The proof is postponed to Section A.1. ∎
Under H 1, (1) has a unique strong solution for every initial condition (Karatzas and Shreve, 1991, Chapter 5, Theorems 2.5 and 2.9). Denote by the semigroup of the Langevin diffusion defined for all and by .
The proof is postponed to Section A.2. ∎
for all , we get and , .
for all , we get and .
Theorem 1 implies that the number of iterations necessary to obtain a sample -close from in Wasserstein distance is the same for LMC and SGLDFP. However for LMC, the cost of one iteration is which is larger than the cost of one iteration for SGLDFP. In other words, to obtain an approximate sample from the target distribution at an accuracy in -Wasserstein distance, LMC requires operations, in contrast with SGLDFP that needs only operations.
We show in the sequel that when in the case of a Bayesian linear regression, where for two sequences , , if . The dataset is where is the response variable and are the covariates. Set and the matrix of covariates such that the row of is . Let . For , the conditional distribution of given is Gaussian with mean and variance . The prior is a normal distribution of mean and variance . The posterior distribution is then proportional to where
We assume that , with . Let be a minibatch of with replacement of size . Define
is the multiplicative part of the noise in the stochastic gradient, and the additive part that does not depend on . The additive part of the stochastic gradient for SGLDFP disappears since
In this setting, the following theorem shows that the Wasserstein distances between the marginal distribution of the iterates of SGLD and SGLDFP, and and , is of order when . This is in sharp contrast with the results of Section 3.1 where the Wasserstein distances tend to as at a rate . For simplicity, we state the result for .
Consider the case of the Bayesian linear regression in dimension .
For all and ,
Set with and assume that . We have .
The proof is postponed to Section A.3. ∎
The study in Wasserstein distance emphasizes the different behaviors of the LMC, SGLDFP, SGLD and SGD algorithms. When and , the marginal distributions of the iterates of the LMC and SGLDFP algorithm are very close to the Langevin diffusion and their invariant probability measures and are similar to the posterior distribution of interest . In contrast, the marginal distributions of the iterates of SGLD and SGD are analogous and their invariant probability measures and are very different from when .
Note that to fix the asymptotic bias of SGLD, other strategies can be considered: choosing a step size where and/or increasing the batch size where . Using the Wasserstein (of order 2) bounds of SGLD w.r.t. the target distribution , see e.g. (Dalalyan and Karagulyan, 2017, Theorem 3), should be equal to to guarantee the -accuracy in Wasserstein distance of SGLD for a cost proportional to (up to logarithmic terms), independently of the choice of and .
3.2 Mean and covariance matrix of
We now establish an expansion of the mean and second moments of and as , and compare them. We first give an expansion of the mean and second moments of as .
The proof is postponed to Section B.1. ∎
Contrary to the Bayesian linear regression where the covariance matrices can be explicitly computed, see Appendix C, only approximate expressions are available in the general case. For that purpose, we consider two types of asymptotic. For LMC and SGLDFP, we assume that , , for , and we develop an asymptotic when . Combining Section 3.2 and Theorem 3 , we show that the biases and covariance matrices of and are of order with remainder terms of the form , where for two sequences , , if .
Regarding SGD and SGLD, we do not have such concentration properties when because of the high variance of the stochastic gradients. The biases and covariance matrices of SGLD and SGD are of order when . To obtain approximate expressions of these quantities, we set where is the step size for the gradient descent over the normalized potential . Assuming that is proportional to and , we show by combining Section 3.2 and Theorem 4 that the biases and covariance matrices of SGLD and SGD are of order with remainder terms of the form when .
Before giving the results associated to and , we need to introduce some notations. For any matrices , we denote by the Kronecker product defined on by and . Besides, for all and , we denote by the tensor product of and . For any matrix , is the trace of .
Define for all by
and and by
, and can be interpreted as perturbations of and , respectively, due to the noise of the stochastic gradients. It can be shown, see Section B.2, that for small enough, and are invertible.
The proof is postponed to Section B.2.2. ∎
The proof is postponed to Section B.2.2. ∎
4 Numerical experiments
For illustrative purposes, we consider a Bayesian logistic regression in dimension . We simulate covariates drawn from a standard -dimensional Gaussian distribution and we denote by the matrix of covariates such that the row of is . Our Bayesian regression model is specified by a Gaussian prior of mean and covariance matrix the identity, and a likelihood given for by . We simulate observations under this model. In this setting, H 1 and H 3 are satisfied, and H 2 holds if the state space is compact.
To illustrate the results of Section 3.2, we consider regularly spaced values of between and and we truncate the dataset accordingly. We compute an estimator of using SGD Pedregosa et al. (2011) combined with the BFGS algorithm Jones et al. (2001). For the LMC, SGLDFP, SGLD and SGD algorithms, the step size is set equal to where is the largest eigenvalue of . We start the algorithms at and run iterations where the first samples are discarded as a burn-in period.
We estimate the means and covariance matrices of and by their empirical averages and . We plot the mean and the trace of the covariance matrices for the different algorithms, averaged over independent trajectories, in Figure 1 and Figure 2 in logarithmic scale.
The slope for LMC and SGLDFP is which confirms the convergence of to at a rate . On the other hand, we can observe that converges to a constant for SGD and SGLD.
We then illustrate our results on the covertype dataset111https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.scale.bz2 with a Bayesian logistic regression model. The prior is a standard multivariate Gaussian distribution. Given the size of the dataset and the dimension of the problem, LMC requires high computational resources and is not included in the simulations. We truncate the training dataset at . For all algorithms, the step size is set equal to and the trajectories are started at , an estimator of , computed using SGD combined with the BFGS algorithm.
We empirically check that the variance of the stochastic gradients scale as for SGD and SGLD, and as for SGLDFP. We compute the empirical variance estimator of the gradients, take the mean over the dimension and display the result in a logarithmic plot in Figure 3. The slopes are for SGD and SGLD, and for SGLDFP.
On the test dataset, we also evaluate the negative loglikelihood of the three algorithms for different values of , as a function of the number of iterations. The plots are shown in Figure 4. We note that for large , SGLD and SGD give very similar results that are below the performance of SGLDFP.
Appendix A Proofs of Section 3.1
a.1 Proof of Section 3.1
The convergence in Wasserstein distance is classically done via a standard synchronous coupling (Dieuleveut et al., 2017, Proposition 2). We prove the statement for SGLD; the adaptation for LMC, SGLDFP and SGD is immediate. Let and . By (Villani, 2009, Theorem 4.1), there exists a couple of random variables such that . Let be the SGLD iterates starting from and respectively and driven by the same noise, i.e. for all ,
where is an i.i.d. sequence of standard Gaussian variables and an i.i.d. sequence of subsamples of of size . Denote by the filtration associated to . We have for ,
and by H 2
Since for all , belongs to , we get by a straightforward induction