Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration
Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is a variant of stochastic gradient with momentum where a controlled and properly scaled Gaussian noise is added to the stochastic gradients to steer the iterates towards a global minimum. Many works reported its empirical success in practice for solving stochastic non-convex optimization problems, in particular it has been observed to outperform overdamped Langevin Monte Carlo-based methods such as stochastic gradient Langevin dynamics (SGLD) in many applications. Although asymptotic global convergence properties of SGHMC are well known, its finite-time performance is not well-understood.
In this work, we provide finite-time performance bounds for the global convergence of SGHMC for solving stochastic non-convex optimization problems with explicit constants. Our results lead to non-asymptotic guarantees for both population and empirical risk minimization problems. For a fixed target accuracy level , on a class of non-convex problems, we obtain iteration complexity bounds for SGHMC that can be tighter than those for SGLD up to a square root factor. These results show that acceleration with momentum is possible in the context of non-convex optimization algorithms.
September 14, 2018
We consider the stochastic non-convex optimization problem
where is a random variable whose probability distribution is unknown, supported on some unknown set , the objective is the expectation of a random function where the functions are continuous and non-convex. Having access to independent and identically distributed samples where each is a random variable distributed with the population distribution , the goal is to compute an approximate minimizer (possibly with a randomized algorithm) of the population risk, i.e. approximately minimize
where the expectation is taken with respect to both and the randomness encountered (if any) during the iterations of the algorithm to compute and is the minimum value. This formulation arises frequently in several contexts including machine learning. A prominent example is deep learning where denotes the set of trainable weights for a deep learning model and is the penalty (loss) of prediction using weight with the individual sample value .
Because the population distribution is unknown, we consider the empirical risk minimization problem
based on the dataset as a proxy to the problem (1.1) and minimize the empirical risk
approximately, where the expectation is taken with respect to any randomness encountered during the algorithm to generate .
It has been well known that sampling from a distribution which concentrates around a global minimizer of is a similar goal to computing an approximate global minimizer of , for example such connections arise in the study of simulated annealing algorithms in optimization which admit several asymptotic convergence guarantees (see e.g. [Gid85, Haj85, GM91, KGV83, BT93, BLNR15, BM99]). Recent studies made such connections between the fields of statistics and optimization stronger, justifying and popularizing the use of Langevin Monte Carlo-based methods in stochastic non-convex optimization and large-scale data analysis further (see e.g. [CCS17, Dal17, RRT17, CCG16, SBCR16, ŞYN18, WT11, Wib18]).
Stochastic gradient algorithms based on Langevin Monte Carlo are popular variants of stochastic gradient which admit asymptotic global convergence guarantees where a properly scaled Gaussian noise is added to the gradient estimate. Two popular Langevin-based algorithms that have demonstrated empirical success are stochastic gradient Langevin dynamics (SGLD) [WT11, CDC15] and stochastic gradient Hamiltonian Monte Carlo (SGHMC) [CFG14, CDC15, Nea10, DKPR87] and their variants to improve their efficiency and accuracy [AKW12, MCF15, PT13, DFB14, Wib18]. In particular, SGLD can be viewed as the analogue of stochastic gradient in the Markov Chain Monte Carlo (MCMC) literature whereas SGHMC is the analogue of stochastic gradient with momentum (see e.g. [CFG14]). SGLD iterations consist of
where is the stepsize parameter, is the inverse temperature, is a conditionally unbiased estimate of the gradient of and is a centered Gaussian random vector with unit covariance matrix. When the gradient variance is zero, SGLD dynamics corresponds to (explicit) Euler discretization of the first-order (a.k.a. overdamped) Langevin stochastic differential equation (SDE)
where is the standard Brownian motion in . The process admits a unique stationary distribution , also known as the Gibbs measure, under some assumptions on (see e.g. [CHS87, HKS89]). For chosen properly (large enough), it is easy to see that this distribution will concentrate around approximate global minimizers of . Recently, Dalalyan established novel theoretical guarantees for the convergence of the overdamped Langevin MCMC and the SGLD algorithm for sampling from a smooth and log-concave density and these results have direct implications to stochastic convex optimization [Dal17]. In a seminal work, Raginsky et al. [RRT17] showed that SGLD iterates track the underdamped Langevin SDE closely and obtained finite-time performance bounds for SGLD. Their results show that SGLD converges to -approximate global minimizers after iterations where is the spectral gap of the overdamped Langevin dynamics which is in general exponential in both and the dimension [RRT17, TLR18]. A related result of Zhang et al. [ZLC17] shows that a modified version of the SGLD algorithm will find an -approximate local minimum after polynomial time (with respect to all parameters). Recently, Xu et al. [XCZG17] improved the dependency of the upper bounds of [RRT17] further in the mini-batch setting, and obtained several guarantees for the gradient Langevin dynamics and variance-reduced SGLD algorithms.
On the other hand, the SGHMC algorithm is based on the underdamped (second-order) Langevin diffusion
where models the position and the momentum of a particle moving in a field of force (described by the gradient of ) plus a random (thermal) force described by Gaussian noise, first derived by Kramers [Kra40]. It is known that under some assumptions on , the Markov process is ergodic and have a unique stationary distribution
Hence, the -marginal distribution of stationary distribution is exactly the invariant distribution of the overdamped Langevin dynamics.
where is a sequence of i.i.d standard Gaussian random vectors in , is a sequence of i.i.d random elements such that
We also focus on the unadjusted dynamics (without Metropolis type of correction) that works well in many applications [CFG14, CDC15], as our primary goal is not to sample from a particular target distribution but rather to compute approximate global minimizers to the problem (1.1).
We note that if the term with involving the Gaussian noise is removed in the underdamped SDE (1.5)-(1.6), this results in a second-order ODE in . It is interesting to note that Polyak’s heavy ball method that accelerates gradient descent is based on the discretization of this ODE [Pol87]. Similarly, if the Gaussian noise is scaled with the step size (instead of ), then one recovers the stochastic gradient method with momentum [CFG14]. There has been a number of works for understanding momentum-based acceleration in first-order convex optimization methods as discretizations of second-order differential equations [SRBd17, SBC14, AP16, FRV18, ZMSJ18, KBB15, WWJ16]. Recent results of [EGZ17] shows that underdamped SDE converges to its stationary distribution faster than the overdamped SDE in the 2-Wasserstein metric under some assumptions where can be non-convex. Similar acceleration behavior between underdamped and overdamped dynamics was also proven for a version of Hamiltonian Monte Carlo algorithm for sampling strongly log-concave densities [CCBJ17, MS17] as well as densities whose negative logarithm is strongly convex outside a ball of finite radius [CCA18]. This raises the natural question whether the discretized underdamped dynamics (SGHMC), can lead to better guarantees than the SGLD method for solving stochastic non-convex optimization problems. Indeed, experimental results show that SGHMC can outperform SGLD dynamics in many applications [EGZ17, CDC15, CFG14]. Although asympotic convergence guarantees for SGHMC exist [CFG14] [MSH02, Section 3], [LMS15]; there is a lack of finite-time explicit performance bounds for solving stochastic non-convex optimization problems with SGHMC in the literature including risk minimization problems. Our main contribution is to give finite-time guarantees to find approximate minimizers of both empirical and population risks with explicit constants, bridging a gap between the theory and the practice for the use of SGHMC algorithms in stochastic non-convex optimization as elaborated further in the next section.
Our contributions can be summarized as follows:
Under some regularity assumptions for the non-convex function , we can show that SGHMC converges to an approximate global minimizer of the empirical risk minimization problem (1.2) after iterations in expectation where is a parameter of the underdamped SDE governing the speed of convergence of it to its stationary distribution with respect to the 2-Wasserstein distance. We make the constants and polynomial dependency of the parameters to our final iteration complexity estimate explicit in our analysis. We emphasize that we do not assume that is convex or strongly convex in any region. To our knowledge, this is the first non-asymptotic provable guarantees for the SGHMC algorithm in the context of non-convex stochastic optimization with explicit constants. This is in contrast to iterations needed for the SGLD algorithm [RRT17] where is a spectral gap parameter of the overdamped diffusion. For a class of non-convex objectives, there are tight characterizations available for the parameters and . We discuss some examples of such non-convex problem classes in Examples 6 and 7 when the convergence rate of underdamped dynamics is faster than that of SGLD by a square root factor, i.e. where is typically an exponentially small constant in (see e.g. [BGK05]). For these problems, under some assumptions, we show that SGHMC can achieve an optimization error that is better than that of SGLD by a square root factor and this is achieved in less number of iterations compared to SGLD. Therefore, our results highlight that momentum-based acceleration is achievable at least for some classes of non-convex problems we precise. In particular, our analysis gives some theoretical justification into the success of momentum-based methods for solving non-convex machine learning problems, empirically observed in practice [SMDH13]. Our further analysis on some particular examples suggest that when the initialization is far away from a global minimizer at a distance , SGHMC needs iterations to reach out to a neighborhood of the global minimizer whereas SGLD requires iterations for the same task.
On the technical side, in order to establish these results, we adapt the proof techniques of Raginsky et al. [RRT17] developed for the overdamped dynamics to the underdamped dynamics and combine it with the analysis of Eberle et al. [EGZ17] which quantifies the convergence rate of the underdamped Langevin SDE to its equilibrium. In an analogy to the fact that momentum-based first-order optimization methods require a different Lyapunov function and a quite different set of analysis tools (compared to their non-accelerated variants) to achieve fast rates (see e.g. [LFM18, SBC14, Nes83]), our analysis of the momentum-based SGHMC algorithm requires studying a different Lyapunov function (that also depends on the objective ) as opposed to the classic Lyapunov function arising in the study of the SGLD algorithm [MSH02, RRT17]. This fact introduces some challenges for the adaptation of the existing analysis techniques for SGLD to SGHMC. For this purpose, we take the following steps:
First, we show that SGHMC iterates track the underdamped Langevin diffusion closely in the 2-Wasserstein metric. As this metric requires finiteness of second moments, we first establish uniform (in time) bounds for both the underdamped Langevin SDE (see Lemma 8) exploiting the structure of the Lyapunov function (which will be defined later in (2.1)). Second, we obtain a bound for the Kullback-Leibler divergence between the discrete and continuous underdamped dynamics making use of the Girsanov’s theorem, which is then converted to bounds in the 2-Wasserstein metric by an application of an optimal transportation inequality of Bolley and Villani [BV05]. This step requires proving a certain exponential integrability property of the underdamped Langevin diffusion. We show in Lemma 9 that the exponential moments grow at most linearly in time, which is a strict improvement from the exponential growth in time in [RRT17]. The method that is used in the proof of Lemma 9 can indeed be adapted to improve the exponential integrability and hence the overall estimates in [RRT17] for overdamped dynamics as well.
Second, we study the continuous-time underdamped Langevin SDE. We build on the seminal work of Eberle et al. [EGZ17] which showed that the underdamped SDE is geometrically ergodic with an explicit rate in the 2-Wasserstein metric. In order to get explicit performance guarantees, we derive new bounds that make the dependence of the constants to the initialization explicit (see Lemma 12).
As the -marginal of the equilibrium distribution of the underdamped Langevin SDE concentrates around the global minimizers of for appropriately chosen, and we can control the error between the discrete-time SGHMC dynamics and the underdamped SDE by choosing the step size accordingly; this leads to performance bounds for the empirical risk minimization provided in Corollary 3.
If every sample is used once (in other words if we sample directly from the population distribution ), then the bounds we obtain for the empirical risk minimization will also lead to similar bounds for the population risk.
6In this case, for a fixed target accuracy , we can show that SGHMC generalizes better than SGLD in the following sense: For a fixed target accuracy , SGHMC iterates can achieve a generalization error that is (smaller) better than known guarantees for SGLD [RRT17, XCZG17] by a square root factor and this improvement requires relatively in less number of iterations compared to SGLD. Our performance bounds for SGHMC can also improve that of SGLD for computing the population risk up to a square root factor for fixed problem parameters and on some non-convex problems. In the multi-pass regime, where the samples are not used only once, exploiting the fact that the -marginal of the stationary distribution for the underdamped diffusion is the same as the overdamped diffusion, we show that the generalization error admits the same bounds known in the literature for the overdamped case, first derived in [RRT17]. In other words, SGHMC algorithm generalizes with a rate no worse than that of the known guarantees for the SGLD algorithm. The results are summarized in Corollary 4.
1.2 Related Work
In a recent work, Simsekli et al. obtained a finite-time performance bound [ŞYN18, Theorem 1] for the ergodic average of the SGHMC iterates in the presence of delays in gradient computations. Their analysis highlight the dependency of the optimization error on the delay in the gradient computations and the stepsize explicitly, however it hides some implicit constants and which can be exponential both in and in the worst case [ŞYN18]. A comparison with the SGLD algorithm is also not given. On the contrary, in our paper, we make all the constants explicit, therefore the effect of acceleration compared to overdamped MCMC approaches such as SGLD is visible.
Cheng et al. [CCA18] considered the problem of sampling from a target distribution where is -smooth everywhere and -strongly convex outside a ball of finite radius . They proved upper bounds for the time required to sample from a distribution that is within of the target distribution with respect to the -Wasserstein distance for both underdamped and overdamped methods that scales polynomially in and . They also show that underdamped MCMC has a better dependency with respect to and by a square root factor. In our analysis, we consider a larger class of non-convex functions since we do not assume strong convexity in any region, and therefore the distance to the invariant distribution scales exponentially with dimension in the worst-case. When is globally strongly convex (or equivalently when the target distribution is strongly log-concave), there is also a growing interesting literature that establish performance bounds for both overdamped MCMC [Dal17] and underdamped MCMC methods [CCBJ17, MS17]. Underdamped Langevin MCMC (also known as Hamiltonian MCMC) and its practical applications have also been analyzed further in a number of recent works [LV18, BBL17, Bet17, BBG14].
Acceleration of first-order gradient or stochastic gradient methods and their variance-reduced versions for finding a local stationary point (a point with a gradient less than in norm) is also studied in the literature (see e.g. [CDHS18, Nes83, GL16, AZH16]). It has also been shown that under some assumptions momentum-based accelerated methods can escape saddle points faster [OW17, LCZZ18]. In contrast, in this work, our focus is obtaining performance guarantees for convergence to global minimizers instead. There is also an alternative approach for non-convex optimization based on graduated optimization techniques [HLSS16] that creates a sequence of smoothed approximations to an objective.
Su et al. [SBC14] shows that Nesterov’s accelerated gradient method [Nes83] tracks a second-order ODE (also referred to as the Nesterov’s ODE in the literature), closely whereas the first-order non-accelerated methods such as the classical gradient descent are known to track the first-order gradient flow dynamics. The authors show that for convex objectives, Nesterov’s ODE converges to its equilibrium faster (in time) than the first-order gradient flow ODE by a square root factor and show that the speed-up is also inherited by the discretized dynamics. Our results can be interpreted as the analogue of these results in the non-convex optimization setting where we deal with SDEs instead of ODEs building on the theory of Markov processes and show that SGHMC tracks the second-order (underdamped) Langevin SDE closely and inherits its faster convergence guarantees compared to first-order overdamped dynamics for non-convex problems.
2 Main Result
In our analysis, we will use the following 2-Wasserstein distance: For any two probability measures on , we define
where is the usual Euclidean norm, are two Borel probability measures on with finite second moments, and the infimum is taken over all random couples taking values in with marginals (see e.g. [Vil08]).
We first state the assumptions used in this paper below. Note that we do not assume to be convex or strongly convex in any region. The first assumption of non-negativity of can be assumed without loss of generality by subtracting a constant and shifting the coordinate system as long as is bounded below. The second assumption of Lipschitz gradients is in general unavoidable for discretized Langevin algorithms to be convergent [MSH02], and the third assumption is known as the dissipativity condition [Hal88] and is standard in the literature to ensure the convergence of Langevin diffusions to the stationary distribution [RRT17, EGZ17, MSH02]. The fourth assumption is regarding the amount of noise present in the gradient estimates and allows not only constant variance noise but allows the noise variance to grow with the norm of the iterates (which is the typical situation in mini-batch methods in stochastic gradient methods). Finally, the fifth assumption is a mild assumption saying that the initial distribution for the SGHMC dynamics should have a reasonable decay rate of the tails to ensure convergence to the stationary distribution. For instance, if the algorithm is started from any arbitrary point , then the Dirac measure would work. If the initial distribution is supported on a Euclidean ball with radius being some universal constant, it would also work. Similar assumptions on the initial distribution is also necessary to achieve convergence to a stationary measure in continuous-time underdamped dynamics as well (see e.g. [HN04b]).
We impose the following assumptions.
The function is continuously differentiable, takes non-negative real values, and there exist constants so that
for any .
For each , the function is -smooth:
For each , the function is -dissipative:
There exists a constant such that for every :
The probability law of the initial state satisfies:
where is a Lyapunov function to be used repeatedly for the rest of the paper:
and is a positive constant that can be found in Table 1 and is a positive constant less than .
We note that the Lyapunov function is used in [EGZ17] to study the rate of convergence to equilibrium for underdamped Langevin diffusion and is motivated by [MSH02]. It follows from the above assumptions (applying Lemma 13) that there exist constants and so that
Our first result shows SGHMC iterates track the underdamped Langevin SDE in the sense that the expectation of the empirical risk with respect to the probability law of conditional on the sample , denoted by , and the stationary distribution of the underdamped SDE is small when is large enough. The difference in expectations decomposes as a sum of two terms and while the former term quantifies the dependency on the initialization and the dataset whereas the latter term is controlled by the discretization error and the amount of noise parameter in the gradients. We also note that the parameter (formally defined later in Section 3.1) in our bounds governs the speed of convergence to the equilibrium of the underdamped Langevin diffusion.
In the next subsections, we discuss that this theorem combined with some basic properties of the equilibrium distribution leads to a number of results which provide performance guarantees for both the empirical risk and population risk minimization.
2.1 Performance bound for the empirical risk minimization
which is a measure of how much the marginal of the equilibrium distribution concentrates around a global minimizer of the empirical risk. As goes to infinity, it can be verified that this quantity goes to zero. For finite , it is also possible to derive explicit bounds (see Lemma 16) of the form
This combined with Theorem 2 immediately leads to the following performance bound for the empirical risk minimization. The proof is omitted.
Corollary 3 (Empirical risk minimization).
2.2 Performance bound for the population risk minimization
For controlling the population risk, in addition to the empirical risk, one has to account for the differences between the finite sample size problem (1.2) and the original problem (1.1). By exploiting the fact that the marginal of the invariant distribution for the underdamped dynamics is the same as it is in the overdamped case, it can be shown that the generalization error is no worse than that of the available bounds for SGLD given in [RRT17], and therefore, we have the following corollary. A more detailed proof will be given in Section 3.
2.3 Generalization error of SGHMC in the one pass regime
The generalization error of the SGHMC algorithm after step is defined as which is a measure of the differences between the objectives of the finite sample size problem (1.2) and the original problem (1.1). Since the predictor is random, it is natural to consider the expected generalization error (see e.g. [HRS16]) which admits the decomposition
where is the output of the underdamped Langevin dynamics, i.e. its conditional distribution is given by . If every sample is used once, i.e. if only one pass is made over the dataset, then the second term in (2.10) disappears. As a consequence, the generalization dynamics is controlled by the bound
The following result provides a bound on this quantity. The proof is similar to the proof of Theorem 2 and its corollaries, and hence omitted.
Under the setting of Theorem 2, we have
provided that (2.5) and (2.6) hold where is the output of the underdamped Langevin dynamics, i.e. its conditional distribution is given by and is defined by (2.8). Then, it follows from (2.11) that if each data point is used once, the expected generalization error satisfies
2.4 Performance comparison with respect to SGLD algorithm
We use the notations and to give explicit dependence on the parameters but it hides factors that depend (at worst polynomially) on other parameters and .
Generalization error in the one-pass setting.
A consequence of Theorem 5 is that the generalization error of the SGHMC iterates in the one-pass setting satisfy
for iterations (see also the discussion in Appendix G). On the other hand, the results in [RRT17, Theorem 1] imply that the generalization error for the SGLD algorithm after iterations in the one-pass setting scales as
The constants and are exponential in both and in the worst case, but under some extra assumptions the dependency on can be polynomial (see e.g. [CCBJ17]) although the exponential dependence to is unavoidable in the presence of multiple minima in general (see [BGK05]).
In both expressions (2.12) and (2.13), we see a term scaling with due to the gradient noise level . If we select so that the error term introduced by noisy gradients scaling as is on the same order as the other terms in (2.13) as suggested by the authors in [RRT17], and do the same for SGHMC, then we see that the generalization error for SGHMC (2.12) becomes
and if we ignore the factor
whereas the corresponding bound for SGLD from [RRT17, Theorem 1] is
Note that and do not have the same dependency to up to factors (the former scales with as and the latter ). To make the comparison to SGLD simpler, if we run SGHMC for number of iterations proportional to instead, then the generalization error behaves as,
that is (ignoring the factor),
Comparing and on arbitrary non-convex functions seems not trivial, however we give a class of non-convex functions (see Examples 6 and 7) where . For this class, comparing (2.16) and (2.15), we see that for given and fixed, SGHMC generalization error upper bound is at least on the order of the square root of that of the SGLD (Note that the improvement with respect to , and dependency is more than a square root factor, but the improvement from to can be a square root factor). Furthermore this can be achieved in significantly smaller number of iterations (up to a square root factor) if is fixed, i.e. can be on the order of the square root of if is given and fixed in both (2.16) and (2.16).
Empirical risk minimization.
The empirical risk minimization bound given in Corollary 3 has an additional term compared to the and terms appearing in the one-pass generalization bounds. Note also that . As a consequence, SGHMC algorithm has expected empirical risk
after iterations as opposed to
after iterations required in [RRT17]. Comparing (2.17) and (2.18), we see that the last terms are the same. If this term is the dominant term, then the empirical risk upper bounds for SGLD and SGHMC will be similar except that can be smaller than for instance when . Otherwise, if the last term is not the dominant one and can be ignored with respect to other terms, then, the performance comparison will be similar to the discussion about the generalization bounds (2.14) and (2.15) discussed above where a square root factor improvement for SGHMC compared to SGLD is possible.
Following [RRT17] and [XCZG17], if we define an almost empirical risk minimizer (ERM) as a point which is within the ball of the global minimizer with radius . Xu et al. [XCZG17] recently showed that it suffices to take
stochastic gradient computations to converge to an neighborhood of an almost ERM where hides some factors in and is the spectral gap of the discrete underdamped Markov Chain. Our results show that (see e.g. (2.17)), it suffices to have
stochastic gradient computations, ignoring the factors in the parameters and hiding factors in that can be made explicit. It is hard to compare and in general, the former being the spectral gap of the discretized overdamped Langevin dynamics and the latter being the spectral gap of the continuous-time overdamped dynamics. However, when the stepsize is small enough, will be similar to . As a consequence, when the stepsize is small enough (which is the case for instance, when target accuracy is small enough or is large enough), we will have and for the class of non-convex functions we discuss in Examples 6 and 7. For this class of problems, comparing (2.19) and (2.20), we see that we obtain an improvement in the dimension dependent factors in front of the spectral gap parameter and term (scaling vs. ) as well as an improvement in the spectral gap parameter ( vs. ), however dependency of the bound (2.19) is better than (2.20) (scaling vs. ).
Population risk minimization.
If samples are recycled and multiple passes over the dataset is made, then one can see from Corollary 4 that there is an extra term that needs to be added to the bounds given in (2.17) and (2.18). This term satisfies
If this term is dominant compared to other terms and , for instance this may happen if the number of samples is not large enough, then the performance guarantees for population risk minimization via SGLD and SGHMC will be similar. Otherwise, if is large and is chosen in a way to keep the term on the order , then the square root improvement can be achieved.
Comparison of and .
The parameters and govern the convergence rate to the equilibrium of the overdamped and underdamped Langevin SDE, they can be both exponentially large in dimension and in . They appear naturally in the complexity estimates of SGHMC and SGLD method as they can be viewed as discretizations of these methods (when the discretization step is small and the gradient noise , the discrete dynamics will behave the same as the continuous dynamics) and arise naturally in the iteration complexity of both methods. Next, we discuss classes of non-convex functions in