On Generalization Error Bounds of Noisy Gradient Methods for NonConvex Learning
Abstract
Generalization error (also known as the outofsample error) measures how well the hypothesis learned from training data generalizes to previously unseen data. Proving tight generalization error bounds is a central question in statistical learning theory. In this paper, we obtain generalization error bounds for learning general nonconvex objectives, which has attracted significant attention in recent years. We develop a new framework, termed BayesStability, for proving algorithmdependent generalization error bounds. The new framework combines ideas from both the PACBayesian theory and the notion of algorithmic stability. Applying the BayesStability method, we obtain new datadependent generalization bounds for stochastic gradient Langevin dynamics (SGLD) and several other noisy gradient methods (e.g., with momentum, minibatch and acceleration, EntropySGD). Our result recovers (and is typically tighter than) a recent result in Mou et al. (2018) and improves upon the results in Pensia et al. (2018). Our experiments demonstrate that our datadependent bounds can distinguish randomly labelled data from normal data, which provides an explanation to the intriguing phenomena observed in Zhang et al. (2017a). We also study the setting where the total loss is the sum of a bounded loss and an additional regularization term. We obtain new generalization bounds for the continuous Langevin dynamic in this setting by developing a new LogSobolev inequality for the parameter distribution at any time. Our new bounds are more desirable when the noise level of the process is not very small, and do not become vacuous even when tends to infinity.
1 Introduction
Nonconvex stochastic optimization is the major workhorse of modern machine learning. For instance, the standard supervised learning on a model class parametrized by can be formulated as the following optimization problem:
where denotes the model parameter, is an unknown data distribution over the instance space , and is a given objective function which may be nonconvex. A learning algorithm takes as input a sequence of data points sampled i.i.d. from , and outputs a (possibly randomized) parameter configuration .
A fundamental problem in learning theory is to understand the generalization performance of learning algorithms—is the algorithm guaranteed to output a model that generalizes well to the data distribution ? Specifically, we aim to prove upper bounds on the generalization error , where and are the population and empirical losses, respectively. We note that the loss function (e.g., the 0/1 loss) could be different from the objective function (e.g., the crossentropy loss) used in the training process (which serves as a surrogate for the loss ).
Classical learning theory relates the generalization error to various complexity measures (e.g., the VCdimension and Rademacher complexity) of the model class. Directly applying these classical complexity measures, however, often fails to explain the recent success of overparametrized neural networks, where the model complexity significantly exceeds the amount of available training data (see e.g., Zhang et al. (2017a)). By incorporating certain datadependent quantities such as margin and compressibility into the classical framework, some recent work (e.g., Bartlett et al. (2017); Arora et al. (2018); Wei and Ma (2019)) obtains more meaningful generalization bounds in the deep learning context.
An alternative approach to generalization is to prove algorithmdependent bounds. One celebrated example along this line is the algorithmic stability framework initiated by Bousquet and Elisseeff (2002). Roughly speaking, the generalization error can be bounded by the stability of the algorithm (see Section 2 for the details). Using this framework, Hardt et al. (2016) study the stability (hence the generalization) of stochastic gradient descent (SGD) for both convex and nonconvex functions. Their work motivates recent study of the generalization performance of several other gradientbased optimization methods: Kuzborskij and Lampert (2018); London (2016); Chaudhari et al. (2017); Raginsky et al. (2017); Mou et al. (2018); Pensia et al. (2018); Chen et al. (2018).
In this paper, we study the algorithmic stability and generalization performance of various iterative gradientbased method, with certain continuous noise injected in each iteration, in a nonconvex setting. As a concrete example, we consider the stochastic gradient Langevin dynamics (SGLD) (see Raginsky et al. (2017); Mou et al. (2018); Pensia et al. (2018)). Viewed as a variant of SGD, SGLD adds an isotropic Gaussian noise at every update step:
(1) 
where denotes either the full gradient or the gradient over a minibatch sampled from training dataset. We also study a continuous version of (1), which is the dynamic defined by the following stochastic differential equation (SDE):
(2) 
where is the standard Brownian motion.
1.1 Related Work
Most related to our work is the study of algorithmdependent generalization bounds of stochastic gradient methods. Hardt et al. (2016) first study the generalization performance of SGD via algorithmic stability. They prove a generalization bound that scales linearly with , the number of iterations, when the loss function is convex, but their results for general nonconvex optimization are more restricted. London (2017) and Rivasplata et al. (2018) also combine ideas from both PACBayesian and algorithm stability. However, these works are essentially different from ours. In London (2017), the prior and posterior are distributions on the hyperparameter space instead of distributions on the hypothesis space. Rivasplata et al. (2018) study the hypothesis stability measured by the distance on the hypothesis space in a setting where the returned hypothesis (model parameter) is perturbed by a Gaussian noise. Our work is a followup of the recent work by Mou et al. (2018), in which they provide generalization bounds for SGLD from both stability and PACBayesian perspectives. Another closely related work by Pensia et al. (2018) derives similar bounds for noisy stochastic gradient methods, based on the information theoretic framework of Xu and Raginsky (2017). However, their bounds scale as ( is the size of the training dataset) and are suboptimal even for SGLD.
We acknowledge that besides the algorithmdependent approach that we follow, recent advances in learning theory aim to explain the generalization performance of neural networks from many other perspectives. Some of the most prominent ideas include bounding the network capacity by the norms of weight matrices Neyshabur et al. (2015); Liang et al. (2019), margin theory Bartlett et al. (2017); Wei et al. (2019), PACBayesian theory Dziugaite and Roy (2017); Neyshabur et al. (2018); Dziugaite and Roy (2018), network compressibility Arora et al. (2018), and overparametrization Du et al. (2019); AllenZhu et al. (2019); Zou et al. (2018); Chizat et al. (2019). Most of these results are stated in the context of neural networks (some are tailored to networks with specific architecture), whereas our work addresses generalization in nonconvex stochastic optimization in general. We also note that some recent work provides explanations for the phenomenon reported in Zhang et al. (2017a) from a variety of different perspectives (e.g., Bartlett et al. (2017); Arora et al. (2018, 2019)).
Welling and Teh (2011) first consider stochastic gradient Langevin dynamics (SGLD) as a sampling algorithm in the Bayesian inference context. Raginsky et al. (2017) give a nonasymptotic analysis and establish the finitetime convergence guarantee of SGLD to an approximate global minimum. Zhang et al. (2017b) analyze the hitting time of SGLD and prove that SGLD converges to an approximate local minimum. These results are further improved and generalized to a family of Langevin dynamics based algorithms by the subsequent work of Xu et al. (2018).
1.2 Overview of Our Results
In this paper, we provide generalization guarantees for the noisy variants of several popular stochastic gradient methods.
The BayesStability method and datadependent generalization bounds. We develop a new method for proving generalization bounds, termed as BayesStability, by incorporating ideas from the PACBayesian theory into the stability framework. In particular, assuming the loss takes value in , our method shows that the generalization error is bounded by both and , where is a prior distribution independent of the training set , and is the expected posterior distribution conditioned on (i.e., the last training data is ). The formal definition and the results can be found in Definition 5 and Theorem 7.
Inspired by Lever et al. (2013), instead of using a fixed prior distribution, we bound the KLdivergence from the posterior to a distributiondependent prior. This enables us to derive the following generalization error bound that depends on the expected norm of the gradient along the optimization path:
(3) 
Here is the dataset and is the expected empirical squared gradient norm at step ; see Theorem 11 for the details.
Compared with the previous bound in (Mou et al., 2018, Theorem 1), where is the global Lipschitz constant of the loss, our new bound (3) depends on the data distribution and is typically tighter (as the gradient norm is at most ). In modern deep neural networks, the worstcase Lipschitz constant can be quite large, and typically much larger than the expected empirical gradient norm along the optimization trajectory. Specifically, in the later stage of the training, the expected empirical gradient is small (see Figure 1(d) for the details). Hence, our generalization bound does not grow much even if we train longer at this stage.
Our new bound also offers an explanation to the difference between training on correct and random labels observed by Zhang et al. (2017a). In particular, we show empirically that the sum of expected squared gradient norm (along the optimization path) is significantly higher when the training labels are replaced with random labels (Section 3.1, Figure 1, Appendix C.2).
We would also like to mention the PACBayesian bound (for SGLD with regularization) proposed by Mou et al. (2018). (This bound is different from what we mentioned before; see Theorem 2 in their paper.) Their bound scales as and the numerator of their bound has a similar sum of gradient norms (with a decaying weight if the regularization coefficient ). Their bound is based on the PACBayesian approach and holds with high probability, while our bound only holds in expectation.
Extensions. We remark that our technique allows for an arguably simpler proof of (Mou et al., 2018, Theorem 1); the original proof is based on SDE and FokkerPlanck equation. More importantly, our technique can be easily extended to handle minibatches and a variety of general settings as follows.

Extension to other gradientbased methods. Our results naturally extends to other noisy stochastic gradient methods including momentum due to Polyak (1964) (Theorem 26), Nesterov’s accelerated gradient method in Nesterov (1983) (Theorem 26), and EntropySGD proposed by Chaudhari et al. (2017) (Theorem 27).

Extension to general noises. The proof of the generalization bound in Mou et al. (2018) relies heavily on that the noise is Gaussian
^{1} , which makes it difficult to generalize to other noise distributions such as the Laplace distribution. In contrast, our analysis easily carries over to the class of logLipschitz noises (i.e., noises drawn from distributions with Lipschitz log densities). 
Pathwise stability. In practice, it is also natural to output a certain function of the entire optimization path, e.g., the one with the smallest empirical risk or a weighted average. We show that the same generalization bound holds for all such variants (Remark 12). We note that the analysis in an independent work of Pensia et al. (2018) also satisfies this property, yet their bound is (see Corollary 1 in their work), which scales at a slower rate (instead of ) when dealing with bounded loss.
^{2}
Generalization bounds with regularization via LogSobolev inequalities. We also study the setting where the total objective function is the sum of a bounded differentiable objective and an additional regularization term . In this case, can be treated as a perturbation of a quadratic function, and the continuous Langevin dynamics (CLD) is well understood for quadratic functions. We obtain two generalization bounds for CLD, both via the technique of LogSobolev inequalities, a powerful tool for proving the convergence rate of CLD. One of our bounds is as follows (Theorem 15):
(4) 
The above bound has the following advantages:

Applying , one can see that our bound is at most , which matches the previous bound in (Mou et al., 2018, Proposition 8)
^{3} . 
As time grows, the bound is upper bounded by and approaches to (unlike the previous bound that goes to infinity as ).

If the noise level is not so small (i.e., is not very large), the generalization bound is quite desirable.
Our analysis is based on a LogSobolev inequality (LSI) for the parameter distribution at time , whereas most known LSIs only hold for the stationary distribution of the Markov process. We prove the new LSI by exploiting the variational formulation of the entropy formula.
2 Preliminaries
Notations. We use to denote the data distribution. The training dataset is a sequence of independent samples drawn from . are called neighboring datasets if and only if they differ at exactly one data point (we could assume without loss of generality that ). Let and be the objective and the loss functions, respectively, where denotes a model parameter and is a data point. Define and ; and are defined similarly. A learning algorithm takes as input a dataset , and outputs a parameter randomly. Let be the set of all possible minibatches. denotes the collection of minibatches that contain the th data point, while . Let denote the diameter of a set .
Definition 1 (lipschitz).
A function is lipschitz if and only if holds for any and .
Definition 2 (Expected generalization error).
The expected generalization error of a learning algorithm is defined as
Algorithmic Stability. Intuitively, a learning algorithm that is stable (i.e., a small perturbation of the training data does not affect its output too much) can generalize well. In the seminal work of Bousquet and Elisseeff (2002) (see also Hardt et al. (2016)), the authors formally defined algorithmic stability and established a close connection between the stability of a learning algorithm and its generalization performance.
Definition 3 (Uniform stability).
Lemma 4 (Generalization in expectation).
(Hardt et al. (2016)) Suppose a randomized algorithm is uniformly stable. Then, .
3 BayesStability Method
In this section, we incorporate ideas from the PACBayesian theory (see e.g., Lever et al. (2013)) into the algorithmic stability framework. Combined with the technical tools introduced in previous sections, the new framework enables us to prove tighter datadependent generalization bounds.
First, we define the posterior of a dataset and the posterior of a single data point.
Definition 5 (Singlepoint posterior).
Let be the posterior distribution of the parameter for a given training dataset . In other words, it is the probability distribution of the output of the learning algorithm on dataset (e.g., for iterations of SGLD in (1), is the pdf of ). The singlepoint posterior is defined as
For convenience, we make the following natural assumption on the learning algorithm:
Assumption 6 (Orderindependent).
For any fixed dataset and any permutation , is the same as , where .
Assumption 6 implies , so we use as a shorthand for in the following. Note that this assumption can be easily satisfied by letting the learning algorithm randomly permute the training data at the beginning. It is also easy to verify that both SGD and SGLD satisfy the orderindependent assumption.
Now, we state our new Bayesstability framework, which holds for any prior distribution over the parameter space that is independent of the training dataset .
Theorem 7 (BayesStability).
Suppose the loss function is bounded and the learning algorithm is orderindependent (Assumption 6). Then for any prior distribution not depending on , the generalization error is bounded by both and .
Remark 8.
Our BayesStability framework originates from the algorithmic stability framework, and hence is similar to the notions of uniform stability and leaveoneout error (see Elisseeff and Pontil (2003)). However, there are important differences. Uniform stability is a distributionindependent property, while BayesStability can incorporate the information of the data distribution (through the prior ). Leaveoneout error measures the loss of a learned model on an unseen data point, yet BayesStability focuses on the extent to which a single data point affects the outcome of the learning algorithm (compared to the prior).
To establish an intuition, we first apply this framework to obtain an expectation generalization bound for (full) gradient Langevin dynamics (GLD), which is a special case of SGLD in (1) (i.e., GLD uses the full gradient as ).
Theorem 9.
Suppose that the loss function is bounded. Then we have the following expected generalization bound for iterations of GLD:
where is the empirical squared gradient norm, and is the parameter at step of GLD.
Proof The proof builds upon the following technical lemma, which we prove in Appendix A.2.
Lemma 10.
Let and be two independent sequences of random variables such that for each , and have the same support. Suppose and follow the same distribution. Then,
where denotes and denotes .
Define , where denotes the zero data point (i.e., for any ). Theorem 7 shows that
(5) 
By the convexity of KLdivergence, for a fixed , we have
(6) 
Let and be the training process of GLD for and , respectively. Note that for a fixed , both and are Gaussian distributions. Since (see Lemma 18 in Appendix A.2).
Applying Lemma 10 and gives
Recall that is the parameter at step using as dataset. In this case, we can rewrite as since it is the th data point of . Note that SGLD satisfies the orderindependent assumption, we can rewrite as for all . Together with (5), (6), and using , we can prove this theorem.
More generally, we give the following bound for SGLD. The proof is similar to that of Theorem 9; the difference is that we need to bound the KLdivergence between two Gaussian mixtures instead of two Gaussians. This proof is more technical and deferred to Appendix A.3.
Theorem 11.
Suppose that the loss function is bounded and the objective function is lipschitz. Assume that the following conditions hold:

Batch size .

Learning rate .
Then, the following expected generalization error bound holds for iterations of SGLD (1):
(empirical norm) 
where is the empirical squared gradient norm, and is the parameter at step of SGLD.
Furthermore, based on essentially the same proof, we can obtain the following bound that depends on the population gradient norm:
The full proofs of the above results are postponed to Appendix A, and we provide some remarks about the new bounds.
Remark 12.
In fact, our proof establishes that the above upper bound holds for the two sequences and : . Hence, our bound holds for any sufficiently regular function over the parameter sequences: . In particular, our generalization error bound automatically extends to several variants of SGLD, such as outputting the average of the trajectory, the average of the suffix of certain length, or the exponential moving average.
Remark 13 (Highprobability bounds).
By relaxing the expected squared gradient norm term to and using the uniform stability framework, our proof can be adapted to recover the bound in (Mou et al., 2018, Theorem 1). Then, we can apply the recent results of Feldman and Vondrak (2019) to provide a generalization error bound of that holds with high probability. (Here hides polylogarithmic factors.) When is at least linear in , the additional term is not dominating.
3.1 Experiment
Distinguish random from normal. Inspired by Zhang et al. (2017a), we run both GLD (Figure 1) and SGLD (Appendix C.2) to fit both normal data and randomly labeled data (see Appendix C for more experiment details). As shown in Figure 1 and Figure 3 in Appendix C.2, a larger random label portion leads to both much higher generalization error and much larger generalization error bound. Moreover, the shapes of the curves of our bounds look quite similar to those of the generalization error curves.
Note that in (b) and (c) of Figure 1, the scales in the axis are different.
We list some possible reasons that may explain why our bound is larger than the actual generalization error. (1) as we explained in Remark 12, our bounds (Theorem 9 and 11) hold for any trajectorybased output, and are much stronger than upper bounds for the last point on the trajectory.
(2) The constant we can prove in Lemma 21 may not be very tight.
(3) The variance of Gaussian noise is not large enough in our experiment.
However, if we choose a larger variance, fitting the random labeled training data becomes quite slow.
Hence, we use a small data size () for the above reason.
We also run an extra experiment for GLD on the full MNIST dataset () without label corruption (see Figure 2 in the Appendix C).
We can see that our bound is nonvacuous (since GLD—which computes the full gradients—took a long time to converge, we stopped when we achieved 90% training accuracy).
Relax the step size constraint. The condition on the step size in Theorem 11 may seem restrictive in the practical use.

The proof of Theorem 11 still goes through if we replace with in the constraint.

The maximum gradient norm can be controlled by gradient clipping, i.e., multiplying to each .

Replacing the constant with in this constraint will only increase the constant of our bound from to .
We also provide an experiment combining the above ideas to make our Theorem 11 applicable in the practical use (see Figure 4 in Appendix C).
4 Generalization of CLD and GLD with regularization
In this section, we study the generalization error of Continuous Langevin Dynamics (CLD) with regularization. Throughout this section, we assume that the objective function over training set is defined as , and moreover, the following assumption holds.
Assumption 14.
The loss function and the original objective are bounded. Moreover, is differentiable and lipschitz.
The Continuous Langevin Dynamics is defined by the following SDE:
(CLD) 
where is the standard Brownian motion on and the initial distribution is the centered Gaussian distribution in with covariance . We show that the generalization error of CLD is upper bounded by , which is independent of the training time (Theorem 15). Furthermore, as goes to infinity, we have a tighter generalization error bound (Theorem 39 in Appendix B). We also study the generalization of Gradient Langevin Dynamics (GLD), which is the discretization of CLD:
(GLD) 
where is the standard Gaussian random vector in . By leveraging a result developed in Raginsky et al. (2017), we show that, as tends to zero, GLD has the same generalization as CLD (see Theorems 15 and 39). We first formally state our first main result in this section.
Theorem 15.
Under Assumption 14, CLD (with initial probability measure ) has the following expected generalization error bound:
(7) 
In addition, if is smooth and nonnegative, by setting , and , GLD (running iterations with the same as CLD) has the expected generalization error bound:
(8) 
where is a constant that only depends on , , , , and .
The following lemma is crucial for establishing the above generalization bound for CLD. In particular, we need to establish a LogSobolev inequality for , the parameter distribution at time , for every time step . In contrast, most known LSIs only characterize the stationary distribution of the Markov process. The proof of the lemma can be found in Appendix B.
Lemma 16.
Proof Sketch of Theorem 15 Suppose and are two neighboring datasets. Let and be the process of CLD running on and , respectively. Let and be the pdf of and . Let denote . We have
(Lemma 16) 
Solving this inequality gives . Hence the generalization error of CLD can be bounded by , which proves the first part. The second part of the theorem follows from Lemma 36 in Appendix B.
Our second generalization bound for CLD (Theorem 39 in Appendix B) is
The high level idea to prove this bound is very similar to that in Raginsky et al. (2017). We first observe that the (stationary) Gibbs distribution has a small generalization error. Then, we bound the distance from to . In our setting, we can use the HolleyStroock perturbation lemma which allows us to bound the Logarithmic Sobolev constant, and we can thus bound the above distance easily.
5 Future Directions
In this paper, we prove new generalization bounds for a variety of noisy gradientbased methods. Our current techniques can only handle continuous noises for which we can bound the KLdivergence. One future direction is to study the discrete noise introduced in SGD (in this case the KLdivergence may not be well defined). For either SGLD or CLD, if the noise level is small (i.e., is large), it may take a long time for the diffusion process to reach the stable distribution. Hence, another interesting future direction is to consider the local behavior and generalization of the diffusion process in finite time through the techniques developed in the studies of metastability (see e.g., Bovier et al. (2005); Bovier and den Hollander (2006); Tzen et al. (2018)). In particular, the technique may be helpful for further improving the bounds in Theorems 15 and 39 (when is not very large).
6 Acknowledgement
We would like to thank Liwei Wang for several helpful discussions during various stages of the work. The research is supported in part by the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003, and the Zhongguancun Haihua Institute for Frontier Information Technology and Turing AI Institute of Nanjing.
Appendix A Proofs in Section 3
a.1 BayesStability Framework
Lemma 17.
Under Assumption 6, for any prior distribution not depending on the dataset , the generalization error is upper bounded by
where denotes the population loss .
Proof of Lemma 17 Let and . We can rewrite generalization error as , where
(Assumption 6)  
and
(Assumption 6)  
( is a prior)  
(definition of ) 
Thus, we have
Now we are ready to prove Theorem 7, which we restate in the following.
Theorem 7 (BayesStability).
Suppose the loss function is bounded and the learning algorithm is orderindependent (Assumption 6), then for any prior distribution not depending on , the generalization error is bounded by both and .
Proof By Lemma 17,
(boundedness)  
(Pinsker’s inequality) 
The other bound follows from a similar argument.
a.2 Technical Lemmas
Now we turn to the proof of Theorem 11. The following lemma allows us to reduce the proof of algorithmic stability to the analysis of a single update step.
Lemma 10.
Let and be two independent sequences of random variables such that for each , and have the same support. Suppose and follow the same distribution. Then,
where denotes and denotes .
Proof By the chain rule of the KLdivergence,
The lemma follows from a summation over .
The following lemma (see e.g., (Duchi, 2007, Section 9)) gives a closedform formula for the KLdivergence between two Gaussian distributions.
Lemma 18.
Suppose that and are two Gaussian distributions on . Then,
The following lemma (Topsoe, 2000, Theorem 3) helps us to upper bound the KLdivergence.
Definition 19.
Let and be two probability distributions on . The directional triangular discrimination from to is defined as
where
Lemma 20.
For any two probability distributions and on ,
Recall that is the set of all possible minibatches. denotes the collection of minibatches that contain , while . denotes the diameter of set . The following technical lemma upper bounds the KLdivergence between two Gaussian mixtures induced by sampling a minibatch from neighbouring datasets.
Lemma 21.
Suppose that batch size . and are two collections of points in labeled by minibatches of size that satisfy the following conditions for some constant :

for and for .

.
Let denote the Gaussian distribution . Let and be two mixture distributions over all minibatches. Then,