On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning

On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning

Abstract

Generalization error (also known as the out-of-sample error) measures how well the hypothesis learned from training data generalizes to previously unseen data. Proving tight generalization error bounds is a central question in statistical learning theory. In this paper, we obtain generalization error bounds for learning general non-convex objectives, which has attracted significant attention in recent years. We develop a new framework, termed Bayes-Stability, for proving algorithm-dependent generalization error bounds. The new framework combines ideas from both the PAC-Bayesian theory and the notion of algorithmic stability. Applying the Bayes-Stability method, we obtain new data-dependent generalization bounds for stochastic gradient Langevin dynamics (SGLD) and several other noisy gradient methods (e.g., with momentum, mini-batch and acceleration, Entropy-SGD). Our result recovers (and is typically tighter than) a recent result in Mou et al. (2018) and improves upon the results in Pensia et al. (2018). Our experiments demonstrate that our data-dependent bounds can distinguish randomly labelled data from normal data, which provides an explanation to the intriguing phenomena observed in Zhang et al. (2017a). We also study the setting where the total loss is the sum of a bounded loss and an additional regularization term. We obtain new generalization bounds for the continuous Langevin dynamic in this setting by developing a new Log-Sobolev inequality for the parameter distribution at any time. Our new bounds are more desirable when the noise level of the process is not very small, and do not become vacuous even when tends to infinity.

\iclrfinalcopy

1 Introduction

Non-convex stochastic optimization is the major workhorse of modern machine learning. For instance, the standard supervised learning on a model class parametrized by can be formulated as the following optimization problem:

where denotes the model parameter, is an unknown data distribution over the instance space , and is a given objective function which may be non-convex. A learning algorithm takes as input a sequence of data points sampled i.i.d. from , and outputs a (possibly randomized) parameter configuration .

A fundamental problem in learning theory is to understand the generalization performance of learning algorithms—is the algorithm guaranteed to output a model that generalizes well to the data distribution ? Specifically, we aim to prove upper bounds on the generalization error , where and are the population and empirical losses, respectively. We note that the loss function (e.g., the 0/1 loss) could be different from the objective function (e.g., the cross-entropy loss) used in the training process (which serves as a surrogate for the loss ).

Classical learning theory relates the generalization error to various complexity measures (e.g., the VC-dimension and Rademacher complexity) of the model class. Directly applying these classical complexity measures, however, often fails to explain the recent success of over-parametrized neural networks, where the model complexity significantly exceeds the amount of available training data (see e.g., Zhang et al. (2017a)). By incorporating certain data-dependent quantities such as margin and compressibility into the classical framework, some recent work (e.g., Bartlett et al. (2017); Arora et al. (2018); Wei and Ma (2019)) obtains more meaningful generalization bounds in the deep learning context.

An alternative approach to generalization is to prove algorithm-dependent bounds. One celebrated example along this line is the algorithmic stability framework initiated by Bousquet and Elisseeff (2002). Roughly speaking, the generalization error can be bounded by the stability of the algorithm (see Section 2 for the details). Using this framework, Hardt et al. (2016) study the stability (hence the generalization) of stochastic gradient descent (SGD) for both convex and non-convex functions. Their work motivates recent study of the generalization performance of several other gradient-based optimization methods: Kuzborskij and Lampert (2018); London (2016); Chaudhari et al. (2017); Raginsky et al. (2017); Mou et al. (2018); Pensia et al. (2018); Chen et al. (2018).

In this paper, we study the algorithmic stability and generalization performance of various iterative gradient-based method, with certain continuous noise injected in each iteration, in a non-convex setting. As a concrete example, we consider the stochastic gradient Langevin dynamics (SGLD) (see Raginsky et al. (2017); Mou et al. (2018); Pensia et al. (2018)). Viewed as a variant of SGD, SGLD adds an isotropic Gaussian noise at every update step:

(1)

where denotes either the full gradient or the gradient over a mini-batch sampled from training dataset. We also study a continuous version of (1), which is the dynamic defined by the following stochastic differential equation (SDE):

(2)

where is the standard Brownian motion.

1.1 Related Work

Most related to our work is the study of algorithm-dependent generalization bounds of stochastic gradient methods. Hardt et al. (2016) first study the generalization performance of SGD via algorithmic stability. They prove a generalization bound that scales linearly with , the number of iterations, when the loss function is convex, but their results for general non-convex optimization are more restricted. London (2017) and Rivasplata et al. (2018) also combine ideas from both PAC-Bayesian and algorithm stability. However, these works are essentially different from ours. In London (2017), the prior and posterior are distributions on the hyperparameter space instead of distributions on the hypothesis space. Rivasplata et al. (2018) study the hypothesis stability measured by the distance on the hypothesis space in a setting where the returned hypothesis (model parameter) is perturbed by a Gaussian noise. Our work is a follow-up of the recent work by Mou et al. (2018), in which they provide generalization bounds for SGLD from both stability and PAC-Bayesian perspectives. Another closely related work by Pensia et al. (2018) derives similar bounds for noisy stochastic gradient methods, based on the information theoretic framework of Xu and Raginsky (2017). However, their bounds scale as ( is the size of the training dataset) and are sub-optimal even for SGLD.

We acknowledge that besides the algorithm-dependent approach that we follow, recent advances in learning theory aim to explain the generalization performance of neural networks from many other perspectives. Some of the most prominent ideas include bounding the network capacity by the norms of weight matrices Neyshabur et al. (2015); Liang et al. (2019), margin theory Bartlett et al. (2017); Wei et al. (2019), PAC-Bayesian theory Dziugaite and Roy (2017); Neyshabur et al. (2018); Dziugaite and Roy (2018), network compressibility Arora et al. (2018), and over-parametrization Du et al. (2019); Allen-Zhu et al. (2019); Zou et al. (2018); Chizat et al. (2019). Most of these results are stated in the context of neural networks (some are tailored to networks with specific architecture), whereas our work addresses generalization in non-convex stochastic optimization in general. We also note that some recent work provides explanations for the phenomenon reported in Zhang et al. (2017a) from a variety of different perspectives (e.g., Bartlett et al. (2017); Arora et al. (2018, 2019)).

Welling and Teh (2011) first consider stochastic gradient Langevin dynamics (SGLD) as a sampling algorithm in the Bayesian inference context. Raginsky et al. (2017) give a non-asymptotic analysis and establish the finite-time convergence guarantee of SGLD to an approximate global minimum. Zhang et al. (2017b) analyze the hitting time of SGLD and prove that SGLD converges to an approximate local minimum. These results are further improved and generalized to a family of Langevin dynamics based algorithms by the subsequent work of Xu et al. (2018).

1.2 Overview of Our Results

In this paper, we provide generalization guarantees for the noisy variants of several popular stochastic gradient methods.

The Bayes-Stability method and data-dependent generalization bounds. We develop a new method for proving generalization bounds, termed as Bayes-Stability, by incorporating ideas from the PAC-Bayesian theory into the stability framework. In particular, assuming the loss takes value in , our method shows that the generalization error is bounded by both and , where is a prior distribution independent of the training set , and is the expected posterior distribution conditioned on (i.e., the last training data is ). The formal definition and the results can be found in Definition 5 and Theorem 7.

Inspired by Lever et al. (2013), instead of using a fixed prior distribution, we bound the KL-divergence from the posterior to a distribution-dependent prior. This enables us to derive the following generalization error bound that depends on the expected norm of the gradient along the optimization path:

(3)

Here is the dataset and is the expected empirical squared gradient norm at step ; see Theorem 11 for the details.

Compared with the previous bound in (Mou et al., 2018, Theorem 1), where is the global Lipschitz constant of the loss, our new bound (3) depends on the data distribution and is typically tighter (as the gradient norm is at most ). In modern deep neural networks, the worst-case Lipschitz constant can be quite large, and typically much larger than the expected empirical gradient norm along the optimization trajectory. Specifically, in the later stage of the training, the expected empirical gradient is small (see Figure 1(d) for the details). Hence, our generalization bound does not grow much even if we train longer at this stage.

Our new bound also offers an explanation to the difference between training on correct and random labels observed by Zhang et al. (2017a). In particular, we show empirically that the sum of expected squared gradient norm (along the optimization path) is significantly higher when the training labels are replaced with random labels (Section 3.1, Figure 1, Appendix C.2).

We would also like to mention the PAC-Bayesian bound (for SGLD with -regularization) proposed by Mou et al. (2018). (This bound is different from what we mentioned before; see Theorem 2 in their paper.) Their bound scales as and the numerator of their bound has a similar sum of gradient norms (with a decaying weight if the regularization coefficient ). Their bound is based on the PAC-Bayesian approach and holds with high probability, while our bound only holds in expectation.

Extensions. We remark that our technique allows for an arguably simpler proof of (Mou et al., 2018, Theorem 1); the original proof is based on SDE and Fokker-Planck equation. More importantly, our technique can be easily extended to handle mini-batches and a variety of general settings as follows.

  1. Extension to other gradient-based methods. Our results naturally extends to other noisy stochastic gradient methods including momentum due to Polyak (1964) (Theorem 26), Nesterov’s accelerated gradient method in Nesterov (1983) (Theorem 26), and Entropy-SGD proposed by Chaudhari et al. (2017) (Theorem 27).

  2. Extension to general noises. The proof of the generalization bound in Mou et al. (2018) relies heavily on that the noise is Gaussian1, which makes it difficult to generalize to other noise distributions such as the Laplace distribution. In contrast, our analysis easily carries over to the class of log-Lipschitz noises (i.e., noises drawn from distributions with Lipschitz log densities).

  3. Pathwise stability. In practice, it is also natural to output a certain function of the entire optimization path, e.g., the one with the smallest empirical risk or a weighted average. We show that the same generalization bound holds for all such variants (Remark 12). We note that the analysis in an independent work of Pensia et al. (2018) also satisfies this property, yet their bound is (see Corollary 1 in their work), which scales at a slower rate (instead of ) when dealing with -bounded loss.2

Generalization bounds with regularization via Log-Sobolev inequalities. We also study the setting where the total objective function is the sum of a -bounded differentiable objective and an additional regularization term . In this case, can be treated as a perturbation of a quadratic function, and the continuous Langevin dynamics (CLD) is well understood for quadratic functions. We obtain two generalization bounds for CLD, both via the technique of Log-Sobolev inequalities, a powerful tool for proving the convergence rate of CLD. One of our bounds is as follows (Theorem 15):

(4)

The above bound has the following advantages:

  1. Applying , one can see that our bound is at most , which matches the previous bound in (Mou et al., 2018, Proposition 8)3.

  2. As time grows, the bound is upper bounded by and approaches to (unlike the previous bound that goes to infinity as ).

  3. If the noise level is not so small (i.e., is not very large), the generalization bound is quite desirable.

Our analysis is based on a Log-Sobolev inequality (LSI) for the parameter distribution at time , whereas most known LSIs only hold for the stationary distribution of the Markov process. We prove the new LSI by exploiting the variational formulation of the entropy formula.

2 Preliminaries

Notations. We use to denote the data distribution. The training dataset is a sequence of independent samples drawn from . are called neighboring datasets if and only if they differ at exactly one data point (we could assume without loss of generality that ). Let and be the objective and the loss functions, respectively, where denotes a model parameter and is a data point. Define and ; and are defined similarly. A learning algorithm takes as input a dataset , and outputs a parameter randomly. Let be the set of all possible mini-batches. denotes the collection of mini-batches that contain the -th data point, while . Let denote the diameter of a set .

Definition 1 (-lipschitz).

A function is -lipschitz if and only if holds for any and .

Definition 2 (Expected generalization error).

The expected generalization error of a learning algorithm is defined as

Algorithmic Stability. Intuitively, a learning algorithm that is stable (i.e., a small perturbation of the training data does not affect its output too much) can generalize well. In the seminal work of Bousquet and Elisseeff (2002) (see also Hardt et al. (2016)), the authors formally defined algorithmic stability and established a close connection between the stability of a learning algorithm and its generalization performance.

Definition 3 (Uniform stability).

(Bousquet and Elisseeff (2002); Elisseeff et al. (2005)) A randomized algorithm is -uniformly stable w.r.t. loss , if for all neighboring sets , it holds that

where and denote the outputs of on and respectively.

Lemma 4 (Generalization in expectation).

(Hardt et al. (2016)) Suppose a randomized algorithm is -uniformly stable. Then, .

3 Bayes-Stability Method

In this section, we incorporate ideas from the PAC-Bayesian theory (see e.g., Lever et al. (2013)) into the algorithmic stability framework. Combined with the technical tools introduced in previous sections, the new framework enables us to prove tighter data-dependent generalization bounds.

First, we define the posterior of a dataset and the posterior of a single data point.

Definition 5 (Single-point posterior).

Let be the posterior distribution of the parameter for a given training dataset . In other words, it is the probability distribution of the output of the learning algorithm on dataset (e.g., for iterations of SGLD in (1), is the pdf of ). The single-point posterior is defined as

For convenience, we make the following natural assumption on the learning algorithm:

Assumption 6 (Order-independent).

For any fixed dataset and any permutation , is the same as , where .

Assumption 6 implies , so we use as a shorthand for in the following. Note that this assumption can be easily satisfied by letting the learning algorithm randomly permute the training data at the beginning. It is also easy to verify that both SGD and SGLD satisfy the order-independent assumption.

Now, we state our new Bayes-stability framework, which holds for any prior distribution over the parameter space that is independent of the training dataset .

Theorem 7 (Bayes-Stability).

Suppose the loss function is -bounded and the learning algorithm is order-independent (Assumption 6). Then for any prior distribution not depending on , the generalization error is bounded by both and .

Remark 8.

Our Bayes-Stability framework originates from the algorithmic stability framework, and hence is similar to the notions of uniform stability and leave-one-out error (see  Elisseeff and Pontil (2003)). However, there are important differences. Uniform stability is a distribution-independent property, while Bayes-Stability can incorporate the information of the data distribution (through the prior ). Leave-one-out error measures the loss of a learned model on an unseen data point, yet Bayes-Stability focuses on the extent to which a single data point affects the outcome of the learning algorithm (compared to the prior).

To establish an intuition, we first apply this framework to obtain an expectation generalization bound for (full) gradient Langevin dynamics (GLD), which is a special case of SGLD in (1) (i.e., GLD uses the full gradient as ).

Theorem 9.

Suppose that the loss function is -bounded. Then we have the following expected generalization bound for iterations of GLD:

where is the empirical squared gradient norm, and is the parameter at step of GLD.

Proof  The proof builds upon the following technical lemma, which we prove in Appendix A.2.

Lemma 10.

Let and be two independent sequences of random variables such that for each , and have the same support. Suppose and follow the same distribution. Then,

where denotes and denotes .

Define , where denotes the zero data point (i.e., for any ). Theorem 7 shows that

(5)

By the convexity of KL-divergence, for a fixed , we have

(6)

Let and be the training process of GLD for and , respectively. Note that for a fixed , both and are Gaussian distributions. Since (see Lemma 18 in Appendix A.2).

Applying Lemma 10 and gives

Recall that is the parameter at step using as dataset. In this case, we can rewrite as since it is the -th data point of . Note that SGLD satisfies the order-independent assumption, we can rewrite as for all . Together with (5), (6), and using , we can prove this theorem.  


More generally, we give the following bound for SGLD. The proof is similar to that of Theorem 9; the difference is that we need to bound the KL-divergence between two Gaussian mixtures instead of two Gaussians. This proof is more technical and deferred to Appendix A.3.

Theorem 11.

Suppose that the loss function is -bounded and the objective function is -lipschitz. Assume that the following conditions hold:

  1. Batch size .

  2. Learning rate .

Then, the following expected generalization error bound holds for iterations of SGLD (1):

(empirical norm)

where is the empirical squared gradient norm, and is the parameter at step of SGLD.

Furthermore, based on essentially the same proof, we can obtain the following bound that depends on the population gradient norm:

The full proofs of the above results are postponed to Appendix A, and we provide some remarks about the new bounds.

Remark 12.

In fact, our proof establishes that the above upper bound holds for the two sequences and : . Hence, our bound holds for any sufficiently regular function over the parameter sequences: . In particular, our generalization error bound automatically extends to several variants of SGLD, such as outputting the average of the trajectory, the average of the suffix of certain length, or the exponential moving average.

Remark 13 (High-probability bounds).

By relaxing the expected squared gradient norm term to and using the uniform stability framework, our proof can be adapted to recover the bound in (Mou et al., 2018, Theorem 1). Then, we can apply the recent results of Feldman and Vondrak (2019) to provide a generalization error bound of that holds with high probability. (Here hides poly-logarithmic factors.) When is at least linear in , the additional term is not dominating.

3.1 Experiment

Distinguish random from normal. Inspired by Zhang et al. (2017a), we run both GLD (Figure 1) and SGLD (Appendix C.2) to fit both normal data and randomly labeled data (see Appendix C for more experiment details). As shown in Figure 1 and Figure 3 in Appendix C.2, a larger random label portion leads to both much higher generalization error and much larger generalization error bound. Moreover, the shapes of the curves of our bounds look quite similar to those of the generalization error curves.

Note that in (b) and (c) of Figure 1, the scales in the -axis are different. We list some possible reasons that may explain why our bound is larger than the actual generalization error. (1) as we explained in Remark 12, our bounds (Theorem 9 and 11) hold for any trajectory-based output, and are much stronger than upper bounds for the last point on the trajectory. (2) The constant we can prove in Lemma 21 may not be very tight. (3) The variance of Gaussian noise is not large enough in our experiment. However, if we choose a larger variance, fitting the random labeled training data becomes quite slow. Hence, we use a small data size () for the above reason. We also run an extra experiment for GLD on the full MNIST dataset () without label corruption (see Figure 2 in the Appendix C). We can see that our bound is non-vacuous (since GLD—which computes the full gradients—took a long time to converge, we stopped when we achieved 90% training accuracy). 4

Figure 1: Training MLP with GLD () on a smaller version of MNIST with different random label portion . (a) shows the training accuracy. (b) shows the generalization error, i.e., the gap between the 0/1 loss on the training data and on the test data. (c) plots our bound in Theorem 9. (d) shows that for , the gradient norms become much smaller at later stages of training.

Relax the step size constraint. The condition on the step size in Theorem 11 may seem restrictive in the practical use.5 We provide several ways to relax this constraint:

  1. The proof of Theorem 11 still goes through if we replace with in the constraint.

  2. The maximum gradient norm can be controlled by gradient clipping, i.e., multiplying to each .

  3. Replacing the constant with in this constraint will only increase the constant of our bound from to .

We also provide an experiment combining the above ideas to make our Theorem 11 applicable in the practical use (see Figure 4 in Appendix C).

4 Generalization of CLD and GLD with regularization

In this section, we study the generalization error of Continuous Langevin Dynamics (CLD) with regularization. Throughout this section, we assume that the objective function over training set is defined as , and moreover, the following assumption holds.

Assumption 14.

The loss function and the original objective are -bounded. Moreover, is differentiable and -lipschitz.

The Continuous Langevin Dynamics is defined by the following SDE:

(CLD)

where is the standard Brownian motion on and the initial distribution is the centered Gaussian distribution in with covariance . We show that the generalization error of CLD is upper bounded by , which is independent of the training time  (Theorem 15). Furthermore, as goes to infinity, we have a tighter generalization error bound (Theorem 39 in Appendix B). We also study the generalization of Gradient Langevin Dynamics (GLD), which is the discretization of CLD:

(GLD)

where is the standard Gaussian random vector in . By leveraging a result developed in Raginsky et al. (2017), we show that, as tends to zero, GLD has the same generalization as CLD (see Theorems 15 and 39). We first formally state our first main result in this section.

Theorem 15.

Under Assumption 14, CLD (with initial probability measure ) has the following expected generalization error bound:

(7)

In addition, if is -smooth and non-negative, by setting , and , GLD (running iterations with the same as CLD) has the expected generalization error bound:

(8)

where is a constant that only depends on , , , , and .

The following lemma is crucial for establishing the above generalization bound for CLD. In particular, we need to establish a Log-Sobolev inequality for , the parameter distribution at time , for every time step . In contrast, most known LSIs only characterize the stationary distribution of the Markov process. The proof of the lemma can be found in Appendix B.

Lemma 16.

Under Assumption 14, let be the probability measure of in CLD (with ). Let be a probability measure that is absolutely continuous with respect to . Suppose and . Then, it holds that

We sketch the proof of Theorem 15, and the complete proof is relegated to Appendix B.

Proof Sketch of Theorem 15 Suppose and are two neighboring datasets. Let and be the process of CLD running on and , respectively. Let and be the pdf of and . Let denote . We have

(Lemma 16)

Solving this inequality gives . Hence the generalization error of CLD can be bounded by , which proves the first part. The second part of the theorem follows from Lemma 36 in Appendix B.  


Our second generalization bound for CLD (Theorem 39 in Appendix B) is

The high level idea to prove this bound is very similar to that in Raginsky et al. (2017). We first observe that the (stationary) Gibbs distribution has a small generalization error. Then, we bound the distance from to . In our setting, we can use the Holley-Stroock perturbation lemma which allows us to bound the Logarithmic Sobolev constant, and we can thus bound the above distance easily.

5 Future Directions

In this paper, we prove new generalization bounds for a variety of noisy gradient-based methods. Our current techniques can only handle continuous noises for which we can bound the KL-divergence. One future direction is to study the discrete noise introduced in SGD (in this case the KL-divergence may not be well defined). For either SGLD or CLD, if the noise level is small (i.e., is large), it may take a long time for the diffusion process to reach the stable distribution. Hence, another interesting future direction is to consider the local behavior and generalization of the diffusion process in finite time through the techniques developed in the studies of metastability (see e.g., Bovier et al. (2005); Bovier and den Hollander (2006); Tzen et al. (2018)). In particular, the technique may be helpful for further improving the bounds in Theorems 15 and 39 (when is not very large).

6 Acknowledgement

We would like to thank Liwei Wang for several helpful discussions during various stages of the work. The research is supported in part by the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003, and the Zhongguancun Haihua Institute for Frontier Information Technology and Turing AI Institute of Nanjing.

Appendix A Proofs in Section 3

a.1 Bayes-Stability Framework

Lemma 17.

Under Assumption 6, for any prior distribution not depending on the dataset , the generalization error is upper bounded by

where denotes the population loss .

Proof of Lemma 17 Let and . We can rewrite generalization error as , where

(Assumption 6)

and

(Assumption 6)
( is a prior)
(definition of )

Thus, we have

 


Now we are ready to prove Theorem 7, which we restate in the following.

Theorem 7 (Bayes-Stability).

Suppose the loss function is -bounded and the learning algorithm is order-independent (Assumption 6), then for any prior distribution not depending on , the generalization error is bounded by both and .

Proof  By Lemma 17,

(-boundedness)
(Pinsker’s inequality)

The other bound follows from a similar argument.  


a.2 Technical Lemmas

Now we turn to the proof of Theorem 11. The following lemma allows us to reduce the proof of algorithmic stability to the analysis of a single update step.

Lemma 10.

Let and be two independent sequences of random variables such that for each , and have the same support. Suppose and follow the same distribution. Then,

where denotes and denotes .

Proof  By the chain rule of the KL-divergence,

The lemma follows from a summation over .  


The following lemma (see e.g., (Duchi, 2007, Section 9)) gives a closed-form formula for the KL-divergence between two Gaussian distributions.

Lemma 18.

Suppose that and are two Gaussian distributions on . Then,

The following lemma (Topsoe, 2000, Theorem 3) helps us to upper bound the KL-divergence.

Definition 19.

Let and be two probability distributions on . The directional triangular discrimination from to is defined as

where

Lemma 20.

For any two probability distributions and on ,

Recall that is the set of all possible mini-batches. denotes the collection of mini-batches that contain , while . denotes the diameter of set . The following technical lemma upper bounds the KL-divergence between two Gaussian mixtures induced by sampling a mini-batch from neighbouring datasets.

Lemma 21.

Suppose that batch size . and are two collections of points in labeled by mini-batches of size that satisfy the following conditions for some constant :

  1. for and for .

  2. .

Let denote the Gaussian distribution . Let and be two mixture distributions over all mini-batches. Then,

Proof of Lemma 21 By Lemma 20, is bounded by