Benign overfitting in ridge regression

Benign overfitting in ridge regression

Abstract

Classical learning theory suggests that strong regularization is needed to learn a class with large complexity. This intuition is in contrast with the modern practice of machine learning, in particular learning neural networks, where the number of parameters often exceeds the number of data points. It has been observed empirically that such overparametrized models can show good generalization performance even if trained with vanishing or negative regularization. The aim of this work is to understand theoretically how this effect can occur, by studying the setting of ridge regression. We provide non-asymptotic generalization bounds for overparametrized ridge regression that depend on the arbitrary covariance structure of the data, and show that those bounds are tight for a range of regularization parameter values. To our knowledge this is the first work that studies overparametrized ridge regression in such a general setting. We identify when small or negative regularization is sufficient for obtaining small generalization error. On the technical side, our bounds only require the data vectors to be i.i.d. sub-gaussian, while most previous work assumes independence of the components of those vectors.

1 Introduction

1.1 Motivation

The bias-variance tradeoff is well known in statistics and machine learning. The classical theory suggests that large models overfit the data, and one needs to impose significant regularization to make them generalize. This intuition is, however, in contrast with the empirical study of modern machine learning techniques. It was repeatedly observed that even models with enough capacity to exactly interpolate the data can generalize with only a very small amount of regularization, or with no regularization at all [2, 19]. In some cases, the best value of the regularizer can be zero [10] or even negative [8] for such models.

The aim of this paper is to provide a theoretical understanding of the above mentioned phenomenon. We suggest to consider perhaps the simplest statistical model — linear regression with ridge regularization and i.i.d. noisy observations. The setting and assumptions are fully described in Section 2. We derive non-asymptotic upper bounds for the prediction mean squared error (MSE), which depend on the spectrum of the covariance operator as well as the value of optimal parameters . Under some additional assumptions we provide matching lower bounds. Our theoretical results are in accordance with the empirical observations mentioned above: we show that small regularization can indeed provide good generalization performance, if, for example, the eigenvalues of the covariance operator have fast decay. Moreover, if there is a sharp drop in the sequence of those eigenvalues, followed by a plateau, then negative values of regularization may be beneficial if the signal-to-noise ratio (SNR) is high enough.

1.2 Related work

Motivated by the empirical success of overparametrized models, there has recently been a flurry of work aimed at understanding theoretically whether the corresponding effects can be seen in overparametrized linear regression. A large body of that work deals with the ridgeless regime [1, 3, 4, 5, 7, 9, 13, 14, 18, 19], also known as the “interpolating solution,” with zero in-sample risk.

Theoretical study of ridge regularization is, however, a much broader area. As we mentioned before, classical intuition suggests that one should regularize if the model is large, so in general ridge regularization is a classical topic, and it’s impossible to mention all the work that was done in this direction. The purpose of our work, however, is to study cases where small or negative regularization is sufficient, that is, outside the classical regime, but beyond the specific case of interpolation.

In the following we mention some recent work on ridge regularization and point out the differences with our approach. Dobriban and Wager [6] analyze the predictive risk of ridge regression in the asymptotic setting, Thrampoulidis, Oymak and Hassibi [15] suggest a methodology for obtaining asymptotic performance guaranties for ridge regression with gaussian design, Mitra [12] provides asymptotic expressions for ridge regression for a special data generating model that they call Misparametrized Sparse Regression, and Mei and Montanari [11] derive asymptotics of the generalization error for ridge regression on random features. All those works are in the asymptotic setting with ambient dimension and the number of data points going to infinity, while their ratio goes to a constant. Moreover, each of those works imposes an assumption of independence of coordinates of the data vectors, or even requires a specific distribution of those vectors (e.g., gaussian). Our results are non-asymptotic in nature, and cover cases that don’t satisfy the above mentioned assumptions. For example, our results work if the dimension is infinite and the eigenvalues of the covariance operator have a fast decay, and we show in Section 4 that small regularization can be sufficient to obtain small generalization error in that case. Moreover, we do not require any assumptions apart from sub-gaussianity to state our main result—the first part of Theorem 1. To our knowledge this is the first result of this kind that applies with such general assumptions on the structure of the data.

A result in a similar framework was obtained in Kobak, Lomond and Sanchez, where it was proved that negative ridge regularization can be beneficial in a spiked covariance model with one spike. Our results are in accordance with that observation. In Corollary 8, we show that negative regularization can also be beneficial for spiked covariance models when the number of spikes is small compared to the sample size.

The closest to our work is Bartlett et al. [1], where it was shown that the variance term for ridgeless overparametrized regression can be small if and only if the tail of the sequence of eigenvalues of the covariance operator has large effective rank. We strictly generalize those results. We obtain similar bounds for the variance term for the case of ridge regression with zero regularization, extend these to the full range of levels of regularization, and provide novel bounds for the bias term. The bias term is of independent interest, as it’s closely related to high-dimensional PCA for a spiked covariance model. Our assumptions on the covariance structure are more general than the usual assumptions of the spiked covariance model. Our argument is also novel, and doesn’t require the assumption of independence of the coordinates of data vectors, which was needed in [1]. We only need sub-gaussianity and a small-ball condition. For example, our results apply to many kernel regression settings, such as polynomial kernels with compactly supported data.

2 Background and notation

We study ridge regression with independent sub-gaussian observations. Throughout the whole paper we make the following assumptions:

  •  — a random design matrix with i.i.d. centered rows,

  • the covariance matrix of a row of is with (note that we can assume that the covariance is diagonal without loss of generality, as we can always achieve this by a change of basis),

  • the rows of are sub-gaussian vectors with sub-gaussian norm at most ,

  • is the response vector, where is some unknown vector, and is noise that is independent of ,

  • components of are independent and have sub-gaussian norms bounded by .

Recall that a random variable is sub-gaussian if it has a finite sub-gaussian norm and that the sub-gaussian norm of a random vector is .

In the following, we write if there exists an absolute constant s.t. . We write if there exists a constant that only depends on s.t. . For any positive integer , we denote the identity matrix as , and for any positive semidefinite (PSD) matrix and any denote . We denote the eigenvalues of in decreasing order by .

For any non-negative integer we introduce the following notation:

  • For any matrix denote to be the matrix which is comprised of the first columns of , and to be the matrix comprised of the rest of the columns of .

  • For any vector denote to be the vector comprised of the first components of , and to be the vector comprised of the rest of the coordinates of .

  • Denote and

  • Denote .

  • Denote .

Denote the ridge estimator as

where we assume that the matrix is non-degenerate. The representation in the last line is derived in Appendix A.

We are interested in evaluating the MSE of the ridge estimator. For a new independent observation (row with the same distribution as a row of matrix , and is independent of and ), we write for the prediction MSE

so we denote

as the bias and variance terms.

3 Main results

3.1 Upper bound

The following theorem is the main result of this paper. It consists of two parts: the first part gives explicit bounds in terms of eigenvalues of the covariance matrix. The second part is more general, but the bound is given in terms of the smallest and largest eigenvalues of random matrix . We will use that second part in section 4.2 to obtain bounds for the case of negative ridge regularization. The proof of Theorem 1 can be found in Section C.1 of the appendix.

Theorem 1.

Suppose and it is also known that for some with probability at least the condition number of is at most , then with probability at least

and

More generally, there exists a constant , which only depends on , s.t. for any , and with probability at least if the matrix is PD

The result of Theorem 1 depends on the arbitrary . One way to interpret the results is through a generalization of the effective ranks that appear for the interpolation case in [1]. Let’s call the highest variance components of the covariance the “spiked part,” and the rest of the components the “tail.” Denote for any

(1)

Note that for this is the usual effective rank (as it’s defined in [1]) divided by . Denote another effective rank of the tail: . Then the bounds for bias and variance become

The following section suggests that to choose —the size of the spiked part—optimally one needs to find such for which is of order of a constant, which only depends on . If such doesn’t exist, one should take the smallest for which is larger than a constant.

So the variance term becomes smaller as the rank of the tail increases and the dimension of the spiked part decreases. For the bias term, on the contrary, components of the tail don’t get estimated at all (all energy from that part of the signal goes into error). When it comes to the spiked part, its estimation error is lower the higher its variance is and the lower the energy of the tail is.

There is a second interpretation, emphasizing the role of . Section C.3 of the Appendix shows that if is chosen as suggested above , then one can write (up to constant multipliers that depend only on ):

So we see that there is a mixture coefficient affecting each term: gives the weight of bias in in the -th component, while gives the weight of variance. It can be shown that if is larger than a constant, then is effectively constant, so has affine dependence on . This shows how increasing the regularization parameter changes the weights of the bias and variance terms in each component.

We further discuss assumptions and applications of Theorem 1 in Section 4.

3.2 Lower bound and why it matches the upper bound

The following lower bounds for the bias and variance term can be expressed succinctly using our modified effective rank :

Lemma 2.

Suppose that , components of the rows of are independent, and the components of the noise vector have unit variance. Then for some absolute constant for any s.t. and w.p. at least ,

Lemma 3.

For arbitrary consider the following prior distribution on : is obtained from randomly flipping signs of all its coordinates.

Suppose also that and it is known for some that for any w.p. at least Then for some absolute constant for any non-negative w.p. at least

It may be unclear from the first glance whether these lower bounds can match the upper bounds from Theorem 1. We shall see that it actually is the case under assumptions that are not much more restrictive than those of the first part of Theorem 1.

Since both upper and lower bounds are stated for arbitrary , we should first decide which we want to choose. It turns out that to understand which to pick one can look at the values of . We already know from the last statement of Theorem 1, that the ability to upper bound the condition number of by a constant implies a lower bound on that is also a constant (depends only on and ). At the same time the following lemma shows that if is not upper bounded by some large constant, then we can shift to be the first number for which is larger than that large constant, and still be able to control the condition number of by a (different) constant with high probability.

Lemma 4 (Controlling the condition number for some means controlling it for ).

Suppose and that it is known that for some and some with probability at least the condition number of the matrix is at most . If for some , then for some absolute constant for any with probability at least

The ability to choose such a (and the corresponding ) justifies the conditions of the next theorem.

Theorem 5 (The lower bound is the same as the upper bound).

Denote

Suppose for some . Then

Alternatively, if and then

4 Discussion of the main results

4.1 Control over the condition number of

The first part of Theorem 1 may seem peculiar: it is not immediately clear whether one would be able to control the condition number of , and what one would require to do that. One possible answer is rather straightforward: the condition number of is small if and only if is large enough. For example, one could just choose to be such that with high probability. On the other hand, the matrix may be well conditioned with high probability on its own, in which case one could choose small or even negative values of . The following lemma, whose proof is given in Appendix I, provides the high probability bound on . Moreover, it shows that one can also control the lowest eigenvalue under a small-ball assumption.

Lemma 6.

For some absolute constant for any with probability at least ,

If it’s additionally known that for some w.p. at least

then w.p. at least ,

For example, if for a large enough constant , which depends only on and , then for both upper and lower bounds from Lemma 6 would match up to a constant factor even for zero . This can be observed by consecutively plugging in and , and then choosing large enough.

4.2 Bounds for particular covariance operators

To demonstrate the applications of Theorem 1 we consider three different regimes, which are the focus of the following theorem. In the first regime, for all , therefore one can only control the condition number of by choosing large . In the second case, for some large constant , and therefore one can control all the eigenvalues of up to a constant factor even for vanishing . In this case, adding small positive regularization has no effect. Moreover, if from the second case is extremely large, then the matrix concentrates around properly scaled identity. In this (third) case one can qualitatively change the bound by choosing negative , by decreasing the bias without significantly increasing the variance.

Theorem 7.

There exists a large constant that only depends on s.t.

  1. If for some , then for and for any , with probability at least ,

  2. Suppose the components of the data vectors are independent and for some .

    1. For any non-negative , for any , with probability at least ,

    2. For and for any with probability at least

Proof.
  1. By Lemma 6 with probability at least

    On the other hand, Plugging in we obtain that with probability at least the condition number of is at most a constant that only depends on , and . The first part of Theorem 1 gives the desired result.

  2. Applying Lemma from [1] we obtain that, with probability at least the condition number of is bounded by a constant that only depends on . The first part of Theorem 1 gives the first part of the claim.

    To obtain the second claim, we use the following result which was shown in the proof of Lemma from [1]: with probability at least

    which yields that for and , all the eigenvalues of are equal to up to a multiplicative constant that only depends on . The second part of Theorem 1 gives the desired result.

The following corollary presents the consequences for three specific examples of covariance operators and regularization parameters that illustrate the three regimes of Theorem 7.

Corollary 8.
  1. Suppose for some and . Then for any for

  2. Suppose the components of the data vectors are independent, and there exist such and s.t.

    Then

    1. If , then for any , with probability at least ,

    2. For and , for any , with probability at least ,

Proof.

The statements follow directly from the corresponding parts of Theorem 7. ∎

As we see from Corollary 8, the regularization can be extremely small compared to the energy (-norm of the signal), and still provide small generalization. For example, if we take in part 1 of Corollary 8, for we get and If we further substitute and , we obtain vanishing (with large ) generalization error with regularization that decays as , which is exponentially fast. Moreover, in the second part of Corollary 8, we see that for a particular spiked covariance, taking negative regularization may perform better than zero regularization if the dimension is high enough. Indeed, the bound in Corollary 8, Part 2(b) becomes the same as in Part 2(a) if we plug in . However, if is large (e.g., ), then taking would not change the variance term (up to a constant multiplier, that only depends on ), but would significantly decrease the second part of the bias term, which dominates when the noise is sufficiently small. Note that Theorem 5 implies that the bound in Part 2(b) is tight for large enough (so that ), which means that the upper bound for the case of negative regularization may be smaller than the lower bound for the case of zero regularization.

5 Discussion and further directions

We obtained non-asymptotic generalization bounds for ridge regression, depending on the covariance structure of the data, and relying on much weaker assumptions then the previous work. Our bound for the bias term is novel even in the ridgeless case, and it shows that the energy of true signal that corresponds to the tail goes directly into error, while the energy that corresponds to the spiked part is reciprocal to the corresponding term of the error. Our results imply that for the case of a rapidly decaying sequence of the eigenvalues of the covariance operator, very small regularization may be sufficient to achieve vanishing generalization error. Moreover, we showed that negative regularization can achieve a similar effect, and can even be beneficial compared to zero regularization for a slowly decaying sequence of eigenvalues, e.g. in the standard spiked covariance setting.

Despite the fact that the second part of Theorem 1 is applicable in a quite general situation, we don’t expect that bound to be sharp if there is no possibility to control the condition number of . For example, that is the case if eigenvalues of the covariance operator decay quickly and regularization is zero. Another possible direction of further work is removing the assumption of sub-gaussianity, which would, for example, allow the results to be applied to a larger class of kernel regression problems.

Appendix A Ridge regression

Recall that

  •  — a random matrix with i.i.d. centered rows,

  • the covariance matrix of a row of is

  • the rows of are sub-gaussian with sub-gaussian norm at most ,

  • is the response vector, where is some unknown vector, and is noise,

  • components of are independent and have sub-gaussian norms bounded by .

We are interested in evaluating the MSE of the ridge estimator

Applying Sherman-Morrison-Woodbury yields

So,

Finally,

Since we have (best misspecified prediction plus noise), the excess MSE can be bounded as

Appendix B Splitting the coordinates and the corresponding notation

Let’s also fix some . We will spit the coordinates into two groups: the first components and the rest of the components.

Consider integers from to .

  • For any matrix denote to be the matrix which is comprised of the columns of from -st to -th.

  • For any vector denote to be the vector comprised of components of from -st to -th.

  • Denote and

Appendix C Main results

c.1 Upper bound on the prediction MSE

Theorem 9.

There exists (large) constants , which only depend on s.t. if

  1. ,

then , and with probability at least

( stands for the contribution of the bias term, and — for the variance).

Moreover, if it is also known that for some with probability at least the condition number of is at most , then with probability at least

and

Proof.

Combining Lemmas 12 and 13 we obtain the following bound