Optimal Regularization Can Mitigate Double Descent
Abstract
Recent empirical and theoretical studies have shown that many learning algorithms – from linear regression to neural networks – can have test performance that is nonmonotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as “double descent”, has raised questions of if we need to rethink our current understanding of generalization. In this work, we study whether the doubledescent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimallytuned regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimallytuned regularization can mitigate double descent for more general models, including neural networks. Our results suggest that it may also be informative to study the test risk scalings of various algorithms in the context of appropriately tuned regularization.
1 Introduction
Recent works have demonstrated a ubiquitous “double descent” phenomenon present in a range of machine learning models, including decision trees, random features, linear regression, and deep neural networks (Opper, 1995, 2001; Advani and Saxe, 2017; Spigler et al., 2018; Belkin et al., 2018; Geiger et al., 2019b; Nakkiran et al., 2020; Belkin et al., 2019; Hastie et al., 2019; Bartlett et al., 2019; Muthukumar et al., 2019; Bibas et al., 2019; Mitra, 2019; Mei and Montanari, 2019; Liang and Rakhlin, 2018; Liang et al., 2019; Xu and Hsu, 2019; DereziÅski et al., 2019; Lampinen and Ganguli, 2018; Deng et al., 2019; Nakkiran, 2019). The phenomenon is that models exhibit a peak of high test risk when they are just barely able to fit the train set, that is, to interpolate. For example, as we increase the size of models, test risk first decreases, then increases to a peak around when effective model size is close to the training data size, and then decreases again in the overparameterized regime. Also surprising is that Nakkiran et al. (2020) observe a double descent as we increase sample size, i.e. for a fixed model, training the model with more data can hurt test performance.
These striking observations highlight a potential gap in our understanding of generalization and an opportunity for improved methods. Ideally, we seek to use learning algorithms which robustly improve performance as the data or model size grow and do not exhibit such unexpected nonmonotonic behaviors. In other words, we aim to improve the test performance in situations which would otherwise exhibit high test risk due to double descent. Here, a natural strategy would be to use a regularizer and tune its strength on a validation set.
This motivates the central question of this work:
When does optimally tuned regularization mitigate or remove the doubledescent phenomenon?
Another motivation to start this line of inquiry is the observation that the double descent phenomenon is largely observed for unregularized or underregularized models in practice. As an example, Figure 1 shows a simple linear ridge regression setting in which the unregularized estimator exhibits double descent, but an optimallytuned regularizer has monotonic test performance.
Our Contributions: We study this question from both a theoretical and empirical perspective. Theoretically, we start with the setting of highdimensional linear regression. Linear regression is a sensible starting point to study these questions, since it already exhibits many of the qualitative features of double descent in more complex models (e.g. Belkin et al. (2019); Hastie et al. (2019) and further related works in Section 1.1).
This work shows that optimallytuned ridge regression can achieve both samplewise monotonicity and modelsizewise monotonicity under certain assumptions. Concretely, we show

Samplewise monotonicity: In the setting of wellspecified linear regression with isotropic features/covariates (Figure 1), we prove that optimallytuned ridge regression yields monotonic test performance with increasing samples. That is, more data never hurts for optimallytuned ridge regression (see Theorem 2).

Modelwise monotonicity: We consider a setting where the input/covariate lives in a highdimensional ambient space with isotropic covariance. Given a fixed model size (which might be much smaller than ambient dimension), we consider the family of models which first project the input to a random dimensional subspace, and then compute a linear function in this projected “feature space.” (This is nearly identical to models of doubledescent considered in Hastie et al. (2019, Section 5.1)). We prove that in this setting, as we grow the modelsize, optimallytuned ridge regression over the projected features has monotone test performance. That is, with optimal regularization, bigger models are always better or the same. (See Theorem 3).

Monotonicity in the realworld: We also demonstrate several richer empirical settings where optimal regularization induces monotonicity, including random feature classifiers and convolutional neural networks. This suggests that the mitigating effect of optimal regularization may hold more generally in broad machine learning contexts. (See Section 5).
A few remarks are in order:
Problemspecific vs Minimax and Bayesian. It is worth noting that our results hold for all linear groundtruths, rather than holding for only the worstcase groundtruth or a random groundtruth. Indeed, the minimax optimal estimator or the Bayes optimal estimator are both trivially samplewise and modelwise monotonic with respect to the minimax risk or the Bayes risk. However, they do not guarantee monotonicity of the risk itself for a given fixed problem.
Universal vs Asymptotic. We also remark that our analysis is not only nonasymptotic but also works for all possible input dimensions, model sizes, and sample sizes. Prior works on double descent mostly rely on asymptotic assumptions that send the sample size or the model size to infinity in a specific manner. To our knowledge, the results herein are the first nonasymptotic samplewise and modelwise monotonicity results for linear regression. (See discussion of related works Hastie et al. (2019); Mei and Montanari (2019) for related results in the asymptotic setting).
Finally, we note that our claims are about monotonicity of the actual test risk, instead of the monotonicity of the generalization bounds (e.g., results in (Wei et al., 2019)).
Towards a more general characterization. Our theoretical results crucially rely on the covariance of the data being isotropic. A natural next question is if and when the same results can hold more generally. A full answer to this question is beyond the scope of this paper, though we give the following results:

Samplewise monotonicity does not hold for optimallytuned ridge regression for a certain nonGaussian data distribution and heteroscedastic noise. This can be seen from an example with and . (See Section 4.1 for the counterexample and intuitions.)

For nonisotropic Gaussian covariates, we can achieve samplewise monotonicity with a regularizer that depends on the population covariance matrix of data. This suggests unlabeled data might also help mitigate double descent in some settings, because the population covariance can be estimated from unlabeled data.

For nonisotropic Gaussian covariates, we conjecture that optimallytuned ridge regression is samplemonotonic even with a standard regularizer (as in Figure 2). We derive a sufficient condition for this conjecture, which we verify numerically in a variety of cases.
The last result above highlights the importance of the form of the regularizer, which leads to the open question: “How do we design good regularizers which mitigate or remove double descent?” We hope that our results can motivate future work on mitigating the double descent phenomenon, and allow us to train high performance models which do not exhibit unexpected nonmonotonic behaviors.
1.1 Related Works
This work builds on and is inspired by the long line of work on “double descent” phenomena in machine learning. Double descent of test risk as a function of model size was proposed in generality by Belkin et al. (2018). Similar behavior was observed empirically in Advani and Saxe (2017); Geiger et al. (2019a); Spigler et al. (2018); Neal et al. (2018), and even earlier in restricted settings as early as Trunk (1979); Opper (1995, 2001); Skurichina and Duin (2002). Recently Nakkiran et al. (2020) demonstrated a generalized double descent phenomenon on modern deep networks, and highlighted “sample nonmonotonicity” as an aspect of double descent.
Following Belkin et al. (2018), a recent stream of theoretical works consider modelwise double descent in simplified settings— often linear models for regression or classification. A partial list includes (Belkin et al., 2019; Hastie et al., 2019; Bartlett et al., 2019; Muthukumar et al., 2019; Bibas et al., 2019; Mitra, 2019; Mei and Montanari, 2019; Liang and Rakhlin, 2018; Liang et al., 2019; Xu and Hsu, 2019; DereziÅski et al., 2019; Lampinen and Ganguli, 2018; Deng et al., 2019; Nakkiran, 2019; Mahdaviyeh and Naulet, 2019). Of these, most closely related to our work are Hastie et al. (2019); Mei and Montanari (2019); Nakkiran (2019). Specifically, Hastie et al. (2019) considers the risk of unregularized and regularized linear regression in an asymptotic regime, where dimension and number of samples scale to infinity together, at a constant ratio . In contrast, we show nonasymptotic results, and are able to consider increasing the number of samples for a fixed model, without scaling both together. Mei and Montanari (2019) derive similar results for unregularized and regularized random features, also in an asymptotic limit where the number of features and samples scale to infinity together. The nonasymptotic versions of the settings considered in Hastie et al. (2019) are almost identical to ours— for example, our projection model in Section 3 is nearly identical to the model in Hastie et al. (2019, Section 5.1).
Most of the above works on double descent are concerned with studying test risk as a function of increasing model size. In this work, we also study the recent “samplewise” perspective on doubledescent, and consider the test risk of of a fixed model for increasing samples. Nakkiran (2019) highlights that unregularized isotropic linear regression exhibits this kind of samplewise nonmonotonicity, and we study this model in the context of optimal regularization. The study of nonmonotonicity in learning algorithms had also existed prior to double descent, including in Duin (1995, 2000); Opper (2001); Loog and Duin (2012). Loog et al. (2019) introduces the same notion of risk monotonicity which we consider, and studies several examples of monotonic and nonmonotonic procedures.
2 Sample Monotonicity in Ridge Ridgression
In this section, we prove that optimallyregularized ridge regression has test risk that is monotonic in samples, for isotropic gaussian covariates and linear response. This confirms the behavior empirically observed in Figure 1. We also show that this monotonicity is not “fragile”, and using larger than larger regularization is still samplemonotonic (consistent with Figure 1).
Formally, we consider the following linear regression problem in dimensions. The input/covariate is generated from , and the output/response is generated by
with and for some unknown parameter . We denote the joint distribution of by . We are given training examples i.i.d sampled from . We aim to learn a linear model with small population meansquared error on the distribution
For simplicity, let be the data matrix that contains ’s as rows and let be column vector that contains the responses ’s as entries. For any estimator as a function of samples, define the expected risk of the estimator as:
(1) 
We consider the regularized leastsquares estimator, also known as the ridge regression estimator. For a given , define
(2)  
(3) 
Here denotes the dimensional identity matrix. Let be the optimal ridge parameter (that achieves the minimum expected risk) given samples:
(4) 
Let be the estimator that corresponds to the
(5) 
Our main theorem in this section shows that the expected risk of monotonically decreases as increases. {theorem} In the setting above, the expected test risk of optimallyregularized wellspecified isotropic linear regression is monotonic in samples. That is, for all and all ,
The above theorem shows a strong form of monotonicity, since it holds for every fixed groundtruth , and does not require averaging over any prior on groundtruths. Moreover, it holds nonasymptotically, for every fixed . Obtaining such nonasymptotic results is nontrivial, since we cannot rely on concentration properties of the involved random variables.
In particular, evaluating as a function of the problem parameters (, and ) is technically challenging. In fact, we suspect that a simple closed form expression does not exist. The key idea towards proving the theorem is to derive a “partial evaluation” — the following lemmas shows that we can write in the form of where contains the singular values of . We will then couple the randomness of data matrices obtained by adding a single sample, and use singular value interlacing to compare their singular values.
In the setting of Theorem 2, let be the singular values of the data matrix . (If , we pad the for .) Let be the distribution of . Then, the expected test risk is
From Lemma 2, the below lemma follows directly by taking derivatives to find the optimal .
In the setting of Theorem 2, the optimal ridge parameter is constant for all : Moreover, the optimal expected test risk can be written as
(6) 
Proof of Lemma 2.
For isotropic , the test risk is related to the parameter error as:
Plugging in the form of and expanding:
Now let be the full singular value decomposition of , with . Let denote the singular values, defining for . Then, continuing:
(7)  
(8)  
(9)  
(10) 
In Line (8) follows because by symmetry, the distribution of is a uniformly random orthonormal matrix, and is independent of . Thus, is distributed as a uniformly random point on the unit sphere of radius .
∎
Now we are ready to prove Theorem 2.
Proof of Theorem 2.
Let and be any two matrices which differ by only the last row of . By the Cauchy interlacing theorem Theorem 4.3.4 of Horn et al. (1990) (c.f.,Lemma 3.4 of Marcus et al. (2014)), the singular values of and are interlaced: where is the th singular value.
If we couple and , it will induce a coupling between the distributions and , of the singular values of the data matrix for and samples. This coupling satisfies that with probability 1 for .
By similar techniques, we can also prove that overregularization —that is, using ridge parameters larger than the optimal value— is still monotonic. This proves the behavior empirically observed in Figure 1.
In the same setting as Theorem 2, overregularized regression is also monotonic in samples. That is, for all , the following holds
where .
Proof.
In Section A.1. ∎
3 Modelwise Monotonicity in Ridge Regression
In this section, we show that for a certain family of linear models, optimal regularization prevents modelwise double descent. That is, for a fixed number of samples, larger models are not worse than smaller models.
We consider the following learning problem. Informally, covariates live in a dimensional ambient space, and we consider models which first linearly project down to a random dimensional subspace, then perform ridge regression in that subspace for some .
Formally, the covariate is generated from , and the response is generated by
with and for some unknown parameter . Next, examples are sampled i.i.d from this distribution. For a given model size , we first sample a random orthonormal matrix which specifies our model. We then consider models which operate on , where . We denote the joint distribution of by . Here, we emphasize that is some large ambient dimension and is the size of the model we learn.
For a fixed , we want to learn a linear model for estimating , with small mean squared error on distribution:
For samples , let be the data matrix, be the projected data matrix and be the responses. For any estimator as a function of the observed samples, define the expected risk of the estimator as:
(14) 
We consider the regularized leastsquares estimator. For a given , define
(15)  
(16) 
Let be the optimal ridge parameter (that achieves the minimum expected risk) for a model of size , with samples:
(17) 
Let be the estimator that corresponds to the
(18) 
Now, our main theorem in this setting shows that with optimal regularization, test performance is monotonic in model size. {theorem} In the setting above, the expected test risk of the optimallyregularized model is monotonic in the model size .
That is, for all , we have
Proof.
In Section A.2. ∎
For all , , and , let be a matrix with i.i.d. entries. Let be a random orthonormal matrix. Define .
Let be the singular values of the data matrix , for (with for ). Let be the distribution of singular values .
Then, the optimal ridge parameter is constant for all :
where we define
Moreover, the optimal expected test risk can be written as
4 Counterexamples to Monotonicity
In this section, we show that optimallyregularized ridge regression is not always monotonic in samples. We give a numeric counterexample in dimensions, with nongaussian covariates and heteroscedastic noise. This does not contradict our main theorem in Section 2, since this distribution is not jointly Gaussian with isotropic marginals.
4.1 Counterexample
Here we give an example of a distribution for which the expected error of optimallyregularized ridge regression with samples is worse than with samples.
This counterexample is most intuitive to understand
when the ridge parameter
is allowed to depend on the specific sample instance as well as
Consider the following distribution on in dimensions. This distribution has one “clean” coordinate and one “noisy” coordinate. The distribution is:
where and is uniformly random independent noise. This distribution is “wellspecified” in that the optimal predictor is linear in : for . However, the noise is heteroscedastic.
For samples, the estimator can decide whether to use small or large depending on if the sampled coordinate is the “clean” or “noisy” one. Specifically, for the sample : If , then the optimal ridge parameter is . If , then the optimal parameter is .
For samples, with probability the two samples will hit both coordinates. In this case, the estimator must chose a single value of uniformly for both coordinates. This yields to a suboptimal tradeoff, since the “noisy” coordinate demands large regularization, but this hurts estimation on the “clean” coordinate.
It turns out that a slight modification to the above also serves as a counterexample to monotonicity when the regularization parameter is chosen only depending on (and not on the instance ).
The distribution is:
with , .
There exists a distribution over for with the following properties.
Let be the optimallyregularized ridge regression solution for samples from . Then:

is “wellspecified” in that is a linear function of ,

The expected test risk increases as a function of , between and . Specifically
Proof.
For samples, it can be confirmed analytically that the expected risk . This is achieved with .
For samples, it can be confirmed numerically (via Mathematica) that the expected risk . This is achieved with . ∎
5 Experiments
We now experimentally demonstrate that optimal regularization can mitigate double descent, in more general settings than Theorems 2 and 3.
5.1 Sample Monotonicity
Here we show various settings where optimal regularization empirically induces samplemonotonic performance.
Nonisotropic Regression. We first consider the setting of Theorem 2, but with nonisotropic covariantes . That is, we perform ridge regression on samples , where the covariate is generated from for . As before, the response is generated by with for some unknown parameter .
We consider the same ridge regression estimator,
(19) 
Figure 2 shows one instance of this, for a particular choice of and . The covariance is diagonal, with for and for . That is, the covariance has one “large” eigenspace and one “small” eigenspace. The groundtruth , which lies almost entirely within the “small” eigenspace of . The noise parameter is .
We see that unregularized regression () actually undergoes “triple descent”
Random ReLU Features. We consider random ReLU features, in the random features framework of (Rahimi and Recht, 2008). We apply random features to FashionMNIST (Xiao et al., 2017), an image classification problem with 10 classes. Input images are normalized and flattened to for . Class labels are encoded as onehot vectors . For a given number of features , and number of samples , the random feature classifier is obtained by performing regularized linear regression on the embedding
where is a matrix with each entry sampled i.i.d , and ReLU applies pointwise. This is equivalent to a 2layer fullyconnected neural network with a frozen (randomlyinitialized) first layer, trained with loss and weight decay.
Figure 2(a) shows the test error of the random features classifier, for random features and varying number of train samples. We see that underregularized models are nonmonotonic, but optimal regularization is monotonic in samples. Moreover, the optimal ridge parameter appears to be constant for all , similar to our results from the isotropic setting in Theorem 2.
5.2 Modelsize Monotonicity
Here we empirically show that optimal regularization can mitigate modelwise double descent.
Random ReLU Features. We consider the same experimental setup as in Section 5.1, but now fix the number of samples , and vary the number of random features . This corresponds to varying the width of the corresponding 2layer neural network.
Figure 2(b) shows the test error of the random features classifier, for train samples and varying number of random features. We see that underregularized models undergo modelwise double descent, but optimal regularization prevents double descent.
Convolutional Neural Networks. We follow the experimental setup of Nakkiran et al. (2020) for modelwise double descent, and add varying amounts of regularization (weight decay). We chose the following setting from Nakkiran et al. (2020), because it exhibits double descent even with no added label noise.
We consider the same family of 5layer convolutional neural networks (CNNs) from Nakkiran et al. (2020), consisting of 4 convolutional layers of widths for varying . This family of CNNs was introduced by Page (2018). We train and test on CIFAR100 (Krizhevsky and Hinton, 2009), an image classification problem with 100 classes. Inputs are normalized to , and we use standard dataaugmentation of random horizontal flip and random crop with 4pixel padding. All models are trained using Stochastic Gradient Descent (SGD) on the crossentropy loss, with step size at step . We train for gradient steps, and use weight decay for varying . Due to optimization instabilities for large , we use the model with the minimum train loss among the last 5K gradient steps.
Figure 4 shows the test error of these models on CIFAR100. Although unregularized and underreguarized models exhibit double descent, the test error of optimallyregularized models is largely monotonic. Note that the optimal regularization varies with the model size — no single regularization value is optimal for all models.
6 Towards Monotonicity with General Covariates
Here we investigate whether monotonicity provably holds in more general models, inspired by the experimental results. As a first step, we consider Gaussian (but not isotropic) covariances and homeostatic noise. That is, we consider ridge regression in the setting of Section 2, but with , and . In this section, we observe that ridge regression can be made samplemonotonic with a modified regularizer. We also conjecture that ridge regression is samplemonotonic without modifying the regularizer, and we outline a potential proof strategy along with numerical evidence.
6.1 Adaptive Regularization
The results on isotropic regression in Section 2 imply that ridge regression can be made samplemonotonic even for nonisotropic covariates, if an appropriate regularizer is applied. Specifically, the appropriate regularizer depends on the covariance of the inputs: for , the following estimator is samplemonotonic for optimallytuned :
(20) 
This follows directly from Theorem 2 by applying a changeofvariable; full details of this equivalence are in Section A.3. Note that if the population covariance is not known, it can potentially be estimated from unlabeled data.
6.2 Towards Proving Monotonicity
We conjecture that optimallyregularized ridge regression is samplemonotonic for nonisotropic covariates, even without modifying the regularizer (as suggested by the experiment in Figure 2). We derive a sufficient condition for monotonicity, which we have numerically verified in a variety of instances.
Specifically, we conjecture the following. {conjecture} For all , and all PSD covariances , consider the distribution on where , and . Then, we conjecture that the expected test risk of the ridge regression estimator:
(21) 
for optimallytuned , is monotone nonincreasing in number of samples .
In order to establish Conjecture 6.2, it is sufficient to prove the following technical conjecture. {conjecture} For all , , , symmetric positive definite matrix , the following holds.
Define
where is sampled with each entry i.i.d. . Similarly, define
Then, we conjecture that
(22) 
Proving Conjecture 6.2 presents a number of technical challenges, but we have numerically verified it in a variety of cases. (One can numerically verify the conjecture for a fixed , and . Here can be assumed to be diagonal w.l.o.g. because is isotropic. The matrices and scalars in equation (22) can be evaluated by sampling the random matrix . The derivatives w.r.t can be done by autodifferentiation).
7 Discussion and Conclusion
In this work, we study the double descent phenomenon in the context of optimal regularization. We show that, while unregularized or underregularized models often have nonmonotonic behavior, appropriate regularization can eliminate this effect.
Theoretically, we prove that for certain linear regression models with isotropic covariates, optimallytuned regularization achieves monotonic test performance as we grow either the sample size or the model size. These are the first nonasymptotic monotonicity results we are aware of in linear regression. We also demonstrate empirically that optimallytuned regularization can mitigate double descent for more general models, including neural networks. We hope that our results can motivate future work on mitigating the double descent phenomenon, and allow us to train high performance models which do not exhibit unexpected nonmonotonic behaviors.
Open Questions. Our work suggests a number of natural open questions. First, it is open to prove (or disprove) that optimal ridge regression is samplemonotonic for nonisotropic Gaussian covariates (Conjecture 6.2). We conjecture that it is, and outline a potential route to proving this (via Conjecture 6.2). The nonisotropic setting presents a number of differences from the isotropic one (e.g. the optimal regularizer depends on number of samples ), and thus a proof of this may yield further insight into mechanisms of monotonicity.
Second, more broadly, it is open to prove samplewise or modelwise monotonicity for more general (nonlinear) models with appropriate regularizers. Addressing the monotonicity of nonlinear models may require us to design new regularizers which improve the generalization when the model size is close to the sample size. It is possible that datadependent regularizers (which depend on certain statistics of the labeled or unlabeled data) can be used to induce sample monotonicity, analogous to the approach in Section 6.1 for linear models. Recent work has introduced datadependent regularizers for deep models with improved generalization upper bounds (Wei and Ma, 2019a, b), however a precise characterization of the test risk remain elusive.
Finally, it is open to understand why large neural networks in practice are often samplemonotonic in realistic regimes of sample sizes, even without careful choice of regularization.
Acknowledgements
Work supported in part by the Simons Investigator Awards of Boaz Barak and Madhu Sudan, and NSF Awards under grants CCF 1715187, CCF 1565264 and CNS 1618026. Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Dataintensive Discovery, and the NSF Awards CCF1703574, and CCF1740551.
The numerical experiments were supported in part by Google Cloud research credits, and a gift form Oracle. The work is also partially supported by SDSI and SAIL at Stanford.
Appendix A Appendix
In Section A.1 and A.2 we provide the proofs for samplemonotonicity and modelsize monotonicity. In Section A.4 we include additional and omitted plots.
a.1 Sample Monotonicity Proofs
Next we prove Lemma 2.
Proof of Lemma 2.
First, we determine the optimal ridge parameter. Using Lemma 2, we have
Thus, and we conclude that .
Proof of Theorem 2.
We follow a similar proof strategy as in Theorem 2: we invoke singular value interlacing () for the data matrix when adding a single sample. We then apply Lemma 2 to argue that the test risk varies monotonically with the singular values.
We have
and we compute how each term in the sum varies with :
Thus we have
(25) 
By the coupling argument in Theorem 2, this implies that the test risk is monotonic:
a.2 Projection Model Proofs
{lemma}For all , , and , let be a matrix with i.i.d. entries. Let be a random orthonormal matrix. Define and .
Let be the singular values of the data matrix , for (with for ). Let be the distribution of singular values .
Then, the expected test risk is