A model of double descent for high-dimensional binary linear classification

A model of double descent for
high-dimensional binary linear classification


We consider a model for logistic regression where only a subset of features of size is used for training a linear classifier over training samples. The classifier is obtained by running gradient-descent (GD) on the logistic-loss. For this model, we investigate the dependence of the generalization error on the overparameterization ratio . First, building on known deterministic results on convergence properties of the GD, we uncover a phase-transition phenomenon for the case of Gaussian regressors: the generalization error of GD is the same as that of the maximum-likelihood (ML) solution when , and that of the max-margin (SVM) solution when . Next, using the convex Gaussian min-max theorem (CGMT), we sharply characterize the performance of both the ML and SVM solutions. Combining these results, we obtain curves that explicitly characterize the generalization error of GD for varying values of . The numerical results validate the theoretical predictions and unveil “double-descent” phenomena that complement similar recent observations in linear regression settings.


Zeyu Deng   Abla Kammoun   Christos Thrampoulidis \addressthanks: Zeyu Deng and Christos Thrampoulidis are with the Electrical and Computer Engineering Department at the University of California, Santa Barbara, USA. Abla Kammoun is with the Electrical Engineering Department at King Abdullah University of Science and Technology, Saudi Arabia.

1 Introduction

Motivation.  Modern learning architectures are chosen such that the number of training parameters overly exceeds the size of the training set. Despite their increased complexity, such over-parametrized architectures are known to generalize well in practice. In principle, this contradicts conventional statical wisdom of the so-called approximation-generalization tradeoff. The latter suggests a U-shaped curve for the generalization error as a function of the number of parameters, over which the error initially decreases, but increases after some point due to overfitting. In contrast, several authors uncover a peculiar W-shaped curve for the generalization error of neural networks as a function of the number of parameters. After the “classical” U-shaped curve, it is seen that a second approximation-generalization tradeoff (hence, a second U-shaped curve) appears for large enough number of parameters. The authors of [BMM18, BHMM18, BHM18] coin this the “double-descent” risk curve.

Recent efforts towards theoretically understanding the phenomenon of double-descent focus on linear regression with Gaussian features [HMRT19, MVS19, BHX19, BLLT19]; also [BRT18, XH19, MM19] for related efforts. These works investigate how the generalization error of gradient descent (GD) on square-loss depends on the overparameterization ratio , where number of features are divided by the size of the training set. On one hand, when , GD iterations converge to the least-squares solution for which the generalization performance is well-known [HMRT19]. On the other hand, in the overparameterized regime , GD iterations converge to the min-norm solution for which the generalization performance is sharply characterized in [HMRT19, BHX19] using random matrix-theory (RMT). Using these sharp asymptotics, these papers identify regimes (in terms of model parameters such as SNR) for which a double-descent curve appears.

Figure 1: Excess risk as a function of the overparameterization ratio for binary logistic regression under a polynomial feature selection rule. Our asymptotic predictions (theory) match the simulation results for gradient descent (GD) on logistic loss and for hard-margin support-vector machines (SVM). The theory also predicts a sharp phase-transition phenomenon: data is linearly separable (eqv. training error is zero) with high-probability iff . The threshold is depicted by dashed lines (different for different values of the SNR parameter , but almost indistinguishable due to scaling of the figure). Observe that the risk curve experiences a “double-descent”. As such, it has two local minima: one in the underparameterized and one in the overparameterized regime, respectively. Furthermore, the global minimizer of the excess risk is achieved in the latter regime in which data are linearly separable. The minimizer corresponds to the optimal number of features that should be chosen during training for the model under consideration. Please refer to Section 4 for further details on the chosen simulation parameters.

Contributions. This paper investigates the dependence of generalization error on the overparameterization ratio in binary linear classification with Gaussian regressors. In short, we obtain results that parallel previous studies for linear regression [HMRT19, BHX19]. In more detail, we propose studying gradient descent on logistic loss for two simple, yet popular, models: logistic model and gaussian mixtures (GM) model. Known results establish that GD iterations converge to either the support-vector machines (SVM) solution or the logistic maximum-likelihood (ML) solution, depending on whether the training data is separable or not. For the proposed learning model, we compute a phase-transition threshold and show that when problem dimensions are large, then data are separable if and only if . Connecting the two, we redefine our task to that of studying the performance of ML and SVM. In contrast to linear regression, where the corresponding task can be accomplished using RMT, here we employ machinery based on Guassian process inequalities. In particular, we obtain sharp asymptotics using the framework of the convex Gaussian min-max theorem (CGMT) [Sto13, TOH15, TAH18]. Finally, we corroborate our theoretical findings with numerical simulations and build on the former to identify regimes where double-descent occurs. Figure 1 contains a pictorial preview of our results 111The y-axis represents excess risk. Specifically, we show curves for the excess risk defined as the the difference of the absolute expected risk minus the risk of the best linear classifier (see Eqn. (5)). Naturally both and are decreasing functions of the SNR parameter . However, is decreasing faster than . This explains why the value of the excess risk is smaller for larger values of the signal strength in Figure 1. In particular, it holds 0.133, 0.098 and 0.084 for 5, 10 and 25, respectively. Contrast this to the values of the absolute risk 0.434, 0.429 and 0.428 at (say) . .

Other related works. Our results on the performance of logistic ML and SVM fit in the rapidly growing recent literature on sharp asymptotics of (possibly non-smooth) convex optimization based estimators; [DMM11, BM12, Sto13, TOH15, DM16, TAH18, EK18, WWM19] and many references therein. Most of these works study linear regression models, for which the CGMT framework has been shown to be powerful and applicable under several variations; see for example [TAH18, CM19] and references therein. In contrast, our results hold for binary classification. For the derivations we utilize the machinery recently put forth in [KA19, TPT19, SAH19], which demonstrate that the CGMT framework can be also applied to classification problems. Closely related ideas were previously introduced in [TAH15, DTL18]. Here, we introduce necessary adjustments to accommodate the needs of the specific data generation model and focus on classification error which has not been studied previously. There are several other works on sharp asymptotics of binary linear classification both for the logistic [CS18, SC19, SAH19] and the GM model [MLC19b, MLC19a, Hua17]. While closely related, these works differ in terms of motivation, model specification, end-results, and proof techniques.

2 Learning model

We study supervised binary classification under two popular data models (e.g., [WR06, Sec. 3.1])

  • a discriminative model: mixtures of Gaussian.

  • a generative model: logistic regression.

Let denote the feature vector and denote the class label.

Logistic model. First, we consider a discriminative approach which models the marginal probability as follows:

where is the unknown weight (or, regressor) vector and

Throughout, we assume IID Gaussian feature vectors For compactness let denote a symmetric Bernoulli distribution with probability for the value and probability for the value . We summarize the logistic model with Gaussian features:


Gaussian mixtures (GM) model. A common choice for the generative case is to model the class-conditional densities with Gaussians (e.g., [WR06, Sec. 3.1]). Specifically, each data point belongs to one of two classes with probabilities such that . If it comes from class , then the feature vector is an iid Gaussian vector with mean and the response variable takes the value of the class label :


2.1 Training data: Feature selection

During training, we assume access to data points generated according to either (1) or (2). We allow for the possibility that only a subset of the total number of features is known during training. Concretely, for each feature vector , the learner only knowns a sub-vector for a chosen set . We denote the size of the known feature sub-vector as . Onwards, we choose 222This assumption is without loss of generality for our asymptotic model specified in Section 3.1., i.e., select the features sequentially in the order in which they appear.

Clearly, . Overall, the training set consists of data pairs:


When clear from context, we simply write

if the training set is generated according to (1) or (2), respectively.

2.2 Classification rule

Having access to the training set , the learner obtains an estimate of the weight vector . Then, for a newly generated sample (and ), she forms a linear classifier and decides a label for the new sample as:

The estimate is minimizing the empirical risk

for certain loss function . A traditional way to minimize is by constructing gradient descent (GD) iterates via

for step-size and arbitrary . We run GD until convergence and set

In this paper, we focus on empirical logistic risk

Generalization error. For a new sample we measure generalization error of by the expected risk


The expectation here is over the data generation model (1) or (2). In particular, here we consider the excess risk


where is the risk of the best333It assumes knowledge of the entire feature vector and of . linear classifier, i.e., . Also of interest is the cosine similarity between the estimate and :


Training error. The training error of is given by

2.3 Convergence behavior of GD iterates

Recent literature studies the behavior of GD iterates for logistic loss. There are two regimes of interest: (i) when data are such that is strongly convex then standard tools show that converges to the unique bounded optimum of ; (ii) when data are separable then the normalized iterates converge to the max-margin solution [JT19, SHN18, Tel13]. These properties guide our approach towards studying the generalization error of GD for logistic loss. Instead of the latter, we study the performance of the max-margin classifier and of the minimum of the empirical logistic risk. We formalize these ideas next.

2.3.1 Separable data

The training set is separable iff there exists a linear classifier achieving zero training error, i.e., When data are separable, for it holds where


i.e., the solution to hard-margin SVM.

2.3.2 Non-separable data

When the separability condition does not hold, then is coercive. Thus, its sub-level sets are closed and GD iterates will converge [JT19] to the minimizer of the empirical loss:


where, . For the data model in (1), is the maximum-likelihood (ML) estimator.

3 Sharp Asymptotics

3.1 Asymptotic setting

Recall the following notation:

  : dimension of the ambient space,
  : training sample size,
  : number of parameters during training (see (3)).

We study a setting in which and are fixed and varies from to . Our asymptotic results hold in a linear asymptotic regime where such that


We fix and derive asymptotic predictions for the generalization error as a function of , which determines the overparametrization ratio in our model. To quantify the effect of on the generalization error, we decompose the feature vector to its known part and to its unknown part :

Then, we let (resp., ) denote the vector of weight coefficients corresponding to the known (resp., unknown) features such that

In this notation, we study a sequence of problems of increasing dimensions as in (9) that further satisfy:


The parameters and can be thought of as the useful signal strength and the noise strength, respectively. Our notation specifies that (hence also, ) is a function of . We are interested in functions that are increasing in such that the signal strength increases as more features enter into the training model; Sec. 4.1 for explicit parameterizations.

3.2 Notation

We reserve the following notation for random variables and


for and as defined in (10). All expectations and probabilities are with respect to the randomness of . Also, let . For a sequence of random variables that converges in probability to a constant in the limit of (9), we simply write . Finally, we denote the proximal operator of the logistic-loss as follows,

3.3 Regimes of learning: Phase-transition

As discussed in Section 2, the behavior of GD changes depending on whether the training data is separable or not. Under the Gaussian feature model, the following proposition establishes a sharp phase-transition characterizing the separability of the training data. The proposition is an extension of the phase-transition result by Candes and Sur [CS18] for the (noiseless) logistic model. We extend their result to the noisy setting (to accommodate for the feature selection model in Section 2.1) as well as to the Gaussian mixtures model.444Our analysis approach is also slightly different to [CS18]. Specifically, we note that data is linearly separable iff the solution to hard-margin SVM (7) is bounded. Then, using the CGMT we are able to show that the latter happens with probability one iff .

Proposition 3.1 (Phase transition).

For a training set that is generated by either of the two models in (1) or (2), consider the event

under which the training data is separable. Recall from (3) that for all , are the (out of ) features that are used for training. Let in (10) be an increasing function of . Recall the notation in (11) and define a random variable depending on the data model as follows


Further fefine the following threshold function depending on the data generation model:


Let be the unique solution to the equation . Then, the following holds regarding :

Put in words: the training data is separable iff . When this is the case, then the training error can be driven to zero and we are in the interpolating regime. In contrast, the training error is non-vanishing for smaller values of .

Remark 1 (The threshold for Gaussian mixture).

Consider the Gaussian mixture model, i.e. . Substituting the value of in (13) and using the fact that and are independent Gaussians the threshold function for the GM model simplifies to:

where and are the density and tail function of the standard normal distribution, respectively.

3.4 High-dimensional asymptotics

Propositions 3.2 and 3.3 characterize the performance of the two optimization-based classifiers in (8) and (7), respectively, under the logistic (cf. (1)) and the GM (cf. (2)) model. When combined with the results of Section 2.3, they characterize the statistical performance of converging points of GD.

3.4.1 Non-separable data

The performance of logistic loss under the logistic model was recently studied in [SC19, SAH19, TPT19]. Here, we follow [TPT19, Thm. III.1]. Specifically, we appropriately modify their proof and result to fit the data model of Section 2 and to obtain a prediction for the classification error (not considered in prior work). Also, we provide extensions to the GM model.

Proposition 3.2 (Ml).

Consider a training set that is generated by either of the two models in (1) or (2) and fix . Recall the notation in (11) and define a random variable depending on the data model as in (12). Let be the unique solution to the following system of three nonlinear equations in three unknowns,



Remark 2 (Simplifications for the GM model).

When data is generated according to (2), then . Using this and Gaussian integration by parts, the system of three equations in (14) simplifies to the following:


where the expectations are over a single Gaussian random variable . We numerically solve the system of equations via a fixed-point iteration method suggested in [TAH18] (see also [SAH19, TPT19]). We empirically observe that reformulating the equations as in (15) is critical for the iteration method to converge. Perhaps an interesting observation is that the performance of logistic regression (the same is true for SVM in Section 3.4.2) does not depend on the prior class probabilitt for the GM model with isotropic Gaussian features in (2).

3.4.2 Separable data

Here, we characterize the asymptotic performance of hard-margin SVM under both models (1) and (2).

Proposition 3.3 (Svm).

Consider a training set that is generated by either of the two models in (1) or (2) and fix . Recall the notation in (11) and define a random variable depending on the data model as in (12). Further define


Let be the unique solution of the equation

Further let be the unique minimum of for . Then,

Remark 3 (Margin).

In addition to the stated results in Proposition 3.3, our proof further shows that the optimal cost of the hard-margin SVM (7) converges to . In other words, the margin of the classifier converges to .

Remark 4 (Simplifications for the GM model).

Similar to Remark 2 we note that under the GM model, the function in Proposition 3.3 simplifies to:


where, the expectation is over a single Gaussian random variable . Moreover, the formula predicting the risk simplifies to

4 Numerical results

4.1 Feature selection models

Recall that the number of features known at training is determined by . Specifically, enters the formulae predicting the generalization performance via the signal strength . In this section, we specify two explicit models for feature selection and their corresponding functions . Similar models are considered in [BF83, HMRT19, BHX19], albeit in a linear regression setting.

Linear model.  We start with a uniform feature selection model characterized by the following parametrization:


for fixed and . This models a setting where all coefficients of the regressor have equal contribution. Hence, the signal strength increases linearly with the number of features considered in training.

Polynomial model.  In the linear model, adding more features during training results in a linear increase of the signal strength. In contrast, our second model assumes diminishing returns:


for some . As increases, so does the signal strength , but the increase is less significant for larger values of at a rate specified by .

Figure 2: Plots of the cosine similarity (left) and of the excess risk (right) as a function of for the polynomial feature selection model (cf. (19)) under logistic data. The curves shown are for and three values of and .
Figure 3: Plots of the cosine similarity (left) and of the excess risk (right) as a function of for the linear feature selection model (cf. (18)) under logistic data. The curves shown are for and three values of and .
Figure 4: Plots of the (absolute) risk (bottom) and of the cosine similarity (up) as a function of for the linear feature selection model (cf. (18)) under data from a Gaussian mixture. The curves show and three signal strength values and . Please refer to the legend of Figure 1 and to Section 4.2 for further explanations.

4.2 Risk curves

Figure 1 assumes the polynomial feature model (19) for and three values of . The crosses (‘’) are simulation results obtained by running GD on synthetic data generated according to (1) and (19). Specifically, the depicted values are averages calculated over Monte Carlo realizations for . For each realization, we ran GD on logistic loss (see Sec. 2.2) with a fixed step size until convergence and recored the resulting risk. Similarly, the squares (‘’) are simulation results obtained by solving SVM (7) over the same number of different realizations and averaging over the recorded performance. As expected by [JT19] (also Sec.  2.3), the performance of GD matches that of SVM when data are separable. Also, as predicted by Proposition 3.1, the data is separable with high-probability when (the threshold value is depicted with dashed vertical lines). This is verified by noticing the dotted (‘’) scatter points, which record (averages of) the training error. Recall from Section 2.2 that the training error is zero if and only if the data are separable. Finally, the solid lines depict the theoretical predictions of Proposition 3.2 () and Proposition 3.3 (). The results of Figure 1 validate the accuracy of our predictions: our asymptotic formulae predict the classification performance of GD on logistic loss for all values of . Note that the curves correspond to excess risk defined in (5); see also related Footnote 1. Corresponding results for the cosine similarity are presented in Figure 6 in the Appendix.

Under the logistic model (1), Figures 3 and 2 depict the cosine similarity and excess risk curves of GD as a function of for the linear and the polynomial model, respectively. Compared to Figure 1, we only show the theoretical predictions. Note that the linear model is determined by parameters and the polynomial model by . Once these values are fixed, we compute the threshold value . Then, we apply Proposition 3.2 () and Proposition 3.3 ().

Finally, Figure 4 shows risk and cosine-similarity curves for the Gaussian mixture data model (2) with linear feature selection rule. The figures compare simulation results to theoretical predictions similar in nature to Figure 1, thus validating the accuracy of the latter for the GM model. For the simulations, we generate data according to (2) with , and . The results are averages over Monte Carlo realizations.

4.3 Discussion on double-descent

Here, we discuss the double-descent phenomenon that appears in Figures 1 and 2. First, focus on the underparameterized regime (cf. area on the left of the vertical dashed lines). Here, the number of training examples is large compared to the size of the unknown weight vector , which helps learning . However, since only a (small) fraction of the total number of features are used for training, the observations are rather noisy. This tradeoff manifests itself with a “U-shaped curve” for . Such a tradeoff is also present in the overparameterized regime (cf. area on the right of the vertical dashed lines) leading to a second “U-shaped curve”. The larger the value of the more accurate the model, but also larger the number of unknown parameters to be estimated. As seen in Figure 2, the “U-shape” is more pronounced for larger values of . Overall the risk curve has are two local minima corresponding to the two regimes of learning. Interestingly, the global minimum appears in the overparameterized regime for all values of and depicted here. The value of determines the optimal number of features that need to be selected during training to minimize the classification error. Note that for the training error is zero, yet the generalization performance is best.

The risk curves for the linear feature selection model in Figure 3 are somewhat different compared to the polynomial model. In the underparameterized regime, we observe a (very shallow) “U-shaped” curve, but the risk is monotonically decreasing for . For the values of signal strength considered here the global minimum is always attained at , i.e., when all features are used for training.

5 Future work

This paper studies the generalization performance of GD on logistic loss under the logistic and GM models with isotropic Gaussian features. In future work, we plan on extending our study to GD on square loss, which is also commonly used in practice. Extensions of the results to multiple classes and non-isotropic features are also of interest.


Appendix A Proofs

The main technical tool that facilitates the analysis of (8) and (7) is the convex Gaussian min-max theorem (CGMT) [Sto13, TOH15], which is an extension of Gordon’s Gaussian min-max inequality (GMT) [Gor88]. More precisely, we utilize the machinery recently put forth in [SAH19, TPT19, KA19], which demonstrate that the CGMT framework can be also applied to classification problems.

Organization. We introduce the necessary background on the CGMT in Section A.1. Then, based on this machinery we study the performance of SVM (cf. (7)) for the logistic model in Section A.4.2. Note that the optimization in (7) is feasible if and only if the training data is linearly separable. Thus, by studying (7) we are also able to show the phase-transition result of Proposition 3.1 in the Section A.4.1. In Section A.5 we briefly explain how to adapt the result of [TPT19, Thm. III.1] for the purpose of Proposition 3.2, again for the logistic model. It turns out that ,albeit different, both data models in (1) and (2) naturally fit under the same analysis framework. Specifically the formulae for the GM model can be proved with identical arguments. The proof sketch for the SVM performance under the GM model in Section A.6 aims to showcase this.

a.1 Technical tool: CGMT

a.1.1 Gordon’s Min-Max Theorem (GMT)

The Gordon’s Gaussian comparison inequality [Gor88] compares the min-max value of two doubly indexed Gaussian processes based on how their autocorrelation functions compare. The inequality is quite general (see [Gor88]), but for our purposes we only need its application to the following two Gaussian processes:


where: , , , they all have entries iid Gaussian; the sets and are compact; and, . For these two processes, define the following (random) min-max optimization programs, which (following [TAH18]) we refer to as the primary optimization (PO) problem and the auxiliary optimization AO – for purposes that will soon become clear.


According to Gordon’s comparison inequality, for any , it holds:


In other words, a high-probability lower bound on the AO is a high-probability lower bound on the PO. The premise is that it is often much simpler to lower bound the AO rather than the PO. To be precise, (22) is a slight reformulation of Gordon’s original result proved in [TOH15] (see therein for details).

a.1.2 Convex Gaussian Min-Max Theorem (CGMT)

The asymptotic expressions of this paper build on the CGMT [TOH15]. For ease of reference we summarize here the essential ideas of the framework following the presentation in [TAH18]; please see [TAH18, Section 6] for the formal statement of the theorem and further details. The CGMT is an extension of the GMT and it asserts that the AO in (21b) can be used to tightly infer properties of the original PO in (21a), including the optimal cost and the optimal solution. According to the CGMT [TAH18, Theorem 6.1], if the sets and are convex and is continuous convex-concave on , then, for any and , it holds


In words, concentration of the optimal cost of the AO problem around implies concentration of the optimal cost of the corresponding PO problem around the same value . Moreover, starting from (23) and under strict convexity conditions, the CGMT shows that concentration of the optimal solution of the AO problem implies concentration of the optimal solution of the PO to the same value. For example, if minimizers of (21b) satisfy for some , then, the same holds true for the minimizers of (21a): [TAH18, Theorem 6.1(iii)]. Thus, one can analyze the AO to infer corresponding properties of the PO, the premise being of course that the former is simpler to handle than the latter.

a.2 Hard-margin SVM: Identifying the PO and AO problems

The max-margin solution is obtained by solving the following problem:


and is feasible only when the training data is separable. Under such a situation, (24) is equivalent to solving:


Based on the rotational invariance of the Gaussian measure, we may assume without loss of generality that

where only the first coordinate is nonzero. For convenience, we also write

In this new notation,


Further using this notation and considering the change of variable , (25) becomes


Letting , and , and writing

leads to the following optimization problem


which has the same form of a primary optimization (PO) problem as required by the CGMT [TOH15], with the single exception that the feasibility sets are not compact. To solve this issue, we pursue the approach developed in [KA19] for the analysis of the hard-margin SVM. We write (28) as follows,




We identified thus a sequence of (PO) problems indexed by , each of which satisfies the compactness conditions on the feasibility sets, as required by the CGMT [TOH15]. We associate each one of them with an auxiliary optimization AO problem that can be written as:


where the random vectors and , and represents the vector whose elements are all 1.

Having identified the AO problem, the next step is to simplify them so as to reduce them to problems involving only optimization problems on only a few number of scalars. This will facilitate in the next step inferring their asymptotic behavior.

a.3 Simplification of the AO problem

To simplify the AO problem, we start by optimizing over the direction of the optimization variable . In doing so, we obtain:


where we used the fact that . We note that for any , the direction of that minimizes the objective in (32) is given by . Using [KA19, Lemma 8], becomes:

In the above optimization problem, it is easy to see that appears only through its norm. Let . Optimizing over , we obtain the following scalar optimization problem:


It is important to note that the new formulation of the AO problem is obtained from a deterministic analysis that did not involve any asymptotic approximation. However, this new formulation is more suitable to understand their asymptotic behavior.

a.4 Asymptotic behavior of the AO problem

a.4.1 Data separability (Proof of Proposition 3.1)

Define the sequence of functions

when and . It is easy to see that is jointly convex in its arguments and converges almost surely to:


where, as in (11),

At this point, to see the relevance of (34) to our end result, note how the form of the function resembles (34) after substituting:


Since the convergence of convex functions is uniform over compacts, for fixed, there exists such that for all and ,


From the above inequality, it follows that if


then for chosen sufficiently small (concretely: smaller for instance than ), there exists a constant independent of such that:


Next, we prove that (37) holds for sufficiently large when the following condition is satisfied:


In the right-hand side of the equation above, recognize the threshold function defined in Proposition 3.1. To show the desired, we prove that the optimization over can be assumed over a compact. This is because for ,

As grows unboundedly large as or , we have:

thus proving that the minimum over is bounded. We can thus consider that belongs to a given compact set of . For fixed ,

Furthermore, this convergence is uniform over compact sets due to the convexity of function , hence,

There exists thus , such that for any ,