A model of double descent for
high-dimensional binary linear classification
We consider a model for logistic regression where only a subset of features of size is used for training a linear classifier over training samples. The classifier is obtained by running gradient-descent (GD) on the logistic-loss. For this model, we investigate the dependence of the generalization error on the overparameterization ratio . First, building on known deterministic results on convergence properties of the GD, we uncover a phase-transition phenomenon for the case of Gaussian regressors: the generalization error of GD is the same as that of the maximum-likelihood (ML) solution when , and that of the max-margin (SVM) solution when . Next, using the convex Gaussian min-max theorem (CGMT), we sharply characterize the performance of both the ML and SVM solutions. Combining these results, we obtain curves that explicitly characterize the generalization error of GD for varying values of . The numerical results validate the theoretical predictions and unveil “double-descent” phenomena that complement similar recent observations in linear regression settings.
Zeyu Deng Abla Kammoun Christos Thrampoulidis \address††thanks: Zeyu Deng and Christos Thrampoulidis are with the Electrical and Computer Engineering Department at the University of California, Santa Barbara, USA. Abla Kammoun is with the Electrical Engineering Department at King Abdullah University of Science and Technology, Saudi Arabia.
Motivation. Modern learning architectures are chosen such that the number of training parameters overly exceeds the size of the training set. Despite their increased complexity, such over-parametrized architectures are known to generalize well in practice. In principle, this contradicts conventional statical wisdom of the so-called approximation-generalization tradeoff. The latter suggests a U-shaped curve for the generalization error as a function of the number of parameters, over which the error initially decreases, but increases after some point due to overfitting. In contrast, several authors uncover a peculiar W-shaped curve for the generalization error of neural networks as a function of the number of parameters. After the “classical” U-shaped curve, it is seen that a second approximation-generalization tradeoff (hence, a second U-shaped curve) appears for large enough number of parameters. The authors of [BMM18, BHMM18, BHM18] coin this the “double-descent” risk curve.
Recent efforts towards theoretically understanding the phenomenon of double-descent focus on linear regression with Gaussian features [HMRT19, MVS19, BHX19, BLLT19]; also [BRT18, XH19, MM19] for related efforts. These works investigate how the generalization error of gradient descent (GD) on square-loss depends on the overparameterization ratio , where number of features are divided by the size of the training set. On one hand, when , GD iterations converge to the least-squares solution for which the generalization performance is well-known [HMRT19]. On the other hand, in the overparameterized regime , GD iterations converge to the min-norm solution for which the generalization performance is sharply characterized in [HMRT19, BHX19] using random matrix-theory (RMT). Using these sharp asymptotics, these papers identify regimes (in terms of model parameters such as SNR) for which a double-descent curve appears.
Contributions. This paper investigates the dependence of generalization error on the overparameterization ratio in binary linear classification with Gaussian regressors. In short, we obtain results that parallel previous studies for linear regression [HMRT19, BHX19]. In more detail, we propose studying gradient descent on logistic loss for two simple, yet popular, models: logistic model and gaussian mixtures (GM) model. Known results establish that GD iterations converge to either the support-vector machines (SVM) solution or the logistic maximum-likelihood (ML) solution, depending on whether the training data is separable or not. For the proposed learning model, we compute a phase-transition threshold and show that when problem dimensions are large, then data are separable if and only if . Connecting the two, we redefine our task to that of studying the performance of ML and SVM. In contrast to linear regression, where the corresponding task can be accomplished using RMT, here we employ machinery based on Guassian process inequalities. In particular, we obtain sharp asymptotics using the framework of the convex Gaussian min-max theorem (CGMT) [Sto13, TOH15, TAH18]. Finally, we corroborate our theoretical findings with numerical simulations and build on the former to identify regimes where double-descent occurs. Figure 1 contains a pictorial preview of our results 111The y-axis represents excess risk. Specifically, we show curves for the excess risk defined as the the difference of the absolute expected risk minus the risk of the best linear classifier (see Eqn. (5)). Naturally both and are decreasing functions of the SNR parameter . However, is decreasing faster than . This explains why the value of the excess risk is smaller for larger values of the signal strength in Figure 1. In particular, it holds 0.133, 0.098 and 0.084 for 5, 10 and 25, respectively. Contrast this to the values of the absolute risk 0.434, 0.429 and 0.428 at (say) . .
Other related works. Our results on the performance of logistic ML and SVM fit in the rapidly growing recent literature on sharp asymptotics of (possibly non-smooth) convex optimization based estimators; [DMM11, BM12, Sto13, TOH15, DM16, TAH18, EK18, WWM19] and many references therein. Most of these works study linear regression models, for which the CGMT framework has been shown to be powerful and applicable under several variations; see for example [TAH18, CM19] and references therein. In contrast, our results hold for binary classification. For the derivations we utilize the machinery recently put forth in [KA19, TPT19, SAH19], which demonstrate that the CGMT framework can be also applied to classification problems. Closely related ideas were previously introduced in [TAH15, DTL18]. Here, we introduce necessary adjustments to accommodate the needs of the specific data generation model and focus on classification error which has not been studied previously. There are several other works on sharp asymptotics of binary linear classification both for the logistic [CS18, SC19, SAH19] and the GM model [MLC19b, MLC19a, Hua17]. While closely related, these works differ in terms of motivation, model specification, end-results, and proof techniques.
2 Learning model
We study supervised binary classification under two popular data models (e.g., [WR06, Sec. 3.1])
a discriminative model: mixtures of Gaussian.
a generative model: logistic regression.
Let denote the feature vector and denote the class label.
Logistic model. First, we consider a discriminative approach which models the marginal probability as follows:
where is the unknown weight (or, regressor) vector and
Throughout, we assume IID Gaussian feature vectors For compactness let denote a symmetric Bernoulli distribution with probability for the value and probability for the value . We summarize the logistic model with Gaussian features:
Gaussian mixtures (GM) model. A common choice for the generative case is to model the class-conditional densities with Gaussians (e.g., [WR06, Sec. 3.1]). Specifically, each data point belongs to one of two classes with probabilities such that . If it comes from class , then the feature vector is an iid Gaussian vector with mean and the response variable takes the value of the class label :
2.1 Training data: Feature selection
During training, we assume access to data points generated according to either (1) or (2). We allow for the possibility that only a subset of the total number of features is known during training. Concretely, for each feature vector , the learner only knowns a sub-vector for a chosen set . We denote the size of the known feature sub-vector as . Onwards, we choose 222This assumption is without loss of generality for our asymptotic model specified in Section 3.1., i.e., select the features sequentially in the order in which they appear.
2.2 Classification rule
Having access to the training set , the learner obtains an estimate of the weight vector . Then, for a newly generated sample (and ), she forms a linear classifier and decides a label for the new sample as:
The estimate is minimizing the empirical risk
for certain loss function . A traditional way to minimize is by constructing gradient descent (GD) iterates via
for step-size and arbitrary . We run GD until convergence and set
In this paper, we focus on empirical logistic risk
Generalization error. For a new sample we measure generalization error of by the expected risk
where is the risk of the best333It assumes knowledge of the entire feature vector and of . linear classifier, i.e., . Also of interest is the cosine similarity between the estimate and :
Training error. The training error of is given by
2.3 Convergence behavior of GD iterates
Recent literature studies the behavior of GD iterates for logistic loss. There are two regimes of interest: (i) when data are such that is strongly convex then standard tools show that converges to the unique bounded optimum of ; (ii) when data are separable then the normalized iterates converge to the max-margin solution [JT19, SHN18, Tel13]. These properties guide our approach towards studying the generalization error of GD for logistic loss. Instead of the latter, we study the performance of the max-margin classifier and of the minimum of the empirical logistic risk. We formalize these ideas next.
2.3.1 Separable data
The training set is separable iff there exists a linear classifier achieving zero training error, i.e., When data are separable, for it holds where
i.e., the solution to hard-margin SVM.
2.3.2 Non-separable data
3 Sharp Asymptotics
3.1 Asymptotic setting
Recall the following notation:
: dimension of the ambient space,
: training sample size,
: number of parameters during training (see (3)).
We study a setting in which and are fixed and varies from to . Our asymptotic results hold in a linear asymptotic regime where such that
We fix and derive asymptotic predictions for the generalization error as a function of , which determines the overparametrization ratio in our model. To quantify the effect of on the generalization error, we decompose the feature vector to its known part and to its unknown part :
Then, we let (resp., ) denote the vector of weight coefficients corresponding to the known (resp., unknown) features such that
In this notation, we study a sequence of problems of increasing dimensions as in (9) that further satisfy:
The parameters and can be thought of as the useful signal strength and the noise strength, respectively. Our notation specifies that (hence also, ) is a function of . We are interested in functions that are increasing in such that the signal strength increases as more features enter into the training model; Sec. 4.1 for explicit parameterizations.
We reserve the following notation for random variables and
for and as defined in (10). All expectations and probabilities are with respect to the randomness of . Also, let . For a sequence of random variables that converges in probability to a constant in the limit of (9), we simply write . Finally, we denote the proximal operator of the logistic-loss as follows,
3.3 Regimes of learning: Phase-transition
As discussed in Section 2, the behavior of GD changes depending on whether the training data is separable or not. Under the Gaussian feature model, the following proposition establishes a sharp phase-transition characterizing the separability of the training data. The proposition is an extension of the phase-transition result by Candes and Sur [CS18] for the (noiseless) logistic model. We extend their result to the noisy setting (to accommodate for the feature selection model in Section 2.1) as well as to the Gaussian mixtures model.444Our analysis approach is also slightly different to [CS18]. Specifically, we note that data is linearly separable iff the solution to hard-margin SVM (7) is bounded. Then, using the CGMT we are able to show that the latter happens with probability one iff .
Proposition 3.1 (Phase transition).
under which the training data is separable. Recall from (3) that for all , are the (out of ) features that are used for training. Let in (10) be an increasing function of . Recall the notation in (11) and define a random variable depending on the data model as follows
Further fefine the following threshold function depending on the data generation model:
Let be the unique solution to the equation . Then, the following holds regarding :
Put in words: the training data is separable iff . When this is the case, then the training error can be driven to zero and we are in the interpolating regime. In contrast, the training error is non-vanishing for smaller values of .
Remark 1 (The threshold for Gaussian mixture).
Consider the Gaussian mixture model, i.e. . Substituting the value of in (13) and using the fact that and are independent Gaussians the threshold function for the GM model simplifies to:
where and are the density and tail function of the standard normal distribution, respectively.
3.4 High-dimensional asymptotics
Propositions 3.2 and 3.3 characterize the performance of the two optimization-based classifiers in (8) and (7), respectively, under the logistic (cf. (1)) and the GM (cf. (2)) model. When combined with the results of Section 2.3, they characterize the statistical performance of converging points of GD.
3.4.1 Non-separable data
The performance of logistic loss under the logistic model was recently studied in [SC19, SAH19, TPT19]. Here, we follow [TPT19, Thm. III.1]. Specifically, we appropriately modify their proof and result to fit the data model of Section 2 and to obtain a prediction for the classification error (not considered in prior work). Also, we provide extensions to the GM model.
Proposition 3.2 (Ml).
Consider a training set that is generated by either of the two models in (1) or (2) and fix . Recall the notation in (11) and define a random variable depending on the data model as in (12). Let be the unique solution to the following system of three nonlinear equations in three unknowns,
Remark 2 (Simplifications for the GM model).
where the expectations are over a single Gaussian random variable . We numerically solve the system of equations via a fixed-point iteration method suggested in [TAH18] (see also [SAH19, TPT19]). We empirically observe that reformulating the equations as in (15) is critical for the iteration method to converge. Perhaps an interesting observation is that the performance of logistic regression (the same is true for SVM in Section 3.4.2) does not depend on the prior class probabilitt for the GM model with isotropic Gaussian features in (2).
3.4.2 Separable data
Proposition 3.3 (Svm).
Consider a training set that is generated by either of the two models in (1) or (2) and fix . Recall the notation in (11) and define a random variable depending on the data model as in (12). Further define
Let be the unique solution of the equation
Further let be the unique minimum of for . Then,
Remark 3 (Margin).
4 Numerical results
4.1 Feature selection models
Recall that the number of features known at training is determined by . Specifically, enters the formulae predicting the generalization performance via the signal strength . In this section, we specify two explicit models for feature selection and their corresponding functions . Similar models are considered in [BF83, HMRT19, BHX19], albeit in a linear regression setting.
Linear model. We start with a uniform feature selection model characterized by the following parametrization:
for fixed and . This models a setting where all coefficients of the regressor have equal contribution. Hence, the signal strength increases linearly with the number of features considered in training.
Polynomial model. In the linear model, adding more features during training results in a linear increase of the signal strength. In contrast, our second model assumes diminishing returns:
for some . As increases, so does the signal strength , but the increase is less significant for larger values of at a rate specified by .
4.2 Risk curves
Figure 1 assumes the polynomial feature model (19) for and three values of . The crosses (‘’) are simulation results obtained by running GD on synthetic data generated according to (1) and (19). Specifically, the depicted values are averages calculated over Monte Carlo realizations for . For each realization, we ran GD on logistic loss (see Sec. 2.2) with a fixed step size until convergence and recored the resulting risk. Similarly, the squares (‘’) are simulation results obtained by solving SVM (7) over the same number of different realizations and averaging over the recorded performance. As expected by [JT19] (also Sec. 2.3), the performance of GD matches that of SVM when data are separable. Also, as predicted by Proposition 3.1, the data is separable with high-probability when (the threshold value is depicted with dashed vertical lines). This is verified by noticing the dotted (‘’) scatter points, which record (averages of) the training error. Recall from Section 2.2 that the training error is zero if and only if the data are separable. Finally, the solid lines depict the theoretical predictions of Proposition 3.2 () and Proposition 3.3 (). The results of Figure 1 validate the accuracy of our predictions: our asymptotic formulae predict the classification performance of GD on logistic loss for all values of . Note that the curves correspond to excess risk defined in (5); see also related Footnote 1. Corresponding results for the cosine similarity are presented in Figure 6 in the Appendix.
Under the logistic model (1), Figures 3 and 2 depict the cosine similarity and excess risk curves of GD as a function of for the linear and the polynomial model, respectively. Compared to Figure 1, we only show the theoretical predictions. Note that the linear model is determined by parameters and the polynomial model by . Once these values are fixed, we compute the threshold value . Then, we apply Proposition 3.2 () and Proposition 3.3 ().
Finally, Figure 4 shows risk and cosine-similarity curves for the Gaussian mixture data model (2) with linear feature selection rule. The figures compare simulation results to theoretical predictions similar in nature to Figure 1, thus validating the accuracy of the latter for the GM model. For the simulations, we generate data according to (2) with , and . The results are averages over Monte Carlo realizations.
4.3 Discussion on double-descent
Here, we discuss the double-descent phenomenon that appears in Figures 1 and 2. First, focus on the underparameterized regime (cf. area on the left of the vertical dashed lines). Here, the number of training examples is large compared to the size of the unknown weight vector , which helps learning . However, since only a (small) fraction of the total number of features are used for training, the observations are rather noisy. This tradeoff manifests itself with a “U-shaped curve” for . Such a tradeoff is also present in the overparameterized regime (cf. area on the right of the vertical dashed lines) leading to a second “U-shaped curve”. The larger the value of the more accurate the model, but also larger the number of unknown parameters to be estimated. As seen in Figure 2, the “U-shape” is more pronounced for larger values of . Overall the risk curve has are two local minima corresponding to the two regimes of learning. Interestingly, the global minimum appears in the overparameterized regime for all values of and depicted here. The value of determines the optimal number of features that need to be selected during training to minimize the classification error. Note that for the training error is zero, yet the generalization performance is best.
The risk curves for the linear feature selection model in Figure 3 are somewhat different compared to the polynomial model. In the underparameterized regime, we observe a (very shallow) “U-shaped” curve, but the risk is monotonically decreasing for . For the values of signal strength considered here the global minimum is always attained at , i.e., when all features are used for training.
5 Future work
This paper studies the generalization performance of GD on logistic loss under the logistic and GM models with isotropic Gaussian features. In future work, we plan on extending our study to GD on square loss, which is also commonly used in practice. Extensions of the results to multiple classes and non-isotropic features are also of interest.
- [BF83] Leo Breiman and David Freedman. How many variables should be entered in a regression equation? Journal of the American Statistical Association, 78(381):131–136, 1983.
- [BHM18] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems, pages 2300–2311, 2018.
- [BHMM18] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
- [BHX19] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019.
- [BLLT19] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. arXiv preprint arXiv:1906.11300, 2019.
- [BM12] Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices. Information Theory, IEEE Transactions on, 58(4):1997–2017, 2012.
- [BMM18] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 540–548, 2018.
- [BRT18] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? arXiv preprint arXiv:1806.09471, 2018.
- [CM19] Michael Celentano and Andrea Montanari. Fundamental barriers to high-dimensional regression with convex penalties. arXiv preprint arXiv:1903.10603, 2019.
- [CS18] Emmanuel J Candès and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. arXiv preprint arXiv:1804.09753, 2018.
- [DM16] David Donoho and Andrea Montanari. High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166(3-4):935–969, 2016.
- [DMM11] David L Donoho, Arian Maleki, and Andrea Montanari. The noise-sensitivity phase transition in compressed sensing. Information Theory, IEEE Transactions on, 57(10):6920–6941, 2011.
- [DTL18] Oussama Dhifallah, Christos Thrampoulidis, and Yue M Lu. Phase retrieval via polytope optimization: Geometry, phase transitions, and new algorithms. arXiv preprint arXiv:1805.09555, 2018.
- [EK18] Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170(1-2):95–175, 2018.
- [Gor88] Yehoram Gordon. On Milman’s inequality and random subspaces which escape through a mesh in . Springer, 1988.
- [HMRT19] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
- [Hua17] Hanwen Huang. Asymptotic behavior of support vector machine for spiked population model. The Journal of Machine Learning Research, 18(1):1472–1492, 2017.
- [JT19] Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1772–1798, Phoenix, USA, 25–28 Jun 2019. PMLR.
- [KA19] A. Kammoun and M.-S. Alouini. On the precise error analysis of support vector machines. Submitted to IEEE Transactions on information theory, 2019.
- [MLC19a] X. Mai, Z. Liao, and R. Couillet. A large scale analysis of logistic regression: asymptotic performance and new insights. In ICASSP, 2019.
- [MLC19b] Xiaoyi Mai, Zhenyu Liao, and Romain Couillet. A large scale analysis of logistic regression: Asymptotic performance and new insights. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3357–3361. IEEE, 2019.
- [MM19] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
- [MVS19] Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. Harmless interpolation of noisy data in regression. arXiv preprint arXiv:1903.09139, 2019.
- [SAH19] Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi. The impact of regularization on high-dimensional logistic regression. arXiv preprint arXiv:1906.03761, 2019.
- [SC18] Pragya Sur and Emmanuel J Candès. Additional supplementary materials for: A modern maximum-likelihood theory for high-dimensional logistic regression, 2018.
- [SC19] Pragya Sur and Emmanuel J Candès. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019.
- [SHN18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- [Sto13] Mihailo Stojnic. A framework to characterize performance of lasso algorithms. arXiv preprint arXiv:1303.7291, 2013.
- [Sun96] R. K. Sundaram. A first course in optimization theory. Cambridge University Press, 1996.
- [TAH15] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Lasso with non-linear measurements is equivalent to one with linear measurements. In Advances in Neural Information Processing Systems, pages 3420–3428, 2015.
- [TAH18] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized -estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592–5628, 2018.
- [Tel13] Matus Telgarsky. Margins, shrinkage and boosting. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages II–307. JMLR. org, 2013.
- [TOH15] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. Regularized linear regression: A precise analysis of the estimation error. In Proceedings of The 28th Conference on Learning Theory, pages 1683–1709, 2015.
- [TPT19] Hossein Taheri, Ramtin Pedarsani, and Christos Thrampoulidis. Sharp guarantees for solving random equations with one-bit information. arXiv preprint arXiv:1908.04433, 2019.
- [WR06] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.
- [WWM19] Shuaiwen Wang, Haolei Weng, and Arian Maleki. Does slope outperform bridge regression? arXiv preprint arXiv:1909.09345, 2019.
- [XH19] Ji Xu and Daniel Hsu. How many variables should be entered in a principal component regression equation? arXiv preprint arXiv:1906.01139, 2019.
Appendix A Proofs
The main technical tool that facilitates the analysis of (8) and (7) is the convex Gaussian min-max theorem (CGMT) [Sto13, TOH15], which is an extension of Gordon’s Gaussian min-max inequality (GMT) [Gor88]. More precisely, we utilize the machinery recently put forth in [SAH19, TPT19, KA19], which demonstrate that the CGMT framework can be also applied to classification problems.
Organization. We introduce the necessary background on the CGMT in Section A.1. Then, based on this machinery we study the performance of SVM (cf. (7)) for the logistic model in Section A.4.2. Note that the optimization in (7) is feasible if and only if the training data is linearly separable. Thus, by studying (7) we are also able to show the phase-transition result of Proposition 3.1 in the Section A.4.1. In Section A.5 we briefly explain how to adapt the result of [TPT19, Thm. III.1] for the purpose of Proposition 3.2, again for the logistic model. It turns out that ,albeit different, both data models in (1) and (2) naturally fit under the same analysis framework. Specifically the formulae for the GM model can be proved with identical arguments. The proof sketch for the SVM performance under the GM model in Section A.6 aims to showcase this.
a.1 Technical tool: CGMT
a.1.1 Gordon’s Min-Max Theorem (GMT)
The Gordon’s Gaussian comparison inequality [Gor88] compares the min-max value of two doubly indexed Gaussian processes based on how their autocorrelation functions compare. The inequality is quite general (see [Gor88]), but for our purposes we only need its application to the following two Gaussian processes:
where: , , , they all have entries iid Gaussian; the sets and are compact; and, . For these two processes, define the following (random) min-max optimization programs, which (following [TAH18]) we refer to as the primary optimization (PO) problem and the auxiliary optimization AO – for purposes that will soon become clear.
According to Gordon’s comparison inequality, for any , it holds:
In other words, a high-probability lower bound on the AO is a high-probability lower bound on the PO. The premise is that it is often much simpler to lower bound the AO rather than the PO. To be precise, (22) is a slight reformulation of Gordon’s original result proved in [TOH15] (see therein for details).
a.1.2 Convex Gaussian Min-Max Theorem (CGMT)
The asymptotic expressions of this paper build on the CGMT [TOH15]. For ease of reference we summarize here the essential ideas of the framework following the presentation in [TAH18]; please see [TAH18, Section 6] for the formal statement of the theorem and further details. The CGMT is an extension of the GMT and it asserts that the AO in (21b) can be used to tightly infer properties of the original PO in (21a), including the optimal cost and the optimal solution. According to the CGMT [TAH18, Theorem 6.1], if the sets and are convex and is continuous convex-concave on , then, for any and , it holds
In words, concentration of the optimal cost of the AO problem around implies concentration of the optimal cost of the corresponding PO problem around the same value . Moreover, starting from (23) and under strict convexity conditions, the CGMT shows that concentration of the optimal solution of the AO problem implies concentration of the optimal solution of the PO to the same value. For example, if minimizers of (21b) satisfy for some , then, the same holds true for the minimizers of (21a): [TAH18, Theorem 6.1(iii)]. Thus, one can analyze the AO to infer corresponding properties of the PO, the premise being of course that the former is simpler to handle than the latter.
a.2 Hard-margin SVM: Identifying the PO and AO problems
The max-margin solution is obtained by solving the following problem:
and is feasible only when the training data is separable. Under such a situation, (24) is equivalent to solving:
Based on the rotational invariance of the Gaussian measure, we may assume without loss of generality that
where only the first coordinate is nonzero. For convenience, we also write
In this new notation,
Further using this notation and considering the change of variable , (25) becomes
Letting , and , and writing
leads to the following optimization problem
which has the same form of a primary optimization (PO) problem as required by the CGMT [TOH15], with the single exception that the feasibility sets are not compact. To solve this issue, we pursue the approach developed in [KA19] for the analysis of the hard-margin SVM. We write (28) as follows,
We identified thus a sequence of (PO) problems indexed by , each of which satisfies the compactness conditions on the feasibility sets, as required by the CGMT [TOH15]. We associate each one of them with an auxiliary optimization AO problem that can be written as:
where the random vectors and , and represents the vector whose elements are all 1.
Having identified the AO problem, the next step is to simplify them so as to reduce them to problems involving only optimization problems on only a few number of scalars. This will facilitate in the next step inferring their asymptotic behavior.
a.3 Simplification of the AO problem
To simplify the AO problem, we start by optimizing over the direction of the optimization variable . In doing so, we obtain:
In the above optimization problem, it is easy to see that appears only through its norm. Let . Optimizing over , we obtain the following scalar optimization problem:
It is important to note that the new formulation of the AO problem is obtained from a deterministic analysis that did not involve any asymptotic approximation. However, this new formulation is more suitable to understand their asymptotic behavior.
a.4 Asymptotic behavior of the AO problem
a.4.1 Data separability (Proof of Proposition 3.1)
Define the sequence of functions
when and . It is easy to see that is jointly convex in its arguments and converges almost surely to:
where, as in (11),
Since the convergence of convex functions is uniform over compacts, for fixed, there exists such that for all and ,
From the above inequality, it follows that if
then for chosen sufficiently small (concretely: smaller for instance than ), there exists a constant independent of such that:
Next, we prove that (37) holds for sufficiently large when the following condition is satisfied:
In the right-hand side of the equation above, recognize the threshold function defined in Proposition 3.1. To show the desired, we prove that the optimization over can be assumed over a compact. This is because for ,
As grows unboundedly large as or , we have:
thus proving that the minimum over is bounded. We can thus consider that belongs to a given compact set of . For fixed ,
Furthermore, this convergence is uniform over compact sets due to the convexity of function , hence,
There exists thus , such that for any ,