Sparse (group) learning with Lipschitz loss functions: a unified analysis
Abstract
We study a family of sparse estimators defined as minimizers of some empirical Lipschitz loss function—which include hinge, logistic and quantile regression losses—with a convex, sparse or groupsparse regularization. In particular, we consider the L1norm on the coefficients, its sorted Slope version, and the Group L1L2 extension. First, we propose a theoretical framework which simultaneously derives new L2 estimation upper bounds for all three regularization schemes. For L1 and Slope regularizations, our bounds scale as — is the size of the design matrix and the dimension of the theoretical loss minimizer —matching the optimal minimax rate achieved for the leastsquares case. For Group L1L2 regularization, our bounds scale as — is the total number of groups and the number of coefficients in the groups which contain —and improve over the leastsquares case. We additionally show that when the signal is strongly groupsparse Group L1L2 is superior to L1 and Slope. Our bounds are achieved both in probability and in expectation, under common assumptions in the literature. Second, we propose an accelerated proximal algorithm which computes the convex estimators studied when the number of variables is of the order of . We additionally compare their statistical performance of our estimators against standard baselines for settings where the signal is either sparse or groupsparse. Our experiments findings reveal (i) the good empirical performance of L1 and Slope regularizations for sparse binary classification problems, (ii) the superiority of Group L1L2 regularization for groupsparse classification problems and (iii) the appealing properties of sparse quantile regression estimators for sparse regression problems with heteroscedastic noise.
.98
1 Introduction
We consider a training data , from a distribution . We fix a loss and consider a theoretical minimizer of the theoretical loss :
(1) 
In the rest of this paper, will be assumed to be Lipschitz and to admit a subgradient. We denote the number of nonzeros of the theoretical minimizer and its L1 norm. We consider the L1constrained learning problem
(2) 
where is a regularization function. We study sparse estimators, i.e. with a small number of nonzeros. To this end, we restrict to a class of the sparsityinducing regularizations. We first consider the L1 regularization, which is wellknown to encourage sparsity in the coefficients [1]. Problem (2) becomes:
(3) 
The second problem we study is inspired by the sorted L1penalty aka the Slope norm [2, 3], used in the context of leastsquares problems for its statistical properties. We note the set of permutations of and consider a sequence . For , we define the L1constrained Slope estimator as a solution of the convex minimization problem:
(4) 
is the Slope regularization and is a nonincreasing rearrangement of .
Finally, in several applications, sparsity is structured—the coefficient indices occur in groups apriori known and it is desirable to select a whole group. In this context, group variants of the L1 norm are often used to improve the performance and interpretability [4, 5]. We consider the use of a Group L1L2 regularization [6] and define the L1constrained Group L1L2 problem:
(5) 
where, denotes a group index (the groups are disjoint), denotes the vector of coefficients belonging to group , the corresponding set of indexes and . In addition, we denote , the smallest subset of group indexes such that the support of is included in the union of these groups, the cardinality of , and the sum of the sizes of these groups.
What this paper is about: In this paper, we propose a unified statistical and computational analysis of a large class of estimators, defined as solutions of Problems (3), (4) and (5) when is a convex Lipschitz loss function which admits a subgradient (cf. Assumption 1, Section 2.2), e.g. when is the hinge loss, the logistic loss or the quantile regression loss. In a first part, we propose a statistical study which derives new error bounds for the L2 norm of the difference between the empirical and theoretical minimizers where ^{2}^{2}2When no confusion can be made, we drop the dependence upon the parameters . is a solution of Problems (3), (4) and (5). Our bounds are reached under standard assumptions in the literature, and hold with high probability and in expectation. As a critical step, we derive stronger versions of existing cone conditions and restricted strong convexity conditions in the respective Theorems 1 and 2. Our method draws inspiration from the leastsquares regression approaches [7, 3, 5] and illustrates the distinction between regression and classification studies. Our framework is flexible enough to apply to coefficientbased and groupbased regularizations, while enhancing the differences between these two classes of problems. For Problem (3) and (4), our bounds scale as . They improve over existing results for all three losses considered with L1 regularization [8, 9, 10], and match the best minimax rate achieved in the leastsquares case [11]. For the group Problem (5), our bounds appear to be the first existing results for all three losses and scale as . This rate is better than the existing ones for the leastsquares problems [5] due to a stronger cone condition (cf. Theorem 1). Similarly to [5], we additionally show that when the signal is strongly sparse, Group L1L2 regularization is superior to L1 and Slope. In a second part, we propose a computational study of our family of estimators. We design a proximal gradient algorithm to solve the fully tractable problems presented herein—our method uses Nesterov smoothing [12] in the case where is a nonsmooth loss—and we compare the estimators studied with standard nonsparse baselines through a variety of computational experiments. Our numerical findings enhance the numerical performance of our estimators for classification and regression settings where the signal is sparse and groupsparse.
Organization of paper: The rest of this paper is organized as follows. Section 2 builds our framework of study and presents our new theorems: our main statistical results appear in Theorem 3 and Corollary 1. Section 3 proposes a first order algorithm to solve Problems (3), (4) and (5) and presents a range of synthetic experiments which reveals the computational advantage of the estimators studied herein.
2 Statistical analysis
In this section, we study the statistical properties of the estimators defined as solutions of Problems (3), (4) and (5) and derive new upper bounds for L2 estimation.
2.1 Existing work on statistical performance
Statistical performance and L2 consistency for highdimensional linear regression have been widely studied [13, 7, 14, 3, 15]. One important statistical performance measure is the L2 estimation error defined as where is the sparse vector used in generating the true model and is an estimator. For regression problems with leastsquares loss, [14] and [11] established a lower bound for estimating the L2 norm of a sparse vector, regardless of the input matrix and estimation procedure. This optimal minimax rate is known to be achieved by a global minimizer of a L0 regularized estimator [16]. This minimizer is sparse and adapts to unknown sparsity—the degree does not have to be specified; however, it is intractable in practice. Recently, [3] reached this optimal minimax bound for a Lasso estimator with knowledge of the sparsity , and proved that a recently introduced and polynomialtime Slope estimator [17] achieves the optimal rate while adapting to unknown sparsity. In a related work, [18] reached a nearoptimal rate for L1 regularized leastangle deviation loss. [10] extended this bound for L1 regularized quantile regression. Finally, in the regime where sparsity is structured, [5] proved a L2 estimation upper bound for a Group L1L2 estimator—where, similarly to our notations, is the number of groups, the number of relevant groups and their aggregated size—and showed that their Group L1L2 estimator is superior to standard Lasso when the signal is strongly groupsparse, i.e. is low and the signal is efficiently covered by the groups. [15] similarly showed that, in the multitask setting, a Group L1L2 estimator is superior to Lasso.
Little work has been done on deriving estimation error bounds on highdimensional classification problems. Existing work has focused on the analysis of generalization error and risk bounds [19, 20]. Unlike the regression case, for classification problems is the sparsity of the theoretical minimizer to estimate. Recently, [8] proved a upperbound for L2 coefficients estimation of a L1 regularized Support Vector Machines (SVM). The authors recovered the rate proposed by [21], which considered a weighted L1 norm for linear models. [9] obtained a similar bound for a L1regularized logistic regression estimator in a binary Ising graph. However, this rate of is not the best known for a classification estimator: [22] proved a error bound for estimating a single vector through sparse models—including 1bit compressed sensing and logistic regression—over a bounded set of vectors. Contrary to this work, our approach does not assume a generative vector and applies to a larger class of losses (hinge, quantile regression) and regularizations (Slope, Group L1L2). We are not aware of any existing result for group regularization in classification settings.
2.2 Framework of study
We design herein our theoretical framework of study, using common assumptions in the literature. Our first assumption requires to be Lipschitz and to admit a subgradient . We list three main examples that fall into this framework.
Assumption 1
Lipschitz loss and existence of a subgradient: The loss is nonnegative, convex and Lipschitz continuous with constant , that is, . In addition, there exists such that .
Support vectors machines (SVM) For , the SVM problem learns a classification rule of the form by solving Problem (2) with the hinge loss . The loss admits as a subgradient and satisfies Assumption 1 for .
Logistic regression We assume . The maximum likelihood estimator solves Problem (2) for the logistic loss . The loss satisfies Assumption 1 for since .
Quantile regression We consider and fix . Following [23], we assume the th conditional quantile of given to be . We define the quantile loss^{3}^{3}3Note that the hinge loss is a translation of the quantile loss for . . satisfies Assumption 1 for In addition, it is known [24] that . For , the quantile regression loss is proportional to the leastangle deviation loss: , which L1 regularized version has been studied in [18].
We additionally assume the unicity of and the twice differentiability of the theoretical loss . [25] studied specific conditions under which Assumption 2 holds for the hinge loss (the result extends to the quantile regression loss). Assumption 2 is guaranteed for the logistic loss.
Assumption 2
Differentiability of the theoretical loss: The theoretical minimizer is unique. In addition, the theoretical loss is twicedifferentiable: we note its gradient and its Hessian matrix It finally holds:
Our next assumption controls the entries of the design matrix. Let us first recall the definition of a subGaussian random variable [26]:
Definition 1
A random variable is said to be subGaussian with variance if and .
This variable will be noted . Assumption 3 is then defined as follows:
Assumption 3
SubGaussian entries: for .
Note that under Assumptions 1 and 2, it holds since minimizes the theoretical loss. In particular, if , then Hoeffding’s lemma guarantees that Assumption 3 holds. This is also true if the design matrix consists of independent samples of a multivariate centered Gaussian distribution^{4}^{4}4These examples prove that Assumption 3 is mild and considerably weaker than Assumption (A1) in [8] which imposes a finite bound on the L2 norm of each column of the design matrix..
The next assumption draws inspiration from the restricted eigenvalue conditions defined for all three L1, Slope and Group L1L2 regularizations in the regression settings [7, 3, 15]. For an integer , Assumption 4 ensures that the quadratic form associated with is upperbounded on the cone of sparse vectors. Similarly, Assumptions 4, 4 and 4 ensure that the quadratic form associated with the Hessian matrix is lowerbounded on a family of cones of —specific to the regularization used.
Assumption 4
Restricted eigenvalue conditions:

Let . Assumption 4 is satisfied if there exists a nonnegative constant such that almost surely:

Let . Assumption 4 holds if there exists which almost surely satisfies:
where and for every subset , the cone is defined as:

Let . Assumption 4 holds if there exists a constant such that a.s.:
where and for every subset , we define the subset of all indexes accross all the groups in . is defined as:
In the SVM framework [8], Assumptions (A3) and (A4) are similar to our Assumptions and . For logistic regression [9], Assumptions A1 and A2 similarly define a dependency and incoherence conditions. For quantile regression, Assumption D.4 [10] is equivalent to a uniform restricted eigenvalue condition.
Since minimizes the theoretical loss, it holds . In particular, under Assumption 4, the theoretical loss is lowerbounded by a quadratic function on a certain subset surrounding . By continuity, we define the maximal radius on which the following lowerbound holds:

and for L1 regularization.

and for Slope regularization.

and for Group L1L2 regularization.
depends upon the same parameters than . We propose the following growth conditions which give a relation between the number of samples , the dimension space , the sparsity levels or , the maximal radius , and a parameter .
Assumption 5
Let . Assumptions 5 and 5—respectively defined for L1 and Slope regularizations—are said to hold if:
where and are respectively defined in the following Theorems 1 and 2. In addition, for Group L1L2 regularization, Assumption 5 is said to hold if:
where and are also defined in the following Theorems 1 and 2.
The constants and depend upon the family of cone corresponding to the regularization used. Note that Assumption 5 is similar to Equation (17) for logistic regression [9]. A similar definition is proposed in the proof of Lemma for quantile regression [10]. Our framework can now be used to derive upper bounds for coefficients estimation, scaling with the problem size parameters and the constants introduced.
2.3 Cone conditions
Similarly to the regression cases for L1, Slope and Group L1L2 regularizations [7, 3, 15], Theorem 1 first derives cone conditions satisfied by a respective solution of Problem (3), (4) or (5). Theorem 1 says that, for each problem, the difference between the theoretical and empirical minimizer belongs to one of the families of cones defined in Assumption 4. These cone conditions are derived by selecting a regularization parameter large enough so that it dominates the subgradient of the loss evaluated at the theoretical minimizer .
Theorem 1
Let , , and assume that Assumptions 1 and 3 are satisfied.
We denote and fix the parameters , and . The following results hold with probability at least .

Let be a solution of the L1 regularized Problem (3) with parameter , and be the subset of indexes of the highest coefficients of . It holds:

Let be a solution of the Sloperegularized Problem (4) with parameter and the sequence of coefficients . It holds:

Let be a solution of the Group L1L2 Problem (5) with parameter , and let be the subset of indexes of the highest subgroups of for the L2 norm. Finally let define the subset of size of all indexes across all the groups in . It holds:
The proof is presented in Appendix B: it uses a new result to control the maximum of independent subGaussian random variables. As a consequence, for the L1 regularized Problem (3), the parameter is of the order of . In particular, our conditions are stronger than [8], [9] and [18], which all propose a scaling as for L1 regularized estimator with all three Lipschitz losses considered herein. In addition, for Group L1L2 regularization, the parameter is of the order of : our conditions are also stronger than [15], which considers a scaling for the leastsquares case.
2.4 Restricted strong convexity conditions
The next Theorem 2 says that the loss satisfies a restricted strong convexity [27] with curvature and L1 tolerance function. It is derived by combining (i) a supremum result from Theorem 5 presented in Appendix C (ii) the minimality of and (iii) restricted eigenvalue conditions from Assumption 4.
Theorem 2
2.5 Upper bounds for coefficients estimation
Theorem 3
Let . We consider the same assumptions and notations than in Theorems 1 and 2. In addition, we assume that the growth conditions 5, 5 and 5 respectively hold for L1, Slope and Group L1L2 regularizations. We select so that for L1 and Slope regularizations, and for Group L1L2 regularization.
Then the estimators and satisfies with probability at least :
In addition, the estimator satisfies with probability at least :
where for L1 regularization, for Slope regularization and for Group L1L2 regularization.
The proof is presented in Appendix D. The bounds follow from the cone conditions and the restricted strong convexity conditions derived in Theorems 1 and 2. Theorem 3 holds for any . Thus, we obtain by integration the following bounds in expectation. The proof is presented in Appendix E.
Corollary 1
If the assumptions presented in Theorem are satisfied for a small enough , then:
Discussion for L1 and Slope: For L1 and Slope regularizations, our family of estimators reach a bound scaling as . This bound strictly improves over existing results for all three losses with an L1 regularization [8, 9, 18, 10] and it matches the best rate known for the leastsquares case [3]. We recover our previous result [28] in the more general framework presented herein which also applies to Group L1L2 regularization. In addition, the L1 regularization parameter uses the sparsity . In contrast, similarly to the leastsquares case [3], Slope presents the statistical advantage of adapting to unknown sparsity.
Discussion for Group L1L2: For Group L1L2, our family of estimators reach a bound scaling as . This bound improves over the regression case [5], which scales as . This is due to the stronger cone condition derived in Theorem 1.
Comparison of both bounds for groupsparse signals: We compare the statistical performance and upper bounds of Group L1L2 regularization to L1 and Slope regularizations when sparsity is structured. Let us first consider two edge case. (i) If all the groups are all of size and the optimal solution is contained in only one group—that is, , , —the bound for Group L1L2 is lower than the ones for L1 and Slope. Group L1L2 is superior as it strongly exploits the problem structure. (ii) If now all the groups are of size one—that is, , , —both bounds have a similar first term (due to the cone conditions), however the second term is worse for the group estimator due to of a suboptimal partition choice in Theorem 2 (cf. Appendix C). L1 and Slope are superior.
For the general case, when , the signal is efficiently covered by the groups—the group structure is useful—and the upper bound for Group L1L2 is lower than the one for L1 and Slope. That is, similarly to the regression case [5], Group L1L2 is superior to L1 for strongly groupsparse signals ([5] does not discuss the superiority of Group L1L2 over Slope, which we do). However, when is larger, then sparsity is not as useful and Group L1L2 is outperformed by L1 and Slope.
3 Empirical analysis
All the estimators studied are convex. In this section, we study their empirical properties for computational settings where the signal is either sparse or groupsparse, and the number of variables is of the order of . To this end, we present a proximal gradient algorithm which solves the tractable Problems (3), (4) and (5).
3.1 Smoothing the loss
We note . Problem (2) can be formulated as: —we drop the L1 constraint in the rest of this section.
Our proximal method requires to be a differentiable loss with continuous Lipschitz gradient. The hinge and quantile regression losses are nonsmooth: we propose to use Nesterov’s smoothing method [12] to construct a convex function with continuous Lipschitz gradient — for quantile regression—which approximates these losses for .
For the hinge loss case, let us first note that as this maximum is achieved for . Consequently, the hinge loss can be expressed as a maximum over the unit ball:
where . We apply the technique suggested by [12] and define for the smoothed version of the loss:
(6) 
Let be the optimal solution of the righthand side of Equation (6). The gradient of is expressed as:
(7) 
and its associated Lipschitz constant is derived from the next theorem. The proof is presented in Appendix F. It follows [12] and uses first order necessary conditions for optimality.
Theorem 4
Let be the highest eigenvalue of . Then is Lipschitz continuous with constant .
Smoothing the quantile regression loss:
The same method applies to the non smooth quantile regression loss. We first note that . Hence the smooth quantile regression loss is defined as and its gradient is:
where we now have with . The Lipschitz constant of is still given by Theorem 4.
3.2 Thresholding operators
We now assume that is a differentiable loss with Lipschitz continuous gradient. Following [29, 30], for , we upperbound around any with the quadratic form defined as:
(8) 
The proximal gradient method approximates the solution of Problem by solving the problem
(9) 
Problem (9) can be solved via the the following proximal operator (evaluated at ):
(10) 
We discuss computation of (9) for the specific choices of considered.
L1 regularization: When , is available via componentwise softhresholding, where the softthresholding operator is: .
Slope regularization: When —where —we note that, at an optimal solution to Problem (10), the signs of and are the same [2]. Consequently, we solve the following close relative to the isotonic regression problem [31]:
(11) 
where, is a decreasing rearrangement of the absolute values of . A solution of Problem (11) corresponds to , where is a solution of Problem (10). We use the software provided by [2] in our experiments.
Group L1L2 regularization: For , we consider the projection operator onto an L2ball with radius :
(12) 
From standard results pertaining to Moreau decomposition ([32, 6]) we have:
(13) 
We solve Problem (10) with Group L1L2 regularization by noticing the separability of the problem across the different groups, and computing for every .
3.3 First order algorithm
Let us denote the mapping in (9) by the operator: The standard version of the proximal gradient descent algorithm performs the updates: for . The accelerated gradient descent algorithm [30], which enjoys a faster convergence rate, performs updates with a minor modification. It starts with , and then performs the updates: where, and . We perform these updates till some tolerance criterion is satisfied, or a maximum number of iterations is reached.
3.4 Simulations
We compare the sparse estimators studied herein with standard baselines when the signal is sparse or groupsparse. We consider the 3 examples below with an increasing number of variables up to .
3.4.1 Example 1: sparse binary classification with hinge and logistic losses
Our first experiments compare L1 and Slope estimators with an L2 baseline for sparse binary classification problems. We use both the logistic loss and the hinge loss. Our hypothesis for this case is that (i) the estimators performance will only be affected by the statistical difficulty of the problem, not by the choice of the loss function and (ii) sparse regularizations will outperform their nonsparse opponents.
Data Generation: We consider samples from a multivariate Gaussian distribution with covariance matrix with if and otherwise. Half of the samples are from the class and have mean where . A smaller makes the statistical setting more difficult since the two classes get closer. The other half are from the class and have mean . We standardize the columns of the input matrix to have unit L2norm.
Following our highdimensional study, we set and consider a sequence of increasing values of . We study the effect of making the problem statistically harder by making the classes closer. Hence we consider two settings, with a small and a large .
Competing methods: We compare 3 approaches for both the logistic loss and the hinge loss:

Method (a) computes a family of L1 regularized estimators for a decreasing geometric sequence of regularization parameters . We start from so that the solution of Problem (3) is and we fix . When is the logistic loss, we use the first order algorithm presented in Section 3.3. When is the hinge loss, we directly solve the Linear Programming (LP) L1SVM problem with the commercial LP solver Gurobi version with Python interface. We present an LP reformulation of the Problem in Appendix G.1.

Method (b) computes a family of Slope regularized estimators, using the first order algorithm presented in Section 3.3. The Slope coefficients are the ones defined in Theorem 3; the sequence of parameters is identical to method (a). When is the hingeloss, we consider the smoothing method defined in Section 3.1 with a coefficient .

Method (c) returns a family of L2 regularized estimators with scikitlearn package: we start from as suggested in [33]—and