Sparse (group) learning with Lipschitz loss functions: a unified analysis

Sparse (group) learning with Lipschitz loss functions: a unified analysis

Antoine Dedieu1
Vicarious AI, Massachusetts Institute of Technology
11Email: antoine@vicarious.com, adedieu@mit.edu. Antoine Dedieu’s research was partially supported by the Office of Naval Research: N000141512342, by MIT and by Vicarious AI.
October, 2019
Abstract

We study a family of sparse estimators defined as minimizers of some empirical Lipschitz loss function—which include hinge, logistic and quantile regression losses—with a convex, sparse or group-sparse regularization. In particular, we consider the L1-norm on the coefficients, its sorted Slope version, and the Group L1-L2 extension. First, we propose a theoretical framework which simultaneously derives new L2 estimation upper bounds for all three regularization schemes. For L1 and Slope regularizations, our bounds scale as is the size of the design matrix and the dimension of the theoretical loss minimizer —matching the optimal minimax rate achieved for the least-squares case. For Group L1-L2 regularization, our bounds scale as is the total number of groups and the number of coefficients in the groups which contain —and improve over the least-squares case. We additionally show that when the signal is strongly group-sparse Group L1-L2 is superior to L1 and Slope. Our bounds are achieved both in probability and in expectation, under common assumptions in the literature. Second, we propose an accelerated proximal algorithm which computes the convex estimators studied when the number of variables is of the order of . We additionally compare their statistical performance of our estimators against standard baselines for settings where the signal is either sparse or group-sparse. Our experiments findings reveal (i) the good empirical performance of L1 and Slope regularizations for sparse binary classification problems, (ii) the superiority of Group L1-L2 regularization for group-sparse classification problems and (iii) the appealing properties of sparse quantile regression estimators for sparse regression problems with heteroscedastic noise.

\setstretch

.98

1 Introduction

We consider a training data , from a distribution . We fix a loss and consider a theoretical minimizer of the theoretical loss :

(1)

In the rest of this paper, will be assumed to be Lipschitz and to admit a subgradient. We denote the number of non-zeros of the theoretical minimizer and its L1 norm. We consider the L1-constrained learning problem

(2)

where is a regularization function. We study sparse estimators, i.e. with a small number of non-zeros. To this end, we restrict to a class of the sparsity-inducing regularizations. We first consider the L1 regularization, which is well-known to encourage sparsity in the coefficients [1]. Problem (2) becomes:

(3)

The second problem we study is inspired by the sorted L1-penalty aka the Slope norm [2, 3], used in the context of least-squares problems for its statistical properties. We note the set of permutations of and consider a sequence . For , we define the L1-constrained Slope estimator as a solution of the convex minimization problem:

(4)

is the Slope regularization and is a non-increasing rearrangement of .

Finally, in several applications, sparsity is structured—the coefficient indices occur in groups a-priori known and it is desirable to select a whole group. In this context, group variants of the L1 norm are often used to improve the performance and interpretability [4, 5]. We consider the use of a Group L1-L2 regularization [6] and define the L1-constrained Group L1-L2 problem:

(5)

where, denotes a group index (the groups are disjoint), denotes the vector of coefficients belonging to group , the corresponding set of indexes and . In addition, we denote , the smallest subset of group indexes such that the support of is included in the union of these groups, the cardinality of , and the sum of the sizes of these groups.

What this paper is about: In this paper, we propose a unified statistical and computational analysis of a large class of estimators, defined as solutions of Problems (3), (4) and (5) when is a convex Lipschitz loss function which admits a subgradient (cf. Assumption 1, Section 2.2), e.g. when is the hinge loss, the logistic loss or the quantile regression loss. In a first part, we propose a statistical study which derives new error bounds for the L2 norm of the difference between the empirical and theoretical minimizers where 222When no confusion can be made, we drop the dependence upon the parameters . is a solution of Problems (3), (4) and (5). Our bounds are reached under standard assumptions in the literature, and hold with high probability and in expectation. As a critical step, we derive stronger versions of existing cone conditions and restricted strong convexity conditions in the respective Theorems 1 and 2. Our method draws inspiration from the least-squares regression approaches [7, 3, 5] and illustrates the distinction between regression and classification studies. Our framework is flexible enough to apply to coefficient-based and group-based regularizations, while enhancing the differences between these two classes of problems. For Problem (3) and (4), our bounds scale as . They improve over existing results for all three losses considered with L1 regularization [8, 9, 10], and match the best minimax rate achieved in the least-squares case [11]. For the group Problem (5), our bounds appear to be the first existing results for all three losses and scale as . This rate is better than the existing ones for the least-squares problems [5] due to a stronger cone condition (cf. Theorem 1). Similarly to [5], we additionally show that when the signal is strongly sparse, Group L1-L2 regularization is superior to L1 and Slope. In a second part, we propose a computational study of our family of estimators. We design a proximal gradient algorithm to solve the fully tractable problems presented herein—our method uses Nesterov smoothing [12] in the case where is a non-smooth loss—and we compare the estimators studied with standard non-sparse baselines through a variety of computational experiments. Our numerical findings enhance the numerical performance of our estimators for classification and regression settings where the signal is sparse and group-sparse.

Organization of paper: The rest of this paper is organized as follows. Section 2 builds our framework of study and presents our new theorems: our main statistical results appear in Theorem 3 and Corollary 1. Section 3 proposes a first order algorithm to solve Problems (3), (4) and (5) and presents a range of synthetic experiments which reveals the computational advantage of the estimators studied herein.

2 Statistical analysis

In this section, we study the statistical properties of the estimators defined as solutions of Problems (3), (4) and (5) and derive new upper bounds for L2 estimation.

2.1 Existing work on statistical performance

Statistical performance and L2 consistency for high-dimensional linear regression have been widely studied [13, 7, 14, 3, 15]. One important statistical performance measure is the L2 estimation error defined as where is the -sparse vector used in generating the true model and is an estimator. For regression problems with least-squares loss, [14] and [11] established a lower bound for estimating the L2 norm of a sparse vector, regardless of the input matrix and estimation procedure. This optimal minimax rate is known to be achieved by a global minimizer of a L0 regularized estimator [16]. This minimizer is sparse and adapts to unknown sparsity—the degree does not have to be specified; however, it is intractable in practice. Recently, [3] reached this optimal minimax bound for a Lasso estimator with knowledge of the sparsity , and proved that a recently introduced and polynomial-time Slope estimator [17] achieves the optimal rate while adapting to unknown sparsity. In a related work, [18] reached a near-optimal rate for L1 regularized least-angle deviation loss. [10] extended this bound for L1 regularized quantile regression. Finally, in the regime where sparsity is structured, [5] proved a L2 estimation upper bound for a Group L1-L2 estimator—where, similarly to our notations, is the number of groups, the number of relevant groups and their aggregated size—and showed that their Group L1-L2 estimator is superior to standard Lasso when the signal is strongly group-sparse, i.e. is low and the signal is efficiently covered by the groups. [15] similarly showed that, in the multitask setting, a Group L1-L2 estimator is superior to Lasso.

Little work has been done on deriving estimation error bounds on high-dimensional classification problems. Existing work has focused on the analysis of generalization error and risk bounds [19, 20]. Unlike the regression case, for classification problems is the sparsity of the theoretical minimizer to estimate. Recently, [8] proved a upper-bound for L2 coefficients estimation of a L1 regularized Support Vector Machines (SVM). The authors recovered the rate proposed by [21], which considered a weighted L1 norm for linear models. [9] obtained a similar bound for a L1-regularized logistic regression estimator in a binary Ising graph. However, this rate of is not the best known for a classification estimator: [22] proved a error bound for estimating a single vector through sparse models—including 1-bit compressed sensing and logistic regression—over a bounded set of vectors. Contrary to this work, our approach does not assume a generative vector and applies to a larger class of losses (hinge, quantile regression) and regularizations (Slope, Group L1-L2). We are not aware of any existing result for group regularization in classification settings.

2.2 Framework of study

We design herein our theoretical framework of study, using common assumptions in the literature. Our first assumption requires to be -Lipschitz and to admit a subgradient . We list three main examples that fall into this framework.

Assumption 1

Lipschitz loss and existence of a subgradient: The loss is non-negative, convex and Lipschitz continuous with constant , that is, . In addition, there exists such that .

Support vectors machines (SVM) For , the SVM problem learns a classification rule of the form by solving Problem (2) with the hinge loss . The loss admits as a subgradient and satisfies Assumption 1 for .

Logistic regression We assume . The maximum likelihood estimator solves Problem (2) for the logistic loss . The loss satisfies Assumption 1 for since .

Quantile regression We consider and fix . Following [23], we assume the th conditional quantile of given to be . We define the quantile loss333Note that the hinge loss is a translation of the quantile loss for . . satisfies Assumption 1 for In addition, it is known [24] that . For , the quantile regression loss is proportional to the least-angle deviation loss: , which L1 regularized version has been studied in [18].


We additionally assume the unicity of and the twice differentiability of the theoretical loss . [25] studied specific conditions under which Assumption 2 holds for the hinge loss (the result extends to the quantile regression loss). Assumption 2 is guaranteed for the logistic loss.

Assumption 2

Differentiability of the theoretical loss: The theoretical minimizer is unique. In addition, the theoretical loss is twice-differentiable: we note its gradient and its Hessian matrix It finally holds:

Our next assumption controls the entries of the design matrix. Let us first recall the definition of a sub-Gaussian random variable [26]:

Definition 1

A random variable is said to be sub-Gaussian with variance if and .

This variable will be noted . Assumption 3 is then defined as follows:

Assumption 3

Sub-Gaussian entries: for .

Note that under Assumptions 1 and 2, it holds since minimizes the theoretical loss. In particular, if , then Hoeffding’s lemma guarantees that Assumption 3 holds. This is also true if the design matrix consists of independent samples of a multivariate centered Gaussian distribution444These examples prove that Assumption 3 is mild and considerably weaker than Assumption (A1) in [8] which imposes a finite bound on the L2 norm of each column of the design matrix..


The next assumption draws inspiration from the restricted eigenvalue conditions defined for all three L1, Slope and Group L1-L2 regularizations in the regression settings [7, 3, 15]. For an integer , Assumption 4 ensures that the quadratic form associated with is upper-bounded on the cone of sparse vectors. Similarly, Assumptions 4, 4 and 4 ensure that the quadratic form associated with the Hessian matrix is lower-bounded on a family of cones of —specific to the regularization used.

Assumption 4

Restricted eigenvalue conditions:

  • Let . Assumption 4 is satisfied if there exists a non-negative constant such that almost surely:

  • Let . Assumption 4 holds if there exists which almost surely satisfies:

    where and for every subset , the cone is defined as:

  • Let . Assumption 4 holds if there exists a constant such that a.s.:

    where the cone is defined as:

  • Let . Assumption 4 holds if there exists a constant such that a.s.:

    where and for every subset , we define the subset of all indexes accross all the groups in . is defined as:

In the SVM framework [8], Assumptions (A3) and (A4) are similar to our Assumptions and . For logistic regression [9], Assumptions A1 and A2 similarly define a dependency and incoherence conditions. For quantile regression, Assumption D.4 [10] is equivalent to a uniform restricted eigenvalue condition.


Since minimizes the theoretical loss, it holds . In particular, under Assumption 4, the theoretical loss is lower-bounded by a quadratic function on a certain subset surrounding . By continuity, we define the maximal radius on which the following lower-bound holds:

  • and for L1 regularization.

  • and for Slope regularization.

  • and for Group L1-L2 regularization.

depends upon the same parameters than . We propose the following growth conditions which give a relation between the number of samples , the dimension space , the sparsity levels or , the maximal radius , and a parameter .

Assumption 5

Let . Assumptions 5 and 5—respectively defined for L1 and Slope regularizations—are said to hold if:

where and are respectively defined in the following Theorems 1 and 2. In addition, for Group L1-L2 regularization, Assumption 5 is said to hold if:

where and are also defined in the following Theorems 1 and 2.

The constants and depend upon the family of cone corresponding to the regularization used. Note that Assumption 5 is similar to Equation (17) for logistic regression [9]. A similar definition is proposed in the proof of Lemma for quantile regression [10]. Our framework can now be used to derive upper bounds for coefficients estimation, scaling with the problem size parameters and the constants introduced.

2.3 Cone conditions

Similarly to the regression cases for L1, Slope and Group L1-L2 regularizations [7, 3, 15], Theorem 1 first derives cone conditions satisfied by a respective solution of Problem (3), (4) or (5). Theorem 1 says that, for each problem, the difference between the theoretical and empirical minimizer belongs to one of the families of cones defined in Assumption 4. These cone conditions are derived by selecting a regularization parameter large enough so that it dominates the sub-gradient of the loss evaluated at the theoretical minimizer .

Theorem 1

Let , , and assume that Assumptions 1 and 3 are satisfied.
We denote and fix the parameters , and . The following results hold with probability at least .

  • Let be a solution of the L1 regularized Problem (3) with parameter , and be the subset of indexes of the highest coefficients of . It holds:

  • Let be a solution of the Slope-regularized Problem (4) with parameter and the sequence of coefficients . It holds:

  • Let be a solution of the Group L1-L2 Problem (5) with parameter , and let be the subset of indexes of the highest subgroups of for the L2 norm. Finally let define the subset of size of all indexes across all the groups in . It holds:

The proof is presented in Appendix B: it uses a new result to control the maximum of independent sub-Gaussian random variables. As a consequence, for the L1 regularized Problem (3), the parameter is of the order of . In particular, our conditions are stronger than [8], [9] and [18], which all propose a scaling as for L1 regularized estimator with all three Lipschitz losses considered herein. In addition, for Group L1-L2 regularization, the parameter is of the order of : our conditions are also stronger than [15], which considers a scaling for the least-squares case.

2.4 Restricted strong convexity conditions

The next Theorem 2 says that the loss satisfies a restricted strong convexity [27] with curvature and L1 tolerance function. It is derived by combining (i) a supremum result from Theorem 5 presented in Appendix C (ii) the minimality of and (iii) restricted eigenvalue conditions from Assumption 4.

Theorem 2

Let and assume that Assumptions 1, 2 and 3 hold. In addition, assume that Assumptions 4 and 4 hold for L1 regularization, Assumptions 4 and 4 for Slope, Assumptions 4 and 4 for Group L1-L2—where , and are defined in Theorem 1. Finally, let for all integers and let be a shorthand for , , or .

Then, it holds with probability at least :

where for L1 and Slope regularizations and for Group L1-L2 regularization. , are shorthands for the restricted eigenvalue constant and maximum radius introduced in Assumptions 4 and 5: they depend upon the regularization used.

Our cone conditions could be extended to the use of an L2 tolerance function: our parameter would scale as . In contrast, [8], [9] and [27] propose a parameter scaling as with an L2 tolerance function: our results are stronger than existing works.

2.5 Upper bounds for coefficients estimation

We conclude this section by presenting our main bounds in Theorem 3 and Corollary 1.

Theorem 3

Let . We consider the same assumptions and notations than in Theorems 1 and 2. In addition, we assume that the growth conditions 5, 5 and 5 respectively hold for L1, Slope and Group L1-L2 regularizations. We select so that for L1 and Slope regularizations, and for Group L1-L2 regularization.

Then the estimators and satisfies with probability at least :

In addition, the estimator satisfies with probability at least :

where for L1 regularization, for Slope regularization and for Group L1-L2 regularization.

The proof is presented in Appendix D. The bounds follow from the cone conditions and the restricted strong convexity conditions derived in Theorems 1 and 2. Theorem 3 holds for any . Thus, we obtain by integration the following bounds in expectation. The proof is presented in Appendix E.

Corollary 1

If the assumptions presented in Theorem are satisfied for a small enough , then:

Discussion for L1 and Slope: For L1 and Slope regularizations, our family of estimators reach a bound scaling as . This bound strictly improves over existing results for all three losses with an L1 regularization [8, 9, 18, 10] and it matches the best rate known for the least-squares case [3]. We recover our previous result [28] in the more general framework presented herein which also applies to Group L1-L2 regularization. In addition, the L1 regularization parameter uses the sparsity . In contrast, similarly to the least-squares case [3], Slope presents the statistical advantage of adapting to unknown sparsity.

Discussion for Group L1-L2: For Group L1-L2, our family of estimators reach a bound scaling as . This bound improves over the regression case [5], which scales as . This is due to the stronger cone condition derived in Theorem 1.

Comparison of both bounds for group-sparse signals: We compare the statistical performance and upper bounds of Group L1-L2 regularization to L1 and Slope regularizations when sparsity is structured. Let us first consider two edge case. (i) If all the groups are all of size and the optimal solution is contained in only one group—that is, , , —the bound for Group L1-L2 is lower than the ones for L1 and Slope. Group L1-L2 is superior as it strongly exploits the problem structure. (ii) If now all the groups are of size one—that is, , , —both bounds have a similar first term (due to the cone conditions), however the second term is worse for the group estimator due to of a suboptimal partition choice in Theorem 2 (cf. Appendix C). L1 and Slope are superior.

For the general case, when , the signal is efficiently covered by the groups—the group structure is useful—and the upper bound for Group L1-L2 is lower than the one for L1 and Slope. That is, similarly to the regression case [5], Group L1-L2 is superior to L1 for strongly group-sparse signals ([5] does not discuss the superiority of Group L1-L2 over Slope, which we do). However, when is larger, then sparsity is not as useful and Group L1-L2 is outperformed by L1 and Slope.

3 Empirical analysis

All the estimators studied are convex. In this section, we study their empirical properties for computational settings where the signal is either sparse or group-sparse, and the number of variables is of the order of . To this end, we present a proximal gradient algorithm which solves the tractable Problems (3), (4) and (5).

3.1 Smoothing the loss

We note . Problem (2) can be formulated as: —we drop the L1 constraint in the rest of this section.

Our proximal method requires to be a differentiable loss with continuous -Lipschitz gradient. The hinge and quantile regression losses are non-smooth: we propose to use Nesterov’s smoothing method [12] to construct a convex function with continuous Lipschitz gradient for quantile regression—which approximates these losses for .

For the hinge loss case, let us first note that as this maximum is achieved for . Consequently, the hinge loss can be expressed as a maximum over the unit ball:

where . We apply the technique suggested by [12] and define for the smoothed version of the loss:

(6)

Let be the optimal solution of the right-hand side of Equation (6). The gradient of is expressed as:

(7)

and its associated Lipschitz constant is derived from the next theorem. The proof is presented in Appendix F. It follows [12] and uses first order necessary conditions for optimality.

Theorem 4

Let be the highest eigenvalue of . Then is Lipschitz continuous with constant .

Smoothing the quantile regression loss:

The same method applies to the non smooth quantile regression loss. We first note that . Hence the smooth quantile regression loss is defined as and its gradient is:

where we now have with . The Lipschitz constant of is still given by Theorem 4.

3.2 Thresholding operators

We now assume that is a differentiable loss with -Lipschitz continuous gradient. Following [29, 30], for , we upper-bound around any with the quadratic form defined as:

(8)

The proximal gradient method approximates the solution of Problem by solving the problem

(9)

Problem (9) can be solved via the the following proximal operator (evaluated at ):

(10)

We discuss computation of (9) for the specific choices of considered.

L1 regularization: When , is available via componentwise softhresholding, where the soft-thresholding operator is: .

Slope regularization: When —where —we note that, at an optimal solution to Problem (10), the signs of and are the same [2]. Consequently, we solve the following close relative to the isotonic regression problem [31]:

(11)

where, is a decreasing rearrangement of the absolute values of . A solution of Problem (11) corresponds to , where is a solution of Problem (10). We use the software provided by [2] in our experiments.

Group L1-L2 regularization: For , we consider the projection operator onto an L2-ball with radius :

(12)

From standard results pertaining to Moreau decomposition ([32, 6]) we have:

(13)

We solve Problem (10) with Group L1-L2 regularization by noticing the separability of the problem across the different groups, and computing for every .

3.3 First order algorithm

Let us denote the mapping in  (9) by the operator: The standard version of the proximal gradient descent algorithm performs the updates: for . The accelerated gradient descent algorithm [30], which enjoys a faster convergence rate, performs updates with a minor modification. It starts with , and then performs the updates: where, and . We perform these updates till some tolerance criterion is satisfied, or a maximum number of iterations is reached.

3.4 Simulations

We compare the sparse estimators studied herein with standard baselines when the signal is sparse or group-sparse. We consider the 3 examples below with an increasing number of variables up to .

3.4.1 Example 1: sparse binary classification with hinge and logistic losses

Our first experiments compare L1 and Slope estimators with an L2 baseline for sparse binary classification problems. We use both the logistic loss and the hinge loss. Our hypothesis for this case is that (i) the estimators performance will only be affected by the statistical difficulty of the problem, not by the choice of the loss function and (ii) sparse regularizations will outperform their non-sparse opponents.

Data Generation: We consider samples from a multivariate Gaussian distribution with covariance matrix with if and otherwise. Half of the samples are from the class and have mean where . A smaller makes the statistical setting more difficult since the two classes get closer. The other half are from the class and have mean . We standardize the columns of the input matrix to have unit L2-norm.

Following our high-dimensional study, we set and consider a sequence of increasing values of . We study the effect of making the problem statistically harder by making the classes closer. Hence we consider two settings, with a small and a large .

Competing methods: We compare 3 approaches for both the logistic loss and the hinge loss:

  • Method (a) computes a family of L1 regularized estimators for a decreasing geometric sequence of regularization parameters . We start from so that the solution of Problem (3) is and we fix . When is the logistic loss, we use the first order algorithm presented in Section 3.3. When is the hinge loss, we directly solve the Linear Programming (LP) L1-SVM problem with the commercial LP solver Gurobi version with Python interface. We present an LP reformulation of the Problem in Appendix G.1.

  • Method (b) computes a family of Slope regularized estimators, using the first order algorithm presented in Section 3.3. The Slope coefficients are the ones defined in Theorem 3; the sequence of parameters is identical to method (a). When is the hinge-loss, we consider the smoothing method defined in Section 3.1 with a coefficient .

  • Method (c) returns a family of L2 regularized estimators with scikit-learn package: we start from as suggested in [33]—and