An error bound for Lasso and Group Lasso in high dimensions

An error bound for Lasso and Group Lasso in high dimensions

Abstract

We leverage recent advances in high-dimensional statistics to derive new L2 estimation upper bounds for Lasso and Group Lasso in high-dimensions. For Lasso, our bounds scale as is the size of the design matrix and the dimension of the ground truth —and match the optimal minimax rate. For Group Lasso, our bounds scale as is the total number of groups and the number of coefficients in the groups which contain —and improve over existing results. We additionally show that when the signal is strongly group-sparse, Group Lasso is superior to Lasso.

\setstretch

.98

1 Introduction

We consider the Gaussian linear regression framework, with response and model matrix :

 y=Xβ∗+ϵ (1)

where the entries of are independent realizations of a sub-Gaussian random variable (as defined in [13]) with variance . We consider settings where is sparse, i.e. has a small number of non-zeros. The L1-regularized least squares estimator (also known as Lasso estimator [14]) is well-known to encourage sparsity in the coefficients. It is defined as a solution of the convex optimization problem:

 minβ∈Rp12∥y−Xβ∥22+λ∥β∥1. (2)

In several applications, sparsity is structured—the coefficient indices of occur in groups a-priori known and it is desirable to select a whole group. In this context, group variants of the L1 regularization are often used to improve the performance and interpretability [15, 10]. We consider the Group L1-L2 regularization [1] and define the Group Lasso estimator as a solution of the convex problem:

 minβ∈Rp12∥y−Xβ∥22+λG∑g=1∥βg∥2 (3)

where, denotes a group index (the groups are disjoint), denotes the vector of coefficients belonging to group , the corresponding set of indexes, and . In addition, we denote , the smallest subset of group indexes such that the support of is included in the union of these groups, the cardinality of , and the sum of the sizes of these groups.

Existing work on statistical performance

Statistical performance and L2 consistency for high-dimensional linear regression have been widely studied [6, 3, 5, 2, 11]. One important statistical performance measure is the L2 estimation error defined as where is the -sparse ground truth used in Equation (1) and is an estimator. For regression problems with least-squares loss, [5] and [12] established a lower bound for estimating the L2 norm of a sparse vector, regardless of the input matrix and estimation procedure. This optimal minimax rate is known to be achieved by a global minimizer of a L0 regularized estimator [4] which is, however, intractable in practice. Recently, [2] reached this optimal minimax bound for a Lasso estimator—improving over existing results [3]—and for a recently introduced and tractable Slope estimator. In addition, when sparsity is structured, [10] proved a L2 estimation upper bound for a Group Lasso estimator—where, similarly to our notations, is the number of groups, the number of relevant groups and their aggregated size—and showed that this estimator is superior to standard Lasso when the signal is strongly group-sparse, i.e. is low and the signal is efficiently covered by the groups. [11] similarly showed that, in the multitask setting, a Group Lasso estimator is superior to Lasso.

What this paper is about: In this short paper, we propose a statistical framework to study the L2 estimation performance of Lasso and Group Lasso in high dimensions and we derive new error bounds for these estimators. To this end, we adapt proof techniques recently developed for high-dimensional classification studies [8, 7] to the least squares case. Our bounds are reached under standard assumptions and hold with high probability and in expectation. For Lasso, our bounds scale as : they reach the optimal minimax rate [12] while matching the best results [2]. For Group Lasso, our bounds scale as and improve over existing results [10], due to a stronger cone condition (cf. Theorem 1). We additionally recover the result that when the signal is strongly group-sparse, Group Lasso is superior to Lasso.

2 Statistical analysis

Similarly to the regression literature [3, 2, 11], we reach our bounds for the Lasso and Group Lasso estimators by assuming restricted eigenvalue conditions.

2.1 Restricted eigenvalue conditions

Assumptions 1 and 1 ensure that the quadratic form associated with the Hessian matrix is lower-bounded on a family of cones of —specific to the regularization used. For Group Lasso, Assumption 1 assumes an additional upper bound for the quadratic form associated with on each group.

Assumption 1
• Let . Assumption 1 holds if there exists which almost surely satisfies:

 0<κ(k,γ1,γ2)≤inf|S|≤k infz∈Λ(S,γ1,γ2)zTXTXzn∥z∥22,

where and for every subset , the cone is defined as:

 Λ(S,γ1,γ2)={z∈Rp: ∥zSc∥1≤γ1∥zS∥1+γ2∥zS∥2}.
• Let . Assumption 1 holds if there exists a constant such that a.s.:

 0<κ(s,ϵ1,ϵ2)≤inf|J|≤s infz∈Ω(J,ϵ1,ϵ2)zTXTXzn∥z∥22,

where and for every subset , we define as the subset of all indexes across all the groups in . is defined as:

 Ω(J,ϵ1,ϵ2)=⎧⎨⎩z∈Rp: ∑g∉J∥zg∥2≤ϵ1∑g∈J∥zg∥2+ϵ2∥zT(J)∥2⎫⎬⎭.
• Finally, let denote the restriction of the input matrix to the columns of group , and let be the highest eigenvalue of the positive semi-definite symmetric matrix . Assumption 1 is satisfied if it almost surely holds:

 supg=1,…Gμmax(XTgXg)≤1.

2.2 Cone conditions

Similarly to existing work [3, 2, 11], Theorem 1 derives a cone condition satisfied by a Lasso or Group Lasso estimator. In particular, Theorem 1 says that, the difference between the estimator and the ground truth belongs to one of the families of cones defined in Assumption 1. The cone conditions are derived by selecting a regularization parameter large enough so that it dominates the gradient of the least squares loss evaluated at the theoretical minimizer .

Theorem 1

Let , . The following results holds with probability at least :

• Let be a solution of the Lasso Problem (2) with parameter , and let be the subset of indexes of the highest coefficients of . It holds:

 h1∈Λ(S0, γ∗1:=αα−1, γ∗2:=√k∗α−1).
• Let us assume that Assumption holds. Let be a solution of the Group Lasso Problem (3) with parameter . Let be the subset of indexes of the highest groups of for the L2 norm. We additionally denote be the total size of the largest groups and assume for some . It then holds:

 hL1−L2∈Ω(J0,ϵ∗1:=αα−1,ϵ∗2:=√s∗α−1).

The proof is presented in Appendix A: it uses a new result from [8] to control the maximum of sub-Gaussian random variables. As a consequence, the regularization parameter for Lasso is of the order of and is stronger than existing results [3]. For Group Lasso, our parameter is of the order of and improve over [10]—which considers a scaling of .

2.3 Upper bounds for L2 coefficients estimation

We now state our main bounds in Theorem 2 and Corollary 1.

Theorem 2

Let . We consider the same assumptions and notations than Theorem 1.

• If Assumption 1 holds, the Lasso estimator satisfies with probability at least :

 ∥^β1−β∗∥2≲ασκ∗√k∗log(p/k∗)log(1/δ)n.
• If Assumption 1 holds, the Group Lasso estimator satisfies with same probability:

 ∥^βL1−L2−β∗∥2≲ασκ∗√s∗log(G/s∗)log(1/δ)+γm∗n.

where for Lasso and for Group Lasso.

The proof is presented in Appendix B. The bounds directly follow from the the cone conditions proofs and the use of the restricted eigenvalue assumptions. Theorem 2 holds for any . Thus, we obtain by integration the following bounds in expectation. The proof is presented in Appendix C.

Corollary 1

The bounds presented in Theorem 2 additionally holds in expectation, that is:

 E∥^β1−β∗∥2≲ασκ∗√k∗log(p/k∗)n,
 E∥^βL1−L2−β∗∥2≲ασκ∗√s∗log(G/s∗)+γm∗n.

Discussion:

For Lasso, our bounds scale as . They improve over [3] and match the best existing result [2] while reaching the optimal minimax rate. For Group Lasso, our bounds scale as and improve over [10]. This is due to the stronger cone condition derived in Theorem 1. For both cases, our bounds reach the same scaling than the respective L1 and Group L1-L2 regularizations discussed in [8], which considers a general learning problem with Lipschitz loss functions (including hinge, logistic and quantile regression losses).

Comparison for group-sparse signals:

We compare the statistical performance and upper bounds for Lasso and Group Lasso when sparsity is structured. Let us first consider two edge case. (i) If all the groups are all of size and the optimal solution is contained in only one group—that is, , , , —the rate for Group Lasso is lower than the one for Lasso. Group Lasso is superior as it strongly exploits the problem structure. (ii) If now all the groups are of size one—that is, , , , —then Group Lasso has not advantage over Lasso.

Let us now consider the general case. If , then the signal is efficiently covered by the groups—the group structure is useful—and the Group Lasso rate is lower than the Lasso one. That is, similarly to the regression case [10], Group Lasso is superior to Lasso for strongly group-sparse signals. However, when is larger, then sparsity is not as useful and Group Lasso is outperformed by Lasso.

Appendix A Proof of Theorem 1

We use the minimality of and Lemma 4 from [8] to derive the cone conditions:

Lemma 1

(Lemma 4, [8]) Let be sub-Gaussian random variables with variance . We denote by a non-increasing rearrangement of and define the coefficients . For , it holds with probability at least :

Proof:

We first present the proof for the Lasso estimator before adapting it to Group Lasso.

Proof for Lasso:

denotes herein the Lasso estimator. is a solution of the Lasso Problem (3) hence:

 12∥y−X^β∥22+λ∥^β∥1≤12∥y−Xβ∗∥22+λ∥β∗∥1=12∥ϵ∥22+λ∥β∗∥1.

Since we have defined , it holds:

 12∥y−X^β∥22=12∥Xβ∗−X^β∥22+ϵT(Xβ∗−X^β)+12∥ϵ∥22=12∥Xh∥22−(XTϵ)Th+12∥ϵ∥22.

We have defined as the support of and as the set of the largest coefficients of . We then have:

 12∥Xh∥22≤(XTϵ)Th+λ∥β∗S∗∥1−λ∥^βS∗∥1−λ∥^β(S∗)c∥1≤(XTϵ)Th+λ∥hS∗∥1−λ∥h(S∗)c∥1≤(XTϵ)Th+λ∥hS0∥1−λ∥h(S0)c∥1. (4)

We now upper-bound the quantity . To this end, we denote and we introduce a non-increasing rearrangement of . We assume without loss of generality that . Lemma 1 gives, with probability at least :

 (XTϵ)Th=p∑j=1gjhj≤p∑j=1|gj||hj|=p∑j=1g(j)2√2σλj2√2σλj|h(j)|≤2√2σsupj=1,…,p{g(j)σλj}p∑j=1λj|h(j)|≤24√2σ√log(1/δ)p∑j=1λj|h(j)| with Lemma% ???≤34σ√log(1/δ)p∑j=1λj|hj| since λ1≥…≥λp and |h1|≥…≥|hp|≤34σ√log(1/δ)(k∗∑j=1λj|hj|+λk∗∥h(S0)c∥1). (5)

 k∗∑j=1λj|hj| ≤ ⎷k∗∑j=1λ2j∥hS0∥2≤√k∗log(2pe/k∗)∥hS0∥2,

where we have used Stirling formula to obtain

 k∗∑j=1λ2j=k∗∑j=1log(2p/j) =k∗log(2p)−log(k∗!)≤k∗log(2p)−k∗log(k∗/e)=k∗log(2pe/k∗).

Theorem 1 defines . Because , we can pair Equations (4) and (5) to obtain with probability at least :

 12∥Xh∥22≤(XTϵ)Th+λ∥hS0∥1−λ∥h(S0)c∥1≤λα(√k∗∥hS0∥2+∥h(S0)c∥1)+λ∥hS0∥1−λ∥h(S0)c∥1. (6)

As a first consequence, Equation (6) implies that with probability at least :

 λ∥h(S0)c∥1−λα∥h(S0)c∥1≤λ∥hS0∥1+λα√k∗∥hS0∥2,

which is equivalent from saying that with probability at least :

 ∥h(S0)c∥1≤αα−1∥hS0∥1+√k∗α−1∥hS0∥2.

We conclude that with probability at least .

Proof for Group Lasso:

designs herein the Group Lasso estimator. is a solution of the Group Lasso Problem (3) hence:

 12∥y−X^β∥22+λGG∑g=1∥^βg∥2≤12∥y−Xβ∗∥22+λGG∑g=1∥β∗g∥2=12∥ϵ∥22+λGG∑g=1∥β∗g∥2.

By definition, the support of is included in and is the subset of indexes of the highest groups of for the L2 norm. It then holds:

 12∥Xh∥22≤(XTϵ)Th+λGG∑g=1∥β∗g∥2−λGG∑g=1∥^βg∥2=(XTϵ)Th+λG∑g∈J∗∥β∗g∥2−λGG∑g=1∥^βg∥2≤(XTϵ)Th+λG∑g∈J∗∥hg∥2−λG∑g∉J∗∥hg∥2≤(XTϵ)Th+λG∑g∈J0∥hg∥2−λG∑g∉J0∥hg∥2. (7)

We now upper-bound the quantity —we denote . Applying Cauchy-Schwartz inequality on each group gives:

 (XTϵ)Th≤|⟨g,h⟩|≤G∑g=1∣∣⟨gg,hg⟩∣∣≤G∑g=1∥gg∥2∥hg∥2, (8)

Let us fix . We have denoted the cardinality of the set of indexes of group . It then holds :

 E(exp(gTgug))=E((XTϵ)Tgug)=E((XTgϵ)Tug)=E(ϵTXgug)=ng∏i=1E(ϵi(Xgug)i) by independence≤ng∏i=1exp(4σ2(Xgug)2i)with Lemma 1.4, \@@cite[cite]{[% \@@bibref{Number}{lecture-notes}{}{}]}=exp(4σ2∥Xgug∥22)=exp(4σ2uTgXTgXgug)≤exp(4σ2∥ug∥22) since μmax(XTgXg)≤1. (9)

We can then use Theorem 2.1 from [9]. By denoting the identity matrix of size it holds:

 P(∥Iggg∥22≥8σ2(tr(Ig)+2√tr(I2g)t+2|||Ig|||))≤e−t,

which can be equivalently expressed as

 P(∥gg∥22≥8σ2(√ng+√2t)2)≤e−t,

or, with a different formulation:

 P(∥gg∥2−2√2σ√ng≥4σ√t)≤e−t,

which is equivalent from saying that:

 P(∥gg∥22−2√2σ√ng≥t)≤exp(−t216σ2). (10)

Let us define the random variables . Under Equation (10), satisfies the same tail condition than a sub-Gaussian random variable with variance and we can apply Lemma 1.

To this end, we introduce a non-increasing rearrangement of and a permutation such that —where we have defined the group sizes . In addition, we assume without loss of generality that and we note the coefficients .

Following Equation (8), we obtain with probability at least :

 (11)

where we define as the subset of all indexes across all the groups in , and denotes the total size of the largest groups. Note that we have paired Cauchy-Schwartz inequality with Stirling formula to obtain

 s∗∑g=1(λ(G)g)2≤s∗log(2Ge/s∗).

Theorem 1 defines . By pairing Equations (7) and (11) it holds with probability at least :

 12∥Xh∥22≤λGα⎛⎝√s∗∥hT0∥2+∑g∉J0∥hg∥2⎞⎠+λG∑g∈J0∥hg∥2−λG∑g∉J0∥hg∥2, (12)

As a first consequence, Equation (12) implies that with probability at least

 λG∑g∉J0∥hg∥2−λGα∑g∉J0∥hg∥2≤λG∑g∈J0∥hg∥2+λGα√s∗∥hT0∥2

which is equivalent to saying that with probability at least :

 ∑g∉J0∥hg∥2≤αα−1∑g∈J0∥hg∥2+√s∗α−1∥hT0∥2,

that is with probability at least .

Appendix B Proof of Theorem 2

Proof:

Our bounds respectively follow from Equations (6) and (12).

Proof for Lasso:

As a second consequence of Equation (6), it holds with probability at least :

 12∥Xh∥22≤λα(√k∗∥hS0∥2+∥h(S0)c∥1)+λ∥hS0∥1−λ∥h(S0)c∥1≤λα√k∗∥hS0∥2+λ∥hS0∥1≤2λ√k∗∥hS0∥2≤2λ√k∗∥h∥2, (13)

where we have used Cauchy-Schwartz inequality on the sparse vector .

The cone condition proved in Theorem 1 gives . We can then use the restricted eigenvalue condition defined in Assumption 1—where we define . It then holds with probability at least :

 κ∗∥h∥22≤1n∥Xh∥22≤4nλ√k∗∥h∥2.

By using that , we conclude that it holds with probability at least :

 ∥h∥22≲(ασκ∗)2k∗log(p/k∗)log(1/δ)n.

Proof for Group Lasso:

Similarly, as a second consequence of Equation (12), it holds with probability at least :

 12∥Xh∥22≤λGα√s∗∥hT0∥2+λG∑g∈J0∥hg∥2≤2λG√s∗∥h∥2, (14)

where we have used Cauchy-Schwartz inequality to obtain:

The cone condition proved in Theorem 1 gives . We can then use the restricted eigenvalue condition defined in Assumption (1)—where we have defined . It then holds:

 κ∗∥h∥22≤4nλG√s∗∥h∥2.

We conclude, by using the definition of , that it holds with probability at least :

 ∥h∥22≲(ασκ∗)2s∗log(G/s∗)log(1/δ)+γm∗n.

Appendix C Proof of Corollary 1

Proof:

In order to derive the bound in expectation, we define the bounded random variable:

 Z=κ∗2α2σ2∥^β−β∗∥22,

where depends upon the regularization used. We fix such that , it holds with probability at least :

 Z≤C0H1log(1/δ)+C0H2,

where , for Lasso and , for Group Lasso. It then holds

 P(Z/C0≥H1t+H2)≤e−t.

Let , then :

 P(Z/C0≥q+H2) ≤exp(−qH1).

As a consequence, by integration, we have:

 E(Z)=∫+∞0C0P(|Z|/C0≥q)dq≤∫+∞H2+q0C0P(|Z|/C0≥q)dq+C0(H2+q0)=∫+∞q0C0P(|Z|/C0≥q+H2)dq+