Model-Consistent Sparse Estimationthrough the Bootstrap

# Model-Consistent Sparse Estimation through the Bootstrap

Francis Bach
Willow Project-team
Laboratoire d’Informatique de l’Ecole Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
45, rue d’Ulm, 75230 Paris, France
francis.bach@mines.org
###### Abstract

We consider the least-square linear regression problem with regularization by the -norm, a problem usually referred to as the Lasso. In this paper, we first present a detailed asymptotic analysis of model consistency of the Lasso in low-dimensional settings. For various decays of the regularization parameter, we compute asymptotic equivalents of the probability of correct model selection. For a specific rate decay, we show that the Lasso selects all the variables that should enter the model with probability tending to one exponentially fast, while it selects all other variables with strictly positive probability. We show that this property implies that if we run the Lasso for several bootstrapped replications of a given sample, then intersecting the supports of the Lasso bootstrap estimates leads to consistent model selection. This novel variable selection procedure, referred to as the Bolasso, is extended to high-dimensional settings by a provably consistent two-step procedure.

## 1 Introduction

Regularization by the -norm has attracted a lot of interest in recent years in statistics, machine learning and signal processing. In the context of least-square linear regression, the problem is usually referred to as the Lasso lasso () or basis pursuit chen (). Much of the early effort has been dedicated to algorithms to solve the optimization problem efficiently, either through first-order methods shooting (); descent (), or through homotopy methods that leads to the entire regularization path (i.e., the set of solutions for all values of the regularization parameters) at the cost of a single matrix inversion markowitz (); osborne (); lars ().

A well-known property of the regularization by the -norm is the sparsity of the solutions, i.e., it leads to loading vectors with many zeros, and thus performs model selection on top of regularization. Recent works Zhaoyu (); yuanlin (); zou (); martin () have looked precisely at the model consistency of the Lasso, i.e., if we know that the data were generated from a sparse loading vector, does the Lasso actually recover the sparsity pattern when the number of observations grows? In the case of a fixed number of covariates (i.e., low-dimensional settings), the Lasso does recover the sparsity pattern if and only if a certain simple condition on the generating covariance matrices is satisfied yuanlin (). In particular, in low correlation settings, the Lasso is indeed consistent. However, in presence of strong correlations between relevant variables and irrelevant variables, the Lasso cannot be model-consistent, shedding light on potential problems of such procedures for variable selection. Various extensions of the Lasso have been designed to fix its inconsistency, based on thresholding yuinfinite (), data-dependent weights zou (); yuanlin (); adaptivehighdim () or two-step procedures relaxedlasso (). The main contribution of this paper is to propose and analyze an alternative approach based on resampling. Note that recent work stability () has also looked at resampling methods for the Lasso, but focuses on resampling the weights of the -norm rather than resampling the observations (see Section 3 for more details).

In this paper, we first derive a detailed asymptotic analysis of sparsity pattern selection of the Lasso estimation procedure, that extends previous analysis Zhaoyu (); yuanlin (); zou () by focusing on a specific decay of the regularization parameter. Namely, in low-dimensional settings where the number of variables is much smaller than the number of observations , we show that when the decay of is proportional to , then the Lasso will select all the variables that should enter the model (the relevant variables) with probability tending to one exponentially fast with , while it selects all other variables (the irrelevant variables) with strictly positive probability. If several datasets generated from the same distribution were available, then the latter property would suggest to consider the intersection of the supports of the Lasso estimates for each dataset: all relevant variables would always be selected for all datasets, while irrelevant variables would enter the models randomly, and intersecting the supports from sufficiently many different datasets would simply eliminate them. However, in practice, only one dataset is given; but resampling methods such as the bootstrap are exactly dedicated to mimic the availability of several datasets by resampling from the same unique dataset efron (). In this paper, we show that when using the bootstrap and intersecting the supports, we actually get a consistent model estimate, without the consistency condition required by the regular Lasso. We refer to this new procedure as the Bolasso (bootstrap-enhanced least absolute shrinkage operator). Finally, our Bolasso framework could be seen as a voting scheme applied to the supports of the bootstrap Lasso estimates; however, our procedure may rather be considered as a consensus combination scheme, as we keep the (largest) subset of variables on which all regressors agree in terms of variable selection, which is in our case provably consistent and also allows to get rid of a potential additional hyperparameter.

We consider the two usual ways of using the bootstrap in regression settings, namely bootstrapping pairs and bootstrapping residuals efron (); freedman (). In Section 3, we show that the two types of bootstrap lead to consistent model selection in low-dimensional settings. Moreover, in Section 5, we provide empirical evidence that in high-dimensional settings, bootstrapping pairs does not lead to consistent estimation, while bootstrapping residuals still does. While we are currently unable to prove the consistency of bootstrapping residuals in high-dimensional settings, we prove in Section 4 the model consistency of a related two-step procedure: the Lasso is run once on the original data, with a larger regularization parameter, and then bootstrap replications (pairs or residuals) are run within the support of the first Lasso estimation. We show in Section 4 that this procedure is consistent. In order to do so, we consider new sufficient conditions for the consistency of the Lasso, which do not rely on sparse eigenvalues yuinfinite (); ch_zhang (), low correlations bunea (); lounici () or finer conditions tsyb (); cohen (); tong_zhang (). In particular, our new assumptions allow to prove that the Lasso will select not only a few variables when the regularization parameter is properly chosen, but always the same variables with high probability.

In Section 5.1, we derive efficient algorithms for the bootstrapped versions of the Lasso. When bootstrapping pairs, we simply run an efficient homotopy algorithm, such as Lars lars (), multiple times; however, when bootstrapping residuals, more efficient ways may be designed to obtain a running time complexity which is less than running Lars multiple times. Finally, in Section 5.2 and Section 5.3, we illustrate our results on synthetic examples, in low-dimensional and high-dimensional settings. This work is a follow-up to earlier work bolasso (): in particular, it refines and extends the analysis to high-dimensional settings and boostrapping of the residuals.

#### Notations

For and , we denote by its -norm, defined as . We also denote by its -norm. For rectangular matrices , we denote by its largest singular value, the largest magnitude of all its elements, and its Frobenius norm. We let denote and the largest and smallest eigenvalue of a symmetric matrix .

For , denotes the sign of , defined as if , if , and if . For a vector , denotes the vector of signs of elements of . Given a set , is the indicator function of the set . We also denote, for , by , the smallest (in magnitude) of non-zero elements of .

Moreover, given a vector and a subset of , denotes the vector in of elements of indexed by . Similarly, for a matrix , denotes the submatrix of composed of elements of whose rows are in and columns are in . Moreover, denotes the cardinal of the set . For a positive definite matrix of size , and two disjoint subsets of indices and included in , we denote the matrix , which is the conditional covariance of variables indexed by given variables indexed by , for a Gaussian vector with covariance matrix . Finally, we let denote and general probability measures and expectations.

#### Least-square regression with ℓ1-norm penalization

Throughout this paper, we consider pairs of observations , . The data are given in the form of a vector and a design matrix . We consider the normalized square loss function

 12nn∑i=1(yi−w⊤xi)2=12n∥y−Xw∥22,

and the regularization by the -norm. That is, we look at the following convex optimization problem lasso (); chen ():

 minw∈Rp12n∥y−Xw∥22+μ∥w∥1, (1.1)

where is the regularization parameter. We denote by any global minimum of Eq. (1.1), and the support of .

In this paper, we consider two settings, depending on the value of the ratio of . When this ratio is much smaller than one, as in Section 2, we refer to this setting as low-dimensional estimation, while in other cases, where this ratio is potentially much larger than one, we refer to this setting as a high-dimensional problem (see Section 4).

## 2 Low-Dimensional Asymptotic Analysis

We make the following “fixed-design” assumptions:

• Linear model with i.i.d. additive noise: , where is a vector with independent components, identical distributions and zero mean; is sparse, with and support .

• Subgaussian noise: there exists such that for all and , . Moreover, the variances of are greater than .

• Bounded design: For all , .

• Full rank design: The matrix is invertible.

Throughout this paper, we consider normalized constants (normalized population loading vector), (normalized regularization parameter), (condition number of the matrix of second-order moments), and (always larger than one, and equal to one if and only if the noise is Gaussian, see Appendix A.2).

With our assumptions, the problem in Eq. (1.1) is equivalent to

 minw∈Rp12(w−w)⊤Q(w−w)−q⊤(w−w)+μ∥w∥1, (2.1)

where and . Note that under assumption (A2), there is a unique solution to Eq. (1.1) and Eq. (2.1), because the associated objective functions are then strongly convex. Moreover, assumption (A2) implies that , that is, we consider in this section, only “low-dimensional” settings (see Section 4 for extensions to high-dimensional settings).

In this section, we detail the asymptotic behavior of the (unique) Lasso estimate , both in terms of the difference in norm with the population value (i.e., regular consistency) and of the sign pattern , for all asymptotic behaviors of the regularization parameter . Note that information about the sign pattern includes information about the support , i.e., the indices for which is different from zero; moreover, when is consistent, consistency of the sign pattern is in fact equivalent to the consistency of the support. We assume that is fixed and tends to infinity, the regularization parameter being considered as a function of (though we still derive non-asymptotic bounds).

Note that for some of our results to be non trivial, we require that is not only small compared to , but that a power of is small compared to . Technically, this is due to the application of multivariate Berry-Esseen inequalities (reviewed in Appendix A.1), which could probably be improved to obtain smaller powers.

We consider five mutually exclusive possible situations which explain various portions of the regularization path; many of these results appear elsewhere yuanlin (); Zhaoyu (); fu (); zou (); grouplasso (); lounici () but some of the finer results presented below are new (in particular most non-asymptotic results and the -decay of the regularization parameter in Section 2.4). These results are illustrated on synthetic examples in Section 5.2.

Note that all exponential convergences have a rate that depends on , i.e., the smallest (in magnitude) non zero element of the generating sparse vector . Thus, we assume a sharp threshold in order to have a fast rate of convergence. Considering situations without such a threshold, which would notably require to estimate errors in model estimation (and not simply exactly correct or incorrect), is out of the scope of this paper (see, e.g., ch_zhang ()).

### 2.1 Heavy regularization

If is large enough, then is equal to zero with probability tending to one exponentially fast in . Indeed, we have (see proof in Appendix D.1):

###### Proposition 2.1.

Assume (A2-2). If , then the probability that is greater than .

A well-known property of homotopy algorithms for the Lasso (see, e.g., lars ()) is that if is large enough, then . This proposition simply provides a uniform probabilistic bound.

### 2.2 Fixed regularization

If tends to a finite strictly positive constant , then converges in probability to the unique global minimum of the noiseless objective function . Thus, the estimate never converges in probability to , while the sign pattern tends to the one of the previous global minimum, which may or may not be the same as the one of the noiseless problem . It is thus possible, though not desirable, to have sign consistency without regular consistency. See grouplasso () for examples and simulations of a similar behavior for the group Lasso.

All convergences are exponentially fast in (proof in Appendix D.2). Note that here and in the next regime (Proposition 2.3), we do not take into account the pathological cases where the sign pattern of the limit in unstable, i.e., the limit is exactly at a hinge point of the regularization path. When this occurs, all associated sign patterns are attained with positive probability (see also Section 4).

###### Proposition 2.2.

Assume (A2-2). Let and . Let be the unique solution of . Then, if , we have:

Moreover, assume the minimum occurs away from a hinge point of the regularization path, i.e., there exists such that for all , implies . If , then

 P(sign(^w)≠sign(w0))⩽2pexp(−~λ2~τ2min{η2/4,m(w0M/σ)2}np).

The proposition above makes no claim in the situation where tends to zero. As we now show, this depends on the rate of decay of , slower, faster, or exactly at the rate .

### 2.3 High regularization

If tends to zero slower than , then converges in probability to (regular consistency) and the sign pattern converges to the sign pattern of the global minimum of a local noiseless objective function , the convergence being exponential in (see proof in Appendix D.3). The local noiseless problem in Eq. (2.2) is simply obtained by a first-order expansion of the Lasso objective function around  fu (); yuanlin ().

###### Proposition 2.3.

Assume (A2-2). Let be the unique solution of

 minΔ∈Rp12Δ⊤QΔ+Δ⊤Jsign(wJ)+∥ΔJc∥1. (2.2)

Assume that . We have:

 P(∥^w−w−μΔ∥2⩾βσ/M)⩽2pexp⎛⎝−~λ2β28~τ2np⎞⎠.

Moreover, assume the minimum of Eq. (2.2) occurs away from a hinge point of the regularization path, i.e., there exists such that for all , implies . Then,

 P(sign(^w)≠sign(w+μΔ))⩽2pexp⎛⎝−m(~w)~λ28~τ2np⎞⎠+2pexp(−A~μ2np),

where .

Note that the sign pattern of is equal to the population sign vector if and only if the problem in Eq. (2.2) has a solution where is equal to zero. A short calculation shows that this occurs if and only if the following consistency condition is satisfied glasso (); Zhaoyu (); yuanlin (); zou (); martin ():

 ∥QJc,JQ−1J,Jsign(wJ)∥∞⩽1. (2.3)

Thus, if Eq. (2.3) is satisfied strictly—which implies that we are not at a hinge point of Eq. (2.2)—the probability of correct sign estimation is tending to one, and to zero if Eq. (2.3) is not satisfied (see yuanlin () for precise statements when there is equality). Moreover, when Eq. (2.3) is satisfied strictly, Proposition 2.3 gives an upper bound on the probability of not selecting the correct pattern .

The first three regimes are not unique to low-dimensional settings; we show in Section 4 the corresponding proposition related to Proposition 2.3, for high-dimensional settings. However, the last two regimes ( tending to zero at rate or faster) are specific to low-dimensional settings.

### 2.4 Medium regularization

If is bounded from above and from below, then we show that the sign pattern of agrees on with the one of with probability tending to one exponentially fast in (Proposition 2.4), while for all sign patterns consistent on with the one of , the probability of obtaining this pattern is tending to a limit in (in particular strictly positive); that is, all sign patterns consistent with on the relevant variables (i.e., the ones in ) are possible with positive probability (Proposition 2.5). The convergence of this probability follows a rate of (see proof in Appendix D.4 and D.5). Note the difference with earlier results bolasso () obtained for random designs.

###### Proposition 2.4.

Assume (A2-2) and . Then for any sign pattern such that , there exists , such that:

 |P(sign(^w)=s)−f(s,n1/2μp1/2)|⩽4CBE1~τ3~λ1/2p2n1/2+2pexp⎛⎝−m(~w)~λ28~τ2np⎞⎠.
###### Proposition 2.5.

Assume (A2-2) and . Then, for any pattern such that ,

 P(sign(^w)=s)⩽2pexp⎛⎝−m(~w)~λ28~τ2np⎞⎠.

The positive real numbers and are universal constants related to multivariate Berry-Esseen inequalities (see Appendix A.1 for more details). From the proof in Appendix D.4, the constant has specific behaviors when is small or large: on the one hand, if tends to infinity, then we tend to the bevahior of the previous section, that is, tends to one if is the limiting pattern in Proposition 2.3 and zero otherwise. On the other hand, if tends to 0, tends to one if has no zeros, and zero otherwise (see next section).

The last two propositions state that the relevant variables are stable, i.e., we get all relevant variables with probability tending to one exponentially fast, while we get exactly get all other patterns with probability tending to a limit strictly between zero and one. This stability of the relevant variables is the source of the intersection arguments outlined in Section 3.

Note that Proposition 2.4 makes non-trivial statements only for larger than ; the fourth power is due to the application of Berry-Esseen inequalities, and could be improved.

### 2.5 Low regularization

If tends to zero faster than , then is consistent (i.e., converges in probability to ) but the support of is equal to with probability tending to one (the signs of variables in may then be arbitrarily negative or positive). That is, the -norm has no sparsifying effect. We obtain two different bounds, with different scalings in and (see proof in Appendix D.6):

###### Proposition 2.6.

Assume (A2-2) and . Then the probability of having at least one zero variable is smaller than and .

The first bound simply requires that tends to zero faster than , but the constant is exponential in , while the second bound required that does not tend to zero too fast, i.e., between and (with constants polynomial in ). As shown in Appendix D.3, the two bounds correspond to two different applications of Berry-Esseen inequalities, one for all the possible sign patterns, one using a detailed analysis of the non-selection of a given variable (see Section 2.6). We are currently exploring the possibility of having a bound that shares the positive aspects of our two bounds—polynomial in and without the term .

Among the five previous regimes, the only ones with consistent estimates (in norm) and a sparsity-inducing effect are tending to zero and tending to a finite or infinite limit. When tends to infinity, we can only hope for model consistent estimates if the consistency condition in Eq. (2.3) is satisfied. This somewhat disappointing result for the Lasso has led to various improvements on the Lasso to ensure model consistency even when Eq. (2.3) is not satisfied yuanlin (); zou (); relaxedlasso (). Those are based on adaptive weights based on the non regularized least-square estimate or two-step procedures. We propose in Section 3 alternative ways which are based on resampling. Before doing so, we derive in the next section finer results that allows to consider the presence or absence in the support set of a specific variable without considering all corresponding consistent sign patterns.

### 2.6 Probability of not selecting a given variable

We can lower and upper bound the probability of not selecting a certain irrelevant variable in (see proof in Appendix D.7)—see Proposition 2.5 for a related proposition for relevant variables in :

###### Proposition 2.7.

Assume (A2-2) and . Let . We have:

 P(j∈^J) ⩾ ~μn1/2/41+~μn1/2/2~λ1/2exp(−2~μ2~λ2np)−10CBE2~τ3~λ1p5/2~μn−CBE24~τ3~λ1/2p2n1/2, P(j∈^J) ⩽ ~μn1/2~λ1/2+8CBE2~τ3~λp5/2~μn+CBE24~τ3~λ1/2p2n1/2.

This novel proposition allows to consider “marginal” probabilities of selecting (or not selecting) a given variable, without considering all consistent sign patterns associated with the selection (or non-selection) of that variable). Note that it makes interesting claims only when is bounded from above and below (for the lower bound) and when tends to zero, while tends to infinity (for the upper bound).

## 3 Support Estimation by Intersection

The results from Section 2.4 exactly show that under suitable choices of the regularization parameter , the relevant variables are stable while the irrelevant are unstable, leading to several intersecting arguments to keep only the relevant variables. We first consider the irrealistic situation where we have multiple independent copies, then we consider splitting a dataset in several pieces, and we finally present two usual types of bootstrap (pairs and residuals). Note that an alternative approach is to resample the columns of the design matrix instead of its rows, i.e., draw random weights for each variable from a well-chosen distribution stability ().

The analysis of support estimation is essentially the same for all methods and is based on the following argument: we consider “replications”, and the associated active sets. The replications are assumed independent given the original data (i.e., the vector of noise ). We let denote the estimate of the active set (given the original data). Once the active set is found, the final estimate of is obtained by the unregularized least-square estimate, restricted to the estimated active set.

We can upper bound the probability of incorrect pattern selection as follows:

 P(^J∩≠J) ⩽ P(Jc∩^J∩≠∅)+P(J∩(^J∩)c≠∅), ⩽ ∑j∈JcP(∀i∈{1,…,m},j∈^Ji)+P(m⋃i=1[(^Ji)c∪J]≠∅), ⩽ ∑j∈JcE(P(j∈^J∗|ε)m)+mP((^J∗)c∪J≠∅),

where denotes a generic support obtained from one replication. We now need to upper bound the probability of forgetting at least a relevant variable , and also the probability that a replication does not include a given irrelevant variable (given the original data). The first term will always drop as the number of replications gets larger, while the second term increases, leading to a natural trade-off for the choice of the number of replications. This is to be contrasted with usual applications of the boostrap where is taken as large as computationally feasible.

### 3.1 Multiple independent copies

Let us assume for a moment that we have independent copies of similar datasets, with potentially different fixed designs but same noise distribution. We then have different active sets and we denote by the intersection of the active sets. We have the following upper bound on the probability of non selecting the correct pattern (see proof in Appendix D.8)

###### Proposition 3.1.

Assume (A2-2) for independent datasets with same noise distribution, and . If , and , then there exists such that

 P(^J∩≠J)⩽pe−f(c)mp−1/2+2pmexp⎛⎝−m(~w)~λ28~τ2np⎞⎠.

From the proof of Proposition 3.1 in Appendix D.8, we can get the detailed behavior of around and : it goes to zero in both cases, i.e., we actually need (in the bound) a regularization parameter that is proportional to .

Moreover, we get an exponential convergence rate in and , where we have two parts: one that states that the number of copies should be as large as possible to remove irrelevant variables (left part), and one that states that should not be too large, otherwise, some relevant variables would start to disappear (right part). Note that best scaling (for the bound) is , leading to a probability of incorrect selection that goes to zero exponentially fast in .

Of course, in practice, one is not given multiple independent copies of the same datasets, but a single one. One strategy is to split it in different pieces, as described in Section 3.2; this however relies on having enough data to get a large number of pieces, which is unlikely to happen in practice. Our main goal is this paper is to show that by using the bootstrap, we can mimic the availability of having multiple copies. This will come at a price, namely an overall convergence rate of instead of exponential in

### 3.2 Splitting into pieces

We can cut the dataset into pieces of the same size, a procedure reminiscent of cross-validation. However, it requires extra-assumption on the design, i.e., we need to assume that the smallest eigenvalues of the data matrices of length are still strictly positive (see proof in Appendix D.9):

###### Proposition 3.2.

Assume (A2-2) for disjoint subdatasets of the original dataset, and . If , there exists such that:

 P(^J∩≠J)⩽p(1−e−f(c)p−1/2+h(c)p5/2m1/2n1/2)m+2pmexp⎛⎝−m(~w)~λ28~τ2nmp⎞⎠.

The proposition above requires that tends to zero, i.e., there should not be too many pieces (which is also required to allow invertibility of the sub-designs). Note that several independent partitions could be considered, and would lead to results similar to the ones for the bootstrap presented in the next two sections stability ().

### 3.3 Random pair bootstrap

Given the observations , , put together into matrices and , we consider bootstrap replications of the data points efron (); that is, for , we consider a ghost sample , , given by matrices and . For each , the pairs , , are sampled uniformly and independently at random with replacement from the original pairs in . Some pairs are not selected, some selected once, some selected twice, and so on. Note that we could consider bootstrap replications with more or less points than , but for simplicity, we keep it the same as the original number of data points.

The following proposition shows that we obtain a consistent model estimate by intersecting the active sets obtained from running the Lasso on each bootstrap sample , a procedure we refer to as the Bolasso (see proof in Appendix E):

###### Proposition 3.3.

Assume (A2-2). If , there exists strictly positive constants that may depend on such that if and , we have, for boostrapping pairs:

 P(^J∩≠J)⩽mpexp(−A0n1/2p1/2)+A4(A3p3n1/2+logmm)1+A5(2log(A3p3n1/2+logmm))−1/2.

Note that in Proposition 3.3, for any , if and are large enough, then we get an upper bound on the probability of incorrect model selection of the form , where are positive constants. Note that in bolasso (), we have derived a bound with better behavior in , i.e., with . However, the bound in bolasso () holds for random designs and has constants which scale exponentially in  and not polynomially. We are currently trying to improve on the bound in Proposition 3.3 to remove the extra factor .

As before, the number of replications should be as large as possible to remove irrelevant variables, and should not be too large, otherwise, some relevant variables would start to disappear from the intersection. Note that best scaling (for the bound) is , leading to an overall probability of incorrect model selection that tends to zero at rate , instead of the exponential rate for the irrealistic situation of having multiple copies (Section 3.1).

We have not explored yet the optimality (in the minimax sense) of the bound given in Proposition 3.3. While we believe that a rate of cannot be improved upon, the rate should be improved with further research.

Finally, we have explored in bolasso () the possibility of considering softer ways of performing the intersection, i.e., by keeping all variables that appear in a certain proportion of the active sets corresponding to the various replications. This is important in cases where the decay of the loading vectors does not have sharp threshold as assumed in most analyses (this paper included). However, it adds an extra hyper-parameter and the theoretical analysis of such schemes is out of the scope of this paper.

### 3.4 Boostrapping residuals

An alternative to resampling pairs is to resample only the estimated centered residuals efron (); freedman (). This is well adapted to fixed-design assumptions, in particular because the design matrix remains the same for all replications. Note however, that the consistency of this resampling scheme usually relies more heavily on the homoscedasticity assumption (A2) that we make in this paper freedman (). Moreover, since the Lasso estimate is biased, the behavior differs slightly from bootstrapping pairs, as shown empirically in Section 5.

Bootstrapping residuals works as follows; we let denote the vector of estimated residuals, and the centered residuals equal to . When bootstrapping residuals, for each , we keep unchanged and we use as data , where is a random index in —the sampling is uniform and the indices are drawn independently.

We obtain a similar bound than when bootstrapping pairs (see proof in Appendix F.2):

###### Proposition 3.4.

Assume (A2-2). If , there exists strictly positive constants that may depend on such that if and , we have, for boostrapping residuals:

 P(^J∩≠J)⩽mpexp(−A0n1/2p1/2)+A4(A3p3n1/2+logmm)1+A5(2log(A3p3n1/2+logmm))−1/2.

The bound in Proposition 3.4 is the same as bootstrapping pairs, but as shown in Appendix F.2, the constants are slightly better). However, as shown in Section 5.3, the behaviors of the two methods differ notably: random-pair bootstrap does not lead to good selection performance in high-dimensional settings, while residual bootstrap does. While we are currently unable to proof the consistency of bootstrapping residuals in high-dimensional settings, we prove in Section 4 the model consistency of a related two-step procedure, where the bootstrap replications are performed within the support of the Lasso estimate on the full data.

## 4 High-Dimensional Analysis

In high-dimensional settings, i.e., when may be larger than , we need to change assumption regarding the invertibility of the empirical second order moment, which cannot hold. Various assumptions have been used for the Lasso, based on low correlations lounici (), sparse eigenvalues yuinfinite () or more general conditions tsyb (); cohen (). In this paper, we introduce a novel assumption, which not only allows us to consider that the support of the Lasso estimate has a bounded size, but also implies that we obtain the same sign pattern with high probability. The analysis carried out in low-dimensional settings in Section 2.3 is thus also valid in high-dimensional settings.

### 4.1 High-dimensional assumptions

Our analysis relies on the analysis carried out in Section 2.3 for “high” regularization, i.e., when tends to zero slower than . In this setting, we have shown that the Lasso estimate asymptotically behaves as , where is the unique minimum of

 minΔ∈Rp12Δ⊤QΔ+Δ⊤Jsign(wJ)+∥ΔJc∥1. (4.1)

We let denote the “extended” support of a solution of Eq. (4.3) and : that is, we not only keep all indices corresponding to non zero elements of , but also the ones for which the optimality condition in Eq. (C.1) is an equality (i.e., if we are at a hinge point of the regularization path, we take all involved variables)

We consider the vector defined by and . If we assume that , then the solution to Eq. (4.1) is unique fuchs (), and is such that and the optimality conditions for Eq. (4.1) are simply

 sign(−[Q−1L,LtL]K)=tK and ∥QLcLQ−1L,LtL∥∞⩽1.

We make the following assumptions (note that (A4.1) is essentially equivalent to the lack of hinge point which is also made in Proposition 2.3):

• Unicity of local noiseless problem: the matrix is invertible.

• Stability of local noiseless problem: .

We let denote

 θ=min{1−∥QLcLQ−1L,LtL∥∞,mink∈K|(Q−1L,LtL)kQk,k|}, (4.2)

the quantity that will characterize the stability of the local noiseless problem; if (A4.1-4.1) are satisfied, then . As shown in Proposition 4.1, the quantity dictates the speed of convergence of the probability of not getting as a sign pattern for the Lasso problem in Eq. (1.1) or Eq. (2.1).

#### Comparison with consistency condition

We now relate (A4.1) with the consistency condition for the Lasso in Eq. (2.3): if Eq. (2.3) satisfied, then and the condition (A4.1) simply becomes:

 ∥QJc,JQ−1J,Jsign(wJ)∥∞<1,

which is exactly a strict version of Eq. (2.3)—an assumption commonly made for high-dimensional analysis of the Lasso Zhaoyu (); martin (). Note that we then have the simplified expression .

The main goal of this paper is to design a consistent procedure even when Eq. (2.3) is not satisfied. As we have seen, (A4.1) is weaker than the usual assumptions made for the Lasso consistency; in Figure 1 (left and middle), we compare empirically the two conditions for random i.i.d. Gaussian designs, showing that our set of assumptions is weaker, but of course breaks down when is too small (too few observations) or the cardinal of is too large (too many relevant variables). We are currently exploring theoretical proofs of this behavior, extending the current analysis of martin () for Eq. (2.3); in particular, we aim at determining the various scalings between , and the number of relevant variables for which a Gaussian ensemble leads to consistent variable selection with high probability (according to our assumptions which are weaker than in martin ()). Moreover, in the right plot of Figure 1, we show values of for various and , which characterize the convergence rate of our bound. Relying on which is bounded from below is clearly a weakness of our approach to high-dimensional estimation; we are currently exploring refined conditions where we relax the stability, i.e., we allow several (but not too many) patterns to be selected with overwhelming probability.

#### Checking assumptions (A4.1-4.1)

In Eq. (4.1), we can optimize in closed form with respect to as , leading to an optimization problem for :

 minΔ∈Rp12Δ⊤JcQJc,Jc|JΔJc−Δ⊤JcQJc,JQ−1J,Jsign(wJ)+∥ΔJc∥1, (4.3)

which can be solved using existing code for the Lasso. We are currently working on deriving sufficient conditions which do not depend on the sign pattern of the population loading (but only on the sparsity pattern, or even its cardinality), as usually done for the consistency condition in Eq. (2.3Zhaoyu (); yuanlin ().

### 4.2 Stability of sign selection

With assumptions (A4.1) and (A4.1), we can show that with high-probability, when the regularization parameter is asymptotically greater than , then the sign of the Lasso estimate is exactly (see proof in Appendix G):

###### Proposition 4.1.

Assume (A2-2), (A4.1-4.1), and . Then:

 P(sign(^w)≠t)⩽2pexp(−n~μ2θ2~λL8~τ2|L|)+2|J|exp⎛⎝−nm(~w)2~λ2L4~τ2|L|⎞⎠. (4.4)

Note that if is bounded away from zero, then we simply need that for our result to hold. Moreover, in Eq. (4.4), we can see that dictates the asymptotic behavior of our bound. If it is too small, then in order to have a meaningful bound for this design matrix, we would need to consider sign patterns which are close to and show that the sign pattern of the Lasso estimate is with high probability within these sign patterns.

### 4.3 High-dimensional Bolasso

Proposition 4.1 suggests to run the Lasso once with a larger regularization parameter (i.e., multiplied by ) and run the various resampling schemes within the active set of the original Lasso estimation (which is very likely to be the support associated with ). More precisely, we have the proposition (see proof in Appendix G):

###### Proposition 4.2.

Assume (A2-2) and (A4.1-4.1). If , there exists strictly positive constants that may depend on such that if and , we have, for boostrapping residuals:

 P(^J∩≠J)⩽2pexp(−c2(logp)2θ2~λL8~τ2|L|2)+2|J|exp⎛⎝−nm(~w)2~λ2L4~τ2|L|⎞⎠+mpexp(−A0n1/2|L|1/2)+A4(A3|L|3n1/2+logmm)1+A5(2log(A3|L|3n1/2+logmm))−1/2.

Note that the constants depend polynomially on and , and do not depend on . This is thus a high-dimensional result where may grow large compared to . If we relax (A4.1), then the original Lasso estimate would have a small set of allowed patterns with high probability (instead of simply one), and a union bound considering all those would need be considered.

## 5 Algorithms and Simulations

In this section, we describe efficient algorithms for the boostrapped versions of the Lasso that we present in this paper and we illustrate the various consistency results obtained in previous sections, in low-dimensional and high-dimensional settings.

### 5.1 Efficient Path Algorithms

We first consider efficient algorithms for the boostrapping procedures, based on homotopy methods osborne (); lars (); garrigues (). Similar developments could be made for first-order methods shooting (); descent (). For the regular Lasso, one can find the solutions of Eq. (1.1) for all values of the regularization parameter that correspond to less than selected covariates in time which is empirically : indeed, computing once is , while computing the relevant elements of and updating various quantities is . Note that our analysis suggests to stop the path when the solution of the problem is not unique anymore, i.e., when the design matrix of selected variables become rank-deficient.

#### Bootstrapping pairs

When bootstrapping pairs, we require applications of the regular Lasso procedure with different design matrices, so we get a complexity of , and since the designs are different, there is no immediate possibility of sharing computations between different bootstrap replications

#### Bootstrapping residuals

When bootstrapping residuals, we first run the Lasso once, with complexity . Then, for all values of the regularization parameter, naively, we would have to run the Lasso times. In order to avoid running the Lasso as many times as times the number of values of we want to consider, one can first notice that there are at most break points in the original Lasso estimation, and that between break points, one has to minimize an objective function which is composed of a -penalty, a quadratic term and a linear term whose coefficients depend affinely in . This implies that the path is also piecewise linear within this segment and can be followed using an homotopy algorithm very similar to the one for the regular Lasso. Thus it makes per segments when restarting an homotopy method for this segment, i.e., an overall complexity of . This can be put down by computing a joint path that goes through all segments sequentially instead of in parallel, in total time . Moreover, since when bootstrapping residuals, the design matrix is the same for all replications and computations of submatrices of may be cached, to obtain a complexity of .

Similarly, when bootstrapping after projections onto the active set of a single global Lasso run, one can get even get a lower complexity of , i.e., one Lasso followed by Lasso on a reduced data set. This requires however updates (when the first Lasso estimation switches active sets) such as the ones proposed in garrigues ().

### 5.2 Experiments - Low-Dimensional Settings

We first consider a low-dimensional design matrix, with , and relevant variables (i.e., ). The design is sampled from a normal distribution with independent rows, sampled i.i.d. from a fixed covariance matrix. We consider two covariance matrices, one that leads to design matrices which do not satisfy the consistency condition of the Lasso in Eq. (2.3), and one that leads to Lasso-consistent design matrices.

In Figure 2, we plot the marginal probabilities (computed from 512 independent replications) of selecting any given of the variables for all values of the regularization parameter and for the various resampling schemes (resampling noise, bootstrapping pairs or bootstrapping residuals), without intersecting (i.e., we are just reporting counts from 512 replications from a single dataset). Note that the left column (resampling noise) exactly corresponds to the various regimes of the Lasso presented in Section 2 (these require full knowledge of the generating distributions and are only displayed for illustration purposes): for large values of , no variable is selected (Proposition 2.1), then a fixed pattern is selected ( tending to zero faster than , Proposition 2.2), then all patterns including the relevant variables ( of order , Propositions 2.4 and  2.5), and finally, for small values of , all variables are selected (Proposition 2.6). Note that in the top plots, as expected (since Eq. (2.3) is not satisfied), some portions of the regularization paths lead to the correct pattern, while in the bottom plots, as expected (since Eq. (2.3) is satisfied), there is no consistent model selection. It is important to note that using the bootstrap leads to similar behavior than resampling the noise, but does not require extra knowledge (i.e., a single dataset is needed). Note finally, that bootstrapping residuals does alter slightly the regularization paths—because of the bias of the Lasso estimate—and the selected patterns (see other evidence of this behavior in Figure 3 and Figure 4).

In Figure 3, we compute the marginal probability of selecting the variables for the Lasso (left column) and the various ways of using the Bolasso (boostrapping pairs or residuals), i.e., after intersecting. Those are obtained by running the Bolasso with 512 replications, 128 times on the same design but with different noisy observations (thus, a total of Lasso runs are used for each of the plots on the middle and right columns of Figure 3). On the top plots, the Lasso consistency condition in Eq. (2.3) is satisfied and the two versions of the Bolasso increase the width of the consistency region of the Lasso, while on the bottom plots, it is not, and the Bolasso creates a consistency region. Note that bootstrapping residuals modifies the early parts of the regularization path (i.e., large values of ), illustrating the effect of the bias of the Lasso when bootstrapping residuals.

In Figure 4, we consider the effect of the number of bootstrap replications, in the same two situations. Increasing seems always beneficial. Note that (1) when (essentially the Lasso), we get some strictly positive probabilities of good pattern selection even in the inconsistent case, illustrating Proposition 2.4, and (2) if was too large, some of the relevant variables would start to leave the intersection of active sets (but this has not happened in our simulations with only 512 replications).

### 5.3 Experiments - High-Dimensional Settings

We now consider a “high-dimensional” design matrix (i.e. such that ), with , and relevant variables (i.e., ). The design matrix is sampled from a normal distribution with i.i.d. elements. For the sampled design matrix, the condition in Eq. (2.3) is not satisfied, as for most designs with such , and , as shown in Figure 1 in Section 4, but assumptions (A4.1-4.1) are.

We performed the same simulations than in Section 5.2, with additional bootstrapping procedures, namely after projecting into the original Lasso estimate, with the same regularization parameter (no consistency result) or with a parameter multiplied by (consistency result in Proposition 4.2).

In Figure 5, we consider marginal probabilities before intersection, to study the general behavior of various resampling schemes. We see that bootstrapping procedures behave rather differently than resampling the noise (unlike in low-dimensional settings), and that boostrapping pairs does lose some of the relevant variables while boostrapping residuals does not. After projection, all resampling procedures behave correctly. In Figure 6, we compare the Lasso and the Bolasso (for several ways of performing the bootstrap): boostrapping residuals consistently leads to better performance. Note that while the top right plot behaves correctly, we currently have no proofs for it. In Figure 7, we consider the effect of various numbers of replications. Note that in the bottom-right plot, 512 replications are indeed too many (i.e., when too many replications are used, we start to lose some of the relevant variables).

## 6 Conclusion

We have presented a detailed analysis of the variable selection properties of a boostrapped version of the Lasso. The model estimation procedure, referred to as the Bolasso, is provably consistent under general assumptions, in low-dimensional and high-dimensional settings. We have considered the two types of bootstrap for linear regression, and have shown empirically and theoretically better properties for the bootstrap of residuals. This work brings to light that poor variable selection results of the Lasso may be easily enhanced thanks to a simple parameter-free resampling procedure. Our contribution also suggests that the use of bootstrap samples by L. Breiman in Bagging/Arcing/Random Forests arcing () may have been so far slightly overlooked and considered a minor feature, while using boostrap samples may actually be a key computational feature in such algorithms for good model selection performances, and eventually good prediction performances on real datasets.

The current work could be extended in various ways: first, we have not proved yet that bootstrapping residuals, while giving nice empirical performance, is consistent in terms of model selection. Second, a similar analysis could be applied to other settings than least-square regression with the -norm, namely regularization by block -norms grouped (), multiple kernel learning grouped (), more general hierarchical norms cap (); hkl (), and other losses such as general convex classification losses; in particular, an extension of our results to well-specified generalized linear models is straightforward, as they are locally equivalent to a problem like in Eq. (2.1), i.e., locally they are equivalent to minimizing , with being random and having as covariance matrix a multiple of .

Moreover, extensions to general misspecified models or models with heteroscedastic additive noise could be carried through. Also, theoretical and practical connections could be made with other work on resampling methods and boosting boosting (). In particular, using the bootstrap to both select the model and estimate the regularization parameter is clearly of interest. Finally, applications of such resampling techniques for signal processing and compressed sensing cs1 (); cs2 () remain to be explored, both in the context of basis pursuit (-norm regularization, chen ()) and matching pursuit (greedy selection, Mallat93matching ()).

## Appendix A Probability results

In this appendix, we review concentration inequalities that we will need throughout the proofs.

If