# Perturbation Bootstrap in Adaptive Lasso

Debraj Das North Carolina State University USA
Karl Gregory University of Mannheim, Germany
S. N. Lahiri North Carolina State University USA
###### Abstract

The Adaptive LASSO (ALASSO) was proposed by Zou [J. Amer. Statist. Assoc. 101 (2006) 1418-1429] as a modification of the LASSO for the purpose of simultaneous variable selection and estimation of the parameters in a linear regression model. Zou (2006) established that the ALASSO estimator is variable-selection consistent as well as asymptotically Normal in the indices corresponding to the nonzero regression coefficients in certain fixed-dimensional settings. In an influential paper, Minnier, Tian and Cai [J. Amer. Statist. Assoc. 106 (2011) 1371-1382] proposed a perturbation bootstrap method and established its distributional consistency for the ALASSO estimator in the fixed-dimensional setting. In this paper, however, we show that this (naive) perturbation bootstrap fails to achieve second order correctness in approximating the distribution of the ALASSO estimator. We propose a modification to the perturbation bootstrap objective function and show that a suitably studentized version of our modified perturbation bootstrap ALASSO estimator achieves second-order correctness even when the dimension of the model is allowed to grow to infinity with the sample size. As a consequence, inferences based on the modified perturbation bootstrap will be more accurate than the inferences based on the oracle Normal approximation. We give simulation studies demonstrating good finite-sample properties of our modified perturbation bootstrap method as well as an illustration of our method on a real data set.

###### Keywords:
ALASSO, Naive Perturbation Bootstrap, Modified Perturbation Bootstrap, Second order correctness, Oracle Distribution
\pdfstringdefDisableCommands\pdfstringdefDisableCommands

## 1 Introduction

Consider the multiple linear regression model

 yi=x′iβ+ϵi,i=1,…,n, (1.1)

where are responses, are independent and identically distributed (iid) random variables, are known non-random design vectors, and is the -dimensional vector of regression parameters. When the dimension is large, it is common to approach regression model (1.1) with the assumption that the vector is sparse, that is that the set has cardinality much smaller than , meaning that only a few of the covariates are “active”. The LASSO estimator introduced by Tibshirani (1996) is well suited to the sparse setting because of its property that it sets some regression coefficients exactly equal to 0. One disadvantage of the LASSO, however, is that it produces non-trivial asymptotic bias for the non-zero regression parameters, primarily because it shrinks all estimators toward zero [cf. Knight and Fu (2000)].

Building on the LASSO, Zou (2006) proposed the adaptive LASSO [hereafter referred to as ALASSO] estimator of in the regression problem (1.1) as

 ^βn=argmint[n∑i=1(yi−x′it)2+λnp∑j=1|~βj,n|−γ|tj|], (1.2)

where is the th component of a root--consistent estimator of , such as the ordinary least squares (OLS) estimator when or the LASSO or ridge regression estimator when , is the penalty parameter, and is a constant governing the influence of the preliminary estimator on the ALASSO fit. Zou (2006) showed in the fixed- setting that under some regularity conditions and with the right choice of , the ALASSO estimator enjoys the so-called oracle property [cf. Fan and Li (2001)]; that is, it is variable-selection consistent and it estimates the non-zero regression parameters with the same precision as the OLS estimator which one would compute if the set of active covariates were known.

In an important recent work, Minnier, Tian and Cai (2011) introduced the perturbation bootstrap in the ALASSO setup. To state their main results, let be the naive perturbation bootstrap ALASSO estimator prescribed by Minnier, Tian and Cai (2011) and define and . These authors showed that under some regularity conditions and with fixed as

 P∗(A∗Nn=^An)→1and√n(β∗N(1)n−^β(1)n)|ε≍d√n(^β(1)n−β(1)),

where , denotes the sub-vector of corresponding to the co-ordinates in , “” denotes asymptotic equivalence in distribution, and denotes bootstrap probability conditional on the data. Thus Minnier, Tian and Cai (2011) [hereafter referred to as MTC(11)] showed that, in the fixed- setting and conditionally on the data, the naive perturbation bootstrap version of the ALASSO estimator is variable-selection consistent in the sense that it recovers the support of the ALASSO estimator with probability tending to one and that its distribution conditional on the data converges at the same time to that of the ALASSO estimator for the non-zero regression parameters. But the accuracy of inference for non-zero regression parameters relies on the rate of convergence of the bootstrap distribution of to the distribution of after proper studentization. Furthermore, Chatterjee and Lahiri (2013) showed that the convergence of the ALASSO estimators of the nonzero regression coefficients to their oracle Normal distribution is quite slow, owing to the bias induced by the penalty term in (1.2). Thus, it would be important for the accuracy of inference if second-order correctness can be achieved in approximating the distribution of the ALASSO estimator by the perturbation bootstrap. Second-order correctness implies that the distributional approximation has a uniform error rate of . We show in this paper, however, that the distribution of the naive perturbation bootstrap version of the ALASSO estimator, as defined by MTC(11), cannot be second order correct even in fixed dimension. For more details, see Section 4.

We introduce a modified perturbation bootstrap for the ALASSO estimator for which second order correctness does hold, even when the number of regression parameters is allowed to increase with the sample size . We also show in Proposition 2.1 that the modified perturbation bootstrap version of the ALASSO estimator (defined in Section 2) can be computed by minimizing simple criterion functions. This makes our bootstrap procedure computationally simple and inexpensive.

In this paper, we consider some pivotal quantities based on ALASSO estimators and establish that the modified perturbation bootstrap estimates the distribution of these pivotal quantities up to second order, i.e. with an error that is of much smaller magnitude than what we would obtain by using the Normal approximation under the knowledge of the true active set of covariates. We will refer to the Normal approximation which uses knowledge of the true set of active covariates as the oracle Normal approximation. Our main results show that the modified perturbation bootstrap method enables, for example, the construction of confidence intervals for the nonzero regression coefficients with smaller coverage error than those based on the oracle Normal approximation.

More precisely, we consider pivots which are studentizations of the quantities

 √nDn(^βn−β) and √nDn(^βn−β)+ˇbn,

where is a matrix ( fixed) producing linear combinations of interest of and where is a bias correction term which we will define later. We find that in the case, the modified perturbation bootstrap can estimate the distribution of the first pivot with an error of order (see Theorem 3.1). This is much smaller than the error of the oracle Normal approximation, which was shown in Theorem 3.1 of Chatterjee and Lahiri (2013) to be of the order , where is the bias targeted by and is determined by the initial estimator and the tuning parameters and ; both and are typically greater in magnitude than and hence determine the rate of the oracle Normal approximation. We also discover that the bias correction in the second pivot improves the error rate so that the modified perturbation bootstrap estimator achieves the rate (see Theorem 3.2), which is a significant improvement over the best possible rate of oracle Normal approximation, namely . In the case, we find that the modified perturbation bootstrap estimates the distributions of studentized versions of both the bias-corrected and un-bias-corrected pivots with the rate (see Theorem 3.3), establishing the second-order correctness of our modified perturbation bootstrap in the high-dimensional setting.

We show that the naive perturbation bootstrap of MTC(11) is not second-order correct (see Theorem 4.1) by investigating the Karush-Kuhn-Tucker (KKT) condition [cf. Boyd and Lieven (2004)] corresponding to their minimization problem. It is shown that second order correctness is not attainable by the naive version of the perturbation bootstrap, primarily due to lack of proper centering of the naive bootstrapped ALASSO criterion function. We derive the form of the centering constant by analyzing the corresponding approximation errors using the theory of Edgeworth expansion. To accommodate the centering correction, we modify the perturbation bootstrap criterion function for the ALASSO; see Section 2 for details. In addition, we also find out that it is beneficial, from both theoretical and computational perspectives, to modify the perturbation bootstrap version of the initial estimators in a similar way. To prove second order correctness of the modified perturbation bootstrap ALASSO, the key steps are to find an Edgeworth expansion of the bootstrap pivotal quantities based on the modified criterion function and to compare it with the Edgeworth expansion of the sample pivots. We want to mention that in our setting, the dimension of the regression parameter vector can grow polynomially in the sample size at a rate depending on the number of finite moments of the error distribution. Extension to the case in which grows exponentially with would be possible under some strong assumptions, e.g. under finiteness of the moment generating function of the regression errors.

We conclude this section with a brief literature review. The perturbation bootstrap was introduced by Jin, Ying, and Wei (2001) as a resampling procedure where the objective function has a U-process structure. Work on the perturbation bootstrap in the linear regression setup is limited. Some work has been carried out by Chatterjee and Bose (2005), MTC(11), Zhou, Song and Thompson (2012), and Das and Lahiri (2016). An analogous bootstrap procedure, called multiplier bootstrap, emerged recently as an effective procedure for approximating the underlying sampling distribution. For details on multiplier bootstrap procedure see Bücher and Dette (2013), Chernozhukov et al. (2013, 2014, 2016), Bücher and Kojadinovic (2016a, 2016b) and references therein. As a variable selection procedure, Tibshirani (1996) introduced the LASSO. Zou (2006) proposed the ALASSO as an improvement over the LASSO. For the ALASSO and related popular penalized estimation and variable selection procedures, the residual bootstrap has been investigated by Knight and Fu (2000), Hall, Lee and Park (2009), Chatterjee and Lahiri (2010, 2011, 2013), Wang and Song (2011), MTC(11), Van De Geer et al. (2014), and Camponovo (2015), among others.

The rest of the paper is organized as follows. The modified perturbation bootstrap for the ALASSO is introduced and discussed in Section 2. Main results concerning the estimation properties of the studentized modified perturbation bootstrap pivotal quantities are given in Section 3. Negative results on the naive perturbation bootstrap approximation proposed by MTC(11) are discussed and intuition and explanations behind the modification of the modified perturbation bootstrap are given in Section 4. Section 5 presents simulation results exploring the finite-sample performance of the modified perturbation bootstrap in comparison with other methods for constructing confidence intervals based on ALASSO estimators and Section 6 gives an illustration on real data. Regularity conditions and the proofs are provided in Section 7. Further simulation results are relegated to the Supplementary Material.

## 2 The modified perturbation bootstrap for the ALASSO

Suppose, be independent copies of a non-degenerate random variable having expectation . These quantities will serve as perturbation quantities. The modified perturbation bootstrap in ALASSO involves careful construction of the penalized objective function. We need ALASSO estimated values besides the observed values , to define the penalized objective function. The modified penalized objective function needs to incorporate the sum of two perturbed least square objective functions, one involving , and other with , , see definition 2. Similar modification is also needed in defining the bootstrap versions of the initial estimators, see definition 2. The motivation behind this construction is detailed in Section 4, where we also point out why the naive perturbation bootstrap formulation of MTC(11) fails drastically.

Formally, The modified perturbation bootstrap version of the ALASSO estimator is defined as

 ^β∗n=argmint∗ [n∑i=1(yi−x′it∗)2(G∗i−μG∗) +n∑i=1(^yi−x′it∗)2(2μG∗−G∗i)+μG∗λnp∑j=1|~β∗jn|−γ|t∗j|], (2.1)

where is the th component of , the modified perturbation bootstrap version of the initial estimator . We construct as

 ~β∗n=argmint∗ [n∑i=1(yi−x′it∗)2(G∗i−μG∗) +n∑i=1(^yi−x′it∗)2(2μG∗−G∗i)+μG∗~λnp∑j=1|t∗j|l], (2.2)

where when is taken as the OLS, when , and or according as the initial estimator is taken as the LASSO or Ridge estimator, when . Note that may be different from .

We point out that the modified perturbation bootstrapped estimators can be computed using existing algorithms. Define, for some non-negative constants , , where . Now assume, , where , for and . Then we have the following proposition.

###### Proposition 2.1

.

This proposition allows us to compute as well as by minimizing standard objective functions on some pseudo-values. Note that the modified perturbation bootstrapped ALASSO estimator as well as the initial estimators can be obtained by properly perturbing the ALASSO residuals only in the decomposition , .

## 3 Main results for the modified perturbation bootstrap

We here present the minimal notation for stating our main results. Assumptions (A.1)–(A.7), to which we refer in the statements of our results, are detailed in Section 7.

From now on we denote the true parameter vector as , where the subscript emphasizes that the dimension may grow with the sample size . From , , and define , , and and let . We will suppose, without loss of generality, that .

Let and partition it according to as

 Cn=[C11,nC12,nC21,nC22,n],

where is of dimension . Define and partition as , where is a vector. Let .

Now define and the corresponding modified pertrubation bootstrap version , where is a known matrix with and is not dependent on . Let contain the first columns of . Define also the matrices and according to such that is the sub-matrix of with rows and columns in and is the sub-matrix of with columns in . Let . Let according as , , , respectively. We denote by the collection of convex sets of .

Our results will concern convergence in distribution of some studentizations of to the distributions of corresponding studentizations of . Define the studentized or pivot quantities

 Rn =⎧⎨⎩^σ−1n^Σ−1/2nTnfor p≤n^σ−1nˇΣ−1/2nTnfor p>n R∗n =⎧⎨⎩^σ∗−1n^σn˘Σ−1/2nT∗nfor p≤n^σ∗−1n^σn~Σ−1/2nT∗nfor p>n ˇRn =ˇσ−1nˇΣ−1/2n[Tn+ˇbn] ˇR∗n =ˇσ∗−1nˇσn~Σ−1/2n[T∗n+ˇb∗n],

where and involve the bias-corrections and . In the above:

 ^Σn =n−1n∑i=1(^ξ(0)i+^η(0)i)(^ξ(0)i+^η(0)i)′,ˇΣn=n−1n∑i=1^ξ(0)i^ξ(0)′i, ˘Σn =n−1n∑i=1(˘ξ(0)i+˘η(0)i)(˘ξ(0)i+˘η(0)i)′,~Σn=n−1n∑i=1˘ξ(0)i˘ξ(0)′i,

where and , with

 ^ηi=(λn2n~xi,jγ|^βj,n|γ+1sgn(^βj,n))j∈^An,

and and , where for . Moreover

 ^σ2n =n−1∑ni=1(yi−∑pj=1xij^βj,n)2 ˇσ2n =n−1∑ni=1(yi−∑j∈^Anxij~βj,n)2 ^σ∗2n =n−1μ−2G∗∑ni=1(yi−∑pj=1xij^β∗j,n)2(G∗i−μG∗)2 ˇσ∗2n =n−1μ−2G∗∑ni=1(yi−∑j∈^A∗nxij~β∗j,n)2(G∗i−μG∗)2,

and lastly

 ˇbn=^D(1)n^C−111,n^s(1)nλn2√n and ˇb∗n=^D∗(1)n^C∗−111,n^s∗(1)nλn2√n,

where and are the and vectors with th entry equal to , and , , respectively. The matrix is the sub-matrix of with rows and columns in and is the sub-matrix of with columns in .

We are motivated to look at these studentized or pivot quantities by the fact that studentization improves the rate of convergence of bootstrap estimators in many settings [cf. Hall (1992)].

###### Remark 1

Note that the matrices , , and used in defining the bootstrap pivots do not depend on . Hence it is not required to compute the negative square roots of these matrices for each Monte Carlo bootstrap iteration; these must only be computed once. This is a notable feature of our modified perturbation bootstrap method from the perspective of computational complexity.

### 3.1 Results for p≤n.

###### Theorem 3.1

Let (A.1)–(A.6) hold with . Then

Theorem 3.1 shows that after proper studentization, the modified perturbation bootstrap approximation of the distribution of the ALASSO estimator is second-order correct. The error rate reduces to from , the best possible rate obtained by the oracle Normal approximation. This is a significant improvement from the perspective of inference. As a consequence, the precision of the percentile confidence intervals based on will be greater than that of confidence intervals based on the oracle Normal approximation.

We point out that the error rate in Theorem 3.1 cannot be reduced to the optimal rate of , unlike in the fixed-dimension case. To achieve this optimal rate by our modified bootstrap method, we now consider a bias corrected pivot and its modified perturbation bootstrap version . The following theorem states that it achieves the optimal rate.

###### Theorem 3.2

Let (A.1)–(A.6) hold with . Then

 supB∈Cq∣∣P∗(ˇR∗n∈B)−P(ˇRn∈B)∣∣=Op(n−1)

Theorem 3.2 suggests that the modified perturbation bootstrap achieves notable improvement in the error rate over the oracle Normal approximation irrespective of the order of the bias term. Thus Theorem 3.2 establishes the perturbation bootstrap method as an effective method for approximating the distribution of the ALASSO estimator when .

### 3.2 Results for p>n.

We now present a result for the quality of perturbation bootstrap approximation when the dimension of the regression parameter can be much larger than the sample size . We consider the initial estimator to be some -consistent bridge estimator, for example LASSO or Ridge estimator, in defining the ALASSO estimator by (1.2). The bootstrap version of LASSO or Ridge is defined by (2.2).

###### Theorem 3.3

Let (A.1)(i), (ii), (iii)’ and (A.2)–(A.6) and (A.7) hold with and and let . Then

 supB∈Cq∣∣P∗(ˇR∗n∈B)−P(ˇRn∈B)∣∣=op(n−1/2).

Theorem 3.3 states that our proposed modified perturbation bootstrap approximation is second-order correct, even when . Thus the error rate obtained by our proposed method is significantly better than , which is the best-attainable rate of the oracle Normal approximation. We want to point out that for the validity of our method, can grow at at most a polynomial rate depending on the decay of the tail of the error distribution and on the growth of the design vectors.

## 4 Impossibility of Second-order correctness of the naive perturbation bootstrap

In this section we describe the naive perturbation bootstrap as defined by MTC(11) for the ALASSO and show that second-order correctness can not be achievable by their naive perturbation bootstrap method. When the objective function is the usual least squares criterion function the naive perturbation bootstrap ALASSO estimator is defined in MTC(11) as

 β∗Nn=argminv∗n [n∑i=1(yi−x′iv∗n)2G∗i+λ∗np∑j=1|~β∗Nj,n|−γ|v∗j,n|], (4.1)

where

1. is such that and as .

2. the initial naive bootstrap estimator is defined as

 ~β∗Nn=argminv∗n[n∑i=1(yi−x′iv∗n)2G∗i]

and is the th component of .

3. is a set of iid non-negative random quantities with mean and variance both equal to 1.

Note that the initial estimator is unique only when is less than or equal to . We now consider the quantity , which we can show from (4.1) to be the minimizer

 u∗Nn=argminw∗n[w∗′nC∗nw∗n−2w∗nW∗n+λ∗np∑j=1|~β∗Nj,n|−γ(|^βj,n+w∗j,n√n|−|^βj,n|)], (4.2)

where is the th component of the ALASSO estimator , , and . To describe the solution of MTC(11), assume . MTC(11) claimed that when and is fixed, is a solution of (4.2) for sufficiently large , where

 u∗Nn1=C−111,nn−1/2n∑i=1ϵix(1)i(G∗i−1)and||u∗Nn−((u∗Nn1)′,0)′||∞=op∗(1).

However, to achieve second order correctness, we need to obtain a solution of (4.2) such that . We show that such an has the form

 u∗Nn2=C∗−111,n[W∗(1)n−λ∗n√n~s∗(1)n]

for sufficiently large , where is the first components of and the th component of equals to , (Here we drop the subscript from the notations of true parameter values since we are considering to be fixed in this section). We establish this fact by exploring the KKT condition corresponding to (4.2), which is given by

 2C∗nw∗n−2W∗n+λ∗n√nΓ∗nln=0, (4.3)

for some with for and . Since is a non-negative definite matrix, (4.2) is a convex optimization problem; hence (4.3) is both necessary and sufficient in solving (4.2).

Note that is not centered and hence we need to adjust the solution for centering before investigating if the naive perturbation bootstrap can asymptotically correct the distribution of ALASSO up to second order. Clearly, the centering adjustment term is where . It follows from the steps of the proofs of the results of Section 3 that we need to achieve second-order correctness. We show that this is indeed not the case even in the fixed setting.

More precisely, we negate the second-order correctness of the naive perturbation bootstrap of MTC(11) by first showing that satisfies the KKT condition (4.3) exactly with bootstrap probability converging to 1. Then we show that diverges in bootstrap probability to , which in turn implies that the conditional cdf of can not approximate the cdf of with the uniform accuracy , needed for the validity of second-order correctness. We formalize these arguments in the following theorem.

###### Theorem 4.1

Suppose be fixed and , a positive definite matrix. Define and let and as . Also assume that (A.1)(i), (ii) and (A.4)(i) hold with . Then there exists a sequence of borel sets with and given , the following conclusions hold.

• .

• for any .

• for some .

###### Remark 2

Theorem 4.1 (a), (b) state that the naive perturbation bootstrap is incompetent in approximating the distribution of ALASSO up to second order. The fundamental reason behind second order incorrectness is the inadequate centering in the form of . Although the adjustment term necessary for centering is , which essentially helps to establish distributional consistency in MTC(11), the term is coarser than , leading to second order incorrectness. Part (c) conveys uniformly how far the naive bootstrap cdf is from the original cdf.

### 4.1 Motivation for the modified perturbation bootstrap

Theorem 4.1 settles the fact that naive perturbation bootstrap of MTC(11) does not provide a solution for approximating the distribution of . On an event with probability tending to , the naive perturbation bootstrap estimator suffers either in that is not conditionally variable-selection consistent or is not . If one looks closely into the KKT conditions (4.3), one can recognize that the problem occurs because is not centered. Let denote the centered version of , that is , and consider the vector equation

 2C∗nw∗n−2~W∗n+λ∗n√nΓ∗nln=0, (4.4)

which is same as (4.3), but after replacing with . Note that solutions to (4.4) are of the form , where is , are on the set as well as conditionally variable-selection consistent due to their form. If we consider the last equations of (4.3), after putting in place of , then for all we must have

 −λ∗n2√n|~β∗Njn|−γ≤[(C∗21,n)j.u∗(1)n−~W∗jn]≤λ∗n2√n|~β∗Njn