Distributional Results for Thresholding Estimators in HighDimensional Gaussian Regression Models ^{†}^{†}thanks: We would like to thank Hannes Leeb, a referee, and an associate editor for comments on a previous version of the paper.
Revised November 2011
Abstract
We study the distribution of hard, soft, and adaptive softthresholding estimators within a linear regression model where the number of parameters can depend on sample size and may diverge with . In addition to the case of known errorvariance, we define and study versions of the estimators when the errorvariance is unknown. We derive the finitesample distribution of each estimator and study its behavior in the largesample limit, also investigating the effects of having to estimate the variance when the degrees of freedom does not tend to infinity or tends to infinity very slowly. Our analysis encompasses both the case where the estimators are tuned to perform consistent variable selection and the case where the estimators are tuned to perform conservative variable selection. Furthermore, we discuss consistency, uniform consistency and derive the uniform convergence rate under either type of tuning.
MSC subject classification: 62F11, 62F12, 62J05, 62J07, 62E15, 62E20
Keywords and phrases: Thresholding, Lasso, adaptive Lasso, penalized maximum likelihood, variable selection, finitesample distribution, asymptotic distribution, variance estimation, uniform convergence rate, highdimensional model, oracle property
1 Introduction
We study the distribution of thresholding estimators such as hardthresholding, softthresholding, and adaptive softthresholding in a linear regression model when the number of regressors can be large. These estimators can be viewed as penalized leastsquares estimators in the case of an orthogonal design matrix, with softthresholding then coinciding with the Lasso (introduced by Frank and Friedman (1993), Alliney and Ruzinsky (1994), and Tibshirani (1996)) and with adaptive softthresholding coinciding with the adaptive Lasso (introduced by Zou (2006)). Thresholding estimators have of course been discussed earlier in the context of model selection (see Bauer, Pötscher and Hackl (1988)) and in the context of wavelets (see, e.g., Donoho, Johnstone, Kerkyacharian, Picard (1995)). Contributions concerning distributional properties of thresholding and penalized leastsquares estimators are as follows: Knight and Fu (2000) study the asymptotic distribution of the Lasso estimator when it is tuned to act as a conservative variable selection procedure, whereas Zou (2006) studies the asymptotic distribution of the Lasso and the adaptive Lasso estimators when they are tuned to act as consistent variable selection procedures. Fan and Li (2001) and Fan and Peng (2004) study the asymptotic distribution of the socalled smoothly clipped absolute deviation (SCAD) estimator when it is tuned to act as a consistent variable selection procedure. In the wake of Fan and Li (2001) and Fan and Peng (2004) a large number of papers have been published that derive the asymptotic distribution of various penalized maximum likelihood estimators under consistent tuning; see the introduction in Pötscher and Schneider (2009) for a partial list. Except for Knight and Fu (2000), all these papers derive the asymptotic distribution in a fixedparameter framework. As pointed out in Leeb and Pötscher (2005), such a fixedparameter framework is often highly misleading in the context of variable selection procedures and penalized maximum likelihood estimators. For that reason, Pötscher and Leeb (2009) and Pötscher and Schneider (2009) have conducted a detailed study of the finitesample as well as largesample distribution of various penalized leastsquares estimators, adopting a movingparameter framework for the asymptotic results. [Related results for socalled postmodelselection estimators can be found in Leeb and Pötscher (2003, 2005) and for model averaging estimators in Pötscher (2006); see also Sen (1979) and Pötscher (1991).] The papers by Pötscher and Leeb (2009) and Pötscher and Schneider (2009) are set in the framework of an orthogonal linear regression model with a fixed number of parameters and with the errorvariance being known.
In the present paper we build on the just mentioned papers Pötscher and Leeb (2009) and Pötscher and Schneider (2009). In contrast to these papers, we do not assume the number of regressors to be fixed, but let it depend on sample size – thus allowing for highdimensional models. We also consider the case where the errorvariance is unknown, which in case of a highdimensional model creates nontrivial complications as then estimators for the errorvariance will typically not be consistent. Considering thresholding estimators from the outset in the present paper allows us also to cover nonorthogonal design. While the asymptotic distributional results in the knownvariance case do not differ in substance from the results in Pötscher and Leeb (2009) and Pötscher and Schneider (2009), not unexpectedly we observe different asymptotic behavior in the unknownvariance case if the number of degrees of freedom is constant, the difference resulting from the nonvanishing variability of the errorvariance estimator in the limit. Less expected is the result that – under consistent tuning – for the variable selection probabilities (implied by all the estimators considered) as well as for the distribution of the hardthresholding estimator, estimation of the errorvariance still has an effect asymptotically even if diverges, but does so only slowly.
To give some idea of the theoretical results obtained in the paper we next present a rough summary of some of these results. For simplicity of exposition assume for the moment that the design matrix is such that the diagonal elements of are equal to , and that the errorvariance is equal to . Let denote the hardthresholding estimator for the th component of the regression parameter, the threshold being given by , with denoting the usual errorvariance estimator and with denoting a tuning parameter. An infeasible version of the estimator, denoted by , which uses instead of , is also considered (knownvariance case). We then show that the uniform rate of convergence of the hardthresholding estimator is if the threshold satisfies and (”conservative tuning”), but that the uniform rate is only if the threshold satisfies and (”consistent tuning”). The same result also holds for the softthresholding estimator and the adaptive softthresholding estimator , as well as for infeasible variants of the estimators that use knowledge of (knownvariance case). Furthermore, all possible limits of the centered and scaled distribution of the hardthresholding estimator (as well as of the soft and the adaptive softthresholding estimators and ) under a moving parameter framework are obtained. Consider first the case of conservative tuning: then all possible limiting forms of the distribution of as well as of for arbitrary parameter sequences are determined. It turns out that – in the knownvariance case – these limits are of the same functional form as the finitesample distribution, i.e., they are a convex combination of a pointmass and an absolutely continuous distribution that is an excised version of a normal distribution. In the unknownvariance case, when the number of degrees of freedom goes to infinity, exactly the same limits arise. However, if is constant, the limits are ”averaged” versions of the limits in the knownvariance case, the averaging being with respect to the distribution of the variance estimator . Again these limits have the same functional form as the corresponding finitesample distributions. Consider next the case of consistent tuning: Here the possible limits of as well as of have to be considered, as is the uniform convergence rate. In the knownvariance case the limits are convex combinations of (at most) two pointmasses, the location of the pointmasses as well as the weights depending on and . In the unknownvariance case exactly the same limits arise if diverges to infinity sufficiently fast; however, if is constant or diverges to infinity sufficiently slowly, the limits are again convex combinations of the same pointmasses, but with weights that are typically different. The picture for softthresholding and adaptive softthresholding is somewhat different: in the knownvariance case, as well as in the unknownvariance case when diverges to infinity, the limits are (single) pointmasses. However, in the unknownvariance case and if is constant, the limit distribution can have an absolutely continuous component. It is furthermore useful to point out that in case of consistent tuning the sequence of distributions of is not stochastically bounded in general (since is the uniform convergence rate), and the same is true for softthresholding and adaptive softthresholding . This throws a light on the fragility of the oracleproperty, see Section 6.4 for more discussion.
While our theoretical results for the thresholding estimators immediately apply to Lasso and adaptive Lasso in case of orthogonal design, this is not so in the nonorthogonal case. In order to get some insight into the finitesample distribution of the latter estimators also in the nonorthogonal case, we numerically compare the distribution of Lasso and adaptive Lasso with their thresholding counterparts in a simulation study.
The main takeaway messages of the paper can be summarized as follows:

The finitesample distributions of the various thresholding estimators considered are highly nonnormal, the distributions being in each case a convex combination of pointmass and an absolutely continuous (nonnormal) component.

The nonnormality persists asymptotically in a moving parameter framework.

Results in the unknownvariance case are obtained from the corresponding results in the knownvariance case by smoothing with respect to the distribution of . In line with this, one would expect the limiting behavior in the unknownvariance case to coincide with the limiting behavior in the knownvariance whenever the degrees of freedom diverge to infinity. This indeed turns out to be so for some of the results, but not for others where we see that the speed of divergence of matters.

In case of conservative tuning the estimators have the expected uniform convergence rate, which is under the simplified assumptions of the above discussion, whereas under consistent tuning the uniform rate is slower, namely under the simplified assumptions of the above discussion. This is intimately connected with the fact that the socalled ‘oracle property’ paints a misleading picture of the performance of the estimators.

The numerical study suggests that the results for the thresholding estimators and qualitatively apply also to the (components of) the Lasso and the adaptive Lasso as long as the design matrix is not too illconditioned.
The paper is organized as follows. We introduce the model and define the estimators in Section 2. Section 3 treats the variable selection probabilities implied by the estimators. Consistency, uniform consistency, and uniform convergence rates are discussed in Section 4. We derive the finitesample distribution of each estimator in Section 5 and study the largesample behavior of these in Section 6. A numerical study of the finitesample distribution of Lasso and adaptive Lasso can be found in Section 7. All proofs are relegated to Section 8.
2 The Model and the Estimators
Consider the linear regression model
with an vector, a nonstochastic matrix of rank , and , . We allow , the number of columns of , as well as the entries of , , and to depend on sample size (in fact, also the probability spaces supporting and may depend on ), although we shall almost always suppress this dependence on in the notation. Note that this framework allows for highdimensional regression models, where the number of regressors is large compared to sample size , as well as for the more classical situation where is much smaller than . Furthermore, let denote the nonnegative square root of , the th diagonal element of . Now let
denote the leastsquares estimator for and the associated estimator for , the latter being defined only if . The hardthresholding estimator is defined via its components as follows
where the tuning parameters are positive real numbers and denotes the th component of the leastsquares estimator. We shall also need to consider its infeasible counterpart given by
The softthresholding estimator and its infeasible counterpart are given by
and
where . Finally, the adaptive softthresholding estimator and its infeasible counterpart are defined via
and
Note that , , and as well as their infeasible counterparts are equivariant under scaling of the columns of by nonzero columnspecific scale factors. We have chosen to let the thresholds (, respectively) depend explicitly on (, respectively) and in order to give an interpretation independent of the values of and . Furthermore, often will be chosen independently of , i.e., where is a positive real number. Clearly, for the feasible versions we always need to assume , whereas for the infeasible versions suffices.
We note the simple fact that
(1) 
holds on the event that , and that
(2) 
holds on the event that . Analogous inequalities hold for the infeasible versions of the estimators.
Remark 1
(Lasso) (i) Consider the objective function
where are positive real numbers. It is wellknown that a unique minimizer of this objective function exists, the Lassoestimator. It is easy to see that in case is diagonal we have
Hence, in the case of diagonal , the components of the Lasso reduce to softthresholding estimators with appropriate thresholds; in particular, coincides with for the choice . Therefore all results derived below for softthresholding immediately give corresponding results for the Lasso as well as for the Dantzigselector in the diagonal case. We shall abstain from spelling out further details.
(ii) Sometimes in the definition of the Lasso is chosen independently of ; more reasonable choices seem to be (a) (where denotes the nonnegative square root of the th diagonal element of ), and (b) where are positive real numbers (not depending on the design matrix and often not on ) as then again has an interpretation independent of the values of and . Note that in case (a) or (b) the solution of the optimization problem is equivariant under scaling of the columns of by nonzero columnspecific scale factors.
(iii) Similar results obviously hold for the infeasible versions of the estimators.
Remark 2
(Adaptive Lasso) Consider the objective function
where are positive real numbers. This is the objective function of the adaptive Lasso (where often is chosen independent of ). Again the minimizer exists and is unique (at least on the event where for all ). Clearly, is equivariant under scaling of the columns of by nonzero columnspecific scale factors provided does not depend on the design matrix. It is easy to see that in case is diagonal we have
Hence, in the case of diagonal , the components of the adaptive Lasso reduce to the adaptive softthresholding estimators (for ). Therefore all results derived below for adaptive softthresholding immediately give corresponding results for the adaptive Lasso in the diagonal case. We shall again abstain from spelling out further details. Similar results obviously hold for the infeasible versions of the estimators.
Remark 3
(Other estimators) (i) The adaptive Lasso as defined in Zou (2006) has an additional tuning parameter . We consider adaptive softthresholding only for the case , since otherwise the estimator is not equivariant in the sense described above. Nonetheless an analysis for the case , similar to the analysis in this paper, is possible in principle.
(ii) An analysis of a SCADbased thresholding estimator is given in Pötscher and Leeb (2009) in the knownvariance case. [These results are given in the orthogonal design case, but easily generalize to the nonorthogonal case.] The results obtained there for SCADbased thresholding are similar in spirit to the results for the other thresholding estimators considered here. The unknownvariance case could also be analyzed in principle, but we refrain from doing so for the sake of brevity.
(iii) Zhang (2010) introduced the socalled minimax concave penalty (MCP) to be used for penalized leastsquares estimation. Apart from the usual tuning parameter, MCP also depends on a shape parameter . It turns out that the thresholding estimator based on MCP coincides with hardthresholding in case , and thus is covered by the analysis of the present paper. In case , the MCPbased thresholding estimator could similarly be analyzed, especially since the functional form of the MCPbased thresholding estimator is relatively simple (namely, a piecewise linear function of the leastsquares estimator). We do not provide such an analysis for brevity.
For all asymptotic considerations in this paper we shall always assume without further mentioning that satisfies
(3) 
for every fixed satisfying for large enough . The case excluded by assumption (3) seems to be rather uninteresting as unboundedness of means that the information contained in the regressors gets weaker with increasing sample size (at least along a subsequence); in particular, this implies (coordinatewise) inconsistency of the leastsquares estimator. [In fact, if as well as the elements of do not depend on , this case is actually impossible as is then necessarily monotonically nonincreasing.]
The following notation will be used in the paper: Let denote the extended real line endowed with the usual topology. On we shall consider the topology it inherits from . Furthermore, and denote the cumulative distribution function (cdf) and the probability density function (pdf) of a standard normal distribution, respectively. By we denote the cdf of a noncentral distribution with degrees of freedom and noncentrality parameter . In the central case, i.e., , we simply write . We use the convention , with a similar convention for .
3 Variable Selection Probabilities
The estimators , , and can be viewed as performing variable selection in the sense that these estimators set components of exactly equal to zero with positive probability. In this section we study the variable selection probability , where stands for any of the estimators , , and . Since these probabilities are the same for any of the three estimators considered we shall drop the subscripts , , and in this section. We use the same convention also for the variable selection probabilities of the infeasible versions.
3.1 KnownVariance Case
Since it suffices to study the variable deletion probability
(4) 
As can be seen from the above formula, depends on only via . We first study the variable selection/deletion probabilities under a ”fixedparameter” asymptotic framework.
Proposition 4
Let be given. For every satisfying for large enough we have:
(a) A necessary and sufficient condition for as for all satisfying ( not depending on ) is .
(b) A necessary and sufficient condition for as for all satisfying is .
(c) A necessary and sufficient condition for as for all satisfying is , . The constant is then given by .
Part (a) of the above proposition gives a necessary and sufficient condition for the procedure to correctly detect nonzero coefficients with probability converging to . Part (b) gives a necessary and sufficient condition for correctly detecting zero coefficients with probability converging to .
Remark 5
If does not converge to zero, the conditions on in Parts (a) and (b) are incompatible; also the conditions in Parts (a) and (c) are then incompatible (except when ). However, the case where does not converge to zero is of little interest as the leastsquares estimator is then not consistent.
Remark 6
(Speed of convergence in Proposition 4) (i) The speed of convergence in (a) is in case is bounded (an uninteresting case as noted above); if, the speed of convergence in (a) is not slower than for some suitable depending on .
(ii) The speed of convergence in (b) is . In (c) the speed of convergence is given by the rate at which approaches .
[For the above results we have made use of Lemma VII.1.2 in Feller (1957).]
Remark 7
For let . Then (i) for every
Suppose now that the entries of do not change with (although the dimension of may depend on ).^{1}^{1}1More precisely, this means that is made up of the initial elements of a fixed element of . Then, given that is bounded (this being in particular the case if is bounded), the probability of incorrect nondetection of at least one nonzero coefficient converges to if and only if as for every . [If is unbounded then this probability converges to , e.g., if and as for every and and as for a suitable that is determined by .]
(ii) For every we have
Suppose again that the entries of do not change with . Then, given that is bounded (this being in particular the case if is bounded), the probability of incorrectly classifying at least one zero parameter as a nonzero one converges to as if and only if for every . [If is unbounded then this probability converges to , e.g., if as .]
(iii) In case is diagonal, the relevant probabilities as well as can be directly expressed in terms of products of or , and Proposition 4 can then be applied.
Since the fixedparameter asymptotic framework often gives a misleading impression of the actual behavior of a variable selection procedure (cf. Leeb and Pötscher (2005), Pötscher and Leeb (2009)) we turn to a ”movingparameter” framework next, i.e., we allow the elements of as well as to depend on sample size . In the proposition to follow (and all subsequent largesample results) we shall concentrate only on the case where as , since otherwise the estimators are not even consistent for as a consequence of Proposition 4, cf. also Theorem 16 below. Given the condition , we shall then distinguish between the case , , and the case , which in light of Proposition 4 we shall call the case of ”conservative tuning” and the case of ”consistent tuning”, respectively.^{2}^{2}2There is no loss of generality here in assuming convergence of to a (finite or infinite) limit, in the sense that this convergence can, for any given sequence , be achieved along suitable subsequences in light of compactness of the extended real line.
Proposition 8
Suppose that for given satisfying for large enough we have and where .
(a) Assume . Suppose that the true parameters and satisfy . Then
(b) Assume . Suppose that the true parameters and satisfy . Then
1. implies .
2. implies .
3. and , for some , imply
In a fixedparameter asymptotic analysis, which in Proposition 8 corresponds to the case and , the limit of the probabilities is always in case , and is in case and consistent tuning (it is in case and conservative tuning); this does clearly not properly capture the finitesample behavior of these probabilities. The movingparameter asymptotic analysis underlying Proposition 8 better captures the finitesample behavior and, e.g., allows for limits other than and even in the case of consistent tuning. In particular, Proposition 8 shows that the convergence of the variable selection/deletion probabilities to their limits in a fixedparameter asymptotic framework is not uniform in , and this nonuniformity is local in the sense that it occurs in an arbitrarily small neighborhood of (holding the value of fixed).^{3}^{3}3More generally, the nonuniformity arises for in a neighborhood of zero. Furthermore, the above proposition entails that under consistent tuning deviations from of larger order than under conservative tuning go unnoticed asymptotically with probability 1 by the variable selection procedure corresponding to . For more discussion in a special case (which in its essence also applies here) see Pötscher and Leeb (2009).
Remark 9
(Speed of convergence in Proposition 8) (i) The speed of convergence in (a) is given by the slower of the rate at which approaches and approaches provided that ; if , the speed of convergence is not slower than
for any .
(ii) The speed of convergence in (b1) is not slower than where depends on . The same is true in case (b2) provided ; if , the speed of convergence is not slower than for every . In case (b3) the speed of convergence is not slower than the speed of convergence of
for any in case ; in case it is not slower than
for any .
The preceding remark corrects and clarifies the remarks at the end of Section 3 in Pötscher and Leeb (2009) and Section 3.1 in Pötscher and Schneider (2009).
3.2 UnknownVariance Case
In the unknownvariance case the finitesample variable selection/deletion probabilities can be obtained as follows:
(5) 
Here we have used (4), and independence of and allowed us to replace by in the relevant formulae, cf. Leeb and Pötscher (2003, p. 110). In the above denotes the density of times the square root of a chisquare distributed random variable with degrees of freedom. It will turn out to be convenient to set for , making a bounded continuous function on .
We now have the following fixedparameter asymptotic result for the variable selection/deletion probabilities in the unknownvariance case that perfectly parallels the corresponding result in the knownvariance case, i.e., Proposition 4:
Proposition 10
Let be given. For every satisfying for large enough we have:
(a) A necessary and sufficient condition for as for all satisfying ( not depending on ) is .
(b) A necessary and sufficient condition for as for all satisfying is .
(c) A necessary and sufficient condition for as for all satisfying and with satisfying is , .
Proposition 10 shows that the dichotomy regarding conservative tuning and consistent tuning is expressed by the same conditions in the unknownvariance case as in the knownvariance case. Furthermore, note that appearing in Part (c) of the above proposition converges to in the case where , the limit thus being the same as in the knownvariance case. This is different in case is constant equal to , say, eventually, the sequence then being constant equal to eventually. We finally note that Remark 5 also applies to Proposition 10 above.
For the same reasons as in the knownvariance case we next investigate the asymptotic behavior of the variable selection/deletion probabilities under a movingparameter asymptotic framework. We consider the case where is (eventually) constant and the case where . There is no essential loss in generality in considering these two cases only, since by compactness of we can always assume (possibly after passing to subsequences) that converges in .
Theorem 11
Suppose that for given satisfying for large enough we have and where .
(a) Assume . Suppose that the true parameters and satisfy .
(a1) If is eventually constant equal to , say, then
(a2) If holds, then
(b) Assume . Suppose that the true parameters and satisfy .
(b1) If is eventually constant equal to , say, then
(b2) If holds, then
1. implies .
2. implies .
3. and imply
provided for some .
4. and with imply