Distributional Results for Thresholding Estimators in High-Dimensional Gaussian Regression Models We would like to thank Hannes Leeb, a referee, and an associate editor for comments on a previous version of the paper.

Distributional Results for Thresholding Estimators in High-Dimensional Gaussian Regression Models thanks: We would like to thank Hannes Leeb, a referee, and an associate editor for comments on a previous version of the paper.

Benedikt M. Pötscher    Ulrike Schneider
June 2011
Revised November 2011
Abstract

We study the distribution of hard-, soft-, and adaptive soft-thresholding estimators within a linear regression model where the number of parameters can depend on sample size and may diverge with . In addition to the case of known error-variance, we define and study versions of the estimators when the error-variance is unknown. We derive the finite-sample distribution of each estimator and study its behavior in the large-sample limit, also investigating the effects of having to estimate the variance when the degrees of freedom does not tend to infinity or tends to infinity very slowly. Our analysis encompasses both the case where the estimators are tuned to perform consistent variable selection and the case where the estimators are tuned to perform conservative variable selection. Furthermore, we discuss consistency, uniform consistency and derive the uniform convergence rate under either type of tuning.

MSC subject classification: 62F11, 62F12, 62J05, 62J07, 62E15, 62E20

Keywords and phrases: Thresholding, Lasso, adaptive Lasso, penalized maximum likelihood, variable selection, finite-sample distribution, asymptotic distribution, variance estimation, uniform convergence rate, high-dimensional model, oracle property

1 Introduction

We study the distribution of thresholding estimators such as hard-thresholding, soft-thresholding, and adaptive soft-thresholding in a linear regression model when the number of regressors can be large. These estimators can be viewed as penalized least-squares estimators in the case of an orthogonal design matrix, with soft-thresholding then coinciding with the Lasso (introduced by Frank and Friedman (1993), Alliney and Ruzinsky (1994), and Tibshirani (1996)) and with adaptive soft-thresholding coinciding with the adaptive Lasso (introduced by Zou (2006)). Thresholding estimators have of course been discussed earlier in the context of model selection (see Bauer, Pötscher and Hackl (1988)) and in the context of wavelets (see, e.g., Donoho, Johnstone, Kerkyacharian, Picard (1995)). Contributions concerning distributional properties of thresholding and penalized least-squares estimators are as follows: Knight and Fu (2000) study the asymptotic distribution of the Lasso estimator when it is tuned to act as a conservative variable selection procedure, whereas Zou (2006) studies the asymptotic distribution of the Lasso and the adaptive Lasso estimators when they are tuned to act as consistent variable selection procedures. Fan and Li (2001) and Fan and Peng (2004) study the asymptotic distribution of the so-called smoothly clipped absolute deviation (SCAD) estimator when it is tuned to act as a consistent variable selection procedure. In the wake of Fan and Li (2001) and Fan and Peng (2004) a large number of papers have been published that derive the asymptotic distribution of various penalized maximum likelihood estimators under consistent tuning; see the introduction in Pötscher and Schneider (2009) for a partial list. Except for Knight and Fu (2000), all these papers derive the asymptotic distribution in a fixed-parameter framework. As pointed out in Leeb and Pötscher (2005), such a fixed-parameter framework is often highly misleading in the context of variable selection procedures and penalized maximum likelihood estimators. For that reason, Pötscher and Leeb (2009) and Pötscher and Schneider (2009) have conducted a detailed study of the finite-sample as well as large-sample distribution of various penalized least-squares estimators, adopting a moving-parameter framework for the asymptotic results. [Related results for so-called post-model-selection estimators can be found in Leeb and Pötscher (2003, 2005) and for model averaging estimators in Pötscher (2006); see also Sen (1979) and Pötscher (1991).] The papers by Pötscher and Leeb (2009) and Pötscher and Schneider (2009) are set in the framework of an orthogonal linear regression model with a fixed number of parameters and with the error-variance being known.

In the present paper we build on the just mentioned papers Pötscher and Leeb (2009) and Pötscher and Schneider (2009). In contrast to these papers, we do not assume the number of regressors to be fixed, but let it depend on sample size – thus allowing for high-dimensional models. We also consider the case where the error-variance is unknown, which in case of a high-dimensional model creates non-trivial complications as then estimators for the error-variance will typically not be consistent. Considering thresholding estimators from the outset in the present paper allows us also to cover non-orthogonal design. While the asymptotic distributional results in the known-variance case do not differ in substance from the results in Pötscher and Leeb (2009) and Pötscher and Schneider (2009), not unexpectedly we observe different asymptotic behavior in the unknown-variance case if the number of degrees of freedom is constant, the difference resulting from the non-vanishing variability of the error-variance estimator in the limit. Less expected is the result that – under consistent tuning – for the variable selection probabilities (implied by all the estimators considered) as well as for the distribution of the hard-thresholding estimator, estimation of the error-variance still has an effect asymptotically even if diverges, but does so only slowly.

To give some idea of the theoretical results obtained in the paper we next present a rough summary of some of these results. For simplicity of exposition assume for the moment that the design matrix is such that the diagonal elements of are equal to , and that the error-variance is equal to . Let denote the hard-thresholding estimator for the -th component of the regression parameter, the threshold being given by , with denoting the usual error-variance estimator and with denoting a tuning parameter. An infeasible version of the estimator, denoted by , which uses instead of , is also considered (known-variance case). We then show that the uniform rate of convergence of the hard-thresholding estimator is if the threshold satisfies and (”conservative tuning”), but that the uniform rate is only if the threshold satisfies and (”consistent tuning”). The same result also holds for the soft-thresholding estimator and the adaptive soft-thresholding estimator , as well as for infeasible variants of the estimators that use knowledge of (known-variance case). Furthermore, all possible limits of the centered and scaled distribution of the hard-thresholding estimator (as well as of the soft- and the adaptive soft-thresholding estimators and ) under a moving parameter framework are obtained. Consider first the case of conservative tuning: then all possible limiting forms of the distribution of as well as of for arbitrary parameter sequences are determined. It turns out that – in the known-variance case – these limits are of the same functional form as the finite-sample distribution, i.e., they are a convex combination of a pointmass and an absolutely continuous distribution that is an excised version of a normal distribution. In the unknown-variance case, when the number of degrees of freedom goes to infinity, exactly the same limits arise. However, if is constant, the limits are ”averaged” versions of the limits in the known-variance case, the averaging being with respect to the distribution of the variance estimator . Again these limits have the same functional form as the corresponding finite-sample distributions. Consider next the case of consistent tuning: Here the possible limits of as well as of have to be considered, as is the uniform convergence rate. In the known-variance case the limits are convex combinations of (at most) two pointmasses, the location of the pointmasses as well as the weights depending on and . In the unknown-variance case exactly the same limits arise if diverges to infinity sufficiently fast; however, if is constant or diverges to infinity sufficiently slowly, the limits are again convex combinations of the same pointmasses, but with weights that are typically different. The picture for soft-thresholding and adaptive soft-thresholding is somewhat different: in the known-variance case, as well as in the unknown-variance case when diverges to infinity, the limits are (single) pointmasses. However, in the unknown-variance case and if is constant, the limit distribution can have an absolutely continuous component. It is furthermore useful to point out that in case of consistent tuning the sequence of distributions of is not stochastically bounded in general (since is the uniform convergence rate), and the same is true for soft-thresholding and adaptive soft-thresholding . This throws a light on the fragility of the oracle-property, see Section 6.4 for more discussion.

While our theoretical results for the thresholding estimators immediately apply to Lasso and adaptive Lasso in case of orthogonal design, this is not so in the non-orthogonal case. In order to get some insight into the finite-sample distribution of the latter estimators also in the non-orthogonal case, we numerically compare the distribution of Lasso and adaptive Lasso with their thresholding counterparts in a simulation study.


The main take-away messages of the paper can be summarized as follows:

  • The finite-sample distributions of the various thresholding estimators considered are highly non-normal, the distributions being in each case a convex combination of pointmass and an absolutely continuous (non-normal) component.

  • The non-normality persists asymptotically in a moving parameter framework.

  • Results in the unknown-variance case are obtained from the corresponding results in the known-variance case by smoothing with respect to the distribution of . In line with this, one would expect the limiting behavior in the unknown-variance case to coincide with the limiting behavior in the known-variance whenever the degrees of freedom diverge to infinity. This indeed turns out to be so for some of the results, but not for others where we see that the speed of divergence of matters.

  • In case of conservative tuning the estimators have the expected uniform convergence rate, which is under the simplified assumptions of the above discussion, whereas under consistent tuning the uniform rate is slower, namely under the simplified assumptions of the above discussion. This is intimately connected with the fact that the so-called ‘oracle property’ paints a misleading picture of the performance of the estimators.

  • The numerical study suggests that the results for the thresholding estimators and qualitatively apply also to the (components of) the Lasso and the adaptive Lasso as long as the design matrix is not too ill-conditioned.

The paper is organized as follows. We introduce the model and define the estimators in Section 2. Section 3 treats the variable selection probabilities implied by the estimators. Consistency, uniform consistency, and uniform convergence rates are discussed in Section 4. We derive the finite-sample distribution of each estimator in Section 5 and study the large-sample behavior of these in Section 6. A numerical study of the finite-sample distribution of Lasso and adaptive Lasso can be found in Section 7. All proofs are relegated to Section 8.

2 The Model and the Estimators

Consider the linear regression model

with an vector, a nonstochastic matrix of rank , and , . We allow , the number of columns of , as well as the entries of , , and to depend on sample size (in fact, also the probability spaces supporting and may depend on ), although we shall almost always suppress this dependence on in the notation. Note that this framework allows for high-dimensional regression models, where the number of regressors is large compared to sample size , as well as for the more classical situation where is much smaller than . Furthermore, let denote the nonnegative square root of , the -th diagonal element of . Now let

denote the least-squares estimator for and the associated estimator for , the latter being defined only if . The hard-thresholding estimator is defined via its components as follows

where the tuning parameters are positive real numbers and denotes the -th component of the least-squares estimator. We shall also need to consider its infeasible counterpart given by

The soft-thresholding estimator and its infeasible counterpart  are given by

and

where . Finally, the adaptive soft-thresholding estimator and its infeasible counterpart are defined via

and

Note that , , and as well as their infeasible counterparts are equivariant under scaling of the columns of by non-zero column-specific scale factors. We have chosen to let the thresholds (, respectively) depend explicitly on (, respectively) and in order to give an interpretation independent of the values of and . Furthermore, often will be chosen independently of , i.e., where is a positive real number. Clearly, for the feasible versions we always need to assume , whereas for the infeasible versions suffices.

We note the simple fact that

(1)

holds on the event that , and that

(2)

holds on the event that . Analogous inequalities hold for the infeasible versions of the estimators.

Remark 1

(Lasso) (i) Consider the objective function

where are positive real numbers. It is well-known that a unique minimizer of this objective function exists, the Lasso-estimator. It is easy to see that in case is diagonal we have

Hence, in the case of diagonal , the components of the Lasso reduce to soft-thresholding estimators with appropriate thresholds; in particular, coincides with for the choice . Therefore all results derived below for soft-thresholding immediately give corresponding results for the Lasso as well as for the Dantzig-selector in the diagonal case. We shall abstain from spelling out further details.

(ii) Sometimes in the definition of the Lasso is chosen independently of ; more reasonable choices seem to be (a) (where denotes the nonnegative square root of the -th diagonal element of ), and (b) where are positive real numbers (not depending on the design matrix and often not on ) as then again has an interpretation independent of the values of and . Note that in case (a) or (b) the solution of the optimization problem is equivariant under scaling of the columns of by non-zero column-specific scale factors.

(iii) Similar results obviously hold for the infeasible versions of the estimators.

Remark 2

(Adaptive Lasso) Consider the objective function

where are positive real numbers. This is the objective function of the adaptive Lasso (where often is chosen independent of ). Again the minimizer exists and is unique (at least on the event where for all ). Clearly, is equivariant under scaling of the columns of by non-zero column-specific scale factors provided does not depend on the design matrix. It is easy to see that in case is diagonal we have

Hence, in the case of diagonal , the components of the adaptive Lasso reduce to the adaptive soft-thresholding estimators (for ). Therefore all results derived below for adaptive soft-thresholding immediately give corresponding results for the adaptive Lasso in the diagonal case. We shall again abstain from spelling out further details. Similar results obviously hold for the infeasible versions of the estimators.

Remark 3

(Other estimators) (i) The adaptive Lasso as defined in Zou (2006) has an additional tuning parameter . We consider adaptive soft-thresholding only for the case , since otherwise the estimator is not equivariant in the sense described above. Nonetheless an analysis for the case , similar to the analysis in this paper, is possible in principle.

(ii) An analysis of a SCAD-based thresholding estimator is given in Pötscher and Leeb (2009) in the known-variance case. [These results are given in the orthogonal design case, but easily generalize to the non-orthogonal case.] The results obtained there for SCAD-based thresholding are similar in spirit to the results for the other thresholding estimators considered here. The unknown-variance case could also be analyzed in principle, but we refrain from doing so for the sake of brevity.

(iii) Zhang (2010) introduced the so-called minimax concave penalty (MCP) to be used for penalized least-squares estimation. Apart from the usual tuning parameter, MCP also depends on a shape parameter . It turns out that the thresholding estimator based on MCP coincides with hard-thresholding in case , and thus is covered by the analysis of the present paper. In case , the MCP-based thresholding estimator could similarly be analyzed, especially since the functional form of the MCP-based thresholding estimator is relatively simple (namely, a piecewise linear function of the least-squares estimator). We do not provide such an analysis for brevity.

For all asymptotic considerations in this paper we shall always assume without further mentioning that  satisfies

(3)

for every fixed  satisfying  for large enough . The case excluded by assumption (3) seems to be rather uninteresting as unboundedness of  means that the information contained in the regressors gets weaker with increasing sample size (at least along a subsequence); in particular, this implies (coordinate-wise) inconsistency of the least-squares estimator. [In fact, if as well as the elements of do not depend on , this case is actually impossible as is then necessarily monotonically nonincreasing.]

The following notation will be used in the paper: Let denote the extended real line endowed with the usual topology. On we shall consider the topology it inherits from . Furthermore, and denote the cumulative distribution function (cdf) and the probability density function (pdf) of a standard normal distribution, respectively. By we denote the cdf of a non-central -distribution with degrees of freedom and non-centrality parameter . In the central case, i.e., , we simply write . We use the convention , with a similar convention for .

3 Variable Selection Probabilities

The estimators , , and can be viewed as performing variable selection in the sense that these estimators set components of exactly equal to zero with positive probability. In this section we study the variable selection probability , where stands for any of the estimators , , and . Since these probabilities are the same for any of the three estimators considered we shall drop the subscripts , , and in this section. We use the same convention also for the variable selection probabilities of the infeasible versions.

3.1 Known-Variance Case

Since it suffices to study the variable deletion probability

(4)

As can be seen from the above formula, depends on only via . We first study the variable selection/deletion probabilities under a ”fixed-parameter” asymptotic framework.

Proposition 4

Let be given. For every satisfying for large enough we have:

(a) A necessary and sufficient condition for as for all satisfying ( not depending on ) is .

(b) A necessary and sufficient condition for as for all satisfying is .

(c) A necessary and sufficient condition for as for all satisfying is , . The constant is then given by .

Part (a) of the above proposition gives a necessary and sufficient condition for the procedure to correctly detect nonzero coefficients with probability converging to . Part (b) gives a necessary and sufficient condition for correctly detecting zero coefficients with probability converging to .

Remark 5

If does not converge to zero, the conditions on in Parts (a) and (b) are incompatible; also the conditions in Parts (a) and (c) are then incompatible (except when ). However, the case where does not converge to zero is of little interest as the least-squares estimator is then not consistent.

Remark 6

(Speed of convergence in Proposition 4) (i) The speed of convergence in (a) is in case is bounded (an uninteresting case as noted above); if, the speed of convergence in (a) is not slower than for some suitable depending on .

(ii) The speed of convergence in (b) is . In (c) the speed of convergence is given by the rate at which approaches .

[For the above results we have made use of Lemma VII.1.2 in Feller (1957).]

Remark 7

For let . Then (i) for every

Suppose now that the entries of do not change with (although the dimension of may depend on ).111More precisely, this means that is made up of the initial elements of a fixed element of . Then, given that is bounded (this being in particular the case if is bounded), the probability of incorrect non-detection of at least one nonzero coefficient converges to if and only if as for every . [If is unbounded then this probability converges to , e.g., if and as for every and and as for a suitable that is determined by .]

(ii) For every we have

Suppose again that the entries of do not change with . Then, given that is bounded (this being in particular the case if is bounded), the probability of incorrectly classifying at least one zero parameter as a non-zero one converges to as if and only if for every . [If is unbounded then this probability converges to , e.g., if as .]

(iii) In case is diagonal, the relevant probabilities as well as can be directly expressed in terms of products of or , and Proposition 4 can then be applied.

Since the fixed-parameter asymptotic framework often gives a misleading impression of the actual behavior of a variable selection procedure (cf. Leeb and Pötscher (2005), Pötscher and Leeb (2009)) we turn to a ”moving-parameter” framework next, i.e., we allow the elements of as well as to depend on sample size . In the proposition to follow (and all subsequent large-sample results) we shall concentrate only on the case where as , since otherwise the estimators are not even consistent for as a consequence of Proposition 4, cf. also Theorem 16 below. Given the condition , we shall then distinguish between the case , , and the case , which in light of Proposition 4 we shall call the case of ”conservative tuning” and the case of ”consistent tuning”, respectively.222There is no loss of generality here in assuming convergence of to a (finite or infinite) limit, in the sense that this convergence can, for any given sequence , be achieved along suitable subsequences in light of compactness of the extended real line.

Proposition 8

Suppose that for given satisfying for large enough we have and where .

(a) Assume . Suppose that the true parameters and satisfy . Then

(b) Assume . Suppose that the true parameters and satisfy . Then

1. implies .

2. implies .

3. and , for some , imply

In a fixed-parameter asymptotic analysis, which in Proposition 8 corresponds to the case and , the limit of the probabilities is always in case , and is in case and consistent tuning (it is in case and conservative tuning); this does clearly not properly capture the finite-sample behavior of these probabilities. The moving-parameter asymptotic analysis underlying Proposition 8 better captures the finite-sample behavior and, e.g., allows for limits other than and even in the case of consistent tuning. In particular, Proposition 8 shows that the convergence of the variable selection/deletion probabilities to their limits in a fixed-parameter asymptotic framework is not uniform in , and this non-uniformity is local in the sense that it occurs in an arbitrarily small neighborhood of (holding the value of fixed).333More generally, the non-uniformity arises for in a neighborhood of zero. Furthermore, the above proposition entails that under consistent tuning deviations from of larger order than under conservative tuning go unnoticed asymptotically with probability 1 by the variable selection procedure corresponding to . For more discussion in a special case (which in its essence also applies here) see Pötscher and Leeb (2009).

Remark 9

(Speed of convergence in Proposition 8) (i) The speed of convergence in (a) is given by the slower of the rate at which approaches and approaches provided that ; if , the speed of convergence is not slower than

for any .

(ii) The speed of convergence in (b1) is not slower than where depends on . The same is true in case (b2) provided ; if , the speed of convergence is not slower than for every . In case (b3) the speed of convergence is not slower than the speed of convergence of

for any in case ; in case it is not slower than

for any .

The preceding remark corrects and clarifies the remarks at the end of Section 3 in Pötscher and Leeb (2009) and Section 3.1 in Pötscher and Schneider (2009).

3.2 Unknown-Variance Case

In the unknown-variance case the finite-sample variable selection/deletion probabilities can be obtained as follows:

(5)

Here we have used (4), and independence of and allowed us to replace by in the relevant formulae, cf. Leeb and Pötscher (2003, p. 110). In the above denotes the density of times the square root of a chi-square distributed random variable with degrees of freedom. It will turn out to be convenient to set for , making a bounded continuous function on .

We now have the following fixed-parameter asymptotic result for the variable selection/deletion probabilities in the unknown-variance case that perfectly parallels the corresponding result in the known-variance case, i.e., Proposition 4:

Proposition 10

Let be given. For every satisfying for large enough we have:

(a) A necessary and sufficient condition for as for all satisfying ( not depending on ) is .

(b) A necessary and sufficient condition for as for all satisfying is .

(c) A necessary and sufficient condition for as for all satisfying and with satisfying is , .

Proposition 10 shows that the dichotomy regarding conservative tuning and consistent tuning is expressed by the same conditions in the unknown-variance case as in the known-variance case. Furthermore, note that appearing in Part (c) of the above proposition converges to in the case where , the limit thus being the same as in the known-variance case. This is different in case is constant equal to , say, eventually, the sequence then being constant equal to eventually. We finally note that Remark 5 also applies to Proposition 10 above.

For the same reasons as in the known-variance case we next investigate the asymptotic behavior of the variable selection/deletion probabilities under a moving-parameter asymptotic framework. We consider the case where is (eventually) constant and the case where . There is no essential loss in generality in considering these two cases only, since by compactness of we can always assume (possibly after passing to subsequences) that converges in .

Theorem 11

Suppose that for given satisfying for large enough we have and where .

(a) Assume . Suppose that the true parameters and satisfy .

(a1) If is eventually constant equal to , say, then

(a2) If holds, then

(b) Assume . Suppose that the true parameters and satisfy .

(b1) If is eventually constant equal to , say, then

(b2) If holds, then

1. implies .

2. implies .

3. and imply

provided for some .

4. and with imply