Nearly Optimal Sample Size in Hypothesis Testing for High-Dimensional Regression

# Nearly Optimal Sample Size in Hypothesis Testing for High-Dimensional Regression

Adel Javanmard 111Department of Electrical Engineering, Stanford University  and Andrea Montanari 222Department of Electrical Engineering and Department of Statistics, Stanford University
###### Abstract

We consider the problem of fitting the parameters of a high-dimensional linear regression model. In the regime where the number of parameters is comparable to or exceeds the sample size , a successful approach uses an -penalized least squares estimator, known as Lasso.

Unfortunately, unlike for linear estimators (e.g., ordinary least squares), no well-established method exists to compute confidence intervals or p-values on the basis of the Lasso estimator. Very recently, a line of work [JM13b, JM13a, vdGBR13] has addressed this problem by constructing a debiased version of the Lasso estimator. In this paper, we study this approach for random design model, under the assumption that a good estimator exists for the precision matrix of the design. Our analysis improves over the state of the art in that it establishes nearly optimal average testing power if the sample size asymptotically dominates , with being the sparsity level (number of non-zero coefficients). Earlier work obtains provable guarantees only for much larger sample size, namely it requires to asymptotically dominate .

In particular, for random designs with a sparse precision matrix we show that an estimator thereof having the required properties can be computed efficiently. Finally, we evaluate this approach on synthetic data and compare it with earlier proposals.

## 1 Introduction

In the random design model for linear regression, we are given i.i.d. pairs with . The response variables are given by

 Yi=⟨θ0,Xi⟩+Wi,Wi∼N(0,σ2). (1)

Here is the standard scalar product in , and is an unknown but fixed vector of parameters. In matrix form, letting and denoting by the design matrix with rows , we have

 Y=Xθ0+W,W∼N(0,σ2In×n). (2)

The goal is to estimate the unknown vector of parameters from the observations and . We are interested in the high-dimensional setting where the number of parameters is larger than the sample size, i.e., , but the number of non-zero entries of is smaller than . We denote by the support of , i.e., the set of non-zero coefficients, and let be the sparsity level.

In the last decade, there has been a burgeoning interest in parameter estimation in high-dimensional setting. A particularly successful approach is the Lasso [Tib96, CD95] estimator which promotes sparse reconstructions through an penalty:

 ˆθ(Y,X;λ)≡argminθ∈Rp{12n∥Y−Xθ∥22+λ∥θ∥1}. (3)

In case the right hand side has more than one minimizer, one of them can be selected arbitrarily for our purposes. We will often omit the arguments , , as they are clear from the context.

The Lasso is known to perform well in terms of prediction error and estimation error, as measured for instance by [BvdG11]. In this paper we address the –far less understood– problem of assessing uncertainty and statistical significance, e.g., by computing confidence intervals or p-values. This problem is particularly challenging in high dimension since good estimators, such as the Lasso, are by necessity non-linear and hence do not have a tractable distribution.

More specifically, we are interested in testing null hypotheses of the form:

 H0,i:θ0,i=0, for i∈[p], (4)

and assigning -values for these tests. Rejecting corresponds to inferring that . A related question is the one of computing confidence intervals. Namely, for a given , and we want to determine such that

 P(θi∈[θ–i,¯¯¯θi])≥1−α. (5)

### 1.1 Main idea and summary of contributions

A series of recent papers have developed the idea of ‘de-biasing’ the Lasso estimator , by defining

 ˆθu=ˆθ+1nMXT(Y−Xˆθ). (6)

Here is a matrix that depends on the design matrix , and aims at decorrelating the columns of . A possible interpretation of this construction is that the term is a subgradient of the norm at the Lasso solution . By adding a term proportional to this subgradient, we compensate for the bias introduced by the penalty. It is worth noting that is no longer a sparse estimator. In certain regimes, and for suitable choices of , it was proved that is approximately Gaussian with mean , hence leading to the construction of p-values and confidence intervals.

More specifically, let be the population covariance matrix, and denote the precision matrix. In [JM13b], the present authors assumed the precision matrix to be known and proposed to use , for an explicit constant . A plug-in estimator for was also suggested for sparse covariances . Furthermore, asymptotic validity and minimax optimality of the method were proven for uncorrelated Gaussian designs (). A conjecture was derived for a broad class of covariances using statistical physics arguments. De Geer, Bühlmann and Ritov [vdGBR13] used a similar construction with an estimate of , which is appropriate for sparse precision matrices . These authors prove validity of their method for sample size that asymptotically dominates . In [JM13a], the present authors propose to construct by solving a convex program that aims at optimizing two objectives. First, control the bias of , and second minimize the variance of . Minimax optimality was established for sample size that asymptotically dominates , without however requiring to be sparse. Additional related work can be found in [Büh12, ZZ11] .

Note that nearly optimal estimation via the Lasso is possible for significantly smaller sample size, namely for , for some constant [CT07, BRT09]. This suggests the following natural question

Is it possible to design a minimax optimal test for hypotheses , for optimal sample size ?

While the results of [JM13b] suggest a positive answer, they assume to be known, and apply only asymptotically as . In this paper we partially answer this question, by proving the following results.

General subgaussian designs.

We do not make any assumption on the rows of except that they are independent and identically distributed, with common law with subgaussian tails, and non-singular covariance. This model is well suited for statistical applications wherein the pairs are drawn at random from a population.

Our results in this case holds conditionally on the availability of an estimator of the precision matrix such that333Here denotes the operator norm of the matrix . . Then, a testing procedure is developed that is minimax optimal with nearly optimal sample size, namely for that asymptotically dominates . Here ‘optimality’ is measured in terms of the average power of tests for hypotheses with average taken over the coordinates . To be more specific, the testing procedure is constructed based on the debiased estimator , where we set .

Subgaussian designs with sparse inverse covariance.

In this case, the rows are subgaussian with a common covariance , such that is sparse. For this model, an estimator with the required properties exists, and was used in [vdGBR13]. We can therefore establish unconditional results and prove optimality of the present test.

With respect to earlier analysis [vdGBR13], our results apply to much smaller sample size, namely needs to dominate instead of . On the other hand, guarantees are only provided with respect to average power of the test. Roughly speaking, our results in this case imply that the method of [vdGBR13] has significantly broader domain of validity than initially expected.

While the assumption of sparse inverse covariance is admittedly restrictive, it arises naturally in a number of contexts. For instance, it is relevant for the problem of learning sparse Gaussian graphical models [MB06]. In this case, the set of edges incident on a specific vertex can be encoded in the vector . It also played a pivotal role in compressed sensing, as one of the first model in which an optimal tradeoff between sparsity and sample size was proven to hold [CT05, DT05, Wai09].

Covariance estimators satisfying the condition can be constructed under other structural assumptions than sparsity. Our general theory allows to build hypothesis testing methods for each of these cases. We expect this to spur progress in other settings as well.

Finally, we evaluate our procedure on synthetic data, comparing its performance with the method of [JM13a].

### 1.2 Definitions and notations

Throughout will be referred to as the covariance, and as the precision matrix. Without loss of generality, we will assume that the columns of are normalized so that . (This normalization is only assumed for the analysis, and is not required for the hypothesis testing procedure or the construction of confidence intervals.)

For a matrix and set of indices , we let denote the submatrix formed by the rows in and columns in . Also, (resp. ) denotes the submatrix containing just the rows (resp. columns) in . Likewise, for a vector , is the restriction of to indices in . The maximum and the minimum singular values of are respectively denoted by and . We write for the standard norm of a vector (omitting the subscript in the case ) and for the number of nonzero entries of . For a matrix , is its operator norm, and is the elementwise norm, i.e., . Further . We use the notation for the set . For a vector , represents the positions of nonzero entries of .

The standard normal distribution function is denoted by . For two functions and , the notation means that dominates asymptotically, namely, for every fixed positive , there exists such that for .

The sub-gaussian norm of a random variable , denoted by , is defined as

 ∥Z∥ψ2=supq≥1q−1/2(E|Z|q)1/q.

The sub-gaussian norm of a random vector is defined as .

Finally, the sub-exponential norm of random variable is defined as

 ∥Z∥ψ1=supq≥1q−1(E|Z|q)1/q.

## 2 Debiasing the Lasso estimator

Let be an estimate of the precision matrix . We define estimator based on the Lasso solution and , as per Eq. (7) in Table . The following proposition provides a decomposition of the residual , which is useful in characterizing the limiting distribution of . Its proof follows readily from the proof of Theorem 2.1 in [vdGBR13], and is given in Appendix A for the reader’s convenience.

###### Proposition 2.1.

Consider the linear model (1) and let be defined as per Eq. (7). Then,

 √n(ˆθu−θ0) =Z+Δ, Z|X∼N(0,σ2ˆΩˆΣˆΩT), Δ=√n(ˆΩˆΣ−I)(θ0−ˆθ).

We recall the definition of restricted eigenvalues as given in [BRT09]:

 ϕmax(t)≡max1≤∥v∥0≤t∥Xv∥22n∥v∥22.

It is also convenient to recall the following restricted eigenvalue (RE) assumptions.

Assumption .

For some integer such that and a positive number , the following condition holds:

 κ(s,c)≡minJ⊆[p]:|J|≤sminv≠0:∥vJc∥1≤c∥vJ∥1∥Xv∥2√n∥vJ∥2>0.

The assumption RE has been used to establish bounds on the prediction loss and on the loss of the Lasso.

Assumption .

Let be integers such that and , . For a vector and a set of indices with , denote by the subset of corresponding to the largest coordinates of (in absolute value) and define . We say that satisfies with constant if

 κ(s,q,c)≡minJ⊆[p]:|J|≤sminv≠0:∥vJc∥1≤c∥vJ∥1∥Xv∥2√n∥vJ2∥2>0.

This assumption has been used to bound the loss of the Lasso with   [BRT09].

The following lemma is a minor improvement over [BRT09, Theorem 7.2] in that it uses instead of .

###### Proposition 2.2 ([Brt09]).

Let assumption RE be satisfied. Consider the Lasso selector with . Then, with high probability, we have

 ∥ˆθ∥0 ≤ 64ϕmax(n)2κ(s0,3)2s0. (8)

If assumption RE with constant is satisfied, then with high probability,

 ∥ˆθ−θ0∥22≤Cs0σ2logpn, (9)

where is bounded for bounded away from .

A proof of Eq. (8) is given in Appendix B.

Our next theorem controls the bias term . In order to state the result formally, for a vector , and , we define its norm as follows

 ∥v∥(∞,k)≡maxA⊆[m],|A|≥k∥vA∥2√k. (10)

For , this is just the norm (the maximum entry) of . At the other extreme, for , this is the rescaled norm. It is easy to see that is non-increasing in . As gets smaller, it gives us tighter control on the individual entries of .

The next theorem bounds down to much smaller than .

###### Theorem 2.3.

Consider the linear model (1) and let be the population covariance matrix of the design . Let be the precision matrix and suppose that an estimate is available, such that . Further, assume that and are respectively bounded from below and above by some constants as . In addition, assume that the rows of the whitened matrix are sub-gaussian, i.e., , for some constant .

Let be the bias term in . Then for any arbitrary (but fixed) constant , there exists such that,

 ∥Δ∥2(∞,cs0)≤Cσ2s0(logp)2n+oP(1). (11)

The proof is deferred to Section 7.1.

Using Markov inequality, this implies that there cannot be many entries of that are large.

###### Corollary 2.4.

Under the conditions of Theorem 2.3, and for , there are at most entries of that are of order . More precisely, fix arbitrary , and define . Then the following limit holds in probability

 limn→∞1s0|Cn(ε)|=0.
###### Proof.

If the claim does not hold true, then by applying Theorem 2.3 to the set , we have

 ε2≤∥ΔCn(ε)∥22|Cn(ε)|≤Cσ2s0(logp)2n+oP(1)=oP(1),

which is contradiction. ∎

In other words, except for at most entries of , is an asymptotically unbiased estimator for .

## 3 Constructing p-values and hypothesis testing

For the linear model (1), we are interested in testing the individual hypotheses , and assigning -values for these tests.

Similar to [JM13a], we construct a -value for the test as follows:

 Pi=2(1−Φ(√n|ˆθui|ˆσ[ˆΩˆΣˆΩT]1/2ii)). (12)

where is a consistent estimator of . For instance, we can use the scaled Lasso [SZ12] (see also related work in [BCW11]) as

 {ˆθ,ˆσ}≡argminθ∈Rp,σ>0{12σn∥Y−Xθ∥22+σ2+λ∥θ∥1}.

Choosing yields a consistent estimate of .

A different estimator of can be constructed using Proposition 2.1 and Theorem 2.3, and is described in Section 4.

The decision rule is then based on the -value :

 Ti,X(y)={1if Pi≤α (reject H0,i),0otherwise(accept H0,i). (13)

We measure the quality of the test in terms of its significance level and statistical power . Here is the probability of type I error (i.e., of a false positive at ) and is the probability of type II error (i.e., of a false negative at ).

Our next theorem characterizes the tradeoff between type I error and the average power attained by the decision rule (13). Note that this tradeoff depends on the magnitude of the non-zero coefficients . The larger they are, the easier one can distinguish between null hypotheses from their alternatives. We refer to Section 7.2 for a proof.

###### Theorem 3.1.

Consider a random design model that satisfies the conditions of Theorem 2.3. For , let . Assume that is such that with high probability. Under the sample size assumption , the following holds true:

 limsupn→∞1p−s0∑i∈ScPθ0(Ti,X(y)=1)≤α. (14) (15) (16)

where, for and , the function is defined as follows:

 G(α,u)=2−Φ(Φ−1(1−α2)+u)−Φ(Φ−1(1−α2)−u). (17)

Furthermore, is the induced probability for random design and noise realization , given the fixed parameter vector .

In Fig. 1, function is plotted versus , for several values of . It is easy to see that, for any , is monotone increasing. Suppose that . Then, by Eq. (16), we have

 1−β∗(θ0,n)≥1s0∑i∈SG(α,√nμσ√Ωii).

Notice that , giving the lower bound for the power at . In fact without any assumption on the non-zero coordinates of , one can take arbitrarily close to zero, and practically becomes indistinguishable from its alternative. In this case, no decision rule can outperform random guessing, i.e, randomly rejecting with probability . This yields the trivial power .

### 3.1 Minimax optimality of the average power

An upper bound for the minimax power of tests, with a given significant level , is provided in [JM13b] for sparse linear regression with Gaussian designs. Considering sample size scaling , the minimax bound [JM13b, Theorem 2.6] simplifies to the following bound for the optimal average power

 limn→∞1−βopt(α;μ)G(α,μ/σeff)≤1,σeff=σ√nηΣ,s0, (18)

where

 ηΣ,s0 ≡maxi∈[p]minS{Σi|S:S⊆[p]∖{i},|S|

We compare our test to the optimal test by computing how much must be increased to achieve the minimax optimal average power. It follows from Eqs. (15) and (16) that must be increased to , with the increase factor

 ~μμ=(maxi∈[p]√Ωii)√ηΣ,s0≤maxi∈[p]√ΩiiΣii≤√σmax(Σ)σmin(Σ).

Therefore, our test has nearly optimal average power for well-conditioned covariances and sample size scaling .

## 4 An estimator of the noise level

In this section we describe a consistent estimator of the noise standard deviation , that is based on Proposition 2.1 and Theorem 2.3. After constructing as per Eq. (7), we let

 zi≡√nˆθui[ˆΩˆΣˆΩT]1/2ii. (19)

According to Proposition 2.1 and Theorem 2.3, the entries , are approximately Gaussian, with mean and variance . This suggests to use the following robust estimator (a similar approach in a related context was proposed in [DMM09]).

Let be the vector of absolute values of , i.e. , and denote by its -th entry in order of magnitude: . We then set444More generally, for , we can use .

 ˆσ=|z|(p/2)Φ−1(3/4) (20)

The next corollary is an immediate consequence of Proposition 2.1 and Theorem 2.3 (see also Lemma 7.2).

###### Corollary 4.1.

Under the assumptions of Theorem 2.3, if further , we have in probability.

## 5 Designs with sparse precision matrix

In Theorem 2.3 we posit existence of an estimator for the precision matrix , such that . In case the precision matrix is sparse enough, then [vdGBR13] constructs such an estimator using the Lasso for the nodewise regression on the design . Formally, for , let

 ^γj=argminγ12n∥Xj−X−jγ∥22+λ∥γ∥1,

where is the sub-matrix obtained by removing the column. Also let

 ˆC=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1−ˆγ1,2⋯−ˆγ1,p−ˆγ2,11⋯−ˆγ1,p⋮⋮⋱⋮−ˆγp,1−ˆγp,2⋯1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

with being the -th entry of , and let

 ˆT2=diag(ˆτ21,…,ˆτ2p),ˆτ2j=(Xj−X−jˆγj)TXj/n.

Then define .

###### Proposition 5.1 ([vdGBR13]).

Suppose that the whitened matrix has i.i.d. sub-gaussian rows. Further, assume that the maximum number of non-zeros per row of is , that and that, for all , . Then, with high probability,

 ∥ˆΩ−Ω∥∞=O(t0√logp/n). (21)

Therefore, if is sufficiently sparse, namely it has non-zeros per row, then Eq. (21) yields . Hence, the estimator satisfies the conditions of Theorem 2.3 and 3.1.

## 6 Numerical experiments

We generated synthetic data from the linear model (1) with the choice of parameters , and . The rows of the design matrix are generated independently form distribution . Here is a circulant matrix with , for , , and zero everywhere else. (The difference between indices is understood modulo .) The parameter controls the sparsity of the precision matrix and we take to ensure that (with ). For parameter vector , we consider a subset , with , chosen uniformly at random, and set for and zero everywhere else.

We evaluate the performance of our testing procedure (13) at significance level . The procedure is implemented in using glmnet-package that fits the Lasso solution for an entire path of regularization parameters . We then choose the value of that has the minimum mean squares error, approximated by a -fold cross validation.

We compare the performance of decision rule (13) to the testing method presented in [JM13a] for different values of . (Recall that controls the sparsity of the precision matrix.) The results are reported in Table 1. The means and the standard deviations are obtained by testing over realizations of noise and the design matrix.

Interestingly, for small values of (very sparse precision matrices), the two methods perform often identically the same, and their performances differ slightly for moderate . This is in agreement with the theoretical results that both methods asymptotically have nearly optimal minimax average power.

Letting with , in Fig. 3 we plot sample quantiles of versus the quantiles of a standard normal distribution for one realization (with ). The linear trend of the plot clearly demonstrates that the empirical distribution of is approximately normal, corroborating our theoretical results (Proposition 2.1 and Theorem 2.3).

## 7 Proofs

### 7.1 Proof of Theorem 2.3

Write with

 Δ(1)=√n(ΩˆΣ−I)(θ0−ˆθ),Δ(2)=√n(ˆΩ−Ω)ˆΣ(θ0−ˆθ).

Let . By Eq. (8), because , for some constant , with high probability. (This in turns follows from the assumption that , are bounded above and below, using [RZ13].) Also, note that any set with can be partitioned as with . If the claim holds for all , then it follows for by summing these cases. We can therefore assume, without loss of generality, .

We first bound using Hoeffding’s inequality. Note that

 Δ(1)A=√n(ΩˆΣ−I)A,T(ˆθ−θ0)T,

since . Hence,

 ∥Δ(1)A∥2≤√n∥(ΩˆΣ−I)A,T∥2∥(ˆθ−θ0)T∥2. (22)

Let and define , . We have

 ∥R∥2 =supu,v∥u∥,∥v∥≤1⟨u,Rv⟩ ≤supu∈F1,v∈F21nn∑i=1(⟨u,ΩXi⟩⟨Xi,v⟩−⟨u,v⟩). (23)

Fix and . Let . The variables are independent and it is easy to see that . By [Ver12, Remark 5.18], we have

 ∥ξi∥ψ1≤2∥⟨u,ΩXi⟩⟨Xi,v⟩∥ψ1.

Moreover, by Lemma C.1,

 ∥⟨u,ΩXi⟩⟨Xi,v⟩∥ψ1 ≤2∥⟨u,ΩXi⟩∥ψ2∥⟨Xi,v⟩∥ψ2 =∥Ω1/2u∥2∥Ω−1/2v∥2∥Ω1/2Xi∥2ψ2 ≤√σmax(Σ)/σmin(Σ)∥Ω1/2Xi∥2ψ2.

Hence, , for some constant . Now, by applying Bernstein inequality for centered sub-exponential random variables [Ver12], for every , we have

 P(1nn∑i=1ξi≥t)≤2exp[−Cnmin(t2K2,tK)],

where is an absolute constant. Therefore, for any constant , since , we have

 P(1nn∑i=1ξi≥K√c1s0logpCn)≤p−c1s0. (24)

In order to bound the right hand side of Eq. (23), we use a -net argument. Clearly, and where denotes that the two objects are isometric. By [Ver12, Lemma 5.2], there exists a -net of (and hence of ) with size at most . Similarly there exists a -net of of size at most . Hence, using Eq. (24) and taking union bound over all vectors in and , we obtain

 supu∈N1,v∈N21nn∑i=1⟨u,(ΩXiXTi−I)v⟩≤K√c1s0logpCn, (25)

with probability at least .

The last part of the argument is based on the following lemma, whose proof is deferred to Appendix D.

###### Lemma 7.1.

Let . Then,

 supu∈F1,v∈F2⟨u,Mv⟩≤4supu∈N1,v∈N2⟨u,Mv⟩.

Employing Lemma 7.1 and bound (25) in Eq. (23), we arrive at

 ∥R∥2≤4K√c1s0logpCn, (26)

with probability at least .

Finally, note that there are less than subsets , with and , for some constant . Taking union bound over all these sets, we obtain that with high probability,

 ∥(ΩˆΣ−I)A,T∥2≤C√s0logp/n,

for all such sets , where is a constant.

Now, plugging this bound and the bound (9) (recalling that is bounded away from zero with high probability by [RZ13] because is bounded away from zero) into Eq. (22), we get

 ∥Δ(1)A∥2≤Cσs0logp√n, (27)

with a constant.

To bound , we bound each entry separately:

 |Δ(2)i| ≤√n∥ˆΩi,⋅−Ωi,⋅∥1∥ˆΣ(θ0−ˆθ)∥∞. (28)

Note that the subgradient condition for optimization (3) reads

 Σ(ˆθ−θ0)=XTW/n+λv(ˆθ),

with . Thus , with high probability, for the choice of . Therefore, Eq. (28) implies

 ∥Δ(2)∥∞=o(1), (29)

since by our assumption

 ∥ˆΩ−Ω∥∞=maxi∈[p]∥ˆ