A Flexible Framework for Hypothesis Testing in High-dimensions

A Flexible Framework for Hypothesis Testing in High-dimensions

Adel Javanmard111Data Sciences and Operations Department, University of Southern California. Email: ajavanma@usc.edu    and   Jason D. Lee222Department of Electrical Engineering, Princeton University. Email: Jasonlee@princeton.edu
Abstract

Hypothesis testing in the linear regression model is a fundamental statistical problem. We consider linear regression in the high-dimensional regime where the number of parameters exceeds the number of samples (). In order to make informative inference, we assume that the model is approximately sparse, that is the effect of covariates on the response can be well approximated by conditioning on a relatively small number of covariates whose identities are unknown. We develop a framework for testing very general hypotheses regarding the model parameters. Our framework encompasses testing whether the parameter lies in a convex cone, testing the signal strength, and testing arbitrary functionals of the parameter. We show that the proposed procedure controls the type I error , and also analyze the power of the procedure. Our numerical experiments confirm our theoretical findings and demonstrate that we control false positive rate (type I error) near the nominal level, and have high power. By duality between hypotheses testing and confidence intervals, the proposed framework can be used to obtain valid confidence intervals for various functionals of the model parameters. For linear functionals, the length of confidence intervals is shown to be minimax rate optimal.

1 Introduction

Consider the high-dimensional regression model where we are given i.i.d. pairs , , , with , and , denoting the response values and the feature vectors, respectively. The linear regression model posits that response values are generated as

(1)

Here is a vector of parameters to be estimated. In matrix form, letting and denoting by the matrix with rows ,, we have

(2)

We are interested in high-dimensional models where the number of parameters may far exceed the sample size . To make informative inference feasible in this setting, we assume sparsity structure for the model, that is has only a few () number of nonzero entries, whose identities are unknown.

Our goal in this paper is to understand various parameter structures of the high-dimensional model. Specifically, we develop a flexible framework for testing null hypothesis of the form

(3)

for a general set . Remarkably, we make no additional assumptions (such as convexity or connectedness) on .

In Section 5, we will relax the sparsity assumption on the model parameters to the approximate sparsity. Consider the linear model , where is not necessarily sparse. The approximate sparsity posits that even if the true signal cannot be written as a sparse linear combination of the covariates, there exists at least one sparse linear combination of the covariates that gets close to the true signal. Formally, we assume that there exists a vector such that , and . Note that this notion of approximate sparsity is similar to but stronger than the one introduced in [BTW07, BCCH12].333In [BCCH12], the approximate sparsity assumption allows , while here we are imposing stronger requirement .

In addition, in Section 6 we extend our analysis to non-gaussian noise.

1.1 Motivation

High-dimensional models are ubiquitous in many areas of applications. Examples range from signal processing (e.g. compressed sensing), to recommender systems (collaborative filtering), to statistical network analysis, to predictive analytics, etc. The widespread interest in these applications has spurred remarkable progress in the area of high-dimensional data analysis [CT07, BRT09, BvdG11]. Given that the number of parameters goes beyond the sample size, there is no hope to design reasonable estimators without making further assumption on the structure of model parameters. A natural such assumption is sparsity, which posits that only of the parameters are nonzero, and . A prominent approach in this setting for estimating the model parameters is via the Lasso estimator [Tib96, CD95] defined by

(4)

(We will omit the arguments of whenever clear from the context.)

To date, the majority of work on high-dimensional parametric models has focused on point estimation such as consistency for prediction [GR04], oracle inequalities and estimation of parameter vector [CT07, BRT09, RWY09], model selection [MB06, ZY06, Wai09], and variable screening [FL08]. The work [BTW07] extended the oracle inequalities for the lasso to the setting of weak sparsity and weak approximation, where the effect of covariates on the response can be controlled up to a small approximation error by conditioning on a relatively small number of covariates, whose identities are unknown. The minimax rate for estimating the parameters in the high-dimensional linear model was studied in [YZ10, RWY11], assuming that the true model parameters belong to some ball.

Despite this remarkable progress, the fundamental problem of statistical significance is far less understood in the high-dimensional setting. Uncertainty assessment is particularly important when one seeks subtle statistical patterns about the model parameters .

Below, we discuss some important examples of high-dimensional inference that can be performed when provided a methodology for testing hypothesis of the form (3).

Example 1 (Testing condition) Support recovery in high-dimension concerns the problem of finding a set , such that is large, where . Work on support recovery requires the nonzero parameters be large enough to be detected. Specifically, for exact support recovery meaning that , it is assumed that . This assumption is often referred to as condition and is shown to be necessary for exact support recovery [MY09, ZY06, FL01, ZY06, Wai09, MB06].

Relaxing the task of exact support recovery, let and be the type I and type II error rates in detecting nonzero (active) parameters of the model. In [JM14b], it is shown that even for gaussian design matrices, any hypothesis testing rule with nontrivial power requires . Despite assumption is commonplace, it is not verifiable in practice and hence it calls for developing methodologies that can test whether such condition holds true.

For a vector , define support of as . In (3), letting , we can test condition for any given and at a pre-assigned significance level .


Example 2 (Confidence intervals for quadratic forms) We can apply our method to test hypothesis of form

(5)

for some given set and a given matrix . By duality between hypothesis testing and confidence interval, we can also construct confidence intervals for quadratic forms .

In the case of , this yields inference on the signal strength . As noted in [JBC17], armed with such testing method one can also provide confidence intervals for the estimation error, namely . Specifically, we split the collected samples into two independent groups and , and construct an estimate just by using the first group. Letting , we obtain a linear regression model . Further, if is a sparse estimate, then is also sparse. Therefore, inference on the signal strength on the obtained model is similar to inference on the error size .

Inference on quadratic forms turns out to be closely related to a number of well-studied problems, such as estimate of the noise level and the proportion of explained variation [FGH12, BEM13, Dic14, JBC17, VG18, GWCL19]. To expand on this point, suppose that attributes are drawn i.i.d. from a gaussian distribution with covariance , and the noise level is unknown. Then, . Since follows a distribution with degrees of freedom, we have . Hence, task of inference on the quadratic form and the noise level are intimately related. This is also related to the proportion of explained variation defined as

(6)

with the signal-to-noise ratio. This quantity is of crucial importance in genetic variability [VHW08] as it somewhat quantifies the proportion of variance in a trait (response) that is explained by genes (design matrix) rather than environment (noise part).


Example 3 (Testing individual parameters ) Recently, there has been a significant interest in testing individual hypothesis , in the high-dimensional regime. This is a challenging problem because obtaining an exact characterization of the probability distribution of the parameter estimates in the high-dimensional regime is notoriously hard.

A successful approach is based on debiasing the regularized estimators. The resulting debiased estimator is amenable to distributional characterization which can be used for inference on individual parameters [JM14a, JM14b, ZZ14, VdGBRD14, JM13]. Our methodology for testing hypothesis of form (3) is built upon the debiasing idea. It also recovers the debiasing approach for .


Example 4 (Confidence intervals for predictions) For a new sample , we can perform inference on the response value by letting for a given value . Further, by duality between hypothesis testing and confidence intervals, we can construct confidence interval for . We refer to Section 7 for further details.


Example 5 (Confidence intervals for ) Let be an arbitrary function. By letting we can test different values of . Further, by employing the duality relationship between hypothesis testing and confidence intervals, we can construct confidence intervals for . Note that Examples 3, 4 are special cases of and . Here is the -th standard basis element with one at the -th entry and zero everywhere else.


Example 6 (Testing over convex cones) For a given cone , our framework allows us to test whether belongs to . Some examples that naturally arise in studying treatment effects are nonnegative cone , and monotone cone . Letting denote the mean of treatment , by testing , one can test whether all the treatments in the study are harmless. Another case is when treatments correspond to an ordered set of dosages of the same drug. Then, one might reason that if the drug is of any effect, its effect should follow a monotone relationship with its dosage. This hypothesis can be cast as . Such testing problems over cones have been studied for gaussian sequence models by [Kud63, RW78, RCLN86], and very recently by [WWG19].

1.2 Other Related work

Testing in the high-dimensional linear model has experienced a resurgence in the past few years. Most closely related to us is the line of work on debiasing/desparsifying pioneered by [ZZ14, VdGBRD14, JM14a]. These papers propose a debiased estimator such that every coordinate is approximately gaussian under the condition that , which is in turn used to test single coordinates of , , and construct confidence intervals for . In a parallel line of work, [BCH11, BCFVH17, BCH13, BCH14] have also designed an asymptotically gaussian pivot via the post-double-selection lasso, under the same sample size condition of . [CG17] established that the sample size conditions required by debiasing and post-double-selection are minimax optimal meaning to construct a confidence interval of length for a coordinate of requires .

The debiasing and post-double-selection approaches have also been applied to a wide variety of other models for testing including missing data linear regression [WWBS19], quantile regression [ZKL14], and graphical models [RSZZ15, CRZZ16, WK16, BK18].

In the multiple testing realm, the debiasing approach has been used to control directional FDR [JJ19]. Other methods such as FDR-thresholding and SLOPE procedures controls the false discovery rate (FDR) when the design matrix is orthogonal [SC16, BvdBS15, ABDJ06]. In the non-orthogonal setting, the knockoff procedure [BC15] controls FDR whenever , and the noise is isotropic; In [JS16], knockoff was generalized to also control for the family-wise error rate. More recently, [CFJL18] developed the model-free knockoff which allows for when the distribution of is known.

In parallel, there have been developments in selective inference, namely inference for the variables that the lasso selects. [LSST16, TTLT16] developed exact tests for the regression coefficients corresponding to variables that lasso selects. This was further generalized to a wide variety of polyhedral model selection procedures including marginal screening and orthogonal matching pursuit in [LT14]. [TT18, FST14, HPM16] developed more powerful and general selective inference procedures by introducing noise in the selection procedure. To allow for selective inference in the high-dimensional setting, [LSST16] combined the polyhedral selection procedure with the debiased lasso to construct selectively valid confidence intervals for when .

Much of the previous work has focused on testing coordinates or one-dimensional projections of . An exception is the work [NvdG13] which studies the problem of constructing confidence sets for the high dimensional linear models, so that the confidence sets are honest over the family of sparse parameters, under i.i.d gaussian designs. Our work increases the applicability of the debiasing approach by allowing for general hypothesis, . The set can be non-convex or even disconnected. Our setup encompasses a broad range of testing problems and it is shown to be minimax optimal for special cases such as and .

The authors in [ZB17] have studied the problem (3) independently and indeed [ZB17] was posted online around the same time that the first draft of our paper was released. This work also leverages the idea of debiasing but greatly differs from this work, both in methodology and theory, which we now discuss. In [ZB17], the debiased estimator is constructed in the standard basis (as compared to ours which is done in a lower dimensional subspace) and is followed by an projection to construct the test statistic. The test statistic involves a data dependent vector and the method uses bootstrap to approximate the distribution of the test statistic and set the critical values. In terms of theory, [ZB17] shows that the proposed method controls the type I error at the desired level assuming that and (See Theorem 1 therein), while we prove such result for our test under . It is shown in [ZB17] that the rule achieves asymptotic power one provided that the signal strength (measured in term of the distance of from ) asymptotically dominates . In comparison, in Theorem 3.4 we establish a lower bound of the power for all values of the signal strength and as a corollary of that we show the method achieves power one if the signal strength dominates asymptotically.

1.3 Organization of the paper

In the remaining part of the introduction, we present the notations and a few preliminary definitions. The rest of the paper presents the following contributions:

  • Section 2. We explain our testing methodology. It consists of constructing a debiased estimator for the projections of the model parameters in a lower dimensional subspace. It is then followed by an projection to form the test statistic.

  • Section 3. We present our main results. Specifically, we show that our method controls false positive rate under a pre-assigned level. We also derive an analytical lower bound for the statistical power of our test. In case of (Example 3), it matches the bound proposed in [JM14a, Theorem 3.5], which is also shown to be minimax optimal.

  • Section 5. We explain the notion of approximate sparsity and discuss how our results can be extended to allow for approximately sparse models.

  • Section 6. We relax the gaussianity assumption on the noise component and discuss how to address possibly non-gaussian noise under proper moment conditions.

  • Section 7. We provide applications of our framework for some special cases: Inference on linear predictions, quadratic forms of the parameters and testing the condition. In Section 7.1, we discuss the existing literature for these subproblems and compare it to our proposed methodology.

  • Section 8. We provide numerical experiments to corroborate our findings and evaluate type I error and statistical power of our test under various settings.

  • Section 9. Proof of Theorems are given in this section, while the proof of technical lemmas are deferred to appendices.

1.4 Notations

We start by adapting some simple notations that will be used throughout the paper, along with some basic definitions from the literature on high-dimensional regression.

We use to refer to the -th standard basis element, e.g., . For a vector , represents the positions of nonzero entries of . For a vector and a subset , is the restriction of to indices in . For an integer , we use the notation . We write for the standard norm of a vector , i.e., and for the number of nonzero entries of . Whenever the subscript is not mentioned it should be read as norm. For a matrix , we denote by , the maximum absolute value of entries of . Further, its maximum and minimum singular values are respectively indicated by by and . Throughout, denotes the CDF of the standard normal distribution. We also denote the -values .

The term “with high probability” means with probability converging to one as and for two functions and , the notation means that ‘dominates’ asymptotically, namely, for every fixed positive , there exists such that for . Likewise, indicates that is ‘bounded’ above by asymptotically, i.e., for some positive constant . Analogously, we use he notations and to indicate asymptotic behavior is probability as the sample size grows to infinity.

Let be the sample covariance of the design . In the high-dimensional setting, where exceeds , is singular. As common in high-dimensional statistics, we assume compatibility condition which requires to be nonsingular in a restricted set of directions.

We use the notation to refer to the sub-gaussian norm. Specifically, for a random variable , we let

(7)

For a random vector , its sub-gaussian norm is defined as

Definition 1.1.

For a symmetric matrix and a set , the compatibility condition is defined as

(8)

Matrix is said to satisfy compatibility condition for a set , with constant if .

2 Projection statistic

Depending on the structure of it may be useful to instead of testing the null hypothesis , we test it in a lower dimensional space. Consider an -dimensional subspace represented by an orthonormal basis , with . For this section, we assume that the basis is predetermined and fixed. In Section 4, we discuss how to choose the subspace depending on to maximize the power of the test. The projection onto this subspace is given by

where . We also use the notation to denote the projection of onto the subspace . Define the hypothesis

(9)

Under the null , also holds, so controlling the type-I error of also controls the type-I error of . In the following we propose a testing rule for the null hypothesis and show that it controls type-I error below a pre-assigned level . Consequently,

For now, we consider an arbitrary fixed subspace , and then after we analyze the statistical power of our test we provide guidelines on how to choose to increase the power.

In order to test we construct a test statistic based on the debiasing approach.

We first let be the scaled Lasso estimator [SZ12] given by

(10)

This optimization simultaneously gives an estimate of and . We use regularization parameter . Due to the penalization, the lasso estimator is biased towards small norm, and so is the projection . We view in the basis , namely and construct a debiased estimator for it in the following way:

(11)

with the decorrelating matrix , where each is obtained by solving the optimization problems for each :

(12)

Note that the decorrelating matrix is a function of , but not of . We next state a lemma that provides a a bias-variance decomposition for and brings insight about the form of debiasing given by (11).

Lemma 2.1.

Let be any (deterministic) design matrix. Assuming that optimization problem (46) is feasible for , let be a general debiased estimator as per Eq (11). Then, setting , with the noise vector in the regression (2), we have

(13)

Further, assume that satisfies the compatibly condition for the set , , with constant , and let . Then, choosing , we have

(14)

Lemma 2.1 can be proved in a similar way to Theorem 2.3 of [JM14a] and its proof is omitted here. The decomposition (13) explains the rationale behind optimization (46). Indeed the convex program (46) aims at optimizing two objectives. On one hand, the constraint controls the term , which by Lemma 2.1 controls the bias term . On the other hand, it minimizes the objective function , which controls the variance of . Therefore, the parameter in optimization (46) controls the bias-variance tradeoff and should be chosen large enough to ensure that (46) is feasible. (See Section 3.1 for further discussion.)

Remark 2.2.

In the special case of and , the debiased estimator (11) reduces to the one introduced in [JM14a]. For the special case of , it becomes similar to the estimator proposed by [CG17] that is used to construct confidence intervals for linear functionals of . Note that the proposed debiasing procedure incurs small bias in the infinity norm with respect to the rotated basis, , as opposed to the standard debiasing procedure  [JM14a, JM14b, ZZ14, VdGBRD14, JM13] which incurs small bias, in the infinity norm, with respect to the original basis, and not necessarily in the rotated basis.

The following assumption ensures that the entries of noise have non-vanishing variances.

Assumption 2.3.

We have , for some positive constant .

The above assumption entails the decorrelating matrix , where our proposal constructs via optimization (46). In the following lemma, we provide a sufficient condition for the above assumption to hold.

Lemma 2.4.

Suppose that and , for some constant . Then, Assumption 2.3 holds.

We refer to Appendix A.1 for the proof of Lemma 2.4.

Define the shorthand

(15)

To ease the notation, we hereafter drop the superscript . We next construct a test statistic so that the large values of provide evidence against the null hypothesis. Our test statistic is defined based on an projection estimator given by the following optimization problem.

(16)

We then define the test statistic to be the optimal value of (16), i.e.,

(17)

The reason for using norm in the projection is that the bias term of is controlled in norm (See Lemma 2.1.) The decision rule is then based on the test statistic:

(18)

The above procedure generalizes the debiasing approach of [JM14a]. Specifically, for and , the test rule becomes the one proposed by [JM14a] for testing hypothesis of the form versus its alternative.

Remark 2.5.

Using Lemma 2.1, under the null hypothesis , we have that is (asymptotically) stochastically dominated by , whose entries are dependent and are distributed as standard normal. The choice of threshold in (18) comes from using this observation and union bounding to control the (two-sided) tail of . Given that Lemma 2.1 also characterizes the dependency structure of the entries of , we can pursue another (less conservative) approach to choose the rejection threshold. As an implication of Lemma 2.1, and since (dimension of ) is fixed, we have that for all ,

(19)

Under the null hypothesis , we have , and by (19), the distribution of is asymptotically equal to the maximum of dependent standard normal variables , whose distribution can be easily simulated since the covariance of the multivariate gaussian vector is known.

In the next section, we prove that decision rule (18) controls type-I error below the target level provided the basis is independent of the samples , . We also develop a lower bound on the statistical power of the testing rule and use that to choose the basis .

3 Main results

3.1 Controlling false positive rate

Definition 3.1.

Consider a given triple where , with and . The generalized coherence parameter of denoted by is given by

(20)

where is the sample covariance of . The minimum generalized coherence of is .

Note that choosing , the optimization (46) becomes feasible.

We take a minimax perspective and require that the probability of type I error (false positive) to be controlled uniformly over -sparse vectors.

For a testing rule and a set , we define

(21)

Our first result shows validity of our test for general set under deterministic designs.

Theorem 3.2.

Consider a sequence of design matrices , with dimensions , satisfying the following assumptions. For each , the sample covariance satisfies compatibility condition for the set , with a constant . Also, assume that for some constant . Also consider a sequence of matrices , with fixed and , such that .

Consider the linear regression (2) and let and be obtained by scaled Lasso, given by (10), with . Construct a debiased estimator as in (11) using , where is the minimum generalized coherence parameter as per Definition 3.1, and suppose that Assumption 2.3 holds. Choose and suppose that . For the test defined in Equation (18), and for any , we have

(22)

We next prove validity of our test for general set under random designs.

Theorem 3.3.

Let such that and and . Suppose that has independent sub-gaussian rows, with mean zero and sub-gaussian norm , for some constant .

Let and be obtained by scaled Lasso, given by (10), with , and . Consider an arbitrary , with , that is independent of the samples . Construct a debiased estimator as in (11) with and . In addition, suppose that , for some constant and .

For the test defined in Equation (18), and for any , we have

(23)

We refer to Section 9 for the proof of Theorem 3.2 and 3.3.

3.2 Statistical power

We next analyze the statistical power of our test. Before proceeding, note that without further assumption, we cannot achieve any non-trivial power, namely, power of which is obtained by a rule that randomly rejects null hypothesis with probability . Indeed, by choosing but arbitrarily close to , once can make essentially indistinguishable from . Taking this point into account, for a set and , we define the distance as

(24)

We will assume that, under alternative hypothesis, as well. Define

(25)

Quantity is the probability of type II error (false negative) and is the statistical power of the test.

Theorem 3.4.

Let be the test defined in Equation (18). Under the conditions of Theorem 3.3, for all :

(26)

where we define as

(27)

Further, for , , and integer , the function is defined as follows:

(28)

The proof of Theorem 3.4 is given in Section 9.3.

Note that for any fixed and , the function is continuous and monotone increasing, i.e., the larger the higher power is achieved. Also, in order to achieve a specific power , our scheme requires , for some constant that depends on the desired power . In addition, if , the rule achieves asymptotic power one.

It is worth noting that in case of testing individual parameters (corresponding to and ), we recover the power lower bound established in [JM14a], which by comparing to the minimax trade-off studied in [JM14b], is optimal up to a constant.

4 Choice of subspace

Before we start this section, let us stress again that by Theorems 3.2 and 3.3, the proposed testing rule controls type-I error below the desired level , for any choice of , with and that is independent of . Here, we provide guidelines for choosing that yields high power. To this end we use the result of Theorem 3.4.

Note that

where we recall that and , for . Hence,

(29)

We propose to choose by maximizing the right-hand side of (29), which by Theorem 3.4 serves as a lower bound for the power of the test. Nevertheless, the above optimization involves which is unknown. To cope with this issue, we use the Lasso estimate via the following procedure:

  1. We randomly split the data into two subsamples and each with sample size . We let be the optimizer of the scaled Lasso applied to .

  2. We choose by solving the following optimization:

    (30)
  3. We construct the debiased estimator using the data . Specifically, set and let be the solution of the following optimization problems for each :

    (31)

    Define the decorrelating matrix and let be the optimizer of the scaled Lasso applied to . Let

    (32)
  4. Set and . Find the projection as

    (33)
  5. Define the test statistics . The testing rule is given by

    (34)

Note that the data splitting above ensures that is independent of , which is required for our analysis (See Theorems 3.2, 3.3 and 3.4.)

4.1 Convex sets

When the set is convex, step (2) in the above procedure can be greatly simplified. Indeed, we can only focus on in this case.

Lemma 4.1.

Define the set of matrices as

(35)

If is convex then there exists a unit norm such that .

Proof of Lemma 4.1 is given in Appendix A.3.

Focusing on , optimization (30) reduces to the following optimization over :

(36)

The function is monotone increasing in and by substituting for , this becomes equivalent to the following problem:

(37)

Given that the objective is linear in and , and the set is convex we can apply the Von Neumann’s minimax theorem and change the order of and :

(38)

Denote the orthogonal projection of onto by