A Global Homogeneity Test for High-Dimensional Linear Regression

A Global Homogeneity Test for High-Dimensional Linear Regression

\fnmsCamille \snmCharbonnier\thanksreft1label=u]camille.charbonnier@ensae.org [    \fnmsNicolas \snmVerzelen\thanksreft2label=v]nicolas.verzelen@supagro.inra.fr [    \fnmsFanny \snmVillers\thanksreft3label=w]fanny.villers@upmc.fr [
Abstract

This paper is motivated by the comparison of genetic networks based on microarray samples. The aim is to test whether the differences observed between two inferred Gaussian graphical models come from real differences or arise from estimation uncertainties. Adopting a neighborhood approach, we consider a two-sample linear regression model with random design and propose a procedure to test whether these two regressions are the same. Relying on multiple testing and variable selection strategies, we develop a testing procedure that applies to high-dimensional settings where the number of covariates is larger than the number of observations and of the two samples. Both type I and type II errors are explicitely controlled from a non-asymptotic perspective and the test is proved to be minimax adaptive to the sparsity. The performances of the test are evaluated on simulated data. Moreover, we illustrate how this procedure can be used to compare genetic networks on Hess et al breast cancer microarray dataset.

[
\kwd
\setmarginsrb

3.7cm2cm3.7cm3cm1cm1cm1cm1cm

\runtitle

Homogeneity tests {aug}

, and

\thankstext

t1\printeadu \thankstextt2\printeadv \thankstextt3\printeadw

class=AMS] \kwd[Primary ]62H15 \kwd[; secondary ]62P10 Gaussian graphical model \kwdtwo-sample hypothesis testing \kwdhigh-dimensional statistics \kwdmultiple testing \kwdadaptive testing \kwdminimax hypothesis testing \kwddetection boundary

1 Introduction

The recent flood of high-dimensional data has motivated the development of a vast range of sparse estimators for linear regressions, in particular a large variety of derivatives from the Lasso. Although theoretical guarantees have been provided in terms of prediction, estimation and selection performances (among a lot of others [2009_AS_Bickel, 2009_IEEE_Wainwright, 2008_AS_Meinshausen]), the research effort has only recently turned to the construction of high-dimensional confidence intervals or parametric hypothesis testing schemes [2011_arxiv_Zhang, 2012_arXiv_Buhlmann, Montanari, 2013arXiv1301.7161L, 2013arXiv1309.3489M]. Yet, quantifying the confidence surrounding coefficient estimates and selected covariates is essential in areas of application where these will nourrish further targeted investigations.

In this paper we consider the two-sample linear regression model with Gaussian random design.

(1)
(2)

In this statistical model, the size row vectors and follow Gaussian distributions and whose covariance matrices remain unknown. The noise components and are independent from the design matrices and follow a centered Gaussian distribution with unknown standard deviations and . In this formal setting, our objective is to develop a test for the equality of and which remains valid in high-dimension.

Suppose that we observe an -sample of and an -sample of noted , , and , , with and remaining smaller than . Defining analogously and , we obtain the decompositions and . Given these observations, we want to test whether models (1) and (2) are the same, that is

(3)

In the null hypothesis, we include the assumption that the population covariances of the covariates are equal (), while under the alternative hypothesis the population covariances are not required to be the same. This choice of assumptions is primarely motivated by our final objective to derive homogeneity tests for Gaussian graphical models (see below). A discussion of the design hypotheses is deferred to Section 7.

1.1 Connection with two-sample Gaussian graphical model testing

This testing framework is mainly motivated by the validation of differences observed between Gaussian graphical models (modelling regulation networks) inferred from transcriptomic data from two samples [2006_AS_Meinshausen, 2008_B_Friedman, 2011_SC_Chiquet] when looking for new potential drug or knock-out targets [2011_JSFDS_Jeanmougin]. Following the development of univariate differential analysis techniques, there is now a surging demand for the detection of differential regulations between pairs of conditions (treated vs. placebo, diseased vs. healthy, exposed vs. control, …). Given two gene regulation networks inferred from two transcriptomic data samples, it is however difficult to disentangle differences in the estimated networks that are due to estimation errors from real differences in the true underlying networks.

We suggest to build upon our two-sample high-dimensional linear regression testing scheme to derive a global test for the equality of high-dimensional Gaussian graphical models inferred under pairs of conditions.

Formally speaking, the global two-sample GGM testing problem is defined as follows. Consider two Gaussian random vectors and . The dependency graphs are characterized by the non-zero entries of the precision matrices and  [1996_Lauritzen]. Given an -sample of and an -sample of , the objective is to test

(4)

where and are assumed to be sparse (most of their entries are zero). This testing problem is global as the objective is to assess a statistically significant difference between the two distributions. If the test is rejected, a more ambitious objective is to infer the entries where the precision matrices differ (ie ).

Adopting a neighborhood selection approach [2006_AS_Meinshausen] as recalled in Section 6, high-dimensional GGM estimation can be solved by multiple high-dimensional linear regressions. As such, two-sample GGM testing (4) can be solved via multiple two-sample hypothesis testing as (3) in the usual linear regression framework. This extension of two-sample linear regression tests to GGMs is described in Section 6.

1.2 Related work

The literature on high-dimensional two-sample tests is very light. In the context of high-dimensional two-sample comparison of means, [1996_SS_Bai, 2008_JMA_Srivastava, 2010_AS_Chen, 2011_NIPS_Lopes] have introduced global tests to compare the means of two high-dimensional Gaussian vectors with unknown variance. Recently, [cai11_covariance_comparison, chen_li_comparison] developped two-sample tests for covariance matrices of two high-dimensional vectors.

In contrast, the one-sample analog of our problem has recently attracted a lot of attention, offering as many theoretical bases for extension to the two-sample problem. In fact, the high-dimensional linear regression tests for the nullity of coefficients can be interpreted as a limit of the two-sample test in the case where is known to be zero, and the sample size is considered infinite so that we perfectly know the distribution of the second sample.

There are basically two different objectives in high-dimensional linear testing: a local and a global approach. In the local approach, one considers the tests for the nullity of each coefficient () with the purpose of controling error measures such as the false discovery rate of the resulting multiple testing procedures. In a way, one aims to assess the individual statistical significance of each of the variables. This can be achieved by providing a confidence region for  [2011_arxiv_Zhang, 2012_arXiv_Buhlmann, Montanari, 2013arXiv1301.7161L, 2013arXiv1309.3489M]. Another line of work derives -values for the nullity of each of the coefficients. Namely, [2009_AS_Wasserman] suggests a screen and clean procedure based upon half-sampling. Model selection is first applied upon a random half of the sample in order to test for the significance of each coefficient using the usual combination of ordinary least squares and Student t-tests on a model of reasonnable size on the remaining second half. To reduce the dependency of the results to the splitting, [2009_JASA_Meinshausen] advocate to use half-sampling times and then aggregate the -values obtained for variable in a way which controls either the family-wise error rate or false discovery rate.

In the global approach, the objective is to test the null hypothesis . Although global approaches are clearly less informative than approaches providing individual significance tests like [2009_JASA_Meinshausen, 2011_arxiv_Zhang, 2012_arXiv_Buhlmann], they can reach better performances from smaller sample sizes. Such a property is of tremendous importance when dealing with high-dimensional datasets. The idea of [2010_AS_Verzelen], based upon the work of [2003_AS_Baraud], is to approximate the alternative by a collection of tractable alternatives working on subsets of reasonable sizes. The null hypothesis is rejected if the null hypothesis is rejected for at least one of the subsets . Admittedly, the resulting procedure is computationally intensive. Nonetheless it is non-asymptotically minimax adaptive to the unknown sparsity of , that is it achieves the optimal rate of detection without any assumption on the population covariance of the covariates. Another series of work relies on higher-criticism. This last testing framework was originally introduced in orthonormal designs [2004_AS_Donoho], but has been proved to reach optimal detection rates in high-dimensional linear regression as well [2011_AS_Arias-Castro, 2010_EJS_Ingster]. In the end, higher-criticism is highly competitive in terms of computing time and achieves the asymptotic rate of detection with the optimal constants. However, these nice properties require strong assumptions on the design.

While writing this paper, we came across the parallel work of Städler and Mukherjee [2013_ArXiv_Stadler], which adopts a local approach in an elegant adaptation of the screen and clean procedure in its simple-split [2009_AS_Wasserman] and multi-split [2009_JASA_Meinshausen] versions to the two-sample framework. Interestingly, their work also led to an extension to GGM testing [2014_ArXiv_Stadler].

In contrast we build our testing strategy upon the global approach developed by [2003_AS_Baraud] and [2010_AS_Verzelen]. A more detailed comparison of [2013_ArXiv_Stadler, 2014_ArXiv_Stadler] with our contribution is deferred to simulations (Section 5) and discussion (Section 7).

1.3 Our contribution

Our suggested approach stems from the fundamental assumption that either the true supports of and are sparse or that their difference is sparse, so that the test can be successfully led in a subset of variables with reasonnable size, compared to the sample sizes and . Of course, this low dimensional subset is unknown. The whole objective of the testing strategy is to achieve similar rates of detection (up to a logarithmic constant) as an oracle test which would know in advance the optimal low-dimensional subset .

Concretely, we proceed in three steps :

  1. We define algorithms to select a data-driven collection of subsets identified as most informative for our testing problem, in an attempt to circumvent the optimal subset .

  2. New parametric statistics related to the likelihood ratio statistic between the conditional distributions and are defined for .

  3. We define two calibration procedures which both guarantee a control on type-I error:

    • we use a Bonferroni calibration which is both computationally and conceptually simple, allowing us to prove that this procedure is minimax adaptive to the sparsity of and from a non-asymptotic point of view;

    • we define a calibration procedure based upon permutations to reach a fine tuning of multiple testing calibration in practice, for an increase in empirical power.

The resulting testing procedure is completely data-driven and its type I error is explicitely controlled. Furthermore, it is computationally amenable in a large and small setting. Interestingly, the procedure does not require any half-sampling steps which are known to decrease the robustness when the sample size is small.

The procedure is described in Section 2 while Section 3 is devoted to technical details, among which theoretical controls on Type I error, as well as some useful empirical tools for interpretation. Section 4 provides a non-asymptotic control of the power. Section 5 provides simulated experiments comparing the performances of the suggested procedures with the approach of [2013_ArXiv_Stadler]. In Section 6, we detail the extension of the procedure to handle the comparison of Gaussian graphical models. The method is illustrated on Transcriptomic Breast Cancer Data. Finally, all the proofs are postponed to Section 8.

The codes of our algorithms are available at [code].

1.4 Notation

In the sequel, norms are denoted , except for the norm which is referred as to alleviate notations. For any positive definite matrix , denotes the Euclidean norm associated with the scalar product induced by : for every vector , . Besides, fo every set , denote its cardinality. For any integer , stands for the identity matrix of size . For any square matrix , and denote respectively the maximum and minimum eigenvalues of . When the context makes it obvious, we may omit to mention to alleviate notations and use and instead. Moreover, refers to the size concatenation of and and refers to the size the concatenation of and . To finish with, refers to a positive numerical constant that may vary from line to line.

2 Description of the testing strategy

Likelihood ratio statistics used to test hypotheses like in the classical large , small  setting are intractable on high-dimensional datasets for the mere reason that the maximum likelihood estimator is not itself defined under high-dimensional design proportions. Our approach approximates the intractable high-dimensional test by a multiple testing construction, similarly to the strategy developped by [2003_AS_Baraud] in order to derive statistical tests against non-parametric alternatives and adapted to one sample tests for high-dimensional linear regression in [2010_AS_Verzelen].

For any subset of satisfying , denote and the restrictions of the random vectors and to covariates indexed by . Their covariance structure is noted (resp. ). Consider the linear regression of by defined by

where the noise variables and are independent from and and follow centered Gaussian distributions with new unkwown conditional standard deviations and . We now state the test hypotheses in reduced dimension:

Of course, there is no reason in general for and to coincide with the restrictions of and to , even less in high-dimension since variables in can be in all likelihood correlated with covariates in . Yet, as exhibited by Lemma 2.1, there is still a strong link between the collection of low dimensional hypotheses and the global null hypothesis .

Lemma 2.1.

The hypothesis implies for any subset .

Proof.

Under , the random vectors of size and follow the same distribution. Hence, for any subset , follows conditionally on the same distribution as conditionnally on . In other words, . ∎

By contraposition, it suffices to reject at least one of the hypotheses to reject the global null hypothesis. This fundamental observation motivates our testing procedure. As summarized in Algorithm 1, the idea is to build a well-calibrated multiple testing procedure that considers the testing problems against for a collection of subsets . Obviously, it would be prohibitive in terms of algorithmic complexity to test for every , since there would be such sets. As a result, we restrain ourselves to a relatively small collection of hypotheses , where the collection of supports is potentially data-driven. If the collection is judiciously selected, then we can manage not to lose too much power compared to the exhaustive search.

We now turn to the description of the three major elements required by our overall strategy (see Algorithm 1):

  1. a well-targeted data-driven collection of models as produced by Algorithm 2;

  2. a parametric statistic to test the hypotheses for , we resort actually to a combination of three parametric statistics , and ;

  3. a calibration procedure guaranteeing the control on type I error as in Algorithm 3 or 4.

Data ,,,, collection and desired level
Step 1 - Choose a collection of low-dimensional models (as e.g. in Algorithm 2)
procedure ModelChoice(,,,,)
     Define the model collection
end procedure
Step 2 - Compute p-values for each test in low dimension
procedure Test(,,,,)
     for each subset in  do
         Compute the p-values , , associated to the statistics , ,
     end for
end procedure
Step 3 - Calibrate decision thresholds as in Algorithms 3 (Bonferroni) or 4 (Permutations)
procedure Calibration(,,,,,)
     for each subset in and each  do
         Define a threshold .
     end for
end procedure
Step 4 - Final Decision
if there is a least one model in such that there is at least one p-value for which  then
     Reject the global null hypothesis
end if
Algorithm 1 Overall Adaptive Testing Strategy

2.1 Choices of Test Collections (Step 1)

The first step of our procedure (Algorithm 1) amounts to select a collection of subsets of , also called models. A good collection of subsets must satisfy a tradeoff between the inclusion of the maximum number of relevant subsets and a reasonable computing time for the whole testing procedure, which is linear in the size of the collection. The construction of proceeds in two steps: (i) One chooses a deterministic collection of models. (ii) One defines an algorithm (called ModelChoice in Algorithm 1) mapping the raw data to some collection satisfying . Even though the introduction of as an argument of the mapping could appear artificial at this point, this quantity will be used in the calibration step of the procedure. Our methodology can be applied to any fixed or data-driven collection. Still, we focus here on two particular collections. The first one is useful for undertaking the first steps of the mathematical analysis. For practical applications, we advise to use the second collection.

Deterministic Collections .

By deterministic, we mean the model choice step is trivial in the sense ModelChoice. Among deterministic collections, the most straightforward collections consist of all size- subsets of , which we denote . This kind of family provides collections which are independent from the data, thereby reducing the risk of overfitting. However, as we allow the model size or the total number of candidate variables to grow, these deterministic families can rapidly reach unreasasonble sizes. Admittedly, always remains feasible, but reducing the search to models of size 1 can be costly in terms of power. As a variation on size models, we introduce the collection of all models of size smaller than , denoted , which will prove useful in theoretical developments.

Lasso-type Collection .

Among all data-driven collections, we suggest the Lasso-type collection . Before proceeding to its definition, let us informally discuss the subsets that a “good” collection should contain. Let denote the support of a vector . Intuitively, under the alternative hypothesis, good candidates for the subsets are either subsets of or subsets of . The first model nicely satisfies and . The second subset has a smaller size than and focuses on covariates corresponding to different parameters in the full regression. However, the divergence between effects might only appear conditionally on other variables with similar effects, this is why the first subset is also of interest. Obviously, both subsets and are unknown. This is why we consider a Lasso methodology that amounts to estimating both and in the collection . Details on the construction of are postponed to Section 3.1.

2.2 Parametric Test Statistic (Step 2)

Given a subset , we consider the three following statistics to test against :

(5)
(6)

As explained in Section 3, these three statistics derive from the Kullback-Leibler divergence between the conditional distributions and . While the first term evaluates the discrepancies in terms of conditional variances, the last two terms and address the comparison of to .

Because the distributions under the null of the statistics , for , depend on the size of , the only way to calibrate the multiple testing step over a collection of models of various sizes is to convert the statistics to a unique common scale. The most natural is to convert observed ’s into -values. Under , the conditional distributions of for to are parameter-free and explicit (see Proposition 3.1 in the next section). Consequently, one can define the exact -values associated to , conditional on . However, the computation of the -values require a function inversion, which can be computationally prohibitive. This is why we introduce explicit upper bounds (Equations (13,17)) of the exact -values.

2.3 Combining the parametric statistics (Step 3)

The objective of this subsection is to calibrate a multiple testing procedure based on the sequence of -values , so that the type-I error remains smaller than a chosen level . In particular, when using a data-driven model collection, we must take good care of preventing the risk of overfitting which results from using the same dataset both for model selection and hypothesis testing.

For the sake of simplicity, we assume in the two following paragraphs that , which merely means that we do not include in the collection of tests the raw comparison of to .

Testing Procedure

Given a model collection and a sequence , we define the test function:

(7)

In other words, the test function rejects the global null if there exists at least one model such that at least one of the three -values is below the corresponding threshold . In Section 3.3, we describe two calibration methods for choosing the thresholds . We first define a natural Bonferroni procedure, whose conceptual simplicity allows us to derive non-asymptotic type II error bounds of the corresponding tests (Section 4). However, this Bonferroni correction reveals too conservative in practice, in part paying the price for resorting to data-driven collections and upper bounds on the true -values. This is why we introduce as a second option the permutation calibration procedure. This second procedure controls the type I error at the nominal level and therefore outperforms the Bonferroni calibration in practice. Nevertheless, the mathematical analysis of the corresponding test becomes more intricate and we are not able to provide sharp type II error bounds.

Remark: In practice, we advocate the use of the Lasso Collection (Algorithm 2) combined with the permutation calibration method (Algorithm 4). Henceforth, the corresponding procedure is denoted .

3 Discussion of the procedure and Type I error

In this section, we provide remaining details on the three steps of the testing procedure. First, we describe the collection and provide an informal justification of its definition. Second, we explain the ideas underlying the parametric statistics , and we define the corresponding -values . Finally, the Bonferroni and permutation calibration methods are defined, which allows us to control the type I error of the corresponding testing procedures.

3.1 Collection

We start from , where, in practice, and we consider the following reparametrized joint regression model.

(8)

In this new model, captures the mean effect , while captures the discrepancy between the sample-specific effect and the mean effect , that is to say . Consequently, and . To simplify notations, denote by the concatenation of and , as well as by the reparametrized design matrix of (8). For a given , the Lasso estimator of is defined by

(9)
(10)

For a suitable choice of the tuning parameter and under assumptions of the designs, it is proved [2009_AS_Bickel, 2008_AS_Meinshausen] that estimates well and is a good estimator of . The Lasso parameter tunes the amount of sparsity of : the larger the parameter , the smaller the support . As the optimal choice of is unknown, the collection is built using the collection of all estimators , also called the Lasso regularization path of . Below we provide an algorithm for computing along with some additional justifications.

Data ,,,, Collection
Compute the function (defined in (9,10)) using Lars-Lasso Algorithm [lars]
Compute the decreasing sequences of jumps in
, , 
while  do
     
     
     
end while
Algorithm 2 Construction of the Lasso-type Collection

It is known [lars] that the function is piecewise constant. Consequently, there exist thresholds such that changes on ’s only. The function and the collection are computed efficiently using the Lars-Lasso Algorithm [lars]. We build two collections of models using and . Following the intuition described above, for a fixed , is an estimator of while is an estimator of . This is why we define

where is the smallest integer such that . In the end, we consider the following data-driven family,

(11)

Recall that is the collection of the models of size 1. Recently, data-driven procedures have been proposed to tune the Lasso and find a parameter is such a way that is a good estimator of (see e.g. [2011_arxiv_Sun, 2010_arxiv_Baraud]). We use the whole regularization path instead of the sole estimator , because our objective is to find subsets such that the statistics are powerful. Consider an example where and contains one large coefficient and many small coefficients. If the sample size is large enough, a well-tuned Lasso estimator will select several variables. In contrast, the best subset (in terms of power of ) contains only one variable. Using the whole regularization path, we hope to find the best trade-off between sparsity (small size of ) and differences between and . This last remark is formalized in Section 4.4. Finally, the size of the collection is generally linear with , which makes the computation of reasonnable.

3.2 Parametric statistics and -values

3.2.1 Symmetric conditional likelihood

In this subsection, we explain the intuition behind the choice of the parametric statistics defined in Equations (5,6). Let us denote by (resp. ) the -likelihood of the first (resp. second) sample normalized by (resp. ). Given a subset of size smaller than , stands for the maximum likelihood estimator of among vectors whose supports are included in . Similarly, we note for the maximum likelihood corresponding to the second sample.

Statistics , and appear as the decomposition of a two-sample likelihood-ratio, measuring the symmetrical adequacy of sample-specific estimators to the opposite sample. To do so, let us define the likelihood ratio in sample between an arbitrary pair  and the corresponding sample-specific estimator :

With this definition, measures how far is from in terms of likelihood within sample 1. The following symmetrized likelihood statistic can be decomposed into the sum of , and :

(12)

Instead of the three statistics , one could use the symmetric likelihood (12) to build a testing procedure. However, we do not manage to obtain an explicit and sharp upper bound of the -values associated to the statistic (12), which makes the resulting procedure either computationally intensive if one estimated the -values by a Monte-Carlo approach or less powerful if one uses a non-sharp upper bound of the -values. In contrast, we explain below how, by considering separately , and , one upper bounds sharply the exact -values.

3.2.2 Definition of the -values

Denote by the non-negative function defined on . Since the restriction of to is a bijection, we note the corresponding reciprocal function.

Proposition 3.1 (Conditional distributions of , and under ).

  1. Let denote a Fisher random variable with degrees of freedom. Then, under the null hypothesis,

  2. Let and be two centered and independent Gaussian vectors with covariance and . Then, under the null hypothesis,

    A symmetric result holds for .

Although the distributions identified in Proposition 3.1 are not all familiar distributions with ready-to-use quantile tables, they all share the advantage that they do not depend on any unknown quantity, such as design variances and , noise variances and , or even true signals and . For any , we note for the conditional probability that is larger than .

By Proposition 3.1, the exact -value associated to is easily computed from the distribution function of a Fisher random variable:

(13)

where denotes the probability that a Fisher random variable with degrees of freedom is larger than .

Since the conditional distribution of given only depends on , , , and , one could compute an estimation of the -value associated with an observed value by Monte-Carlo simulations. However, this approach is computationally prohibitive for large collections of subsets . This is why we use instead an explicit upper bound of based on Laplace method, as given in the definition below and justified in the proof of Proposition 3.3.

Definition 3.2 (Definition of and ).

Let us note the positive eigenvalues of

For any , define . For any , take

(14)

where is defined as follows. If all the components of are equal, then . If is not a constant vector, then we define

(15)
(16)

is defined analogously by exchanging the role of and .

Proposition 3.3.

For any , and for , .

Finally we define the approximate -values and by

(17)

Although we use similar notations for with , this must not mask the essential difference that is the exact