A Global Homogeneity Test for HighDimensional Linear Regression
Abstract
This paper is motivated by the comparison of genetic networks based on microarray samples. The aim is to test whether the differences observed between two inferred Gaussian graphical models come from real differences or arise from estimation uncertainties. Adopting a neighborhood approach, we consider a twosample linear regression model with random design and propose a procedure to test whether these two regressions are the same. Relying on multiple testing and variable selection strategies, we develop a testing procedure that applies to highdimensional settings where the number of covariates is larger than the number of observations and of the two samples. Both type I and type II errors are explicitely controlled from a nonasymptotic perspective and the test is proved to be minimax adaptive to the sparsity. The performances of the test are evaluated on simulated data. Moreover, we illustrate how this procedure can be used to compare genetic networks on Hess et al breast cancer microarray dataset.
3.7cm2cm3.7cm3cm1cm1cm1cm1cm
Homogeneity tests {aug}
, and
t1\printeadu \thankstextt2\printeadv \thankstextt3\printeadw
class=AMS] \kwd[Primary ]62H15 \kwd[; secondary ]62P10 Gaussian graphical model \kwdtwosample hypothesis testing \kwdhighdimensional statistics \kwdmultiple testing \kwdadaptive testing \kwdminimax hypothesis testing \kwddetection boundary
1 Introduction
The recent flood of highdimensional data has motivated the development of a vast range of sparse estimators for linear regressions, in particular a large variety of derivatives from the Lasso. Although theoretical guarantees have been provided in terms of prediction, estimation and selection performances (among a lot of others [2009_AS_Bickel, 2009_IEEE_Wainwright, 2008_AS_Meinshausen]), the research effort has only recently turned to the construction of highdimensional confidence intervals or parametric hypothesis testing schemes [2011_arxiv_Zhang, 2012_arXiv_Buhlmann, Montanari, 2013arXiv1301.7161L, 2013arXiv1309.3489M]. Yet, quantifying the confidence surrounding coefficient estimates and selected covariates is essential in areas of application where these will nourrish further targeted investigations.
In this paper we consider the twosample linear regression model with Gaussian random design.
(1)  
(2) 
In this statistical model, the size row vectors and follow Gaussian distributions and whose covariance matrices remain unknown. The noise components and are independent from the design matrices and follow a centered Gaussian distribution with unknown standard deviations and . In this formal setting, our objective is to develop a test for the equality of and which remains valid in highdimension.
Suppose that we observe an sample of and an sample of noted , , and , , with and remaining smaller than . Defining analogously and , we obtain the decompositions and . Given these observations, we want to test whether models (1) and (2) are the same, that is
(3) 
In the null hypothesis, we include the assumption that the population covariances of the covariates are equal (), while under the alternative hypothesis the population covariances are not required to be the same. This choice of assumptions is primarely motivated by our final objective to derive homogeneity tests for Gaussian graphical models (see below). A discussion of the design hypotheses is deferred to Section 7.
1.1 Connection with twosample Gaussian graphical model testing
This testing framework is mainly motivated by the validation of differences observed between Gaussian graphical models (modelling regulation networks) inferred from transcriptomic data from two samples [2006_AS_Meinshausen, 2008_B_Friedman, 2011_SC_Chiquet] when looking for new potential drug or knockout targets [2011_JSFDS_Jeanmougin]. Following the development of univariate differential analysis techniques, there is now a surging demand for the detection of differential regulations between pairs of conditions (treated vs. placebo, diseased vs. healthy, exposed vs. control, …). Given two gene regulation networks inferred from two transcriptomic data samples, it is however difficult to disentangle differences in the estimated networks that are due to estimation errors from real differences in the true underlying networks.
We suggest to build upon our twosample highdimensional linear regression testing scheme to derive a global test for the equality of highdimensional Gaussian graphical models inferred under pairs of conditions.
Formally speaking, the global twosample GGM testing problem is defined as follows. Consider two Gaussian random vectors and . The dependency graphs are characterized by the nonzero entries of the precision matrices and [1996_Lauritzen]. Given an sample of and an sample of , the objective is to test
(4) 
where and are assumed to be sparse (most of their entries are zero). This testing problem is global as the objective is to assess a statistically significant difference between the two distributions. If the test is rejected, a more ambitious objective is to infer the entries where the precision matrices differ (ie ).
Adopting a neighborhood selection approach [2006_AS_Meinshausen] as recalled in Section 6, highdimensional GGM estimation can be solved by multiple highdimensional linear regressions. As such, twosample GGM testing (4) can be solved via multiple twosample hypothesis testing as (3) in the usual linear regression framework. This extension of twosample linear regression tests to GGMs is described in Section 6.
1.2 Related work
The literature on highdimensional twosample tests is very light. In the context of highdimensional twosample comparison of means, [1996_SS_Bai, 2008_JMA_Srivastava, 2010_AS_Chen, 2011_NIPS_Lopes] have introduced global tests to compare the means of two highdimensional Gaussian vectors with unknown variance. Recently, [cai11_covariance_comparison, chen_li_comparison] developped twosample tests for covariance matrices of two highdimensional vectors.
In contrast, the onesample analog of our problem has recently attracted a lot of attention, offering as many theoretical bases for extension to the twosample problem. In fact, the highdimensional linear regression tests for the nullity of coefficients can be interpreted as a limit of the twosample test in the case where is known to be zero, and the sample size is considered infinite so that we perfectly know the distribution of the second sample.
There are basically two different objectives in highdimensional linear testing: a local and a global approach. In the local approach, one considers the tests for the nullity of each coefficient () with the purpose of controling error measures such as the false discovery rate of the resulting multiple testing procedures. In a way, one aims to assess the individual statistical significance of each of the variables. This can be achieved by providing a confidence region for [2011_arxiv_Zhang, 2012_arXiv_Buhlmann, Montanari, 2013arXiv1301.7161L, 2013arXiv1309.3489M]. Another line of work derives values for the nullity of each of the coefficients. Namely, [2009_AS_Wasserman] suggests a screen and clean procedure based upon halfsampling. Model selection is first applied upon a random half of the sample in order to test for the significance of each coefficient using the usual combination of ordinary least squares and Student ttests on a model of reasonnable size on the remaining second half. To reduce the dependency of the results to the splitting, [2009_JASA_Meinshausen] advocate to use halfsampling times and then aggregate the values obtained for variable in a way which controls either the familywise error rate or false discovery rate.
In the global approach, the objective is to test the null hypothesis .
Although global approaches are clearly less informative than approaches providing individual
significance tests like [2009_JASA_Meinshausen, 2011_arxiv_Zhang, 2012_arXiv_Buhlmann],
they can reach better performances from smaller sample sizes. Such a
property is of tremendous importance when dealing with highdimensional datasets.
The idea of [2010_AS_Verzelen], based upon
the work of [2003_AS_Baraud], is to approximate the alternative
by a collection of tractable alternatives
working on subsets of reasonable sizes. The null hypothesis is
rejected if the null hypothesis is
rejected for at least one of the subsets
. Admittedly, the
resulting procedure is computationally intensive. Nonetheless it is
nonasymptotically minimax adaptive to the unknown sparsity of
, that is it achieves the optimal rate of detection
without any assumption on the population covariance of
the covariates. Another series of work relies on highercriticism.
This last testing framework was originally introduced in
orthonormal designs [2004_AS_Donoho], but has been
proved to reach optimal detection rates in highdimensional linear
regression as well [2011_AS_AriasCastro, 2010_EJS_Ingster].
In the end, highercriticism is highly
competitive in terms of computing time and achieves the asymptotic rate of detection with the optimal constants. However, these nice properties require strong assumptions on the design.
While writing this paper, we came across the parallel work of Städler and Mukherjee [2013_ArXiv_Stadler], which adopts a local approach in an elegant adaptation of the screen and clean procedure in its simplesplit [2009_AS_Wasserman] and multisplit [2009_JASA_Meinshausen] versions to the twosample framework. Interestingly, their work also led to an extension to GGM testing [2014_ArXiv_Stadler].
1.3 Our contribution
Our suggested approach stems from the fundamental assumption that either the true supports of and are sparse or that their difference is sparse, so that the test can be successfully led in a subset of variables with reasonnable size, compared to the sample sizes and . Of course, this low dimensional subset is unknown. The whole objective of the testing strategy is to achieve similar rates of detection (up to a logarithmic constant) as an oracle test which would know in advance the optimal lowdimensional subset .
Concretely, we proceed in three steps :

We define algorithms to select a datadriven collection of subsets identified as most informative for our testing problem, in an attempt to circumvent the optimal subset .

New parametric statistics related to the likelihood ratio statistic between the conditional distributions and are defined for .

We define two calibration procedures which both guarantee a control on typeI error:

we use a Bonferroni calibration which is both computationally and conceptually simple, allowing us to prove that this procedure is minimax adaptive to the sparsity of and from a nonasymptotic point of view;

we define a calibration procedure based upon permutations to reach a fine tuning of multiple testing calibration in practice, for an increase in empirical power.

The resulting testing procedure is completely datadriven and its type I error is explicitely controlled. Furthermore, it is computationally amenable in a large and small setting. Interestingly, the procedure does not require any halfsampling steps which are known to decrease the robustness when the sample size is small.
The procedure is described in Section 2 while Section 3 is devoted to technical details, among which theoretical controls on Type I error, as well as some useful empirical tools for interpretation. Section 4 provides a nonasymptotic control of the power. Section 5 provides simulated experiments comparing the performances of the suggested procedures with the approach of [2013_ArXiv_Stadler]. In Section 6, we detail the extension of the procedure to handle the comparison of Gaussian graphical models. The method is illustrated on Transcriptomic Breast Cancer Data. Finally, all the proofs are postponed to Section 8.
The codes of our algorithms are available at [code].
1.4 Notation
In the sequel, norms are denoted , except for the norm which is referred as to alleviate notations. For any positive definite matrix , denotes the Euclidean norm associated with the scalar product induced by : for every vector , . Besides, fo every set , denote its cardinality. For any integer , stands for the identity matrix of size . For any square matrix , and denote respectively the maximum and minimum eigenvalues of . When the context makes it obvious, we may omit to mention to alleviate notations and use and instead. Moreover, refers to the size concatenation of and and refers to the size the concatenation of and . To finish with, refers to a positive numerical constant that may vary from line to line.
2 Description of the testing strategy
Likelihood ratio statistics used to test hypotheses like in the classical large , small setting are intractable on highdimensional datasets for the mere reason that the maximum likelihood estimator is not itself defined under highdimensional design proportions. Our approach approximates the intractable highdimensional test by a multiple testing construction, similarly to the strategy developped by [2003_AS_Baraud] in order to derive statistical tests against nonparametric alternatives and adapted to one sample tests for highdimensional linear regression in [2010_AS_Verzelen].
For any subset of satisfying , denote and the restrictions of the random vectors and to covariates indexed by . Their covariance structure is noted (resp. ). Consider the linear regression of by defined by
where the noise variables and are independent from and and follow centered Gaussian distributions with new unkwown conditional standard deviations and . We now state the test hypotheses in reduced dimension:
Of course, there is no reason in general for and to coincide with the restrictions of and to , even less in highdimension since variables in can be in all likelihood correlated with covariates in . Yet, as exhibited by Lemma 2.1, there is still a strong link between the collection of low dimensional hypotheses and the global null hypothesis .
Lemma 2.1.
The hypothesis implies for any subset .
Proof.
Under , the random vectors of size and follow the same distribution. Hence, for any subset , follows conditionally on the same distribution as conditionnally on . In other words, . ∎
By contraposition, it suffices to reject at least one of the hypotheses to reject the global null hypothesis. This fundamental observation motivates our testing procedure.
As summarized in Algorithm 1, the idea is to build a
wellcalibrated multiple testing procedure that considers the testing problems against for a collection of subsets .
Obviously, it would be prohibitive in terms of algorithmic complexity to test for every , since there would be such sets. As a result, we restrain ourselves to a relatively small collection of hypotheses , where the collection of supports is potentially datadriven. If the collection is judiciously selected, then we can manage not to lose too much power compared to the exhaustive search.
We now turn to the description of the three major elements required by our overall strategy (see Algorithm 1):

a welltargeted datadriven collection of models as produced by Algorithm 2;

a parametric statistic to test the hypotheses for , we resort actually to a combination of three parametric statistics , and ;
2.1 Choices of Test Collections (Step 1)
The first step of our procedure (Algorithm 1) amounts to select a collection of subsets of , also called models. A good collection of subsets must satisfy a tradeoff between the inclusion of the maximum number of relevant subsets and a reasonable computing time for the whole testing procedure, which is linear in the size of the collection. The construction of proceeds in two steps: (i) One chooses a deterministic collection of models. (ii) One defines an algorithm (called ModelChoice in Algorithm 1) mapping the raw data to some collection satisfying . Even though the introduction of as an argument of the mapping could appear artificial at this point, this quantity will be used in the calibration step of the procedure. Our methodology can be applied to any fixed or datadriven collection. Still, we focus here on two particular collections. The first one is useful for undertaking the first steps of the mathematical analysis. For practical applications, we advise to use the second collection.
Deterministic Collections .
By deterministic, we mean the model choice step is trivial in the sense ModelChoice. Among deterministic collections, the most straightforward collections consist of all size subsets of , which we denote . This kind of family provides collections which are independent from the data, thereby reducing the risk of overfitting. However, as we allow the model size or the total number of candidate variables to grow, these deterministic families can rapidly reach unreasasonble sizes. Admittedly, always remains feasible, but reducing the search to models of size 1 can be costly in terms of power. As a variation on size models, we introduce the collection of all models of size smaller than , denoted , which will prove useful in theoretical developments.
Lassotype Collection .
Among all datadriven collections, we suggest the Lassotype collection . Before proceeding to its definition, let us informally discuss the subsets that a “good” collection should contain. Let denote the support of a vector . Intuitively, under the alternative hypothesis, good candidates for the subsets are either subsets of or subsets of . The first model nicely satisfies and . The second subset has a smaller size than and focuses on covariates corresponding to different parameters in the full regression. However, the divergence between effects might only appear conditionally on other variables with similar effects, this is why the first subset is also of interest. Obviously, both subsets and are unknown. This is why we consider a Lasso methodology that amounts to estimating both and in the collection . Details on the construction of are postponed to Section 3.1.
2.2 Parametric Test Statistic (Step 2)
Given a subset , we consider the three following statistics to test against :
(5)  
(6) 
As explained in Section 3, these three statistics derive from the KullbackLeibler divergence between the conditional distributions and . While the first term evaluates the discrepancies in terms of conditional variances, the last two terms and address the comparison of to .
Because the distributions under the null of the statistics , for , depend on the size of , the only way to calibrate the multiple testing step over a collection of models of various sizes is to convert the statistics to a unique common scale. The most natural is to convert observed ’s into values. Under , the conditional distributions of for to are parameterfree and explicit (see Proposition 3.1 in the next section). Consequently, one can define the exact values associated to , conditional on . However, the computation of the values require a function inversion, which can be computationally prohibitive. This is why we introduce explicit upper bounds (Equations (13,17)) of the exact values.
2.3 Combining the parametric statistics (Step 3)
The objective of this subsection is to calibrate a multiple testing procedure based on the sequence of values , so that the typeI error remains smaller than a chosen level . In particular, when using a datadriven model collection, we must take good care of preventing the risk of overfitting which results from using the same dataset both for model selection and hypothesis testing.
For the sake of simplicity, we assume in the two following paragraphs that , which merely means that we do not include in the collection of tests the raw comparison of to .
Testing Procedure
Given a model collection and a sequence , we define the test function:
(7) 
In other words, the test function rejects the global null if there exists at least one model such that at least one of the three values is below the corresponding threshold . In Section 3.3, we describe two calibration methods for choosing the thresholds . We first define a natural Bonferroni procedure, whose conceptual simplicity allows us to derive nonasymptotic type II error bounds of the corresponding tests (Section 4). However, this Bonferroni correction reveals too conservative in practice, in part paying the price for resorting to datadriven collections and upper bounds on the true values. This is why we introduce as a second option the permutation calibration procedure. This second procedure controls the type I error at the nominal level and therefore outperforms the Bonferroni calibration in practice. Nevertheless, the mathematical analysis of the corresponding test becomes more intricate and we are not able to provide sharp type II error bounds.
3 Discussion of the procedure and Type I error
In this section, we provide remaining details on the three steps of the testing procedure. First, we describe the collection and provide an informal justification of its definition. Second, we explain the ideas underlying the parametric statistics , and we define the corresponding values . Finally, the Bonferroni and permutation calibration methods are defined, which allows us to control the type I error of the corresponding testing procedures.
3.1 Collection
We start from , where, in practice, and we consider the following reparametrized joint regression model.
(8) 
In this new model, captures the mean effect , while captures the discrepancy between the samplespecific effect and the mean effect , that is to say . Consequently, and . To simplify notations, denote by the concatenation of and , as well as by the reparametrized design matrix of (8). For a given , the Lasso estimator of is defined by
(9)  
(10) 
For a suitable choice of the tuning parameter and under assumptions of the designs, it is proved [2009_AS_Bickel, 2008_AS_Meinshausen] that estimates well and is a good estimator of . The Lasso parameter tunes the amount of sparsity of : the larger the parameter , the smaller the support . As the optimal choice of is unknown, the collection is built using the collection of all estimators , also called the Lasso regularization path of . Below we provide an algorithm for computing along with some additional justifications.
It is known [lars] that the function is piecewise constant. Consequently, there exist thresholds such that changes on ’s only. The function and the collection are computed efficiently using the LarsLasso Algorithm [lars]. We build two collections of models using and . Following the intuition described above, for a fixed , is an estimator of while is an estimator of . This is why we define
where is the smallest integer such that . In the end, we consider the following datadriven family,
(11) 
Recall that is the collection of the models of size 1. Recently, datadriven procedures have been proposed to tune the Lasso and find a parameter is such a way that is a good estimator of (see e.g. [2011_arxiv_Sun, 2010_arxiv_Baraud]). We use the whole regularization path instead of the sole estimator , because our objective is to find subsets such that the statistics are powerful. Consider an example where and contains one large coefficient and many small coefficients. If the sample size is large enough, a welltuned Lasso estimator will select several variables. In contrast, the best subset (in terms of power of ) contains only one variable. Using the whole regularization path, we hope to find the best tradeoff between sparsity (small size of ) and differences between and . This last remark is formalized in Section 4.4. Finally, the size of the collection is generally linear with , which makes the computation of reasonnable.
3.2 Parametric statistics and values
3.2.1 Symmetric conditional likelihood
In this subsection, we explain the intuition behind the choice of the parametric statistics defined in Equations (5,6). Let us denote by (resp. ) the likelihood of the first (resp. second) sample normalized by (resp. ). Given a subset of size smaller than , stands for the maximum likelihood estimator of among vectors whose supports are included in . Similarly, we note for the maximum likelihood corresponding to the second sample.
Statistics , and appear as the decomposition of a twosample likelihoodratio, measuring the symmetrical adequacy of samplespecific estimators to the opposite sample. To do so, let us define the likelihood ratio in sample between an arbitrary pair and the corresponding samplespecific estimator :
With this definition, measures how far is from in terms of likelihood within sample 1. The following symmetrized likelihood statistic can be decomposed into the sum of , and :
(12) 
Instead of the three statistics , one could use the symmetric likelihood (12) to build a testing procedure. However, we do not manage to obtain an explicit and sharp upper bound of the values associated to the statistic (12), which makes the resulting procedure either computationally intensive if one estimated the values by a MonteCarlo approach or less powerful if one uses a nonsharp upper bound of the values. In contrast, we explain below how, by considering separately , and , one upper bounds sharply the exact values.
3.2.2 Definition of the values
Denote by the nonnegative function defined on . Since the restriction of to is a bijection, we note the corresponding reciprocal function.
Proposition 3.1 (Conditional distributions of , and under ).

Let denote a Fisher random variable with degrees of freedom. Then, under the null hypothesis,

Let and be two centered and independent Gaussian vectors with covariance and . Then, under the null hypothesis,
A symmetric result holds for .
Although the distributions identified in Proposition 3.1 are not all familiar distributions with readytouse quantile tables, they all share the advantage that they do not depend on any unknown quantity, such as design variances and , noise variances and , or even true signals and . For any , we note for the conditional probability that is larger than .
By Proposition 3.1, the exact value associated to is easily computed from the distribution function of a Fisher random variable:
(13) 
where denotes the probability that a Fisher random variable with degrees of freedom is larger than .
Since the conditional distribution of given only depends on , , , and , one could compute an estimation of the value associated with an observed value by MonteCarlo simulations. However, this approach is computationally prohibitive for large collections of subsets . This is why we use instead an explicit upper bound of based on Laplace method, as given in the definition below and justified in the proof of Proposition 3.3.
Definition 3.2 (Definition of and ).
Let us note the positive eigenvalues of
For any , define . For any , take
(14) 
where is defined as follows. If all the components of are equal, then . If is not a constant vector, then we define
(15)  
(16) 
is defined analogously by exchanging the role of and .
Proposition 3.3.
For any , and for , .
Finally we define the approximate values and by
(17) 
Although we use similar notations for with , this must not mask the essential difference that is the exact