TwoSample Testing in HighDimensional Models
Abstract
We propose novel methodology for testing equality of model parameters between two highdimensional populations. The technique is very general and applicable to a wide range of models. The method is based on sample splitting: the data is split into two parts; on the first part we reduce the dimensionality of the model to a manageable size; on the second part we perform significance testing (pvalue calculation) based on a restricted likelihood ratio statistic. Assuming that both populations arise from the same distribution, we show that the restricted likelihood ratio statistic is asymptotically distributed as a weighted sum of chisquares with weights which can be efficiently estimated from the data. In highdimensional problems, a single data split can result in a “pvalue lottery”. To ameliorate this effect, we iterate the splitting process and aggregate the resulting pvalues. This multisplit approach provides improved pvalues. We illustrate the use of our general approach in twosample comparisons of highdimensional regression models (“differential regression”) and graphical models (“differential network”). In both cases we show results on simulated data as well as real data from recent, highthroughput cancer studies.
Keywords Highdimensional twosample testing; Data splitting; regularization; Nonnested hypotheses; Highdimensional regression; Gaussian graphical models; Differential regression; Differential network
1 Introduction and Motivation
We consider the general twosample testing problem where the goal is to test whether or not two independent populations and , parameterized by and respectively, with , arise from the same distribution. The hypothesis testing problem of interest is
(1.1) 
In this paper the focus is on the highdimensional setting where the number of available samples per population ( and ) is small compared to the dimensionality of the parameter space .
In a setup where and are much larger than the ordinary likelihoodratio test offers a very general solution to problem (1.1). It is wellknown that under the likelihood ratio statistic is asymptotically distributed which then allows computation of confidence intervals and pvalues. However, in settings where is large compared to sample sizes, the likelihoodratio statistic is illbehaved and the classical asymptotic setup cannot be relied upon to test statistical significance.
Our approach for solving the general highdimensional twosample problem (1.1) is motivated by the screen and clean procedure (Wasserman and Roeder, 2009) originally developed for improved variable selection in the highdimensional regression model. The idea is to split the data from both populations into two parts; perform dimensionality reduction on one part and pursue significance testing on the other part of the data. In more detail, on the first split we select three subsets of by screening for the most relevant parameters of population and individually but also by selecting the important parameters of the pooled data from both populations. In this way we obtain an individual model with different parameter spaces for and and a joint model which describes both populations by a common parameter space. On the second split of the data, we then can compute the likelihood ratio statistic between these two models, the socalled restricted likelihood ratio statistic. A crucial observation is that the two models are nonnested and therefore nonstandard tools are needed to obtain the asymptotic null distribution. We apply the theory on model selection and nonnested hypotheses developed by Vuong (1989) to the twosample scenario and show that the null distribution asymptotically approaches a weighted sum of independent chisquared random variables with weights which can be efficiently estimated from the second split of the data. Importantly, the weighted sum of chisquared approximation is invoked only in the second split, where the model dimensionalty has been reduced.
As indicated above our approach involves a screening or model selection step prior to significance testing. Our method requires that the parameter set selected by a specific screening procedure contains the true model parameter (screening property) and that the selected set is small compared to the sample size (sparsity property). The first property is needed for deriving the asymptotic null distribution, whereas the latter property justifies asymptotic approximation. The two conditions which we impose here are much weaker than requiring consistency in variable selection. We propose to use penalized maximum likelihood estimation (Tibshirani, 1996; Fan and Li, 2001) with regularization parameter chosen by crossvalidation. This sets automatically nonrelevant parameter components to zero and allows for efficient screening even in scenarios with very large . penalization leads to sparse estimates and there is theoretical evidence that the screening property holds under mild additional assumptions.
In current applied statistics, a wide range of applications are faced with highdimensional data. Parameter estimation in the “large , small ” setting has been extensively studied in theory (Bühlmann and van de Geer, 2011) and also applied with success in many areas. Since interpretation of parameters is crucial in applied science there is a clear need to assess uncertainty and statistical significance in such settings. However, significance testing in highdimensional settings has only recently attracted attention. Meinshausen et al. (2009) and Bühlmann (2012) consider testing in the highdimensional linear model. In the context of highdimensional twosample comparison Bai and Saranadasa (1996), Chen and Qin (2010), Lopes et al. (2012) and others treat testing for differences in the population means. And recently, Cai et al. (2011) and Li and Chen (2012) developed a twosample test for covariance matrices of two highdimensional populations.
The approach proposed in this paper tackles the highdimensional twosample testing in a very general setting. In our methodology and can parameterize any model of interest. In the empirical examples we show below we focus on two specific applications of our approach: (i) Highdimensional regression, where two populations may differ with respect to regression models and (ii) Graphical modelling, where the two populations may differ with respect to conditional independence structure. In analogy to the term “differential expression” as widelyused for testing means in gene expression studies, we call these “differential regression” and “differential network” respectively. Both highdimensional regression and graphical models are now widely used in biological applications, and very often scientific interest focuses on potential differences between populations (such as disease types, cell types, environmental conditions etc.). However, to date in the highdimensional setting, two sample testing concerning such models has not been well studied. The methodology we propose offers a way to directly test hypotheses concerning differences in molecular influences or biological network structure using highthroughput data.
The organization of the paper is as follows: Section 2 introduces the setup and the methodology: Section 2.1 explains variable screening using regularization; Section 2.2 introduces the restricted likelihoodratio statistic; Sections 2.3 and 2.4 derives its asymptotic null distribution and Section 2.5 shows how to compute pvalues using the single and multisplit algorithms. In Section 3 we present and give details on the two specific examples of our approach that we outlined above (differential regression and differential network). Finally, in Section 4, we evaluate our methodology on simulated and real data from two recent highthroughput studies in cancer biology.
2 Data Splitting, Screening and NonNested Hypothesis
Consider a conditional model class given by densities
(2.2) 
Note that can be empty (i.e., ). Let population and , where , are both generated from the same distribution and conditional on the ’s, and are generated from and respectively.
The goal is to solve the general twosample testing problem (1.1) given datamatrices and , representing i.i.d. random samples of and . We consider the highdimensional case with and assume that the data generating parameters and are sparse, meaning that many of their components are equal to zero. If , we have a classical (univariate) twosample setup. If and is the univariate Normal distribution with mean and noise variance , , then and follow linear regression models. If and is the multivariate Normal distribution with representing the inverse covariance matrix, then and follow a Gaussian graphical model and (1.1) asks whether or not two graphical models (in short: networks) are significantly different. We will treat these two examples in detail in Section 3 and 4. We refer to the regression case as differential regression and to the graphical model case as differential network.
Our methodology for solving (1.1) is based on sample splitting and has its inspiration from the work by Wasserman and Roeder (2009). We randomly divide data from population into two parts and of equal size and proceed in the same way with population which yields and . In a first step the dimensionality of the parameter space is reduced by filtering out (potentially many) redundant components. We do this by applying a screening procedure on and separately, but also on pooled data . In this way we obtain models with lower dimensionality than the full dimension . The first model describes populations and individually using different reducedparameter spaces, whereas the second model explains both populations jointly with a single reducedparameter space. In a second step, we then evaluate the restricted likelihoodratio statistic on the heldout data and and perform significance testing.
2.1 Screening and regularization
Consider i.i.d. data with distributed according to and . We have and is supposed to be sparse.
A screening procedure selects, based on data , a set of active parameter components by a map . The map then defines the active parameter space by
There are two basic requirements on the screening procedure . Firstly, the procedure should get rid of many nonrelevant components in the sense that the set of active parameter components should be small compared to . Secondly, the active parameter space should contain the true parameter . We refer to these requirements as the sparsity and screening property:

Sparsity property: is small compared to .

Screening property: .
Formulating the sparsity property precisely, i.e. specifying the rate at which the size of the active set can grow with , is a research topic in its own right (see also the discussion in Section 5). We do not address this topic here. The screening property guarantees that the model , selected by , is correctly specified in the sense that it contains the true density function . L1penalized likelihood methods (Fan and Li, 2001) serve as our prime example of screening procedures. Consider estimators of the form
(2.3) 
where denotes the conditional loglikelihood and is a nonnegative regularization parameter. In the context of linear regression (2.3) coincides with the Lasso estimator introduced by Tibshirani (1996). Other important example of the form (2.3) include the elastic net (Zou and Hastie, 2005), the grouped Lasso (Yuan and Lin, 2006), the graphical Lasso (Friedman et al., 2008) or the Lasso for generalized linear models (Park and Hastie, 2007). Estimators of the form (2.3) have been shown to be extremely powerful: they are suitable for highdimensional data and lead to sparse solutions. A screening procedure can be defined by setting
(2.4) 
The question is whether or not satisfies both the sparsity and the screening property. In principle the answer depends on the choice of the tuning parameter . Too strong regularization (too large ) leads to very sparse solutions, but with relevant parameter components incorrectly set to zero, whereas too little regularization (too small ) results in too large active sets. Common practice is to choose in a prediction optimal manner, for example by optimizing a crossvalidation score. This gives typically sparse solutions with a larger number of nonzero components than the true number (Meinshausen and Bühlmann, 2006). It seems to be reasonable in practice to assume that the sparsity and screening property hold for the procedure obtained via regularization. In fact, Bühlmann (2012) points out that the screening property is a very useful concept for regularization which holds under much milder assumptions than consistency in variable selection.
By applying the screening procedure on and separately, we obtain activesets
Further we obtain activeset
by applying jointly on . For the rest of this section we treat the activesets , and as fixed, i.e., not depending on the sample size.
2.2 Restricted likelihood ratio statistic
We define model with shared parameter space for both population and jointly as
and model with individual parameter spaces and for each population as
The loglikelihood functions with respect to models and are given by
We will test hypothesis (1.1) based on the restricted loglikelihood ratio teststatistic defined as
where and denote the maximum likelihood estimators corresponding to models and .
It is crucial to note that testing based on (2.2) is nontrivial and involves nonnested model comparison (Vuong, 1989). The difficulty arises from the fact that model is not nested in , or in other terms . Nonnestedness of these models is fundamental and is not only an artefact of the random nature of the screening procedure. If , than applied on two different populations mixed together will set different components of to zero than applied two both populations individually. This is a consequence of model misspecification and is also well known as Simpson’s paradox in which an association present in two different groups can be lost or even reversed when the groups are combined.
Under , or if , then we have:
Proposition 2.1.
Assume that the screening property holds and consider the sets and as fixed. Then under and regularity assumptions (A1)(A3) (listed in Appendix A):
A proof is given in Appendix A. Based on Proposition 2.1 we reject if exceeds some critical value. The critical value is chosen to control the typeI error at some level of significance . Alternatively, one may directly compute a pvalue. In order to compute a critical value and/or determine a pvalue we obtain in Section 2.3 the asymptotic distribution of under .
2.3 Asymptotic Null Distribution
The work of Vuong (1989) specifies the asymptotic distribution of the loglikelihood ratio statistic for comparing two competing models in a very general setting, in particular Vuong’s theory covers the case where the two models are nonnested. We apply Theorem 3.3 of Vuong (1989) to the special case where the competing models are and .
Let and be pseudotrue values of model , respectively :
Define for , and sets
We further consider the matrices
and
The following theorem establishes the asymptotic distribution of .
Theorem 2.1 (Asymptotic nulldistribution of restricted likelihood ratiostatistic.).
Assume that the screening property holds and consider the sets , and as fixed. If and under regularity assumptions (A1)(A6) (listed in Appendix A) we have:
with and denotes the distribution function of a weighted sum of independent chisquare distributions where the weights are eigenvalues of the matrix
(2.6) 
A proof is given in Appendix A. Set , , and
If we assume that the screening property holds then model is correctly specified and the pseudotrue values equal the true values . Furthermore, if we have then the screening property guarantees that also is correctly specified and that . We then write
In Appendix A we prove the following proposition which characterizes the weights defined as eigenvalues of the matrix in Theorem 2.1.
Proposition 2.2 (Characterization of eigenvalues).
The eigenvalues , , of matrix W in Theorem 2.1 can be characterized as follows:
If :

eigenvalues are 0.

eigenvalues are 1.

The remaining eigenvalues equal , where are eigenvalues of
(2.7)
If

eigenvalues are 0.

eigenvalues are 1.

eigenvalues are +1.

The remaining eigenvalues equal , where are eigenvalues of
(2.8)
Expressions (2.7) and (2.8) of Proposition 2.2 have nice interpretations in terms of analyzing the variances of the models and . Consider the random variables
which are asymptotically Normal distributed with mean zero. If and denote the residuals obtained from regressing and against , then is the covariance between these residuals. It is easy to see that the matrix equals
and therefore expression (2.7) of Proposition 2.2 can be interpreted as the asymptotic variance of model not explained by model . Similarly, we can see that (2.8) describes the variance of model not explained by .
We further point out two special cases of Proposition 2.2: If is nested in , i.e., , then follows asymptotically a chisquared distribution with degrees of freedom equal . If then is asymptotically distributed according to .
2.4 Estimation of the weights in
In practice the weights of the weighted sum of chisquared null distribution have to be estimated from the data. In light of Theorem 2.1, it would be straightforward to estimate the quantities and plug them into expression (2.6) to obtain an estimate . Estimating then involves computation of eigenvalues. Despite model reduction in the screening step, can be a rather large number and can result in inefficient and inaccurate estimation. However, if both populations arise from the same distribution, then the overlap of the activesets is large compared to , and . According to Proposition 2.2, the number of eigenvalues which remain to be estimated is small compared to . We therefore estimate matrix (2.7) by
(2.9) 
and similarly matrix (2.8) by
(2.10) 
Here, and denotes a consistent estimator of . One possibility is to use the sample analogues:
Another way is to plugin estimators and into the expectation with respect to and :
(2.11) 
Figure 1 shows for a linear regression example with predictors the estimated weights (upper panels show obtained using directly Theorem 2.1; lower panels show obtained via Proposition 2.2). Estimating the eigenvalues with help of Proposition 2.2 works better than direct computation according to Theorem 2.1. In particular the true zero eigenvalues are poorly estimated with the direct approach. Figure 2 illustrates for the same example the quality of approximation of the ordinary and restricted likelihoodratio with their asymptotic counterparts. The weighted sum of chisquares approximates well already for small sample sizes, whereas the comes close to the distribution function of the ordinary likelihoodratio only for very large ’s.
2.5 PValues and MultiSplitting
In this section we demonstrate how to obtain pvalues for testing hypothesis (1.1) using the methodology developed in Sections 2.12.4. The basic workflow is to split the data into two parts, do screening on one part and derive asymptotic pvalues on the other part. A pvalue computed based on a onetime single split depends heavily on the arbitrary choice of the split and amounts to a “pvalue lottery” (Meinshausen et al., 2009): for finite samples and for some specific split, the screening or the sparsity property might be violated which then results in a erroneous pvalue. An alternative to a single arbitrary sample split is to split the data repeatedly. Meinshausen et al. (2009) demonstrated in the context of variable selection in the highdimensional regression model that such a multisplit approach gives improved and more reproducible results. We adopt this idea and divide the data repeatedly into different splits. Let () and () be the first and second half of split . On the first half screening is performed by solving three times the regularized loglikelihood problem (2.3), individually for each population and for both populations pooled together. The tuning parameter is always chosen by crossvalidation. This gives models and defined via activesets and . Then, the restricted loglikelihood ratio is evaluated on the second half of split and a onesided pvalue is computed according to
where and is defined in Theorem 2.1. The weights are estimated from the second half of split as described in Section 2.4. As in Meinshausen et al. (2009) we aggregate pvalues , obtained from all different splits using the formula:
where is the empirical quantile function. This procedure is summarized in Algorithm 1. We refer to the choice as the singlesplit method and to the choice as the multisplit method.
3 Examples
As mentioned in the beginning of Section 2 two examples of our approach are differential regression where the populations follow linear regression models and differential network where the populations are generated from Gaussian graphical models. In this Section we provide details on both examples.
Differential regression
Consider a regression model
(3.12) 
with (), () and random error . With the score function is given by
In Appendix B we show that
(3.13) 
Now, given data and we obtain pvalues by following Algorithm 1 (see Section 2.5). For the regression model, penalized maximum likelihood estimation, used in the screening step, coincides with the Lasso (Tibshirani, 1996) and is implemented in the Rpackage glmnet (Friedman et al., 2010). With the help of formula (3.13) we can easily compute plugin estimates (see equation (2.11)) and then the weights of the asymptotic null distribution are obtained as outlined in Section 2.4. We use the algorithm of Davies (1980), implemented in the Rpackage CompQuadForm (Duchesne and de Micheaux, 2010), to compute the distribution function of .
Differential network
Let () be Gaussian distributed with zero mean and covariance , i.e., . A Gaussian graphical model with undirected graph is then defined by locations of zero entries in the inverse covariance matrix , i.e., . Setting (), the score function is given by
By invoking formulas on fourth moments of a multivariate normal distribution (see Appendix B) we find: