Two-Sample Testing in High-Dimensional Models

# Two-Sample Testing in High-Dimensional Models

Nicolas Städler
Netherlands Cancer Institute
Amsterdam, Netherlands.
Sach Mukherjee
Netherlands Cancer Institute
Amsterdam, Netherlands.
s.mukherjee@nki.nl
###### Abstract

We propose novel methodology for testing equality of model parameters between two high-dimensional populations. The technique is very general and applicable to a wide range of models. The method is based on sample splitting: the data is split into two parts; on the first part we reduce the dimensionality of the model to a manageable size; on the second part we perform significance testing (p-value calculation) based on a restricted likelihood ratio statistic. Assuming that both populations arise from the same distribution, we show that the restricted likelihood ratio statistic is asymptotically distributed as a weighted sum of chi-squares with weights which can be efficiently estimated from the data. In high-dimensional problems, a single data split can result in a “p-value lottery”. To ameliorate this effect, we iterate the splitting process and aggregate the resulting p-values. This multi-split approach provides improved p-values. We illustrate the use of our general approach in two-sample comparisons of high-dimensional regression models (“differential regression”) and graphical models (“differential network”). In both cases we show results on simulated data as well as real data from recent, high-throughput cancer studies.

Keywords High-dimensional two-sample testing; Data splitting; -regularization; Non-nested hypotheses; High-dimensional regression; Gaussian graphical models; Differential regression; Differential network

## 1 Introduction and Motivation

We consider the general two-sample testing problem where the goal is to test whether or not two independent populations and , parameterized by and respectively, with , arise from the same distribution. The hypothesis testing problem of interest is

 (1.1)

In this paper the focus is on the high-dimensional setting where the number of available samples per population ( and ) is small compared to the dimensionality of the parameter space .

In a setup where and are much larger than the ordinary likelihood-ratio test offers a very general solution to problem (1.1). It is well-known that under the likelihood ratio statistic is asymptotically distributed which then allows computation of confidence intervals and p-values. However, in settings where is large compared to sample sizes, the likelihood-ratio statistic is ill-behaved and the classical asymptotic set-up cannot be relied upon to test statistical significance.

Our approach for solving the general high-dimensional two-sample problem (1.1) is motivated by the screen and clean procedure (Wasserman and Roeder, 2009) originally developed for improved variable selection in the high-dimensional regression model. The idea is to split the data from both populations into two parts; perform dimensionality reduction on one part and pursue significance testing on the other part of the data. In more detail, on the first split we select three subsets of by screening for the most relevant parameters of population and individually but also by selecting the important parameters of the pooled data from both populations. In this way we obtain an individual model with different parameter spaces for and and a joint model which describes both populations by a common parameter space. On the second split of the data, we then can compute the likelihood ratio statistic between these two models, the so-called restricted likelihood ratio statistic. A crucial observation is that the two models are non-nested and therefore non-standard tools are needed to obtain the asymptotic null distribution. We apply the theory on model selection and non-nested hypotheses developed by Vuong (1989) to the two-sample scenario and show that the null distribution asymptotically approaches a weighted sum of independent chi-squared random variables with weights which can be efficiently estimated from the second split of the data. Importantly, the weighted sum of chi-squared approximation is invoked only in the second split, where the model dimensionalty has been reduced.

As indicated above our approach involves a screening or model selection step prior to significance testing. Our method requires that the parameter set selected by a specific screening procedure contains the true model parameter (screening property) and that the selected set is small compared to the sample size (sparsity property). The first property is needed for deriving the asymptotic null distribution, whereas the latter property justifies asymptotic approximation. The two conditions which we impose here are much weaker than requiring consistency in variable selection. We propose to use -penalized maximum likelihood estimation (Tibshirani, 1996; Fan and Li, 2001) with regularization parameter chosen by cross-validation. This sets automatically non-relevant parameter components to zero and allows for efficient screening even in scenarios with very large . penalization leads to sparse estimates and there is theoretical evidence that the screening property holds under mild additional assumptions.

In current applied statistics, a wide range of applications are faced with high-dimensional data. Parameter estimation in the “large , small ” setting has been extensively studied in theory (Bühlmann and van de Geer, 2011) and also applied with success in many areas. Since interpretation of parameters is crucial in applied science there is a clear need to assess uncertainty and statistical significance in such settings. However, significance testing in high-dimensional settings has only recently attracted attention. Meinshausen et al. (2009) and Bühlmann (2012) consider testing in the high-dimensional linear model. In the context of high-dimensional two-sample comparison Bai and Saranadasa (1996), Chen and Qin (2010), Lopes et al. (2012) and others treat testing for differences in the population means. And recently, Cai et al. (2011) and Li and Chen (2012) developed a two-sample test for covariance matrices of two high-dimensional populations.

The approach proposed in this paper tackles the high-dimensional two-sample testing in a very general setting. In our methodology and can parameterize any model of interest. In the empirical examples we show below we focus on two specific applications of our approach: (i) High-dimensional regression, where two populations may differ with respect to regression models and (ii) Graphical modelling, where the two populations may differ with respect to conditional independence structure. In analogy to the term “differential expression” as widely-used for testing means in gene expression studies, we call these “differential regression” and “differential network” respectively. Both high-dimensional regression and graphical models are now widely used in biological applications, and very often scientific interest focuses on potential differences between populations (such as disease types, cell types, environmental conditions etc.). However, to date in the high-dimensional setting, two sample testing concerning such models has not been well studied. The methodology we propose offers a way to directly test hypotheses concerning differences in molecular influences or biological network structure using high-throughput data.

The organization of the paper is as follows: Section 2 introduces the setup and the methodology: Section 2.1 explains variable screening using -regularization; Section 2.2 introduces the restricted likelihood-ratio statistic; Sections 2.3 and 2.4 derives its asymptotic null distribution and Section 2.5 shows how to compute p-values using the single- and multi-split algorithms. In Section 3 we present and give details on the two specific examples of our approach that we outlined above (differential regression and differential network). Finally, in Section 4, we evaluate our methodology on simulated and real data from two recent high-throughput studies in cancer biology.

## 2 Data Splitting, Screening and Non-Nested Hypothesis

Consider a conditional model class given by densities

 d(y|x;ϕ),y∈Rk,x∈Rl,z=(y,x)andϕ∈Φ⊂Rp. (2.2)

Note that can be empty (i.e., ). Let population and , where , are both generated from the same distribution and conditional on the ’s, and are generated from and respectively.

The goal is to solve the general two-sample testing problem (1.1) given data-matrices and , representing i.i.d. random samples of and . We consider the high-dimensional case with and assume that the data generating parameters and are sparse, meaning that many of their components are equal to zero. If , we have a classical (univariate) two-sample set-up. If and is the univariate Normal distribution with mean and noise variance , , then and follow linear regression models. If and is the multivariate Normal distribution with representing the inverse covariance matrix, then and follow a Gaussian graphical model and (1.1) asks whether or not two graphical models (in short: networks) are significantly different. We will treat these two examples in detail in Section 3 and 4. We refer to the regression case as differential regression and to the graphical model case as differential network.

Our methodology for solving (1.1) is based on sample splitting and has its inspiration from the work by Wasserman and Roeder (2009). We randomly divide data from population into two parts and of equal size and proceed in the same way with population which yields and . In a first step the dimensionality of the parameter space is reduced by filtering out (potentially many) redundant components. We do this by applying a screening procedure on and separately, but also on pooled data . In this way we obtain models with lower dimensionality than the full dimension . The first model describes populations and individually using different reduced-parameter spaces, whereas the second model explains both populations jointly with a single reduced-parameter space. In a second step, we then evaluate the restricted likelihood-ratio statistic on the held-out data and and perform significance testing.

### 2.1 Screening and ℓ1-regularization

Consider i.i.d. data with distributed according to and . We have and is supposed to be sparse.

A screening procedure selects, based on data , a set of active parameter components by a map . The map then defines the active parameter space by

 ΦI(Z)={(ϕ1,…,ϕp):ϕj=0for allj∉I(Z)}.

There are two basic requirements on the screening procedure . Firstly, the procedure should get rid of many non-relevant components in the sense that the set of active parameter components should be small compared to . Secondly, the active parameter space should contain the true parameter . We refer to these requirements as the sparsity and screening property:

• Sparsity property: is small compared to .

• Screening property: .

Formulating the sparsity property precisely, i.e. specifying the rate at which the size of the active set can grow with , is a research topic in its own right (see also the discussion in Section 5). We do not address this topic here. The screening property guarantees that the model , selected by , is correctly specified in the sense that it contains the true density function . L1-penalized likelihood methods (Fan and Li, 2001) serve as our prime example of screening procedures. Consider estimators of the form

 ^ϕλ=argmaxϕ∈Φℓ(ϕ;Y|X)−λ∥ϕ∥1, (2.3)

where denotes the conditional log-likelihood and is a non-negative regularization parameter. In the context of linear regression (2.3) coincides with the Lasso estimator introduced by Tibshirani (1996). Other important example of the form (2.3) include the elastic net (Zou and Hastie, 2005), the grouped Lasso (Yuan and Lin, 2006), the graphical Lasso (Friedman et al., 2008) or the Lasso for generalized linear models (Park and Hastie, 2007). Estimators of the form (2.3) have been shown to be extremely powerful: they are suitable for high-dimensional data and lead to sparse solutions. A screening procedure can be defined by setting

 Iλ(Z)={j:^ϕλ,j≠0}. (2.4)

The question is whether or not satisfies both the sparsity and the screening property. In principle the answer depends on the choice of the tuning parameter . Too strong regularization (too large ) leads to very sparse solutions, but with relevant parameter components incorrectly set to zero, whereas too little regularization (too small ) results in too large active sets. Common practice is to choose in a prediction optimal manner, for example by optimizing a cross-validation score. This gives typically sparse solutions with a larger number of non-zero components than the true number (Meinshausen and Bühlmann, 2006). It seems to be reasonable in practice to assume that the sparsity and screening property hold for the procedure obtained via -regularization. In fact, Bühlmann (2012) points out that the screening property is a very useful concept for -regularization which holds under much milder assumptions than consistency in variable selection.

By applying the screening procedure on and separately, we obtain active-sets

 Iu=Iλ(Uin),Iv=Iλ(Vin).

Further we obtain active-set

 Iuv=Iλ((Uin,Vin))

by applying jointly on . For the rest of this section we treat the active-sets , and as fixed, i.e., not depending on the sample size.

### 2.2 Restricted likelihood ratio statistic

We define model with shared parameter space for both population and jointly as

 Mjoint={d(yu|xu;ϕuv)d(yv|xv;ϕuv):ϕuv∈ΦIuv}

and model with individual parameter spaces and for each population as

 Mind={d(yu|xu;ϕu)d(yv|xv;ϕv):(ϕu,ϕv)∈ΦIu×ΦIv}.

The log-likelihood functions with respect to models and are given by

 Ljointnu,nv(ϕuv) = nu∑i=1logd(Yu,i|Xu,i;ϕuv)+nv∑i=1logd(Yv,i|Xv,i;ϕuv) Lindnu,nv(ϕu,ϕv) = nu∑i=1logd(Yu,i|Xu,i;ϕu)+nv∑i=1logd(Yv,i|Xv,i;ϕv).

We will test hypothesis (1.1) based on the restricted log-likelihood ratio test-statistic defined as

 LRnu,nv = 2{maxLindnu,nv(ϕu,ϕv)−maxLjointnu,nv(ϕ)} = 2{Lindnu,nv(^ϕu,^ϕv)−Ljointnu,nv(^ϕuv)},

where and denote the maximum likelihood estimators corresponding to models and .

It is crucial to note that testing based on (2.2) is non-trivial and involves non-nested model comparison (Vuong, 1989). The difficulty arises from the fact that model is not nested in , or in other terms . Non-nestedness of these models is fundamental and is not only an artefact of the random nature of the screening procedure. If , than applied on two different populations mixed together will set different components of to zero than applied two both populations individually. This is a consequence of model misspecification and is also well known as Simpson’s paradox in which an association present in two different groups can be lost or even reversed when the groups are combined.

Under , or if , then we have:

###### Proposition 2.1.

Assume that the screening property holds and consider the sets and as fixed. Then under and regularity assumptions (A1)-(A3) (listed in Appendix A):

 LRnu,nv=2{Lindnu,nv(^ϕu,^ϕv)−Ljointnu,nv(^ϕuv)}a.s.→∞(nu,nv→∞).

A proof is given in Appendix A. Based on Proposition 2.1 we reject if exceeds some critical value. The critical value is chosen to control the type-I error at some level of significance . Alternatively, one may directly compute a p-value. In order to compute a critical value and/or determine a p-value we obtain in Section 2.3 the asymptotic distribution of under .

### 2.3 Asymptotic Null Distribution

The work of Vuong (1989) specifies the asymptotic distribution of the log-likelihood ratio statistic for comparing two competing models in a very general setting, in particular Vuong’s theory covers the case where the two models are non-nested. We apply Theorem 3.3 of Vuong (1989) to the special case where the competing models are and .

Let and be pseudo-true values of model , respectively :

 ϕ∗uv=argminϕ∈ΦIuv(E[Eϕu[logd(Yu|Xu;ϕ)]]+E[Eϕv[logd(Yv|Xv;ϕ)]]), ϕ∗u=argminψ∈ΦIuE[Eϕu[logd(Yu|Xu;ψ)]],ϕ∗v=argminξ∈ΦIvE[Eϕv[logd(Yv|Xv;ξ)]].

Define for , and sets

 sA(y|x;ϕ) = ∂∂ϕAlogd(y|x;ϕ)and BcAB(ϕ∗a;ϕ∗b) = E[Eϕc[sA(Y|X;ϕ∗a)sB(Y|X;ϕ∗b)T]].

We further consider the matrices

 BMind(ϕ∗u,ϕ∗v)=(BuIu(ϕ∗u)00BvIv(ϕ∗v)),BMjoint(ϕ∗uv)=BuIuv(ϕ∗uv)+BvIuv(ϕ∗uv)

and

 BMjointMind(ϕ∗uv;ϕ∗u,ϕ∗v)=(BuIuvIu(ϕ∗uv;ϕ∗u),BvIuvIv(ϕ∗uv;ϕ∗v)).

The following theorem establishes the asymptotic distribution of .

###### Theorem 2.1 (Asymptotic null-distribution of restricted likelihood ratio-statistic.).

Assume that the screening property holds and consider the sets , and as fixed. If and under regularity assumptions (A1)-(A6) (listed in Appendix A) we have:

 LRnu,nvd→Ψr(⋅;ν),

with and denotes the distribution function of a weighted sum of independent chi-square distributions where the weights are eigenvalues of the matrix

 W = ⎛⎜ ⎜ ⎜⎝Iru+rvBMindMjointB−1MjointBMjointMindB−1Mind−Iruv\par⎞⎟ ⎟ ⎟⎠ (2.6)

A proof is given in Appendix A. Set , , and

 Qc˚Ia˚Ib(ϕ∗a,ϕ∗b)=Bc˚Ia˚Ib(ϕ∗a,ϕ∗b)−Bc˚IaJ(ϕ∗a,ϕ∗b)BcJJ(ϕ∗a,ϕ∗b)−1BcJ˚Ib(ϕ∗a,ϕ∗b).

If we assume that the screening property holds then model is correctly specified and the pseudo-true values equal the true values . Furthermore, if we have then the screening property guarantees that also is correctly specified and that . We then write

 BcAB(ϕ∗a;ϕ∗b)=BABandQc˚Ia˚Ib(ϕ∗a,ϕ∗b)=Q˚Ia˚Ib.

In Appendix A we prove the following proposition which characterizes the weights defined as eigenvalues of the matrix in Theorem 2.1.

###### Proposition 2.2 (Characterization of eigenvalues).

The eigenvalues , , of matrix W in Theorem 2.1 can be characterized as follows:

If :

• eigenvalues are 0.

• eigenvalues are 1.

• The remaining eigenvalues equal , where are eigenvalues of

 (Q˚Iuv˚IuQ−1˚IuQ˚Iu˚Iuv+Q˚Iuv˚IvQ−1˚IvQ˚Iv˚Iuv)(2Q˚Iuv)−1. (2.7)

If

• eigenvalues are 0.

• eigenvalues are -1.

• eigenvalues are +1.

• The remaining eigenvalues equal , where are eigenvalues of

 ⎡⎣(Q˚Iu˚Iuv(2Q˚Iuv)−1Q˚Iuv˚Iu)Q−1˚Iu(Q˚Iu˚Iuv(2Q˚Iuv)−1Q˚Iuv˚Iv)Q−1˚Iv(Q˚Iv˚Iuv(2Q˚Iuv)−1Q˚Iuv˚Iu)Q−1˚Iu(Q˚Iv˚Iuv(2Q˚Iuv)−1Q˚Iuv˚Iv)Q−1˚Iv⎤⎦. (2.8)

Expressions (2.7) and (2.8) of Proposition 2.2 have nice interpretations in terms of analyzing the variances of the models and . Consider the random variables

 Zc=1√nn∑i=1s(Yc,i|Xc,i;¯ϕ)(c∈{u,v}),

which are asymptotically Normal distributed with mean zero. If and denote the residuals obtained from regressing and against , then is the covariance between these residuals. It is easy to see that the matrix equals

 Var(rescIuv)−Var(rescIuv|rescIu),

and therefore expression (2.7) of Proposition 2.2 can be interpreted as the asymptotic variance of model not explained by model . Similarly, we can see that (2.8) describes the variance of model not explained by .

We further point out two special cases of Proposition 2.2: If is nested in , i.e., , then follows asymptotically a chi-squared distribution with degrees of freedom equal . If then is asymptotically distributed according to .

### 2.4 Estimation of the weights ν in Ψr(⋅;ν)

In practice the weights of the weighted sum of chi-squared null distribution have to be estimated from the data. In light of Theorem 2.1, it would be straightforward to estimate the quantities and plug them into expression (2.6) to obtain an estimate . Estimating then involves computation of eigenvalues. Despite model reduction in the screening step, can be a rather large number and can result in inefficient and inaccurate estimation. However, if both populations arise from the same distribution, then the overlap of the active-sets is large compared to , and . According to Proposition 2.2, the number of eigenvalues which remain to be estimated is small compared to . We therefore estimate matrix (2.7) by

 (^Qu˚Iuv˚Iu(^Qu˚Iu)−1^Qu˚Iu˚Iuv+^Qv˚Iuv˚Iv(^Qv˚Iv)−1^Qv˚Iv˚Iuv)(^Qu˚Iuv+^Qv˚Iuv)−1, (2.9)

and similarly matrix (2.8) by

 ⎡⎣(^Qu˚Iu˚Iuv(^Qu˚Iuv+^Qv˚Iuv)−1^Qu˚Iuv˚Iu)(^Qu˚Iu)−1(^Qu˚Iu˚Iuv(^Qu˚Iuv+^Qv˚Iuv)−1^Qv˚Iuv˚Iv)(^Qv˚Iv)−1(^Qv˚Iv˚Iuv(^Qu˚Iuv+^Qv˚Iuv)−1^Qu˚Iuv˚Iu)(^Qu˚Iu)−1(^Qv˚Iv˚Iuv(^Qu˚Iuv+^Qv˚Iuv)−1^Qv˚Iuv˚Iv)(^Qv˚Iv)−1⎤⎦. (2.10)

Here, and denotes a consistent estimator of . One possibility is to use the sample analogues:

 ^BcIaIb,sample = 1ncnc∑i=1sIa(Yc,i|Xc,i;^ϕa)sIb(Yc,i|Xc,i;^ϕb)T.

Another way is to plug-in estimators and into the expectation with respect to and :

 ^BcIaIb,plug-in = 1ncnc∑i=1E^ϕc[sIa(Y|Xc,i;^ϕa)sIb(Y|Xc,i;^ϕb)T]. (2.11)

Figure 1 shows for a linear regression example with predictors the estimated weights (upper panels show obtained using directly Theorem 2.1; lower panels show obtained via Proposition 2.2). Estimating the eigenvalues with help of Proposition 2.2 works better than direct computation according to Theorem 2.1. In particular the true zero eigenvalues are poorly estimated with the direct approach. Figure 2 illustrates for the same example the quality of approximation of the ordinary and restricted likelihood-ratio with their asymptotic counterparts. The weighted sum of chi-squares approximates well already for small sample sizes, whereas the comes close to the distribution function of the ordinary likelihood-ratio only for very large ’s.

### 2.5 P-Values and Multi-Splitting

In this section we demonstrate how to obtain p-values for testing hypothesis (1.1) using the methodology developed in Sections 2.1-2.4. The basic workflow is to split the data into two parts, do screening on one part and derive asymptotic p-values on the other part. A p-value computed based on a one-time single split depends heavily on the arbitrary choice of the split and amounts to a “p-value lottery” (Meinshausen et al., 2009): for finite samples and for some specific split, the screening or the sparsity property might be violated which then results in a erroneous p-value. An alternative to a single arbitrary sample split is to split the data repeatedly. Meinshausen et al. (2009) demonstrated in the context of variable selection in the high-dimensional regression model that such a multi-split approach gives improved and more reproducible results. We adopt this idea and divide the data repeatedly into different splits. Let () and () be the first and second half of split . On the first half screening is performed by solving three times the -regularized log-likelihood problem (2.3), individually for each population and for both populations pooled together. The tuning parameter is always chosen by cross-validation. This gives models and defined via active-sets and . Then, the restricted log-likelihood ratio is evaluated on the second half of split and a one-sided p-value is computed according to

 Pk = 1−Ψr(LRk;^ν),

where and is defined in Theorem 2.1. The weights are estimated from the second half of split as described in Section 2.4. As in Meinshausen et al. (2009) we aggregate p-values , obtained from all different splits using the formula:

 Pagg=min((1−γmin)infγ∈(γmin,1)qγ({Pb/γ;b=1,…,B}),1),

where is the empirical -quantile function. This procedure is summarized in Algorithm 1. We refer to the choice as the single-split method and to the choice as the multi-split method.

## 3 Examples

As mentioned in the beginning of Section 2 two examples of our approach are differential regression where the populations follow linear regression models and differential network where the populations are generated from Gaussian graphical models. In this Section we provide details on both examples.

#### Differential regression

Consider a regression model

 Y=Xβ+ϵ, (3.12)

with (), () and random error . With the score function is given by

 s(y|x;ϕ) = ((y−βTx)xσ2;12σ2((y−βTx)2σ2−1)).

In Appendix B we show that

 Eϕc[sA(Y|X;ϕa)sB(Y|X;ϕb)T] = XAXTBσa2σb2(σ2c+(βc−βa)TXXT(βc−βb)). (3.13)

Now, given data and we obtain p-values by following Algorithm 1 (see Section 2.5). For the regression model, -penalized maximum likelihood estimation, used in the screening step, coincides with the Lasso (Tibshirani, 1996) and is implemented in the R-package glmnet (Friedman et al., 2010). With the help of formula (3.13) we can easily compute plug-in estimates (see equation (2.11)) and then the weights of the asymptotic null distribution are obtained as outlined in Section 2.4. We use the algorithm of Davies (1980), implemented in the R-package CompQuadForm (Duchesne and de Micheaux, 2010), to compute the distribution function of .

#### Differential network

Let () be Gaussian distributed with zero mean and covariance , i.e., . A Gaussian graphical model with undirected graph is then defined by locations of zero entries in the inverse covariance matrix , i.e., . Setting (), the score function is given by

 s(j,j′)(Y;ϕ)=Y(j)Y(j′)−Σjj′.

By invoking formulas on fourth moments of a multivariate normal distribution (see Appendix B) we find:

 Eϕc[s(j,j′)(Y;ϕa)s(