Simultaneous critical values for t-tests in very high dimensions

Simultaneous critical values for -tests in very high dimensions

[ [    [ [ Department of Health Studies, 5841 South Maryland Avenue MC 2007, University of Chicago, Chicago, IL, 60637, USA. \printeade1 Department of Biostatistics and Department of Statistics and Operations Research, 3101 Mcgavran-Greenberg Hall, CB 7420, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA. \printeade2
\smonth6 \syear2009\smonth1 \syear2010
\smonth6 \syear2009\smonth1 \syear2010
\smonth6 \syear2009\smonth1 \syear2010
Abstract

This article considers the problem of multiple hypothesis testing using -tests. The observed data are assumed to be independently generated conditional on an underlying and unknown two-state hidden model. We propose an asymptotically valid data-driven procedure to find critical values for rejection regions controlling the -familywise error rate (-FWER), false discovery rate (FDR) and the tail probability of false discovery proportion (FDTP) by using one-sample and two-sample -statistics. We only require a finite fourth moment plus some very general conditions on the mean and variance of the population by virtue of the moderate deviations properties of -statistics. A new consistent estimator for the proportion of alternative hypotheses is developed. Simulation studies support our theoretical results and demonstrate that the power of a multiple testing procedure can be substantially improved by using critical values directly, as opposed to the conventional -value approach. Our method is applied in an analysis of the microarray data from a leukemia cancer study that involves testing a large number of hypotheses simultaneously.

\kwd
\aid

0 \volume17 \issue1 2011 \firstpage347 \lastpage394 \doi10.3150/10-BEJ272 \newremarkremarkRemark[section]

\runtitle

-tests in very high dimensions

{aug}

a]\fnmsHongyuan \snmCao\thanksrefalabel=e1]hycao@uchicago.edu and b]\fnmsMichael R. \snmKosorok\thanksrefblabel=e2]kosorok@unc.edu

empirical processes \kwdFDR \kwdhigh dimension \kwdmicroarrays \kwdmultiple hypothesis testing \kwdone-sample -statistics \kwdself-normalized moderate deviation \kwdtwo-sample -statistics

1 Introduction

Among the many challenges raised by the analysis of large data sets is the problem of multiple testing. Examples include functional magnetic resonance imaging, source detection in astronomy and microarray analysis in genetics and molecular biology. It is now common practice to simultaneously measure thousands of variables or features in a variety of biological studies. Many of these high-dimensional biological studies are aimed at identifying features showing a biological signal of interest, usually through the application of large-scale significance testing. The possible outcomes are summarized in Table 1.

\tablewidth

=6.5cm

Table 1: Outcomes when testing hypotheses
Hypothesis Accept Reject Total
Null true
Alternative true
Total

Traditional methods that provide strong control of the familywise error rate () often have low power and can be unduly conservative in many applications. One way around this is to increase the number of false rejections one is willing to tolerate. This results in a relaxed version of FWER, -.

Benjamini and Hochberg r1 () (hereafter referred to as “BH”) pioneered an alternative. Define the false discovery proportion (FDP) to be the number of false rejections divided by the number of rejections (). The only effect of the in the denominator is that the ratio is set to zero when . Without loss of generality, we treat and define the false discovery tail probability , where is pre-specified, based on the application. Several papers have developed procedures for FDTP control. We shall not attempt a complete review here, but mention the following: van der Laan, Dudoit and Pollard r36 () proposed an augmentation-based procedure, Lehmann and Romano r24 () derived a step-down procedure and Genoves and Wasserman r16 () suggested an inversion-based procedure, which is equivalent to the procedure of r36 () under mild conditions r16 ().

The false discovery rate (FDR) is the expected FDP. BH provided a distribution-free, finite-sample method for choosing a -value threshold that guarantees that the FDR is less than a target level . Since this publication, there has been a considerable amount of research on both the theory and application of FDR control. Benjamini and Hochberg r2 () and Benjamini and Yekutieli r3 () extended the BH method to a class of dependent tests. A Bayesian mixture model approach to obtain multiple testing procedures controlling the FDR is considered in r14 (), r30 (), r31 (), r32 (), r33 (). Wu r39 () considered the conditional dependence model under the assumption of Donsker properties of the indicator function of the true state for each hypothesis and derived asymptotic properties of false discovery proportions and numbers of rejected hypotheses. A systematic study of multiple testing procedures is given in the book r12 (). Other related work can be found in r9 (), r10 ().

One challenge in multiple hypothesis testing is that many procedures depend on the proportion of null hypotheses, which is not known in reality. Estimating this proportion has long been known as a difficult problem. There have been some interesting developments recently, for example, the approach of r26 () (see also r14 (), r16 (), r25 (), r23 ()). Roughly speaking, these approaches are only successful under a condition which r16 () calls the “purity” condition. Unfortunately, the purity condition depends on -values and is hard to check in practice.

The general framework for -FWER, FDTP, FDR control and the estimation of the proportion of alternative hypotheses is based on -values which are assumed to be known in advance or can be accurately approximated. However, the assumption that -values are always available is not realistic. In some special settings, approximate -values have been shown to be asymptotically equivalent to exact -values for controlling FDR r15 (), r22 (). However, these approximations are only helpful in certain simultaneous error control settings and are not universally applicable. Moreover, if the -values are not reliable, any procedures derived later are problematic.

This motivates us to propose a method to find critical values directly for rejection regions to control -FWER, FDTP and FDR by using one-sample and two-sample -statistics. The advantage of using -tests is that they require minimum conditions on the population, only existence of the fourth moment, which is relatively easily satisfied by most statistical distributions, rather than other stringent conditions such as the existence of the moment generating function. In addition, we approximate tail probabilities of both null and alternative hypotheses accurately, rather than -value approaches that only consider the case under null hypotheses. Thus, a better ranking of hypotheses is obtained. Furthermore, we propose a consistent estimate of the proportion of alternative hypotheses which only depends on test statistics. As long as the asymptotic distribution of the test statistic is known under the null hypothesis, we can apply our method to estimate this proportion, resulting in more precise cut-offs.

The BH procedure controls the FDR conservatively at , where is the proportion of null hypotheses and is the targeted significance level. If is much smaller than , then the statistical power is greatly compromised. The power we use in this paper is , as defined in r40 (). In the situation that -statistics can be used, our procedure gives a better approximation and more accurate critical values can be obtained by plugging in the estimate of . The validity of our approach is guaranteed by empirical process methods and recent theoretical advances on self-normalized moderate deviations, in combination with Berry–Esseen-type bounds for central and non-central -statistics.

To illustrate, we simulate a Markov chain, as in r34 (), of Bernoulli variables , to indicate the true state of each hypothesis test ( if the alternative is true; if the null is true). Conditional on the indicator, observations , are generated according to the model . The one-sample -statistic is used to perform simultaneous hypothesis testing. Figure 1 shows the plot of 10 000 MCMC results of the realized and nominal FDR control based on the BH method for different control levels. From this plot, we can see that as the control level increases, the BH procedure becomes more and more conservative. For instance, the FDR actually obtained is when the nominal level is set at , reflecting a significant loss in power.

Figure 1: Claimed and obtained FDR control using the BH procedure.

The three methods of multiple testing control we utilize are -FWER, FDTP and FDR. The criterion for using -FWER is, asymptotically,

(1)

Since we only apply our method when there are discoveries (), we need the FDTP, with a given proportion and significance level , to satisfy, asymptotically,

(2)

Similarly, the criterion for using FDR is, asymptotically,

(3)

The main contributions of this paper are as follows: (1) Moderate deviation results which only require the finiteness of fourth moment, from which the statistic is computed in probability theory, are applied in multiple testing. Thus, the applicability of this procedure is dramatically expanded: it can deal with non-normal populations and even highly skewed populations. (2) The critical values for rejection regions are computed directly, which circumvents the intermediate -value step. (3) An asymptotically consistent estimation of the proportion of alternative hypotheses is developed for multiple testing procedures under very general conditions.

The remainder of the paper is organized as follows. In Section 2, we present the basic data structure, our goals, the procedures and theoretical results for the one-sample -test. Two-sample -test results are discussed in Section 3. Section 4 is devoted to numerical investigations using simulation and Section 5 applies our procedure to detect significantly expressed genes in a microarray study of leukemia cancer. Some concluding remarks and a discussion are given in Section  6. Proofs of results from Sections 2 and 3 are given in the Appendix.

2 One-sample -test

In this section, we first introduce the basic framework for simultaneous hypothesis testing, followed by our main results. Estimation of the unknown proportion of alternative hypotheses is presented next. We conclude the section by presenting theoretical results for the special case of completely independent observations. This special setting is the basis for the more general main results and is also of independent interest since fairly precise rates of convergence can be obtained.

2.1 Basic framework

As a specific application of multiple hypothesis testing in very high dimensions, we use gene expression microarray data. At the level of single genes, researchers seek to establish whether each gene in isolation behaves differently in a control versus a treatment situation. If the transcripts are pairwise under two conditions, then we can use a one-sample -statistic to test for differential expression.

The mathematical model is

(4)

It should be noted that the following discussion is under this model and does not hold in general. Here, represents the expression level in the th gene and th array. Since the subjects are independent, for each , are independent random variables with mean zero and variance . The null hypothesis is and the alternative hypothesis is . For the relationship between different genes, we propose the conditional independence model, as follows. Let be a -valued stationary process and, given , are independently generated. The dependence is imposed on the hypothesis , where if the null hypothesis is true and if the alternative is true. From Table 1, we can see that and . It is assumed that satisfy a strong law of large numbers:

(5)

This condition is satisfied in a variety of scenarios, for example, the independent case, Markov models and stationary models. Consider the one-sample -statistic

where

If we use as a cut-off, then the number of rejected hypotheses and the number of false discoveries are, respectively,

(6)

Under the null hypothesis, it is well known that follows a Student -distribution with degrees of freedom if the sample is from a normal distribution. Asymptotic convergence to a standard normal distribution holds when the population is completely unknown, provided that it has a finite fourth moment under the null hypothesis. Moreover, under the alternative hypothesis, can also be approximated by a normal distribution, but with a shift in location. We will show that

(7)
(8)

uniformly for under some regularity conditions, where denotes the standard normal random variable, is the tail probability of the standard normal distribution and the critical values that control the FDTP and FDR asymptotically at prescribed level are bounded. These assumptions are fairly realistic in practice. We do not require the critical value for -FWER to be bounded. Although we do not typically know , or in practice, we need the following theorem – the proof of which is given in the Appendix – as the first step. We will shortly extend this result, in Theorem 2.2 below, to permit estimation of the unknown quantities.

Theorem 2.1

Assume that , , , and (5) is satisfied. Also, assume that there exist and such that

(9)

Let

(10)

and

(11)
  1. [(iii)]

  2. If is chosen such that

    (12)

    where is the th quintile of the standard normal distribution, then

    (13)

    holds.

  3. If is chosen such that

    (14)

    then

    (15)

    holds.

  4. If is chosen such that

    (16)

    where and

    then

    (17)

    holds.

{remark}

In the next section, we use a Gaussian approximation for and for both FDTP and FDR, for which the critical values are shown to be bounded. In this case, can be arbitrarily large, while the critical value remains bounded. Due to sparsity, we use a Poisson approximation for -FWER, for which the critical value is no longer bounded as , and we require .

2.2 Main results

Note that in Theorem 2.1, there are an unknown parameter and unknown functions and involved in and . For practical settings, we need to estimate these quantities. We will begin by assuming that we have a strongly consistent estimate of and will then provide one such estimate in the next section. Given , note that can be estimated from the empirical distribution of , where

(18)

and that is close to when is large, by (7). The next theorem, proved in the Appendix, provides a consistent estimate of the critical value .

Theorem 2.2

Let

(19)

and

where is a strongly consistent estimate of . Assume that the conditions of Theorem 2.1 are satisfied.

  1. [(iii)]

  2. If is chosen such that

    (21)

    then

    (22)
  3. If is chosen such that

    (23)

    then

    (24)
  4. If is chosen such that

    (25)

    where and

    then, as long as , we have

    (26)
{remark}

This theorem deals with the general dependence case, where is assumed to follow a two-state hidden model and the data are generated independently conditional on . The proof is mainly based on the independence case, which we present in Section 2.4 below, plus a conditioning argument.

2.3 Estimating

In the previous section, we assumed that was a consistent estimator of . We now develop one such estimator. By the two-group nature of multiple testing, the test statistic is essentially a mixture of null and alternative hypotheses with proportion as a parameter. By virtue of moderate deviations, the distribution of -statistics can be accurately approximated under both null and alternative hypotheses. However, for the alternative approximation, an unknown mean and variance are involved. So, we think of a functional transformation of the -statistics which has a ceiling at to first get a conservative estimate of which is consistent under certain conditions. Let and define . It is easy to see that is a decreasing function of , bounded by , and that the derivative is bounded by . Hence, the function class indexed by is a Donsker class and thus also Glivenko–Cantelli. Let

(27)
Theorem 2.3

We have

If, in addition, we assume that

(28)

then

where

{pf}

We can write

Let . Conditional on , , are independent random variables. We consider I first. Let

let be the infinite sequence and let be the event that as . By the assumption (5), we know that . Thus,

where the second equality follows from the fact that, conditional on , the terms in the sum are i.i.d. and thus the standard Glivenko–Cantelli theorem applies. Arguing similarly, based on conditioning on the sequence we can also establish that

Now, note that . Thus, since a.s. and a.s., we have that when

We now have the following lower bound for :

(29)

Define

Letting , we have a.s. Also,

Note that

Therefore,

Thus, we obtain

(30)

As a consequence of this theorem, we propose the following estimate of :

(31)

where

{remark}

If we use , as given in (31), then Theorem 2.2 yields a fully automated procedure to carry out multiple hypothesis testing in very high dimensions in practical data settings.

2.4 Consistency and rate of convergence under independence

In order to prove the main results in the general, possibly dependent, -test setting, we need results under the assumption of independence between -tests. Specifically, we assume in this section that are independent, identically distributed random variables with . This independence assumption can also yield stronger results than the more general setting and is of independent interest.

The next theorem, proved in the Appendix, provides a strong consistent estimate of the critical value , as well as its rate of convergence.

Theorem 2.4

Let

(32)

and

Assume the conditions of Theorem 2.1 with (5) replaced by the assumption that , are i.i.d. and . Let be the set that contains the indices of alternative hypotheses. Also, assume that are i.i.d. for .

  1. [(iii)]

  2. If is chosen such that

    (33)

    then

    (34)

    and

    (35)

    Here, is the critical value defined in (A.78).

  3. If is chosen such that

    (36)

    then

    (37)

    and

    (38)

    Here, is the critical value defined in (A.80).

  4. If is chosen such that

    (39)

    where and

    then

    (40)

    Here is the critical value defined in (A.82).

{remark}

If in Theorem 2.4, then it is not difficult to see that Therefore, (34) and (35) remain valid with replaced by . This shows that controlling FDTP is asymptotically equivalent to controlling FDR. This is also true in the more general dependence case. Thus, we will focus primarily on FDR in our numerical studies. {remark} Note that is assumed to be known in order to get a precise rate of convergence for FDTP and FDR. If is estimated with rate of convergence , then the correct convergence rate for the “in probability” result for FDR and FDTP would involve an additional term added in (35) and (38). It is unclear what the correction would be for the almost sure rate in (34) and (37). These corrections are beyond the scope of this paper and will not be pursued further here. Note that the rate of is not needed in the main results presented in Sections 2.12.3.

3 Two-sample -test

In this section, the results of the previous section are extended to the two-sample -test setting. The estimator of the unknown parameter remains the same as in the one-sample case, but with in (27) being the two-sample, rather than one-sample, -statistic. Theoretical results for the rates of convergence under independence are also presented, as in the previous section.

3.1 Basic set-up and results

When two groups, such as a control and an experimental group, are independent, which we assume here, a natural statistic to use is the two-sample -statistic. As far as possible, we adopt the same notation as used in the one-sample case, and we assume that (5) holds. We observe the random variables

with the index denoting the th gene, indicating the th array, representing the mean effect for the th gene from the first group and representing the mean effect for the th gene from the second group. The sampling processes for the two groups are assumed to be independent of each other. The sample sizes and are assumed to be of the same order, that is, . We will also assume that for each , are independent random variables with mean zero and variance ; are independent random variables with mean zero and variance . The null hypothesis is , the alternative hypothesis is and the dependence is assumed to be generated in the same manner as the dependence in the one-sample setting. Consider the two-sample -statistic

where

Then

(41)

The two-sample -statistic is one of the most commonly used statistics to construct confidence intervals and carry out hypothesis testing for the difference between two means. There are several premises underlying the use of two-sample -tests. It is assumed that the data have been derived from populations with normal distributions. Based on the fact that a.s., with moderate violation of the assumption, statisticians quite often recommend using the two-sample -test, provided the samples are not too small and the samples are of equal or nearly equal size. When the populations are not normally distributed, it is a consequence of the central limit theorem that two-sample -tests remain valid. A more refined confirmation of this validity under non-normality based on moderate deviations is shown in r7 (). Furthermore, under the alternative hypothesis, the asymptotic results still hold, but with a shift in location similar to the one-sample case under certain conditions, that is,

uniformly in , where . Under the assumption of (5), asymptotic critical values to control FDTP, FDR and -FWER are very similar to the one-sample -test case with the one-sample -statistic replaced by the two-sample -statistic . The following theorem, proved in the Appendix, is analogous to Theorem 2.1 and is a necessary first step.

Theorem 3.1

Assume that , , , , , , , , and that (5) is satisfied. Assume that there exist and such that

(42)

The conclusions of Theorem 2.1 then hold with the one-sample -statistic replaced by the two-sample -statistic .

3.2 Main results

The unknown parameter and functions and in Theorem 3.1 are estimated similarly as in the one-sample case with the one-sample -statistic replaced by its two-sample counterpart. The following theorem, the proof of which is given in the Appendix, gives our main results for two-sample -tests.

Theorem 3.2

Assume that the conditions in Theorem 3.1 are satisfied. Replace the one-sample -statistic by the two-sample -statistic in Theorem 2.2. Let be a strong consistent estimate of , as in (31), using the two-sample -statistic .

  1. [(iii)]

  2. If is chosen such that

    (43)

    then

    (44)
  3. If is chosen such that

    (45)

    then

    (46)
  4. If is chosen such that

    (47)

    where and

    then, provided , we have

    (48)
{remark}