Simultaneous critical values for tests in very high dimensions
Abstract
This article considers the problem of multiple hypothesis testing using tests. The observed data are assumed to be independently generated conditional on an underlying and unknown twostate hidden model. We propose an asymptotically valid datadriven procedure to find critical values for rejection regions controlling the familywise error rate (FWER), false discovery rate (FDR) and the tail probability of false discovery proportion (FDTP) by using onesample and twosample statistics. We only require a finite fourth moment plus some very general conditions on the mean and variance of the population by virtue of the moderate deviations properties of statistics. A new consistent estimator for the proportion of alternative hypotheses is developed. Simulation studies support our theoretical results and demonstrate that the power of a multiple testing procedure can be substantially improved by using critical values directly, as opposed to the conventional value approach. Our method is applied in an analysis of the microarray data from a leukemia cancer study that involves testing a large number of hypotheses simultaneously.
0 \volume17 \issue1 2011 \firstpage347 \lastpage394 \doi10.3150/10BEJ272 \newremarkremarkRemark[section]
tests in very high dimensions
a]\fnmsHongyuan \snmCao\thanksrefalabel=e1]hycao@uchicago.edu and b]\fnmsMichael R. \snmKosorok\thanksrefblabel=e2]kosorok@unc.edu
empirical processes \kwdFDR \kwdhigh dimension \kwdmicroarrays \kwdmultiple hypothesis testing \kwdonesample statistics \kwdselfnormalized moderate deviation \kwdtwosample statistics
1 Introduction
Among the many challenges raised by the analysis of large data sets is the problem of multiple testing. Examples include functional magnetic resonance imaging, source detection in astronomy and microarray analysis in genetics and molecular biology. It is now common practice to simultaneously measure thousands of variables or features in a variety of biological studies. Many of these highdimensional biological studies are aimed at identifying features showing a biological signal of interest, usually through the application of largescale significance testing. The possible outcomes are summarized in Table 1.
=6.5cm
Hypothesis  Accept  Reject  Total 

Null true  
Alternative true  
Total 
Traditional methods that provide strong control of the familywise error rate () often have low power and can be unduly conservative in many applications. One way around this is to increase the number of false rejections one is willing to tolerate. This results in a relaxed version of FWER, .
Benjamini and Hochberg r1 () (hereafter referred to as “BH”) pioneered an alternative. Define the false discovery proportion (FDP) to be the number of false rejections divided by the number of rejections (). The only effect of the in the denominator is that the ratio is set to zero when . Without loss of generality, we treat and define the false discovery tail probability , where is prespecified, based on the application. Several papers have developed procedures for FDTP control. We shall not attempt a complete review here, but mention the following: van der Laan, Dudoit and Pollard r36 () proposed an augmentationbased procedure, Lehmann and Romano r24 () derived a stepdown procedure and Genoves and Wasserman r16 () suggested an inversionbased procedure, which is equivalent to the procedure of r36 () under mild conditions r16 ().
The false discovery rate (FDR) is the expected FDP. BH provided a distributionfree, finitesample method for choosing a value threshold that guarantees that the FDR is less than a target level . Since this publication, there has been a considerable amount of research on both the theory and application of FDR control. Benjamini and Hochberg r2 () and Benjamini and Yekutieli r3 () extended the BH method to a class of dependent tests. A Bayesian mixture model approach to obtain multiple testing procedures controlling the FDR is considered in r14 (), r30 (), r31 (), r32 (), r33 (). Wu r39 () considered the conditional dependence model under the assumption of Donsker properties of the indicator function of the true state for each hypothesis and derived asymptotic properties of false discovery proportions and numbers of rejected hypotheses. A systematic study of multiple testing procedures is given in the book r12 (). Other related work can be found in r9 (), r10 ().
One challenge in multiple hypothesis testing is that many procedures depend on the proportion of null hypotheses, which is not known in reality. Estimating this proportion has long been known as a difficult problem. There have been some interesting developments recently, for example, the approach of r26 () (see also r14 (), r16 (), r25 (), r23 ()). Roughly speaking, these approaches are only successful under a condition which r16 () calls the “purity” condition. Unfortunately, the purity condition depends on values and is hard to check in practice.
The general framework for FWER, FDTP, FDR control and the estimation of the proportion of alternative hypotheses is based on values which are assumed to be known in advance or can be accurately approximated. However, the assumption that values are always available is not realistic. In some special settings, approximate values have been shown to be asymptotically equivalent to exact values for controlling FDR r15 (), r22 (). However, these approximations are only helpful in certain simultaneous error control settings and are not universally applicable. Moreover, if the values are not reliable, any procedures derived later are problematic.
This motivates us to propose a method to find critical values directly for rejection regions to control FWER, FDTP and FDR by using onesample and twosample statistics. The advantage of using tests is that they require minimum conditions on the population, only existence of the fourth moment, which is relatively easily satisfied by most statistical distributions, rather than other stringent conditions such as the existence of the moment generating function. In addition, we approximate tail probabilities of both null and alternative hypotheses accurately, rather than value approaches that only consider the case under null hypotheses. Thus, a better ranking of hypotheses is obtained. Furthermore, we propose a consistent estimate of the proportion of alternative hypotheses which only depends on test statistics. As long as the asymptotic distribution of the test statistic is known under the null hypothesis, we can apply our method to estimate this proportion, resulting in more precise cutoffs.
The BH procedure controls the FDR conservatively at , where is the proportion of null hypotheses and is the targeted significance level. If is much smaller than , then the statistical power is greatly compromised. The power we use in this paper is , as defined in r40 (). In the situation that statistics can be used, our procedure gives a better approximation and more accurate critical values can be obtained by plugging in the estimate of . The validity of our approach is guaranteed by empirical process methods and recent theoretical advances on selfnormalized moderate deviations, in combination with Berry–Esseentype bounds for central and noncentral statistics.
To illustrate, we simulate a Markov chain, as in r34 (), of Bernoulli variables , to indicate the true state of each hypothesis test ( if the alternative is true; if the null is true). Conditional on the indicator, observations , are generated according to the model . The onesample statistic is used to perform simultaneous hypothesis testing. Figure 1 shows the plot of 10 000 MCMC results of the realized and nominal FDR control based on the BH method for different control levels. From this plot, we can see that as the control level increases, the BH procedure becomes more and more conservative. For instance, the FDR actually obtained is when the nominal level is set at , reflecting a significant loss in power.
The three methods of multiple testing control we utilize are FWER, FDTP and FDR. The criterion for using FWER is, asymptotically,
(1) 
Since we only apply our method when there are discoveries (), we need the FDTP, with a given proportion and significance level , to satisfy, asymptotically,
(2) 
Similarly, the criterion for using FDR is, asymptotically,
(3) 
The main contributions of this paper are as follows: (1) Moderate deviation results which only require the finiteness of fourth moment, from which the statistic is computed in probability theory, are applied in multiple testing. Thus, the applicability of this procedure is dramatically expanded: it can deal with nonnormal populations and even highly skewed populations. (2) The critical values for rejection regions are computed directly, which circumvents the intermediate value step. (3) An asymptotically consistent estimation of the proportion of alternative hypotheses is developed for multiple testing procedures under very general conditions.
The remainder of the paper is organized as follows. In Section 2, we present the basic data structure, our goals, the procedures and theoretical results for the onesample test. Twosample test results are discussed in Section 3. Section 4 is devoted to numerical investigations using simulation and Section 5 applies our procedure to detect significantly expressed genes in a microarray study of leukemia cancer. Some concluding remarks and a discussion are given in Section 6. Proofs of results from Sections 2 and 3 are given in the Appendix.
2 Onesample test
In this section, we first introduce the basic framework for simultaneous hypothesis testing, followed by our main results. Estimation of the unknown proportion of alternative hypotheses is presented next. We conclude the section by presenting theoretical results for the special case of completely independent observations. This special setting is the basis for the more general main results and is also of independent interest since fairly precise rates of convergence can be obtained.
2.1 Basic framework
As a specific application of multiple hypothesis testing in very high dimensions, we use gene expression microarray data. At the level of single genes, researchers seek to establish whether each gene in isolation behaves differently in a control versus a treatment situation. If the transcripts are pairwise under two conditions, then we can use a onesample statistic to test for differential expression.
The mathematical model is
(4) 
It should be noted that the following discussion is under this model and does not hold in general. Here, represents the expression level in the th gene and th array. Since the subjects are independent, for each , are independent random variables with mean zero and variance . The null hypothesis is and the alternative hypothesis is . For the relationship between different genes, we propose the conditional independence model, as follows. Let be a valued stationary process and, given , are independently generated. The dependence is imposed on the hypothesis , where if the null hypothesis is true and if the alternative is true. From Table 1, we can see that and . It is assumed that satisfy a strong law of large numbers:
(5) 
This condition is satisfied in a variety of scenarios, for example, the independent case, Markov models and stationary models. Consider the onesample statistic
where
If we use as a cutoff, then the number of rejected hypotheses and the number of false discoveries are, respectively,
(6) 
Under the null hypothesis, it is well known that follows a Student distribution with degrees of freedom if the sample is from a normal distribution. Asymptotic convergence to a standard normal distribution holds when the population is completely unknown, provided that it has a finite fourth moment under the null hypothesis. Moreover, under the alternative hypothesis, can also be approximated by a normal distribution, but with a shift in location. We will show that
(7)  
(8) 
uniformly for under some regularity conditions, where denotes the standard normal random variable, is the tail probability of the standard normal distribution and the critical values that control the FDTP and FDR asymptotically at prescribed level are bounded. These assumptions are fairly realistic in practice. We do not require the critical value for FWER to be bounded. Although we do not typically know , or in practice, we need the following theorem – the proof of which is given in the Appendix – as the first step. We will shortly extend this result, in Theorem 2.2 below, to permit estimation of the unknown quantities.
Theorem 2.1
Assume that , , , and (5) is satisfied. Also, assume that there exist and such that
(9) 
Let
(10) 
and
(11) 

[(iii)]

If is chosen such that
(12) where is the th quintile of the standard normal distribution, then
(13) holds.

If is chosen such that
(14) then
(15) holds.

If is chosen such that
(16) where and
then
(17) holds.
In the next section, we use a Gaussian approximation for and for both FDTP and FDR, for which the critical values are shown to be bounded. In this case, can be arbitrarily large, while the critical value remains bounded. Due to sparsity, we use a Poisson approximation for FWER, for which the critical value is no longer bounded as , and we require .
2.2 Main results
Note that in Theorem 2.1, there are an unknown parameter and unknown functions and involved in and . For practical settings, we need to estimate these quantities. We will begin by assuming that we have a strongly consistent estimate of and will then provide one such estimate in the next section. Given , note that can be estimated from the empirical distribution of , where
(18) 
and that is close to when is large, by (7). The next theorem, proved in the Appendix, provides a consistent estimate of the critical value .
Theorem 2.2
Let
(19) 
and
where is a strongly consistent estimate of . Assume that the conditions of Theorem 2.1 are satisfied.

[(iii)]

If is chosen such that
(21) then
(22) 
If is chosen such that
(23) then
(24) 
If is chosen such that
(25) where and
then, as long as , we have
(26)
This theorem deals with the general dependence case, where is assumed to follow a twostate hidden model and the data are generated independently conditional on . The proof is mainly based on the independence case, which we present in Section 2.4 below, plus a conditioning argument.
2.3 Estimating
In the previous section, we assumed that was a consistent estimator of . We now develop one such estimator. By the twogroup nature of multiple testing, the test statistic is essentially a mixture of null and alternative hypotheses with proportion as a parameter. By virtue of moderate deviations, the distribution of statistics can be accurately approximated under both null and alternative hypotheses. However, for the alternative approximation, an unknown mean and variance are involved. So, we think of a functional transformation of the statistics which has a ceiling at to first get a conservative estimate of which is consistent under certain conditions. Let and define . It is easy to see that is a decreasing function of , bounded by , and that the derivative is bounded by . Hence, the function class indexed by is a Donsker class and thus also Glivenko–Cantelli. Let
(27) 
Theorem 2.3
We have
If, in addition, we assume that
(28) 
then
where
We can write
Let . Conditional on , , are independent random variables. We consider I first. Let
let be the infinite sequence and let be the event that as . By the assumption (5), we know that . Thus,
where the second equality follows from the fact that, conditional on , the terms in the sum are i.i.d. and thus the standard Glivenko–Cantelli theorem applies. Arguing similarly, based on conditioning on the sequence we can also establish that
Now, note that . Thus, since a.s. and a.s., we have that when
We now have the following lower bound for :
(29) 
Define
Letting , we have a.s. Also,
Note that
Therefore,
Thus, we obtain
(30) 
As a consequence of this theorem, we propose the following estimate of :
(31) 
where
2.4 Consistency and rate of convergence under independence
In order to prove the main results in the general, possibly dependent, test setting, we need results under the assumption of independence between tests. Specifically, we assume in this section that are independent, identically distributed random variables with . This independence assumption can also yield stronger results than the more general setting and is of independent interest.
The next theorem, proved in the Appendix, provides a strong consistent estimate of the critical value , as well as its rate of convergence.
Theorem 2.4
If in Theorem 2.4, then it is not difficult to see that Therefore, (34) and (35) remain valid with replaced by . This shows that controlling FDTP is asymptotically equivalent to controlling FDR. This is also true in the more general dependence case. Thus, we will focus primarily on FDR in our numerical studies. {remark} Note that is assumed to be known in order to get a precise rate of convergence for FDTP and FDR. If is estimated with rate of convergence , then the correct convergence rate for the “in probability” result for FDR and FDTP would involve an additional term added in (35) and (38). It is unclear what the correction would be for the almost sure rate in (34) and (37). These corrections are beyond the scope of this paper and will not be pursued further here. Note that the rate of is not needed in the main results presented in Sections 2.1–2.3.
3 Twosample test
In this section, the results of the previous section are extended to the twosample test setting. The estimator of the unknown parameter remains the same as in the onesample case, but with in (27) being the twosample, rather than onesample, statistic. Theoretical results for the rates of convergence under independence are also presented, as in the previous section.
3.1 Basic setup and results
When two groups, such as a control and an experimental group, are independent, which we assume here, a natural statistic to use is the twosample statistic. As far as possible, we adopt the same notation as used in the onesample case, and we assume that (5) holds. We observe the random variables
with the index denoting the th gene, indicating the th array, representing the mean effect for the th gene from the first group and representing the mean effect for the th gene from the second group. The sampling processes for the two groups are assumed to be independent of each other. The sample sizes and are assumed to be of the same order, that is, . We will also assume that for each , are independent random variables with mean zero and variance ; are independent random variables with mean zero and variance . The null hypothesis is , the alternative hypothesis is and the dependence is assumed to be generated in the same manner as the dependence in the onesample setting. Consider the twosample statistic
where
Then
(41) 
The twosample statistic is one of the most commonly used statistics to construct confidence intervals and carry out hypothesis testing for the difference between two means. There are several premises underlying the use of twosample tests. It is assumed that the data have been derived from populations with normal distributions. Based on the fact that a.s., with moderate violation of the assumption, statisticians quite often recommend using the twosample test, provided the samples are not too small and the samples are of equal or nearly equal size. When the populations are not normally distributed, it is a consequence of the central limit theorem that twosample tests remain valid. A more refined confirmation of this validity under nonnormality based on moderate deviations is shown in r7 (). Furthermore, under the alternative hypothesis, the asymptotic results still hold, but with a shift in location similar to the onesample case under certain conditions, that is,
uniformly in , where . Under the assumption of (5), asymptotic critical values to control FDTP, FDR and FWER are very similar to the onesample test case with the onesample statistic replaced by the twosample statistic . The following theorem, proved in the Appendix, is analogous to Theorem 2.1 and is a necessary first step.
3.2 Main results
The unknown parameter and functions and in Theorem 3.1 are estimated similarly as in the onesample case with the onesample statistic replaced by its twosample counterpart. The following theorem, the proof of which is given in the Appendix, gives our main results for twosample tests.
Theorem 3.2
Assume that the conditions in Theorem 3.1 are satisfied. Replace the onesample statistic by the twosample statistic in Theorem 2.2. Let be a strong consistent estimate of , as in (31), using the twosample statistic .

[(iii)]

If is chosen such that
(43) then
(44) 
If is chosen such that
(45) then
(46) 
If is chosen such that
(47) where and
then, provided , we have
(48)