Analysis of error control in large scale twostage multiple hypothesis testing
Abstract
When dealing with the problem of simultaneously testing a large number of null hypotheses, a natural testing strategy is to first reduce the number of tested hypotheses by some selection (screening or filtering) process, and then to simultaneously test the selected hypotheses. The main advantage of this strategy is to greatly reduce the severe effect of high dimensions. However, the first screening or selection stage must be properly accounted for in order to maintain some type of error control. In this paper, we will introduce a selection rule based on a selection statistic that is independent of the test statistic when the tested hypothesis is true. Combining this selection rule and the conventional Bonferroni procedure, we can develop a powerful and valid twostage procedure. The introduced procedure has several nice properties: (i) it completely removes the selection effect; (ii) it reduces the multiplicity effect; (iii) it does not “waste” data while carrying out both selection and testing. Asymptotic power analysis and simulation studies illustrate that this proposed method can provide higher power compared to usual multiple testing methods while controlling the Type 1 error rate. Optimal selection thresholds are also derived based on our asymptotic analysis.
AMS 1991 subject classifications. Primary 62J15, Secondary 62G10
KEY WORDS: screening, familywise error rate, filtering, highdimensional, multiple testing
1 Introduction
Consider the multiple testing problem of simultaneously testing a large number of hypotheses. When is large, standard multiple testing procedures suffer from low “power” and are unable to distinguish between null and alternative effects because extremely small values are required if one properly accounts for Type 1 error control, such as the familywise error rate (FWER); see Lehmann and Romano (2005). It is only by weakening the measure of error control, such as the false discovery rate (FDR), that some discoveries may be found (Benjamin and Hochberg, 1995). But, such discoveries are not as forceful as when they arise while controlling the FWER.
When “most” null hypotheses are “true”, a common and useful approach is to first reduce the number of hypotheses being testing in order to construct methods which are better able to distinguish alternative hypotheses. That is, one applies some selection, filtering or screening technique based on some selection statistics in order to reduce the number of hypotheses being tested. Then, one can use standard stepwise methods to test the reduced number of tests. Such twostage methods have been extensively used in practice to deal with various problems of multiple testing (McClintick and Edenberg, 2006; Talloen et al., 2007; Hackstadt and Hess, 2009). As in the bulk of this paper, such approaches are called twostage procedures. In the first stage, some screening or selection method is applied in order to reduce the number of tests. In the second stage, the reduced number of tests is tested. A major limitation of these methods is there lacks a systematic consideration of the selection effect. In other words, one cannot simply apply some method to the reduced number of hypotheses without accounting for selection in error control. That is, one cannot in general “forget” about the screening stage. In other words, in order to properly control Type 1 error rates, one must in general account for the screening stage by considering the error rate conditional on the method of selection. Otherwise, lose of Type 1 error control, whether it is FDR, FWER, or an alternative measure, results.
But, if screening statistics at the first stage are chosen to be independent of the testing statistics at the second stage (at least under the null hypothesis), then error control simplifies as the conditional distributions and unconditional distributions of the test statistics are the same (at least under its respective null distribution). Indeed, Bourgon, Gentleman, and Huber (2010) introduced such a novel approach of independence filtering to avoid the effect of selection, in which the selection or filtering statistics at the first stage are chosen to be independent of the test statistics (at least when the corresponding null hypotheses are true). Two new twostage methods, which respectively combine the approach of independence filtering with the conventional Bonferroni and BenjaminiHochberg procedures (Benjamini and Hochberg, 1995), are proposed and shown to control both the FWER and FDR under independence of test statistics. By using the same idea of independence filtering, Dai et al. (2012) develop several twostage testing procedures to detect geneenvironment interaction in genomewide association studies. Kim and Schliekelman (2016) further discuss some key questions on how to best apply the approach of independence filtering and quantify the effects of the quality of the filter information, the filter cutoff and other factors on the effectiveness of the filter.
Another commonly used approach to avoid the selection effect is sample splitting in which the data is split in two independent parts. One uses the first part of the data to construct the selection or filtering statistics and the second part to construct the test statistics. By combining sample splitting with conventional stepwise procedures, one can develop twostage procedures that guarantee control of Type 1 error rates (Cox, 1975; Rubin, Dudoit, and van der Laan, 2006; Wasserman and Roeder, 2009). These methods completely remove the effect of selection; however, they often result in power loss due to reduced sample size for testing (Skol, et al., 2006; Fithian, Sun and Taylor, 2014).
In recent years, there has been a growing interest in selective inference (Benjamin and Yekutieli, 2005; Benjamini, 2010; Taylor and Tibshirani, 2015) and several novel breakthroughs have been made in the context of highdimensional regression (Berk et al 2013; Barber and Cands 2015; Lee et al 2016; Fithian et al. 2014). All of these developments take model selection rules as given and develop methods to preform valid inference after taking into account selection effects. Along these lines, a number of selective inference/post selection inference methods have been developed for various model selection algorithms (Barber and Candes, 2016; Benjamini and Bogomolov, 2014; Fithian et al., 2015; Heller et al., 2016; Tian and Taylor, 2015a, b; Weinstein, Fithian and Benjamini, 2013; Yekutieli; 2012). In this literature, the problem of how to choose selection rules is often overlooked; however, in practice one can often choose a desired selection rule to lead to favorable conditional properties of inference after selection. In contrast, rather than treat the selected hypotheses as given, we can propose a rule in both stages so that the overall procedure has good unconditional error control properties.
Another popular way of exploiting information in the data is, rather than completely eliminating tests under consideration, to construct weights for the null hypotheses and then develop datadriven weighted multiple testing procedures (Roeder and Wasserman, 2009; Poisson et al, 2012). The datadriven weighted methods are pretty general and filtering methods can regarded as its special case. A limitation of such methods is that it is not clear how to assign weights in a datadriven way to ensure control of the FWER or FDR. Very recently, by using “covariates” to construct weights which are independent of the test statistics under the null hypotheses, several Bonferronibased and BenjaminiHochberg based data driven weighted methods have been developed that increase power while controlling the FWER and FDR, respectively (Fino and Salmaso, 2007; Ignatiadis, et al, 2016; Li and Barber, 2016; Lei and Fithian, 2016; Ignatiadis and Huber, 2017). In addition, when developing more powerful multiple testing methods, there are several other ways of using such additional covariate information that have recently introduced in the literature, such as local FDR based approaches (Cai and Sun, 2009), stratified BenjaminiHochberg (Yoo, et al., 2010), grouped BenjaminiHochberg (Hu, Zhao and Zhou, 2010), and singleindex modulated method (Du and Zhang, 2014), etc.
In summary, there is a growing literature of approaches to dimension reduction in high dimensional (single and multiple) hypothesis testing, including some useful, novel, and somewhat ad hoc procedures. The contribution of this paper is to perform a detailed error analysis in a large scale setting. We consider an ideal Gaussian model, as is often assumed in the literature. as described in the setup in Section 2. There, we introduce a specific twostage procedure that we will analyze and compare later with other procedures. Control of the FWER is presented, though the less formal argument already appears in Bourgon, Gentleman and Huber (2010). (The analysis applies to the joint but single testing problem of testing all means zero against the alternative that at least one is not, but the exposition emphasizes the multiple testing problem.) The remainder of the paper is new. In Section 3, under a large asymptotic framework with a sparsity assumption on the number of false hypotheses, we present detection boundaries for mean levels that can (or cannot be) detected by the twostage procedure. In Section 4, a refinement is obtained so that the exact cutoff is calculated. Section 5 considers the unknown variance case, where the basic finite sample control of the FWER is replaced by asymptotic control, but the same power analysis holds as when the variance is known. In Section 6, we allow for dependence between the test statistics. Section 7 theoretically compares the twostage approach with other methods: Bonferroni and splitsample methods. By proper choice of how to split, the split sample technique can only perform as well as Bonferroni, with neither approach performing as well as the twostage method. A simulation study is presented in Section 8. Both global tests of a single hypothesis (in a high dimensional setting) as well as multiple tests are considered. In the former case, the Higher Criticism (Donoho and Jin, 2004; Donoho and Jin, 2015) is also compared (but it cannot readily be used in the multiple testing case). In both cases, the twostage approach offers both control of the Type 1 error rate as well as it performs quite well under various scenarios. In particular, the twostage method shows good performance even when variances are unequal and especially under dependence.
2 The setup
A very stylized Gaussian setup is assumed, as is conventional in large scale testing. The problem is testing means from independent populations, where is large.
Assume that, for , a sample of size from a normal population with unknown mean and variance is observed; that is, data
where is the number of hypotheses of interest representing the number of samples or populations, and is the sample size for the th sample. The samples are assumed mutually independent. When is large, it is typically assumed that the are known as well, in which case one can take (by sufficiency). For now, we will assume and , though we will discuss the unknown variances case later.
For , consider testing hypotheses
(One may also treat the case of onesided alternatives with easy modifications.) Define the following two statistics
(1) 
and
(2) 
where and are respectively the sample mean and (unbiased) sample variance for the th sample, i.e., and .
The basic twostage strategy for our method is as follows. The statistics are first used to “select” which of the hypotheses to “test” in the second stage, at which point the statistics are used. There are various choices for the selection statistics, as well as test statistics. For example, one could use the statistics in both stages. Regardless, the first consideration would then be how to set critical values in each stage in order to ensure some measure of Type 1 error control, such as the familywise error rate (FWER), the probability of at least one false rejection. We will be specific about the critical values soon, but the key motivation for the choice of the sum of squares selection statistic and test statistic is based on the following wellknown facts. First, under (and ) we have that
that is, has the Chisquared distribution with degrees of freedom and has the distribution with degrees of freedom. But, the more important reason motivating our choice is that, by Basu’s theorem, and are independent under (Lehmann and Romano, 2005). Note that , so that larger values of are indicative of larger values of .
A simple selection rule is used for selecting which hypotheses are to be tested at the second stage. Given a threshold , is selected iff . Let denote the indices of selected hypotheses, with the number of selected hypotheses. At the second stage, one can simply apply the Bonferroni test; that is, reject iff , the quantile of the distribution with degrees of freedom.
Lemma 2.1
For any choice of the threshold , the above twostage procedure controls the FWER at level .
Like all proofs, see the appendix.
Remark 2.1
The proof of Lemma 2.1 requires that any test statistic be independent of the selection statistics , if is true. Note that it is not required that the test statistics are jointly independent of the selection statistics.
More generally, the twostage procedure controls the familywise error rate when any test statistic is independent of the selection statistics, even outside our stylized Gaussian model.
The simple twostage method can be improved by a Holmtype stepdown improvement. To describe the method, simply apply the Holm method (Holm, 1979) to the values based on the selected set of hypotheses. More specifically, let denote the marginal value when testing based on . Of course, in the model above, this is just the probability that a distribution with degrees of freedom exceeds the observed value of . Let be one if is not selected and equal to if it is selected. Let
denote the ordered values, so that is the index of the th most significant value. Now, apply Holm’s procedure based on the values with . Thus, is rejected if for .
Theorem 2.1
Under the setting of Lemma 2.1, apply the Holm method to the selected set of hypotheses. Then, this modified procedure controls the FWER at level .
Thus, one can do even better by using a Holmlike stepdown method, or even a stepdown version of Sidak’s procedure; see Lehmann and Romano (2005) and Guo and Romano (2007). Indeed, conditional on the selection statistics, all computed true null values based on “detection” statistics at the second stage are conditionally uniform on and hence unconditionally as well. Thus, any multiple testing method based on values is available. For example, one can also apply the BenjaminiHochberg procedure based on the selected values for controlling the false discovery rate (Benjamini and Hochberg, 1995). In all such cases, the motivation is that gains are possible because at the second stage only a reduced number of hypotheses are tested, with the hopes of increased ability to detect or discover false null hypotheses. Furthermore, both the selection and detection stages are based on the full data (rather than a split sample approach which is used to obtain independence of the stages) and there is no selection effect because of independence between the selection and test statistics when the corresponding hypothesis is true.
So far, the threshold for selection has been just generically set at some constant . We now discuss this choice. For our method, we will choose of the form , the quantile of . Since when is true, such a selection threshold ensures that roughly hypotheses are selected, at least if most null hypotheses are true. The question now is how to choose . Let and , where is a given positive constant satisfying . Then, roughly hypotheses are selected for testing. A choice of must still be specified.
Since Type 1 error control is ensured regardless of the choice of , we now turn to studying the power of the procedure. In our asymptotic analysis, the following is assumed.
Assumption A: , where is a nonnegative constant.
Note that as is equal to , or , the values of are respectively and . So, it is reasonable and often sufficient to characterize the relationship between and by imposing Assumption A. In applications, and (and hence ) are known, and generally we will have . We will consider the probability of rejecting a null hypothesis having mean , which without loss of generality can be taken to be positive. Further assume without loss of generality that it is that is false with mean . If is constant, then under Assumptions A and , we have . On the other hand, if varies with (and ) such that as approaches infinity, then Finally, if the sample size is very large, so that is very small compared to the sample size , then the value of should be taken to be 0. In the following, we mainly perform asymptotic power analyses under Assumption A. Sometimes, is assumed, in which case the case can either be treated separately with ease, or by a limiting argument as tends to zero.
3 Power analysis of twostage procedure
In order to analyze the power of the twostage procedure, we break up the analysis in two parts. The first part analyzes the probability of “selection” in the first stage, while the second will analyze the probability of “detection” in the second stage. Rejection of then occurs when both has been selected at the first stage and then detection occurs at the second stage. Roughly, the basic goal will be to determine how large in absolute value an alternative mean must be in order to ensure that the probability of rejection tends to one.
3.1 The probability of selecting
Consider the case where is a constant, so that is false. We now consider the asymptotic behavior of the probability that is selected in the first stage of the twostage procedure. Recall that denotes the quantile of , the Chisquared distribution with degrees of freedom, i.e., Hypothesis is selected if .
Lemma 3.1
(i) Under Assumption A, if
(3) 
then
(ii) Under Assumption A, if
(4) 
then
In Lemma 3.1(i), if , then the condition (3) always holds, while in (ii) if the condition (4) never holds, which implies is selected with probability tending to one.
Note that there exists a gap between the two detection thresholds in Lemma 3.1, but we will derive an improved, exact result in Section 4.
3.2 The probability of detecting
We now consider the probability that is detected at the second stage using the statistic . That is, we now analyze the probability that exceeds , regardless of whether or not is selected at the first stage. Later, we will analyze the two stages jointly, but for now note that if is false, then it is no longer the case that the selection statistic and the detection statistic are independent.
First, in order to understand the detection probability, we need to understand , the number of selections from the first stage (as it is random). Let denote the indices of true null hypotheses from to , and let denote the indices of false null hypotheses from to . Let and denote the number of true and false null hypotheses, respectively, from .
We will assume some degree of sparsity in the sense
(5) 
for some . We will even allow , treating the “needle in the haystack” problem, where exactly one alternative hypothesis is true.
Lemma 3.2
The number of selected hypotheses satisfies
(6) 
If we assume the sparsity condition (5), then
(7) 
and
(8) 
as long as .
Lemma 3.3
Under Assumptions A and (5), we have

when , ;

when , .
Obviously, if , then for any .
3.3 Asymptotic power analysis
We now combine the two stages to determine the value of that leads to rejection of . Let be the event that is selected in the first stage and let be the event that at the second stage. Note and are dependent in general. Then, the power of the twostage method, i.e., the probability that is rejected, is
(9) 
Therefore, in order for rejection of to occur with probability tending to one, it is sufficient to show both and have probability tending to one. Also, we have
(10) 
Theorem 3.1
Under Assumption A and (5), we have

when ,

when
Corollary 3.1
Under Assumption A with , for any given , (5) and any ,
Of course, in multiple testing problems, there are many notions of power one might wish to maximize: the probability of rejecting at least one false null hypothesis, the probability of rejecting all false null hypotheses, the probability of rejecting at least false null hypotheses (for any given ), the expected number (or proportion) of rejections among false null hypothesis, etc. Theorem 3.1 and Corollary 3.1 apply directly to the expected proportion of false null hypotheses rejected. For example, in the setting where all false null hypotheses have a common mean , then the expected proportion of correct rejections equals the probability that any one of them is rejected, which tends to one (or not) based on the threshold for .
4 Further improvement
In order to improve Theorem 3.1, we need to derive improved bounds on extreme Chisquared quantiles. (Note the slack in the bounds provided in Lemmas 9.1 and 9.2.)
Let
(11) 
which is increasing on . Then, define
(12) 
which is decreasing in .
Lemma 4.1
Given the value used in stage one for selection with , and in Assumption A, with , define to be the solution of the equation
(13) 
(i) For any and sufficiently large ,
(14) 
(i) For any and sufficiently large ,
(15) 
Lemma 4.2
Under Assumption A and (5), we have

when , .

when , .
Theorem 4.1
Under Assumption A and (5), we have

when ,

when ,
Remark 4.1
Theorem 4.1 offers an approach of determining the value of tuning parameter . By minimizing the righthand side of the inequality in Theorem 4.1 (i) or (ii) with respect to , one can determine an optimal value of for each given value of , which maximizes probability of detecting any false null or average power asymptotically. As seen from Figure 4.1 (left), the chosen value of is decreasing in . Note that , thus is roughly increasing in if is fixed and decreasing in if is fixed. For instance, suppose and , then . By checking Figure 4.1 (left), the determined value of is about , which implies that about hypotheses are selected in the first stage for detection.
Based on the optimal value of , we can determine by Theorem 4.1 the upper bound of squared mean for our suggested twostage Bonferroni procedure, which constitutes a sharp detection threshold. When is larger than the bound, we can always detect . Similarly, we can also determine by Theorem 7.1 the detection threshold of for the conventional Bonferroni procedure. Figure 4.1 (right) shows the detection thresholds of for these two procedures. As seen from Figure 4.1 (right), the detection thresholds of our suggested procedure are always lower than those of conventional Bonferroni procedure for different values of , and their differences are increasingly larger with increasing . This implies that our suggested twostage Bonferroni procedure is more powerful than the conventional Bonferorni procedure and its power improvement over the Bonferroni procedure becomes increasingly larger with increasing . Specifically, the detection threshold of our suggested procedure is almost linear in terms of with the slope being about and that of the conventional Bonferroni procedure is an exponential function of .
5 Estimating
The goal of this section is to show asymptotic control of the FWER is retained when are the same as unknown and is estimated. To this end, let denote an overall estimator of which satisfies
(16) 
actually, (16) can be weakened but it holds if we take the average or median of the sample variances computed from each of the samples. Consider the modified procedure based on the selection set
(17) 
where and is the critical value used in selection when it is known that . The modified twostage procedure is identical in the second stage in that, for each , is rejected if its corresponding statistic exceeds the quantile of the distribution with degrees of freedom, where denotes the number of selected hypotheses at the first stage.
Theorem 5.1
Assume Assumption A.
(i) For , the above modified twostage procedure asymptotically controls the familywise error rate as .
(ii) For and , the above modified twostage procedure asymptotically controls the familywise error rate as . In fact, the same is true if
where
and defined in (13).
Remark 5.1
The power analysis used to derive Theorems 3.1 and 4.1 applies equally well to the above modified procedure when is estimated. Of course, at the second stage, the detection probability analysis remains completely unchanged since there is no modification in the second stage. In the first stage, the argument for selection can be used along with the assumption (16) to yield the same results, as the argument is basically the same.
6 Dependence
We now extend the twostage method when the tests are dependent. The setup is similar to that described in Section 2. Assume we have i.i.d. observations , where and the components of may be dependent. As before, is . (Note that it is not necessary to assume is multivariate Gaussian, but just that the onedimensional marginal distributions are Gaussian.) We firstly discuss the case of known . For convenience, we still assume . The twostage procedure is based on the same selection statistic and detection statistic as before. The twostage procedure selects any for which and then rejects if also exceeds , where is the set of indices such that and is the number of selections at the first stage. Let and be the set of indices of the selected true null hypotheses, i.e.,
We make the following assumptions regarding and , in which the assumption regarding was already shown to hold under independence in Lemma 3.2.
Assumption B1: as where is a fixed constant.
In assumption B1, corresponds to sparsity. By assumption B1, we have
(18) 
so one can expect the following assumption B2:
Assumption B2: as .
Theorem 6.1
Assume Assumptions B1 and B2. The twostage procedure discussed in Lemma 2.1 with asymptotically controls the familywise error rate at level .
Remark 6.1
It is interesting to note that in Theorem 6.1, we do not make any assumption of dependence on false null statistics. Only some weak dependence is imposed on true null statistics.
Remark 6.2
Remark 6.3
When the selection statistics are weakly dependent, assumption B2 is satisfied. In the following, we present such an example of block dependence satisfying assumption B2.
Let for . Suppose forming blocks of sizes each, which are reformulated as for blocks, are independent to each other, with , and as , where . Note that In the following, we show that assumption B2 is satisfied under such block dependence. Note that
Thus, by block independence of , we have
We know that
Combining the above two inequalities,
Note that
By Chebychev’s inequality, we have
and thus assumption B2 is satisfied.
When are the same as unknown and is estimated, we consider the modified twostage procedure discussed in Theorem 5.1. By using similar arguments as in the proof of Theorem 6.1, we can also show that asymptotic control of the FWER is retained for this procedure under dependence.
For any given and , define
Except for assumption B1, we also make the following two assumptions regarding and :
Assumption B3: .
Assumption B4: as where for some slowly.
We should note that assumption B3 has been presented in Section 5 and assumption B4 is a slight extension of assumption B2.
Theorem 6.2
Assume Assumptions B1, B3 and B4. The twostage procedure discussed in Theorem 5.1 asymptotically controls the familywise error rate at level .
When the selection statistics are block dependent, if the overall estimate is chosen as
we can similarly show that assumptions B3 and B4 are satisfied under block dependence by using the similar arguments as in the case of known variance where we showed in Remark 6.3 that assumption B2 is satisfied under block dependence.
7 Alternative Methods
In this section, we perform a corresponding power analysis with some alternative methods.
7.1 Bonferroni
First, we consider the Bonferroni method, which rejects if . We consider the power or rejection probability of when is the mean.
Theorem 7.1
Assume Assumption A. For the original Bonferroni method,
(i) when ,
(ii) when ,
Remark 7.1
In Theorem 7.1, if , then the stated condition in (i) always holds, which implies is rejected by the Bonferroni procedure with probability tending to one. On the other hand, the stated condition in (ii) holds for any large if is large enough, which implies is rejected with probability tending to zero.
Remark 7.2
In the case of known variance, one can use a statistic with a normal quantile . Similar to the proof of Theorem 7.1, it can be shown that the threshold can be replaced by .
7.2 Split Sample Method
A common way (Skol et al. 2006; Wasserman and Roeder 2009) to achieve a reduction in the number of tests is to split the sample in two independent parts. The first part, based on observations is used to determine which hypotheses will be selected. Then, those selected hypotheses are tested based on the independent set of observations. Since the two subsamples are independent (as we have been assuming all observations are i.i.d.), it is easy to control the FWER. Indeed, suppose the first subsample produces a reduced set of hypotheses with indices , so that the number of selected hypotheses is . Then, the Bonferroni procedure applied to the remaining observations evidently controls the FWER. Specifically, for , suppose denotes the statistic computed on the th subsample of size for testing . Here, is selected if , for some cutoff . Here, we will take to be of the form
for some . If denotes the number of satisfying the inequality so that is selected, then is rejected at the second stage if also
For any cutoff used for selection, this procedure controls the FWER. We would like to determine the smallest value of where such a procedure has limiting power one.
Theorem 7.2
Assume Assumption A. Also assume and the sparsity condition (5). For the above split sample method,
(i) when ,
(ii) when ,
Remark 7.3
By Theorem 7.2, the detection threshold (or rather its square) of the split sample method is equal to
which depends on , which we set as , a choice of , as well as the choice of to determine the split sample sizes. We want the threshold to be as small as possible. With fixed, minimizing over both and requires minimizing . If is fixed, the optimizing choice of is , in which case the threshold becomes , which is the same as the original Bonferroni procedure. Note that there are infinitely many optimizing combinations of and as long as . Regardless, no claim can be made to an improvement over the Bonferroni procedure. (On the other hand, we could also apply the split sample method and then apply Holm method in the second stage, which if compared to the usual Holm method based on the full data could offer an improvement because critical values now change more rapidly at each step.)
8 Simulation Studies
In this section, we performed two simulation studies to evaluate the performances of our suggested twostage Bonferroni method as a highdimensional global testing method and as an FWER controlling method.
8.1 Numerical comparison for high dimensional global tests
We performed a simulation study to compare the performance of our suggested modified twostage Bonferroni method (See Section 5) with those of several existing global testing methods with respect to type 1 error rate and power. The methods we chose for comparison include the conventional Bonferroni test, Simes test (Simes, 1986), Higher Criterion method (Donoho and Jin, 2004, 2015), and samplesplit Bonferroni test (Cox, 1975; Skol et al, 2006).
Each simulated data set is obtained by generating dependent normal random samples , with a common correlation and a sample size . Among the 1,000 mean values , or are drawn from and the remaining are equal to 0, where . The common variance is drawn from . For , consider using onesample statistic for testing individual hypothesis against . We then use the aforementioned five global testing methods for testing the global hypothesis against at level . For our suggested modified twostage Bonferroni method, we use the sum of squares as the selection statistic for performing selection of the individual hypotheses. The selection threshold we chose is in Section 5, which roughly ensures of hypotheses to be selected. For the samplesplit Bonferroni test, we use onesample statistics for both selection and testing, which are respectively constructed based on the first and second half samples. The selection threshold we chose is in Section 7. In addition, we always set in the simulations.
The simulation is repeated for times. The type 1 error rate and power are both estimated as the proportions of simulations where is rejected when is respectively true and false. In Figure 8.1 we compared the estimated type 1 error rates and powers of the aforementioned five global testing methods with respect to the common correlation. As seen from Figure 8.1, our suggested modified twostage Bonferroni method always controls the type 1 error rate at level for all values of correlation while performing best in terms of power. However, for the Higher Criterion test, it completely loses the control of type 1 error rate even when the correlation is weak; and even though for its inflated type 1 error rate, it is still less powerful than our suggested method.
In Figure 8.2 we compared the estimated power of the aforementioned five methods under independence in the cases of equal and unequal variances with respect to with values from to . As seen from Figure 8.2, our suggested modified twostage Bonferroni method performs best under equal variance in terms of power and its power improvements over the existing four methods are always pretty large for different values of . Under unequal variance, our suggested modified twostage Bonferroni method still performs well compared to the existing methods, although the power improvements become smaller when the variability of variances becomes larger.
8.2 Numerical comparison for FWER controlling procedures
We also performed a simulation study to compare the performance of our suggested modified twostage Bonferroni method (Section 5) with those of several existing multiple testing methods with respect to the FWER control and average power. The methods we chose for comparison include conventional Bonferroni procedure, Hochberg procedure, and samplesplit Bonferroni procedure (Section 7).
Each simulated data set is obtained by generating dependent normal random samples , with a common correlation and a sample size . Among the 100 ’s, are drawn from and the remaining are equal to 0, where is the proportion of . The common variance is drawn from . For all of these four procedures, we use onesample test statistics for testing the hypotheses against . For our suggested modified twostage Bonferroni method, we use the sum of squares as the selection statistic for performing selection of the tested hypotheses. The selection threshold we chose is , which roughly ensures about hypotheses to be selected. Here, is the average of the sample variances of the samples and is the quantile of chisquare distribution with degrees of freedom . For the samplesplit Bonferroni procedure, we use onesample statistics for performing selection of all of the hypotheses, which are constructed based on the first half sample with sample size . The selection threshold we chose is , the quantile of distribution with degrees of freedom , which also roughly ensures about hypotheses to be selected. For testing the selected hypotheses, we also use onesample statistics, which are constructed based on the second half sample with sample size .
The aforementioned four procedures are then applied to test against simultaneously for at level . The simulation is repeated for times. The FWER is estimated as the proportion of simulations where at least one true null hypothesis is falsely rejected and the average power is estimated as the average proportion of rejected false null hypothesis among all false nulls across simulations. In Figure 8.3 we compared the estimated FWER and average power of these four procedures with respect to the proportion of false null hypotheses with values from to in the cases of (upper panel) or (bottom panel). As seen from Figure 8.3, our suggested modified twostage Bonferroni method performs best in terms of average power while controlling the FWER at level , and its power improvements over the existing three methods are decreasing with the increasing proportion of false nulls.
In Figure 8.4 we compared the estimated FWER and average power of these four procedures with respect to the common correlation with values from to . We observe from Figure 8.4 that for different values of correlation , our suggested modified twostage Bonferroni method always performs best in terms of average power while controlling the FWER at level . In addition, we also observe that the average powers of these methods are not affected by the correlation and the estimated FWERs are basically decreasing in terms of the correlation.
9 Technical Details
Proof of Lemma 2.1 : Assume is true. Then, we claim the detection statistic is independent of all the selection statistics . For the univariate normal model with mean 0 and unknown variance, the statistic is independent of by Basu’s theorem (because is ancillary and is a complete sufficient statistic). Hence, is independent of , and therefore independent of . Let be the indices of the true null hypotheses. Thus, the FWER is given by
(19) 
This probability, conditional on the selection statistics is
(20) 
which by Bonferroni’s inequality is bounded above by
(21) 
Therefore, the unconditional probability is bounded above by , as required.
Proof of Theorem 2.1: As in the proof of Lemma 2.1, compute the probability of at least one false rejection conditional on the selection statistics. Let be the smallest (or first) for which is true and . Such an event implies that the smallest value among the true null hypotheses which have been selected is less than or equal to . Indeed, the largest possible value for (leading to the largest possible critical value for the first true null hypothesis tested) is given if, out of the