Use of spurious correlation for multiplicity adjustment
Abstract: We consider one of the most basic multiple testing problems that compares expectations of multivariate data among several groups. As a test statistic, a conventional (approximate) statistic is considered, and we determine its rejection region using a common rejection limit. When there are unknown correlations among test statistics, the multiplicity adjusted values are dependent on the unknown correlations. They are usually replaced with their estimates that are always consistent under any hypothesis. In this paper, we propose the use of estimates, which are not necessarily consistent and are referred to as spurious correlations, in order to improve statistical power. Through simulation studies, we verify that the proposed method asymptotically controls the familywise error rate and clearly provides higher statistical power than existing methods. In addition, the proposed and existing methods are applied to a real multiple testing problem that compares quantitative traits among groups of mice and the results are compared.
Keywords: Asymptotic control; Familywise error rate; Improving statistical power; Max t procedure; Multiple comparison; Stepdown procedure
1 Introduction
We consider a simple multiple testing problem that compares the expectations of twodimensional independent data from control and case groups. Setting the sample size as in each group, we denote the data in the control group by and the data in the case group by . In addition, their expectations and variances are denoted by
(1) 
(), and we assume that they are unknown. For this model, we consider testing against and against , simultaneously. As test statistics, we use conventional statistics, and , and a rejection region is determined using a common rejection limit . If the values of are known, then as a rejection limit, we have to only obtain the value of such that the familywise error rate
is controlled; that is, the familywise error rate is equal to , where refers to a probability under a hypothesis and is the significance level for this multiple testing. It is to be noted that the higher the value of the correlation is, the higher the correlation between and is. This would result in more number of tests being rejected. In this problem, the value of is unknown, and we intend to asymptotically control the familywise error rate.
A natural choice would be to replace with its reasonable estimator such as an unbiased estimator
() in the asymptotic null distribution of , that is a twodimensional Gaussian distribution with a mean of , a variance of , and a correlation of . We call this the maxt method (see Section 2.6 of Dudoit and van der Laan 2007). Because is consistent without relation to what hypothesis is true, this method asymptotically controls the familywise error rate.
On the other hand, to improve statistical power, we evaluate the correlation by assuming that the expectations are the same in both groups. This means that we use
an unbiased estimator of when is true, in place of . The reason for this is the fact that , the socalled spurious correlation, tends to be larger than when an alternative hypothesis is true. In general, using such a spurious correlation does not assure any control of the familywise error rate because it becomes a meaningless value under certain hypotheses; however, this spurious correlation does assure that the familywise error rate is asymptotically controlled in this problem and can be seen in the next paragraph.
We will verify the asymptotic control in the following four cases: (a) when is true, (b) when is true, (c) when is true, and (d) when is true. In this method, we use , such that , as a rejection limit for each test, where is a twodimensional Gaussian random vector with the mean being , the variance being and the correlation being . For case (a), since the spurious correlation is a consistent estimator, the familywise error rate is evaluated as
For case (b), although the spurious correlation is not consistent, it does not appear in the expression of the familywise error rate. The familywise error rate is evaluated as
For case (c), the spurious correlation does not appear in the expression of the familywise error rate, which is similar to case (b). For case (d), the familywise error rate is always zero. Therefore, we have verified the control.
Figure 1 plots synthetic data under in the setting above. Under , the first and second variables in the case group are larger than those in the control group. Consequently, the spurious correlation becomes larger than the correlations in the two groups. For this data, we consider a multiple test consisting of the abovementioned two tests, testing against and testing against , with a significance level of . While the maxt method does not reject either of the tests, both tests are rejected when the spurious correlation is used.
It is not true that we can always use the spurious correlation. We assume one more case group which consists of independent data satisfying (1) with . We consider testing against and testing against in addition to the abovementioned two tests, and we denote their test statistics by and . We then consider a fourdimensional Gaussian distribution which is the asymptotic null distribution of . Let be a random vector distributed according to a distribution made by replacing with
(2) 
which is an unbiased estimator under in the fourdimensional Gaussian distribution. We assume the value of such that is the rejection limit of each test. Under this condition, we consider the familywise error rate when is true. If it holds for similar type of evaluations as in the case of the two groups, the familywise error rate is expressed as
(3) 
however, this approximation does not necessarily hold. This is because of the fact that the approximation replaces with despite not being a consistent estimator of under this hypothesis.
In general, using a spurious correlation does not assure any asymptotic control of the familywise error rate; however, for the correlation between two test statistics, using an estimator that is consistent under the null hypotheses in the two tests assures asymptotic control of the familywise error rate. In (3), if the correlation between and is replaced with an estimator under , the approximation holds. This theory is given in its general form in Section 2.
For correlated multiple tests without any prespecified hypothesis ordering such as the above example, the maxt method is conventional, and Westfall and Young (1993) showed that it asymptotically controls the familywise error rate under a subset pivotality condition. It was also shown by Pollard and van der Laan (2004) and Dudoit et al. (2004) that the condition is relaxed by an easy algorithm. Moreover, the maxt method can also be used in a stepdown procedure (Laan et al. 2004). In this paper, under the same situation, we consider a different method that enhances statistical power.
In recent years, multiple testing procedures have been developed due to a rising demand from applications in the fields of medicine, bioinformatics, genomics, and brain imaging (see, e.g., Farcomeni 2008). A welldeveloped approach that does not consider the use of correlations is known as the “oracle approach”. This approach constructs an optimal test function by assuming that the true values of the parameters or their prior distributions are known. When the subject of what we want to control is the false discovery rate, the oracle approach works well if we substitute simple estimators for the true values of the parameters, or even if the prior distributions are slightly misspecified (Genovese et al. 2006, Storey 2007, Sun and Cai 2007, Guindani et al. 2009). On the other hand, it is difficult to control the familywise error rate if we only use a simple estimator, and as shown in Roeder and Wasserman (2009), a natural choice would be to use a twostage method with a samplesplitting procedure (Rubin et al. 2006, Wasserman and Roeder 2006, Habiger and Peña 2014). In this method, the parameters are estimated from one split sample and testing is implemented by the other split samples. Although this assures the asymptotic control of the familywise error rate, we have no appropriate theory on the splitting of the samples, and so an arbitrariness for the splitting exists. Roeder and Wasserman (2009) avoided this twostage method and roughly estimated the parameters on purpose in order to approximately control the familywise error rate; however, it is still difficult to construct a theory on how to roughly estimate the parameters. The approach for improving statistical power using such methods is different from the one proposed in this paper, and it is an attractive future theme that combines the two approaches.
2 General theory
Supposing that the covariance matrices for the groups are different from each other and the alternative hypotheses are twosided, we generalize the method in the previous section. For simple notations, we study pairwise comparisons between one control group and multiple case groups but we can also study general pairwise comparisons (see Web Appendix).
Let us denote a dimensional random vector for the th sample in the th group by (, ), and its mean vector and covariance matrix by and , where and indicate the control and case groups, respectively. We assume that the parameters of interest are in the mean vector and consider a multiple testing problem that compares and (). By denoting the average of by and a conventional unbiased estimator of by , an approximate statistic for each test is written by
For this problem, when a common rejection limit is used in every test, the familywise error rate is given by
(4) 
under the complete null hypothesis. We can easily verify the method that uses the rejection limit , such that the value in (4) is a significance level , keeps the familywise error rate below under any hypothesis; that is, it strongly controls the familywise error rate. On the other hand, we cannot obtain an exact value of the tail probability because the distribution of depends on unknown parameters and we do not assume any parametric model for the distribution of data in the first place.
The Bonferroni method considers as an upper bound of (4), and as a rejection limit, it uses such that the value of the bound is ; however, it is too conservative when the correlations among are large. In such a case, a natural choice would be to use an asymptotic evaluation of (4) (see Section 2.6 of Dudoit and van der Laan 2007). Under the hypothesis , the asymptotic distributions of ’s are and their correlations , , are asymptotically equivalent to
(5)  
(6)  
(7) 
respectively, where is a conventional unbiased estimator of . Here, and hereafter, we assume and . Let be a set of standard Gaussian variables whose correlations , , and are given by , , and , respectively. Then is asymptotically equivalent to (4), and the method that uses such that its value is asymptotically controls the familywise error rate. We call this the maxt method.
In this paper, we put “ (a hat) ” on estimators under a null hypothesis and on random variables based on the estimators while we put “ (a tilde) ” on estimators that are always consistent without relation to what the true hypothesis is. For an appropriate degree of freedom , let us define
(8) 
(). Then, is a reasonable estimator of under a null hypothesis . If these estimators are expected to be larger than , we would want to use them in place of in order to obtain improved statistical power. Therefore, as a natural choice, we consider a method that uses in place of in the definition of . That is, we define as a set of standard Gaussian variables whose correlations are given by using the replacement in the definition of , and the correlations and are given by and , respectively, and then we consider a method that uses , such that , as a common rejection limit. If the correlation matrix obtained in this manner does not satisfy the positivedefiniteness, we will make it approach the conventional correlation matrix so that it satisfies the positivedefiniteness, and we will then use the approached matrix. We call this the maxt method using for . Although using estimators under a null hypothesis does not assure any control of the familywise error rate in general, the maxt method using for assures the control of the familywise error rate. This assurance is written in a general form as follows (See Appendix for its proof):
Theorem 1.
The maxt method using consistent estimators of under in place of for in (5) asymptotically controls the familywise error rate.
Conversely, the maxt method using inconsistent estimators of under in place of for does not necessarily control the familywise error rate. Let us again consider the example of the three groups in Section 1. The fact that in (2) does not necessarily have a consistency under nor , the (3) does not hold and the familywise error rate is not controlled. Let us verify this. Now we assume that there is no correlation between and , i.e., . In addition, we assume that the true hypothesis is and that and are large enough. Then, the spurious correlation becomes close to , and so and ( and ) are almost the same random variables. Therefore, the familywise error rate is evaluated as
and we can see that it asymptotically exceeds . Here, is a weak limit of . The inequality holds because of the independence between and and the positivity of the correlation between and .
Similar to the example of the three groups above, in the setting of this section, we can consider the use of
(9) 
for an appropriately defined (); however, this also does not satisfy the control of the familywise error rate as in the following corollary (See Web Appendix for its proof):
3 Proposal for spurious correlation
From the previous section, the multiplicity is adjusted if we use or in place of , which means that the maxt method using for in (5) asymptotically controls the familywise error rate for an arbitrary .
Let us consider what value should be used as . When is assumed to be larger than , and if we assign a large value to , the estimator becomes large and this enhances the statistical power. On the other hand, the control of the familywise error rate becomes unstable even though it is asymptotically controlled because the variance of the estimator becomes large. Therefore, we consider setting the value of as large as possible while we assure the stability of the control to some extent. Specifically, we propose the use of the supremum of such that the variances of are asymptotically equal to or smaller than those of the conventional estimators .
Let us asymptotically evaluate the variance of the estimator in a setting of twogroup comparisons, i.e. , with a common covariance matrix . In asymptotics, we fix and increase . Firstly, by letting , the variance of the conventional estimator is evaluated as
(10) 
On the other hand, by letting , which is independent of , the variance of the estimator under is evaluated as
Because is written as , its variance is evaluated as
(11) 
The difference between the right sides of (10) and (11) is , and we can see that is always asymptotically smaller than when . For more details, see the Appendix, which derives the following theorem in a similar way.
Moreover, in the setting of the general theory in Section 2, a similar property holds under the following two requirements: the sample sizes in the groups are close to each other; if the variance of a variable is large, the variance of the other variables is also large. This property is written in the following theorem. Therefore, we propose that in this paper.
Theorem 2.
Let us assume that in the setting of Section 2 (, ). When and are close enough, the supremum of , such that the variance of is always smaller than that of , is .
4 Simulation study
Let us compare the performances of the proposed and existing methods through simulation studies in a simple setting. Considering the real data analysis in the next section, we treat a twogroup comparison, i.e. , and assume the sample size in each group to be between to and the number of tests on each group to be between to . In addition, we assume that the covariance matrix is a blockdiagonal matrix whose block is a uniform covariance matrix with variance and covariance , and is assumed to be between and . Letting be the rate of the true alternative hypotheses in all alternative hypotheses, we assume that the differences between the expectations are in the true alternative hypotheses. Among existing methods, we consider the Bonferroni, the maxt and the stepdown maxt methods. The stepdown maxt method uses the maxt method in a stepdown procedure (Dudoit and van der Laan 2007). It is trivial that we can use our method in a stepdown procedure, and we refer to this as “Proposal.”
By denoting the estimates of the correlation matrix for made by the proposal in Section 3 as , in some cases, does not become positivedefinite and our method cannot be applied. In such cases, letting be the estimates of the correlation matrix for made by in (5), in (6), and in (7), we gradually move closer to as long as the positivedefiniteness is maintained, and we use the last matrix before the positivedefiniteness is broken. We specifically provide an output of the following algorithm:
Algorithm 1.
Positivedefinitization by increasing values of components.

Set .

Randomly select an element from , and select the corresponding element from .

Replace with if the replaced is positivedefinite.

Repeat ii and iii as long as an update exists, and when the update does not occur, output .
Firstly, we will check how the proposed method, which asymptotically controls the familywise error rate, controls the familywise error rate in finite sample cases. Table 1 numerically evaluates the familywise error rate by each method when the significance level is . It is to be noted that we include settings in which there are true alternative hypotheses to verify the differences of the proposed and existing methods even though the familywise error rate for the proposed method is clearly smaller than in these settings. From the table, it can be seen that the proposed and existing methods share almost the same values under complete null hypotheses; that is, the proposed method controls the familywise error rate accurately enough even if the sample size is not large. When there are true alternative hypotheses, the familywise error rate in the proposed method is closer to than that in the existing methods. This indicates that the proposed method is superior to the existing methods in terms of statistical power.
Bon  SDmaxt  Proposal  

0.0  12  50  –  0.0  4.53  4.64 (0.54)  4.64 (0.56) 
0.2  12  50  –  0.0  4.57  4.82 (0.64)  4.80 (0.64) 
0.4  12  50  –  0.0  4.28  4.91 (0.67)  5.01 (0.81) 
0.6  12  50  –  0.0  3.77  5.07 (0.82)  5.26 (1.20) 
0.3  6  50  –  0.0  3.53  4.38 (0.75)  4.21 (0.82) 
0.3  10  50  –  0.0  3.74  4.29 (0.69)  4.32 (0.71) 
0.3  14  50  –  0.0  4.29  4.77 (0.70)  4.78 (0.74) 
0.3  18  50  –  0.0  4.44  4.90 (0.72)  4.84 (0.71) 
0.3  12  20  –  0.0  4.26  4.92 (0.53)  4.96 (0.55) 
0.3  12  40  –  0.0  4.37  4.91 (0.62)  4.88 (0.67) 
0.3  12  60  –  0.0  4.46  4.93 (0.72)  4.86 (0.71) 
0.3  12  80  –  0.0  4.59  5.04 (0.82)  4.99 (0.83) 
0.3  12  50  0.6  0.5  2.39  2.66 (0.40)  2.81 (0.49) 
0.3  12  50  1.0  0.5  2.39  2.67 (0.37)  3.06 (0.58) 
0.3  12  50  1.4  0.5  2.39  2.65 (0.39)  3.27 (0.54) 
0.3  12  50  1.8  0.5  2.39  2.68 (0.40)  3.40 (0.57) 
0.3  12  50  1.2  0.2  3.60  4.09 (0.69)  4.19 (0.71) 
0.3  12  50  1.2  0.4  2.82  3.11 (0.41)  3.53 (0.62) 
0.3  12  50  1.2  0.6  1.96  2.28 (0.37)  2.85 (0.52) 
0.3  12  50  1.2  0.8  1.05  1.18 (0.19)  1.78 (0.43) 
Bon, Bonferroni method; SDmaxt, stepdown maxt method.
Next, we will verify the superiority of the proposed method. Letting and , Table 2 numerically evaluates the statistical power of each method. We observe the degrees of improvements by considering correlations and by using a stepdown procedure from the difference between the Bonferroni and the maxt methods and from the difference between the maxt and the stepdown maxt methods, respectively. We would like to state that such improvements are sometimes overwhelmed by the improvement of the proposed method when compared to the stepdown maxt method. Especially when the correlation , the number of tests , and the difference between the expectations are large, our method significantly increases statistical power.
Bon  maxt  SDmaxt  Proposal  
0  12  50  1.2  30.4 [34.4]  30.9 [33.8]  34.7 [28.8]  38.8 [23.7] 
0.2  12  50  1.2  33.0 [31.9]  34.1 [29.6]  39.7 [25.4]  47.2 [19.7] 
0.4  12  50  1.2  33.2 [32.2]  35.5 [27.5]  41.7 [24.3]  51.5 [18.2] 
0.6  12  50  1.2  30.5 [34.1]  36.3 [25.1]  41.3 [23.3]  54.6 [16.5] 
0.3  6  50  1.2  6.9 [65.9]  7.7 [59.6]  8.2 [58.6]  12.4 [50.3] 
0.3  10  50  1.2  21.2 [41.8]  23.1 [37.4]  25.9 [34.4]  35.6 [25.9] 
0.3  14  50  1.2  40.8 [24.8]  42.1 [21.9]  50.2 [18.1]  57.3 [13.8] 
0.3  18  50  1.2  59.8 [14.1]  61.6 [12.5]  70.2 [8.7]  74.3 [7.0] 
0.3  12  20  1.2  46.8 [20.4]  48.6 [17.1]  57.1 [13.4]  64.0 [10.3] 
0.3  12  40  1.2  35.4 [29.0]  36.7 [25.4]  43.1 [21.7]  52.1 [16.3] 
0.3  12  60  1.2  29.9 [35.0]  31.1 [31.4]  35.7 [27.9]  44.2 [21.1] 
0.3  12  80  1.2  26.3 [39.9]  27.3 [35.6]  31.8 [32.2]  41.6 [24.7] 
0.3  12  50  0.9  14.3 [55.0]  14.9 [50.6]  16.9 [48.5]  21.8 [42.3] 
0.3  12  50  1.1  23.4 [40.8]  24.6 [36.5]  28.8 [33.3]  36.7 [26.4] 
0.3  12  50  1.3  43.3 [23.8]  45.3 [21.0]  52.2 [16.9]  60.5 [12.1] 
0.3  12  50  1.5  55.8 [14.9]  57.8 [12.9]  67.8 [8.8]  77.1 [5.6] 
Bon, Bonferroni method; SDmaxt, stepdown maxt method.
References
 (1)
 Dudoit et al. (2004) Dudoit, S., van der Laan, M. J., and Pollard, K. S. (2004). Multiple testing. Part I. Singlestep procedures for control of general type I error rates, Statistical Applications in Genetics and Molecular Biology, 3, 1–69.
 Dudoit and van der Laan (2007) Dudoit, S. and van der Laan, M. J. (2007). Multiple testing procedures with applications to genomics: Springer Science & Business Media.
 Farcomeni (2008) Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion, Statistical Methods in Medical Research, 17, 347–388.
 Genovese et al. (2006) Genovese, C. R., Roeder, K., and Wasserman, L. (2006). False discovery control with pvalue weighting, Biometrika, 93, 509–524.
 Guindani et al. (2009) Guindani, M., Müller, P., and Zhang, S. (2009). A Bayesian discovery procedure, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 905–925.
 Habiger and Peña (2014) Habiger, J. D. and Peña, E. A. (2014). Compound pvalue statistics for multiple testing procedures, Journal of multivariate analysis, 126, 153–166.
 Laan et al. (2004) van der Laan, M. J., Dudoit, S., and Pollard, K. S. (2004). Multiple testing. Part II. Stepdown procedures for control of the familywise error rate, Statistical applications in genetics and molecular biology, 3, 1–33.
 Pollard and van der Laan (2004) Pollard, K. S. and van der Laan, M. J. (2004). Choice of a null distribution in resamplingbased multiple testing, Journal of Statistical Planning and Inference, 125, 85–100.
 Roeder and Wasserman (2009) Roeder, K. and Wasserman, L. (2009). Genomewide significance levels and weighted hypothesis testing, Statistical science, 24, 398.
 Rubin et al. (2006) Rubin, D., Dudoit, S., and van der Laan, M. (2006). A method to increase the power of multiple testing procedures through sample splitting, Statistical Applications in Genetics and Molecular Biology, 5.
 Storey (2007) Storey, J. D. (2007). The optimal discovery procedure: a new approach to simultaneous significance testing, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69, 347–368.
 Sun and Cai (2007) Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control, Journal of the American Statistical Association, 102, 901–912.
 Wasserman and Roeder (2006) Wasserman, L. and Roeder, K. (2006). Weighted hypothesis testing, arXiv preprint math/0604172.
 Westfall and Young (1993) Westfall, P. H. and Young, S. S. (1993). Resamplingbased multiple testing: Examples and methods for pvalue adjustment, 279: John Wiley & Sons.