# Controlling Familywise Error When Rejecting at Most One Null Hypothesis Each From a Sequence of Sub-Families of Null Hypotheses

## Abstract

We present a procedure for controlling FWER when sequentially considering successive subfamilies of null hypotheses and rejecting at most one from each subfamily. Our procedure differs from previous procedures for controlling FWER by adjusting the critical values that are applied in subsequent rejection decisions by subtracting from the global significance level quantities based on the p-values of rejected null hypotheses and the numbers of null hypotheses considered.

Keywords: FWER, Sequential Hypothesis Testing, Stepwise Model Selection

## 1 Introduction

We present a procedure for strictly controlling the Familywise Error Rate when rejecting a single null hypothesis from each subfamily in a sequence of subfamilies of null hypotheses, where each rejection decision is made without knowledge of subsequent subfamilies.

Our procedure is a more powerful variant of a procedure presented by Webb and Petitjean (2016). These procedures differ in form from previous multiple testing procedures by adjusting the critical value applied to subsequent subfamilies based on the observed values of test statistics for null hypotheses in prior subfamilies.

We identify the assumptions of the procedure, use Monte Carlo simulations to elucidate properties of the procedure under differing scenarios when the assumptions are satisfied, and provide analytical and Monte Carlo simulation results to demonstrate scenarios under which FWER is not controlled when the assumptions are violated.

### 1.1 Set-up

Let be a random variable with probability distribution . Suppose we observe a realization of this random variable representing our observed data. Let , , be an ordered sequence of subfamilies of null hypotheses, where consists of null hypotheses about the data distribution . Let be a test-statistic for null hypothesis , , . Let be the set of true null hypotheses in , and let be the set of false null hypotheses in . Let and be the sets of true and false null hypotheses among all null hypotheses.

P-values: Let be a p-value implied by . It is assumed that if is true, then for all . In other words, the -value, which is just a transformation of the test-statistic, satisfies its key property. For simplicity, we assume this to be true for the finite sample . As a result, our theorem establishes exact control of the family wise error, but in the often more realistic case that the null distributions of the test statistics are only known asymptotically so that , our results will provide asymptotic control of the family wise error.

Let be the vector of -values, and let and the vector of -values for the true null hypotheses and false null hypotheses, respectively. Let be the minimum -value for family and let identify the null-hypothesis with the minimal -value. Thus . We also define and as the minimum of the p-values over the set of true and false null hypotheses in family , respectively. More precisely,

Our goal is to define a sequential multiple testing procedure that rejects at most one hypothesis per subfamily , making the decision as to whether or not to reject without knowledge of subsequent subfamilies and that controls the familywise error over all subfamilies , at user supplied level .

### 1.2 Sequential multiple testing procedure for a sequence of families of null hypotheses.

We propose the following sequential multiple testing procedure that results in a set of rejections .

Multiple Testing Procedure:

This procedure differs from that of Webb and Petitjean (2016) at line 7 where their procedure has instead let . By subtracting a smaller quantity from each successive our procedure is guaranteed to be uniformly more powerful. Hence, our proof also provides a proof of correctness for this prior procedure.

### 1.3 Theorem establishing family wise error control

The following theorem proves that for each realization of the -values of the false null hypotheses, the conditional probability of rejecting a true null hypothesis is no greater than . Of course, this implies, in particular, that the marginal probability on any rejection of a true null is no greater than . The key assumption this theorem relies upon is that the -values of the true nulls are independent of the -values of the false nulls.

###### Theorem 1

Assume that is independent of . Specifically, assume that for all possible realizations of , for all and all . Then,

Proof:
In this proof we condition on , so that all probabilities concern the random variable .

Scenario I:
First, consider the scenario that

We note that this implies that all the subfamilies contain at least one false null hypothesis. The probability of a false rejection at the -th subfamily, , is the probability that , which is no greater than , where we use that is a minimum over maximally true null hypotheses.

The union from represents the event that we have a false rejection. This proves that the probability of a false rejection is no greater than .

Scenario II: The only alternative to Scenario I is the scenario that there exists a first such that , and thus, for , we have . We note that this implies that the -th subfamily has at least one false null hypothesis, , and that the probability of a false rejection of a true null hypothesis in is no greater than .

If there has been no false rejection in this implies that for all , as otherwise there would have been a false rejection of the true null hypothesis corresponding to .

In this scenario, the procedure rejects if and only if and hence the probability of a first false rejection at .

The probability of the union of the two events of a rejection in and of a rejection in but no rejection in is thus no greater than .

## 2 Discussion

### 2.1 Relationship to other approaches for controlling FWER

The standard fixed sequence hypothesis test procedure (Maurer et al., 1995; Hsu and Berger, 1999), where all of a fixed sequence of null hypotheses are tested at level , is a special case of our procedure where all subfamilies are of size 1 (all ).

Our procedure follows a fundamentally different strategy to gatekeeping procedures based on Bonferroni adjustments (Bauer et al., 1998; Westfall and Krishen, 2001; Dmitrienko et al., 2003; Chen et al., 2005). Gatekeeping procedures add the for rejected null hypotheses to the of subsequent hypotheses. In contrast, our procedure subtracts from subsequent some portion of the previous , based on the observed p-value of the rejected null hypotheses.

The approach also differs fundamentally from selective inference (Taylor and Tibshirani, 2015). First, our procedure controls FWER, while selective inference controls FDR. Second, unlike our procedure, selective inference, does not use an explicit sequential order over subfamilies of null hypotheses. Third, also unlike our procedure, selective inference rejects null hypotheses in order of ascending p-value until a function over the p-values of the null hypotheses exceed a threshold.

### 2.2 Monte Carlo experiments

To elucidate the statistical power of the technique, we conducted Monte Carlo simulations. In all the following simulations we use .

In the first simulation we generated sets of null hypotheses, which were randomly assigned to be either true or false and were randomly assigned simulated p-values. These simulations were governed by three parameters — subfamilySize: the size of each subfamily; pTrue: the probability that a null hypothesis should be designated to be true; and maxFalsePVal: the maximum simulated p-value to be assigned to a false null hypothesis.

The following procedure was used for this simulation.

Monte Carlo simulation procedure

To generate each , subfamilySize simulated null hypotheses were generated. Each was designated as either true or false, with probability pTrue of being designated true. Each true null hypothesis was assigned a simulated p-value drawn uniformly at random from and each false null hypothesis was assigned a simulated p-value drawn uniformly at random from . Having lower p-values for false null hypotheses simulates the use of a test statistic that is useful for discriminating between true and false null hypotheses.

pTrue was varied from to in steps of and subfamilySize was set to each of the values , , and , creating a total of 40 treatments. Monte Carlo simulations were conducted for each treatment and the FWER and average number of true discoveries per simulation determined.

Figure 1 presents a surface chart showing the effect on FWER as the relative frequency of true to false null hypotheses is increased and as the subfamily size varies. When pTrue is 1.0 and FWER is determined by whether a null hypothesis is rejected for the first subfamily or not, the probability of FWER is strictly controlled by the equivalent of a Bonferroni correction for the first subfamily. FWER falls as the proportion of false null hypotheses rises because the multiple test correction is allowing for the possibility that they are all true.

Increasing subfamilysize also decreases FWER because the multple test correction allows for the worst case where the rejection regions of all null hypotheses are disjoint whereas in this simulation all null hypotheses are independent of one another.

This simulation demonstrates the power of our procedure when its assumptions are satisfied, and show that it is most powerful when the ratio of false to true hypotheses is highest and subfamilysize is smallest.

We next demonstrate a scenario where violating the requirement that True and False null hypotheses be independent results in a failure to control FWER.

In this scenario we have one false null hypothesis, and two true null hypotheses, and . The experimental outcome on which and are based is the result of tossing an unbiased coin 17 times. The experimental outcome on which is based is the result of tossing another coin 13 times. Both coins are unbiased, . We choose 17 for the first experiment because it is the smallest number of tosses that has an outcome for a test for that is close to 0.025, and 13 for the second because it is the smallest number of tosses that has an outcome for a test for that is close to 0.05. , and .

and .

We proceed to if either or is rejected.

and are tested at .

There are 17 coin tosses and the rejection region for is 4 or fewer heads. The probability of this outcome is 0.0245.

The rejection region for is 5 or more heads.

. The respective p-values for are , , , , , , , , , , , and , meaning A will be rejected if there are 5 or more heads and the adjusted alpha for C will be respectively to . If were a maximally powerful true null hypothesis then the probability of it being rejected would be . Adding this to the probability of false rejection of gives a FWER of 0.0715.

However, as we are using coin tosses with a finite number of outcomes, is not maximally powerful. A Monte Carlo simulation of 1,000,000 repetitions of this scenario yielded a FWER of demonstrating again that violation of the requirement that the true and false null hypotheses be independent of one another can lead to failure to control familywise error.

## 3 Conclusion

We have presented a novel procedure for controlling familywise error in a sequential testing scenario where at most one null hypothesis is to be rejected from each of a series of subfamilies of null hypotheses. We have shown that this procedure requires only the assumption that the p-values for the true and false null hypotheses are independent of one another. This assumption is realistic in the context of stepwise model selection for which the procedure was developed.

The procedure uses a novel mechanism of adjusting subsequent critical values by quantities based on the observed p-values of null hypotheses that are rejected. It remains a promising avenue for future research to investigate whether this strategy is more broadly applicable in other sequential testing scenarios.

## Acknowledgments

This research has been supported by the Australian Research Council under grant DP140100087.

### References

- P. Bauer, J. RÃ¶hmel, W. Maurer, and L. Hothorn. Testing strategies in multi-dose experiments including active control. Statistics in Medicine, 17(18):2133–2146, 1998. ISSN 1097-0258.
- Xun Chen, Xiaohui Luo, and Tom Capizzi. The application of enhanced parallel gatekeeping strategies. Statistics in Medicine, 24(9):1385–1397, 2005. ISSN 1097-0258.
- Alexei Dmitrienko, Walter W. Offen, and Peter H. Westfall. Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Statistics in Medicine, 22(15):2387–2400, 2003. ISSN 1097-0258.
- Jason C. Hsu and Roger L. Berger. Stepwise confidence intervals without multiplicity adjustment for doseÃ¢ÂÂresponse and toxicity studies. Journal of the American Statistical Association, 94(446):468–482, 1999.
- W Maurer, LA Hothorn, and W Lehmacher. Multiple comparisons in drug clinical trials and preclinical assays: a-priori ordered hypotheses. Biometrie in der chemisch-pharmazeutischen Industrie, 6:3–18, 1995.
- Jonathan Taylor and Robert J. Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
- Geoffrey I Webb and Francois Petitjean. A multiple test correction for streams and cascades of statistical hypothesis tests. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pages 1255–1264. ACM Press, 2016.
- Peter H. Westfall and Alok Krishen. Optimally weighted, fixed sequence and gatekeeper multiple testing procedures. Journal of Statistical Planning and Inference, 99(1):25 – 40, 2001. ISSN 0378-3758.