Only Closed Testing Procedures are Admissible for Controlling False Discovery Proportions

Only Closed Testing Procedures are Admissible for Controlling False Discovery Proportions

Jelle J. Goeman111Dept. of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands    Jesse Hemerik222Oslo Centre for Biostatistics and Epidemiology, University of Oslo, Oslo, Norway    Aldo Solari333Dept. of Economics, Management and Statistics, University of Milano-Bicocca, Milan, Italy
Abstract

We consider the class of all multiple testing methods controlling tail probabilities of the false discovery proportion, either for one random set or simultaneously for many such sets. This class encompasses methods controlling familywise error rate, generalized familywise error rate, false discovery exceedance, joint error rate, simultaneous control of all false discovery proportions, and others, as well as seemingly unrelated methods such as gene set testing in genomics and cluster inference methods in neuroimaging. We show that all such methods are either equivalent to a closed testing method, or are uniformly improved by one. Moreover, we show that a closed testing method is admissible as a method controlling tail probabilities of false discovery proportions if and only if all its local tests are admissible. This implies that, when designing such methods, it is sufficient to restrict attention to closed testing methods only. We demonstrate the practical usefulness of this design principle by constructing a uniform improvement of a recently proposed method.

1 Introduction

Closed testing (Marcus et al., 1976) is known to be a fundamental principle of familywise error rate (FWER) control in multiple hypothesis testing. Indeed, almost every well-known procedure controlling FWER has been shown to be a special case of closed testing, and many procedures have even been explicitly constructed as such. This is natural from a theoretical perspective, as it has been shown by Sonnemann (1982, 2008), and Sonnemann and Finner (1988) that closed testing is necessary for FWER control: every admissible procedure that controls FWER is a special case of closed testing. Romano et al. (2011) extended the results of Sonnemann and Finner. They proved that from a FWER perspective not every closed testing procedure is admissible, but consonant procedures. These results are valuable for designers of FWER controlling methods, who can rely exclusively on closed testing as a general design principle. Alternative design principles exist, such as the partitioning principle (Finner and Strassburger, 2002) and sequential rejection (Goeman and Solari, 2010), but these are equivalent to closed testing.

Rather than only for FWER control, Goeman and Solari (2011) showed that closed testing may also be used to obtain simultaneous confidence bounds for the false discovery proportion (FDP) of all subsets within a family of hypotheses. Used in this way, closed testing allows a form of post-selection inference. It allows users to look at the data prior to choosing thresholds and criteria for significance, while still keeping control of tail probabilities of the FDP. The approach of Goeman and Solari (2011) is equivalent to an earlier approach by Genovese and Wasserman (2004, 2006) that did not explicitly use closed testing. A natural question that arises is whether similar results to those of Sonnemann (1982), Sonnemann and Finner (1988) and Romano et al. (2011) also hold for this novel use of closed testing. When controlling FDP, is it sufficient to look only at closed testing-based methods? Which closed testing-based methods are admissible? These are the questions we will address in this paper.

We will look at these questions within the class of procedures that control tail probabilities of FDP, both for a single (random) set and simultaneously over many such sets. Many procedures, some of which seem to target control of other quantities than FDP at first sight, will turn out to be within this general class. These include all methods with regular FWER control; FWER control of intersection hypotheses; -FWER control; simultaneous -FWER control; False Discovery Exceedance control; control of recently proposed error rates such as the Joint Error Rate; Simultaneous Selective Inference; and even methods constructing confidence intervals for the overall proportion of true (or false) hypotheses. A detailed overview with references is given in the next section. We show how all of these methods can be written as procedures that simultaneously control FDP over all subsets within a family of hypotheses, although they typically focus the power of the procedure on a limited number of such subsets.

After formally defining the class of procedures we will study, we follow the lead of Sonnemann (1982) and Sonnemann and Finner (1988), generalizing their concept of coherence to FDP control. We show that every incoherent procedure can be uniformly improved by a coherent one, and that every coherent procedure is either a closed testing procedure or can be uniformly improved by one. This shows that only closed testing procedures are admissible as FDP controlling procedures. Next, we show that every closed testing procedure with admissible local tests is admissible as an FDP controlling procedure. Together, these results lay out design principles for construction of FDP controlling procedures. To design admissible procedures it is sufficient to create a closed testing procedure with admissible local tests. To show admissibility for a procedure designed in a different way, it is sufficient to show that the procedure is equivalent to such a procedure. We will discuss practical implications of our results for researchers seeking to develop new methods, and illustrate them by constructing a uniform improvement of a method recently proposed by Katsevich and Ramdas (2018).

2 Inference on false discovery proportions

Assume that we have data distributed according to with unknown. About we may formulate hypotheses of the form . Let the family of hypotheses of interest be , where is finite. The maximal set is arbitrary here, but will become important in Section 5. For any finite , let be the index set of the true hypotheses and the index set of the false hypotheses. We will make no further model assumptions in this paper: any models, any test statistics, and any dependence structures will be allowed. Throughout the paper we will denote all random quantities in boldface. Equalities and inequalities between random variables should be read as holding almost surely for all unless otherwise stated. Proofs of all theorems, lemmas and propositions are in Appendix LABEL:sec_proofs.

We will be studying procedures with FDP control. The FDP of a finite set is given by

We define a procedure with FDP control on (i.e. on ) as a random function , where is the power set of , such that for all it satisfies

(1)

It will be more convenient to use an equivalent representation that gives a simultaneous lower -confidence bound for , the number of true discoveries. We say that a random function has true discovery guarantee on if, for all ,

(2)

To see that the class of methods of FDP control and the class of methods with true discovery guarantee are equivalent, note that if fulfils (1), then

fulfils (2) and, if fulfils (2), then

fulfils (1). In the rest of the paper we will focus on methods with true discovery guarantee, which are mathematically easier to work with than methods with FDP control, e.g. because they automatically avoid issues with empty sets . Without loss of generality we may assume that takes integer values, and . If is not integer, we may freely replace by .

The class of FDP control (cf. true discovery guarantee) procedures encompasses seemingly diverse methods. Only few authors (Genovese and Wasserman, 2006; Goeman and Solari, 2011; Blanchard et al., 2017; Goeman et al., 2017) have explicitly proposed procedures that target control of FDP for all sets simultaneously as implied by (1). However, many other well-known types of multiple testing procedures turn out to be special cases of FDP control procedures, even if they were not directly formulated to control (1) or its equivalent. We will review these procedures briefly in the rest of this section in order to emphasize the wide range of applications of the results of this paper. We will reformulate such procedures in terms of .

Procedures that control FWER (e.g. Bretz et al., 2009; Westfall and Young, 1993; Berk et al., 2013; Janson et al., 2016) are usually defined as producing a random set (possibly empty) for which it is guaranteed that, for all ,

A generalization, -FWER (Romano and Shaikh, 2006; Sarkar, 2007; Lehmann and Romano, 2005; Finos and Farcomeni, 2011; Guo et al., 2010; Hommel and Hoffmann, 1988), makes sure that, for all ,

a statement that reduces to regular FWER if is chosen. It is easily seen that this is equivalent to requiring (2) if we take

(3)

Free additional statements may be obtained from (3) by direct logical implication. For example, if then we may immediately set , if positive, for all without compromising (2). We will come back to such implications in Section 4.

Related to -FWER are methods controlling False Discovery Exceedance (FDX), also known as -FDP, at level (Farcomeni, 2009; Sun et al., 2015; Delattre et al., 2015; Romano and Shaikh, 2006; Dudoit et al., 2004; Korn et al., 2004). Such methods find a random set (possibly empty) such that, for all ,

which is equivalent to (2) with

In most methods controlling FDX the control level is fixed, but it may also be random, as e.g. in the permutation-based method of Hemerik and Goeman (2018). Variants, such as kFDP (Guo et al., 2014), which allow a minimum number of false discoveries regardless of the size of , also fit (2).

Other methods allow to be chosen post-hoc by controlling FDX simultaneously over several values of . One way to achieve this is by control of the Joint Error Rate (JER). The JER (Blanchard et al., 2017) constructs a sequence of distinct random sets and corresponding random bounds , such that, for all ,

(4)

This is a special case of (2) if we set

Joint error rate control may be used with nested sets (Blanchard et al., 2017) or tree-structured sets (Durand et al., 2018), and is meant to be combined with interpolation (see Section 4). Similar approaches were used by e.g. the permutation-based methods of Meinshausen (2006) and Hemerik et al. (2018). Also the Simultaneous Selective Inference approach of Katsevich and Ramdas (2018), discussed in detail in Section 9, can be seen as controlling JER with nested sets.

A different category of methods involves FWER control of many intersection hypotheses, as e.g. used in gene set testing in genomics and in cluster inference in neuroimaging. In genomics, a collection of distinct sets is given a priori, and the procedure generates corresponding random indicators for detection of signal in the corresponding set. FWER is controlled over all statements made, i.e., for all ,

(5)

This corresponds to (2) with

Examples of such methods include Meinshausen (2008), Goeman and Mansmann (2008), Goeman and Finos (2012), Meijer and Goeman (2015b), Meijer et al. (2015), and Meijer and Goeman (2015a). In the latter two papers a connection with FDP control was already noted. In neuroimaging, cluster inference methods are similar except that in this case the sets and their number are random, and for is fixed (Poline and Mazoyer, 1993; Forman et al., 1995; Friston et al., 1996). FWER control (5) is guaranteed by Gaussian random field theory. Such control translates to a true discovery guarantee requirement (2) in the same way as in the genomics case.

Finally, interest can be in obtaining confidence statements for , the proportion of true null hypotheses in the testing problem as a whole (Meinshausen and Rice, 2006; Ge and Li, 2012). Here, the requirement that is a valid confidence interval for is equivalent to demanding (2) with

This listing of the different types of methods that may be written as true discovery guarantee methods is certainly not exhaustive, but a general pattern emerges. Any method controlling a -tail probability of the number or proportion of true discoveries (from below) or false discoveries (from above) either in one subset of , or in several subsets simultaneously, are special cases of general discovery guarantee procedures. The sets and bounds are all allowed to be random; only must be fixed.

Writing procedures as true discovery control procedures, even when the rewriting is trivial, may bring a new perspective to the use of the procedure. As proposed by Goeman and Solari (2011), procedures that fulfil (1) or (2) allow a different, flexible way of using multiple testing methods. In flexible multiple testing the user may look at the data before choosing post hoc one or several sets of interest, based on any desired criteria, and find their . Regardless of this data peeking the bounds on the selected sets are simultaneously valid due to the simultaneity in (2). Writing procedures in this form, therefore, in principle opens the way to their use as post-selection inference methods (see Rosenblatt et al., 2018, for an application). Of course, this is only useful if the user has some real choice, i.e. if for a decent number of sets . We will see in Section 4 how to get rid of some of the zeros in the definitions above.

3 True discovery guarantee using closed testing

A general way to construct procedures with true discovery guarantee is using closed testing, a method introduced by Marcus et al. (1976) as a means of constructing methods for FWER control. Genovese and Wasserman (2006) and Goeman and Solari (2011) adapted closed testing to make it useable for true discovery guarantee and FDP control. We will briefly review these methods here.

For every finite set we define a corresponding intersection hypothesis as . This hypothesis is true if and only if all , are true. We have , which is always true. For every such intersection hypothesis we may choose a local test taking values in , which is simply a valid statistical test for , with the property that

We always choose surely. Choosing a local test for every finite will yield a suite of local tests . To deal with restricted combinations (Shaffer, 1986) efficiently, if present, we demand that identical hypotheses have identical tests, i.e. if for we have , then we also have . If for some , we may take surely.

From such a suite of local tests we may obtain a true discovery guarantee procedure in two simple steps. First, we need to correct the tests for multiple testing. We define the effective local test within the family by

As shown by Marcus et al. (1976), the effective local tests have FWER control over all intersection hypotheses , , i.e., for all ,

We see that the procedure defined by already fulfils (2). More recently, however, Goeman and Solari (2011) showed that closed testing may also be used for more powerful FDP control. For any suite of local tests , these authors defined the associated procedure

(6)

and proved that it has true discovery guarantee. Note that the minimum is always defined since surely.

An earlier general approach to developing procedures with true discovery guarantee was developed without reference to closed testing by Genovese and Wasserman (2006). Starting from a suite of local tests, they proved true discovery guarantee for the procedure

(7)

This procedure is equivalent to the procedure (6), as was shown by Hemerik et al. (2018). This result was hidden in the supplemental information to that paper, and it deserves some more prominence. The proof is short. We repeat it here.

Lemma 1.

.

The expressions (6) and (7) are very useful for constructing methods with true discovery guarantee. Local tests tend to be easy to specify in most models, as each local test is a test of a single hypothesis, so that standard statistical test theory may be used. Given a suite of local tests, (6) or (7) takes care of the multiplicity. A computational problem remains, of course, if is large, since direct application of (6) and (7) takes exponential time. Often, however, shortcuts are available that allow faster computation (Goeman and Solari, 2011; Goeman et al., 2017). We’ll see an example in Section 9.

Comparing (6) and (7), the single step expression of Genovese and Wasserman (2006) is clearly more elegant. However, the link of (6) to closed testing is valuable because it connects true discovery guarantee procedures to the enormous literature on closed testing (see Henning and Westfall, 2015, for an overview). We found that the detour via effective local tests is often profitable in practice because expressions for can be easier to derive through easy expressions for , as we shall see in Section 9 (also Goeman et al., 2017; Hemerik and Goeman, 2018).

4 Coherence and interpolation

By viewing methods in terms of true discovery guarantee, as we have done in Section 2, they are upgraded from making a confidence statement about discoveries in a limited number of sets to doing the same for all subsets of . However, in the definitions above, most of these statements are the trivial . Often, however, some of the statements can be uniformly improved by a process called interpolation. In this section we discuss interpolation and how it can improve true discovery guarantee procedures. We will define coherent procedures as procedures that cannot be improved by interpolation.

Let be some procedure with true discovery guarantee. We define the interpolation of as

(8)

Interpolation was used in weaker versions or in specific cases by several authors (Genovese and Wasserman, 2006; Meinshausen, 2006; Blanchard et al., 2017; Durand et al., 2018). Taking , we see that . Moreover, the improvement from to is for free, as noted in the following lemma.

Lemma 2.

If has true discovery guarantee then also .

Intuitively, the rationale for interpolation is as follows. If is large, and has so much overlap with that the signal in does not fit in , then the remaining signal must be in . Since this reasoning follows by direct logical implication, it will not increase the occurrence of type I error: we can only make an erroneous statement about if we had already made one about . As an example, consider interpolation for -FWER controlling procedures. The interpolated version of (3) is simply

(9)

an expression that simplifies even further to with regular FWER when .

Interpolation is not necessarily a one-off process, and interpolated procedures may sometimes be further improved by another round of interpolation. We call a procedure coherent if it cannot be improved by interpolation, i.e. if

(10)

We can characterize coherent procedures further with the following lemma.

Lemma 3.

is coherent if and only if for every disjoint we have

(11)

We intentionally use the same term coherent that was used by Sonnemann (1982) in the context of FWER control of intersection hypotheses. Looking only at FWER control of intersection hypotheses is equivalent to looking only at for every , where denotes an indicator function. In that case (10) reduces to simply requiring that and implies that , which is exactly Sonnemann’s definition of coherence.

Methods that are created through closed testing are automatically coherent, as the following lemma claims.

Lemma 4.

The procedure is coherent.

Since an incoherent procedure can always be replaced by a coherent procedure that is at least as good, we will restrict attention to coherent procedures for the rest of this paper.

5 Monotone procedures

The methods from the literature discussed in Sections 2 and 3 are typically not defined for a specific family of hypotheses, but are usually formulated as generic procedures that can be used for any family, large or small. Researchers developing methods are usually not looking for good properties for a specific family at a specific scale , but for methods that are generally applicable and have good properties whatever .

We can embed the procedure into a stack of procedures , where we may have some maximal family . We will briefly call a monotone procedure. In contast, we call for a specific a local procedure, or a local member of . For a monotone procedure we make the following three assumptions:

  1. true discovery guarantee: has true discovery guarantee for every finite ;

  2. coherence: is coherent for every finite ;

  3. monotonicity: for every finite .

The first two assumptions are no more than natural. We demand true discovery guarantee for every member of the monotone procedure, and we demand coherence for every local member since we otherwise we may always improve it by a coherent procedure. The monotonicity requirement relates local procedures at different scales to each other. It says that inference on the number of discoveries in a set should never get better if we embed in a larger family rather than in a smaller family . As the multiple testing problem gets larger, inference should get more difficult. This requirement relates closely to the “subsetting property” of Goeman and Solari (2014) and the monotonicity property of various FWER control procedures (e.g. Bretz et al., 2009; Goeman and Solari, 2010). It is also a natural requirement also for FDP controlling procedures, and the procedures cited in Section 2 generally adhere to it by construction.

There are a few notable exceptions to the rule that method designers tend to design monotone rather than local procedures. The examples we are aware of are all FWER-controlling procedures. Rosenblum et al. (2014) proposed a local procedure for hypotheses that optimizes the power for rejecting at least one of these. Their method is specific for the scale it was defined for: extensions to do not exist (Rosset et al., 2018), and natural extensions to do not fulfill the monotonicity requirement. Rosset et al. (2018) developed methods that optimize any-power for specific scales under an exchangeability assumption, and that also have non-monotone behavior. We remark, however, that every coherent local procedure with true discovery guarantee may be trivially embedded in a monotone procedure with (or even ) by setting

(12)

This embedding is in itself not useful, but it allows translation of some properties we will derive of monotone procedures to properties of their local members.

Procedures created from a suite of local tests are automatically monotone, as formalized in the following lemma.

Lemma 5.

The procedure is a monotone procedure.

We will mostly be studying monotone procedures in this paper, but investigate implications for local procedures where appropriate. The property of primary interest is admissibility. Let us formally define admissibility for true discovery guarantee procedures. Recall that a statistical test of a hypothesis is uniformly improved by a statistical test of the same hypothesis if (1.) ; and (2.) for some . A statistical test is admissible if no test exists that uniformly improves it. We call a suite of local tests admissible if is admissible for all finite .

Analogously we define admissibility for procedures with true discovery guarantee. A uniform improvement of a monotone procedure is a monotone procedure such that (1.) for all finite ; and (2.) for some and some finite . A uniform improvement of a local procedure is a local procedure such that (1.) for all ; and (2.) for some and some . We call a local or monotone procedure that cannot be uniformly improved admissible. If all local members of a monotone procedure are admissible, then the monotone procedure is admissible, but the converse is not necessarily true, as illustrated in Appendix A.

6 All admissible procedures are closed testing procedures

Theorem 1, below, claims that every monotone procedure that has discovery control is either equivalent to a closed testing procedure or can be uniformly improved by one. We already know from Lemma 3 that every incoherent procedure can be uniformly improved by a coherent procedure. It follows that every procedure that is not equivalent to a closed testing procedure is inadmissible. This is the first main result of this paper.

Theorem 1.

Let be a monotone procedure. Then, for every finite ,

is a valid local test of . For the suite we have, for all with ,

Coherence is necessary but not sufficient to guarantee admissibility. The procedure may in some cases be truly a uniform improvement over the original, coherent . To see a classical example in which a coherent procedure can uniformly improved by a closed testing argument, think of Bonferroni. Combined with (9), Bonferroni is coherent. However, it is uniformly improved by Holm’s procedure that follows from a well-known step-down argument. This stepping-down can be seen as a direct application closed testing with the local test defined in Theorem 1. Step-down arguments are standard for FWER control and have been applied to several FDP controlling methods in the past (Blanchard et al., 2017; Goeman et al., 2017; Hemerik et al., 2018).

It should be noted that in case of a monotone procedure, the local test defined in Theorem 1 is truly local, in the sense that it uses only the information used by the restricted testing problem about the hypotheses , . For example, in a testing problem based on -values, the local test would use only the -values , . In other testing problems, some global information may be used, e.g. the overall estimate of in a large one-way ANOVA, but still in such situations the local test is very natural: as a local test for we use the test for discovery of signal in hypotheses , , that we would use in the situation where the hypotheses , are not of interest to us. Such a local test is implicitly defined by the local procedure .

The result of the theorem is formulated in terms of monotone procedures. It applies immediately to local procedures as well if we use the trivial embedding (12) of a local procedure into a monotone one. With this embedding we even have . This leads to the following corollary.

Corollary 1.

Let be a coherent procedure. Then, for every ,

is a valid local test of . For the suite we have, for all ,

Corollary 1 shows that all procedures with true discovery guarantee is equivalent to a closed testing procedure. It may possibly be uniformly improved by another closed testing procedure if the suite of local tests is not admissible, as we’ll see in the next section. The embedding into a monotone procedure used in Theorem 1 helps to define the local tests in terms of the local procedure , rather than in terms of the original , allowing for more intuitive, and possibly more powerful local tests.

7 All closed testing procedures are admissible

So far we have seen that a procedure with true discovery guarantee may be uniformly improved by interpolation to coherent procedures. This procedure in turn may be uniformly improved by closed testing procedure. Clearly, equivalence to a closed testing procedure is necessary for admissibility. Are all closed testing procedures admissible? In this section we derive a simple condition for admissibility of monotone procedures that is both necessary and sufficient. We show that admissibility of the monotone procedure follows directly from admissibility of its local tests. This is the second main result of this paper.

Theorem 2.

is admissible if and only if the suite is admissible.

We already saw from Theorem 1 that only closed testing procedures are admissible. Theorem 2 says that all closed testing procedures are admissible, provided they fulfil the reasonable demand that they are built from admissible local tests.

Unlike Theorem 1, the result of Theorem 2 does not immediately translate to local procedures: even if is admissible, it may happen for some finite that can be uniformly improved by some other procedure . About such local improvements we have the following proposition.

Proposition 1.

If is admissible, then there is an admissible such that and, for all , .

Proposition 1 limits the available room for local improvements of admissible monotone procedures. Combining Proposition 1 and Theorem 2 we see that such improvements have to be admissible monotone procedures, and therefore closed testing procedures, themselves. The difference between and , if both are admissible, is that for every , uses only the local information in , but the same does not necessarily hold for .

In Appendix A we give an example of a local improvement of a monotone procedure. In this example is restricted in such a way that the local tests are known to have limited power even under the alternative. We show that this knowledge can be exploited to obtain local improvements. In the example it is crucial that, for some ,

Since the effective local test never exhausts the -level there is room to be exploited by , which can be constructed to create a local improvement. Local improvements are also possible in case null hypotheses are composite, using the Partitioning Principle, as shown in Finner and Strassburger (2002), examples 4.1–4.3, and Goeman and Solari (2010), section 4. For many well-known procedures, e.g. Holm’s procedure under arbitrary dependence, local improvements do not exist. We have no general theory on the relationship between admissibility of a monotone procedure and admissibility of its local members. We leave this as an open problem.

8 Consonance and familywise error

Theorem 2 establishes a necessary and sufficient condition for admissibility of monotone true discovery guarantee procedures, and therefore of FDP-controlling procedures. At first sight, our results may seem at odds with those of Romano et al. (2011), who proved that for FWER control, which is a special case of true discovery guarantee, only consonant procedures are admissible. However, this seeming contradiction disappears when we realize that admissibility of a procedure as a true discovery control procedure does not automatically imply admissibility as a FWER controlling procedure and vice versa. In this section we take a sidestep to FWER control, investigating the concept of consonance, and extending some of the results of Romano et al. (2011) on admissibility of FWER controlling procedures.

We call a procedure consonant if it has the property that for every , implies that for at least one we have , almost surely for all . If , this is equivalent to the more usual formulation in terms of the suite , that implies that for at least one we have , almost surely for all . We call a monotone procedure consonant if all local members , finite, are consonant. We call a suite consonant if all , finite, are consonant.

Conceptually, consonant procedures allow pinpointing of effects. If , signal has been detected somewhere in . A consonant procedure in this case can always find at least one elementary hypothesis to pin the effect down on. This is a desirable property, as it can be unsatisfactory for a researcher to know that an effect exist but not where it can be found. On the other hand, Goeman et al. (2017) argued that non-consonant procedures can be far more powerful in large-scale multiple testing procedures than consonant ones.

For consonant procedures a stronger version of Lemma 3 holds.

Lemma 6.

is consonant and coherent if and only if, for every disjoint ,

(13)

Classically, focus in the literature on closed testing has been on FWER controlling procedures (Henning and Westfall, 2015). An FWER-controlling procedure on a finite returns a set such that, for all ,

As argued in Section 2, we can relate FWER controlling procedures to true discovery guarantee procedures and vice versa. If is a FWER controlling procedure, then with

for all , is a coherent procedure with true discovery guarantee , as we know from (9). Conversely, if is a coherent procedure with true discovery guarantee, then

is a FWER controlling procedure. Both types of procedures may be created from local tests. The FWER controlling procedure from the suite is given by

(14)

We can compare the procedure defined from through (6) with the procedure

indirectly defined through . This is the procedure that discards all information in that is not contained in . Lemma 7 describes consonance as the property that no information is lost in the process.

Lemma 7.

If is consonant, ; otherwise uniformly improves .

If FWER control is what we are after, however, we must look at admissibility of directly. As with procedures with true discovery guarantee, we will focus on monotone (stacks of) procedures defined for all finite . We call a procedure monotone if for all finite we have

As above for true discovery guarantee procedures, it asserts that enlarging the multiple testing problem from to will never increase the number of rejections in (Bretz et al., 2009; Goeman and Solari, 2010). Analogous to the definition in Section 5, we define a uniform improvement of a monotone FWER control procedure as a monotone FWER control procedure such that (1.) for all finite ; and (2.) for some and some finite . A procedure is admissible if no uniform improvement exists. What can we say about admissibility of FWER control procedures?

Romano et al. (2011) showed that consonance is necessary for admissibility of FWER controlling procedures. Proposition 2 is a variant of the result of Romano et al. (2011) for monotone procedures.

Proposition 2.

If is admissible, then a consonant suite exists such that .

We also have a second necessary condition for admissibility.

Proposition 3.

If is admissible, then an admissible suite exists such that .

It would be tempting to conclude from Propositions 2 and 3 that if is admissible, then with consonant and admissible. Certainly under weak assumptions we may choose . For example, if some is inadmissible because it fails to exhaust the -level we may make choose as plus a randomized multiple of for some . However, we were unable to prove in full generality that is always possible. Perhaps in some awkward models admissible local tests cannot be consonant, and consonant tests cannot be admissible. We leave the question open when is possible. In converse however, if we can find a that is both admissible and consonant, we have an admissible procedure:

Proposition 4.

If is consonant and admissible, then is admissible.

9 Application: improving an existing method

We will now illustrate how existing methods may be improved by embedding them in a closed testing procedure and using the results of this paper. We chose a method recently proposed by Katsevich and Ramdas (2018). This elegant method, which we abbreviate K&R, allows users to choose a -value cutoff for significance post hoc, and uses stochastic process arguments to control both FDP and FDR. We focus on the FDP control property here.

Take . Let hypotheses have corresponding -values independent, and with standard uniform, or possibly stochastically larger, if is true. Katsevich and Ramdas proposed a method that controls (4) for sets , where consists of the indices of the hypotheses with smallest -values, with ties broken arbitrarily. As shown by Katsevich and Ramdas, if for these sets we have

where , and is the th smallest -value. For we have . As in Section 2 we can write this as a procedure with true discovery guarantee on by writing

(15)

where we round up to ensure that is always an integer.

Is the procedure (15) admissible, and if not, how can we improve it? We apply the results of this paper. First, we remark that the method as defined is not coherent. To interpolate, we introduce the notation , with , for the th smallest -value among the multiset , so that in particular . The interpolation of the procedure follows directly from (8), yielding

(16)

To simplify this expression, call . Let be the permutation such that for all . If , then either and or and , so we may restrict the maximum in (16) to values of with . We have with . If , then Therefore, (16) reduces to

(17)

taking implicitly. We note that interpolated method makes non-trivial statements for sets not of the form , and may even improve for some . It may be checked using Lemma 3 that the procedure (17) is coherent, so no further rounds of interpolation are needed. The K&R procedure was not developed for a specific scale . Writing for in (17) we have a procedure that is defined for general , and it is easy to check that is monotone.

Next, we use Theorem 1 to embed the method in a closed testing procedure, which hopefully results in further improvement of the procedure. By the theorem, the local test for finite is given by

(18)

We will construct the closed testing procedure based on this local test. To do this, we note that the local test is very similar to the Simes (1986) local test studied in the context of FDP control by Goeman et al. (2017). We generalize a result from that paper to local tests of the form

(19)

where we assume that for all . For convenience, we let be defined for all and , i.e. even if . Without loss of generality we can take for , and for all .

The general form (19) encompasses the local test (18), taking

(20)

if , and the Simes test, taking if . It can also be used for a much broader range of tests, e.g. to the local tests implied by the False Discovery Rate controlling procedures of Blanchard and Roquain (2009), to higher criticism (Donoho et al., 2004), to the local tests implied by the Dvoretzky-Kiefer-Wolfowitz inequality (Genovese and Wasserman, 2004; Meinshausen, 2006), to the local tests implied by second and higher order generalized Simes constants (Cai and Sarkar, 2008; Gou and Tamhane, 2014), and to the local tests implied by the FDR controlling procedures of Benjamini and Liu (1999), and Romano and Shaikh (2006, equation 4.1).

Generalizing a result of Goeman et al. (2017) for the Simes local test we prove that calculation of for local tests of type (19) can be done in quadratic time. We use (6), so we first characterize the effective local test before deriving an expression for .

Lemma 8.

If , , is of the form (19), with for all and , then

and

(21)

where

We note that can be calculated in time, and, given , we can calculate in time. In special cases the calculation time of and is even linear after sorting the -values, as shown by Meijer et al. (2018).

Our result is related to the results on FDP control based on bounding functions (Genovese and Wasserman, 2006; Blanchard et al., 2017). However, earlier authors have focused on proving type I error only, in which case it was sufficient to give a computable lower bound for . From the perspective of admissibility, in view of Theorems 1 and 2, it is relevant that we compute exactly in Theorem 8.

By Theorem 1, the method (21) is everywhere at least as powerful as the interpolated method (17). In fact, it is a uniform improvement of that method as we shall see in the simulation experiment below. The next question is whether the method defined by (21) with (20) is admissible, or whether it can be further improved. We can verify this using Theorem 2 by checking whether the local tests are admissible. It is immediately obvious that this is not the case. Taking e.g. , we see that at with we have , which is clearly not admissible. We may freely decrease to to obtain the uniformly more powerful local test . We can use the same reasoning for , decreasing the value of to the minimal value that guarantees type I error control. This value may easily be calculated numerically since the worst case distribution of under is the easy independent unform case. We obtain a new local test of the form (19) with

(22)

We tabulated the minimal values of (taking ) for some values of in Table 1. Note that for all by the results of Katsevich and Ramdas (2018), so the new local test uniformly improves the old one. We note that with these choices of the critical values cannot be further increased without destroying type I error control of the local tests, so we conclude that the resulting local tests are admissible provided that the test is admissible as an -level local test of for all and . Assuming this, by Theorem 2 the resulting true discovery guarantee procedure is admissible. We note that, since is increasing in , (22) still fulfils the conditions of Lemma 8, so that the admissible method is still computable in quadratic time.

1 2 3 4 5 7 10 15 20 50 100 500 1000
0.95 1.38 1.55 1.64 1.71 1.78 1.84 1.90 1.92 1.98 2.00 2.01 2.02
Table 1: Values of calculated by Monte Carlo integration ( samples)

We have started with the procedure of Katsevich and Ramdas (2018) and improved it uniformly in three steps: the method was first improved by interpolation. The resulting coherent method was further improved by embedding it in a closed testing procedure, and finally that closed testing procedure was improved to an admissible method by improving its local tests. This way we obtained a sequence of four methods, each uniformly improving the previous one. We will call them the original (15), coherent (17), closed, defined by (21) with (20), and admissible method, defined by (21) with (22). We performed a small simulation experiment to assess the relative improvement made with each of the three steps. We used hypotheses, of which were true, and false. We sampled -values independently. For true null hypotheses, we used . For false null hypotheses, we used , where is the standard normal distribution function, and . We took values and for and for . A true discovery guarantee procedure gives exponentially many output values. We report only results for sets of the smallest -values, as the original method did. We used .

Although we expect a linear time algorithm similar to that of Meijer et al. (2018) to be possible here, we used the naive quadratic time algorithm, which already calculated and for the closed and admissible methods in less than 0.1 seconds on a standard PC.

The results are given in Table 2. For each setting and each method we report the average value of over simulations. Several things can be noticed about these simulation results. The most important finding for the message of this section is that all three improvement steps are substantial. The improvement from the original to the coherent procedure is perhaps largest. It is especially noticeable for large rejected sets, where the original method may all too often give , especially if . The improvement from the coherent procedure to closed testing is most apparent if is large. This is natural because the improvement can be seen as a “step-down” argument, implicitly incorporating an estimate of in the procedure. The final improvement from the initial closed testing to the admissible procedure is clear throughout the table. Although the improvement from the coherent to the closed procedure seems the smallest one, we emphasize that closed testing is also crucial for the construction of the admissible procedure.

We also see some of the properties of the K&R method. We have that with probability 1 for the original method (15), since if . This also holds for the coherent method (17). For the closed method the same is not true, but we have that if we have for every with , since . Therefore

unless all hypotheses are false. For the admissible method, by an analogous reasoning using Lemma 8, the same holds if , since then with large probability. This happens from . It follows that none of these methods in this section, not even the admissible method, should be expected make any FWER-rejections in practical applications. The admissible method is (almost) fully non-consonant in the sense that for all , , and we have with probability almost 1 unless . By Romano et al. (2011) it is clearly inadmissible as a FWER-controlling method. By Theorem 2 it is admissible, however, as a method with true discovery guarantee. This lack of power for FWER-type statements is compensated by larger power for non-FWER-type statements. Indeed, Katsevich and Ramdas have shown that their method may significantly outperform Simes-based closed testing (Goeman et al., 2017) in some scenarios, which in turn outperforms consonant FWER-based testing in terms of FDP large-scale testing problems.

2 3 4 2 3 4 1 2 3
original 0.1 1.3 2.9 1.9 3.0 3.0 1.1 3.0 3.0
coherent 0.2 1.5 2.9 2.0 3.0 3.0 1.1 3.0 3.0
closed 0.2 1.5 2.9 2.0 3.0 3.0 1.2 3.0 3.0
admissible 0.3 1.7 2.9 2.2 3.0 3.0 1.4 3.0 3.0
original 0.1 0.9 2.8 1.9 8.0 8.0 2.5 8.0 8.0
coherent 0.3 2.2 4.8 2.0 8.0 8.0 2.8 8.0 8.0
closed 0.4 2.2 4.8 2.0 8.0 8.0 2.8 8.0 8.0
admissible 0.5 2.5 5.0 2.2 8.0 8.0 3.3 8.0 8.0
original 0.0 0.1 0.3 3.7 15.9 18.0 3.8 17.2 18.0
coherent 0.4 2.3 5.0 5.8 15.9 18.0 4.9 17.2 18.0
closed 0.4 2.3 5.0 5.9 16.0 18.0 5.0 17.3 18.0
admissible 0.5 2.6 5.3 6.5 16.3 18.0 5.8 17.6 18.0
original 0.0 0.0 0.0 0.3 10.2 22.3 2.9 41.1 48.0
coherent 0.4 2.3 5.0 6.4 21.3 32.4 7.0 41.1 48.0
closed 0.4 2.3 5.0 6.4 21.5 32.6 7.2 41.7 48.0
admissible 0.5 2.7 5.3 7.3 22.4 33.1 8.7 42.3 48.0
original 0.0 0.0 0.0 0.0 0.0 0.0 0.0 40.6 130.1
coherent 0.4 2.3 5.0 6.4 21.3 32.4 7.3 71.8 140.6
closed 0.4 2.3 5.0 6.4 21.5 32.6 7.4 76.4 146.3
admissible 0.5 2.7 5.3 7.3 22.5 33.2 9.3 81.2 149.1
Table 2: Averages of for several values of over simulations, relating to the method of Katsevich and Ramdas (2018) and its successive uniform improvements.

10 Discussion

We have studied the class of all methods controlling tail probabilities of false discovery proportions. This class encompasses very diverse methods, e.g. familywise error control procedures, false discovery exceedance procedures, simultaneous selective inference, and cluster inference. We have shown that all such procedures can be written as methods simultaneously controlling false discovery proportions over all subsets of the family of hypotheses. This rewrite, trivial as it may be in some cases, is valuable in its own right, because makes it possible to study methods jointly that seemed incomparable before. Moreover, methods that were constructed to give non-trivial error bounds for only a single random hypothesis set of interest, now give simultaneous error bounds for all such sets, allowing their use in flexible selective inference in the sense advocated by Goeman and Solari (2011).

We formulated all such procedures in terms of true discovery guarantee, i.e. giving a lower bound to the number of true discoveries in each set, because this representation is mathematically easier to work with. Also, by emphasizing true rather than false discoveries, it gives a valuable positive frame to the multiple testing problem. Otherwise, this change in representation is purely cosmetic; we may continue to speak of FDP control procedures.

We have formulated a condition for admissibility of FDP control procedures that is both necessary and sufficient. All admissible FDP control procedures are closed testing procedures, and all closed testing procedures are admissible as FDP control procedures, provided they are well-designed in the sense that all their local tests are admissible. Apparently, control of false discovery proportions and closed testing procedures are so closely tied together that the relationship seems almost tautological. Admissibility is closely tied to optimality. Since optimal methods must be admissible, and admissible methods must be closed testing procedures, we have shown that only closed testing procedures can be optimal.

This theoretical insight has great practical value for methods designers. It can be used to uniformly improve existing methods, as we have demonstrated on the method of Katsevich and Ramdas (2018). Given a procedure that controls FDP, we first make sure it is coherent. Next, we can explicitly construct the local tests implied by the procedure, and turn it closed testing procedure. To check admissibility, we now only need to check admissibility of the local tests. Each step may result in substantive improvement, as we have shown in simulations. Alternatively, when designing a method we may start for a suite of local tests that has good power properties. The options are virtually unlimited here. The validity of the local test as an -level test guarantees control of FDP. Correlations between test statistics, that often complicate multiple testing procedures, are taken into account by the local test. Admissibility of the local tests guarantees admissibility of the resulting procedure. In both cases the computational problem remains that closed testing may require exponentially many tests, but this is the only remaining problem. Polynomial time shortcuts are possible. Ideally these are exact, as for K&R above, and admissibility is retained. If the full closed testing procedure is not computable for large testing problems, we may settle for an inadmissible but computable method, based on a conservative shortcut (e.g. Hemerik and Goeman, 2018; Hemerik et al., 2018). It may still be worthwhile to compare such a method to full closed testing in small-scale problems to see how much power is lost.

We defined admissibility in terms of simultaneous FDP control for all possible subsets of the family of hypotheses. In some cases we may not be interested in all of these sets, as e.g. when targeting FWER control exclusively. Even with FWER, we retain the result that admissible procedures must be closed testing procedures with admissible local tests. We lose, however, the property that all such procedures are automatically admissible. With other loss functions, additional criteria might come in, as consonance in the case of familywise error control. Variants of consonance may be useful as well (Brannath and Bretz, 2010).

Our focus was mostly on monotone procedures. Such procedures are defined for multiple testing problems on different scales simultaneously. Connecting between different scales, they have the property that adding more hypotheses to the multiple testing problem will never result in stronger conclusions for the hypotheses that were already there. This is an intuitively desirable property by itself, which prevents some paradoxes (Goeman and Solari, 2014). Monotone procedures have additional valuable properties: viewed as closed testing procedures, they have local tests that are truly local: the local test on uses only the information that the corresponding local procedure uses. Admissible monotone procedures, however, may sometimes be locally improved, and we have given an example if this. Such improvements, if admissible, must still be closed testing procedures with admissible local tests themselves.

We have restricted to finite testing problems. Extensions to countably infinite problems are of interest e.g. when considering online control (Javanmard et al., 2018). The results of this paper may trivially be extended to allow infinite if we are willing to assume that , so that . If is unbounded, care must be taken to scale properly to keep it in the non-trivial range. This scaling adds some technical complexity, and is not assumption-free because scales with the unknown . However, since most of the results of this paper compare competing methods, which obviously require the same scaling, we conjecture that the optimality of closed testing will translate to FDP control in countable and even uncountable multiple testing problems. We leave this to future research.

Finally, we remark that we have only considered procedures that control tail probabilities of the false discovery proportion. These methods can also be used for bounding the median FDP (Goeman and Solari, 2011). However, if there is interest in the central tendency of FDP it is more common to bound the mean FDP, better known as False Discovery Rate (FDR). Given the close connection we have established between closed testing and FDP tail probabilities, it is likely that there is also a connection between closed testing and FDR control. Some connections have already been found between Simes-based closed testing and the procedure of Benjamini and Hochberg (1995) by Goeman et al. (2017). It is likely that there are more such connections. Any procedure that controls FDR, since FDR control implies weak FWER control, implies a local test and can therefore be used to construct a closed testing procedure. Conversely, if FDP is controlled with -confidence at level , then FDR is controlled at , as Lehmann and Romano (2005) have shown. More profound relationships may be found in the future.

Appendix A A local improvement

In this section we construct a local improvement of an admissible monotone procedure to illustrate Proposition 1. Assume that for each , we have a -value . Assume that each is standard uniform if is true. Under these assumptions we can define the standard fixed sequence testing procedure, which starts testing using at level , continues one by one with , in order, and stops when it fails to reject some hypothesis. It is well known that this procedure is a closed testing procedure. The local test is defined for all by

If we assume that the test is admissible for all , then, by Theorem 2, the fixed sequence procedure is admissible.

We will now make some additional assumptions that will allow a uniform improvement of the procedure at the fixed scale . Assume for convenience that all -values are independent. Next, assume that the distribution of every is constrained even under the alternative. Assume that some exists such that, for every ,

This means that the power for each test is inherently limited. Even under the alternative, we reject e.g.  with probability at most . Clearly or we would not have uniformity under the null. We will now demonstrate that the fixed sequence procedure can be uniformly improved, locally at any , if .

We start with the simple case . By Proposition 1, the improvement is a closed testing procedure that involves local tests . Consider . Then . Under this has . Clearly, there is room for improvement. Let us consider the procedure with , for all except , when

The latter is a valid local test of . The resulting procedure at starts testing at level , and continues, if is rejected, to test at level . This is clearly a uniform improvement of the original procedure at . To see that this is not a counterexample to Theorem 2, consider instead. Clearly, we do not have

The improvement is only local.

Similar local improvements actually exist for every finite except . Define recursively

From this, fix some , and define a local test as

where . To check that this is a valid local test, we verify that for all

where . The resulting procedure is still a fixed sequence procedure that tests all , in order, stopping the first time it fails to reject. Only, rather than testing at level every time, it tests at level in step . If for all the sequence is strictly increasing and approaches 1.

Crucial for this example is the assumption that we have limited power and, more importantly, that we know the limit to the power. If we are either not willing to assume that always , or if we do not know , then the above local improvements are not possible. It is difficult to think of uniform local improvements in the case , and we believe they do not exist. It may be worthwhile to think of adaptive procedures that learn as the procedure moves along, but we will not pursue this direction here. In any case, due to the cost inherent to learning , such a procedure would not uniformly improve .

Appendix B Proofs

Proof of Lemma 1

Take any . For any with there exists a which has and . Consequently, .

For any with there is a with that has . Consequently, .

Proof of Lemma 2

Let be the event that for all