Online Rules for Control of False Discovery Rate and False Discovery Exceedance^{1}
Abstract
Multiple hypothesis testing is a core problem in statistical inference and arises in almost every scientific field. Given a set of null hypotheses , Benjamini and Hochberg [BH95] introduced the false discovery rate (FDR), which is the expected proportion of false positives among rejected null hypotheses, and proposed a testing procedure that controls below a preassigned significance level. Nowadays is the criterion of choice for largescale multiple hypothesis testing.
In this paper we consider the problem of controlling in an online manner. Concretely, we consider an ordered –possibly infinite– sequence of null hypotheses where, at each step , the statistician must decide whether to reject hypothesis having access only to the previous decisions. This model was introduced by Foster and Stine [FS08].
We study a class of generalized alpha investing procedures, first introduced by Aharoni and Rosset [AR14]. We prove that any rule in this class controls online , provided values corresponding to true nulls are independent from the other values. Earlier work only established control. Next, we obtain conditions under which generalized alpha investing controls in the presence of general values dependencies. We also develop a modified set of procedures that allow to control the false discovery exceedance (the tail of the proportion of false discoveries). Finally, we evaluate the performance of online procedures on both synthetic and real data, comparing them with offline approaches, such as adaptive BenjaminiHochberg.
1 Introduction
The common practice in claiming a scientific discovery is to support such claim with a value as a measure of statistical significance. Hypotheses with values below a significance level , typically , are considered to be statistically significant. While this ritual controls type I errors for single testing problems, in case of testing multiple hypotheses it leads to a large number of false positives (false discoveries). Consider, for instance, a setting in which hypotheses are to be tested, but only a few of them, say , are nonnull. If we test all of the hypotheses at a fixed significance level , each of truly null hypotheses can be falsely rejected with probability . Therefore, the number of false discoveries –equal to in expectation– can substantially exceed the number of true nonnulls.
The false discovery rate (FDR) –namely, the expected fraction of discoveries that are false positives– is the criterion of choice for statistical inference in large scale hypothesis testing problem. In their groundbreaking work [BH95], Benjamini and Hochberg (BH) developed a procedure to control FDR below a preassigned level, while allowing for a large number of true discoveries when many nonnulls are present. The BH procedure remains –with some improvements– the stateoftheart in the context of multiple hypothesis testing, and has been implemented across genomics [RYB03], brain imaging [GLN02], marketing [PWJ15], and many other applied domains.
Standard FDR control techniques, such as the BH procedure [BH95], require aggregating values for all the tests and processing them jointly. This is impossible in a number of applications which are best modeled as an online hypothesis testing problem [FS08] (a more formal definition will be provided below):
Hypotheses arrive sequentially in a stream. At each step, the analyst must decide whether to reject the current null hypothesis without having access to the number of hypotheses (potentially infinite) or the future values, but solely based on the previous decisions.
This is the case –for instance– with publicly available datasets, where new hypotheses are tested in an ongoing fashion by different researchers [AR14]. Similar constraints arise in marketing research, where multiple AB tests are carried out on an ongoing fashion [PWJ15]. Finally, scientific research as a whole suffers from the same problem: a stream of hypotheses are tested on an ongoing basis using a fixed significance level, thus leading to large numbers of false positives [Ioa05b]. We refer to Section 1.2 for further discussion.
In order to illustrate the online scenario, consider an approach that would control the familywise error rate (FWER), i.e. the probability of rejecting at least one true null hypothesis. Formally
(1) 
where denotes the model parameters (including the set of nonnull hypotheses) and the number of false positives among the first hypotheses. This metric can be controlled by choosing different significance levels for tests , with summable, e.g., . Notice that the analyst only needs to know the number of tests performed before the current one, in order to implement this scheme. However, this method leads to small statistical power. In particular, making a discovery at later steps becomes very unlikely.
In contrast, the BH procedure assumes that all the values are given a priori. Given values and a significance level , BH follows the steps below:

Let be the th value in the (increasing) sorted order, and define . Further. let
(2) 
Reject for every test with .
As mentioned above, BH controls the false discovery rate defined as
(3) 
where is the total the number of rejected hypotheses. Note that BH requires the knowledge of all values to determine the significance level for testing the hypotheses. Hence, it does not address the online scenario.
In this paper, we study methods for online control of false discovery rate. Namely, we consider a sequence of hypotheses that arrive sequentially in a stream, with corresponding values , , . We aim at developing a testing mechanism that ensures false discovery rate remains below a preassigned level . A testing procedure provides a sequence of significance levels , with decision rule:
(4) 
In online testing, we require significance levels to be functions of prior outcomes:
(5) 
Foster and Stine [FS08] introduced the above setting and proposed a class of procedures named alpha investing rules. Alpha investing starts with an initial wealth, at most , of allowable false discovery rate. The wealth is spent for testing different hypotheses. Each time a discovery occurs, the alpha investing procedure earns a contribution toward its wealth to use for further tests. Foster and Stine [FS08] proved that alpha investing rules control a modified metric known as mFDR, defined as below:
(6) 
In words, is the ratio of the expected number of false discoveries to the expected number of discoveries. As illustrated in the Appendix A, mFDR and FDR can be very different in situations with high variability. While FDR is the expected proportion of false discoveries, mFDR is the ratio of two expectations and hence is not directly related to any single sequence quantity.
Several recent papers [LTTT14, GWCT15, LB16] consider a ‘sequential hypothesis testing’ problem that arises in connection with sparse linear regression. Let us emphasize that the problem treated in [LTTT14, GWCT15] is substantially different from the one analyzed here. For instance, as discussed in Section 1.2, the methods of [GWCT15] achieve vanishingly small statistical power for the present problem.
1.1 Contributions
In this paper, we study a class of procedures that are known as generalized alpha investing, and were first introduced by Aharoni and Rosset in [AR14]. As in alpha investing [FS08], generalized alpha investing makes use of a potential sequence (wealth) that increases every time a null hypothesis is rejected, and decreases otherwise. However: The payoff and payout functions are general functions of past history; The payout is not tightly determined by the testing level . This additional freedom allows to construct interesting new rules.
The contributions of this paper are summarized as follows.
Online control of FDR. We prove that generalized
alpha investing rules control FDR,
under the assumption of independent values, and provided they are monotone (a technical condition defined in the sequel). To the best of our
knowledge, this is the first work
Online control of FDR for dependent values. Dependencies among values can arise for multiple reasons. For instance the same data can be reused to test a new hypothesis, or the choice of a new hypothesis can depend on the past outcomes. We present a general upper bound on the FDR for dependent values under generalized alpha investing.
False discovery exceedance. FDR can be viewed as the expectation of false discovery proportion (FDP). In some cases, FDP may not be well represented by its expectation, e.g., when the number of discoveries is small. In these cases, FDP might be sizably larger than its expectation with significant probability. In order to provide tighter control, we develop bounds on the false discovery exceedance (FDX), i.e. on the tail probability of FDP.
Statistical power. In order to compare different procedures, we develop lower bounds on fraction of nonnull hypotheses that are discovered (statistical power), under a mixture model where each null hypothesis is false with probability , for a fixed arbitrary .
We focus in particular on a concrete example of generalized alpha investing rule (called Lord below) that we consider particularly compelling. We use our lower bound to guide the choice of parameters for this rule.
Numerical Validation. We validate our procedures on synthetic and real data in Section 5 and Appendix J, showing that they control FDR and mFDR in an online setting. We further compare them with BH and Bonferroni procedures. We observe that generalized alpha investing procedures can benefit from ordering of hypotheses. Specifically, they can achieve higher statistical power compared to offline benchmarks such as adaptive BH, when fraction of nonnulls is small and hypotheses can be a priori ordered in such a way that those most likely to be rejected appear first in the sequence.
1.2 Further related work
General context. An increasing effort was devoted to reducing the risk of fallacious research findings. Some of the prevalent issues such as publication bias, lack of replicability and multiple comparisons on a dataset were discussed in Ioannidis’s 2005 papers [Ioa05b, Ioa05a] and in [PSA11].
Statistical databases. Concerned with the above issues and the importance of data sharing in the genetics community, [RAN14] proposed an approach to public database management, called Quality Preserving Database (QPD). A QPD makes a shared data resource amenable to perpetual use for hypothesis testing while controlling FWER and maintaining statistical power of the tests. In this scheme, for testing a new hypothesis, the investigator should pay a price in form of additional samples that should be added to the database. The number of required samples for each test depends on the required effect size and the power for the corresponding test. A key feature of QPD is that type I errors are controlled at the management layer and the investigator is not concerned with values for the tests. Instead, investigators provide effect size, assumptions on the distribution of the data, and the desired statistical power. A critical limitation of QPD is that all samples, including those currently in the database and those that will be added, are assumed to have the same quality and are coming from a common underlying distribution. Motivated by similar concerns in practical data analysis, [DFH15] applies insights from differential privacy to efficiently use samples to answer adaptively chosen estimation queries. These papers however do not address the problem of controlling FDR in online multiple testing.
Online feature selection. Building upon alpha investing procedures, [LFU11] develops VIF, a method for feature selection in large regression problems. VIF is accurate and computationally very efficient; it uses a onepass search over the pool of features and applies alpha investing to test each feature for adding to the model. VIF regression avoids overfitting due to the property that alpha investing controls . Similarly, one can incorporate Lord in VIF regression to perform fast online feature selection and provably avoid overfitting.
Highdimensional and sparse regression. There has been significant interest over the last two years in developing hypothesis testing procedures for highdimensional regression, especially in conjunction with sparsityseeking methods. Procedures for computing values of lowdimensional coordinates were developed in [ZZ14, VdGBRD14, JM14a, JM14b, JM13]. Sequential and selective inference methods were proposed in [LTTT14, FST14, TLTT16]. Methods to control FDR were put forward in [BC15, BvdBS15].
As exemplified by VIF regression, online hypothesis testing methods can be useful in this context as they allow to select a subset of regressors through a onepass procedure. Also they can be used in conjunction with the methods of [LTTT14], where a sequence of hypothesis is generated by including an increasing number of regressors (e.g. sweeping values of the regularization parameter).
In particular, [GWCT15, LB16] develop multiple hypothesis testing procedures for ordered tests. Note, however, that these approaches fall short of addressing the issues we consider, for several reasons: They are not online, since they reject the first null hypotheses, where depends on all the values. They require knowledge of all past values (not only discovery events) to compute the current score. Since they are constrained to reject all hypotheses before , and accept them after, they cannot achieve any discovery rate increasing with , let alone nearly linear in . For instance in the mixture model of Section 4, if the fraction of true nonnull is , then the methods of [GWCT15, LB16] achieves discoveries out of true nonnull. In other words their power is of order in this simple case.
1.3 Notations
Throughout the paper, we typically use upper case symbols (e.g. ) to denote random variables, and lower case symbols for deterministic values (e.g. ). Vectors are denoted by boldface, e.g. for random vectors, and for deterministic vectors. Given a vector , we use to denote the subvector with indices between and . We will often consider sequences indexed by the same ‘time index’ as for the hypotheses . Given such a sequence , we denote by its partial sums.
We denote the standard Gaussian density by , and the Gaussian distribution function by . We use the standard bigO notation. In particular as if there exists a constant such that for all large enough. We also use to denote asymptotic equality, i.e. as , means . We further use for equality up to constants, i.e. if , then there exist constants such that for all large enough.
2 Generalized alpha investing
In this section we define generalized alpha investing rules, and provide some concrete examples. Our definitions and notations follow the paper of Aharoni and Rosset that first introduced generalized alpha investing [AR14].
2.1 Definitions
Given a sequence of input values , a generalized alpha investing rule generates a sequence of decisions (here and is to be interpreted as rejection of null hypothesis ) by using test levels . After each decision , the rule updates a potential function as follows:

If hypothesis is accepted, then the potential function is decreased by a payout .

If hypothesis is rejected, then the potential is increased by an amount .
In other words, the payout is the amount paid for testing a new hypothesis, and the payoff is the amount earned if a discovery is made at that step.
Formally, a generalized alpha investing rule is specified by three (sequences of) functions , determining test levels, payout and payoff. Decisions are taken by testing at level
(7) 
The potential function is updated via:
(8)  
(9) 
with an initial condition. Notice in particular that is a function of .
A valid generalized alpha investing rule is required to satisfy the following conditions, for a constant :

For all and all , letting , , , we have
(10) (11) (12) 
For all , and all , if then .
Notice that Condition (12) and are well posed since , and are functions of . Further, because of (12), the function remains nonnegative for all .
We later show that generalized alpha investing guarantees control as a function of and .
Throughout, we shall denote by the algebra generated by the random variables .
Definition 2.1.
For , , we write if for all . We say that an online rule is monotone if the functions are monotone nondecreasing with respect to this partial ordering (i.e. if implies ).
Remark 2.2.
Remark 2.3.
In a generalized alpha investing rule, as we reject more hypotheses the potential increases and hence we can use large test levels . In other words, the burden of proof decreases as we reject more hypotheses. This is similar to the BH rule, where the most significant values is compared to a Bonferroni cutoff, the second most significant to twice this cutoff and so on.
2.2 Examples
Generalized investing rules comprise a large variety of online hypothesis testing methods. We next describe some specific subclasses that are useful for designing specific procedures.
Alpha Investing
Alpha investing, introduced by Foster and Stine [FS08], is a special case of generalized alpha investing rule. In this case the potential is decreased by if hypothesis is not rejected, and increased by a fixed amount if it is rejected. In formula, the potential evolves according to
(13) 
This fits the above framework by defining and . Note that this rule depends on the choice of the test levels , and of the parameter . The test levels can be chosen arbitrarily, provided that they satisfy condition (12), which is equivalent to .
Alpha Spending with Rewards
Alpha spending with rewards was introduced in [AR14], as a special subclass of generalized alpha investing rules, which are convenient for some specific applications.
Lord
As a running example, we shall use a simple procedure that we term Lord , for Levels based On Recent Discovery. Lord is easily seen to be a special case of alpha spending with rewards, for .
Below, we present three different versions of Lord . For a concrete exposition, choose any sequence of nonnegative numbers , which is monotone nonincreasing (i.e. for we have ) and such that . We refer to Section 4 for concrete choices of this sequence.
At each time , we let be the set of discovery times up to time . We further define as the last time a discovery was made before :
At each step, if a discovery is made, we add an amount to the current wealth. Otherwise, we remove an amount of the current test level from the wealth. Formally, we set
(15) 
where is defined recursively via Equation (9). Note that and are measurable on , and hence are functions of as claimed, while is a function of . Therefore, the above rule defines an online multiple hypothesis testing procedure.
We present three versions of Lord which differ in the way that the test levels are set.

: We set the test levels solely based on the time of the last discovery. Specifically,
(16) where denotes the time of first discovery. In words, up until the first discovery is made, we set levels by discounting the initial wealth, i.e., . After the first discovery is made, we use a fraction of to spend in testing null hypothesis .

: We set the test levels based on the previous discovery times. Specifically,
(17) 
: In this alternative, the significance levels depend on the past only through the time of the last discovery, and the wealth accumulated at that time. Specifically,
(18)
In the next lemma, we show that all the three versions of Lord are generalized alpha investing rules. Further, and are monotone rules (see Definition 2.1), while is not necessarily a monotone rule without making further assumptions on sequence .
Lemma 2.4.
The rules , and are instances of generalized alpha investing rules. Further, the rules and are monotone.
3 Control of false discovery rate
3.1 control for independent test statistics
As already mentioned, we are interested in testing a –possibly infinite– sequence of null hypotheses . The set of first hypotheses will be denoted by . Without loss of generality, we assume concerns the value of a parameter , with . Rejecting the null hypothesis can be interpreted as being significantly nonzero. We will denote by the set of possible values for the parameters , and by the space of possible values of the sequence
Under the null hypothesis , the corresponding value is uniformly random in :
(19) 
Recall that is the indicator that a discovery is made at time , and the total number of discoveries up to time . Analogously, let be the indicator that a false discovery occurs at time and the total number of false discovery up to time . Throughout the paper, superscript is used to distinguish unobservable variables such as , from statistics such as . However, we drop the superscript when it is clear from the context.
There are various criteria of interest for multiple testing methods. We will mostly focus on the false discovery rate (FDR) [BH95], and we repeat its definition here for the reader’s convenience. We first define the false discovery proportion (FDP) as follows. For ,
(20) 
The false discovery rate is defined as
(21) 
Our first result establishes FDR control for all monotone generalized alpha investing procedures, where the monotonicity of a testing rule is given by Definition 2.1. Its proof is presented in Appendix C.
Theorem 3.1.
Assume the values to be independent. Then, for any monotone generalized alpha investing rule with , we have
(22) 
The same holds if only the values corresponding to true nulls are mutually independent, and independent from the nonnull values.
Remark 3.2.
By applying Theorem 3.1 and Lemma 2.4, we obtain that and Lord controls at level , as long as
. For , such result cannot be obtained directly from Theorem 3.1 because this rule is not necessarily a monotone rule
without making further assumptions on the sequence . Nevertheless, in our numerical experiments, we focus on and as we show empirically that
it also control .
Remark 3.3.
In Appendix C, we prove a somewhat stronger version of Theorem 3.1, namely . In particular, when the total number of discoveries is large, with high probability. This is the case –for instance– when the hypotheses to be tested comprise a large number of ‘strong signals’ (even if these form a small proportion of the total number of hypotheses).
Another possible strengthening of Theorem 3.1 is
obtained by considering a new metric, that we call (for smoothed FDR):
(23) 
The following theorem bounds for monotone generalized alpha investing rules (cf. Definition 2.1).
Theorem 3.4.
Under the assumptions of Theorem 3.1, for any , we have
(24) 
3.2 control for dependent test statistics
In some applications, the assumption of independent values is not warranted. This is the case –for instance– of multiple related hypotheses being tested on the same experimental data. Benjamini and Yekutieli [BY01] introduced a property called positive regression dependency from a subset (PRDS on ) to capture a positive dependency structure among the test statistics. They showed that if the joint distribution of the test statistics is PRDS on the subset of test statistics corresponding to true null hypotheses, then BH controls . (See Theorem 1.3 in [BY01].) Further, they proved that BH controls under general dependency if its threshold is adjusted by replacing with in equation (2).
Our next result establishes an upper bound on the of generalized alpha investing rules, under general values dependencies. For a given generalized alpha investing rule, let , the set of decision sequences that have nonzero probability.
Definition 3.6.
An index sequence is a sequence of deterministic functions with . For an index sequence , let
(25)  
(26) 
As concrete examples of the last definition, for a generalized alpha investing rule, the current potentials , potentials at the last rejection and total number of rejections are index sequences.
Theorem 3.7.
Consider a generalized alpha investing rule and assume that the test level is determined based on index function . Namely, for each there exists a function such that . Further, assume to be nondecreasing and weakly differentiable with weak derivative .
Then, the following upper bound holds for general dependencies among values:
(27) 
The proof of this theorem is presented in Appendix E.
Example 3.8 ( control for dependent test statistics via modified Lord ).
We can modify Lord as to achieve FDR control even under dependent test statistics. As before, we let . However, we fix a sequence , , and set test levels according to rule . In other words, compared with the original Lord procedure, we discount the capital accumulated at the last discovery as a function of the number of hypotheses tested so far, rather than the number of hypotheses tested since the last discovery.
4 Statistical power
The class of generalized alpha investing rules is quite broad. In order to compare different approaches, it is important to estimate their statistical power.
Here, we consider a mixture model wherein each null hypothesis is false with probability independently of other hypotheses, and the values corresponding to different hypotheses are mutually independent. Under the null hypothesis , we have uniformly distributed in and under its alternative, is generated according to a distribution whose c.d.f is denoted by . We let , with , be the marginal distribution of the values. For presentation clarity, we assume that is continuous.
While the mixture model is admittedly idealized, it offers a natural ground to compare online procedures to offline procedures. Indeed, online approaches are naturally favored if the true nonnulls arise at the beginning of the sequence of hypotheses, and naturally unfavored if they only appear later. On the other hand, if the values can be processed offline, we can always apply an online rule after a random reordering of the hypotheses. By exchangeability, we expect the performance to be similar to the ones in the mixture model.
The next theorem lower bounds the statistical power of Lord under the mixture model. This lower bound applies to any of the three versions of Lord .
Theorem 4.1.
Consider the mixture model with denoting the marginal distribution of values. Further, let (and its complement ) be the subset of true nulls (nonnulls), among the first hypotheses. Then, the average power of Lord rule is almost surely bounded as follows:
(28) 
Proof of Theorem 4.1 is deferred to Appendix H. The lower bound is in fact the exact power for a slightly weaker rule that resets the potential at level after each discovery (in other words, Equation (18) is replaced by ). This procedure is weaker only when multiple discoveries are made in a short interval of time. Hence, the above bound is expected to be accurate when is small, and discoveries are rare.
Recall that in Lord , parameters can be any sequence of nonnegative, monotone nonincreasing numbers that sums up to one. This leaves a great extent of flexibility in choosing . The above lower bound on statistical power under the mixture model provides useful insight on what are good choices of .
We first simplify the lower bound further. We notice that . Further, by the monotonicity property of , we have for . Thus,
In order to choose , we use the lower bound as a surrogate objective function. We let be the sequence that maximizes . The following proposition characterizes the asymptotic behavior of .
Proposition 4.2.
Let be the sequence that maximizes under the constraint . Further suppose that is concave and differentiable on an interval for some . Then there is a constant independent of such that, for all large enough, the following holds true:
The concavity assumption of requires the density of nonnull values (i.e., ) to be nonincreasing in a neighborhood . This is a reasonable assumption because significant values are generically small and the assumption states that, in a neighborhood of zero, smaller values have higher density than larger values. In Appendix F, we compute the optimal sequence for two case examples.
5 Numerical simulations
In this section we carry out some numerical experiments with synthetic data. For an application with real data, we refer to Appendix J.
5.1 Comparison with offline rules
In our first experiment, we consider hypotheses concerning the means of normal distributions. The null hypothesis is . We observe test statistics , where are independent standard normal random variables. Therefore, onesided values are given by , and two sided values by . Parameters are set according to a mixture model:
(29) 
In our experiment, we set and and use the following three choices of the nonnull distribution:
Gaussian. In this case the alternative is with . This choice of produces parameters in the interesting regime in which they are detectable, but not easily so. In order to see this recall that, under the global null hypothesis, and with high probability. Indeed is the minimax amplitude for estimation in the sparse Gaussian sequence model [DJ94, Joh94].
In this case we carry out twosided hypothesis testing.
Exponential. In this case the alternative is exponential with mean . The rationale for this choice is the same given above. The alternative is known to be nonnegative, and hence we carry out onesided hypothesis testing.
Simple. In this example, the nonnulls are constant and equal to . Again, we carry out onesided tests in this case.
We consider three online testing rules, namely alpha investing (AI), Lord (a special case of alpha spending with rewards) and Bonferroni. We also simulate the expected reward optimal (ERO) alpha investing rule introduced in [AR14]. For a brief overview of the ERO notion, recall that in a generalized alpha investing rule, payout , test level and the reward should satisfy inequalities (10) and (11). An ERO procedure finds the optimal point of tradeoff between and , for a given value of , where optimality criterion is the expected reward of the current test, i.e., . We compare performance of these online methods with the (adaptive) BH procedure, which as emphasized already, is an offline testing rule: it has access to the number of hypotheses and values in advance, while the former algorithms receive values in an online manner, without knowing the total number of hypotheses. We use Storey’s variant of BH rule, that is better suited to cases in which the fraction of nonnulls is not necessarily small [Sto02]. In all cases, we set as our objective to control below .
The different procedures are specified as follows:
Alpha Investing. We set test levels according to
(30) 
where denotes the time of the most recent discovery before time . This proposal was introduced by [FS08] and boosts statistical power in cases in which the nonnull hypotheses appear in batches. We use parameters (for the initial potential), and (for the rewards). The rationale for this choice is that controls the evolution of the potential for large , while controls its initial value. Hence, the behavior of the resting rule for large is mainly driven by .
Note that, by [AR14, Corollary 2], this is an ERO alpha investing rule
ERO alpha investing. For the case of simple alternative, the maximum power achievable at test is . In this case, we consider ERO alpha investing [AR14] defined by , and with , given implicitly by the solution of and . We use parameters and .
LORD. We use , and choose the sequence as follows:
(31) 
with determined by the condition , which yields . This choice of is loosely motivated by Example E.2, given in Appendix F. Notice, however, that we do not assume the data to be generated with the model treated in that example. Further, for this case we set parameters (for the initial potential), and (for the rewards).
Bonferroni. We set the test levels as , where the values of are set as per Equation (31), and therefore .
Storey. It is well known that the classical BH procedure satisfies where is the proportion of true nulls. A number of adaptive rules have been proposed that use a plugin estimate of as a multiplicative correction in the BH procedure [Sto02, MR06, JC07, Jin08]. Following [BR09], the adaptive test thresholds are given by (instead of ), where is an estimate of , determined as a function of values, .
Here, we focus on Storey estimator given by [Sto02]:
(32) 
Storey’s estimator is in general an underestimate of . A standard choice of is used in the SAM software [ST03]. In [BR09], it is shown that the choice can have better properties under dependent values. In our simulations we tried both choices of .
Our empirical results are presented in Figure 1. As we see all the rules control below the nominal level , as guaranteed by Theorem 3.1. While BH and the generalized alpha investing schemes (Lord , alpha investing, ERO alpha investing) exploit most of the allowed amount of false discoveries, Bonferroni is clearly too conservative. A closer look reveals that the generalized alpha investing schemes are somewhat more conservative than BH. Note however that the present simulations assume the nonnulls to arrive at random times, which is a more benign scenario than the one considered in Theorem 3.1, where arrival times of nonnulls are adversarial.
In terms of power, Lord appears particularly effective for small , while standard alpha investing suffers a loss of power for large . This is related to the fact that in this case. As a consequence the rule can effectively stop after a large number of discoveries, because gets close to one.
Figure 2 showcases the FDR achieved by various rules as a function of , for and exponential alternatives. For alpha investing and Lord we use parameters and . The generalized alpha investing rules under consideration have below the nominal , and track it fairly closely. The gap is partly due to the fact that, for large number of discoveries, the of generalized alpha investing rules is closer to than to , cf. Remark 3.3.
5.2 The effect of ordering
By definition, the BH rule is insensitive to the order in which the hypotheses are presented. On the contrary, the outcome of online testing rules depends on this ordering. This is a weakness, because the ordering of hypotheses can be adversarial, leading to a loss of power, but also a strength. Indeed, in some applications, hypotheses can be ordered, using side information, such that those most likely to be rejected come first. In these cases, we expect generalized alpha investing procedures to be potentially more powerful than benchmark offline rules as BH.
For instance, Li and Barber [LB16] analyze a drugresponse dataset proceeding in two steps.
First, a family of hypotheses (gene expression levels) are ordered using side information, and then a multiple hypothesis
testing procedure is applied to the ordered data
In order to explore the effect of a favorable ordering of the hypotheses, we reconsider the exponential
model in the previous section, and simulate a case in which side information is available.
For each trial, we generate the mean , and two independent sets of
observations , , with ,
independent.
We then compute the corresponding (onesided) values , . We use the values
to order the hypotheses
Let us emphasize that, for this simulation, better statistical power would be achieved if we computed a single value by processing jointly and . However, in real applications, the two sources of information are heterogenous and this joint processing is not warranted, see [LB16] for a discussion of this point.
Figure 3 reports the and statistical power in this setting. We used Lord with parameters given by Equation (31), and simulated two noise levels for the side information: (noisy ordering information) and (less noisy ordering). As expected, with a favorable ordering the decreases significantly. The statistical power increases as long as the fraction of nonnulls is not too large. This is expected: when the fraction of nonnulls is large, ordering is less relevant.
In particular, for small , the gain in power can be as large as (for ) and as (for ). The resulting power is superior to adaptive BH [Sto02] for (for ), or (for ).
5.3 control versus control
Aharoni and Rosset [AR14] proved that generalized alpha investing rules control . Formally,
(33) 
As mentioned before (see also Appendix A), this metric has been criticized because it does not control a property of the realized sequence of tests; instead it controls a ratio of expectations.
Our Theorem 3.4 controls a different metric that we called :
(34) 
This quantity is the expected ratio, and hence passes the above criticism. Note that both theorems yield control at level , for the same class of rules.
Finally, Theorem 3.1 controls a more universally accepted metric, namely , at level . A natural question is whether, in practice, we should choose , as to guarantee control (and hence set ) or instead be satisfied with and control, which allow for and hence potentially larger statistical power.
While an exhaustive answer to this question is beyond the scope of this paper, we repeated the simulations in Figure 1, using the two different criteria. The results, provided in Appendix A, suggest that this question might not have a simple answer. On one hand, under the setting of Figure 1 (independent values, large number of discovery) and