Optimal Testing in the Experimentrich Regime
Abstract
Motivated by the widespread adoption of largescale A/B testing in industry, we propose a new experimentation framework for the setting where potential experiments are abundant (i.e., many hypotheses are available to test), and observations are costly; we refer to this as the experimentrich regime. Such scenarios require the experimenter to internalize the opportunity cost of assigning a sample to a particular experiment. We fully characterize the optimal policy and give an algorithm to compute it. Furthermore, we develop a simple heuristic that also provides intuition for the optimal policy. We use simulations based on real data to compare both the optimal algorithm and the heuristic to other natural alternative experimental design frameworks. In particular, we discuss the paradox of power: highpowered “classical” tests can lead to highly inefficient sampling in the experimentrich regime.
1 Introduction
In modern A/B testing (e.g., for web applications), it is not uncommon to find organizations that run hundreds or even thousands of experiments at a time (Kaufman et al., 2017; Tang et al., 2010a; Kohavi et al., 2009; Bakshy et al., 2014). Increased computational power and the ubiquity of software have made it easier to generate hypotheses and deploy experiments. Organizations typically continuously experiment using A/B testing. In particular, the space of potential experiments of interest (i.e., hypotheses being tested) is vast; e.g., testing the size, shape, font, etc., of page elements, testing different feature designs and user flows, testing different messages, etc. Artificial intelligence techniques are being deployed to help automate the design of such tests, further increasing the pace at which new experiments are designed (e.g., Sensei, Adobe’s A/B testing product, is being used in Adobe Target).^{1}^{1}1https://www.adobe.com/marketingcloud/target.html
This abundance of potential experiments has led to an interesting phenomenon: despite the large numbers of visitors arriving per day at most online web applications, organizations need to constantly consider the most efficient way to allocate these visitors to experiments. For many experiments, baseline rates may be small (e.g., a low conversion rate), or more generally effect sizes may be quite small even relative to large sample sizes. For example, large organizations may be seeking relative changes in a conversion rate of % or less, potentially necessitating millions of users allocated to a single experiment to discover a true effect. (See Tang et al. (2010a, b); Deng et al. (2013) and Azevedo et al. (2018), where these issues are discussed extensively.) Since organizations have a plethora of hypotheses of interest to test, there is a significant opportunity cost: they must constantly trade off allocation of a visitor to a current experiment against the potential allocation of this visitor to a new experiment.
In this paper, we study a benchmark model with the feature that experiments are abundant relative to the arrival rate of data; we refer to this as the experimentrich regime. A key feature of our analysis is the impact of the opportunity cost described above: whereas much of optimal experiment design takes place in the setting of a single experiment, the experimentrich regime fundamentally requires us to trade off the potential for discoveries across multiple experiments. Our main contribution is a complete description of an optimal discovery algorithm for our setting; the development of an effective heuristic; and an extensive datadriven simulation analysis of its performance against more classical techniques commonly applied in industrial A/B testing.
We present our model in Section 2. The formal setting we consider mimics the setting of most industrial A/B testing contexts. The experimenter receives a stream of observational units and can assign them to an infinite number of possible experiments, or alternatives, of varying quality (effect size). We consider a Bayesian setting where there is a prior over the effect size of each alternative, which is natural in a setting with an infinite number of experiments.
We focus on the objective of finding an alternative that is at least as good as a given threshold as fast as possible. In particular, we call an alternative a discovery if the posterior probability that the effect is greater than is at least , and the goal is to minimize the expected time per discovery. This is a natural criterion: good performance requires finding an alternative that is actually delivering practically significant effects (as measured by ). Adjusting and allows the experimenter to trade off the quality and quantity of discoveries made. Note that under this criterion any optimal policy is naturally incentivized to find the “best” experiments, because the discovery criterion is easiest to be met for those alternatives.
In Section 3 we present an optimal policy for allocation of observations to experiments. Since observations arrive sequentially, the problem can be equivalently formulated as minimizing the cumulative number of observations until until a discovery is made. We characterize a dynamic programming approximation of this problem, and show this method converges to the optimal policy in an appropriate sense. We also develop a simple heuristic that approximates and provides insight into the optimal policy.
In Section 4 we use data on baseball players’ batting averages as input data for a simulation analysis of our approach. Our simulations demonstrate that our approach delivers fast discovery while controlling the rate of false discoveries; and that our heuristic approximates the optimal policy well. We also use the simulation setup to compare our method to “classical” techniques for discovery in experiments (e.g., hypothesis testing). This comparison reveals the ways in which classical methods can be inefficient in the experimentrich regime. In particular, there is a paradox of power: efficient discovery can often lead to low power in a classical sense, and conversely highpowered classical tests can be highly inefficient in maximizing the discovery rate.
Due to space constraints, all proofs are given in the appendix.
1.1 Related work
The literature on sequential testing goes back many decades. Originally, Wald and Wolfowitz (1948) propose an optimal test, the sequential probability ratio test, or SPRT for short, for testing a simple hypothesis. Chernoff (1959) studies the asymptotics of experimentation with two hypothesis tests and how to assign observations. Lai (1988) proposes a class of Bayesian sequential tests with a composite alternative for an exponential family of distributions. For a more thorough overview of sequential testing, we refer the interested reader to Siegmund (2013), Wetherill and Glazebrook (1986), Shiryayev (1978) and Lai (1997). None of these approaches consider the opportunity cost associated with having multiple experiments.
Recently, there has been an increased interest in sequential testing due to the rise in popularity of A/B testing (Deng et al., 2017; Kaufman et al., 2017; Kharitonov et al., 2015), and the ubiquity of peeking (Johari et al., 2017; Balsubramani and Ramdas, 2016; Deng et al., 2016). A recent paper by Azevedo et al. (2018) discusses how the tails of the effect distribution affect the assignment strategy of observations to experiments and complements this work.
There is also a strong connection to the multiarmed bandit literature (Gittins et al., 2011; Bubeck and CesaBianchi, 2012), especially the pure exploration problem (Bubeck et al., 2009; Jamieson et al., 2014; Russo, 2016), where the goal is to find the best arm. The case with infinitely many arms is studied by Carpentier and Valko (2015); Chaudhuri and Kalyanakrishnan (2017); Aziz et al. (2018). Locatelli et al. (2016) studies the setting of finding the set of arms (out of finitely many) above a given threshold in a fixed time horizon.
Methods to control of the false discovery rate in the sequential hypothesis setting are discussed by Foster and Stine (2007), Javanmard and Montanari (2016) and Ramdas et al. (2017). The connection between with multiarmed bandits is made by Yang et al. (2017). However, the Bayesian framework we propose does not require multiple testing corrections.
The heavycoin problem (Chandrasekaran and Karp, 2012; Malloy et al., 2012; Jamieson et al., 2016; Lai et al., 2011) is another closely related research area. Here, a fraction of coins in a bag is considered heavy, while most are light. The goal is to find a heavy coin as quickly as possible. These approaches rely on likelihood ratios, as there are only two alternatives, and there is a connection to the CUSUM procedure (Page, 1954). The approaches mentioned above all consider the same problem as we do in this work, albeit for testing two alternatives against each other.
2 Model and objective
In this section we describe the model we study and the objective of the experimenter.
Experiments. We consider a model with an infinite number of experiments, or alternatives, indexed by . Each experiment is associated with a parameter drawn independently from a common (known) prior that completely characterizes the distribution of outcomes corresponding to that experiment. Throughout our analysis, the experimenter is interested in experiments with higher values of .
Actions and outcomes. At times , the experimenter selects an alternative and observes an independent outcome drawn from a distribution . Note, in particular, that opportunities for observations arrive in a sequential, streaming fashion. We also assume that observations are independent across experiments.
We assume that is described by a single parameter natural exponential family, i.e. the density for an observation can be written as:
(1) 
for known functions , , and . Let be the canonical sufficient statistic for experiment at time . Note that in particular, this model includes the conjugate normal model with known variance and the betabinomial model for binary outcomes.
Policies. Let denote the field generated. A policy is a mapping from to experiments.
Discoveries. The experimenter is interested in finding discoveries, defined as follows.
Definition 1 (Discovery).
We say that alternative is a discovery at time , given and , if
(2) 
Here and are parameters that capture the experimenter’s preferences, i.e., the level of aggressiveness and risk that she is willing to tolerate. (Note that this is more stringent than the related false discovery rate guarantees (Benjamini and Hochberg, 2007).)
We assume that the prior satisfies to avoid the trivial scenarios that all or none of the alternatives is a discovery before trials begin.
Objective: Minimize time to discovery. As motivated in the introduction, informally the objective is to find discoveries as fast as possible. We formalize this as follows: The goal of the experimenter is to design a policy (i.e., an algorithm to match observations to experiments) such that the number of observations until the first discovery is minimized.
In particular, define the time to first discovery as:
(3) 
Then the goal is to minimize over all policies. Given this goal, the only decision the experimenter needs to make at each point in time till the first success is whether to reject the current experiment or to continue with it.
Discussion. We conclude with two remarks regarding our model.
(1) Posterior validity. Note that at the (random) stopping time , the posterior is computed based on the potentially adaptive matching policy used by the experimenter. The following lemma shows that when the experimenter computes the posterior and decides to stop the experiment at time when the condition is met, the decision to stop does not invalidate the discovery.
Lemma 1.
The posterior for the discovered experiment at time satisfies
(4) 
almost surely.
(2) Fixed cost per experiment. In some scenarios, starting a new experiment has a cost; e.g., there may be a cost to implementing a new variant, or results may need to be analyzed on a per experiment basis. We can incorporate such a cost in the objective, and our results and approach generalize accordingly. Formally, let be the cost of starting a new experiment, and let be the cumulative number of matched experiments up to time . We can include the per experiment cost by considering instead the problem of minimizing .
3 Optimal policy
In this section, we characterize the structure of the optimal policy, show that it can be approximated arbitrarily well by considering a truncated problem, and give an algorithm to compute the optimal policy of the truncated problem. Finally, we present a simple heuristic that approximates the optimal policy remarkably well.
3.1 Sequential policies
We start with a key structural result that simplifies the search for an optimal policy. The following lemma shows that we can focus on policies that only consider experiments sequentially, in the sense that once a new experiment is being allocated observations, no previous experiment will ever again receive observations.
Lemma 2.
There exists an optimal policy such that for all almost surely.
This result hinges on three aspects of our model: experiments are independent of each other, with identically distributed effects ; there are an infinite number of experiments available; and observations arrive in an infinite stream. As a consequence, all experiments are a priori equally viable, and a posteriori once the experimenter has determined to stop allocating observations to an experiment, she need never consider it again.
Note in particular that this lemma also reveals that any optimal policy for the first discovery also straightforwardly minimizes the expected time until the ’th discovery, for any .
3.2 Reformulating the optimization problem
Based on Lemma 2, we can reformulate and simplify the optimization problem faced by the experimenter as a sequential decision problem, where the only choice is whether or not to continue testing the current experiment.
We abuse notation to describe this new perspective. Let denote the effect size of the current experiment. In particular, let be the ’th observation; let be the field generated by observations of the current experiment . Let denote the canonical sufficient statistic at state . The state of the sequential decision problem is , the number of observations and the sufficient statistic of the current experiment.
If has the property that , then a discovery has been found and so the process stops. The following lemma shows that this discovery criterion induces an acceptance region on the sufficient statistic , i.e., a sequence of thresholds such the current experiment is a discovery when .
Lemma 3.
There exists a sequence such that if and only if .
If , then the experimenter can make one of two decisions:

Continue (i.e., collect one additional observation on the current experiment); or

Reject (i.e., quit the current experiment and collect the first observation of a new experiment).
If Continue is chosen, the state updates to . If Reject is chosen, the state changes to (where is an independent draw of the sufficient statistic after the first observation); and in either case, the process continues.
The goal of the experimenter is to minimize the expected time until the observation process stops, i.e., until a discovery is found. Let be this minimum, starting from state . Then the Bellman equation for this process is as follows:
(5)  
(6) 
The first line corresponds to the case where is in the acceptance region, i.e., the process stops. In the second line, we consider two possibilities: continuing incurs a unit cost for the current observation, plus the expected cost from the state ; rejecting resets the state with no cost incurred. The optimal choice is found by minimizing between these alternatives. The expected number of samples till a discovery satisfies .
3.3 Characterizing the optimal policy
The following theorem shows that an optimal policy for the dynamic programming problem (5)(6) can be expressed using a sequence of rejection thresholds on the sufficient statistic. That is, for each there is an such that it is optimal to Continue if , and to Reject if .
Theorem 4.
The remainder of the section is devoted to computing the optimal sequence of rejection thresholds.
3.4 Approximating the optimal policy via truncation
In order to compute an optimal policy, we consider a truncated problem. This problem is identical in every respect to the problem in Section 3.2, except that we consider only policies that must choose Reject after observations. We refer to this as the truncated problem.
Let denote the minimum expected cumulative time to discovery for the truncated problem, starting from state . The Bellman equation is nearly identical to (5)(6), except that now , and we add the additional constraint that to (6). We have the following proposition.
Theorem 5.
There exists an optimal policy for the truncated problem described by a sequence of rejection thresholds such that, after observations, Reject is declared if , Continue is declared if , and Accept is declared is if .
Further, let be the optimal expected number of observations until a discovery is made. Then for each , as ; and as .
3.5 Computing the truncated optimal policy
The truncated horizon brings us closer to computing an optimal policy, but it is still an infinite horizon dynamic programming problem. In this section we show instead that we can compute the truncated optimal policy by iteratively solving a singleexperiment truncated problem with a fixed rejection cost . Let be the optimal expected cost for this problem starting from state . We have the following Bellman equation.
(7)  
(8)  
(9) 
For any terminal cost , this dynamic programming problem is easily solved using backward induction to find the rejection boundaries. The following theorem shows how we can use this solution to find an optimal policy to the truncated problem.
Theorem 6.
Thus, to find approximately optimal rejection thresholds, select suitably large, and start with an arbitrary . Then iteratively compute the corresponding thresholds and the cost , using bisection to converge on , and thus the corresponding optimal thresholds.
We note that the same program we have outlined in this section can be used to compute an optimal policy with a per experiment fixed cost , by using rejection cost instead of . Empirically, this leads to only slightly lower rejection thresholds; due to space constraints, we omit the details.
3.6 Heuristic approximation
We have seen that the optimal policy is easy to approximate by solving dynamic programs iteratively. However, this does not give us direct insight into the structure of the solution, and in certain cases a quick ruleofthumb that provides an approximate policy might be all that is required. In this section, we show that there exists a simple heuristic that performs remarkably well.
The approximate rejection boundary at time is found as follows. Let be the MAP estimate of for sufficient statistic . Then reject the current experiment if is not plausible under . That is, the heuristic boundary is, for a suitably chosen ,
(10) 
Of course, this heuristic is not practical as is, as in general we do not know unless we compute the optimal policy. But often varies only little in so a reasonable approximate choice is sufficient. In Figure 1 we plot the discovery and rejection boundaries, along with the heuristic outlined above (with ), for the normal and Bernoulli models.
The heuristic and optimal policies clearly exhibit aggressive rejection regions, cf. Figure 1. THe interpretation is as follows: to continue sampling from the current experiment, we do not just want its quality to be , but substantially better than , since for all . If not, it would take too many additional observations to verify the discovery.
4 Case study: baseball
We now empirically analyze our testing framework based on a simulation with baseball data. First, we demonstrate empirically that the proposed algorithm leads to fast discoveries, and behaves differently from traditional testing approaches. Second, we show that the ruleofthumb heuristic performance is close to that of the optimal policy.^{2}^{2}2Code to replicate results can be found at https://github.com/schmit/optimaltestingexperimentrichregime.
Data We use the baseball dataset with pitching and hitting statistics from 1871 through 2016 from the Lahman R package. The number of At Bats (AB) and Hits (H) is collected for each player, and we are interested in finding players with a high batting average, defined as . We consider players with at least At Bats, which leaves a total of 5721 players, with a mean of about 2300 At Bats. In the top left of Figure 2, we plot the histogram of batting averages, along with an approximation by a beta distribution (fit via method of moments). We note that these fit the data reasonably well, but not perfectly. This discrepancy helps us evaluate the robustness to a misspecified prior.
Simulation setup To construct the testing problem, we view the batters as alternatives, with empirical batting average of batter treated as ground truth. We want to find alternatives with . We draw a Bernoulli sample of mean to simulate an observation from alternative .^{3}^{3}3Based on sequential batting data from the 20142018 seasons there is no evidence for strong correlation between atbats. These samples are then used to test whether . We set , and vary between and . For each simulation, we iterate through each batter and repeat it 1000 times to reduce variance. This allows us to compare methods fairly, ensuring that each procedure is run on exactly the same test cases.
4.1 Benchmarks
To assess performance, we compare several testing procedures. Note that the nontraditional setup of our testing framework does not allow for easy comparison with other methods, in particular frequentist approaches, as they give different guarantees. Thus, we restrict attention to Bayesian methods that provide the same error guarantee. All of the benchmarks use the same beta prior computed above.
Optimal policy
First we study the optimal policy based on the betabinomial model, computed using the bisection and backward induction approach in Section 3.5, where we truncate after samples.
Heuristic policy
Next, we include the heuristic rejection thresholds that approximate the optimal policy for truncation samples. The heuristic policy requires setting two parameters: , i.e., how far to look into the future to find the acceptance boundary, which is ideally set close to ; and the rejection region . To demonstrate the the insensitivity to , we use and for all simulations. (Note that varies dramatically as we change the threshold .
Fixed sample size test
Our next benchmark is a simple fixed sample size test. For each experiment, we gather observations, and claim a discovery if where is the number of Hits of alternative (batter) . We focus our attention on using samples per test, as this seems to perform best when compared to other sample sizes, but any differences are immaterial for our conclusions.
Fixed sample size test with early stopping
This benchmark is similar to the fixed sample size test, except that we stop the experiment early if the discovery criterion is met. Thus, we can quantify the gains from being able to discover early.
Bayesian sequential test
Now we consider a sequential test that also rejects early. In particular, we reject the current experiment if We also reject an alternative after samples. This approach also requires careful tuning of . In particular, if is too large, say larger than the prior probability , then the test is too aggressive and rejects all alternatives outright. Instead, we found empirically that setting leads to good performance across all values of .
4.2 Results
Average time to discovery
The average number of observations until a discovery is shown in the top right plot of Figure 2. As expected, the fixed sample test performs worst. Early stopping leads to slightly better performance, but this method is still not effective as most of the gains come from early rejection. The Bayesian sequential test demonstrates this effect and shows substantial gains over the fixed tests. The heuristic policy, despite lack of parameter tuning, performs very well, essentially matching the performance of the optimal algorithm for most thresholds.
False discovery proportion and robustness
Next, we compare the false discovery proportion (FDP) (Benjamini and Hochberg, 2007), i.e., the fraction of discoveries that in fact had true . If the prior is correctly specified, the methods we consider satisfy . Indeed, we observe that the guarantee holds for most thresholds and algorithms in the bottom left plot of Figure 2. There is some minor exceedance of the FDP for thresholds around , which can be explained by the fact that the prior does not fit the empirical batting averages perfectly. Since there are few rejections for thresholds beyond , the FDP estimate has higher variance in that regime. Across all simulations, the optimal policy has an FDP of . Finally, we see that the lack of early stopping makes the fixed test rather conservative.
The paradox of power
Finally, we compare power, i.e., the fraction of alternatives with that are declared a discovery. Power comparisons across the algorithms are plotted in the bottom right of Figure 2. The most surprising insight from the simulations is the paradox of power. Algorithms that are effective have very low power. This is counterintuitive: how can algorithms that make many discoveries have only a small chance of picking up true effects? The main driver of good performance for an algorithm is the ability to quickly reject unpromising alternatives. Some unpromising alternatives are “barely winners”: i.e., is only slightly above . In the experimentrich regime, such alternatives should be rejected quickly, because it takes too many observations to get enough concentration around the posterior to claim a discovery. This effect leads to low power, but fast discoveries.
Characteristics of the optimal policy
We consider the outcomes of individual tests for the optimal algorithm () in Figure 3. The average number of samples for rejected alternatives is very small while it is much larger for discovered alternatives. We also note the concave shape for the discovered alternatives, that seems to peek around . When the batting average is larger, the algorithm is able to detect the effect with fewer samples, and when the batting average is lower, there are a few lucky streaks that lead to a (false) discovery.
The probability of discovery is low across all batting averages, but increases sharply beyond a batting average of , rather than around . As noted before, the optimal policy tries to avoid effects that are close to the threshold.
Finally, the MAP estimates for the batting averages of discovered batters. It illustrates a known but important fact that the parameter of discovered alternatives are quite poor. If estimation of effects is important, the experimenter ought to obtain more samples for the discovered alternatives.
5 Conclusion
We consider an experimentation setting where observations are costly, and there is an abundance of possible experiments to run — an increasingly prevalent scenario as the world is becoming more datadriven. Based on backward induction, we can compute an approximately optimal algorithm that allocates observations to experiments such that the time to a discovery is minimized. Simulations validate the efficacy of our approach, and also reveal discuss the paradox of power: there is a tension between highpowered tests, and being efficient with observations.
Our paradigm has several additional practical benefits. First, we can leverage knowledge across experiments through the prior. Second, adaptive matching of observations to experiments does not preclude valid inference, and thus outcomes can thus continuously be monitored. Finally, the framework also provides an easy “user interface”: it directly incorporates the desired effect size, and leads to guarantees that are easy to explain to nonexperts.
Further directions
The framework assumes there is a common prior among alternatives, and this allows us to view every rejection as a renewal. If experiments have different priors, then the order at which experiments are chosen matters. This is also true when the costs of observations or starting experiments differ across experiments.
Furthermore, we assume the prior is known. The experimenter can take an empirical Bayes approach similar to our simulations before starting the experiments, but data gathered while the optimal policy is running distorts estimates of the prior.
We briefly touched upon the independence assumption across observations of a single experiment, and showed that for baseball data this does not lead to problems, but in other use cases time variation, e.g. novelty effects, might play a bigger role and need to be encoded into the framework. One way to incorporate such effects, along with suitably chose covariates that can reduce the variance of testing and thereby improving the time till discoveries are Bayesian generalized linear models.
Finally, we assume that experiments are independent. In certain settings the results of one experiment can affect future experiments, or there might be correlations between outcomes of experiments. Incorporating these effects is nontrivial and beyond the scope of this work.
6 Acknowledgements
The authors would like to thank Johan Ugander, David Walsh, Andrea Locatelli, and Carlos Riquelme for their suggestions and feedback. This work was supported by the Stanford TomKat Center, and by the National Science Foundation under Grant No. CNS1544548 and CNS1343253. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References
 Azevedo et al. [2018] Eduardo M Azevedo, Alex Deng, Jose Montiel Olea, Justin M Rao, and E Glen Weyl. A/b testing. In Proceedings of the Nineteenth ACM Conference on Economics and Computation. ACM, 2018.
 Aziz et al. [2018] Maryam Aziz, Jesse Anderton, Emilie Kaufmann, and Javed A. Aslam. Pure exploration in infinitelyarmed bandit models with fixedconfidence. In ALT, 2018.
 Bakshy et al. [2014] Eytan Bakshy, Dean Eckles, and Michael S Bernstein. Designing and deploying online field experiments. In Proceedings of the 23rd international conference on World wide web, pages 283–292. ACM, 2014.
 Balsubramani and Ramdas [2016] Akshay Balsubramani and Aaditya Ramdas. Sequential nonparametric testing with the law of the iterated logarithm. CoRR, abs/1506.03486, 2016.
 Benjamini and Hochberg [2007] Author Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. 2007.
 Bubeck and CesaBianchi [2012] Sébastien Bubeck and Nicolò CesaBianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5:1–122, 2012.
 Bubeck et al. [2009] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multiarmed bandits problems. In ALT, 2009.
 Carpentier and Valko [2015] Alexandra Carpentier and Michal Valko. Simple regret for infinitely many armed bandits. In ICML, 2015.
 Chandrasekaran and Karp [2012] Karthekeyan Chandrasekaran and Richard M. Karp. Finding the most biased coin with fewest flips. CoRR, abs/1202.3639, 2012.
 Chaudhuri and Kalyanakrishnan [2017] Arghya Roy Chaudhuri and Shivaram Kalyanakrishnan. Pac identification of a bandit arm relative to a reward quantile. In AAAI, pages 1777–1783, 2017.
 Chernoff [1959] Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
 Deng et al. [2013] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. Improving the sensitivity of online controlled experiments by utilizing preexperiment data. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 123–132. ACM, 2013.
 Deng et al. [2016] Alex Deng, Jiannan Lu, and Shouyuan Chen. Continuous monitoring of a/b tests without pain: Optional stopping in bayesian testing. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 243–252, 2016.
 Deng et al. [2017] Alex Deng, Jiannan Lu, and Jonthan Litz. Trustworthy analysis of online a/b tests: Pitfalls, challenges and solutions. In WSDM, 2017.
 Foster and Stine [2007] Dean P. Foster and Robert A. Stine. Alphainvesting: A procedure for sequential control of expected false discoveries. 2007.
 Freeman [1983] PR Freeman. The secretary problem and its extensions: A review. International Statistical Review/Revue Internationale de Statistique, pages 189–206, 1983.
 Gittins et al. [2011] John Gittins, Kevin Glazebrook, and Richard Weber. Multiarmed bandit allocation indices. John Wiley & Sons, 2011.
 Jamieson et al. [2014] Kevin G. Jamieson, Matthew Malloy, Robert D. Nowak, and Sébastien Bubeck. lil’ ucb : An optimal exploration algorithm for multiarmed bandits. In COLT, 2014.
 Jamieson et al. [2016] Kevin G. Jamieson, Daniel Haas, and Benjamin Recht. The power of adaptivity in identifying statistical alternatives. In NIPS, 2016.
 Javanmard and Montanari [2016] Adel Javanmard and Andrea Montanari. Online rules for control of false discovery rate and false discovery exceedance. CoRR, abs/1603.09000, 2016.
 Johari et al. [2017] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking at a/b tests: Why it matters, and what to do about it. In KDD, 2017.
 Kaufman et al. [2017] Raphael Lopez Kaufman, Jegar Pitchforth, and Lukas Vermeer. Democratizing online controlled experiments at booking.com. arXiv preprint arXiv:1710.08217, 2017.
 Kharitonov et al. [2015] Eugene Kharitonov, Aleksandr Vorobev, Craig MacDonald, Pavel Serdyukov, and Iadh Ounis. Sequential testing for early stopping. 2015.
 Kohavi et al. [2009] Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Juan Lavista Ferres, and Tamir Melamed. Online experimentation at microsoft. 2009.
 Lai et al. [2011] Lifeng Lai, H. Vincent Poor, Yan Xin, and Georgios Georgiadis. Quickest search over multiple sequences. IEEE Transactions on Information Theory, 57:5375–5386, 2011.
 Lai [1988] Tze Leung Lai. Nearly optimal sequential tests of composite hypotheses. The Annals of Statistics, pages 856–886, 1988.
 Lai [1997] Tze Leung Lai. On optimal stopping problems in sequential hypothesis testing. Statistica Sinica, 7(1):33–51, 1997.
 Locatelli et al. [2016] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In ICML, 2016.
 Malloy et al. [2012] Matthew Malloy, Gongguo Tang, and Robert D. Nowak. Quickest search for a rare distribution. 2012 46th Annual Conference on Information Sciences and Systems (CISS), pages 1–6, 2012.
 Page [1954] Ewan S Page. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
 Ramdas et al. [2017] Aaditya Ramdas, Fanny Yang, Martin J Wainwright, and Michael I Jordan. Online control of the false discovery rate with decaying memory. In Advances in Neural Information Processing Systems, pages 5655–5664, 2017.
 Russo [2016] Daniel Russo. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pages 1417–1418, 2016.
 Samuels [1991] Stephen M Samuels. Secretary problems. Handbook of sequential analysis, 118:381–405, 1991.
 Shiryayev [1978] Alexi N Shiryayev. Optimal stopping rules, volume 8 of applications of mathematics, 1978.
 Siegmund [2013] David Siegmund. Sequential analysis: tests and confidence intervals. Springer Science & Business Media, 2013.
 Tang et al. [2010a] Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings 16th Conference on Knowledge Discovery and Data Mining, pages 17–26, Washington, DC, 2010a.
 Tang et al. [2010b] Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlapping experiment infrastructure: More, better, faster experimentation (presentation). 2010b. URL https://static.googleusercontent.com/media/research.google.com/en//archive/papers/Overlapping_Experiment_Infrastructure_More_Be.pdf.
 Wald and Wolfowitz [1948] Abraham Wald and Jacob Wolfowitz. Optimum character of the sequential probability ratio test. The Annals of Mathematical Statistics, pages 326–339, 1948.
 Wetherill and Glazebrook [1986] G Barrie Wetherill and Kevin D Glazebrook. Sequential methods in statistics, 1986.
 Williams [1991] David Williams. Probability with martingales. Cambridge university press, 1991.
 Yang et al. [2017] Fanny Yang, Aaditya Ramdas, Kevin G Jamieson, and Martin J Wainwright. A framework for multia (rmed)/b (andit) testing with online fdr control. In Advances in Neural Information Processing Systems, pages 5959–5968, 2017.
Appendix A Proofs
a.1 Proofs from Section 2
Proof of Lemma 1.
The result relies on being a stopping time. Recall that indicates the discovered experiment. Then we find
where we use that if , and thus if [Williams, 1991][p.219]. ∎
Proof of Lemma 2.
Note that due to independence we can assume without loss of generality that the index of the arm corresponds to the order in which alternatives are first considered. Thus the result follows if we show that for any , action cannot be strictly better than . Assume to the contrary that is optimal (and strictly better than for some . Consider the last time alternative was selected: . At that time it was at least as good to consider a new alternative, and subsequently the posterior for alternative has not changed due to independence. Due to the infinite time horizon, it is thus at least as good to consider a new alternative. ∎
a.2 Proofs from Section 3
Proof of Lemma 3.
Let . We can rewrite the discovery criterion as
(11)  
(12)  
(13) 
We show that this is decreasing in .
Now take the logarithm and the derivative with respect to to obtain Find expression
(14)  
(15)  
(16) 
where the expectations in the last line is taken with respect to the distribution with density
(17) 
Note that the last inequality holds, because, in general
(18) 
Now the lemma follows: if , then for all , and similarly if if , then for all . ∎
To prove the theorems in Section 3 we use the following lemmas, which are proven at the end of this section.
Lemma 7.
The optimal policy for the truncated problem can be characterized by a rejection threshold. That is, the optimal policy rejects the current experiment if for a sequence , and collects another observation for the current experiment otherwise, until a discovery is made.
Write for the expected number of observations required for a discovery for the optimal policy of the truncated problem. Then we can show that both and converge.
Lemma 8.
Both and converge as .
Proof of Theorem 4.
Lemma 7 shows that the truncated problem has an optimal policy that has the form of a threshold. Next, lemma 8 shows that both the thresholds and the optimal cost converge.
Recall and . Now we show that limiting policy with corresponding cost is optimal.
Suppose there exists an and a policy with cost such that . Consider a policy with cost . Let be the stopping time of this policy. We consider the truncated version of this policy, and show that it cannot be much worse. On the other hand, this truncated policy has a cost larger than . The truncated policy, denoted by rejects the current alternative after samples, but is otherwise identical to . Let and be the stopping times corresponding to and . Trivially, we have . Because is finite, . Because and are identical up to observations, it follows that if , then , and thus we find that
Thus, it follows that
(19) 
Since , as .
However, for all , and thus , which is a contradiction. ∎
Proof of Theorem 6.
Let denote the (random) hitting time of the boundary of the first alternative
(20) 
under rejection boundary . Furthermore, let denote the rejection probability. Now note that Note that we can solve this minimization problem using backward induction, since the time horizon is fixed (). First, we show that has a unique fixed point which is equal to .
Note that we have
(21) 
By definition, minimizes , thus, it follows immediately that is a fixed point of .
Next, we show that for each and for each .
First, fix . Suppose that . Thus, there exists such that . Thus, , where the last equality follows from (21). This, along with implies that , a contradiction. Thus, we must have .
Finally, fix . We know that
(22) 
Thus, there exists (equal to ) such that . Thus, . ∎
a.3 Proofs of lemmas
Proof of Lemma 7.
Based on Lemma 2, there exists a policy that can be characterized by a sequence of three sets

Discover if , the experiment is a discovery

Continue if , and

Reject if
Now note that is a threshold region for by definition. Assume for all . Further, from the Bellman equation for the truncated problem, it is clear that the optimal solution rejects the current experiment at the time if
(23) 
Note that
(24) 
Then for we note that is decreasing in . This follows since for all , as for such it is better to continue than to reject. Furthermore, arguing along the lines of the proof of Lemma 3, is decreasing in . This implies we can write for some . ∎
Proof of Lemma 8.
Due to increased degrees of freedom, it follows that is decreasing. Since is bounded below by , converges. Let .
Next, we show that is decreasing in : Clearly, . Now suppose , then , which follows from the fact that , and is decreasing in . It remains to show that is bounded.
We construct a lower bound on , for large , as follows. Let and let be such that , by choosing sufficiently small. Then we note that the cost for obtaining another sample is at least . However, if the experimenter rejects the current alternative now, the cost is . Thus, if we can show that there exists a such that for all , , then is a lower bound on for all . But above we have shown that , hence such exists. This implies that converges as . ∎