Optimal Correct BestArm Selection for General Distributions
Abstract
Given a finite set of unknown distributions or arms that can be sampled, we consider the problem of identifying the one with the largest mean using a deltacorrect algorithm (an adaptive, sequential algorithm that restricts the probability of error to a specified delta) that has minimum sample complexity. Lower bounds for deltacorrect algorithms are well known. Deltacorrect algorithms that match the lower bound asymptotically as delta reduces to zero have been previously developed when arm distributions are restricted to a single parameter exponential family. In this paper, we first observe a negative result that some restrictions are essential, as otherwise under a deltacorrect algorithm, distributions with unbounded support would require an infinite number of samples in expectation. We then propose a deltacorrect algorithm that matches the lower bound as delta reduces to zero under the mild restriction that a known bound on the expectation of a nonnegative, continuous, increasing convex function (for example, the squared moment) of the underlying random variables, exists. We also propose batch processing and identify near optimal batch sizes to substantially speed up the proposed algorithm. The bestarm problem has many learning applications, including recommendation systems and product selection. It is also a well studied classic problem in the simulation community.
Optimal Correct BestArm Selection for General DistributionsAgrawal, Juneja and Glynn \firstpageno1
Multiarmed bandits, bestarm identification, sequential learning, ranking and selection
1 Introduction
Given a vector of unknown arms or probability distributions that can be sampled, we consider algorithms that sequentially sample from or pull these arms and at termination identify the bestarm, i.e., the arm with the largest mean. The algorithms considered provide correct probabilistic guarantees, that is, the probability of identifying an incorrect arm is bounded from above by a prespecified . Further, the correct algorithms that we consider aim to minimize the sample complexity, or, equivalently, the expected total number of arms pulled before they terminate. This bestarm problem is well studied in the literature (see, e.g., in learning  Garivier and Kaufmann (2016); Kaufmann et al. (2016); Russo (2016); Jamieson et al. (2014); Bubeck et al. (2011); Audibert and Bubeck (2010); EvenDar et al. (2006); Mannor and Tsitsiklis (2004); in earlier statistics literature  Jennison et al. (1982); Bechhofer et al. (1968); Paulson et al. (1964); in simulation  Glynn and Juneja (2004); Kim and Nelson (2001); Chen et al. (2000); Dai (1996); Ho et al. (1992)).
The correct guarantee imposes constraints on the expected number of times each arm must be pulled by the algorithm. These constraints are made explicit by Garivier and Kaufmann (2016) through their transportation inequality which can be used to arrive at a maxmin optimization problem to develop efficient lower bounds on correct algorithms. This line of work relies on change of measure based analysis that goes back at least to Lai and Robbins (1985). Also see Mannor and Tsitsiklis (2004) and Burnetas and Katehakis (1996). It is important to emphasize that the maxmin optimization problem to develop efficient lower bound on a correct algorithm, requires complete knowledge of the underlying arm distributions, and its solution is a function of these underlying distributions. The algorithms, on the other hand, acting on a given set of arms, are unaware of the underlying distributions, and, typically, adaptively learn them to decide on the sequence of arms to pull, as well as the stopping time.
Garivier and Kaufmann (2016) consider the best arm problem under the assumption that each arm distribution belongs to a single parameter exponential family (SPEF). Under this restriction, they arrive at an asymptotically optimal algorithm having a sample complexity matching the derived lower bound asymptotically as . SPEF distributions include Bernoulli, Poisson and Gaussian distributions with known variance. However, in practice it is rarely the case (other than in the Bernoulli setting) that the arm distributions are from SPEF, so there is a need for a general theory as well as efficient algorithms that have wider applicability. Our paper substantially addresses this issue.
Contributions: Our first contribution is an impossibility result illustrating why some distributional restrictions on arms are necessary for correct algorithms to be effective. Consider an algorithm that provides correct guarantees when acting on a finite set of distributions belonging to a collection , where comprises distributions with unbounded support that are right dense (defined in Section 2). In this setup we show that the sample complexity of the algorithm in every instance of it acting on a finite set of distributions in , must be infinite. Examples of such a include all lighttailed distributions with unbounded support (a distribution is said to be light tailed if its moment generating function is finite in a neighborhood of zero). Another example is a collection of unbounded distributions supported on that are in , for some . That is, for some , their absolute moment is finite.
To arrive at an effective correct algorithm, we restrict arm distributions to the collection
(1) 
where denotes the set of probability distributions with support in , is a strictly increasing, continuous, nonnegative, convex function such that , and is a known positive constant. For instance, we may have for any or . In simulation models, upper bounds on moments of simulation output, as in (1), can often be found by the use of Lyapunov function based techniques (see, e.g., Glynn and Zeevi (2008)). With this mild restriction we solve the associated maxmin optimization problem to arrive at efficient lower bounds on sample complexity for correct algorithms. We also develop simple line search based procedures to solve this optimization problem. Our main contribution is the development of an asymptotically optimal correct algorithm whose sample complexity matches the derived lower bound asymptotically as .
Key to developing such a lower bound and the correct algorithm is the functional defined as follows: let denote the KullbackLeibler divergence between probability distributions and , and let denote the mean of the probability distribution . Then, for and such that , is the optimal value of such that , for , and , for . Call this optimization problem . Heuristically, measures the difficulty of separating distribution from all other distributions in whose mean equals . It equals zero when and .
We develop a concentration inequality for , where for , denotes the empirical distribution corresponding to samples from . This plays a key role in the proof of correctness of the proposed algorithm. A key step in the proof of the concentration inequality relies on arriving at a simpler dual representation of . Here, we substantially extend the representation developed by Honda and Takemura (2010) for bounded random variables to random variables belonging to . Honda and Takemura (2010) had focussed on the regret minimization problem for stochastic bandits (also see Burnetas and Katehakis (1996); Honda and Takemura (2011)).
To prove the correctness of the proposed algorithm, we further develop a concentration inequality for where denotes the number of times arm is pulled in a total of arm pulls, and denotes the empirical distribution for arm based on samples. While Magureanu et al. (2014) have developed these inequalities for the Bernoulli distribution, we generalize their analysis to arm distributions belonging to .
In bounding the sample complexity of the proposed algorithm, we exploit the continuity of in as well as the continuity of the solution to the maxmin lower bound problem with respect to the underlying arm distributions. We achieve this by considering the Wasserstein distance in the space of probability distributions . The Wasserstein distance is relatively tractable to work with, and it can be seen that is a compact metric space under this distance. This, in particular, allows us to use the wellknown Berge’s Maximum Theorem (stated in the Appendix A) to derive the requisite continuity properties.
In, e.g., Garivier and Kaufmann (2016); Kalyanakrishnan et al. (2012), the proposed algorithms solve the lower bound problem at every iteration. However, solving the corresponding maxmin optimization problem can be computationally demanding particularly in the generality that we consider. We instead solve this problem in batches and arrive at near optimal batch sizes, that result in a provably substantial computational reduction.
Best arm problems arise in many settings in practice. For instance, one can view the selection of the best product version to roll out for production and sale, after a set of expensive pilot trials among many competing versions to be a best arm problem. In simulation theory, selecting the best design amongst many (based on output from a simulation model) is a classic problem, with applications to manufacturing, road and communications network design, etc. In these and many other settings, the underlying distributions can be very general and may not be modelled well by a SPEF distribution.
Roadmap: In Section 2 we review some background material and present an impossibility result illustrating the need for distributional restrictions on arms. In Section 3, an efficient lower bound for correct algorithms for the best arm problem is provided when the arm distributions are restricted to . The algorithm that matches this lower bound asymptotically as is developed in Section 4. Enroute, we develop certain concentration inequalities associated with . Discussion on optimal batch size and a numerical experiment are shown in Section 5. While key ideas in some of the proofs are outlined in the main body, proof details are all given in the Appendices.
2 Background and the impossibility result
Let denote the universe of probability distributions for which we aim to devise correct algorithms. We assume that each distribution in has finite mean. Let denote the KullbackLeibler divergence between distributions and . For , let denote the KLdivergence between Bernoulli distributions with mean and , respectively, that is,
Recall that , denotes the mean of any distribution . Let denote the collection of all such that each . Consider a vector of distributions from . Without loss of generality, henceforth we assume that the highestmean arm in is arm 1, that is, . Let denote the collection of all distributions such that each and .
Under a correct algorithm acting on , for , the following transportation inequality is shown by Kaufmann et al. (2016):
(2) 
for any , where denotes the number of times arm is pulled by the algorithm in trials, and denotes the algorithm termination time. Intuitively, this specifies a lower bound on the expected number of samples that need to be generated from each arm under , for an algorithm to separate it from a distribution belonging to the set of alternative hypotheses , with probability at least .
The following lemma helps in proving our negative result in Theorem 2. {lemma} Given with an unbounded support on the positive real line, for any finite and , there exists a distribution such that
(3) 
A collection of probability distributions is referred to as right dense, if for every , and every , , there exists a distribution such that (3) holds.
Observe that a necessary condition for to be right dense is that each member does not have a realvalued upper bound on its support.
Under a correct algorithm operating on right dense , for any ,
(4) 
The proof follows easily from (3) in Lemma 2, since given such that arm has the maximum mean, for any , one can easily find such that for , , and is arbitrarily small. (4) now follows from (2).
When only information available about a distribution is that its mean exists, Bahadur and Savage (1956) prove a related impossibility result that there does not exist an effective test of hypothesis for testing whether the mean of the distribution is zero (also see Lehmann and Romano (2006)). However, to the best of our knowledge, Theorem 2 is the first impossibility result in the best arm setting (or, equivalently, the ranking and selection setting).
3 Lower bound for a correct algorithm
Theorem 2 suggests that some restrictions are needed on for correct algorithms to provide reasonable performance guarantees. To this end, we limit our analysis to correct algorithms acting on the class defined in (1) earlier. Let denote the collection of vectors , such that each . Let . Recall that without loss of generality, . Let denote the collection of all distributions such that each and . From (2) it follows that for any correct algorithm acting on :
Let denote probability simplex in . It follows that is bounded from below by times the inverse of
(5) 
and hence the problem of computing the lower bound on reduces to solving the above maxmin problem. To characterize the solution to (5), we need some definitions.
Recall that for and such that , is defined as the value of . As mentioned in the introduction, we study the continuity of as a function of in the Wasserstein metric.
Wasserstein metric: (see, e.g., Villani (2003)). Recall that the Wasserstein metric between probability distributions and on is given by:
(6) 
where denotes the collection of measures on with marginals on the first and second coordinates, respectively, and is any metric on . For simplicity, we consider . We endow with the corresponding Wasserstein metric, . Then, is a metric space (see Section 7.1 Villani (2003)). as a subset of , is also a metric space with being the metric. Hence, we can define continuity of functions from to . Furthermore, for and in , is a metric on . Thus, endowed with defined with in (6), is a metric space and we can define continuous functions from to . Lemma 3 lists some properties of the set and that give insights into geometrical structure of and are useful for our analysis. These are proved in Appendix B.1.
The set is uniformlyintegrable and compact in the Wasserstein metric. Moreover, for such that and , is increasing for , and decreasing for . It is continuous in , and convex and twice differentiable in . Furthermore, for , it satisfies and .
Let Observe that the maxmin problem (5) may be reexpressed as
(7) 
The inner optimization problem in (7) can be further simplified. Given , let denote the optimal value of the expression in (7) (or, equivalently, (5)) and let denote the set of that maximize (7). Furthermore, recall that . For , and fixed, consider functions and given by
(8) 
The following theorem characterizes the solution to , given that is known.
The set is a singleton. Moreover, the optimal value of the maxmin problem (7),
(9) 
Further, the maxmin problem (7) is solved by that uniquely satisfies

[label=()]

and ,

, where each denotes the unique .

.
Moreover, the optimal value, equals , and the optimal proportion vector, , is a continuous function (in the Wasserstein metric) of .
The above characterization is analogous to that in Theorem 1, Glynn and Juneja (2004) where they considered the fixed budget bestarm problem. It is easily seen that the fixed budget setting also lends itself to solving a maxmin problem analogous to (5), where the arguments and in each term are switched. The above characterization also generalizes that in Garivier and Kaufmann (2016) for SPEF of distributions. Observe that when belongs to SPEF and is restricted to the same SPEF, corresponds to where denotes the corresponding SPEF distribution with mean .
The proposed algorithm discussed in Section 4, relies on solving the maxmin problem (7) repeatedly with replaced by its empirical estimator. Since this estimator may not necessarily belong to , it is important to note that the lower bound in (5) and the results in Theorem 3 hold for every . Furthermore, since the maxmin problem (7) needs to be solved multiple times in our proposed algorithm, efficiently solving it for the optimal proportions for any is crucial to it. {remark} [Numerically solving the maxmin problem:] Let denote the common value of for , i.e., . We develop an algorithm that relies on repeated single dimensional linesearches to solve for and . Appendix B.3 contains the details of the algorithm and proofs of its convergence to the correct value. To get an idea of the computational effort needed in solving (7), let denote the average time taken to compute using efficient solvers. (Theorem 4 below shows that has a dual representation, where it is a solution to a two variable concave program.) Let denote the tolerance for each line search. Then, numerically solving for and takes time To decrease this computation time, we precompute values of for each along a grid, for each . For not in this set, we linearly interpolate from the computed values. This substantially reduces the computation time of the algorithm to , where is time for the preprocessing step.
4 The correct algorithm
We now propose a correct algorithm and show that its sample complexity matches the lower bound up to the first order as . Recall that a correct algorithm has a sampling rule that at any stage, based on the information available, decides which arm to sample next. Further, it has a stopping rule, and at the stopping time it announces the arm with the largest mean while ensuring that the probability of incorrect assessment is at most a prespecified .
It can be shown that if the distribution of the arms, , is not known, but there exists an oracle that informs us the optimal that solve (7), then sampling arms to closely match the proportions leads to an asymptotically optimal algorithm (this can be seen, for instance, by using the stopping rule that is analogous to ours, and essentially repeating the arguments in our proof where approximations to are used). This suggests that the fraction of times a good algorithm pulls an arm should converge to . We propose a sampling rule to ensure this. Our stopping rule (discussed above (10)) relies on a generalized likelihood ratio statistic, taking sufficiently large value.
Sampling rule: Our sampling algorithm relies on solving the maxmin lower bound problem with the vector of empirical distributions used as a proxy for the unknown true distribution . The computed optimal proportions then guide the sampling strategy. Garivier and Kaufmann (2016) and Juneja and Krishnasamy (2019) follow a similar plugin strategy for SPEF distributions, where empirically observed means are used as a proxy for true means. The proposed algorithm conducts some exploration to ensure that no arm is starved with insufficient samples. Because solving the maxmin lower bound problem can be computationally demanding, we solve it periodically after fixed, well chosen samples (which is allowed to be a function of ), where may be optimised to minimize the overall computation effort.
The specific algorithm, , is as follows:

Initialize by allocating samples in roundrobin way to generate at least samples from each arm. Set and let denote total number of samples generated.

Compute optimal proportions . Check if stopping criteria (shown above (10)) is met. If not,

Compute starvation for each arm as

If , generate samples from each arm . Specifically, first generate samples from arm 1, then samples from arm 2 and so on. In addition, generate independent samples from probability distribution . For each arm , count the number of occurrences of in the generated samples and sample arm that many times.

Else, if , generate samples from each arm , where are a solution to the load balancing problem: and Again, first generate samples from arm 1, then samples from arm 2 and so on.

Increment by 1 and return to step 2.
Set . Algorithm ensures that for all .
When to stop: At any step of the algorithm, the generated data suggests a unique arm, say , with the largest mean (arbitrarily breaking ties, if any). Call this the null hypothesis, and its complement (arm does not have maximum mean) the alternate hypothesis. For a stopping rule we consider the generalized likelihood ratio test (see Chernoff (1959)). The numerator in this ratio has value of the likelihood under most likely vector of distributions with arm having the maximum mean, that explains the observed data. The denominator equals the value of likelihood of observed data under most likely distribution of arms under the alternative hypothesis.
In this spirit, at time , since among all vectors of distributions in with distribution having maximum mean, maximizes the likelihood of the observed data, we take numerator to be the likelihood under and the denominator to be that under that maximizes the likelihood of given data under alternative hypothesis. Our stopping rule corresponds to the logarithm of this ‘generalized likelihood ratio’ becoming sufficiently large.
Specifically, let denote the set of arms with arm having the largest mean. Denote to be the set . If at stage , , the , , can be seen to equal
(see Appendix C.2 for the proof).
Stopping rule: If at stage , , check if exceeds the threshold function
(10) 
where is specified later in (16), and . The algorithm stops if , announcing arm as the one having the largest mean. If the threshold function is not exceeded, the algorithm continues.
We prove in Theorem 4 that given by (10) ensures correctness of , as well as that the sample complexity matches lower bound asymptotically as when .
Using arguments as in proof of Theorem 3, it can be shown that if , then:
(11) 
and thus our stopping rule corresponds to evaluating if (11) exceeds .
As mentioned earlier in Remark 3, a mild nuance in our analysis is that while computing , the empirical distribution need not lie in . Also, recall that the stopping condition is checked only after intervals of , i.e., every time after samples are generated. Let denote stopping time for the algorithm for a given . The algorithm makes an error if at time , , for some . Let denote the error event.
We first analyse correctness of algorithm below. Analysis for the sample complexity is presented towards the end of this section.
The proof of correctness relies on concentration inequality for , where for , denotes the empirical distribution corresponding to samples from (Theorem 4). Proof of Theorem 4 in turn relies on the dual representation of (Theorem 4). These results are proved in Appendix C.4 and C.3 respectively and may also be of independent interest.
Let , , , and be nonnegative constants and {theorem} For , and ,
Let , and denote the support of the measure . Let
Note that for due to strict convexity of , there can be at most one such that
For and such that and ,
(14) 
Maximum in RHS above is attained at a unique in . Any probability measure achieving infimum in the primal problem , satisfies , , and
Furthermore, if then , where
In the Appendix C.5.2 we briefly discuss how the algorithm and analysis simplify if it is known apriori that the underlying distributions have a known bounded support.
of (12) in Theorem 4: Recall that the algorithm makes an error if at the stopping time , for some . Let the event be denoted by . Then the error event,
(15) 
Let . To prove that the probability of making an error is bounded by , it is sufficient to prove that is less than For this, we upper bound using Theorem 4 below.
Let , , , and be nonnegative constants and let
For , , and ,
Using Theorem 4 with and ,
Choosing in expression (10) for , ensures that the summation in the above expression is finite. Further, choosing the constant as:
(16) 
proves that , and hence is bounded from above by . We refer the reader to Appendix C.5 for a proof of Theorem 4 and related results.
Proof for (13) in Theorem 4: To see that the sample complexity of matches lower bound as , i.e., (13) holds, first observe that our sampling algorithm ensures that fraction of times each arm is pulled is close to its optimal proportion . In particular,
For all ,
The proof of Lemma 4 is given in Appendix C.6. It uses the fact that eventually all the samples are allocated according to optimal proportions computed for the empirical distribution vector, , which in turn converges to . We first heuristically argue that (13) holds. Let denote the set . Recall that for defined in (10), the stopping time, equals
and satisfies
(17) 
Furthermore, for sufficiently large , with high probability, , and from Lemma 4, . When this is true, arm is the best arm, and satisfies
(18) 
With constants and as in (10), that satisfies (18) is given by
(19) 
Dividing both sides of (19) by , we get , for sufficiently small .
Complement of this highprobability event contributes only lower order terms (with respect to ) to . Combining these, we get an upper bound on that asymptotically (as ) matches the lower bound in (5).
Rigorous proof of the sample complexity result in Theorem 4, i.e., proof for (13), is given in the Appendix C.6. Our proof builds upon that in Garivier and Kaufmann (2016) where the authors consider a restricted SPEF, while we allow arm distributions to belong to a more general class . Our proof differs in that we work in space of probability measures instead of Euclidian space. This leads to additional nuances. To work in the space of probability measures, we use Wasserstein metric to define continuity of functions and convergence of sequences in this space of probability measures. Furthermore, we check for stopping condition only once in samples, instead of doing so in every sample, and construct the proof that allows for this flexibility.
5 Optimizing batch sizes and numerical results
We now discuss batch size selection in to minimize the overall experiment cost. Suppose that the average cost of generating a sample is given by . This may be large when generating a sample is costly, for instance, if that corresponds to an output of a massive simulation model, or a result of a clinical trial. It may be small, e.g., in an online recommendation system. The cost of solving the maxmin problem (7) may be measured by the computational effort involved. The total experiment cost of is the sum of the total cost of sampling ( number of samples generated) and total computational effort involved in solving the maxmin optimization problems till termination.
Recall that in order to efficiently solve the maxmin problem iteratively (see Section 3), at each stage when this evaluation is done, letting denote the empirical distribution of arm , we precompute values of for each along a grid and linearly interpolate for values of not in the grid, for each arm . Empirically we see that the the cost of computing increases linearly with the number of samples of the corresponding empirical distribution (also see Cappé et al. (2013) for similar observations). This suggests that the computational cost of increases linearly in the total number of samples generated. To this end, we observe that the overall computational cost of solving the maxmin problem (7) is modelled well as , where denotes the total number of samples generated by till that stage, and and are fitted to the data. In our numerical experiment below, this cost is approximately (in computational time) seconds. Since, runs in many thousands in a typical setting, the linear term cannot be ignored.
To approximate the optimal batch size, we need to approximate the sample complexity. To this end, let where recall that and were defined in the stopping rule for . For small values of , the sample complexity of (see (78) in Appendix C.6),
(20) 
where denotes the batch size. Equation (20), remains valid if we use in place of . However, for reasonable values of , the two may differ significantly, and empirically we find that substantially underestimates , while is much closer. Using as a proxy for , and assuming that the total number of batches till the stopping time is approximated by , the total cost of approximately equals
Observe that for constant, independent of , is since is . For , it is .
Optimizing over to minimize , we get
(21) 
i.e., . Notice that even though minimizes , (20) suggests that with this choice of , the ratio of expected number of samples until termination for to the corresponding maxmin lower bound no longer converges to 1 as , that is, (13) no longer holds. It can however be seen that the correct property still holds for even for . If, however, could be estimated using computational effort that is independent of the size of the empirical distribution, that is, if , then , and is asymptotically optimal, so that (13) holds. One way to achieve this may be to approximate the empirical distribution by a fixed size distribution (eg., by bucketing the generated samples into finitely many bins). This may substantially reduce the computation time. The overall issue of developing efficient implementations for the best arm problem for general distributions is an interesting area for further research.
Numerical experiment: We consider a