Learning the Distribution with Largest Mean:
Two Bandit Frameworks
Abstract
Over the past few years, the multiarmed bandit model has become increasingly popular in the machine learning community, partly because of applications including online content optimization. This paper reviews two different sequential learning tasks that have been considered in the bandit literature ; they can be formulated as (sequentially) learning which distribution has the highest mean among a set of distributions, with some constraints on the learning process. For both of them (regret minimization and best arm identification) we present recent, asymptotically optimal algorithms. We compare the behaviors of the sampling rule of each algorithm as well as the complexity terms associated to each problem.
Le modèle stochastique dit de bandit à plusieurs bras soulève ces dernières années un grand intérêt dans la communauté de l’apprentissage automatique, du fait notamment de ses applications à l’optimisation de contenu sur le web. Cet article présente deux problèmes d’apprentissage séquentiel dans le cadre d’un modèle de bandit qui peuvent être formulés comme la découverte de la distribution ayant la moyenne la plus élevée dans un ensemble de distributions, avec certaines contraintes sur le processus d’apprentissage. Pour ces deux objectifs (minimisation du regret d’une part et identification du meilleur bras d’autre part), nous présentons des algorithmes optimaux, en un sens asymptotique. Nous comparons les stratégies d’échantillonnage employées par ces deux types d’algorithmes ainsi que les quantités caractérisant la complexité de chacun des problèmes.
Introduction
Bandit models can be traced back to the 1930s and the work of [Thompson, 1933] in the context of medical trials. It addresses the idealized situation where, for a given symptom, a doctor has several treatments at her disposal, but has no prior knowledge about their efficacies. These efficacies need to be learnt by allocating treatments to patients and observing the result. As the doctor aims at healing as many patients as possible, she would like to select the best treatment as often as possible, even though it is unknown to her at the beginning. After each patient, the doctor takes the outcome of the treatment into account in order to decide which treatment to assign to the next patient: the learning process is sequential.
This archetypal situation is mathematically captured by the multiarmed bandit model. It involves an agent (the doctor) interacting with a set of probability distribution called arms (the treatments), which she may sequentially sample. The mean of arm (which is unknown to the agent) is denoted by . At round , the agent selects an arm and subsequently observes a sample from the associated distribution. The arm is selected according to a sampling strategy denoted by , where maps the history of past arm choices and observations to an arm. In a simplistic model for the clinical trial example, each arm is a Bernoulli distribution that indicates the success or failure of the treatment. After sampling an arm (giving a treatment) at time , the doctor observes whether the patient was healed () or not (). In this example as in many others, the samples gathered can be considered as rewards, and a natural goal for the agent is to adjust her sampling strategy so as to maximize the expected sum of the rewards gathered up to some given horizon . This is equivalent to minimizing the regret
which is defined as the gap between the expected efficiency of the strategy and the expected cumulated reward of an oracle strategy always playing the best arm that has mean .
A sampling strategy minimizing the regret should not only learn which arm has the highest mean: it should also not incur too big losses during this learning phase. In other words, it has to achieve a good tradeoff between exploration (experimenting all the arms in order to estimate their means) and exploitation (focusing on the arm that appears best so far). Despite its simplicity, the multiarmed bandit model already captures the fundamental dilemma inherent to reinforcement learning [Sutton and Barto, 1998], where the goal is to learn how to act optimally in a random environment based on numeric feedback. The fundamental model of reinforcement learning is the Markov Decision Process [Puterman, 1994], which involves the additional notion of system state; a bandit model is simply a Markov Decision Process with a single state.
Sometimes, rewards actually correspond to profits for the agent. In fact, the imaginatively named multiarmed bandits refer to casino slot machines: a player sequentially selects one of them (also called a onearmed bandit), draws its arm, and possibly collects her wins. While the model was initially motivated by clinical trials, modern applications involve neither bandits, nor casinos, but for example the design of recommender systems [Chu et al., 2011], or more generally content optimization. Indeed, a bandit algorithm may be used by a company for dynamically selecting which version of its website to display to each user, in order to maximize the number of conversions (purchase or subscription for example). In the case of two competing options, this problem is known as A/B testing. It motivates the consideration of a different optimization problem in a bandit model: rather than continuously changing its website, the company may prefer to experiment during a testing phase only, which is aimed at identifying the best version, and then to use that one consistently for a much bigger audience.
In such a testing phase, the objective is different: one aims at learning which arm has highest mean without constraint on the cumulative reward. In other words, the company agrees to lose some profit during the testing phase, as long as the length of this phase is as short as possible. In this framework, called best arm identification, the sampling rule is designed so as to identify the arm with highest mean as fast and as confidently as possible. Two alternative frameworks are considered in the literature. In the fixedbudget setting [Audibert et al., 2010], the length of the trial phase is given and the goal is to minimize the probability of misidentifying the best arm. In the fixedconfidence setting [EvenDar et al., 2006], a risk parameter is given and the procedure is allowed to choose when the testing phase stops. It must guarantee that the misidentification probability is smaller than while minimizing the sample complexity, that is the expected number of samples required before electing the arm. Although the study of best arm identification problems is relatively recent in the bandit literature, similar questions were already addressed in the 1950s under the name ranking and identification problems [Bechhofer, 1954, Bechhofer et al., 1968], and they are also related to the sequential adaptive hypothesis testing framework introduced by [Chernoff, 1959].
In this paper, we review a few algorithms for both regret minimization and best arm identification in the fixedconfidence setting. The algorithms and results are presented for simple classes of parametric bandit models, and we explain along the way how some of them can be extended to more general models. In each case we introduce an asymptotic notion of optimality and present algorithms that are asymptotically optimal. Our optimality notion is instancedependent, in the sense that we characterize the minimal regret or minimal sample complexity achievable on each specific bandit instance. The paper is structured as follows: we introduce in Section 1 the parametric bandit models considered in the paper, and present some useful probabilistic tools for the analysis of bandit algorithms. We discuss the regret minimization problem in Section 2 and the best arm identification problem in Section 3. We comment in Section 4 on the different behaviors of the algorithms aimed at these distinct objectives, and on the different informationtheoretic quantities characterizing their complexities.
1 Parametric Bandit Models and Useful Tools
1.1 Some Assumptions on the Arms Distributions
Unless specified otherwise, we assume in the rest of the paper that all the arms belong to a class of distributions parameterized by their means, , where is an interval of . We assume that for all , has a density denoted by with respect to some fixed reference measure, and that . For all we introduce
the KullbackLeibler divergence between the distribution of mean and that of mean . We shall in particular consider examples in which forms a oneparameter exponential family (e.g. Bernoulli distributions, Gaussian distributions with known variance, exponential distributions), for which there is a closed form formula for the divergence function (see, e.g. [Cappé et al., 2013]).
Under this assumption, a bandit model is fully described by a vector in such that for all . We denote by and the probability and expectation under the bandit model . Under , the sequence of successive observations from arm is i.i.d. with law , and the families are independent. Given a strategy , we let be the number of draws of arm up to and including round . Hence, upon selection of the arm , the observation made at round is . When the strategy is clear from the context, we may remove the superscript and write simply . We define as the empirical mean of the first observations from arm , and as the empirical mean of arm at round of the bandit algorithm.
For the two frameworks that we consider, regret minimization and best arm identification, we adopt the same approach. First, we propose a lower bound on the target quantity (regret or sample complexity). Then, we propose strategies whose regret or sample complexity asymptotically matches the lower bound. Two central tools to derive lower bounds and algorithms are changes of distributions and confidence intervals.
1.2 Changes Of Distribution
Problemdependent lower bounds in the bandit literature all rely in the end on change of distribution arguments (see e.g. [Lai and Robbins, 1985, Burnetas and Katehakis, 1996, Mannor and Tsitsiklis, 2004, Audibert et al., 2010]). In order to control the probability of some event under the bandit model , the idea is to consider an alternative bandit model under which some assumptions on the strategy make it is easier to control the probability of this event. This alternative model should be close enough to , in the sense that the transportation cost should not be too high. This transportation cost is related to the loglikelihood ratio of the observations up to time , that we denote by
Letting be the field generated by the observations up to time , it is indeed well known that for all , .
The most simple way of writing changes of distribution (see [Kaufmann et al., 2014, Combes and Proutière, 2014] and [Garivier et al., 2016b]) directly relates the expected loglikelihood ratio of the observations under two bandit models to the probability of any event under the two models. If is a stopping time, one can show that for any two bandit models and for any event in ,
where is the binary relative entropy, i.e. the KullbackLeibler divergence between two Bernoulli distributions of means and . Using Wald’s lemma, one can show that in the particular case of bandit models, the expected loglikelihood ratio can be expressed in terms of the expected number of draws of each arms, which yields the following result.
Let be a stopping time. For any event ,
Two different proofs of this result can be found in [Kaufmann et al., 2016] and [Garivier et al., 2016b], in which a slightly more general result is derived based on the entropy contraction principle. As we will see in the next sections, this lemma is particularly powerful to prove lower bounds on the regret or the sample complexity, as both quantities are closely related to the expected number of draws of each arm.
1.3 Confidence Intervals
In both the regret minimization and best arm identification frameworks, the sampling rule has to decide which arm to sample from at a current round, based on the observations gathered at previous rounds. This decision may be based on the set of statistically plausible values for the mean of each arm , that is materialized by a confidence interval on . Note that in this sequential learning framework, this interval has to be built based on a random number of observations.
The line of research leading to the UCB1 algorithm [Auer et al., 2002] worked under the assumption that each arm is a bounded distribution supported in . Bounded distributions are particular examples of subGaussian distributions. A random variable is said to be subGaussian if holds for all . Hoeffding’s lemma states that distributions with a support bounded in are subGaussian. If arm is subGaussian, Hoeffding’s inequality together with a union bound to handle the random number of observations permits to show that
(1) 
Hence on can build an upperconfidence bound on with probability of coverage by setting .
There are two levels of improvement here. First, under more specific assumption on the arms (for example if the arms belong to some exponential family of distributions), Chernoff’s inequality has an explicit form that can be used directly in place of Hoeffding’s inequality. It states that , where is the KLdivergence function defined in Section 1.1. Then, to handle the random number of observations, a peeling argument can be used rather than a union bound. This argument, initially developed in the context of Markov order estimation (see [Garivier and Leonardi, 2011]), was used in [Garivier and Moulines, 2011, Bubeck, 2010] under subgaussian assumption. Combining these two ideas [Garivier and Cappé, 2011] show that, letting
one has
(2) 
The improvement can be measured by specifying this result to Bernoulli distributions, for which the two bounds \eqrefbound:UCB1 and \eqrefbound:KLUCB hold. By Pinsker’s inequality , it holds that . Hence for and such that , is a smaller upperconfidence bound on with the same coverage guarantees. As we will see in the next sections, such refined confidence intervals have yield huge improvements in the bandit literature, and lead to simple UCBtype algorithms that are asymptotically optimal for regret minimization.
2 Optimal Strategies For Regret Minimization
After the initial work of [Thompson, 1933], bandit models were studied again in the 1950s, with for example the paper of [Robbins, 1952], in which the notion of regret is introduced. Interestingly, a large part of the early work on bandit models takes a slightly different Bayesian perspective: the goal is also to maximize the expected sum of rewards, but the expectation is also computed over a prior distribution for the arms (see [Berry and Fristedt, 1985] for a survey). It turns out that this Bayesian multiarmed bandit problem can be solved exactly using dynamic programming [Bellman, 1956], but the exact solution is in most cases intractable. Practical solutions may be found when one aims at maximizing the sum of discounted rewards over an infinite horizon: the seminal paper of [Gittins, 1979] shows that the Bayesian optimal policy has a simple form where, at each round an index is computed for each arm and the arm with highest index is selected.
Gittins’s work motivated the focus on index policies, where an index is computed for each arm as a selection procedure. Such index policies have also emerged in the “frequentist” literature on multiarmed bandits. Some asymptotic expansions of the index put forward by Gittins were proposed. They have led to new policies that could be studied directly, forgetting about their Bayesian roots. This line of research includes in particular the seminal work of [Lai and Robbins, 1985].
2.1 A Lower Bound on the Regret
In 1985, Lai and Robbins characterized the optimal regret rate in oneparameter bandit models, by providing an asymptotic lower bound on the regret and a first index policy with a matching regret [Lai and Robbins, 1985]. In order to understand this lower bound, one can first observe that the regret can be expressed in terms of the number of draws of each suboptimal arm. Indeed, a simple conditioning shows that for any strategy ,
(3) 
where we recall that is the number of times arm has been selected up to time . A strategy is said to be uniformly efficient if its regret is small on every bandit model in our class, that is if for all and for every .
[Lai and Robbins, 1985] Any uniformly efficient strategy satisfies, for all ,
By Equation \eqrefdef:regNbDraws, this result directly provides a logarithmic regret lower bound on the regret:
(4) 
This lower bound motivates the definition of an asymptotically optimal algorithm (on a set of parametric bandit models ) as an algorithm for with for all , the regret is asymptotically upper bounded by . This defines an instancedependent notion of optimality, as we want an algorithm that attains the best regret rate for every bandit instance . However, for some instances such that some arms are very close to the optimal arm, the constant may be really large and the bound is not very interesting in finitetime. For such instances, one may prefer having regret upper bounds that scale in and are independent of , matching the minimax regret lower bound obtained by [CesaBianchi and Lugosi, 2006, Bubeck and CesaBianchi, 2012] for Bernoulli bandits:
Logarithmic instancedependent regret lower bound have also been obtained under more general assumptions for the arms distributions [Burnetas and Katehakis, 1996], and even in some examples of structured bandit models, in which they take a less explicit form [Graves and Lai, 1997, Magureanu et al., 2014]. All these lower bounds rely on a change of distribution argument, and we now explain how to easily obtain the lower bound of Theorem 2.1 by using the tool described in Section 1.2, Lemma 1.2.
Fixing a suboptimal arm in the bandit model , we define an alternative bandit model such that for all and . In , arm is now the optimal arm, hence a uniformly efficient algorithm will draw this arm very often. As arm is the only arm that has been modified in , the statement in Lemma 1.2 takes the simple form:
for any event . Now the event is very likely under in which is suboptimal, and very unlikely under in which is optimal. More precisely, the uniformly efficient assumption permits to show that and for all when goes to infinity. This leads to and proves Theorem 2.1.
2.2 Asymptotically Optimal Index Policies and Upper Confidence Bounds
Lai and Robbins also proposed the first algorithm whose regret matches the lower bound \eqrefeq:LBRegret and this first asymptotically optimal algorithm is actually an index policy, i.e. it is of the form
but the proposed indices are quite complex to compute. [Agrawal, 1995, Katehakis and Robbins, 1995] later proposed slightly more simple indices and show that they can be interpreted as Upper Confidence Bounds (UCB) on the unknown means of the arms. UCBtype algorithms were popularized by [Auer et al., 2002], who introduce the UCB1 algorithm for (nonparametric) bandit models with bounded rewards, and give the first finitetime upper bound on its regret. Simple indices like those of UCB1 can be used more generally for subGaussian rewards, and take the form
for some function which controls the confidence level. While the original choice of [Auer et al., 2002] is too conservative, one may safely choose in practice; obtaining finitetime regret bounds is somewhat easier with a slightly larger choice, as in [Garivier and Cappé, 2011]. With such a choice, for Bernoulli distributions (which are subGaussian), the regret of this index policy can be shown to be
which is only orderoptimal with respect to the lower bound \eqrefeq:LBRegret, as by Pinsker inequality . Since the work of [Auer et al., 2002], several improvements of UCB1 have been proposed. They aimed at providing finitetime regret guaranteess that would match the asymptotic lower bound \eqrefeq:LBRegret (see the review [Bubeck and CesaBianchi, 2012]). Among them, the klUCB algorithm studied by [Cappé et al., 2013] is shown to be asymptotically optimal when the arms belong to a oneparameter exponential family. This algorithm is an index policy associated with
for the same choice of an exploration function as mentioned above. The discussion of Section 1.3 explains why this index is actually an upper confidence bound on : choosing one has . For this particular choice, [Cappé et al., 2013] give a finitetime analysis of klUCB, proving its asymptotic optimality. To conclude on UCB algorithms, let us mention that several improvements have been proposed. A simple but significant one is obtained by replacing by in the definition of , leading to a variant sometimes termed klUCB which has a slightly better empirical performance, but also minimax guarantees that plain UCB algorithms do not enjoy (for a discussion and related ideas, see the OCUCB algorithm of [Lattimore, 2016], [Ménard and Garivier, 2017] and the references therein).
2.3 Beyond the Optimism Principle
For simple parametric bandit models, in particular when rewards belong to a oneparameter exponential family, we showed that the regret minimization problem is solved, at least in an asymptotic sense: the klUCB algorithm, for example, attains the best possible regret rate on every problem instance. All the UCBtype algorithms described in the previous section are based on the so called principle of “optimism in face of uncertainty”. Indeed, at each round of a UCB algorithm the confidence intervals materialize the set of bandit models that are compatible with our observations (see Figure 1, left), and choosing the arm with largest UCB amounts to acting optimally in an “optimistic” model in which the mean of each arm would be equal to its best possible value. This optimism principle has also been successfully applied in some structured bandit models [AbbasiYadkori et al., 2011], as well as in reinforcement learning [Jaksch et al., 2010] and other related problems [Bubeck et al., 2013].
While the Lai and Robbins’ lower bound provides a good guideline to design algorithms, it has sometimes been misunderstood as a justification of the wrong folk theorem which is wellknown by practitioners mostly interested in using bandit algorithms: “no strategy can have a regret smaller than , which is reached by good strategies”. But experiments often infirm this claim: it is easy to show settings and algorithms where the regret is much smaller than and does not look like a logarithmic curve. The reason is twofold: first, Lai and Robbins’ lower result is asymptotic; a close look at its proof shows that it is relevant only when the horizon is so large that any reasonable policy has identified the best arm with high probability; second, it only states that the regret divided by cannot always be smaller than . In [Garivier et al., 2016a], a more simple but similar bandit model of complexity is given where some strategy is proved to have a regret smaller than for some positive constant .
Some recent works try to complement this result and to give a better description of what can be observed in practice. Notably, [Garivier et al., 2016b] focuses mainly on the initial regime: the authors show in particular that all strategies suffer a linear regret before reaches some problemdependent value. When the problem is very difficult (for example when the number of arms is very large) this initial phase may be the only observable one… They give nonasymptotic inequalities, and above all show a way to prove lower bounds which may lead to further new results (see e.g. [Garivier et al., 2016a]). It would be of great interest (but technically difficult) to exhibit an intermediate regime where, after this first phase, statistical estimation becomes possible but is still not trivial. This would in particular permit to discriminate from a theoretical perspective between all the bandit algorithms that are now known to be asymptotically optimal, but for which significant differences may be observed in practice.
Indeed, one drawback of the klUCB algorithm is the need to construct tight confidence intervals (as explained in Section 1.3), which may not be generalized easily beyond simple parametric models. More flexible, Bayesian algorithms have recently also been shown to be asymptotically optimal, and to have good empirical performances. Given a prior distribution one the arms, Bayesian algorithms are simple procedures exploiting the posterior distributions of each arm. In the Bernoulli example, assuming a uniform prior on the mean of each arm, the posterior distribution of at round , defined as the conditional distribution of given past observations, is easily seen to be a Beta distribution with parameters given by the number of ones and zeros observed so far from the arm. A Bayesian algorithm uses the different posterior distributions, that are illustrated in Figure 1 (right), to choose the next arm to sample from.
The BayesUCB algorithm of [Kaufmann et al., 2012a] exploits these posterior distributions in an optimistic way: it selects at round the arm whose posterior on the mean has the largest quantile of order . Another popular algorithm, Thompson Sampling, departs from the optimism principle by selecting arms at random according to their probability of being optimal. This principle was proposed by [Thompson, 1933] as the very first bandit algorithm, and can easily be implemented by drawing one sample from the posterior distribution of the mean of each arm, and selecting the arm with highest sample. This algorithm, also called probability matching, was rediscovered in the 2000s for its good empirical performances in complex bandit models [Scott, 2010, Chapelle and Li, 2011], but its first regret analysis dates back to [Agrawal and Goyal, 2012]. Both Thompson Sampling and BayesUCB have been shown recently to be asymptotically optimal in oneparameter models, for some choices of the prior distribution [Kaufmann et al., 2012b, Agrawal and Goyal, 2013, Kaufmann, 2016]. These algorithms are also quite generic, as they can be implemented in any bandit model in which one can define a prior distribution on the arms, and draw samples from the associated posterior. For example, they can be used in (generalized) linear bandit models, that can model recommendation tasks where the features of the items are taken into account (see, e.g. [Agrawal and Goyal, 2013] and the Chapter 4 of [Kaufmann, 2014]).
3 Optimal Strategies for Best Arm Identification
Finding the arm with largest mean (without trying to maximize the cumulated rewards) is quite a different task and relates more to classical statistics. It can indeed be cast into the framework of sequential adaptive hypothesis testing introduced by [Chernoff, 1959]. In this framework, one has to decide which of the (composite) hypotheses
is true. In order to gain information, one can select at each round one out of possible experiments, each of them consisting in sampling from one of the marginal distributions (arms). Moreover, one has to choose when to stop the trial and decide for one of the hypotheses. Rephrased in a “bandit” terminology, a strategy consists of

a sampling rule, that specifies which arm is selected at time ( is measurable),

a stopping rule , that indicates when the trial ends ( is a stopping time wrt ),

a recommendation rule that provides, upon stopping, a guess for the best arm ( is measurable).
However, the objective of the fixedconfidence best arm identification problem differs from that of [Chernoff, 1959], where one aims at minimizing a risk measure of the form
where is the cost of wrongly rejecting hypothesis and is a cost for sampling. Modern bandit literature rather focuses on socalled PAC strategies (for Probably Approximately Correct) which output, with high probability, an arm whose mean is within of the mean of the best arm:
The goal is to build a PAC strategy with a sample complexity that is as small as possible. For simplicity, we focus here on the case : a strategy is called PAC^{1}^{1}1It would be more correct to call it a PC (Probably Correct) strategy. if
where is the set of bandit models that have a unique optimal arm.
We show in the next section that, as in the regret minimization framework, there exists an instancedependent lower bound on the sample complexity of any PAC algorithm. We further present an algorithm whose sample complexity matches the lower bound, at least in the asymptotic regime where goes to . It is remarkable that this optimal algorithm, described in Section 3.2, is actually a byproduct of the lower bound analysis described in Section 3.1, which sheds light on how a good strategy should distribute the draws between the arms.
3.1 The Sample Complexity of PAC Best Arm Identification
The first lower bound on the sample complexity of a PAC algorithm was given by [Mannor and Tsitsiklis, 2004]. Particularized to the case , the lower bound says that for Bernoulli bandit models with means in , there exists a constant and a subset of the suboptimal arms such that for any PAC algorithm
Following this result, the literature has provided several PAC strategies together with upper bounds on their sample complexity, mostly under the assumption that the rewards are bounded in . Existing strategies fall into two categories: those based on successive eliminations [EvenDar et al., 2006, Karnin et al., 2013], and those based on confidence intervals [Kalyanakrishnan et al., 2012, Gabillon et al., 2012, Jamieson et al., 2014]. For all these algorithms, under a bandit instance such that , the number of samples used can be shown to be of order
where is a (large) numerical constant. While explicit finitetime bounds on can be extracted from most of the papers listed above, we mostly care here about the firstorder term in , when goes to zero. Both the upper and lower bounds take the form of a sum over the arms of an individual complexity term (involving the inverse squared gap with the best or second best arm), but there is a gap as those sums do not involve the same number of terms; in addition, loose multiplicative constants make it hard to identify the exact minimal sample complexity of the problem.
As for the regret minimization problem, the true sample complexity can be expected to involve informationtheoretic quantities (like the KullbackLeibler divergence between arms distributions), for which the quantities above appear to be only surrogates; for example, for Bernoulli distributions, it holds that . For exponential families, it has been shown that incorporating the KLbased confidence bounds described in Section 1.3 into existing algorithms lowers the sample complexity [Kaufmann and Kalyanakrishnan, 2013] but the true sample complexity was only recently obtained by [Garivier and Kaufmann, 2016]. The result, and its proof, are remarkably simple:
Let , define and let be the set of probability vectors. Any PAC algorithm satisfies
where
This lower bound against relies on a change of distribution, but unlike Lai and Robbins’ result (and previous results for best arm identification), it is not sufficient to individually lower bound the expected number of draws of each arm using a single alternative model. One needs to consider the set of all possible alternative models in which the optimal arm is different from the optimal arm of .
Given a PAC algorithm, let . For any , the PAC property implies that while . Hence, by Lemma 1.2,
Combining all the inequalities thus obtained for the different possible values of , we conclude that:
{align*}
\kl(δ,1δ) &≤inf_λ∈\Alt(μ)∑_a=1^K \bE_μ[N_a(τ)] d(μ_a,λ_a)
&≤\bE_μ[τ](inf_λ∈\Alt(μ)∑_a=1^K \bEμ[Na(τ)]\bEμ[τ] d(μ_a,λ_a))
&≤\bE_μ[τ](sup_w ∈Σ_Kinf_λ∈\Alt(μ)∑_a=1^K w_a d(μ_a,λ_a)) .
In the last step, we use the fact that the vector sums to one: upper bounding by the worst probability vector yields a bound that is independent of the algorithm.
We thus obtain the (not fully explicit, but simple) lower bound of Theorem 3.1 that holds under the parametric assumption of Section 1.1. Its form involving an optimization problem is reminiscent of the early work of [Agrawal et al., 1989, Graves and Lai, 1997] that provide a lower bound on the regret in general, possibly structured bandit models. For best arm identification, [Vaidhyan and Sundaresan, 2015] consider the particular case of Poisson distribution in which there is only one arm that is different from the others, where a very nice formula can be derived for the sample complexity. For general exponential family bandit models, we now provide a slightly more explicit expression of , that permits to efficiently compute it.
3.2 An Asymptotically Optimal Algorithm
3.2.1 Computing the complexity and the optimal weights
The proof of Theorem 3.1 reveals that the quantity
can be interpreted as a vector of optimal proportions, in the sense that any strategy matching the lower bound should satisfy . Some algebra shows that the above optimization problem has a unique solution, and provides an efficient way of computing for any , which boils down to numerically solving a series of scalar equations. In this section, we shall assume that is such that .
First, when the distribution belong to a onedimensional exponential family (which we assume in the rest of this section), one can solve the inner optimization over in closed form, using Lagrange duality. This yields:
Then, one can prove that at the optimum the quantities in the are equal. Introducing their common value as an auxiliary variable, one can show that the computation of reduces to solving a onedimensional optimization problem. For all , one introduces the strictly increasing mapping
and defines to be its inverse mapping. With this notation, the following Lemma 3.2.1 provides a way to compute . The scalar equation defined therein may be solved using binary search. At each step of the search, the solution of the equation can again be computed by using binary search, or by Newton’s method. This algorithm is available as a julia code at https://github.com/jsfunc/bestarmidentification.
For every ,
(5) 
where is the unique solution of the equation , where
(6) 
is a continuous, increasing function on such that and when .
This result yields an efficient algorithm for computing . But can a closed form formula be derived, at least in some special cases? In the twoarmed case, it is easy to see that is equal to the inverse Chernoff information between the two arms (see [Kaufmann et al., 2014]). However, no closed form is available when , even for simple families of distributions. For Gaussian arms with known variance , only the following bound is known which captures up to a factor :
Note that may be much smaller than , which is the minimal number of samples required by a strategy using uniform sampling (for which ). An optimal strategy actually uses quite unbalanced weights .
3.2.2 An algorithm inspired from the lower bound
Back to general exponential families, building on the lower bound and our ability to compute , we now introduce an efficient algorithm whose sample complexity matches the lower bound, at least for small values of . This TrackandStop strategy consists of two elements:

a tracking sampling rule, that forces the proportion of draws of each arm to converge to the associated optimal proportion , by using the plugin estimates ,

the Chernoff stopping rule, that can be interpreted as the stopping rule of a sequential Generalized Likelihood Ratio Test (GLRT), whose closed form in this particular problem is very similar to our lower bound.
When stopping, our guess is the empirical best arm. We now describe the sampling and stopping rule in details before presenting the theoretical guarantees for TrackandStop.
3.2.3 The Tracking sampling rule
Let be the current maximum likelihood estimate of at time :
A first idea for matching the proportions is to track the plugin estimates , by drawing at round the arm whose empirical proportion of draws lags furthest behind the estimated target . But a closer inspection shows that (sufficiently fast) convergence of towards the true parameter requires some “forced exploration” to make sure each arm has not been undersampled. More formally, defining , the Tracking rule is defined as
Simple combinatorial arguments prove that the Tracking rule draws each arm at least times at round , and relate the gap between and to the gap between and . This permits in particular to show that the Tracking rule has the following desired behavior:
The Tracking sampling rule satisfies
3.2.4 The Chernoff stopping rule
Let be the loglikelihood of the observations up to time under a bandit model and define
Intuitively, this generalized likelihood ratio is large if the current maximumlikelihood estimate is far apart from its “closest alternative” defined as the parameter maximizing the likelihood under the constraint that it belongs to , i.e. that its optimal arm is different from that of . This idea can be traced back to the work of [Chernoff, 1959], in which is interpreted as the NeymanPearson statistic for testing the (datadependent) pseudohyptothesis against , based on all samples available up to round . The analysis of Chernoff, however, only applies to two discrete hypotheses, whereas the best arm identification problem requires to consider continuous hypotheses.
In the paper [Garivier and Kaufmann, 2016], we provide new insights on this Chernoff stopping rule, formally defined as
(7) 
where is some threshold function.
The first problem is to set the threshold such that the probability of error of TrackandStop is upper bounded by . Our analysis relies on expressing in terms of pairwise sequential GLRTs of against , for which we provide tight bounds on the type I error. Indeed, letting
where is a vector that contains the observations of arm available at time , and where is the likelihood of i.i.d. observations drawn from , one can show that
(8) 
where is the empirical best arm at round . In other words, one stops when for each arm that is different from the empirical best arm , a GLRT would reject the (datadependent) pseudohypothesis . This expression also allows for a simple computation of , as implies
Under the Tracking sampling rule, it is easy to see that the grows linearly with , hence with a that is sublinear in , is also surely finite. The probability of error of the Chernoff sampling rule is thus upper bounded by
In the Bernoulli case, the following lemma permits to prove that the Chernoff stopping rule is PAC for the choice
(9) 
Let . For any sampling strategy, if , .
Another rewriting of the Chernoff stopping rule permits to understand why it achieves the optimal sample complexity when coupled to the Tracking stopping rule. Indeed, using the particular form of the likelihood in an exponential family yields
This expression is reminiscent of the lower bound of Theorem 3.1. When is large one expects to be close to thanks to the forced exploration, and to be close to , due to Proposition 3.2.3. Hence one has for large values of . Thus, with the threshold function \eqrefChoiceBeta, for small , is asymptotically upper bounded by the smallest such that , which is of order for small values of .
3.2.5 Optimality of TrackandStop
In the previous section, we sketched an upper bound on the number of samples used by TrackandStop that holds with probability one in the Bernoulli case. [Garivier and Kaufmann, 2016] also propose an asymptotic upper bound on the expected sample complexity of this algorithm, beyond the Bernoulli case. The results are summarized below.
Let . There exists a constant such that for all the TrackandStop strategy with threshold
is PAC and satisfies
In the Bernoulli case, one can set and .
Hence, TrackandStop can be qualified as asymptotically optimal, in the sense that its sample complexity matches the lower bound of Theorem 3.1, when tends to zero. Inspired by the regret minimization study, an important direction of future work is to obtain finitetime upper bounds on the sample complexity of an algorithm, that would still asymptotically match the lower bound of Theorem 3.1. A different line of research has studied, for subGaussian rewards, the asymptotic behavior of the sample complexity for fixed values of in a regime in which the gap between the best and second best arm goes to zero [Jamieson et al., 2014, Chen et al., 2016], leading to a different notion of optimality. Hence, we should aim for the best of both worlds: an algorithm with a finitetime sample complexity upper bound that would also match the lower bound obtained in this alternative asymptotic regime.
Finally, while Theorem 3.2.5 gives asymptotic results for TrackandStop, we would like to highlight the practical impact of this algorithm. Experiments in [Garivier and Kaufmann, 2016] reveal that for relatively “large” values of (e.g. ) the sample complexity of TrackandStop appears to be twice smaller than that of stateoftheart algorithms in generic scenarios. The sampling rule of TrackandStop is slightly more computationally demanding than that of its competitors, as it requires to compute at each round. However, the Tracking sampling rule is the most naive idea, and we will investigate whether other simple heuristics could be used to guarantee that the empirical proportions of draws converge towards the optimal proportions , while being amenable for finitetime analysis.
4 Discussion
It is known at least since [Bubeck et al., 2011] that good algorithms for regret minimization and pureexploration are expected to be different: small regrets after time steps imply a large probability of error , where is the recommendation for the best arm at time . In the dual fixed confidence setting that was studied in this paper, we provided other elements to assess the difference of the regret minimization and best arm identification problems.
First, the sampling strategy used by both type of algorithms are very different. Regret minimizing algorithm draw the best arm most of the time ( times in rounds) while each suboptimal arm gathers a vanishing proportion of draws. On the contrary, identifying the best arm requires the proportions of arm draws to converge to a vector will all nonzero components. Figure 2 illustrates this different behavior: the number of draws of each arm and associated KLbased confidence intervals are displayed for the klUCB (left) and TrackandStop (right) strategies. As expected, TrackandStop draws more frequently than its competitor the closetooptimal arms, and has therefore tighter confidence intervals on their means.
We also emphasize that the informationtheoretic quantities characterizing the complexity of the two problems are different. For regret minimization, we saw that the minimal regret of uniformly efficient (u.e.) strategies satisfies:
where is the KullbackLeibler divergence between the distribution of mean and the distribution of mean in our class. For best arm identification,
where is the solution of an optimization problem expressed with KullbackLeibler divergences between arms that has no closed form solution for more than two arms.
Although regret minimization and best arm identification are two very different objectives, both in terms of algorithms and of complexity, best arm identification tools have been used within regret minimization algorithms in socalled ExploreThenCommit strategies [Perchet and Rigollet, 2013, Perchet et al., 2015]. For minimizing regret up to a horizon such strategies use a (eliminationbased) fixedconfidence best arm identification algorithm with to make a guess for the best arm and then commit to play this estimated best arm until the end of the horizon . In a simple case (two Gaussian arms), we recently quantified the suboptimality of such approaches: the regret of the best such ExploreThenCommit strategy is at least twice larger than that of the klUCB algorithm [Garivier et al., 2016a]. Even if the article focuses on Gaussian rewards, other cases with possibly more than two arms are also discussed. Coming back to the introductory example of A/B testing, the takehome message is the following: if you prefer to experiment first the two options before using only one of them in production, instead of continuously allocating the two options using a good regretminimizing strategy, then this will cost you twice larger a regret.
Unlike the asymptotically optimal regret minimizing strategies that we presented, the asymptotically optimal TrackandStop strategy for best arm identification has no finitetime guarantees, and its implementation is slightly more complex. An important future work is to see whether useful tools for regret minimization, like the optimism principle or Bayesian methods can be combined with TrackandStop to have a simpler algorithm with a finitetime analysis. A starting point may be found in [Russo, 2016], who recently proposed a modified Bayesian Thompson Sampling rule that has some promising properties.
Aknowledgement: The authors are extremely thankful to the reviewers of this paper, who contributed significantly to the clarity of the presentation by their numerous and always relevant comments.
References
 [AbbasiYadkori et al., 2011] AbbasiYadkori, Y., D.Pál, and C.Szepesvári (2011). Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems.
 [Agrawal, 1995] Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multiarmed bandit problem. Advances in Applied Probability, 27(4):1054–1078.
 [Agrawal et al., 1989] Agrawal, R., Teneketzis, D., and Anantharam, V. (1989). Asymptotically Efficient Adaptive Allocation Schemes for Controlled i.i.d. Processes: Finite Parameter Space. IEEE Transactions on Automatic Control, 34(3):258–267.
 [Agrawal and Goyal, 2012] Agrawal, S. and Goyal, N. (2012). Analysis of Thompson Sampling for the multiarmed bandit problem. In Proceedings of the 25th Conference On Learning Theory.
 [Agrawal and Goyal, 2013] Agrawal, S. and Goyal, N. (2013). Further Optimal Regret Bounds for Thompson Sampling. In Proceedings of the 16th Conference on Artificial Intelligence and Statistics.
 [Audibert et al., 2010] Audibert, J.Y., Bubeck, S., and Munos, R. (2010). Best Arm Identification in Multiarmed Bandits. In Proceedings of the 23rd Conference on Learning Theory.
 [Auer et al., 2002] Auer, P., CesaBianchi, N., and Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256.
 [Bechhofer, 1954] Bechhofer, R. (1954). A singlesample multiple decision procedure for ranking means of normal populations with known variances. Annals of Mathematical Statistics, 25:16–39.
 [Bechhofer et al., 1968] Bechhofer, R., Kiefer, J., and Sobel, M. (1968). Sequential identification and ranking procedures. The University of Chicago Press.
 [Bellman, 1956] Bellman, R. (1956). A problem in the sequential design of experiments. The indian journal of statistics, 16(3/4):221–229.
 [Berry and Fristedt, 1985] Berry, D. and Fristedt, B. (1985). Bandit Problems. Sequential allocation of experiments. Chapman and Hall.
 [Bubeck, 2010] Bubeck, S. (2010). Jeux de bandits et fondation du clustering. PhD thesis, Université de Lille 1.
 [Bubeck and CesaBianchi, 2012] Bubeck, S. and CesaBianchi, N. (2012). Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Fondations and Trends in Machine Learning, 5(1):1–122.
 [Bubeck et al., 2013] Bubeck, S., Ernst, D., and Garivier, A. (Feb. 2013). Optimal discovery with probabilistic expert advice: Finite time analysis and macroscopic optimality. Journal of Machine Learning Research, 14:601–623.
 [Bubeck et al., 2011] Bubeck, S., Munos, R., and Stoltz, G. (2011). Pure Exploration in Finitely Armed and Continuous Armed Bandits. Theoretical Computer Science 412, 18321852, 412:1832–1852.
 [Burnetas and Katehakis, 1996] Burnetas, A. and Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142.
 [Cappé et al., 2013] Cappé, O., Garivier, A., Maillard, O.A., Munos, R., and Stoltz, G. (2013). KullbackLeibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516–1541.
 [CesaBianchi and Lugosi, 2006] CesaBianchi, N. and Lugosi, G. (2006). Prediction, Learning and Games. Cambridge University Press.
 [Chapelle and Li, 2011] Chapelle, O. and Li, L. (2011). An empirical evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems.
 [Chen et al., 2016] Chen, L., Li, J., and Qiao, M. (2016). Towards instance optimal bounds for best arm identification. arXiv:1608.06031.
 [Chernoff, 1959] Chernoff, H. (1959). Sequential design of Experiments. The Annals of Mathematical Statistics, 30(3):755–770.
 [Chu et al., 2011] Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual Bandits with Linear Payoff Functions. In Proceedings of the 14th Conference on Artificial Intelligence and Statistics.
 [Combes and Proutière, 2014] Combes, R. and Proutière, A. (2014). Unimodal Bandits without Smoothness. Technical report.
 [EvenDar et al., 2006] EvenDar, E., Mannor, S., and Mansour, Y. (2006). Action Elimination and Stopping Conditions for the MultiArmed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7:1079–1105.
 [Gabillon et al., 2012] Gabillon, V., Ghavamzadeh, M., and Lazaric, A. (2012). Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence. In Advances in Neural Information Processing Systems.
 [Garivier and Cappé, 2011] Garivier, A. and Cappé, O. (2011). The KLUCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Conference on Learning Theory.
 [Garivier and Kaufmann, 2016] Garivier, A. and Kaufmann, E. (2016). Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference On Learning Theory (to appear).
 [Garivier et al., 2016a] Garivier, A., Kaufmann, E., and Lattimore, T. (2016a). On explorethencommit strategies. In Advances in Neural Processing Systems (NIPS).
 [Garivier and Leonardi, 2011] Garivier, A. and Leonardi, F. (Nov. 2011). Context tree selection: A unifying view. Stochastic Processes and their Applications, 121(11):2488–2506.
 [Garivier et al., 2016b] Garivier, A., Ménard, P., and Stoltz, G. (2016b). Explore first, exploit next: The true shape of regret in bandit problems. arXiv preprint arXiv:1602.07182.
 [Garivier and Moulines, 2011] Garivier, A. and Moulines, E. (2011). On UpperConfidence Bound Policies for Switching Bandit Problems. In Proceedings of the 22nd conference on Algorithmic Learning Theory.
 [Gittins, 1979] Gittins, J. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B, 41(2):148–177.
 [Graves and Lai, 1997] Graves, T. and Lai, T. (1997). Asymptotically Efficient adaptive choice of control laws in controlled markov chains. SIAM Journal on Control and Optimization, 35(3):715–743.
 [Jaksch et al., 2010] Jaksch, T., Ortner, R., and Auer, P. (2010). NearOptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600.
 [Jamieson et al., 2014] Jamieson, K., Malloy, M., Nowak, R., and Bubeck, S. (2014). lil’UCB: an Optimal Exploration Algorithm for MultiArmed Bandits. In Proceedings of the 27th Conference on Learning Theory.
 [Kalyanakrishnan et al., 2012] Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. (2012). PAC subset selection in stochastic multiarmed bandits. In International Conference on Machine Learning (ICML).
 [Karnin et al., 2013] Karnin, Z., Koren, T., and Somekh, O. (2013). Almost optimal Exploration in multiarmed bandits. In International Conference on Machine Learning (ICML).
 [Katehakis and Robbins, 1995] Katehakis, M. and Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Science, 92:8584–8585.
 [Kaufmann, 2014] Kaufmann, E. (2014). Analyse de stratégies bayésiennes et fréquentistes pour l’allocation séquentielle de ressources. PhD thesis.
 [Kaufmann, 2016] Kaufmann, E. (2016). On bayesian index policies for sequential resource allocation. Preprint arXiv:1601.01190.
 [Kaufmann et al., 2012a] Kaufmann, E., Cappé, O., and Garivier, A. (2012a). On Bayesian UpperConfidence Bounds for Bandit Problems. In Proceedings of the 15th conference on Artificial Intelligence and Statistics.
 [Kaufmann et al., 2014] Kaufmann, E., Cappé, O., and Garivier, A. (2014). On the Complexity of A/B Testing. In Proceedings of the 27th Conference On Learning Theory.
 [Kaufmann et al., 2016] Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the Complexity of Best Arm Identification in MultiArmed Bandit Models. Journal of Machine Learning Research, 17(1):1–42.
 [Kaufmann and Kalyanakrishnan, 2013] Kaufmann, E. and Kalyanakrishnan, S. (2013). Information complexity in bandit subset selection. In Proceeding of the 26th Conference On Learning Theory.
 [Kaufmann et al., 2012b] Kaufmann, E., Korda, N., and Munos, R. (2012b). Thompson Sampling : an Asymptotically Optimal FiniteTime Analysis. In Proceedings of the 23rd conference on Algorithmic Learning Theory.
 [Lai and Robbins, 1985] Lai, T. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22.
 [Lattimore, 2016] Lattimore, T. (2016). Regret analysis of the anytime optimally confident UCB algorithm. CoRR, abs/1603.08661.
 [Magureanu et al., 2014] Magureanu, S., Combes, R., and Proutière, A. (2014). Lipschitz Bandits: Regret lower bounds and optimal algorithms. In Proceedings on the 27th Conference On Learning Theory.
 [Mannor and Tsitsiklis, 2004] Mannor, S. and Tsitsiklis, J. (2004). The Sample Complexity of Exploration in the MultiArmed Bandit Problem. Journal of Machine Learning Research, pages 623–648.
 [Ménard and Garivier, 2017] Ménard, P. and Garivier, A. (Feb. 2017). A minimax and asymptotically optimal algorithm for stochastic bandits.
 [Perchet and Rigollet, 2013] Perchet, V. and Rigollet, P. (2013). The multiarmed bandit with covariates. The Annals of Statistics.
 [Perchet et al., 2015] Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E. (2015). Batched bandit problems. In Proceedings of the 28th Conference On Learning Theory.
 [Puterman, 1994] Puterman, M. (1994). Markov Decision Processes. Discrete Stochastic. Dynamic Programming. Wiley.
 [Robbins, 1952] Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535.
 [Russo, 2016] Russo, D. (2016). Simple bayesian algorithms for best arm identification. In Proceedings of the 29th Conference On Learning Theory.
 [Scott, 2010] Scott, S. (2010). A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26:639–658.
 [Sutton and Barto, 1998] Sutton, R. and Barto, A. (1998). Reinforcement Learning: an Introduction. MIT press.
 [Thompson, 1933] Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294.
 [Vaidhyan and Sundaresan, 2015] Vaidhyan, N. and Sundaresan, R. (2015). Learning to detect an oddball target. arXiv:1508.05572.