Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits

Asymptotically Optimal Algorithms for Budgeted
Multiple Play Bandits

Alex Luedtke, Emilie Kaufmann and Antoine Chambaz
University of Washington, Department of Statistics.
CNRS & Univ. Lille, CRIStAL (UMR 9189), Inria Lille.
Université Paris Descartes, Laboratoire MAP5.

We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rate and leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary.

1 Introduction

In the classical multi-armed bandit problem, an agent is repeatedly confronted with a set of probability distributions called arms and must at each round select one of the available arms to pull based on their knowledge from previous rounds of the game. Each played arm presents the agent with a reward drawn from the corresponding distribution, and the agent’s objective is to maximize the expected sum of their rewards over time or, equivalently, to minimize the total regret (the expected reward of pulling the optimal arm at every time step minus the expected sum of the rewards corresponding to their selected actions). To play the game well, the agent must balance the need to gather new information about the reward distribution of each arm (exploration) with the need to take advantage of the information that they already have by pulling the arm for which they believe the reward will be the highest (exploitation).

The bandit problem first started receiving rigorous mathematical attention slightly under a century ago [Thompson, 1933]. This early work focused on Bernoulli rewards, that are relevant in the simplest modeling of a sequential clinical trial, and presented a Bayesian algorithm now known as Thompson sampling. Since that time, many authors have contributed to a deeper understanding of the multi-armed bandit problem, both with Bernoulli and other reward distributions and either from a Bayesian [Gittins, 1979] or frequentist [Robbins, 1952] perspective. Lai and Robbins [1985] established a lower bound on the (frequentist) regret of any algorithm that satisfies a general uniform efficiency condition. This lower bound provides a concise definition of asymptotic (regret) optimality for an algorithm: an algorithm is asymptotically optimal when it achieves this lower bound. Lai [1987] introduced what are known as upper confidence bound (UCB) procedures for deciding which arm to pull at a given time step. In short, these procedures compute a UCB for the expected reward of each arm at each time and pull the arm with the highest UCB. Many variants of UCB algorithms have been proposed since then (see the Introduction of Cappé et al., 2013a for a thorough review), with more explicit indices and/or finite-time regret guarantees. Among them the KL-UCB algorithm [Cappé et al., 2013a] is proved to be asymptotically optimal for rewards that belong to a one-parameter exponential family and finitely-supported rewards. Meanwhile, there has been a recent interest in the theoretical understanding of the previously discussed Thompson sampling algorithm, whose first regret bound was obtained by Agrawal and Goyal [2011]. Since then, Thompson Sampling has been proved to be asymptotically optimal for Bernoulli rewards [Kaufmann et al., 2012b, Agrawal and Goyal, 2012] and for reward distributions belonging to univariate exponential families [Korda et al., 2013].

There has recently been a surge of interest in the multi-armed bandit problem, due to its applications to (online) sequential content recommendation. In this context each arm models the feedback of an agent to a specific item that can be displayed (e.g. an advertisement). In this framework, it might be relevant to display several items at a time, and some variants of the classical bandit problems that have been proposed in the literature may be considered. In the multi-armed bandit with multiple plays, out of arms are sampled at each round and all the associated rewards are observed by the agent, who receives their sum. Anantharam et al. [1987] present a regret lower bound for this problem, together with a (non-explicit) matching strategy. More explicit strategies can be obtained when viewing this problem as a particular instance of a combinatorial bandit problem with semi-bandit feedback. Combinatorial bandits, originally introduced by Cesa-Bianchi and Lugosi [2012] in a non-stochastic setting, present the agent with possibly structured subsets of arms at each round: once a subset is chosen, the agent receives the sum of their rewards. The semi-bandit feedback corresponds to the case where the agent is able to see the reward of each of the sampled arms [Audibert et al., 2011]. Several extensions of UCB procedures have been proposed for the combinatorial setting (see e.g. Chen et al. [2013], Combes et al. [2015b]), with logarithmic regret guarantees. However, existing regret upper bounds do not match the lower bound of Anantharam et al. [1987]. In particular, despite the strong practical performance of KL-UCB-based algorithms in some combinatorial settings (including multiple-plays), their asymptotic optimality has never been established. Extending the optimality result from the single-play setting has proven challenging, especially in settings where the optimal set of arms in non-unique. Recently, Komiyama et al. [2015] proved the asymptotic optimality of Thompson sampling for multiple-play bandits with Bernoulli rewards in the case where the arm with the largest mean is unique. An important consequence of the uniqueness of the largest mean is that the optimal set of arms is necessarily unique, which may not be plausible in practice.

In this paper, we extend the multiple plays model in two directions, incorporating a budget constraint and an indifference point. Given a known cost associated with pulling each arm , at each round a subset of arms is selected, so that the expected cost of pulling the chosen arms is at most the budget . More formally, letting , one requires , where the expectation over the random selection of the subset is taken conditionally on past observations. The agent observes the reward associated to the selected arms and receives a total reward , where is drawn from . This reward is then compared to what she could have obtained, had she spent the same budget on some other activity, for which the expect reward per cost unit is (that is, the agent may prefer to use that money for some purpose that has reward to cost ratio greater than and is external to the bandit problem). We note that, for positive reward distributions, choosing corresponds to taking an action at every round. The agent’s gain at round is thus defined as

The goal of the agent is to devise a sequential subset selection strategy that maximizes the expected sum of her gains, up to some horizon and for which the budget constraint is satisfied at each round . In particular, arm is “worth” drawing (in the sense that it increases the expected gain) only if its average reward per cost unit, (where is the expectation of ), is at least the indifference point .

This new framework no longer requires the number of arm draws to be fixed. Rather, the number of arm draws is selected to exhaust the budget, which makes sense in several online marketing scenarios. One can imagine for example a company targeting a new market on which it is willing to spend a budget per week. Each week, the company has to decide which products to advertise for, and the cost of the advertising campaign may vary. After each week, the income associated to each campaign is measured and compared to the minimal income of that can be obtained when targeting other (known) markets or investing the money in some other well-understood venture. Another possible scenario is that the same item can be displayed on several marketplaces never explored before for different costs, and the seller has to sequentially choose the different places he wants to display the items on while keeping the total budget spend smaller than and maintaining a profitability larger than what can be obtained on a reference market place with reward per cost unit .

Our first contribution is to characterize the best attainable performance in terms of regret (with respect to the gain , not the total reward ) in this multiple-play bandit scenario with cost constraints, thanks to a lower bound that generalizes that of Anantharam et al. [1987]. We then study natural extensions of two existing bandit algorithms (KL-UCB and Thompson sampling) to our setting. We prove both rate and problem-dependent leading constant optimality for KL-UCB and Thompson sampling. The most difficult part of the proof is to show that the optimal arms away from the margin are pulled in almost every round (specifically, they are pulled in all but a sub-logarithmic number of rounds). Komiyama et al. [2015] studied this problem for Thompson sampling in multiple-play bandits using an argument different than that used in this paper. We provide a novel proof technique that leverages the asymptotic lower bound on the number of draws of any suboptimal arm. While this lower bound on suboptimal arm draws is typically used to prove an asymptotic lower bound on the regret of any reasonable algorithm, we use it as a key ingredient for our proof of an asymptotically optimal upper bound on the regret of KL-UCB and Thompson sampling, i.e. to prove the asymptotic optimality of these two algorithms. Also, throughout the manuscript, we do not assume that the set of optimal arms is unique, unlike most of the existing work on (standard) multiple-play bandits.

The rest of the article is organized as follows. Section 2 outlines our problem of interest. Section 3 provides an asymptotic lower bound on the number of suboptimal arm draws and on the regret. Section 4 presents the two sampling algorithms we consider in this paper and theorems establishing their asymptotic optimality: KL-UCB (Section 4.1) and Thompson sampling (Section 4.2). Section 5 presents numerical experiments supporting our theoretical findings. Section 6 presents the proofs of our asymptotic optimality (rate and leading constant) results for KL-UCB and Thompson Sampling. Section 7 gives concluding remarks. Technical proofs are postponed to the Appendix.

2 Multiple plays bandit with cost constraint

We consider a finite collection of arms , where each arm has real-valued marginal reward distribution whose mean we denote by both and . Each arm belongs to a (possibly nonparametric) class of distributions . We use to denote , where belongs to any model that is variation-independent in the sense that, for each , knowing the joint distribution of the rewards places no restrictions on the collection of possible marginal distributions of , i.e. could be equal to any element in . More formally, letting denote the collection of joint distributions of the rewards implied by at least one distribution in , variation independence states that, for each it is true that, for every joint distribution and every distribution , there exists a distribution in whose joint distribution of the rewards is equal to and whose marginal distribution of reward is equal to . An example of a statistical model satisfying this variation-independence assumption is the distribution in which the rewards of all of the arms are independent and the marginal distributions fall in for all , though this assumption also allows for high levels of dependence between the rewards of the arms, i.e. is not to be confused with the much stronger model assumption of independence between the different arms.

2.1 The sequential decision problem

Let be an independent and identically distributed (i.i.d.) sample from the distribution . In the multiple-play bandit with cost constraint, each arm is associated with a known cost . The model also depends on a known budget per round and indifference parameter . At round , the agent selects a subset of arms and subsequently observes the action-reward pairs . We emphasize that the agent is aware that reward corresponds to the action . This subset is drawn from a distribution over , the set of all subsets of , that depends on the observations gathered at the previous rounds. More precisely, is -measurable, where is the -field generated by all action-reward pairs seen at times , and possibly also some exogenous stochastic mechanism. We use to denote the probability that arm falls in .

Given the budget and the indifference parameter , at each round the distribution must respect the budget constraint


Upon selecting the arms, the agent receives a reward and incurs a gain . Given a (possibly unknown) horizon , the goal of the agent is to adopt a strategy for sequentially selecting the distributions that maximizes

while satisfying, at each round the budget constraint (1). This constraint may be viewed as a ‘soft’ budget constraint, as it allows the agent to (slightly) exceed the budget at some rounds, as long as the expected cost remains below at each round. We shall see below that considering a ‘hard’ budget constraint, that is selecting at each round a deterministic subset that satisfies , is a much harder problem. Besides, in the marketing examples described in the introduction, it makes sense to consider a large time horizon and to allow for minor budget crossings. Under the soft budget constraint (1), if we knew the vector of expected mean rewards , at each round we would draw a subset from a distribution


Above, the argmax is over distributions with support on the power set of . Noting that the two expectations only depend on the marginal probability of inclusions , it boils down to finding a vector that satisfies


An oracle strategy would then draw from a distribution with marginal probabilities of inclusions given by (e.g. including independently each arm with probability ). The optimization problem (3) is known as a fractional knapsack problem [Dantzig, 1957], and its solution is a greedy strategy, that is described below. It is expressed in terms of the reward-to-cost ratio of each arm , defined as .

Proposition 1.


and define the three sets

optimal arms away from the margin:
arms on the margin:
suboptimal arms away from the margin:

Then is solution to (3) if and only if for all , for all and if .

We would like to emphasize that, just like the quantities , or defined above, the quantity defined in Proposition 1 depends on the value of , the vector of cost and on the vector of means . When we need to materialize this dependency in we shall use the notation , but it is sometimes omitted for the sake of readability.

From Proposition 1, proved in Appendix A, the optimal strategy sorts the items by decreasing order of , and includes them one by one (), as long as the value increases and the budget is not exceeded. Then we can identify two situations: if , there are not enough interesting items (i.e. such that ) to saturate the budget, and the optimal strategy is to include all the interesting items. If , some probability of inclusion is further given to the items on the margin in order to saturate the budget constraint. In that case, the margin is always non-empty: there exist items such that .

Recovering the multiple-play bandit model.

By choosing for all arm , and , we recover the classical multiple-play bandit model. In that case , where is the arm with the largest mean and is a solution to (2): the corresponding oracle strategy always plays the arms with largest means.

Hard and soft constraints.

Under hard budget constraints, if we knew the vector of expected mean rewards , at each round we would pick a subset


This is a knapsack problem, that is much harder to solve than the above fractional knapsack problem. In fact, knapsack problems are NP-hard, though they are, admittedly, some of the easiest problems in this class, and reasonable approximation schemes exist [Karp, 1972]. Nonetheless, the greedy strategy (including arms by decreasing order of while the budget is not exceeded, with ties broken arbitrarily) is not generally a solution to (4). However, using Proposition 1, one can identify some examples where there exist deterministic solutions to (3), i.e. solutions such that that are therefore solutions to (4): if or if there exists such that . Hence the multiple-play bandit model can be viewed as a particular instance of the multiple plays model under both hard or soft budget constraint. In the rest of the article, we only consider soft budget constraints, as there is generally no tractable oracle under hard budget constraints.

High-probability bound on the budget spent by a finite horizon .

In Appendix LABEL:app:finitehorizon, we outline how one could analyze the regret of algorithms that respect the soft budget constraint (1) at each time in a finite-horizon problem in which the requirement that (1) hold at each time is replaced by the hard budget constraint that almost surely. Our argument suggests that the regret in these settings should be no worse than .

2.2 Regret decompositions

The best achievable (oracle) performance consists in choosing, at every round , to be the optimal distribution whose probabilities of inclusions are described in Proposition 1. Using the definitions introduced in Proposition 1, such a strategy ensures an expected gain at each round of


The quantity above is the reward from pulling the chosen arms relative to the reward from reallocating the expected cost of the strategy, namely , to pursue the action (which is external to the bandit problem) that has reward-to-cost ratio equal to the indifference point . We prove the following identity in Appendix A.

Proposition 2.

It holds that

Maximizing the expected total gain is equivalent to minimizing the regret, that is the difference in performance compared to the oracle strategy:

where the sequence of gains is obtained under algorithm Alg. The following statement, proved in Appendix A, provides an interesting decomposition of the regret, as a function of the number of selections of each arm, denoted by .

Proposition 3.

With , defined as in Proposition 1, for any algorithm Alg


This decomposition writes the regret as a sum of three non-negative terms. In order for the regret to be small, each optimal arm should be drawn very often (of order times, to make the first term small) and each suboptimal arm should be drawn seldomly (to make the second term small). Finally if , that is if there are sufficiently many ‘worthwhile’ arms to exceed the budget, then the third term appears as a penalty for not using the whole budget at every round. It means that arms on the margin have to be drawn sufficiently often so as to saturate the budget constraint.

An extended bandit interpretation.

Here we propose another view on this regret decomposition, by means of an extended bandit game with an extra arm, which we term a pseudo-arm, that represents the choice not to pull arms. Whenever an algorithm does not saturate the budget constraint (1), one can view this algorithm as putting weight on a pseudo-arm in the bandit, that yields zero gain but permits saturation of the budget. Letting and , the gain associated with drawing arm (whose distribution is a point mass at ) is indeed zero (as ) and, for any such that , there exists such that , as . Any algorithm for the original bandit problem selecting at time can thus be viewed as an algorithm selecting , that additionally includes arm with probability . As the pseudo-arm is associated with a null gain, the cumulated gain and regret are similar in both settings. Moreover, as , one easily sees that the number of (artificial) selections of the pseudo-arm is such that

which equals the third term in the regret decomposition, up to the factor .

In this extended bandit model, the three sets of arms introduced in Proposition 1 remain unchanged, with , and . As , the pseudo-arm may only belong to or , and the margin is always non-empty. Considering the extended bandit model, the regret decomposition can be rewritten in a more compact way:

Our proofs make use of this extended bandit model, since many of the results we present apply to both the “actual” arms and the pseudo-arm . Our proofs also make use of a set , which, in the extended bandit model, refers to all arms in whereas, in the unextended bandit model, it refers simply to all optimal arms both on and away from the margin.

2.3 Related work

There has been considerable work on various forms of “budgeted” or “knapsack” bandit problems [Tran-Thanh et al., 2012, Badanidiyuru et al., 2013, Agrawal and Devanur, 2014, Xia et al., 2015, 2016a, Li and Xia, 2017]. The main difference between our work and these works is that we consider a round-wise budget constrain, and allow for several arms to be selected at each round, possibly in a randomized way in order to satisfy the budget constraint in expectation. In contrast, in most existing works, one arm is (deterministically) selected at each round, and the game ends when a global budget is exhausted. The work of Xia et al. [2016b] appears to be the most closely related to ours: in their setup the agent may play multiple arms at each round, though the number of arms pulled at each round is fixed and the cost of pulling each arm is random and observed upon pulling each arm. Sankararaman and Slivkins [2018] also consider a framework in which a subset of arms is selected at each round, but this subset is chosen from a list of candidate subsets (as in a combinatorial bandit problem) and there is a global budget constraint. Compared to all these mentioned budgeted bandit problems, the focus of our analysis differs substantially, in that our primary objective is to not only prove rate optimality, but also leading constant optimality of our regret bounds. Proving constant optimality is especially challenging in situations where the set of optimal arms is non-unique, but we give careful arguments that overcome this challenge.

Several other extensions of the multiple-play bandit model have been studied in the literature. UCB algorithms have been widely used in the combinatorial semi-bandit setting, in which at each time step a subset of arms has to be select among a given class of subsets, and the rewards of every individual arms in the subset are observed. The most natural use of UCBs and the “optimism in face of uncertainty principle” is to choose at every time step the subset that would be the best if the unknown means were equal to the corresponding UCBs. This was studied by Chen et al. [2013], Kveton et al. [2014], Wen et al. [2015], who exhibit good empirical performance and logarithmic regret bounds. Combes et al. [2015b] further study instance-dependent optimality for combinatorial semi bandits, and propose an algorithm based on confidence bounds on the value of each subset, rather than on confidence bounds on the arms’ means. Their ESCB algorithm is proved to be order-optimal for several combinatorial problems. As a by product of our results, we will see that in the multiple-play setting, using KL-based confidence bounds on the arms’ means is sufficient to achieve asymptotic optimality. Another interesting direction of extension is the possibility to have only partial feedback over the proposed item. Variants of KL-UCB and Thompson Sampling were proposed for the Cascading bandit model [Kveton et al., 2015a, b], Learning to Rank [Combes et al., 2015a] or the Position-Based model [Lagrée et al., 2016]. It would be interesting to try to extend the results presented in this work to these partial feedback settings.

3 Regret Lower Bound

We first give in Lemma 4 asymptotic lower bounds on the number of draws of suboptimal arms, either in high-probability or in expectation, in the spirit of those obtained by Lai and Robbins [1985], Anantharam et al. [1987]. Compared to these works, the lower bounds obtained here hold under our more general assumptions on the arm distributions, which is reminiscent of the work of Burnetas and Katehakis [1996].

To be able to state our regret lower bound, we now introduce the following notation. We let denote the KL-divergence between distributions and . If and are uniquely parameterized by their respective means and as in a canonical single parameter exponential family (e.g. Bernoulli distributions), then we abuse notation and let . For a distribution and a real , we define


with the convention that if there does not exist a with . We will also use the convention that, for finite constants , when . We make one final assumption, and introduce two disjoint sets and , whose union is . The assumption is that, for each arm , falls below the upper bound of the expected reward parameter space, i.e. . We define the sets and respectively as the subsets of for which optimality is and is not feasible given our parameter space, namely

By defining and in this way, these sets agree in the extended and unextended bandit models. The lower bounds presented in this section will also agree in these two models.

We now define a uniformly efficient algorithm, that generalizes the class of algorithms considered in Lai and Robbins [1985]. An algorithm Alg is uniformly efficient if, for all and , as goes to infinity (from now on, the limits in will be for ). From the regret decomposition (6), this is equivalent to

  1. for all arms such that ;

  2. for all arms such that ;

  3. if ,

where above and throughout we write when we wish to emphasize that the expectation is over .

Lemma 4 (Lower bound on suboptimal arm draws).

If an algorithm is uniformly efficient, then, for any arm and any and ,

One can take if . Furthermore, for any suboptimal arm ,

We defer the proof of this result to Appendix B. We note that, while (9) could also easily be obtained using the recent change-of-distribution tools introduced by Garivier et al. [2016], we need to go back to Lai and Robbins’ technique to prove the high-probability result (8), which will be crucial in the sequel. Indeed, we will use it to prove optimal regret of our algorithms: in essence we need to ensure that we have enough information about arms in to ensure that we pull the optimal arms in sufficiently often.

We now present a corollary to Lemma 4 which provides a regret lower bound, as well as sufficient conditions for an algorithm to asymptotically match it. As already noted by Komiyama et al. [2015] in the Bernoulli case for the bandit with multiple-play problems, an algorithm achieving the asymptotic lower bound (9) on the expected number of draws of arms in does not necessarily achieve optimal regret, unlike in classic bandit problems. Thus, we emphasize that the upcoming condition (11) alone is not sufficient to prove asymptotic optimality. The conditions of this proof can be easily obtained from the regret decomposition (6), and so the proof is omitted.

Theorem 5 (Regret lower bound).

If an algorithm Alg is uniformly efficient, then


Moreover, any algorithm Alg satisfying

for arms : (11)
for arms (12)
for arms : (13)

and, if ,


is asymptotically optimal, in the sense that it satisfies


4 Algorithms

Algorithms rely on estimates of the arm distributions and their means, that we formally introduce below. For each arm and natural number , define to be the (stopping) time at which the draw of arm occurs. Let denote the draw from . One can show that is an i.i.d. sequence of draws from for each , though we note that our variation independence assumption is too weak to ensure that these sequences are independent for two arms (this is not problematic – most of our arguments end up focusing on arm-specific sequences )[1][1][1]It is a priori possible that for all large enough (though, as we showed in Section 3, this event will occur with probability zero for any reasonable algorithm). To deal with this case, let denote the draws from for all and let denote an i.i.d. sequence independent of .. We denote the empirical distribution function of observations drawn from arm by any time by

We similarly define to be the empirical distribution function of the observations , , . Thus, . We further define to be the empirical mean of observations drawn from arm by time and .

4.1 Kl-Ucb

At time , UCB algorithms leverage high probability upper bound on for each . The methods used to build these confidence bounds vary, as does the way the algorithm uses these confidence bounds. In our setting, we derive these bounds using the same technique as for KL-UCB in Cappé et al. [2013a]. At the beginning of round , the KL-UCB algorithm computes an optimistic oracle strategy , that is an oracle strategy assuming the unknown mean of each arm is equal to its best possible value, . From Proposition 1, this optimistic oracle depends on , where is the function defined in Proposition 1. Then each arm is included in independently with probability . Due to the structure of an oracle strategy, KL-UCB can be rephrased as successively drawing the arms by decreasing order of the ratio until the point that the budget is exhausted, with some probability to include the arms on the margin. We choose to keep the name KL-UCB for this straightforward generalization of the original KL-UCB algorithm.

The definition of the upper bound is closely related to that of given in (7). Let be a problem-specific operator mapping each empirical distribution function to an element of the model . Furthermore, let be a non-decreasing function, where this function is usually chosen so that . The UCB is then defined as


As we will see, the closed form expression for can be made slightly more explicit for exponential family models, though the expression still has the same general flavor. If a number satisfies , then this implies that, for every for which , . Consequently, .

We now describe two settings in which the algorithm that we have described achieves the optimal asymptotic regret bound. These two settings and the presentation thereof follows Cappé et al. [2013a]. The first family of distributions we consider for is a canonical one-dimensional exponential family . For some dominating measure (not necessarily Lebesgue), open set , and twice-differentiable strictly convex function , is a set of distributions such that

We assume that the open set is the natural parameter space, i.e. the set of all such that . We define the corresponding (open) set of expectations by and its closure by . We have omitted the dependence of on and in the notation. It is easily verified that .

For the moment suppose that is such that . In this case we let denote the maximum likelihood operator so that returns the unique distribution in indexed by the satisfying . Thus, in this setting where , the UCB then takes the form of the expression in (16).

More generally, we must deal with the case that equals or . For , define by convention , , and analogously for and . Finally, define and to be zero. This then gives the following general expression for that we use to replace (16) in the KL-UCB Algorithm:


Note that this definition of does not explicitly include a mapping mapping any empirical distribution function to an element of the model . Thus we have avoided any problems that could arise in defining such a mapping when falls on the boundary of . The above optimization problem can be solved by noting that is convex, and so one can first identify the minimizing this function, and then perform a root-finding method for monotone functions to (approximately) identify the largest at which .

The KL-UCB variant that we have presented achieves the asymptotic regret bound in the setting where .

Theorem 6 (Optimality for single parameter exponential families).

Suppose that . Further let for and . This variant of KL-UCB satisfies (11), (12), (13) and (14). Thus, KL-UCB achieves the asymptotic regret lower bound (10) for uniformly efficient algorithms.

Another interesting family of distributions for is a set of distributions on with finite support. If the support of is instead bounded in some , then the observations can be rescaled to when selecting which arm to pull using the linear transformation .

If is equal to , then Cappé et al. [2013a] observe that (16) rewrites as

where, for a measure , we use to denote the support of . They furthermore observe that this expression admits an explicit solution via the method of Lagrange multipliers.

Theorem 7 (Optimality for finitely supported distributions).

Suppose that . Let denote the identity map and for and . Suppose that for all . The variant of KL-UCB satisfies (11), (12), (13) and (14). Thus, KL-UCB achieves the asymptotic regret lower bound (10) for uniformly efficient algorithms.

In both theorems, the little-oh notation hides the problem-dependent but -independent quantities. In the proofs of Theorems 6 and 7 we refer to equations in Cappé et al. [2013b] where the reader can find explicit finite-sample, problem-dependent expressions for the term in (11) for the settings of Theorems 6 and 7. The argument used to establish (12) considers similar terms to those that appear in the proof of (11), though the simplest argument for establishing (12) (which, for brevity, is the one that we have elected to present here) invokes asymptotics. The argument used to establish (13) in these settings, on the other hand, seems to be fundamentally asymptotic and does not appear to easily yield finite sample constants. Nonetheless, this is to our knowledge the first handling of thick margins in the multiple-play bandit literature, and so we believe that our rate- and constant-optimal regret guarantee is of interest despite its asymptotic nature.

Moreover, though not presented in detail here, our proof techniques can be used to establish a finite-time regret guarantee that is rate-optimal, namely is , but is constant-suboptimal. To obtain this bound, we note that, by Proposition 3, it suffices to combine (i) the previously-discussed finite-time variants of (11) and (12) that can result from the proof of Theorem 7 and (ii) the following finite-time variant of (13), which must hold for all and some :

for arms : (18)

This guarantee is asymptotically weaker than that in (13) in the sense that the term has been replaced by , but is stronger than (13) in the sense that we require a finite-time bound on the term rather than only an asymptotic guarantee. Though we did not explicitly establish the above in our proof of Theorem 7, only a minor modification to the proof is needed. Specifically, by (29), it suffices to obtain a finite-time upper bound on for all and . This upper bound can be found by noting that the proof of Lemma 18 shows that , and explicit finite-sample constants can be computed for this bound just as they can for (11). Plugging this into (29) then establishes (18), which in turn establishes a finite-time regret bound. This finite-time regret bound will be valid even if contains more than one arm.

4.2 Thompson Sampling

Parameters For each arm , let be a prior distribution on .
for  do
     For each arm , draw .
     Let .
     For , let .
     if  is non-empty then
         For , let .      
     Draw from any distribution with marginal probabilities .
     Draw the corresponding rewards , .
     For each , obtain a new posterior by updating with the observation .
     For each , let .
Algorithm Thompson Sampling

Thompson sampling uses Bayesian ideas to account for the uncertainty in the estimated reward distributions. In a classical bandit setting, one first posits a (typically non-informative) prior over the means of the reward distributions, and then at each time updates the posterior and takes a random draw of the means from the posterior and pulls the arm whose posterior draw is the largest. In our setting, this corresponds to drawing the subset of arms for which the posterior draw to cost ratio is largest (up until the budget constraint is met), which generalizes the idea initially proposed by Thompson [1933]. In the above algorithm, we focus on independent priors so that the only posteriors updated at time are those of arms in . At time , Thompson Sampling first draws one sample from the posterior distribution on the mean of each arm , and then selects a subset according an oracle strategy assuming are the true parameters.

We prove the optimality of Thompson sampling for Bernoulli rewards, for the particular choice of a uniform prior distribution on the mean of each arm. Note that the algorithm is easy to implement in that case, since is a Beta distribution with parameters and . Our proof relies on the same techniques as those used to prove the optimality of Thompson sampling in the standard bandit setting for Bernoulli rewards by Agrawal and Goyal [2012]. We note that Komiyama et al. [2015] also made use of some of the techniques in Agrawal and Goyal [2012] to prove the optimality of Thompson sampling for Bernoulli rewards in the multiple-play bandit setting.

Theorem 8 (Optimality for Bernoulli rewards).

If the reward distributions are Bernoulli and is a standard uniform distribution for each , then Thompson sampling satisfies (11), (12), (13) and (14). Thus, Thompson sampling achieves the asymptotic regret lower bound (10) for uniformly efficient algorithms.

For any and , the proof shows that Thompson sampling satisfies

The proof gives an explicit bound on the term that depends on both the problem and the choice of . We conclude by noting that, similarly as for KL-UCB, our proof techniques can be easily adapted to give a rate-optimal but constant-suboptimal finite-time regret bound, where this bound will be valid even if contains more than one arm.

5 Numerical Experiments

We now run four simulations to evaluate our theoretical results in practice, all with Bernoulli reward distributions, a horizon of , and . The simulation settings are displayed in Table 1. Simulations 1-3 are run using Monte Carlo repetitions, and Simulation 4 was run using repetitions to reduce Monte Carlo uncertainty.

Sim 1
Sim 2
Sim 3
Sim 4
Table 1: Simulation settings considered. Simulations 1 and 3 have non-unique margins so that must be less than one for at least one arm for the budget constraint to be satisfied. In Simulation 3, the pseudo-arm is in , and in Simulation 4 arm is in .

For , we define the KL-UCB algorithm as the instance of KL-UCB using the function . Note that the use of both KL-UCB 3 and KL-UCB 1 are theoretically justified by the results of Theorems 6 and 7, as Bernoulli distributions satisfy the conditions of both theorems. In the settings of Simulations 1 and 2, which represent multiple-play bandit instances as is an integer in and the cost of pulling each arm is one, we compare Thompson sampling and KL-UCB to the ESCB algorithm of Combes et al. [2015b]. As quickly explained earlier, ESCB is a generalization of the KL-UCB algorithm, designed for the combinatorial semi-bandit setting (that includes multiple-play). This algorithm computes an upper confidence bound for the sum of the arm means for each of the candidate sets , defined by the optimal value to


and draws the arms in the set with the maximal index. Just like KL-UCB, ESCB uses confidence bounds whose level rely on a function such that . Because the optimization problem solved to compute the indices (17) and (19) are different, the functions used by KL-UCB and ESCB are not directly comparable. Nonetheless, a side-by-side comparison of the two algorithms seems to indicate that for ESCB is comparable to for KL-UCB. Combes et al. prove an regret bound (with a sub-optimal constant) for the version of ESCB corresponding to the constant , that we refer to as ESCB 4.

Figure 1: Regret of the four algorithms with theoretical guarantees. ESCB only run for Simulations 1 and 2 for which the cost is identically one for all arms.

Figure 1 displays the regret of the four algorithms with theoretical guarantees. All but ESCB 4 have been proven to be asymptotically optimal, and thus are guaranteed to achieve the theoretical lower bound asymptotically. In our finite sample simulation, Thompson sampling performs better than this theoretical guarantee may suggest (the regret lower bounds at time are approximately equal to and in Simulations 1 and 2, respectively). Indeed, Thompson sampling outperforms the KL-UCB algorithms in all but Simulation 4, while KL-UCB 1 outperforms KL-UCB 3 and KL-UCB 3 outperforms ESCB 4 in Simulations 1 and 2. To give the reader intuition on the relative performance of KL-UCB variants, note that in the proofs of Theorems 6 and 7 we prove that the number of pulls on each suboptimal arm is upper bounded by , with an explicit finite sample constant for the term. While