BudgetConstrained MultiArmed Bandits with Multiple Plays
Abstract
We study the multiarmed bandit problem with multiple plays and a budget constraint for both the stochastic and the adversarial setting. At each round, exactly out of possible arms have to be played (with ). In addition to observing the individual rewards for each arm played, the player also learns a vector of costs which has to be covered with an apriori defined budget . The game ends when the sum of current costs associated with the played arms exceeds the remaining budget.
Firstly, we analyze this setting for the stochastic case, for which we assume each arm to have an underlying cost and reward distribution with support and , respectively. We derive an Upper Confidence Bound (UCB) algorithm which achieves regret.
Secondly, for the adversarial case in which the entire sequence of rewards and costs is fixed in advance, we derive an upper bound on the regret of order utilizing an extension of the wellknown Exp3 algorithm. We also provide upper bounds that hold with high probability and a lower bound of order .
BudgetConstrained MultiArmed Bandits with Multiple Plays
Datong P. Zhou^{1} and Claire J. Tomlin^{2} ^{1}Dept. of Mechanical Engineering, ^{2}Dept. of Electrical Engineering and Computer Sciences University of California, Berkeley, CA 94720 {datong.zhou, tomlin}@berkeley.edu
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
1 Introduction
The multiarmed bandit (MAB) problem has been extensively studied in machine learning and statistics as a means to model online sequential decision making. In the classic setting popularized by (?), (?), the decisionmaker selects exactly one arm at a given round , given the observations of realized rewards from arms played in previous rounds . The goal is to maximize the cumulative reward over a fixed horizon , or equivalently, to minimize regret, which is defined as the difference between the cumulative gain achieved, had the decisionmaker always played the best arm, and the realized cumulative gain. The analysis of this setting reflects the fundamental tradeoff between the desire to learn better arms (exploration) and the possibility to play arms believed to have high payoff (exploitation).
A variety of practical applications of the MAB problem include placement of online advertising to maximize the clickthrough rate, in particular online sponsored search auctions (?) and adexchange platforms (?), channel selection in radio networks (?), or learning to rank web documents (?). As acknowledged by (?), taking an action (playing an arm) in practice is inherently costly, yet the vast majority of existing banditrelated work used to analyze such examples forgoes any notion of cost. Furthermore, the abovementioned applications rarely proceed in a strictly sequential way. A more realistic scenario is a setting in which, at each round, multiple actions are taken among the set of all possible choices.
These two shortcomings motivate the theme of this paper, as we investigate the MAB problem under a budget constraint in a setting with timevarying rewards and costs and multiple plays. More precisely, given an apriori defined budget , at each round the decision maker selects a combination of distinct arms from available arms and observes the individual costs and rewards, which corresponds to the semibandit setting. The player pays for the materialized costs until the remaining budget is exhausted, at which point the algorithm terminates and the cumulative reward is compared to the theoretical optimum and defines the weak regret, which is the expected difference between the payout under the best fixed choice of arms for all rounds and the actual gain. In this paper, we investigate both the stochastic and the adversarial case. For the stochastic case, we derive an upper bound on the expected regret of order , utilizing Algorithm UCBMB inspired by the upper confidence bound algorithm UCB1 first introduced by (?). For the adversarial case, Algorithm Exp3.M.B upper and lowerbounds the regret with and , respectively. These findings extend existing results from (?) and (?), as we also provide an upper bound that holds with high probability. To the best of our knowledge, this is the first case that addresses the adversarial budgetconstrained case, which we therefore consider to be the main contribution of this paper.
Related Work
In the extant literature, attempts to make sense of a cost component in MAB problems occur in (?) and (?), who assume timeinvariant costs and cast the setting as a knapsack problem with only the rewards being stochastic. In contrast, (?) proposed algorithm UCBBV, where perround costs and rewards are sampled in an IID fashion from unknown distributions to derive an upper bound on the regret of order . The papers that are closest to our setting are (?) and (?). The former investigates the stochastic case with a resource consumption. Unlike our case, however, the authors allow for the existence of a “null arm”, which is tantamount to skipping rounds, and obtain an upper bound of order rather than compared to our case. The latter paper focuses on the stochastic case, but does not address the adversarial setting at all.
The extension of the single play to the multiple plays case, where at each round arms have to be played, was introduced in (?) and (?). However, their analysis is based on the original bandit formulation introduced by (?), where the regret bounds only hold asymptotically (in particular not for a finite time), rely on hardtocompute index policies, and are distributiondependent. Influenced by the works of (?) and (?), who popularized the usage of easytocompute upper confidence bounds (UCB), a recent line of work has further investigated the combinatorial bandit setting. For example, (?) derived an regret bound in the stochastic semibandit setting, utilizing a policy they termed “Learning with Linear Rewards” (LLR). Similarly, (?) utilize a framework where the decisionmaker queries an oracle that returns a fraction of the optimal reward. Other, less relevant settings to this paper are found in (?) and later (?), who consider the adversarial bandit setting, where only the sum of losses for the selected arms can be observed. Furthermore, (?) investigate bandit slate problems to take into account the ordering of the arms selected at each round. Lastly, (?) utilize Thompson Sampling to model the stochastic MAB problem.
2 Main Results
In this section, we formally define the budgeted, multiple play multiarmed bandit setup and present the main theorems, whose results are provided in Table 1 together with a comparison to existing results in the literature. We first describe the stochastic setting (Section 2.1) and then proceed to the adversarial one (Section 2.2). Illuminating proofs for the theorems in this section are presented in Section 3. Technical proofs are relegated to the supplementary document.
Algorithm  Upper Bound  Lower Bound  Authors 
Exp3  (?)  
Exp3.M  (?)  
Exp3.M.B  This paper  
Exp3.P  (?)  
Exp3.P.M  This paper  
Exp3.P.M.B  This paper  
UCB1  (?)  
LLR  (?)  
UCBBV  (?)  
UCBMB  This paper 
2.1 Stochastic Setting
The definition of the stochastic setting is based on the classic setup introduced in (?), but is enriched by a cost component and a multiple play constraint. Specifically, given a bandit with distinct arms, each arm indexed by is associated with an unknown reward and cost distribution with unknown means and , respectively. Realizations of costs and rewards are independently and identically distributed. At each round , the decision maker plays exactly arms () and subsequently observes the individual costs and rewards only for the played arms, which corresponds to the semibandit setting. Before the game starts, the player is given a budget to pay for the materialized costs , where denotes the indexes of the arms played at time . The game terminates as soon as the sum of costs at round , namely exceeds the remaining budget.
Notice the minimum on the support of the cost distributions. This assumption is not only made for practical reasons, as many applications of bandits come with a minimum cost, but also to guarantee welldefined “bangperbuck” ratios , which our analysis in this paper relies on.
The goal is to design a deterministic algorithm such that the expected payout is maximized, given the budget and multiple play constraints. Formally:
(1)  
subject to  
In (1), is the stopping time of algorithm and indicates after how many steps the algorithm terminates, namely when the budget is exhausted. The expectation is taken over the randomness of the reward and cost distributions.
The performance of algorithm is evaluated on its expected regret , which is defined as the difference between the expected payout (gain) under the optimal strategy (which in each round plays , namely the set of arms with the largest bangperbuck ratios) and the expected payout under algorithm :
(2) 
Our main result in Theorem 1 upper bounds the regret achieved with Algorithm 1. Similar to (?) and (?), we maintain timevarying upper confidence bounds for each arm
(3) 
where denotes the sample mean of the observed bangperbuck ratios up to time and the exploration term defined in Algorithm 1. At each round, the arms associated with the largest confidence bounds are played. For initialization purposes, we allow all arms to be played exactly once prior to the whileloop.
Theorem 1.
There exist constants , , and , which are functions of only, such that Algorithm 1 (UCBMB) achieves expected regret
(4) 
In Theorem 1, denotes the smallest possible difference of bangperbuck ratios among nonoptimal selections , i.e. the second best choice of arms:
(5) 
Similarly, the proof of Theorem 1 also relies on the largest such difference , which corresponds to the worst possible choice of arms:
(6) 
Comparing the bound given in Theorem 1 to the results in Table 1, we recover the bound from (?) for the singleplay case.
2.2 Adversarial Setting
We now consider the adversarial case that makes no assumptions on the reward and cost distributions whatsoever. The setup for this case was first proposed and analyzed by (?) for the single play case (i.e. ), a fixed horizon , and an oblivious adversary. That is, the entire seqence of rewards for all arms is fixed in advance and in particular cannot be adaptively changed during runtime. The proposed randomized algorithm Exp3 enjoys regret. Under semibandit feedback, where the rewards for a given round are observed for each arm played, (?) derived a variation of the singleplay Exp3 algorithm, which they called Exp3.M and enjoys regret , where is the number of plays per round.
We consider the extension of the classic setting as in (?), where the decision maker has to play exactly arms. For each arm played at round , the player observes the reward and, unlike in previous settings, additionally the cost . As in the stochastic setting (Section 2.1), the player is given a budget to pay for the costs incurred, and the algorithm terminates after rounds when the sum of materialized costs in round exceeds the remaining budget. The gain of algorithm is the sum of observed rewards up to and including round . The expected regret is defined as in (2), where the gain of algorithm is compared against the best set of arms that an omniscient algorithm , which knows the reward and cost sequences in advance, would select, given the budget . In contrast to the stochastic case, the expectation is now taken with respect to algorithm ’s internal randomness.
Upper Bounds on the Regret
We begin with upper bounds on the regret for the budget constrained MAB with multiple plays and later transition towards lower bounds and upper bounds that hold with high probability. Algorithm 2, which we call Exp3.M.B, provides a randomized algorithm to achieve sublinear regret. Similar to the original Exp3 algorithm developed by (?), Algorithm Exp3.M.B maintains a set of timevarying weights for all arms, from which the probabilities for each arm being played at time are calculated (line 10). As noted in (?), the probabilities sum to (because exactly arms need to be played), which requires the weights to be capped at a value (line 3) such that the probabilities are kept in the range . In each round, the player draws a set of distinct arms of cardinality , where each arm has probability of being included in (line 11). This is done by employing algorithm DependentRounding introduced by (?), which runs in time and space. At the end of each round, the observed rewards and costs for the played arms are turned into estimates and such that and for (line 16). Arms with are updated according to , which assigns larger weights as increases and decreases, as one might expect.
Initialize: for , gain .
Theorem 2.
Algorithm Exp3.M.B achieves regret
(7) 
where is an upper bound on , the maximal gain of the optimal algorithm. This bound is of order .
The runtime of Algorithm Exp3.M.B and its space complexity is linear in the number of arms, i.e. . If no bound on exists, we have to modify Algorithm 2. Specifically, the weights are now updated as follows:
(8) 
This replaces the original update step in line 16 of Algorithm 2. As in Algorithm Exp3.1 in (?), we use an adaptation of Algorithm 2, which we call Exp3.1.M.B, see Algorithm 3. In Algorithm 3, we define cumulative expected gains and losses
(9a)  
(9b) 
and make the following, necessary assumption:
Assumption 1.
for all possible combinations and .
Assumption 1 is a natural assumption, which is motivated by “individual rationality” reasons. In other words, a user will only play the bandit algorithm if the reward at any given round, for any possible choice of arms, is at least as large as the cost that incurs for playing. Under the caveat of this assumption, Algorithm Exp3.1.M.B utilizes Algorithm Exp3.1.M as a subroutine in each epoch until termination.
Proposition 1.
For the multiple plays case with budget, the regret of Algorithm Exp3.1.M.B is upper bounded by
(10) 
Lower Bound on the Regret
Theorem 3 provides a lower bound of order on the weak regret of algorithm Exp3.M.B.
Theorem 3.
For , the weak regret of Algorithm Exp3.M.B is lower bounded as follows:
(11) 
where . Choosing as
yields the bound
(12) 
This lower bound differs from the upper bound in Theorem 1 by a factor of . For the singleplay case , this factor is , which recovers the gap from (?).
High Probability Upper Bounds on the Regret
For a fixed number of rounds (no budget considerations) and single play per round (), (?) proposed Algorithm Exp3.P to derive the following upper bound on the regret that holds with probability at least :
(13) 
Theorem 4 extends the nonbudgeted case to the multiple play case.
Theorem 4.
For the multiple play algorithm () and a fixed number of rounds , the following bound on the regret holds with probability at least :
(14) 
For , (4) recovers (13) save for the constants, which is due to a better tuning in this paper compared to (?). Agreeing with intuition, this upper bound becomes zero for the edge case .
Theorem 4 can be derived by using a modified version of Algorithm 2, which we name Exp3.P.M. The necessary modifications to Exp3.M.B are motivated by Algorithm Exp3.P in (?) and are provided in the following:

Replace the outer while loop with for do

Initialize parameter :

Initialize weights for :

Update weights for as follows:
(15)
Since there is no notion of cost in Theorem 4, we do not need to update any cost terms.
Theorem 5.
For the multiple play algorithm () and the budget , the following bound on the regret holds with probability at least :
(16)  
3 Proofs
Proof of Theorem 1
The proof of Theorem 1 is divided into two technical lemmas introduced in the following. Due to space constraints, the proofs are relegated to the supplementary document.
First, we bound the number of times a nonoptimal selection of arms is made up to stopping time . For this purpose, let us define a counter for each arm , initialized to zero for . Each time a nonoptimal vector of arms is played, that is, , we increment the smallest counter in the set :
(17) 
Ties are broken randomly. By definition, the number of times arm has been played until time is greater than or equal to its counter . Further, the sum of all counters is exactly the number of suboptimal choices made so far:
Lemma 1 bounds the value of from above.
Lemma 1.
Upon termination of algorithm , there have been at most suboptimal actions. Specifically, for each :
Secondly, we relate the stopping time of algorithm to the optimal action :
Lemma 2.
The stopping time is bounded as follows:
where , , and are the same positive constants as in Theorem 1 that depend only on .
Proof of Theorem 2
The proof of Theorem 2 in influenced by the proof methods for Algorithms Exp3 by (?) and Exp3.M by (?). The main challenge is the absence of a welldefined time horizon due to the timevarying costs. To remedy this problem, we define , which allows us to first express the regret as a function of . In a second step, we relate to the budget .
Proof of Proposition 1
The proof of Proposition 1 is divided into the following two lemmas:
Lemma 3.
For any subset of unique elements from , :
(18)  
where and denote the first and last time step at epoch , respectively.
Lemma 4.
The total number of epochs is bounded by
(19) 
where .
Proof of Theorem 3
The proof follows existing procedures for deriving lower bounds in adversarial bandit settings, see (?), (?). The main challenges are found in generalizing the single play setting to the multiple play setting () as well as incorporating a notion of cost associated with bandits.
Select exactly out of arms at random to be the arms in the “good” subset . For these arms, let at each round be Bernoulli distributed with bias , and the cost attain and with probability and , respectively, for some to be specified later. All other arms are assigned rewards and and costs and independently at random. Let denote the expectation of a random variable conditional on as the set of good arms. Let denote the expectation with respect to a uniform assignment of costs and rewards to all arms. Lemma 5 is an extension of Lemma A.1 in (?) to the multipleplay case with cost considerations:
Lemma 5.
Let be any function defined on reward and cost sequences of length less than or equal . Then, for the best action set :
where denotes the total number of plays of arms in during rounds through , that is:
Lemma 5, whose proof is relegated to the supplementary document, allows us to bound the gain under the existence of optimal arms by treating the problem as a uniform assignment of costs and rewards to arms. The technical parts of the proof can also be found in the supplementary document.
Proof of Theorem 4
The proof strategy is to acknowledge that Algorithm Exp3.P.M uses upper confidence bounds to update weights (15). Lemma 6 asserts that these confidence bounds are valid, namely that they upper bound the actual gain with probability at least , where .
Lemma 6.
For ,
where denotes an arbitrary subset of unique elements from . denotes the upper confidence bound for the optimal gain.
Next, Lemma 7 provides a lower bound on the gain of Algorithm Exp3.P.M as a function of the maximal upper confidence bound.
Lemma 7.
For , the gain of Algorithm Exp3.P.M is bounded below as follows:
(20) 
where denotes the upper confidence bound of the optimal gain achieved with optimal set .
Proof of Theorem 5
The proof of Theorem 5 proceeds in the same fashion as in Theorem 4. Importantly, the upper confidence bounds now include a cost term. Lemma 8 is the equivalent to Lemma 6 for the budget constrained case:
Lemma 8.
For ,
where denotes an arbitrary timeinvariant subset of unique elements from . denotes the upper confidence bound for the cumulative optimal gain minus the cumulative cost incurred after rounds (the stopping time when the budget is exhausted):
(21) 
In Lemma 8, denotes the optimal cumulative reward under the optimal set chosen in (21). and denote the cumulative expected reward and cost of arm after exhaustion of the budget (that is, after rounds), respectively.
Lastly, Lemma 9 lower bounds the actual gain of Algorithm Exp3.P.M.B as a function of the upper confidence bound (21).
Lemma 9.
For , the gain of Algorithm Exp3.P.M.B is bounded below as follows:
4 Discussion and Conclusion
We discussed the budgetconstrained multiarmed bandit problem with arms, multiple plays, and an apriori defined budget . We explored the stochastic as well as the adversarial case and provided algorithms to derive regret bounds in the budget . For the stochastic setting, our algorithm UCBMB enjoys regret . In the adversarial case, we showed that algorithm Exp3.M.B enjoys an upper bound on the regret of order and a lower bound . Lastly, we derived upper bounds that hold with high probability.
Our work can be extended in several dimensions in future research. For example, the incorporation of a budget constraint in this paper leads us to believe that a logical extension is to integrate ideas from economics, in particular mechanism design, into the multiple plays setting (one might think about auctioning off multiple items simultaneously) (?). A possible idea is to investigate to which extent the regret varies as the number of plays increases. Further, we believe that in such settings, repeated interactions with customers (playing arms) give rise to strategic considerations, in which customers can misreport their preferences in the first few rounds to maximize their longrun surplus. While the works of (?) and (?) investigate repeated interactions with a single player only, we believe an extension to a pool of buyers is worth exploring. In this setting, we would expect that the extent of strategic behavior decreases as the number of plays in each round increases, since the decisionmaker could simply ignore users in future rounds who previously declined offers.
References
 [Agrawal, Hegde, and Teneketzis 1990] Agrawal, R.; Hegde, M. V.; and Teneketzis, D. 1990. MultiArmed Bandits with Multiple Plays and Switching Cost. Stochastics and Stochastic Reports 29:437–459.
 [Agrawal 2002] Agrawal, R. 2002. Sample Mean Based Index Policies with O(log n) Regret for the MultiArmed Bandit Problem. Machine Learning 47:235–256.
 [Amin, Rostamizadeh, and Syed 2013] Amin, K.; Rostamizadeh, A.; and Syed, U. 2013. Learning Prices for Repeated Auctions with Strategic Buyers. Advances in Neural Information Processing Systems 1169–1177.
 [Anantharam, Varaiya, and Walrand 1986] Anantharam, V.; Varaiya, P.; and Walrand, J. 1986. Asymptotically Efficient Allocation Rules for the Multiarmed Bandit Problem  Part I: IID Rewards. IEEE Transactions on Automatic Control 32:968–976.
 [Auer et al. 2002] Auer, P.; CesaBianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The Nonstochastic MultiArmed Bandit Problem. SIAM Journal on Computing 32:48–77.
 [Auer, CesaBianchi, and Fischer 2002] Auer, P.; CesaBianchi, N.; and Fischer, P. 2002. FiniteTime Analysis of the Multiarmed Bandit Problem. Machine Learning 47:235–256.
 [Babaioff, Sharma, and Slivkins 2009] Babaioff, M.; Sharma, Y.; and Slivkins, A. 2009. Characterizing Truthful MultiArmed Bandit Mechanisms. Proceedings of the 10th ACM Conference on Electronic Commerce 79–88.
 [Badanidiyuru, Kleinberg, and Slivkins 2013] Badanidiyuru, A.; Kleinberg, R.; and Slivkins, A. 2013. Bandits with Knapsacks. Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science 207–216.
 [CesaBianchi and Lugosi 2006] CesaBianchi, N., and Lugosi, G. 2006. Prediction, Learning, and Games. Cambridge University Press.
 [CesaBianchi and Lugosi 2009] CesaBianchi, N., and Lugosi, G. 2009. Combinatorial Bandits. Proceedings of the 22nd Annual Conference on Learning Theory.
 [Chakraborty et al. 2010] Chakraborty, T.; EvenDar, E.; Guha, S.; Mansour, Y.; and Muthukrishnan, S. 2010. Selective Call Out and Real Time Bidding. WINE 6484:145–157.
 [Chen, Wang, and Yuan 2013] Chen, W.; Wang, Y.; and Yuan, Y. 2013. Combinatorial Bandits: General Framework, Results and Applications. International Conference on Machine Learning.
 [Combes et al. 2015] Combes, R.; M. Sadegh Talebi; Proutiere, A.; and Lelarge, M. 2015. Combinatorial Bandits Revisited. Advances in Neural Information Processing Systems 2116 – 2124.
 [Ding et al. 2013] Ding, W.; Qin, T.; Zhang, X.D.; and Liu, T.Y. 2013. MultiArmed Bandit with Budget Constraint and Variable Costs. Proceedings of the TwentySeventh AAAI Conference on Artificial Intelligence.
 [Gai, Krishnamachari, and Jain 2012] Gai, Y.; Krishnamachari, B.; and Jain, R. 2012. Combinatorial Network Optimization with Unknown Variables: MultiArmed Bandits with Linear Rewards and Individual Observations. IEEE/ACM Transactions on Networking 20(5):1466–1478.
 [Gandhi, Khuller, and Parthasarathy 2006] Gandhi, R.; Khuller, S.; and Parthasarathy, S. 2006. Dependent Rounding and its Applications to Approximation Algorithms. Journal of the ACM (JACM) 53(3):324–360.
 [Huang, Liu, and Ding 2008] Huang, S.; Liu, X.; and Ding, Z. 2008. Opportunistic Spectrum Access in Cognitive Radio Networks. IEEE INFOCOM 2008 Proceedings 2101–2109.
 [Kale, Reyzin, and Schapire 2010] Kale, S.; Reyzin, L.; and Schapire, R. E. 2010. NonStochastic Bandit Slate Problems. Advances in Neural Information Processing Systems 1054–1062.
 [Komiyama, Honda, and Nakagawa 2015] Komiyama, J.; Honda, J.; and Nakagawa, H. 2015. Optimal Regret Analysis of Thompson Sampling in Stochastic MultiArmed Bandit Problems with Multiple Plays. International Conference on Machine Learning 1152–1161.
 [Lai and Robbins 1985] Lai, T. L., and Robbins, H. 1985. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics 6(1):4–22.
 [Mohri and Munoz 2014] Mohri, M., and Munoz, A. 2014. Optimal Regret Minimization in PostedPrice Auctions with Strategic Buyers. Advances in Neural Information Processing Systems 1871–1879.
 [Radlinski, Kleinberg, and Joachims 2008] Radlinski, F.; Kleinberg, R.; and Joachims, T. 2008. Learning Diverse Rankings with MultiArmed Bandits. Proceedings of the 25th International Conference on Machine Learning 784–791.
 [Rusmevichientong and Williamson 2005] Rusmevichientong, P., and Williamson, D. P. 2005. An Adaptive Algorithm for Selecting Profitable Keywords for SearchBased Advertising Services. Proceedings of the 7th ACM Conference on Electronic Commerce 260–269.
 [TranThanh et al. 2010] TranThanh, L.; Chapman, A.; F.L. Munoz De Cote; Jose, E.; Rogers, A.; and Jennings, N. R. 2010. EpsilonFirst Policies for BudgetLimited MultiArmed Bandits. TwentyFourth AAAI Conference on Artificial Intelligence 1211–1216.
 [TranThanh et al. 2012] TranThanh, L.; Chapman, A.; Rogers, A.; and Jennings, N. R. 2012. Knapsack Based Optimal Policies for Budget–Limited Multi–Armed Bandits. TwentySixth AAAI Conference on Artificial Intelligence 1134–1140.
 [Uchiya, Nakamura, and Kudo 2010] Uchiya, T.; Nakamura, A.; and Kudo, M. 2010. Algorithms for Adversarial Bandit Problems with Multiple Plays. International Conference on Algorithmic Learning Theory 375–389.
 [Xia et al. 2016] Xia, Y.; Qin, T.; Ma, W.; Yu, N.; and Liu, T.Y. 2016. Budgeted MultiArmed Bandits with Multiple Plays. Proceedings of the 25th International Joint Conference on Artificial Intelligence 2210 – 2216.
Appendix A Proofs for Stochastic Setting
For convenience, we restate all theorems and lemmas before proving them.
Proof of Lemma 1
Recall the definition of counters . Each time a nonoptimal vector of arms is played, that is, , we increment the smallest counter in the set :
(22) 
Lemma 1.
Upon termination of algorithm , there have been at most suboptimal actions. Specifically, for each :
Proof.
Let denote the indicator that is incremented at time . Then for any time , we have: