Budget-Constrained Multi-Armed Bandits with Multiple Plays

# Budget-Constrained Multi-Armed Bandits with Multiple Plays

## Abstract

We study the multi-armed bandit problem with multiple plays and a budget constraint for both the stochastic and the adversarial setting. At each round, exactly out of possible arms have to be played (with ). In addition to observing the individual rewards for each arm played, the player also learns a vector of costs which has to be covered with an a-priori defined budget . The game ends when the sum of current costs associated with the played arms exceeds the remaining budget.

Firstly, we analyze this setting for the stochastic case, for which we assume each arm to have an underlying cost and reward distribution with support and , respectively. We derive an Upper Confidence Bound (UCB) algorithm which achieves regret.

Secondly, for the adversarial case in which the entire sequence of rewards and costs is fixed in advance, we derive an upper bound on the regret of order utilizing an extension of the well-known Exp3 algorithm. We also provide upper bounds that hold with high probability and a lower bound of order .

## 1 Introduction

The multi-armed bandit (MAB) problem has been extensively studied in machine learning and statistics as a means to model online sequential decision making. In the classic setting popularized by [\citeauthoryearAuer, Cesa-Bianchi, and Fischer2002], [\citeauthoryearAuer et al.2002], the decision-maker selects exactly one arm at a given round , given the observations of realized rewards from arms played in previous rounds . The goal is to maximize the cumulative reward over a fixed horizon , or equivalently, to minimize regret, which is defined as the difference between the cumulative gain achieved, had the decision-maker always played the best arm, and the realized cumulative gain. The analysis of this setting reflects the fundamental tradeoff between the desire to learn better arms (exploration) and the possibility to play arms believed to have high payoff (exploitation).

A variety of practical applications of the MAB problem include placement of online advertising to maximize the click-through rate, in particular online sponsored search auctions [\citeauthoryearRusmevichientong and Williamson2005] and ad-exchange platforms [\citeauthoryearChakraborty et al.2010], channel selection in radio networks [\citeauthoryearHuang, Liu, and Ding2008], or learning to rank web documents [\citeauthoryearRadlinski, Kleinberg, and Joachims2008]. As acknowledged by [\citeauthoryearDing et al.2013], taking an action (playing an arm) in practice is inherently costly, yet the vast majority of existing bandit-related work used to analyze such examples forgoes any notion of cost. Furthermore, the above-mentioned applications rarely proceed in a strictly sequential way. A more realistic scenario is a setting in which, at each round, multiple actions are taken among the set of all possible choices.

These two shortcomings motivate the theme of this paper, as we investigate the MAB problem under a budget constraint in a setting with time-varying rewards and costs and multiple plays. More precisely, given an a-priori defined budget , at each round the decision maker selects a combination of distinct arms from available arms and observes the individual costs and rewards, which corresponds to the semi-bandit setting. The player pays for the materialized costs until the remaining budget is exhausted, at which point the algorithm terminates and the cumulative reward is compared to the theoretical optimum and defines the weak regret, which is the expected difference between the payout under the best fixed choice of arms for all rounds and the actual gain. In this paper, we investigate both the stochastic and the adversarial case. For the stochastic case, we derive an upper bound on the expected regret of order , utilizing Algorithm UCB-MB inspired by the upper confidence bound algorithm UCB1 first introduced by [\citeauthoryearAuer, Cesa-Bianchi, and Fischer2002]. For the adversarial case, Algorithm Exp3.M.B upper and lower-bounds the regret with and , respectively. These findings extend existing results from [\citeauthoryearUchiya, Nakamura, and Kudo2010] and [\citeauthoryearAuer et al.2002], as we also provide an upper bound that holds with high probability. To the best of our knowledge, this is the first case that addresses the adversarial budget-constrained case, which we therefore consider to be the main contribution of this paper.

### Related Work

In the extant literature, attempts to make sense of a cost component in MAB problems occur in [\citeauthoryearTran-Thanh et al.2010] and [\citeauthoryearTran-Thanh et al.2012], who assume time-invariant costs and cast the setting as a knapsack problem with only the rewards being stochastic. In contrast, [\citeauthoryearDing et al.2013] proposed algorithm UCB-BV, where per-round costs and rewards are sampled in an IID fashion from unknown distributions to derive an upper bound on the regret of order . The papers that are closest to our setting are [\citeauthoryearBadanidiyuru, Kleinberg, and Slivkins2013] and [\citeauthoryearXia et al.2016]. The former investigates the stochastic case with a resource consumption. Unlike our case, however, the authors allow for the existence of a “null arm”, which is tantamount to skipping rounds, and obtain an upper bound of order rather than compared to our case. The latter paper focuses on the stochastic case, but does not address the adversarial setting at all.

The extension of the single play to the multiple plays case, where at each round arms have to be played, was introduced in [\citeauthoryearAnantharam, Varaiya, and Walrand1986] and [\citeauthoryearAgrawal, Hegde, and Teneketzis1990]. However, their analysis is based on the original bandit formulation introduced by [\citeauthoryearLai and Robbins1985], where the regret bounds only hold asymptotically (in particular not for a finite time), rely on hard-to-compute index policies, and are distribution-dependent. Influenced by the works of [\citeauthoryearAuer, Cesa-Bianchi, and Fischer2002] and [\citeauthoryearAgrawal2002], who popularized the usage of easy-to-compute upper confidence bounds (UCB), a recent line of work has further investigated the combinatorial bandit setting. For example, [\citeauthoryearGai, Krishnamachari, and Jain2012] derived an regret bound in the stochastic semi-bandit setting, utilizing a policy they termed “Learning with Linear Rewards” (LLR). Similarly, [\citeauthoryearChen, Wang, and Yuan2013] utilize a framework where the decision-maker queries an oracle that returns a fraction of the optimal reward. Other, less relevant settings to this paper are found in [\citeauthoryearCesa-Bianchi and Lugosi2009] and later [\citeauthoryearCombes et al.2015], who consider the adversarial bandit setting, where only the sum of losses for the selected arms can be observed. Furthermore, [\citeauthoryearKale, Reyzin, and Schapire2010] investigate bandit slate problems to take into account the ordering of the arms selected at each round. Lastly, [\citeauthoryearKomiyama, Honda, and Nakagawa2015] utilize Thompson Sampling to model the stochastic MAB problem.

## 2 Main Results

In this section, we formally define the budgeted, multiple play multi-armed bandit setup and present the main theorems, whose results are provided in Table 1 together with a comparison to existing results in the literature. We first describe the stochastic setting (Section 2.1) and then proceed to the adversarial one (Section 2.2). Illuminating proofs for the theorems in this section are presented in Section 3. Technical proofs are relegated to the supplementary document.

### 2.1 Stochastic Setting

The definition of the stochastic setting is based on the classic setup introduced in [\citeauthoryearAuer, Cesa-Bianchi, and Fischer2002], but is enriched by a cost component and a multiple play constraint. Specifically, given a bandit with distinct arms, each arm indexed by is associated with an unknown reward and cost distribution with unknown means and , respectively. Realizations of costs and rewards are independently and identically distributed. At each round , the decision maker plays exactly arms () and subsequently observes the individual costs and rewards only for the played arms, which corresponds to the semi-bandit setting. Before the game starts, the player is given a budget to pay for the materialized costs , where denotes the indexes of the arms played at time . The game terminates as soon as the sum of costs at round , namely exceeds the remaining budget.

Notice the minimum on the support of the cost distributions. This assumption is not only made for practical reasons, as many applications of bandits come with a minimum cost, but also to guarantee well-defined “bang-per-buck” ratios , which our analysis in this paper relies on.

The goal is to design a deterministic algorithm such that the expected payout is maximized, given the budget and multiple play constraints. Formally:

 maximizea1,…,aτA(B) E⎡⎣τA(B)∑t=1∑i∈atri,t⎤⎦ (1) subject to E⎡⎣τA(B)∑t=1∑i∈atci,t≤B⎤⎦ |at|=K, 1≤K≤N ∀ t∈[τA(B)]

In (1), is the stopping time of algorithm and indicates after how many steps the algorithm terminates, namely when the budget is exhausted. The expectation is taken over the randomness of the reward and cost distributions.

The performance of algorithm is evaluated on its expected regret , which is defined as the difference between the expected payout (gain) under the optimal strategy (which in each round plays , namely the set of arms with the largest bang-per-buck ratios) and the expected payout under algorithm :

 RA(B)=E[GA∗(B)]−E[GA(B)]. (2)

Our main result in Theorem 1 upper bounds the regret achieved with Algorithm 1. Similar to [\citeauthoryearAuer, Cesa-Bianchi, and Fischer2002] and [\citeauthoryearDing et al.2013], we maintain time-varying upper confidence bounds for each arm

 Ui,t =¯μit+ei,t, (3)

where denotes the sample mean of the observed bang-per-buck ratios up to time and the exploration term defined in Algorithm 1. At each round, the arms associated with the largest confidence bounds are played. For initialization purposes, we allow all arms to be played exactly once prior to the while-loop.

###### Theorem 1.

There exist constants , , and , which are functions of only, such that Algorithm 1 (UCB-MB) achieves expected regret

 RA(B)≤c1+c2log(B+c3)=O(NK4logB). (4)

In Theorem 1, denotes the smallest possible difference of bang-per-buck ratios among non-optimal selections , i.e. the second best choice of arms:

 Δmin=∑j∈a∗μj−maxa,a≠a∗∑j∈aμj. (5)

Similarly, the proof of Theorem 1 also relies on the largest such difference , which corresponds to the worst possible choice of arms:

 Δmax=∑j∈a∗μj−mina,a≠a∗∑j∈aμj. (6)

Comparing the bound given in Theorem 1 to the results in Table 1, we recover the bound from [\citeauthoryearDing et al.2013] for the single-play case.

We now consider the adversarial case that makes no assumptions on the reward and cost distributions whatsoever. The setup for this case was first proposed and analyzed by [\citeauthoryearAuer et al.2002] for the single play case (i.e. ), a fixed horizon , and an oblivious adversary. That is, the entire seqence of rewards for all arms is fixed in advance and in particular cannot be adaptively changed during runtime. The proposed randomized algorithm Exp3 enjoys regret. Under semi-bandit feedback, where the rewards for a given round are observed for each arm played, [\citeauthoryearUchiya, Nakamura, and Kudo2010] derived a variation of the single-play Exp3 algorithm, which they called Exp3.M and enjoys regret , where is the number of plays per round.

We consider the extension of the classic setting as in [\citeauthoryearUchiya, Nakamura, and Kudo2010], where the decision maker has to play exactly arms. For each arm played at round , the player observes the reward and, unlike in previous settings, additionally the cost . As in the stochastic setting (Section 2.1), the player is given a budget to pay for the costs incurred, and the algorithm terminates after rounds when the sum of materialized costs in round exceeds the remaining budget. The gain of algorithm is the sum of observed rewards up to and including round . The expected regret is defined as in (2), where the gain of algorithm is compared against the best set of arms that an omniscient algorithm , which knows the reward and cost sequences in advance, would select, given the budget . In contrast to the stochastic case, the expectation is now taken with respect to algorithm ’s internal randomness.

#### Upper Bounds on the Regret

We begin with upper bounds on the regret for the budget constrained MAB with multiple plays and later transition towards lower bounds and upper bounds that hold with high probability. Algorithm 2, which we call Exp3.M.B, provides a randomized algorithm to achieve sublinear regret. Similar to the original Exp3 algorithm developed by [\citeauthoryearAuer et al.2002], Algorithm Exp3.M.B maintains a set of time-varying weights for all arms, from which the probabilities for each arm being played at time are calculated (line 10). As noted in [\citeauthoryearUchiya, Nakamura, and Kudo2010], the probabilities sum to (because exactly arms need to be played), which requires the weights to be capped at a value (line 3) such that the probabilities are kept in the range . In each round, the player draws a set of distinct arms of cardinality , where each arm has probability of being included in (line 11). This is done by employing algorithm DependentRounding introduced by [\citeauthoryearGandhi, Khuller, and Parthasarathy2006], which runs in time and space. At the end of each round, the observed rewards and costs for the played arms are turned into estimates and such that and for (line 16). Arms with are updated according to , which assigns larger weights as increases and decreases, as one might expect.

###### Theorem 2.

Algorithm Exp3.M.B achieves regret

 R≤2.63√1+Bgcmin√gNlog(N/K)+K, (7)

where is an upper bound on , the maximal gain of the optimal algorithm. This bound is of order .

The runtime of Algorithm Exp3.M.B and its space complexity is linear in the number of arms, i.e. . If no bound on exists, we have to modify Algorithm 2. Specifically, the weights are now updated as follows:

 wi(t+1) =wi(t)exp[KγN[^ri(t)−^ci(t)]⋅\mathds1i∈at]. (8)

This replaces the original update step in line 16 of Algorithm 2. As in Algorithm Exp3.1 in [\citeauthoryearAuer et al.2002], we use an adaptation of Algorithm 2, which we call Exp3.1.M.B, see Algorithm 3. In Algorithm 3, we define cumulative expected gains and losses

 ^Gi(t) =t∑s=1^ri(s), (9a) ^Li(t) =t∑s=1^ci(s). (9b)

and make the following, necessary assumption:

###### Assumption 1.

for all possible -combinations and .

Assumption 1 is a natural assumption, which is motivated by “individual rationality” reasons. In other words, a user will only play the bandit algorithm if the reward at any given round, for any possible choice of arms, is at least as large as the cost that incurs for playing. Under the caveat of this assumption, Algorithm Exp3.1.M.B utilizes Algorithm Exp3.1.M as a subroutine in each epoch until termination.

###### Proposition 1.

For the multiple plays case with budget, the regret of Algorithm Exp3.1.M.B is upper bounded by

 R≤8[(e−1)−(e−2)cmin]NK+2NlogNK+K+ 8√[(e−1)−(e−2)cmin](Gmax−B+K)Nlog(N/K) (10)

This bound is of order and, due to Assumption 1, not directly comparable to the bound in Theorem 2. One case in which (10) outperforms (7) occurs whenever only a loose upper bound of on exists or whenever , the return of the best selection of arms, is “small”.

#### Lower Bound on the Regret

Theorem 3 provides a lower bound of order on the weak regret of algorithm Exp3.M.B.

###### Theorem 3.

For , the weak regret of Algorithm Exp3.M.B is lower bounded as follows:

 R≥ε⎛⎝B−BKN−2Bc−3/2minε√BKlog(4/3)N⎞⎠, (11)

where . Choosing as

 ε=min⎛⎝14, (1−K/N)c3/2min4√log(4/3)√NBK⎞⎠

yields the bound

 R≥min⎛⎝c3/2min(1−K/N)28√log(4/3)√NBK, B(1−K/N)8⎞⎠. (12)

This lower bound differs from the upper bound in Theorem 1 by a factor of . For the single-play case , this factor is , which recovers the gap from [\citeauthoryearAuer et al.2002].

#### High Probability Upper Bounds on the Regret

For a fixed number of rounds (no budget considerations) and single play per round (), [\citeauthoryearAuer et al.2002] proposed Algorithm Exp3.P to derive the following upper bound on the regret that holds with probability at least :

 Gmax− GExp3.P≤4√NTlog(NT/δ) +4√53NTlogN+8log(NTδ). (13)

Theorem 4 extends the non-budgeted case to the multiple play case.

###### Theorem 4.

For the multiple play algorithm () and a fixed number of rounds , the following bound on the regret holds with probability at least :

 R =Gmax−GExp3.P.M ≤2√5√NKTlog(N/K)+8N−KN−1log(NTδ) +2(1+K2)√NTN−KN−1log(NTδ). (14)

For , (4) recovers (13) save for the constants, which is due to a better -tuning in this paper compared to [\citeauthoryearAuer et al.2002]. Agreeing with intuition, this upper bound becomes zero for the edge case .

Theorem 4 can be derived by using a modified version of Algorithm 2, which we name Exp3.P.M. The necessary modifications to Exp3.M.B are motivated by Algorithm Exp3.P in [\citeauthoryearAuer et al.2002] and are provided in the following:

• Replace the outer while loop with for do

• Initialize parameter :

 α=2√(N−K)/(N−1)log(NT/δ).
• Initialize weights for :

 wi(1)=exp(αγK2√T/N/3).
• Update weights for as follows:

 wi(t+1)=wi(t) ×exp[\mathds1i∉~S(t)γK3N(^ri(t)+αpi(t)√NT)]. (15)

Since there is no notion of cost in Theorem 4, we do not need to update any cost terms.

Lastly, Theorem 5 extends Theorem 4 to the budget constrained setting using algorithm Exp3.P.M.B.

###### Theorem 5.

For the multiple play algorithm () and the budget , the following bound on the regret holds with probability at least :

 R =Gmax−GExp3.P.M.B ≤2√3√NB(1−cmin)cminlogNK +4√6N−KN−1log(NBKcminδ) (16) +2√6(1+K2)√N−KN−1NBKcminlog(NBKcminδ).

To derive bound (5), we again modify the following update rules in Algorithm 2 to obtain Algorithm Exp3.P.M.B:

• Initialize parameter :

 α=2√6√(N−K)/(N−1)log(NB/(Kcminδ)).
• Initialize weights for :

 wi(1)=exp(αγK2√B/(NKcmin)/3).
• Update weights for as follows:

 wi(t+1)=wi(t) ×exp[\mathds1i∉~S(t)γK3N(^ri(t)−^ci(t)+α√Kcminpi(t)√NB)].

The estimated costs are computed as whenever arm is played at time , as is done in Algorithm 2.

## 3 Proofs

### Proof of Theorem 1

The proof of Theorem 1 is divided into two technical lemmas introduced in the following. Due to space constraints, the proofs are relegated to the supplementary document.

First, we bound the number of times a non-optimal selection of arms is made up to stopping time . For this purpose, let us define a counter for each arm , initialized to zero for . Each time a non-optimal vector of arms is played, that is, , we increment the smallest counter in the set :

 Cj,t←Cj,t+1,j=argmini∈atCi,t. (17)

Ties are broken randomly. By definition, the number of times arm has been played until time is greater than or equal to its counter . Further, the sum of all counters is exactly the number of suboptimal choices made so far:

 ni,t ≥Ci,t∀ i∈[N], t∈[τA(B)]. N∑i=1Ci,t =t∑τ=1\mathds1(aτ≠a∗)∀ t∈[τA(B)].

Lemma 1 bounds the value of from above.

###### Lemma 1.

Upon termination of algorithm , there have been at most suboptimal actions. Specifically, for each :

 E [Ci,τA(B)]≤1+Kπ23 +(K+1)(Δmin+2K(1+1/cmin)cminΔmin)2logτA(B).

Secondly, we relate the stopping time of algorithm to the optimal action :

###### Lemma 2.

The stopping time is bounded as follows:

 B∑i∈a∗μic −c2−c3log(c1+2B∑i∈a∗μic) ≤τA≤2B∑i∈a∗μic+c1,

where , , and are the same positive constants as in Theorem 1 that depend only on .

Utilizing Lemmas 1 and 2 in conjunction with the definition of the weak regret (2) yields Theorem 1. See the supplementary document for further technicalities.

### Proof of Theorem 2

The proof of Theorem 2 in influenced by the proof methods for Algorithms Exp3 by [\citeauthoryearAuer et al.2002] and Exp3.M by [\citeauthoryearUchiya, Nakamura, and Kudo2010]. The main challenge is the absence of a well-defined time horizon due to the time-varying costs. To remedy this problem, we define , which allows us to first express the regret as a function of . In a second step, we relate to the budget .

### Proof of Proposition 1

The proof of Proposition 1 is divided into the following two lemmas:

###### Lemma 3.

For any subset of unique elements from , :

 Tr∑t=Sr∑i∈at(ri(t)−ci(t))≥∑i∈aTr∑t=Sr(^rj(t)−^cj(t)) (18) −2√(e−1)−(e−2)cmin√grNlog(N/K),

where and denote the first and last time step at epoch , respectively.

###### Lemma 4.

The total number of epochs is bounded by

 2R−1≤N(1−cmin)Kc+√^Gmax−^Lmaxc+12, (19)

where .

To derive Proposition 1, we combine Lemmas 3 and 4 and utilize the fact that algorithm Exp3.1.M.B terminates after rounds. See supplementary document for details.

### Proof of Theorem 3

The proof follows existing procedures for deriving lower bounds in adversarial bandit settings, see [\citeauthoryearAuer et al.2002], [\citeauthoryearCesa-Bianchi and Lugosi2006]. The main challenges are found in generalizing the single play setting to the multiple play setting () as well as incorporating a notion of cost associated with bandits.

Select exactly out of arms at random to be the arms in the “good” subset . For these arms, let at each round be Bernoulli distributed with bias , and the cost attain and with probability and , respectively, for some to be specified later. All other arms are assigned rewards and and costs and independently at random. Let denote the expectation of a random variable conditional on as the set of good arms. Let denote the expectation with respect to a uniform assignment of costs and rewards to all arms. Lemma 5 is an extension of Lemma A.1 in [\citeauthoryearAuer et al.2002] to the multiple-play case with cost considerations:

###### Lemma 5.

Let be any function defined on reward and cost sequences of length less than or equal . Then, for the best action set :

 Ea∗[f(r,c)] ≤Eu[f(r,c)]+Bc−3/2min2√−Eu[Na∗]log(1−4ε2),

where denotes the total number of plays of arms in during rounds through , that is:

 Na∗=τA(B)∑t=1∑i∈a∗\mathds1(i∈at).

Lemma 5, whose proof is relegated to the supplementary document, allows us to bound the gain under the existence of optimal arms by treating the problem as a uniform assignment of costs and rewards to arms. The technical parts of the proof can also be found in the supplementary document.

### Proof of Theorem 4

The proof strategy is to acknowledge that Algorithm Exp3.P.M uses upper confidence bounds to update weights (15). Lemma 6 asserts that these confidence bounds are valid, namely that they upper bound the actual gain with probability at least , where .

###### Lemma 6.

For ,

 P(^U∗>Gmax) ≥P(⋂a⊂S∑i∈a^Gi+α^σi>∑i∈aGi)≥1−δ,

where denotes an arbitrary subset of unique elements from . denotes the upper confidence bound for the optimal gain.

Next, Lemma 7 provides a lower bound on the gain of Algorithm Exp3.P.M as a function of the maximal upper confidence bound.

###### Lemma 7.

For , the gain of Algorithm Exp3.P.M is bounded below as follows:

 GExp3.P.M ≥(1−53γ)^U∗−3Nγlog(N/K) −2α2−α(1+K2)√NT, (20)

where denotes the upper confidence bound of the optimal gain achieved with optimal set .

Therefore, combining Lemmas 6 and 7 upper bounds the actual gain of Algorithm Exp3.P.M with high probability. See the supplementary document for technical details.

### Proof of Theorem 5

The proof of Theorem 5 proceeds in the same fashion as in Theorem 4. Importantly, the upper confidence bounds now include a cost term. Lemma 8 is the equivalent to Lemma 6 for the budget constrained case:

###### Lemma 8.

For ,

 P(^U∗>Gmax−B) ≥P(⋂a⊂S∑i∈a^Gi−^Li+α^σi>∑i∈aGi−Li)≥1−δ,

where denotes an arbitrary time-invariant subset of unique elements from . denotes the upper confidence bound for the cumulative optimal gain minus the cumulative cost incurred after rounds (the stopping time when the budget is exhausted):

 a∗ =maxa∈Sτa(B)∑t=1(ri(t)−ci(t)), ^U∗ =∑i∈a∗⎛⎝α^σi+τa∗(B)∑t=1(^ri(t)−^ci(t))⎞⎠. (21)

In Lemma 8, denotes the optimal cumulative reward under the optimal set chosen in (21). and denote the cumulative expected reward and cost of arm after exhaustion of the budget (that is, after rounds), respectively.

Lastly, Lemma 9 lower bounds the actual gain of Algorithm Exp3.P.M.B as a function of the upper confidence bound (21).

###### Lemma 9.

For , the gain of Algorithm Exp3.P.M.B is bounded below as follows:

 GExp3.P.M.B ≥(1−γ−2γ31−cmincmin)^U∗ −3NγlogNK−2α2−α(1+K2)BNKcmin.

Combining Lemmas 8 and 9 completes the proof, see the supplementary document.

## 4 Discussion and Conclusion

We discussed the budget-constrained multi-armed bandit problem with arms, multiple plays, and an a-priori defined budget . We explored the stochastic as well as the adversarial case and provided algorithms to derive regret bounds in the budget . For the stochastic setting, our algorithm UCB-MB enjoys regret . In the adversarial case, we showed that algorithm Exp3.M.B enjoys an upper bound on the regret of order and a lower bound . Lastly, we derived upper bounds that hold with high probability.

Our work can be extended in several dimensions in future research. For example, the incorporation of a budget constraint in this paper leads us to believe that a logical extension is to integrate ideas from economics, in particular mechanism design, into the multiple plays setting (one might think about auctioning off multiple items simultaneously) [\citeauthoryearBabaioff, Sharma, and Slivkins2009]. A possible idea is to investigate to which extent the regret varies as the number of plays increases. Further, we believe that in such settings, repeated interactions with customers (playing arms) give rise to strategic considerations, in which customers can misreport their preferences in the first few rounds to maximize their long-run surplus. While the works of [\citeauthoryearAmin, Rostamizadeh, and Syed2013] and [\citeauthoryearMohri and Munoz2014] investigate repeated interactions with a single player only, we believe an extension to a pool of buyers is worth exploring. In this setting, we would expect that the extent of strategic behavior decreases as the number of plays in each round increases, since the decision-maker could simply ignore users in future rounds who previously declined offers.

## Appendix A Proofs for Stochastic Setting

For convenience, we restate all theorems and lemmas before proving them.

### Proof of Lemma 1

Recall the definition of counters . Each time a non-optimal vector of arms is played, that is, , we increment the smallest counter in the set :

 Cj,t←Cj,t+1,j=argmini∈atCi,t (22)
###### Lemma 1.

Upon termination of algorithm , there have been at most suboptimal actions. Specifically, for each :

 E [Ci,τA(B)]≤1+Kπ23 +(K+1)(Δmin+2K(1+1/cmin)cminΔmin)2logτA(B).
###### Proof.

Let denote the indicator that