Quantifying the Burden of Exploration and the Unfairness of Free Riding

# Quantifying the Burden of Exploration and the Unfairness of Free Riding

## Abstract

We consider the multi-armed bandit setting with a twist. Rather than having just one decision maker deciding which arm to pull in each round, we have different decision makers (agents). In the simple stochastic setting we show that one of the agents (called the free rider), who has access to the history of other agents playing some zero regret algorithm can achieve just regret, as opposed to the regret lower bound of when one decision maker is playing in isolation. In the linear contextual setting, we show that if the other agents play a particular, popular zero regret algorithm (UCB), then the free rider can again achieve regret. In order to prove this result, we give a deterministic lower bound on the number of times each suboptimal arm must be pulled in UCB. In contrast, we show that the free-rider cannot beat the standard single-player regret bounds in certain partial information settings.

## 1 Introduction

We consider situations where exploitation must be balanced with exploration in order to obtain optimal performance. Typically, there is a single agent who does this balancing, in order to minimize a quantity called the regret. In this paper we consider settings where there are many agents and ask how a single agent (the free rider) can benefit from the exploration of others. For example, competing pharmaceutical companies might be engaged in research for drug discovery. If one of these companies had access to the research findings of its competitors, it might greatly reduce its own exploration cost. Of course, this is an unlikely scenario since intellectual property is jealously guarded by companies. This also points to an important consideration in modeling such scenarios: The amount and type of information that one agent is able to gather about the findings of others. More realistically, and less consequentially, a recommendation system such as Yelp™ makes user ratings of restaurants publicly available. The assumption underlying such systems is that “the crowd” will explore available options so that we end up with accurate average ratings. Many problems of this sort can be modeled using the formalism of multi-armed bandits.

Multi-armed bandit problems model decision making under uncertainty [14, 4, 12]. There is an unknown reward associated with each arm, chosen either adversarially or stochastically from an underlying distribution for that arm. The decision maker has to decide which arm to pull on each round. Her goal is to minimize regret, the difference between the (expected) reward of the best arm and the total reward she obtains. Our focus in this paper will be on the stochastic case. The basic model is referred to as the stochastic bandits model. In the extension to the linear contextual bandits model, each arm has an unknown parameter vector for , where is the total number of arms. At round , a context arrives. The expected reward for pulling arm in round is the inner product . A common assumption is that the reward is drawn from a sub-Gaussian distribution with this expectation by adding sub-Gaussian noise to the inner product.

In our setting, using Yelp™ as the running example, the arms correspond to restaurants. Our model differs from standard bandit models in three significant ways. First, there are decision makers rather than one; in the Yelp™ example, each decision maker corresponds to a diner. Upon visiting a restaurant, a diner samples from a distribution to determine her dining experience. In the stochastic setting, we assume that all diners have identical criteria for assessing their experiences, meaning that identical samples lead to identical rewards. Second, in the linear contextual setting, the contexts in our model are fixed in time and belong to individual decision makers. Each diner’s context vector represents the weight she assigns to various features (parameters) of a restaurant, such as innovativeness, decor, noise level, suitability for vegetarians, etc. Third, each arm has a distribution over parameter vectors instead of a fixed parameter vector. When a diner visits a restaurant, and her reward is determined by taking the inner product of her context with a feature vector drawn from the restaurant’s distribution. No additional noise is added to this inner product in our model.

In the standard stochastic or linear contextual bandit setting a decision-making algorithm is called zero-regret if its expected regret over rounds is . It is well known that exploration is essential for achieving zero regret [12]. One algorithm that achieves the asymptotically optimal regret bound of over rounds is the so-called Upper Confidence Bound (UCB) algorithm of Lai and Robbins [12]. In addition to maintaining a sample mean for each arm, this algorithm maintains confidence intervals around these means, where the width of the confidence interval for arm drops roughly as where is the number of times arm has been pulled so far. The UCB algorithm then plays the arm with the highest upper limit to its confidence interval. There are many other zero-regret strategies, such as Thompson sampling [15] or conducting an initial round-robin exploration phase followed by an exploitation phase in which the apparently optimal arm is pulled [6].

#### Our results:

In our scenario there are decision makers (agents). We are interested in understanding how one decision maker can benefit from the information gleaned by others. In the simple stochastic case, there are two types of relevant information: the other players’ actions and the resulting rewards. In the full information setting, the free rider has access to both types of information. We also consider a partial information setting where the free rider can may only observe the other players’ actions. We show in Theorem 1 that, even with this limited information, the free rider can achieve constant expected regret whenever some other agent plays an algorithm that achieves sublinear regret with sufficiently high probability.

For linear contextual bandits, besides the two types of information described above, a free rider might have access to the context vectors of the other agents. Given all of this information, we show in Theorem 4 that a free rider can achieve constant regret if the other agents play the UCB algorithm with a sufficiently high rate of exploration. In Theorem 5, we show that without knowledge of other players’ contexts, a free rider cannot hope to achieve constant regret.

#### Related works:

This paper asks how and when an agent may avoid doing their “fair share” of exploration. Several recent works have studied the how the cost of exploration in multi-armed bandit problems is distributed, from the perspective of algorithmic fairness. Some of these works model situations where people are selected for a benefit such as a bank loan or admission to college. Arms correspond to demographic subgroups, and pulling an arm corresponds to selecting a person from this group. There is only one agent choosing which arm to pull, and the goal is to fairly distribute the cost of exploration among the arms (subgroups). Joseph, Kearns, Morgenstern, and Roth [7] introduce a notion of meritocratic fairness requiring that a more qualified arm is at least as likely to be pulled as a less qualified arm, and they show how to achieve sublinear regret while ensuring meritocratic fairness. Kannan, Kearns, Morgenstern, Pai, Roth, Vohra, and Wu [8] study incentivizing myopic agents to preserve meritocratic fairness while achieving sublinear regret. They show that although full knowledge of contexts and rewards allows for achieving both fairness and low regret, it would be prohibitively expensive to guarantee those behaviors without knowledge of contexts or rewards. Works by Bastani, Bayati, and Khosravi [2]; Kannan, Morgenstern, Roth, Waggoner, and Wu [9]; and Raghavan, Slivkins, Vaughan, and Wu [13] show that when the data is sufficiently diverse, e.g., if the contexts are randomly perturbed, then exploration may not be necessary.

There is some discussion in the economics literature of free riding in bandit settings. In the model of Bolton and Harris [3], players choose what fraction of each time unit to devote to a safe action (exploitation) and to a risky action (exploration). Rewards from these actions have deterministic and stochastic components, and the actions and rewards of all players are visible to all players at the end of each time instant. This data changes the posteriors for the payoff of the risky action. Bolton and Harris show that while the attraction of free riding drives players to play the safe action always, there is a countervailing force — that risky action by a player may enable everyone to converge to the correct posterior belief faster. They then derive the equilibrium behavior of players. Keller, Rady, and Cripps [10] consider a very similar setting where risky arm will generate positive payoff after exponentially distributed random time, if it is a good one, but it will pay out nothing otherwise. They characterize unique symmetric equilibrium as well as various asymmetric equilibria, and they show that aggregate payoffs are better in asymmetric equilibria than in symmetric equilibria. Klein [11] considers a 2-player, 3-armed bandit setting where there is a safe arm and two negatively-correlated risky arms, with further assumptions about their behavior. He shows that complete learning is achieved if and only if the risky arms have sufficiently high stakes, i.e., that the good risky arm has a payoff far exceeding that of the safe arm. It is clear that these models do not support having more than 2 arms (or 3 in the case of [11]) and that their goal is maximizing expected reward (and not minimizing regret). Moreover among the arms, one arm is explicitly designated as safe arm and the other(s) as risky, a priori. In Klein’s 3-arm setting, there are more specific constraints on how those arms behave.

The rest of the paper is organized as follows: In Section 2 we formally define the models we use in this paper. In Section 3 we prove a upper bound on the regret achieved by a free-riding agent in the simple stochastic case. In Section 4 we give a lower bound on the number of times any arm must be pulled by any zero regret algorithm. Although we prove this lower bound for the simple stochastic case, we use it in Section 5 to prove our main result: a upper bound on regret in the full-information linear contextual case. In Section 6 we give a logarithmic lower bound on the regret of a free-riding agent who is ignorant of the other contexts. Finally, in Section 7 we outline some directions for further work.

## 2 Preliminaries

### 2.1 Stochastic Model

There are arms, indexed by and players or agents, indexed by . Arm has a reward distribution supported on with mean , and is the reward distribution profile. The arm with the highest mean reward is denoted by , and we write for ; we assume that is unique. An important parameter is the gap between and the next-highest mean, which we denote

 Δ=μ∗−maxi∈[k]∖{i∗}μi.

In round , each player selects an arm and receives a reward . We write

 Ht=⎡⎢ ⎢ ⎢⎣(i11,r11)…(it1,rt1)⋮⋱⋮(i1n,r1n)…(itn,rtn)⎤⎥ ⎥ ⎥⎦

to denote the history of all players’ actions and rewards through round . The space of all histories is

 H=⋃t∈N([k]×[−1,1])n×t.

A deterministic policy for a player is a function mapping histories to arms; a randomized policy instead maps histories to distributions over . Informally, a player with policy who observes history will pull arm in round . A policy profile is a vector where each is a policy for player . Notice that a policy profile and a reward distribution profile together determine a distribution on histories. A policy for player is self-reliant if it depends only on ’s own observed actions and rewards. In contrast, a free-riding policy may use all players’ history.

The regret of player under reward distribution profile and history is

 ^Rp(D,HT)=maxi∈[k]T∑t=1ri−T∑t=1rtp,

and ’s pseudo-regret through round under reward distribution profile and policy profile is

 RTp(D,f)=maxi∈[k]EHT[T∑t=1ri−T∑t=1rtp],

where is drawn according to the distribution determined by and . Equivalently,

 RTp(D,f)=Tμ∗−T∑t=1E[rtp].

When the reward distribution or policy profile is clear from the context, we simply write or, in single-player settings, .

One well-studied self-reliant policy that achieves logarithmic pseudo-regret in the stochastic setting is called -UCB [12]. For each arm , the player maintains an upper confidence bound on , and in each round, it pulls the arm with the highest upper confidence bound. The distance from each arm’s sample mean to its upper confidence bound is a function of its sample count,

 Nt−1p,i=t−1∑s=1\mathbbm1(is=i),

the number of times arm has been pulled by the player through round . When is clear from context, such as in single-player settings, we may omit it from the subscript. The parameter calibrates the balance between exploration and exploitation.

\@float

algocf[h]\end@float

### 2.2 Linear Contextual Model

The linear contextual model generalizes the stochastic model. Now, each arm has a feature distribution supported on the -dimensional closed unit ball, for some , and is the feature distribution profile. Each player has a context , and is the context profile. As before, in each round , each player selects an arm , but now the reward is given by sampling a feature vector , and taking its inner product with , i.e., . is the distribution of rewards from arm for player , and the mean of this distribution is

 μp,i=Eθi∼Fi[⟨θi,xp⟩].

The optimal arm for player is , and we write for . The gap in expected reward between the best and second-best arm for player is

 Δp=μ∗p−maxi∈[k]∖{i∗p}μp,i.

Histories, policies, policy profiles, self-reliance, and free riding are defined exactly as in the stochastic setting. The pseudo-regret of player through round under feature distribution profile , context profile , and policy profile is given by

where the expectation is taken according to the distribution determined by , , and .

## 3 Regret in the Stochastic Case

We now show that a free rider can achieve constant pseudo-regret by observing a player who plays any policy that achieves low regret with sufficiently high probability. This class of policies includes UCB. We consider a specific, natural free-riding policy defined by

 \textscFreqp(Ht)=argmaxi∈[k]Ntp,i,

which always pulls whichever arm has been pulled most frequently by player . Notice that this policy does not require the free rider to observe player ’s rewards.

One might suspect that it would sufficient for player ’s policy to achieve sublinear pseudo-regret, , in order for the free rider to achieve constant regret under , but this turns out not to be the case. It is possible for to have sublinear pseudo-regret despite having large regret with non-trivial probability, preventing from achieving constant pseudo-regret.

For instance, consider the self-reliant policy that, for , dictates the following behavior in rounds to . With probability , player “gives up” on these rounds, choosing arm for all rounds . Otherwise, with probability , player plays UCB during those rounds.

Under this policy, ’s pseudo-regret grows at most logarithmically:

 RTp ≤RTUCB+2⌈logT⌉∑j=0j3j(3j+1−1−3j) =O(logT).

Notice that whenever player “gives up,” will become her most frequently pulled arm by round , so the -playing free rider will pull this arm at least times before round . If the arm is suboptimal, then this causes the free rider to incur at least regret. It is routine to show that is indeed suboptimal with probability , so the free-rider’s total pseudo-regret through round is at least

 RT1 ≥⌊logT⌋∑j=0(1−O(j/3j))⋅Δ⋅3j =Ω(logT).

Since it is not enough for player to have logarithmic pseudo-regret, we instead show that if incurs linear regret with sufficiently low probability, then the free rider achieves constant pseudo-regret by playing .

###### Theorem 1.

Let be a reward distribution profile, let be a player, and let be a policy profile with . If for all there is some satisfying , then .

###### Proof.

Observe that the free rider pulls a suboptimal arm at each time if and only if which implies that and therefore

 ^Rt−1p(D,f)≥Δt−12.

Hence, we can bound the free rider’s pseudo-regret by

 RT1(D,f)≤2T−1∑t=0Pr(^Rtp(D,f)≥Δt/2).

If satisfies the conditions of the lemma, then there exists a constant such that . Thus,

 RT1(D,f) ≤2T−1∑t=0O(t−w) =O(1).

Audibert, Munos, and Szepesvári [1] showed that -UCB satisfies the probability bound of Theorem 1 in the single-player setting whenever . Since -UCB is a self-reliant policy, this immediately yields the following corollary.

###### Corollary 2.

If player ’s policy is -UCB for any , then a free rider playing achieves pseudo-regret.

## 4 A Lower Bound on the Number of Pulls of Any Arm

In this section, we give a deterministic lower bound on the number of samples for each arm, when the arms are pulled according to the -UCB policy. Although this is a lower bound for the stochastic bandits case, it helps us bound the regret of the free rider in the contextual bandits situation!

###### Lemma 3.

Let , , and . There exists such that for all reward distribution profiles supported on , and for all , a learner playing the -UCB policy will sample each arm

 Nt−1i≥αlnt2η2k2

times in the first rounds.

###### Proof.

For every and , define the set

 Utj={i∈[k]:Nt−1i≥αlnt2η2j2}.

We claim that for all there is a constant such that for all , .

We will prove this claim by induction on . For any time , there is clearly some arm with , and we can choose such that whenever , so the claim holds for .

Now fix , and assume that the claim holds for . Define a function by

 gj(t)=t−(k−j+1)αlnt2η2j2.

We choose sufficiently large such that for all we have and

 ln(gj(t)−1)lnt>(1−1−2/ηj)2. (1)

Assume for contradiction that there is some time such that . Since , the inductive hypothesis then implies that . Thus, , and there are exactly arms outside of . Each one of those arms has been pulled at most times by round , so by the pigeonhole principle there is some such that an arm from is pulled in round .

Furthermore, inequality (1) implies

 αlns2η2(j−1)2>αlnt2η2j2,

which guarantees that . Since , the inductive hypothesis tells us that , so we have , meaning that the arm pulled in round is also in .

Now, for all , so the upper confidence bound of the arm pulled at time is at most

 1+√αlns2αlns/(2η2(j−1)2)=1+η⋅(j−1).

The upper confidence bound at time of any arm in is at least

 −1+√αlns2αlnt/(2η2j2)=−1+ηj√lnslnt.

But since and , inequality (1) implies

 −1+ηj√lnslnt>1+η⋅(j−1).

This means that all arms outside of have higher upper confidence bounds at time than the arms in , contradicting the choice to pull an arm in at time .

By induction, we conclude that the claim holds for all , and in particular that the theorem holds with . ∎

## 5 Regret in the Contextual Case with Full Information

Theorem 1 shows that free riding is easy in the stochastic case, in which every player’s reward distributions are identical, but the task is more nuanced when players may have diverse contexts. Here, we assume that the other players play the -UCB algorithm. Using the lower bound from Lemma 3, we see that each of those players must accumulate a significant number of samples from every arm. We now show that if player 1’s context is a linear combination of the other players’ contexts, then it can successfully free ride by aggregating their observations to estimate the means of its own reward distribution profile. To avoid amplifying the error of other players too much, though, player 1 needs this linear combination to have relatively small coefficients.

###### Theorem 4.

In the linear contextual setting, suppose that players play -UCB on arms, and is a vector such that

 n∑p=2cpxp=x1,

where each is the context of player . If

 Δ1√αk∥c∥>8,

where , then a free-riding player 1 with full information can achieve constant pseudo-regret.

###### Proof.

Fix a feature distribution profile on arms, and assume that each player applies the algorithm -UCB on reward distribution .

Player 1 free rides by using observations of the other players’ rewards to benefit its own expected reward in each round. Given a context profile , player 1 finds a vector that minimizes subject to

 n∑p=2cpxp=x1.

Using this vector, player 1’s policy is as follows. Let , let

 η=3√Δ√αk∥c∥>2,

and let be as in Lemma 3, for this , , and . For all players , arms and , define as the value of the th observed sample of , and define

 ¯¯¯μsp,i=1ss∑j=1^rjp,i,

the mean of the first samples of .

Now, given and a history , let

 s=αlnt2η2k2.

Lemma 3 guarantees that all other players have pulled all arms at least times each in the first rounds, so player 1 can estimate by calculating

 ~μt−1i=n∑p=2cp¯¯¯μsp,i.

The free rider then pulls the arm

 it1=argmaxi∈[k]~μt−11,i

in round .

Under this policy, player 1’s pseudo-regret through round is bounded by

 RT1 ≤2T∑t=1Pr[it1≠i∗1] ≤Δ^t+2T∑t=^tPr[it1≠i∗1] ≤2^t+2T∑t=^tk∑i=1Pr[∣∣~μt−1i−μ1,i∣∣≥Δ2].

For fixed and , we have

 ∣∣~μt−1i−μn1,i∣∣ =∣∣ ∣∣n∑p=2cpss∑j=1^rjp,i−n∑p=2cpμp,i∣∣ ∣∣ =∣∣ ∣∣n∑p=2s∑j=1Xjp∣∣ ∣∣,

where

 Xjp=cps(^rjp,i−μp,i).

Notice that the are independent, and each is supported on with . Thus, by Hoeffding’s lemma,

 E[exp(λXjp)]≤exp(λ2c2p2s2)

holds for all .

Now let , and apply a Chernoff bound:

 Pr[n∑p=2s∑j=1Xjp≥Δ2] ≤exp(−λΔ2)E[n∏p=2s∏j=1exp(λXjp)] =exp(−λΔ2)n∏p=2s∏j=1E[exp(λXjp)] ≤exp(−λΔ2+n∑p=2s∑j=1λ2c2p2s2) =exp(λ2(λs∥c∥2−Δ)) =exp(2lntΔ(8η2k2Δα∥c∥2−Δ)) =exp(lnt(2η−2)) =t2/η−2.

By symmetry,

 Pr[∣∣ ∣∣n∑p=2s∑j=1Xjp∣∣ ∣∣≥Δ2]≤2t2/η−2.

Recall that , so , and

 RT1 ≤2^t+2T∑t=^tk∑i=12t2/η−2 ≤2^t+4k∫∞^t−1t2/η−2dt =2^t+4k(^t−1)2/η−11−2/η =Oα,c,Δ,k(1).

## 6 Regret in the Contextual Case with Partial Information

We now consider the situation when player 1 must choose a free-riding policy without knowledge of the other players’ contexts. We show that this restriction can force the free rider to incur logarithmic regret — in contrast to the upper bound of Theorem 4 — even given knowledge of the other players’ policies, actions, and rewards. Intuitively, this is true because a self-reliant player might behave identically in two different environments, making observations of their behavior useless to the free rider.

###### Theorem 5.

There exist a pair of feature distribution profiles and and a pair of two-player context profiles and such that, for every policy profile in which is self-reliant,

 max{RT1(F,x,f),RT1(F′,x′,f)}≥ln(T/12)+12. (2)
###### Proof.

We construct a two-arm, two-player, one-dimensional example. Let be a point mass at ; let and be discrete random variables that take value with probability and , respectively, and value otherwise; and let and . Let be any linear contextual bandit policy profile such that is self-reliant, and consider a free-riding player 1.

Let and . For , let be the reward distribution of arm for player under feature distribution profile and context profile . Similarly, let be the reward distribution of arm for player under parameter distribution profile and context profile . Observe that , , and , but .

Informally, the environment is indistinguishable from from the perspective of player 2. Observing player 2’s actions and rewards will therefore be completely uninformative for player 1, who is ignorant of player 2’s context. Thus, player 1’s task is essentially equivalent to a single-player stochastic bandit problem where the learner must distinguish between reward distribution profiles and . Bubeck et al. [5] showed that the latter task requires the learner to experience logarithmic regret. Adapting their proof to the present situation, we can demonstrate that (2) holds. The details are deferred to Lemma 6 in the appendix. ∎

## 7 Conclusion

We have demonstrated that in the linear contextual setting, a free rider can successfully shirk the burden of exploration, achieving constant regret by observing other players engaged in standard learning behavior. Furthermore, we have shown that even with partial information and weaker assumptions on the other players’ learning behaviors, the free rider can achieve constant regret in the simple stochastic setting. It would be interesting to examine richer settings. For example, exploring players need not be self-reliant, and both exploring players and free riders could play a range of strategies. As another example, when the free rider only sees the actions (and not the rewards) of the self-reliant players and does not know which if any of them is playing UCB or another zero regret strategy, can he still achieve constant regret? More realistically, users of a service like Yelp™ cannot be partitioned into self-reliant public learners and selfish free riders who keep their data private. It would be interesting to explore more nuanced player roles and to characterize the equilibria that arise from their interactions. Such a characterization might also suggest mechanisms for the deterrence of free riding or for incentivizing exploration.

## Appendix A Appendix

###### Lemma 6.

For , , , , and defined as in the proof of Theorem 5, for all ,

 max{RT1(F,f,x),RT1(F,f,x′)}≥ln(T/12)+12.

We closely follow the proof of Theorem 6 in the work of Bubeck et al. [5], which shows that a learner in a single-player stochastic multiarm bandit setting may not be able to avoid uniformly logarithmic regret, even when the gap between the mean rewards of the best and second-best arms is known. Our situation is almost identical to theirs, except for the presence of an uninformative second player, which requires only minor changes to their proof. We include the modified proof here for the sake of completeness.

###### Proof.

Let

 QT=RT1(F,f,x)andQ′T=RT1(F,f,x′).

For all , let and be distributions on such that

 Ht(F,f,x)∼GtandHt(F′,f,x′)∼G′t,

and for , let

 Mt,i=Nt1,i(F,f,x)andM′t,i=Nti(F′,f,x′).

Observe that

 max{QT,Q′T}≥E[MT,2]3,

and

 max{QT,Q′T} ≥12(QT+Q′T) =16T∑t=1(PrGt−1[it1=1]+PrG′t−1[it1=2]) ≥T12exp(−KL(GT,G′T)),

where denotes Kullback-Leibler divergence and the second-to-last line follows from Lemma 4 of [5].

We now calculate . For each and , let and be the distributions on under and , respectively. Observe that for all . By the chain rule for KL divergence,

 KL(GT,G′T) =T∑t=1EHt−1∼G[KL((γt,1,γt,2∣Ht−1),(γ′t,1,γ′t,2∣Ht−1))] =T∑t=1EHt−1∼Gt−1[KL((γt,1∣Ht−1),(γ′t,1∣Ht−1))] =T∑t=1EHt−1∼Gt−1[KL((D1,f1(Ht−1)∣Ht−1),(D′1,f1(Ht−1)∣Ht−1))] =T∑t=1PrHt−1∼Gt−1[f1(Ht−1)=2]KL(D2,D′2) =KL(D2,D′2)E[MT,2] =E[MT,2]/3.

Thus, we have

 max{QT,Q′T} ≥12(E[MT,2]3+T12exp(−E[MT,2]/3)) ≥16minx∈[0,T](x+Te−x/34) =ln(T/12)+12.

### References

1. Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009.
2. Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. arXiv preprint arXiv:1704.09011, 2017.
3. Patrick Bolton and Christopher Harris. Strategic experimentation. Econometrica, 67(2):349–374, 1999.
4. Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
5. Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. In Proceedings of the 26th Annual Conference on Learning Theory (COLT ’13), June 12–14, 2013, Princeton University, NJ, USA, pages 122–134, 2013.
6. Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. CoRR, abs/1602.07182, 2016.
7. Matthew Joseph, Michael J. Kearns, Jamie H. Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS ’16), December 5–10, 2016, Barcelona, Spain, pages 325–333, 2016.
8. Sampath Kannan, Michael J. Kearns, Jamie Morgenstern, Mallesh M. Pai, Aaron Roth, Rakesh V. Vohra, and Zhiwei Steven Wu. Fairness incentives for myopic agents. In Proceedings of the 2017 ACM Conference on Economics and Computation (EC ’17), Cambridge, MA, USA, June 26–30, 2017, pages 369–386, 2017.
9. Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. CoRR, abs/1801.03423, 2018.
10. Godfrey Keller, Sven Rady, and Martin Cripps. Strategic experimentation with exponential bandits. Econometrica, 73(1):39–68, 2005.
11. Nicolas Klein. Free-riding and delegation in research teams–a three-armed bandit model. 2009.
12. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
13. Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The externalities of exploration and how data diversity helps exploitation. In Proceedings of the 31st Annual Conference On Learning Theory (COLT ’18), Stockholm, Sweden, July 6–9 2018., pages 1724–1738, 2018.
14. Herbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pages 169–177. Springer, 1985.
15. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters