Bandit Learning with Positive Externalities

# Bandit Learning with Positive Externalities

Virag Shah, Jose Blanchet, Ramesh Johari
Stanford University
###### Abstract

Many platforms are characterized by the fact that future user arrivals are likely to have preferences similar to users who were satisfied in the past. In other words, arrivals exhibit positive externalities. We study multiarmed bandit (MAB) problems with positive externalities. Our model has a finite number of arms and users are distinguished by the arm(s) they prefer. We model positive externalities by assuming that the preferred arms of future arrivals are self-reinforcing based on the experiences of past users. We show that classical algorithms such as UCB which are optimal in the classical MAB setting may even exhibit linear regret in the context of positive externalities. We provide an algorithm which achieves optimal regret and show that such optimal regret exhibits substantially different structure from that observed in the standard MAB setting.

## 1 Introduction

A number of different platforms use multiarmed bandit (MAB) algorithms today to optimize their service: e.g., search engines and information retrieval platforms; e-commerce platforms; and news sites. In all cases, the basic application of MAB strategies is to adaptively determine the most preferred options (search results, products, news articles, etc.) to maximize rewards. Optimal algorithms trade off exploration (acquiring information about lesser known options) against exploitation (of better known options).

As a result of the use of such policies, the data the platforms use to make decisions has an inherent sampling bias: the data collection is endogenous to the platform, and of course not independent across past decisions. One potential effect that can exacerbate this bias is a shifting user population. User preferences are often heterogeneous, so future arrivals may be biased towards users who expect to have positive experiences based on past observation of the platform.

In this paper, we model the issue by considering a setting where such a platform faces many types of users that can arrive. Each user type is distinguished by preferring one of the item types above all others. The platform is not aware of either the type of the user, or the item-user payoffs. Our model is distinguished by the presumption that there are positive externalities (also called positive network effects) among the users [13]: in particular, if one type of item generates positive rewards, users who prefer that type of item become more likely to arrive in the future. This is a significant effect in real platforms: for example, if a news site generates articles that are liberal (resp., conservative), then it is most likely to attract additional users who are liberal (resp., conservative) [2].

There are broadly two ways to maximize reward. First, the platform might aim for fine-grained data on the preferences of each user, and use this to personalize the service. This is the theme of much recent work on contextual bandits; see, e.g., [16, 22, 18] and [8] for a survey of early work. In such a model, it is important that either (1) enough observable covariates are available to group different users together as decisions are made; or (2) users are long-lived enough that the platform has time to learn about them. In this view, the platform essentially knows the user type, and thus can maximize rewards largely independently for each class of users. But what if these conditions do not hold: in other words, what if context does not provide sufficient information about user type, and users are shorter-lived?

Our paper quantifies the consequences of positive externalities for bandit learning in a benchmark model where users are short-lived (one interaction per user), and where the platform is unable to observe the user’s class on arrival. As a result, the platform can only use observed rewards to optimize both arm and user type. In the model we consider, there is a set of arms, and each arriving user prefers one of these arms over all others; in particular, all arms other than the preferred arm generate zero reward, while the preferred arm generates a Bernoulli reward with mean . The probability that a user preferring arm arrives at time is proportional to , where is the number of past rewards observed from arm . We consider : when is near zero, the positive externality is weak; when is large, the positive externality is strong. The platform aims to maximize cumulative reward up to time .

We evaluate our performance by measuring regret against an “offline” oracle that knows for the arms, but does not know the arriving class of each user. In particular, the oracle we consider always chooses the arm . Because of the positive externality, this choice causes the user population to shift entirely to users preferring arm over time. This oracle achieves asymptotically optimal performance to leading order in .

At the heart of this learning problem is a central tradeoff. On one hand, because of the positive externality, the platform operator is able to move the user population towards the profit maximizing population; this movement is especially rapid when the network effect is strong ( is large). On the other hand, because of precisely the same effect, any mistakes are amplified as well: if rewards are generated on suboptimal arms, the positive externality causes more users that prefer those arms to arrive in the future. The tradeoff is reversed when is small: now the user population takes longer to change, but because positive externalities are weaker the consequences of suboptimal arm choices are also weaker. The design of an optimal algorithm for this problem must balance these two effects: leverage the positive externality to converge to the correct user population, but prevent the positive externality from leading to convergence to the wrong user population.

Our main results reveal that positive externalities substantially change the learning problem. First, we study the regret performance of UCB [5], a benchmark optimal algorithm for the MAB problem. We show that regardless of the value of , UCB can have linear regret; fundamentally, this is because it “overexploits”, and allows the process to converge to the wrong user population. Second, we consider the regret performance of a policy that chooses arms uniformly at random at each time step up to some fixed time , then exploits the empirical best arm until time . This policy prevents misconvergence, but also does not learn as efficiently as UCB. It performs better than UCB for smaller values of , but with performance worsening to linear regret for : when the positive externality is sufficiently strong, the random policy is punished for its unstructured approach to exploration.

Motivated by the observations, we propose a “waterfilling” policy that tries to optimally learn despite the shifting population dynamics introduced by the positive externality. This policy also follows an explore-then-exploit structure, like the random policy. However, during the exploration phase the policy is cautious: it explores the arm with the least accrued reward, to ensure that the algorithm can observe whether low rewards were simply a consequence of bad luck. We show that this waterfilling policy achieves regret when , regret when , and regret when . Further we provide matching lower bounds, showing that this policy is order optimal. It exploits strong positive externalities when available, but also substantially reduces the probability of exploiting with the wrong user population. Our results are summarized in Table 1.

The presence of positive externalities presents a unique technical challenge: since the type of the arriving user determines the preferred arm, we need a precise characterization of the influence of past rewards on the arrival process. We develop our analysis by viewing the cumulative rewards at each arm as a generalized urn process. Following classical results in the literature on urn models, we embed this process in a continuous-time branching process, and leverage the resulting representation to characterize performance of the various algorithms described above.

Our work provides a first step towards understanding the influence of positive externalities on bandit learning. As noted above, our work investigates a setting with different constraints and tradeoffs from much of the existing work on personalization. Of course, the reality of many platforms is somewhere in between: some user information may be available, though imperfect. Nevertheless, our work suggests that there are significant consequences to learning when the user population itself can change over time, an insight that we expect to be robust across a wide range of settings.

The remainder of the paper is organized as follows. In Section 3, we introduce the model, and in Section 4 we introduce the oracle benchmark we compare against, and our definition of regret. In Section 5, we study the suboptimality of the UCB algorithm. In Section 6, we study the performance of an algorithm that randomly explores arms before exploiting. In Section 7, we introduce our waterfilling strategy and characterize its performance. Finally, in Section 8, we prove lower bounds on the performance of any algorithm; these lower bounds establish optimality of our algorithm. We conclude in Section 9. Omitted proofs can be found in the Appendix.

## 2 Related Work

As noted above, our work incorporates the notion of positive externalities, also known as positive network effects or positive network externalities. (Note that the phrase “network” is often used here, even when the externality may not be due to network connection between the users per se.) See [13], as well as [21, 20] for background. Positive externalities are extensively discussed in most standard textbooks on microeconomic theory; see, e.g., Chapter 11 of [17].

It is well accepted that online search and recommendation engines produce feedback loops that can lead to self-reinforcement of popular items [3, 6, 19, 9]. Our model captures this phenomenon by employing a self-reinforcing arrival process, inspired by classical urn processes [4, 11].

We note that the kind of self-reinforcing behavior observed in our model may be reminiscent of “herding” behavior in Bayesian social learning [7, 23, 1]. In these models, arriving Bayesian rational users take actions based on their own private information, and the outcomes experienced by past users; under some circumstances, it can be shown that the users will “herd”, in the sense that after some finite time they all choose the same action. The central question in that literature is whether individuals act on their own private information or not, rather than following the crowd; indeed, users who herd exhibit a negative externality on others, as their private information is not shared. By contrast, in our model users do not have private information; instead, it is the platform that is trying to learn from preferences of users, and users exert positive externalities on each other.

## 3 Problem Formulation

Let be the set of arms available. During each time a new user arrives, and an arm is “pulled”; we denote the arm pulled at time by . We view pulling an arm as presenting the corresponding option to the newly arrived user. Each arriving user prefers a subset of the arms, denoted by . If arm is pulled at time and if the user at time prefers arm , then the reward obtained at time is an independent Bernoulli random variable with mean , where for each . If the user at time does not prefer the arm pulled then the reward obtained at time is zero. Let denote the reward obtained at time .

Let represent the number of times arm is pulled by time , and represent the total reward accrued by pulling arm up to time . Thus , and . Let

 a∗=argmaxaμa.

For simplicity, we assume that is unique. We also assume that there exist positive constants and such that for each and for each .

Suppose there are fixed constants that denote the initial “popularity” of arm . Define:

 Na(t)=Sa(t)+θa,   a∈A.

By definition, . We assume that the user prefers arm (i.e., ) with probability independently of other arms, where:

 λa(t)\coloneqqf(Na(t−1))∑a′f(Na′(t−1))

for an increasing function . We refer to as the externality function. We are primarily interested in the following class of externality functions:

 f(x)=xα,0<α<∞.

Intuitively, the idea is that agents who prefer arm are more likely to arrive if arm has been successful in the past. This is a positive externality: users who prefer arm are more likely to generate rewards when arm is pulled, and this will in turn increase the likelihood an arrival preferring arm comes in the future. The parameter controls the strength of this externality: the positive externality is stronger when is larger.

If is linear (), then we can interpret our model in terms of an urn process. In this view, resembles the initial number of balls of color in the urn at time and resembles the total number of balls of color added into the urn after draws. Thus, the probability the draw is of color is proportional to . In contrast to the standard urn model, in our model we have additional control: namely, we can pull an arm, and thus govern the probability with which a new ball of same color is added into the urn.

The goal is to choose to maximize the expected total reward accrued in a time horizon . We let denote the total reward obtained up to time :

 ΓT=T∑t=1Xt. (1)

## 4 Oracle and Regret

We consider the following policy as our benchmark.

###### Definition 1 (Oracle).

The algorithm knows the optimal arm and pulls it at all times .

Let denote the reward of this . We measure regret against the oracle: given any policy, define as:

 RT=Γ∗T−ΓT. (2)

Our goal is to design algorithms that minimize the expected regret . In particular we focus on characterizing regret performance asymptotically to leading order in , for different values of the externality exponent . Note that may not be optimal for finite fixed ; however, we show that it is asymptotically optimal as .

Unlike in the standard stochastic MAB problem, the expected cumulative reward is not , as several arrivals may not prefer the optimal arm. Below, we provide tight bounds on . For proof see Appendix.

###### Theorem 1.

Suppose . Let . The expected cumulative reward for the satisfies:

###### Corollary 1.

We have:

 E[Γ∗T]=⎧⎨⎩μa∗T−Θ(T1−α),0<α<1μa∗T−Θ(lnT),α=1μa∗T−Θ(1),α>1

Note that in all cases, the reward asymptotically is of order . This is the best achievable performance to leading order in , showing the oracle is asymptotically optimal.

## 5 Suboptimality of UCB algorithm

In this section we show that the standard UCB algorithm [5, 8], which does not account for the positive externality, performs poorly. It is known that under the standard MAB setting the UCB algorithm achieves regret when . For the standard Bernoulli MAB problem this is asymptotically optimal [15, 8]. Formally, the algorithm is defined as follows.

###### Definition 2 (Ucb(α)).

Fix . For each , let and for each let , under convention that if . For each let and for each let

 ua(t)\coloneqq^μa(t)+αlntTa(t−1).

Under UCB() policy, we have

 It∈argmaxa∈Aua(t),

with ties broken uniformly at random.

Under our model, consider an event where but : i.e., is pulled but the arrivign user did not prefer arm . Under UCB, such events are self-reinforcing, in that they not only lower the upper confidence bound for arm , resulting in fewer future pulls of arm , but they also reduce the preference of a future users towards arm . In fact, the impact of this self-reinforcement under UCB is so severe that we obtain a striking result: there is a strictly positive probability that the optimal arm will never see a positive reward, as shown by the following theorem. An immediate corollary of this result is that the regret of UCB is linear in the horizon length.

###### Theorem 2.

Suppose . Suppose that is for some . There exists a such that

 P(∩∞t=0Sa∗(t)=0)≥δ.
###### Corollary 2.

The regret of UCB is .

###### Proof.

We first prove the result for the setting with two arms, i.e., , and then generalize later. Suppose . Without loss of generality, let .

Let be the time at which arm is pulled for the time.

Let be the event that the first pulls of arm each saw a user which did not prefer arm .

Let be the event that .

Then, under , we have the following for each time s.t. :

 ua(t)<αlneθbμb4α(k−1)k−1=θbμb4<θbμb3<^μb(t)

Thus, under , arm is pulled for each time s.t. , and in turn.

We now show that there exists a such that from which the result would follow.

Using law of total probability we have,

 P(Qk∩Ek)≥P(Qk−1∩Ek−1)P(Qk∩Ek|Qk−1,Ek−1).

Thus, we have

 P(Qk∩Ek)≥P(Qk−1∩Ek−1)P(Ek|Qk−1,Ek−1)P(Qk|Qk−1,Ek−1,Ek). (3)

Note that, under , arm is pulled at least times before . Using standard Chernoff bound techniques it is easy to show that there exists a constant such that

 P(Ek|Qk−1,Ek−1)≥1−e−δ′(k−1). (4)

Further, under , we have and . Thus,

 P(Qk|Qk−1,Ek−1,Ek)≥f(θa)f(θa)+f(θbμb3eθbμb4α(k−1)). (5)

Substituting (5) and (4) in (3), we obtain

 P(Qk∩Ek)≥P(Qk−1∩Ek−1)(1−e−δ′(k−1))⎛⎜ ⎜ ⎜ ⎜⎝f(θa)f(θa)+f(θbμb3eθbμb4α(k−1))⎞⎟ ⎟ ⎟ ⎟⎠.

Computing recursively, we obtain

 P(Qk∩Ek)≥P(Q2∩E2)k∏l=2(1−e−δ′(k−1))×k∏l=2⎛⎜ ⎜ ⎜ ⎜⎝f(θa)f(θa)+f(θbμb3eθbμb4α(k−1))⎞⎟ ⎟ ⎟ ⎟⎠,

which tends to a constant as . Hence the result follows for .

For , we can generalize the argument to show that only the worst arm will see non-zero rewards with positive probability by appropriately generalizing the notions of and and arguing along the above lines. ∎

## 6 Random Explore-then-Exploit

In this section, we consider an algorithm we choose arms independently and uniformly at random for some period of time, then the algorithm commits to the empirical best arm. While such an algorithm explores all arms equally, it does not learn efficiently; in particular, it pulls suboptimal arms too frequently, and then has the risk of never recovering from the positive externalities generated by these arms (and the subsequent shifts in user population). We formalize this intuition in this section.

To analyze this algorithm, we start with a technical result for the algorithm that indefinitely pulls arms independently and uniformly at random. For the case where the reward distribution for each arm has finite and rational support, and if is linear, we can model the cumulative rewards obtained at each arm via the generalized Friedman’s urn process. These processes are studied by embedding them into multitype continuous-time Markov branching processes [4, 11], where the expected lifetime of each particle is one at all times. Here, we are interested in rewards obtained for more general . We study this by considering multitype branching processes with state-dependent expected lifetimes. We obtain the following result.

###### Theorem 3.

Suppose at each time step , arms are pulled independently and uniformly at random. The following statements hold:

1. If for then for each , we have that almost surely as .

2. If then for each , we have that converges to a random variable almost surely.

3. If for then there is a positive probability that is while for some we have as .

###### Proof.

For ease of exposition we will assume that . The argument for the more general case is more or less identical.

For now, suppose that . We will study the process by analyzing a multitype continuous time Markov branching process such that its embedded Markov chain, i.e., the discrete time Markov chain corresponding to the state of the branching process at its jump times, is statistically identical to . By jump time we mean the times at which a particle dies; upon death it may give birth to just one new particle, in which case, the size of the process may not change at the jump times.

We construct as follows. Both and are themselves independently evolving single dimensional branching processes. Initially, and have one particle each, i.e., . Each particle dies at a rate dependent on the size of the corresponding branching processes as follows: at time each particle of dies at rate . At the end of its lifetime, the particle belonging to dies and gives birth to one new particle wi—th probability and two new particles with probability . Similarly for the particles belonging to .

We can see that the embedded Markov chain of is statistically identical to as follows. Let represent the jump times of . Clearly, and are identically distributed. Further, after the jump of , the rate at which increments is . Thus, the probability that the jump of belongs to is . This is simulating the preference selection of the user arrival. Further, the birth of new particles simulates the selection of matching arm and generation of the positive reward.

Now, we obtain the following lemma from Theorem 1 in [14]. We say that is sublinear if there exists such that .

###### Lemma 1.

If is linear or sublinear, then

 |Za(s)|→wa(s)(W+o(1)),

where is the inverse function of

 ga(s)=2μa∫s01f(x)dx,

and is a random variable with w.p.1. Moreover, is is sublinear.

Now, consider for . Then, it follows that . Thus, we have

 |Za(s)|(2s(1−α)μa)11−α→1a.s.,

and

 |Zb(s)|(2s(1−α)μb)11−α→1a.s..

Thus, part of the theorem follows for the case where . For general and , we construct as many independent branching processes, apply the above lemma, and the result follows.

Part follows in a similar fashion and noting that .

We now argue for part . We assume that , the argument for general and is similar. We show that if for then there exists a time such that . Our result follows from this since for each finite we have that . For each let . Clearly, is the sum of a random number (with distribution Geometric) of Exponential distributed random variables. Thus, , which tends to a constant, say , as . Thus, . Hence part follows. ∎

Clearly, by choosing arms at random at all times within time horizon the regret incurred is . Thus, a natural policy is to select arms at random till time and pull the perceived best arm for the remaining time; this is what we mean by random explore-then-exploit. More formally:

###### Definition 3 (Random(τ)).

Fix . Under Random() policy, for each , is chosen uniformly at random from set . Let , with tie broken at random. For , .

In the light of above theorem one may envisage that for linear and sublinear , Random( may perform well for some . The following theorem provides performance bounds for the Random() policy. For the proof, see the Appendix.

###### Theorem 4.

Let . The following statements hold for Random() policy for any :

1. If then we have

 E[R]=Ω(T1−αlnα1−αT).
2. If then we have

 E[R]=Ω(Tμbμb+θa∗μa∗).
3. If then we have

 E[R]=Ω(T).

In the next section, we provide an algorithm which does better than Random() for each . In fact, it provides exponential improvement over Random() for .

## 7 Waterfilling

We saw above that being optimistic in the face of uncertainty and sampling at random achieve regret which is polynomial in the time horizon when the externality function is linear. We now present a policy which is cautious during the exploration phase, in that it pulls the arm with least accrued reward, to give it further opportunity to ramp up its score just in case its poor performance was bad luck. At the end of the exploration phase, it exploits the perceived best arm for the rest of the horizon.

###### Definition 4.

Waterfilling() Algorithm: Fix .

1. Exploration phase: For , pull arm , with ties broken at random.

2. Exploitation phase: For , pull the arm , with ties broken at random.

The following theorem provides a bound on the performance of the Waterfilling() policy.

###### Theorem 5.

Suppose that . For each , there exists a constant such that the regret under Waterfilling() policy satisfies the following:

1. If then regret is . Thus if then

 R∗=O(T1−αlnαT).
2. If then regret is . Thus if then

 R∗=O(ln2T).
3. If then regret is . Thus if then

 R∗=O(lnαT).

Note that the constants depend on the system parameters, namely, and (but not on ). This can be resolved by using as slightly larger than as noted below.

###### Corollary 3.

Fix a sequence such that . For example, could be for each . Set . Then

1. If then .

2. If then .

3. If then .

###### Proof.

By Law of Total Expectation, we have

 E[R] =E[R|^a=^a∗]P(^a=^a∗)+E[R|^a≠^a∗]P(^a≠^a∗) ≤E[R|^a=^a∗]+TP(^a≠^a∗). (6)

We first obtain a bound on and then on , from which the result would follow.

We can lower-bound total rewards obtained by only counting rewards obtained during from time to . Let be the total rewards obtained from time to under . Thus, we get

 E[R|^a=^a∗]≤E[Γ∗]−E[Γτ]. (7)

Note that for each arm . A lower bound on is obtained using an argument same as to that used for obtaining the lower bound on in Theorem 1, with replaced with (which is ) and looking at times to instead of times . Thus, we get

 E[Γτ]≥μa∗(T−τ)−(∑bNb(τ)α)T∑k=τ1(k+θa∗)α−1.

Using the above bound and Part of Theorem 1 in (7) we obtain,

 E[R|^a=^a∗]≤Tμa∗−μa∗θαT∑k=11(μa∗k)α+θα−μa∗(T−τ)+(∑bNb(τ)α)T∑k=τ1(k+θa∗)α+1

Thus, for we have

 E[R|^a=^a∗] ≤μa∗τ−Ω(T1−α)+O(ταT1−α)=O(ταT1−α).

Similarly we obtain that is for and it is for .

Thus, the result would follow if we show that for some positive constant . We show that below. We start with special case where for each .

The algorithm operates in cycles, where at the beginning of cycle the for each arm is equal to , and and it equals to at the end of the cycle. For each let represent the length of cycle ; more formally let for each . At each time the probability of positive reward is greater than , where conservatively accounts of unequal for different arms within the ‘cycle’ of length . Thus, can be stochastically upperbounded by sum of geometrically distributed random variables, each with rate no less than and no greater than , where . Thus, .

From Theorem 2.1 in [12], we get

 P(n∑k=1γk>2n∑k=1E[γk])≤e−(2−ln2)Δm∑nk=1E[γk].

Clearly, . Thus, we get,

 P(n∑k=1γk>2α+1nm2/Δ)≤e−(2−ln2)Δnm. (8)

Thus, if time, then the probability that we would have less than cycles is exponentially small in . Note that, if we explore for cycles then we have samples from each arm.

From the definition, event implies that for some . Recall that for each . Thus,

 P(^a∗≠a∗ ∣∣Tb(τ)=n)≤∑b≠a∗P(Ta∗>Tb∣∣Tb(τ)=n) =∑b≠a∗P(Sb(τ)Ta∗(τ)

where, for each and . Clearly for each we have that

 (10)

We will use the following concentration inequality.

###### Lemma 2.

For each arm , there exists a constant independent of such that

 P(^μb(τ)>μb+δ2∣∣Tb(τ)=n)≤e−cbn.

Similarly, there exists a constant independent of such that

 P(^μa∗(τ)<μa∗−δ2∣∣Ta∗(τ)=n)≤e−c′n.

To prove the lemma, note that for each small constant there exists an integer constant such that for each time after cycle, we have for each arm . Thus, after a constant number of pulls of arm , we have that each pull of arm results into a success with probability no larger than which equals . Thus, when arm is pulled times for number of arm pulls have success probability less than or equal to . Thus, the first part of the lemma follows from standard exponential concentration result for independent Bernoulli random variables [10]. Second part of the lemma follows similarly.

Let , where is to be determined. Then, from (8) we have that . We use the above lemma and the fact that is linear in for each with high probability to bound , as follows:

 P(^a∗≠a∗)≤P(^a∗≠a∗∣∣Tb(τ)≥n)+P(Tb(τ)≥n) ≤τ∑k=nP(^a∗≠a∗∣∣Tb(τ)=k)+e−(2−ln2)Δnm ≤τP(^a∗≠a∗∣∣Tb(τ)=n)+e−(2−ln2)Δnm ≤τ∑b≠a∗P(^μa∗(τ)<^μb(τ)∣∣Tb(τ)=n)+e−(2−ln2)Δnm,

where the last inequality follows from (9). Further, from Lemma 2 and (10), we have

 P(^a∗≠a∗)≤2α+1nmΔ−1∑b≠a∗(e−cbn+e−c′n)+e−(2−ln2)Δnm

Where the last inequality follows from the above lemma and (10). Thus, by choosing for large enough , and noting that is linear in , we get the result for the case where for each . For general values essentially the same argument applies by noting that for each small constant there exists an integer constant such that for each time after cycle, we have for each arm . Since does not depend on , the concentration arguments above still hold. ∎

## 8 Lower Bounds

In this section, we provide lower bounds that show that to leading order in , the performance of the waterfilling policy described in the preceding section is essentially optimal. To understand our construction of the lower bound, consider the case where the externality function is linear (); the other cases follow similar logic.

Our basic idea is that in order to determine the best arm, any optimal algorithm will need to explore all arms at least times. However, this means that after time, the total reward on any suboptimal arms will be of order . Because of the effect of the positive externality, any algorithm will then need to “recover” from having accumulated rewards on these suboptimal arms; we show that even if the optimal arm is pulled from time onwards, a regret is incurred simply because arrivals who do not prefer arm continue to arrive in sufficient numbers.

###### Theorem 6.
1. For , there exists no policy with on all sets of Bernoulli reward distributions.

2. For , there exists no policy with on all sets of Bernoulli reward distributions.

3. For , there exists no policy with on all sets of Bernoulli reward distributions.

###### Proof.

We show the result for and . For other values of and , the result follows in a similar fashion.

Consider a problem instance where , with expected rewards and respectively. Without loss of generality, assume that . The rewards obtained by a policy can be simulated as follows. Let be a sequence of i.i.d. Bernoulli() random variables. Similarly, let be a sequence of i.i.d. Bernoulli() random variables. Let represent the set of arms preferred by the arrival at time . Recall that repesents the arm pulled at time . Then the rewards obtained until time , denoted , are given by:

 Γt=t∑k=1(\mathbbm1(Ik=a)\mathbbm1(a∈Jk)Xk,a+\mathbbm1(Ik=b)\mathbbm1(b∈Jk)Xk,b).

First, we study the following , and in particular characterize the maximum payoff achievable. We then use this device to rule out the possibility of policies achieving the performance in the theorem statement.

###### Definition 5 (Oracle(t′)).

Fix time . The values ,