Bandit Regret Scaling with the Effective Loss Range

Bandit Regret Scaling with the Effective Loss Range

Nicolò Cesa-Bianchi
Dipartimento di Informatica
Università degli Studi di Milano
Milano 20135, Italy
nicolo.cesa-bianchi@unimi.it
   Ohad Shamir
Department of Computer Science
Weizmann Institute of Science
Rehovot 7610001, Israel
ohad.shamir@weizmann.ac.il
Abstract

We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availability of rough estimates of the losses, or advance knowledge of the loss of a single, possibly unspecified arm. Along the way, we develop a novel technique which might be of independent interest, to convert any multi-armed bandit algorithm with regret depending on the loss range, to an algorithm with regret depending only on the effective range, while avoiding predictably bad arms altogether.

1 Introduction

In the online learning and bandit literature, a recent and important trend has been the development of algorithms which are capable of exploiting “easy” data, in the sense of improved regret guarantees if the losses presented to the learner have certain favorable patterns. For example, a series of works have studied how the regret can be improved if the losses do not change much across rounds (e.g., [9, 14, 15, 16, 22]); being simultaneously competitive w.r.t. both “hard” and “easy” data (e.g., [21, 20, 5, 7]); attain richer feedback on the losses (e.g., [2]), have some predictable structure [19], and so on. In this paper, we continue this research agenda in a different direction, focusing on improved regret performance in nonstochastic settings with partial feedback where the learner has some knowledge about the variability of the losses within each round.

In the full information setting, where the learner observes the entire set of losses after each round , it is possible to obtain regret bounds of order scaling with the unknown effective range of the losses [8, Corollary 1]. Unfortunately, the situation in the bandit setting, where the learner only observes the loss of the chosen action, is quite different. A recent surprising result [11, Corollary 4] implies that in the bandit setting, the standard regret lower bound holds, even when . The proof defines a process where losses are kept -close to each other, but where the values oscillate unpredictably between rounds. Based on this, one may think that it is impossible to attain improved bounds in the bandit setting which depend on , or some other measure of variability of the losses across arms. In this paper, we show the extent to which partial information about the losses allows one to circumvent the impossibility result in some interesting ways. We analyze two specific settings: one in which the learner can roughly estimate in advance the actual loss value of each arm, and one where she knows the exact loss of some arbitrary and unknown arm.

In order to motivate the first setting, consider a scenario where the learner knows each arm’s loss up to a certain precision (which may be different for each arm). For example, in the context of stock prices [13, 1] the learner may have a stochastic model providing some estimates of the loss means for the next round. In other cases, the learner may be able to predict that certain arms are going to perform poorly in some rounds. For example, in routing the learner may know in advance that some route is down, and a large loss is incurred if that route is picked. Note that in this scenario, a reasonable algorithm should be able to avoid picking that route. However, that breaks the regret guarantees of standard expert/bandit algorithms, which typically require each arm to be chosen with some positive probability. In the resulting regret bounds, it is difficult to avoid at least some dependence on the highest loss values.

To formalize these scenarios and considerations, we study a setting where for each arm at round , the learner is told that the loss will be in for some . In this setting, we show a generic reduction, which allows one to convert any algorithm for bounded losses, under a generic feedback model (not necessarily a bandit one) to an algorithm with regret depending only on the effective range of the losses (that is, only on , independent of ). Concretely, taking the simple case where the loss of each arm at each round is in for some and fixed , and assuming the step size is properly chosen, we can get a regret bound of for the bandit feedback, completely independent of and the losses’ actual range. Note that this has the desired behavior that as , the regret also converges to zero (in the extreme case where , the learner essentially knows the losses in advance, and hence can avoid any regret). With full information feedback (where the entire loss vector is revealed at the end of each round), we can use the same technique to recover the regret bound of . We note that this is a special case of the predictable sequences setting studied in [19], and their proposed algorithm and analysis is applicable here. However, comparing the results, our bandit regret bounds have a better dependence on the number of arms , and our reduction can be applied to any algorithm, rather than the specific one proposed in [19]. On the flip side, the algorithm proposed in [19] is tailored to the more general setting of bandit linear optimization, and does not require the range parameter to be known in advance (see Sec. 3 for a more detailed comparison). We also study the tightness of our regret guarantees by providing lower bounds.

A second scenario motivating partial knowledge about the loss vectors is the following. Consider a system for recommending products to visitors of some company’s website. Say that two products are similar if the typical visitor tends to like them both or dislike them both. Hence, if we consider the similarity graph over the set of products, then it is plausible to assume that the likelihood of purchase (or any related index of the visitor’s behavior) be a smooth function over this graph. Formally, the loss vectors at each round satisfy , where is the Laplacian matrix associated with a graph over the arms with edges , and is a smoothness parameter. In this setting, we provide improved bandit regret bounds depending on the spectral properties of the Laplacian. To circumvent the impossibility result of [11] mentioned earlier, we make the reasonable assumption that at the end of each round round, the learner is given an “anchor point”, corresponding to the loss of some unspecified arm. In our motivating example, the recommender system may assume, for instance, that each visitor has some product that she most likely won’t buy. Using a simple modification of the Exp3 algorithm, we show that if the parameters are properly tuned, we attain a regret bound of order (ignoring log factors), where is the second-smallest eigenvalue of , also known as the algebraic connectivity number of the graph represented by . If the learner is told the minimal loss at every round (rather than any loss), this bound can be improved to order of (again, ignoring log factors) which vanishes, as it should, when for all ; that is, when all arms share the same loss value. We also provide a lower bound, showing that this upper bound is the best possible (up to log factors) in the worst case. Although our basic results pertain to connected graphs, using the range-dependent reductions discussed earlier we show it can be applied to graphs with multiple connected components and anchor points.

The paper is structured as follows: In Sec. 2, we formally define the standard experts/bandit online learning setting, which is the focus of our paper, and devote a few words to the notation we use. In Sec. 3, we discuss the situation where each individual loss is known to lie in a certain range, and provide an algorithm as well as upper and lower bounds on the expected regret. In Sec. 4, we consider the setting of smooth losses (as defined above). All our formal proofs are presented in the appendices.

2 Setting and notation

The standard experts/bandit learning setting (with nonstochastic losses) is phrased as a repeated game between a learner and an adversary, defined over a fixed set of arms/actions. Before the game begins, the adversary assigns losses for each of arms and each of rounds (this is also known as an oblivious adversary, as opposed to a nonoblivious one which sets the losses during the game’s progress). The loss of arm at round is defined as , and is assumed w.l.o.g. to lie in . We let denote the vector . At the beginning of each round, the learner chooses an arm , and receives the associated loss . With bandit feedback, the learner then observes only her own loss , whereas with full information feedback, the learner gets to observe for all . The learner’s goal is to minimize the expected regret (sometimes denoted as pseudo-regret), defined as

where the expectation is over the learner’s possible randomness. We use to denote the indicator of the event , and let denote the natural logarithm. Given an (undirected) graph over nodes, its Laplacian is defined as the matrix where equals the degree of node , for equals if node is adjacent to node , and otherwise. We let denote the second-smallest eigenvalue of . This is also known as the algebraic connectivity number, and is larger the more well-connected is the graph. In particular, for disconnected graphs, and for the complete graph.

3 Rough estimates of individual losses

We consider a variant of the online learning setting presented in Sec. 2, where at the beginning of every round , the learner is provided with additional side information in the form of , with the guarantee that for all . We then propose an algorithmic reduction, which allows to convert any regret-minimizing algorithm (with some generic feedback), to an algorithm with regret depending on , independent of . We assume that given a loss vector and chosen action , the algorithm receives as feedback some function : For example, if is an algorithm for the multi-armed bandits setting, then , whereas if is an algorithm for the experts setting, . In our reduction, is sequentially fed, at the end of each round , with (where and are not necessarily the same as the actual loss vector and actual chosen arm ), and returns a recommended arm for the next round, which is used to choose the actual arm .

To formally describe the reduction, we need a couple of definitions. For all , let

denote the arm with the lowest potential loss, based on the provided side-information (if there are ties, we choose the one with smallest , and break any remaining ties arbitrarily). Define any arm as “bad” (at round ) if and “good” if . Intuitively, “bad” arms are those which cannot possibly have the smallest loss in round . For loss vector , define the transformed loss vector as

It is easily verified that always. Hence, the range of the transformed losses does not depend on . The meta-algorithm now does the following at every round:

  1. Get an arm recommendation from .

  2. Let if is a good arm, and otherwise.

  3. Choose arm and get feedback

  4. Construct feedback and feed to algorithm

Crucially, note that we assume that can be constructed based on . For example, this is certainly true in the full information setting (as we are given , hence can explicitly compute ). This is also true in the bandit setting: If is a “good” arm, then , hence we can construct based on the feedback actually given to the meta-algorithm. If is a “bad” arm, then we can indeed construct , since is given to the meta-algorithm as side-information. This framework can potentially be used for other partial-feedback settings as well.

The following key theorem implies that the expected regret of this meta-algorithm can be upper bounded by the expected regret of , with respect to the transformed losses (whose range is independent of ):

Theorem 1.

Suppose (without loss of generality) that given by is chosen at random by sampling from a probability distribution . Let be the induced distribution111By definition of the meta-algorithm, we have if is good, if is bad, and . of . Then for any fixed arm , it holds that

(1)

This implies in particular that

where the expectation is over the possible randomness of the algorithm . Moreover, for any good , and for any bad .

The proof of the theorem (in the appendices) carefully relies on how the transformed losses and actions were defined. Since the range of is independent of , we get a regret bound for our meta-algorithm which depends only on . This is exemplified in the following two corollaries:

Corollary 1.

With bandit feedback and using Exp3 as the algorithm (with step size ), the expected regret of the meta-algorithm is

where is the set of “good” arms at round .

The optimal choice of leads to a regret of order . This recovers the standard Exp3 bound in the case (i.e., the standard setting where the losses are only known to be bounded in ), but can be considerably better if the terms are small, or the terms are large. We also note that the factor can in principle be removed, i.e., by using the implicitly normalized forecaster of [3] with appropriate parameters. A similar corollary can be obtained in the full information setting, using a standard algorithm such as Hedge [10]

Corollary 2.

With full information feedback and using Hedge as the algorithm (with step size ), the expected regret of the meta-algorithm is

The optimal choice of leads to regret of order . As in the bandit setting, our reduction can be applied to other algorithms as well, including those with more refined loss-dependent guarantees (e.g., [22] and references therein).

Finally, we note that Thm. 1 can easily be used to provide high-probability bounds on the actual regret , rather than just bounds in expectation, as long as we have a high-probability regret bound for . This is due to Eq. (1), and can be easily shown using standard martingale arguments.

3.1 Related work

As mentioned in the introduction, a question similar to those we are studying here was considered in [19], under the name of learning with predictable sequences. Unlike our setting, however, [19] does not require knowledge of . Assuming the step size is chosen appropriately, they provide algorithms with expected regret bounds scaling as

Comparing these bounds to Corollaries 1 and 2, we see that we obtain a similar regret bound in the full information setting, whereas in the bandit setting, our bound has a better dependence on the number of arms , and better dependencies on if or the number of “good” arms tends to be small. Also, our algorithmic approach is based on a reduction, which can be applied in principle to any algorithm and to general families of feedback settings, rather than a specific algorithm. On the flip side, our bound in the bandit setting can be worse than that of [19], if . Also, their algorithm is tailored to the more general setting of linear bandits (where at each round the learner needs to pick a point in some convex set , and receives a loss ), and does not require knowing in advance.

Another related line of work is path-based bounds, where it is assumed that the losses tend to vary slowly with , and can provide a good estimate of . This can be linked to our setting by taking , and be some known upper bound on . However, implementing this requires the assumption that is revealed at the next round , which does not fit the bandit setting. Thus, it is difficult to directly compare these results to ours. Most work on this topic has focused on the full information feedback setting (see [22] and references therein), and the bandit setting was studied for instance in [15].

3.2 Lower bound

We now turn to consider the tightness of our results. Since the focus of this paper is to study the variability of the losses across arms, rather than across time, we will consider for simplicity the case where are fixed for all (hence the subscript can be dropped).

In the theorem below, we show that the dependencies on and (in the bandit and full information case, respectively) cannot be improved in general.

Theorem 2.

Fix and nonnegative such that . Then there exists fixed parameters for such that the following holds: For any (possibly randomized) learner strategy , there exists a loss assignment satisfying for all , such that

where is a universal constant.

The proof is conceptually similar to the standard regret lower bound for nonstochastic multi-armed bandits (see [6]), where the losses are generated stochastically, with one randomly-chosen and hard-to-find arm having a slightly smaller loss in expectation. However, we utilize a more involved stochastic process to generate the losses as well as to choose the better arm, which takes the values of into account.

Remark 1.

The construction in the bandit setting is such that all arms are potentially “good” in the sense used in Corollary 1, and hence coincides with (recall is the set of “good” arms at time ). If one wishes to consider a situation where some arms are “bad”, and obtain a bound dependent on , one can simply pick some sufficiently large values for them, and ignore their contribution to the regret in the lower bound analysis.

The lower bound leaves open the possibility of removing the dependence on in the upper bound. This term is immaterial when is comparable to, or smaller than (e.g., if most arms are good, and is about the same for all ), but there are certainly situations where it could be otherwise. This question is left to future work.

4 Smooth losses

As discussed in the introduction, a line of work in the online learning literature considered the situation where the losses of each arm varies slowly across time (e.g., tends to be small when and are close to each other), and showed how to attain better regret guarantees in such a case. An orthogonal question is whether such improved performance is possible when the losses vary smoothly across arms. Namely, tends to be small for all pairs of actions that are similar to each other.

It turns out that this assumption can be exploited, avoiding the lower bound of [11], if the learner is given (or can compute) an “anchor point” at the end of the round , which equals the loss of some arm at round , independent of the learner’s randomness at that round. Importantly, the learner need not even know which arm has this loss. For example, it is often reasonable to assume that there is always some arm which attains a minimal loss of , or some arm which attains a maximal loss of . In that case, instead of estimating losses in , it is enough to estimate losses of the form , which may lie in a much narrower range if is constrained to be small.

To see why this “anchor point” side-information circumvents the lower bound of [11], we briefly discuss their construction (in a slightly simplified manner): The authors consider a situation where the losses are generated stochastically and independently at each round according to , with being a standard Gaussian random variable, , and being some arm chosen uniformly at random. Hence, at every round, arm has a loss smaller by than all other arms. Getting an expected regret smaller than would then amount to detecting . However, since the learner observes only a single loss every round, the similarity of the losses for different arms at a given round does not help much. In contrast, if the learner had access to the loss of any fixed arm (independent of the learner’s randomness), she could easily detect in rounds, simply by maintaining a “feasible set” of possible arms, picking arms at random, and removing it from if is positive. This process ends once contains a single arm, which must be .

To formalize this setting in a flexible manner, we follow a graph-based approach, inspired by [23]. Specifically, we assume that at every round , a graph over the arms, with an associated Laplacian matrix and parameter , can be defined so that the loss vector satisfies

The smaller is , the more similar are the losses, on average. This can naturally interpolate between the standard bandit setting (where the losses need not be similar) and the extreme case where all losses are the same, in which case the regret is always trivially zero. Crucially, note that the learner need not have explicit knowledge of neither nor : In fact, our regret upper bounds, which will depend on these entities, will hold for any and which are valid with respect to the vectors of actual losses (possibly the ones minimizing the bounds). The only thing we do expect the learner to know (at the end of each round ) is the “anchor point” as described above. We also note that this setting is quite distinct from the graph bandits setting of [17, 2], which also assumes a graph structure over the bandits, but this graph encodes what feedback the learner receives, as opposed to encoding similarities between the losses themselves.

We now turn to describe the algorithm and associated regret bound. The algorithm itself is very simple: Run a standard multiarmed bandits algorithm suitable for our setting (in particular, Exp3 [4]) using the shifted losses . The associated regret guarantee is formalized in the following theorem.

Theorem 3.

Assume that in each round , after choosing the learner is told a number chosen by the oblivious adversary and such that there exists some arm with . Then Exp3 performing updates based on loss vectors achieves

where each is the Laplacian of any simple and connected graph on such that for all .

The proof is based on Euclidean-norm regret bounds for the Exp3 algorithm, combined with a careful analysis of the associated quantities based on the Laplacian constraint .

By this theorem, we get that if the step size is chosen optimally (based on ), then we get a regret bound of order . We note that even if some of these parameters are unknown in advance, this can be easily handled using doubling-trick arguments (see the appendices for a proof), and the same holds for our other results.

The bound of Theorem 3 is not fully satisfying as it does not vanish when (which assuming the graph is connected, implies that all losses are the same). The reason is that we need to add to each loss component in order to guarantee that we do not end up with negative components when is subtracted from . This is avoided when in each round , the revealed loss is the smallest component of , as formalized in the following corollary.

Corollary 3.

Assume that in each round , after choosing the learner is told . Then Exp3 performing updates using losses achieves

where each is the Laplacian of any simple and connected graph on such that for all .

We leave the question of getting such a bound, without being the smallest loss, as an open problem.

We now show how the bounds stated in Theorem 3 and Corollary 3 relate to the standard Exp3 bound, which in its tightest form is of order —see Lemma 2 in the supplementary material. Recall that our bounds are achieved for all choices of such that for all . Now assume, for each , that is the Laplacian of the -clique. Then has all nonzero eigenvalues equal to , and so the condition is satisfied for . As is also equal to , we have that . Hence, when is tuned optimally (e.g., through the doubling trick), the bounds of Theorem 3 and Corollary 3 take, respectively, the form

(2)

Finally, we show that for fixed graphs , the regret bound in Eq. (2) (right-hand side) is tight in the worst-case up to log factors.

Theorem 4.

There exist universal constants such that the following holds: For any randomized algorithm, any , any , and any sufficiently large and , there exists a -node graph with Laplacian satisfying and an adversary strategy , such that the expected regret (w.r.t. the algorithm’s internal randomization) is at least

while for all .

This theorem matches Eq. (2), assuming that for all , and that . Note that the latter assumption is generally the interesting regime for (for example, as long as there is some node connected by a single edge). The proof is based on considering an “octopus” graph, composed of long threads emanating from one central node, and applying a standard bandit lower bound strategy on the nodes at the ends of the threads.

4.1 Multiple connected components

The previous results of this section need the graph represented by to be connected, in order for the guarantees to be non-vacuous. This is not just an artifact of the analysis: If the graph is not connected, at least some arms can have losses which are arbitrarily different than other arms, and the anchor point side information is not necessarily useful. Indeed, if there are multiple connected components, then and our bounds become trivial. Nevertheless, we now show it is still possible to get improved regret performance in some cases, as long as the learner is provided with anchor point information on each connected component of the graph.

We assume that at every round , there is some graph defined over the arms, with edge set . However, here we assume that this graph may have multiple connected components (indexed by in some set ). For each connected component , with associated Laplacian , we assume the learner has access to an anchor point . Unlike the case discussed previously, here the anchor points may be different at different components, so a simple shifting of the losses (as done in Sec. 4) no longer suffices to get a good bound. However, the anchor points still allow us to compute some interval, in which each loss must lie, which in turn can be plugged into the algorithmic reduction presented in Sec. 3. This is formalized in the following lemma.

Lemma 1.

For any connected component , and any arm in that component, .

Based on this lemma, we know that any arm at any connected component has values in

Using this and applying Corollary 1, we have the following result.

Theorem 5.

For any fixed arm , the algorithm described in Corollary 1 satisfies

where is the number of arms in connected component , and is a connected component for which is smallest.

This allows us to get results which depend on the Laplacians , even when these sub-graphs are disconnected. We note however that this theorem does not recover the results of Sec. 4 when there is only one connected component, as we get where the factor is spurious. The reason for this looseness is that we go through a coarse upper bound on the magnitude of the losses, and lose the dependence on the Laplacian along the way. This is not just an artifact of the analysis: Recall that the algorithmic reduction proceeds by using transformations of the actual losses, and these transformations may not satisfy the same Laplacian constraints as the original losses. Getting a better algorithm with improved regret performance in this particular setting is left to future work.

References

  • [1] Jacob Abernethy, Peter L Bartlett, Rafael Frongillo, and Andre Wibisono. How to hedge an option against an adversary: Black-scholes pricing is minimax optimal. In NIPS, 2013.
  • [2] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. arXiv preprint arXiv:1409.8428, 2014.
  • [3] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
  • [4] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
  • [5] Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. arXiv preprint arXiv:1605.08722, 2016.
  • [6] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.
  • [7] Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. In COLT, pages 42–1, 2012.
  • [8] Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321–352, 2007.
  • [9] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In COLT, pages 6–1, 2012.
  • [10] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23–37. Springer, 1995.
  • [11] Sébastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits. In NIPS, 2016.
  • [12] Robert Grone, Russell Merris, and V S_ Sunder. The laplacian spectrum of a graph. SIAM Journal on Matrix Analysis and Applications, 11(2):218–238, 1990.
  • [13] Elad Hazan and Satyen Kale. On stochastic and worst-case models for investing. In NIPS, 2009.
  • [14] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80(2-3):165–188, 2010.
  • [15] Elad Hazan and Satyen Kale. Better algorithms for benign bandits. Journal of Machine Learning Research, 12(Apr):1287–1311, 2011.
  • [16] Zohar S Karnin and Oren Anava. Multi-armed bandits: Competing with optimal sequences. In NIPS, 2016.
  • [17] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.
  • [18] Ali Ajdari Rad, Mahdi Jalili, and Martin Hasler. A lower bound for algebraic connectivity based on the connection-graph-stability method. Linear Algebra and Its Applications, 1(435):186–192, 2011.
  • [19] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In COLT, pages 993–1019, 2013.
  • [20] Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. In Advances in Neural Information Processing Systems, pages 810–818, 2014.
  • [21] Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits. In ICML, 2014.
  • [22] Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In ICML, pages 1593–1601, 2014.
  • [23] Michal Valko, Rémi Munos, Branislav Kveton, and Tomas Kocak. Spectral bandits for smooth graph functions. In ICML, 2014.

Appendix A Proof of Thm. 1

The proof consists mainly of proving Eq. (1). The in-expectation bounds follows by applying expectations on both sides of the inequality, and noting that conditioned on rounds , the conditional expectation of equals , and the conditional expectation of equals . Also, the statement on the range of each is immediate from the definition of and Eq. (5) below.

We now turn to prove Eq. (1). By adding and subtracting terms, it is sufficient to prove that

(3)

We will rely on the following facts, which are immediate from the definition of good and bad arms: Any bad arm must satisfy

(4)

and any good arm must satisfy

(5)

Based on this, we have the following two claim, whose combination immediately implies Eq. (3).

Claim 1. For any fixed arm , .

To show Claim 1, we consider separately the case where is a bad arm at round , and where a good arm at round . If is a bad arm, then , which is at most by Eq. (4). Otherwise, if is a good arm at round , the observation follows by definition of .

Claim 2.

where

To show Claim 2, recall that if is a good arm, then , and otherwise, we have (since by definition). Letting denote the set of good arms at round , we have:

Combining the two claims above, and summing over , we get Eq. (3) as required.

Appendix B Proof of Thm. 2

Suppose the learner uses some (possibly randomized) strategy, and let be a random variable denoting its random coin flips. Our goal is to provide lower bounds on

where the expectation is with respect to the learner’s (possibly randomized) strategy. Clearly, this is lower bounded by

where signifies expectation over some distribution over indices and losses . By Fubini’s theorem, this equals

where refers an infimum over the learner’s random coin flips. Thus, we need to provide some distribution over indices and losses, so that for any deterministic learner,

(6)

is lower bounded as stated in the theorem.

The proof will be composed of two constructions, depending on whether we are in the bandit of full information setting, and whether is larger or smaller than .

b.1 The case with bandit feedback

For this case, we will consider the following distribution: Let be distributed on according to the probability distribution (to be specified later). Conditioned on any , we define the distribution over losses as follows, independently for each round and index :

  • If , then equals w.p, , and w.p. .

  • If , then equals w.p. , and w.p. .

Also, let denote expectation and probabilities (over the space of possible losses and indices) conditioned on the event . With this construction, we note that , and if . As a result,

and therefore Eq. (6) equals

(7)

Let denote the probability distribution over , where for any and , is independent and equals with equal probability (note that this induces a probability on any event which is a deterministic function of the loss assignments, such as for some ). By a standard information-theoretic argument (see for instance [6, proof of Lemma 3.6]), we have that

where is the number of times arm was chosen by the learner, and