Refined Lower Bounds for Adversarial Bandits

Refined Lower Bounds for Adversarial Bandits

Sébastien Gerchinovitz
Institut de Mathématiques de Toulouse
Université Toulouse 3 Paul Sabatier
Toulouse, 31062, France
sebastien.gerchinovitz@math.univ-toulouse.fr
&Tor Lattimore
Department of Computing Science
University of Alberta
Edmonton, Canada
tor.lattimore@gmail.com
Abstract

We provide new lower bounds on the regret that must be suffered by adversarial bandit algorithms. The new results show that recent upper bounds that either (a) hold with high-probability or (b) depend on the total loss of the best arm or (c) depend on the quadratic variation of the losses, are close to tight. Besides this we prove two impossibility results. First, the existence of a single arm that is optimal in every round cannot improve the regret in the worst case. Second, the regret cannot scale with the effective range of the losses. In contrast, both results are possible in the full-information setting.

 

Refined Lower Bounds for Adversarial Bandits


  Sébastien Gerchinovitz Institut de Mathématiques de Toulouse Université Toulouse 3 Paul Sabatier Toulouse, 31062, France sebastien.gerchinovitz@math.univ-toulouse.fr Tor Lattimore Department of Computing Science University of Alberta Edmonton, Canada tor.lattimore@gmail.com

\@float

noticebox[b]30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float

1 Introduction

We consider the standard -armed adversarial bandit problem, which is a game played over rounds between a learner and an adversary. In every round the learner chooses a probability distribution over . The adversary then chooses a loss vector , which may depend on . Finally the learner samples an action from denoted by and observes her own loss . The learner would like to minimise her regret, which is the difference between cumulative loss suffered and the loss suffered by the optimal action in hindsight:

where is the sequence of losses chosen by the adversary. A famous strategy is called Exp3, which satisfies where the expectation is taken over the randomness in the algorithm and the choices of the adversary (Auer et al., 2002). There is also a lower bound showing that for every learner there is an adversary for which the expected regret is (Auer et al., 1995). If the losses are chosen ahead of time, then the adversary is called oblivious, and in this case there exists a learner for which (Audibert and Bubeck, 2009). One might think that this is the end of the story, but it is not so. While the worst-case expected regret is one quantity of interest, there are many situations where a refined regret guarantee is more informative. Recent research on adversarial bandits has primarily focussed on these issues, especially the questions of obtaining regret guarantees that hold with high probability as well as stronger guarantees when the losses are “nice” in some sense. While there are now a wide range of strategies with upper bounds that depend on various quantities, the literature is missing lower bounds for many cases, some of which we now provide.

We focus on three classes of lower bound, which are described in detail below. The first addresses the optimal regret achievable with high probability, where we show there is little room for improvement over existing strategies. Our other results concern lower bounds that depend on some kind of regularity in the losses (“nice” data). Specifically we prove lower bounds that replace in the regret bound with the loss of the best action (called first-order bounds) and also with the quadratic variation of the losses (called second-order bounds).

High-probability bounds

Existing strategies Exp3.P (Auer et al., 2002) and Exp3-IX (Neu, 2015a) are tuned with a confidence parameter and satisfy, for all ,

(1)

for some universal constant . An alternative tuning of Exp-IX or Exp3.P (Bubeck and Cesa-Bianchi, 2012) leads to a single algorithm for which, for all ,

(2)

The difference is that in (1) the algorithm depends on while in (2) it does not. The cost of not knowing is that the moves outside the square root. In Section 2 we prove two lower bounds showing that there is little room for improvement in either (1) or (2).

First-order bounds

An improvement over the worst-case regret bound of is the so-called improvement for small losses. Specifically, there exist strategies (eg., FPL-TRIX by Neu (2015b) with earlier results by Stoltz (2005); Allenberg et al. (2006); Rakhlin and Sridharan (2013)) such that for all

(3)

where the expectation is with respect to the internal randomisation of the algorithm (the losses are fixed). This result improves on the bounds since is always guaranteed and sometimes is much smaller than . In order to evaluate the optimality of this bound, we first rewrite it in terms of the small-loss balls defined for all and by

(4)
Corollary 1.

The first-order regret bound (3) of Neu (2015b) is equivalent to:

The proof is straightforward. Our main contribution in Section 3 is a lower bound of the order of for all . This minimax lower bound shows that we cannot hope for a better bound than (3) (up to log factors) if we only know the value of .

Second-order bounds

Another type of improved regret bound was derived by Hazan and Kale (2011b) and involves a second-order quantity called the quadratic variation:

(5)

where is the mean of all loss vectors. (In other words, is the sum of the empirical variances of all the arms). Hazan and Kale (2011b) addressed the general online linear optimisation setting. In the particular case of adversarial -armed bandits with an oblivious adversary (as is the case here), they showed that there exists an efficient algorithm such that for some absolute constant and for all

(6)

As before we can rewrite the regret bound (6) in terms of the small-variation balls defined for all and by

(7)
Corollary 2.

The second-order regret bound (6) of Hazan and Kale (2011b) is equivalent to:

The proof is straightforward because the losses are deterministic and fixed in advance by an oblivious adversary. In Section 4 we provide a lower bound of order that holds whenever . This minimax lower bound shows that we cannot hope for a bound better than (7) by more than a factor of if we only know the value of . Closing the gap is left as an open question.

Two impossibility results in the bandit setting

We also show in Section 4 that, in contrast to the full-information setting, regret bounds involving the cumulative variance of the algorithm as in (Cesa-Bianchi et al., 2007) cannot be obtained in the bandit setting. More precisely, we prove that two consequences that hold true in the full-information case, namely: (i) a regret bound proportional to the effective range of the losses and (ii) a bounded regret if one arm performs best at all rounds, must fail in the worst case for every bandit algorithm.

Additional notation and key tools

Before the theorems we develop some additional notation and describe the generic ideas in the proofs. For let be the number of times action has been chosen after round . All our lower bounds are derived by analysing the regret incurred by strategies when facing randomised adversaries that choose the losses for all actions from the same joint distribution in every round (sometimes independently for each action and sometimes not). denotes the Bernoulli distribution with parameter . If and are measures on the same probability space, then is the KL-divergence between them. For we define and for we let . Our main tools throughout the analysis are the following information-theoretic lemmas. The first bounds the KL divergence between the laws of the observed losses/actions for two distributions on the losses.

Lemma 1.

Fix a randomised bandit algorithm and two probability distributions and on . Assume the loss vectors are drawn i.i.d. from either or , and denote by the joint probability distribution on all sources of randomness when is used (formally, , where is the probability distribution used by the algorithm for its internal randomisation). Let . Denote by the history available at the beginning of round , by the law of under , and by the th marginal distribution of . Then,

Results of roughly this form are well known and the proof follows immediately from the chain rule for the relative entropy and the independence of the loss vectors across time (see (Auer et al., 2002) or Appendix A). One difference is that the losses need not be independent across the arms, which we heavily exploit in our proofs by using correlated losses. The second key lemma is an alternative to Pinsker’s inequality that proves useful when the Kullback-Leibler divergence is larger than . It has previously been used for bandit lower bounds (in the stochastic setting) by Bubeck et al. (2013).

Lemma 2 (Lemma 2.6 in Tsybakov 2008).

Let and be two probability distributions on the same measurable space. Then, for every measurable subset (whose complement we denote by ),

2 Zero-Order High Probability Lower Bounds

We prove two new high-probability lower bounds on the regret of any bandit algorithm. The first shows that no strategy can enjoy smaller regret than with probability at least . Upper bounds of this form have been shown for various algorithms including Exp3.P (Auer et al., 2002) and Exp3-IX (Neu, 2015a). Although this result is not very surprising, we are not aware of any existing work on this problem and the proof is less straightforward than one might expect. An added benefit of our result is that the loss sequences producing large regret have two special properties. First, the optimal arm is the same in every round and second the range of the losses in each round is . These properties will be useful in subsequent analysis.

In the second lower bound we show that any algorithm for which must necessarily suffer a high probability regret of at least for some sequence . The important difference relative to the previous result is that strategies with appearing inside the square root depend on a specific value of , which must be known in advance.

Theorem 1.

Suppose and and , then there exists a sequence of losses such that

where the probability is taken with respect to the randomness in the algorithm. Furthermore can be chosen in such a way that there exists an such that for all it holds that and .

Theorem 2.

Suppose , , and there exists a strategy and constant such that for any it holds that . Let satisfy and . Then there exists for which

where the probability is taken with respect to the randomness in the algorithm.

Corollary 3.

If and , then there does not exist a strategy such that for all , , and the regret is bounded by .

The corollary follows easily by integrating the assumed high-probability bound and applying Theorem 2 for sufficiently large and small . The proof may be found in Appendix E.

Proof of Theorems 1 and 2

Both proofs rely on a carefully selected choice of correlated stochastic losses described below. Let be a sequence of i.i.d. Gaussian random variables with mean and variance . Let be a constant that will be chosen differently in each proof and define random loss sequences where

For let be the measure on and when for all and . Informally, is the measure on the sequence of loss vectors and actions when the learner interacts with the losses sampled from the th environment defined above.

Lemma 3.

Let and suppose and . Then and .

The proof may be found in Appendix D.

Proof of Theorem 1.

First we choose the value of that determines the gaps in the losses by . By the pigeonhole principle there exists an for which . Therefore by Lemmas 2 and 1, and the fact that the KL divergence between clipped Gaussian distributions is always smaller than without clipping (see Lemma 7 in Appendix B),

But by Lemma 3

Therefore there exists an such that

The result is completed by substituting the value of and by noting that -almost surely. ∎

Proof of Theorem 2.

By the assumption on we have . Suppose for all that

(8)

Then by the assumption in the theorem statement and the second part of Lemma 3 we have

which is a contradiction. Therefore there exists an for which Eq. (8) does not hold. Then by the same argument as the previous proof it follows that

The result is completed by substituting the value of . ∎

Remark 1.

It is possible to derive similar high-probability regret bounds with non-correlated losses. However the correlation makes the results cleaner (we do not need an additional concentration argument to locate the optimal arm) and it is key to derive Corollaries 4 and 5 in Section 4.

3 First-Order Lower Bound

First-order upper bounds provide improvement over minimax bounds when the loss of the optimal action is small. Recall from Corollary 1 that first-order bounds can be rewritten in terms of the small-loss balls defined in (4). Theorem 3 below provides a new lower bound of order , which matches the best existing upper bounds up to logarithmic factors. As is standard for minimax results this does not imply a lower bound on every loss sequence . Instead it shows that we cannot hope for a better bound if we only know the value of .

Theorem 3.

Let , , and , where . Then for any randomised bandit algorithm , where the expectation is taken with respect to the internal randomisation of the algorithm.

Our proof is inspired by that of Auer et al. (2002, Theorem 5.1). The key difference is that we take Bernoulli distributions with parameter close to instead of . This way the best cumulative loss is ensured to be concentrated around , and the regret lower bound can be seen to involve the variance of the binomial distribution with parameters and .

First we state the stochastic construction of the losses and prove a general lemma that allows us to prove Theorem 3 and will also be useful in Section 4 to a derive a lower bound in terms of the quadratic variation. Let be fixed and define probability distributions on such that under the following hold:

  • All random losses for and are independent.

  • is sampled from a Bernoulli distribution with parameter if , or with parameter if .

Lemma 4.

Let , , and . Consider the probability distributions  on defined above with , and set . Then for any randomised bandit algorithm , where the expectation is with respect to both the internal randomisation of the algorithm and the random loss sequence which is drawn from .

The assumption above ensures that , so that the are well defined.

Proof of Lemma 4.

We lower bound the regret by the pseudo-regret for each distribution :

(9)

where the first equality follows because since under , the conditional distribution of given is simply . To bound (9) from below, note that by Pinsker’s inequality we have for all and , , where is the joint probability distribution that makes all the i.i.d. , and and denote the laws of under and respectively. Plugging the last inequality above into (9), averaging over and using the concavity of the square root yields

(10)

where we recall that . The rest of the proof is devoted to upper-bounding . Denote by the history available at the beginning of round . From Lemma 1

(11)

where the last inequality follows by upper bounding the KL divergence by the divergence (see Appendix B). Averaging (11) over and and noting that we get

Plugging the above inequality into (10) and using the definition of yields

Proof of Theorem 3.

We show that there exists a loss sequence such that and . Lemma 4 above provides such kind of lower bound, but without the guarantee on . For this purpose we will use Lemma 4 with a smaller value of (namely, ) and combine it with Bernstein’s inequality to prove that with high probability.

Part 1: Applying Lemma 4 with (note that by assumption on ) and noting that we get that for some the probability distribution  defined with satisfies

(12)

since by assumption.

(13)

To this end, first note that . Second, note that under , the , , are i.i.d. . We can thus use Bernstein’s inequality: applying Theorem 2.10 (and a remark on p.38) of Boucheron et al. (2013) with , with , and with ), we get that, for all , with -probability at least ,

(14)

where the second last inequality is true whenever and that last is true whenever . By assumption on , these two conditions are satisfied for , which concludes the proof of (13).

Conclusion: We show by contradiction that there exists a loss sequence such that and

(15)

where the expectation is with respect to the internal randomisation of the algorithm. Imagine for a second that (15) were false for every loss sequence satisfying . Then we would have almost surely (since the internal source of randomness of the bandit algorithm is independent of ). Therefore by the tower rule for the first expectation on the r.h.s. below, we would get

(16)

where (16) follows from (13) and by noting that since . Comparing (16) and (12) we get a contradiction, which proves that there exists a loss sequence satisfying both and (15). We conclude the proof by noting that . Finally, the condition is sufficient to make the interval non empty. ∎

4 Second-Order Lower Bounds

We start by giving a lower bound on the regret in terms of the quadratic variation that is close to existing upper bounds except in the dependence on the number of arms. Afterwards we prove that bandit strategies cannot adapt to losses that lie in a small range or the existence of an action that is always optimal.

Lower bound in terms of quadratic variation

We prove a lower bound of over any small-variation ball (as defined by (7)) for all . This minimax lower bound matches the upper bound of Corollary 2 up to a multiplicative factor of . Closing this gap is left as an open question, but we conjecture that the upper bound is loose (see also the COLT open problem by Hazan and Kale (2011a)).

Theorem 4.

Let , , and , where . Then for any randomised bandit algorithm, , where the expectation is taken with respect to the internal randomisation of the algorithm.

The proof is very similar to that of Theorem 3; it also follows from Lemma 4 and Bernstein’s inequality. It is postponed to Appendix C.

Impossibility results

In the full-information setting (where the entire loss vector is observed after each round) Cesa-Bianchi et al. (2007, Theorem 6) designed a carefully tuned exponential weighting algorithm for which the regret depends on the variation of the algorithm and the range of the losses:

(17)

where the expectation is taken with respect to the internal randomisation of the algorithm and denotes the effective range of the losses and denotes the cumulative variance of the algorithm (in each round the expert’s action is drawn at random from the weight vector ). The bound in (17) is not closed-form because depends on the algorithm, but has several interesting consequences:

  1. If for all the losses lie in an unknown interval with a small width , then , so that . Hence

    Therefore, though the algorithm by Cesa-Bianchi et al. (2007, Section 4.2) does not use the prior knowledge of or , it is able to incur a regret that scales linearly in the effective range .

  2. If all the losses are nonnegative, then by Corollary 3 of (Cesa-Bianchi et al., 2007) the second-order bound (17) implies the first-order bound

    (18)

    where  .

  3. If there exists an arm that is optimal at every round (i.e., for all ), then any translation-invariant algorithm with regret guarantees as in (18) above suffers a bounded regret. This is the case for the fully automatic algorithm of Cesa-Bianchi et al. (2007, Theorem 6) mentioned above. Then by the translation invariance of the algorithm all losses appearing in the regret bound can be replaced with the translated losses , so that a bound of the same form as (18) implies a regret bound of .

  4. Assume that the loss vectors are i.i.d. with a unique optimal arm in expectation (i.e., there exists such that for all ). Then using the Hoeffding-Azuma inequality we can show that the algorithm of Cesa-Bianchi et al. (2007, Section 4.2) has with high probability a bounded cumulative variance , and therefore (by (17)) incurs a bounded regret, in the same spirit as in de Rooij et al. (2014); Gaillard et al. (2014).

We already know that point 2 has a counterpart in the bandit setting. If one is prepared to ignore logarithmic terms, then point 4 also has an analogue in the bandit setting due to the existence of logarithmic regret guarantees for stochastic bandits (Lai and Robbins, 1985). The following corollaries show that in the bandit setting it is not possible to design algorithms to exploit the range of the losses or the existence of an arm that is always optimal. We use Theorem 1 as a general tool but the bounds can be improved to by analysing the expected regret directly (similar to Lemma 4).

Corollary 4.

Let , and . Then for any randomised bandit algorithm, , where the expectation is with respect to the randomness in the algorithm, and .

Corollary 5.

Let and . Then, for any randomised bandit algorithm, there is a loss sequence such that there exists an arm that is optimal at every round  (i.e., for all ), but , where the expectation is with respect to the randomness in the algorithm.

Proof of Corollaries 4 and 5.

Both results follow from Theorem 1 by choosing . Therefore there exists an such that , which implies (since here) that . Finally note that since and there exists an such that for all and . ∎

Acknowledgments

The authors would like to thank Aurélien Garivier and Émilie Kaufmann for insightful discussions. This work was partially supported by the CIMI (Centre International de Mathématiques et d’Informatique) Excellence program. The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grants ANR-13-BS01-0005 (project SPADRO) and ANR-13-CORD-0020 (project ALICIA).

References

  • Allenberg et al. [2006] C. Allenberg, P. Auer, L. Györfi, and G. Ottucsák. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proceedings of ALT’2006, pages 229–243. Springer, 2006.
  • Audibert and Bubeck [2009] J. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of Conference on Learning Theory (COLT), pages 217–226, 2009.
  • Auer et al. [1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE, 1995.
  • Auer et al. [2002] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multi-armed bandit problem. SIAM J. Comput., 32(1):48–77, 2002.
  • Boucheron et al. [2013] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, 2013.
  • Bubeck and Cesa-Bianchi [2012] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
  • Bubeck et al. [2013] S. Bubeck, V. Perchet, and P. Rigollet. Bounded regret in stochastic multi-armed bandits. In Proceedings of The 26th Conference on Learning Theory, pages 122–134, 2013.
  • Cesa-Bianchi et al. [2007] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Mach. Learn., 66(2/3):321–352, 2007.
  • de Rooij et al. [2014] S. de Rooij, T. van Erven, P. D. Grünwald, and W. M. Koolen. Follow the leader if you can, hedge if you must. J. Mach. Learn. Res., 15(Apr):1281–1316, 2014.
  • Gaillard et al. [2014] P. Gaillard, G. Stoltz, and T. van Erven. A second-order bound with excess losses. In