On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems

# On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems

Baekjin Kim
Department of Statistics
University of Michigan
Ann Arbor, MI 48109
baekjin@umich.edu
&Ambuj Tewari
Department of Statistics
University of Michigan
Ann Arbor, MI 48109
tewaria@umich.edu
###### Abstract

We investigate the optimality of perturbation based algorithms in the stochastic and adversarial multi-armed bandit problems. For the stochastic case, we provide a unified regret analysis for both sub-Weibull and bounded perturbations when rewards are sub-Gaussian. Our bounds are instance optimal for sub-Weibull perturbations with parameter 2 that also have a matching lower tail bound, and all bounded support perturbations where there is sufficient probability mass at the extremes of the support. For the adversarial setting, we prove rigorous barriers against two natural solution approaches using tools from discrete choice theory and extreme value theory. Our results suggest that the optimal perturbation, if it exists, will be of Fréchet-type.

On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems

Baekjin Kim Department of Statistics University of Michigan Ann Arbor, MI 48109 baekjin@umich.edu Ambuj Tewari Department of Statistics University of Michigan Ann Arbor, MI 48109 tewaria@umich.edu

\@float

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

Beginning with the seminal work of Hannan (1957), researchers have been interested in algorithms that use random perturbations to generate a distribution over available actions. Kalai and Vempala (2005) showed that the perturbation idea leads to efficient algorithms for many online learning problems with large action sets. Due to the Gumbel lemma (Hazan et al., 2017), the well known exponential weights algorithm (Freund and Schapire, 1997) also has an interpretation as a perturbation based algorithm that uses Gumbel distributed perturbations.

There have been several attempts to analyze the regret of perturbation based algorithms with specific distributions such as Uniform, Double-exponential, drop-out and random walk (see, e.g., (Kalai and Vempala, 2005; Kujala and Elomaa, 2005; Devroye et al., 2013; Van Erven et al., 2014)). These works provided rigorous guarantees but the techniques they used did not generalize to general perturbations. Recent work (Abernethy et al., 2014) provided a general framework to understand general perturbations and clarified the relation between regularization and perturbation by understanding them as different ways to smooth an underlying non-smooth potential function.

Abernethy et al. (2015) extended the analysis of general perturbations to the partial information setting of the adversarial multi-armed bandit problem. They isolated bounded hazard rate as an important property of a perturbation and gave several examples of perturbations that lead to the near optimal regret bound of . Since Tsallis entropy regularization can achieve the minimax regret of (Audibert and Bubeck, 2009, 2010), the question of whether perturbations can match the power of regularizers remained open for the adversarial multi-armed bandit problem.

In this paper, we build upon previous works (Abernethy et al., 2014, 2015) in two distinct but related directions. First, we provide the first general result for perturbation algorithms in the stochastic multi-armed bandit problem. We present a unified regret analysis for both sub-Weibull and bounded perturbations when rewards are sub-Gaussian. Our regrets are instance optimal for sub-Weibull perturbations with parameter 2 (with a matching lower tail bound), and all bounded support perturbations where there is sufficient probability mass at the extremes of the support. Since the Uniform and Rademacher distribution are instances of these bounded support perturbations, one of our results is a regret bound for a randomized version of UCB where the algorithm picks a random number in the confidence interval or randomly chooses between lower and upper confidence bounds instead of always picking the upper bound. Our analysis relies on the simple but powerful observation that Thompson sampling with Gaussian priors and rewards can also be interpreted as a perturbation algorithm with Gaussian perturbations. We are able to generalize both the upper bound and lower bound of Agrawal and Goyal (2013) in two respects; (1) from the special Gaussian perturbation to general sub-Weibull or bounded perturbations, and (2) from the special Gaussian rewards to general sub-Gaussian rewards.

Second, we return to the open problem mentioned above: is there a perturbation that gives us minimax optimality? We do not resolve it but provide rigorous proofs that there are barriers to two natural approaches to solving the open problem. (A) One cannot simply find a perturbation that is exactly equivalent to Tsallis entropy. This is surprising since Shannon entropy does have an exact equivalent perturbation, viz. Gumbel. (B) One cannot simply do a better analysis of perturbations used before (Abernethy et al., 2015) and plug the results into their general regret bound to eliminate the extra factor. In proving the first barrier, we use a fundamental result in discrete choice theory. For the second barrier, we rely on tools from extreme value theory.

## 2 Problem Setup

In every round , a learner chooses an action out of arms and the environment picks a response in the form of a real-valued reward vector . While the entire reward vector is revealed to the learner in the full information setting, the learner only receives a reward associated with his choice in the bandit setting, while any information on other arms is not provided. Thus, we denote the reward corresponding to his choice as .

In stochastic multi-armed bandit, the rewards are sampled i.i.d from a fixed, but unknown distribution with mean . Adversarial multi-armed bandit is more general in that all assumptions on how rewards are assigned to arms are dropped. It only assumes that rewards are assigned by an adversary before the interaction begins. Such an adversary is called an oblivious adversary. In both environments, the learner makes a sequence of decisions based on each history to maximize the cumulative reward, .

As a measure of evaluating a learner, Regret is the difference between rewards the learner would have received had he played the best in hindsight, and the rewards he actually received. Therefore, minimizing the regret is equivalent to maximizing the expected cumulative reward. We consider the expected regret, in adversarial setting, and the pseudo regret, in stochastic setting. Note that two regrets are the same where an oblivious adversary is considered. An online algorithm is called a no-regret algorithm if for every adversary, the expected regret with respect to every action is sub-linear in .

We use FTPL (Follow The Perturbed Leader) to denote families of algorithms for both stochastic and adversarial settings. The common core of FTPL algorithms consists in adding random perturbations to the estimates of rewards of each arm prior to computing the current “the best arm” (or “leader”). However, the estimates used are different in the two settings: stochastic setting uses sample means and adversarial setting uses inverse probability weighted estimates.

## 3 Stochastic Bandits

In this section, we propose FTPL algorithms for stochastic multi-armed bandits and characterize a family of perturbations that make the algorithm instance-optimal in terms of regret bounds. This work is mainly motivated by Thompson Sampling (Thompson, 1933), one of the standard algorithms in stochastic settings. We also provide a lower bound for the regret of this FTPL algorithm.

For our analysis, we assume, without loss of generality, that arm 1 is optimal, , and the sub-optimality gap is denoted as . Let be the average reward received from arm after round written formally as where is the number of times arm has been pulled after round . The regret for stochastic bandits can be decomposed into . The reward distributions are generally assumed to be sub-Gaussian with parameter 1 (Lattimore and Szepesvári, 2018).

###### Definition 1 (sub-Gaussian).

A random variable with mean is sub-Gaussian with parameter if it satisfies for all .

###### Lemma 1 (Hoeffding bound of sub-Gaussian (Hoeffding, 1994)).

Suppose , are i.i.d. random variables with and sub-Gaussian with parameter . Then for all , where .

### 3.1 Upper Confidence Bound and Thompson Sampling

The standard algorithms in stochastic bandit are Upper Confidence Bound (UCB1) (Auer, 2002) and Thompson Sampling (Thompson, 1933). The former algorithm is constructed to compare the largest plausible estimate of mean for each arm based on the optimism in the face of uncertainty so that it would be deterministic in contradistinction to the latter one. At time , UCB1 chooses an action by maximizing upper confidence bounds, . Regarding the instance-dependent regret of UCB1, there exists some universal constant such that .

Thompson Sampling is a Bayesian solution based on randomized probability matching approach (Scott, 2010). Given the prior distribution , at time , it computes posterior distribution based on observed data, samples from posterior , and then chooses the arm . In Gaussian Thompson Sampling where the Gaussian rewards and the Gaussian prior distribution for each with mean and infinite variance are considered, the policy from Thompson Sampling is to choose an index that maximizes randomly sampled from Gaussian posterior distribution, as stated in Alg.1-(i). Also, its regret bound is restated in Theorem 2.

###### Theorem 2 (Theorem 3 (Agrawal and Goyal, 2013)).

Assume that reward distribution of each arm is Gaussian with mean and unit variance. Thompson sampling policy via Gaussian prior defined in Alg.1-(i) has the following instance-dependent and independent regret bounds, for ,

 R(T)≤C′∑Δi>0(log(TΔ2i)/Δi+Δi),R(T)≤O(√KTlogK).

The more generic view of Thompson Sampling is via the idea of perturbation. We bring an interpretation of viewing this Gaussian Thompson Sampling as Follow-The-Perturbed-Leader (FTPL) algorithm via Gaussian perturbation (Lattimore and Szepesvári, 2018). If Gaussian random variables are decomposed into the average mean reward of each arm and scaled Gaussian perturbation where . In each round , the FTPL algorithm chooses the action according to .

We show that the FTPL algorithm with Gaussian perturbation under Gaussian reward setting can be extended to sub-Gaussian rewards as well as families of sub-Weibull and bounded perturbations. The sub-Weibull family is an interesting family in that it includes well known families like sub-Gaussian and sub-Exponential as special cases. We propose perturbation based algorithms via sub-Weibull and bounded perturbation in Alg.1-(ii), (iii), and their regrets are analyzed in Theorem 3 and 5.

###### Definition 2 (sub-Weibull (Wong et al., 2019)).

A random variable with mean is sub-Weibull () with if it satisfies for all .

###### Theorem 3 (FTPL via sub-Weibull Perturbation, Proof in Appendix a.1).

Assume that reward distribution of each arm is 1-sub-Gaussian with mean , and the sub-Weibull () perturbation with parameter and satisfies the following anti-concentration inequality,

 P(|Z|≥t) ≥exp(−tq/2σq)/Cb,for t≥0 (1)

Then the Follow-The-Perturbed-Leader algorithm via in Alg.1-(ii) has the following instance-dependent and independent regret bounds, for (if , ) and ,

 (2)

Note that the parameters and can be chosen from any values , and the algorithm can achieve smaller regret bound as becomes larger. For nice distributions such as Gaussian and Double-exponential, the parameters and can be matched by 2 and 1, respectively.

###### Corollary 4 (FTPL via Gaussian Perturbation).

Assume that reward distribution of each arm is 1-sub-Gaussian with mean . The Follow-The-Perturbed-Leader algorithm via Gaussian perturbation with parameter and in Alg.1-(ii) has the following instance-dependent and independent regret bounds, for and ,

 R(T)≤C′′∑Δi>0(log(TΔ2i)/Δi+Δi),R(T)≤O(√KTlogK). (3)
##### Failure of Bounded Perturbation

Any perturbation with bounded support cannot yield an optimal FTPL algorithm. For example, in a two-armed bandit setting with and , rewards of each arm are generated from Gaussian distribution with mean and unit variance and perturbation is uniform with support . In the case where we have during first 10 times, and average mean rewards are and , then perturbed rewards are sampled from and . This algorithm will not choose the first arm and accordingly achieve a linear regret. To overcome this limitation of bounded support, we suggest another FTPL algorithm via bounded perturbation by adding an extra logarithmic term in as stated in Alg.1-(iii).

###### Theorem 5 (FTPL algorithm via Bounded support Perturbation, Proof in Appendix a.3).

Assume that reward distribution of each arm is 1-sub-Gaussian with mean , the perturbation distribution with lies in and for any , there exists s.t. . Then the Follow-The-Perturbed-Leader algorithm via in Alg.1-(iii) has the following instance-dependent and independent regret bounds, for independent of and ,

 (4)
##### Randomized Confidence Bound algorithm

Theorem 5 implies that the optimism embedded in UCB can be replaced by simple randomization. Instead of comparing upper confidence bounds, our modification is to compare a value randomly chosen from confidence interval or between lower and upper confidence bounds by introducing uniform or Rademacher perturbation in UCB1 algorithm with slightly wider confidence interval, . These FTPL algorithms via Uniform and Rademacher perturbations can be regarded as a randomized version of UCB algorithm, which we call the RCB (Randomized Confidence Bound) algorithm, and they also achieve the same regret bound as that of UCB1. The RCB algorithm is meaningful in that it can be arrived at from two different perspectives, either as a randomized variant of UCB or by replacing the Gaussian distribution with Uniform in Gaussian Thompson Sampling.

The regret lower bound of the FTPL algorithm in Alg.1-(ii) is built on the work of Agrawal and Goyal (2013). Theorem 6 states that the regret lower bound depends on the lower bound of the tail probability of perturbation. As special cases, FTPL algorithms via Gaussian () and Double-exponential () make the lower and upper regret bounds matched, .

###### Theorem 6 (Regret lower bound, Proof in Appendix a.4).

If the perturbation with has the lower bound of tail probability as for , the Follow-The-Perturbed-Leader algorithm via has the lower bound of expected regret, ).

In this section we study two major families of online learning, FTRL and FTPL, as ways of smoothings and introduce the Gradient Based Prediction Algorithm (GBPA) family for solving the adversarial multi-armed bandit problem. Then, we mention an important open problem regarding existence of an optimal FTPL algorithm. The main contributions of this section are theoretical results showing that two natural approaches to solving the open problem are not going to work. We also make some conjectures on what alternative ideas might work.

### 4.1 FTRL and FTPL as Two Types of Smoothings and An Open Problem

Following previous work (Abernethy et al., 2015), we consider a general algorithmic framework, Alg.2. There are two main ingredients of GBPA. The first ingredient is the smoothed potential whose gradient is used to map the current estimate of the cumulative reward vector to a probability distribution over arms. The second ingredient is the construction of an unbiased estimate of the rewards vector using the reward of the pulled arm only by inverse probability weighting. This step reduces the bandit setting to full-information setting so that any algorithm for the full-information setting can be immediately applied to the bandit setting.

If we did not use any smoothing and directly used the baseline potential , we would be running Follow The Leader (FTL) as our full information algorithm. It is well known that FTL does not have good regret guarantees (Hazan et al., 2016). Therefore, we need to smooth the baseline potential to induce stability in the algorithm. It turns out that two major algorithm families in online learning, namely Follow The Regularized Leader (FTRL) and Follow The Perturbed Leader (FTPL) correspond to two different types of smoothings.

The smoothing used by FTRL is achieved by adding a strongly convex regularizer in the dual representation of the baseline potential. That is, we set , where is a strongly convex function. The well known exponential weights algorithm (Freund and Schapire, 1997) uses the Shannon entropy regularizer, . GBPA with the resulting smoothed potential becomes the EXP3 algorithm (Auer et al., 2002) which achieves a near-optimal regret bound just logarithmically worse compared to the lower bound . This lower bound was matched by Implicit Normalized Forecaster with polynomial function (Poly-INF algorithm) (Audibert and Bubeck, 2009, 2010) and later work (Abernethy et al., 2015) showed that Poly-INF algorithm is equivalent to FTRL algorithm via the Tsallis entropy regularizer, for .

An alternate way of smoothing is stochastic smoothing which is what is used by FTPL algorithms. It injects stochastic perturbations to the cumulative rewards of each arm and then finds the best arm. Given a perturbation distribution and consisting of i.i.d. draws from , the resulting stochastically smoothed potential is . Its gradient is where .

In Section 4.3, we recall the general regret bound proved by Abernethy et al. (2015) for distributions with bounded hazard rate. They showed that a variety of natural perturbation distributions can yield a near-optimal regret bound of . However, none of the distributions they tried yielded the minimax optimal rate . Since FTRL with Tsallis entropy regularizer can achieve the minimax optimal rate in adversarial bandits, the following is an important unresolved question regarding the power of perturbations.

##### Open Problem

Is there a perturbation such that GBPA with a stochastically smoothed potential using achieves the optimal regret bound in adversarial -armed bandits?

Given what we currently know, there are two very natural approaches to resolving the open question in the affirmative. Approach 1: Find a perturbation so that we get the exactly same choice probability function as the one used by FTRL via Tsallis entropy. Approach 2: Provide a tighter control on expected block maxima of random variables considered as perturbations by Abernethy et al. (2015).

### 4.2 Barrier Against First Approach: Discrete Choice Theory

The first approach is motivated by a folklore observation in online learning theory, namely, that the exponential weights algorithm (Freund and Schapire, 1997) can be viewed as FTRL via Shannon entropy regularizer or as FTPL via a Gumbel-distributed perturbation. Thus, we might hope to find a perturbation which is an exact equivalent of the Tsallis entropy regularizer. Since FTRL via Tsallis entropy is optimal for adversarial bandits, finding such a perturbation would immediately settle the open problem.

The relation between regularizers and perturbations has been theoretically studied in discrete choice theory (McFadden, 1981; Hofbauer and Sandholm, 2002). For any perturbation, there is always a regularizer which gives the same choice probability function. The converse, however, does not hold. The Williams-Daly-Zachary Theorem provides a characterization of choice probability functions that can be derived via additive perturbations.

###### Theorem 7 (Williams-Daly-Zachary Theorem (McFadden, 1981)).

Let be the choice probability function and derivative matrix . The following 4 conditions are necessary and sufficient for the existence of perturbations such that this choice probability function can be written in .
(1) is symmetric, (2) is positive definite, (3) , and (4) All mixed partial derivatives of are positive, for each .

We now show that if the number of arms is greater than three, there does not exist any perturbation exactly equivalent to Tsallis entropy regularization. Therefore, the first approach to solving the open problem is doomed to failure.

###### Theorem 8 (Proof in Appendix a.5).

When , there is no stochastic perturbation that yields the same choice probability function as the Tsallis entropy regularizer.

### 4.3 Barrier Against Second Approach: Extreme Value Theory

The second approach is built on the work of Abernethy et al. (2015) who provided the-state-of-the-art perturbation based algorithm for adversarial multi-armed bandits. The framework proposed in this work covered all distributions with bounded hazard rate and showed that the regret of GBPA via perturbation with a bounded hazard is upper bounded by trade-off between the bound of hazard rate and expected block maxima as stated below.

###### Theorem 9 (Theorem 4.2 (Abernethy et al., 2015)).

Assume the support of is unbounded in positive direction and hazard rate is bounded, then the expected regret of GBPA() in adversarial bandit is bounded by , where . The optimal choice of leads to the regret bound where .

Abernethy et al. (2015) considered several perturbations such as Gumbel, Gamma, Weibull, Fréchet and Pareto. The best tuning of distribution parameters (to minimize upper bounds on the product ) always leads to the bound , which is tantalizingly close to the lower bound but does not match it. It is possible that some of their upper bounds on expected block maxima are loose and that we can get closer, or perhaps even match, the lower bound by simply doing a better job of bounding expected block maxima (we will not worry about supremum of the hazard since their bounds can easily be shown to be tight, up to constants, using elementary calculations in Appendix B.2). We show that this approach will also not work by characterizing the asymptotic (as ) behavior of block maxima of perturbations using extreme value theory. The statistical behavior of block maxima, , where ’s is a sequence of i.i.d. random variables with distribution function can be described by one of three extreme value distributions: Gumbel, Fréchet and Weibull (Coles et al., 2001; Resnick, 2013). Then, the normalizing sequences and are explicitly characterized (Leadbetter et al., 2012). Under the mild condition, as where is extreme value distribution and is constant, and the expected block maxima behave asymptotically as . See Theorem 11-13 in Appendix B for more details.

The asymptotically tight growth rates (with explicit constants for the leading term!) of expected block maximum of some distributions are given in Table 1. They match the upper bounds of the expected block maximum in Table 1 of Abernethy et al. (2015). That is, their upper bounds are asymptotically tight. Gumbel, Gamma and Weibull distribution are Gumbel-type () and their expected block maximum behave as asymptotically. It implies that Gumbel type perturbation can never achieve optimal regret bound despite bounded hazard rate. Fréchet and Pareto distributions are Fréchet-type () and their expected block maximum grows as . Heuristically, if is set optimally to , the expected block maxima is independent of while the supremum of hazard is upper bounded by .

##### Conjecture

If there exists a perturbation that achieves minimax optimal regret in adversarial multi-armed bandits, it must be of Fréchet-type.

Fréchet-type perturbations can still possibly yield the optimal regret bound in perturbation based algorithm if the expected block maximum is asymptotically bounded by a constant and the divergence term in regret analysis of GBPA algorithm can be shown to enjoy a tighter bound than what follows from the assumption of a bounded hazard rate.

The perturbation equivalent to Tsallis entropy (in two armed setting) is of Fréchet-type Further evidence to support the conjecture can be found in the connection between FTRL and FTPL algorithms that regularizer and perturbation are bijective in two-armed bandit in terms of a mapping between and , , where are i.i.d random variables with distribution function, , and then . The difference of two i.i.d. Fréchet-type distributed random variables is conjectured to be Fréchet-type. Thus, Tsallis entropy in two-armed setting leads to Fréchet-type perturbation, which supports our conjecture about optimal perturbations in adversarial multi-armed bandits. See Appendix C for more details.

## 5 Numerical Experiments

We present some experimental results with perturbation based algorithms (Alg.1-(ii),(iii)) and compare them to the UCB1 algorithm in the simulated stochastic -armed bandit. In all experiments, the number of arms () is 10, the number of episodes is 1000, and true mean rewards () are generated from (Kuleshov and Precup, 2014). We consider the following four examples of 1-sub-Gaussian reward distributions that will be shifted by true mean ; (a) Uniform, , (b) Rademacher, , (c) Gaussian, , and (d) Gaussian mixture, where . Under each reward setting, we run five different algorithms; UCB1, RCB with Uniform and Rademacher, and FTPL via Gaussian and Double-exponential () after we use grid search to tune confidence levels for confidence based algorithms and the parameter for FTPL algorithms. All tuned confidence level and parameter are specified in Figure 1. We compare the performance of perturbation based algorithms to UCB1 in terms of average regret , which is expected to more rapidly converge to zero if the better algorithm is used.

The average regret plots in Figure 1 have the similar patterns that FTPL algorithms via Gaussian and Double-exponential consistently perform the best after parameters tuned, while UCB1 algorithm works as well as them in all rewards except for Rademacher reward. The RCB algorithms with Uniform and Rademacher perturbations are slightly worse than UCB1 in early stages, but perform comparably well to UCB1 after enough iterations. In the Rademacher reward case, which is discrete, RCB with Uniform perturbation slightly outperforms UCB1.

Note that the main contribution of this work is to establish theoretical foundations for a large family of perturbation based algorithms (including those used in this section). Our numerical results are not intended to show the superiority of perturbation methods but to demonstrate that they are competitive with Thompson Sampling and UCB. Note that in more complex bandit problems, sampling from the posterior and optimistic optimization can prove to be computationally challenging. Accordingly, our work paves the way for designing efficient perturbation methods in complex settings, such as stochastic linear bandits and stochastic combinatorial bandits, that have both computational advantages and low regret guarantees. Furthermore, perturbation approaches based on the Double-exponential distribution are of special interest from a privacy viewpoint since that distribution figures prominently in the theory of differential privacy (Dwork et al., 2014; Tossou and Dimitrakakis, 2016, 2017).

## 6 Conclusion

We provided the first general analysis of perturbations for the stochastic multi-armed bandit problem. We believe that our work paves the way for similar extension for more complex settings, e.g., stochastic linear bandits, stochastic partial monitoring, and Markov decision processes. We also showed that the open problem regarding minimax optimal perturbations for adversarial bandits cannot be solved in two ways that might seem very natural. While our results are negative, they do point the way to a possible affirmative solution of the problem. They led us to a conjecture that the optimal perturbation, if it exists, will be of Fréchet-type.

## References

• Abernethy et al. [2014] Jacob Abernethy, Chansoo Lee, Abhinav Sinha, and Ambuj Tewari. Online linear optimization via smoothing. In Conference on Learning Theory, pages 807–823, 2014.
• Abernethy et al. [2015] Jacob Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems, pages 2197–2205, 2015.
• Agrawal and Goyal [2013] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013.
• Audibert and Bubeck [2009] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
• Audibert and Bubeck [2010] Jean-Yves Audibert and Sébastien Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(Oct):2785–2836, 2010.
• Auer [2002] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
• Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
• Coles et al. [2001] Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. An introduction to statistical modeling of extreme values, volume 208. Springer, 2001.
• Devroye et al. [2013] Luc Devroye, Gábor Lugosi, and Gergely Neu. Prediction by random-walk perturbation. In Conference on Learning Theory, pages 460–473, 2013.
• Dwork et al. [2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
• Freund and Schapire [1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
• Hannan [1957] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
• Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
• Hazan et al. [2017] Tamir Hazan, George Papandreou, and Daniel Tarlow, editors. Perturbations, Optimization and Statistics. MIT Press, 2017.
• Hoeffding [1994] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer, 1994.
• Hofbauer and Sandholm [2002] Josef Hofbauer and William H Sandholm. On the global convergence of stochastic fictitious play. Econometrica, 70(6):2265–2294, 2002.
• Kalai and Vempala [2005] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
• Kujala and Elomaa [2005] Jussi Kujala and Tapio Elomaa. On following the perturbed leader in the bandit setting. In International Conference on Algorithmic Learning Theory, pages 371–385. Springer, 2005.
• Kuleshov and Precup [2014] Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014.
• Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.
• Leadbetter et al. [2012] Malcolm R Leadbetter, Georg Lindgren, and Holger Rootzén. Extremes and related properties of random sequences and processes. Springer Science & Business Media, 2012.
• McFadden [1981] Daniel McFadden. Econometric models of probabilistic choice. Structural analysis of discrete data with econometric applications, 198272, 1981.
• Resnick [2013] Sidney I Resnick. Extreme values, regular variation and point processes. Springer, 2013.
• Scott [2010] Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
• Thompson [1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
• Tossou and Dimitrakakis [2017] Aristide Charles Yedia Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
• Tossou and Dimitrakakis [2016] Aristide CY Tossou and Christos Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
• Van Erven et al. [2014] Tim Van Erven, Wojciech Kotłowski, and Manfred K Warmuth. Follow the leader with dropout perturbations. In Conference on Learning Theory, pages 949–974, 2014.
• Wong et al. [2019] Kam Chung Wong, Zifan Li, and Ambuj Tewari. Lasso guarantees for -mixing heavy tailed time series. Annals of Statistics, 2019.

## Appendix A Proofs

### a.1 Proof of Theorem 3

###### Proof.

For each arm , we will choose two thresholds such that and define two types of events, , and . Intuitively, and are the events that the estimate and the sample value are not too far above the mean , respectively. is decomposed into the following three parts according to events and ,

 E[Ti(T)] =T∑t=1P(At=i,(Eμi(t))c)(a)+T∑t=1P(At=i,Eμi(t),(Eθi(t))c)(b) +T∑t=1P(At=i,Eμi(t),Eθi(t)(c))

Let denote the time at which -th trial of arm happens. Set .

 (a) ≤1+T−1∑k=1P((Eμi(τk+1))c)≤1+T−1∑k=1exp(−k(xi−μi)22)≤1+18Δ2i. (5)

The probability in part is upper bounded by 1 if is less than , and by otherwise. The latter can be proved as below,

 P(At=i,(Eθi(t))c|Eμi(t)) ≤P(θi(t)>yi|^μi(t)≤xi)≤P(Zit√Ti(t)>yi−xi|^μi(t)≤xi) ≤Ca⋅exp(−Ti(t)p/2(yi−xi)p2σp)≤CaTΔ2iifTi(t)≥Li(T).

The third inequality holds by sub-Weibull () assumption on perturbation . Let be the largest step until , then part is bounded as,

Regarding part , define as the probability where is defined as the history of plays until time . Let denote the time at which -th trial of arm happens.

###### Lemma 10 (Lemma 1 [Agrawal and Goyal, 2013]).

For ,

 (c) =T∑t=1P(At=i,Eμi(t),Eθi(t)) ≤T∑t=1E[1−pi,tpi,tI(At=1,Eμi(t),Eθi(t))]≤T−1∑j=0E[1−pi,δj+1pi,δj+1].
###### Proof.

See Appendix A.2. ∎

The average rewards from the first arm, , has a density function denoted by .

 E[1−pi,δj+1pi,δj+1]= E[1P(θ1(δj+1)≥yi|Hδj+1)−1] = ∫R[1P(x+Z√j>μ1−Δi3)−1]ϕ^μ1,j(x)dx

The above integration is divided into three intervals, , and . We denote them as and , respectively.

 =∫∞0[1P(Z>u)−Cb]1√jϕ^μ1,j(−u√j+μ1−Δi3)du∵u=−√j(x−μ1+Δi3) ≤∫∞0[Cb⋅exp(uq2σq)−Cb]1√jϕ^μ1,j(−u√j+μ1−Δi3)du∵% anti-concentration inequality =∫∞0[∫u0G′(v)dv]1√jϕ^μ1,j(−u√j+μ1−Δi3)du∵G(u)=Cb⋅exp(uq2σq) ≤∫∞0exp(−(v+√jΔi3)22)⋅G′(v)dv∵Fubini's Theorem % & sub-Gaussian reward =∫∞0exp(−(v+√jΔi3)22)⋅Cbqvq−12σqexp(vq2σq)dv
 (i) =∫μ1−Δi3−∞[[1P(Z>−√j(x−μ1+Δi3))−Cb]ϕ^μ1,j(x)+(Cb−1)ϕ^μ1,j(x)]dx (ii) =∫μ1−Δi6μ1−Δi32P(Z<−√j(x−μ1+Δi3))ϕ^μ1,j(x)dx (iii) =∫∞μ1−Δi62P(Z<−√j(x−μ1+Δi3))ϕ^μ1,j(x)dx ≤2P(Z<−√jΔi6)∫∞μ1−Δi6ϕ^μ1,j(x)dx≤2P(Z<−√jΔi6)≤2Caexp(−jp/2Δpi2⋅(6σ)p)
 (c)=T−1∑j=0(i)+(ii)+(iii)<18Cb(Mq,σ+1)+126Δ2i+4Ca(6σ)pΔpi (6)

Combining parts , , and ,

 E[Ti(T)] ≤1+144+Ca+18Cb(Mq,σ+1)Δ2i+4Ca(6σ)pΔpi+σ2[2log(TΔ2i)]2/p(yi−xi)2

We obtain the following instance-dependent regret that there exists independent of , , and such that

 R(T)≤C′′∑Δi>0(Δi+1Δi+1Δp−1i+log(TΔ2i)2/pΔi). (7)

The optimal choice of gives the instance independent regret bound . ∎

### a.2 Proof of Lemma 10

###### Proof.

First of all, we will show the following inequality holds for all realizations of ,

 P(At=i,Eθi(t),Eμi(t)|Ht−1)≤1−pi,tpi,t⋅P(At=1,Eθi(t),Eμi(t)|Ht−1). (8)

To prove the above inequality, it suffices to show the following inequality in (9). This is because whether is true or not depends on realizations of history and we would consider realizations where is true. If it is not true in some , then inequality in (8) trivially holds.

 P(At=i|Eθi(t),Ht−1)≤1−pi,tpi,t⋅P(At=1|Eθi(t),Ht−1) (9)

Considering realizations satisfying , all should be smaller than including optimal arm to choose a sub-optimal arm .

 P(At=i|Eθi(t),Ht−1) ≤P(θj(t)≤yi,∀j∈[K]|Eθi(t),Ht−1) =P(θ1(t)≤yi|Ht−1)⋅P(θj(t)≤yi,∀j∈[K]∖{1,i}|Eθi(t),Ht−1) =(1−pi,t)⋅P(θj(t)≤yi,∀j∈[K]∖{1,i}|Eθi(t),Ht−1) (10)

The first equality above holds since is independent of other and events given . In the same way it is obtained as below,

 P(At=1|Eθi(t),Ht−1) ≥P(θ1(t)>yi≥θj(t),∀j∈[K]∖{1}|Eθi(t),