# Tracking the Best Expert in Non-stationary Stochastic Environments

###### Abstract

We study the dynamic regret of multi-armed bandit and experts problem in non-stationary stochastic environments. We introduce a new parameter , which measures the total statistical variance of the loss distributions over rounds of the process, and study how this amount affects the regret. We investigate the interaction between and , which counts the number of times the distributions change, as well as and , which measures how far the distributions deviates over time. One striking result we find is that even when , , and are all restricted to constant, the regret lower bound in the bandit setting still grows with . The other highlight is that in the full-information setting, a constant regret becomes achievable with constant and , as it can be made independent of , while with constant and , the regret still has a dependency. We not only propose algorithms with upper bound guarantee, but prove their matching lower bounds as well.

Tracking the Best Expert in Non-stationary Stochastic Environments

Chen-Yu Wei Yi-Te Hong Chi-Jen Lu Institute of Information Science Academia Sinica, Taiwan bahh723, ted0504, cjlu@iis.sinica.edu.tw

noticebox[b]30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float

## 1 Introduction

Many situations in our daily life require us to make repeated decisions which result in some losses corresponding to our chosen actions. This can be abstracted as the well-known online decision problem in machine learning CBL06 (). Depending on how the loss vectors are generated, two different worlds are usually considered. In the adversarial world, loss vectors are assumed to be deterministic and controlled by an adversary, while in the stochastic world, loss vectors are assumed to be sampled independently from some distributions. In both worlds, good online algorithms are known which can achieve a regret of about over time steps, where the regret is the difference between the total loss of the online algorithm and that of the best offline one. Another distinction is about the information the online algorithm can receive after each action. In the full-information setting, it gets to know the whole loss vector of that step, while in the bandit setting, only the loss value of the chosen action is received. Again, in both settings, a regret of about turns out to be achievable.

While the regret bounds remain in the same order in those general scenarios discussed above, things become different when some natural conditions are considered. One well-known example is that in the stochastic multi-armed bandit (MAB) problem, when the best arm (or action) is substantially better than the second best, with a constant gap between their means, then a much lower regret, of the order of , becomes possible. This motivates us to consider other possible conditions which can have finer characterization of the problem in terms of the achievable regret.

In the stochastic world, most previous works focused on the stationary setting, in which the loss (or reward) vectors are assumed to be sampled from the same distribution for all time steps. With this assumption, although one needs to balance between exploration and exploitation in the beginning, after some trials, one can be confident about which action is the best and rest assured that there are no more surprises. On the other hand, the world around us may not be stationary, in which existing learning algorithms for the stationary case may no longer work. In fact, in a non-stationary world, the dilemma between exploration and exploitation persists as the underlying distribution may drift as time evolves. How does the non-stationarity affect the achievable regret? How does one measure the degree of non-stationarity?

In this paper, we answer the above questions through the notion of dynamic regret, which measures the algorithm’s performance against an offline algorithm allowed to select the best arm at every step.

#### Related Works.

One way to measure the non-stationarity of a sequence of distributions is to count the number of times the distribution at a time step differs from its previous one. Let be this number so that the whole time horizon can be partitioned into intervals, with each interval having a stationary distribution. In the bandit setting, a regret of about is achieved by the EXP3.S algorithm in auer2002nonstochastic (), as well as the discounted UCB and sliding-window UCB algorithms in garivier2011upper (). The dependency on can be refined in the full-information setting: AdaNormalHedge luo2015achieving () and Adapt-ML-Prod Gaillard2014 () can both achieve regret in the form of , where is the total first-order and second-order excess loss respectively, which is upper-bounded by . From a slightly different Online Mirror Descent approach, jadbabaie2015online () can also achieve a regret of about , where is the sum of differences between consecutive loss vectors.

Another measure of non-stationarity, denoted by , is to compute the difference between the means of consecutive distributions and sum them up. Note that this allows the possibility for the best arm to change frequently, with a very large , while still having similar distributions with a small . For such a measure , gur2014stochastic () provided a bandit algorithm which achieves a regret of about . This regret upper bound is unimprovable in general even in the full-information setting, as a matching lower bound was shown in besbes2015non (). Again, jadbabaie2015online () refined the upper bound in the full-information setting through the introduction of , achieving the regret of about , for a parameter different but related to : calculates the sum of differences between consecutive realized loss vectors, while measures that between mean loss vectors. This makes the results of gur2014stochastic () and jadbabaie2015online () incomparable. The problem stems from the fact that jadbabaie2015online () considers the traditional adversarial setting, while gur2014stochastic () studies the non-stationary stochastic setting. In this paper, we will provide a framework that bridges these two seemingly disparate worlds.

#### Our Results.

We base ourselves in the stochastic world with non-stationary distributions, characterized by the parameters and . In addition, we introduce a new parameter , which measures the total statistical variance of the distributions. Note that traditional adversarial setting corresponds to the case with and , while the traditional stochastic setting has and . Clearly, with a smaller , the learning problem becomes easier, and we would like to understand the tradeoff between and other parameters, including , , and . In particular, we would like to know how the bounds described in the related works would be changed. Would all the dependency on be replaced by , or would only some partial dependency on be shifted to ?

First, we consider the effect of the variance with respect to the parameter . We show that in the full-information setting, a regret of about can be achieved, which is independent of . On the other hand, we show a sharp contrast that in the bandit setting, the dependency on is unavoidable, and a lower bound of the order of exists. That is, even when there is no variance in distributions, with , and the distributions only change once, with , any bandit algorithm cannot avoid a regret of about , while a full-information algorithm can achieve a constant regret independent of .

Next, we study the tradeoff between and . We show that in the bandit setting, a regret of about is achievable. Note that this recovers the regret bound of gur2014stochastic () as is at most of the order of , but our bound becomes better when is much smaller than . Again, one may notice the dependency on and wonder if this can also be removed in the full-information setting. We show that in the full-information setting, the regret upper bound and lower bound are both about . Our upper bound is incomparable to the bound of jadbabaie2015online (), since their adversarial setting corresponds to and their can be as large as in our setting. Moreover, we see that while the full-information regret bound is slightly better than that in the bandit setting, there is still an unavoidable dependency.

Our results provide a big picture of the regret landscape in terms of the parameters , and , in both full-information and bandit settings. A table summarizing our bounds as well as previous ones is given in Appendix A in the supplementary material. Finally, let us remark that our effort mostly focuses on characterizing the achievable (minimax) regrets, and most of our upper bounds are achieved by algorithms which need the knowledge of the related parameters and may not be practical. To complement this, we also propose a parameter-free algorithm, which still achieve a good regret bound and may have independent interest of its own.

## 2 Preliminaries

Let us first introduce some notations. For an integer , let denote the set . For a vector , let denote its ’th component. When we need to refer to a time-indexed vector , we will write to denote its ’th component. We will use the indicator function for a condition , which gives the value if holds and otherwise. For a vector , we let denote its -norm. While standard notation is used to hide constant factors, we will use the notation to hide logarithmic factors.

Next, let us describe the problem we study in this paper. Imagine that a learner is given the choice of a total of actions, and has to play iteratively for a total of steps. At step , the learner needs to choose an action , and then suffers a corresponding loss , which is independently drawn from a non-stationary distribution with expected loss , which may drift over time. After that, the learner receives some feedback from the environment. In the full-information setting, the feedback gives the whole loss vector , while in the bandit setting, only the loss of the chosen action is revealed. A standard way to evaluate the learner’s performance is to measure her (or his) regret, which is the difference between the total loss she suffers and that of an offline algorithm. While most prior works consider offline algorithms which can only play a fixed action for all the steps, we consider stronger offline algorithms which can take different actions in different steps. Our consideration is natural for non-stationary distributions, although this would make the regret large when compared to such stronger offline algorithms. Formally, we measure the learner’s performance by its expected dynamic pseudo-regret, defined as where is the best action at step . For convenience, we will simply refer it as the regret of the learner later in our paper.

We will consider the following parameters characterizing different aspects of the environments:

(1) |

where we let be the all-zero vector. Here, is the number of times the distributions switch, measures the distance the distributions deviate, and is the total statistical variance of these distributions. We will call distributions with a small switching distributions, while we will call distributions with a small drifting distributions and call the total drift of the distributions.

Finally, we will need the following large deviation bound, known as empirical Bernstein inequality.

###### Theorem 2.1.

DBLP:conf/colt/MaurerP09 () Let be a vector of independent random variables taking values in , and let . Then for any , we have

## 3 Algorithms

We would like to characterize the achievable regret bounds for both switching and drifting distributions, in both full-information and bandit settings. In particular, we would like to understand the interplay among the parameters , and , defined in (1). The only known upper bound which is good enough for our purpose is that by garivier2011upper () for switching distributions in the bandit setting, which is close to the lower bound in our Theorem 4.1. In subsection 3.1, we provide a bandit algorithm for drifting distributions which achieves an almost optimal regret upper bound, when given the parameters . In subsection 3.2, we provide a full-information algorithm which works for both switching and drifting distributions. The regret bounds it achieves are also close to optimal, but it again needs the knowledge of the related parameters. To complement this, we provide a full-information algorithm in subsection 3.3, which does not need to know the parameters but achieves slightly larger regret bounds.

### 3.1 Parameter-Dependent Bandit Algorithm

In this subsection, we consider drifting distributions parameterized by and . Our main result is a bandit algorithm which achieves a regret of about . As we aim to achieve smaller regrets for distributions with smaller statistical variances, we adopt a variant of the UCB algorithm developed by audibert2009exploration (), called UCB-V, which takes variances into account when building its confidence interval.

Our algorithm divides the time steps into intervals , each having steps,^{1}^{1}1For simplicity of presentation, let us assume here and later in the paper that taking divisions and roots to produce blocks of time steps all yield integers. It is easy to modify our analysis to the general case without affecting the order of our regret bound. with

(2) |

For each interval, our algorithm clears all the information from previous intervals, and starts a fresh run of UCB-V. More precisely, before step in an interval , it maintains for each arm its empirical mean , empirical variance , and size of confidence interval , defined as

(3) |

where denotes the set of steps before in that arm was played, and is the function given in Theorem 2.1. Here we use the convention that if , while and if . Then at step , our algorithm selects the optimistic arm

receives the corresponding loss, and updates the statistics.

Our algorithm is summarized in Algorithm 1, and its regret is guaranteed by the following, which we prove in Appendix B in the supplementary material.

###### Theorem 3.1.

The expected regret of Algorithm 1 is at most

### 3.2 Parameter-Dependent Full-Information Algorithms

In this subsection, we provide full-information algorithms for switching and drifting distributions. In fact, they are based on an existing algorithm from ChiangYLMLJZ12 (), which is known to work in a different setting: the loss vectors are deterministic and adversarial, and the offline comparator cannot switch arms. In that setting, one of their algorithms, based on gradient-descent (GD), can achieve a regret of where , which is small when the loss vectors have small deviation. Our first observation is that their algorithm in fact can work against a dynamic offline comparator which switches arms less than times, given any , with its regret becoming . Our second observation is that when is small, each observed loss vector is likely to be close to its true mean , and when is small, is likely to be close to . These two observations make possible for us to adopt their algorithm to our setting.

We show the first algorithm in Algorithm 2, with the feasible set being the probability simplex. The idea is to use as an estimate for to move further in a possibly beneficial direction. Its regret is guaranteed by the following, which we prove in Appendix C in the supplementary material.

###### Theorem 3.2.

For switching distributions parameterized by and , the regret of Algorithm 2 with , is at most .

Note that for switching distributions, the regret of Algorithm 2 does not depend on , which means that it can achieve a constant regret for constant and . Let us remark that although using a variant based on multiplicative updates could result in a better dependency on , an additional factor of would then emerge when using existing techniques for dealing with dynamic comparators.

For drifting distributions, one can show that Algorithm 2 still works and has a good regret bound. However, a slightly better bound can be achieved as we describe next. The idea is to divide the time steps into intervals of size , with if and otherwise, and re-run Algorithm 2 in each interval with an adaptive learning rate. One way to have an adaptive learning rate can be found in jadbabaie2015online (), which works well when there is only one interval. A natural way to adopt it here is to reset the learning rate at the start of each interval, but this does not lead to a good enough regret bound as it results in some constant regret at the start of every interval. To avoid this, some careful changes are needed. Specifically, in an interval , we run Algorithm 2 with the learning rate reset as

for , with initially for every interval. This has the benefit of having small or even no regret at the start of an interval when the loss vectors across the boundary have small or no deviation. The regret of this new algorithm is guaranteed by the following, which we prove in Appendix D in the supplementary material.

###### Theorem 3.3.

For drifting distributions parameterized by and , the regret of this new algorithm is at most .

### 3.3 Parameter-Free Full-Information Algorithm

The reason that our algorithm for Theorem 3.3 needs the related parameters is to set its learning rate properly. To have a parameter-free algorithm, we would like to adjust the learning rate dynamically in a data-driven way. One way for doing this can be found in Gaillard2014 (), which is based on the multiplicative updates variant of the mirror-descent algorithm. It achieves a static regret of about against any expert , where is its instantaneous regret for playing at step . However, in order to work in our setting, we would like the regret bound to depend on as seen previously. This suggests us to modify the Adapt-ML-Prod algorithm of Gaillard2014 () using the idea of ChiangYLMLJZ12 (), which takes as an estimate of to move further in an optimistic direction.

Recall that the algorithm of Gaillard2014 () maintains a separate learning rate for each arm at time , and it updates the weight as well as using the instantaneous regret . To modify the algorithm using the idea of ChiangYLMLJZ12 (), we would like to have an estimate for in order to move further using and update the learning rate accordingly. More precisely, at step , we now play , with

(4) |

which uses the estimate to move further from . Then after receiving the loss vector , we update each weight

(5) |

as well as each learning rate

(6) |

Our algorithm is summarized in Algorithm 3, and we will show that it achieves a regret of about against arm . It remains to choose an appropriate estimate . One attempt is to have , but , which does not lead to a desirable bound. The other possibility is to set , which can be shown to have . However, it is not clear how to compute such because it depends on which in turns depends on itself. Fortunately, we can approximate it efficiently in the following way.

Note that the key quantity is . Given its value , and can be seen as functions of , defined according to (5) as and . Then we would like to show the existence of such that and to find it efficiently. For this, consider the function , with defined above. It is easy to check that is a continuous function bounded in , which implies the existence of some fixed point with . Using a binary search, such an can be approximated within error in iterations. As such a small error does not affect the order of the regret, we will ignore it for simplicity of presentation, and assume that we indeed have and hence without error.

Then we have the following regret bound (c.f. (Gaillard2014, , Corollary 4)), which we prove in Appendix E in the supplementary material.

###### Theorem 3.4.

The static regret of Algorithm 3 w.r.t. any arm (or expert) is at most

where the notation hides a factor.

The regret in the theorem above is measured against a fixed arm. To achieve a dynamic regret against an offline algorithm which can switch arms, one can use a generic reduction to the so-called sleeping experts problem. In particular, we can use the idea in Gaillard2014 () by creating sleeping experts, and run our Algorithm 3 on these experts (instead of on the arms). More precisely, each sleeping expert is indexed by some pair , and it is asleep for steps before and becomes awake for steps . At step , it calls Algorithm 3 for the distribution over the experts, and computes its own distribution over arms, with proportional to . Then it plays , receives loss vector , and feeds some modified loss vector and estimate vector to Algorithm 3 for update. Here, we set to its expected loss if expert is asleep and to otherwise, while we set to if expert is asleep and to otherwise. This choice allows us to relate the regret of Algorithm 3 to that of the new algorithm, which can be seen in the proof of the following theorem, given in Appendix F in the supplementary material.

###### Theorem 3.5.

The dynamic expected regret of the new algorithm is for switching distributions and for drifting distributions.

## 4 Lower Bounds

We study regret lower bounds in this section. In subsection 4.1, we show that for switching distributions with switches, there is an lower bound for bandit algorithms, even when there is no variance () and there are constant loss gaps between the optimal and suboptimal arms. We also show a full-information lower bound, which almost matches our upper bound in Theorem 3.2. In subsection 4.2, we show that for drifting distributions, our upper bounds in Theorem 3.1 and Theorem 3.2 are almost tight. In particular, we show that now even for full-information algorithms, a large dependency in the regret turns out to be unavoidable, even for small and . This provides a sharp contrast to the upper bound of our Theorem 3.2, which shows that a constant regret is in fact achievable by a full-information algorithm for switching distributions with constant and . For simplicity of presentation, we will only discuss the case with actions, as it is not hard to extend our proofs to the general case.

### 4.1 Switching Distributions

In contrast to the full-information setting, the existence of switches presents a dilemma with lose-lose situation for a bandit algorithm: in order to detect any possible switch early enough, it must explore aggressively, but this has the consequence of playing suboptimal arms too often. To fool any bandit algorithm, we will switch between two deterministic distributions, with no variance, which have mean vectors and , respectively. Our result is the following.

###### Theorem 4.1.

The worst-case expected regret of any bandit algorithm is , for .

###### Proof.

Consider any bandit algorithm , and let us partition the steps into intervals, each consisting of steps. Our goal is to make suffer in each interval an expected regret of by switching the loss vectors at most once. As mentioned before, we will only switch between two different deterministic distributions with mean vectors: and . Note that we can see these two distributions simply as two loss vectors, with having arm as the optimal arm.

In what follows, we focus on one of the intervals, and assume that we have chosen the distributions in all previous intervals. We would like to start the interval with the loss vector . Let denote the expected number of steps plays the suboptimal arm 2 in this interval if is used for the whole interval. If , we can actually use for the whole interval with no switch, which makes suffer an expected regret of at least in this interval. Thus, it remains to consider the case with . In this case, does not explore arm 2 often enough, and we let it pay by choosing an appropriate step to switch to the other loss vector , which has arm 2 as the optimal one. For this, let us divide the steps of the interval into blocks, each consisting of steps. As , there must be a block in which the expected number of steps that plays arm 2 is at most . By a Markov inequality, the probability that ever plays arm 2 in this block is less than . This implies that when given the loss vector for all the steps till the end of this block, never plays arm 2 in the block with probability more than . Therefore, if we make the switch to the loss vector at the beginning of the block, then with probability more than still never plays arm 2 and never notices the switch in this block. As arm 2 is the optimal one with respect to , the expected regret of in this block is more than .

Now if we choose distributions in each interval as described above, then there are at most periods of stationary distribution in the whole horizon, and the total expected regret of can be made at least , which proves the theorem. ∎

For full-information algorithms, we have the following lower bound, which almost matches our upper bound in Theorem 3.2. We provide the proof in Appendix G in the supplementary material.

###### Theorem 4.2.

The worst-case expected regret of any full-information algorithm is .

### 4.2 Drifting Distributions

In this subsection, we show that the regret upper bounds achieved by our bandit algorithm and full-information algorithm are close to optimal by showing almost matching lower bounds. More precisely, we have the following.

###### Theorem 4.3.

The worst-case expected regret of any full-information algorithm is , while that of any bandit algorithm is .

###### Proof.

Let us first consider the full-information case. When , we immediately have from Theorem 4.2 the regret lower bound of .

Thus, let us focus on the case with . In this case, , so it suffices to prove a lower bound of . Fix any full-information algorithm , and we will show the existence of a sequence of loss distributions for to suffer such an expected regret. Following gur2014stochastic (), we divide the time steps into intervals of length , and we set . For each interval, we will pick some arm as the optimal one, and give it some loss distribution , while other arms are sub-optimal and all have some loss distribution . We need and to satisfy the following three conditions: (a) ’s mean is smaller than ’s by , (b) their variances are at most , and (c) their KL divergence satisfies , for some to be specified later. Their existence is guaranteed by the following, which we prove in Appendix H in the supplementary material.

###### Lemma 4.4.

For any and , there exist distributions and satisfying the three conditions above.

Let denote the joint distribution of such distributions, with arm being the optimal one, and we will use the same for all the steps in an interval. We will show that for any interval, there is some such that using this way can make algorithm suffer a large expected regret in the interval, conditioned on the distributions chosen for previous intervals. Before showing that, note that when we choose distributions in this way, their total variance is at most while their total drift is at most . To have them bounded by and respectively, we choose and , which satisfy the condition of Lemma 4.4, with our choice of .

To find the distributions, we deal with the intervals one by one. Consider any interval, and assume that the distributions for previous intervals have been chosen. Let denote the number of steps plays arm in this interval, and let denote its expectation when is used for every step of the interval, conditioned on the distributions of previous intervals. One can bound this conditional expectation in terms of a related one, denoted as , when every arm has the distribution for every step of the interval, again conditioned on the distributions of previous intervals. Specifically, using an almost identical argument to that in (auer2002nonstochastic, , proof of Theorem A.2.), one can show that

(7) |

According to Lemma 4.4 and our choice of parameters, we have . Summing both sides of (7) over arm , and using the fact that , we get which implies the existence of some such that Therefore, if we choose this distribution , the conditional expected regret of algorithm in this interval is at least

By choosing distributions inductively in this way, we can make suffer a total expected regret of at least This completes the proof for the full-information case.

Next, let us consider the bandit case. From Theorem 4.1, we immediately have a lower bound of , which implies the required bound when . When , we have which implies that , and we can then use the full-information bound of just proved before. This completes the proof of the theorem. ∎

## References

- [1] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci., 410(19):1876–1902, 2009.
- [2] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2002.
- [3] Omar Besbes, Yonatan Gur, and Assaf J. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems (NIPS), December 2014.
- [4] Omar Besbes, Yonatan Gur, and Assaf J. Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
- [5] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
- [6] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In The 25th Conference on Learning Theory (COLT), June 2012.
- [7] Pierre Gaillard, Gilles Stoltz, and Tim van Erven. A second-order bound with excess losses. In The 27th Conference on Learning Theory (COLT), June 2014.
- [8] Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for switching bandit problems. In The 22nd International Conferenc on Algorithmic Learning Theory (ALT), October 2011.
- [9] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization : Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTAT), May 2015.
- [10] Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: Adanormalhedge. In The 28th Conference on Learning Theory (COLT), July 2015.
- [11] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample-variance penalization. In The 22nd Conference on Learning Theory (COLT), June 2009.

## Appendix A Summary and Comparison of Regret Bounds

In Table 1, we list the bounds we derive as well as those from previous works. As can be seen, we have completed a basic picture of the regret landscape characterized by the parameters , and , in both the full-information and bandit settings. Note that we did not provide a bandit upper bound for switching distributions because a good upper bound is already shown in [8]. Although that upper bound does not depend on which may lead one to wonder if a better bound is possible with a smaller , we show that it is in fact impossible by providing a matching lower bound using distributions with . Let us remark that although an lower bound was also given in [8], it was achieved by distributions with . Moreover, the upper bounds of previous works, such as [8, 3, 4], do not take into account, so their algorithms do not seem to produce our regret bounds in terms of . The full-information algorithms of [10, 7] can produce a regret bound of the form for switching distributions, with the quantity being the smallest accumulated loss by an expert sequence that changes at most times. Finally, in the full-information setting, [9] provides a regret upper bound which resembles that of ours for drifting distributions but using slightly different definition of parameters.

## Appendix B Proof of Theorem 3.1

To prove the theorem, we rely on the following key lemma, which we prove in Subsection B.1.

###### Lemma B.1.

Consider a time interval . Let and be the drift and variance, respectively, in . Then the expected regret of Algorithm 1 in is at most

By applying Lemma B.1 on each interval, we can bound the total regret of Algorithm 1 by

from the Cauchy–Schwarz inequality. This gives the bound of with when , while the bound becomes with otherwise. Combining these two bounds together gives the regret bound of the theorem.

### b.1 Proof of Lemma b.1

Let us first consider a time step and bound the regret of our algorithm at that step, which is , where denotes the arm chosen by our algorithm and is the best arm. Let , which is the expected value of . Note that when arm is pulled, we have This implies

when for every arm , which happens with probability by Theorem 2.1 and a union bound. Using the fact that for any arm , we have

with probability .

As a result, by summing over , the total regret of our algorithm is at most

(8) |

with the last term being at most for . It remains to bound the first term above, for which we rely on the following lemma.

###### Lemma B.2.

For any time step and any arm such that ,

Before proving the lemma, let us first use it to bound the first term in (8). Let us divide into two parts: and . Then is at most

since , and recall that we use to hide logarithmic factors. Then by combining all the bounds together, we have

Substituting this into the bound in (8) and taking its expectation, we can conclude that the expected regret of our algorithm is at most

by Jensen’s inequality and the definition of . This gives the regret bound claimed by Lemma B.1. Thus, to complete the proof, it remains to prove Lemma B.2, which we do next.

###### Proof of Lemma b.2.

Consider any and , and for ease of presentation, let us drop the indices involving and . To bound , defined in Theorem 2.1, let us first bound . Note that each can be expressed as

using the Cauchy–Schwarz inequality as well as the definition and the fact . Thus, we have

By plugging this bound into the definition of and using to hide logarithmic factors, we have

since for any . This proves the lemma. ∎

## Appendix C Proof of Theorem 3.2

Recall that our Algorithm 2 is based on that of [6] which is known to have a good regret bound against any offline comparator which does not switch arms. As we consider the dynamic regret here, we need to extend the bound to work against offline comparators which can switch arms. The following lemma provides such a bound.

###### Lemma C.1.

With the choice of , the regret of Algorithm 2, against an offline comparator switching arms less than times, is at most .

###### Proof.

Consider any offline comparator which switches arms times, say at . More specifically, assuming and , the arm it plays at step remains the same as that at step , for any . Let denote the characteristic vector of the arm it plays at step . Then when compared against it, the expected regret of our algorithm at step with respect to a realized loss vector (sampled from the distribution) is

which according to [6] is at most