Learning piecewise Lipschitz functions in changing environments

# Learning piecewise Lipschitz functions in changing environments

## Abstract

Optimization in the presence of sharp (non-Lipschitz), unpredictable (w.r.t. time and amount) changes is a challenging and largely unexplored problem of great significance. We consider the class of piecewise Lipschitz functions, which is the most general online setting considered in the literature for the problem, and arises naturally in various combinatorial algorithm selection problems where utility functions can have sharp discontinuities. The usual performance metric of ‘static’ regret minimizes the gap between the payoff accumulated and that of the best fixed point for the entire duration, and thus fails to capture changing environments. Shifting regret is a useful alternative, which allows for up to environment shifts. In this work we provide an regret bound for -dispersed functions, where roughly quantifies the rate at which discontinuities appear in the utility functions in expectation (typically in problems of practical interest [6, 7]). We show this bound is optimal up to sub-logarithmic factors. We further show how to improve the bounds when selecting from a small pool of experts. We empirically demonstrate a key application of our algorithms to online clustering problems, with 15-40% relative gain over static regret based algorithms on popular benchmarks.

## 1 Introduction

Online optimization is a well-studied problem in the online learning community [16, 24]. It consists of a repeated game with iterations. At iteration , the player chooses a point from a compact decision set ; after the choice is committed, a bounded utility function is revealed to the player. The goal of the player is to minimize the regret, which is defined as the difference between the online cumulative payoff (i.e. ) and the cumulative payoff using an optimal offline choice in hindsight. In many real world problems, like online routing [5, 39], detecting spam email/bots [38, 19] and ad/content ranking [40, 18], it is often inadequate to assume a fixed point will yield good payoff at all times. It is more natural to compute regret against a stronger offline baseline, say one which is allowed to switch the point a few times (say shifts), to accommodate ‘events’ which significantly change the function values for certain time periods. The switching points are neither known in advance nor explicitly stated during the course of the game. This stronger baseline is known as shifting regret [15, 26].

Shifting regret is a particularly relevant metric for online learning problems arising in the context of algorithm configuration. This is an important family of non-convex optimization problems where the goal is to decide in a data-driven way what algorithm to use from a large family of algorithms for a given problem domain. In the online setting, one has a configurable algorithm such as an algorithm for clustering data [8], and must solve a series of related problems, such as clustering news articles each day for a news reader or clustering drugstore sales information to detect disease outbreaks. For problems of this nature, significant events in the world or changing habits of buyers might require changes in algorithm parameters, and we would like the online algorithms to adapt smoothly.

We present the first results for shifting regret for non-convex utility functions which potentially have sharp discontinuities. Restricting attention to specific kinds of decision sets and utility function classes yields several important problems. If is a convex set and utility functions are convex functions, we get the Online Convex Optimization (OCO) problem [42], which is a generalization of online linear regression [28] and prediction with expert advice [32]. Algorithms with worst-case regret are known for the case of shifts for prediction with experts and OCO on the -simplex [26, 15, 14] using weight-sharing or regularization. We show how to extend the result to arbitrary compact sets of experts, and more general utility functions where convexity can no longer be exploited. Our key insight is to view the regularization as simultaneously inducing multiplicative weights update with restarts matching all possible shifted expert sequences, which allows us to use the dispersion condition introduced in [7].

Intuitively, a sequence of piecewise -Lipschitz functions is well-dispersed if not too many functions are non-Lipschitz in the same region in . An assumption like this is necessary since even for piecewise constant functions, linear regret is unavoidable in the worst case [17]. Our shifting regret bounds are which are good for sufficiently dispersed (large enough ) functions. In a large range of applications, one can show [7]. This allows us to obtain tight regret bounds modulo sublogarithmic terms, providing a near-optimal characterization of the problem. Our analysis also readily extends to the closely related notion of adaptive regret [25]. Note that our setting generalizes the Online Non-Convex Learning (ONCL) problem (where all functions are -Lipschitz throughout) [34, 41] for which shifting regret bounds have not been studied before.

We demonstrate the effectiveness of our algorithm in solving the algorithm selection problem for a family of clustering algorithms parameterized by different ways to initialize -means [9]. We consider the problem of online clustering, but unlike prior work which studies individual data points arriving in an online fashion [31, 36], we look at complete clustering instances from some distribution(s) presented sequentially. Our experiments provide the first empirical evaluation of online algorithms for piecewise Lipschitz functions - prior work is limited to theoretical analysis [7] or experiments for the batch setting [9]. In particular we show that our algorithms significantly outperform the optimal algorithm for static regret (fraction of items correctly clustered) given clustering samples from changing distributions. Our results also have applications in non-convex online problems like portfolio optimization [2, 35] and online non-convex SVMs [22]. More broadly, for applications where one needs to tune hyperparameters that are not ‘nice’, our results imply it is necessary and sufficient to look at dispersion.

Overview: We formally define the notion of ‘changing environments’ in Section 2. We then present online algorithms that perform well in these settings in Section 3. In Sections 4 and 5 we will provide theoretical guarantees of low regret for our algorithms and describe efficient implementations respectively. We will present a near-tight lower bound in the next section. In Section 7 we demonstrate the effectiveness of our algorithms in algorithm configuration problems for online clustering.

## 2 Problem setup

For each of the following definitions, we consider the following repeated game. At each round we are required to choose , are presented piecewise -Lipschitz functions and experience reward .

In this work we will study -shifted regret and (-sparse, -shifted) regret notions defined below.

###### Definition 1.

The -shifted regret (introduced by [26] as tracking regret) is given by

 E[maxρ∗i∈C,t0=1

Note that for the -th phase () given by , the offline algorithm uses the same point . The usual notion of regret compares the payoff of the online algorithm to the offline strategies that pick a fixed point for all but here we compete against more powerful offline strategies that can use up to distinct points by switching the expert times. For , we retrieve the standard static regret.

###### Definition 2.

We can extend Definition 1 with an additional constraint that the number of distinct experts used (i.e. ) is at most . We call this (-sparse, -shifted) regret [13].

This restriction makes sense if we think of the adversary as likely to reuse the same experts again, or the changing environment to experience recurring events with similar payoff distributions.

Without further assumptions, no algorithm achieves sublinear regret, even when the payout functions are piecewise constant. [17]. We will characterize our regret bounds in terms of the ‘dispersion’ [7, 6] of the utility functions, which roughly says that discontinuities are not too concentrated. Several other restrictions can be seen as a special case [37, 17].

###### Definition 3.

The sequence of utility functions is -dispersed for the Lipschitz constant if, for all and for all , at most functions (the soft-O notation suppresses dependence on quantities beside and ) are not -Lipschitz in any ball of size contained in . Further if the utility functions are obtained from some distribution, the random process generating them is said to be -dispersed if the above holds in expectation, i.e. if for all and for all ,

For ‘static’ regret, a continuous version of exponential weight updates gives a bound of which is also shown to be asymptotically optimal by Balcan et al. [7]. They further show that in several cases of practical interest one can prove dispersion with and the algorithm enjoys regret. This algorithm may, however, have -shifted regret even with a single switch , and hence may not be suited to changing environments (Appendix B).

## 3 Online algorithms with low shifting regret

In this section we describe online algorithms with good shifting regret, but defer the actual regret analysis to Section 4. First we present a discretization based algorithm that simply uses a finite expert algorithm given a discretization of . This algorithm will give us the reasonable target regret bounds we should shoot for, although the discretization results in exponentially many experts.

###### Definition 4 (r-discretization).

An -discretization or -net of a bounded set is a finite set of points such that the Euclidean distance of any point in is at most from some point in .

The number of experts here grows exponentially with , so we seek more efficient algorithms. We introduce a continuous version of the fixed share algorithm (Algorithm 2). We maintain weights for all points similar to the Exponential Forecaster of [7] which updates these weights in proportion to their exponentiated scaled utility ( is a step size parameter which controls how aggressively the algorithm updates its weights). The main difference is to update the weights with a mixture of the exponential update and a constant additive boost at all points in some proportion (the exploration parameter, optimal value derived in Section 4) which remains fixed for the duration of the game. This allows the algorithm to balance exploitation (exponential update assigns high weights to points with high past utility) with exploration, which turns out to be critical for success in changing environments. We will show this algorithm has good -shifted in Section 4. It also enjoys good adaptive regret [25] (see Appendix D).

Notice that it is not clear how to implement the Algorithm 2 from its description since we cannot store all the weights or sample easily since we have uncountably many points . We will show how to efficiently sample according to without necessarily computing it exactly or storing the exact weights in Section 5.

As it turns out adding equal weights to all points for exploration does not allow us to exploit recurring environments of the (-sparse, -shifted) setting very well. To overcome this, we replace the uniform update with a prior consisting of a weighted mixture of all the previous probability distributions used for sampling. Notice that this includes uniformly random exploration as the first probability distribution is uniformly random, but the weight on this distribution decreases exponentially with time according to discount rate (more precisely, it decays by a factor with each time step). The idea of discounting is common with [26], but the exponential discounting gives better asymptotic bounds and we need novel proof techniques in our setting. While exploration in Fixed Share EF (Algorithm 2) is limited to starting afresh, here it includes partial resets to explore again from all past states, with an exponentially discounted rate.

## 4 Analysis of algorithms

We will now analyse the algorithms in Section 3. At a high level, the algorithms have been designed to ensure that the optimal solution, and its neighborhood, in hindsight have a large total density. We achieve this by carefully setting the parameters, in particular the exploration parameter which controls the rate at which we allow our confidence on ‘good’ experts to change. Lipschitzness and dispersion are then used to ensure that solutions sufficiently close to the optimum are also good on average.

We first state our regret bounds. Theorems 4-4 share the following setting. We assume from now on the utility functions are piecewise -Lipschitz and -dispersed (definition 3), where is contained in a ball of radius .{restatable}theoremdisc Let denote the -shifted regret for the finite experts problem on experts, for the algorithm used in step 2 of Algorithm 1. Then Algorithm 1 enjoys -shifted regret which satisfies

 RC(T,s)≤Rfinite(T,s,(3RTβ)d)+(sH+L)O(T1−β)

The proof of Theorem 4 is straightforward using the definition of dispersion and is deferred to Appendix A. This gives us the following target bound for our more efficient algorithms.

###### Corollary 5.

Algorithm 1 enjoys -shifted regret.

###### Proof.

There are known algorithms e.g. Fixed-Share ([15, 26]) which obtain . Applying Theorem 4 gives the desired upper bound. ∎

Under the same conditions, we will show the following bounds for our algorithms. In the following statements, we give approximate values for the parameters and under the assumptions . See proofs in Appendix C for more precise values. {restatable}theoremthmub Algorithm 2 enjoys -shifted regret for and .

{restatable}

theoremthmubsparse Algorithm 3 enjoys (-sparse, -shifted) regret for , and .

We will sketch the proofs of Theorems 4 and 4 now. We start with some observations about the weights in Algorithm 2.

For Algorithm 2,

###### Proof.

The result follows by simply substituting the update rule (1) for . ∎

The update rule 1 had the uniform exploration term scaled just appropriately so this relation is satisfied. We will now relate with weights resulting from pure exponential updates, i.e. in Algorithm 2 (also the Exponential Forecaster algorithm of [7]). The following definition corresponds to weights for running Exponential Forecaster starting at some time .

###### Definition 7.

For any and define to be the weight of expert , and to be the normalizing constant, if we ran the Exponential Forecaster of [7] starting from time up till time , i.e. and .

We consider Algorithm 4 obtained by a slight modification in the update rule (1) of Fixed Share EF (Algorithm 2) which makes it easier to analyze. Essentially we replace the deterministic -mixture by a randomized one, so at each turn we either explicitly ‘restart’ with probability by putting the same weight on each point, or else apply the exponential update. We note that Algorithm 4 is introduced to simplify the proof of Theorem 4, and in particular does not result in low regret itself. The issue is that even though the weights are correct in expectation (Lemma 4), their ratio (probability ) is not. In particular, the optimal parameter value of for Fixed Share EF allows the possibility of pure exponential updates over a long period of time with a constant probability in Algorithm 4, which implies linear regret (see Appendix B, Theorem 12). This also makes the implementation of Fixed Share EF somewhat trickier (Section 5).

The expected weights of Algorithm 4 (over the coin flips used in weight setting) are the same as the actual weights of Algorithm 2 (proof in Appendix C).

{restatable}

lemmalemfsefrref Let be weights in Algorithm 2. For each , and , where the expectations are over random restarts .

The next lemma provides intuition for looking at our algorithm as a weighted superposition of several exponential update subsequences with restarts. This novel insight establishes a tight connection between the algorithms and is crucial for our analysis. {restatable}lemmalemwtpartition For Algorithm 2,

 WT+1=∑s∈[T]⎡⎣∑t0=1
###### Proof Sketch.

Intuitively each term corresponds to the weight when we pick a number for the number of times we start afresh with a uniformly random point at times and do the regular exponential weighted forecaster in the intermediate periods. We have a weighted sum over all these terms with a factor for each time we start afresh and for each time we continue with the weighted Exponential Forecaster.
Formally, we use Lemma 4 to note that , where is the total weight of Algorithm 4 at time given restarts occur exactly at , and is deterministic since all weights are fixed given exact restart times. We show in Appendix C by an induction on ,

 ^wT+1(ρ)∣s,ts=~w(ρ;ts−1,ts)s−1∏i=1~W(ti−1,ti)\textscVol(C) (2)

Integrating (2) to get , and noting probability of restarts at is completes the proof. ∎

We are now ready to prove Theorem 4. The main idea is to show that the normalization of exploration helps the total weights to provide a lower bound for the algorithm payoff. Also the total weights are competitive against the optimal payoff as they contain the exponential updates with the optimal set of switching points in Lemma 4 with a sufficiently large (‘probability’) coefficient.

###### Proof sketch of Theorem 4.

We provide an upper and lower bound to . The upper bound uses Lemma 6 and helps us lower bound the performance of the algorithm (see Appendix C) as

 WT+1W1≤exp(P(A)(eHλ−1)H) (3)

where is the expected total payoff for Algorithm 2. We now upper bound the optimal payoff by providing a lower bound for . By Lemma 4 we have

 WT+1≥αs−1(1−α)T−s\textscVol(C)s−1s∏i=1~W(t∗i−1,t∗i)

by dropping all terms save those that ‘restart’ exactly at the OPT expert switches . Now using -dispersion we can show (similarly to [7], proof in Appendix C)

 WT+1W1≥αs−1(1−α)T−s(1RTβ)sdexp(λ(OPT−(sH+L)O(T1−β)))

Putting together with the upper bound (3), rearranging and optimizing the difference for and concludes the proof. (See Appendix C for a full proof.) ∎

We now analyze Algorithm 3 for the sparse experts setting. We can adapt proofs of Lemmas 6 and 4 to easily establish Lemmas 8 and9.

For Algorithm 3,

###### Lemma 9.

Let . For Algorithm 3,

 WT+1=∑s∈[T][∑t0=1

where .

For any ,

###### Proof.

Consider the probability of last ‘reset’ (setting ) at time when computing as the expected weight of a random restart version which matches Algorithm 3 till time . ∎

We are now ready to prove Theorem 4. This time we show that the total weight is competitive with running exponential updates on all partitions (and in particular the optimal partition) of into subsets with switches, intuitively the property of restarting exploration from all past points crucially allows us to “jump” across intervals where a given expert was inactive (or bad).

###### Proof sketch of Theorem 4.

We provide an upper and lower bound to similar to Theorem 4. Using Lemma 8 we can show that inequality 3 holds here as well. By Corollary 10 and Lemma 14 (which relates to past weights, proved in Appendix C), and -dispersion we can show a lower bound

 WT+1W1≥αs(1−α)T(1−e−γ)s(e−γ+α(1−e−γ))−mT(RTβ)mdexp(λ(OPT−(mH+L)O(T1−β))) (4)

Putting together equations 3 and 4, rearranging and optimizing for concludes the proof. ∎

## 5 Efficient implementation of algorithms

In this section we show that the Fixed Share Exponential Forecaster algorithm (Algorithm 2) can be implemented efficiently. In particular we overcome the need to explicitly compute and update (there are uncountably infinite in ) by showing that we can sample the points according to directly.
The high-level strategy is to show (Lemma 5) that is a mixture of distributions which are Exponential Forecaster distributions from [7] i.e. for each , with proportions . As shown in [7] these distributions can be approximately sampled from (exactly in the one-dimensional case, ). We need to sample from one of these distributions with probability to get the distribution , and we can approximate these coefficients efficently (or compute exactly in one-dimensional case). The rest of the section discusses how to do these approximations efficiently, and with small extra expected regret. Indeed we will asymptotically get the same bound as exact algorithm. (Formal proofs in Appendix E).

The coefficients have a simple form in terms of normalizing constants ’s of the rounds so far, so we first express in terms of ’s from previous rounds and some ’s. {restatable}lemmalemwtrecursion In Algorithm 2, for ,

 Wt+1= (1−α)t−1~W(1,t+1)+ α\textscVol(C)t∑i=2[(1−α)t−iWi~W(i,t+1)]

As indicated above, is a mixture of distributions. {restatable}lemmalemptrecursion In Algorithm 2, for , . The coefficients are given by

 Ct,i=⎧⎪ ⎪⎨⎪ ⎪⎩1$i=t=1$α$i=t>1$(1−α)Wt−1Wt~W(i,t)~W(i,t−1)Ct−1,i$i and lies on the probability simplex . The observations above allow us to write the algorithms for efficiently implementing Fixed Share EF, for which we obtain formal guarantees in Theorem 5. We present an approximate algorithm (Algorithm 5) with the same expected regret as in Theorem 4 (and also present an exact algorithm, Algorithm 6 in Appendix E, for ). We say Algorithm 5 gives a estimate of Algorithm 2, i.e. with probability at least , its expected payoff is within a factor of of that of Algorithm 2. Key to approximating the coefficients and sampling according to them are the efficient sampling and integration algorithms for logconcave distributions [33] which allows us to solve the problem when ’s are concave (dimishing returns). {restatable} theoremthmsampling If utility functions are concave, we can approximately sample a point with probability in time for approximation parameters and and enjoy the same regret bound as the exact algorithm. ( is number of discontinuities in ’s). Note that in this section we concerned ourselves with developing a algorithm. For special cases of practical interest, like one-dimensional piecewise constant functions, we can implement much faster ( ) algorithms as noted in Section 7. ## 6 Lower bounds We prove our lower bound for and . Also we will consider functions which are -dispersed and 0-Lipschitz (piecewise constant). For such utility functions we have shown in Section 4 that the -shifted regret is . Here we will establish a lower bound of . We show a range of values of where the stated lower bound is achieved. For , this improves over the lower bound construction of [7] where the lower bound is shown only for . In particular our results establish an almost tight characterization of static and dynamic regret under dispersion. {restatable} theoremthmlb For each , there exist utility functions which are -dispersed, and the -shifted regret of any online algorithm is . ###### Proof. . In the first phase, for the first functions we have a single discontinuity in the interval . The functions have payoff 1 before or after (with probability each) their discontinuity point, and zero elsewhere. We introduce functions each for the same discontinuity point, and set the discontinuity points apart for -dispersion. This gives us potential points inside , so we can support such functions ( since ). By Lemma 22 (Appendix F) we accumulate regret for this part of the phase in expectation. Let be the interval from among and with more payoff in the phase so far. The next function has payoff 1 only at first or second half of (with probability ) and zero everywhere else. Any algorithm accumulates expected regret on this round. We repeat this in successively halved intervals. -dispersion is satisfied since we use only functions in the interval of size greater than , and we accumulate an additional regret. Notice there is a fixed point used by the optimal adversary for this phase. Finally we repeat the construction inside the largest interval with no discontinuities at the end of the last phase for the next phase. Note that at the -th phase the interval size will be . Indeed at the end of the first round we have ‘unused’ intervals of size At the -th phase, we’ll be repeating inside an interval of size . This allows us to run phases and get the desired lower bound (intervals must be of size at least to support the construction). ∎ ## 7 Experiments The simplest demonstration of significance of our algorithm in a changing environment is to consider the 2-shifted regret when a single expert shift occurs. We consider an artifical online optimization problem first, and will then look at applications to online clustering. Let . Define utility functions  u(0)(ρ)={1if$ρ<12$0if$ρ≥12$and u(1)(ρ)={0if$ρ<12$1if$ρ≥12$Now consider the instance where is presented for the first rounds and is presented for the remaining rounds. We observe constant average regret for the Exponential Forecaster algorithm, while Fixed Share regret decays as (Figure 1). While the example is simple and artificial, it qualitatively captures why Fixed Share dominates Exponential Forescaster here — because the best expert changes and the old expert is no longer competitive. (cf. Appendix B) -means is a celebrated algorithm [3] which shows the importance of initial seed centers in clustering using the -means algorithm (also called Llyod’s method). Balcan et al. [9] generalize it to -Lloyds-clustering, which interpolates between random initial seeds (vanilla -means, ), -means () and farthest-first traveral () [23, 20] using a single parameter . The clustering objective (we use the Hamming distance to the optimal clustering, i.e. the fraction of points assigned to different clusters by the algorithm and the target clustering) is a piecewise constant function of , and the best clustering may be obtained for a value of specific to a given problem domain. In an online problem, where clustering instances arrive in a sequential fashion, determining good values of becomes an online optimization problem on piecewise Lipshitz functions. Furthermore the functions are -dispersed for [9]. We perform our evaluation on four benchmark datasets to cover a range of examples-set sizes, and number of clusters, : MNIST, binary images of handwritten digits with 60,000 training examples for 10 classes [21]; Omniglot, binary images of handwritten characters across 30 alphabets with 19,280 examples [30]; Omniglot_small_1, a “minimal” Omniglot split with only 5 alphabets and 2720 examples. We consider a sequence of clustering instances drawn from the four datasets and compare our algorithms Fixed Share EF (Algorithm 2) and Generalized Share EF (Algorithm 3) with the Exponential Forecaster algorithm of [7]. At each time we sample a subset of the dataset of size . For each , we take uniformly random points from half the classes (even class labels) at times and from the remaining classes (odd class labels) at . We determine the hamming cost of -Lloyds-clustering for which is used as the piecewise constant loss function (or payoff is the fraction of points assigned correctly) for the online optimization game. Notice the Lipschitz constant since we have piecewise constant utility, and utility function values lie in . We set exploration parameter and decay parameter in our algorithms. We plot average -shifted regret until time (i.e. ) and take average over 20 runs to get smooth curves. (Figure 2). Unlike Figure 1, the optimal clustering parameters before the shift might be relatively competitive to new optimal parameters. So the Exponential Forecaster performance is not terrible, although our algorithms still outperform it noticeably. We observe that our algorithms have significantly lower regrets (about 15-40% relative for the datasets considered, for ) compared to the Exponential Forecaster algorithm across all datasets. We also note that the exact advantage of adding exploration to exponential updates varies with datasets and problem instances. In Appendix G we have compiled further experiments that reaffirm the strengths of our approach against different changing environments and also compare against the static setting. ###### Remark. The applications considered, for which the algorithms have been implemented and empirically evaluated, have piecewise constant utility functions with . For these it is possible to simply maintain the weight on each piece of in time for round where each has pieces by using a simple interval tree data structure [17]. The tree lazily maintains weight for each of pieces, takes time for lazy insertion of new pieces and allows drawing with probability proportional to weight in time. Similarly updates are possible for Algorithm 3 as well in this case. Section 5 of the paper addresses the harder problem of polynomial time implementation for arbitrary (for Algorithm 2). ## 8 Discussion and open problems We presented approaches which trade off exploitation with exploration for the online optimization problem to obtain low shifting regret for the case of general non-convex functions with sharp but dispersed discontinuities. Optimizing for the stronger baseline of shifting regret leads to empirically better payout, as we have shown via experiments bearing applications to algorithm configuration. Our focus here is on the full-information setting which corresponds to the entire utility function being revealed at each iteration, and we present almost tight theoretical results for the same. Other relevant settings include bandit and semi-bandit feedback where the function value is revealed for only the selected point or a subset of the space containing the point. It would be interesting to obtain low shifting regret in these settings [4]. It would also be interesting to see if the present work can be extended to other interesting notions of regret [29, 27, 14, 12]. ## 9 Acknowledgements We thank Ellen Vitercik for helpful feedback. This work was supported in part by NSF grants CCF-1535967, IIS-1618714, IIS-1901403, an Amazon Research Award, a Microsoft Research Faculty Fellowship, a Bloomberg Data Science research grant, and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program. Views expressed in this work do not necessarily reflect those of any funding agency. ## Appendix ## Appendix A Discretization based algorithm Recall that is contained in a ball of radius . A standard greedy construction gives an -discretization of size at most [7]. Given the dispersion parameter , a natural choice is to use a -discretization as in Algorithm 1. \disc * ###### Proof of Theorem 4. We show we can round the optimal points in to points in the -discretization with a payoff loss at most in expectation. But in we know a way to bound regret by , where , the number of points in , is at most . Let denote the expert switching times in the optimal offline payoff, and be the point picked by the optimal offline algorithm in . Consider a ball of radius around . It must have some point . We then must have that has at most discontinuities due to -dispersion, which implies  ti−1∑t=ti−1ut(^ρ∗i)≥ti−1∑t=ti−1ut(ρ∗i)−O(T1−β)H−L(ti−ti−1)T−β Let for each . Summing over gives  T∑t=1ut(^ρt)≥OPT−O(T1−β)sH−LT1−β=OPT−(sH+L)O(T1−β) Now payoff of this algorithm is bounded above by the payoff of the optimal sequence of experts with shifts  T∑t=1ut(^ρt)≤OPTfinite Let the finite experts algorithm with shifted regret bounded by choose at round . Then, using the above inequalities,  T∑t=1ut(ρt)≥OPTfinite−Rfinite(T,s,N)≥OPT−(sH+L)O(T1−β)−Rfinite(T,s,N) We use this to bound the regret for the continuous case  RC(T,s) =OPT−T∑t=1ut(ρt) ≤OPT−(OPT−(sH+L)O(T1−β)−Rfinite(T,s,N)) =Rfinite(T,s,N)+(sH+L)O(T1−β) ## Appendix B Counterexamples We will construct problem instances where some sub-optimal algorithms mentioned in the paper suffer high regret. We first show that the Exponential Forecaster algorithm of [7] suffers linear -shifted regret even for . This happens because pure exponential updates may accumulate high weights on well-performing experts and may take a while to adjust weights when these experts suddenly start performing poorly. ###### Lemma 11. There exists an instance where Exponential Forecaster algorithm of [7] suffers linear -shifted regret. ###### Proof. Let . Define utility functions  u(0)(ρ)={1if$ρ<12$0if$ρ≥12$and u(1)(ρ)={0if$ρ<12$1if$ρ≥12\$

Now consider the instance where is presented for the first rounds and is presented for the remaining rounds. In the second half, with probability at least , the Exponential Forecaster algorithm will select a point from and accumulate a regret of . Thus the expected 2-shifted regret of the algorithm is at least . Notice that the construction does not depend on the step size parameter . ∎

We further look at the performance of Random Restarts EF (Algorithm 4), an easy-to-implement algorithm which looks deceptively similar to Algorithm 2, against this adversary. Turns out Random Restarts EF may not restart frequently enough for the optimal value of the exploration parameter, and have sufficiently long chains of pure exponential updates in expectation to suffer high regret.

###### Theorem 12.

There exists an instance where Random Restarts EF (Algorithm 4) with parameters and as in Theorem 4 suffers linear -shifted regret.

###### Proof.

The probability of pure exponential updates from through is at least

 (1−α)T/2=(1−1T−1)T/2>12

for . By Lemma 11, this implies at least regret in this case, and so the expected regret of the algorithm is at least . ∎

## Appendix C Analysis of algorithms

In this section we will provide detailed proofs of lemmas and theorems from Section 4. We will restate them for easy reference.

\lemfsefrref

*

###### Proof of Lemma 4.

implies by Fubini’s theorem (recall is closed and bounded). follows by simple induction on . In the base case, is the empty set and . For ,

 E[^wt(ρ)] =(1−α)E[eλut(ρ)^wt−1(ρ)]+α\textscVol(C)E[∫Ceλut(ρ)^wt−1(ρ)dρ] (definition of ^wt) =(1−α)eλut(ρ)E[^wt−1(ρ)]+α\textscVol(C)∫Ceλut(ρ)E[^wt−1(ρ)]dρ (expectation is over zt) =(1−α)eλut(ρ)wt−1(ρ)+α\textscVol(C)∫Ceλut(ρ)wt−1(ρ)dρ (inductive hypothesis) =wt(ρ) (definition of wt)

\lemwtpartition

*

###### Proof of relation 2 in Lemma 4.

Recall that we wish to show that (weights of Algorithm 4 at time given restarts occur exactly at ) can be expressed as the product of weight at of regular Exponential Forecaster since the last restart times the normalized total weights accumulated over previous runs, i.e.

 ^wT+1(ρ)∣s,ts=~w(ρ;ts−1,ts)s−1∏i=1~W(ti−1,ti)\textscVol(C)

We show this by induction on . For , we have no restarts and

 ~w(ρ;ts−1,ts)s−1∏i=1~W(ti−1,ti)\textscVol(C)=~w(ρ;t0,t1)0∏i=1~W(ti−1,ti)\textscVol(C)=~w(ρ;1,T+1)=^wT+1(ρ)∣1,t1

For , the last restart occurs at . By inductive hypothesis for time