Achieving All with No Parameters: Adaptive NormalHedge

# Achieving All with No Parameters: Adaptive NormalHedge

## Abstract

We study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information. The main component of this work is an improved version of the NormalHedge.DT algorithm [23], called AdaNormalHedge. On one hand, this new algorithm ensures small regret when the competitor has small loss and almost constant regret when the losses are stochastic. On the other hand, the algorithm is able to compete with any convex combination of the experts simultaneously, with a regret in terms of the relative entropy of the prior and the competitor. This resolves an open problem proposed by [8] and [9]. Moreover, we extend the results to the sleeping expert setting and provide two applications to illustrate the power of AdaNormalHedge: 1) competing with time-varying unknown competitors and 2) predicting almost as well as the best pruning tree. Our results on these applications significantly improve previous work from different aspects, and a special case of the first application resolves another open problem proposed by [32] on whether one can simultaneously achieve optimal shifting regret for both adversarial and stochastic losses.

## 1Introduction

The problem of predicting with expert advice was first pioneered by [22] and others two decades ago. Roughly speaking, in this problem, a player needs to decide a distribution over a set of experts on each round, and then an adversary decides and reveals the loss for each expert. The player’s loss for this round is the expected loss of the experts with respect to the distribution that he chose, and his goal is to have a total loss that is not much worse than any single expert, or more generally, any fixed and unknown convex combination of experts.

Beyond this classic goal, various more difficult objectives for this problem were studied in recent years, such as: learning with unknown number of experts and competing with all but the top small fraction of experts [8]; competing with a sequence of different combinations of the experts [19]; learning with experts who provide confidence-rated advice [3]; and achieving much smaller regret when the problem is “easy” while still ensuring worst-case robustness [10]. Different algorithms were proposed separately to solve these problems to some extent. In this work, we essentially provide one single parameter-free algorithm that achieves all these goals with absolutely no prior information and significantly improved results in some cases.

Our algorithm is a variant of [8]’s NormalHedge algorithm, and more specifically is an improved version of NormalHedge.DT [23]. We call it Adaptive NormalHedge (or AdaNormalHedge for short). NormalHedge and NormalHedge.DT provide guarantees for the so-called -quantile regret simultaneously for any , which essentially corresponds to competing with a uniform distribution over the top -fraction of experts. Our new algorithm improves NormalHedge.DT from two aspects (Section 3):

1. AdaNormalHedge can compete with not just the competitor of the specific form mentioned above, but indeed any unknown fixed competitor simultaneously, with a regret in terms of the relative entropy between the competitor and the player’s prior belief of the experts.

2. AdaNormalHedge ensures a new regret bound in terms of the cumulative magnitude of the instantaneous regrets, which is always at most the bound for NormalHedge.DT (or NormalHedge). Moreover, the power of this new form of regret is almost the same as the second order bound introduced in a recent work by [14]. Specifically, it implies 1) a small regret when the loss of the competitor is small and 2) an almost constant regret when the losses are generated randomly with a gap in expectation.

Our results resolve the open problem asked in [8] and [9] on whether a better -quantile regret in terms of the loss of the expert instead of the horizon can be achieved. In fact, our results are even better and more general.

AdaNormalHedge is a simple and truly parameter-free algorithm. Indeed, it does not even need to know the number of experts in some sense. To illustrate this idea, in Section 4 we extend the algorithm and results to a setting where experts provide confidence-rated advice [3]. We then focus on a special case of this setting called the sleeping expert problem [2], where the number of “awake” experts is dynamically changing and the total number of underlying experts is indeed unknown. AdaNormalHedge is thus a very suitable algorithm for this problem. To show the power of all the abovementioned properties of AdaNormalHedge, we study the following two examples of the sleeping expert problem and use AdaNormalHedge to significantly improve previous work.

The first example is adaptive regret, that is, regret on any time interval, introduced by [15]. This can be reduced to a sleeping expert problem by adding a new copy of each original expert on each round [13]. Thus, the total number of sleeping experts is not fixed. When some information on this interval is known (such as the length, the loss of the competitor on this interval, etc), several algorithms achieve optimal regret [15]. However, when no prior information is available, all previous work gives suboptimal bounds. We apply AdaNormalHedge to this problem. The resulting algorithm, which we called AdaNormalHedge.TV, enjoys the optimal adaptive regret in not only the adversarial case but also the stochastic case due to the properties of AdaNormalHedge.

We then extend the results to the problem of tracking the best experts where the player needs to compete with the best partition of the whole process and the best experts on each of these partitions [17]. This resolves one of the open problems in [32] on whether a single algorithm can achieve optimal shifting regret for both adversarial and stochastic losses. Note that although recent work by [27] also solves this open problem in some sense, their method requires knowing the number of partitions and other information ahead of time and also gives a worse bound for stochastic losses, while AdaNormalHedge.TV is completely parameter-free and gives optimal bounds.

We finally consider the most general case where the competitor varies over time with no constraints, which subsumes the previous two examples (adaptive regret and shifting regret). This problem was introduced in [19] and later generalized by [7]. Their algorithm (fixed share) also requires knowing some information on the sequence of competitors to optimally tune parameters. We avoid this issue by showing that while this problem seems more general and difficult, it is in fact equivalent to its special case: achieving adaptive regret. This equivalence theorem is independent of the concrete algorithms and may be of independent interest. Applying this result, we show that without any parameter tuning, AdaNormalHedge.TV automatically achieves a bound comparable to the one achieved by the optimally tuned fixed share algorithm when competing with time-varying competitors.

Concrete results and detailed comparisons on this first example can be found in Section 5. To sum up, AdaNormalHedge.TV is an algorithm that is simultaneously adaptive in the number of experts, the competitors and the way the losses are generated.

The second example we provide is predicting almost as well as the best pruning tree [16], which was also shown to be reducible to a sleeping expert problem [13]. Previous work either only considered the log loss setting, or assumed prior information on the best pruning tree is known. Using AdaNormalHedge, we again provide better or comparable bounds without knowing any prior information. In fact, due to the adaptivity of AdaNormalHedge in the number of experts, our regret bound depends on the total number of distinct traversed edges so far, instead of the total number of edges of the decision tree as in [13] which could be exponentially larger. Concrete comparisons can be found in Section 6.

Related work. While competing with any unknown competitor simultaneously is relatively easy in the log loss setting [22], it is much harder in the bounded loss setting studied here. The well-known exponential weights algorithm gives the optimal results only when the learning rate is optimally tuned in terms of the competitor [12]. [9] also studied -quantile regret, but no concrete algorithm was provided. Several work considers competing with unknown competitors in a different unconstrained linear optimization setting [29]. [20] studied general adaptive online learning algorithms against time-varying competitors, but with different and incomparable measurement of the hardness of the problem. As far as we know, none of the existing algorithms enjoys all the nice properties discussed in this work at the same time as our algorithms do.

## 2The Expert Problem and NormalHedge.DT

In the expert problem, on each round : the player first chooses a distribution over experts, then the adversary decides each expert’s loss , and reveals these losses to the player. At the end of this round, the player suffers the weighted average loss with . We denote the instantaneous regret to expert on round by , the cumulative regret by , and the cumulative loss by . Throughout the paper, a bold letter denotes a vector with corresponding coordinates. For example, , and represent , and respectively.

Usually, the goal of the player is to minimize the regret to the best expert, that is, . Here we consider a more general case where the player wants to minimize the regret to an arbitrary convex combination of experts: where the competitor is a fixed unknown distribution over the experts. In other words, this regret measures the difference between the player’s loss and the loss that he would have suffered if he used a constant strategy all the time. Clearly, can be written as and can then be upper bounded appropriately by a bound on each (for example, ). However, our goal is to get a better and more refined bound on that depends on . More importantly, we aim to achieve this without knowing the competitor ahead of time. When it is clear from the context, we drop the subscript in .

In fact, in Section 5, we will consider an even more general notion of regret introduced in [19], where we allow the competitor to vary over time and to have different scales. Specifically, let be different vectors with nonnegative coordinates (denoted by ). Then the regret of the player to this sequence of competitors is . If all these competitors are distributions (which they are not required to be), then this regret captures a very natural and general concept of comparing the player’s strategy to any other strategy. Again, we are interested in developing low-regret algorithms that do not need to know any information of this sequence of competitors beforehand.

We briefly describe a recent algorithm for the expert problem, NormalHedge.DT [23] (a variant of NormalHedge [8]), before we introduce our new improved variants. On round , NormalHedge.DT sets where . Let and competitor be a distribution that puts all the mass on the -th best expert, that is, the one that ranks among all experts according to their total loss from the smallest to the largest. Then the regret guarantee for NormalHedge.DT states simultaneously for all , which means the algorithm suffers at most this amount of regret for all but an fraction of the experts. Note that this bound does not depend on at all. This is the first concrete algorithm with this kind of adaptive property (the original NormalHedge [8] still has a weak dependence on ). In fact, as we will show later, one can even extend the results to any competitor . Moreover, we will improve NormalHedge.DT so that it has a much smaller regret when the problem is “easy” in some sense.

Notation. We use to denote the set , to denote the simplex of all distributions over , and to denote the relative entropy between two distributions, Also define . Many bounds in this work will be in terms of , which is always at most since trivially . We consider “log log” terms to be nearly constant, and use notation to hide these terms. Indeed, as pointed out by [9], is smaller than even when is as large as the age of the universe expressed in microseconds ().

We start by writing NormalHedge.DT in a general form. We define potential function with defined to be , and also a weight function with respect to this potential:

Then the prediction of NormalHedge.DT is simply to set to be proportional to where for all . Note that is closely related to the regret. In fact, the regret is roughly of order (ignoring the log term). Therefore, in order to get an expert-wise and more refined bound, we replace by for each expert so that it captures some useful information for each expert . There are several possible choices for (discussed at the end of Appendix A), but for now we focus on the one used in our new algorithm: , that is, the cumulative magnitude of the instantaneous regrets up to time . We call this algorithm AdaNormalHedge and summarize it in Algorithm ?. Note that we even allow the player to have a prior distribution over the experts, which will be useful in some applications as we will see in Section 5. The theoretical guarantee of AdaNormalHedge is stated below.

Before we prove this theorem (see sketch at the end of this section and complete proof in Appendix A), we discuss some implications of the regret bounds and why they are interesting. First of all, the relative entropy term captures how close the player’s prior is to the competitor. A bound in terms of can be obtained, for example, using the classic exponential weights algorithm but requires carefully tuning the learning rate as a function of . Without knowing , as far as we know, AdaNormalHedge is the only algorithm that can achieve this.1

On the other hand, if is a uniform distribution, then using bound and the fact , we get an -quantile regret bound similar to the one of NormalHedge.DT: where is uniform over the top experts. in terms of their total loss .

However, the power of a bound in terms of is far more than this. [14] introduced a new second order bound that implies much smaller regret when the problem is easy. It turns out that our seemingly weaker first order bound is also enough to get the exact same results! We state these implications in the following theorem which is essentially a restatement of Theorems 9 and 11 of [14] with weaker conditions.

The proof of Theorem ? is based on the same idea as in [14], and is included in Appendix B for completeness. For AdaNormalHedge, the term is in general (or smaller for special as stated in Theorem ?). Applying Theorem ? we have O(RE(u||)) Specifically, if is uniform and assuming without loss of generality that , then by a similar argument, we have for AdaNormalHedge, for any . This answers the open question (in the affirmative) asked by [8] and [9] on whether an improvement for small loss can be obtained for -quantile regret without knowing .

On the other hand, when we are in a stochastic setting as stated in Theorem ?, AdaNormalHedge ensures in expectation (or with high probability with an extra confidence term), which does not grow with . Therefore, the new regret bound in terms of actually leads to significant improvements compared to NormalHedge.DT.

Comparison to Adapt-ML-Prod . Adapt-ML-Prod enjoys a second order bound in terms of , which is always at most the term appeared in our bounds.2 However, on one hand, as discussed above, these two bounds have the same improvements when the problem is easy in several senses; on the other hand, Adapt-ML-Prod does not provide a bound in terms of for an unknown . In fact, as discussed at the end of Section A.3 of [14], Adapt-ML-Prod cannot improve by exploiting a good prior (or at least its current analysis cannot). Specifically, while the regret for AdaNormalHedge does not have an explicit dependence on and is much smaller when the prior is close to the competitor , the regret for Adapt-ML-Prod always has a multiplicative term for , which means even a good prior results in the same regret as a uniform prior! More advantages of AdaNormalHedge over Adapt-ML-Prod will be discussed in concrete examples in following sections.

Proof sketch of Theorem . The analysis of NormaHedge.DT is based on the idea of converting the expert problem into a drifting game [28]. Here, we extract and simplify the key idea of their proof and also improve it to form our analysis. The main idea is to show that the weighted sum of potentials does not increase much on each round using an improved version of Lemma 2 of [23]. In fact, we show that the final potential is exactly bounded by (defined in Theorem ?). From this, assuming without loss of generality that , we have for all , which, by solving for , gives . Multiplying both sides by , summing over and applying the Cauchy-Schwarz inequality, we arrive at where we define . It remains to show that and are close by standard analysis and Stirling’s formula.

## 4Confidence-rated Advice and Sleeping Experts

In this section, we generalize AdaNormalHedge to deal with experts that make confidence-rated advice, a setting that subsumes many interesting applications as studied by [2] and [13]. In this general setting, on each round , each expert first reports its confidence for the current task. The player then predicts as usual with an extra yet natural restriction that if then . That is, the player has to ignore those experts who abstain from making advice (by reporting zero confidence). After that, the loss for those experts who did not abstain (i.e. ) are revealed and the player still suffers loss . We redefine the instantaneous regret to be , that is, the difference between the loss of the player and expert weighted by the confidence. The goal of the player is, as before, to minimize cumulative regret to any competitor : . Clearly, the classic expert problem that we have studied in previous sections is just a special case of this general setting with for all and .

Moreover, with this general form of , AdaNormalHedge can be used to deal with this general setting with only one simple change of scaling the weights by the confidence:

where and is still defined to be and respectively. The constraint is clearly satisfied. In fact, Algorithm can be seen as a special case of this general form of AdaNormalHedge with . Furthermore, the regret bounds in Theorem ? still hold without any changes, which are summarized below (proof deferred to Appendix A).

Previously, [14] studied a general reduction from an expert algorithm to a confidence-rated expert algorithm. Applying those results here gives the exact same algorithm and regret guarantee mentioned above. However, we point out that the general reduction is not always applicable. Specifically, it is invalid if there is an unknown number of experts in the confidence-rated setting (explained more in the next paragraph) while the expert algorithm in the standard setting requires knowing the number of experts as a parameter. This is indeed the case for most algorithms (including Adapt-ML-Prod and even the original NormalHedge by [8]). AdaNormalHedge naturally avoids this problem since it does not depend on at all.

Sleeping Experts. We are especially interested in the case when , also called the specialist/sleeping expert problem where means that expert is “asleep” for round and not making any advice. This is a natural setting where the total number of experts is unknown ahead of time. Indeed, the number of awake experts can be dynamically changing over time. An expert that has never appeared before should be thought of as being asleep for all previous rounds.

AdaNormalHedge is a very suitable algorithm to deal with this case due to its independence of the total number of experts. If an expert appears for the first time on round , then by definition it will naturally start with and . Although we state the prior as a distribution, which seems to require knowing the total number of experts, it is not an issue algorithmically since is only used to scale the unnormalized weights (Eq. ). For example, if we want to be a uniform distribution over experts where is unknown beforehand, then to run AdaNormalHedge we can simply treat in Eq. to be for all , which clearly will not change the behavior of the algorithm anyway. In this case, if we let denote the total number of distinct experts that have been seen up to time and the competitor concentrates on any of these experts, then the relative entropy term in the regret (up to time ) will be (instead of ), which is changing over time.

Using the adaptivity of AdaNormalHedge in both the number of experts and the competitor, we provide improved results for two instances of the sleeping expert problem in the next two sections.

## 5Time-Varying Competitors

In this section, we study a more challenging goal of competing with time-varying competitors in the standard expert setting (that is, each expert is always awake and again ), which turns out to be reducible to a sleeping expert problem. Results for this section are summarized in Table 1.

### 5.1Special Cases: Adaptive Regret and Tracking the Best Expert

We start from a special case: adaptive regret, introduced by [15] to better capture changing environments. Formally, consider any time interval , and let be the regret to expert on this interval (similarly define and ). The goal of the player is to obtain relatively small regret on any interval. [13] essentially introduced a way to reduce this problem to a sleeping expert problem, which was later improved by [1]. Specifically, for every pair of time and expert , we create a sleeping expert, denoted by , who is only awake after (and including) round and since then suffers the same loss as the original expert . So we have sleeping experts in total on round . The prediction is set to be the sum of all the weights of sleeping expert . It is clear that doing this ensures that the cumulative regret up to time with respect to sleeping expert is exactly in the original problem.

This is a sleeping expert problem for which AdaNormalHedge is very suitable, since the number of sleeping experts keeps increasing and the total number of experts is in fact unknown if the horizon is unknown. Theorem ? implies that the resulting algorithm gives the following adaptive regret:

where is a prior over the experts and the last step is by setting the prior to be for all and .3 This prior is better than a simple uniform distribution which leads to a term instead of . We call this algorithm AdaNormalHedge.TV.4 To be concrete, on round AdaNormalHedge.TV predicts

Again, Theorem ? can be applied to get a more interpretable bound where , and a much smaller bound if the losses are stochastic on interval in the sense stated in Theorem ?.

One drawback of AdaNormalHedge.TV is that its time complexity per round is and the overall space is . However, the data streaming technique used in [15] can be directly applied here to reduce the time and space complexity to and respectively, with only an extra multiplicative factor in the regret.

Tracking the best expert. In fact, AdaNormalHedge.TV is a solution for one of the open problems proposed by [32]. Adaptive regret immediately implies the so-called -shifting regret for the problem of tracking the best expert in a changing environment. Formally, define the -shifting regret to be where the max is taken over all and . In other words, the player is competing with the best -partition of the whole game and the best expert on each of these partitions. Let be the total loss of such best partition (that is, the max is taken over the same space), and similarly define . Since essentially is just the sum of adaptive regrets, using the bounds discussed above and the Cauchy-Schwarz inequality, we conclude that AdaNormalHedge.TV ensures Also, if the loss vectors are generated randomly on these intervals, each satisfying the condition stated in Theorem ?, then the regret is in expectation (high probability bound is similar). These bounds are optimal up to logarithmic factors [15]. This is exactly what was asked in [32]: whether there is an algorithm that can do optimally for both adversarial and stochastic losses in the problem of tracking the best expert. AdaNormalHedge.TV achieves this goal without knowing or any other information, while the solution provided by [27] needs to know , and to get the same adversarial bound and a worse stochastic bound of order .

Comparison to previous work. For adaptive regret, the FLH algorithm by [15] treats any standard expert algorithm as a sleeping expert, and has an additive term in addition to the base algorithm’s regret (when no prior information is available), which adds up to a large term for -shifting regret. Due to this extra additive regret, FLH also does not enjoy first order bounds nor small regret in the stochastic setting, even if the base algorithm that it builds on provides these guarantees. On the other hand, FLH was proposed to achieve adaptive regret for any general online convex optimization problem. We point out that using AdaNormalHedge as the master algorithm in their framework will give similar improvements as discussed here.

Adapt-ML-Prod is not directly applicable here for the corresponding sleeping expert problem since the total number of experts is unknown.

Another well-studied algorithm for this problem is “fixed share”. Several works on fixed share for the simpler “log loss” setting were studied before [18]. [7] studied a generalized fixed share algorithm for the bounded loss setting considered here. When and are known, their algorithm ensures for adaptive regret, and when , and are known, they have . No better result is provided for the stochastic setting. More importantly, when no prior information is known, which is the case in practice, the best results one can extract from their analysis are and , which are much worse than our results.

### 5.2General Time-Varying Competitors

We finally discuss the most general goal: compete with different on different rounds. Recall where for all (note that does not even have to be a distribution). Clearly, adaptive regret and -shifting regret are special cases of this general notion. Intuitively, how large this regret is should be closely related to how much the competitor’s sequence varies. [7] introduced a distance measurement to capture this variation: where we define for all . Also let and . Fixed share is shown to ensure the following regret [7]: when and are known. No result was provided otherwise.5 Here, we show that our parameter-free algorithm AdaNormalHedge.TV actually achieves almost the same bound without knowing any information beforehand. Moreover, while the results in [7] are specific for the fixed share algorithm, we prove the following results which are independent of the concrete algorithms and may be of independent interest.

The key idea of the proof is to rewrite as a weighted sum of several adaptive regrets in an optimal way (see Appendix C for the complete proof). This theorem tells us that while playing with time-varying competitors seems to be a harder problem, it is in fact not any harder than its special case: achieving adaptive regret on any interval. Although the result is independent of the algorithms, one still cannot derive bounds on for FLH or fixed share based on their adaptive regret bounds, because when no prior information is available, the bounds on for these algorithms are of order instead of , which is not good enough. We refer the reader to Table 1 for a summary of this section.

## 6Competing with the Best Pruning Tree

We now turn to our second application on predicting almost as well as the best pruning tree within a template tree. This problem was studied in the context of online learning by [16] using the approach of [33]. It is also called the tree expert problem in [5]. [13] proposed a generic reduction from a tree expert problem to a sleeping expert problem. Using this reduction with our new algorithm, we provide much better results compared to previous work (summarized in Table 2).

Specifically, consider a setting where on each round , the predictor has to make a decision from some convex set given some side information , and then a convex loss function is revealed and the player suffers loss . The predictor is given a template tree to consult. Starting from the root, each node of performs a test on to decide which of its child should perform the next test, until a leaf is reached. In addition to a test, each node (except the root) also makes a prediction based on . A pruning tree is a tree induced by replacing zero or more nodes (and associated subtrees) of by leaves. The prediction of a pruning tree given , denoted by , is naturally defined as the prediction of the leaf that reaches by traversing . The player’s goal is thus to predict almost as well as the best pruning tree in hindsight, that is, to minimize .

The idea of the reduction introduced by [13] is to view each edge of as a sleeping expert (indexed by ), who is awake only when traversed by , and in that case predicts , the same prediction as the child node that it connects to. The predictor runs a sleeping expert algorithm with loss , and eventually predicts where denotes the set of edges of and is the output of the expert algorithm; thus by convexity of , we have . Note that we only care about the predictions of those awake experts since otherwise is required to be zero. Now let be one of the best pruning trees, that is, , and be the number of leaves of . In the expert problem, we will set the competitor to be a uniform distribution over the terminal edges (that is, the ones connecting the leaves) of , and the prior to be a uniform distribution over all the edges. Since on each round, one and only one of those experts is awake, and its prediction is exactly , we have , and therefore .

It remains to pick a concrete sleeping algorithm to apply. There are two reasons that make AdaNormalHedge very suitable for this problem. First, since is clearly unknown ahead of time, we are competing with an unknown competitor , which is exactly what AdaNormalHedge can deal with. Second, the number of awake experts is dynamically changing, and as discussed before, in this case AdaNormalHedge enjoys a regret bound that is adaptive in the the number of experts seen so far. Formally, recall the notation , which in this case represents the total number of distinct traversed edges up to round . Then by Theorem ?, we have

which, by Theorem ?, implies where , which is at most the total loss of the best pruning tree . Moreover, the algorithm is efficient: the overall space requirement is , and the running time on round is where we use to denote the number of edges that traverses.

Comparison to other solutions. The work by [13] considers a variant of the exponential weights algorithm in a “log loss” setting, and is not directly applicable here (specifically it is not clear how to tune the learning rate appropriately). A better choice is the Adapt-ML-Prod algorithm by [14] (the version for the sleeping expert problem). However, there is still one issue for this algorithm: it does not give a bound in terms of for an unknown .6 So to get a bound on , the best thing to do is to use the definition and a bound on each . In short, one can verify that Adapt-ML-Prod ensures regret where is the total number of edges/experts. We emphasize that can be much larger than when the tree is huge. Indeed, while is at most times the depth of , could be exponentially large in the depth. The running time and space of Adapt-ML-Prod for this problem, however, is the same as AdaNormalHedge.

We finally compare with a totally different approach [16], where one simply treats each pruning tree as an expert and run the exponential weights algorithm. Clearly the number of experts is exponentially large, and thus the running time and space are unacceptable by a naive implementation. This issue is avoided by using a clever dynamic programming technique. If and are known ahead of time, then the regret for this algorithm is by tuning the learning rate optimally. As discussed in [13], the linear dependence on in this bound is much better than the one of the form , which, in the worst case, is linear in . This was considered as the main drawback of using the sleeping expert approach. However, the bound for AdaNormalHedge is , which is much smaller as discussed previously and in fact comparable to . More importantly, and are unknown in practice. In this case, no sublinear regret is known for this dynamic programming approach, since it relies heavily on the fact that the algorithm is using a fixed learning rate and thus the usual time-varying learning rate methods cannot be applied here. Therefore, although theoretically this approach gives small regret, it is not a practical method. The running time is also slightly worse than the sleeping expert approach. For simplicity, suppose every internal node has children. Then the time complexity per round is . The overall space requirement is , the same as other approaches. Again, see Table 2 for a summary of this section.

Finally, as mentioned in [13], the sleeping expert approach can be easily generalized to predicting with a decision graph. In that case, AdaNormalHedge still enjoys all the improvements discussed in this section (details omitted).

## AComplete proofs of Theorem and

We need the following two lemmas. The first one is an improved version of Lemma 2 of [23].

We first argue that , as a function of , is piecewise-convex on and . Since the value of the function is when and is at least otherwise. It suffices to only consider the case when . On the interval , we can rewrite the exponent (ignoring the constant ) as:

which is convex in . Combining with the fact that “if is convex then is also convex” proves that is convex on . Similarly when , rewriting the exponent as

completes the argument.

Now define function . Since is clearly also piecewise-convex on and , we know that the curve of is below the segment connecting points and on , and also below the segment connecting points and on . This can be mathematically expressed as:

where we use the fact . Now by Lemma 2 of [23], we have

which is at most since is nonnegative and for any . Noting that completes the proof.

The second lemma makes use of Lemma ? to show that the weighted sum of potentials does not increase much and thus the final potential is relatively small.

First note that since AdaNormalHedge predicts , we have

Now applying Lemma ? with and , multiplying the inequality by on both sides and summing over gives We then sum over and telescope to show Finally applying Lemma 14 of [14] to show completes the proof.

We are now ready to prove Theorem ? and Theorem ?.

(of Theorem ?) Assume without loss of generality. Then by Lemma ?, it must be true that for all , which, by solving for , gives . Multiplying both sides by , summing over and applying the Cauchy-Schwarz inequality, we arrive at where we define . It remains to show that and are close. Indeed, we have , which, by standard analysis, can be shown to reach its maximum when and the maximum value is . This completes the proof for Eq. .

Finally, when is in the special form as described in Theorem ?, we have . By Stirling’s formula , we arrive at , proving Eq. .

(of Theorem ?) It suffices to point out that is still in the interval and Eq. in the proof of Lemma ? still holds by the new prediction rule Eq. . The entire proof for Theorem ? applies here exactly.

The algorithm and the proof can be generalized to for any . Indeed, the only extra work is to prove the convexity of . When , we recover NormalHedge.DT exactly and get a bound on for any (in terms of ), instead of just as in the original work. It is clear that gives the smallest bound, which is why we use it in AdaNormalHedge. The ideal choice, however, should be so that a second order bound similar to the one of [14] can be obtained. Unfortunately, the function turns out to not always be piecewise-convex, which breaks our analysis. Whether gives a low-regret algorithm and how to analyze it remain an open question.

## BProof of Theorem

For the first result, the key observation is . We only consider the case when since otherwise the statement is trivial. By the condition we thus have , which by solving for gives

proving the bound we want.

For the second result, let denote the expectation conditioning on all the randomness up to round . So by the condition, we have , and thus where we define . On the other hand, by convexity we also have and thus by the concavity of the square root function. Combining the above two statements gives , and plugging this back shows . The high probability statement follows from the exact same argument of [14] using a martingale concentration lemma.

## CProof of Theorem

Below we use to denote the set for and .

We first fix an expert and consider the regret to this expert . Let and for be positive numbers and corresponding time intervals such that . Note that this is always possible, with a trivial choice being , and ; we will however need a more sophisticated construction specified later. By the adaptive regret guarantee, we have

which, by the Cauchy-Schwarz inequality, is at most

Therefore, we need a construction of and such that is minimized. This is addressed in Lemma ? below which shows that there is in fact always a (optimal) construction such that is exactly