Achieving All with No Parameters: Adaptive NormalHedge

# Achieving All with No Parameters: Adaptive NormalHedge

Haipeng Luo
Princeton University
haipengl@cs.princeton.edu
Robert E. Schapire
Microsoft Research and Princeton University
schapire@cs.princeton.edu
###### Abstract

We study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information. The main component of this work is an improved version of the NormalHedge.DT algorithm (Luo and Schapire, 2014), called AdaNormalHedge. On one hand, this new algorithm ensures small regret when the competitor has small loss and almost constant regret when the losses are stochastic. On the other hand, the algorithm is able to compete with any convex combination of the experts simultaneously, with a regret in terms of the relative entropy of the prior and the competitor. This resolves an open problem proposed by Chaudhuri et al. (2009) and Chernov and Vovk (2010). Moreover, we extend the results to the sleeping expert setting and provide two applications to illustrate the power of AdaNormalHedge: 1) competing with time-varying unknown competitors and 2) predicting almost as well as the best pruning tree. Our results on these applications significantly improve previous work from different aspects, and a special case of the first application resolves another open problem proposed by Warmuth and Koolen (2014) on whether one can simultaneously achieve optimal shifting regret for both adversarial and stochastic losses.

## 1 Introduction

The problem of predicting with expert advice was first pioneered by Littlestone and Warmuth (1994); Freund and Schapire (1997); Cesa-Bianchi et al. (1997); Vovk (1998) and others two decades ago. Roughly speaking, in this problem, a player needs to decide a distribution over a set of experts on each round, and then an adversary decides and reveals the loss for each expert. The player’s loss for this round is the expected loss of the experts with respect to the distribution that he chose, and his goal is to have a total loss that is not much worse than any single expert, or more generally, any fixed and unknown convex combination of experts.

Beyond this classic goal, various more difficult objectives for this problem were studied in recent years, such as: learning with unknown number of experts and competing with all but the top small fraction of experts (Chaudhuri et al., 2009; Chernov and Vovk, 2010); competing with a sequence of different combinations of the experts (Herbster and Warmuth, 2001; Cesa-Bianchi et al., 2012); learning with experts who provide confidence-rated advice (Blum and Mansour, 2007); and achieving much smaller regret when the problem is “easy” while still ensuring worst-case robustness (de Rooij et al., 2014; Van Erven et al., 2014; Gaillard et al., 2014). Different algorithms were proposed separately to solve these problems to some extent. In this work, we essentially provide one single parameter-free algorithm that achieves all these goals with absolutely no prior information and significantly improved results in some cases.

Our algorithm is a variant of Chaudhuri et al. (2009)’s NormalHedge algorithm, and more specifically is an improved version of NormalHedge.DT (Luo and Schapire, 2014). We call it Adaptive NormalHedge (or AdaNormalHedge for short). NormalHedge and NormalHedge.DT provide guarantees for the so-called -quantile regret simultaneously for any , which essentially corresponds to competing with a uniform distribution over the top -fraction of experts. Our new algorithm improves NormalHedge.DT from two aspects (Section 3):

1. AdaNormalHedge can compete with not just the competitor of the specific form mentioned above, but indeed any unknown fixed competitor simultaneously, with a regret in terms of the relative entropy between the competitor and the player’s prior belief of the experts.

2. AdaNormalHedge ensures a new regret bound in terms of the cumulative magnitude of the instantaneous regrets, which is always at most the bound for NormalHedge.DT (or NormalHedge). Moreover, the power of this new form of regret is almost the same as the second order bound introduced in a recent work by Gaillard et al. (2014). Specifically, it implies 1) a small regret when the loss of the competitor is small and 2) an almost constant regret when the losses are generated randomly with a gap in expectation.

Our results resolve the open problem asked in Chaudhuri et al. (2009) and Chernov and Vovk (2010) on whether a better -quantile regret in terms of the loss of the expert instead of the horizon can be achieved. In fact, our results are even better and more general.

AdaNormalHedge is a simple and truly parameter-free algorithm. Indeed, it does not even need to know the number of experts in some sense. To illustrate this idea, in Section 4 we extend the algorithm and results to a setting where experts provide confidence-rated advice (Blum and Mansour, 2007). We then focus on a special case of this setting called the sleeping expert problem (Blum, 1997; Freund et al., 1997), where the number of “awake” experts is dynamically changing and the total number of underlying experts is indeed unknown. AdaNormalHedge is thus a very suitable algorithm for this problem. To show the power of all the abovementioned properties of AdaNormalHedge, we study the following two examples of the sleeping expert problem and use AdaNormalHedge to significantly improve previous work.

The first example is adaptive regret, that is, regret on any time interval, introduced by Hazan and Seshadhri (2007). This can be reduced to a sleeping expert problem by adding a new copy of each original expert on each round (Freund et al., 1997; Koolen et al., 2012). Thus, the total number of sleeping experts is not fixed. When some information on this interval is known (such as the length, the loss of the competitor on this interval, etc), several algorithms achieve optimal regret (Hazan and Seshadhri, 2007; Cesa-Bianchi et al., 2012). However, when no prior information is available, all previous work gives suboptimal bounds. We apply AdaNormalHedge to this problem. The resulting algorithm, which we called AdaNormalHedge.TV, enjoys the optimal adaptive regret in not only the adversarial case but also the stochastic case due to the properties of AdaNormalHedge.

We then extend the results to the problem of tracking the best experts where the player needs to compete with the best partition of the whole process and the best experts on each of these partitions (Herbster and Warmuth, 1995; Bousquet and Warmuth, 2003). This resolves one of the open problems in Warmuth and Koolen (2014) on whether a single algorithm can achieve optimal shifting regret for both adversarial and stochastic losses. Note that although recent work by Sani et al. (2014) also solves this open problem in some sense, their method requires knowing the number of partitions and other information ahead of time and also gives a worse bound for stochastic losses, while AdaNormalHedge.TV is completely parameter-free and gives optimal bounds.

We finally consider the most general case where the competitor varies over time with no constraints, which subsumes the previous two examples (adaptive regret and shifting regret). This problem was introduced in Herbster and Warmuth (2001) and later generalized by Cesa-Bianchi et al. (2012). Their algorithm (fixed share) also requires knowing some information on the sequence of competitors to optimally tune parameters. We avoid this issue by showing that while this problem seems more general and difficult, it is in fact equivalent to its special case: achieving adaptive regret. This equivalence theorem is independent of the concrete algorithms and may be of independent interest. Applying this result, we show that without any parameter tuning, AdaNormalHedge.TV automatically achieves a bound comparable to the one achieved by the optimally tuned fixed share algorithm when competing with time-varying competitors.

Concrete results and detailed comparisons on this first example can be found in Section 5. To sum up, AdaNormalHedge.TV is an algorithm that is simultaneously adaptive in the number of experts, the competitors and the way the losses are generated.

The second example we provide is predicting almost as well as the best pruning tree (Helmbold and Schapire, 1997), which was also shown to be reducible to a sleeping expert problem (Freund et al., 1997). Previous work either only considered the log loss setting, or assumed prior information on the best pruning tree is known. Using AdaNormalHedge, we again provide better or comparable bounds without knowing any prior information. In fact, due to the adaptivity of AdaNormalHedge in the number of experts, our regret bound depends on the total number of distinct traversed edges so far, instead of the total number of edges of the decision tree as in Freund et al. (1997) which could be exponentially larger. Concrete comparisons can be found in Section 6.

#### Related work.

While competing with any unknown competitor simultaneously is relatively easy in the log loss setting (Littlestone and Warmuth, 1994; Adamskiy et al., 2012; Koolen et al., 2012), it is much harder in the bounded loss setting studied here. The well-known exponential weights algorithm gives the optimal results only when the learning rate is optimally tuned in terms of the competitor (Freund and Schapire, 1999). Chernov and Vovk (2010) also studied -quantile regret, but no concrete algorithm was provided. Several work considers competing with unknown competitors in a different unconstrained linear optimization setting (Streeter and Mcmahan, 2012; Orabona, 2013; McMahan and Orabona, 2014; Orabona, 2014). Jadbabaie et al. (2015) studied general adaptive online learning algorithms against time-varying competitors, but with different and incomparable measurement of the hardness of the problem. As far as we know, none of the existing algorithms enjoys all the nice properties discussed in this work at the same time as our algorithms do.

## 2 The Expert Problem and NormalHedge.DT

In the expert problem, on each round : the player first chooses a distribution over experts, then the adversary decides each expert’s loss , and reveals these losses to the player. At the end of this round, the player suffers the weighted average loss with . We denote the instantaneous regret to expert on round by , the cumulative regret by , and the cumulative loss by . Throughout the paper, a bold letter denotes a vector with corresponding coordinates. For example, , and represent , and respectively.

Usually, the goal of the player is to minimize the regret to the best expert, that is, . Here we consider a more general case where the player wants to minimize the regret to an arbitrary convex combination of experts: where the competitor is a fixed unknown distribution over the experts. In other words, this regret measures the difference between the player’s loss and the loss that he would have suffered if he used a constant strategy all the time. Clearly, can be written as and can then be upper bounded appropriately by a bound on each (for example, ). However, our goal is to get a better and more refined bound on that depends on . More importantly, we aim to achieve this without knowing the competitor ahead of time. When it is clear from the context, we drop the subscript in .

In fact, in Section 5, we will consider an even more general notion of regret introduced in Herbster and Warmuth (2001), where we allow the competitor to vary over time and to have different scales. Specifically, let be different vectors with nonnegative coordinates (denoted by ). Then the regret of the player to this sequence of competitors is . If all these competitors are distributions (which they are not required to be), then this regret captures a very natural and general concept of comparing the player’s strategy to any other strategy. Again, we are interested in developing low-regret algorithms that do not need to know any information of this sequence of competitors beforehand.

We briefly describe a recent algorithm for the expert problem, NormalHedge.DT (Luo and Schapire, 2014) (a variant of NormalHedge (Chaudhuri et al., 2009)), before we introduce our new improved variants. On round , NormalHedge.DT sets where . Let and competitor be a distribution that puts all the mass on the -th best expert, that is, the one that ranks among all experts according to their total loss from the smallest to the largest. Then the regret guarantee for NormalHedge.DT states simultaneously for all , which means the algorithm suffers at most this amount of regret for all but an fraction of the experts. Note that this bound does not depend on at all. This is the first concrete algorithm with this kind of adaptive property (the original NormalHedge (Chaudhuri et al., 2009) still has a weak dependence on ). In fact, as we will show later, one can even extend the results to any competitor . Moreover, we will improve NormalHedge.DT so that it has a much smaller regret when the problem is “easy” in some sense.

#### Notation.

We use to denote the set , to denote the simplex of all distributions over , and to denote the relative entropy between two distributions, Also define . Many bounds in this work will be in terms of , which is always at most since trivially . We consider “log log” terms to be nearly constant, and use notation to hide these terms. Indeed, as pointed out by Chernov and Vovk (2010), is smaller than even when is as large as the age of the universe expressed in microseconds ().

## 3 A New Algorithm: AdaNormalHedge

We start by writing NormalHedge.DT in a general form. We define potential function with defined to be , and also a weight function with respect to this potential:

 w(R,C)=12(Φ(R+1,C+1)−Φ(R−1,C+1)).

Then the prediction of NormalHedge.DT is simply to set to be proportional to where for all . Note that is closely related to the regret. In fact, the regret is roughly of order (ignoring the log term). Therefore, in order to get an expert-wise and more refined bound, we replace by for each expert so that it captures some useful information for each expert . There are several possible choices for (discussed at the end of Appendix A), but for now we focus on the one used in our new algorithm: , that is, the cumulative magnitude of the instantaneous regrets up to time . We call this algorithm AdaNormalHedge and summarize it in Algorithm 1. Note that we even allow the player to have a prior distribution over the experts, which will be useful in some applications as we will see in Section 5. The theoretical guarantee of AdaNormalHedge is stated below.

###### Theorem 1.

The regret of AdaNormalHedge to any competitor is bounded as follows:

 R(u)≤√3(u⋅CT)(\rm RE(u||q)+lnB+ln(1+lnN))=^O(√(u⋅CT)\rm RE(u||q)), (1)

where , . Moreover, if is a uniform distribution over a subset of , then the regret can be improved to

 R(u)≤√3(u⋅CT)(\rm RE(u||q)+lnB+1). (2)

Before we prove this theorem (see sketch at the end of this section and complete proof in Appendix A), we discuss some implications of the regret bounds and why they are interesting. First of all, the relative entropy term captures how close the player’s prior is to the competitor. A bound in terms of can be obtained, for example, using the classic exponential weights algorithm but requires carefully tuning the learning rate as a function of . Without knowing , as far as we know, AdaNormalHedge is the only algorithm that can achieve this.222 In fact, one can also derive similar bounds for NormalHedge and NormalHedge.DT using our analysis. See discussion at the end of Appendix A.

On the other hand, if is a uniform distribution, then using bound (2) and the fact , we get an -quantile regret bound similar to the one of NormalHedge.DT: where is uniform over the top experts. in terms of their total loss .

However, the power of a bound in terms of is far more than this. Gaillard et al. (2014) introduced a new second order bound that implies much smaller regret when the problem is easy. It turns out that our seemingly weaker first order bound is also enough to get the exact same results! We state these implications in the following theorem which is essentially a restatement of Theorems 9 and 11 of Gaillard et al. (2014) with weaker conditions.

###### Theorem 2.

Suppose an expert algorithm guarantees where is some function of . Then it also satisfies the following:

1. Recall and . We have

 R(u)≤√2(u⋅~LT)A(u)+A(u)≤√2(u⋅LT)A(u)+A(u).
2. Suppose the loss vector ’s are independent random variables and there exists an and some such that for any and . Let be a distribution that puts all the mass on expert . Then we have and with probability at least ,

The proof of Theorem 2 is based on the same idea as in Gaillard et al. (2014), and is included in Appendix B for completeness. For AdaNormalHedge, the term is in general (or smaller for special as stated in Theorem 1). Applying Theorem 2 we have Specifically, if is uniform and assuming without loss of generality that , then by a similar argument, we have for AdaNormalHedge, for any . This answers the open question (in the affirmative) asked by Chaudhuri et al. (2009) and Chernov and Vovk (2010) on whether an improvement for small loss can be obtained for -quantile regret without knowing .

On the other hand, when we are in a stochastic setting as stated in Theorem 2, AdaNormalHedge ensures in expectation (or with high probability with an extra confidence term), which does not grow with . Therefore, the new regret bound in terms of actually leads to significant improvements compared to NormalHedge.DT.

#### Comparison to Adapt-ML-Prod (Gaillard et al., 2014).

Adapt-ML-Prod enjoys a second order bound in terms of , which is always at most the term appeared in our bounds.444We briefly discuss the difficulty of getting a similar second order bound for our algorithm at the end of Appendix A. However, on one hand, as discussed above, these two bounds have the same improvements when the problem is easy in several senses; on the other hand, Adapt-ML-Prod does not provide a bound in terms of for an unknown . In fact, as discussed at the end of Section A.3 of Gaillard et al. (2014), Adapt-ML-Prod cannot improve by exploiting a good prior (or at least its current analysis cannot). Specifically, while the regret for AdaNormalHedge does not have an explicit dependence on and is much smaller when the prior is close to the competitor , the regret for Adapt-ML-Prod always has a multiplicative term for , which means even a good prior results in the same regret as a uniform prior! More advantages of AdaNormalHedge over Adapt-ML-Prod will be discussed in concrete examples in following sections.

#### Proof sketch of Theorem 1.

The analysis of NormaHedge.DT is based on the idea of converting the expert problem into a drifting game (Schapire, 2001; Luo and Schapire, 2014). Here, we extract and simplify the key idea of their proof and also improve it to form our analysis. The main idea is to show that the weighted sum of potentials does not increase much on each round using an improved version of Lemma 2 of Luo and Schapire (2014). In fact, we show that the final potential is exactly bounded by (defined in Theorem 1). From this, assuming without loss of generality that , we have for all , which, by solving for , gives . Multiplying both sides by , summing over and applying the Cauchy-Schwarz inequality, we arrive at where we define . It remains to show that and are close by standard analysis and Stirling’s formula.

## 4 Confidence-rated Advice and Sleeping Experts

In this section, we generalize AdaNormalHedge to deal with experts that make confidence-rated advice, a setting that subsumes many interesting applications as studied by Blum (1997) and Freund et al. (1997). In this general setting, on each round , each expert first reports its confidence for the current task. The player then predicts as usual with an extra yet natural restriction that if then . That is, the player has to ignore those experts who abstain from making advice (by reporting zero confidence). After that, the loss for those experts who did not abstain (i.e. ) are revealed and the player still suffers loss . We redefine the instantaneous regret to be , that is, the difference between the loss of the player and expert weighted by the confidence. The goal of the player is, as before, to minimize cumulative regret to any competitor : . Clearly, the classic expert problem that we have studied in previous sections is just a special case of this general setting with for all and .

Moreover, with this general form of , AdaNormalHedge can be used to deal with this general setting with only one simple change of scaling the weights by the confidence:

 pt,i∝qiIt,iw(Rt−1,i,Ct−1,i), (3)

where and is still defined to be and respectively. The constraint is clearly satisfied. In fact, Algorithm 1 can be seen as a special case of this general form of AdaNormalHedge with . Furthermore, the regret bounds in Theorem 1 still hold without any changes, which are summarized below (proof deferred to Appendix A).

###### Theorem 3.

For the confidence-rated expert problem, regret bounds (1) and (2) still hold for general AdaNormalHedge (Eq. (3)).

Previously, Gaillard et al. (2014) studied a general reduction from an expert algorithm to a confidence-rated expert algorithm. Applying those results here gives the exact same algorithm and regret guarantee mentioned above. However, we point out that the general reduction is not always applicable. Specifically, it is invalid if there is an unknown number of experts in the confidence-rated setting (explained more in the next paragraph) while the expert algorithm in the standard setting requires knowing the number of experts as a parameter. This is indeed the case for most algorithms (including Adapt-ML-Prod and even the original NormalHedge by Chaudhuri et al. (2009)). AdaNormalHedge naturally avoids this problem since it does not depend on at all.

#### Sleeping Experts.

We are especially interested in the case when , also called the specialist/sleeping expert problem where means that expert is “asleep” for round and not making any advice. This is a natural setting where the total number of experts is unknown ahead of time. Indeed, the number of awake experts can be dynamically changing over time. An expert that has never appeared before should be thought of as being asleep for all previous rounds.

AdaNormalHedge is a very suitable algorithm to deal with this case due to its independence of the total number of experts. If an expert appears for the first time on round , then by definition it will naturally start with and . Although we state the prior as a distribution, which seems to require knowing the total number of experts, it is not an issue algorithmically since is only used to scale the unnormalized weights (Eq. (3)). For example, if we want to be a uniform distribution over experts where is unknown beforehand, then to run AdaNormalHedge we can simply treat in Eq. (3) to be for all , which clearly will not change the behavior of the algorithm anyway. In this case, if we let denote the total number of distinct experts that have been seen up to time and the competitor concentrates on any of these experts, then the relative entropy term in the regret (up to time ) will be (instead of ), which is changing over time.

Using the adaptivity of AdaNormalHedge in both the number of experts and the competitor, we provide improved results for two instances of the sleeping expert problem in the next two sections.

## 5 Time-Varying Competitors

In this section, we study a more challenging goal of competing with time-varying competitors in the standard expert setting (that is, each expert is always awake and again ), which turns out to be reducible to a sleeping expert problem. Results for this section are summarized in Table 1.

### 5.1 Special Cases: Adaptive Regret and Tracking the Best Expert

We start from a special case: adaptive regret, introduced by Hazan and Seshadhri (2007) to better capture changing environments. Formally, consider any time interval , and let be the regret to expert on this interval (similarly define and ). The goal of the player is to obtain relatively small regret on any interval. Freund et al. (1997) essentially introduced a way to reduce this problem to a sleeping expert problem, which was later improved by Adamskiy et al. (2012). Specifically, for every pair of time and expert , we create a sleeping expert, denoted by , who is only awake after (and including) round and since then suffers the same loss as the original expert . So we have sleeping experts in total on round . The prediction is set to be the sum of all the weights of sleeping expert . It is clear that doing this ensures that the cumulative regret up to time with respect to sleeping expert is exactly in the original problem.

This is a sleeping expert problem for which AdaNormalHedge is very suitable, since the number of sleeping experts keeps increasing and the total number of experts is in fact unknown if the horizon is unknown. Theorem 3 implies that the resulting algorithm gives the following adaptive regret:

 R[t1,t2],i=^O(√(∑t2t=t1|rt,i|)ln(1q(t1,i)))=^O(√(∑t2t=t1|rt,i|)ln(Nt1)),

where is a prior over the experts and the last step is by setting the prior to be for all and .555 Note that as discussed before, the fact that is unknown and thus is unknown does not affect the algorithm. This prior is better than a simple uniform distribution which leads to a term instead of . We call this algorithm AdaNormalHedge.TV.666“TV” stands for “time-varying”. To be concrete, on round AdaNormalHedge.TV predicts

Again, Theorem 2 can be applied to get a more interpretable bound where , and a much smaller bound if the losses are stochastic on interval in the sense stated in Theorem 2.

One drawback of AdaNormalHedge.TV is that its time complexity per round is and the overall space is . However, the data streaming technique used in Hazan and Seshadhri (2007) can be directly applied here to reduce the time and space complexity to and respectively, with only an extra multiplicative factor in the regret.

#### Tracking the best expert.

In fact, AdaNormalHedge.TV is a solution for one of the open problems proposed by Warmuth and Koolen (2014). Adaptive regret immediately implies the so-called -shifting regret for the problem of tracking the best expert in a changing environment. Formally, define the -shifting regret to be where the max is taken over all and . In other words, the player is competing with the best -partition of the whole game and the best expert on each of these partitions. Let be the total loss of such best partition (that is, the max is taken over the same space), and similarly define . Since essentially is just the sum of adaptive regrets, using the bounds discussed above and the Cauchy-Schwarz inequality, we conclude that AdaNormalHedge.TV ensures Also, if the loss vectors are generated randomly on these intervals, each satisfying the condition stated in Theorem 2, then the regret is in expectation (high probability bound is similar). These bounds are optimal up to logarithmic factors (Hazan and Seshadhri, 2007). This is exactly what was asked in Warmuth and Koolen (2014): whether there is an algorithm that can do optimally for both adversarial and stochastic losses in the problem of tracking the best expert. AdaNormalHedge.TV achieves this goal without knowing or any other information, while the solution provided by Sani et al. (2014) needs to know , and to get the same adversarial bound and a worse stochastic bound of order .

#### Comparison to previous work.

For adaptive regret, the FLH algorithm by Hazan and Seshadhri (2007) treats any standard expert algorithm as a sleeping expert, and has an additive term in addition to the base algorithm’s regret (when no prior information is available), which adds up to a large term for -shifting regret. Due to this extra additive regret, FLH also does not enjoy first order bounds nor small regret in the stochastic setting, even if the base algorithm that it builds on provides these guarantees. On the other hand, FLH was proposed to achieve adaptive regret for any general online convex optimization problem. We point out that using AdaNormalHedge as the master algorithm in their framework will give similar improvements as discussed here.

Adapt-ML-Prod is not directly applicable here for the corresponding sleeping expert problem since the total number of experts is unknown.

Another well-studied algorithm for this problem is “fixed share”. Several works on fixed share for the simpler “log loss” setting were studied before (Herbster and Warmuth, 1998; Bousquet and Warmuth, 2003; Adamskiy et al., 2012; Koolen et al., 2012). Cesa-Bianchi et al. (2012) studied a generalized fixed share algorithm for the bounded loss setting considered here. When and are known, their algorithm ensures for adaptive regret, and when , and are known, they have . No better result is provided for the stochastic setting. More importantly, when no prior information is known, which is the case in practice, the best results one can extract from their analysis are and , which are much worse than our results.

### 5.2 General Time-Varying Competitors

We finally discuss the most general goal: compete with different on different rounds. Recall where for all (note that does not even have to be a distribution). Clearly, adaptive regret and -shifting regret are special cases of this general notion. Intuitively, how large this regret is should be closely related to how much the competitor’s sequence varies. Cesa-Bianchi et al. (2012) introduced a distance measurement to capture this variation: where we define for all . Also let and . Fixed share is shown to ensure the following regret (Cesa-Bianchi et al., 2012): when and are known. No result was provided otherwise.101010 Although in Section 7.3 of Cesa-Bianchi et al. (2012), the authors mentioned online tuning technique for the parameters, it only works for special cases (e.g. adaptive regret). Here, we show that our parameter-free algorithm AdaNormalHedge.TV actually achieves almost the same bound without knowing any information beforehand. Moreover, while the results in Cesa-Bianchi et al. (2012) are specific for the fixed share algorithm, we prove the following results which are independent of the concrete algorithms and may be of independent interest.

###### Theorem 4.

Suppose an expert algorithm ensures for any and , where can be anything depending on and (e.g. , , or constant ), and is a term independent of and . Then this algorithm also ensures

 R(u1:T)≤√AV(u1:T)∑Tt=1ut⋅zt.

Specially, for AdaNormalHedge.TV, plugging and gives where .

The key idea of the proof is to rewrite as a weighted sum of several adaptive regrets in an optimal way (see Appendix C for the complete proof). This theorem tells us that while playing with time-varying competitors seems to be a harder problem, it is in fact not any harder than its special case: achieving adaptive regret on any interval. Although the result is independent of the algorithms, one still cannot derive bounds on for FLH or fixed share based on their adaptive regret bounds, because when no prior information is available, the bounds on for these algorithms are of order instead of , which is not good enough. We refer the reader to Table 1 for a summary of this section.

## 6 Competing with the Best Pruning Tree

We now turn to our second application on predicting almost as well as the best pruning tree within a template tree. This problem was studied in the context of online learning by Helmbold and Schapire (1997) using the approach of Willems et al. (1993, 1995). It is also called the tree expert problem in Cesa-Bianchi and Lugosi (2006). Freund et al. (1997) proposed a generic reduction from a tree expert problem to a sleeping expert problem. Using this reduction with our new algorithm, we provide much better results compared to previous work (summarized in Table 2).

Specifically, consider a setting where on each round , the predictor has to make a decision from some convex set given some side information , and then a convex loss function is revealed and the player suffers loss . The predictor is given a template tree to consult. Starting from the root, each node of performs a test on to decide which of its child should perform the next test, until a leaf is reached. In addition to a test, each node (except the root) also makes a prediction based on . A pruning tree is a tree induced by replacing zero or more nodes (and associated subtrees) of by leaves. The prediction of a pruning tree given , denoted by , is naturally defined as the prediction of the leaf that reaches by traversing . The player’s goal is thus to predict almost as well as the best pruning tree in hindsight, that is, to minimize .

The idea of the reduction introduced by Freund et al. (1997) is to view each edge of as a sleeping expert (indexed by ), who is awake only when traversed by , and in that case predicts , the same prediction as the child node that it connects to. The predictor runs a sleeping expert algorithm with loss , and eventually predicts where denotes the set of edges of and is the output of the expert algorithm; thus by convexity of , we have . Note that we only care about the predictions of those awake experts since otherwise is required to be zero. Now let be one of the best pruning trees, that is, , and be the number of leaves of . In the expert problem, we will set the competitor to be a uniform distribution over the terminal edges (that is, the ones connecting the leaves) of , and the prior to be a uniform distribution over all the edges. Since on each round, one and only one of those experts is awake, and its prediction is exactly , we have , and therefore .

It remains to pick a concrete sleeping algorithm to apply. There are two reasons that make AdaNormalHedge very suitable for this problem. First, since is clearly unknown ahead of time, we are competing with an unknown competitor , which is exactly what AdaNormalHedge can deal with. Second, the number of awake experts is dynamically changing, and as discussed before, in this case AdaNormalHedge enjoys a regret bound that is adaptive in the the number of experts seen so far. Formally, recall the notation , which in this case represents the total number of distinct traversed edges up to round . Then by Theorem 3, we have

 RG=^O(m√(u⋅CT)\rm RE(u||q))=^O(√m(∑Tt=1|^ℓt−ft(P∗(xt))|)ln(NTm)),

which, by Theorem 2, implies where , which is at most the total loss of the best pruning tree . Moreover, the algorithm is efficient: the overall space requirement is , and the running time on round is where we use to denote the number of edges that traverses.

#### Comparison to other solutions.

The work by Freund et al. (1997) considers a variant of the exponential weights algorithm in a “log loss” setting, and is not directly applicable here (specifically it is not clear how to tune the learning rate appropriately). A better choice is the Adapt-ML-Prod algorithm by Gaillard et al. (2014) (the version for the sleeping expert problem). However, there is still one issue for this algorithm: it does not give a bound in terms of for an unknown .111111 In fact, even if it does, this term is still dominated by a term. See the discussion at the end of Section A.3 of Gaillard et al. (2014) that we already mentioned at Section 3. So to get a bound on , the best thing to do is to use the definition and a bound on each . In short, one can verify that Adapt-ML-Prod ensures regret where is the total number of edges/experts. We emphasize that can be much larger than when the tree is huge. Indeed, while is at most times the depth of , could be exponentially large in the depth. The running time and space of Adapt-ML-Prod for this problem, however, is the same as AdaNormalHedge.

We finally compare with a totally different approach (Helmbold and Schapire, 1997), where one simply treats each pruning tree as an expert and run the exponential weights algorithm. Clearly the number of experts is exponentially large, and thus the running time and space are unacceptable by a naive implementation. This issue is avoided by using a clever dynamic programming technique. If and are known ahead of time, then the regret for this algorithm is by tuning the learning rate optimally. As discussed in Freund et al. (1997), the linear dependence on in this bound is much better than the one of the form , which, in the worst case, is linear in . This was considered as the main drawback of using the sleeping expert approach. However, the bound for AdaNormalHedge is , which is much smaller as discussed previously and in fact comparable to . More importantly, and are unknown in practice. In this case, no sublinear regret is known for this dynamic programming approach, since it relies heavily on the fact that the algorithm is using a fixed learning rate and thus the usual time-varying learning rate methods cannot be applied here. Therefore, although theoretically this approach gives small regret, it is not a practical method. The running time is also slightly worse than the sleeping expert approach. For simplicity, suppose every internal node has children. Then the time complexity per round is . The overall space requirement is , the same as other approaches. Again, see Table 2 for a summary of this section.

Finally, as mentioned in Freund et al. (1997), the sleeping expert approach can be easily generalized to predicting with a decision graph. In that case, AdaNormalHedge still enjoys all the improvements discussed in this section (details omitted).

## References

• Adamskiy et al. (2012) Dmitry Adamskiy, Wouter M Koolen, Alexey Chernov, and Vladimir Vovk. A closer look at adaptive regret. In Algorithmic Learning Theory, pages 290–304, 2012.
• Blum (1997) Avrim Blum. Empirical support for Winnow and Weighted-Majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26(1):5–23, 1997.
• Blum and Mansour (2007) Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8:1307–1324, 2007.
• Bousquet and Warmuth (2003) Olivier Bousquet and Manfred K. Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3:363–396, 2003.
• Cesa-Bianchi and Lugosi (2006) Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
• Cesa-Bianchi et al. (1997) Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, May 1997.
• Cesa-Bianchi et al. (2012) Nicolò Cesa-Bianchi, Pierre Gaillard, Gábor Lugosi, and Gilles Stoltz. Mirror descent meets fixed share (and feels no regret). In Advances in Neural Information Processing Systems 25, 2012.
• Chaudhuri et al. (2009) Kamalika Chaudhuri, Yoav Freund, and Daniel Hsu. A parameter-free hedging algorithm. In Advances in Neural Information Processing Systems 22, 2009.
• Chernov and Vovk (2010) Alexey Chernov and Vladimir Vovk. Prediction with advice of unknown number of experts. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, 2010.
• de Rooij et al. (2014) Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15:1281–1316, 2014.
• Freund and Schapire (1997) Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997.
• Freund and Schapire (1999) Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, 1999.
• Freund et al. (1997) Yoav Freund, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. Using and combining predictors that specialize. In Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing, pages 334–343, 1997.
• Gaillard et al. (2014) Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Proceedings of the 27th Annual Conference on Learning Theory, 2014.
• Hazan and Seshadhri (2007) Elad Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. In Electronic Colloquium on Computational Complexity (ECCC), volume 14, 2007.
• Helmbold and Schapire (1997) David P. Helmbold and Robert E. Schapire. Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27(1):51–68, April 1997.
• Herbster and Warmuth (1995) Mark Herbster and Manfred Warmuth. Tracking the best expert. In Proceedings of the Twelfth International Conference on Machine Learning, pages 286–294, 1995.
• Herbster and Warmuth (1998) Mark Herbster and Manfred Warmuth. Tracking the best expert. Machine Learning, 32(2):151–178, 1998.
• Herbster and Warmuth (2001) Mark Herbster and Manfred K Warmuth. Tracking the best linear predictor. The Journal of Machine Learning Research, 1:281–309, 2001.
• Jadbabaie et al. (2015) Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In The 18th International Conference on Artificial Intelligence and Statistics, 2015.
• Koolen et al. (2012) Wouter M Koolen, Dmitry Adamskiy, and Manfred K Warmuth. Putting bayes to sleep. In Advances in Neural Information Processing Systems 25, pages 135–143, 2012.
• Littlestone and Warmuth (1994) Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
• Luo and Schapire (2014) Haipeng Luo and Robert E. Schapire. A Drifting-Games Analysis for Online Learning and Applications to Boosting. In Advances in Neural Information Processing Systems 27, 2014.
• McMahan and Orabona (2014) H Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Proceedings of the 27th Annual Conference on Learning Theory, 2014.
• Orabona (2013) Francesco Orabona. Dimension-free exponentiated gradient. In Advances in Neural Information Processing Systems 26, pages 1806–1814, 2013.
• Orabona (2014) Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems 27, 2014.
• Sani et al. (2014) Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. In Advances in Neural Information Processing Systems 27, 2014.
• Schapire (2001) Robert E. Schapire. Drifting games. Machine Learning, 43(3):265–291, June 2001.
• Streeter and Mcmahan (2012) Matthew Streeter and Brendan Mcmahan. No-regret algorithms for unconstrained online convex optimization. In Advances in Neural Information Processing Systems 25, pages 2402–2410, 2012.
• Van Erven et al. (2014) Tim Van Erven, Wojciech Kotlowski, and Manfred K Warmuth. Follow the leader with dropout perturbations. In Proceedings of the 27th Annual Conference on Learning Theory, 2014.
• Vovk (1998) V. G. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153–173, April 1998.
• Warmuth and Koolen (2014) Manfred K. Warmuth and Wouter M. Koolen. Open problem: Shifting experts on easy data. In Proceedings of the 27th Annual Conference on Learning Theory, 2014.
• Willems et al. (1993) Frans M. J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. Context tree weighting: A sequential universal source coding procedure for FSMX sources. In Proceedings 1993 IEEE International Symposium on Information Theory, page 59, 1993.
• Willems et al. (1995) Frans M. J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The context tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.

## Appendix A Complete proofs of Theorem 1 and 3

We need the following two lemmas. The first one is an improved version of Lemma 2 of Luo and Schapire [2014].

###### Lemma 1.

For any and , we have

 Φ(R+r,C+|r|)≤Φ(R,C)+w(R,C)r+3|r|2(C+1).
###### Proof.

We first argue that , as a function of , is piecewise-convex on and . Since the value of the function is when and is at least otherwise. It suffices to only consider the case when . On the interval , we can rewrite the exponent (ignoring the constant ) as:

 (R+r)2C+r=(C+r)+(R−C)2C+r+2(R−C),

which is convex in . Combining with the fact that “if is convex then is also convex” proves that is convex on . Similarly when , rewriting the exponent as

 (R+r)2C−r=(C−r)+(R+C)2C−r−2(R+C)

completes the argument.

Now define function . Since is clearly also piecewise-convex on and , we know that the curve of is below the segment connecting points and on , and also below the segment connecting points and on . This can be mathematically expressed as:

 f(r)≤max{f(0)+(f(0)−f(−1))r,f(0)+(f(1)−f(0))r}=f(0)+(f(1)−f(0))|r|,

where we use the fact . Now by Lemma 2 of Luo and Schapire [2014], we have

 f(1)−f(0)=12(Φ(R+1,C+1)+Φ(R−1,C+1))−Φ(R,C)≤12(exp(43(C+1))−1),

which is at most since is nonnegative and for any . Noting that completes the proof. ∎

The second lemma makes use of Lemma 1 to show that the weighted sum of potentials does not increase much and thus the final potential is relatively small.