Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods.
Tim van Erven Department of Mathematics VU University De Boelelaan 1081a 1081 HV Amsterdam, the Netherlands firstname.lastname@example.org Peter Grünwald Centrum Wiskunde & Informatica (CWI) Science Park 123, P.O. Box 94079 1090 GB Amsterdam, the Netherlands email@example.com Wouter M. Koolen CWI and Department of Computer Science Royal Holloway, University of London Egham Hill, Egham, Surrey TW20 0EX, United Kingdom firstname.lastname@example.org Steven de Rooij Centrum Wiskunde & Informatica (CWI) Science Park 123, P.O. Box 94079 1090 GB Amsterdam, the Netherlands email@example.com
Decision-theoretic online learning (DTOL) is a framework to capture learning problems that proceed in rounds. It was introduced by Freund and Schapire  and is closely related to the paradigm of prediction with expert advice [2, 3, 4]. In DTOL an agent is given access to a fixed set of actions, and at the start of each round must make a decision by assigning a probability to every action. Then all actions incur a loss from the range , and the agent’s loss is the expected loss of the actions under the probability distribution it produced. Losses add up over rounds and the goal for the agent is to minimize its regret after rounds, which is the difference in accumulated loss between the agent and the action that has accumulated the least amount of loss.
The most commonly studied strategy for the agent is called the Hedge algorithm [1, 5]. Its performance crucially depends on a parameter called the learning rate. Different ways of tuning the learning rate have been proposed, which all aim to minimize the regret for the worst possible sequence of losses the actions might incur. If is known to the agent, then the learning rate may be tuned to achieve worst-case regret bounded by , which is known to be optimal as and become large . Nevertheless, by slightly relaxing the problem, one can obtain better guarantees. Suppose for example that the cumulative loss of the best action is known to the agent beforehand. Then, if the learning rate is set appropriately, the regret is bounded by , which has the same asymptotics as the previous bound in the worst case (because ) but may be much better when turns out to be small. Similarly, Hazan and Kale  obtain a bound of for a modification of Hedge if the cumulative empirical variance of the best expert is known. In applications it may be unrealistic to assume that or (especially) or is known beforehand, but at the cost of slightly worse constants such problems may be circumvented using either the doubling trick (setting a budget on the unknown quantity and restarting the algorithm with a double budget when the budget is depleted) [4, 7, 6], or a variable learning rate that is adjusted each round [4, 8].
Bounding the regret in terms of or is based on the idea that worst-case performance is not the only property of interest: such bounds give essentially the same guarantee in the worst case, but a much better guarantee in a plausible favourable case (when or is small). In this paper, we pursue the same goal for a different favourable case. To illustrate our approach, consider the following simplistic example with two actions: let be such that . Then in odd rounds the first action gets loss and the second action gets loss ; in even rounds the actions get losses and , respectively. Informally, this seems like a very easy instance of DTOL, because the cumulative losses of the actions diverge and it is easy to see from the losses which action is the best one. In fact, the Follow-the-Leader strategy, which puts all probability mass on the action with smallest cumulative loss, gives a regret of at most in this case — the worst-case bound is very loose by comparison, and so is , which is of the same order . On the other hand, for Follow-the-Leader one cannot guarantee sublinear regret for worst-case instances. (For example, if one out of two actions yields losses and the other action yields losses , its regret will be at least .) To get the best of both worlds, we introduce an adaptive version of Hedge, called AdaHedge, that automatically adapts to the difficulty of the problem by varying the learning rate appropriately. As a result we obtain constant regret for the simplistic example above and other ‘easy’ instances of DTOL, while at the same time guaranteeing regret in the worst case.
It remains to characterise what we consider easy problems, which we will do in terms of the probabilities produced by Hedge. As explained below, these may be interpreted as a generalisation of Bayesian posterior probabilities. We measure the difficulty of the problem in terms of the speed at which the posterior probability of the best action converges to one. In the previous example, this happens at an exponential rate, whereas for worst-case instances the posterior probability of the best action does not converge to one at all.
In the next section we describe a new way of tuning the learning rate, and show that it yields essentially optimal performance guarantees in the worst case. To construct the AdaHedge algorithm, we then add the doubling trick to this idea in Section 3, and analyse its worst-case regret. In Section 4 we show that AdaHedge in fact incurs much smaller regret on easy problems. We compare AdaHedge to other instances of Hedge by means of a simulation study in Section 5. The proof of our main technical lemma is postponed to Section 6, and open questions are discussed in the concluding Section 7. Finally, longer proofs are only available as Additional Material in the full version at arXiv.org.
2 Tuning the Learning Rate
Let the available actions be indexed by . At the start of each round the agent is to assign a probability to each action by producing a vector with nonnegative components that sum up to . Then every action incurs a loss , which we collect in the loss vector , and the loss of the agent is . After rounds action has accumulated loss , and the agent’s regret is
where is the cumulative loss of the best action.
The Hedge algorithm chooses the weights proportional to , where is the learning rate. As is well-known, these weights may essentially be interpreted as Bayesian posterior probabilities on actions, relative to a uniform prior and pseudo-likelihoods [9, 10, 4]:
is a generalisation of the Bayesian marginal likelihood. And like the ordinary marginal likelihood, factorizes into sequential per-round contributions:
We will sometimes write and instead of and in order to emphasize the dependence of these quantities on .
The Learning Rate and the Mixability Gap
A key quantity in our and previous  analyses is the gap between the per-round loss of the Hedge algorithm and the per-round contribution to the negative logarithm of the “marginal likelihood” , which we call the mixability gap:
In the setting of prediction with expert advice, the subtracted term coincides with the loss incurred by the Aggregating Pseudo-Algorithm (APA) which, by allowing the losses of the actions to be mixed with optimal efficiency, provides an idealised lower bound for the actual loss of any prediction strategy . The mixability gap measures how closely we approach this ideal. As the same interpretation still holds in the more general DTOL setting of this paper, we can measure the difficulty of the problem, and tune , in terms of the cumulative mixability gap:
We proceed to list some basic properties of the mixability gap. First, it is nonnegative and bounded above by a constant that depends on :
For any and we have .
The lower bound follows by applying Jensen’s inequality to the concave function , the upper bound from Hoeffding’s bound on the cumulant generating function [4, Lemma A.1]. ∎
Further, the cumulative mixability gap can be related to via the following upper bound, proved in the Additional Material:
For any and we have .
This relationship will make it possible to provide worst-case guarantees similar to what is possible when is tuned in terms of . However, for easy instances of DTOL this inequality is very loose, in which case we can prove substantially better regret bounds. We could now proceed by optimizing the learning rate given the rather awkward assumption that is bounded by a known constant for all , which would be the natural counterpart to an analysis that optimizes when a bound on is known. However, as varies with and is unknown a priori anyway, it makes more sense to turn the analysis on its head and start by fixing . We can then simply run the Hedge algorithm until the smallest such that exceeds an appropriate budget , which we set to
When at some point the budget is depleted, i.e. , Lemma 2 implies that
so that, up to a constant factor, the learning rate used by AdaHedge is at least as large as the learning rates proportional to that are used in the literature. On the other hand, it is not too large, because we can still provide a bound of order on the worst-case regret:
Suppose the agent runs Hedge with learning rate , and after rounds has just used up the budget (3), i.e. . Then its regret is bounded by
The cumulative loss of Hedge is bounded by
where we have used the bound . Plugging in (4) completes the proof. ∎
3 The AdaHedge Algorithm
We now introduce the AdaHedge algorithm by adding the doubling trick to the analysis of the previous section. The doubling trick divides the rounds in segments , and on each segment restarts Hedge with a different learning rate . For AdaHedge we set initially, and scale down the learning rate by a factor of for every new segment, such that . We monitor , measured only on the losses in the -th segment, and when it exceeds its budget a new segment is started. The factor is a parameter of the algorithm. Theorem 5 below suggests setting its value to the golden ratio or simply to .
The regret of AdaHedge is determined by the number of segments it creates: the fewer segments there are, the smaller the regret.
Suppose that after rounds, the AdaHedge algorithm has started new segments. Then its regret is bounded by
The regret per segment is bounded as in (5). Summing over all segments, and plugging in gives the required inequality. ∎
Using (4), one can obtain an upper bound on the number of segments that leads to the following guarantee for AdaHedge:
Suppose the agent runs AdaHedge for rounds. Then its regret is bounded by
For details see the proof in the Additional Material. The value for that minimizes the leading factor is the golden ratio , for which , but simply taking leads to a very similar factor of .
4 Easy Instances
While the previous sections reassure us that AdaHedge performs well for the worst possible sequence of losses, we are also interested in its behaviour when the losses are not maximally antagonistic. We will characterise such sequences in terms of convergence of the Hedge posterior probability of the best action:
(Recall that is proportional to , so corresponds to the posterior probability of the action with smallest cumulative loss.) Technically, this is expressed by the following refinement of Lemma 1, which is proved in Section 6.
For any and we have .
This lemma, which may be of independent interest, is a variation on Hoeffding’s bound on the cumulant generating function. While Lemma 1 leads to a bound on that grows linearly in , Lemma 6 shows that may grow much slower. In fact, if the posterior probabilities converge to sufficiently quickly, then is bounded, as shown by the following lemma. Recall that .
Let and be positive constants, and let . Suppose that for there exists a single action that achieves minimal cumulative loss , and for the cumulative losses diverge as . Then for all
where is a constant that does not depend on or .
The lemma is proved in the Additional Material. Together with Lemmas 1 and 6, it gives an upper bound on , which may be used to bound the number of segments started by AdaHedge. This leads to the following result, whose proof is also delegated to the Additional Material.
Let denote the round in which AdaHedge starts its -th segment, and let denote the cumulative loss of action in that segment.
In the simplistic example from the introduction, we may take and , such that (6) is satisfied for any . Taking large enough to ensure that , we find that AdaHedge never starts more than segments. Let us also give an example of a probabilistic setting in which Lemma 8 applies:
Let and be constants, and let be a fixed action. Suppose the loss vectors are independent random variables such that the expected differences in loss satisfy
Then, with probability at least , AdaHedge starts at most
segments and consequently its regret is bounded by a constant:
This shows that the probabilistic setting of the theorem is much easier than the worst case, for which only a bound on the regret of order is possible, and that AdaHedge automatically adapts to this easier setting. The proof of Theorem 9 is in the Additional Material. It verifies that the conditions of Lemma 8 hold with sufficient probability for , and and as in the theorem.
We compare AdaHedge to other hedging algorithms in two experiments involving simulated losses.
5.1 Hedging Algorithms
Follow-the-Leader. This algorithm is included because it is simple and very effective if the losses are not antagonistic, although as mentioned in the introduction its regret is linear in the worst case.
Hedge with fixed learning rate. We also include Hedge with a fixed learning rate
which achieves the regret bound 111Cesa-Bianchi and Lugosi use , but the same bound can be obtained for the simplified expression we use.. Since is a function of , the agent needs to use post-hoc knowledge to use this strategy.
Hedge with doubling trick. The common way to apply the doubling trick to is to set a budget on and multiply it by some constant at the start of each new segment, after which is optimized for the new budget [4, 7]. Instead, we proceed the other way around and with each new segment first divide by and then calculate the new budget such that (9) holds when reaches the budget. This way we keep the same invariant ( is never larger than the right-hand side of (9), with equality when the budget is depleted), and the frequency of doubling remains logarithmic in with a constant determined by , so both approaches are equally valid. However, controlling the sequence of values of allows for easier comparison to AdaHedge.
AdaHedge (Algorithm 3). Like in the previous algorithm, we set . Because of how we set up the doubling, both algorithms now use the same sequence of learning rates ; the only difference is when they decide to start a new segment.
Hedge with variable learning rate. Rather than using the doubling trick, this algorithm, described in , changes the learning rate each round as a function of . This way there is no need to relearn the weights of the actions in each block, which leads to a better worst-case bound and potentially better performance in practice. Its behaviour on easy problems, as we are currently interested in, has not been studied.
5.2 Generating the Losses
In both experiments we choose losses in . The experiments are set up as follows.
I.I.D. losses. In the first experiment, all losses for all actions are independent, with distribution depending only on the action: the probabilities of incurring loss are , , and , respectively. The results are then averaged over repetitions of the experiment.
Correlated losses. In the second experiment, the loss vectors are still independent, but no longer identically distributed. In addition there are dependencies within the loss vectors , between the losses for the available actions: each round is hard with probability , and easy otherwise. If round is hard, then action yields loss with probability and action yields loss with probability . If the round is easy, then the probabilities are flipped and the actions yield loss with the same probabilities. The results are averaged over repetitions.
5.3 Discussion and Results
Figure 1 shows the results of the experiments above. We plot the regret (averaged over repetitions of the experiment) as a function of the number of rounds, for each of the considered algorithms.
In the first considered regime, the accumulated losses for each action diverge linearly with high probability, so that the regret of Follow-the-Leader is bounded. Based on Theorem 9 we expect AdaHedge to incur bounded regret also; this is confirmed in Figure LABEL:fig:simlinear. Hedge with a fixed learning rate shows much larger regret. This happens because the learning rate, while it optimizes the worst-case bound, is much too small for this easy regime. In fact, if we would include more rounds, the learning rate would be set to an even smaller value, clearly showing the need to determine the learning rate adaptively. The doubling trick provides one way to adapt the learning rate; indeed, we observe that the regret of Hedge with the doubling trick is initially smaller than the regret of Hedge with fixed learning rate. However, unlike AdaHedge, the algorithm never detects that its current value of is working well; instead it keeps exhausting its budget, which leads to a sequence of clearly visible bumps in its regret. Finally, it appears that the Hedge algorithm with variable learning rate also achieves bounded regret. This is surprising, as the existing theory for this algorithm only considers its worst-case behaviour, and the algorithm was not designed to do specifically well in easy regimes.
In the second simulation we investigate the case where the mean cumulative loss of two actions is extremely close — within of one another. If the losses of the actions where independent, such a small difference would be dwarfed by random fluctuations in the cumulative losses, which would be of order . Thus the two actions can only be distinguished because we have made their losses dependent. Depending on the application, this may actually be a more natural scenario than complete independence as in the first simulation; for example, we can think of the losses as mistakes of two binary classifiers, say, two naive Bayes classifiers with different smoothing parameters. In such a scenario, losses will be dependent, and the difference in cumulative loss will be much smaller than . In the previous experiment, the posterior weights of the actions converged relatively quickly for a large range of learning rates, so that the exact value of the learning rate was most important at the start (e.g., from rounds onward Hedge with fixed learning rate does not incur much additional regret any more). In this second setting, using a high learning rate remains important throughout. This explains why in this case Hedge with variable learning rate can no longer keep up with Follow-the-Leader. The results for AdaHedge are also interesting: although Theorem 9 does not apply in this case, we may still hope that grows slowly enough that the algorithm does not start too many segments. This turns out to be the case: over the repetitions of the experiment, AdaHedge started only segments on average, which explains its excellent performance in this simulation.
6 Proof of Lemma 6
Our main technical tool is Lemma 6. Its proof requires the following intermediate result:
For any and any time , the function is convex.
This may be proved by observing that is the convex conjugate of the Kullback-Leibler divergence. An alternative proof based on log-convexity is provided in the Additional Material.
Proof of Lemma 6.
We need to bound , which is a convex function of by Lemma 10. As a consequence, its maximum is achieved when lies on the boundary of its domain, such that the losses are either or for all , and in the remainder of the proof we will assume (without loss of generality) that this is the case. Now let be the posterior probability of the actions with loss . Then
Using and , we get , which is tight for near . For near , rewrite
and use and for to obtain . Combining the bounds, we find
Now, let be an action such that . Then implies . On the other hand, if , then so . Hence, in both cases which completes the proof. ∎
7 Conclusion and Future Work
We have presented a new algorithm, AdaHedge, that adapts to the difficulty of the DTOL learning problem. This difficulty was characterised in terms of convergence of the posterior probability of the best action. For hard instances of DTOL, for which the posterior does not converge, it was shown that the regret of AdaHedge is of the optimal order ; for easy instances, for which the posterior converges sufficiently fast, the regret was bounded by a constant. This behaviour was confirmed in a simulation study, where the algorithm outperformed existing versions of Hedge.
A surprising observation in the experiments was the good performance of Hedge with a variable learning rate on some easy instances. It would be interesting to obtain matching theoretical guarantees, like those presented here for AdaHedge. A starting point might be to consider how fast the posterior probability of the best action converges to one, and plug that into Lemma 6.
The authors would like to thank Wojciech Kotłowski for useful discussions. This work was supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence, IST-2007-216886, and by NWO Rubicon grant 680-50-1010. This publication only reflects the authors’ views.
-  Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997.
-  N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
-  V. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153–173, 1998.
-  N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
-  Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, 1999.
-  E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 57–67, 2008.
-  N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
-  P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64:48–75, 2002.
-  V. Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001.
-  D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44(5):1906–1925, 1998.
-  A. N. Shiryaev. Probability. Springer-Verlag, 1996.
Appendix A Additional Material
Proof of Lemma 2.
Lemma A.3 in  gives the bound for any random variable taking values in and any . Defining with distribution , setting and dividing by the negative factor , we obtain
It follows that
Proof of Theorem 5.
In order to apply Lemma 4 we will need to bound the number of segments . To this end, let denote the cumulative loss of the best action on the -th segment. That is, if the -th segment spans rounds , then . If , then the theorem is true by Lemma 4, so suppose that . Then we know that the budgets for the first segments have been depleted, so that for these segments (4) applies, giving:
Solving for , we find
Substitution in Lemma 4 gives
where the last step uses . Rearranging yields the theorem. ∎
Proof of Lemma 7.
For between and we have
where the integral can be evaluated using the variable substitution and the fact that . ∎
Proof of Lemma 8.
be a version of measured only on the rounds since the start of the -th segment. We bound the first terms in this sum using Lemma 1 and the remaining terms using Lemmas 6 and 7222Lemma 7 applies to a single run of the Hedge algorithm. As AdaHedge restarts Hedge at the start of every new segment, the times in Lemma 7 should be interpreted relative to the current segment, ., which gives
for any .
We will argue that the budget in segment is never depleted: , for which it is sufficient to show that
which is true by definition of . ∎
Proof of Theorem 9.
We will show that the conditions of Lemma 8 (with the same , and ) are satisfied with probability at least . For , let denote the event that for , let denote the event that , and let denote the intersection of these events. Using (7), it can be seen that on , as required by (6). Hence we need to show that the probability of is at least , or equivalently that the probability of the complementary event is at most .
By Hoeffding’s inequality  the probabilities of the complementary events and may each be bounded by . And hence by the union bound the probability of is bounded by . Again by the union bound it follows that
We require this probability to be bounded by , for which it is sufficient that
By definition of , this is implied by
which holds for our choice (8) of . To show that AdaHedge starts at most segments, it remains to verify the other condition of Lemma 8, which is that . This follows from (10) upon observing that (7) implies so that .
Finally, the bound on the regret is obtained by plugging into Lemma 4. ∎
Proof of Lemma 10.
Within this proof, let us drop the subscript from and , and define the function for every action . Let and be arbitrary loss vectors, and let also be arbitrary. Then it is sufficient to show that
where . Towards this end, we start by observing that is log-convex:
Inequality 11 now follows from the general fact that a convex combination of log-convex functions is itself log-convex, which we will proceed to prove: using first (12) and then applying Hölder’s inequality (see e.g. ) one obtains
from which (11) follows by taking natural logarithms. ∎