Towards Minimax Online Learning with Unknown Time Horizon

# Towards Minimax Online Learning with Unknown Time Horizon

## Abstract

We consider online learning when the time horizon is unknown. We apply a minimax analysis, beginning with the fixed horizon case, and then moving on to two unknown-horizon settings, one that assumes the horizon is chosen randomly according to some known distribution, and the other which allows the adversary full control over the horizon. For the random horizon setting with restricted losses, we derive a fully optimal minimax algorithm. And for the adversarial horizon setting, we prove a nontrivial lower bound which shows that the adversary obtains strictly more power than when the horizon is fixed and known. Based on the minimax solution of the random horizon setting, we then propose a new adaptive algorithm which “pretends” that the horizon is drawn from a distribution from a special family, but no matter how the actual horizon is chosen, the worst-case regret is of the optimal rate. Furthermore, our algorithm can be combined and applied in many ways, for instance, to online convex optimization, follow the perturbed leader, exponential weights algorithm and first order bounds. Experiments show that our algorithm outperforms many other existing algorithms in an online linear optimization setting.

## 1 Introduction

We study online learning problems with unknown time horizon with the aim of developing algorithms and approaches for the realistic case that the number of time steps is initially unknown.

We first adopt the standard Hedge setting Freund & Schapire (1997) where the learner chooses a distribution over actions on each round, and the losses for each action are then selected by an adversary. The learner incurs loss equal to the expected loss of the actions in terms of the distribution it chose for this round, and its goal is to minimize the regret, the difference between its cumulative loss and that of the best action after rounds.

Various algorithms are known to achieve the optimal (up to a constant) upper bound on the regret. Most of them assume that the horizon is known ahead of time, especially those which are minimax optimal. When the horizon is unknown, the so-called doubling trick Cesa-Bianchi et al. (1997) is a general technique to make a learning algorithm adaptive and still achieve regret uniformly for any . The idea is to first guess a horizon, and once the actual horizon exceeds this guess, double it and restart the algorithm. Although, in theory, it is widely applicable, the doubling trick is aesthetically inelegant, and intuitively wasteful, since it repeatedly restarts itself, entirely forgetting all the preceding information. Other approaches have also been proposed, as we discuss shortly.

In this paper, we study the problem of learning with unknown horizon in a game-theoretic framework. We consider a number of variants of the problem, and make progress toward a minimax solution. Based on this approach, we give a new general technique which can also make other minimax or non-minimax algorithms adaptive and achieve low regret in a very general online learning setting. The resulting algorithm is still not exactly optimal, but it makes use of all the previous information on each round and achieves much lower regret in experiments.

We view the Hedge problem as a repeated game between the learner and the adversary. Abernethy et al. (2008b), and Abernethy & Warmuth (2010) proposed an exact minimax optimal solution for a slightly different game with binary losses, assuming that the loss of the best action is at most some fixed constant. They derived the solution under a very simple type of loss space; that is, on each round only one action suffers one unit loss. We call this the basis vector loss space. As a preliminary of this paper, we also derive a similar minimax solution under this simple loss space for our setting where the horizon is fixed and known to the learner ahead of time.

We then move on to the primary interest of this paper, that is, the case when the horizon is unknown to the learner. We study this unknown horizon setting in the minimax framework, with the aim of ultimately deriving game-theoretically optimal algorithms. Two types of models are studied. The first one assumes the horizon is chosen according to some known distribution, and the learner’s goal is to minimize the expected regret. We show the exact minimax solution for the basis vector loss space in this case. It turns out that the distribution the learner should choose on each round is simply the conditional expectation of the distributions the learner would have chosen for the fixed horizon case.

The second model we study gives the adversary the power to decide the horizon on the fly, which is possibly the most adversarial case. In this case, we no longer use the regret as the performance measure. Otherwise the adversary would obviously choose an infinite horizon. Instead, we use a scaled regret to measure the performance. Specifically, we scale the regret at time by the optimal regret under fixed horizon . The exact optimal solution in this case is unfortunately not found and remains an open problem, even for the extremely simple case. However, we give a lower bound for this setting to show that the optimal regret is strictly greater than the one in the fixed horizon game. That is, the adversary does obtain strictly more power if allowed to pick the horizon.

We then propose our new adaptive algorithm based on the minimax solution in the random horizon setting. One might doubt how realistic a random horizon is in practice. Even if the true horizon is indeed drawn from a fixed distribution, how can we know this distribution? We address these problems at the same time. Specifically, we prove that no matter how the horizon is chosen, if we assume it is drawn from a distribution from a special family, and let the learner play in a way similar to the one in the random horizon setting, then the worst-case regret at any time (not the expected regret) can still be of the optimal order. In other words, although the learner is behaving as if the horizon is random, its regret will be small even if the horizon is actually controlled by an adversary. Moreover, the results hold for not just the Hedge problem, but a general online learning setting that includes many interesting problems.

Our idea can be combined not only with the minimax algorithm, but also the “follow the perturbed leader” algorithm and the exponential weights algorithm. In addition, our technique can not only deal with unknown horizon, but also other unknown information such as the loss of the best action, thus leading to a first order regret bound that depends on the loss of the best action Cesa-Bianchi & Lugosi (2006). Like the doubling trick, this seems to be a quite general way to make an algorithm adaptive. Furthermore, we conduct experiments showing that our algorithm outperforms many existing algorithms, including the doubling trick, in an online linear optimization setting within an ball where our algorithm has an explicit closed form.

The rest of the paper is organized as follows. We define the Hedge setting formally in Section 2, and derive the minimax solution for the fixed horizon setting as the preliminary of this paper in Section 3. In Section 4, we study two unknown horizon settings in the minimax framework. We then turn to a general online learning setting and present our new adaptive algorithm in Section 5. Implementation issues, experiments, and applications are discussed in Section 6. We omit most of the proofs due to space limitations, but all details can be found in the supplementary material.

#### Related work

Besides the doubling trick, other adaptive algorithms have been studied Auer et al. (2002); Gentile (2003); Yaroshinsky et al. (2004); Chaudhuri et al. (2009). Auer et al. (2002) showed that for algorithms such as the exponential weights algorithm Littlestone & Warmuth (1994); Freund & Schapire (1997, 1999), where a learning rate should be set as a function of the horizon, typically in the form for some constant , one can simply set the learning rate adaptively as , where is the current number of rounds. In other words, this algorithm always pretends the current round is the last round. Although this idea works with the exponential weights algorithm, we remark that assuming the current round is the last round does not always work. Specifically, one can show that it will fail if applied to the minimax algorithm (see Section 6.4). In another approach to online learning with unknown horizon, Chaudhuri et al. (2009) proposed an adaptive algorithm based on a novel potential function reminiscent of the half-normal distribution.

Other performance measures different from the usual regret were studied before. Foster & Vohra (1998) introduced internal regret comparing the loss of an online algorithm to the loss of a modified algorithm which consistently replaces one action by another. Herbster & Warmuth (1995), and Bousquet & Warmuth (2003) compared the learner’s loss with the best -shifting expert, while Hazan & Seshadhri (2007) studied the usual regret within any time interval. To the best of our knowledge, the form of scaled regret that we study is new. Lower bounds on anytime regret in terms of the quadratic variations for any loss sequence (instead of the worst case sequence this paper considers) were studied by Gofer & Mansour (2012).

## 2 Repeated Games

We first consider the following repeated game between a learner and an adversary. The learner has access to actions. On each round , (1) the learner chooses a distribution over the actions; (2) the adversary reveals the loss vector , where is the loss for action for this round, and the loss space is a subset of ; (3) the learner suffers loss for this round.

Notice that the adversary can choose the losses on round with full knowledge of the history and ,that is, all the previous choices of the learner and the adversary (we use notation to denote the multiset ). We also denote the cumulative loss up to round for the learner and the actions by and respectively. The goal for the learner is to minimize the difference between its total loss and that of the best action at the end of the game. In other words, the goal of the learner is to minimize , where we define the regret function for and . The number of rounds is called the horizon.

Regarding the loss space , perhaps the simplest one is , the standard basis vectors in dimensions. Playing with this loss space means that on each round, the adversary chooses one single action to incur one unit loss. In order to show the intuition of our main results, we mainly focus on this basis vector loss space in Sections 3 and 4, but we return to the most general case later.

## 3 Minimax Solution for Fixed Horizon

Although our primary interest in this paper is the case when the horizon is unknown to the learner, we first present some preliminary results on the setting where the horizon is known to both the learner and the adversary ahead of time. These will later be useful for the unknown horizon case.

If we treat the learner as an algorithm that takes the information of previous rounds as inputs, and outputs a distribution that the learner is going to play with, then finding the optimal solution in this fixed horizon setting can be viewed as solving the minimax expression

Alternatively, we can recursively define:

 V(M,0) ≜−miniMi ; V(M,r) ≜minP∈Δ(N)maxZ∈LS(P⋅Z+V(M+Z,r−1)),

where is a loss vector, is a nonnegative integer, and is the dimensional simplex. By a simple argument, one can show that the value of is the regret of a game with rounds starting from the situation that each action has initial loss , and assuming both the learner and the adversary will play optimally. In fact, the value of Eq. (1) is exactly , and the optimal learner algorithm is the one that chooses the which realizes the minimum in the definition of when the actions’ cumulative loss vector is and there are rounds left. We call the value of the game.

As a concrete illustration of these ideas, we now consider the basis vector loss space 1, that is, . It turns out that under this loss space, the value function has a nice closed form. Similar to the results from Cesa-Bianchi et al. (1997) and Abernethy et al. (2008b), we show that can be expressed in terms of a random walk. Suppose is the expectation of the loss of the best action if the adversary chooses each uniformly randomly for the remaining rounds, starting from loss vector . Formally, can be defined in a recursive way: The connection between and , and the optimal algorithm are then shown by the following theorem.

###### Theorem 1.

If , then for any vector and integer ,

 V(M,r)=rN−R(M,r).

Let . Then the value of the game satisfies

 V(0,T)≤cN√T. (2)

Moreover, on round , the optimal learner algorithm is the one that chooses weight for each action , where is the current cumulative loss vector and is the number of remaining rounds, that is, .

Theorem 1 tells us that under the basis vector loss space, the best way to play is to assume that the adversary is playing uniformly randomly, because and are exactly the expected losses for the learner and for the best action respectively. In practice, computing needs exponential time. However, we can estimate it by sampling (see similar work in Abernethy et al., 2008b). Note that is decreasing when (with maximum value about ). So contrary to the regret bound for the general loss space which is increasing in , here is of order .

## 4 Playing without Knowing the Horizon

We turn now to the case in which the horizon is unknown to the learner, which is often more realistic in practice. There are several ways of modeling this setting. For example, the horizon can be chosen ahead of time according to some fixed distribution, or it can even be chosen by the adversary. We will discuss these two variants separately.

### 4.1 Random Horizon

Suppose the horizon is chosen according to some fixed distribution which is known to both the learner and the adversary. Before the game starts, a random is drawn, and neither the learner nor the adversary knows the actual value of . The game stops after rounds, and the learner aims to minimize the expectation of the regret. Using our earlier notation, the problem can be formally defined as

where we assume the expectation is always finite. We sometimes omit the subscript for simplicity.

Continuing the example in Section 3 of the basis vector loss space, we can again show the exact minimax solution, which has a strong connection with the one for the fixed horizon setting.

###### Theorem 2.

If , then

Moreover, on round , the optimal learner plays with the distribution where is the optimal distribution the learner would play if the horizon is , that is,

Eq. (3) tells us that if the horizon is drawn from some distribution, then even though the learner does not know the actual horizon before playing the game, as long as the adversary does not know this information either, it can still do as well as the case when they are both aware of the horizon.

However, so far this model does not seem to be quite useful in practice for several reasons. First of all, the horizon might not be chosen according to a distribution. Even if it is, this distribution is probably unknown. Secondly, what we really care about is the performance which holds uniformly for any horizon, instead of the expected regret. Last but not least, one might conjecture that the similar result stated in Theorem 2 should hold for other more general loss spaces, which is in fact not true (see Example 1 in the supplementary file), making the result seem even less useful.

Fortunately, we address all these problems and develop new adaptive algorithms based on the result in this section. We discuss these in Section 5 after first introducing the fully adversarial model.

### 4.2 Adversarial Horizon

The most adversarial setting is the one where the horizon is completely controlled by the adversary. That is, we let the adversary decide whether to continue or stop the game on each round according to the current situation. However, notice that the value of the game is increasing in the horizon. So if the adversary can determine the horizon and its goal is still to maximize the regret, then the problem would not make sense because the adversary would clearly choose to play the game forever and never stop leading to infinite regret. One reasonable way to address this issue is to scale the regret by the value of the fixed horizon game , so that the scaled regret indicates how many times worse is the regret compared to the one that is optimal given the horizon. Under this setting, the corresponding minimax expression is

Unfortunately, finding the minimax solution to this setting seems to be quite challenging, even for the simplest case . It is clear, however, that is at most some constant due to the existence of adaptive algorithms such as the doubling trick, which can achieve the optimal regret bound up to a constant without knowing . Another clear fact is , since it is impossible for the learner to do better than the case when it is aware of the horizon. Below, we derive a nontrivial lower bound that is greater than , thus proving that the adversary does gain strictly more power when it can stop the game whenever it wants.

###### Theorem 3.

If and , then . That is, for every algorithm, there exists an adversary and a horizon such that the regret of the learner after rounds is at least .

## 5 A New General Adaptive Algorithm

We study next how the random-horizon algorithm of Section 4.1 can be used when the horizon is entirely unknown and furthermore, for a much more general class of online learning problems. In Theorem 2, we proposed an algorithm that simply takes the conditional expectation of the distributions we would have played if the horizon were given. Notice that even though it is derived from the random horizon setting, it can still be used in any setting as an adaptive algorithm in the sense that it does not require the horizon as a parameter. However, to use this algorithm, we should ask two questions: What distribution should we use? And what can we say about the algorithm’s performance for an arbitrary horizon instead of in expectation?

As a first attempt, suppose we use a uniform distribution over , where is a huge integer. From what we observe in some numerical calculations, tends to be a uniform distribution in this case. Clearly it cannot be a good algorithm if for each round, it just places equal weights for each action regardless of the actions’ behaviors. In fact, one can verify that the exponential distribution (that is, for some constant ) also does not work. These examples show that even though this algorithm gives us the optimal expected regret, it can still suffer a big regret for a particular trial of the game, which we definitely want to avoid.

Nevertheless, it turns out that there does exist a family of distributions that can guarantee the regret to be of order for any . Moreover, this is true for a very general online learning problem that includes the Hedge setting we have been discussing. Before stating our results, we first formally describe this general setting, which is sometimes called the online convex optimization problem Zinkevich (2003); Shalev-Shwartz (2011). Let be a compact convex set, and be a set of convex functions defined on . On each round : (1) the learner chooses a point ; (2) the adversary chooses a loss function ; (3) the learner suffers loss for this round. The regret after rounds is defined by

 Reg(x1:T,f1:T)=T∑t=1ft(xt)−minx∈ST∑t=1ft(x).

It is clear that the Hedge problem is a special case of the above setting with being the probability simplex, and being a set of linear functions defined by a point in the loss space, that is, . Similarly, to study the minimax algorithm we define the following function of the multiset of loss functions we have encountered and the number of remaining rounds :

 VS,F(M,0) ≜−minx∈S∑f∈Mf(x); VS,F(M,r) ≜minx∈Smaxf∈F(f(x)+VS,F(M⊎{f},r−1)),

where denotes multiset union. We omit the subscript of whenever there is no confusion. Let be the output of the minimax algorithm on round . In other words, realizes the minimum in the definition of . We will adapt the idea in Section 4.1 and study the adaptive algorithm that outputs on round for a distribution on the horizon. One mild assumption needed is

###### Assumption 1.

and , .

Roughly speaking, this assumption implies that the game is in the adversary’s favor: playing more rounds leads to greater regret. It holds for the Hedge setting with basis vector loss space (see Property 7 in the supplementary file). In fact, it also holds as long as contains the zero function . To see this, simply observe that

 V(M,r) =\adjustlimitsminx∈Smaxf∈F(f(x)+V(M⊎{f},r−1)) ≥V(M⊎{f0},r−1) ≥…≥V(M⊎{f0,…,f0},0)=V(M,0).

So the assumption is mild and will hold for all the examples we consider.

Below, we first give a general upper bound on the regret that holds for any distribution and has no dependence on the choices of the adversary. After that we will show what the appropriate distributions are to make this bound .

###### Theorem 4.

Let and . Suppose Assumption 1 holds, and on round the learner chooses where is the output of the minimax algorithm as described above. Then for any , the regret after rounds is at most

 ¯V1(∅)+Ts∑t=1qt¯Vt+1(∅).

To prove Theorem 4, we first show the following lemma.

###### Lemma 1.

For any and multiset and ,

 V(M1⊎M2,r)−V(M1,0)≤V(M2,r). (5)
###### Proof.

If , then Eq. (5) holds since

 minx∈S∑f∈M1f(x)+minx∈S∑f∈M2f(x)≤minx∈S∑f∈M1⊎M2f(x).

Now assume Eq. (5) holds for . By induction one has

 V(M1⊎M2,r)−V(M1,0) = minx∈Smaxf∈F(f(x)+V(M1⊎M2⊎{f},r−1))−V(M1,0) ≤ minx∈Smaxf∈F(f(x)+V(M2⊎{f},r−1))=V(M2,r),

concluding the proof. ∎

###### Proof of Theorem 4.

By definition of , we have

 V(f1:t−1,T−t+1) = maxf∈F(f(xTt)+V(f1:t−1⊎{f}),T−t) ≥ ft(xTt)+V(f1:t,T−t).

Therefore, by convexity and the fact that for any , the loss of the algorithm on round is

 ft(xt)=ft(E[xTt|T≥t])≤E[ft(xTt)|T≥t] ≤ E[V(f1:t−1,T−t+1)−V(f1:t,T−t)|T≥t] = ¯Vt(f1:t−1)−qtV(f1:t,0)−(1−qt)¯Vt+1(f1:t) ≤ ¯Vt(f1:t−1)−¯Vt+1(f1:t)+qt¯Vt+1(∅),

where the last equality holds because by Lemma 1. We conclude the proof by summing up over and pointing out that by Assumption 1. ∎

As a direct corollary, we now show an appropriate choice of . We assume that the optimal regret under the fixed horizon setting is of order . That is:

###### Assumption 2.

For any , for some constant that might depend on .

This is proven to be true in the literature for all the examples we consider, especially when contains linear functions.

###### Theorem 5.

Under Assumption 2 and the same conditions of Theorem 4, if where is a constant, then for any , the regret after rounds is at most

 Γ(d−32)Γ(d)(d−1)2cN√πTs+o(√Ts),

where is the gamma function. Choosing approximately minimizes the main term in the bound, leading to regret approximately .

Theorem 5 tells us that pretending that the horizon is drawn from the distribution can always achieve low regret, even if the actual horizon is chosen adversarially. Also notice that the constant in the bound for the term is less than the one for the doubling trick with the fixed horizon optimal algorithm, which is Cesa-Bianchi & Lugosi (2006). We will see in Section 6.1 an experiment showing that our algorithm performs much better than the doubling trick.

It is straightforward to apply our new algorithm to different instances of the online convex optimization framework. Examples include Hedge with basis vector loss space, predicting with expert advice Cesa-Bianchi et al. (1997), online linear optimization within an ball Abernethy et al. (2008a) or an ball McMahan & Abernethy (2013). These are examples where minimax algorithms for fixed horizon are already known. In theory, however, our algorithm is still applicable when the minimax algorithm is unknown, such as Hedge with the general loss space .

## 6 Implementation and Applications

In this section, we discuss the implementation issue of our new algorithm, and also show that the idea of using a “pretend prior distribution” is much more applicable in online learning than we have discussed so far.

### 6.1 Closed Form of the Algorithm

Among the examples listed at the end of Section 5, we are especially interested in online linear optimization within an ball since our algorithm enjoys an explicit closed form in this case. Specifically, we consider the following problem (all the norms are norms): take , and . In other words, the adversary also chooses a point in on each round, which we denote by . Abernethy et al. (2008a) showed a simple but exact minimax optimal algorithm for the fixed horizon setting (for ): on each round , choose

 xTt=−Wt−1/√∥Wt−1∥2+(T−t+1), (6)

where . This strategy guarantees the regret to be at most . To make this algorithm adaptive, we again assign a distribution over the horizon. However, in order to get an explicit form for , a continuous distribution on is necessary. It does not seem to make sense at first glance since the horizon is always an integer, but keep in mind that the random variable is merely an artifact of our algorithm, and Eq. (6) is well defined with being a real number. As long as the output of the learner is in the set , our algorithm is valid. The analysis for our algorithm also holds with minor changes. Specifically, we show the following:

###### Theorem 6.

Let be a continuous random variable with probability density . If the learner chooses on round t, where is defined by Eq. (6), then the regret after rounds is at most for any . Moreover, has the following explicit form

 xt=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩(t⋅tanh−1(√1−t/c)(c−t)3/2−√cc−t)Wt−1\quad if c≠t−2t3c3/2Wt−1\quad else, (7)

where .

The algorithm we are proposing in Eq. (7) looks quite inexplicable if one does not realize that it comes from the expression with an appropriate distribution. Yet the algorithm not only enjoys a low theoretic regret bound as shown in Theorem 6, but also achieves very good performance in simulated experiments.

To show this, we conduct an experiment that compares the regrets of four algorithms at any time step within 1000 rounds against an adversary that chooses points in uniformly at random (). The results are shown in Figure 1, where each data point is the maximum regret over 1000 randomly generated adversaries for the corresponding algorithm and horizon. The four algorithms are: the minimax algorithm in Eq. (6) (OPT); the one we proposed in Theorem 6 (DIST); online gradient descent, a general algorithm for online optimization (see Zinkevich, 2003) (OGD); and the doubling trick with the minimax algorithm (DOUBLE). Note that OPT is not really an adaptive algorithm: it “cheats” by knowing the horizon in advance, and thus performs best at the end of the game. We include this algorithm merely as a baseline. Figure 1 shows that our algorithm DIST achieves consistently much lower regret than any other adaptive algorithm, including OGD which seems to enjoy a better constant in the regret bound (, see Zinkevich, 2003). Moreover, for the first 450 rounds or so, our algorithm performs even better than OPT, implying that using the optimal algorithm with a large guess on the horizon is inferior to our algorithm. Finally, we remark that although the doubling trick is widely applicable in theory, in experiments it is beaten by most of the other algorithms.

### 6.2 Randomized Play and Efficient Implementation

Implementation is an issue for our algorithm when there is no closed form for , which is usually the case. One way to address this problem is to compute the sum of the first sufficient number of terms in the series, which can be a good estimate since the weight for each term decreases rapidly.

However, there is another more natural way to deal with the implementation issue when we are in a similar setting but allowed to play randomly. Specifically, consider a modified Hedge setting where on each round , the learner can bet on one and only one action , and then the loss vector is revealed with the learner suffering loss for this round. It is well known that in this kind of problem, randomization is necessary for the learner to achieve sub-linear regret. That is, is a random variable and is decided without knowing the actual draw of . In addition, suppose , the conditional distribution of given the past, only depends on , and the learner achieves sub-linear regret in the usual Hedge setting (sometimes called pseudo-regret):

 T∑t=1Pt⋅Zt−miniMT,i≤cN√T (8)

(recall ) for any and a constant . Then the learner also achieves sub-linear regret with high probability in the randomized setting. That is, with probability at least , the actual regret satisfies:

 T∑t=1Zt,It−miniMT,i≤cN√T+√T2ln1δ. (9)

We refer the interested reader to Lemma 4.1 of Cesa-Bianchi & Lugosi (2006) for more details.

Therefore, in this setting we can implement our algorithm in an efficient way: on round , first draw a horizon according to distribution , then draw according to . It is clear that the marginal distribution of of this process is exactly . Hence, Eq. (8) is satisfied by Theorem 5 and as a result Eq. (9) holds.

### 6.3 Combining with the FPL algorithm

Even if we have an efficient randomized implementation, or sometimes even have a closed form of the output, it is still too constrained if we can only apply our technique to minimax algorithms since they are usually difficult to derive and sometimes even inefficient to implement. It turns out, however, that the “pretend prior distribution” idea is applicable for many other non-minimax algorithms, which we will discuss from this section on.

Continuing the randomized setting discussed in the previous section, we study the well-known “follow the perturbed leader (FPL)” algorithm Kalai & Vempala (2005), which chooses where is a random variable drawn from some distribution. This distribution sometimes requires the horizon as a parameter. If this is the case, applying our technique would have a simple Bayesian interpretation: put a prior distribution on an unknown parameter of another distribution. Working out the marginal distribution of would then give an adaptive variant of FPL.

For simplicity, consider drawing uniformly at random from the hypercube (see Chapter 4.3 of Cesa-Bianchi & Lugosi, 2006). If , then the pseudo-regret is upper bounded by (whose dependence on is not optimal). Now again let be a continuous random variable with probability density , and be obtained by first drawing given , and then drawing a point uniformly from . We show the following:

###### Lemma 2.

If for some constant , the marginal density function of is

The normalization factor is .

###### Theorem 7.

Suppose on round , the learner chooses

 It∈argmini(Mt−1,i+ξt,i),

where is a random variable with density function (10). Then the pseudo-regret after rounds is at most

 (d−1√b(d−1/2)+√b(d−1)2d−3/2)2√TsN.

Choosing and minimizes the main term in the bound, leading to about .

By the exact same argument, the actual regret is bounded by the same quantity plus with probability .

### 6.4 Generalizing the Exponential Weights Algorithm

Now we come back to the usual Hedge setting and consider another popular non-minimax algorithm (note that it is trivial to generalize the results to the randomized setting). When dealing with the most general loss space , the minimax algorithm is unknown even for the fixed horizon setting. However, generalizing the weighted majority algorithm of Littlestone & Warmuth (1994), Freund & Schapire (1997, 1999) presented an algorithm using exponential weights that can deal with this general loss space and achieve the bound on the regret. The algorithm takes the horizon as a parameter, and on round , it simply chooses , where is the learning rate. It is shown that the regret of this algorithm is at most . Auer et al. (2002) proposed a way to make this algorithm adaptive by simply setting a time-varying learning rate , where is the current round, leading to a regret bound of for any (see Chapter 2.5 of Bubeck, 2011). In other words, the algorithm always treats the current round as the last round. Below, we show that our “pretend distribution” idea can also be used to make this exponential weights algorithm adaptive, and is in fact a generalization of the adaptive learning rate algorithm by Auer et al. (2002).

###### Theorem 8.

Let , and , where is a constant. If on round , the learner assigns weight to each action , where , then for any , the regret after rounds is at most

 (√b(d−1)4(d−1/2)+d−1(d−3/2)√b)√TslnN+o(√TslnN).

Setting minimizes the main term, which approaches as .

Note that if , our algorithm simply becomes the one of Auer et al. (2002), because is if and otherwise. Therefore, our algorithm can be viewed as a generalization of the idea of treating the current round as the last round. However, we emphasize that the way we deal with unknown horizon is more applicable in the sense that if we try to make a minimax algorithm adaptive by treating each round as the last round, one can construct an adversary that leads to linear— and therefore grossly suboptimal—regret, whereas our approach yields nearly optimal regret. (See Example 2 and 3 in the supplementary file for details.)

### 6.5 First Order Regret Bound

So far all the regret bounds we have discussed are in terms of the horizon, which are also called zeroth order bounds. More refined bounds have been studied in the literature Cesa-Bianchi & Lugosi (2006). For example, the first order bound for Hedge, that depends on the loss of the best action at the end of the game, usually is of order . Again, using the exponential weights algorithm with a slightly different learning rate , one can show that the regret is at most . Here, is prior information on the loss sequence similar to the horizon. To avoid exploiting this information that is unavailable in practice, one can again use techniques like the doubling trick or the time-varying learning rate. Alternatively, we show that the “pretend distribution” technique can also be used here. Again it makes more sense to assign a continuous distribution on the loss of the best action instead of a discrete one.

###### Theorem 9.

Let , , , and be a continuous random variable with probability density . If on round , the learner assigns weight to each action , where , then for any , the regret after rounds is at most

 3(d−7/6)(d−1)(d−3/2)(d−1/2)√m∗lnN + (1+(d−1)ln(m∗+1))lnN+o(√m∗lnN),

where is the loss of the best action after rounds. Setting minimizes the main term, which becomes .

Through all the proofs, we denote the set by .

## Appendix A Proof of Theorem 1

We first state a few properties of the function :

###### Proposition 1.

For any vector of dimensions and integer ,

###### Property 1.

for any real number and .

###### Property 2.

is non-decreasing in for each .

If , .

###### Property 4.

If , and for each , then is a distribution in the simplex .

###### Proof of Proposition 1.

We omit the proof for Property 1 and 2, since it is straightforward. We prove Property 3 by induction. For the base case , let . If , then is for and otherwise. If , then is simply for all . In either case, we have

 R(M,1) =1NN∑i=1R(M+ei,0)≤1N(1+N∑i=1R(M,0)) =1N+R(M,0),

proving the base case. Now for , by definition of and induction,

 R(M,r)−R(M,r−1) = 1NN∑j=1(R(M+ei,r−1)−R(M+ei,r−2)) ≤ 1NN∑j=11N=1N,

completing the induction. For Property 4, it suffices to prove for each and . The first part can be shown using Property 2 and 3:

 Pi =1N+R(M+ei,r−1)−R(M,r) ≥1N+R(M,r−1)−(1N+R(M,r−1))=0.

The second part is also easy to show by definition of :

 N∑i=1Pi =1+N∑i=1R(M+ei,r−1)−NR(M,r) =1+NR(M,r)−NR(M,r)=1.

###### Proof of Theorem 1.

First inductively prove for any . The base case is trivial by definition. For ,

 V(M,r) =minP∈Δ(N)maxZ∈LS(P⋅Z+V(M+Z,r−1)) =minP∈Δ(N)maxi∈[N](Pi+V(M+ei,r−1)) (LS = {e1,…,eN}) =minP∈Δ(N)maxi∈[N](Pi+r−1N−R(M+ei,r−1)) (by induction)

Denote by . Notice that the average of over all is irrelevant to : . Therefore, for any , and

 V(M,r)=minPmaxig(P,i)≥r/N−R(M,r). (11)

On the other hand, from Proposition 1, we know that is a valid distribution. Also,

 V(M,r)=minPmaxig(P,i)≤maxig(P∗,i)=maxi(rN−R(M,r))=rN−R(M,r). (12)

So from Eq. (11) and (12) we have , and also realizes the minimum, and thus is the optimal strategy.

It remains to prove . Let be independent uniform random variables taking values in . By what we proved above,

 V(0,T)=TN−E[mini∈[N]T∑t=1Zt,i]=E[maxi∈[N]T∑t=1(1/N−Zt,i)].

Let . Then each is a random variable that takes value with probability and with probability . Also, for a fixed , are independent (note that this is not true for for a fixed ). It is shown in Lemma 3.3 of Berend & Kontorovich (2013) that each satisfies

 E[exp(λyt,i)]≤exp(λ2σ22),∀λ>0,

where is the variance of . So if we let , by the independence of each term, we have

 E[exp(λYi)] =E[T∏t=1exp(λyt,i)]=T∏t=1E[exp(λyt,i)] ≤exp(λ2σ2