Optimal Non-Asymptotic Lower Bound on the Minimax Regret of Learning with Expert Advice

# Optimal Non-Asymptotic Lower Bound on the Minimax Regret of Learning with Expert Advice

Francesco Orabona David Pal francesco@orabona.com dpal@yahoo-inc.com

Yahoo Labs
New York, NY, USA
###### Abstract

We prove non-asymptotic lower bounds on the expectation of the maximum of independent Gaussian variables and the expectation of the maximum of independent symmetric random walks. Both lower bounds recover the optimal leading constant in the limit. A simple application of the lower bound for random walks is an (asymptotically optimal) non-asymptotic lower bound on the minimax regret of online learning with expert advice.

## 1 Introduction

Let be i.i.d. Gaussian random variables . It easy to prove that (see Appendix A)

 E[max1≤i≤dXi]≤σ√2lndfor any d≥1. (1)

It is also well known that

 limd→∞E[max1≤i≤dXi]σ√2lnd=1. (2)

In section 2, we prove a non-asymptotic lower bound on . The leading term of the lower bound is asymptotically . In other words, the lower bound implies (2).

Discrete analog of a Gaussian random variable is the symmetric random walk. Recall that a random walk of length is a sum of i.i.d. Rademacher variables, which have probability distribution . We consider independent symmetric random walks of length . Analogously to (1), it is easy to prove that (see Appendix A)

 E[max1≤i≤dZ(n)i]≤√2nlndfor any n≥0 and any d≥1. (3)

Note that in (1) is replaced by . By central limit theorem as converges in distribution to . From this fact, it possible to prove the analog of (2),

 limd→∞limn→∞E[max1≤i≤dZ(n)i]√2nlnd=1. (4)

We prove a non-asymptotic lower bound on . Same as for the Gaussian case, the leading term of the lower bound is asymptotically matching (4).

In section 4, we show a simple application of the lower bound on to the problem of learning with expert advice. This problem was extensively studied in the online learning literature; see (Cesa-BianchiL06). Our bound is optimal in the sense that for large and large it recovers the right leading constant.

## 2 Maximum of Gaussians

Crucial step towards lower bounding is a good lower bound on the tail of a single Gaussian. The standard way of deriving such bounds is via bounds on the so-called Mill’s ratio. Mill’s ratio of a random variable with density function is the ratio .111Mill’s ratio has applications in economics. A simple is problem where Mill’s ratio shows up is the problem of setting optimal price for a product. Given a distribution prices that customers are willing to pay, the goal is to choose the price that brings the most revenue. It clear that a lower bound on the Mill’s ratio yields a lower bound on the tail .

Without loss of generality it suffices to lower bound the Mill’s ratio of , since Mill’s ratio of can be obtained by rescaling. Recall that probability density of is and its cumulative distribution function is . The Mill’s ratio for can be expressed as . A lower bound on Mill’s ratio of was proved by Boyd-1959.

###### Lemma 1 (Mill’s ratio for standard Gaussian (Boyd-1959)).

For any ,

 1−Φ(x)ϕ(x)=exp(x22)∫∞xexp(−t22)dt≥π(π−1)x+√x2+2π≥ππx+√2π.

The second inequality in Lemma 1 is our simplification of Boyd’s bound. It follows by setting and . By a simple algebra it is equivalent to the inequality which holds for any .

###### Corollary 2 (Lower Bound on Gaussian Tail).

Let and . Then,

 Pr[X≥x]≥exp(−x22σ2)1√2πxσ+2.
###### Proof.

We have

 Pr[X≥x] =1σ√2π∫∞xexp(−t22σ2)dt =1√2π∫∞xσexp(−t22)dt ≥1√2πexp(−x22σ2)ππxσ+√2π (by Lemma~{}???).

Equipped with the lower bound on the tail, we prove a lower bound on the maximum of Gaussians.

###### Theorem 3 (Lower Bound on Maximum of Independent Gaussians).

Let be independent Gaussian random variables . For any ,

 E[max1≤i≤dXi] ≥σ(1−exp(−√lnd6.35))(√2lnd−2lnlnd+√2π)−√2πσ (5) ≥0.13σ√lnd−0.7σ. (6)
###### Proof.

Let be the event that at least one of the is greater than where . We denote by the complement of this event. We have

 E[max1≤i≤dXi] =E[max1≤i≤dXi ∣∣∣ A]⋅Pr[A]+E[max1≤i≤dXi ∣∣∣ ¯¯¯¯A]⋅Pr[¯¯¯¯A] =E[max1≤i≤dXi ∣∣∣ A]⋅Pr[A]+E[X1 | X1≤Cσ√lnd]⋅Pr[¯¯¯¯A] ≥E[max1≤i≤dXi ∣∣∣ A]⋅Pr[A]+E[X1 | X1≤0]⋅Pr[¯¯¯¯A] ≥E[max1≤i≤dXi ∣∣∣ A]⋅Pr[A]−σ√2π⋅Pr[¯¯¯¯A] ≥Cσ√lnd⋅Pr[A]−σ√2π(1−Pr[A]) =σ(C√lnd+√2π)Pr[A]−σ√2π (7)

where we used that .

It remains to lower bound , which we do as follows

 Pr[A] =1−Pr[¯¯¯¯A] =1−(Pr[X1≤Cσ√lnd])d =1−(1−Pr[X1>Cσ√lnd])d ≥1−exp(−d⋅Pr[X1≥Cσ√lnd]) ≥1−exp(−dexp(−C2lnd2)1√2πC√lnd+2) =1−exp⎛⎜⎝−d1−C22C√2πlnd+2⎞⎟⎠. (8)

where in the first inequality we used the elementary inequality valid for all .

Since we have . Substituting this into (8), we get

 Pr[A]≥1−exp(−lndC√2πlnd+2)=1−exp(−√lndC√2π+2). (9)

The function is decreasing on the interval , increasing on , and . From these properties we can deduce that for any . Therefore, and hence

 Pr[A]≥1−exp(−√lnd6.35). (10)

Inequalities (7) and (10) together imply bound (5). Bound (6) is obtained from (5) by noticing that

 σ(1−exp(−√lnd6.35))(√2lnd−2lnlnd+√2π)−√2πσ =σ(1−exp(−√lnd6.35))√2lnd−2lnlnd−exp(−√lnd6.35)√2πσ ≥0.1227⋅σ√2lnd−2lnlnd−0.7σ =0.1227⋅σ√lnd⋅C(d)−0.7σ

where we used that for any . Since has minimum at , it follows that for any . ∎

## 3 Maximum of Random Walks

The general strategy for proving a lower bound on is the same as in the previous section. The main task it to lower bound the tail of a symmetric random walk of length . Note that

 Bn=Z(n)+n2

is a Binomial random variable . We follow the same approach used in nOrabona13. First we lower bound the tail with McKay1989.

###### Lemma 4 (Bound on Binomial Tail).

Let be integers satisfying and . Define . Then, satisfies

 Pr[Bn≥k]≥√n(n−1k−1)2−n1−Φ(x)ϕ(x).

We lower bound the binomial coefficient using Stirling’s approximation of the factorial. The lower bound on the binomial coefficient will be expressed in terms of Kullback-Leibler divergence between two Bernoulli distributions, and . Abusing notation somewhat, we write the divergence as

 D(p∥q)=pln(pq)+(1−p)ln(1−p1−q).

The result is the following lower bound on the tail of Binomial.

###### Theorem 5 (Bound on Binomial Tail).

Let be integers satisfying and . Define . Then, satisfies

 Pr[Bn≥k]≥exp(−nD(kn∥∥12))exp(16)√2π1−Φ(x)ϕ(x).
###### Proof.

Lemma 4 implies that

 Pr[Bn≥k]≥√n(n−1k−1)2−n1−Φ(x)ϕ(x)

Since , we can write the binomial coefficient as

 (n−1k−1)=kn(nk)

We bound the binomial coefficient by using Stirling’s formula for the factorial. We use explicit upper and lower bounds due to Robbins-1955 valid for any ,

 √2πn(ne)n

Using the Stirling’s approximation, for any ,

 (nk) =n!k!(n−k)! >√2πn nne−n√2πk kke−ke1/12⋅√2π(n−k) (n−k)n−ke−(n−k)e1/12 =1exp(16)√2π(nn−k)n−k(nk)k√nk(n−k) =1exp(16)√2π 2nexp(−n⋅D(kn∥∥∥12))√nk(n−k)

where in the equality we used the definition of . Combining all the inequalities, gives

 Pr[Bn≥2k−n] ≥√nkn1exp(16)√2π 2nexp(−n⋅D(kn∥∥∥12))√nk(n−k)2−n1−Φ(x)ϕ(x) =1exp(16)√2πexp(−n⋅D(kn∥∥∥12))1−Φ(x)ϕ(x)√kn−k ≥1exp(16)√2πexp(−n⋅D(kn∥∥∥12))1−Φ(x)ϕ(x)

for . For , we verify the statement of the theorem by direct substitution. The left hand side is . Since and , it’s easy to see that the right hand side is smaller than . ∎

For , the divergence can be approximated by . We define the function as

 ψ(x)=D(12+x∥∥12)2x2.

It is the ratio of the divergence and the approximation. The function satisfies the following properties:

• is decreasing on and increasing on

• minimum value is

• maximum value is

Using the definition of and Theorem 5, we have the following Corollary.

###### Corollary 6.

Let be a positive integer and let be a real number. Then satisfies

 Pr[Bn≥12n+t−1]≥exp(−16)exp(−2ψ(tn)t2n)1√2π2t√n+2.
###### Proof.

By Theorem 5 and Lemma 1, we have

 Pr[Z≥12n+t−1] =Pr[Z≥⌈12n+t−1⌉] ≥exp(−nD(⌈12n+t−1⌉n∥∥∥12))exp(16)√2π⋅ππ2⌈12n+t−1⌉−n√n+√2π ≥exp(−nD(12n+tn∥∥∥12))exp(16)√2π⋅ππ2(12n+t)−n√n+√2π =exp(−nD(12+tn∥∥12))exp(16)√2π⋅ππ2t√n+√2π =exp(−16)exp(−2ψ(tn)t2n)1√2π2t√n+2.\qed
###### Theorem 7 (Lower Bound on Maximum of Independent Symmetric Random Walks).

Let be independent symmetric random walks of length . If and ,

 E[max1≤i≤dZ(n)i] ≥1−exp(−√lnd3.1√2π)√ψ(1.6√lnd2√n)√n(√2lnd−2lnlnd−1)−√n ≥0.09√nlnd−2√n.
###### Proof.

Define the event equal to the case that at least one of the is greater or equal to where .

We upper and lower bound . Denote by and notice that . It suffices to bound and . We already know that for all and for . The function is decreasing on , increasing on , and . It has unique minimum at . Therefore, for all . Similarly, from unimodality of we have that for all . From this we can conclude that if ,

 0.95≤f(ee)√2ln2≤C(d,n)≤f(2)≤1.6. (11)

If and this implies that

 1

Recalling the definition of event , we have

 E[max1≤i≤dZ(n)i] =E[max1≤i≤dZ(n)i ∣∣∣ A]⋅Pr[A]+E[max1≤i≤dZ(n)i ∣∣∣ ¯¯¯¯A]⋅Pr[¯¯¯¯A] ≥E[max1≤i≤dZ(n)i ∣∣∣ A]⋅Pr[A]+E[max1≤i≤dZ(n)i ∣∣∣ ¯¯¯¯A]⋅Pr[¯¯¯¯A] ≥E[max1≤i≤dZ(n)i ∣∣∣ A]⋅Pr[A]+E[Z(n)1 ∣∣ ¯¯¯¯A]⋅Pr[¯¯¯¯A] =E[max1≤i≤dZ(n)i ∣∣∣ A]⋅Pr[A]+E[Z(n)1 ∣∣ Z(n)1

We lower bound . Using the fact that distribution of is symmetric and has zero mean,

 E[Z(n)1 ∣∣ Z(n)1≤0] =0∑k=−nk⋅Pr[Z(n)1=k | Z(n)1≤0] =1Pr[Z(n)1≤0]0∑k=−nk⋅Pr[Z(n)1=k] ≥20∑k=−nk⋅Pr[Z(n)1=k] (by symmetry of Z(n)1) =−n∑k=−n|k|⋅Pr[Z(n)1=k] (again, by symmetry of Z(n)1) =−E[|Z(n)1|] =−E[√(Z(n)1)2] ≥−√E[(Z(n)1)2] (by concavity of √⋅) =−√Var(Z(n)1) =−√n.

Now let us focus on . Note that is a binomial random variable with distribution . Similar to the proof of Theorem 3, we can lower bound as

 Pr[A] =1−Pr[¯¯¯¯A] =1−(Pr[Z(n)1

We now use the fact that implies that . Hence, we obtain

 Pr[A] ≥1−exp⎛⎜ ⎜ ⎜ ⎜⎝−exp(−16)d1−C22ψ(1.6√lnd2√n)1.6√2π√lnd+2⎞⎟ ⎟ ⎟ ⎟⎠ =1−exp⎛⎜ ⎜⎝−exp(−16)lnd1.6√2π√lnd+2⎞⎟ ⎟⎠ ≥1−exp⎛⎜ ⎜⎝−exp(−16)√lnd2.6√2π⎞⎟ ⎟⎠ ≥1−exp(−√lnd3.1√2π)

where in the last equality we used the fact that for . Putting all together, we have the stated bound. ∎

## 4 Learning with Expert Advice

Learning with Expert Advice is an online problem where in each round an algorithm chooses (possibly randomly) an action and then it receives losses of the actions . This repeats for rounds. The goal of the algorithm is to have a small cumulative loss of actions it has chosen. The difference between the algorithm’s loss and the loss of best fixed action in hind-sight is called regret. Formally,

 Regret(d)(n)=n∑t=1ℓt,It−min1≤i≤dn∑t=1ℓt,i.

There are algorithms that given the number of rounds as an input achieve regret no more than for any sequence of losses.

###### Theorem 8.

Let and . For any algorithm for learning with expert advice there exists a sequence of losses , , , such that

 Regret(d)(n)≥1−exp(−√lnd3.1√2π)√ψ(1.6√lnd2√n)√n2(√2lnd−2lnlnd−1)−√n2.
###### Proof.

Proceeding as in the proof of Theorem 3.7 in (Cesa-BianchiL06) we only need to show that

where are independent symmetric random walks of length . The theorem follows from Theorem 7. ∎

The theorem proves a non-asymptotic lower bounds, while at the same time recovering the optimal constant of the asymptotic one in Cesa-BianchiL06.

## Appendix A Upper Bounds

We say that a random variable is -sub-Gaussian (for some ) if

 E[esX]≤exp(σ2s22)for all s∈R. (13)

It is straightforward to verify that is -sub-Gaussian. Indeed, for any ,

 E[esX] =∫∞−∞1σ√2πexp(−x22σ2)esxdx =exp(s2σ22)∫∞−∞1σ√2πexp(−(x−sσ2)22σ2)dx =exp(s2σ22).

We now show that a Rademacher random variable (with distribution ) is -sub-Gaussian. Indeed, for any ,

 E[esY]=es+e−s2=12∞∑k=0skk!+12∞∑k=0(−1)kskk!=∞∑k=0s2k(2k)!≤∞∑k=0s2kk!2k=exp(s22).

If are independent -sub-Gaussian random variables, then is -sub-Gaussian. This follows from

 E[es∑i=1Yi]=n∏i=1E[esYi].

This property proves that the symmetric random walk of length