Fast rates with high probability in exp-concave statistical learning

# Fast rates with high probability in exp-concave statistical learning

## Abstract

We present an algorithm for the statistical learning setting with a bounded exp-concave loss in dimensions that obtains excess risk with probability at least . The core technique is to boost the confidence of recent in-expectation excess risk bounds for empirical risk minimization (ERM), without sacrificing the rate, by leveraging a Bernstein condition which holds due to exp-concavity. We also show that with probability the standard ERM method obtains excess risk . We further show that a regret bound for any online learner in this setting translates to a high probability excess risk bound for the corresponding online-to-batch conversion of the online learner. Lastly, we present two high probability bounds for the exp-concave model selection aggregation problem that are quantile-adaptive in a certain sense. The first bound is a purely exponential weights type algorithm, obtains a nearly optimal rate, and has no explicit dependence on the Lipschitz continuity of the loss. The second bound requires Lipschitz continuity but obtains the optimal rate.

## 1 Introduction

In the statistical learning problem, a learning agent observes a samples of points drawn i.i.d. from an unknown distribution over an outcome space . The agent then seeks an action in an action space that minimizes their expected loss, or risk, , where is a loss function . Several recent works have studied this problem in the situation where the loss is exp-concave and bounded, and are subsets of , and is convex. Mahdavi et al. (2015) were the first to show that there exists a learner for which, with probability at least , the excess risk decays at the rate . Via new algorithmic stability arguments applied to empirical risk minimization (ERM), Koren and Levy (2015) and Gonen and Shalev-Shwartz (2016) discarded the factor to obtain a rate of , but their bounds only hold in expectation. All three works highlighted the open problem of obtaining a high probability excess risk bound with the rate . Whether this is possible is far from a trivial question in light of a result of Audibert (2008): when learning over a finite class with bounded -exp-concave losses, the progressive mixture rule (a Cesàro mean of pseudo-Bayesian estimators) with learning rate obtains expected excess risk but, for any learning rate, these rules suffer from severe deviations of order .

This work resolves the high probability question: we present a learning algorithm with an excess risk bound (Corollary 5) which has rate with probability at least . ERM also obtains excess risk, a fact that apparently was not widely known although it follows from results in the literature. To vanquish the factor with the small price it suffices to run a two-phase ERM method based on a confidence-boosting device. The key to our analysis is connecting exp-concavity to the central condition of van Erven et al. (2015), which in turn implies a Bernstein condition. We then exploit the variance control of the excess loss random variables afforded by the Bernstein condition to boost the boosting the confidence trick of Schapire (1990).

In the next section, we discuss a brief history of the work in this area. In Section 3, we formally define the setting and describe the previous in-expectation bounds. We present the results for standard ERM and our confidence-boosted ERM method in Sections 4 and 5 respectively. Section 6 extends the results of Kakade and Tewari (2009) to exp-concave losses, showing that under a bounded loss assumption a regret bound for any online exp-concave learner transfers to a high probability excess risk bound via an online-to-batch conversion. This extension comes at no additional technical price: it is a consequence of the variance control implied by exp-concavity, control leveraged by Freedman’s inequality for martingales to obtain a fast rate with high probability. This result continues the line of work of Cesa-Bianchi et al. (2001) and Kakade and Tewari (2009) and accordingly is about the generalization ability of online exp-concave learning algorithms. One powerful consequence of this result is a new guarantee for model selection aggregation: we present a method (Section 7) for the model selection aggregation problem over finite classes with exp-concave losses that obtains a rate of with high probability, with no explicit dependence on the Lipschitz continuity of the loss function. All previous bounds of which we are aware have explicit dependence on the Lipschitz continuity of the problem. Moreover, the bound is a quantile-like bound in that it improves with the prior measure on a subclass of nearly optimal hypotheses.

## 2 A history of exp-concave learning

Learning under exp-concave losses with finite classes dates back to the seminal work of Vovk (1990) and the game of prediction with expert advice, with the first explicit treatment for exp-concave losses due to Kivinen and Warmuth (1999). Vovk (1990) showed that if a game is -mixable (which is implied by -exp-concavity), one can guarantee that the worst-case individual sequence regret against the best of experts is at most . An online-to-batch conversion then implies an in-expectation excess risk bound of the same order in the stochastic i.i.d. setting.

Audibert (2008) showed that when learning over a finite class with exp-concave losses, no progressive mixture rule can obtain a high probability excess risk bound of order better than . ERM fares even worse, with a lower bound of in expectation. (Juditsky et al., 2008). Audibert (2008) overcame the deviations shortcoming of progressive mixture rules via his empirical star algorithm, which first runs ERM on , obtaining , and then runs ERM a second time on the star convex hull of with respect to . This algorithm achieves with high probability; the rate was only proved for squared loss with targets and predictions in , but it was claimed that the result can be extended to general, bounded losses satisfying smoothness and strong convexity as a function of predictions . Under similar assumptions, Lecué and Rigollet (2014) proved that a method, -aggregation, also obtains this rate but can further take into account a prior distribution.

For convex classes, such as as we consider here, Hazan et al. (2007) designed the Online Newton Step (ONS) and Exponentially Weighted Online Optimization (EWOO) algorithms. Both have regret over rounds, which, after online-to-batch conversion yields excess risk in expectation. Until recently, it was unclear whether one could obtain a similar high probability result; however, Mahdavi et al. (2015) showed that an online-to-batch conversion of ONS enjoys excess risk bounded by with high probability. While this resolved the statistical complexity of learning up to factors, ONS (though efficient) can have a high computational cost of even in simple cases like learning over the unit ball, and in general its complexity may be as high as per projection step (Hazan et al., 2007; Koren, 2013).

If one hopes to eliminate the factor, the additional hardness of the online setting makes it unlikely that one can proceed via an online-to-batch conversion approach. Moreover, computational considerations suggest circumventing ONS anyways. In this vein, as we discuss in the next section both Koren and Levy (2015) and Gonen and Shalev-Shwartz (2016) recently established in-expectation excess risk bounds for a lightly penalized ERM algorithm and ERM itself respectively, without resorting to an online-to-batch conversion. Notably, both works developed arguments based on algorithmic stability, thereby circumventing the typical reliance on chaining-based arguments to discard factors. Table 1 summarizes what is known and our new results.

## 3 Rate-optimal in-expectation bounds

We now describe the setting more formally. In this work is always assumed to be convex, except in Section 7, which studies the model selection aggregation problem for countable classes. We say a function has diameter if . Assume for each that the loss map is -exp-concave, i.e.  is concave over . We further assume, for each outcome , that the loss has diameter . We adopt the notation . Given a sample of points drawn i.i.d. from an unknown distribution over , our objective is to select a hypothesis that minimizes the excess risk . We assume that there exists satisfying ; this assumption also was made by Gonen and Shalev-Shwartz (2016) and Kakade and Tewari (2009).1

Let be an algorithm, defined for a function class as a mapping ; we drop the subscript when it is clear from the context. Our starting point will be an algorithm which, when provided with a sample of i.i.d. points, satisfies an expected risk bound of the form

 EZ∼Pn[EZ∼P[ℓA(Z)(Z)−ℓf∗(Z)]]≤ψ(n). (1)

Koren and Levy (2015) and Gonen and Shalev-Shwartz (2016) both established in-expectation bounds of the form (1) that obtain a rate of in the case when , each in a slightly different setting. Koren and Levy (2015) assume, for each outcome , that the loss has diameter and is -smooth for some , i.e. for all , the gradient is -Lipschitz:

 ∥∇fℓ(f,z)−∇fℓ(f′,z)∥2≤β∥f−f′∥2.

They also use a 1-strongly convex regularizer with diameter . Under these assumptions, they show that ERM run with the weighted regularizer has expected excess risk at most

 ψ(n)=1n(24βdη+100Bd+R).

It is not known if the smoothness assumption is necessary to eliminate the factor.

Gonen and Shalev-Shwartz (2016) work in a slightly different setting that captures all known exp-concave losses. They assume that the loss is of the form , for . They further assume, for each , that the mapping is -strongly convex and -Lipschitz, but they do not assume smoothness. They show that standard, unregularized ERM has expected excess risk at most

 ψ(n)=2L2dαn=2dηn,

where ; the purpose of the rightmost expression is that the loss is -exp-concave. Although this bound ostensibly is independent of the loss’s diameter , the dependence may be masked by : for logistic loss, , while squared loss admits the more favorable .

## 4 A high probability bound for ERM

As a warm-up to proving a high probability excess risk bound, we first show that ERM itself obtains excess risk with high probability; here and elsewhere, if is omitted the dependence is . That ERM satisfies such a bound was largely implicit in the literature, and so we make this result explicit. The closest such result, Theorem 1 of Mahdavi and Jin (2014), does not apply as it relies on an additional assumption (see their Assumption (I)). Our assumptions subtly differ from elsewhere in this work. We assume that satisfies and that, for each outcome , the loss is -Lipschitz and . The first two assumptions already imply the last for . All these assumptions were made by Mahdavi and Jin (2014) and Koren and Levy (2015), sometimes implicitly, and while Gonen and Shalev-Shwartz (2016) only make the Lipschitz assumption, for all known -exp-concave losses the constant depends on (which itself typically will depend on ).

The first, critical observation is that exp-concavity implies good concentration properties of the excess loss random variable. This is easiest to see by way of the -central condition, which the excess loss satisfies. This concept, studied by van Erven et al. (2015) and first introduced by van Erven et al. (2012) as “stochastic mixability”, is defined as follows.

###### Definition (Central condition)

We say that satisfies the -central condition for some if there exists a comparator such that, for all ,

 EZ∼P[e−η(ℓf(Z)−ℓf∗(Z))]≤1.

Jensen’s inequality implies that if this condition holds, the corresponding must be a risk minimizer. It is known (van Erven et al., 2015, Section 4.2.2) that in our setting ) satisfies the -central condition.

###### Lemma

Let be convex. Take to be a loss function , and assume that, for each , the map is -exp-concave. Then, for all distributions over , if there exists an that minimizes the risk under , then satisfies the -central condition.

With the central condition in our grip, Theorem 7 of Mehta and Williamson (2014) directly implies an bound for ERM; however, a far simpler version of that result yields much smaller constants. The proof of the version below, in the appendix for completeness, only makes use of an -net of in the norm, which induces an -net of in the sup norm.

###### Theorem

Let be a convex set satisfying . Suppose, for all , that the loss is -exp-concave and -Lipschitz. Let . Then if , with probability at least , ERM learns a hypothesis with excess risk bounded as

 EZ∼P[ℓ^f(Z)−ℓf∗(Z)]≤1n(8(B∨1η)(dlog(16LRn)+log1δ)+1). (2)

## 5 Boosting the confidence for high probability bounds

The two existing excess risk bounds mentioned in Section 3 decay at the rate . A naïve application of Markov’s inequality unsatisfyingly yields excess risk bounds of order that hold with probability . In this section, we present and analyze our meta-algorithm, ConfidenceBoost, which boosts these in-expectation bounds to hold with probability at least at the price of factor. This method is essentially the “boosting the confidence” trick of Schapire (1990);2 the novelty lies in a refined analysis that exploits a Bernstein-type condition to improve the rate in the final high probability bound from the typical to the desired .

Our analysis of ConfidenceBoost actually applies more generally than the exp-concave learning setting, requiring only that satisfy an in-expectation bound of the form (1), the loss have bounded diameter for each , and the problem satisfy a -Bernstein condition.

###### Definition (Bernstein condition)

We say that satisfies the -Bernstein condition for some and if there exists a comparator such that, for all ,

 EZ∼P[(ℓf(Z)−ℓf∗(Z))2]≤CEZ∼P[ℓf(Z)−ℓf∗(Z)]q.

Before getting to ConfidenceBoost, we first show that the exp-concave learning setting satisfies the Bernstein condition with the best exponent, , and so is a special case of the more general setting we analyze. Recall from Lemma 4 that the -central condition holds for . The next lemma, which adapts a result of van Erven et al. (2015), shows that the -central condition, together with boundedness of the loss, implies that a Bernstein condition holds.

###### Lemma (Central to Bernstein)

Let be a random variable taking values in . Assume that . Then .

### Boosting the boosting-the-confidence trick.

First, consider running on a sample of i.i.d. points. The excess risk random variable is nonnegative, and so Markov’s inequality and the expected excess risk being bounded by imply that

 Pr(EZ[ℓA(Z1)(Z)−ℓf∗(Z)]≥e⋅ψ(n))≤1e.

Now, let be independent samples, each of size . Running on each sample yields . Applying Markov’s inequality as above, combined with independence, implies that with probability at least there exists such that . Let us call this good event good.

Our quest is now to show that on event good, we can identify any of the hypotheses approximately satisfying , where by “approximately” we mean up to some slack that weakens the order of our resulting excess risk bound by a multiplicative factor of at most . As we will see, it suffices to run ERM over this finite subclass using a fresh sample. The proposed meta-algorithm is presented in Algorithm 1.

### Analysis.

From here on out, we treat the initial sample of size as fixed and unhat the estimators above, referring to them as . Without loss of generality, we further assume that they are sorted in order of increasing risk (breaking ties arbitrarily). Our goal now is to show that running ERM on the finite class yields low excess risk with respect to comparator . A typical analysis of the boosting the confidence trick would apply Hoeffding’s inequality to select a risk minimizer optimal to resolution , but this is not good enough here. As a further boost to the trick, this time with respect to its resolution, we will establish that a Bernstein condition holds over a particular subclass of with high probability, which will in turn imply that ERM obtains excess risk over .

We first establish an approximate Bernstein condition for . Since for all , from the -Bernstein condition,

 ∥ℓfj−ℓf1∥2L2(P) ≤C(3E[ℓfj−ℓf∗]q+E[ℓf1−ℓf∗]q) =C(3(E[ℓfj−ℓf1]+E[ℓf1−ℓf∗])q+E[ℓf1−ℓf∗]q) ≤C(3E[ℓfj−ℓf1]q+4E[ℓf1−ℓf∗]q);

where the last step follows because the map is concave and hence subadditive. We call this bound an approximate Bernstein condition because, on event good, for all :

 ∥ℓfj−ℓf1∥2L2(P)≤C(3E[ℓfj−ℓf1]q+4(e⋅ψ(n))q).

Define the class as the set . Then with probability , the problem satisfies the -Bernstein condition.

We now analyze the outcome of running ERM on using a fresh sample of points. The next lemma shows that ERM performs favorably under a Bernstein condition, a well-known result.

###### Lemma

Let be a finite class of functions and assume without loss of generality that is a risk minimizer. Let be a subclass for which, for all :

 E[(ℓf−ℓf1)2]≤CE[ℓf−ℓf1]q,

and almost surely. Then, with probability at least , ERM run on will not select any function in whose excess risk satisfies

 E[ℓf−ℓf1]≥⎛⎜ ⎜⎝2(C+B2−q3)log|G′|−1δn⎞⎟ ⎟⎠1/(2−q).

Applying Lemma 5 with and , with probability at least over the fresh sample, ERM selects a function falling in one of two cases:

• ;

• (using ).

We now run ConfidenceBoost with on a sample of points, with and ; for simplicity, we assume that divides . Taking the failure probability for the ERM phase to be , ConfidenceBoost admits the following guarantee.

###### Theorem

Let satisfy the -Bernstein condition, and assume for all that the loss has diameter . Impose any necessary assumptions such that algorithm obtains a bound of the form (1). Then, with probability at least , ConfidenceBoost run with , , and learns a hypothesis with excess risk at most

 (3)

The next result for exp-concave learning is immediate.

###### Corollary

Applying Theorem 5 with the algorithm of Koren and Levy (2015) and their assumptions (with ), the bound in Theorem 5 specializes to

 O⎛⎝log1δn(dβη+dB+R)⎞⎠. (4)

Similarly taking the algorithm of Gonen and Shalev-Shwartz (2016) and their assumptions yields

 O⎛⎝log1δn(dη+B)⎞⎠. (5)

### Remarks.

As we saw from Lemmas 4 and 5, in the exp-concave setting a Bernstein condition holds for the class . A natural inquiry is if one could use this Bernstein condition to show directly a high probability fast rate of for ERM. Indeed, under strong convexity (which is strictly stronger than exp-concavity), Sridharan et al. (2009) show that a similar bound for ERM is possible; however, they used strong convexity to bound a localized complexity. It is unclear if exp-concavity can be used to bound a localized complexity, and the Bernstein condition alone seems insufficient; such a bound may be possible via ideas from the local norm analysis of Koren and Levy (2015). While we think controlling a localized complexity from exp-concavity is a very interesting and worthwhile direction, we leave this to future work, and for now only conjecture that ERM also enjoys excess risk bounded by with high probability. This conjecture is from analogy to the empirical star algorithm of Audibert (2008), which for convex reduces to ERM itself; note that the conjectured effect of is additive rather than multiplicative.

## 6 Online-to-batch-conversion

The present section’s purpose is to show that if one is willing to accept the additional factor in a high probability bound, then it is sufficient to use an online-to-batch conversion of an online exp-concave learner whose worst-case cumulative regret (over rounds) is logarithmic in . Using such a conversion, it is easy to get an excess risk bound with the additional factor that holds in expectation. The key difficulty is making such a bound hold with high probability. This result provides an alternative to the high probability result for ERM in Section 4.

Mahdavi et al. (2015) previously considered an online-to-batch conversion of ONS and established the first explicit high probability excess risk bound in the exp-concave statistical learning setting. Their analysis is elegant but seems to be intimately coupled to ONS; it consequently is unclear if their analysis can be used to grasp excess risk bounds by online-to-batch conversions of other online exp-concave learners. This leads to our next point and a new path: it is possible to transfer regret bounds to high probability excess risk bounds via online-to-batch conversion for general online exp-concave learners. Our analysis builds strongly on the analysis of Kakade and Tewari (2009) in the strongly convex setting.

We first consider a different, related setting: online convex optimization (OCO) under a -bounded, -strongly convex loss that is -Lipschitz with respect to the action. An OCO game unfolds over rounds. An adversary first selects a sequence of convex loss functions . In round , the online learner plays , the environment subsequently reveals cost function , and the learner suffers loss . Note that the adversary is oblivious, and so the learner does not necessarily need to randomize. Because we are interested in analyzing the statistical learning setting, we constrain the adversary to play a sequence of points , inducing cost functions .

Consider an online learner that sequentially plays actions in response to , so that depends on . The (cumulative) regret is defined as

 n∑t=1ℓft(zt)−inff∈Fn∑t=1ℓf(zt).

When the losses are bounded, strongly convex, and Lipschitz, Kakade and Tewari (2009) showed that if an online algorithm has regret on an i.i.d. sequence , online-to-batch conversion by simple averaging of the iterates admits the following guarantee.

###### Theorem (Cor. 5, Kakade and Tewari (2009))

For all , assume that is bounded by , -strongly convex, and -Lipschitz. Then with probability at least the action satisfies excess risk bound

Under various assumptions, there are OCO algorithms that obtain worst-case regret (under all sequences ) . For instance, Online Gradient Descent (Hazan et al., 2007) admits the regret bound , where is an upper bound on the gradient.

What if we relax strong convexity to exp-concavity? As we will see, it is possible to extend the analysis of Kakade and Tewari (2009) to -exp-concave losses. Of course, such a regret-to-excess-risk bound conversion is useful only if we have online algorithms and regret bounds to start with. Indeed, at least two such algorithms and bounds exist, due to Hazan et al. (2007):

• ONS, with , where is a bound on the gradient and is a bound on the diameter of the action space.

• Exponentially Weighted Online Optimization (EWOO), with . The better regret bound comes at the price of not being computationally efficient. EWOO can be run in randomized polynomial time, but the regret bound then holds only in expectation (which is insufficient for an online-to-batch conversion).

We now show how to extend the analysis of Kakade and Tewari (2009) to exp-concave losses. While similar results can be obtained from the work of Mahdavi et al. (2015) for the specific case of ONS, our analysis is agnostic of the base algorithm. A particular consequence is that our analysis also applies to EWOO, which, although highly impractical, offers a better regret bound. Moreover, our analysis applies to any future online learning algorithms which may have improved guarantees and computational complexities. The key insight is that exp-concavity implies a variance inequality similar to Lemma 1 of Kakade and Tewari (2009), a pivotal result of that work that unlocks Freedman’s inequality for martingales (Freedman, 1975). Let denote the sequence .

###### Lemma (Conditional variance control)

Define the Martingale difference sequence

 ξt:=EZ[ℓft(Z)−ℓf∗(Z)]−(ℓft(Zt)−ℓf∗(Zt)).
 Then Var[ξt∣Zt−11]≤4(1η+B)EZ[ℓft(Z)−ℓf∗(Z)].

###### Proof

Observe that . Treating the sequence as fixed and also treating as a fixed parameter , the above conditional variance equals , where the randomness lies entirely in . Then, Lemma 5 implies that .

The next corollary is from a retrace of the proof of Theorem 2 of Kakade and Tewari (2009).

###### Corollary

For all , let be bounded by and -exp-concave with respect to the action . Then with probability at least , for any , the excess risk of is at most

 Rnn+4√(1η+B)log4lognδ⋅√Rnn+16(1η+B)log4lognδn.

In particular, an online-to-batch conversion of EWOO yields excess risk of order

 dlognηn+√dlognn(√(loglogn)Bη+√(1η2+Bη)log1δ)+(loglogn)B+Blog1δn.

By proceeding similarly one can get a guarantee for ONS, under the additional assumptions that has bounded diameter and that, for all , the gradient has bounded norm.

### Obtaining o(logn) excess risk.

The worst-case regret bounds in this online setting have a factor, but when the environment is stochastic and the distribution satisfies some notion of easiness the actual regret can be . In such situations the excess risk similarly can be because our excess risk bounds depend not on worst-case regret bounds but rather the actual regret. We briefly explore one scenario where such improvement is possible. Suppose that the loss is also -smooth; then, in situations when the cumulative loss of is small, the analysis of Orabona et al. (2012, Theorem 1) for ONS yields a more favorable regret bound: they show a regret bound of order . As a simple example, consider the case when the problem is realizable in the sense that almost surely. Then the regret bound is constant and the rate with respect to for the excess risk in Corollary 6 is .

## 7 Model selection aggregation

In the model selection aggregation problem for exp-concave losses, we are given a countable class of functions from an input space to an output space and a loss ; for each , the mapping is -exp-concave. The loss is a supervised loss, as in supervised classification and regression, unlike the more general loss functions used in the rest of the paper which fit into Vapnik’s general setting of the learning problem (Vapnik, 1995). The random points now decompose into an input-output pair . We often use the notation . The goal is the same as in the stochastic exp-concave optimization problem, but now fails to be convex (and the exp-concavity assumption slightly differs).

After Audibert (2008) showed that the progressive mixture rule cannot obtain fast rates with high probability, several works developed methods that departed from progressive mixture rules and gravitated instead toward ERM-style rules, starting with the empirical star algorithm of Audibert (2008) and a subsequent method of Lecué and Mendelson (2009) which runs ERM over the convex hull of a data-dependent subclass. Lecué and Rigollet (2014) extended these results to take into account a prior on the class using their -aggregation procedure. All the methods require Lipschitz continuity of the loss3 and are for finite classes, although we believe that -aggregation combined with a suitable prior extends to countable classes. In this section, we present an algorithm that carefully composes exponential weights-type algorithms and still obtains a fast rate with high probability for the model selection aggregation problem. One incarnation can do so with the fast rate of for finite , by relying on Boosted ERM. Another, “pure” version is based on exponential weights-type procedures alone, can get a rate of with no explicit dependence on the Lipschitz continuity of the loss. To our knowledge, this is the first fast rate high probability bound for model selection aggregation that lacks explicit dependence on the Lipschitz constant of the loss. Both results hold more generally, allowing for countable classes, taking into account a prior distribution over , and providing a quantile-like improvement when there is a low quantile with close to optimal risk.

Since is countable and hence not convex, algorithms for stochastic exp-concave optimization do not directly apply. Our approach is to apply stochastic exp-concave optimization to the convex hull of a certain small cardinality and data-dependent subset of . The first phase of obtaining this subset makes use of the progressive mixture rule. We offer two variants for the second phase: PM-EWOO (Algorithm 2) and PM-CB (Algorithm 3). In the algorithms, and are online-to-batch conversions of the progressive mixture rule and EWOO respectively, is ConfidenceBoost, and is ERM.

Our interest in PM-EWOO is two-fold: (i) it is a “purely” exponential weights type method in that it is based only on the progressive mixture rule and EWOO; (ii) it does not require any Lipschitz assumption on the loss function, unlike all previous work.

###### Theorem

Let be a countable and a prior distribution over . Assume that for each the loss is -exp-concave. Further assume that for all in the support of . Then with probability at least , PM-EWOO run with , and learns a hypothesis satisfying

 EZ∼P[ℓ^f(Z)−ℓf∗(Z)]≤e⋅\textscBayesRedη⎛⎝n2⌈log2δ⌉,π⎞⎠+θ\textscew(δ,n),
 with θ\textscew(δ,n)=O⎛⎜ ⎜⎝√B(log1δ+√log1δlogn)ηn+Bloglognδ)n⎞⎟ ⎟⎠.

Here, is the -generalized Expected Bayesian Redundancy (Takeuchi and Barron, 1998; Grünwald, 2012), defined as

 infρ∈Δ(F){EZ[Ef∼ρ[ℓf(Z)]−ℓf∗(Z)]+D(ρ∥π)η(n+1)},

for the KL-divergence. The bound can be rewritten as a quantile-like bound; for all :

 EZ∼P[ℓ^f(Z)−Ef∼ρ[ℓf(Z)]]≤(e−1)\textscgap(ρ,f∗)+2e⌈log2δ⌉D(ρ∥π)ηn+θ\textscew(δ,n),

where . This bound enjoys a quantile-like improvement when is small. For instance, if there is a set of large prior measure which has excess risk close to , then Theorem 7 pays for the complexity; in contrast, Theorem A of Lecué and Rigollet (2014) pays a higher complexity price of .

Lastly, we provide a simpler bound by specializing to the case of concentrated entirely on . Then

 EZ∼P[ℓ^f(Z)−ℓf∗(Z)]≤2e⌈log2δ⌉log1π(f∗)ηn+θ\textscew(δ,n).

Theorem 7 does not explicitly require Lipschitz continuity of the loss, but the rate is suboptimal due to the extra factor. The next result obtains the correct rate by using ConfidenceBoost for the second stage of the procedure.

###### Theorem

Take the assumptions of Theorem 7, but instead assume that for each the loss is -strongly convex and -Lipschitz (so -exp-concavity holds). Then with probability at least , PM-CB run with , and learns a hypothesis satisfying

 EZ∼P[ℓ^f(Z)−ℓf∗(Z)]≤e⋅\textscBayesRedη⎛⎝n4⌈log3δ⌉,π⎞⎠+θ\textsccb(δ,n),
 with θ\textsccb(δ,n)=O⎛⎜ ⎜⎝(log1δ)2ηn+Blog1δn⎞⎟ ⎟⎠.

The proofs of Theorems 7 and 7 are nearly identical and left to the appendix. We sketch a proof here, as it uses a novel reduction of the second phase to a low-dimensional stochastic exp-concave optimization problem. For simplicity, we restrict to the case of finite , uniform prior , and competing with . A naïve approach is to run a stochastic exp-concave optimization method on the convex hull of , but this suffers an excess risk bound scaling as rather than . We instead start with an initial procedure that drastically reduces the set of candidates to a set of . To this end, note that an online-to-batch conversion of the progressive mixture rule run on samples obtains expected excess risk at most . Hence, independent runs yield a hypothesis with the same bound inflated by a factor with probability at least (we assume that this high probability event holds hereafter). At this point, it seems that we have replaced the original problem with an isomorphic one, as we do not know which yields the desired candidate , and the corresponding subclass is still clearly non-convex. However, by taking the convex hull of this set of predictors and reparameterizing the problem, we arrive at a stochastic -exp-concave optimization problem over the -dimensional simplex; the best predictor in the convex hull clearly at least as good as the best one in . Thus, our analyses of EWOO and ConfidenceBoost apply and the results follow.

## 8 Discussion and Open Problems

We presented the first high probability excess risk bound for exp-concave statistical learning. The key to proving this bound was the connection between exp-concavity and the central condition, a connection which suggests that exp-concavity implies a low noise condition. Here, low noise can be interpreted either in terms of the central condition, by the exponential decay of the negative tail of the excess loss random variables, or in terms of the Bernstein condition, by the variance of the excess loss of a hypothesis being controlled by its excess risk. All our results for stochastic exp-concave optimization were based on this low noise interpretation of exp-concavity. In contrast, The previous in-expectation results of Koren and Levy (2015) and Gonen and Shalev-Shwartz (2016) used the geometric/convexity-interpretation of exp-concavity, which we further boosted to high probability results using the low noise interpretation. It would be interesting to get a high probability result that proceeds purely from a low noise interpretation or purely from a geometric/convexity one.

Many results flowing from algorithmic stability often only yield in-expectation bounds, with high probability bounds stemming either from (i) a posthoc confidence boosting procedure — typically involving Hoeffding’s inequality, which “slows down” fast rate results; or (ii) quite strong stability notions — e.g. uniform stability allows one to apply McDiarmid’s inequality to a single run of the algorithm (Bousquet and Elisseeff, 2002). Is it a limitation of algorithmic stability techniques that high probability fast rates seem to be out of reach without a posthoc confidence boosting procedure, or are we simply missing the right perspective? One reason to avoid a confidence boosting procedure is that the resulting bounds suffer from a multiplicative factor rather than the lighter effect of an additive factor in bounds like Theorem 4. As we mentioned earlier, we conjecture that the basic ERM method obtains a high probability rate, and a potential path to show this rate would be to control a localized complexity as done by Sridharan et al. (2009) but using a more involved argument based on exp-concavity rather than strong convexity.

We also developed high probability quantile-like risk bounds for model selection aggregation, one with an optimal rate and another with a slightly suboptimal rate but no explicit dependence on the Lipschitz continuity of the loss. However, our bound form is not yet a full quantile-type bound; it degrades when the gap term is large, while the bound of Lecué and Rigollet (2014) does not have this problem. Yet, our bound provides an improvement when there is a neighborhood around with large prior mass, which the bound of Lecué and Rigollet cannot do. It is an open problem to get a bound with the best of both worlds.

## Appendix A Proofs for Stochastic Exp-Concave Optimization

###### Proof (of Lemma 4)

The exp-concavity of for each implies that, for all and all distributions over :

It therefore holds that for all distributions over , for all distributions over , there exists (from convexity of ) satisfying

 EZ∼P[ℓ(f∗,Z)]≤EZ∼P[−1ηlogEf∼Q[e−ηℓ(f,Z)]].

This condition is equivalent to stochastic mixability as well as the pseudoprobability convexity (PPC) condition, both defined by van Erven et al. (2015). To be precise, for stochastic mixability, in Definition 4.1 of van Erven et al. (2015), take their and both equal to our , their equal to , and