A Tight Excess Risk Bound via a Unified PAC-Bayesian–Rademacher–Shtarkov–MDL Complexity

# A Tight Excess Risk Bound via a Unified PAC-Bayesian–Rademacher–Shtarkov–MDL Complexity

\namePeter D. Grünwald \emailpdg@cwi.nl
\addrCentrum Wiskunde & Informatica and Leiden University \AND\nameNishant A. Mehta \emailnmehta@uvic.ca
###### Abstract

We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with . Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: ‘easiness’ (Bernstein) conditions and model complexity.

\newglossaryentry

losses name=Losses, description=\nopostdesc, sort=A, nonumberlist \newglossaryentryexcessLoss name=, description=Excess loss, , sort=A1 \newglossaryentrycumulativeExcessLoss name=, description=Cumulative excess loss, , sort=A2 \newglossaryentryentropified name=, description=Entropified version of , , sort=A3 \newglossaryentrynotation name=Notation, description=\nopostdesc, sort=B, nonumberlist \newglossaryentryESI name=, description=Exponential stochastic inequality, , sort=B1 \newglossaryentryannealedExpectation name=, description=Annealed expectation, , sort=B2 \newglossaryentrycomplexities name=Complexities, description=\nopostdesc, sort=C, nonumberlist \longnewglossaryentrynormConstant name=, description=Normalization constant for Shtarkov integral, , sort=C1 \longnewglossaryentryshtarkovDet name=, description=Shtarkov integral (deterministic version), , sort=C21 \longnewglossaryentrymaxShtarkovDet name=, description=Maximal Shtarkov integral (deterministic version), , sort=C22 \longnewglossaryentrygenShtarkovDet name=, description=Generalized Shtarkov integral (deterministic version) , sort=C23 \longnewglossaryentrygenShtarkov name=, description=Generalized Shtarkov integral, , sort=C24 \newglossaryentrysimpleComp name=, description=Complexity, , sort=C31 \longnewglossaryentrygenCompDet name=, description=Generalized complexity (deterministic version), , sort=C32 \newglossaryentrymaxComp name=, description=Maximal complexity, , sort=C33 \longnewglossaryentrygenComp name=, description=Generalized complexity, , sort=C34 \longnewglossaryentrycompFull name=, description=Full generalized complexity, , sort=C35 \newglossaryentryTn name=, description=, sort=C41 \newglossaryentryHLocalComplexity name=, description=-local complexity, sort=C42 \newglossaryentrycoveringNumber name=, description=-covering number for in the norm , sort=C51 \newglossaryentrybracketingNumber name=, description=-bracketing number for in the norm , sort=C52 \newglossaryentryempRadComp name=, description=Empirical Rademacher complexity, , sort=C53 \newglossaryentryradComp name=, description=Rademacher complexity, , sort=C54 \makeglossaries \ShortHeadingsPAC-Bayes–Rademacher–Shtarkov–MDLGrünwald Mehta \firstpageno1

## 1 Introduction

We simultaneously address four questions of learning theory:

1. We establish a precise relation between Rademacher complexities for arbitrary bounded losses and the minimax cumulative log-loss regret, also known as the Shtarkov integral and normalized maximum likelihood (NML) complexity.

2. We bound this minimax regret in terms of entropy. Past results were based on entropy.

3. We introduce a new type of complexity that enables a unification of data-dependent PAC-Bayesian and empirical-process-type excess risk bounds into a single clean bound; this bound recovers minimax optimal rates for large classes under Bernstein ‘easiness’ conditions.

4. We extend the link between excess risk bounds for arbitrary losses and codelengths of Bayesian codes to general codes.

All four results are part of the chain of bounds in Figure 1. The arrow stands for ‘bounded in terms of’; the precise bounds (which may hold in probability and expectation or may even be an equality) are given in the respective results in the paper. Red arrows indicate results that are new.

We start with a family of predictors for an arbitrary loss function , which, for example, may be log-loss, squared error loss or -loss, and an estimator which on each sample outputs a distribution on ; classic deterministic estimators such as ERM are represented by taking a that outputs the Dirac measure on . The main bound , Theorem 3, bounds the annealed excess risk of a fixed but arbitrary estimator in terms of its empirical risk on the training data plus a novel notion of complexity, (formulas for and all other concepts in the paper are summarized in the Glossary on page 6). The annealed excess risk is a proxy (and lower bound) of the actual excess risk, the expected loss difference between predicting with and predicting with the actual risk minimizer within . The bound (Corollary 3, based on Lemma 3, itself from Grünwald (2012)) bounds the actual excess risk in terms of the annealed excess risk, so that we get a true excess risk bound for . The complexity is dependent on a luckiness function ; can be chosen freely; different choices lead to different complexities and excess risk bounds. For nonconstant , the complexity becomes data-dependent; in particular, for of the form , where is the density of a ‘prior’ distribution on , the complexity becomes, by (Proposition 2.4) (strictly) upper bounded by the information complexity of Zhang (2006a, b), involving a Kullback-Leibler (KL) divergence term . Information complexity generalizes earlier complexity notions and accompanying bounds from the information theory literature such as (extended) stochastic complexity (Rissanen, 1989; Yamanishi, 1998), resolvability (Barron and Cover, 1991; Barron et al., 1998), and also excess risk bounds from the PAC-Bayesian literature (Audibert, 2004; Catoni, 2007). Together, recover and strengthen Zhang’s bounds.

For constant , the complexity is independent of the data and turns out (), Section 2.2) to be equal to the minimax cumulative individual sequence regret for sequential prediction with log-loss relative to a family of probability measures defined in terms of , also known as the log-Shtarkov integral or NML (Normalized Maximum Likelihood) complexity. NML complexity has been much studied in the MDL (minimum description length) literature (Rissanen, 1996; Grünwald, 2007).

##### Problem A: NML and Rademacher

NML complexity can itself be bounded in terms of a new complexity we introduce, -local complexity, which is further bounded in terms of Rademacher complexity (Theorem 4.2 and Corollary 1, ). Both Rademacher and NML complexities are used as penalties in model selection (albeit with different motivations), and the close conceptual similarity between NML and Rademacher complexity has been noted by several authors (e.g. Grünwald (2007); Zhu et al. (2009); Roos (2016)). For example, as shown by Grünwald (2007, Open Problem 19, page 583) in classification problems, both the empirical Rademacher complexity for 0-1 loss and the NML complexity of a family of conditional distributions can be simply expressed in terms of a (possibly transformed) minimized cumulative loss that is uniformly averaged over all possible values of the data to be predicted, thereby measuring how well the model can fit random noise. Theorem 4.2 and Corollary 1 establish, for the first time, a precise and tight link between NML and Rademacher complexity. The proofs extend a technique due to Opper and Haussler (1999), who bound NML complexity in terms of entropy using an empirical process result of Yurinskiĭ (1976). By using Talagrand’s inequality instead, we get a bound in terms of Rademacher complexity.

##### Problem B: Bounding NML Complexity with L2(p) entropy and empirical L2 Entropy

If is of VC-type or a class of polynomial empirical entropy, the Rademacher complexity can be further bounded, (Theorem 4.3, ), in terms of the empirical entropy; if admits polynomial entropy with bracketing, then is further bounded, (Theorem 4.3, ), in terms of this entropy with bracketing. These latter two results are well-known, due to Koltchinskii (2011) and Massart and Nédélec (2006) respectively, but in conjunction with they become of significant interest for log-loss individual sequence prediction. Whereas previous bounds on minimax log-loss regret were invariably in terms of entropy (Opper and Haussler, 1999; Cesa-Bianchi and Lugosi, 2001; Rakhlin and Sridharan, 2015), the aforementiond two results allow us to obtain bounds in terms of entropy and empirical entropy, where can be any member of the class . Unlike the latter two works, however, our results are restricted to static experts that treat the data as i.i.d.

##### Problem C: Unifying data-dependent and empirical process-type excess risk bounds

As lamented by Audibert (2004; 2009), despite their considerable appeal, standard PAC-Bayesian and KL excess risk bounds do not lead to the right rates for large classes, i.e. with polynomial entropy. On the other hand, standard Rademacher complexity generalization and excess risk bound analyses are not easily extendable to either penalized estimators or generalized Bayesian estimators that are based on updating a prior distribution; also handling logarithmic loss appears difficult. Yet shows that there does exist a single bound capturing all these applications — by varying the function one can get both (a strict strengthening of) the KL bounds and a Rademacher complexity-type excess risk bound. In this way, via the chain of bounds