PAC-Bayes Un-Expected Bernstein Inequality
We present a new PAC-Bayesian generalization bound. Standard bounds contain a complexity term which dominates unless , the empirical error of the learning algorithm’s randomized predictions, vanishes. We manage to replace by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset at hand. Our new bound consistently beats state-of-the-art bounds both on a toy example and on UCI datasets (with large enough ). Theoretically, unlike existing bounds, our new bound can be expected to converge to faster whenever a Bernstein/Tsybakov condition holds, thus connecting PAC-Bayesian generalization and excess risk bounds — for the latter it has long been known that faster convergence can be obtained under Bernstein conditions. Our main technical tool is a new concentration inequality which is like Bernstein’s but with taken outside its expectation.
capbtabboxtable[\FBwidth] \fxusethemecolorsig \FXRegisterAuthorpeterPeterPG \FXRegisterAuthornicheNicheNM
PAC-Bayesian generalization bounds (alquier2018simpler, ; catoni2003pac, ; Catoni07, ; germain2009pac, ; germain2015risk, ; guedj2019primer, ; McAllester98, ; McAllester99, ; McAllester02, ) have recently obtained renewed interest within the context of deep neural networks (dziugaite2017computing, ; neyshabur2017pac, ; zhou2018, ). In particular, Zhou et al. (zhou2018, ) and Dziugaite and Roy (dziugaite2017computing, ) showed that, by extending an idea due to Langford and Caruana LangfordC02 (), one can obtain nontrivial (but still not very strong) generalization bounds on real-world datasets such as MNIST and ImageNet. Since using alternative methods, nontrivial generalization bounds are even harder to get, there remains a strong interest in improved PAC-Bayesian bounds. In this paper, we provide a considerably improved bound whenever the employed learning algorithm is sufficiently stable on the given data.
Standard bounds all have an order term on the right, where represents model complexity in the form of a Kullback-Leibler divergence between a prior and a posterior, and is the posterior expected loss on the training sample. The latter only vanishes if there is a sufficiently large neighborhood around the ‘center’ of the posterior at which the training error is 0. In the two papers dziugaite2017computing (); zhou2018 () mentioned above, this is not the case. For example, the various deep net experiments reported by Dziugaite et al. (dziugaite2017computing, , Table 1) with all have around , so that is multiplied by a non-negligible . Furthermore, they have increasing substantially with , making converge to at rate slower than .
In this paper, we provide a bound (Theorem 1) with replaced by a second-order term — a term which will go to in many cases in which does not. This can be viewed as an extension of an earlier second-order approach by Tolstikhin and Seldin tolstikhin2013pac () (TS from now on): they also replace , but by a term that, while usually smaller than , will tend to be larger than our . Specifically, as they write, in classification settings (our primary interest), their replacement is not much smaller than itself. Instead our can be very close to in classification even when is large. While the TS bound is based on an ‘empirical’ Bernstein inequality due to maurer2009empirical (), our bound is based on a different modification of Bernstein’s moment inequality in which the occurrence of is taken outside of its expectation. We call this result, which is of independent interest, the un-expected Bernstein inequality — see Lemma 10.
The term in our bound goes to — and our bound improves on existing bounds — whenever the employed learning algorithm is relatively stable on the given data; for example, if the predictor learned on an initial segment (say, ) of the dataset performs similarly (i.e. assigns similar losses to the same samples) to the predictor based on the full data. This improvement is reflected in our experiments where, except for very small sample sizes, we consistently outperform existing bounds both on a toy classification problem with label noise and on standard UCI datasets Dua:2019 (). Of course, the importance of stability for generalization has been recognized before in landmark papers such as bousquet2002stability (); mukherjee2006learning (); shalev2010learnability (). However, the data-dependent stability notion ‘’ occurring in our bound seems very different from any of the notions discussed in those papers.
Theoretically, a further contribution is that we connect our PAC-Bayesian generalization bound to excess risk bounds: we show that (Theorem 4) our generalization bound can be of comparable size to excess risk bounds up to an irreducible complexity-free term that is independent of model complexity. The excess risk bound that can be attained for any given problem depends both on the complexity of the set of predictors and on the inherent ‘easiness’ of the problem. The latter is often measured in terms of the exponent of the Bernstein condition that holds for the given problem (bartlett2006empirical, ; erven2015fast, ; grunwald2016fast, ), which generalizes the exponent in the celebrated Tsybakov margin condition (bartlett2006convexity, ; tsybakov2004optimal, ) (this has been a main topic in two recent NeurIPS workshops on ‘learning from easy data’). The larger , the faster the excess risk converges. In Section 5, we essentially show that the rate at which the (often dominating) term goes to can also be bounded by a quantity that gets smaller as gets larger. In contrast, previous PAC-Bayesian bounds do not have such a property.
Contents. In Section 2, we introduce the problem setting and provide a first, simplified version of our theorem. Section 3 gives our main bound. Experiments are presented in Section 4, followed by theoretical motivation in Section 5. The proof of our main bound is provided in Section 6, where we first present the convenient ESI language for expressing stochastic inequalities, and (our main tool) the unexpected Bernstein inequality in Lemma 10. The paper ends with an outlook for future work.
2 Problem Setting, Background, and Simplified Version of Our Bound
Setting and Notation.
Let be i.i.d. random variables in some set , with , for . Let be a hypothesis set and , , be a bounded loss function such that denotes the loss that hypothesis makes on . We call any such tuple a learning problem. For a given hypothesis , we denote its risk (expected loss on a test sample of size 1) by and its empirical error by . For any distribution on , we write and .
For any and any variables in , we denote and , with the convention that . We follow a similar rule for and . As is customary in PAC-Bayesian works, a learning algorithm is a (computable) function that, upon observing input , outputs a ‘posterior’ distribution on . The posterior could be a Gibbs or a generalized-Bayesian posterior but also other algorithms. When no confusion can arise, we will abbreviate to , and denote any ‘prior’ distribution, i.e., a distribution on which has to be specified in advance, before seeing the data. Finally, we denote the Kullback-Leibler divergence between and by .
Comparing Bounds. Both existing state-of-the-art PAC-Bayes bounds and ours essentially take the following form: there exists constants , and a function ( for small), logarithmic in and , such that , with probability at least over the sample , it holds that,
where are sample-dependent quantities which may differ from one bound to another. Existing classical bounds that after slight relaxations take on this form are due to Langford and Seeger (langford2003pac, ; Seeger02, ), Catoni catoni2007pac (), Maurer maurer2004note (), and Tolstikhin and Seldin (TS) tolstikhin2013pac () (see the latter for a nice overview). In all these cases, , , and — except for the TS bound — . For the TS bound, is equal to empirical loss variance. Our bound in Theorem 1 also fits (1) (after a relaxation), but with considerably different choices for , , and .
Of special relevance in our experiments is the bound due to Maurer maurer2004note (), which as noted by TS tolstikhin2013pac () tightens the PAC-Bayes-kl inequality due to Seeger seeger2002pac (), and is one of the tightest known generalization bounds in the literature. It can be stated as follows: for , , and any learning algorithm , with probability at least ,
where is the binary Kullback-Leibler divergence. Applying the inequality to (2) yields a bound of the form (1) (see tolstikhin2013pac () for more details). Note also that using Pinsker’s inequality together with (2) implies McAllester’s classical PAC-Bayesian bound McAllester98 ().
and at most , where is an upper-bound on the loss . Like in TS’s and Catoni’s bound, but unlike McAllester’s and Maurer’s, our grows as . A larger difference is that our complexity term is a sum of two KL divergences, in which the prior is ‘informed’ — when , it is really the posterior based on half the sample. Our experiments confirm that this tends to be much smaller than . Note that one can make the existing bounds conceptually closer to ours by learning a classifier based on the first half of the data, and using a prior ‘centered’ at that classifier to obtain a PAC-Bayes bound on the second half — an idea pioneered by ambroladze2007tighter (). In additional experiments reported in Appendix A, we found that while this improves existing bounds in some cases it also worsens them in others, and the conclusions of Section 4 (the experiments) remain the same (see Appendix A).
In light of the preceding observation, we regard the fact that we have in our bound instead of as the more fundamental difference to other approaches. Only TS tolstikhin2013pac () have a that is somewhat reminiscent of ours: in their case is the empirical loss variance. The crucial difference to our is that the empirical loss variance cannot be close to unless a sizeable -posterior region of has empirical error almost constant on most data instances. For classification with 0-1 loss, this is a strong condition since the empirical loss variance is equal to , which is only close to if is itself close to or . In contrast, our can go to zero even if the empirical error and variance do not. This can be witnessed in our experiments in Section 4. In Section 5, we argue more formally that under a Bernstein condition, the term can be much smaller than . Note, finally, that the term has 2-fold cross-validation flavor, but in contrast to a cross-validation error, for to be small, it is sufficient that the losses are similar, not that they are small.
The price we pay for all this (and that does not show up in other bounds) is the right-most, irreducible remainder term in (1) of order at most (up to log-factors). Note, however, that this term is decoupled from the complexity , and thus it is not affected by growing with . We call this an ‘irreducible’ term, because, by the central limit theorem, a term is unavoidable in any PAC-Bayesian bound: this is the case even if is a singleton and there is no learning, unless has zero-variance.
3 Main Bound
We now present our main result in its most general form. Let and , for , where is an upper-bound on the loss .
[Main Theorem] Let be i.i.d. with . Let and be any distribution with support on a finite or countable grid . For any , and any learning algorithms , we have,
with probability at least , where , , and are the random variables defined by:
While the result holds for all , in the remainder of this paper, we assume for simplicity that is even and that . We will also be using the grid and distribution defined by
Roughly speaking, this choice of ensures that the infima in and in (3) are attained within . Using the relaxation , for , in (3) and tuning and within the grid defined in (4) leads to a bound of the form (1). Furthermore, we see that the expression of in the simplified version of Theorem 1 given in the previous section now follows when is chosen such that, for , and , for some deterministic estimator , where denotes the Dirac distribution at . It is clear that Theorem 1 is considerably more general than its corollary above: when predicting the -th point () in the second term of , we could use an estimator that does not only depend on but also on part of the second sample, namely , and analogously for predicting () in the first term. We can thus base our bound on a sum of errors achieved by online estimators that converge to the final based on the full data. Doing this would likely improve our bounds, but is computationally demanding, and so we did not try it in our experiments.
In this section, we experimentally compare our bound in Theorem 1 to that of TS tolstikhin2013pac (), Catoni (Catoni07, , Theorem 1.2.8) (with ), and Maurer in (2)111Find code at https://github.com/bguedj/PAC-Bayesian-Un-Expected-Bernstein-Inequality.. For the latter, given and the RHS of (2), we solve for an upper bound of by ‘inverting’ the . We note that TS tolstikhin2013pac () do not claim that their bound is better than Maurer’s in classification (in fact, they do better in other settings).
Setting. We consider both synthetic and real-world datasets for binary classification, and we evaluate bounds using the 0-1 loss. In particular, the data space is , where is the dimension of the feature space. In this case, the hypothesis set is also , and the error associated with on a sample is given by , where . We learn our hypotheses using regularized logistic regression; given a sample , with , we compute
For , and , we choose algorithm in Theorem 1 such that
Given a sample , we set the ‘posterior’ to be a Gaussian centered at with variance : that is, . The prior distribution is set to , for .
Parameters. We set . For all datasets, we use , and (approximately) solve (5) using the BFGS algorithm. For each bound, we pick the which minimizes it on the given data (with instances). In order for the bounds to still hold with probability at least , we replace on the RHS of each bound by (this follows from the application of the union bound). We choose the prior variance such that (this was the best value on average for the bounds we compare against). We choose the grid in Theorem 1 as in (4). Finally, we approximate Gaussian expectations using Monte Carlo sampling.
Synthetic data. We generate synthetic data for and sample sizes between 800 and 8000. For a given sample size , we 1) draw [resp. ] identically and independently from the multivariate-Gaussian distribution [resp. the Bernoulli distribution ]; and 2) we set , for , where is the vector constructed from the first digits of . For example, if , then . Figure 2 shows the results averaged over 10 independent runs for each sample size.
UCI datasets. For the second experiment, we use several UCI datasets. These are listed in Table 2 (where Breast-C. stands for Breast Cancer). We encode categorical variables in appropriate 0-1 vectors. This effectively increases the dimension of the input space (this is reported as in Table 2). After removing any rows (i.e. instances) containing missing features and performing the encoding, the input data is scaled such that every column has values between -1 and 1. We used a 5-fold train-test split ( in Table 2 is the training set size), and the results in Table 2 are averages over 5 runs. We only compare with Maurer’s bound since other bounds were worse than Maurer’s and ours on all datasets.
Discussion. As the dimension of the input space increases, the complexity — and thus, all the PAC-Bayes bounds discussed in this paper — get larger. Our bound suffers less from this increase in , since for a large enough sample size , the term is small enough (see Figure 2) to absorb any increase in the complexity. In fact, for large enough , the irreducible (complexity-free) term involving in our bound becomes the dominant one. This, combined with the fact that for the 0-1 loss, for large enough (see Figure 2), makes our bound tighter than others.
Adding a regularization term in the objective (5) is important as it stabilizes and ; a similar effect is achieved with methods like gradient descent as they essentially have a ‘built-in’ regularization. For very small sample sizes, the regularization in (5) may not be enough to ensure that and are close to , in which case need not be necessarily small. In particular, this is the case for the Haberman and the breast cancer datasets where the advantage of our bound is not fully leveraged, and Maurer’s bound is smaller
5 Theoretical Motivation of the Bound
In this section, we study the behavior of our bound (3) under a Bernstein condition:
[Bernstein Condition (BC)] The learning problem satisfies the -Bernstein condition, for and , if for all ,
where is the risk minimizer within the closer of .
The Bernstein condition audibert2004pac (); bartlett2006convexity (); bartlett2006empirical (); erven2015fast (); koolen2016combining () essentially characterizes the ‘easiness’ of the learning problem; it implies that the variance in the excess loss random variable gets smaller the closer the risk of hypothesis gets to that of the risk minimizer . For bounded loss functions, the BC with always holds. The BC with (the ‘easiest’ learning setting) is also known as the Massart noise condition massart2006risk (); it holds in our experiment with synthetic data in Section 4, and also, e.g., whenever is convex and is exp-concave, for all erven2015fast (); mehta2017fast (). For more examples of learning settings where a BC holds see (koolen2016combining, , Section 3).
Our aim in this section is to give an upper-bound on the infimum term involving in (3), under a BC, in terms of the complexity and the excess risks , , and , where for a distribution , the excess risk is defined by
In the next theorem, we denote and , for . To simplify the presentation further (and for consistency with Section 4), we assume that is chosen such that
In addition to the ‘ESI’ tools provided in Section 6 and Lemma 10, the proof of Theorem 4, presented in Appendix C, also uses an ‘ESI version’ of the Bernstein condition due to (koolen2016combining, ).
First note that the only terms in our main bound (3), other than the infimum on the LHS of (7), are the empirical error and a -complexity-free term which is typically smaller than (e.g. when the dimension of is large enough). The latter term is often the dominating one in other PAC-Bayesian bounds when .
Now consider the remaining term in our main bound, which matches the infimum term on the LHS of (7), and let us choose algorithm as per Remark 2, so that . Suppose that, with high probability (w.h.p), converges to 0 for (otherwise no PAC-Bayesian bound would converge to 0), then — essentially the sum of the last two terms on the RHS of (7) — converges to 0 at a faster rate than w.h.p. for , and at equal rate for . Thus, in light of Theorem 4, to argue that our bound can be better than others (still when ), it remains to show that there exist algorithms and for which the sum of the excess risks on the RHS of (7) is smaller than .
One choice of estimator with small excess risk is the Empirical Risk Minimizer (ERM). When , if one chooses such that it outputs a Dirac around the ERM on a given sample, then under a BC with exponent and for ‘parametric’ (such as the -dimensional linear classifiers in Sec. 4), and are of order w.h.p. audibert2004pac (); grunwald2016fast (). However, setting is not allowed, since otherwise . Instead one can choose to be the generalized-Bayes/Gibbs posterior. In this case too, under a BC with exponent and for parametric , the excess risk is of order w.h.p. for clever choices of prior audibert2004pac (); grunwald2016fast ().
6 Detailed Analysis
We start this section by presenting the convenient ESI notation and use it to present our main technical Lemma 10 (proofs of the ESI results are in Appendix B). We then continue with a proof of Theorem 1.
Definition 5 can be extended to the case where is also a random variable, in which case the expectation in (8) needs to be replaced by the expectation over the joint distribution of (, , ). When no ambiguity can arise, we omit from the ESI notation. Besides simplifying notation, ESIs are useful in that they simultaneously capture “with high probability” and “in expectation” results:
[ESI Implications] For fixed , if then . For both fixed and random , if , then , , with probability at least .
In the next proposition, we present two results concerning transitivity and additive properties of ESI:
[ESI Transitivity and Chain Rule] (a) Let be any random variables on (not necessarily independent). If for some , , for all , then
(b) Suppose now that are i.i.d. and let be any real-valued function. If for some , , for all and all , then .
We now give a basic PAC-Bayesian result for the ESI context (the proof, slightly different from standard change-of-measure arguments, is in Appendix B):
[ESI PAC-Bayes] Fix and be any family of random variables such that for all , . Let be any distribution on and let be a learning algorithm. We have:
In many applications (especially for our main result) it is desirable to work with a random (i.e. data-dependent) in the ESI inequalities: one can obtain tighter bounds by tuning in ‘hindsight’.
[ESI from fixed to random ] Let be a countable subset of and let be a prior distribution over . Given a countable collection of random variables satisfying , for all fixed , we have, for arbitrary estimator with support on ,
The following key lemma, which is of independent interest, is central to our main result:
[Key result: un-expected Bernstein] Let be a random variable bounded from above by almost surely, and let . For all , we have
The result is tight: for every , there exists a distribution so that (12) does not hold.
Note that the un-expected Bernstein inequality in Lemma 10 has the lifted out of the expectation. In Appendix E, we prove (13) and compare it to standard versions of Bernstein. We also compare (12) to the related but distinct empirical Bernstein inequality due to (maurer2009empirical, , Theorem 4). The detailed proof of Lemma 10 with the tight constants is (as far as we know) significantly harder than Bernstein’s, and is postponed to the appendix. But it is easy to give a proof for bounded with suboptimal constants:
Proof Sketch of Lemma 10 for with suboptimal constants.
Let be such that , e.g., . We apply the standard Bernstein inequality (13) twice, once with itself and once with (in the role of ), giving us, for ,
Proof of Theorem 1.
Let and . For , define
Since is bounded from above by , Lemma 10 implies that and ,
Since are i.i.d. we can chain the ESIs above using Proposition 7-(b) to get:
We now apply Proposition 7-(a) to chain these two ESIs; this yields
With the discrete prior on , we have for any (see Proposition 9),