PACBayes unleashed: generalisation bounds with unbounded losses
Abstract
We present new PACBayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PACBayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this assumption, we propose a new notion called HYPE (standing for HYPothesisdependent rangE), which effectively allows the range of the loss to depend on each predictor. Based on this new notion we derive a novel PACBayesian generalisation bound for unbounded loss functions, and we instantiate it on a linear regression problem. To make our theory usable by the largest audience possible, we include discussions on actual computation, practicality and limitations of our assumptions.
1 Introduction
Since its emergence in the late 90s, the PACBayes theory (see the seminal papers by STW1997 and McAllester1998; McAllester1999, or the recent survey by guedj2019primer) has been a powerful tool to obtain generalisation bounds and derive efficient learning algorithms. PACBayes bounds were originally meant for binary classification problems (seeger2002; langford2005tutorial; catoni2007) but the literature now includes many contributions involving any bounded loss function (without loss of generality, with values in ), not just the binary loss. Generalisation bounds are helpful to ensure that a learning algorithm will have a good performance on future similar batches of data. Our goal is to provide new PACBayesian generalisation bounds holding for unbounded loss functions, and thus extend the usability of PACBayes to a much larger class of learning problems.
Some ways to circumvent the bounded range assumption on the losses have been addressed in the recent literature. For instance, one approach assumes subgaussian or subexponential tails of the loss (alquier2016; germain2016), however this requires the knowledge of additional parameters. Some other works have also looked into the analysis for heavytailed losses, e.g. Alquier_2017 proposed a polynomial momentdependent bound with divergences, while holland2019 devised an exponential bound which assumes that the second (uncentered) moment of the loss is bounded by a constant (with a truncated risk estimator, as recalled in Sec. 5). A somewhat related approach was also explored by kuzborskij2019efron, who do not assume boundedness of the loss, but instead control higherorder moments of the generalization gap through the EfronStein variance proxy.
We investigate a different route here. We introduce the HYPothesisdependent rangE condition (HYPE), which means that the loss is upper bounded by a term which does not depend on data but only on the chosen predictor for the considered learning problem. We designed this condition to be easy to verify in practice, given an explicit formulation of the loss function. Our purpose is to bring our framework to the attention of the largest machine learning community, and the HYPE is intended as an easytouse, friendly condition to yield theoretical guarantees, even for the less theoreticallyoriented audience. Our regression example illustrates that purpose, and shows that a mere use of the triangle inequality is enough to check that HYPE is satisfied in a naive learning problem
Classical PACBayes bounds (see, e.g. McAllester1999; seeger2002) have been designed with few technical conditions. For instance, besides assuming a bounded loss function with values in the interval , McAllester’s bound only requires absolute continuity between two densities. We intend to keep the same level of streamlined clarity in our assumptions, and hope practitioners could readily check whether our results apply to their particular learning problem.
Our contributions are twofold.
(i) We propose PACBayesian bounds holding with unbounded loss functions, therefore overcoming a limitation of the mainstream PACBayesian literature for which a bounded loss is usually assumed. (ii) We analyse the bound, its implications, limitations of our assumptions, and their usability by practitioners. We hope this will extend the PACBayes framework into a widely usable tool for a significantly wider range of problems, such as unbounded regression or reinforcement learning problems with unbounded rewards.
Outline.
Sec. 2 introduces our notation and definition of the HYPE condition. Sec. 3 provides a general PACBayesian bound, which is valid for any learning problem complying with a mild assumption. The novelty of our approach lies in the proof technique: we adapt the notion of selfbounding function, introduced by boucheron2000 and further developed in boucheron_concentration_inequalities2003; boucheron_concentration_inequalities2009. For the sake of completeness, we present in Sec. 4 how our approach (designed for the unbounded case) behaves in the bounded case. This section is not the core of our work but rather serves as a safety check and particularises our bound to more classical PACBayesian assumptions. Sec. 5 introduces the notion of softening functions and particularises Sec. 3’s PACBayesian bound. In particular, we make explicit all terms in the righthand side. Sec. 6 extends our results to linear regression (which has been studied from the perspective of PACBayes in the literature, most recently by shalaeva2019). Finally Sec. 7 contains numerical experiments to illustrate the behaviour of our bounds in the aforementioned linear regression problem.
We defer the following material to the appendix: Appendix A contains additional numerical experiments for the bounded case. Appendix B presents in details related works. We reproduce in Appendix C a naive approach which inspired our study, for the sake of completeness. Appendix D contains a nontrivial corollary for Thm. 5.5. Finally, Appendix E contains all proofs to original claims we make in the paper.
2 Notation
The learning problem is specified by the data space , a set of predictors, and a loss function . We will denote by a size dataset: where data is sampled from the same datagenerating distribution over . For any predictor , we define the empirical risk and the theoretical risk as
respectively, denotes the expectation under , and the expectation under the distribution of the sample . We define the generalisation gap . We now introduce the key concept to our analysis.
Definition 2.1 (Hypothesisdependent range (HYPE) condition).
A loss function is said to satisfy the hypothesisdependent range (HYPE) if there exists a function such that for any predictor . We then say that is compliant.
Let be a set of probability distributions on . For , the notation stands for absolutely continuous with respect to (i.e. if for an element of the considered algebra).
We now recall a result from germain2009. Note that while implicit in many PACBayes works (including theirs), we make explicit that both the prior and the posterior must be absolutely continuous with respect to each other. We discuss this restriction below.
Theorem 2.2 (Adapted from germain2009, Theorem 2.1).
For any with no dependency on data, for any convex function , for any and for any , we have with probability at least over size samples , for any such that and :
The proof is deferred to Sec. E.1. Note that the proof in germain2009 does require that although it is not explicitly stated: we highlight this in our proof. While is classical and necessary for the to be meaningful, appears to be more restrictive. In particular, we have to choose such that it has the exact same support as (e.g., choosing a Gaussian and a truncated Gaussian is not possible). However, we can still apply our theorem when and belong to the same parametric family of distributions, e.g. both ‘fullsupport’ Gaussian or Laplace distributions, among others.
Note also that alquier2016, which adapts a result from catoni2007, only require . This comes at the expense of a Hoeffding’s assumption (alquier2016, Definition 2.3). This means that
(when or ) is assumed to be bounded by a function only depending on hyperparameters (such as the dataset size or parameters given by Hoeffding’s assumption). Our analysis does not require this assumption, which might prove restrictive in practice.
Our Thm. 2.2 may be seen as a basis to recover many classical PACBayesian bounds. For instance, recovers McAllester’s bound (as recalled in guedj2019primer, Theorem 1). To get a usable bound the outstanding task is to bound . Note that a previous attempt has been made in germain2016, as described in Sec. B.1.
3 Exponential moment via selfbounding functions
Our goal is to control for a fixed . The technique we use is based on the notion of selfbounding functions defined in boucheron_concentration_inequalities2009.
Definition 3.1 (boucheron_concentration_inequalities2009).
A function is said to be selfbounding with , if there exists for every such that and :
and
where for all , the removal of the th entry is . We denote by the class of functions that satisfy this definition.
In boucheron_concentration_inequalities2009, the following bound has been presented to deal with the exponential moment of a selfbounding function. Let denote the positive part of . We define when .
Theorem 3.2 (boucheron_concentration_inequalities2009).
Let where are independent (not necessarily identically distributed) valued random variables. We assume that . If , then defining , for any we have:
Next, we deal with the exponential moment over in Thm. 2.2 when . To do so, we propose the following theorem:
Theorem 3.3.
Let be a fixed predictor and . If the loss function is compliant, then for we have:
Proof.
We define the function as
We also define . Then, notice that . We first prove that for any . Indeed, for all , we define:
where for any and for any . Then, since for all , we have
Moreover, because for any , we then have:
Since this holds for any , this proves that is selfbounding.
Now, to complete the proof, we will use Thm. 3.2. Because is selfbounding, we have for all :
And since :
∎
Comparing our Thm. 3.3 with the naive result shown in Appendix C shows the strength of our approach: the tradeoff lies in the fact that we are now ‘only’ controlling instead of , but we traded, on the righthand side of the bound, the large exponent for , the latter being much smaller when e.g. .
Now, without any additional assumptions, the selfbounding function theory provided us a first step in our study of the exponential moment. For convenient crossreferencing, we state the following rewriting of Thm. 2.2.
Theorem 3.4.
Let the loss being compliant. For any with no data dependency, for any and for any , we have with probability at least over size samples , for any such that and :
4 Safety check: the bounded loss case
At this stage, the reader might wonder whether this new approach allows to recover known results in the bounded case: the answer is yes.
We will, during this whole section, study the case where is bounded by some constant . We provide a bound, valid for any choice “priors” and “posteriors” such that and . which is an immediate corollary of Thm. 3.4.
Proposition 4.1.
Let being compliant, with constant , and . Then we have, for any with no data dependency, with probability over random samples, for any such that :
Remark 4.2.
We precise Proposition 4.1 to evaluate the robustness of our approach, for instance, by comparing it with the PACBayesian bound found in germain2016. This discussion can be found in Sec. B.1, where the bound from germain2016 is introduced in details.
Remark 4.3.
At first glance, a naive remark: in order to control the rate of convergence of all the terms of the bound in Proposition 4.1 (as it is often the case in classical PACBayesian bounds), then the only case of interest is in fact . However, one could notice that the factor is not optimisable while the KL one is. In this way, if it appears that is too big in practice, one wants to have the ability to attenuate its influence as much as possible and it may lead to consider . The following lemma is dealing with this question.
Lemma 4.4.
For any given , the function reaches its minimum at
Proof.
The explicit calculus of the and the resolution of provides the result. ∎
Remark 4.5.
Our Lemma 4.4 indicates that if we already fixed a “prior” and a “posterior” , then taking , offer us the optimised value of the bound given in Proposition 4.1. We numerically show in Appendix A’s first experiment that optimising leads to significantly better results .
Now the only remaining question is how to optimise the KL divergence. To do so, we may need to fix an “informed prior” to minimise the KL divergence with an interesting posterior. This idea has been studied by lever2010; lever2013 and studied more recently by mhammedi2019; rivasplata2019, among others. We will just adapt it to our problem in the most simplest way.
We will now introduce, for , the splits and .
Proposition 4.6.
Let be compliant, with constant , and . Then we have, for any “priors” (possibly dependent on ) and (possibly dependent on ), with probability over random size samples , for any such that , and , :
Proof.
Let as stated in the theorem. We first notice that by using Proposition 4.1 on the two halves of the sample, we obtain with probability at least :
and also with probability at least :
Hence with probability at least both inequalities hold, and the result follows by adding them and dividing by 2. ∎
Remark 4.7.
One can notice that the main difference between Proposition 4.6 and Proposition 4.1 lies in the implicit PACBayesian paradigm saying that our prior must not depend on the data. With this last proposition, we implicitly allow to depend on and on , which can in practice lead to far more accurate priors. We numerically show this fact in Appendix A’s second experiment.
5 PAC Bayesian bounds with smoothed estimator
We now move on to control the righthand side term in Thm. 3.4 when is not constant. A first step is to consider a transformed estimate of the risk, inspired by the truncated estimator from catoni2012challenging, also used in catoni2017 and more recently studied by holland2019. The following is inspired by the results of holland2019 which we summarise in Sec. B.2.
The idea is to modify the estimator for any by introducing a threshold and a function which will attenuate the influence of the empirical losses that exceed .
Definition 5.1 (risks).
For every , , for any , we define the empirical risk and the theoretical risk as follows:
where .
We now focus on what we call softening functions, i.e. functions that will temperate high values of the loss function .
Definition 5.2 (Softening function).
We say that is a softening function if:

,

is nondecreasing,

.
We let denote the set of all softening functions.
Remark 5.3.
Notice that those three assumptions ensure that is continuous at . For instance, the functions and are in . In Sec. B.2 we compare these softening functions and those used by holland2019.
Using , for a fixed threshold , the softened loss function verifies for any :
because is nondecreasing. In this way, the exponential moment in Thm. 3.4 can be far more controllable. The tradeoff lies in the fact that softening (instead of taking directly ) will deteriorate our ability to distinguish between two bad predictions when both of them are greater than . For instance, if we choose such as on and , if for a certain pair , then we cannot tell how far is from and we only can affirm that .
We now move on to the following lemma which controls the shortfall between and for all , for a given and . To do that we assume that admits a finite moment under any posterior distribution:
(1) 
For instance, if we work in and if is polynomial in (where denote the Euclidean norm), then this assumption holds if we consider Gaussian priors and posteriors.
Lemma 5.4.
Assume that Eq. 1 holds, and let , . We have:
Proof.
Let , . We have, for :
Finally, by crudely bounding the probability by 1, we get:
Hence the result by integrating over with respect to . ∎
Finally we present the following theorem, which provides a PACBayesian inequality bounding the theoretical risk by the empirical risk for :
Theorem 5.5.
Let being compliant and assume that is satisfying Eq. 1. Then for any with no data dependency, for any , for any and for any , we have with probability at least over size samples , for any such that and :
Proof.
Remark 5.6.
Remark 5.7.
For the sake of clarity, we establish in Appendix D a corollary of Thm. 5.5 (with an assumption which is stronger than Eq. 1) which is easier to compare to the result of holland2019.
6 The linear regression problem
We now focus on the celebrated linear regression problem and see how our theory translates to that particular learning problem. We assume that data is a size sample drawn independently under the distribution , where for all , with .
Our goal here is to find the most accurate predictor with respect to the loss function , where . We will make the following mild assumption: there exists such that for all drawn under :
where is the norm associated to the classical inner product of . Under this assumption we note that for all drawn according to , we have:
Thus we define for . If we first restrict ourselves to the framework of Sec. 3, we want to use Thm. 3.4 and doing so, our goal is to bound . The shape of invites us to consider a Gaussian prior. Indeed, we notice that if with , then . Notice that we cannot take just any Gaussian prior, however with a small , the condition may become quite loose. Thus, we have the following:
Theorem 6.1.
Let and . If the loss is compliant with , with , then we have, for any Gaussian prior with , . We have with probability over size samples , for any such that and :
where .
The proof is deferred to Sec. E.2. To compare our result with those found in the literature, we can fix . Doing so, we lose the dependency in for the choice of the variance of the prior (which now only depends on ), but we recover the classic decreasing factor .
Remark 6.2.
Notice that for now we did not use Sec. 5 even if we could (because is polynomial in and we consider Gaussian priors and posteriors, so Eq. 1 is satisfied). Doing so, we obtained a bound which appears to depend linearly on the dimension . In practice may be too big, and in this case, introducing an adapted softening function (one can think for instance of ) is a powerful tool to attenuate the weight of the exponential moment. This also extends the class of authorised Gaussian priors by avoiding to stick with a variance , .
7 Numerical experiments for linear regression
Setting.
In this section we apply Thm. 6.1 on a concrete linear regression problem. The situation is as follows: we want to approximate the function where . We assume that lies in an hypercube centered in of halfside , e.g. the set . Doing so we have .
Furthermore, we assume that data is drawn inside an hypercube of halfside . Doing so we have for any data .
For any data , we define and we set . As described in Sec. 6, we set . We then remark that for any :
Then we can define and to apply Thm. 6.1. We also define which is the set of candidate measures for this learning problem. Recall that in practice, given a fixed , we are only allowed to consider priors such that their variance . We want to learn an optimised predictor given a dataset . To do so we consider synthetic data.
Synthetic data.
We draw under a Gaussian (with mean 0 and standard deviation equal to ) truncated to the hypercube centered in of halfside . We generate synthetic data according to the following process: for a fixed sample size , we draw under a Gaussian (with mean 0 and standard deviation equal to ) truncated to the hypercube centered in of halfside .
Experiment.
First, we fix . Our goal here is to obtain a generalisation bound on our problem. We fix arbitrarily, for a fixed , and and we define our naive prior as . For a fixed dataset , we define our posterior as , with such that it is minimising the bound among candidates. Note that all the previously defined parameters are depending on , which is why we choose for step a fixed integer (in practice step=8 or 16) and we take the value of minimising the bound among the candidates as well. Fig. 1 contains two figures, one with , the other with . On each figure are computed the righthand side term in Thm. 6.1 with an optimised for each step.
Discussion.
To the the best of our knowledge, this is the first attempt to numerically compute PACBayes bounds for unbounded problems, making it impossible to compare to other results. We stress though that obtaining numerical values for the bound without assuming a bounded loss is a significant first step. Furthermore, we consider a rather hard problem: is not linear, so we cannot rely on a linear approximation fitting perfectly data, and the larger the dimension, the larger the error, as illustrated by Fig. 1. Thus for any posterior , the quantity is potentially large in practice and our bound might not be tight. Finally, notice that optimising (instead of taking to recover a classic convergence rate) leads to a significantly better bound. A numerical example of this assertion is presented in Appendix A. We aim to conduct further studies to consider the convergence rate as an hyperparameter to optimise, rather than selecting the same rate for all terms in the bound.
8 Conclusion
The main goal of this paper is to expand the PACBayesian theory to learning problems with unbounded losses, under the HYPE condition. We plan next to particularise our general theorems to more specific situations, starting with the kernel PCA setting.
References
Appendix A Additional experiments for the bounded loss case
Our experimental framework has been inspired of the work of mhammedi2019.
Settings We generate synthetic data for classification, and we are using the 01 loss. Here, the data space is with . Here the set of predictors is also . And for , we define the loss as . where
We want to learn an optimised predictor given a dataset . To do so we use regularised logistic regression and we compute:
(2) 
where is a fixed regularisation parameter. We also define
which is the set of considered measures for this learning problem.
Parameters We set . We approximately solve Eq. 2 by using the minimize function of the optimisation module in Python, with the Powell method. To approximate gaussian expectations, we use MonteCarlo sampling.
Synthetic data
We generate synthetic data for according to the following process: for a fixed sample size , we draw under the multivariate gaussian distribution and we compute for all : where is the vector formed by the first digits of .
Normalisation trick
Given the predictors shape, we notice that for any :
Thus, the value of the prediction is exclusively determined by the sign of the inner product, and this quantity is definitely not influenced by the norm of the vector.
Then, for any sample , we call normalisation trick the fact of considering instead of in our calculations. This process will not deteriorate the quality of the prediction and will considerably enhance the value of the KL divergence.
First experiment
Our goal here is to highlight the point discussed in Remark 4.3 e.g. the influence of the parameter in Proposition 4.1.
We fix arbitrarily and we define our naive prior as .
For a fixed dataset , we define our posterior as , with