PAC-Bayes under potentially heavy tails

PAC-Bayes under potentially heavy tails

Abstract

We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain a novel optimal Gibbs posterior which enjoys finite-sample excess risk bounds at logarithmic confidence. Our core technique itself makes use of PAC-Bayesian inequalities in order to derive a robust risk estimator, which by design is easy to compute. In particular, only assuming that the first three moments of the loss distribution are bounded, the learning algorithm derived from this estimator achieves nearly sub-Gaussian statistical error, up to the quality of the prior.

1 Introduction

More than two decades ago, the origins of PAC-Bayesian learning theory were developed with the goal of strengthening traditional PAC learning guarantees1 by explicitly accounting for prior knowledge [19, 14, 7]. Subsequent work developed finite-sample risk bounds for “Bayesian” learning algorithms which specify a distribution over the model [15]. These bounds are controlled using the empirical risk and the relative entropy between “prior” and “posterior” distributions, and hold uniformly over the choice of the latter, meaning that the guarantees hold for data-dependent posteriors, hence the naming. Furthermore, choosing the posterior to minimize PAC-Bayesian risk bounds leads to practical learning algorithms which have seen numerous successful applications [3].

Following this framework, a tremendous amount of work has been done to refine, extend, and apply the PAC-Bayesian framework to new learning problems. Tight risk bounds for bounded losses are due to Seeger [17] and Maurer [13], with the former work applying them to Gaussian processes. Bounds constructed using the loss variance in a Bernstein-type inequality were given by Seldin et al. [18], with a data-dependent extension derived by Tolstikhin and Seldin [20]. As stated by McAllester [16], virtually all the bounds derived in the original PAC-Bayesian theory “only apply to bounded loss functions.” This technical barrier was solved by Alquier et al. [3], who introduce an additional error term depending on the concentration of the empirical risk about the true risk. This technique was subsequently applied to the log-likelihood loss in the context of Bayesian linear regression by Germain et al. [12], and further systematized by Bégin et al. [5]. While this approach lets us deal with unbounded losses, naturally the statistical error guarantees are only as good as the confidence intervals available for the empirical mean deviations. In particular, strong assumptions on all of the moments of the loss are essentially unavoidable using the traditional tools espoused by Bégin et al. [5], which means the “heavy-tailed” regime, where all we assume is that a few higher-order moments are finite (say finite variance and/or finite kurtosis), cannot be handled. A new technique for deriving PAC-Bayesian bounds even under heavy-tailed losses is introduced by Alquier and Guedj [2]; their lucid procedure provides error rates even under heavy tails, but as the authors recognize, the rates are highly sub-optimal due to direct dependence on the empirical risk, leading in turn to sub-optimal algorithms derived from these bounds.2

In this work, while keeping many core ideas of Bégin et al. [5] intact, using a novel approach we obtain exponential tail bounds on the excess risk using PAC-Bayesian bounds that hold even under heavy-tailed losses. Our key technique is to replace the empirical risk with a new mean estimator inspired by the dimension-free estimators of Catoni and Giulini [10], designed to be computationally convenient. We review some key theory in section 2 before introducing the new estimator in section 3. In section 4 we apply this estimator to the PAC-Bayes setting, deriving a new robust optimal Gibbs posterior. Most detailed proofs are relegated to section A.1 in the appendix.

2 PAC-Bayesian theory based on the empirical mean

Let us begin by briefly reviewing the best available PAC-Bayesian learning guarantees under general losses. Denote by a sequence of independent observations distributed according to common distribution . Denote by a model/hypothesis class, from which the learner selects a candidate based on the -sized sample. The quality of this choice can be measured in a pointwise fashion using a loss function , assumed to be . The learning task is to achieve a small risk, defined by . Since the underlying distribution is inherently unknown, the canonical proxy is

Let and respectively denote “prior” and “posterior” distributions on the model . The so-called Gibbs risk induced by , as well as its empirical counterpart are given by

When our losses are almost surely bounded, lucid guarantees are available.

Theorem 1 (PAC-Bayes under bounded losses [15, 5]).

Assume , and fix any arbitrary prior on . For any confidence level , we have with probability no less than over the draw of the sample that

uniformly in the choice of .

Since the “good event” where the inequality in Theorem 1 holds is valid for any choice of , the result holds even when depends on the sample, which justifies calling it a posterior distribution. Optimizing this upper bound with respect to leads to the so-called optimal Gibbs posterior, which takes a form which is readily characterized (cf. Remark 14).

The above results fall apart when the loss is unbounded, and meaningful extensions become challenging when exponential moment bounds are not available. As highlighted in section 1 above, over the years, the analytical machinery has evolved to provide general-purpose PAC-Bayesian bounds even under heavy-tailed data. The following theorem of Alquier and Guedj [2] extends the strategy of Bégin et al. [5] to obtain bounds under the weakest conditions we know of.

Theorem 2 (PAC-Bayes under heavy-tailed losses [2]).

Take any and set . For any confidence level , we have with probability no less than over the draw of the sample that

uniformly in the choice of .

For concreteness, consider the case of , where , and assume that the variance of the loss is is -finite, namely that

From Proposition 4 of Alquier and Guedj [2], we have . It follows that on the high-probability event, we have

While the rate and dependence on a divergence between and are similar, note that the dependence on the confidence level is polynomial; compare this with the logarithmic dependence available in Theorem 1 above when the losses were bounded.

For comparison, our main result of section 4 is a uniform bound on the Gibbs risk: with probability no less than , we have

where is an estimator of defined in section 3, is a term depending on the quality of prior , and the key constants are bounds such that for all we have . As long as the first three moments are finite, this guarantee holds, and thus both sub-Gaussian and heavy-tailed losses (e.g., with infinite higher-order moments) are permitted. Given any valid , the PAC-Bayesian upper bound above can be minimized in based on the data, and thus an optimal Gibbs posterior can also be computed in practice. In section 4, we characterize this “robust posterior.”

3 A new estimator using smoothed Bernoulli noise

Notation

In this section, we are dealing with the specific problem of robust mean estimation, thus we specialize our notation here slightly. Data observations will be , assumed to be independent copies of . Denote the index set . Write for the set of all probability measures defined on the measurable space . Write for the relative entropy between measures and (also known as the KL divergence; definition in appendix). We shall typically suppress and even in the notation when it is clear from the context. Let be a bounded, non-decreasing function such that for some and all ,

(1)

As a concrete and analytically useful example, we shall use the piecewise polynomial function of Catoni and Giulini [10], defined by

(2)

which satisfies (1) with , and is pictured in Figure 1 with the two key bounds.3

Figure 1: Graph of the Catoni function over .

Estimator definition

We consider a straightforward procedure, in which the data are subject to a soft truncation after re-scaling, defined by

(3)

where is a re-scaling parameter. Depending on the setting of , this function can very closely approximate the sample mean, and indeed modifying this scaling parameter controls the bias of this estimator in a direct way, which can be quantified as follows. As the scale grows, note that

which implies that taking expectation with respect to the sample and , in the limit this estimator is unbiased, with

On the other hand, taking closer to zero implies that more observations will be truncated. Taking small enough,4 we have

which converges to zero as . Here the positive/negative indices are and . Thus taking too small means that only the signs of the observations matter, and the absolute value of the estimator tends to become too small.

High-probability deviation bounds for

We are interested in high-probability bounds on the deviations under the weakest possible assumptions on the underlying data distribution. To obtain such guarantees in a straightforward manner, we make the simple observation that the estimator defined in (3) can be related to an estimator with smoothed noise as follows. Let be an iid sample of noise with distribution for some . Then, taking expectation with respect to the noise sample, one has that

(4)

This simple observation becomes useful to us in the context of the following technical fact.

Lemma 3.

Assume we are given some independent data , assumed to be copies of the random variable . In addition, let similarly be independent observations of “strategic noise,” with distribution that we can design. Fix an arbitrary prior distribution , and consider , assumed to be bounded and measurable. Write for the Kullback-Leibler divergence between distributions and . It follows that with probability no less than over the random draw of the sample, we have

uniform in the choice of , where expectation on the left-hand side is over the noise sample.

The special case of interest here is . Using (1) and Lemma 3, with prior and posterior , it follows that on the high-probability event, uniform in the choice of , we have

(5)

where we have used the fact that in the Bernoulli case. Dividing both sides by and optimizing this as a function of yields a closed-form expression for depending on the second moment, the confidence , and . Analogous arguments yield lower bounds on the same quantity. Taking these facts together, we have the following proposition, which says that assuming only finite second moments , the proposed estimator achieves exponential tail bounds scaling with the second non-central moment.

Proposition 4 (Concentration of deviations).

Scaling with , the estimator defined in (3) satisfies

(6)

with probability at least .

Proof of Proposition 4.

First, note that the upper bound derived from (5) holds uniformly in the choice of on a high-probability event. Setting and solving for the optimal setting is just calculus. It remains to obtain a corresponding lower bound on . To do so, consider the analogous setting of Bernoulli and , but this time on the domain , with and . Using (1) and Lemma 3 again, we have

where we note and . This yields a high-probability lower bound in the desired form when we set , since an upper bound on is equivalent to a lower bound on . However, since we have changed the prior in this case, the high-probability event here need not be the same as that for the upper bound, and as such, we must take a union bound over these two events to obtain the desired final result. ∎

Remark 5.

While the above bound (6) depends on the true second moment, as is clear from the proof outlined above, the result is easily extended to hold for any valid upper bound on the moment, which is what will inevitably be used in practice.

Centered estimates

Note that the bound (6) depends on the second moment of the underlying data; this is in contrast to M-estimators which due to a natural “centering” of the data typically have tail bounds depending on the variance. This results in a sensitivity to the absolute value of the location of the distribution, e.g., on a distribution with unit variance and will tend to be much better than a distribution with . Fortunately, a simple centering strategy works well to alleviate this sensitivity, as follows.

Without loss of generality, assume that the first estimates are used for constructing a shifting device, with the remaining points left for running the usual routine on shifted data. More concretely, define

(7)

From (6) in Proposition 4, we have

on an event with probability no less than , over the draw of the -sized sub-sample. Using this, we shift the remaining data points as . Note that the second moment of this data is bounded as follows:

Passing these shifted points through (3) with analogous second moment bounds used for scaling, we have

(8)

Shifting the resulting output back to the original location by adding and shifting back to the original location by adding , conditioned on , we have by (6) again that

with probability no less than over the draw of the remaining points. Defining the centered estimator as , and taking a union bound over the two “good events” on the independent sample subsets, we may thus conclude that

(9)

where probability is over the draw of the full -sized sample. While one takes a hit in terms of the sample size, the variance works to combat sensitivity to the distribution location.

4 PAC-Bayesian bounds for heavy-tailed data

An import and influential paper due to D. McAllester gave the following theorem as a motivating result. For clarity to the reader, we give a slightly modified version of his result.

Theorem 6 (McAllester [14], Preliminary Theorem 2).

Let be a prior probability distribution over , assumed countable, and to be such that for all . Consider the pattern recognition task with , and the classification error . Then with probability no less than , for any choice of , we have

Proof.

For clean notation, denote the empirical risk as

Using a classical Chernoff bound specialized to the case of Bernoulli observations (Lemma 18), we have that for any , it holds that

Rearranging terms, it follows immediately that with probability no less than , we have

The desired result follows from a union bound:

The event on the left-hand side of the above inequality is precisely that of the hypothesis, namely the “bad event” on which the sample is such that the risk exceeds the given bound for some candidate . ∎

One quick glance at the proof of this theorem shows that the bounded nature of the observations plays a crucial role in deriving excess risk bounds of the above form, as it is used to obtain concentration inequalities for the empirical risk about the true risk. While analogous concentration inequalities hold under slightly weaker assumptions, when considering the potentially heavy-tailed setting, one simply cannot guarantee that empirical risk is tightly concentrated about the true risk, which prevents direct extensions of such theorems. With this in mind, we take a different approach, that does not require the empirical mean to be well-concentrated.

Our motivating pre-theorem

The basic idea of our approach is very simple: instead of using the sample mean, bound the off-sample risk using a more robust estimator which is easy to compute directly, and which allows risk bounds even under unbounded, potentially heavy-tailed losses. Define a new approximation of the risk by

(10)

for . Note that this is just a direct application of the robust estimator defined in (3) to the case of a loss which depends on the choice of candidate . As a motivating result, we basically re-prove McAllester’s result (Theorem 6) under much weaker assumptions on the loss, using the statistical properties of the new risk estimator (10), rather than relying on classical Chernoff inequalities.

Theorem 7 (Pre-theorem).

Let be a prior probability distribution over , assumed countable. Assume that for all , and that for all . Setting the scale in (10) to , then with probability no less than , for any choice of , we have

Proof.

We start by making use of the pointwise deviation bound given in Proposition 4, which tells us that with high probability

for any pre-fixed . Replacing with gives the key error level

and using the union bound argument in the proof of Theorem 6, we have

Remark 8.

We note that all quantities on the right-hand side of Theorem 7 are easily computed based on the sample, except for the second moment , which in practice must be replaced with an empirical estimate. With an empirical estimate of in place, the upper bound can easily be used to derive a learning algorithm.

Uncountable model case

Next we extend the previous motivating theorem to a more general result on a potentially uncountable , using stochastic learning algorithms, as has become standard in the PAC-Bayes literature. We need a few technical conditions, listed below:

  1. Bounds on lower-order moments. For all , we require , .

  2. Bounds on the risk. For all , we require .

  3. Large enough confidence. We require .

These conditions are quite reasonable, and easily realized under heavy-tailed data, with just lower-order moment assumptions on and say a compact class . The new terms that appear in our bounds that do no appear in previous works are and . The former is the expectation of the proposed robust estimator with respect to posterior , and the latter is a term that depends directly on the quality of the prior .

Theorem 9.

Let be a prior distribution on model . Let the three assumptions listed above hold. Setting the scale in (10) to , then with probability no greater than over the random draw of the sample, it holds that

for any choice of probability distribution on , since by assumption.

Remark 10.

As is evident from the statement of Theorem 9, the convergence rate is clear for all terms but . In our proof, we use a modified version of the elegant and now-standard strategy formulated by Bégin et al. [5]. A glance at the proof shows that under this strategy, there is essentially no way to avoid dependence on . Since the random variable is bounded over the random draw of the sample and , the bounds still hold and are non-trivial. That said, may indeed increase as , potentially spoiling the rate, and even consistency in the worst case. Clearly presents no troubles if on a high-probability event, but note that this essentially amounts to asking for a prior that on average realizes bounds that are better than we can guarantee for any posterior though the above analysis. Such a prior may indeed exist, but if it were known, then that would eliminate the need for doing any learning at all. If the deviations are truly sub-Gaussian [6], then the rate can be easily obtained. However, impossibility results from Devroye et al. [11] suggest that under just a few finite moment assumptions, such an estimator cannot be constructed. As such, here we see a clear limitation of the established PAC-Bayes analytical framework under potentially heavy-tailed data. Since the change of measures step in the proof is fundamental to the basic argument, it appears that concessions will have to be made, either in the form of slower rates, deviations larger than the relative entropy, or weaker dependence on .

Remark 11.

Note that while in its tightest form, the above bound requires knowledge of , we may set used to define using any valid upper bound , under which the above bound still holds as-is, using known quantities. Furthermore, for reference the content of the term in the above bound takes the form

where is an upper bound on the variance over .

Proof of Theorem 9.

To begin, let us recall a useful “change of measures” inequality,5 which can be immediately derived from our proof of Theorem 19. In particular, recall from identity (24) that given some prior and constructing such that almost everywhere one has

it follows that

whenever . In the case where , upper bounds are of course meaningless. Re-arranging, observe that since , it follows that

(11)

This inequality given in (11) is deterministic, holds for any choice of , and is a standard technical tool in deriving PAC-Bayes bounds.

We shall introduce a minor modification to this now-standard strategy in order to make the subsequent results more lucid. Instead of as just characterized above, define such that almost surely , we have

where is a function of the sample size , which increases monotonically as when (e.g., setting ). To explicitly construct such a measure, one can define it by , for all , where is our measurable space of interest. In this paper, we always6 have , implying that . Also by assumption, since is bounded over , we have , which in turn implies

and so is a finite measure. Note however that both and are possible, so in general need not be a probability measure. By construction, we have . Since for all , we have that and thus the measurability of implies the measurability of . Using the chain rule (Lemma 15), it follows that for any ,

As such, we have , and by the Radon-Nikonym theorem, we may write since such a function is unique almost everywhere . As long as , which in turn implies , so that with use of the chain rule and Radon-Nikodym, we have

Taking the two ends of this string of equalities, by Radon-Nikodym it holds that

a.e. , and thus a.e.  as well. Following the argument of Theorem 19, we have that

The tradeoff for using which need not be a probability comes in deriving a lower bound on . In Lemma 16 we showed how the relative entropy between probability measures is non-negative. Non-negativity does not necessarily hold for general measures, but analogous lower bounds can be readily derived for our special case as

where the last inequality uses the fact that is a probability and for all . Taking this with our decomposition of , we have

(12)

which amounts to a revised inequality based on change of measures, analogous to (11).

To keep notation clean, write

Noting that is random with dependence on the sample, via Markov’s inequality we have

(13)

with probability no less than . Here probability and are with respect to the sample. Since is bounded, as long as , we have , which lets us use the change of measures inequality in a meaningful way. Now for , observe that we have

with probability no less than . The first inequality follows from modified change of measures (12), the second inequality follows from (13), and the final interchange of integration operations is valid using Fubini’s theorem [4]. Note that the “good event” depends only on (fixed in advance) and not . Thus, the above inequality holds on the good event, uniformly in .

It remains to bound , for an arbitrary constant (here we will have ). Start by breaking up the one-sided deviations as

writing and for convenience. We will take the terms and one at a time. First, note that the function can be written

(14)

Again for notational simplicity, write and , , where is arbitrary. Write and . We are assuming non-negative losses, so that . This means that and . We use this, as well as , in addition to (14) in order to bound the expectation of our estimator from below, as follows.

By assumption, we have , implying that this lower bound is non-trivial. Next we obtain a one-sided bound on the tails of the loss by

Note that the first inequality makes use of , which is implied by the bounds assumed on , namely that .

Returning to the lower bound on , using Hölder’s inequality in conjunction with the tail bound we just obtained, we get an upper bound in the form of

This means we can now say

which re-arranged and written more succinctly gives us