Efron-Stein PAC-Bayesian Inequalities

Efron-Stein PAC-Bayesian Inequalities

Abstract

We prove semi-empirical concentration inequalities for random variables which are given as possibly nonlinear functions of independent random variables. These inequalities describe concentration of random variable in terms of the data/distribution-dependent Efron-Stein (ES) estimate of its variance and they do not require any additional assumptions on the moments. In particular, this allows us to state semi-empirical Bernstein type inequalities for general functions of unbounded random variables, which gives user-friendly concentration bounds for cases where related methods (e.g. bounded differences) might be more challenging to apply. We extend these results to Efron-Stein PAC-Bayesian inequalities which hold for arbitrary probability kernels that define a random, data-dependent choice of the function of interest. Finally, we demonstrate a number of applications, including PAC-Bayesian generalization bounds for unbounded loss functions, empirical Bernstein type generalization bounds, new truncation-free bounds for off-policy evaluation with Weighted Importance Sampling (WIS), and off-policy PAC-Bayesian learning with WIS.

RLS
Regularized Least Squares
ERM
Empirical Risk Minimization
RKHS
Reproducing kernel Hilbert space
DA
Domain Adaptation
PSD
Positive Semi-Definite
SGD
Stochastic Gradient Descent
OGD
Online Gradient Descent
SGLD
Stochastic Gradient Langevin Dynamics
IS
Importance Sampling
WIS
Weighted Importance Sampling
MGF
Moment-Generating Function
ES
Efron-Stein
ESS
Effective Sample Size
KL
Kullback-Liebler

1 Introduction

In the following we will be concerned with bounds on the upper tail probability of

where composed from independent random elements is distributed according to some probability measure 1, and is some space and is either a fixed measurable function, or is a function that is randomly chosen as a function of .

We first consider a simpler case of such concentration inequalities, when is a fixed function and the user can choose from a number of different ways to study behavior of (see (Boucheron et al., 2013) for a comprehensive survey on the topic). Perhaps the most popular two methods used in learning theory are the martingale method (Azuma, 1967; McDiarmid, 1998) and the information-theoretic entropy method (Boucheron et al., 2003; Maurer, 2019). Both of these give many well-known and useful inequalities: the first family includes the celebrated Azuma-Hoeffding and so-called bounded-differences inequalities popularized by McDiarmid (1998), while the second family is mostly known for powerful exponential Efron-Stein inequality which allows to state many prominent concentration inequalities as its special case (for instance, inequalities for self-bounding functions and Talagrand’s convex distance inequality).

Roughly speaking, a common feature of both families is that they relate concentration of around zero to the sensitivity of to coordinatewise perturbations, expressed through the Efron-Stein (ES) variance proxy

(1)

where and notation means that the th element of is replaced by , where is an independent copy of .

For example, an inequality closely related to the well-known bounded-differences (or McDiarmid’s) inequality follows from a conservative upper bound on (1): assuming that almost surely for some positive constant , one has

This inequality is of course rather pessimistic since it neglects information about moments of . A tangible step forward in proving less conservative inequalities was done in the context of so-called entropy method. In particular, one of the central achievements of the entropy method is the following ’exponential ES inequality’:2

(2)

This inequality bounds the Moment-Generating Function (MGF) of through the MGF of its ES variance proxy, implying that by controlling the latter, one can obtain tail bounds involving moments. For instance, if for any choice of the distribution , satisfies for some constants , it is called a weakly self-bounding function and we can employ Eq. 2 to show that the first moment of and constants control the tail behavior of :

(3)

Thus, whenever is decreasing in (for instance, when is an average of random variables), one gets a dominating lower-order term when the first moment is small enough. This behavior is reminiscent of the classical Bernstein’s inequality and proved to be useful in a number of applications, such as generalization bounds with localization (Bartlett et al., 2002; Srebro et al., 2010; Catoni, 2007) and empirical Bernstein-type inequalities (Maurer and Pontil, 2009; Tolstikhin and Seldin, 2013).

It is natural to ask whether we can get similar inequalities with higher order moments. Indeed, a recent line of work by Maurer (2019); Maurer and Pontil (2018) introduced Bernstein-type concentration inequalities for general functions where in place of the variance, one has an expected ES variance proxy (note that one still controls the variance of indirectly thanks to the Efron-Stein inequality ). However, this comes at a cost of controlling the first and the second moments of . Formally, if for any choice of distribution , satisfies3

almost surely for some , we have a tail bound

(4)

Thus, to have a Bernstein-type behavior of the bound, and should be of a lower order. Note the connection of Eq. 4 to the exponential ES inequality 2: the condition on outlined above is sufficient to control the second and higher order moments of , thus giving a concentration inequality for .

Despite their generality, all of these bounds implicitly control moments of , which makes them difficult to apply in some cases. The pair cannot depend on the sample and typically one would require boundedness of to obtain a well-behaved . One way to avoid these limitations is to revisit exponential ES inequality and analyze MGF of ES variance proxy in an application-specific way (for example, to assume a subexponential behavior of ) (Abou-Moustafa and Szepesvári, 2019). However, in general this requires knowledge of additional parameters (such as scale and variance factor of the underlying subexponential distribution).

1.1 Our Contribution

Semi-empirical Efron-Stein Inequalities. In this paper we prove concentration inequalities without making apriori assumptions on the moments of the ES variance proxy and instead we state bounds on the upper tail probability in terms of the semi-empirical ES variance proxy

(5)

Note that is semi-empirical since it depends on both distribution and sample . Another property of is that it is asymmetric w.r.t. the sample  — in general depends on the order of elements in the sequence , due to conditional expectation. However, as we discuss in the following section (see applications), this does not affect sums and weakly affects normalized sums.

Our first result (Theorem 1) gives an exponential bound

(6)

This inequality does not require boundedness of random variables, nor of  — the only crucial assumption is independence of elements in from each other. Observe that Eq. 6 essentially depends on and a positive free parameter , which must be selected by the user. For instance, a problem agnostic choice of for any gives us w.p. (with probability) at least ,

This recovers a Bernstein-type behavior, that is, the dominance of the lower-order term whenever (a variance proxy) is small enough. The price we pay for such a simple choice of is a logarithmic term. In general, one can achieve even sharper bound if the range of is known (or can be guessed) — in this case, we can take a union bound over some discretized range of , and select minimizing the bound. In addition, we show a version of the bound that does not involve and thus it is scale-free. This version of our inequality, however, depends on :

PAC-Bayesian Semi-Empirical Efron-Stein Inequalities. So far we have presented concentration inequalities which hold for fixed functions . However, in many learning-theoretic applications we are interested in concentration w.r.t. the class of functions, for example when potentially depends on the data. In the following we extend our results to the class , where is some parameter space. We focus on the stochastic, PAC-Bayesian model, where functions are parameterized by a random variable given some probability kernel from to 4. For example, in the statistical learning setting, the predictor is parameterized by sampled from called the posterior, while represents an empirical loss of the predictor (we discuss this in the upcoming section). In particular, defining a -dependent deviation

and a semi-empirical ES variance proxy

we show (in Theorem 3Section 4.2) that for an arbitrary probability kernel from to and an arbitrary probability measure on called the prior, w.p. at least for any , ,

(7)

where the Kullback-Liebler (KL) divergence between and (assuming that ) captures the effective capacity of under respective measures. Similarly as before, we also have a -free version, which holds w.p. at least for any :

Once again, these results do not require boundedness of random variables, nor of , and the concentration is essentially controlled by the expected variance-proxy . Next, we discuss several specializations of our results and note several key connections to the literature on the PAC-Bayesian analysis.

1.2 Applications

Now we discuss several applications of our inequalities. Throughout this section we assume that inequalities hold for any and any , unless stated otherwise.

Generalization bounds. PAC-Bayesian literature often discusses bounds on the generalization gap, which is a special case covered by our results. In this scenario, is defined as an average of some non-negative loss functions, incurred by the predictor of interest parameterized by on a given example. Here, is sampled from the posterior (a density over the parameter space ) chosen by the learner. In particular, let and let be some loss function with co-domain . Then, we define the population loss and the empirical loss as

respectively, and taking , the generalization gap is defined as .

The vast majority of PAC-Bayesian literature, e.g. (McAllester, 1998; Seeger, 2002; Langford and Shawe-Taylor, 2003; Maurer, 2004) assume that the loss function is bounded, i.e. w.l.o.g. . In such case, and taking , Eq. 7 immediately implies that w.p. at least ,

This basic corollary tightens classical results by replacing term with a universal constant, but slightly looses in terms of a multiplicative constant. The technical assumption on boundedness of the loss is not easy to avoid and the usual way to circumvent this would be to resort to a sub-exponential behavior of the relevant quantities, such as an empirical loss or a generalization gap (Alquier et al., 2016; Germain et al., 2016). Recently, few works have also looked into the PAC-Bayesian analysis for heavy-tailed losses: Alquier and Guedj (2018) proposed a polynomial moment-dependent bound with -divergence, while Holland (2019) devised an exponential bound which assumes that the second (uncentered) moment of the loss is bounded by a constant.

Here, without any of those assumptions, we obtain (creftype 1) a high-probability semi-empirical generalization bound for unbounded loss functions 5 (),

(8)

where . This result is close in spirit to the localized bounds of Catoni (2007); Langford and Shawe-Taylor (2003); Tolstikhin and Seldin (2013): for a small variance proxy (here sum of squared losses) we get the dominance of a lower-order term .

Finally, while the bound of Eq. 8 is semi-empirical (note that we condition only on ), by additionally assuming boundedness of the loss, it implies a fully empirical result (Theorem 5):

Such empirical Bernstein bounds (Audibert et al., 2007; Maurer and Pontil, 2009) in PAC-Bayesian setting were first investigated by Tolstikhin and Seldin (2013). The bound we present here is similar to the one of Tolstikhin and Seldin (2013), but slightly differs since we consider the sum of squared losses (with co-domain ) rather than the sample variance. Nevertheless, we recover a similar behavior, that is a “fast” order for the small enough variance proxy.

Off-policy Evaluation with Weighted Importance Sampling (Wis). Consider the stochastic decision making model where the pair of random variables called the action-reward pair is distributed according to some unknown joint probability measure . In such model, also known as a stochastic bandit feedback model (see (Lattimore and Szepesvári, 2018) for a detailed treatment on the subject), an agent takes action by sampling it from a discrete probability distribution called the target policy and observes a realization of reward . In the off-policy setting of this model, the learner only observes a tuple of actions and rewards generated by sampling each action from another fixed discrete probability distribution called the behavior policy, while corresponding rewards are distributed as .

In the off-policy evaluation problem, our goal is to estimate an expected reward, or the value of a fixed target policy ,

by relying on , where actions and rewards are collected by policy . Since observations are collected by another policy, we face a distribution mismatch problem and an estimator of the value is typically designed using a variant of an importance sampling, while aiming to maintain a good bias and variance trade-off. In this paper we study Weighted Importance Sampling (WIS) (or self-normalized importance sampling) estimator

where importance weights are defined as a ratio of policies given an action, . WIS estimator is known for a relatively low variance in practice (Hesterberg, 1995), yet it concentrates well even when importance weights are unbounded, since all of its moments are bounded, which makes it appealing for confidence estimation.

In this paper we show a lower bound on the value of the target policy when employing WIS, which partially captures the variance of an estimator. We prove (Theorem 7) that w.p. at least ,

where

Here acts as a variance proxy and can be easily computed since the distribution of the importance weights is known. Note that the bound can be further improved by taking a tighter variance proxy (Theorem 6creftype 1) at an additional computational cost. Computationally efficient version of we discuss here states the rate of concentration in terms of the variance of the importance weights.

Presented high-probability results do not require boundedness of importance weights, nor any form of truncation, prevalent in the literature on (weighted) importance sampling (Swaminathan and Joachims, 2015; Bottou et al., 2013; Thomas et al., 2015a). Indeed, it is not hard to see that by truncating the weights we can apply standard concentration inequalities. However, truncation biases the estimator and in practice requires to carefully tune the level of truncation to guarantee a good bias-variance trade-off. We avoid this precisely because our concentration inequalities do not require control of higher moments through boundedness. While another general Bernstein-type concentration inequalities (e.g. Eqs. 4 and 3) could be used for such problems, they would require truncation of importance weights.

Off-policy Learning with Wis. An off-policy learning problem is a natural extension of the evaluation problem discussed earlier. Here, instead of the evaluation of a fixed target policy, our goal is to select a policy from a given class, which maximizes the value. In this paper we propose a PAC-Bayesian lower bound on the value by specializing our ES PAC-Bayesian inequalities. In particular, we consider a class of parametric target policies for some parameter space . Similarly as before, we assume that the parameter is sampled from the posterior (typically, chosen by the learner after observing the data), where is some probability kernel from to . Note that the probability measure depends on the tuple of observed action-reward pairs generated as described before, and importance weights are now defined w.r.t. the random parameter , that is .

We show (Theorem 8) that for an arbitrary probability kernel from to and an arbitrary probability measure over , w.p. at least ,

where the effective capacity of the policy class is represented by the KL divergence in

and the variance proxy of an estimator is

Here captures the bias of an estimator, while is the variance proxy. Similarly as in case of evaluation, the variance proxy is not fully empirical, however it can be computed exactly since the distribution of importance weights is known (it is given by the behavior policy and the target policy ). The variance proxy presented here is also closely related to the Effective Sample Size (ESS) encountered in the Monte-Carlo theory (Owen, 2013, Chap.9)(Elvira et al., 2018), which provides a problem-dependent convergence rate of the WIS estimator. When all importance weights are equal to one (perfectly matching policies), ESS is of order ; on the other hand ESS approaches when importance weights concentrate on a single action. The role of ESS in a variance-dependent off-policy problems was also observed by Metelli et al. (2018), although in a context of polynomial bounds for fixed target policies.

Thus, maximizing presented lower bound w.r.t. a (parametric) probability measure gives a way to learn a target policy maximizing the value, while maintaining a bias-variance trade-off of WIS estimator. This idea is well-known in the off-policy learning literature — typically this is done through empirical Bernstein bounds by employing importance sampling estimator (Thomas et al., 2015b) and sometimes WIS estimator (Swaminathan and Joachims, 2015), however, all these techniques require some form of weight truncation. As in the off-policy evaluation case, the trade-off between the bias and the variance has to be carefully controlled by tuning the level of truncation. Our results provide an alternative route, free from an additional tuning. A truncation-free uniform convergence bounds for importance sampling were also derived by Cortes et al. (2010), however here we explore a PAC-Bayesian analysis approach.

1.3 Proof Ideas

Concentration inequalities shown in this paper to some extent are based on the inequalities for self-normalized estimators by de la Peña et al. (2008). In particular, we use inequalities derived through the method of mixtures (de la Peña et al., 2008, Chap. 12.2.1), which hold for the pair of random variables satisfying the condition Such random variables are called a canonical pair and our semi-empirical ES inequalities follow by proving that indeed forms a canonical pair. We do so by applying exponential decomposition inspired by the proof of the Azuma-Hoeffding’s inequality to the condition stated above, while is represented by the Doob martingale difference sequence. Note that a similar technique was also applied by Rakhlin and Sridharan (2017) in the context of the self-normalized martingales.

Our PAC-Bayesian inequalities follow a more involved argument. As in the classical PAC-Bayesian analysis we start from the Donsker-Varadhan change-of-measure inequality applied to the function for , and note that the log-MGF of at the prior parameter is bounded by due to the canonical pair argument. The rest of the proof is dedicated to the tuning of . One possibility here would be to apply the union bound argument of Seldin et al. (2012), however, since is unbounded, this allows us to take a more straightforward path. In particular, we employ the method of mixtures (used in (de la Peña et al., 2008, Chap. 12.2.1) to derive the aforementioned inequalities) to achieve analytic tuning of the bound w.r.t. . The idea behind the method of mixtures is to integrate the parameter of interest under some analytically-integrable probability measure. The choice of the Gaussian density with variance (recall that is a free parameter) and the Gaussian integration w.r.t.  leads to the concentration inequalities. To the best of our knowledge, this is the first application of the method of mixtures in PAC-Bayesian analysis, which is an alternative technique to analytical (union bound) (Seldin et al., 2012; Tolstikhin and Seldin, 2013) and empirical tuning of as in (Thiemann et al., 2016).

Finally, described applications follow from the analysis of the semi-empirical ES variance proxy. In case of the generalization error, such analysis is straightforward — our main bounds are obtained by observing multiple cancellations in . The case of WIS, this comes by the stability analysis of the estimator — given the removal of the -th importance weight, the difference of estimators is bounded by .

1.4 Additional Related Work

Our PAC-Bayesian bounds are related to the martingale bounds of Seldin et al. (2012). In particular, our results need to be compared to the PAC-Bayes-Bernstein inequality for martingales (Seldin et al., 2012, Theorem 8). In principle, their inequality could be applied to the Doob martingale difference sequence to prove a concentration bound. However, this would hold only for the bounded difference sequences, restricted family of probability kernels, and would yield inequality with a different ’less empirical’ ES variance proxy (with conditioning up to elements in expectation). The technique of Seldin et al. (2012); Tolstikhin and Seldin (2013); Thiemann et al. (2016) at its heart relies on the self-bounding property of the variance proxy to control the log-MGF term arising due to the PAC-Bayesian analysis. This control is possible thanks to inequalities obtained through the entropy method. On the other hand, self-bounding property in these cases holds for a limited range of , and the method of mixtures applied in our proofs cannot be used here without introduction of superfluous error terms (because of the clipped Gaussian integration). Another direction in controlling log-MGF, related to the empirical Bernstein inequalities for martingales was explored in the online learning literature (Cesa-Bianchi et al., 2007; Wintenberger, 2017) by linearization of .

PAC-Bayesian analysis in learning theory is not restricted to the generalization gap discussed in Section 1.2. Several works (Maurer, 2004; Seeger, 2002) have investigated generalization by giving upper bounds on the KL divergence of a Bernoulli variable (assuming that loss function is bounded on ), which are clearly tighter than the difference bounds (due to the Pinsker’s inequality). In this paper we forego this setting for the sake of generality, however we suspect that KL-Bernoulli ES bounds can be derived for the bounded loss functions.

A number of works have also looked into PAC-Bayesian bounds (and bounds for the closely related Gibbs predictors) on the excess risk , e.g. (Catoni, 2007; Alquier et al., 2016; Kuzborskij et al., 2019; Grünwald and Mehta, 2019). A tantalizing line of research would be to investigate the use of our semi-empirical results in the context of an excess risk analysis.

Finally, in recent years many works (Dziugaite and Roy, 2017; Neyshabur et al., 2018; Rivasplata et al., 2018; Mhammedi et al., 2019) have observed that PAC-Bayesian bounds tend to give less conservative numerical estimates of the generalization gap for neural networks compared to the alternative techniques based on the concentration of empirical processes. Semi-empirical bounds proposed in this paper offer opportunities for sharpening these results in a data-dependent fashion.

Organization. In Section 3 we prove concentration inequalities for the fixed function, where the proof crucially relies on Lemma 1, whose proof deferred to the Appendix A. In Section 4.2 we present the PAC-Bayesian extension of the concentration inequalities and present a major part of its proof, which is one of the main contributions of this paper. Finally, proofs for generalization bounds are presented in Section B.1, while proofs for the WIS value bounds are presented in Section B.3 and Section B.4.

2 Preliminaries

Throughout this paper, we use to indicate that there exists a universal constant such that holds uniformly over all arguments. We use to indicate a version of which also supresses logarithmic factors. Notation is used to denote the positive part of the real number . We use notation to denote a family of probability measures supported on a set . If and are densities over such that and , the Kullback-Liebler (KL) divergence between and is defined as . We say that a random variable is -subgaussian if for every .

3 Semi-Empirical Concentration Inequalities

In this section we prove semi-empirical concentration inequalities. Recall that we focus on measurable functions of a random tuple , where elements of are distributed independently from each other and . In this section we prove bounds on the upper tail probability of .

Theorem 1.

Let the semi-empirical Efron-Stein variance proxy be defined as

(9)

Then, for any , with probability at least and any ,

In addition, for any , with probability at least we have

Note the similarity to the Efron-Stein inequality, which bounds the variance of with . The proof of Theorem 1 combines the argument used in the proof of McDiarmid’s/Azuma-Hoeffding’s inequality with a concentration inequality due to de la Peña et al. (2008). To state this inequality, recall that a pair of random variables is called a canonical pair if and

(10)

The result of de la Peña et al. shows that if is a canonical pair then has a (random) subgaussian behavior with variance proxy :

Theorem 2 (Theorem 12.4 of de la Peña et al. (2008)).

Let be a canonical pair. Then, for any ,

In addition, for all and ,

Thus, provided that is a canonical pair, it is easy to see that Theorem 1 follows from Theorem 2 applied to . Thus, it remains to be seen that is a canonical pair, which is established by the following lemma, whose proof is presented in Appendix A:

Lemma 1.

is a canonical pair.

4 PAC-Bayesian Bounds

In this section we prove a PAC-Bayesian version of Theorem 2, which will consequently imply a PAC-Bayesian bound for classes of functions.

Let denote an index set. We call the collection of pairs of random variables indexed by elements of a canonical family, if for each , is a canonical pair. In this section we prove new PAC-Bayesian inequalities for families of canonical pairs that arise as functions of a common random element. Below we let to denote a probability kernel from to . Also, for brevity, given , we will also use to denote .

Theorem 3.

For some space , let be a canonical family and , jointly distributed with and taking values in . Fix an arbitrary probability kernel from to and an arbitrary probability measure over , and let . Then, for any , with probability at least we have that

(11)

In addition, for all and with probability at least we have

(12)

The proof (given in Appendix A) largely relies on the following theorem, which allows to bound a moment-generating function of a random variable . Note that the crucial part is to show that this random variable is subgaussian (as show in (14)).

Theorem 4.

Under the same conditions as in Theorem 3, for any , we have

(13)

Furthermore, for all ,

(14)

The proof combines PAC-Bayesian ideas with the method of mixtures as described by de la Peña et al. (2008) [Section 12.2.1].

4.1 Proof of Theorem 4

We start by applying the following change-of-measure lemma, which is the basis of the PAC-Bayesian analysis.

Lemma 2 (Donsker and Varadhan (1975); Dupuis and R. S. Ellis (1997); Gray (2011)).

Let be probability measures on and let , . Then, for any measurable function we have

The lemma with , , and for a fixed implies

Exponentiation of both sides gives

and taking expectation we have

since is a canonical pair for a fixed by assumption. Now we apply the method of mixtures with respect to the Gaussian distribution. Multiplying both sides by for some , integrating w.r.t. , and applying Fubini’s theorem gives

We perform Gaussian integration and arrive at

(15)

which finishes the proof of Eq. 13. For the second part, we consider the following standard lemma:

Lemma 3.

Let be a nonnegative valued random variable such that is finite. Then, for any , holds.

Setting , the lemma gives Eq. 14 provided that we show that . For this, let . Introduce the abbreviations , so that . Note that . By Cauchy-Schwartz,

(16)

Observe that a.s. and that by Eq. 13. Now, we have

and finally, by subadditivity of and Jensen’s inequality,

where the last inequality follows by taking . Thus, applying Lemma 3 with completes the proof.

4.2 PAC-Bayesian Efron-Stein Inequalities

Now, we apply Theorem 3 to get concentration inequalities for classes of functions. In particular, consider the class of functions parametrized by some space , that is . Furthermore, we will assume that , where is a probability kernel from to and that where is a probability distribution over . We let be a random element that shares a common distribution with and which is independent of . Finally, we are interested in bounds on the deviation , where

which hold simultaneously for any choice of and , and which are controlled by the -dependent version of a semi-empirical Efron-Stein variance proxy

Then, by Lemma 1, is a canonical pair for any fixed . Hence, form a canonical family and the conclusions of the previous section hold for it.

Acknowledgements We are grateful to András György for many insightful comments.

Appendix A Proofs for semi-empirical concentration inequalities

Lemma 1 (restated). is a canonical pair.

Proof.

Let stand for . The Doob martingale decomposition of gives

where and the last equality follows from the elementary identity .

Observe that

and where . Assume for now that for , the inequalities

(17)

hold. Then, using an argument similar to the proof of McDiarmid’s inequality, we get

Thus, it remains to prove Eq. 17. For this, fix and introduce a Rademacher variable such that and is independent of . Let . Using that , we get