Novel Change of Measure Inequalities and PAC-Bayesian Bounds

# Novel Change of Measure Inequalities and PAC-Bayesian Bounds

## Abstract

PAC-Bayesian theory has received a growing attention in the machine learning community. Our work extends the PAC-Bayesian theory by introducing several novel change of measure inequalities for two families of divergences: -divergences and -divergences. First, we show how the variational representation for -divergences leads to novel change of measure inequalities. Second, we propose a multiplicative change of measure inequality for -divergences, which leads to tighter bounds under some technical conditions. Finally, we present several PAC-Bayesian bounds for various classes of random variables, by using our novel change of measure inequalities.

## 1 Introduction

The Probably Approximate Correct (PAC) learning framework was introduced by [15] and [7]. This framework allows us to evaluate the efficiency of machine learning algorithm, and several extensions have been proposed to date (see e.g., [13, 8, 9, 14, 1]). The core of these theoretical results is summarized by a change of measure inequality. The change of measure inequality is an expectation inequality involving two probability measures where the expectation with respect to one measure is upper-bounded by the divergence between the two measures and the moments with respect to the other measure. The change of measure inequality also plays a major role in information theory. For instance, [5] derived robust uncertainty quantification bounds for statistical estimators of interest with change of measure inequalities.

There exist change of measure inequalities based on specific divergences, such as the KL-divergence [7], the Rényi divergence [2] and the divergence [6, 3]. The aforementioned works were proposed for specific purposes. A comprehensive study on change of measure inequalities has not been performed yet. Our work proposes several novel and general change of measure inequalities for two families of divergences: -divergences and -divergences. It is a well-known fact that the -divergence can be variationally characterized as the maximum of an optimization problem rooted in the convexity of the function . This variational representation has been recently used in various applications of information theory, such as -divergence estimation [10] and quantification of the bias in adaptive data analysis [4]. Recently, [12] showed that the variational representation of the -divergence can be tightened when constrained to the space of probability densities.

Our main contributions are as follows:

• By using the variational representation for -divergences, we derive several change of measure inequalities. We perform the analysis for the constrained regime (to the space of probability densities) as well as the unconstrained regime.

• We present a multiplicative change of measure inequality for the family of -divergences. This generalizes the previous results [3] for one of the members of the -divergence family, i.e., the divergence.

• We also generalize prior results for the Hammersley-Chapman-Robbins inequality [6] from the particular divergence, to the family of -divergences.

• We provide new PAC-Bayesian bounds from our novel change of measure inequalities for bounded, sub-Gaussian, sub-exponential and bounded-variance loss functions. Our results pertain to important machine learning prediction problems, such as regression, classification and structured prediction.

## 2 Change of Measure Inequalities

In this section, we formalize the definition of -divergences and present the constrained representation (to the space of probability measures) as well as the unconstrained representation. Then, we provide different change of measure inequalities for several divergences. We also provide multiplicative bounds as well as a generalized Hammersley-Chapman-Robbins bound. Table 1 summarizes our results.

### 2.1 Change of Measure Inequality from the Variational Representation of f-divergences

Let be a convex function. The convex conjugate of is defined by:

 f∗(y)=supx∈R(xy−f(x)). (1)

The definition of yields the following Young-Fenchel inequality

 f(x)≥xy−f∗(y)

which holds for any .

Using the notation of convex conjugates, the -divergence and its variational representation is defined as follows.

###### Definition 1 (f-divergence and its variational representation).

Let be any arbitrary domain. Let and denote the probability measures over the Borel -field on . Additionally, let : be a convex and lower semi-continuous function that satisfies . Let : be a real-valued function. Given such functions, the -divergence from to is defined as

 Df(Q∥P):=EP[f(dQdP)]=supϕEQ[ϕ]−EP[f∗(ϕ)]

For simplicity, here and in what follows, we denote and . Many common divergences, such as the KL-divergence and the Hellinger divergence, are members of the family of -divergences, coinciding with a particular choice of . Table 2 presents the definition of each divergence with the corresponding generator .

Based on Definition 1, we can obtain the following inequality:

 EQ[ϕ] ≤Df(Q∥P)+EP[f∗(ϕ)] ≤Df(Q∥P)+suppEP[ϕp]−EP[f(p)]

[12] shows that this variational representation for -divergences can be tightened. The famous Donsker-Varadhan representation for the KL-divergence, which is used in most PAC-Bayesian bounds, can be actually derived from this tighter representation.

###### Theorem 1 (Change of measure inequality from the constrained variational representation for f-divergences [12]).

Let , , , and be defined as in Definition 1. Let denote the space of probability densities with respect to , where the norm is defined as , given a measure over . The general form of the change of measure inequality for f-divergences is given by

 EQ[ϕ]≤Df(Q∥P)+(IRf,P)∗(ϕ)(IRf,P)∗(ϕ)=supp∈Δ(P)EP[ϕp]−EP[f(p)]

where p is constrained to be a probability density function.

However, it is not always easy to find a closed-form solution for Theorem 1, as it requires to resort to variational calculus, and in some cases, there is no closed-form solution. In such a case, we can use the following corollary to obtain looser bounds, but only requires to find a convex conjugate.

###### Corollary 1 (Change of measure inequality from the unconstrained variational representation for f-divergences).

Let , , and be defined as in Theorem 1. By Definition 1, we have

 EQ[ϕ]≤Df(Q∥P)+EP[f∗(ϕ)]
###### Proof.

Removing the supremum and rearranging the terms in Definition 1, we prove our claim. ∎

By choosing a right function and deriving the constrained maximization term with the help of variational calculus, we can create an upper-bound based on the corresponding divergence . First, we discuss the case of the divergence.

###### Lemma 2 (Change of measure inequality from the constrained representation of the χ2-divergence).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤χ2(Q∥P)+(EP[ϕ]+14VarP[ϕ])
###### Proof.

By Theorem 1, we want to find

 (IRf,P)∗(ϕ)=supp∈Δ(P)∫Hϕp−(p−1)2dP=supp∈Δ(P)EP[ϕp−(p−1)2] (2)

where is a measurable function on . In order to find the supremum on the right hand side, we consider the following Lagrangian:

 L(p,λ)=EP[ϕp−(p−1)2]+λ(EP[p]−1)=∫Hϕp−(p−1)2dP+λ(∫HpdP−1)

where and is constrained to be a probability density over . By talking the functional derivative with respect to and renormalize by dropping the factor multiplying all terms and setting it to zero, we have . Thus, we have . Since is constrained to be , we have . Then, we obtain and the optimum . Plugging it in Equation (2), we have

 (IRf,P)∗(ϕ)=EP[ϕ(ϕ−EP[ϕ]2+1)−(ϕ−EP[ϕ]2)2]

which simplifies to . ∎

The bound in Lemma 2 is slightly tighter than the one without the constraint. The change of measure inequality without the constraint is give as follows.

###### Lemma 3 (Change of measure inequality from the unconstrained representation of the Pearson χ2-divergence).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤χ2(Q∥P)+(EP[ϕ]+14EP[ϕ2])
###### Proof.

Notice that the -divergence is obtained by setting . In order to find the convex conjugate from Equation (1), let us denote . We need to find the supremum of with respect to . By using differentiation with respect to and setting the derivative to zero, we have . Thus, plugging for , we obtain . Plugging and in Corollary 1, we prove our claim. ∎

As might be apparent, the bound in Lemma 2 is tighter than the one in Lemma 3 by because .

Next, we discuss the case of the total variation divergence.

###### Lemma 4 (Change of measure inequality from the constrained representation of the total variation divergence).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤TV(Q∥P)+EP[ϕ]

for .

###### Proof.

By Theorem 1, we want to find

 (IRf,P)∗(ϕ)=supp∈Δ(P)EP[ϕp−12|p−1|]

In order to find the supremum on the right hand side, we consider the following Lagrangian

 L(p,λ)=EP[ϕp−12|p−1|)]+λ(EP[p]−1)={EP[ϕp−12p+12)]+λ(EP[p]−1)if 1≤pEP[ϕp+12p−12|)]+λ(EP[p]−1)if 0

Then, it is not hard to see that if , is maximized at , otherwise as . Thus, Lemma 4 holds for . If we add on the both sides, then is bounded between 0 and 1, as we claimed in the corollary. ∎

Interestingly, we can obtain the same bound on the total variation divergence even if we use the unconstrained variational representation.

Next, we state our result for -divergences.

###### Lemma 5 (Change of measure inequality from the unconstrained representation of the α-divergence).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤Dα(Q∥P)+(α−1)αα−1αEP[ϕαα−1]+1α(α−1)
###### Proof.

The corresponding convex function for the -divergence is defined as . Applying the same procedure as in Lemma 3, we get the convex conjugate . Plugging and in Corollary 1, we prove our claim. ∎

We can obtain the bounds based on the squared Hellinger divergence , the reverse KL-divergence and the Neyman -divergence in a similar fashion.

###### Lemma 6 (Change of measure inequality from the unconstrained representation of the squared Hellinger divergence).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤H2(Q∥P)+EP[1(ϕ−1)2]+1
###### Proof.

The corresponding convex function for the squared Hellinger divergence is defined as . Applying the same procedure as in Lemma 3, we get the convex conjugate . Plugging and in Corollary 1, we prove our claim. ∎

Similarly, we obtain the following bound for the reverse-KL divergence.

###### Lemma 7 (Change of measure inequality from the unconstrained representation of the reverse KL-divergence).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤¯¯¯¯¯¯¯¯¯KL(Q∥P)+EP[log(11−ϕ)]

where .

###### Proof.

The corresponding convex function for the reverse KL-divergence is defined as . Applying the same procedure as in Lemma 3, we get the convex conjugate . Plugging and in Corollary 1, we have

 EQ[ψ]≤¯¯¯¯¯¯¯¯¯KL(Q∥P)+EP[log(−1ψ)]−1

where . Letting , we prove our claim. ∎

Finally, we prove our result for the Neyman divergence based on a similar approach.

###### Lemma 8 (Change of measure inequality from the unconstrained representation of the Neyman χ2-divergence).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤¯¯¯¯¯¯χ2(Q∥P)+2−2EP[√1−ϕ]

where .

###### Proof.

The corresponding convex function for the Neyman -divergence is defined as . Applying the same procedure as in Lemma 3, we get the convex conjugate . Plugging and in Corollary 1, we prove our claim. ∎

### 2.2 Multiplicative Change of Measure Inequality for α-divergences

First, we state a known result for the divergence.

###### Lemma 9 (Multiplicative change of measure inequality for the χ2-divergence [3]).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤√(χ2(Q∥P)+1)EP[ϕ2]

First, we note that the divergence is an -divergence for . Next, we generalize the above bound for any -divergence.

###### Lemma 10 (Multiplicative change of measure inequality for the α-divergence).

Let , and be defined as in Theorem 1. For any , we have

 EQ[ϕ]≤(α(α−1)Dα(Q∥P)+1)1α(EP[|ϕ|αα−1])α−1α
###### Proof.
 EQ[ϕ]=∫HϕdQ=∫HϕdQdPdP≤(∫H∣∣∣dQdP∣∣∣αdP)1α(∫H|ϕ|αα−1dP)α−1α=(α(α−1)Dα(Q∥P)+1)1α(EP[|ϕ|αα−1])α−1α

The third line is due to the Hölder’s inequality. ∎

By choosing , we have the same bound as Lemma 9 where .

By Lemma 10, we can obtain the change of measure inequality for specific cases of the -divergence. Next we provide the result for the squared Hellinger divergence.

###### Lemma 11 (Multiplicative change of measure inequality for the squared Hellinger divergence).

Let , and be defined as in Theorem 1.

 EQ[ϕ]≤(1−H2(Q∥P))2EP[|1ϕ|]
###### Proof.

Letting in Lemma 10 and completes the proof. ∎

### 2.3 A Generalized Hammersley-Chapman-Robbins (HCR) Inequality

The HCR inequality is a famous information theoretic inequality for the -divergence.

###### Lemma 12 (HCR inequality [6]).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤EP[ϕ]+√VarP[ϕ]χ2(Q∥P)

First, we note that the divergence is an -divergence for . Next, we generalize the above bound for any -divergence.

###### Lemma 13 (The generalization of HCR inequality.).

Let , and be defined as in Theorem 1. For any , we have

where .

###### Proof.

Consider the covariance of and .

 CovP(ϕ,dQdP)=∫HϕdQdPdP−∫HϕdP∫HdQdPdP=EQ[ϕ]−EP[ϕ]

On the other hand,

 (3)

which completes the proof. ∎

By choosing , we obtain

 EQ[ϕ]≤EP[ϕ]+√VarP[ϕ](χ2(Q∥P)+1)

We can immediately see that this bound is quite similar to and slightly looser than the HCR inequality. The reason that this phenomenon is observed is because the third step of the proof of Equation (3) makes it a little looser than the HCR inequality.

Finally, we present a result for the squared Hellinger divergence.

###### Lemma 14 (Change of measure inequality for the squared Hellinger divergence from Lemma 13).

Let , and be defined as in Theorem 1, we have

 EQ[ϕ]≤EP[ϕ]+(1−H2(Q∥P))2EP[|1ϕ−μP|]

where .

###### Proof.

Letting in Lemma 13 and completes the proof. ∎

## 3 Applicability to PAC-Bayesian Theory

In this section, we will explore the applicability of our change of measure inequalities. We consider an arbitrary input space and a output space . The samples are input-output pairs. Each example is drawn according to a fixed, but unknown, distribution on . Let denote a generic loss function. The risk of any predictor is defined as the expected loss induced by samples drawn according to . Given a training set of samples, the empirical risk of any predictor is defined by the empirical average of the loss. That is

 RD(h)=E(x,y)∼Dℓ(h(x),y)
 RS(h)=1|S|∑(x,y)∈Sℓ(h(x),y)

In the PAC-Bayesian framework, we consider a hypothesis space of predictors, a prior distribution on , and a posterior distribution on . The prior is specified before exploiting the information contained in , while the posterior is obtained by running a learning algorithm on . The PAC-Bayesian theory usually studies the stochastic Gibbs predictor . Given a distribution on , predicts an example by drawing a predictor according to , and returning . The risk of is then defined as follows. For any probability distribution on a set of predictors, the Gibbs risk is the expected risk of the Gibbs predictor relative to . Hence,

 RD(GQ)=E(x,y)∼DEh∼Qℓ(h(x),y) (4)

Usual PAC-Bayesian bounds give guarantees on the generalization risk . Typically, these bounds rely on the empirical risk defined as follows.

 RS(GQ)=1|S|∑(x,y)∈SEh∼Qℓ(h(x),y) (5)

Next, we present novel PAC-Bayesian bounds based on various change of measure inequalities depending on different assumptions on the loss function . Our results pertain to important machine learning prediction problems, such as regression, classification and structured prediction.

### 3.1 Bounded Loss Function

First, let us assume that the loss function is bounded as for . For this scenario, we cannot use the change of measure inequalities with the total variation, Reverse KL and Neyman divergence anymore because has to be bounded in .

###### Corollary 2 (The PAC-Bayesian bounds for bounded loss function).

Let be a fixed prior distribution over a hypothesis space . For a given posterior distribution over a hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (4) and (5) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have

 RD(GQ)≤RS(GQ)+1tKL(Q∥P)+R√12mlog(2δ)
 RD(GQ)≤RS(GQ)+1tχ2(Q∥P)+R√12mlog(2δ)+R2t8mlog(2δ)
 RD(GQ)≤RS(GQ)+R√(χ2(Q∥P)+1)12mlog(2δ)
###### Proof.

Suppose that we have a convex function , that measures the discrepancy between the observed empirical Gibbs risk and the true Gibbs risk on distribution . Given that, the purpose of the PAC-Bayesian theorem is to upper-bound the discrepancy for any . Let , where the subscript of shows the dependency on the data distribution . Let . By applying Jensen’s inequality as the first step,

 tΔ(RD(GQ),RS(GQ))=tΔ(Eh∼QRD(h),Eh∼QRS(h))≤Eh∼QtΔ(RD(h),RS(h))=Eh∼QϕD(h) (6)

By Hoeffding’s inequality, for any and any ,

 Pr(ϕD(h)≥ϵ)=Pr(|RS(h)−RD(h)|≥ϵt)≤2e−2mϵ2R2t2

Setting , we have

 ϕD(h)≤1−δRt√12mlog(2δ)

The symbol denotes that the inequality holds with probability at least . The second line holds due to the Hoeffding’s inequality. Also note that

 Eh∼P[ϕD(h)]≤sup(x,y)∼D{Eh∼P[ϕD(h)]}≤Eh∼P[sup(x,y)∼DϕD(h)]≤1−δRt√12mlog(2δ) (7)

The second line follows from Jensen’s inequality and convexity of supremum. Similarly,

 logEh∼P[eϕD(h)]≤1−δRt√12mlog(2δ) (8)

Furthermore, the variance is bounded as follows.

 Varh∼P[ϕD(h)]=Eh∼P[ϕD(h)2]−Eh∼P[ϕD(h)]2≤1−δR2t22mlog(2δ)+0=R2t22mlog(2δ) (9)

By applying Equations (6), (7), (8) and (9) to the Donsker-Varadhan representation, Lemma 4, Lemma 2 and Lemma 9 respectively, we prove our claims. ∎

### 3.2 Sub-Gaussian Loss Function

Next, we relax the constraints on the loss function. We consider the problem where is sub-Gaussian. First, we define sub-Gaussianity.

###### Definition 2.

A random variable is said to be sub-Gaussian with the expectation and variance proxy if for any ,

 E[eλ(Z−μ)]≤eϵ2σ22

Next, we present our PAC-Bayesian bounds.

###### Corollary 3 (The PAC-Bayesian bounds for sub-Gaussian loss function).

Let be a fixed prior distribution over a hypothesis space . For a given posterior distribution over a hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (4) and (5) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have

 RD(GQ)≤RS(GQ)+1tKL(Q∥P)+√2σ2mlog(2δ)
 RD(GQ)≤RS(GQ)+1tχ2(Q∥P)+√2σ2mlog(2δ)+σ2t2mlog(2δ)
 RD(GQ)≤RS(GQ)+√2σ2mlog(2δ)(√χ2(Q∥P)+1)
###### Proof.

Employing Chernoff’s bound, the tail bound probability for sub-Gaussian random variables [11] is given as follows

 Pr(|¯Z−μ|≥ϵ)≤2e−mϵ22σ2 (10)

Setting , and in the tail bound in Equation (10), for any , we have

 ϕD(h)≤1−δt√2σ2mlog(2δ)

where is defined as in Equation (6). Now, we have the upper bound for so we can apply the same procedure as in Corollary 2. ∎

### 3.3 Sub-Exponential Loss Function

Next, we consider a more general class where is sub-exponential. First, we define sub-exponentiality.

###### Definition 3.

A random variable is said to be sub-exponential with the expectation and parameters and , if for any ,

 E[eλ(Z−μ)]≤eϵ2σ22,∀:|λ|<1β

Next, we provide our PAC-Bayesian bounds.

###### Corollary 4 (The PAC-Bayesian bounds for sub-exponential loss function).

Let be a fixed prior distribution over a hypothesis space . For a given posterior distribution over a hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (4) and (5) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have

For , we have

 RD(GQ)≤RS(GQ)+1tKL(Q∥P)+√2σ2mlog(2δ)
 RD(GQ)≤RS(GQ)+1tχ2(Q∥P)+√2σ2mlog(2δ)+σ2t2mlog(2δ)
 RD(GQ)≤RS(GQ)+√2σ2mlog(2δ)(√χ2(Q∥P)+1)

For , we have

 RD(GQ)≤RS(GQ)+1tKL(Q∥P)+2βlog(2δ)
 RD(GQ)≤RS(GQ)+