Novel Change of Measure Inequalities and PACBayesian Bounds
Abstract
PACBayesian theory has received a growing attention in the machine learning community. Our work extends the PACBayesian theory by introducing several novel change of measure inequalities for two families of divergences: divergences and divergences. First, we show how the variational representation for divergences leads to novel change of measure inequalities. Second, we propose a multiplicative change of measure inequality for divergences, which leads to tighter bounds under some technical conditions. Finally, we present several PACBayesian bounds for various classes of random variables, by using our novel change of measure inequalities.
1 Introduction
The Probably Approximate Correct (PAC) learning framework was introduced by [15] and [7]. This framework allows us to evaluate the efficiency of machine learning algorithm, and several extensions have been proposed to date (see e.g., [13, 8, 9, 14, 1]). The core of these theoretical results is summarized by a change of measure inequality. The change of measure inequality is an expectation inequality involving two probability measures where the expectation with respect to one measure is upperbounded by the divergence between the two measures and the moments with respect to the other measure. The change of measure inequality also plays a major role in information theory. For instance, [5] derived robust uncertainty quantification bounds for statistical estimators of interest with change of measure inequalities.
There exist change of measure inequalities based on specific divergences, such as the KLdivergence [7], the Rényi divergence [2] and the divergence [6, 3]. The aforementioned works were proposed for specific purposes. A comprehensive study on change of measure inequalities has not been performed yet. Our work proposes several novel and general change of measure inequalities for two families of divergences: divergences and divergences. It is a wellknown fact that the divergence can be variationally characterized as the maximum of an optimization problem rooted in the convexity of the function . This variational representation has been recently used in various applications of information theory, such as divergence estimation [10] and quantification of the bias in adaptive data analysis [4]. Recently, [12] showed that the variational representation of the divergence can be tightened when constrained to the space of probability densities.
Our main contributions are as follows:

By using the variational representation for divergences, we derive several change of measure inequalities. We perform the analysis for the constrained regime (to the space of probability densities) as well as the unconstrained regime.

We present a multiplicative change of measure inequality for the family of divergences. This generalizes the previous results [3] for one of the members of the divergence family, i.e., the divergence.

We also generalize prior results for the HammersleyChapmanRobbins inequality [6] from the particular divergence, to the family of divergences.

We provide new PACBayesian bounds from our novel change of measure inequalities for bounded, subGaussian, subexponential and boundedvariance loss functions. Our results pertain to important machine learning prediction problems, such as regression, classification and structured prediction.
2 Change of Measure Inequalities
Bound Type  Divergence  UppperBound for Every and a Fixed  Reference 
Constrained  KL  [7]  
Variational  Pearson  Lemma 2  
Representation  Total Variation  for  Lemma 4 
Unconstrained  KL  [7]  
Variational  Pearson  Lemma 3  
Representation  Total Variation  for  
Lemma 5  
Squared Hellinger  Lemma 6  
Reverse KL  for  Lemma 7  
Neyman  for  Lemma 8  
Multiplicative  Pearson  [3]  
Lemma 10  
Squared Hellinger  Lemma 11  
Generalized HCR  Pearson  [6]  
Lemma 13  
Squared Hellinger  Lemma 14  
Divergence  Formula with probability measures and defined on a common space  Corresponding Generator 

KL  
Reverse KL  
Pearson  
Neyman  
Total Variation  
Squared Hellinger  
In this section, we formalize the definition of divergences and present the constrained representation (to the space of probability measures) as well as the unconstrained representation. Then, we provide different change of measure inequalities for several divergences. We also provide multiplicative bounds as well as a generalized HammersleyChapmanRobbins bound. Table 1 summarizes our results.
2.1 Change of Measure Inequality from the Variational Representation of divergences
Let be a convex function. The convex conjugate of is defined by:
(1) 
The definition of yields the following YoungFenchel inequality
which holds for any .
Using the notation of convex conjugates, the divergence and its variational representation is defined as follows.
Definition 1 (divergence and its variational representation).
Let be any arbitrary domain. Let and denote the probability measures over the Borel field on . Additionally, let : be a convex and lower semicontinuous function that satisfies . Let : be a realvalued function. Given such functions, the divergence from to is defined as
For simplicity, here and in what follows, we denote and . Many common divergences, such as the KLdivergence and the Hellinger divergence, are members of the family of divergences, coinciding with a particular choice of . Table 2 presents the definition of each divergence with the corresponding generator .
Based on Definition 1, we can obtain the following inequality:
[12] shows that this variational representation for divergences can be tightened. The famous DonskerVaradhan representation for the KLdivergence, which is used in most PACBayesian bounds, can be actually derived from this tighter representation.
Theorem 1 (Change of measure inequality from the constrained variational representation for divergences [12]).
Let , , , and be defined as in Definition 1. Let denote the space of probability densities with respect to , where the norm is defined as , given a measure over . The general form of the change of measure inequality for fdivergences is given by
where p is constrained to be a probability density function.
However, it is not always easy to find a closedform solution for Theorem 1, as it requires to resort to variational calculus, and in some cases, there is no closedform solution. In such a case, we can use the following corollary to obtain looser bounds, but only requires to find a convex conjugate.
Corollary 1 (Change of measure inequality from the unconstrained variational representation for divergences).
Proof.
Removing the supremum and rearranging the terms in Definition 1, we prove our claim. ∎
By choosing a right function and deriving the constrained maximization term with the help of variational calculus, we can create an upperbound based on the corresponding divergence . First, we discuss the case of the divergence.
Lemma 2 (Change of measure inequality from the constrained representation of the divergence).
Let , and be defined as in Theorem 1, we have
Proof.
By Theorem 1, we want to find
(2) 
where is a measurable function on . In order to find the supremum on the right hand side, we consider the following Lagrangian:
where and is constrained to be a probability density over . By talking the functional derivative with respect to and renormalize by dropping the factor multiplying all terms and setting it to zero, we have . Thus, we have . Since is constrained to be , we have . Then, we obtain and the optimum . Plugging it in Equation (2), we have
which simplifies to . ∎
The bound in Lemma 2 is slightly tighter than the one without the constraint. The change of measure inequality without the constraint is give as follows.
Lemma 3 (Change of measure inequality from the unconstrained representation of the Pearson divergence).
Let , and be defined as in Theorem 1, we have
Proof.
Notice that the divergence is obtained by setting . In order to find the convex conjugate from Equation (1), let us denote . We need to find the supremum of with respect to . By using differentiation with respect to and setting the derivative to zero, we have . Thus, plugging for , we obtain . Plugging and in Corollary 1, we prove our claim. ∎
Next, we discuss the case of the total variation divergence.
Lemma 4 (Change of measure inequality from the constrained representation of the total variation divergence).
Proof.
By Theorem 1, we want to find
In order to find the supremum on the right hand side, we consider the following Lagrangian
Then, it is not hard to see that if , is maximized at , otherwise as . Thus, Lemma 4 holds for . If we add on the both sides, then is bounded between 0 and 1, as we claimed in the corollary. ∎
Interestingly, we can obtain the same bound on the total variation divergence even if we use the unconstrained variational representation.
Next, we state our result for divergences.
Lemma 5 (Change of measure inequality from the unconstrained representation of the divergence).
Let , and be defined as in Theorem 1, we have
Proof.
We can obtain the bounds based on the squared Hellinger divergence , the reverse KLdivergence and the Neyman divergence in a similar fashion.
Lemma 6 (Change of measure inequality from the unconstrained representation of the squared Hellinger divergence).
Let , and be defined as in Theorem 1, we have
Proof.
Similarly, we obtain the following bound for the reverseKL divergence.
Lemma 7 (Change of measure inequality from the unconstrained representation of the reverse KLdivergence).
Let , and be defined as in Theorem 1, we have
where .
Proof.
The corresponding convex function for the reverse KLdivergence is defined as . Applying the same procedure as in Lemma 3, we get the convex conjugate . Plugging and in Corollary 1, we have
where . Letting , we prove our claim. ∎
Finally, we prove our result for the Neyman divergence based on a similar approach.
Lemma 8 (Change of measure inequality from the unconstrained representation of the Neyman divergence).
2.2 Multiplicative Change of Measure Inequality for divergences
First, we state a known result for the divergence.
Lemma 9 (Multiplicative change of measure inequality for the divergence [3]).
Let , and be defined as in Theorem 1, we have
First, we note that the divergence is an divergence for . Next, we generalize the above bound for any divergence.
Lemma 10 (Multiplicative change of measure inequality for the divergence).
Let , and be defined as in Theorem 1. For any , we have
Proof.
The third line is due to the Hölder’s inequality. ∎
By choosing , we have the same bound as Lemma 9 where .
By Lemma 10, we can obtain the change of measure inequality for specific cases of the divergence. Next we provide the result for the squared Hellinger divergence.
Lemma 11 (Multiplicative change of measure inequality for the squared Hellinger divergence).
Let , and be defined as in Theorem 1.
Proof.
Letting in Lemma 10 and completes the proof. ∎
2.3 A Generalized HammersleyChapmanRobbins (HCR) Inequality
The HCR inequality is a famous information theoretic inequality for the divergence.
First, we note that the divergence is an divergence for . Next, we generalize the above bound for any divergence.
Lemma 13 (The generalization of HCR inequality.).
Proof.
Consider the covariance of and .
On the other hand,
(3) 
which completes the proof. ∎
By choosing , we obtain
We can immediately see that this bound is quite similar to and slightly looser than the HCR inequality. The reason that this phenomenon is observed is because the third step of the proof of Equation (3) makes it a little looser than the HCR inequality.
Finally, we present a result for the squared Hellinger divergence.
Lemma 14 (Change of measure inequality for the squared Hellinger divergence from Lemma 13).
Proof.
Letting in Lemma 13 and completes the proof. ∎
3 Applicability to PACBayesian Theory
In this section, we will explore the applicability of our change of measure inequalities. We consider an arbitrary input space and a output space . The samples are inputoutput pairs. Each example is drawn according to a fixed, but unknown, distribution on . Let denote a generic loss function. The risk of any predictor is defined as the expected loss induced by samples drawn according to . Given a training set of samples, the empirical risk of any predictor is defined by the empirical average of the loss. That is
In the PACBayesian framework, we consider a hypothesis space of predictors, a prior distribution on , and a posterior distribution on . The prior is specified before exploiting the information contained in , while the posterior is obtained by running a learning algorithm on . The PACBayesian theory usually studies the stochastic Gibbs predictor . Given a distribution on , predicts an example by drawing a predictor according to , and returning . The risk of is then defined as follows. For any probability distribution on a set of predictors, the Gibbs risk is the expected risk of the Gibbs predictor relative to . Hence,
(4) 
Usual PACBayesian bounds give guarantees on the generalization risk . Typically, these bounds rely on the empirical risk defined as follows.
(5) 
Next, we present novel PACBayesian bounds based on various change of measure inequalities depending on different assumptions on the loss function . Our results pertain to important machine learning prediction problems, such as regression, classification and structured prediction.
3.1 Bounded Loss Function
First, let us assume that the loss function is bounded as for . For this scenario, we cannot use the change of measure inequalities with the total variation, Reverse KL and Neyman divergence anymore because has to be bounded in .
Corollary 2 (The PACBayesian bounds for bounded loss function).
Let be a fixed prior distribution over a hypothesis space . For a given posterior distribution over a hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (4) and (5) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have
Proof.
Suppose that we have a convex function , that measures the discrepancy between the observed empirical Gibbs risk and the true Gibbs risk on distribution . Given that, the purpose of the PACBayesian theorem is to upperbound the discrepancy for any . Let , where the subscript of shows the dependency on the data distribution . Let . By applying Jensen’s inequality as the first step,
(6) 
By Hoeffding’s inequality, for any and any ,
Setting , we have
The symbol denotes that the inequality holds with probability at least . The second line holds due to the Hoeffding’s inequality. Also note that
(7) 
The second line follows from Jensen’s inequality and convexity of supremum. Similarly,
(8) 
3.2 SubGaussian Loss Function
Next, we relax the constraints on the loss function. We consider the problem where is subGaussian. First, we define subGaussianity.
Definition 2.
A random variable is said to be subGaussian with the expectation and variance proxy if for any ,
Next, we present our PACBayesian bounds.
Corollary 3 (The PACBayesian bounds for subGaussian loss function).
Let be a fixed prior distribution over a hypothesis space . For a given posterior distribution over a hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (4) and (5) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have
Proof.
Employing Chernoff’s bound, the tail bound probability for subGaussian random variables [11] is given as follows
(10) 
Setting , and in the tail bound in Equation (10), for any , we have
where is defined as in Equation (6). Now, we have the upper bound for so we can apply the same procedure as in Corollary 2. ∎
3.3 SubExponential Loss Function
Next, we consider a more general class where is subexponential. First, we define subexponentiality.
Definition 3.
A random variable is said to be subexponential with the expectation and parameters and , if for any ,
Next, we provide our PACBayesian bounds.
Corollary 4 (The PACBayesian bounds for subexponential loss function).
Let be a fixed prior distribution over a hypothesis space . For a given posterior distribution over a hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (4) and (5) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have
For , we have
For , we have