Learning from i.i.d. data under model miss-specification

# Learning from i.i.d. data under model miss-specification

## Abstract

This paper introduces a new approach to learning from i.i.d. data under model miss-specification. This approach casts the problem of learning as minimizing the expected code-length of a Bayesian mixture code. To solve this problem, we build on PAC-Bayes bounds, information theory and a new family of second-order Jensen bounds. The key insight of this paper is that the use of the standard (first-order) Jensen bounds in learning is suboptimal when our model class is miss-specified (i.e. it does not contain the data generating distribution). As a consequence of this insight, this work provides strong theoretical arguments explaining why the Bayesian posterior is not optimal for making predictions under model miss-specification because the Bayesian posterior is directly related to the use of first-order Jensen bounds. We then argue for the use of second-order Jensen bounds, which leads to new families of learning algorithms. In this work, we introduce novel variational and ensemble learning methods based on the minimization of a novel family of second-order PAC-Bayes bounds over the expected code-length of a Bayesian mixture code. Using this new framework, we also provide novel hypotheses of why parameters in a flat minimum generalize better than parameters in a sharp minimum.

\nc\Pa

pa \nc\Chch \nc\Coco \nc\MBmb \nc\dataset\calD \nc\lbL \nc\llb^L \nc\ExpE

## 1 Introduction

The learning problem in machine learning is how to design machines to find patterns in a finite data sample that generalize, i.e. that applies to unseen data samples. In this work, we argue that learning approaches based on empirical risk minimization (Vapnik, 1992) or Bayesian learning (Murphy, 2012) are not optimal strategies to solve this generalization problem under model miss-specification and i.i.d. data. For the shake of simplicity, our analysis is focused on the use the log-loss and unsupervised settings, but also applies to general loss functions and supervised settings.

We assume we have a finite training data set of i.i.d. samples, , where . The samples in are generated from some unknown data generative distribution denoted . We also assume we have a parametric probability distribution over denoted by , parametrized by some parameter vector . We denote the family of all possible distributions parametrized by some . Learning under model miss-specification means that .

In empirical risk minimization (ERM) and under the log-loss, the learning problem consists on finding the parameter which defines the probabilistic model which is closest to in terms of Kullback-Leibler (KL) distance,

 \bmtheta⋆=argmin\bmthetaKL(ν(\bmx),p(\bmx|\bmtheta))

which can be shown to be equivalent to minimize the so-called expected risk,

 \bmtheta⋆=argmin\bmthetaEν(\bmx)[ln1p(\bmx|\bmtheta)]

As we do not have access to the data generative distribution, we have to minimize a proxy objective based on the data sample , which is known as the empirical risk,

 \bmthetaERM=argmin\bmtheta1nn∑i=1ln1p(\bmx|\bmtheta)

Usually, a regularization term is also included to improve the quality of the solution. This approach has an information-theory interpretation based on the construction of optimal two-part codes for compressing the training data (Grünwald and Grunwald, 2007). The statistical consistency of this approach, i.e. how close are and , have been extensively studied by the machine learning community usually under i.i.d. data assumptions (Vapnik, 1992; Nagarajan and Kolter, 2019).

In this work, we approach the learning problem in a more Bayesian way because we consider distributions over the parameters instead of single point estimates. Let us start introducing the so-called predictive posterior distribution,

 p(\bmx)=∫p(\bmx|\bmtheta)ρ(\bmtheta)d\bmtheta=Eρ(\bmtheta)[p(\bmx|\bmtheta)]

where is a probability distribution over defining this predictive posterior.

The learning problem we study here is to find the probability distribution which defines the predictive posterior distribution which is closest to in terms of Kullback-Leibler (KL) distance,

 ρ⋆=argminρKL(ν(\bmx),Eρ(\bmtheta)[p(\bmx|\bmtheta)])

This problem is equivalent to find the probability distribution which defines a Bayesian mixture code (Grünwald and Grunwald, 2007) with the smallest code-length, in expectation, for the data samples generated from . As we do not have access to the data generative distribution, we employ PAC-Bayes bounds (McAllester, 1999) to bound this expected code-length using our finite data sample. The learning problem then reduces to find the distribution which minimizes this bound over the generalization performance.

We show in this work that, under perfect model specification, the Bayesian posterior is an optimal choice because it minimizes a (first-order) PAC-Bayes bound over this expected code-length. However, we also show that, under model miss-specification, this (first-order) PAC-Bayes bound is suboptimal and, in consequence, the Bayesian posterior is suboptimal too for defining predictive posterior distributions that generalize. We then argue for the use of second-order PAC-Bayes bounds to design more optimal learning algorithms.

After this theoretical analysis, we introduce two new families of learning algorithms which are based on the minimization of these second-order PAC-Bayes bounds. The first one is a variational learning algorithm quite close to standard approaches (Blei et al., 2017) but with a modified evidence lower bound functional. And the second one is a simple ensemble learning algorithm (Dietterich, 2000) which can be seen as a particle-based variational inference method which includes a novel diversity term (Kuncheva and Whitaker, 2003), and which allows to jointly learn the whole ensemble of models using a gradient-based optimization approach. Some experiments on toy data samples illustrate these learning algorithms. The code to reproduce these experiments can be found in the following public repository https://github.com/PGM-Lab/PAC2BAYES.

The insights of this work may help to explain why the use of Bayesian approaches do not consistently provide significant performance gains wrt maximum a posteriori (MAP) or (regularized) empirical risk minimization (ERM) estimates in large data samples regimes. We argue that in these large data sample regimes the Bayesian posterior tends to collapse around a single parameter, so the use of MAP/ERM estimates supposes a good approximation of this highly-peaked posterior and, in consequence, they provide similar prediction performance without the complexity of having to employ approximate Bayesian machinery. However, when we consider this new learning framework, we argue that the use of the so-called PAC-Bayesian posterior may provide significant improvements in the prediction performance.

We finally exploit the insights obtained in this work to provide novel hypotheses of why parameters in a flat minimum generalize better than parameters in a sharp minimum.

## 2 Background

### 2.1 Bayesian Learning

The solution to the learning problem under Bayesian settings (Bernardo and Smith, 2009) is the Bayesian posterior, which is computed by the Bayes’ rule,

 p(\bmtheta|D)=π(\bmtheta)∏ni=1p(\bmxi|\bmtheta)∫π(\bmtheta)∏ni=1p(\bmxi|\bmtheta)d\bmtheta

where is known as the prior distribution.

If a new observation arrives we compute the posterior predictive distribution to make predictions about ,

 p(\bmx′|D)=∫p(\bmx′|\bmtheta)p(\bmtheta|D)d\bmtheta

The main issue when applying Bayesian machine learning (Murphy, 2012) in real settings is the computation of the (multi-dimensional) integral of the Bayesian posterior. Usually, the computation of this integral is not tractable and we have to resort to approximate methods. Variational inference (VI) (Blei et al., 2017) is a popular method in Bayesian machine learning to compute an approximation of the Bayesian posterior. In standard VI settings, we choose a tractable family of probability distributions over , denoted by , and the learning problem consists on finding the probability distribution which is closer to the Bayesian posterior in terms of (inverse) KL distance,

 argminq∈QKL(q(\bmtheta),p(\bmtheta|D)) (1)

For solving the above minimization problem, we exploit the following equality,

 lnp(D)=L(q)+KL(q(\bmtheta),p(\bmtheta|D)), (2)

where is the marginal likelihood of the training data, and is defines as

 L(q)=Eρ(\bmtheta)[lnp(D|\bmtheta)]−KL(q,π),

is known as the evidence lower bound (ELBO) function because it is a lower-bound of the marginal likelihood of the training data,

 lnp(D)≥L(q).

So, due to the equality of Equation (2), the minimizaton problem of Equation (1) is equivalent to the following maximization problem,

 argmaxq∈QEq(\bmtheta)[lnp(D|\bmtheta)]−KL(q,π) (3)

There is a myriad of powerful methods for solving this maximization problem for many different probabilistic models (see (Zhang et al., 2018) for a recent review).

But the ELBO function also provides an insightful characterization of the Bayesian posterior,

###### Lemma 1.

If the Bayesian posterior belongs to , i.e. , then the Bayesian posterior can be characterized as the maximum of the ELBO function,

 p(\bmtheta|D)=argmaxq∈QEq(\bmtheta)[lnp(D|\bmtheta)]−KL(q,π)
###### Proof.

It follows by looking at the equality of Equation (2). As the lhs of the equality is constant, when maximizing we also minimize the term. As this KL term is always positive or zero, we can deduce that the maximum of is attained for . ∎

### 2.2 PAC-Bayesian Analysis

The PAC-Bayes framework (McAllester, 1999) provides data-dependent generalization guarantees for Gibbs or randomized classifiers. Like in frequentist learning theories, the main assumption in the PAC-Bayes framework is that your training data sample 1 is generated from some data generative distribution that we denote . Thus, we assume that is an i.i.d. sample of observations, which we denote as . Following a frequentist perspective, we consider a loss . Then, we define the empirical loss on the sample , denoted , and the generalization loss for the hypothesis , denoted , as

 ^L(\bmtheta,D)=1nn∑i=1ℓ(\bmtheta,\bmxi,\bmyi)L(\bmtheta)=Eν(\bmx,\bmy)[ℓ(\bmtheta,\bmx,\bmy)]

The PAC-Bayesian theory provides probably approximate correct generalization bounds on the (unknown) expected loss given the empirical estimate and some other parameters, which includes a prior over the hypothesis space independent of the training sample. The following result, due to (Catoni, 2007), states a widely used PAC-Bayes bound,

###### Theorem 1.

(Catoni, 2007, Theorem 1.2.6)2 For any prior distribution over and for any and , with probability at least over draws of training data , for all mixture distribution over ,

 Eρ(\bmtheta)[L(\bmtheta)]≤11−e−C(CEρ(\bmtheta)[^L(\bmtheta,D)]+KL(ρ,π)+ln1ξn)

The above result shows how an optimal distribution should optimizes the trade-off between the empirical expected loss and the KL divergence between and the prior . This last term acts as a regularizer avoiding that the distribution collapses around a single parameter minimizing the empirical loss.

For the rest of the paper, unless otherwise stated, we will only work with the so-called log-loss, so we will use the notation introduced in this section to refer to the log-loss.

### 2.3 Minimum Description Length and Information Theory

The Minimum Description Length (MDL) principle (Grünwald and Grunwald, 2007) is a generic method for model selection which relates learning with data compression. MDL builds on the idea that any pattern in the data can be exploited for compressing the data. And that the more we are able to compress the data the more we learn about the data (Grünwald and Grunwald, 2007). So, among all the possible models, we should choose the one which best compresses the training data.

Information theory (Shannon, 1948) provides the well-known one-to-one relationship between a probabilistic model and a coding scheme, because defines the maximum number of bits3 needed to optimally encode the observation according to the code scheme defined by . So, according to MDL, we should select the hypothesis that best compress the training data with fewer bits. I.e. we should select the hypothesis which minimizes , which is the number of bits needed to encode the training data set according to hypothesis . Moreover, according to MDL, we must also consider the coding length of the hypothesis (to prevent for overfitting), which can be done by defining a prior over the hypothesis space , which defines the number of bits needed to encode . induces a penalty to account for the complexity of . The optimal hypothesis according to this two-part coding scheme is the one that minimizes the following quantity,

 min\bmtheta∈\bmThetaL(D,\bmtheta)+ln1π(\bmtheta)

Bayesian mixture codes (Grünwald and Grunwald, 2007) are an alternative class of coding schemes. A Bayesian mixture code would assign the following code length to the training data set ,

 ¯L(D,π)=−lnEπ(\bmtheta)[n∏i=1p(\bmxi|\bmtheta)]=−lnp(D)

where plays the role of the mixture distribution of the Bayesian code.

Bayesian codes control the complexity of the hypothesis space by taking expectation or, equivalently, marginalizing out over . For finite hypothesis spaces, Bayesian mixture codes always provides better coding schemes than two-part coding schemes.

### 2.4 The Dirac-delta Function

The Dirac-delta function (Friedman, 1990), with parameter , is a generalized function with the following property

 ∫δ\bmtheta0(\bmtheta)f(\bmtheta)d\bmtheta=f(\bmtheta0) (4)

for any continuous function . In consequence, The Dirac-delta function can be seen as the density function of a probability distribution because it is positive and, by the above property, . This Dirac-delta distribution is a degenerated probability distribution which always samples the same value because if . We also have that the entropy of a Dirac-delta distribution is minus infinity, because (Friedman, 1990).

## 3 The learning problem

We approach the following learning problem. There is a computing machine, called the learner, which is observing the outside world through a digital sensor.

###### Assumption 1.

The sensor readings are i.i.d from an unknown but fixed distribution , known as the data generative distribution.

There is another machine, called the receiver, connected to the learner through a noiseless channel. The learner wants to send to the receiver the information it is getting from the outside world. As the communication channel between them has a limited bandwidth, the learner must compress the information before sending it to the receiver.

For compressing the sensor information, the learner chooses a parametric probability distribution family , and wants to build a Bayesian mixture code for computing the code length of an observation , denoted by ,

 ¯ℓ(\bmx,ρ)=ln1Eρ(\bmtheta)[p(\bmx|\bmtheta)] (5)

where , a probability distribution over , is the mixture distribution of the Bayesian code.

We denote the expected code length of the Bayesian mixture code defined by ,

 ¯L(ρ)=Eν(\bmx)[¯ℓ(\bmx,ρ)] (6)

The learning problem we study in this paper is how the learner, using only a finite set of i.i.d. observations , can find a mixture distribution which minimizes this expected code length for the observations coming from , i.e. the outside world.

 ρ⋆ = argminρ¯L(ρ) (7) = argminρEν(\bmx)[¯ℓ(\bmx,ρ)] = argminρEν(\bmx)[ln1Eρ(\bmtheta)[p(\bmx|\bmtheta)]]

As the learner has a limited memory capacity, this imposes several restrictions when learning. The first one is on the number of parameters defining the distribution . In that way, we are under limited modelling capacity because more powerful models tend to require a higher number of parameters, which we need to store in memory. For example, (Sutskever and Hinton, 2008) shows that deep sigmoid neural networks can approximate any distribution over binary vectors to arbitrary accuracy, even when the width of each layer is limited to the length of the binary vectors, but these networks need exponential depth. So, a key assumption in this work is the following,

###### Assumption 2.

The learner operates under model miss-specification, .

Moreover, we also assume that the model class is non-identifiable,

###### Assumption 3.

The model class is non-identifiable. And we denote the set of parameters which defines the same distribution than , i.e. if then .

Another consequence derived for the limited memory capacity of the learner is an upper bound on the maximum code length the learner can assign to any observation in the support of the distribution , denoted by . In this way, we assume that

###### Assumption 4.

The learner operates in a setting where

 0≤ln1p(\bmx|\bmtheta)≤B,

where is the upper bound.

Although this assumption is violated in many statistical learning settings, we argue this is not a strong assumption in many real machine learning settings when combined with the following assumption. This is the so-called digital sensor assumption which states that the sensor readings are composed by finite-precision numbers,

###### Assumption 5.

The sensor provides readings in the form of vectors of real numbers under finite-precision representation. It means that the support of the distribution is contained in , which denotes the space of real number vectors of dimension that can be represented under a finite-precision scheme using bits to encode each element of the vector. So, we have that . Then, the number of bits needed to store a sensor reading is bits.

Mathematically, can be defined as a probability mass function satisfying,

 Eν(\bmx)[g(\bmx)]=∑\bmx′∈XFw\bmx′g(\bmx′) (8)

where are positive scalar values parameterizing the distribution. They satisfy that and .

In Appendix A, we show how we can easily define many standard machine learning models satisfying Assumption 4 mainly because the observation space is naturally bounded due to the Assumption 5 (i.e. is a finite data set), and because we are free to constrain the parameter space . On the other hand, the definition of in terms of a probability mass function over vectors of finite-precision numbers has no practical consequences when designing learning algorithms. This definition will only be used for the theoretical results of this work.

Finally, the learner has always to choose a uniquely decodable code (Cover and Thomas, 2012) to guarantee an error-free communication with the receiver. So, the following assumption states that the family of Bayesian mixture codes considered for learning/coding satisfy the Kraft inequality (Cover and Thomas, 2012),

###### Assumption 6.

Any mixture distribution over used by the learner to define a Bayesian mixture code satisfies the following property ,

 ∑\bmx′∈XFEρ(\bmtheta)[p(\bmx|\bmtheta)]≤1.

This last assumption implies the Kraft inequality for the corresponding Bayesian mixture code, because

 ∑\bmx′∈XFe−¯l(\bmx′,ρ)=∑\bmx′∈XFEρ(\bmtheta)[p(\bmx|\bmtheta)]≤1.

This assumption is also linked with Assumption 4 because it implies that . We also have that, by definition, , because is a proper density. So, one should expect that because is an approximation of this integral. Again, we will see that this assumption has not practical consequences when designing novel machine learning algorithms, it will only be used for the theoretical results of this work.

## 4 The three gaps of learning

We describe here the three gaps the learner should overcome to optimally solve the learning problem describe in the previous section (Equation (7)). These three gaps are associated to three inequalities,

 H(ν)≤¯L(ρ)≤¯LJ(ρ)≤¯LPB(D,ρ)

where denotes the entropy of the data generative distribution, the expected code length of the Bayesian mixture code defined by the mixture distribution , is the so-called Jensen bound which depends of , and is the PAC-Bayes bound which depends of and the observed data set .

The learner has to minimize to maximize the level of compression of the observations coming from the outside world that are sent to the observer. But the learner can not perform this minimization over directly because it has no access to the data generative distribution . So, the learning strategy is to upper bound this function with some other function which depends on the observed data set , . The learner will then perform the minimization over this function with the hope that by minimizing it, it will also minimize the function. In the next sections, we detail this learning strategy and present and analyze these bounding functions, their associated gaps and which role they play in this learning process.

### 4.1 The Kullback-Leibler gap

From information theory (Cover and Thomas, 2012), we know that the expected code length is bounded below from the entropy of the data generative distribution, denoted by ,

###### Theorem 2.

Let us denote the entropy of the distribution , then we have that for any mixture distribution on ,

 H(ν)≤¯L(ρ),

where the equality only holds if the data generative distribution and the posterior predictive distribution are equal.

###### Proof.

We have that,

 KL(ν(\bmx),p(\bmx)) = Eν(\bmx)[−lnp(\bmx)ν(\bmx)] = ∑\bmx′∈XF−w\bmx′lnp(\bmx′)w\bmx′ ≥ −ln∑\bmx′∈XFw\bmx′p(\bmx′)w\bmx′ = −ln∑\bmx′∈XFp(\bmx′) ≥ 0

where the equality of second line follows form Equation (8), the inequality of the third line is the Jensen inequality, and the last inequality follows from Assumption 6.

As we have proved that the KL distance between and is always greater or equal than zero, we also have

 KL(ν(\bmx),p(\bmx)) = Eν(\bmx)[lnν(\bmx)]+Eν(\bmx)[1Eρ(\bmtheta)[p(\bmx|\bmtheta)]] = −H(ν(\bmx))+Eν(\bmx)[¯ℓ(\bmx,ρ)] = −H(ν(\bmx))+¯L(ρ)≥0

and the statement follows from the last inequality. ∎

We define the Kullback-Leibler gap as the difference between the expected code length achieve by the mixture distribution minimizing (see Equation (7)) and the overall optimal code length defined by the entropy of , . And this gap corresponds to the Kullback-Leibler distance between the data generative distribution and the posterior predictive distribution ,

 KL(ν(\bmx),Eρ⋆(\bmtheta)[p(\bmx|\bmtheta)])=¯L(ρ⋆)−H(ν). (9)

#### The KL gap is null under perfect model specification

Under perfect model specification (i.e. ), the following result shows that an optimal mixture distribution would be a Dirac-delta distribution centered in the parameter vector matching the data generative distribution,

###### Lemma 2.

Under perfect model specification (i.e. ), by definition, there exist a , such that . In this case, the mixture distribution defined as a Dirac-delta distribution centered in this parameter vector,

 ρ⋆ν(\bmtheta)=δ\bmtheta⋆ν(\bmtheta),

is a minimizer of , and

 H(ν)=¯L(ρ⋆)=¯L(ρ⋆ν),

where is a minimizer of , as defined in Equation (7). And we can say that is an optimal mixture distribution because it achieves the best possible expected code-length, which is .

###### Proof.

If , then KL distance between and is equal to zero and the equality follows from Equation (9). Then. we have for any , where the inequality follows from Theorem 2. In consequence, is a global minimizer of . ∎

The above result can be extended to distributions over the non-identifiable parameters in ,

###### Corollary 1.

Any distribution whose support is contained in , i.e. , is a global minimizer of , and

 H(ν)=¯L(ρ⋆)=¯L(ρ′).
###### Proof.

By definition we have that if , then . In consequence, , which implies that . ∎

So, the KL gap is null under perfect model specification.

#### The role of the KL gap in learning

The KL gap defines the first limitation the learner encounters when learning. First, there is a lower bound limit on the quality of the coding the learner can achieve. And this limit is defined by the entropy of the data generative distribution, i.e. the nature of the outside world. But the KL gap tells us about the loss the learner incurs when learning under a limited modelling capacity (c.f. Assumption 2), which is caused from the inherently limited memory capacity of the learner. So, the only way to address this gap is by increasing the modelling and memory capacity of the learner.

### 4.2 The Jensen gap

In this section, we analyze the consequences on our learning problem when employing Jensen bounds, i.e. the consequences of having to move an expectation out of a function. In the first part, we start by analyzing what happens when we try to minimize standard (first-order) Jensen bounds. And, in the second part, we repeat this same analysis for second-order Jensen bounds. We then show that the Jensen gap is null under perfect model specification, i.e. there are no consequences for applying Jensen bounds in this case, and we finish this section discussing the implications of this gap on learning.

#### First-order Jensen bounds

Le us denote the expected code-length achieved by the parameter ,

 L(\bmtheta)=Eν(\bmx)[ln1p(\bmx|\bmtheta)].

As the logarithm function is a convex function, we can apply Jensen inequality to derive an upper bound over using the expected value of wrt to ,

Let us denote the so-called first-order Jensen bound.

 ¯LJ(ρ)=Eρ(\bmtheta)[L(\bmtheta)] (10)

Then, we have that

###### Lemma 3.

Any distribution over satisfies the following inequality,

 H(ν)≤¯L(ρ)≤¯LJ(ρ).
###### Proof.

This result is a straightforward application of the Jensen inequality over the function,

 ¯L(ρ) =−Eν(\bmx)[lnEρ(\bmtheta)[p(\bmx|\bmtheta)]] ≤−Eν(\bmx)[Eρ(\bmtheta)[lnp(\bmx|\bmtheta)]] =Eρ(\bmtheta)[Eν(\bmx)[ln1p(\bmx|\bmtheta)]]

Due to model non-identifiability, there could potentially be many different minimisers of the Jensen bound, but the following two results clearly characterize a mixture distribution attaining a minimum,

###### Lemma 4.

Let us define a minimizer of ,

 \bmtheta⋆J=argmin\bmthetaL(\bmtheta)

The mixture distribution defined as a Dirac-delta distribution center around , is a minimizer of the Jensen bound,

 ρ⋆J=argminρ¯LJ(ρ),

and the density minimizes the KL-distance with respect to ,

 \bmtheta⋆J=argmin\bmthetaKL(ν(\bmx),p(\bmx|\bmtheta))
###### Proof.

We first have that because, due to Assumption 4, . We also have that , because . In consequence, . So, will always be a minimizer of the Jensen bound, because . The last KL equality of this lemma follows because , and is constant wrt to , so it does not influence the minimization. ∎

###### Corollary 2.

Any distribution whose support is contained in , i.e. , is a minimizer of the Jensen bound .

###### Proof.

By definition we have that if , then . In consequence, . ∎

In this sense, we define the Jensen GAP as the loss in the expected code-length the learner incurrs when using instead of for coding,

 ¯L(ρ⋆J)−¯L(ρ⋆)

Figure 1 graphically illustrates this bound for toy probabilistic models, both in the case of perfect model specification and model miss-specification.

#### Second-order Jensen bounds

Here, we exploit a recent result (Liao and Berg, 2019) for deriving a tighter (second-order) Jensen bound. We will refer to this bound as the Jensen bound, denoted by ,

 ¯LJ2(ρ)=Eρ(\bmtheta)[L(\bmtheta)]−V(ρ) (11)

where denotes the expected (normalized) variance of wrt ,

 V(ρ)=Eν(\bmx)[12max\bmthetap(\bmx|\bmtheta)2Eρ(\bmtheta)[(p(\bmx|\bmtheta)−p(\bmx))2]].

where the maximization operation has to be performed only over the support of . The following result proves that also bounds the function,

###### Theorem 3.

Any distribution over satisfies the following inequality,

 H(ν)≤¯L(ρ)≤¯LJ2(ρ).
###### Proof.

Applying Taylor’s theorem to about with a mean-value form of the remainder gives,

 lny=lnμ+1μ(y−μ)−12g(y)2(y−μ)2,

where is a real value between and . Therefore,

 Eρ(\bmtheta)[lnp(\bmx|\bmtheta)]=lnEρ(\bmtheta)[p(\bmx|\bmtheta)]−Eρ(\bmtheta)[12g(p(\bmx|\bmtheta))2(p(\bmx|\bmtheta)−p(\bmx))2]

Rearranging we have

 −lnEρ(\bmtheta)[p(\bmx|\bmtheta)] =−Eρ(\bmtheta)[lnp(\bmx|\bmtheta)]−Eρ(\bmtheta)[12g(p(\bmx|\bmtheta))2(p(\bmx|\bmtheta)−p(\bmx))2] ≤−Eρ(\bmtheta)[lnp(\bmx|\bmtheta)]−12max\bmthetap(\bmx|\bmtheta)2Eρ(\bmtheta)[(p(\bmx|\bmtheta)−p(\bmx))2],

where the inequality is derived from that fact is always positive and , which is a real number between and , is upper bounded by .

Finally, the result of the theorem is derived by taking expectation wrt on both sides of the above inequality. ∎

We can see that the Jensen bound is a tighter bound than the Jensen bound because it introduces a quadratic or variance term which is always positive. This will avoid the problems of the first-order Jensen bound, and when minimizing this new bound we will not have to end up in a Dirac-delta distribution because the variance term will push the minimum away for this extreme solution.

The following result will show that the Jensen bound is not affected by model identifiability (c.f. Assumption 3). I.e., we can not trick the minimization of the Jensen bound by getting , which minimizes the Jensen bound , and then expand the support of to other parameters defining the same probability density, i.e. , because the variance term of the Jensen bound is, even in this case, zero. This is formalized in the following result.

###### Lemma 5.

For any distribution such that, , the variance term of the Jensen bound is equal to zero, .

###### Proof.

Because , we have that . In consequence, because for any . So, . ∎

The following result tell us that the expected code-length of a mixture distribution minimizing the Jensen bound is always at least as good as the mixture distribution minimizing the Jensen bound ,

###### Lemma 6.

Let us denote and two mixtures distribution minimizing the Jensen bound, , and the Jensen bound, , respectively. The following inequalities hold

 H(ν)≤¯L(ρ⋆)≤¯L(ρ⋆J2)≤¯L(ρ⋆J).
###### Proof.

Let us define the space of all the mixture distributions which are a Dirac-delta distribution. Then, we have that the minimum of the Jensen bound for all the mixture distributions can be written as,

 minρ∈ΔEρ(\bmtheta)[L(\bmtheta)]−V(ρ)=minρ∈ΔEρ(\bmtheta)[L(\bmtheta)]=Eρ⋆J(\bmtheta)[L(\bmtheta)],

where the first inequality follows because if then , and the second equality follows from Lemma 4. We also have that

 Eρ⋆J2(\bmtheta)[L(\bmtheta)]−V(ρ⋆J2)≤Eρ⋆J(\bmtheta)[L(\bmtheta)]

because, by definition, the left hand side of the inequality is the minimum of the Jensen bound for all the mixture distributions , while the right hand side of the inequality is the minimum of the Jensen bound but only for those mixture distributions .

By chaining the above inequality with the Jensen bound inequality of Theorem 3, we have

 ¯L(ρ⋆J2)≤Eρ⋆J(\bmtheta)[L(\bmtheta)]

Finally, we have that,

 Eρ⋆J(\bmtheta)[L(\bmtheta)] =L(\bmtheta⋆J) =Eν(\bmx)[ln1p(\bmx|\bmtheta⋆J)] =Eν(\bmx)[ln1Eρ⋆J(\bmtheta)[p(\bmx|\bmtheta)]] =¯L(ρ⋆J),

where first equality follows from the property of Dirac-delta distributions (see Equation (4)), the second equality follows from the definition of , the third equality follows again from the property of Dirac-delta distributions, and the last equality follows from the definition of .

The inequality follows from Theorem 2 and follows because, by definition, is a global minimizer of . ∎

We define the Jensen gap in this case as the loss incurred by the application of this Jensen bound as,

 ¯L(ρ⋆J2)−¯L(ρ⋆)

i.e. the difference between the expected code-length achieved by they mixture distribution minimizing the Jensen bound and the expected code-length achieved by the optimal mixture distribution . Figure 1 graphically illustrates this gap for a toy probabilistic model.

#### The Jensen gap is null under perfect model specification

In the case of perfect model specification, we have that both the Jensen bound and Jensen bound are tight, and that and are optimal,

###### Lemma 7.

If , then we have,

 H(ν)=¯L(ρ⋆)=¯L(ρ⋆J2)=¯L(ρ⋆J),

and, then, and are both optimal mixture distributions.

###### Proof.

If , then, by definition, . In consequence, because is a minimizer of this KL distance according to Lemma 4. Then, by Equation (9), we have . And because of Lemma 2. Finally, the lemma statement follows by also considering Lemma 6 which states that . ∎

Lemma 7 shows that the Jensen gap is null under perfect model specification. So, minimizing the Jensen bound will provide the same result than the minimization of the Jensen bound, and both will be optimal. Again, this is graphically illustrated in Figure 1 for a toy probabilistic model.

#### The role of the Jensen gap in learning

This gap defines a second limitation the learner encounters when learning. It comes directly from the cost of moving and expectation out of a function. We have seen in this section that first-order Jensen bounds can induce the learner to pick up a suboptimal solution. The introduction of a second-order bound alleviates this problem and leads to better solutions (see Lemma 6). The size of this gap and the quality of the learning is going to depend on the quality of the Jensen bound we employ. The only way to address this gap is by using tighter Jensen bounds. Later, in Section 5.1, we will introduce a new second-order Jensen bound which is tighter, but more elaborated, than the second-order Jensen bound described here.

### 4.3 The PAC-Bayes gap

In this last step, we introduce the PAC-Bayes bound. As we saw on Section 2.2, PAC-Bayes bounds are not deterministic, they are probabilistic in the sense that they only hold with a given probability , with , over draws of the training data .

The PAC-Bayes bound for follows directly from the application of Theorem 1. We denote to this first-order PAC-Bayes bound, which is defined as,

 ¯LPB(D,ρ)=B1−e−B(Eρ(\bmtheta)[^L(D,\bmtheta)]+KL(ρ,π)+ln1ξn)

where is a prior distribution over , is the bound of the log-loss (see Assumption 4), and is the empirical log-loss of ,

 ^L(D,\bmtheta)= 1nn∑i=1ln1p(\bmxi|\bmtheta)=−1nlnp(D|\bmtheta)

Then, we prove that bounds ,

###### Lemma 8.

For any prior distribution over and for any , with probability at least over draws of training data , for all mixture distribution over ,

 H(ν)≤¯L(ρ)≤¯LJ(ρ)≤¯LPB(D,ρ)
###### Proof.

It follows directly from the application of Theorem 1 to the normalized loss . ∎

The derivation of the PAC-Bayes bound for is a bit more involved. We need first to rewrite the function,

###### Lemma 9.

For any distribution over , the Jensen bound can be expressed as follows,

 ¯LJ2(ρ)=Eρ(\bmtheta,\bmtheta′)[Eν(\bmx)[ℓ(\bmx,\bmtheta,\bmtheta′)]],

where , , and is defined as

 ℓ(\bmx,\bmtheta,\bmtheta′)=ln1p(\bmx|\bmtheta)−