PAC-Bayesian Transportation Bound

# PAC-Bayesian Transportation Bound

Kohei Miyaguchi
IBM Research - Tokyo
kohei.miyaguchi3@ibm.com
###### Abstract

We present a new generalization error bound, the PAC-Bayesian transportation bound, unifying the PAC-Bayesian analysis and the generic chaining method in view of the optimal transportation. The proposed bound is the first PAC-Bayesian framework that characterizes the cost of de-randomization of stochastic predictors facing any Lipschitz loss functions. As an example, we give an upper bound on the de-randomization cost of spectrally normalized neural networks (NNs) to evaluate how much randomness contributes to the generalization of NNs.

PAC-Bayesian Transportation Bound

Kohei Miyaguchi IBM Research - Tokyo kohei.miyaguchi3@ibm.com

\@float

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

The goal of statistical learning is to acquire the predictor that (approximately) minimizes a risk function in a inductive way. In doing so, one is only allowed to access some proxy function based on noisy data . Therefore, the goal of statistical learning theory is to describe the behavior of the deviation of the proxy from the true risk,

 ΔS(f):=r(f)−^rS(f). (1)

In particular, we are interested in computable upper bounds on the tail/expectation of , say , as we can bound the true risk with a computable function, , with high probability or in expectation respectively.

The PAC-Bayesian analysis (McAllester, 1999; Catoni, 2007) is one of the frameworks of such statistical learning theory based on the strong duality of the Kullback–Leibler (KL) divergence (Donsker and Varadhan, 1975). Below, we highlight two major advantages of the PAC-Bayesian approach.

• It is tight and transparent. The gap of the resulting bound is optimal in the sense of the strong duality. Moreover, one can easily interpret the meaning of each term in the bound and where it comes from.

• It is usable in practical situations. It has been also confirmed repeatedly in the literature that it produces non-vacuous bounds on the generalization of complex predictive models such as deep and large-scale neural networks (Dziugaite and Roy, 2017; Zhou et al., 2019).

In particular, the second point explains well the recent growing attention attracted on the PAC-Bayesian theory in the machine learning community (Neyshabur et al., 2018; Dziugaite and Roy, 2018; Mou et al., 2018; Nagarajan and Kolter, 2019).

The issue we address here is that it cannot handle deterministic predictors. More precisely, it provides upper bounds on the expectation of the deviation function with respect to distributions over , namely , and the upper bound diverges for almost all the deterministic setting . This is problematic in the following two viewpoints.

• The stochastic predictors are rarely used in practice. 333 Although some algorithms, including stochastic gradient descent, are stochastic in their nature, they are often not used in the way the PAC-Bayesian framework suggests. The use of deterministic predictors is fairly common and their performances are not even close to what is predicted by the naïve PAC-Bayesian theory.

• More importantly, the contribution of the stochasticity is unexplained. In previous studies, it has been pointed out repeatedly from both empirical and theoretical perspective that introducing stochasticity into prediction (sometimes drastically) improves the predictive performance (Welling and Teh, 2011; Srivastava et al., 2014; Neelakantan et al., 2015; Russo and Zou, 2015). However, it is unclear whether it is also the stochasticity that makes possible the tightness of the PAC-Bayesian bounds or not, since deterministic predictors cannot be accurately described with the naïve PAC-Bayesian theory.

To address these issues (I1) and (I2), we present a new theoretical analysis that unifies the PAC-Bayesian analysis and the chaining method. The chaining (Dudley, 1967; Talagrand, 2001) is the technique that gives the tightest known upper bound (up to a constant factor) on the supremum of the deviation function, , and is understood as a process of discretization refinement starting from finitely discretized models to reach the limit of continuous models . We extends this idea to the process of noise shrinking starting from stochastic predictors to reach the limit of deterministic predictors . As a result, we obtain the risk bound which is interpreted as the KL-weighted cost of transportation over predictor manifolds . As such, our contributions are summarized as follows.

• We derive the PAC-Bayesian transportation bound, which unifies the conventional PAC-Bayesian bounds and the chaining bounds in a continuous way. It allows us, for the first time, to assess the effect of stochasticity in prediction and relate the PAC-Bayesian stochastic predictors to deterministic ones.

• As an example, we give (an upper bound of) the noise reduction cost of spectrally normalized neural networks (NNs) and the corresponding de-randomized generalization error bound. The bound is as tight as the conventional PAC-Bayesian risk bound for such NNs with margin loss functions (Neyshabur et al., 2018) up to a logarithmic factor, and hence indicates that the same convergence rate can be achieved without stochasticity in this specific setting.

The rest of the paper is organized as follows. First, the problem setting is detailed in Section 2. Then, in Section 3, the main result is presented with a proof sketch. The interpretation and comparison to previous studies are also included here. In Section 4, we utilize the proposed bound for analysing the risk of neural networks. Finally, we give several concluding remarks in Section 5.

## 2 Problem Setting

In this section, we first overview the PAC-Bayesian framework to specify our focus, and then introduce the mathematical notation to describe the precise problem we are concerned with.

### 2.1 The PAC-Bayesian Framework in a Nutshell

The goal of the PAC-Bayesian analysis is to derive high-probability upper bounds on the deviation function given by (1). This is difficult since predictors are learned from data and thus there occurs a nontrivial statistical dependency among those two variables, which is difficult to analyze. To avoid this problem, the PAC-Bayesian theory suggest that one follows two key steps, namely, the linearization and decoupling.

In the first step, linearization, the predictors are generalized to be stochastic, i.e., upon prediction, a random predictor is drawn from some probability measure , called a posterior, which can depend on the data in a nontrivial way. The performance of such stochastic predictors is measured by the expectation , and hence the deviation is also studied in the form of expectation, . Note that the ordinary deterministic predictors are special instances of stochastic ones as they are recovered by taking the Dirac’s delta measures . In this way, any kinds of interactions between and are now formulated as the bilinear pairing of a predictor and a data-dependent function .

Now, in the second step, the bilinear pairing is decoupled with the Fenchel–Young type inequalities, namely , which allow us to deal with predictors and data separately. Here, denotes a pair of Fenchel conjugate functions. Specifically, the standard PAC-Bayesian analysis exploits the (strong) duality between the KL divergence and the log-integral-exp function

 QΔS≤DKL(Q,Q0)+lnQ0[eβΔS], (2)

where is any data-independent distribution called a prior (see Appendix in the supplemental material for the proof). Finally, the data dependent term, , is bounded with the concentration inequalities, such as Hoeffding’s inequality and Bernstein’s inequality, and we obtain high-probability upper bounds on as desired.

Remark   The upper bound (2) is optimal. More precisely, it cannot be uniformly improved since it is derived from the strong Fenchel duality. On the contrary, it can be thought of as a generalization of the union bound technique, which is recovered by letting the priors be discrete measures and the posteriors be Dirac’s delta measures. Hence, by taking non-Dirac’s delta posteriors, we obtain tighter upper bounds on .

Our focus   However, there is a flaw that it is meaningless when is not absolutely continuous with respect to , since the KL term diverges. In particular, if one takes non-atomic priors , e.g., Gaussian measures, then, it diverges with any Dirac’s delta posteriors . More generally, when the model space is continuous, then almost every deterministic predictors are prohibited to use under the naïve PAC-Bayesian bound (2). This is the problem we focus on in this paper.

### 2.2 Mathematical Formulation

Conventions   For any measurable spaces , we denote by the space of probability distributions over . For any two-ary function , let and denote the partially applied functions such that and . Moreover, if the function is measurable, we denote its expectation with respect to and by , where and are arbitrary measurable spaces. We also reserve another notation for expectations; The expectation with respect to any random variables maybe denoted by , or just if any confusion is unlikely.

Basic notation   Let be i.i.d. random variables corresponding to single observations subject to unknown distribution . Let be the collection of such variables, , from which we want to learn a good predictor. Let be a measurable space of predictors, such as neural networks and SVMs with their parameters unspecified, and let denote predictors with specific parameters.

Let be the loss of the prediction made by predictors upon observations . Also, we define the risk of the predictors by .

Problem statement   Our objective here is to find the predictor with small risk . However, since is inaccessible as is, we leverage the empirical risk measure to approximate the true objective . Let be the empirical distribution with respect to the sample . Then, the empirical risk of is given by

 PS[^r(f)]=1nn∑i=1^r(f,zi),

which is a random function whose expectation coincides with the true risk, . As it fluctuates around the mean, we are motivated to study the tail probability of the deviation. Define the deviation function (of single observation) by

 Δ(f,z) :=r(f)−^r(f,z). (3)

Then, we want to find a tight upper bound on the sample-averaged deviation function of posterior distributions , including deterministic ones , in the form of

 QPSΔ ≤U(Q,S)+Z(S), (4)

where is a computable function and is a negligible random variable independent of satisfying either (i) for  (high-probability bound) or (ii)  (expectation bound).

Finally, we conclude this section by introducing the definition of the Kullback–Leibler (KL) divergence, which is one of the key quantities in the PAC-Bayesian analysis.

###### Definition 1 (The Kullback–Leibler (KL) divergence)

Let where is a measurable space. The KL divergence between and is given by

 DKL(Q,Q0):=Q[lndQdQ0],

where is absolutely continuous with respect to . Otherwise, .

## 3 Main Result

We now present the main result. In this section, we only show the main theorem and its proof sketch. The full version with the rigorous proof is relegated to the supplemental material.

### 3.1 PAC-Bayesian Transportation Bound

Our goal is to find the upper bounds in the form of (4) applicable to both stochastic and deterministic predictors . To this end, we combine the PAC-Bayesian framework with the chaining method. The chaining is understood as a process of relating one predictor to its neighbors and moving towards some destination through the chain . We extend this idea and take the infinitesimal limit where is an appropriate distance over . As a result, we get the continuous transportation of predictors instead of the chain.

Riemannian structure   To facilitate the continuous transportation of posterior distributions, we introduce the structure of the Riemannian manifold to the predictor space . For simplicity, we assume that is a differentiable manifold diffeomorphic to and there exists a bijective local coordinate . Let be a smooth Riemannian metric associated with satisfying the following bounded conditions; there exist such that for all . Let be the corresponding geodesic distance on .

The following is our main assumption.

###### Assumption 2 (Lipschitz loss)

The loss function is -Lipschitz continuous with respect to the geodesic distance , i.e.,

 |^r(f,z)−^r(g,z)|≤Ldg(f,g),∀f,g∈F,

-almost surely, where

Under Assumption 2, the loss function (and hence the deviation function ) is weakly differentiable almost everywhere by Rademacher’s theorem. Thus, in addition to the base metric , we also consider a data-dependent (possibly non-smooth) metric,

 ^g(f):=12(P+PS)[∇Δ(f)∇Δ(f)⊤],

where is the weak gradient operator of . Note that is inaccessible in general since it depends on . However, it measures the variations of predictors in terms of the fluctuation of the deviation function, thereby quantifying the ‘true’ distance in . On the other hand, the base metric can be thought of as representing one’s prior belief on the true distance.

Space of transportation   Now, let us introduce the mathematical objects for the process of the noise shrinkage. Let be the measurable space of all smooth curves endowed with the uniform distance. We denote by a distribution over , which represents a transportation of predictors from stochastic ones to deterministic (or less stochastic) ones. The velocity of the transportation at relative to a metric is measured by , whereas the length of the transportation is written as .

Let be the snapshot distribution of at (i.e., if , then for all ), and let

 FWS(t):=QWtPSΔ. (5)

Thus, the deviation of the final snapshot at is decomposed as

 QW1PSΔ =QW0PSΔInitial deviation+FWS(1)−FWS(0),Additional deviation with transportation W (6)

where the first term is bounded with some standard concentration inequalities (where is fixed) or with the naïve PAC-Bayesian analysis (where is also learned from data ). Therefore, our focus is to show an upper bound on the second term.

The following is our main theorem.

###### Theorem 3 (Transportation bound)

Fix a confidence level and a prior distribution . Suppose that Assumption 2 holds. Then, for all ,

 FWS(1)−FWS(0) ≤2∫10v^g(t;W)dt ⎷DKL(QWt,Q0)+lnCδn−1Cost of chaining ΓC(W,Q0) +∫10vg(t;W)dt⎛⎝√2QWttr(^gg−1)n−1+4n3/2⎞⎠Cost % of transportation ΓT(W)+Z(Q0,S)Negligible noise, (7)

where .

Here, is a random variable satisfying that with probability at least where  (high-probability bound), and that where  (expectation bound).

Remark (Interpretation)   The main components of the bound (7) are two weighted transportation costs corresponding to in Inequality (4), whereas the remaining term is the prior-dependency-aware version of and just negligible. The first term is a KL-weighted cost of transportation with respect to the data-dependent metric . It is analogous to the key quantity of the chaining bound known as Dudley’s entropy integral. Thus, it can be interpreted as the cost of chaining. On the other hand, the second term is a trace-weighted cost with respect to the predefined metric . Since the trace factor is bounded as  (the proof in the supplemental material), it is bounded by the quantity proportional to the length of transportation, . Therefore, it can be interpreted as the additional cost for transportation.

Remark (Distance induced by optimal transportation)   The inequality (7) bounds the additional generalization error incurred by shrinking the noise in predictors. If we optimize it with respect to the transportation with the source and the destination fixed, a distance function over is given as

 dQ0(Q0,Q1) (8)

Putting this back to (6) and decomposing the deviation function, we obtain the following.

###### Corollary 4 (Transportation-based risk bound)

Fix a prior . Then, for all posteriors ,

 Q1rRisk ≤Q1PS^r% Empirical fitness+dQ0(Q0,Q1)Noise reduction cost+Q0PSΔ+Z(Q0,S)Reference deviation,

where is given by (8) and is defined in Theorem 3.

For each reference predictor , there is a trade-off relationship with the empirical fitness and the noise reduction cost in terms of the amount of noise contained in the prediction . As a result, the optimal amount of the noise can be determined by minimizing the RHS of the above inequality. However, the distance function is intractable in general. In Section 4, we give an example of upper bounds on it.

### 3.2 Discussion

In this subsection, we discuss the difference between Theorem 3 and related existing results.

Comparison with Audibert and Bousquet (2007)   An attempt to handle deterministic prediction within the PAC-Bayesian framework has been made earlier by Audibert and Bousquet (2007). In particular, they had already accomplished the goal of tightly bounding the risks of the deterministic predictors with the PAC-Bayesian analysis assisted with the idea of chaining.

The essential difference is that their result is based on the chaining flow over the minimum covering tree of , whereas ours is based on the one over the entire predictor space (with the continuous limit). This entails three apparent differences among two.

Firstly, because of the freedom in transportation flows, we have to include the additional cost , which is not in the previous bound. However, its impact is not serious because it costs at most while is likely to be bounded if the parameter spaces are bounded. This is also confirmed in Section 4.

Secondly, in the previous result, the initial point of chaining should be a fixed deterministic predictor because the flow has to be tree-shaped, i.e., there must be no more than one root point. Therefore, it is not directly applicable for relating the deviations of two (stochastic or deterministic) predictors and on the basis of their closeness, as we have done in Corollary 4.

Finally, the previous bound contains the KL divergence between discretized posterior and prior distributions, where the discretization is based on the minimum -nets of the predictor space . Thus, it is difficult to evaluate their bound directly in practice.

Comparison to the generic chaining bound   Since the derivation of Theorem 3 involves the continuous extension of chaining, it is interesting to compare it with the original chaining bound. The generic chaining method (Talagrand, 2001) is an improvement of the vanilla chaining bound and known to give rate-optimal upper bounds on the supremum of the deviation function.444 The original statement is found at Proposition 2.4 in Talagrand (2001).

Let be a -subgaussian process, i.e., for all . Then, the generic chaining bound implies that, for all ,

 ES∼Pn[supf∈FδfPSΔ] ≤supf∈FK√n∫∞0√ln1Q0(Bg0(f,r))dr, (9)

where is the ball centered at with -radius and is some constant.

This can be considered as the counterpart of the chaining cost . For the ease of comparison, let be constant over so that the geodesics become lines and let be the uniform distribution over . Taking such that , and , we have

 ΓC(Wf,Q0) =2∫10v^g(t;W)dt√DKL(QWt,Q0)+lnC/δn−1 ≤∫104r0dt√n−1 ⎷lnC/δQ0(B^g(f,r0t))=4√n−1∫r00 ⎷lnC/δQ0(B^g(f,r))dr,

which is similar to (9). Moreover, if we take , Corollary 4 with implies that

 ES∼Pn[supf∈FδfPSΔ] ≤supf∈F4√n−1∫r00√lnCQ0(B^g(f,r))dr+O(r0√dn).

as . Note that the second term is the contribution of , which is no more than the order of the first term since .

Finally, it is worth pointing out that the flexibility of transportation in Theorem 3 is much greater than that of the generic chaining. Specifically, it is allowed in our framework to stop the chaining before reaching the deterministic limit while it is not considered in the context of generic chaining.

### 3.3 Proof Sketch of Theorem 3

Before going to the proof sketch, we give a new insight on the essence of the chaining bound, which forms the foundation of our proof. The chaining is basically a sophisticated way of applying union bounds. As the union bound is a subset of the PAC-Bayesian bound, it must also workaround the problem of the diverging phenomenon with (2).

The key idea here is to divide and conquer. More precisely, instead of applying the Fenchel–Young inequality directly, we first decompose the deviation function into a telescoping sum,

 (10)

where and is constructed to satisfy . This is the ‘divide’ part.

As for the ‘conquer’ part, we handle each of the summands separately. Let be the joint snapshot distribution of at time and , i.e., if . Also, let be the increment function of . Then, the summands can be seen as the bilinear pairing of and . Applying the Fenchel–Young inequality with a series of conjugate pairs , we have

 =∞∑i=1QWti−1,tiXS≤∞∑i=1{ξi(QWti−1,ti)+ξ∗i(XS)}.

As a result, with an appropriate choice of the conjugate series (i.e., the way of applying union bounds), the diverging behavior of the KL divergence is averaged out within the infinite summation.

To prove Theorem 3, on the other hand, we extend the divide-and-conquer scheme to the infinitesimal limit. This is well-explained with the fundamental theorem of calculus,

 FWS(1)−FWS(0)=∫10˙FWS(t)dt, (11)

which is analogous to (10). Here, denotes the derivative of . The derivative is then bounded with the Fenchel–Young inequality,

 ˙FWS(t) =limu→0(QWt+u−QWt)PSΔu ≤limu→0ξu(QWt,t+u)+ξ∗u(XS)u=limu→0ξu(QWt,t+u)uA(t;W)+limu→∞ξ∗u(XS)uZ(S)

where is a continuous sequence of Fenchel conjugate pairs. Finally, putting this back to (11), we obtain the desired result since and .

## 4 Example: Spectrally Normalized Neural Networks

Finally, we demonstrate the effectiveness of the transportation bound by recovering the spectrally normalized risk bound of neural networks presented by Neyshabur et al. (2018) under weaker assumptions.

Let be a sequence of -Lipschitz activation functions. Let be a set of -depth neural networks such that , where are a collection of matrices whose spectral norms are no greater than . Let and be the spaces of the inputs and the teacher signals, respectively. We denote by the maximum scale of inputs. Let be the space of observations. Assume that the loss function is given by , where is a -Lipschitz continuous function defined on , such as hinge loss and logistic loss. Let be the norm of network and be the total dimensionality. Let be the distribution of networks induced by the normal distribution with mean and variance . Let be the restriction of to the Frobenius ball .

###### Theorem 5 (Noise reduction cost for neural networks)

Assume . Then, there exists a prior such that, for all and ,

 dQ0(Cd(f,r1),Cd(f,r0)) ≤^L0BR√8e2mK2n−1∫r0Br1Bds⎧⎪ ⎪⎨⎪ ⎪⎩  ⎷ln(1+1s2)+O⎛⎝ln(ndδlnL0)d⎞⎠+O(1√mK)⎫⎪ ⎪⎬⎪ ⎪⎭,

where , and .

The proof is found in the supplemental material. Theorem 5 tells us how much additional deviation we get if we reduce the scale of noise from to . Interestingly, the order of the cost, , coincides with the deviation bound given by Neyshabur et al. (2018) (Thm. 1), where the -margin classification error is assumed to deal with the noise. On the other hand, Theorem 5 implies that, even if we weaken the margin assumption to the Lipschitz assumption, we can still enjoy the same rate of deviation bound (up to logarithmic factor) by shrinking the noise.

Setting and combining with Corollary 4, we obtain a risk bound for deterministic cases as follows.

###### Corollary 6 (Deviation bound of spectrally normalized neural networks)

Assume that and for all and . Then, for simultaneously all ,

 δfPSΔ =O⎛⎜⎝^L0BR√mK2ln2mn+^L0√ln(nδlnL0)n⎞⎟⎠

with probability at least , where and are given as in Theorem 5.

The proof is also given in the supplemental material. The point is, in this specific case, the stochasticity of the PAC-Bayesian predictor is not necessary to achieve the same rate of deviation (Neyshabur et al., 2018) as long as we deal with Lipschitz loss functions. Note also that this result is corresponding to Bartlett et al. (2017), albeit slightly looser. Hence, it can be seen as an extension of their result capable of stochastic predictors, which may result in better risk bounds by optimizing the fitness-transportation trade-off given by Corollary 4.

## 5 Concluding Remarks

We have presented the PAC-Bayesian transportation bound, unifying the PAC-Bayesian analysis and the chaining analysis under an infinitesimal limit. It allows us to evaluate the cost of noise reduction. As an example, we have given an upper bound on the noise reduction cost of neural networks, which actually denies the necessity of the noise (and hence margins) in the recently-proposed PAC-Bayesian risk bound for spectrally normalized neural networks.

As the future work, we would highlight two possible directions. One is the macroscopic characterization of the transportation, including the geometry of the metric space , which may give more tight transportation bound. The other is the microscopic characterization such as the differential equation that governs the optimal transportation, which hopefully links the transportation bound with a new class of first-order optimization methods.

## References

• Audibert and Bousquet (2007) Audibert, J.-Y. and Bousquet, O. (2007). Combining pac-bayesian and generic chaining bounds. Journal of Machine Learning Research, 8(Apr):863–889.
• Bartlett et al. (2017) Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
• Catoni (2007) Catoni, O. (2007). Pac-bayesian supervised classification: the thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248.
• Donsker and Varadhan (1975) Donsker, M. D. and Varadhan, S. S. (1975). Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1):1–47.
• Dudley (1967) Dudley, R. M. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330.
• Dziugaite and Roy (2017) Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017.
• Dziugaite and Roy (2018) Dziugaite, G. K. and Roy, D. M. (2018). Entropy-sgd optimizes the prior of a pac-bayes bound: Generalization properties of entropy-sgd and data-dependent priors. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1376–1385.
• McAllester (1999) McAllester, D. A. (1999). Some pac-bayesian theorems. Machine Learning, 37(3):355–363.
• Mou et al. (2018) Mou, W., Wang, L., Zhai, X., and Zheng, K. (2018). Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 605–638.
• Nagarajan and Kolter (2019) Nagarajan, V. and Kolter, Z. (2019). Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. In International Conference on Learning Representations.
• Neelakantan et al. (2015) Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., and Martens, J. (2015). Adding gradient noise improves learning for very deep networks. CoRR, abs/1511.06807.
• Neyshabur et al. (2018) Neyshabur, B., Bhojanapalli, S., and Srebro, N. (2018). A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations.
• Russo and Zou (2015) Russo, D. and Zou, J. (2015). How much does your data exploration overfit? controlling bias via information usage. arXiv preprint arXiv:1511.05219.
• Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
• Talagrand (2001) Talagrand, M. (2001). Majorizing measures without measures. Annals of probability, pages 411–417.
• Tropp (2012) Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434.
• Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688.
• Zhou et al. (2019) Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2019). Non-vacuous generalization bounds at the imagenet scale: a PAC-bayesian compression approach. In International Conference on Learning Representations.

The following contents are only included in the supplemental material.

We provide here the proofs for Theorem 3, Theorem 5 and Corollary 6 respectively. First, in Section A, we prove the main result, Theorem 3. Then, in Section B, we give the proofs of the statements on the generalization error of neural networks, Theorem 5 and Corollary 6.

## Appendix A Detailed Proof of Theorem 3

In this section, we present the detailed proofs for the all results we have stated so far. To this end, we first introduce a useful notation for manipulating high-probability bounds (§A.1) and Riemannian metrics (§A.2). Then, we show some known results of PAC-Bayesian analysis (with improvement on the major constant), upon which our new analysis is built (§A.3). Finally, we present the main part of the proof of Theorem 3 (§A.4).

Note that, under Assumption 2, is -Lipschitz with respect to the Euclidean distance, too. This implies that for all . Also, since Lipschitz functions over Euclidean spaces can be uniformly approximated by Lipschitz smooth functions arbitrarily well (e.g., consider the smoothing with Gaussian kernels), we can assume the Lipschitz smoothness of and and Lipschitz continuity of without loss of generality.

### a.1 Exponential class of random variables

The following is the definition of a class of random variables whose upper tail is exponentially decaying.

###### Definition 7 (Exponential class)

We say a random variable is of -exponential class if and only if

 E[eZ/λ]≤δ

for . We also denote this by .

The exponential class satisfies several desirable property for the systematic calculation of both high-probability upper bounds and expectation bounds.

###### Proposition 8 (Expectation and tail probability bound)

If , then

 EZ≤λlnδ

and

 Z≤0

with probability at least .

Proof  The first inequality follows from Jensen’s inequality, . By Markov’s inequality, we have

 Z =λlneZ/λ≤λlnEeZ/λδ=0

with probability at least .

###### Proposition 9 (Reproductive property)

Let . Then,

 aEδλ =Eδaλ, Eδλ+λlnb =Ebδλ, Eδλ1+Eδλ2 =Eδλ1+λ2.

Moreover, if for all , then

 supi≥1Zi ∈Eδλ,

where and .

Proof  The first two equalities are trivial. The inclusion is seen from that, for all , and . The opposite inclusion is shown as, for all and ,

 EeZ1+Z2λ1+λ2 ≤(EeZ1/λ1)λ1λ1+λ2(EeZ1/λ2)λ2λ1+λ2≤δ

by Hölder’s inequality. As for the last inclusion, note that we have for all since . Thus, it is shown by the monotone convergence theorem,

 Eesupi≥1Zi/λ =Esupi≥1eZi/λ ≤E∞∑i=1eZi/λ=∞∑i=1EeZi/λ≤δ.

###### Proposition 10 (Completeness)

Let be a sequence of random variables with some envelope for , i.e., for all almost surely. Then, if strongly,

 Z∈Eδλ.

Proof  It is a direct consequence of the dominated convergence theorem.

### a.2 Useful Properties of the Metrics

The following propositions state some useful properties of the metrics and .

###### Proposition 11 (Trace bound)

For all ,

 tr(^g(f)g−1(f))≤L2.

Proof  Let and . Recall that the Lipschitz property in Assumption 2 implies that

 L2 ≥limsupv→0∣∣^r(φ−1(φ(f)+v),z)−^r(f,z)∣∣2v⊤g(f)v =supv∈Rd:v⊤g(f)v=1∣∣v⊤∇^r(f,z)∣∣2 =supw∈Rd:∥w∥=1∣∣∣w⊤g−12(f)∇^r(f,z)∣∣∣2 =∇^r(f,z)⊤g−1(f)∇^r(f,z)

-almost surely. Therefore,

 tr(^g(f)g−1(f)) =12(P+PS)tr(g−1(f)[∇Δ(f,z)∇Δ(f,z)⊤])) =12(P+PS)tr(g−1(f)[∇^r(f,z)∇^r(f,z)⊤−∇r(f,z)∇r(f,z)⊤]) ≤12(P+PS)tr(∇^r(f,z)⊤g−1(f)∇^r(f,z)) ≤L2.

###### Proposition 12 (Dominance over L2-distance)

For all ,

 12(P+PS)(Δ(f)−Δ(g))2≤d2^g(f,g).

Proof  Let . Let be a geodesic from to with respect to . Then, by the triangle inequality with respect to , we have

 ~d(f,g) ≤k∑i=1~d(γ(ti−1),γ(ti)) k→∞⟶∫10dt√γ′(t)⊤^g(γ(t))γ′(t) =d^g(γ(0),γ(1)),

where for all .

### a.3 Basic PAC-Bayesian bound

In this subsection, we introduce a couple of fundamental inequalities of the PAC-Bayesian analysis. Namely, the strong Fenchel duality of the KL divergence and log-integral-exp functions and one of its applications.

###### Lemma 13 (Strong duality of the KL divergence, Donsker and Varadhan (1975))

For any measurable space and all ,

 DKL(Q,Q0)=supX:F→RQX−lnQ0[eX],

where the supremum is taken over all the measurable functions . Therefore, we have

 QX≤DKL(Q,Q0)+lnQ0[eX]

and it cannot be improved uniformly.

Proof  Assume that is not absolutely continuous with respect to . Then, the LHS diverges. Note that there exists such that and . Hence, the RHS also diverges by taking with .

On the other hand, if is absolutely continuous with respect to , there exists the Radon–Nikodym derivative . Now, let . Then we have

 QX−lnQ0[eX] =DKL(Q,Q0)+QY−lnQ[eY] ≤DKL(Q,Q0). (Jensen's inequality)

The equality holds if is constant -almost surely.

Utilizing Lemma 13, we present one of the basic PAC-Bayesian inequality as follows.

###### Lemma 14 (PAC-Bayesian bound)

Let be an arbitrary centered random function such that . Fix any positive number and any prior distribution . Then, there exists such that, for all ,

 QPSX≤Q(P+PS)X22nϵ+ϵDKL(Q,Q0)+Z(ϵ,Q0,S) (12)

-almost surely, where the noise term is bounded as

 Z(ϵ,Q0,S)≤lnQ0[eϵ−1PSX]. (13)

Proof  Let