1 INTRODUCTION
###### Abstract

Empirically, the PAC-Bayesian analysis is known to produce tight risk bounds for practical machine learning algorithms. However, in its naïve form, it can only deal with stochastic predictors while such predictors are rarely used and deterministic predictors often performs well in practice. To fill this gap, we develop a new generalization error bound, the PAC-Bayesian transportation bound, unifying the PAC-Bayesian analysis and the chaining method in view of the optimal transportation. It is the first PAC-Bayesian bound that relates the risks of any two predictors according to their distance, and capable of evaluating the cost of de-randomization of stochastic predictors faced with continuous loss functions. As an example, we give an upper bound on the de-randomization cost of spectrally normalized neural networks (NNs) to evaluate how much randomness contributes to the generalization of NNs.

\aistatstitle

PAC-Bayesian Transportation Bound

\aistatsauthor

Kohei Miyaguchi

IBM Research

## 1 Introduction

The goal of statistical learning is to acquire the predictor that (approximately) minimizes a risk function in a inductive way. In doing so, one is only allowed to access some proxy function based on noisy data . Therefore, the goal of statistical learning theory is to describe the behavior of the deviation of the proxy from the true risk,

 ΔS(f):=r(f)−^rS(f). (1)

In particular, we are interested in computable high-probability upper bounds on , say , to bound the true risk with a computable function, .

The PAC-Bayesian analysis (McAllester, 1999; Catoni, 2007) is one of the frameworks of such statistical learning theory based on the strong duality of the Kullback–Leibler (KL) divergence (Donsker and Varadhan, 1975). Below, we highlight two major advantages of the PAC-Bayesian approach.

• It is tight and transparent. The gap of the resulting bound is optimal in the sense of the strong duality. Moreover, one can easily interpret the meaning of each term in the bound and where it comes from.

• It is usable in practical situations. It has been also confirmed repeatedly in the literature that it produces non-vacuous bounds on the generalization of complex predictive models such as deep and large-scale neural networks (Dziugaite and Roy, 2017; Zhou et al., 2019).

In particular, the second point explains well the recent growing attention attracted on the PAC-Bayesian theory in the machine learning community (Neyshabur et al., 2018; Dziugaite and Roy, 2018; Mou et al., 2018; Nagarajan and Kolter, 2019).

The issue we address here is that it cannot handle deterministic predictors. More precisely, it provides upper bounds on the expectation of the deviation function with respect to distributions over , namely , and the upper bound diverges for almost all the deterministic settings . This is problematic in the following two viewpoints.

• The stochastic predictors are rarely used in practice. 111 Although some algorithms, including stochastic gradient descent, are stochastic in their nature, they are often not used in the way the PAC-Bayesian framework suggests. The use of deterministic predictors is fairly common and their performances are not even close to what is predicted by the naïve PAC-Bayesian theory.

• More importantly, the contribution of the stochasticity is unexplained. In previous studies, it has been pointed out repeatedly from both empirical and theoretical perspectives that introducing stochasticity into prediction (sometimes drastically) improves the predictive performance (Welling and Teh, 2011; Srivastava et al., 2014; Neelakantan et al., 2015; Russo and Zou, 2015). However, it is unclear whether it is also the stochasticity that makes possible the tightness of the PAC-Bayesian bounds or not, since deterministic predictors cannot be accurately described with the naïve PAC-Bayesian theory.

To address these issues (I1) and (I2), we present a new theoretical analysis that unifies the PAC-Bayesian analysis and the chaining method. The chaining (Dudley, 1967; Talagrand, 2001) is the technique that gives the tightest known upper bound (up to a constant factor) on the supremum of the deviation function, , and is understood as a process of discretization refinement starting from finitely discretized models to reach the limit of continuous models . We extends this idea to the process of noise shrinking starting from stochastic predictors to reach the limit of deterministic predictors . As a result, we obtain the risk bound, namely PAC-Bayesian transportation bound, which is interpreted as the KL-weighted cost of transportation over predictor space .

In particular, our contribution is summarized as follows.

• The proposed bound is the first general bound that allows us to relate the risks of any two stochastic or deterministic predictors in general. More specifically, it allows us, for the first time, to assess the effect of stochasticity in prediction by comparing stochastic predictors with deterministic ones.

• To demonstrate the effectiveness of the bound, we give an upper bound of the noise reduction cost of neural networks (NNs) and the corresponding de-randomized generalization error bound. The resulting risk bound is as tight as the conventional PAC-Bayesian risk bound for randomized NNs (Neyshabur et al., 2018) up to a logarithmic factor, and hence indicates that the stochasticity is not essential in this specific setting.

The rest of the paper is organized as follows. First, the problem setting is detailed in Section 2. Then, in Section 3, the main result is presented with a proof sketch. The interpretation and comparison to previous studies are also included here. In Section 4, we utilize the proposed bound for analysing the risk of neural networks. Finally, we give several concluding remarks in Section 5.

## 2 Problem Setting

In this section, we first overview the PAC-Bayesian framework to specify our focus, and then introduce the mathematical notation to describe the precise problem we are concerned with.

### 2.1 The PAC-Bayesian Framework in a Nutshell

The goal of the PAC-Bayesian analysis is to derive high-probability upper bounds on the deviation function given by (1). The difficulty to achieve this goal is that predictors are learned from data and thus there occurs a nontrivial statistical dependency among those two variables. To avoid this problem, the PAC-Bayesian theory suggest that one follows two key steps, namely, the linearization and decoupling.

In the first step, linearization, the predictors are generalized to be stochastic, i.e., upon prediction, a random predictor is drawn from some probability measure , called a posterior, which can depend on the data in a nontrivial way. The performance of such stochastic predictors is measured by the expectation , and hence the deviation is also studied in the form of expectation, . Note that the ordinary deterministic predictors are special instances of stochastic ones as they are recovered by taking Dirac’s delta measures . In this way, any kinds of interactions between and are now formulated as the bilinear pairing of a predictor and a data-dependent function .

Now, in the second step, the bilinear pairing is decoupled with the Fenchel–Young type inequalities, namely , which allow us to deal with predictors and data separately. Here, denotes a pair of Fenchel conjugate functions. Specifically, the standard PAC-Bayesian analysis exploits the strong duality between the KL divergence and the log-integral-exp function,

 QΔS≤β−1DKL(Q,U)+β−1lnU[eβΔS], (2)

, where is any data-independent distribution called a prior (see Appendix B.1 for the proof). Finally, the data dependent term, , is bounded with the concentration inequalities, such as Hoeffding’s inequality and Bernstein’s inequality, and we obtain high-probability upper bounds on as desired.

##### Our Focus

Unfortunately, the bound (2) is meaningless when is not absolutely continuous with respect to , since the KL term diverges. In particular, if one takes non-atomic priors , e.g., Gaussian measures, then, it diverges with any Dirac’s delta posteriors . More generally, when the model space is continuous, then almost every deterministic predictors are prohibited to use under the naïve PAC-Bayesian bound (2). This is the problem we focus on in this paper.

### 2.2 Mathematical Formulation

##### Conventions

For any measurable spaces , we denote by the space of probability distributions over . For any two-ary function , let and denote the partially applied functions such that and . Moreover, if the function is measurable, we denote its expectation with respect to and by , where and are arbitrary measurable spaces. We also reserve another notation for expectations; The expectation with respect to any random variables maybe denoted by , or just if any confusion is unlikely.

##### Basic Notation and Assumptions

Let be i.i.d. random variables corresponding to single observations subject to unknown distribution . Let be the collection of such variables, , from which we want to learn a good predictor. Let be a measurable space of predictors, such as neural networks and SVMs with their parameters unspecified, and let denote predictors with specific parameters. We assume that is (a subset of) a separable Hilbert space with inner product and norm , e.g., -dimensional parameter spaces or infinite-dimensional function spaces.

Let be the loss of the prediction made by predictors upon observations , accompanied with the Fréchet derivative with respect to for all . Also, we define the risk of the predictors by .

To facilitate the analysis of transportation in later, we also assme that is endowed with an inverse metric ,222 Equipped with , can be thought of as a Riemannian manifold. However, we do not assume the smoothness nor the invertibleness of as standard Riemannian manifolds do. where is the set of symmetric nonnegative linear operators on . It defines a local norm at each point , for all and otherwise . The metric is used to bound the variation of .

###### Assumption 1 (Lipschitz condition)

The deviation function is -Lipschitz continuous with respect to , i.e.,

 limρ→0sup|g−f|Tf≤ρ|Δ(g,z)−Δ(f,z)|ρ≤LΔ

for all and .

Note that the standard Lipschitz condition is recovered if is identity for all .333 For example, deviation of squared loss function , , satisfies Assumption 1 with and if . However, by appropriately choosing , we can handle a broader class of deviation functions beyond standard Lipschitz ones.

##### Problem Statement

Our objective here is to find the predictor with small risk . However, since is inaccessible as is, we leverage the empirical risk measure to approximate the true objective . Let be the empirical distribution with respect to the sample . Then, the empirical risk of is given by which is a random function whose expectation coincides with the true risk, . As it fluctuates around the mean, we are motivated to study the tail probability of the deviation . Define the deviation function (of single observation) by

 Δ(f,z) :=r(f)−ℓ(f,z). (3)

Since , we want to find a tight high-probability upper bound on the sample-averaged deviation of posterior distributions in the form of

 QPSΔ ≤U(Q,S)+Z(S), (4)

where is a computable function and is a negligible random variable independent of satisfying that with some confidence level .

To conclude this section, we introduce the Kullback–Leibler (KL) divergence, which is used to measure the complexity of predictors in the PAC-Bayesian analysis.

###### Definition 2 (The KL divergence)

Let where is a measurable space. The KL divergence between and is given by

 DKL(Q,U):=Q[lndQdU],

where is absolutely continuous with respect to . Otherwise, . Moreover,

 Hδ(Q,U):=DKL(Q,U)+ln1δ

denotes the PAC-Bayesian complexity of posteriors with respect to priors with confidence level .

## 3 Main Result

In Section 3.1, we introduce the analytical tools necessary for stating our main result. Then we present the main result with a few remarks in Section 3.2. Finally, in Section 3.3, we compare it with relevant existing results and give a proof sketch. The rigorous proof is included in Appendix B.

### 3.1 Posterior Flow and Velocity Measures

Our goal is to find the upper bounds in the form of (4) applicable to both stochastic and deterministic predictors . To this end, we combine the PAC-Bayesian framework with the chaining method. The chaining is understood as a process of relating one predictor to its neighbors and moving towards some destination through the chain . We extend this idea and take the infinitesimal limit where is an appropriate distance over . As a result, we get the continuous transportation of predictors instead of the chain.

To facilitate the analysis of the transportation of posterior distributions, we introduce ordinary differential equations (ODE) over that transport posteriors, which we call posterior flow. Let be a time index indicating the progress of transportation. Let be the initial posterior distribution. Let be a random time-indexed vector field on , where is a probability space representing the source of randomness in transportation itself.

###### Definition 3 (Posterior Flow)

We say is a posterior flow if the solution of the following ODE exists almost surely,

 dft =ξt(ft,ω0)dt,f0∼Q0, ω0∼P0, (5)

and the corresponding snapshot distributions are well-defined, i.e., for all . Moreover, is the mean posterior flow of if

 μt(f)=Ef0∼Q0,ω0∼P0[ξt(f,ω0)|ft=f].

Note that all the posteriors , , are completely identified once we specify the initial condition and the flow . When it is clear from the context, we may omit and refer to as a posterior flow. Let

 Dt(ξ;S):=QξtPSΔ

be the deviation function of the posterior generated with at time . In our analysis, the deviation is characterized with respect to two velocity measures of posterior flows.

The first one is given by the 2-Wasserstein distance (Villani, 2008) between and at the limit of .

###### Definition 4 (Wasserstein velocity)

The Wasserstein velocity of at time is defined as

 Wt(ξ):=√Ef∼Qξt|μt(f)|2Tf.

The Wasserstein velocity is determined only by the metric structure of the predictor space , which indirectly reflects the continuity of the deviation function through the Lipschitz condition (Assumption 1). On the other hand, the second velocity measure reflects the structure of in a more direct manner.

###### Definition 5 (Deviation-based velocity)

The deviation-based velocity of with respect to at time is defined as

 Vt(ξ;S) :=√QξtP+PS2⟨μt,∇Δ⟩2=√Qξt⟨μt,ΛSμt⟩,

where .

Note that we have . This reveals that the roles of these two velocities are indeed complementary to each other; While is tighter and offers finer characterization of the posterior flow in a data- and distribution-dependent way, it is guaranteed to be bounded by in the worst case just in the same way as loss functions are bounded by constant in the standard PAC-Bayesian analysis.

### 3.2 PAC-Bayesian Transportation Bound

To evaluate the deviation function , we utilize the fundamental theorem of calculus,

 Dt(ξ;S) =D0(ξ;S)+∫t0dduDu(ξ;S)du. (6)

This motivate us to seek for an upper bound on the infinitesimal increment .

###### Theorem 6 (Transportation bound)

Fix any prior distribution . Then, the increment of the deviation is bounded by

 ddtDt(ξ;S) ≤2Vt(ξ;S)√Hδ(Qξt,U)+c(n)n+2LΔWt(ξ)√n =:ιt(ξ;S,U,δ) (7)

with probability on the draw of for simultaneously all posterior flow and all . Here, .

The proof is given in Appendix B.3.

##### Remark (Interpretation)

The upper bound (7) consists of two terms, each of which is interpreted in a different way.

Ignoring the sub-polynomial factor , the first term is the deviation-based velocity times the model complexity per sample . This is analogous to the key quantity of the chaining bound known as Dudley’s entropy integral, where is corresponding to the metric entropy. Thus, it can be interpreted as the chaining cost. Note that, under some regularity conditions, the entropy term is roughly evaluated to be of the same order with the dimensionality of model . Since the velocity factor is bounded by in the worst case, the integration over recovers the traditional dimensionality-dependent bounds as long as the length of the transportation is bounded with respect to 2-Wasserstein distance. On the contrary, our bound can be significantly tighter than traditional ones if is much smaller than and/or the entropy term is much smaller than the dimensionality . In particular, the effect of is unique to our bound; The results shown in Section 4 is owing to the fact that with the number of neurons per layer and the depth of neural networks.

On the other hand, the second term is proportional to . Therefore, after integrating it over , it is proportional to the 2-Wasserstein length of posterior transportation, hence it can be thought of as a pure transportation cost. It is negligible in comparison to the first term since it is independent of the complexity of the model such as the dimensionality. Moreover, we note that the bound (7) is not sensitive to the choice of metric since it does not appear in the chaining-cost term.

##### Remark (Distance induced by optimal flow)

The inequality (7) induces a metric on the space of posterior distributions . If we integrate it over and optimize it with respect to with the source and the destination fixed, a distance function over is given as

 dS,U,δ(Q,Q′) :=infξ:Qξ0=Q,Qξ1=Q′∫10ιt(ξ;S,U,δ)dt (8)

for all , which we call the optimal transportation (OT) distance of predictors. Putting this back to (6) with Theorem 6 and decomposing the deviation function, we obtain a relative risk bound in terms of the closeness of two predictors.

###### Corollary 7 (Transportation-based risk bound)

Fix a prior . Then,

 Qr\tiny\rm\bf Risk ≤Q^rS\tiny\rm\bf Emp.\;% Risk+dS,U,δ(Q,Q0)% \tiny\rm\bf OT distance+Q0PSΔ\tiny\rm\bf Reference deviation

with probability for simultaneously all posteriors .

This corollary has two distinct implications. The first implication is that (i) the OT distance bounds the risk of predictors by itself. If the reference predictor is fixed, then the reference deviation is of order , independent of the complexity of , and just negligible. Hence the deviation of is shown to be governed by the OT distance from an arbitrary fixed predictor. For example, if is -subgaussian, then we have

 QPSΔ≤dS,U,δ/2(Q,U)+σ√ln2/δn, (9)

taking in Corollary 7.

The second implication is that (ii) the OT distance can be used to describe the cost of de-randomization. Consider as any stochastic predictors and let be any “less stochastic” predictors. Then, Corollary 7 implies a trade-off relationship of the empirical fitness and the OT distance. Specifically, the fitness is likely to get worse if the amount of noise contained in the prediction is increased, whereas the OT distance can be decreased as is stochastic. As a result, the optimal amount of the noise can be determined by minimizing the RHS, or one may simply take to be deterministic, , paying the cost of . This further implies that the transportation bound and the conventional PAC-Bayesian bound can be combined to produce better risk bounds that cannot be achieved by themselves alone.

Note the OT distance is intractable in general due to the infimum. We discuss the problem of the computational tractability in the next remark. Moreover, in Section 4, we give an example of upper bounds on it.

##### Remark (Computational tractability)

We note that (7) and (8) are computationally tractable with some numerical approximation methods.

As for the increment bound (7), the only inaccessible quantity is , or more specifically, the data-dependent the metric in it. Recall that we have a trivial computable upper bound, . To get tighter bounds, one must exploit the problem-dependent structure of the loss function that characterizes .

Once we get such upper bound, we may evaluate with Monte Carlo sampling approximation, choosing an appropriate posterior flow . To illustrate this, consider the transportation from an arbitrary initial distribution to the delta measures , . One of the simplest such posterior flows is the linear contraction flow . In this case, the posterior is nothing but the initial one shrunk by a factor of towards . Therefore, as long as we can draw samples from , the velocities in the increment bound can be evaluated as and so on.

On the other hand, to compute the noise reduction cost (8), one has to evaluate the integral of and take the infimum over . As for the infimum, we can just ignore it and compute an upper bound with a concrete instance of . As for the integral, one may approximate the integral with finite sums. More precisely, the time line is discretized with and the integral is approximated with the summation .

### 3.3 Comparison with Existing Bounds

In this subsection, we discuss the difference between Theorem 6 and related existing results.

##### Audibert and Bousquet (2007)

An attempt to handle deterministic prediction within the PAC-Bayesian framework has been made earlier by Audibert and Bousquet (2007). In particular, they had already accomplished the goal of tightly bounding the risks of the deterministic predictors with the PAC-Bayesian analysis assisted with the idea of chaining. However, it cannot be utilized to relate conventional PAC-Bayesian bounds with deterministic predictors.

The essential difference is that their result is based on the chaining flow over the minimum covering tree of , whereas ours is based on the one over the entire predictor space with any direction as long as it can be expressed in the form of ODE. This entails three apparent differences among two.

Firstly, because of the freedom in transportation flows, we have to include the additional cost , which does not appear in the previous bound. However, its impact is not serious because it costs at most without any dependency on the model complexity, provided the Wasserstein length of posterior transportation is bounded.444This is likely to be the case if the diameter of model is bounded. This is also confirmed in the example given in Section 4.

Secondly, in the previous result, the initial point of chaining should be a fixed deterministic predictor because the flow has to be tree-shaped, i.e., there must be no more than one root point. Therefore, it is not directly applicable for relating the deviations of two (stochastic or deterministic) predictors on the basis of their closeness, as we have done in Corollary 7.

Finally, the previous bound contains the KL divergence between discretized posterior and prior distributions, where the discretization is based on the minimum -nets of the predictor space . Thus, it is difficult to evaluate their bound directly in practice. On the other hand, our increment bound can be evaluated once a computable upper bound on is given.

##### Chaining Method (Proof Sketch)

We also compare our bound with the conventional chaining bound. This gives a rough sketch of how we prove the main theorem (Theorem 6).

We start with a new insight on the essence of the chaining bound, which forms the foundation of our bound. The chaining is basically a sophisticated way of applying union bounds. As the union bound can be thought of as a subset of the PAC-Bayesian bound (e.g., take the prior as a counting measure and the posterior as a Dirac’s delta), it must also workaround the problem of the diverging phenomenon with (2).

The key idea of chaining is to divide and conquer. More precisely, instead of applying the Fenchel–Young inequality directly, we first decompose the deviation function into a telescoping sum,

 QPSΔ=Q0PSΔ+∞∑i=1(Qi−Qi−1)PSΔ, (10)

where the posterior sequence is constructed to satisfy as in the sense of weak convergence. This is the ‘dividing’ part.

As for the ‘conquering’ part, we handle each of the summands separately. Let be a joint distribution of a pair of predictors and whose marginals are corresponding to and respectively, i.e., and . Also, let be the increment function of . Then, the summands can be seen as the bilinear pairing of and . Applying the Fenchel–Young inequality with a series of conjugate pairs , we have

 QPSΔ−Q0PSΔ =∞∑i=1Qi−1,iXS ≤∞∑i=1{ζi(Qi−1,i)+ζ∗i(XS)}. (11)

As a result, with an appropriate choice of joint distributions and the conjugate series (i.e., the way of applying union bounds), the diverging behavior of the KL divergence is averaged out within the infinite summation.

On the other hand, Theorem 6 is an infinitesimal version of the conquering part, bounding

 ddtDt(ξ;S)=limu→0(Qξt+u−Qξt)PSΔu,

whereas the dividing part is owing to the fundamental theorem of calculus (6), which is the continuous counterpart of (10). The derivative is then bounded with the Fenchel–Young inequality in the same spirit of (11), where the joint distributions are turned into the posterior flow utilizing the chain rule

 ddtDt(ξ;S) =Qξt⟨μt,PSΔ⟩.
##### Standard PAC-Bayesian Bounds

We also highlight two differences in our bound compared to the standard PAC-Bayesian bound.

Firstly, of course, it allows us to avoid the diverging KL phenomenon. Note that the conventional PAC-Bayesian risk bound claims that

 QPSΔ≤~O⎛⎝σ√Hδ(Q,U)n⎞⎠, (12)

where is a variance-like scale factor of . On the other hand, our bound (9) is roughly equivalent (ignoring the model-independent terms) to

 QPSΔ≤~O⎛⎜⎝∫10dtVt(ξ;S)√Hδ(Qξt,U)n⎞⎟⎠, (13)

where is taken to be the transportation from to . Since the velocity measures the rate of change in the deviation function in the -sense, it plays a similar role as that of . Therefore, the essential difference is that the posterior can change over time in (13). As a result, the effect of the entropy is averaged and it remains finite even if as . This is how our bound works around the diverging problem the conventional bound (12) suffers from.

Secondly, our bound can be significantly tighter than conventional ones even if does not diverge. This is explained with how these two bounds react to the change of the diameter of the model space . In the conventional bound, is not necessarily related to as it measures the scale of the absolute value of the deviation function . On the other hand, the scaling factor of our bound, , linearly scales with the diameter since the velocity of transportation reflects the distance over directly. Therefore, if is sufficiently small, our bound can be much tighter than conventional bounds.

## 4 De-Randomizing Spectrally Normalized Neural Networks

We demonstrate the effectiveness of the transportation bound by recovering the risk bound of spectrally normalized neural networks presented by Neyshabur et al. (2018) under weaker assumptions.

Let be a sequence of -Lipschitz activation functions satisfying the homogeneity condition  (consider the ReLU activation for example). Let be a set of -depth neural networks such that , where is a collection of matrices. Note that the total dimensionality of the networks is . To introduce the structure of Hilbert space into , we identify the network with its parameter (hence is the Frobenius norm of the Kronecker product ). Let and be the space of the inputs and the teacher signals, respectively, and let be the space of observations. We denote by the maximum scale of inputs. Assume that the loss function is given by , where is a -Lipschitz continuous function defined on , such as hinge loss and logistic loss.

We also introduce two characteristics of networks essential to the subsequent analysis. Let be the depth-normalized spectral radius and let be the total spectral radius of . Here, denotes the operator norm of .

Now, as an application of the transportation bound, we present an upper bound on the de-randomization cost of Gaussian posteriors on . For simplicity, we only consider the spherical Gaussian distributions with some scale correction factors,

 N(w,ρ):=N(w,ρ2¯L2(w)mK2γ2mId), (14)

where , denotes the mean and denotes the normalized scale. Then, we connect the deviation of with that of the deterministic counterpart .

###### Theorem 8 (De-randomization cost of NNs)

Consider the stochastic predictor given in (14). Then, there exists a prior such that

 dS,U,δ(N(w,ρ),δw) ≤4eρR(w)√mK2n⎡⎢ ⎢ ⎢ ⎢⎣|w|¯L(w)I(√mρKγm)+ρ(1+√c2m)Kγm⎤⎥ ⎥ ⎥ ⎥⎦,

for all and . Here, we define and .

The proof is found in Appendix C.1. Theorem 8 tells us how much additional cost we have to pay if we are to remove the noise of scale . Note that the first term is dominant if is large and is moderately small. Specifically, taking , we have

 dS,U,δ(N(w,ρ2),δw) =~O(R(w)|w|¯L(w)√mK2n)

since .

Combining the above result with existing PAC-Bayesian bounds, we can prove deviation bounds of the deterministic NNs; Inserting the result of Theorem 8 with to that of Corollary 7, we obtain the following risk bound.

###### Corollary 9 (Risk of spectrally normalized NNs)

Suppose that for all and . Then, for large ,

 PSΔ(fw) =~O(R(w)|w|¯L(w)√mK2n)

with high probability for simultaneously all .

The proof is also given in Appendix C.2. Note that the confidence level parameter is still there, but erased within the order notation .

The bound is much smaller than the VC-dimension-based bound  (Harvey et al., 2017) when and is moderate. This is the case when the singular values of each weight matrix decay fast (i.e., almost low rank). However, in the worst case with the identical singular values, we have and the bound is not tighter anymore.

Note that this is the same rate with the state-of-the-art PAC-Bayesian bound (Neyshabur et al., 2018), which is derived under the margin assumption on to remove the stochastic noise of PAC-Bayesian predictors. On the other hand, our requirement on the loss functions is the -Lipschitz continuity, which is strictly weaker than the margin assumption. In other words, the corollary implies that the stochasticity of the PAC-Bayesian predictor is not necessary to achieve the same rate as long as the loss is Lipschitz continuous.

Note also that this result gives a similar upper bound as in Bartlett et al. (2017), albeit slightly looser. One possible advantage of our bounds relative to theirs is that, according to Corollary 7, it gives risk bounds of both deterministic and stochastic predictors. Hence, it may result in better risk bounds by optimizing the trade-off with respect to the amount of noise.

## 5 Concluding Remarks

We have presented the PAC-Bayesian transportation bound, unifying the PAC-Bayesian analysis and the chaining analysis with an infinitesimal limit. It allows us to relate existing PAC-Bayesian bounds to the risk of deterministic predictors by evaluating the cost of noise reduction. As an example, we have given an upper bound on the noise reduction cost of neural networks, which have given a negative answer to the necessity of the noise (and hence margins) in the recently-proposed PAC-Bayesian risk bound for spectrally normalized neural networks.

One of the most significant implications of the transportation bound, besides the de-randomization viewpoint, is that it allows us to evaluate the risk of predictors not only with what they are, but also with how we found them, regarding the processes of parameter search as transportations. This may be useful to analyze the generalization errors of iterative algorithms like SGDs.

As future work, we highlight two possible directions. One is the characterization of the optimal posterior flow, aiming to understand the geometry of the metric space . Although a simple linear flow suffices to prove the results presented in this paper, this direction of study may give more tight transportation bound. It is also valuable to study whether the optimal posterior flow is approximated with a function of empirical values, which hopefully links the transportation bound with new optimization algorithms.

## References

• Audibert and Bousquet (2007) Audibert, J.-Y. and Bousquet, O. (2007). Combining pac-bayesian and generic chaining bounds. Journal of Machine Learning Research, 8(Apr):863–889.
• Bartlett et al. (2017) Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
• Boucheron et al. (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
• Catoni (2007) Catoni, O. (2007). Pac-bayesian supervised classification: the thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248.
• Donsker and Varadhan (1975) Donsker, M. D. and Varadhan, S. S. (1975). Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1):1–47.
• Dudley (1967) Dudley, R. M. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330.
• Dziugaite and Roy (2017) Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017.
• Dziugaite and Roy (2018) Dziugaite, G. K. and Roy, D. M. (2018). Entropy-sgd optimizes the prior of a pac-bayes bound: Generalization properties of entropy-sgd and data-dependent priors. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1376–1385.
• Harvey et al. (2017) Harvey, N., Liaw, C., and Mehrabian, A. (2017). Nearly-tight vc-dimension bounds for piecewise linear neural networks. In Conference on Learning Theory, pages 1064–1068.
• McAllester (1999) McAllester, D. A. (1999). Some pac-bayesian theorems. Machine Learning, 37(3):355–363.
• Mou et al. (2018) Mou, W., Wang, L., Zhai, X., and Zheng, K. (2018). Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pages 605–638.
• Nagarajan and Kolter (2019) Nagarajan, V. and Kolter, Z. (2019). Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. In International Conference on Learning Representations.
• Neelakantan et al. (2015) Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., and Martens, J. (2015). Adding gradient noise improves learning for very deep networks. CoRR, abs/1511.06807.
• Neyshabur et al. (2018) Neyshabur, B., Bhojanapalli, S., and Srebro, N. (2018). A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations.
• Russo and Zou (2015) Russo, D. and Zou, J. (2015). How much does your data exploration overfit? controlling bias via information usage. arXiv preprint arXiv:1511.05219.
• Seldin et al. (2012) Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., and Auer, P. (2012). Pac-bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086–7093.
• Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
• Talagrand (2001) Talagrand, M. (2001). Majorizing measures without measures. Annals of probability, pages 411–417.
• Tolstikhin and Seldin (2013) Tolstikhin, I. O. and Seldin, Y. (2013). Pac-bayes-empirical-bernstein inequality. In Advances in Neural Information Processing Systems, pages 109–117.
• Tropp et al. (2015) Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230.
• Villani (2008) Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.
• Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688.
• Zhou et al. (2019) Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2019). Non-vacuous generalization bounds at the imagenet scale: a PAC-bayesian compression approach. In International Conference on Learning Representations.

In this appendix, we provide the proofs for Theorem 6, Theorem 8 and Corollary 9 respectively. First, in Section A, we introduce the norm-based notation of the velocities and for convenience. Second, in Section B, we prove the main result, Theorem 6. Then, in Section C, we give the proofs of the statements on the generalization error of neural networks, namely Theorem 8 and Corollary 9.

## Appendix A Norm-Based Notation of Velocities

In this section, we introduce the formal notion of metric and metric field as tools for measuring the velocity.

##### Metrics

Let denote the set of the symmetric nonnegative linear mappings on , i.e., if , and for all . We referred to such as a metric as it defines a distance on as follows. Let denote the Mahalanobis’ distance with respect to , i.e., for all .

##### Metric Fields

Next, we consider metric fields on , a mapping of to . As metrics define the norm of vectors, the metric fields define the norms of vector fields.

###### Definition 10

Let . Let be a vector field on and be a metric field on . Then, the -norm of with respect to and is given by

 ∥μ∥Λ,Q :=∥∥|μ|Λ∥∥L2(Q)=√Q|μ|2Λ,

where denotes the function .

Here, we implicitly assume that and are regular enough in the sense that there exists an appropriate sigma algebra on that ensures the measurability of the integrand .

Finally, we introduce two metric fields, and , utilized in the main analysis. One is (i) a metric field induced from the distribution of the derivatives of loss functions, defined as follows.

###### Definition 11 (Data-dependent metric fields)

Let be metric fields given by

 ΛS(f) :=P+PS2[∇Δ(f)⊗∇Δ(f)],

where denotes the tensor product such that , , for all .

The other is (ii) a metric field denoted by , which is expected to be faithful to the data-independent structure of the loss function in the sense of the following assumption, which is equivalent to Assumption 1.

###### Assumption 12 (Lipschitz condition, norm based)

The loss function is Lipschitz continuous with respect to , i.e., .

Note that, under this assumption, Assumption 1 holds with . Moreover, the two velocities are written as

 Vt(ξ;S) =∥μt∥ΛS,Qξt, Wt(ξ) =∥μt∥Σ−1,Qξt.

This way, it is easier to recall that these velocity measures satisfy homogeneity and triangle inequality.

## Appendix B Complete Proof of Theorem 6

In this section, we present the detailed proofs for the main result. First, in Section B.1, we show an abstract PAC-Bayesian bound, where the notion of centrality functions plays an crucial role, and then introduce several instance of centrality functions in Section B.2. Finally, in Section B.3, we specialize the abstract bound into our transportation setting and prove the main theorem.

### b.1 Abstract PAC-Bayesian bound

In this subsection, we introduce fundamental inequalities of the PAC-Bayesian analysis. Namely, the strong Fenchel duality of the KL divergence and log-integral-exp functions, and its applications.

###### Lemma 13 (Donsker and Varadhan (1975))

For any measurable space and all ,

 DKL(Q,U)=supX:F→RQX−lnU[eX],

where the supremum is taken over all the measurable functions . Therefore, we have

 QX≤DKL(Q,U)+lnU[eX]

and it is impossible to improve it uniformly.

Proof  Assume that is not absolutely continuous with respect to . Then, the LHS diverges. Note that there exists such that and . Hence, the RHS also diverges by taking with .

On the other hand, if is absolutely continuous with respect to , there exists the Radon–Nikodym derivative . Now, let . Then we have

 QX−lnU[eX] ≤DKL(Q,U). (Jensen's inequality)

The equality holds if is constant -almost surely.

Utilizing Lemma 13, we present an abstract PAC-Bayesian inequality. To this end, we introduce the notion of centrality.

###### Definition 14 (Centrality)

We say a stochastic process is -central, , if

 P[eX(f)−η(f)]≤1

for all . Moreover, we call as a centrality function of .

The centrality functions allow us to estimate how stochastic processes deviate from their expected values. For example, is -subgaussian if and only if is -central for all . In particular, the centrality is essential in the PAC-Bayesian analysis as showcased in the following lemma.

###### Lemma 15 (Abstract PAC-Bayesian bound)

Let be an arbitrary -central process. Fix any prior distributions . Then, simultaneously for all ,

 QPSX ≤QPSη+1nHδ(Q,U) (15)

with probability on the draw of , where denotes the complexity of with respect to .

Proof  To combine Lemma 13 with the centrality property of , we consider . Thus, Lemma 13 implies that

 QPSX =QPSη+1nQPSnX′ ≤QPSη+1nDKL(Q,U)+1nlnU[ePSnX′].

Now, the centrality condition combined with Markov’s inequality yields

 U[ePSnX′] ≤1δES∼PnU[ePSnX′] (Markov) =1δn∏i=1PU[eX′]≤1δ (centrality)

with probability , i.e.,

 QPSX ≤QPSη+1nHδ(Q,U).

One can confirm that Lemma 15 is an abstraction of the PAC-Bayesian analysis as it derives one of the most standard PAC-Bayesian bound. 555The following theorem is complementary; It is just showing the connection of Lemma 15 and the ordinary PAC-Bayesian bounds, and hence readers may skip it.

###### Theorem 16

Let be an arbitrary -subgaussian process, i.e., for all and . Fix any prior distributions . Then, simultaneously for all ,

 QPSΔ ≤σ√2Hδ(Q,U)+ln2√enn−2,

with probability on the draw of .

Proof  Let and take the prior and the posterior . Here, is a normal distribution with mean and variance and is the standard Cauchy distribution given by

 U1(dt)dt=1π(1+t2).

Since is -subgaussian, is by definition -central for . Hence Lemma 15 with the posterior and prior implies that, with probability ,

 1σQPSΔ =1¯λ~QPSXσ ≤1¯λ~QPSη+1n¯λHδ(~Q,~U) =(¯λ+ρ2¯λ)2+1n¯λ[Hδ(Q,U)+DKL(N,U1)]

for simultaneously all and . Now, take and observe