Bayesian fractional posteriors

Bayesian fractional posteriors

Anirban Bhattacharya anirbanb@stat.tamu.edu Department of Statistics, Texas A&M University Debdeep Pati debdeep@stat.fsu.edu Department of Statistics, Florida State University Yun Yang yyang@stat.fsu.edu (Note: Authors are arranged in alphabetical order) Department of Statistics, Florida State University
Abstract

We consider the fractional posterior distribution that is obtained by updating a prior distribution via Bayes theorem with a fractional likelihood function, a usual likelihood function raised to a fractional power. First, we analyze the contraction property of the fractional posterior in a general misspecified framework. Our contraction results only require a prior mass condition on certain Kullback-Leibler (KL) neighborhood of the true parameter (or the KL divergence minimizer in the misspecified case), and obviate constructions of test functions and sieves commonly used in the literature for analyzing the contraction property of a regular posterior. We show through a counterexample that some condition controlling the complexity of the parameter space is necessary for the regular posterior to contract, rendering additional flexibility on the choice of the prior for the fractional posterior. Second, we derive a novel Bayesian oracle inequality based on a PAC-Bayes inequality in misspecified models. Our derivation reveals several advantages of averaging based Bayesian procedures over optimization based frequentist procedures. As an application of the Bayesian oracle inequality, we derive a sharp oracle inequality in the convex regression problem under an arbitrary dimension. We also illustrate the theory in Gaussian process regression and density estimation problems.

Keywords: Posterior contraction; Rényi divergence; Misspecified models; PAC-Bayes; Oracle inequality; Convex regression.

1 Introduction

The usage of fractional likelihoods has generated renewed attention in Bayesian statistics in recent years, where one raises a likelihood function to a fractional power, and combines the resulting fractional likelihood with a prior distribution via the usual Bayes formula to arrive at a power posterior or fractional posterior distribution. Applications of fractional posteriors has been diverse, ranging from fractional Bayes factors in objective Bayesian model selection [48], data-dependent priors for sparse estimation [42, 41], to marginal likelihood approximation [19] and posterior simulation [20]. The fractional posteriors are a special instance of Gibbs posteriors [31] or quasi-posteriors [15], where the negative exponent of a loss function targeted towards a specific parameter of interest is used as a surrogate for the likelihood function; see [10] for a general framework for updating of prior beliefs using Gibbs posteriors.

The recent surge of interest in fractional posteriors can be largely attributed to its empirically demonstrated robustness to misspecification [25, 46]. For correctly specified, or well-specified (non)parametric models, there is now a rich body of literature [23, 53, 24] guaranteeing concentration of the posterior distribution around minimax neighborhoods of the true data generating distribution. While it can be argued that the primary objective of Bayesian nonparametric models is to relax parametric assumptions to capture finer aspects of the data, susceptibility to model misspecification remains a potent concern. First, in many practical situations, it may be unreasonable to assume that all aspects of data generating distribution can be captured adequately via a probabilistic model, and fitting models of increasingly high complexity additionally carries the risk of overfitting. Second, even though Bayesian nonparametric models can be arbitrarily flexible, they still rely on parametric building blocks, for example, component specific distributions in mixture models, and small perturbations to these assumptions can lead to fairly drastic differences in the inference drawn from the model fit [46].

There is a comparatively smaller literature on large sample behavior of nonparametric Bayesian procedures under misspecification [32, 18, 50], where the general aim is to establish sufficient conditions under which the usual posterior distribution concentrates around the nearest Kullback–Leibler (KL) point to the truth inside the parameter space. However, these conditions are considerably more stringent than those in case of well-specified models, so that verification can be fairly nontrivial, along with comparatively limited scope of applicability. In fact, [26] empirically demonstrate through a detailed simulation study that even convergence to the nearest KL point may not take place in misspecified models. They instead recommend using a fractional posterior, with a data-driven approach to choose the fractional power; see also [25]. More recently, [46] proposed a coarsened posterior approach to combat model misspecification, where one conditions on neighborhoods of the empirical distribution rather than on the observed data to apply Bayes formula. When the neighborhood of the empirical distribution is chosen based on the KL divergence, [46] exhibited that the resulting coarsened posterior essentially takes the form of a fractional posterior.

These observations compel us to systematically study the concentration properties of fractional posteriors. While [59] established consistency of power posteriors for well-specified models; see also [43] for certain rate results; we derive rates of convergence for the fractional posterior for general non-i.i.d. models in a misspecified model framework. The sufficient conditions for the fractional posterior to concentrate at the nearest KL point turn out to be substantially simpler compared to the existing literature on misspecified models. We state our concentration results for a class of Rényi divergence measures in a non-asymptotic environment, which in particular, imply Hellinger concentration in properly specified settings. The effect of flattening the likelihood shows up in the leading constant in the rate. The sub-exponential nature of the posterior tails allow us to additionally derive posterior moment bounds.

As one of our contributions, we show that the contraction rate of the fractional posterior is entirely determined by the prior mass assigned to appropriate KL neighborhoods of the true distribution, bypassing the construction of sieves and testing arguments in the existing theory [4, 23, 32]. One practically important consequence is that concentration results can be established for the fractional posterior for a much broader class of priors compared to the regular posterior. We provide several examples on usage of heavy tailed hyperpriors in density estimation and regression, where the fractional posterior provably concentrates at a (near) minimax rate, while the regular posterior has inconclusive behavior. Another novel application of our result lies in shape constrained function estimation. Obtaining metric entropy estimates in such problems pose a stiff technical challenge and constitutes an active area of research [28]. The fractional posterior obviates the need to obtain such entropy estimates en route to deriving concentration bounds.

As a second contribution, we develop oracle inequalities for the fractional posterior based on a new PAC-Bayes inequality [11, 12, 51, 44, 16, 27] in a fully general Bayesian model. Many previous results on PAC-Bayes type inequalities are specifically tailored to classification (bounded loss, [11, 12, 51]) or regression (squared loss, [51, 39, 16, 27]) problems. Moreover, in the machine learning literature, a PAC-Bayes inequality is primarily used as a computational tool for controlling the generalization error by optimizing its upper bound over a restricted class of “posterior” distributions [11, 12]. There is a need to develop a general PAC-Bayes inequality and an accompanied general theory for analyzing the Bayesian risk that can be applied to a broader class of statistical problems. In this paper, we derive an oracle type inequality for Bayesian procedures, which will be referred to as a Bayesian oracle inequality (BOI), based on a new PAC-Bayes inequality. Similar to the local Rademacher complexity [5] or local Gaussian complexity [6] in a frequenstist oracle inequality (FOI) for penalized empirical risk minimization procedures [33, 34], a BOI also involves a penalty term, which we refer to as local Bayesian complexity, that characterizes the local complexity of the parameter space. Roughly speaking, the local Bayesian complexity is defined as the inverse sample size times the negative logarithm of the prior mass assigned to certain Kullback-Leibler neighbourhood around the (pseudo) true parameter. In the special case when the prior distribution is close to be “uniform” over the parameter space, the local Bayesian complexity becomes the inverse sample size times a local covering entropy, and our BOI recovers the convergence rates derived from local covering conditions [38]. Moreover, our BOI naturally leads to sharp oracle inequalities when the model is misspecified. For example, when applied to convex regression, we derive a sharp oracle inequality with minimax-optimal (up to factors) excess risk bound that extends the recent sharp oracle inequality obtained in [8] from dimension one to general dimension .

Last but not the least, our analysis reveals several potential advantages of averaging based Bayesian procedures over optimization based frequentist procedures. First, due to the averaging nature of a Bayesian procedure, our averaging case analysis leading to a BOI is significantly simpler than a common worst case analysis leading to a FOI. For example, a local average type excess risk bound from a Bayesian procedure allows us to use simple probability tools, such as the Markov inequality and Chebyshev’s inequality, to obtain a high probability bound for the excess risk, since the expectation operation exchanges with the averaging (integration) operation. This is different from a local supremum type excess risk from a optimization procedure, where more sophisticated empirical process tools are exploited to obtain a high probability bound for excess risk [40, 5, 54, 56], due to the non-exchangeability between the expectation operation and the supremum operation. For further details about the comparison between BOI and FOI, please refer to Section 3.3. Second, a Bayesian procedure naturally leads to adaptation to unknown hyperparameters or tuning parameters. We show that by placing a hyper-prior that distributes proper weights to different levels of the hyperparameter, a BOI adaptively leads to the optimal rate corresponding to the best choice of the hyperparameter.

The rest of the paper is organized as follows. The main results of the paper are stated in §3, with contraction results in §3.2, and the PAC-Bayesian inequality and Bayesian oracle inequality in §3.3. Applications to convex regression, Gaussian process regression and density estimation are discussed in §4. All proofs are deferred to §5.

2 Preliminaries

We begin by introducing notation, and then briefly review Rényi divergences as our key metric characterizing the contraction of fraction posteriors.

2.1 Notation

Let and denote the space of continuous functions and the Hölder space of -smooth functions , respectively, endowed with the supremum norm . For , the Hölder space consists of functions that have bounded mixed partial derivatives up to order , with the partial derivatives of order being Lipschitz continuous of order . Let and respectively denote the and norm on with respect to the Lebesgue measure (i.e., the uniform distribution). To distinguish the norm with respect to the Lebesgue measure on , we use the notation .

For a finite set , let denote the cardinality of . The set of natural numbers is denoted by . denotes for some constant . denotes the -covering number of the set with respect to the metric . The -dimensional simplex is denoted by . stands for the identity matrix. Let denote a multivariate normal density with mean and covariance matrix (or a diagonal matrix with squared elements of on the diagonal, when is a -vector).

2.2 Rényi divergences

Let and be probability measures on a common probability space with a dominating measure , and let . The Hellinger distance , where denotes the Hellinger affinity. Let denote the Kullback–Leibler (KL) divergence between and . For any , let

 Dα(p,q)=1α−1log∫pαq1−αdμ (1)

denote the Rényi divergence of order . Let us also denote , which we shall refer to as the -affinity. When , the -affinity equals the Hellinger affinity. We recall some important inequalities relating the above quantities; additional details and proofs can be found in [57].

(R1) for any , which in particular implies that for any .

(R2) using the inequality for .

(R3) For fixed , is increasing in the order . Moreover, the following two-sided inequality shows the equivalence of and for :

 αβ1−β1−αDβ≤Dα≤Dβ,0<α≤β<1.

(R4) By an application of L’Hospital’s rule, .

3 Contraction and Bayesian oracle inequalities for fractional posteriors

In this section, we present our main results. To begin with, we introduce the background including the definition of fractional posterior distributions in Bayesian procedures in §3.1. Then we present our results on the contraction of fraction posterior distributions in §3.2, and Bayesian oracle inequalities based on PAC-Bayes type bounds in §3.3.

3.1 Background

We will present our theory on the large sample properties of fractional posteriors in its full generality by allowing the model to be misspecified and the observations, denoted by , to be neither identically nor independently distributed (abbreviated as non-i.i.d.) [24]. Our non-i.i.d. result can be applied to models with nonindependent observations such as Gaussian time series and Markov processes, or models with independent, nonidentically distributed (i.n.i.d.) observations such as Gaussian regression and density regression.

More specifically, let be a sequence of statistical experiments with observations , where is the parameter of interest in arbitrary parameter space , and is the sample size. For each , let admit a density relative to a -finite measure . Assume that is jointly measurable relative to , where is a -field on . We place a prior distribution on , and define the fractional likelihood of order to be the usual likelihood raised to power :

 Ln,α(θ)=[p(n)θ(X(n))]α. (2)

Let denote the posterior distribution obtained by combining the fractional likelihood with the prior , that is, for any measurable set ,

 Πn,α(B|X(n))=∫BLn,α(θ)Πn(dθ)∫ΘLn,α(θ)Πn(dθ)=∫Be−αrn(θ,θ†)Πn(dθ)∫Θe−αrn(θ,θ†)Πn(dθ), (3)

where is the negative log-likelihood ratio between and any other fixed parameter value . For example, we may choose as the parameter associated with the true data generating distribution, abbreviated as the true parameter. Clearly, denotes the usual posterior distribution.

We allow the model to be misspecified by allowing to lie outside the parameter space . In misspecified models, the point in that minimizes the KL divergence from , that is,

 θ∗:=argminθ∈ΘD(p(n)θ0,p(n)θ), (4)

plays the role of in well-specified models [32]. In fact, we will show that the fractional posterior distribution tends to contract towards as . We use the divergence

 D(n)θ0,α(θ,θ∗):=1α−1logA(n)θ0,α(θ,θ∗), (5)

referred to as the -divergence with respect to , or simply , to measure the closeness between any and , where

 A(n)θ0,α(θ,θ∗):=∫⎛⎝p(n)θp(n)θ∗⎞⎠αp(n)θ0dμ(n)

is an -affinity between and with respect to .

Remark: In the well-specified case where , reduces to the usual -affinity defined in §2.2, and becomes the Rényi divergence of order between and :

 D(n)α(θ,θ0)=Dα(p(n)θ,p(n)θ0)=1α−1log∫{p(n)θ}α{p(n)θ0}1−αdμ(n). (6)

Note we drop from the subscript when .

In general, continues to define a divergence measure that satisfies for and in a variety of statistical problems. For example, in the normal means problem , defines a divergence measure if the parameter space for the mean is a convex set in ; see §4.1 for more details. The convexity condition is satisfied by a broad class of problems, including sparse problems, isotonic regression, and convex regression [14]. In the density estimation context, the following Lemma shows that defines a divergence measure if the parameter space of densities is convex.

Lemma 3.1 (Property of α-divergences).

If is convex111Given any , and , there exists such that or is an interior point of , then for any . Therefore, defines a divergence that satisfies for and .

When , the proof of the lemma implies that if and only if on the support of , since is a strictly concave function on .

We will primarily focus on the following two cases in this paper.

Independent and identically distributed observations:

When are i.i.d. observations, equals the -fold product measure , where is the common distribution for the observations. also takes a product form as , with the common -field. The fractional likelihood function is

 Ln,α(θ)=n∏i=1{pθ(Xi)}α, (7)

where is the common density indexed by . The negative log-likelihood ratio becomes the sum of individual log density ratios. Moreover, the -affinity and divergence can be simplified as and , where and respectively are the -affinity and divergence for .

Independent observations:

In this case as well, are independent observations. However, the th observation has an index-dependent distribution , which possesses a density relative to a -finite measure on . Thus, we take the measure equal to the product measure on the product measurable space . The fractional likelihood function takes a product form as

 Ln,α(θ)=n∏i=1{pθ,i(Xi)}α, (8)

and the negative log-likelihood ratio . The -affinity and divergence can be decomposed, respectively, as and , where and are the -affinity and divergence associated with the th observation .

3.2 General concentration bounds

In this subsection, we consider the asymptotic behavior of fractional posterior distributions and corresponding Bayes estimators based on non-i.i.d. observations under the general misspecified framework. We give general results on the rate of contraction of the fractional posterior measure towards the KL minimizer relative to the -divergence .

For any , we define a specific kind of KL neighborhood of with radius as

 (9)

It is standard practice to make assumptions on the prior mass assigned to such KL neighborhoods to obtain the rate of posterior concentration in misspecified models [32]. With these notations, we present a nonasymptotic upper bound for the posterior probability assigned to complements of -divergence neighborhoods of with respect to .

Theorem 3.2 (Contraction of fractional posterior distributions).

Fix . Recall from (4). Assume that satisfies and

 (10)

Then, for any and ,

 Πn,α(1nD(n)θ0,α(θ,θ∗)≥D+3t1−αε2n ∣∣X(n))≤e−tnε2n

holds with probability at least .

Theorem 3.2 characterizes the contraction of the fractional posterior measure where the posterior of exhibits a sub-exponentially decaying tail. As a direct consequence, we have the following corollary that characterizes the fractional posterior moments of .

Corollary 3.3 (Fractional posterior moments).

Under the conditions of Theorem 3.2, we have that for any ,

 ∫{1nD(n)θ0,α(θ,θ∗)}kΠn,α(dθ∣∣X(n))≤C1(1−α)kε2kn,

holds with probability at least , where are some positive constants depending on .

Implications for well-specified models. While Theorem 3.2 and Corollary 3.3 apply generally to the misspecified setting, it is instructive to first consider their implications in the well-specified setting, i.e., when the data generating parameter . Setting in Theorem 3.2 implies that the fractional posterior increasingly concentrates on -sized neighborhoods of the true parameter . In particular, given (R2) and (R3), Theorem 3.2 implies that for any , the rate of concentration of the fractional posterior in the Hellinger metric is . Similar concentration results for the usual posterior distribution in the Hellinger metric were established in [53, 23] for the i.i.d. case, and in [24] for the non-i.i.d. case. Since the prior mass condition (10) appears as one of the sufficient conditions there as well, the fractional posterior achieves the same rate of concentration as the usual posterior (albeit up to constants) in all the examples considered in these works, which is typically minimax up to a logarithmic term for appropriately chosen priors. In addition to the prior mass condition (10), the sufficient conditions of [24] additionally require the construction of sieves whose -entropy in the Hellinger metric is stipulated to grow in the order , and at the same time, the prior probability assigned to the complement of the sieve is required to be exponentially small, i.e., . The existence of such sieves with suitable control over their metric entropy is a crucial ingredient of their theory, as it guarantees existence of exponentially consistent test functions [9, 37] to test the true density against complements of Hellinger neighborhoods of the form .

An important distinction for the fractional posterior in Theorem 3.2 is that the prior mass condition alone is sufficient to guarantee optimal concentration. This is important for at least two distinct reasons. First, the condition of exponentially decaying prior mass assigned to the complement of the sieve implies fairly strong restrictions on the prior tails and essentially rules out heavy-tailed prior distributions on hyperparameters. On the other hand, a much broader class of prior choices lead to provably optimal posterior behavior for the fractional posterior. Second, obtaining tight bounds on the metric entropy in non-regular parameter spaces, for example, in shape-constrained regression problems, can be a substantially nontrivial exercise [28], which is entirely circumvented using the fractional posterior approach. Specific examples of either kind are provided in §4.

While it may be argued that the conditions on the entropy and complement probability of the sieve are only sufficient conditions, a counterexample from [4] suggests that some control on the complexity of the parameter space is also necessary to ensure the consistency of a regular posterior when the model space is well-specified. Specifically, in their example, the posterior tends to put all its mass on a set of distributions that are away from the true data generating distribution with respect to the Hellinger metric, even though the prior assigns positive probability over any -KL ball around the true parameter. As an implication, the fractional posterior can still achieve a certain rate of contraction for this problem even though the regular posterior is not consistent. In fact, the rate of concentration of the fractional posterior for this problem, since their prior satisfies for some constant . Therefore, a combination of Theorem 3.2 and the counterexample in [4] shows that the fractional posterior has an annealing effect that can flatten the potential peculiar spikes in the regular posterior that are far away from the true parameter. However, this additional flexibility of the fractional posterior comes at a price—when the regular posterior contracts, then the -fractional posterior will sacrifice a factor of in the rate of contraction.

The following theorem shows that for fixed , the fractional posterior will almost surely converges to the regular posterior () as .

Theorem 3.4 (Regular posterior as a limit of fractional posteriors).

For each , we have

This theorem implies that although for a fixed , the fraction posterior has the annealing effect of flattening the posterior, it will eventually convergence to the regular posterior as almost surely. This observation also justifies the empirical observation [21] that parallel tempering can boost the convergence of the posterior when the posterior contracts. However, when the posterior is ill-behaved—does not have consistency or has multimodality, then we need a very fine grid for the design of as in the parallel tempering algorithm, since otherwise all factional posteriors will only exhibit the one big mode around and miss the rest.

Implications for misspecified models.

A key reference for Bayesian asymptotics in infinite-dimensional misspecified models is [32], where sufficient conditions analogous to the well-specified case were provided for the posterior to concentrate around . The primary technical difficulty in showing such a result compared to the well-specified case is the construction of test functions, for which [32] proposed a novel solution. Akin to the well-specified case for the regular posterior, the sufficient conditions of [32] constitute of a prior thickness condition as in Theorem 3.2, and conditions on entropy numbers. However, the entropy number conditions (equations (2.2) and (2.5) in [32]) for the misspecified case are substantially harder to verify. In their Lemma 2.1, a simpler sufficient condition related their entropy number condition to ordinary entropy numbers. Further, in their Lemma 2.3, exploiting convexity of the parameter space, they established that the sufficient conditions of their Lemma 2.1 are satisfied by a weighted Hellinger distance

 h2w(θ(n),θ∗(n))=14∫(√p(n)θ∗−√p(n)θ)2p(n)θ0p(n)θ∗ dμ(n),

which then amounts to obtaining entropy numbers in the weighted Hellinger metric. Such an exercise typically requires further assumptions on the behavior of . For example, if is finite, the ordinary Hellinger metric dominates the weighted Hellinger metric and it suffices to obtain covering numbers with respect to the ordinary Hellinger metric. Under this assumption, the authors proceeded to derive convergence rates for the regular posterior in a density estimation problem using Dirichlet process mixture priors. However, this assumption precludes the true density to have heavier tails than that prescribed by the model. For example, if the true density is heavier that the class of densities specified by the model, the assumption is not satisfied. Typically, in the misspecified case, controlling the prior mass (10) in Theorem 3.2 requires certain tail conditions on . However, Theorem 3.2 obviates the need to verify any entropy conditions for the fractional posterior. It thus avoids the need to assume , unless required to verify the prior mass condition.

For , our divergence measure dominates the weighted Hellinger distance in which [32] derive their convergence rate for the density estimation problem in Theorem 3.1. This can be readily seen from

 4h2w(θ(n),θ∗(n)) = 1+∫p(n)θp(n)θ∗p(n)θ0dμ(n)−2∫(p(n)θp(n)θ∗)1/2p(n)θ0dμ(n) ≤ 2[1−∫(p(n)θp(n)θ∗)1/2p(n)θ0dμ(n)]≤D1/2(θ(n),θ∗(n)),

where the last inequality follows from and the penultimate inequality follows from Lemma 3.1.

3.3 PAC-Bayes bounds and Bayesian oracle inequalities

In many problems, the performance of a (pseudo) Bayesian approach can be characterized via PAC-Bayes type inequalities [51, 44, 27]. A typical PAC-Bayes inequality takes the form as

 ∫R(θ,θ0)Πn,α(dθ|X(n))≤∫Sn(θ,θ0)ρ(dθ)+1κnD(ρ,Πn)+Rem,   ∀ probability measure ρ≪Πn,

where is a statistical risk function, is some tuning parameter, Rem is some remainder term, and is some function that measures the discrepancy between and on the support of . We present a PAC-Bayes inequality for the fractional posterior distribution, where the risk function is a multiple of the -Rényi divergence in (6), and a multiple of the negative log-likelihood ratio .

Theorem 3.5 (PAC-Bayes inequality relative to θ0).

Fix . Then, for any ,

 1n(1−α)D(ρ,Πn)+1n(1−α)log(1/ε), (11) ∀ probability measure ρ≪Π,

with probability at least .

Theorem 3.5 immediately implies an oracle type inequality for the Bayes estimator by using the convexity of and applying Jensen’s inequality,

 1nD(n)α(^θB,θ0)≤αn(1−α)∫rn(θ,θ0)ρ(dθ)+1n(1−α)D(ρ,Πn) +1n(1−α)log(1/ε), (12)

for all probability measure . We call this inequality a Bayesian oracle inequality.

Let us compare the Bayesian oracle inequality (BOI) with frequentist oracle inequalities (FOI) [33, 34]. For convenience, we assume that the observations are i.i.d., and use to represent the empirical measure . For a function , define

 Pnf=1nn∑i=1f(Xi),and% Pθ0f=Eθ0f(X). (13)

Under this notation, a typical FOI takes a form as

 Pθ0fˆθ≤cinfθ∈ΘPθ0fθ+Ψn(rn), (14)

for some leading constant , where is the estimator of , for example, obtained by empirical risk minimization [34, 7]. Here is a class of functions indexed by , such as, a certain loss function evaluated at . The term will be referred to as the approximation error term, reflecting the smallest loss incurred by approximating from . The second term in the display is an excess risk term that reflects certain local complexity measure of , such as the local Rademacher complexity [5] or local Gaussian complexity [6]. typically serves as a high probability upper bound to the supremum of the localized empirical process,

 (15)

up to some other remainder terms, where is a critical radius obtained as the fixed point of certain function depending on .

Now let us look at the BOI (12), which can be rewritten as

 1nD(n)α(^θB,θ0)≤ αn(1−α)infθ∈ΘPθ0rθ+αn(1−α)∫{Pnrθ−Pθ0rθ}ρ(dθ) (16)

where is the log density ratio. We observe that the first term on the right hand side of (16) is the approximation error term, and the rest serves as the excess risk term. However, the excess risk term in BOI has two distinctions from that in FOI. First, different from the FOI that induces localization via either an iterative procedure [35] or solving the fixed point of certain function [5], a BOI induces localization via picking a measure concentrating around the best approximation that balances between the average approximation error and a penalty on the size of localization . Second, in FOI the stochastic term characterizing the local complexity is based on a worse case analysis by taking the supremum as in (15), while BOI bounds the stochastic term based on an average case analysis via the average fluctuation

 ∫{Pnrθ−Pθ0rθ}ρ(dθ).

Because we can exchange the expectation with integration, this local average form allows us to use simple probability tools, such as Markov’s inequality and Chebyshev’s inequality, to obtain bounds for the excess risk. This is different from the local supremum form (15), where expectation does not exchange with supremum, and we need much more sophisticated empirical process tools such as chaining and peeling techniques to bound the excess risk (see, for example, [40, 5, 54, 56]).

As a simple illustration of applying Chebyshev’s inequality to BOI or inequality (11) in Theorem 3.5 to obtain an explicit risk bound for the Bayes estimator, we present the following corollary. Recall the definition of the KL neighorhood defined in (9).

Corollary 3.6.

Suppose satisfies and . With probability at least ,

 (17)

In particular, if we let to be the Bayesian critical radius that is a stationary point of

 −logΠn(Bn(θ0,ε;θ0))nε=ε,

then with probability at least ,

The main idea of the proof is to choose the probability measure as ; the restriction of the prior to . Under this choice, we have , and can be bounded by applying Chebyshev’s inequality. If higher moment constraints on the likelihood ratio is also included into the definition of in (9), then the probability bound for (17) to hold can be boosted (for details, see Section 2 in [22]).

According to Corollary 3.6, the overall risk bound in (17) is a balance between two terms: an approximation error term and a local complexity measure term . For this reason, we will refer to the second term as the local Bayesian complexity. The local Bayesian complexity reflects the compatibility between the prior distribution and the parameter space: if is close to a uniform distribution over , then is roughly the logarithm of the number of -balls needed to cover a neighborhood of , and therefore is related to the local covering entropy. On the other hand, if some prior knowledge about is available, then we can combine these knowledge to increase the prior mass around , which may significantly boost the rate of convergence of the Bayes estimator. This observation is consistent with our previous intuition that averaging based (average case analysis) Bayesian approaches sometimes can be better than optimization based (worst case analysis) frequentist approaches. For example, when certain hyperparameter or tuning parameter, such as the regularity of a function class or sparsity level of a regression model, is unknown, then a Bayesian procedure naturally achieves adaptation to those unknown parameters by placing a prior on them that distributes proper weights to different levels of the hyperparameter (see our examples in Section 4). In contrast, a common way to select a tuning parameter in frequentist methods is via cross-validation or data-splitting. These approaches only uses some proportion of data to do estimation, after learning the tuning parameter via the rest, which may not be the most efficient way to use data.

Although Theorem 3.5 is useful for obtaining an BOI, when transformed into form (14) the resulting leading constant of the approximation error term in the BOI is typically strictly larger than , resulting in a non-sharp oracle inequality. Here, we call an oracle inequality sharp if the leading constant in (14) is ; see, for example, [17]. To solve this issue for the PAC-Bayes inequality in Theorem 3.5, we consider a second class of PAC-Bayes inequalities that directly characterizes the closeness between and the best approximation of from .

Theorem 3.7 (PAC-Bayes inequality relative to θ∗).

Fix . Then, for any ,

 ∫{1nD(n)θ0,α(θ,θ∗)}Πn,α(dθ|X(n))≤αn(1−α)∫rn(θ,θ∗)ρ(dθ)+ 1n(1−α)D(ρ,Πn)+1n(1−α)log(1/ε), (18) ∀ probability measure ρ≪Πn,

with probability at least .

Similar to Corollary 3.6 for a concrete Bayesian risk bound for characterizing the closeness between and , we have the following counterpart for and .

Corollary 3.8.

For any satisfying and , with probability at least ,

 ∫{1nD(n)θ0,α(θ,θ∗)}Πn,α(dθ|X(n))≤Dα1−αε2+{−1n(1−α)logΠn(Bn(θ∗,ε;θ0))}. (19)

In particular, if we let to be the Bayesian critical radius that is a stationary point of

 −logΠn(Bn(θ∗,ε;θ0))nε=ε,

then with probability at least ,

 ∫{1nD(n)θ0,α(θ,θ∗)}Πn,α(dθ|X(n))≤Dα+11−αε2n.

We now illustrate how Corollary 3.8 leads to a sharp oracle inequality in the misspecified case (a concrete example is provided in Section 4.1). As noted previously, an oracle inequality is sharp in the misspecified case if the leading constant in front of the model space approximation term is , i.e., for some distance metric . In statistical learning theory, the regret [58, 49] of an estimator is defined as . A benchmark to compare regrets for different estimators is the minimax regret defined as . Regret bounds (misspecified case) are substantially harder to obtain compared to minimax risk bounds (well-specified case), and the rate of minimax regret can be different from the minimax risk [49]. Our general technique to derive a sharp oracle inequality for the Bayes estimator will imply that the Bayes estimator has minimax regret.

Suppose we are interested in certain metric , the square of which is weaker than the average -divergence , that is

 1nD(n)θ0,α(θ,θ∗)≥cαd2n(θ,θ∗),θ∈Θ,

where is some positive constant that may depend on . For simplicity, we assume that is also the minimizer of over . If this is not the case, then we can always add an extra remainder term to the upper bound that characterizes the difference between and the best approximation of from relative to . Under these assumptions, Corollary 3.8 implies that with high probability, the Bayes estimator satisfies

 dn(ˆθB,θ∗)≤c′αεn,

where is the Bayesian critical radius. Now adding to both sides of this inequality and applying the triangle inequality, we obtain

 dn(ˆθB,θ0)≤infθ∈Θdn(θ,θ0)+c′αεn,

which is a sharp oracle inequality. Sometimes, we may be interested in obtaining an oracle inequality for the squared loss , when is a vector space and is induced by an inner product, denoted by . This is a more intricate problem, as the trivial bound renders the oracle inequality non-sharp. However, it is usually true when is a convex set that

 1nD(n)θ0,α(θ,θ∗)≥cα(d2n(θ,θ∗)+2⟨θ−θ∗,θ∗−θ0⟩n),∀ θ∈Θ.

For example, this inequality holds for regression with fixed design, where is the empirical norm (details can be found in Section 4.1). Again, by applying Corollary 3.8 and adding to both sides of this inequality, we obtain

 d2n(ˆθB,θ0)=d2n(ˆθB,θ∗)+2⟨ˆθB−θ∗,θ∗−θ0⟩n+d2n(θ∗,θ0)≤infθ∈Θd2n(θ,θ0)+c′αε2n,

which is a sharp oracle inequality for the squared loss . As an application of this technique, we derive a sharp oracle inequality for estimating a convex function in Theorem 4.2 when is not necessarily convex.

Comparisons with previous work:

The most relevant PAC-Bayes type result to ours, such as Theorem 3.5, is the Theorem 1 in [16], which focus on the regression setting , where is the unknown regression function to be estimated, ’s are the fixed design points and ’s are the i.i.d. zero mean noise, corresponding to the i.n.i.d. observations. They propose to use the posterior mean of the following quasi-likelihood function as the estimator,

 Ln,β(f)=exp{−12βn∑i=1(Yi−f(xi))2},

where according to their terms, is a temperature parameter. In the special case when and , this function reduces to the likelihood function. They establish a PAC-Bayes inequality

 E(n)θ0[∥^f−f0∥2n]≤∫∥f−f0∥2nρ(df)+βnD(ρ,Πn),   ∀ probability measure ρ≪Πn,

when , where is the corresponding posterior mean. Therefore, their quasi-likelihood approach can be viewed as a special case under our fractional posterior with . Their proof is specialized to the empirical loss and requires the log-likelihood function to also take a sum of squares form. In contrast, our PAC-Bayes inequality generalizes the results in [16] to a more broader class of models. Moreover, the posterior expectation in