Rates of contraction for posterior distributions in \bolds{L^{r}}-metrics, \bolds{1\leq r\leq\infty}

# Rates of contraction for posterior distributions in \boldsLr-metrics, \bolds1≤r≤∞

## Abstract

The frequentist behavior of nonparametric Bayes estimates, more specifically, rates of contraction of the posterior distributions to shrinking -norm neighborhoods, , of the unknown parameter, are studied. A theorem for nonparametric density estimation is proved under general approximation-theoretic assumptions on the prior. The result is applied to a variety of common examples, including Gaussian process, wavelet series, normal mixture and histogram priors. The rates of contraction are minimax-optimal for , but deteriorate as increases beyond . In the case of Gaussian nonparametric regression a Gaussian prior is devised for which the posterior contracts at the optimal rate in all -norms, .

[
\kwd
\doi

10.1214/11-AOS924 \volume39 \issue6 2011 \firstpage2883 \lastpage2911 \newproclaimconditionCondition \newproclaimdefinitionDefinition \newproclaimremarkRemark

\runtitle

and uniform consistency of Bayes estimates

{aug}

A]\fnmsEvarist \snmGiné\correflabel=e1]gine@math.uconn.edu and B]\fnmsRichard \snmNickllabel=e2]r.nickl@statslab.cam.ac.uk

class=AMS] \kwd[Primary ]62G20 \kwd[; secondary ]62G07 \kwd62G08. Rate of contraction \kwdposterior \kwdnonparametric hypothesis testing.

## 1 Introduction

In finite-dimensional statistical models the Bernstein–von Mises theorem provides a frequentist justification of the use of Bayesian methods. In the case of infinite-dimensional models, consistency properties in weak metrics hold under relatively mild conditions; see Schwartz [28]. Consistency in stronger metrics was considered by Barron, Schervish and Wasserman [1] and by Ghosal, Ghosh and Ramamoorthi [9], and, shortly after, Ghosal, Ghosh and van der Vaart [10] and Shen and Wasserman [30] developed techniques that allow us to prove frequentist rates of contraction of the posterior to the true infinite-dimensional parameter in the Hellinger metric, if the prior is suitably chosen according to the structure of the nonparametric problem at hand. This led to further progress recently; we refer to [11, 12, 32, 34] and the references therein.

This literature has been successful in generalizing the scope of these techniques to a variety of different statistical models, and has naturally focussed on consistency and rates of contraction results in the Hellinger distance. For instance, if is the unknown density to be estimated, and if is the posterior based on a prior and a sample with joint law , results of the kind

 Π(p\dvtxh(p,p0)≥εn|X1,…,Xn)→0in Pn0 probability (1)

were established, where is the Hellinger metric and where . Such posterior contraction results are known to imply the same frequentist consistency rate , also in the metric , for the associated formal Bayes estimators.

In this article we investigate the question of how to generalize results of this kind to more general loss-functions than the Hellinger metric, with a particular focus on -norms, . Such results are of interest for a variety of reasons, for example, the construction of simultaneous confidence bands, or for plug-in procedures that require control of nonparametric remainder terms (e.g., in the proof of the Bernstein–von Mises theorem in semiparametric models in Castillo [6]). They are also of interest with a view on a more unified understanding of nonparametric Bayes procedures that complements the existing -type results for standard frequentist methods.

The main challenge in extending the theory to the -case, except for specific conjugate situations discussed below, rests in generalizing the Le Cam–Birgé testing theory for the Hellinger metric to more general situations. A main ingredient of the proof of a result of the kind (1) is that, in testing problems of the form

 H0\dvtxp=p0againstHA\dvtxp∈{p\dvtxh(p,p0)≥εn}, (2)

universal tests with concentration bounds on type-II errors of the type  exist, under assumptions on the size, or entropy, of the “alternative” space defining . This fact is rooted in the subtle connection between nonparametric testing problems and the Hellinger metric as highlighted in the work of Le Cam [21] and Birgé [2]. A main contribution of this article is the development of a new approach to testing problems of the kind (2) based on concentration properties of linear centered kernel-type density estimators, derived from empirical process techniques. While this approach can only be used if one has sufficient control of the approximation properties of the support of the prior, it can be generalized to arbitrary -metrics, including the supremum norm . The concentration properties of these tests depend on the geometry of the -norm and deteriorate as , which is, in a sense, dual to the fact that the minimax testing rate in the sense of Ingster [20] approaches the minimax rate of estimation as .

While our main results can be viewed as “abstract” in that they replace the entropy conditions in [10] for sieve sets by general approximation-theoretic conditions (see Theorems 2 and 3 below), our findings become most transparent by considering specific examples, selected in an attempt to reflect the spectrum of situations that can arise in Bayesian nonparametrics: In Section 2 we study the “ideal” situation of a simple uniform wavelet prior on a Hölder ball, the “supersmooth” situation of mixtures of normals, the case of random histograms based on a Dirichlet process where no uniform bound on the -norm of the support of the prior is available, as well as Gaussian process priors of the kind studied in [32]. The general conclusion is that if is -smooth, then the rate of contraction obtained in the -norm for a posterior based on an adequately chosen prior of smoothness is, up to factors, and with ,

 (1n)(α−1/2+1/¯r)/(2α+1). (3)

So as soon as our proof retrieves the minimax optimal rate, but for the rate deteriorates by a genuine power of . As approaches infinity this effect becomes more lenient and vanishes in the limit.

We currently have no proof of the fact that our general theorem gives the right rate for Bayesian posteriors if —similar problems are known with nonparametric maximum likelihood estimators in -metrics (cf. the proof of Proposition 6 in [27]). While we do not settle the issue of optimality of our rates for in this article, we also prove in Theorem 1 below that in nonparametric Gaussian regression the minimax rate of contraction can be obtained by certain diagonal Gaussian wavelet priors, in all -norms simultaneously. We believe that this result is closely tied to the fact that the posterior is then itself Gaussian, and conjecture that our rates cannot be substantially improved in the nonconjugate situation.

## 2 Main results

Let be a class of probability densities on or , and let be a random sample drawn from some unknown probability density with joint law the first coordinate projections of the infinite product probability measure . Suppose one is given a prior probability distribution defined on some -algebra of . The posterior is the random probability measure

 Π(B|X1,…,Xn)=∫B∏ni=1p(Xi)dΠ(p)∫P∏ni=1p(Xi)dΠ(p),B∈B.

We wish to analyze contraction properties of the posterior distribution under certain regularity conditions on and , and these regularity properties can be conveniently characterized by wavelet theory.

### 2.1 Function spaces and wavelets

For or , , we shall write , the norm on the space of bounded continuous real-valued functions defined on . We shall use wavelet theory throughout; see [26, 19]. Let be the scaling function and wavelet of a multiresolution analysis of the space of square integrable real-valued functions on . We shall say that the wavelet basis is -regular if are -times continuously differentiable on . For instance we can take Daubechies wavelets on of sufficiently large order (see [26]) and define the translated scaling functions and wavelets

 ϕk=ϕ(⋅−k),ψℓk=2ℓ/2ψ(2ℓ(⋅)−k),ℓ∈N∪{0},k∈Z, (4)

which form an orthonormal basis of .

For we consider the orthonormal wavelet bases of constructed in Theorem 4.4 of Cohen, Daubechies and Vial [8]. Each such basis is built from a Daubechies scaling function and its corresponding wavelet , of order , starting at a fixed resolution level such that (see Theorem 4.4 in [8]): the that are supported in the interior of are all kept, and suitable boundary corrected wavelets are added, so that the still form an orthonormal basis for . While formula (4) now only applies to the “interior” wavelets, one can still write for every ; cf. page 73 in [8] and also after Condition 3 below. {definition} Let or , and let , , , . Let be bounded, compactly supported -regular scaling function and wavelet, respectively, and denote by and the wavelet coefficients of . The Besov spa-ce  is defined as the set of functions where

 ∥f∥s,p,q:=∥∥α(⋅)(f)∥∥p+(∞∑ℓ=0(2ℓ(s+1/2−1/p)∥∥βℓ(⋅)(f)∥∥p)q)1/q

with the obvious modification in case . {remark} We note the following standard embeddings/identifications we shall use (cf. [26, 19]): for the Hölder (-Zygmund in case integer) spaces on , we have . Moreover where  are the standard -Sobolev spaces. We also have the “Sobolev-type” imbeddings for . Finally, if , then for every , where , with , .

### 2.2 Uniform wavelet series

Let us consider first the case where an a priori upper bound on the Hölder norm is available, so that the prior can be chosen to have bounded support in . An example is obtained, for example, by uniformly distributing wavelet coefficients on a Hölder ball. Let be a -regular CDV-wavelet basis for , let be i.i.d.  random variables, and define, for , the random wavelet series

 Uα(x)=∑ku0kϕk(x)+∞∑ℓ=J0∑k2−ℓ(α+1/2)uℓkψℓk(x), (5)

which has trajectories in , almost surely (in view of Definition 2.1 and Remark 2.1). Since moreover , and since the exponential map has bounded derivatives on bounded subsets of , the same applies to the random density

 pU,α(x):=eUα(x)∫10eUα(y)dy,

whose induced law on we denote by . Our general results below imply the following proposition, which, since is bounded away from zero, implies the same contraction rate in Hellinger distance . Note moreover that the result for could be obtained from interpolation properties of -spaces.

###### Proposition 1

Let be i.i.d. on with density satisfying . Let , , and suppose . Then there exist finite positive constants such that, as ,

 Πα{p∈P\dvtx∥p−p0∥r≥Mn−(α−1/2+1/¯r)/(2α+1)(logn)η|X1,…,Xn} (6) →PN00.

### 2.3 Dirichlet mixtures

Consider first, as in [9, 13, 12], a normal mixture prior , defined as follows: for the standard normal density, set:

(-) ,

(-) the Dirichlet-process with base measure , and a probability measure,

(-) , where is a probability distribution with compact support in .

###### Proposition 2

Let be i.i.d. on with density where and where is supported in . Suppose that has a positive continuous density in a neighborhood of , and that the base measure has compact support and a continuous density on an interval containing . Then there exist finite positive constants such that

 Πα{p∈P\dvtx∥p−p0∥∞≥M(logn)η√n∣∣X1,…,Xn}→PN00as n→∞. (7)

Consider next a random histogram based on a Dirichlet process, similar to the priors studied in [29]: for let be a Dirichlet-distribution on the -dimensional unit simplex, with all parameters equal to one. Consider the dyadic random histogram with resolution level

 2j∑k=1αjk2j1{(k−12j,k2j]}(x),{ajk}∼Dirj,x∈[0,1],

and denote its law on the space of probability densities by . Note that this prior is not concentrated uniformly (in ) on bounded densities (despite the densities in the support being uniformly bounded for fixed ).

###### Proposition 3

Let be i.i.d. on with density , satisfying on . Let be such that , let , and let either or . Then for some , as

 Missing or unrecognized delimiter for \bigr (8) →PN00.

### 2.4 Gaussian process priors

We now study a variety of Gaussian process priors that were considered in the nonparametric Bayes literature recently; see [32, 34] for references. To reduce technicalities we shall restrict ourselves to integrated Brownian motions, but see also the remark below. {definition} Let , , be a (sample-continuous version of) standard Brownian motion. For , , setting , being the integer part of , is defined as the -fold integral

 Bα(t) = ∫t0∫t[α]−10⋯∫t20∫t10B(s)dsdt1⋯dt[α]−1 = 1([α]−1)!∫t0(t−s)[α]−1B(s)ds,t∈[0,1],

where for the multiple integral is understood to be only .

Following [23, 32], and as before Proposition 1, we would like to define our prior on densities as the probability law of the random process

 eBα∫10eBα(t)dt, (9)

but we must make two corrections: first, since a.s., , would impose unwanted conditions on the value at zero of the density, we should release at zero, that is, take , where are i.i.d.  variables independent of ; see [32]. In order to deal with bounded densities, we introduce a second modification to (9), and define our prior (on the Borel sets of ) as

 Π=L(e¯Bα∫10e¯Bα(t)dt∣∣∥¯Bα∥∞≤c), (10)

where is a fixed arbitrary positive constant. This prior works as follows: if is a measurable set of continuous densities on , then

 Π(A)=Pr{e¯Bα/∫e¯Bα∈A,∥¯Bα∥∞≤c}/Pr{∥¯Bα∥∞≤c},

and clearly the denominator is strictly positive for all ; see Proposition 7 below.

###### Proposition 4

Let , and assume (a) , and (b) is bounded and bounded away from zero, say, . Let be the prior defined by (10) where is as in (a) and is as in (b). Then, if are i.i.d. with common law of density , there exists s.t.

 Π{p∈P\dvtx∥p−p0∥r≥Mn−(α−1/2+1/¯r)/(2α+1)(logn)(1/2)1{r=∞}|X1,…,Xn} →0

in -probability as .

As remarked before Proposition 1, a contraction result in the Hellinger distance follows as well, and the case could be obtained from interpolation.

The result in Proposition 4 extrapolates to fractional multiple integrals of Brownian motion (Riemann–Liouville processes) of any real valued index , and it also extends to the related fractional Brownian motion processes (see, e.g., [32] for definitions), but, for conciseness and clarity of exposition, we refrain from carrying out these extensions.

### 2.5 Sharp rates in the Gaussian conjugate situation

We currently have no proof that the rates obtained in the previous subsections are optimal for these priors as soon as . While we conjecture that Bayesian posteriors may suffer from suboptimal contraction rates in density estimation problems in -loss, , we finally show here that in the much simpler conjugate situation of nonparametric regression with Gaussian errors, sharp rates in all  norms can be obtained at least for certain diagonal wavelet priors. The proof of this result follows from a direct analysis of the posterior distribution, available in closed form due to conjugacy.

Given a noise level , we observe

 dY(n)(t)=f(t)dt+1√ndB(t),t∈[0,1], (11)

for , where is Brownian motion on . This model is well known to be asymptotically equivalent to nonparametric regression with fixed, equally-spaced design and Gaussian errors.

Consider priors on defined on a -regular CDV-wavelet basis as

 Π=L(N∑k=0gkϕk+∞∑ℓ=J02ℓ−1∑k=0√μℓgℓkψℓk) (12)

in , with the ’s i.i.d. and with . Such a prior is designed for -smooth . As is easily seen, the series in (12) converges uniformly almost surely.

###### Theorem 1

Let , and let be the Gaussian prior on defined by (12) based on a CDV wavelet basis of of smoothness at least . Let , let and suppose we observe . Then there exists and depending only on the wavelet basis, and such that, for every , and for all ,

 EY(n)0Π(f\dvtx∥f−f0∥r>Mεn|Y(n)0)≤n−C2(M−M0)2. (13)

This rate of convergence is sharp (in case up to the -term) in view of the usual minimax lower bounds and since the contraction rate implies the same rate of convergence for the formal Bayes estimator to (using Anderson’s lemma and the fact that the posterior is a random Gaussian measure on , as inspection of the proof shows). One may even apply the usual thresholding techniques to the posterior mean to obtain a Bayesian rate adaptive estimator of by proceeding as in [17, 25].

## 3 General contraction theorems for density estimates in Lr-loss, 1≤r≤∞

We shall, in our main results, use properties of various approximation schemes in function spaces, based on integrating a localized kernel-type function against functions , . Let, in slight abuse of notation, for , be the space of -integrable functions, , normed by . Recall the notion of -variation of a function (e.g., as before Lemma 1 in [17]). {condition} Let or . The sequence of operators , is called an admissible approximating sequence if it satisfies one of the following conditions:

(a) (convolution kernel case): , where is of bounded -variation for some finite , right (or left) continuous, and satisfies for some .

(b) (multiresolution projection case): , the sum extending over any subset of , where has bounded -variation for some finite and satisfies, in addition, as well as for every and some function for which for some .

(c) (multiresolution case, ): is the projection kernel of a Cohen–Daubechies–Vial (CDV) wavelet basis.

Condition (a) is a standard assumption on kernels, condition (b) is satisfied for most wavelet basis on , such as Daubechies, Meyer or spline wavelets, by using standard wavelet theory (e.g., [19]). For part (c) we note the following: as in the case of the whole line, an orthonormal basis of is obtained from -fold dilates of the basic linear span , for every (page 73 in [8]). In this case, has dimension , and a basis consists of: (i)  left edge functions , , where is a modification of , which is still bounded and of bounded support; (ii) right edge functions , , also modifications of bounded and of bounded support, and then the “interior” usual translations of dilations of , , . The projection kernel corresponds to the projection onto the three orthogonal components of (the linear spans, respectively, of the left edge functions , the right edge functions , and the interior functions ). The first two spaces have dimension and the third, . By Lemma 8.6 in [19], there exist bounded, compactly supported nonnegative functions such that , for all . We call this function a majorizing kernel of the interior part of .

Let be i.i.d. with law and density .

###### Theorem 2

Let or , let be a set of probability densities on , and let be priors defined on some -algebra of for which the maps are measurable for all . Let and let as be a sequence of positive numbers such that as . Let

 δn=εn(nε2n)1/2−1/(2r)γn (14)

for some sequence satisfying . Let be any sequence satisfying for some fixed , and let be an admissible approximator sequence. Let be a sequence of subsets of

 {p∈P\dvtx∥KJn(p)−p∥r≤C(K)δn,∥p∥μw≤D}, (15)

where is a constant that depends only on the operator kernel , is a fixed constant, and where if , if .

Assume there exists such that, for every large enough:

• (1) and (2) . Let be s.t. and s.t. if . If as , then there exists such that (16) in -probability.

• Note that the moment condition in (15) is void if or if . If the rate can be taken to be or, more generally, . For one only has at best , which is always slower than (since ). In case the rate interpolates between these two rates without, however, requiring .

In the case where is bounded, and if it is known that the posterior concentrates on a fixed sup-norm ball with probability approaching one, we can refine the rates in the above theorem for , and retrieve the (in applications of the theorem often optimal) rate for . The following theorem can be applied with , in which case conditions (a) and (b) require the rate to be fast enough (which in applications typically entails that a minimal degree of smoothness of has to be assumed).

###### Theorem 3

Let be as in Theorem 2. Let , and let as be a sequence of positive numbers such that as . Let , and set

 δn=εn(nε2n)1/2−1/¯rγn (17)

for some sequence . Assume either:

(a) that and that or

(b) that and that .

Let be defined as in Theorem 2, assume that conditions (1) and (2) in that theorem are satisfied, and that, in addition,

(3) there exists such that

 Πn(p∈P\dvtx∥p∥∞>B|X1,…,Xn)→0

as in -probability.

Let be s.t. and such that for some if . If as , then there exists s.t.

 Πn{p∈P\dvtx∥p−p0∥r≥Mδn|X1,…,Xn}→0as n→∞ (18)

in -probability.

### 3.1 Lr-norm inequalities

A main step in the proof of Theorems 2 and 3 [see (3.2) below] is the construction of nonparametric tests for -alternatives, , that have sufficiently good exponential bounds on the type-two errors. For this we first derive sharp concentration inequalities for -norms of centered density estimators. It is convenient to observe that the degree of concentration of a kernel-type density estimator around its expectation in  depends on , as can already be seen from comparing the known cases in [16, 14] for kernel estimators and [17] for wavelets. These results are derived from Talagrand’s inequality [31] for empirical processes: let be i.i.d. with law on a measurable space , let be a -centered (i.e., for all ) countable class of real-valued measurable functions on , uniformly bounded by the constant , and set for any . Let be any positive number such that , and set . Then, Bousquet’s [5] version of Talagrand’s inequality, with constants, is as follows (see Theorem 7.3 in [5]): for every ,

 Missing or unrecognized delimiter for \Biggr (19)

This applies to our situation as follows: let be i.i.d. with density on with respect to Lebesgue measure , , and let be a kernel-type estimator with as in Condition 3. Its expectation equals , and we wish to derive sharp exponential bounds for the quantity for . In case this can be achieved by studying the empirical process indexed by

 K={Kj(x,⋅)−Kj(p0)(x)\dvtxx∈T},

and in case we shall view as a sample average of the centered -valued random variables , and reduce the problem to an empirical process as follows: let be conjugate to , that is, . By the Hahn–Banach theorem, the separability of implies that there is a countable subset of the unit ball of such that

 Extra close brace or missing open brace

for all . We thus have , where is the empirical measure, and where

 K={x↦∫Tf(t)Kj(t,x)dt−∫Tf(t)Kj(p0)(t)dt\dvtxf∈B0}.

To apply (19) with the countable class we need to find suitable bounds for the envelope and the weak variances . We will also apply (19) in the case , and note that the corresponding empirical process suprema are over countable subsets of , by the continuity property of in the convolution kernel case, and by finiteness of the -variation of the scaling function in the wavelet case (Remark 2 in [17]).

#### Envelope and variance bounds for K

We first consider Condition 3(a), the convolution kernel case: let us write in abuse of notation and for . (One naturally replaces  by the Banach space of finite signed measures if in the arguments below.) The class then equals

 K={x↦Kj∗f(x)−E(Kj∗f(X))\dvtxf∈B0}.

The bound for the envelope is seen to be of size : by Hölder’s inequality

 ∥Kj∗f∥∞≤∥Kj∥r∥f∥s≤C(K,r)2j(1−1/r)≡U, (20)

a bound that remains true when since . To bound the variances, for densities , we have

 E(Kj∗f)(X)2≤∥p0∥r∥Kj∗f∥22s≤C′(K,r)∥p0∥r2j(1−1/r)≡σ2 (21)

from Hölder’s inequality and since , for is bounded up to constants by , by using Young’s inequality for .

The last estimate can be refined if is known to be bounded, where we recall that , to yield

 E(Kj∗f)(X)2≤C(p0)2j(1−2/¯r)≡σ2, (22)

where is bounded on uniformly bounded sets of densities. To see this, consider first and thus : then Young’s inequality gives, as above,

 E(Kj∗f)(X)2≤∥p0∥∞∥Kj∗f∥22≤C∥p0∥∞2j(1−2/r)=σ2.

If , then , so by Hölder’s inequality

 E(Kj∗f)(X)2≤∥Kj∗f∥2s∥p0∥s/(s−2)≤C(p0)∥Kj∥21∥f∥2s≤C(p0,K).

For Condition 3(b), so in the multiresolution case for , the arguments as in (a) and obvious modifications give the same bounds for in view of the estimate , which allows us to compare wavelet projections to convolutions and proceed as above.

For Condition 3(c), note that, by the comments following the statement of Condition 3, the projection kernels have the form where