Catia Scricciolo Department of Decision Sciences, Bocconi University, Via Röntgen 1, 20136 Milano, Italy
Abstract

In the need for low assumption inferential methods in infinite-dimensional settings, Bayesian adaptive estimation via a prior distribution that does not depend on the regularity of the function to be estimated nor on the sample size is valuable. We elucidate relationships among the main approaches followed to design priors for minimax-optimal rate-adaptive estimation meanwhile shedding light on the underlying ideas.

keywords:
Adaptive estimation, Empirical Bayes, Gaussian process priors, Kernel mixture priors, Nonparametric credibility regions, Posterior distributions, Rates of convergence, Sieve priors
journal:

1 Introduction

Nonparametric curve estimation is a fundamental problem that has been intensively studied in a Bayesian framework only in the last decade, with more than a ten-years delay over the ponderous progress made in the frequentist literature where rates for point estimators have been developed in many aspects: adaptation, sharp minimax adaptive constants etc., see, e.g., Goldenshluger and Lepski (2012) for recent progress in the area. Bayesian adaptive estimation is a main theme: it accounts for designing a prior probability measure on a function space so that the posterior distribution contracts at “the truth” at optimal rate, in the minimax sense, relative to the distance defining the risk. The rate then has the desirable property of automatically adapting to the unknown regularity level of the estimandum: the correct rate stems, whichever the true value of the regularity parameter, even if knowledge of it is not available to be exploited in the definition of the prior. As the amount of data grows, the posterior distribution learns from the data so that the derived estimation procedure, despite lack of knowledge of the smoothness, performs as well as if the regularity level were known and this information could be incorporated into the prior. In this sense, adaptation may be regarded as an oracle property of the prior distribution providing a frequentist large-sample validation of it and, above all, a success of Bayesian nonparametric methods for low assumption inference in infinite-dimensional settings.

Early influential contributions to Bayesian adaptation are due to Belitser and Ghosal (2003) and Huang (2004). The former article deals with the prototypical problem of adaptive estimation of the mean of an infinite-dimensional normal distribution which is assumed to live in a Sobolev space of unknown smoothness level; the latter provides general sufficient conditions for adaptive density and regression estimation which are then applied to illustrate full exact minimax-optimal rate adaptation in density and regression estimation over Sobolev spaces using log-spline models and full minimax-optimal rate adaptation in density estimation over Besov spaces with the Haar basis but at the price of an extra logarithmic term. A third breakthrough contribution is given in the article of van der Vaart and van Zanten (2009), where adaptation is considered in the statistical settings of density estimation, regression and classification by introducing as a prior for the functional parameter a re-scaling of the sample paths of a smooth Gaussian random field on , , by an independent gamma random variable. These three articles are paradigmatic of the main approaches followed for Bayesian adaptation:

• the approach that considers the regularity level as a hyper-parameter and puts a prior on it;

• the approach that puts a prior on a discrete random variable which may represent the model dimension, the dimension of the space where the function is projected or the number of basis functions used in the approximation;

• the approach based on the re-scaling of a smooth Gaussian random field.

Approach (a), which considers hierarchical models with regularity hyper-parameter, is proposed in Belitser and Ghosal (2003), where the unknown regularity level is endowed with a prior supported on at most countably many values. The overall prior is then a mixture of priors on different models indexed by the regularity parameter and leads to exact optimal posterior contraction rates simultaneously for all regularity levels. The same philosophy is followed in Scricciolo (2006), where full exact optimal rate adaptive estimation of log-densities in Sobolev ellipsoids is achieved by considering only a finite number of competing models. In both articles, the key ideas are the following:

• the posterior probability of selecting a coarser model than the best one asymptotically vanishes;

• the posterior distribution resulting from the prior restricted to bigger models asymptotically accumulates on a fixed ellipsoid in the correct space;

• the posterior distribution corresponding to the restricted prior concentrates on Hellinger/-balls around the truth at optimal rate.

In both articles, full minimax-optimal rate adaptation is achieved when the prior on the regularity level can only take countably many values, while continuous spectrum adaptation is obtained at the price of a genuine power of in Belitser and Ghosal (2003) and of an extra logarithmic factor in Lian (2014). In the latter article, adaptation to the regularity level of the Besov space where the true signal of a Gaussian white noise model is assumed to live is achieved, up to a log-factor, over the full scale of possible regularity values by considering a spike-and-slab type prior, with a point mass at zero mixed with a Gaussian distribution, on the single wavelet coefficients of the signal and a prior on a parameter related to the regularity of the space, but the overall prior is restricted to a fixed Besov ellipsoid. Another extension of Belitser and Ghosal (2003) to continuous spectrum is Knapik et al. (2012). Also the Bayesian adaptation scheme proposed by Ghosal et al. (2003) and Lember and van der Vaart (2007) can be ascribed to approach (a). It puts a prior on every model of a collection, each one expressing a qualitative prior guess on the true density, possibly a regularity parameter, and next combines these priors into an overall prior by equipping the abstract model indices with special sample-size-dependent prior weights giving more relevance to “smaller” models, that is, those with faster convergence rates. Illustrations include finite discrete priors based on nets and priors on finite-dimensional models for adaptive estimation over scales of Banach spaces like Hölder spaces. A closely related problem is that of model selection which is dealt with using similar ideas in Ghosal et al. (2008), where it is shown that the posterior distribution gives negligible weights to models that are bigger than the one that best approximates the true density from a given list, thus automatically selecting the optimal one.

Approach (b) that considers hierarchical models with dimension reduction hyper-parameter is followed in Huang (2004) and relies on the construction of a fairly simple compound prior called “sieve prior” by Shen and Wasserman (2001). A sieve prior is a mixture of priors,

 Π=∞∑k=1ρ(k)Πk,

with , and, where every single prior is supported on a space of densities which is typically finite-dimensional and can be represented as . As previously mentioned, the index may represent the dimension of the space where the function is projected, the number of basis functions for the approximation or the model dimension. A sieve prior can be thought of as generated in two steps: first the index of a model is selected with probability , next a probability measure is generated from the chosen model according to a prior on it. Such finite-dimensional models may arise from the approximation of a collection of target densities through a set of basis functions (e.g., trigonometric functions, splines or wavelets), where a model of dimension is generated by a selection of basis functions. This adaptive scheme is based on a set of assumptions such that they give control in terms of covering numbers of the local structure of each , they guarantee the existence of a model receiving enough prior weight , the existence of a density close to and of neighborhoods of this approximating density being charged enough prior mass by . Several examples treated in Huang (2004) using scales of finite-dimensional models are covered with different priors in Lember and van der Vaart (2007). Further references on adaptive curve estimation via sieve priors are Scricciolo (2008) and Arbel et al. (2013). Bayesian adaptive procedures via sieve priors on the unit interval include piecewise constant and polygonally smoothed priors based on the Dirichlet process as in Scricciolo (2007), Bernstein-Dirichlet polynomials as in Kruijer and van der Vaart (2008), mixtures of beta densities as in Rousseau (2010). Other contributions clearly belonging to this category, while not being Dirichlet mixtures, are de Jonge and van Zanten (2010, 2012), Ray (2013) and Belitser and Serra (2013). The underlying idea is that of considering a sequence of positive projection kernels so that, at each “resolution” level, the Dirichlet process filtered through the kernel results in a density. Considering instead a “convolution-type” kernel, with usual conversion from bin-width to bandwidth, fully rate-adaptive density estimation over locally Hölder classes on the real line can be performed using finite Dirichlet location mixtures of analytic exponential power densities as proposed by Kruijer et al. (2010). Mixture models with priors on the mixing distribution admitting an infinite discrete representation, like the Dirichlet process or more general stick-breaking priors, avoid choosing a truncation level for the number of mixing components, while updating it in a fully Bayes way is computationally intensive. Fully rate-adaptive density estimation over Sobolev or analytic regularity scales can be performed using Dirichlet process mixtures of Gaussian densities as shown in Scricciolo (2014). The extension to the multivariate setting is due to Shen et al. (2013).

Theoretical properties of approach (c) based on re-scaling are investigated in van der Vaart and van Zanten (2009), Szabó et al. (2013a) and Castillo et al. (2014). Computational aspects are studied in Agapiou et al. (2013). The method is applied in many practical articles, cf. van der Vaart and van Zanten (2007) for some references.

Almost all the above described schemes for Bayesian adaptation yield rates with extra logarithmic terms. The issue of whether in Bayesian nonparametrics logarithmic terms could be removed in posterior contraction rates has been settled in the affirmative by Gao and Zhou (2013) using a novel block prior and getting a rate-optimal posterior contraction result over a continuum of regularity levels for curve estimation over Sobolev or Besov ellipsoids in a general framework covering various statistical settings such as density estimation, white noise model, Gaussian sequence model, Gaussian regression and spectral density estimation.

Except for the article of Huang (2004) and those dealing with re-scaling, all previously mentioned contributions fall within the same approach for deriving posterior contraction rates as developed by Ghosal et al. (2000), Shen and Wasserman (2001). We expose the main underlying ideas in the case of independent and identically distributed (i.i.d.) observations, the case of dependent, non-identically distributed observations adding only technical difficulties, see Ghosal and van der Vaart (2007a) for the non-i.i.d. case. Let denote the observation at the th stage which consists of i.i.d. replicates from a probability measure that possesses density with respect to (w.r.t.) some dominating measure on a sample space . Let be the collection of all probability measures on that possess densities w.r.t. , equipped with a semi-metric , typically the Hellinger or the -distance. Giné and Nickl (2011) have provided sufficient conditions for assessing posterior contraction rates in the full scale of -metrics, , in an abstract setting using a different strategy of attack to the problem. Also the recent work of Hoffmann et al. (2013) deals with -metrics and gives “adapted” conditions for posterior contraction rates with the help of modulus of continuity. The contribution of Castillo (2014) is focussed on sup-norm posterior contraction rates based on yet another approach oriented to specific statistical settings like the Gaussian white noise model for non-conjugate priors and density estimation using priors on log-densities or random dyadic histograms. Let be a prior probability measure on . The posterior probability of any Borel set writes as

 Π(B∣X(n))=∫B∏ni=1(fP/f0)(Xi)Π(dP)∫F∏ni=1(fP/f0)(Xi)Π(dP),

where . A sequence such that is said to be (an upper bound on) the posterior contraction rate, relative to , if for a sufficiently large constant (or a slowly varying sequence ),

 Π(P:d(fP,f0)>Mϵn∣X(n))→0 (1)

-almost surely or in -probability, where stands for the joint law of the first coordinate projections of the infinite product probability measure . The intuition behind the notion of rate of convergence, as stated in (1), is that the radius of a -ball around is large enough to prevent escape of mass as the posterior shrinks to point mass at . In order to show convergence in (1), it is enough

(i) to bound above the numerator of the ratio defining the probability in (1) by a term of the order ,

(ii) to bound below the denominator of the ratio defining the probability in (1) by a term of the order ,

where are finite suitable constants and are sequences such that and , (for real numbers and , we denote by their maximum and by their minimum. Also we write “” and “” for inequalities valid up to a constant multiple which is universal or inessential for our purposes). The posterior contraction rate is then defined as . This double sequence version of the theorem is introduced in Ghosal and van der Vaart (2001). The exponential upper bound in (i) can be shown by considering an appropriate sieve set which is almost the support of the prior , in the sense that the complement receives exponentially small prior mass

 Π(Fcn)≲e−(c3+2)n~ϵ2n,

as proposed by Barron (1988a), meanwhile controlling the complexity of by the covering or packing number when appropriate tests exist, that is,

 logD(¯ϵn,Fn,d)≲n¯ϵ2n,

where denotes the -packing number of , namely, the maximum number of points in such that the distance between each pair is at least . The exponential lower bound in (ii) is implied by the condition that Kullback-Leibler type neighborhoods of receive enough prior mass

 Π(BKL(P0;~ϵ2n))≳exp(−c3n~ϵ2n),

where , for the Kullback-Leibler divergence and the second moment of . A condition which is originated from Schwartz (1965).

The analysis of the asymptotic behavior of posterior distributions in terms of contraction rates details more comprehensively the impact of the prior on the posterior than the analysis of the speed at which the expected squared error between and the predictive density

 ^fn(⋅)=∫FfP(⋅)Π(dP∣X(n)),

as measured by the risk , where denotes expectation under , converges to zero as . If is (an upper bound on) the posterior contraction rate and the posterior probability in (1) converges to zero at least at the order , then is (an upper bound on) the rate of convergence of the Bayes’ estimator, provided is bounded and its square convex. The posterior contraction rate is related to the minimax rate of convergence over the density function class which belongs to. Let denote a density function class indexed by a parameter related to the regularity of its elements.

Definition 1.

A positive sequence is said to be the minimax rate of convergence over if there exist universal constants , possibly depending on the regularity parameter , such that the minimax risk over , that is, , where stands here for any density estimator based on observations, satisfies

 c≤liminfn→∞ϵ−2n,βinf^fnsupf∈FβEnf[d2(^fn,f)]≤limsupn→∞ϵ−2n,βinf^fnsupf∈FβEnf[d2(^fn,f)]≤C,

where denotes expectation under . An estimator is said to be adaptive in the minimax sense over the collection of function spaces if there exists a constant , possibly depending on , such that

 ∀β>0,supf∈FβEnf[d2(^f∗n,f)]≤C1ϵ2n,β.

Since the rate of convergence of an estimator cannot be faster than the minimax rate over the considered density function class, the posterior contraction rate cannot be faster than the minimax rate. So, if the posterior distribution achieves the minimax rate, then also the Bayes’ estimator has minimax-optimal convergence rate and is adaptive. Furthermore, by taking the center of the smallest ball accumulating at least of the posterior mass gives a point estimator with the same rate of convergence as the posterior contraction rate without requiring convexity of , see Section 4 in Belitser and Ghosal (2003). The study of posterior contraction rates may thus play an ancillary role in allowing to appeal to general theoretical results, see Theorem 2.1 and Theorem 2.2 in Ghosal et al. (2000) or Theorem 2 and Theorem 4 in Shen and Wasserman (2001).

In this overview, while trying to convey the main underlying ideas, we attempt at providing an account of the state of the art on Bayesian adaptation and an update of existing monographs on the theme like the one by (Ghosal, 2010, Ch. 2) and the dissertation by Shen (2013) of which we point out the contributions of Ch. 3 devoted to curve estimation using random series priors. For a variety of reasons, here we focus on Bayesian adaptation by mixtures, this having the two-fold meaning of modeling the data-generating density by mixtures and of using compound priors that are themselves mixtures of priors like sieve priors. We try to set up a unifying framework useful for understanding the large-sample behavior of commonly used priors as well as possibly being the starting point for the development of new results. Interest in mixtures is doubly motivated by the fact that they naturally arise in many contexts as models for observations of phenomena with multiple underlying factors and by their flexibility, due to which they may provide good approximation schemes for function estimation. For instance, the combination of a Gaussian convolution kernel with a Dirichlet process prior constitutes one of the most popular Bayesian schemes for density estimation on the real line. As pointed out in Shen (2013), results concerning the approximation of densities by Gaussian mixtures pave the way to the achievement of results on the estimation of density derivatives which are important because involved in relevant statistical quantities such as the score function and the Fisher information. Another important problem for which mixtures are well-suited is that of estimating multivariate (possibly anisotropic) densities, see Shen et al. (2013). A closely related problem is that of the estimation of mixing distributions. While the problem has been extensively studied from a frequentist perspective using deconvolution kernel-type estimators, Bayesian nonparametric deconvolution has been hardly investigated so far, except for the recent article of Nguyen (2013) and the manuscripts by Sarkar et al. (2013), who derive adaptive posterior convergence rates for Bayesian density deconvolution with supersmooth errors, and by Donnet et al. (2014) where both the ordinary and the supersmooth cases are treated in a fully Bayes as well as in an empirical Bayes context.

In Section 2, we provide a survey of results on Bayesian adaptation for the most popular schemes for density estimation by mixtures. For a more comprehensive overview of the diverse contexts and fields of application of mixture models, the reader may consult Marin et al. (2005). The focus of the article is on fully Bayes adaptation techniques, but some lights on empirical Bayes adaptation methods and on adaptive nonparametric credibility regions are shed in Section 3.

Mixtures of probability distributions naturally arise in many contexts as models for observations of phenomena with multiple latent factors, so that modeling by mixtures is well motivated in such situations. On other side, in a Bayesian set-up, mixtures can be the building block for constructing priors on spaces of densities using a model-based approach since, by endowing the mixing distribution of a mixed density with a probability measure, a prior distribution can be induced on a space of probability measures possessing densities w.r.t. some dominating measure. Furthermore, a well-chosen mixture model may provide an approximation scheme for density estimation resulting in minimax-optimal convergence rates. This approach, which has the virtue of combining conceptual simplicity of the scheme with flexibility of the model due to the wide range of possible choices for the kernel, has been initiated by Ferguson (1983), who used a Dirichlet process prior for the mixing distribution and derived the expressions for the resulting posterior distribution and Bayes’ density estimator or predictive density, see also Lo (1984).

Given a kernel , namely, a jointly measurable mapping from to such that, for every fixed , is a probability density on w.r.t. , a way for defining a prior is that of modeling the random probability density w.r.t. as

 x↦fP(x)=∫ΘK(x;θ)P(dθ), (2)

where the mixing probability measure is endowed with a prior . So, conditionally on , the observations are i.i.d. according to . A way to structurally describe observations from a kernel mixture prior is via the following hierarchical model:

 Xi∣θi,Pind∼K(⋅;θi),i=1,…,n,θi∣Piid∼P,i=1,…,n,P∼Π.

In the original formulation of Ferguson (1983), the combination of a Gaussian kernel and a Dirichlet process has been proposed for density estimation on the real line and the mixture model is called Dirichlet process mixture of Gaussian densities. This is the most popular Bayesian scheme for density estimation on the real line, but the need may arise for the use of different kernels because the empirical distribution of many phenomena fails to conform to a Gaussian distribution, thus leading to the search for other models. An alternative when the discrepancy lies in the tails can be represented by exponential power distributions, where the tail thickness is governed by a shape parameter. For example, the normal-Laplace distribution, resulting from the convolution of independent normal and Laplace components, behaves like the normal in the middle of its range and like the Laplace in the tails. Its use in the study of high frequency price data is pointed out in Reed (2006). Rates of contraction for density estimation using Dirichlet mixtures of exponential power densities are derived in Scricciolo (2011). Another possibility is that of employing a kernel belonging to the family of (symmetric) stable laws, which includes the Cauchy. Unlike exponential power distributions, these distributions have heavy (polynomially decaying) tails and arise in many applications. For compactly supported data, the combinations of a Dirichlet distribution with Bernstein polynomials (Petrone, 1999), triangular densities (Perron and Mengersen, 2001; McVinish et al., 2009), histograms or polygons (Scricciolo, 2007) have been suggested. Some of them are illustrated in the examples below.

In this survey, we are mostly interested in nonparametric mixtures, that is, in the case where the number of the underlying components is unknown and infinite, and want to consider their theoretical properties. Given a random sample of i.i.d. observations from the “true” distribution , we are interested in studying frequentist asymptotic properties of the posterior distribution as the sample size tends to infinity, the focus being on adaptation to unknown smoothness. Consider observations from a density on , or on some subset thereof, belonging to a model . For example, could be the space of density functions on that are Hölder -smooth. Recall that, for , a density (or, more generally, a function) defined on a set is said to be Hölder -smooth if it is differentiable up to the order and the derivative is (uniformly) Hölder continuous with exponent ,

 |f(⌊β⌋)(x)−f(⌊β⌋)(y)|≤L|x−y|β−⌊β⌋,∀x,y∈X, (3)

where is a finite constant, possibly depending on and . For later use, we introduce the notation

 [f]β:=supx≠y|f(⌊β⌋)(x)−f(⌊β⌋)(y)||x−y|β−⌊β⌋

to denote the smallest constant for which (3) is satisfied. Let stand for the class of (Lebesgue) densities on that are Hölder -smooth. Consider a scale of models . The value of the regularity parameter of is typically unknown. The problem is that of designing a prior supported on such that the posterior, hence the entailed Bayes’ estimator, has the remarkable fine property of being self-adaptive to , in the sense that, as the value of varies, one need not change the prior to guarantee that the corresponding posterior achieves the minimax-optimal contraction rate simultaneously over all classes of the collection. The rate of convergence thus has the property of adapting automatically to the unknown smoothness level of . In other terms, the correct rate stems, whatever the true value of , even if is not involved in the definition of the prior. Henceforth, stands for the minimax-optimal rate of convergence relative to the -metric over the class .

Definition 2.

The posterior distribution corresponding to a prior measure on concentrates adaptively over if, for some finite constant (or some slowly varying sequence ),

 ∀β∈B,supf0∈FβEn0[Π(P:∥fP−f0∥1>Mϵn,β∣X(n))]→0.

As mentioned in Section 1, approach (b), which is based on hierarchical models with dimension reduction hyper-parameter, relies on the construction of the so-called sieve priors. A sieve prior is a mixture of priors , where is supported on some set of densities with generic element that can be a kernel mixture. The overall prior induces a prior on which (almost surely) selects probability measures with densities . The choice of the densities is motivated by the fact that they possess some approximation property for “regular” densities, relative to some -metric, . In fact, if is positive for all but finitely many and is fully supported on the -dimensional standard simplex , then every probability measure with density which is the -limit of a sequence of densities , that is, , is in the -support of . The approximation property of densities is crucial to assess the prior concentration rate , which is a main determinant of the posterior contraction rate at “regular” densities. In fact, the main challenge when proving adaptation lies in finding a finite mixing distribution, with a suitable number of support points, such that the corresponding kernel mixture approximates the sampling density, in the Kullback-Leibler divergence, with an error of the correct order. Mixtures are constructed so that their approximation properties guarantee that, under natural conditions on the priors of the hierarchy, the prior mass in Kullback-Leibler type neighborhoods around the sampling density is bounded below by the probability of the mixing weights taking values in a simplex of appropriate dimension, say , depending on the true value of the regularity parameter and the approximation error ,

 Π(BKL(P0;ϵ2))≥ρ(k0)Πk0(N(f∗k0;ϵ2)),

where is an Euclidean ball centered at the best approximation to in . This crucial step can be better understood from the following examples.

Example 1.

(Random histograms and Bernstein polynomials). Random histograms are a common nonparametric model-based mixture prior. For every , let be the partition of into intervals (bins) of equal length for , where and for . Given the number of bins, for any , let the -regular histogram be defined as , , where the are the mixing weights for the densities , with the indicator function of the cell . The prior can be constructed by randomizing the number of bins and the mixing weights . First the index is selected with probability , next a probability measure is generated from the chosen model according to a prior for , the prior being typically chosen to be a Dirichlet distribution on the -dimensional simplex with parameters , i.e., . The mixing weights may be thought of the form , where is a random probability measure distributed according to a Dirichlet process with base measure , in symbols, . A piecewise constant prior can be structurally described as follows. Defined the function , the hierarchical model is

 Xi∣k,P,θiind∼Nk(θi)=k\mathds1Aj(θi),k(θi),i=1,…,k,θi∣k,Piid∼P,i=1,…,k,P∣k∼Πkk∼ρ, (4)

where identifies the bin containing the point , i.e., . We now clarify how, conditionally on the number of bins, the random density can be written as a kernel mixture in the form (2). Taken , for every , consider the discretization of the base measure , with for . The measure defines a random probability measure supported on , with random weights having prior expectation , for . A piecewise constant prior is then the probability distribution of the random density , where

 hwk(⋅)=∫10k\mathds1Akθ,k(⋅)πk(dθ)

is a mixture as in (2), with kernel . The Bayes’ estimator yielded by a piecewise constant prior has the following structure

which evidentiates that the posterior expected density is still a histogram with updated weights, see equation (3) in Scricciolo (2007) for the complete explicit expression of . Consistency of the posterior distribution of a piecewise constant Dirichlet prior concentrated on the -regular dyadic histograms is addressed in Barron et al. (1999), see also Barron (1988a). The main idea is to show that the prior satisfies Schwartz (1965)’s prior positivity condition. The posterior is consistent in the Hellinger or the -metric at any density such that , for and all , with . Bayesian adaptive density estimation via a piecewise constant prior is studied in Scricciolo (2007). The capability of the posterior distribution to achieve minimax-optimal contraction rates, possibly up to a logarithmic factor, depends on the approximation error of a density by histograms: the sup-norm approximation error of a density by a -regular histogram-shaped density is of the order , which is at most only proportional to the inverse of the bin-width . For , we have , where , with , is the histogram-shaped density based on . Thus, as stated in Proposition 1 below, piecewise constant priors can achieve minimax rates, up to a logarithmic factor, only up to Hölder regularity . It is known from the vast literature on density estimation on the unit interval that the minimax -risk over Hölder smoothness classes satisfies

 R(p)n(H(β,L))≍L1/(2β+1)×{n−β/(2β+1),for 1≤p<∞,(n/logn)−β/(2β+1),for p=∞,

where denotes the Hölder class of order , consisting of densities on such that the derivative exists and . Note that, except for the case where , the rate does not depend on . In what follows, the previously introduced sequence specifies as for -metrics, when . Before reporting a result on adaptation, it is worth mentioning some recent findings on non-adaptive posterior contraction rates in -metrics for random dyadic histograms with a sample size-dependent number of bins, for densities of Hölder regularity . Giné and Nickl (2011) obtain the minimax rate , up to a logarithmic factor, for -metrics, with ; while Castillo (2014) gets the exact minimax sup-norm rate .

Proposition 1 (Scricciolo (2007)).

Let the density , with , be bounded away from zero on . Let be a piecewise constant prior, with for all and constants , and with the base measure of the Dirichlet process possessing a continuous and positive density on . Then, for large enough, with -probability one. Consequently, .

Since piecewise constant priors can attain minimax rates in the -metric only up to Hölder regularity , they are not appropriate for estimating smoother than Lipschitz densities. One may compare the performance of random histograms with that of random Bernstein polynomials. A Bernstein-Dirichlet prior has the same structure as a piecewise constant prior described in (4), but with and . The Dirichlet process mixture of Bernstein polynomials as a nonparametric prior is introduced in Petrone (1999). Weak and Hellinger posterior consistency are investigated in Petrone and Wasserman (2002), while convergence rates relative to the Hellinger or the -distance are analyzed in Ghosal (2001). Although the sub-optimal rate found by Ghosal (2001) for estimating twice continuously differentiable densities is only an upper bound on the posterior contraction rate, it indicates, following from Proposition 1, that random histograms, despite their simple structure, possess better approximation properties than random Bernstein polynomials, whose use in Bayesian adaptive estimation of densities with Hölder regularity has been considered by Kruijer and van der Vaart (2008). They find the sub-optimal rate , up to a logarithmic factor. As remarked by the authors themselves, sub-optimality of the rate can be understood from sub-optimality of Bernstein polynomials as an approximation scheme. In fact, in terms of approximation of Hölder regular functions, they are sub-optimal in yielding an approximation error of the order , whereas polynomials of degree of best approximation have an error of the order only. We incidentally note that, as discussed in the following example, the same sub-optimality phenomenon is observed for polygons which, in principle, are introduced to overcome limitations of histograms, but turn out to suffer from the same deficiency when . The authors employ coarsened Bernstein polynomials to get the nearly optimal rate for densities of Hölder regularity . Adaptation in the Hellinger metric over the full scale of Hölder classes of regularity can be achieved by using suitably constructed mixtures of beta densities, see Rousseau (2010).

Example 2.

(Random polygons). A polygonally smoothed prior, introduced in Scricciolo (2007), is a model-based hierarchical prior having the same structure as a piecewise constant prior, but with a continuous polygon-shaped, in lieu of a histogram-shaped, conditional density of the observations. The polygon can be regarded as the result of a histogram smoothing performed by joining the heights at mid-bin points , for , with straight lines,

 pwk(x)=w1,kk\mathds1A−1,k(x)+k−1∑j=1[k(cj+1,k−x)wj,k+k(x−cj,k)wj+1,k]k\mathds1A+j,k∪A−j+1,k(x)+wk,kk\mathds1A+k,k(x),x∈[0,1],

where, for every , the symbols and stand for the left and right equal length sub-intervals of , respectively. Any density can be uniformly approximated by a -regular polygon-shaped density based on with an error of the order , that is, . If is Hölder -regular, with , the approximation error near the endpoints of , where inherits the structure of a histogram, is only of the order , as for Lipschitz densities. Thus, extra regularity conditions on , aimed at compensating for the poor approximation quality of the polygon near the unit interval endpoints, can be considered to guarantee the correct order of the approximation error. For , possible boundary conditions on are , as , and as , where and , see also Scricciolo (2007).

Proposition 2.

Let the density , with . For , suppose further that satisfies the boundary conditions and . Then, .

This approximation result, whose proof is deferred to A, is the key ingredient for proving that the posterior distribution corresponding to a polygonally smoothed prior is rate-adaptive over a scale of Hölder classes of regularity .

Theorem 1.

Let the density , with , and . For , suppose further that satisfies the boundary conditions and . Let be a polygonally smoothed prior, with for all and constants , and with the base measure of the Dirichlet process having a continuous and positive density on . Then, for a sufficiently large constant , . Consequently, .

Estimating any density of Hölder regularity with the Bayes’ estimator entailed by a polygonally smoothed prior we may pay, at most, a price of a -factor, the convergence rate being self-adaptive to : as the regularity parameter varies with , one need not change the prior to guarantee that the Bayes’ estimator achieves, up to a multiplicative logarithmic term, the minimax rate of convergence over a scale of Hölder classes of regularity . For any , instead, the error made in uniformly approximating a density by the -regular polygon-shaped density is only of the order and the posterior contraction rate we find is as for densities that are only twice differentiable. Hereafter, we show that the minimax -rate for the Hölder smoothness class is a lower bound on the contraction rate of the posterior distribution of a polygonally smoothed prior at densities in , where the subclass

 F3={f|f:[0,1]→R+,∥f∥1=1,f′′ bounded away from 0 on an interval I⊂(0,1) and f′′′ bounded}

has been employed for an analogous purpose by McVinish et al. (2005, 2009). We consider densities in that also satisfy the above boundary conditions and . An example of such a density is , , (see, e.g., Scricciolo, 2007, Remark 5).

Proposition 3.

Let the density satisfy the boundary conditions and and . Let be a polygonally smoothed prior, with for all and constants , and with the base measure of the Dirichlet process having a continuous and positive density on . Then, in -probability.

The assertion implies that the minimax -rate has a too small order of magnitude to be the radius of an -ball around that is able to capture almost all the mass when the posterior weakly converges to a point mass at . Thus, random polygons can only get minimax rates of convergence, up to a logarithmic factor, over a scale of Hölder classes up to regularity : they are not appropriate for estimating smoother than twice differentiable densities because they are structurally not able to exploit additional regularity.

It is interesting to investigate the relationship between the Bayes’ estimator of a polygonally smoothed prior and a frequentist counterpart, the so-called smooth Barron-type density estimator proposed by Beirlant et al. (2002):

 fPn(x)\coloneqq(1−an)pwμnkn(x)+an,x∈[0,1],

where, for such that and , the sequence has generic term and is the -regular frequency polygon constructed with weights that are the relative frequencies of the observations falling into the bins , that is, , for , where stands for the empirical measure associated with the sample , i.e., for every measurable set . Thus, is a convex combination of the frequency polygon and the uniform density on and, as the sample size increases, it shrinks towards the frequency polygon which converges pointwise to . The smooth Barron-type density estimator is a modification of the histogram-based Barron estimator (Barron, 1988a)

 fBn(x)\coloneqq(1−an)hwμnkn(x)+an,x∈[0,1], (5)

in fact, is obtained by replacing the histogram with the frequency polygon in (5). The smooth Barron-type density estimator can be given an interpretation in terms of the Bayes’ rule analogous to that of the Barron estimator presented by Barron et al. (1992) in Remark 5. Suppose that are i.i.d. observations from a distribution corresponding to a probability measure that is given a prior by assigning a prior to the bin probabilities which is a Dirichlet distribution with parameters all equal to one, i.e., . Then, the posterior distribution of the cell probabilities, given the data, is still Dirichlet with parameters , for . Let , with the posterior expectation of the cell probability , that is, , for , which may be interpreted as the relative frequency of the cell with one additional fictitious observation. Then, the posterior expectation of a polygon constructed with the bin probabilities is , with . Therefore, it is a convex combination of the polygonally smoothed empirical distribution function and the prior guess which is the uniform distribution on . Therefore,

 fPn(x)=\mathdsE[pwkn(x)∣X(n)],x∈[0,1],

namely, the smooth Barron-type density estimator corresponds to the Bayes’ estimator of a statistician who assumes observations were generated from and takes a Dirichlet distribution with one a priori expected observation per cell as a prior for the cell probabilities. In fact, in evaluating the expectation , the posterior distribution of is computed assuming that were i.i.d. observations from . A Bayesian statistician believing that the observations were generated from a density, possibly a polygon, would, instead, first induce a prior on the space of polygon-shaped densities from the prior distribution for (or the mixing weights) and then compute the corresponding posterior.

Barron’s modification of the histogram estimator is motivated by the search for consistency in stronger information divergence criteria than the -distance, which is needed for applications in information transfer and communication as illustrated in Barron et al. (1992). The smooth Barron-type density estimator is in turn a modification of the Barron estimator to overcome discontinuities of the histogram. The following result, which provides the order of the approximation error of any density , with Hölder regularity , by the smooth Barron-type density estimator in the expected -divergence, where , complements Theorem 4.1 of Beirlant et al. (2002), where only the case of a twice continuously differentiable density is treated. In what follows, we denote by the smooth Barron-type density estimator corresponding to the choice .

Proposition 4.

Let the density , with , and . For , suppose further that satisfies the boundary conditions and . Then, . The choice