On Bayesian supremum norm contraction rates

On Bayesian supremum norm contraction rates

Abstract

Building on ideas from Castillo and Nickl [Ann. Statist. 41 (2013) 1999–2028], a method is provided to study nonparametric Bayesian posterior convergence rates when “strong” measures of distances, such as the sup-norm, are considered. In particular, we show that likelihood methods can achieve optimal minimax sup-norm rates in density estimation on the unit interval. The introduced methodology is used to prove that commonly used families of prior distributions on densities, namely log-density priors and dyadic random density histograms, can indeed achieve optimal sup-norm rates of convergence. New results are also derived in the Gaussian white noise model as a further illustration of the presented techniques.

[
\kwd
\doi

10.1214/14-AOS1253 \volume42 \issue5 2014 \firstpage2058 \lastpage2091 \docsubtyFLA \newproclaimremarkRemark

\runtitle

On Bayesian sup-norm rates

{aug}\thankstext

t1Supported in part by ANR Grant “Banhdits” ANR-2010-BLAN-0113-03.

class=AMS] \kwd[Primary ]62G20 \kwd[; secondary ]62G05 \kwd62G07 Bayesian nonparametrics \kwdcontraction rates \kwdsupremum norm

1 Introduction

In the fundamental contributions by Ghosal, Ghosh and van der Vaart [13], Shen and Wasserman [32] and Ghosal and van der Vaart [15], a general theory is developed to study the behaviour of Bayesian posterior distributions. A main tool is provided by the existence of exponentially powerful tests between a point and the complement of a ball for some distance. The use of some important distances, such as the Hellinger distance between probability measures, indeed guarantees the existence of such tests. The theory often also allows extensions to other metrics, for instance, -type distances, but the question of dealing with arbitrary metrics has been left essentially open so far. Although a general theory might be harder to obtain, it is natural to consider such a problem in simple, canonical, statistical settings first, such as Gaussian white noise or density estimation. This is the starting point of the authors in Giné and Nickl [16], and this paper was the first to provide tools to get rates in strong norms, such as the -norm. Exponential inequalities for frequentist estimators are used in [16] as a way to build appropriate tests, and this enables one to obtain some rates in sup-norm in density estimation. In the case where the true density is itself supersmooth and a kernel mixture is used as a prior, the nearly parametric minimax rate is attained, at least up to a possible logarithmic term; see also the work by Scricciolo [31] for related results. In the general case where the true density belongs to a Hölder class, a sup-norm rate is obtained which differs from the minimax rate by a power of . On the other hand, by using explicit computations, the authors in [16] show that in the Gaussian white noise model with conjugate Gaussian priors, minimax sup-norm rates are attainable, which leads to the natural question to know whether this is still possible in density estimation, or in nonconjugate regression settings. This nontrivial question also arises for other likelihood methods, such as nonparametric maximum likelihood estimation; see Nickl [26].

From a general statistical perspective, density estimation in supremum-norm is a central problem both from theoretical and practical points of view. The problem was the object of much interest in the framework of minimax theory. Lower bounds in density estimation in sup-norm can be found in Hasminskii [19], upper-bounds in Ibragimov and Hasminskii [21] for density estimation and Stone [33] for regression. We refer to Goldenshluger and Lepski [17] for an overview of current work in this area. From the practical perspective, sup-norm properties are of course very desirable, since saying that two curves in a simulation picture look close is very naturally, and often implicitly, done in a sup-norm sense.

Here, we establish that minimax optimal sup-norm rates of convergence in density estimation are attainable by common and natural Bayes procedures. The methodology we introduce is in fact related to a programme initiated in [6] and continued in [7], namely nonparametric Bernstein–von Mises type results, as discussed below. In [7], we use the results of the present paper to derive nonparametric Bernstein–von Mises theorems in density estimation, as well as Donsker-type results for the posterior distribution function. The testing approach commonly used to establish posterior rates is replaced here by tools from semiparametric Bernstein–von Mises results (testing is still typically useful to establish preliminary rates); see [6] for an overview of references. We split the distance of interest in simpler pieces, each simpler piece being a semiparametric functional to study. One novelty of the paper consists in providing well-chosen uniform approximation schemes of various influence functions appearing at the semiparametric level when estimating those simple functionals.

Two natural families of nonparametric priors are considered for density estimation: priors on log-densities; see, for example, Ghosal, Ghosh and van der Vaart [13], Scricciolo [29], Tokdar and Ghosh [35], van der Vaart and van Zanten [38, 3], Rivoirard and Rousseau [27], and random (dyadic) histogram priors; see, for example, Barron [1], Barron, Schervish and Wasserman [2], Walker [39], Ghosal and van der Vaart [15], Scricciolo [30], Giné and Nickl [16] and the recent semiparametric treatment in [8]. Both classes are relevant for applications and priors of these types have been studied from the implementation perspective; see, for example, Lenk [23], Tokdar [34] and references therein for the use of logistic Gaussian process priors, and Leonard [24], Gasparini [11] for random histogram priors.

New results are also derived in the Gaussian white noise model, in the spirit of [6], for nonconjugate priors.

While working on this paper, we learned from the work by Marc Hoffmann, Judith Rousseau and Johannes Schmidt-Hieber [20], which independently obtains sup-norm properties for different priors. Their method is different from ours, and both approaches shed light on different specific aspects of the problem. In Gaussian white noise, adaptive results over Hölder classes are obtained in [20] for a class of sparse priors. In Theorem 1 below, the sup-norm minimax rate for fixed regularity is obtained for canonical priors without sparsity enforcement. The authors also give insight on the interplay between loss function and posterior rate, as well as an upper bound result for fairly abstract sieve-type priors, which are shown to attain the adaptive sup-norm rate in density estimation. This is an interesting existence result, but no method is provided to investigate sup-norm rates for general given priors. Although for simplicity we limit ourselves here to the fixed regularity case, the present paper suggests such a method and demonstrates its applicability by dealing with several commonly used classes of prior distributions. Clearly, there is still much to do in the understanding of posterior rates for strong measures of loss, and we hope that future contributions will go further in the different directions suggested by both the present paper and [20].

Let and , respectively, denote the space of square integrable functions with respect to Lebesgue measure on and the space of measurable bounded functions on . Theses spaces are equipped with their usual norms, respectively, denoted (denote by the associated inner product) and . Let denote the class of Hölder functions on with Hölder exponent .

For any and any , denote by the rate

(1)

The typical minimax rate over a ball of the Hölder space , , for the sup-norm is

(2)

Let us also set, omitting the dependence in in the notation,

(3)

For a statistical model indexed by in some class of functions to be specified and associated observations , denote by the “true” function and by the expectation under . Given a prior on a set of possible ’s, denote by the posterior distribution and by the expectation operator under the law .

2 Prologue

Let us start by a simple example in Gaussian white noise which will serve as a slightly naive yet useful illustration of the main technique of proof.

Let be an element of . Let . Suppose one observes

(4)

where is standard Brownian motion. Let be a wavelet basis on the interval . Here, we take the basis constructed in [10], see below for precise definitions. The model (4) is statistically equivalent to observing the projected observations onto the basis ,

where and are i.i.d. standard normal. Denote , an efficient frequentist estimator of the wavelet coefficient .

2.1 A first example

Suppose the coefficients of the true function satisfy, for some that we suppose to be known in this first example,

(5)

Define a prior on via an independent product prior on its coordinates onto the considered basis. The component is assumed to be sampled from a prior with density with respect to Lebesgue measure on , where, for as in (5), and a given

(6)

This type of prior was considered in [16], Section 2.2, and provides a simple example of a random function with bounded -Hölder norm.

Proposition 1

Consider observations from the model (4). Let  and  satisfy (5) and let the prior be chosen according to (6). Then there exists such that for defined by (2),

Uniform wavelet priors thus lead to the minimax rate of convergence in sup-norm. The result has a fairly simple proof, as we now illustrate, and is new, to the best of our knowledge.

Let be defined in (3). Denote by the orthogonal projection of in onto , and the projection of onto . Then

where is the projection estimator onto the basis with cut-off . Note that the previous equality as such is an equality in . However, if the wavelet series of into the basis is absolutely convergent -almost surely (which is the case for all priors considered in this paper), we also have pointwise for Lebesgue-almost every , -almost surely, and similarly for . Now,

We have . Using (5) and the localisation property of the wavelet basis (see below), one obtains

where means less or equal to up to some universal constant. The term depends on the randomness of the observations only,

This is bounded under by a constant times ; see Lemma 7 for a proof in the more difficult case of empirical processes. {longlist}[Term (ii)] Term (i). By definition, has coordinates in the basis , so using the localisation property of the wavelet basis as above, one obtains For , via Jensen’s inequality and bounding the maximum by the sum, using as a shorthand notation for the posterior , for any . Simple computations presented in Lemma 1 yield a sub-Gaussian behaviour for the Laplace transform of under the posterior distribution, which is bounded above by for a constant independent of and . From this deduce, for any and , The choice leads us to the bound Term (ii). Under the considered prior, the wavelet coefficients of are bounded by , so using again the localisation property of the wavelet basis, This concludes the proof of Proposition 1.

Although fairly simple, the previous example is revealing of some important facts, some of which are well known from frequentist analysis of the problem, some being specific to the Bayesian approach. The previous proof shows two regimes of frequencies: “low frequency” and “high frequency.” In the low frequency regime, the estimator of is satisfactory, and the concentration of the posterior distribution around this efficient frequentist estimator is desirable. This is reminiscent of the Bernstein–von Mises (BvM) property; see van der Vaart [37], Chapter 10, which states that in regular parametric problems with unknown parameter , the posterior distribution is asymptotically Gaussian concentrating at rate and centered around an efficient estimator of .

Here are a few words on the general philosophy of the results specifically in the Bayesian context. Such method was used as a building block in [6]. The idea is to split the distance of interest into small pieces. For the sup-norm, those pieces can, for instance, involve the wavelet coefficients , but not necessarily, as will be seen for log-density priors. In this case, this split is obtained, for instance, from the inequality

which holds for localised bases . Note that can be seen as a semiparametric functional; see, for example, [37], Chapter 25 for an Introduction to semiparametrics and the notions of efficiency and efficient influence functions. Next, one analyses each piece separately, with different regimes of indexes often arising, requiring specific techniques for each of them.

  • the BvM-regime: semiparametric bias. For “low frequencies,” what is typically needed is a concentration of the posterior distribution for the functional of interest, say , at rate around a semiparametrically efficient estimator of the functional. This is at the heart of the proof of semiparametric BvM results, hence the use of BvM techniques. In particular, sharp control of the bias will be essential. Regarding the BvM property, although the precise Gaussian shape will not be needed here, one needs uniformity in all frequencies in the considered regime. This requires nontrivial strengthenings of BvM-type results, the semiparametric efficient influence function of the functional of interest, which can be, for instance, a re-centered version of , being typically unbounded as  grows.

  • Taking care of uniformity issues in approximation of the efficient influence functions by the prior may require various approximations regimes depending on . For log-density priors, we will indeed see various regimes of indexes “” arise in the obtained bounds for the bias.

  • The high-frequency bias corresponds to frequencies where the prior should make the likelihood negligible. This part can be difficult to handle, too, especially for unbounded priors.

In the example above for uniform priors in white noise, most of the previous steps are either almost trivial or at least can be carried out by considering the explicit expression of the posterior, but for different priors or in different sampling situations some of the previous steps may become significantly harder, as we will see below.

2.2 Wavelet basis and Besov spaces

Central to our investigations is the tool provided by localised bases of . We refer to the Lecture Notes by Härdle, Kerkyacharian, Picard and Tsybakov [18] for an Introduction to wavelets. Two bases will be used in the sequel.

The Haar basis on is defined by , and , for any integer and . The supports of Haar wavelets form dyadic partitions of , corresponding to intervals for , and where the interval is closed to the left when .

The boundary corrected basis of Cohen, Daubechies, Vial [10] will be referred to as CDV basis. Similar to the Haar basis, the CDV basis enables a treatment on compact intervals, but at the same time can be chosen sufficiently smooth. A few properties are lost, essentially simple explicit expressions, but most convenient localisation properties and characterisation of spaces are maintained. Below we recall some useful properties of the CDV basis. We denote this basis , with indexes , (with respect to the original construction in [10], one starts at a sufficiently large level , with fixed large enough; for simplicity, up to renumbering, one can start the indexing at ). Let be fixed.

  • forms an orthonormal basis of .

  • have support , with diameter at most a constant (independent of ) times , and . The ’s are in the Hölder class , for some .

  • At fixed level , given a fixed with support ,

    • the number of wavelets of the level with support intersecting is bounded by a universal constant (independent of ),

    • the number of wavelets of the level with support intersecting is bounded by times a universal constant.

    The following localisation property holds , where the inequality is up to a fixed universal constant.

  • The constant function equal to on is orthogonal to high-level wavelets, in the sense that whenever , for a large enough constant .

  • The basis characterises Besov spaces , any , in terms of wavelet coefficients. That is, if and only if

    (7)

We note that orthonormality of the basis is not essential. Other nonorthonormal, multi-resolution dictionaries could be used instead up to some adaptation of the proofs, as long as coefficients in the expansion of can be recovered from inner products. Also, recall that coincides with the Hölder space when is not an integer and that when is an integer the inclusion holds. If the Haar-wavelet is considered, the fact that is in , , implies that the supremum in (7) with is finite.

3 Main results

3.1 Gaussian white noise

Consider priors defined as coordinate-wise products of priors on coordinates specified by a density and scalings as in Section 2.1. The next result allows for a much broader class of priors.

Let be a continuous density with respect to Lebesgue measure on . We assume that is (strictly) positive on and that it satisfies

(8)

Consider a scaling for the prior equal to, for the constant in (8),

(9)
Theorem 1

Let be observations from (4). Suppose belongs to , for some . Let the prior be a product prior defined through and satisfying (8), (9). Then there exists such that for defined by (2),

Theorem 1 can be seen as a generalisation to nonconjugate priors of Theorem 1 in [16]. Possible choices for cover several commonly used classes of prior distributions, such as so-called exponential power (EP) distributions; see, for example, Choy and Smith [9], Walker and Gutiérrez-Peña [40] and references therein, as well as some of the univariate Kotz-type distibutions, see, for example, Nadarajah [25]. Other choices of prior distributions are possible, up sometimes to some adaptations. For instance, Proposition 1 provides a result in the case of a uniform distribution. If one allows for some extra logarithmic term in the rate, Laplace (double-exponential) distributions can be used, as well as distributions without the control from below on the tail in (8), provided one chooses , as can be checked following the steps of the proof of Theorem 1. As a special case, the latter include all sub-Gaussian distributions. Also note that Theorem 1 as such applies to canonical priors, in that they do not depend on . Results for truncated priors, which set for above a threshold, can be obtained along the same lines, with slightly simpler proofs.

Further consequences of Theorem 1 include the minimaxity in sup-norm of several Bayesian estimators. The result for the posterior mean immediately follows from a convexity argument. One can also check that the posterior coordinate-wise median is minimax. Details are omitted.

3.2 Density estimation

Consider independent and identically distributed observations

(10)

with unknown density function on . We use the same notation for observations as in the white noise model: it will always be clear from the context which model we are referring to. Let be the set of densities on which are bounded away from and . In other words, one can write , with . In the sequel, we assume that the “true” belongs to , for some . The assumption that the density is bounded away from  and  is for simplicity. Allowing the density to tend to , for example, at the boundary of would be an interesting extension, but would presumably induce technicalities not related to our point here. Let denote the Hellinger distance between densities on .

Log-densities priors

Define the prior on densities as follows. Given a sufficiently smooth CDV-wavelet basis , consider the prior induced by, for any and defined in (3),

(11)
(12)

where are i.i.d. random variables of density with respect to Lebesgue measure on and are positive reals which for simplicity we make only depend on , that is . We consider the choices the Gaussian density and , where is any density such that its logarithm is Lipschitz on . We refer to this as the “log-Lipschitz case.” For instance, the ’s can be Laplace-distributed or have heavier tails, such as, for a given and , and a normalising constant,

(13)

Suppose the prior parameters satisfy, for some and ,

(14)

Typically, see examples below, such priors in (12) under or and (14) attain the rate in (1) in terms of Hellinger loss, up to logarithmic terms. For some , suppose

(15)

If (15) holds for some , we denote and , with as in (3).

Theorem 2

Consider observations from model (10). Suppose belongs to , with . Let be the prior on defined by (12), with or . Suppose that satisfy (14) and that (15) holds. Then, for and defined by (2), any , it holds, as ,

In the case , the same holds with replaced by , for some .

Theorem 2 implies that log-density priors for many natural priors on the coefficients achieve the precise optimal minimax rate of estimation over Hölder spaces under sup-norm loss, as soon as the regularity is at least .

In the case , examination of the proof reveals that the presented techniques provide the sup-norm rate up to logarithmic terms. For , we have . So, although the minimax rate is not exactly attained for those low regularities, the obtained rate improves on the intermediate rate , which was obtained in [16] for slightly different priors. In the next subsection, a prior is proposed which attains the minimax rate for the sup-norm in the case .

Let us give some examples of prior distributions satisfying the assumptions of Theorem 2. In the Gaussian case, any sequence of the type with satisfies both (14) and (15). In the log-Lipschitz case, the choice in (13) with any combined with satisfies (14)–(15). Both claims follow from minor adaptations of Theorem 4.5 in [38] and Theorem 2.1 in [27], respectively; see Lemma 8. In both Gaussian and log-Lipschitz cases, we in fact expect (15) to hold true for many other choices of under (14) and Lipschitz, or under in the Gaussian case, although such a general statement in Hellinger distance is not yet available in the literature, to the best of our knowledge.

Random dyadic histograms

Associated to the regular dyadic partition of at level , given by and for , is a natural notion of histogram

the set of all histograms with regular bins on . Let be the unit simplex in . Further denote

The set is the subset of consisting of histograms which are densities on . Let be the set of all histograms which are densities on .

A simple way to specify a prior on is to set deterministic and to fix a distribution for . Set as defined in (3). Choose some fixed constants and let

(16)

for any admissible index , where denotes the Dirichlet distribution on . Unlike suggested by the notation, the coefficients of the Dirichlet distribution are allowed to depend on , so that .

Theorem 3

Let and suppose belongs to , where . Let be the prior on defined by (16). Then, for defined by (2) and any it holds, as ,

According to Theorem 3, random dyadic histograms achieve the precise minimax rate in sup-norm over Hölder balls. Condition (16) is quite mild. For instance, the uniform choice is allowed, as well as a variety of others, for instance, one can take to originate from a measure on the interval , of finite total mass . By this we mean, . If has say a fixed continuous and positive density with respect to Lebesgue measure on , then (16) is satisfied as soon as there exists a with .

Further examples

A referee of the paper, whom we thank for the suggestion, has asked whether the proposed technique would work for other priors, more specifically for non--dependent priors in density estimation. Although not considered here for lack of space, we would like to mention the important class of Pólya tree priors; see, for example, Lavine [22]. For well-chosen parameters, it can be shown that these priors achieve supremum-norm consistency in density estimation (consistency in the, weaker, Hellinger sense was studied, e.g., in [2]) and minimax rates of convergence in the sup-norm can be obtained. In particular, this class contain canonical (i.e., non--dependent) priors that achieve such optimal rates in density estimation. This will be studied elsewhere.

3.3 Discussion

We have introduced new tools which allow to obtain optimal minimax rates of contraction in strong distances for posterior distributions. The essence of the technique is to view the problem semiparametrically as the uniform study of a collection of semiparametric Bayes concentration results, very much in the spirit of nonparametric Bernstein–von Mises results as studied in [6]. For the sake of clarity, we refrain of carrying out further extensions in the present paper but briefly mention a few applications. From the sup-norm rates, optimal results—up to logarithmic terms- in -metrics, , can be immediately obtained by interpolation. Adaptation to the unknown could also be considered. This will be the object of future work. However, note that “fixed ” nonparametric results as such are already very desirable in strong norms. They can, for instance, be used in the study of remainder terms of semiparametric functional expansions or of LAN-expansions as, for example, to check the conditions of application of semiparametric Bernstein–von Mises theorems as in [4]. In this semiparametric perspective, adaptation to is in fact not always desirable, since posteriors for functionals may behave pathologically when an adaptive prior on the nuisance is chosen; see [27] and [8], where it is shown that too large discrepancies in smoothness between the semiparametric functional and the unknown can lead to undesirable bias. Also, we expect the present methodology to give results in a broad variety of statistical models and/or for different classes of priors. Indeed, it reduces the problem of the strong-distance rate to two parts: (1) uniform semiparametric study of functionals and (2) high-frequency bias. The first part is very much related to obtaining (uniform) semiparametric Bernstein–von Mises (BvM) results. So, any advance in BvM theory for classes of priors will automatically lead to advances in (1). As for (2), the studied examples suggest that for frequencies above the cut-off the posterior behaves essentially as the prior itself. So, contrary to the BvM-regime (1) where the prior washes out asymptotically, one does not expect a universal behaviour for this part. However, showing that the posterior is close to the prior provides a possible method of proof.

4 Proofs

4.1 Gaussian white noise

Lemma 1

Let follow model (4). Let satisfy (5) and let the prior be chosen according to (6). There exists such that for any real , any and , with defined in (3),

{pf}

The proof is similar to the first lines of the proof of Theorem 5 in [6]: one uses Bayes’ formula to express the posterior expectation in the lemma. Next, using (3) and (5), one checks that for any with , the ratio is at most , for any and . For such ’s, since is the uniform density on , the expression involving in the next line is constant, and thus can be removed from the expression, leading to

Since are standard normal, simple calculations show that the expectation of the inverse of the quantity under brackets is bounded by a universal constant, as in [6], pages 2015–2016.

{pf*}

Proof of Theorem 1 Small . Let us first consider indexes with . For any real , set . Using the fact that is bounded,

Introduce the set, for any possibly -dependent sequence ,

(17)

Choose with . This implies, with our choices of and taking large enough, that contains the interval . First restricting the integral on the denominator to and next using the tail condition on and the fact that on , one gets

The maximal inequality argument from Section 2.1 directly yields .

Large . Let us now consider the case . For any real set,

To bound the denominator, first restrict the integral to the set as defined in (17). Set

next apply Jensen’s inequality with the logarithm function to get, with the diameter of and some constant ,

where we have used that in (17). Below we shall also use that

To bound the numerator from above, split the integrating set into and  and write for the integrals over each respective set. Using the previous bound on , that by definition of , and Fubini’s theorem,