Sharp sup-norm Bayesian curve estimation
Sup-norm curve estimation is a fundamental statistical problem and, in principle, a premise for the construction of confidence bands for infinite-dimensional parameters. In a Bayesian framework, the issue of whether the sup-norm-concentration-of-posterior-measure approach proposed by Giné and Nickl (2011), which involves solving a testing problem exploiting concentration properties of kernel and projection-type density estimators around their expectations, can yield minimax-optimal rates is herein settled in the affirmative beyond conjugate-prior settings obtaining sharp rates for common prior-model pairs like random histograms, Dirichlet Gaussian or Laplace mixtures, which can be employed for density, regression or quantile estimation.
keywords:McDiarmind’s inequality, Nonparametric hypothesis testing, Posterior distributions, Sup-norm rates
The study of the frequentist asymptotic behaviour of Bayesian nonparametric (BNP) procedures has initially focused on the Hellinger or -distance loss, see Shen and Wasserman (2001) and Ghosal et al. (2000), but an extension and generalization of the results to -distance losses, , has been the object of two recent contributions by Giné and Nickl (2011) and Castillo (2014). Sup-norm estimation has particularly attracted attention as it may constitute the premise for the construction of confidence bands whose geometric structure can be easily visualized and interpreted. Furthermore, as shown in the example of Section 3.2, the study of sup-norm posterior contraction rates for density estimation can be motivated as being an intermediate step for the final assessment of convergence rates for quantile estimation.
While the contribution of Castillo (2014) has a more prior-model specific flavour, the article by Giné and Nickl (2011) aims at a unified understanding of the drivers of the asymptotic behaviour of BNP procedures by developing a new approach to the involved testing problem constructing nonparametric tests that have good exponential bounds on the type-one and type-two error probabilities that rely on concentration properties of kernel and projection-type density estimators around their expectations.
Even if Giné and Nickl (2011)’s approach can only be useful if a fine control of the approximation properties of the prior support is possible, it has the merit of replacing the entropy condition for sieve sets with approximating conditions. However, the result, as presented in their Theorem 2 (Theorem 3), can only retrieve minimax-optimal rates for -losses when , while rates deteriorate by a genuine power of , in fact , for . Thus, the open question remains whether their approach can give the right rates for for non-conjugate priors and sub-optimal rates are possibly only an artifact of the proof. We herein settle this issue in the affirmative by refining their result and proof and showing in concrete examples that this approach retrieves the right rates.
2 Main result
In this section, we describe the set-up and present the main contribution of this note. Let be a collection of probability measures on a measurable space that possess densities with respect to some -finite dominating measure . Let be a sequence of priors on , where is a -field on for which the maps are jointly measurable relative to . Let be i.i.d. (independent, identically distributed) observations from a common law with density on with respect to , . For a probability measure on and an -measurable function , , let denote the integral , where, unless otherwise specified, the set of integration is understood to be the whole domain. When this notation is applied to the empirical measure associated with a sample , namely the discrete uniform measure on the sample values, this yields . For each , let be a kernel or projection-type density estimator based on at resolution level , with as in Definition (1) below. Its expectation is then equal to , where we have used the notation . In order to refine Giné and Nickl (2011)’s result, we use concentration properties of around its expectation by applying McDiarmind’s inequality for bounded differences functions.
The following definition, which corresponds to Condition 5.1.1 in Giné and Nickl (2015), is essential for the main result.
Let , or . The sequence of operators
is called an admissible approximating sequence if it satisfies one of the
convolution kernel case, : ,
integrates to and is of bounded -variation for some finite and right (left)-continuous;
multi-resolution projection case, : , with as above or , where define an -regular wavelet basis, have bounded -variation for some and are uniformly continuous, or define the Haar basis, see Chapter 4, ibidem;
multi-resolution case, : is the projection kernel at resolution of a Cohen-Daubechies-Vial (CDV) wavelet basis, see Chapter 4, ibidem;
multi-resolution case, : is the projection kernel at resolution of the periodization of a scaling function satisfying b), see (4.126) and (4.127), ibidem.
A useful property of -regular wavelet bases is the following: there exists a non-negative measurable function such that for all , that is, is dominated by a bounded and integrable convolution kernel .
In order to state the main result, we recall that a sequence of positive real numbers is slowly varying at if, for each , it holds that . Also, for , let be the space of -integrable functions, , equipped with the norm .
Let and be sequences of positive real numbers such that , and . For each and a slowly varying sequence , let . Suppose that, for as in Definition (1), with , and that integrate for some in cases a) and b),
and, for a constant , sets , where only depends on , we have
Then, for sufficiently large ,
If the convergence in (2) holds for , then, for each . , where .
The assertion, whose proof is reported in A, is an in-probability statement that the posterior mass outside a sup-norm ball of radius a large multiple of is negligible. The theorem provides the same sufficient conditions for deriving sup-norm posterior contraction rates that are minimax-optimal, up to logarithmic factors, as in Giné and Nickl (2011). Condition (ii), which is mutuated from Ghosal et al. (2000), is the essential one: the prior concentration rate is the only determinant of the posterior contraction rate at densities having sup-norm approximation error of the same order against a kernel-type approximant, provided the prior support is almost the set of densities with the same approximation property.
In this section, we apply Theorem 1 to some prior-model pairs used for (conditional) density or regression estimation, including random histograms, Dirichlet Gaussian or Laplace mixtures, that have been selected in an attempt to reflect cases for which the issue of obtaining sup-norm posterior rates was still open. We do not consider Gaussian priors or wavelets series because these examples have been successfully worked out in Castillo (2014) taking a different approach. We furthermore exhibit an example with the aim of illustrating that obtaining sup-norm posterior contraction rates for density estimation can be motivated as being an intermediate step for the final assessment of convergence rates for estimating single quantiles.
3.1 Density estimation
Example 1 (Random dyadic histograms).
For , consider a partition of into intervals (bins) of equal length and , . Let denote the Dirichlet distribution on the -dimensional unit simplex with all parameters equal to . Consider the random histogram Denote by the induced law on the space of probability measures with Lebesgue density on . Let be i.i.d. observations from a density on . Then, the Bayes’ density estimator, that is the posterior expected histogram, has expression where identifies the bin containing , i.e., , and stands for the number of observations falling into . Let denote the class of Hölder continuous functions on with exponent . Let be the minimax rate of convergence over .
Let be i.i.d. observations from a density , with , satisfying on . Let be such that . Then, for sufficiently large , . Consequently, .
The first part of the assertion, which concerns posterior contraction rates, immediately follows from Theorem (1) combined with the proof of Proposition 3 of Giné and Nickl (2011), whose result, together with that of Theorem 3 in Castillo (2014), is herein improved to the minimax-optimal rate for every . The second part of the assertion, which concerns convergence rates for the histogram density estimator, is a consequence of Jensen’s inequality and convexity of , combined with the fact that the prior is supported on densities uniformly bounded above by and that the proof of Theorem 1 yields the exponential order for the convergence of the posterior probability of the complement of an -ball around , in symbols, whence .
Example 2 (Dirichlet-Laplace mixtures).
Consider, as in Scricciolo (2011), Gao and van der Vaart (2015), a Laplace mixture prior
thus defined. For , ,
the density of a Laplace distribution, let
denote a mixture of Laplace densities with mixing distribution ,
, the Dirichlet process with base measure , for and a probability measure on .
Proposition 2.Let be i.i.d. observations from a density , with supported on a compact interval . If has support on with continuous Lebesgue density bounded below away from and above from , then, for sufficiently large , . Consequently, for the Bayes’ estimator we have .
Proof.It is known from Proposition 4 in Gao and van der Vaart (2015) that the small-ball probability estimate in condition (ii) of Theorem 1 is satisfied for . For the bias condition, we take to be the support of and show that, for and any symmetric density with finite second moment, we have uniformly over the support of . Indeed, by applying Lemma 1 with , for each it results , which implies that both conditions (1) and (i) are satisfied. The assertion on the Bayes’ estimator follows from the same arguments laid out for random histograms together with the fact that uniformly in . ∎
Example 3 (Dirichlet-Gaussian mixtures).
Consider, as in Ghosal and van der Vaart (2001, 2007),
Shen et al. (2013), Scricciolo (2014), a Gaussian mixture prior
thus defined. For the standard normal density, let
denote a mixture of Gaussian densities with mixing distribution ,
, the Dirichlet process with base measure , for and a probability measure on , which has continuous and positive density as , for some constants and ,
which has continuous and positive density on such that, for constants , , for all in a neighborhood of . Let denote the class of Hölder continuous functions on with exponent . Let be the minimax rate of convergence over . For any real , let stand for the largest integer strictly smaller than .
Proposition 3.Let be i.i.d. observations from a density such that condition is satisfied for . Then, for sufficiently large , .
Proof.Let be a convolution kernel such that , , and , the Fourier transform has . Let . For every , , where the constant does not depend on . Thus, . For the bias condition, let , with and a suitable constant . For every and uniformly in , because as . Now, because , which implies that the remaining mass condition is satisfied. ∎
3.2 Quantile estimation
For , consider the problem of estimating the -quantile of the population distribution function from observations . For any (possibly unbounded) interval and function on , define the Hölder norm as
Let denote the space of continuous and bounded functions on and ,
Suppose that, given , there are constants so that and
Consider a prior concentrated on probability measures having densities . If, for sufficiently large , the posterior probability , then, there exists so that .
We preliminarily make the following remark. Let , . For , let be the -quantile of . By Lagrange’s theorem, there exists a point between and so that . Consequently,
If , then
In order to upper bound , by appealing to relationship (4), we can separately control and . Let the kernel function be such that
, , and ,
its Fourier transform has .
By Lemma 5.2 in Dattner et al. (2013),
with . Write
By inequality (5), we have . By the same reasoning, . We now consider . Taking into account that and
for some point between and (clearly, depends on ),
Then, , which implies that . It follows that . Taking into account that , and , choosing , we have . If then, under condition (3), for every . In fact, for any interval that includes the point so that it also includes the intermediate point between and , for any we have for every . It follows that . Conclude the proof by noting that, in virtue of (4), . The assertion then follows. ∎
Proposition 4 considers local Hölder regularity of , which seems natural for estimating single quantiles. Clearly, requirements on are automatically satisfied if is globally Hölder regular and, in this case, the minimax-optimal sup-norm rate is so that the rate for estimating single quantiles is . The conditions on the random density are automatically satisfied if the prior is concentrated on probability measures possessing globally Hölder regular densities.
Appendix A Proof of Theorem 1
Using the remaining mass condition (i) and the small-ball probability estimate (ii), by the proof of Theorem 2.1 in Ghosal et al. (2000), it is enough to construct, for each , a test for the hypothesis
with large enough, where is the indicator function of the rejection region of , such that
where , the constant being that appearing in (i) and (ii). By assumption (1), there exists a constant such that . Define . For a constant , define the event and the test . For
, the triangular inequality