Bayesian optimal adaptive estimation using a sieve prior

# Bayesian optimal adaptive estimation using a sieve prior

Julyan Arbel1,2, Ghislaine Gayraud1,3 and Judith Rousseau1,4
1Laboratoire de Statistique, CREST, France
2INSEE, France
3LMAC, Université de Technologie de Compiègne, France
4Université Paris Dauphine, France
###### Abstract

We derive rates of contraction of posterior distributions on nonparametric models resulting from sieve priors. The aim of the paper is to provide general conditions to get posterior rates when the parameter space has a general structure, and rate adaptation when the parameter space is, e.g., a Sobolev class. The conditions employed, although standard in the literature, are combined in a different way. The results are applied to density, regression, nonlinear autoregression and Gaussian white noise models. In the latter we have also considered a loss function which is different from the usual norm, namely the pointwise loss. In this case it is possible to prove that the adaptive Bayesian approach for the loss is strongly suboptimal and we provide a lower bound on the rate.

Keywords adaptation, minimax criteria, nonparametric models, rate of contraction, sieve prior, white noise model.

## 1 Introduction

The asymptotic behaviour of posterior distributions in nonparametric models has received growing consideration in the literature over the last ten years. Many different models have been considered, ranging from the problem of density estimation in i.i.d. models (Barron et al., 1999; Ghosal et al., 2000), to sophisticated dependent models (Rousseau et al., 2012). For these models, different families of priors have also been considered, where the most common are Dirichlet process mixtures (or related priors), Gaussian processes (van der Vaart and van Zanten, 2008), or series expansions on a basis (such as wavelets, see Abramovich et al., 1998).

In this paper we focus on a family of priors called sieve priors, introduced as compound priors and discussed by Zhao (1993, 2000), and further studied by Shen and Wasserman (2001). It is defined for models , , where , the set of sequences. Let be a -field associated to . The observations are denoted , where the asymptotics are driven by . The probability measures are dominated by some reference measure , with density . Remark that such an infinite-dimensional parameter can often characterize a functional parameter, or a curve, . For instance, in regression, density or spectral density models, represents a regression function, a log density or a log spectral density respectively, and represents its coordinates in an appropriate basis (e.g., a Fourier, a wavelet, a log spline, or an orthonormal basis in general). In this paper we study frequentist properties of the posterior distributions as tends to infinity, assuming that data are generated by a measure , . We study in particular rates of contraction of the posterior distribution and rates of convergence of the risk.

A sieve prior is expressed as

 θ∼Π(⋅)=∞∑k=1π(k)Πk(⋅), (1)

where , , and the ’s are prior distributions on so-called sieve spaces . Set the finite-dimensional vector of the first entries of . Essentially, the whole prior is seen as a hierarchical prior, see Figure 1. The hierarchical parameter , called threshold parameter, has prior . Conditionally on , the prior on is which is supposed to have mass only on (this amounts to say that the priors on the remaining entries , , are point masses at 0). We assume that is an independent prior on the coordinates , , of with a unique probability density  once rescaled by positive . Using the same notation for probability and density with Lebesgue measure or , we have

 ∀θk∈Θk,Πk(θk)=k∏j=11τjg(θjτj). (2)

Note that the quantities and could depend on . Although not purely Bayesian, data dependent priors are quite common in the literature. For instance, Ghosal and van der Vaart (2007) use a similar prior with a deterministic cutoff in application 7.6.

We will also consider the case where the prior is truncated to an ball of radius (see the nonlinear AR(1) model application in Section 2.3.3)

 ∀θk∈Θk,Πk(θk)∝k∏j=11τjg(θjτj)I(k∑j=1|θj|≤r1). (3)

The posterior distribution is defined by, for all measurable sets of ,

 Π(B|Xn)=∫Bp(n)θ(Xn)dΠ(θ)∫Θp(n)θ(Xn)dΠ(θ). (4)

Given the sieve prior , we study the rate of contraction of the posterior distribution in probability with respect to a semimetric on . This rate is defined as the best possible (i.e. the smallest) sequence such that

 Π(θ:d2n(θ,θ0)≥Mϵ2n|Xn)⟶n→∞0,

in probability, for some and a positive constant , which can be chosen as large as needed. We also derive convergence rates for the posterior loss in probability.

The posterior concentration rate is optimal when it coincides with the minimax rates of convergence, when belongs to a given functional class, associated to the same semimetric . Typically these minimax rates of convergence are defined for functional classes indexed by a smoothness parameter Sobolev, Hölder, or more generally Besov spaces.

The objective of this paper is to find mild generic assumptions on the sieve prior of the form (1), on models and on , such that the procedure adapts to the optimal rate in the minimax sense, both for the posterior distribution and for the risk. Results in Bayesian nonparametrics literature about contraction rates are usually of two kinds. Firstly, general assumptions on priors and models allow to derive rates, see for example Shen and Wasserman (2001); Ghosal et al. (2000); Ghosal and van der Vaart (2007). Secondly, other papers focus on a particular prior and obtain contraction rates in a particular model, see for instance Belitser and Ghosal (2003) in the white noise model, De Jonge and van Zanten (2010) in regression, and Scricciolo (2006) in density. The novelty of this paper is that our results hold for a family of priors (sieve priors) without a specific underlying model, and can be applied to different models.

An additional interesting property that is sought at the same time as convergence rates is adaptation. This means that, once specified a loss function (a semimetric on ), and a collection of classes of different smoothnesses for the parameter, one constructs a procedure which is independent of the smoothness, but which is rate optimal (under the given loss ), within each class. Indeed, the optimal rate naturally depends on the smoothness of the parameter, and standard straightforward estimation techniques usually use it as an input. This is all the more an important issue that relatively few instances in the Bayesian literature are available in this area. That property is often obtained when the unknown parameter is assumed to belong to a discrete set, see for example Belitser and Ghosal (2003). There exist some results in the context of density estimation by Huang (2004), Scricciolo (2006), Ghosal et al. (2008), van der Vaart and van Zanten (2009), Rivoirard and Rousseau (2012a), Rousseau (2010) and Kruijer et al. (2010), in regression by De Jonge and van Zanten (2010), and in spectral density estimation by Rousseau and Kruijer (2011). What enables adaptation in our results is the thresholding induced by the prior on : the posterior distribution of parameter concentrates around values that are the typical efficient size of models of the true smoothness.

As seen from our assumptions in Section 2.1 and from the general results (Theorem 1 and Corollary 1), adaptation is relatively straightforward under sieve priors defined by (1) when the semimetric is a global loss function which acts like the Kullback-Leibler divergence, the norm on in the regression problem, or the Hellinger distance in the density problem. If the loss function (or the semimetric) acts differently, then the posterior distribution (or the risk) can be quite different (suboptimal). This is illustrated in Section 3.2 for the white noise model (16) when the loss is a local loss function as in the case of the estimation of , for a given , where . This phenomenon has been encountered also by Rousseau and Kruijer (2011). It is not merely a Bayesian issue: Cai et al. (2007) show that an optimal estimator under global loss cannot be locally optimal at each point in the white noise model. The penalty between global and local rates is at least a term. Abramovich et al. (2004) and Abramovich et al. (2007a) obtain similar results with Bayesian wavelet estimators in the same model.

The paper is organized as follows. Section 2 first provides a general result on rates of contraction for the posterior distribution in the setting of sieve priors. We also derive a result in terms of posterior loss, and show that the rates are adaptive optimal for Sobolev smoothness classes. The section ends up with applications to the density, the regression function and the nonlinear autoregression function estimation. In Section 3, we study more precisely the case of the white noise model, which is a benchmark model. We study in detail the difference between global or pointwise losses in this model, and provide a lower bound for the latter loss, showing that sieve priors lead to suboptimal contraction rates. Proofs are deferred to the Appendix.

### Notations

We use the following notations. Vectors are written in bold letters, for example or , while light-face is used for their entries, like or . We denote by the projection of on its first coordinates, and by and the densities of the observations in the corresponding models. We denote by a semimetric, by the norm (on vectors) in or the norm (on curves ), and by the norm restricted to the first coordinates of a parameter. Expectations and are defined with respect to and respectively. The same notation is used for posterior probability or posterior expectation. The expected posterior risk and the frequentist risk relative to are defined and denoted by and respectively (for an estimator of ), where the mention of might be omitted (cf. Robert, 2007, Section 2.3). We denote by the standard Gaussian probability density.

Let denote the Kullback-Leibler divergence , and denote the centered moment , with .

Define two additional divergences and , which are expectations with respect to , and .

We denote by a generic constant whose value is of no importance and we use for inequalities up to a multiple constant.

## 2 General case

In this section we give a general theorem which provides an upper bound on posterior contraction rates . Throughout the section, we assume that the sequence of positive numbers , or when we point to a specific value of smoothness , is such that and .

We introduce the following numbers

 jn=⌊j0nϵ2n/log(n)⌋,kn=⌊M0jnlog(n)/L(n)⌋, (5)

for , where is a slow varying function such that , hence . We use to define the following approximation subsets of

 Θkn(Q)={θ∈Θkn:∥θ∥2,kn≤nQ},

for . Note that the prior actually charges a union of spaces of dimension , , so that can be seen as a union of spaces of dimension . Lemma 2 provides an upper bound on the prior mass of .

It has been shown (Ghosal et al., 2000; Ghosal and van der Vaart, 2007; Shen and Wasserman, 2001) that an efficient way to derive rates of contraction of posterior distributions is to bound from above the numerator of (1) using tests (and for the increasing sequence ), and to bound from below its denominator using an approximation of based on a value close to . The latter is done in Lemma 3 where we use to define the finite component approximation of , and we show that the prior mass of the following Kullback-Leibler neighbourhoods of , , , are lower bounded by an exponential term:

 Bn(m)={θ:K(p(n)0,p(n)θ)≤2nϵ2n,Vm,0(p(n)0,p(n)θ)≤2m+1(nϵ2n)m/2}.

Define two neighbourhoods of in the sieve space , , similar to but using and , and , an ball of radius , :

 ˜Bn(m)={θ∈Θjn:˜K(p(n)0jn,p(n)θ)≤nϵ2n,˜Vm,0(p(n)0jn,p(n)θ)≤(nϵ2n)m/2}, An(H1)={θ∈Θjn:∥∥θ0jn−θ∥∥2,jn≤n−H1}.

### 2.1 Assumptions

The following technical assumptions are involved in the subsequent analysis, and are discussed at the end of this section. Recall that the true parameter is , under which the observations have density .

Condition on and . For large enough and for some ,

 K(p(n)0,p(n)0jn)≤nϵ2n % and Vm,0(p(n)0,p(n)0jn)≤(nϵ2n)m/2.

Comparison between norms. The following inclusion holds in

 ∃H1>0, s.t. An(H1)⊂˜Bn(m).

Comparison between and . There exist three non negative constants such that, for any two ,

 dn(θ,θ′)≤D0kD1n∥∥θ−θ′∥∥D22,kn.

Test Condition. There exist two positive constants and such that, for every , there exists a test which satisfies

 E(n)0(ϕn(θ1))≤e−c1nd2n(θ0,θ1) % and supdn(θ,θ1)<ζdn(θ0,θ1)E(n)θ(1−ϕn(θ1))≤e−c1nd2n(θ0,θ1).

On the prior . There exist positive constants and such that satisfy

 ∀k=1,2,…,e−akL(k)≤π(k)≤e−bkL(k), (6)

where the function is a slow varying function introduced in Equation (5); satisfy

 ∀θ∈R,G1e−G2|θ|α≤ g(θ)≤G3e−G4|θ|α. (7)

The scales defined in Equation (1) satisfy the following conditions

 maxj≥1τj≤τ0, (8) minj≤knτj≥n−H2, (9) jn∑j=1∣∣θ0j∣∣α/ταj≤Cjnlogn. (10)
###### Remark 1.
• Conditions and are local in that they need to be checked at the true parameter only. They are useful to prove that the prior puts sufficient mass around Kullback-Leibler neighbourhoods of the true probability. Condition is a limiting factor to the rate: it characterizes through the capacity of approximation of by : the smoother , the closer and , and the faster . In many models, they are ensured because and can be written locally (meaning around ) in terms of the norm directly. Smoothness assumptions are then typically required to control .

It is the case for instance for Sobolev and Besov smoothnesses (cf. Equation (12)). The control is expressed with a power of , whose comparison to provides in turn a tight way to tune the rate (cf. the proof of Proposition 1).

Note that the constant in Condition can be chosen as large as needed: if holds for a specified positive constant , then it does for any . This makes the condition quite loose. A more stringent version of , if simpler, is the following.

Comparison between norms. For any

 ˜K(p(n)0jn,p(n)θ) ≤ ˜Vm,0(p(n)0jn,p(n)θ) ≤ Cnm/2∥∥θ0jn−θ∥∥m2,jn.

This is satisfied in the Gaussian white noise model (see Section 3).

• Condition is generally mild. The reverse is more stringent since may be bounded, as is the case with the Hellinger distance. is satisfied in many common situations, see for example the applications later on. Technically, this condition allows to switch from a covering number (or entropy) in terms of the norm to a covering number in terms of the semimetric .

• Condition is common in the Bayesian nonparametric literature. A review of different models and their corresponding tests is given in Ghosal and van der Vaart (2007) for example. The tests strongly depend on the semimetric .

• Condition concerns the prior. Equations (6) and (7) state that the tails of and have to be at least exponential or of exponential type. For instance, if is the geometric distribution, , and if it is the Poisson distribution, (both are slow varying functions). Laplace and Gaussian distributions are covered by , with and respectively. These equations aim at controlling the prior mass of , the complement of in (see Lemma 2). The case where the scale depends on is considered in Babenko and Belitser (2009, 2010) in the white noise model. Here the constraints on are rather mild since they are allowed to go to zero polynomially as a function of , and must be upper bounded. Rivoirard and Rousseau (2012a) study a family of scales that are decreasing polynomially with . Here the prior is more general and encompasses both frameworks. Equations (6) - (10) are needed in Lemmas 2 and 3 for bounding respectively from below and from above. A smoothness assumption on is usually required for Equation (10).

### 2.2 Results

#### 2.2.1 Concentration and posterior loss

The following theorem provides an upper bound for the rate of contraction of the posterior distribution.

###### Theorem 1.

If Conditions - hold, then for large enough and for introduced in Equation (5),

 E(n)0Π(θ: d2n(θ,θ0)≥MlognL(n)ϵ2n|Xn)=O((nϵ2n)−m/2)⟶n→∞0.
###### Proof.

See the Appendix. ∎

The convergence of the posterior distribution at the rate implies that the expected posterior risk converges (at least) at the same rate , when is bounded.

###### Corollary 1.

Under the assumptions of Theorem 1, with a value of in Conditions and such that , and if is bounded on , then the expected posterior risk given and converges at least at the same rate

 Rdnn=E(n)0Π(d2n(θ,θ0)|Xn)=O(lognL(n)ϵ2n).
###### Proof.

Denote the bound of , i.e. for all , . We have

 Rdnn ≤ MlognL(n)ϵ2n+E(n)0Π(I(d2n(θ,θ0)≥MlognL(n)ϵ2n)d2n(θ,θ0)|Xn) ≤ MlognL(n)ϵ2n+DE(n)0Π(θ: d2n(θ,θ0)≥MlognL(n)ϵ2n|Xn)

so by Theorem 1 and the assumption on .∎

###### Remark 2.

The condition on in Corollary 1 requires to grow as a power of . When has Sobolev smoothness , this is the case since is typically of order . The condition on boils down to . When is smoother, e.g. in a Sobolev space with exponential weights, the rate is typically of order . Then a common way to proceed is to resort to an exponential inequality for controlling the denominator of the posterior distribution of Equation (1) (see e.g. Rivoirard and Rousseau, 2012b).

###### Remark 3.

We can note that this result is meaningful from a non Bayesian point of view as well. Indeed, let be the posterior mean estimate of with respect to . Then, if is convex, we have by Jensen’s inequality

 d2n(ˆθ,θ0)≤Π(d2n(θ,θ0)|Xn),

so the frequentist risk converges at the same rate

 Rdnn=E(n)0(d2n(ˆθ,θ0))≤E(n)0Π(d2n(θ,θ0)|Xn)=Rdnn=O(lognL(n)ϵ2n).

Note that we have no result for general pointwise estimates , for instance for the MAP. This latter was studied in Abramovich et al. (2007b, 2010).

When considering a given class of smoothness for the parameter , the minimax criterion implies an optimal rate of convergence. Posterior (resp. risk) adaptation means that the posterior distribution (resp. the risk) concentrates at the optimal rate for a class of possible smoothness values.

We consider here Sobolev classes for univariate problems defined by

 Θβ(L0)={θ:∞∑j=1θ2jj2β1/2,L0>0 (11)

with smoothness parameter and radius . If , then one has the following bound

 ∥∥θ0−θ0jn∥∥22=∞∑j=jn+1θ20jj2βj−2β≤L0j−2βn. (12)

Donoho and Johnstone (1998) give the global (i.e. under the loss) minimax rate attached to the Sobolev class of smoothness . We show that under an additional condition between , and , the upper bound on the rate of contraction can be chosen equal to the optimal rate, up to a term.

###### Proposition 1.

Let denote a positive fixed radius, and . If for large enough, there exists a positive constant such that

 supβ1≤β≤β2supθ0∈Θβ(L0)K(p(n)0,p(n)0jn)≤C0n∥∥θ0−θ0jn∥∥22,and supβ1≤β≤β2supθ0∈Θβ(L0)Vm,0(p(n)0,p(n)0jn)≤Cm0nm/2∥∥θ0−θ0jn∥∥m2, (13)

and if Conditions - hold with constants independent of in the set , then for sufficiently large,

 supβ1≤β≤β2supθ0∈Θβ(L0)E(n)0Π(θ: d2n(θ,θ0)≥MlognL(n)ϵ2n(β)|Xn)⟶n→∞0,

with

 ϵn(β)=ϵ0(lognn)β2β+1,

and depending on and the constants involved in the assumptions, but not depending on .

###### Remark 4.

In the standard case where is the norm, is the optimal rate of contraction, up to a term (which is quite common in Bayesian nonparametric computations).

###### Proof.

Let and . Then satisfies Equation (12), and Condition (13) implies that

 K(p(n)0,p(n)0jn)≤C0L0nj−2βn, Vm,0(p(n)0,p(n)0jn)≤C0Lm0nm/2j−mβn.

For given and , the result of Theorem 1 holds if Condition is satisfied. This is the case if we choose , provided that the bounds in Conditions - and in Equation (13) are uniform. Combined with , it gives as a tight choice with . So there exists a bound such that , which concludes the proof. ∎

### 2.3 Examples

In this section, we apply our results of contraction of Sections 2.2.1 and 2.2.2 to a series of models. The Gaussian white noise example is studied in detail in Section 3. We suppose in each model that , where is defined in Equation (11).

Throughout, we consider the following prior on (or on a curve space through the coefficients of the functions in a basis). Let the prior distribution on be Poisson with parameter , and given , the prior distribution on , be standard Gaussian,

 k ∼Poisson(λ), θjτj|k ∼N(0,1),j=1,…,k,independently. (14)

It satisfies Equation (6) with function and Equation (7) with . Choose then , , with . It is decreasing and bounded from above by so Equation (8) is satisfied. Additionally,

 minj≤knτj=τkn=k−2qn≥n−H2

for large enough, so Equation (9) is checked. Since ,

 τ20jn∑j=1θ20j/τ2j=jn∑j=1θ20jj2q=jn∑j=1θ20jj2βj2q−2β≤jnjn∑j=1θ20jj2β≤jnL0,

as soon as . Hence by choosing , Equation (10) is verified for all . The prior thus satisfies Condition .

Since Condition is satisfied, we will show in the three examples that Conditions - and Condition (13) hold, thus Proposition 1 applies: the posterior distribution attains the optimal rate of contraction, up to a term, that is , for a distance which is specific to each model. This rate is adaptive in a range of smoothness .

#### 2.3.1 Density

Let us consider the density model in which the density is unknown, and we observe i.i.d. data

 Xi∼p,i=1,2,…,n,

where belongs to ,

 F={p density on [0,1]:p(0)=p(1) and logp∈L2(0,1)}.

Equality is mainly used for ease of computation. We define the parameter of such a function , and write , as the coefficients of in the Fourier basis , i.e. it can be represented as

 logpθ(x)=∞∑j=1θjψj(x)−c(θ),

where is a normalizing constant. We assign a prior to by assigning the sieve prior of Equation (14) to .

A natural choice of metric in this model is the Hellinger distance . Lemma 2 in Ghosal and van der Vaart (2007) shows the existence of tests satisfying with the Hellinger distance.

Rivoirard and Rousseau (2012b) study this model in detail (Section 4.2.2) in order to derive a Bernstein-von Mises theorem for the density model. They prove that Conditions , and (13) are valid in this model (see the proof of Condition (C) for and (13), and the proof of Condition (B) for ). With , Condition is written

#### 2.3.2 Regression

Consider now the following nonparametric regression model

 Xi=f(ti)+σξi,i=1,…,n,

with the regular fixed design in , i.i.d. centered Gaussian errors with variance . The unknown case is studied in an unpublished paper by Rousseau and Sun. They endow with an Inverse Gamma (conjugate) prior. They show that this one dimensional parameter adds an term in the Kullback-Leibler divergence but does not alter the rates by considering three different cases for , either , , or .

We consider now in more detail the known case. Denote the coefficients of a regression function in the Fourier basis . So for all , can be represented as . We assign a prior to by assigning the sieve prior of Equation (14) to .

Let be the empirical measure of the covariates ’s, and define the square of the empirical norm by . We use .

Let and the corresponding regression. Basic algebra (see for example Lemma 1.7 in Tsybakov, 2009) provides, for any two and ,

 1nn∑i=1ψj(ti)ψk(ti)=δjk,

where stands for Kronecker delta. Hence

 ∥f∥2Pnt=1nn∑i=1∑j,kθjθkψj(ti)ψk(ti)=∥θ∥22=∥f∥22, (15)

where the last equality is Parseval’s. It ensures Condition with and .

The densities of ’s are denoted , , and their product . The quantity denotes the truncated version of to its first terms in the Fourier basis.

We have and for , where is the (non centered) moment of a standard Gaussian variable. So using Equation (15) we get

 2K(p(n)f0,p(n)f)=V2,0(p(n)f0,p(n)f)=nσ−2∥f0−f∥2Pnt=nσ−2∥θ0−θ∥22

which proves Condition (13).

Additionally, both and are upper bounded by . Let , for a certain . Then, using (15) again, the bound is less than

 nσ−2(n−