The semi-parametric Bernstein-von Mises theoremfor regression models with symmetric errors

# The semi-parametric Bernstein-von Mises theorem for regression models with symmetric errors

Minwoo Chae, Yongdai Kim and Bas Kleijn
Department of Mathematics, University of Texas at Austin
Department of Statistics, Seoul National University
Korteweg-de Vries Institute for Mathematics, University of Amsterdam
July 3, 2019
###### Abstract

In a smooth semi-parametric model, the marginal posterior distribution for a finite dimensional parameter of interest is expected to be asymptotically equivalent to the sampling distribution of any efficient point-estimator. The assertion leads to asymptotic equivalence of credible and confidence sets for the parameter of interest and is known as the semi-parametric Bernstein-von Mises theorem. In recent years, it has received much attention and has been applied in many examples. We consider models in which errors with symmetric densities play a role; more specifically, it is shown that the marginal posterior distributions of regression coefficients in the linear regression and linear mixed effect models satisfy the semi-parametric Bernstein-von Mises assertion. As a consequence, Bayes estimators in these models achieve frequentist inferential optimality, as expressed e.g. through Hájek’s convolution and asymptotic minimax theorems. Conditions for the prior on the space of error densities are relatively mild and well-known constructions like the Dirichlet process mixture of normal densities and random series priors constitute valid choices. Particularly, the result provides an efficient estimate of regression coefficients in the linear mixed effect model, for which no other efficient point-estimator was known previously.

## 1 Introduction

In this paper, we give an asymptotic, Bayesian analysis of models with errors that are distributed symmetrically. The observations are modeled by,

 X=\boldmath{μ}+\boldmath{ϵ}, (1.1)

where and . Here the mean vector is non-random and parametrized by a finite dimensional parameter , and the distribution of the error vector is symmetric in the sense that has the same distribution as . Since the error has a symmetric but otherwise unknown distribution, the model is semi-parametric. Examples of models of the form (1.1) are the symmetric location model (where ,), and the linear regression model (where for given covariates ). Moreover, the form (1.1) includes models with dependent errors, like linear mixed effect models.

The main goal of this paper is to prove the semi-parametric Bernstein-von Mises (BvM) assertion for models of the form (1.1) with symmetric error distributions. Roughly speaking we show that the marginal posterior distribution of the parameter of interest is asymptotically normal, centered on an efficient estimator with variance equal to the inverse Fisher information matrix. As a result, statistical inference based on the posterior distribution satisfies frequentist criteria of optimality.

Various sets of sufficient conditions for the semi-parametric BvM theorem based on the full LAN (local asymptotic normality) expansion (i.e. the LAN expansion with respect to both the finite and infinite dimensional parameters [25]) have been developed in [29, 3, 7]. The full LAN expansion, however, is conceptually inaccessible and technically difficult to verify. Because the models we consider are adaptive [4], we can consider a simpler type of LAN expansion that involves only the parameter of interest, albeit that the expansion must be valid under data distributions that differ slightly from the one on which the expansion is centred. We call this property misspecified LAN and prove that it holds for the models of the form (1.1) and that, together with other regularity conditions, it implies the semi-parametric BvM assertion.

While the BvM theorem for parametric Bayesian models is well established (e.g. [23, 21]), the semi-parametric BvM theorem is still being studied very actively: initial examples [9, 11] of simple semi-parametric problems with simple choices for the prior demonstrated failures of marginals posteriors to display BvM-type asymptotic behaviour. Subsequently, positive semi-parametric BvM results have been established in these and various other examples, including models in survival analysis ([19, 18]), multivariate normal regression models with growing numbers of parameters ([5, 17, 12]) and discrete probability measures ([6]). More delicate notions like finite sample properties and second-order asymptotics are considered in [26, 30, 38].

Regarding models of the form (1.1), there is a sizable amount of literature on efficient point-estimation in the symmetric location problem ([2, 31, 27]) and linear regression models ([4]). By contrast, to date no efficient point-estimator for the regression coefficients in the linear mixed effect model has been found; the semi-parametric BvM theorem proved below, however, implies that the Bayes estimator is efficient! To the authors’ best knowledge, this paper provides the first efficient semi-parametric estimator in the linear mixed effect model. A numerical study given in section 5 supports the view that the Bayes estimator is superior to previous methods of estimation.

This paper is organized as follows: section 2 proves the semi-parametric BvM assertion for all smooth adaptive models (c.f. the misspecified LAN expansion). In sections 3 and 4 we study the linear regression model and linear mixed effect model, respectively. For each, we consider two common choices for the nuisance prior, a Dirichlet process mixture and a series prior, and we show that both lead to validity of the BvM assertion. Results of numerical studies are presented in section 5.

#### Notation and conventions

For two real values and , and are the minimum and maximum of and , respectively, and signifies that is smaller than up to a constant multiple independent of . Lebesgue measures are denoted by ; represents the Euclidean norm on . The capitals , etc. denote the probability measures associated with densities that we write in lower case, , etc. (where it is always clear from the context which dominating measure is involved). The corresponding log densities are indicated with , etc. Hellinger and total-variational metrics are defined as and , respectively. The expectation of a random variable under a probability measure is denoted by . The notation always represents the true probability which generates the observation and is the centered version of a random variable . The indicator function for a set is denoted . For a class of measurable functions , the quantities and represent the -covering and -bracketing numbers [33] with respect to a (semi)metric .

## 2 Misspecified LAN and the semi-parametric BvM theorem

In this section, we prove the semi-parametric BvM theorem for smooth adaptive models, i.e. those that satisfy the misspecified LAN expansion defined below.

### 2.1 Misspecified local asymptotic normality

Consider a sequence of statistical models on measurable spaces , parametrized by a finite dimensional parameter of interest and an infinite dimensional nuisance parameter . Assume that is a subset of , is a metric space equipped with the associated Borel -algebra and has density with respect to some -finite measures dominating .

Let be a -valued random element following and assume that for some and . We say that a sequence of statistical models satisfies the misspecified LAN expansion if there exists a sequence of vector-valued (componentwise) -functions , a sequence of measurable subsets of and a sequence of -matrices such that,

 suph∈Ksupη∈Hn∣∣∣logp(n)θn(h),ηp(n)θ0,η(X(n))−hT√ngn,η(X(n))+12hTVn,ηh∣∣∣=oP0(1), (2.1)

for every compact , where equals . When we know , property (2.1) is nothing but the usual parametric LAN expansion, where we set . We refer to (2.1) as the misspecified LAN expansion because the base for the expansion is while rest-terms go to zero under , which corresponds to the point .

Note that the misspecified LAN expansion is simpler than the full LAN expansion used in [29, 3, 7]. Although the misspecified LAN expansion (2.1) can be applied only to the adaptive cases, the verification of (2.1) is not easy due to misspecification and the required uniformity of convergence. LAN expansions have been shown to be valid even under misspecification: in [21] for example, smoothness in misspecified parametric models is expressed through a version of local asymptotic normality under the true distribution of the data, with a likelihood expansion around points in the model where the Kullback-Leibler (KL)-divergence with respect to is minimal. In models with symmetric error, the point of minimal KL-divergence equals exactly , provided that the misspecified is close enough to in the sense of . This allows the usual LAN expansion at for fixed , that is, the left-hand side of (2.1) is expected to be of order . By choosing localizations appropriately, the family of score functions is shown to be a Donsker class, which validates (2.1) in models with symmetric errors, where , and . The score function is not necessarily the pointwise derivative of the log-likelihood, but in most examples (including the models considered in this paper), where . From now on, since it conveys the natural meaning of derivative, we use the notation instead of .

### 2.2 The semi-parametric Bernstein-von Mises theorem

We use a product prior on the Borel -algebra of and denote the posterior distribution by . Note that the misspecified LAN property gives rise to an expansion of the log-likelihood that applies only locally in sets , where (for some compact and appropriate ). So for the semi-parametric BvM theorem, the score function as well as must ‘behave nicely’ on and the posterior distribution must concentrate inside . Technically, these requirements are expressed by the following two conditions. For a matrix , represents the operator norm of , defined as , and if is a square matrix, and denote the minimum and maximum eigenvalues of , respectively.

Condition A. (Equicontinuity and non-singularity)

 supη∈Hn∣∣˙ℓ(n)θ0,η(X(n))−˙ℓ(n)θ0,η0(X(n))∣∣ = oP0(n1/2), (2.2) supη∈Hn∥Vn,η−Vn,η0∥ = o(1), (2.3) 0

Condition B. (Posterior localization)

 P(n)0Π(Hn|X(n)) → 1, (2.5) P(n)0Π(√n|θ−θ0|>Mn|X(n)) → 0,  for every Mn↑∞. (2.6)

Conditions like (2.2) and (2.3) are to be expected in the context of semi-parametric estimation (see, e.g., Theorem 25.54 of [34]). Condition (2.2) amounts to asymptotic equicontinuity and is implied whenever scores form a Donsker class, a well-known sufficient condition in semi-parametric efficiency (see [34]). Condition (2.3) is implied whenever the -norm of the difference between scores at and vanishes as converges to in Hellinger distance, c.f. (3.12); it controls variations of the information matrix as converges to with . Condition (2.4) guarantees that the Fisher information matrix does not develop singularities as the sample size goes to infinity.

Condition (2.5) formulates a requirement of posterior consistency in the usual sense, and sufficient conditions are well-known [28, 1, 36, 20]. Condition (2.6) requires -rate of convergence rate for the marginal posterior distribution for the parameter of interest. Though some authors remark that (2.6) appears to be rather too strong [38], clearly, (2.6) is a necessary condition (since it follows directly from the BvM assertion). The proof of condition (2.6) is demanding in a technical sense and forms the most difficult part of this analysis and most others [3].

We say the prior is thick at if it has a strictly positive and continuous Lebesgue density in a neighborhood of . The following theorem states the BvM theorem for semi-parametric models that are smooth in the sense of the misspecified LAN expansion.

###### Theorem 2.1.

Consider statistical models with a product prior . Assume that is thick at and that (2.1) as well as Conditions A and B hold. Then,

 supB∣∣Π(√n(θ−θ0)∈B|X(n))−NΔn,V−1n,η0(B)∣∣→0, (2.7)

in -probability, where,

 Δn=1√nV−1n,η0˙ℓ(n)θ0,η0(X(n)).
###### Proof.

Note first that (2.5) implies that for large enough . Let be the probability measure obtained by restricting to and next re-normalizing, and be the corresponding posterior distribution. Then, for any measurable set in ,

 Π(θ∈B|X(n))=Π(θ∈B,η∈Hn|X(n))+Π(θ∈B,η∈Hcn|X(n)) =ΠHn(θ∈B|X(n))Π(η∈Hn|X(n))+Π(θ∈B,η∈Hcn|X(n)),

so we have,

 supB∣∣Π(θ∈B|X(n))−ΠHn(θ∈B|X(n))∣∣→0,

in -probability. Therefore it is sufficient to prove the BvM assertion with the priors .

Particularly,

 ΠHn(√n|θ−θ0|>Mn|X(n))=Π(√n|θ−θ0|>Mn,η∈Hn|X(n))Π(η∈Hn|X(n)), (2.8)

converges to 0 in -probability by (2.5) and (2.6). Using (2.1), (2.2) and (2.3), we obtain,

 suph∈Ksupη∈Hn∣∣∣logp(n)θn(h),ηp(n)θ0,η(X(n))−hT√n˙ℓ(n)θ0,η0(X(n))+12hTVn,η0h∣∣∣=oP0(1), (2.9)

for every compact . Let,

 b1(h)=infη∈Hnp(n)θn(h),η(X(n))p(n)θ0,η(X(n)),andb2(h)=supη∈Hnp(n)θn(h),η(X(n))p(n)θ0,η(X(n)).

Then, trivially, we have,

 b1(h)≤∫p(n)θn(h),η(X(n))dΠHn(η)∫p(n)θ0,η(X(n))dΠHn(η)≤b2(h), (2.10)

and the quantity,

 suph∈K∣∣bk(h)−hT√n˙ℓ(n)θ0,η0(X(n))+12hTVn,η0h∣∣,

is bounded above by the left-hand side of (2.9) for . As a result,

 suph∈K∣∣∣log∫p(n)θn(h),η(X(n))dΠHn(η)∫p(n)θ0,η(X(n))dΠHn(η)−hT√n˙ℓ(n)θ0,η0(X(n))+12hTVn,η0h∣∣∣=oP0(1), (2.11)

because for all real numbers and with . The remainder of the proof is (almost) identical to the proof for parametric models [23, 21], replacing the parametric likelihood by as in [3], details of which can be found in Theorem 3.1.1 of [8]. ∎

## 3 Semi-parametric BvM for linear regression models

Let be the set of all continuously differentiable densities defined on (for some ) such that and for every . Equip with the Hellinger metric. We consider a model for data satisfying,

 Xi=θTZi+ϵi,for i=1,…,n, (3.1)

where ’s are -dimensional non-random covariates and the errors are assumed to form an i.i.d. sample from a distribution with density . We prove the BvM theorem for the regression coefficient .

Let denote the probability measure with density and . Also let be the probability measure with density and . Let represent the product measure and let . With slight abuse of notation, we treat and as either functions of or the corresponding random variables when they are evaluated at . For example, represents either the function or the random vector . We treat and similarly.

Let and be the true regression coefficient and error density in the model (3.1). Define specialized KL-balls in of the form,

 Bn(ϵ)={(θ,η):n∑i=1K(pθ0,η0,i,pθ,η,i)≤nϵ2,n∑i=1V(pθ0,η0,i,pθ,η,i)≤C2nϵ2}, (3.2)

where , , and is some positive constant (see [14]). Define the mean Hellinger distance on by,

 h2n((θ1,η1),(θ2,η2))=1nn∑i=1h2(pθ1,η1,i,pθ2,η2,i). (3.3)

Let and,

 Vn,η=1nP(n)0[˙ℓ(n)θ0,η˙ℓ(n)Tθ0,η0]. (3.4)

It is easy to see that , where .

We say that a sequence of real-valued stochastic processes , (), is asymptotically tight if it is asymptotically tight in the space of bounded functions on with the uniform norm [33]. A vector-valued stochastic process is asymptotic tight if each of its components is asymptotically tight.

###### Theorem 3.1.

Suppose that for some constant , and . The prior for is a product , where is thick at . Suppose also that there exist an , a sequence with , and partitions and such that and

 logN(ϵn/36,Θn,1×Hn,1,hn)≤nϵ2n,logΠ(Bn(ϵn))≥−14nϵ2n,log(ΠΘ(Θn,2)+ΠH(Hn,2))≤−52nϵ2n, (3.5)

for all . For some , with , let and assume that there exist a continuous -function and an such that,

 sup|y|<ϵ0supη∈HN∣∣∣ℓη(x+y)−ℓη(x)y∣∣∣∨∣∣∣sη(x+y)−sη(x)y∣∣∣≤Q(x), (3.6)

where . Furthermore, assume that the sequence of stochastic processes,

 {1√n(˙ℓ(n)θ,η−P(n)0˙ℓ(n)θ,η):|θ−θ0|<ϵ0,η∈HN}, (3.7)

indexed by is asymptotically tight. Then the assertion of the BvM theorem 2.1 holds for .

Since the observations are not i.i.d., we consider the mean Hellinger distance as in [14]. Conditions (3.5) are required for the convergence rate of to be , which in turn implies that the convergence rates of and are (c.f. Lemma 3.1). In fact, we only need to prove (3.5) with arbitrary rate because the so-called no-bias condition holds trivially by the symmetry, which plays an important role to prove (2.1)-(2.3) as in frequentist literature (see Chapter 25 of [35]). Condition (3.6), which is technical in nature, is easily satisfied. For a random design, (3.7) is asymptotically tight if and only if the class of score functions forms a Donsker class, and sufficient conditions for the latter are well established in empirical process theory. Since observations are not i.i.d. due to the non-randomness of covariates, (3.7) does not converge in distribution to a Gaussian process. Here, asymptotic tightness of (3.7) merely assures that the supremum of its norm is of order . Asymptotic tightness holds under a finite bracketing integral condition (where the definition of the bracketing number is extended to non-i.i.d. observations in a natural way). For sufficient conditions for asymptotic tightness with non-i.i.d. observations, readers are referred to section 2.11 of [33].

We prove Theorem 3.1 by checking the misspecified LAN condition as well as Conditions A and B, whose proofs are sketched in the three following subsections respectively. Detailed proofs are provided in the appendix.

### 3.1 Proof of Misspecified LAN

Note that for every by the symmetry of and . This enables writing the left-hand side of (2.1) as,

 logp(n)θn(h),ηp(n)θ0,η(X(n))−hT√n˙ℓ(n)θ0,η(X(n))+12hTVn,ηh=An(h,η)+Bn(h,η),

where,

 An(h,η)=(ℓ(n)θn(h),η−ℓ(n)θ0,η−hT√n˙ℓ(n)θ0,η)o,Bn(h,η)=P(n)0(ℓ(n)θn(h),η−ℓ(n)θ0,η)+12hTVn,ηh. (3.8)

It suffices to prove that and converge to zero uniformly over and , in -probability, for every compact set .

Note that is equal to,

 hT√n∫10(˙ℓ(n)θn(th),η−˙ℓ(n)θ0,η)odt,

by Taylor expansion, so for a compact set , we have,

 suph∈Ksupη∈HN|An(h,η)|≲suph∈Ksupη∈HN∣∣∣1√n(˙ℓ(n)θn(h),η−˙ℓ(n)θ0,η)o∣∣∣. (3.9)

For fixed and , converges to zero in probability because its mean is zero and its variance is bounded by,

 1nn∑i=1P0∣∣˙ℓθn(h),η,i−˙ℓθ0,η,i∣∣2≲1nn∑i=1P0∣∣sη(Xi−θn(h)TZi)−sη(Xi−θT0Zi)∣∣2≤1nn∑i=1|(θn(h)−θ0)TZi|2⋅Pη0Q2≲Pη0Q2n,

which converges to zero as . In turn, the pointwise convergence of to zero implies uniform convergence to zero of the right-hand side of (3.9), since (3.7) is asymptotically tight. Thus the supremum of over and is of order .

For , we prove in Section A.1.1 that,

 supη∈HN∣∣∣1nP(n)0(ℓ(n)θ,η−ℓ(n)θ0,η)+12(θ−θ0)TVn,η(θ−θ0)∣∣∣=o(|θ−θ0|2), (3.10)

as . Consequently, the supremum of over and converges to zero. ∎

### 3.2 Proof of Condition A

For given , let be the metric on defined by,

 d22(η,η0)=Pη0(sη−sη0)2. (3.11)

In Section A.1.2, it is shown that,

 limn→∞supη∈Hnd2(η,η0)=0. (3.12)

Let be a non-zero vector and let . Because is bounded away from zero in the tail by assumption, is bounded away from zero for large enough , and so the scaled process,

 {aT√nσn(˙ℓ(n)θ0,η−P(n)0˙ℓ(n)θ0,η):η∈HN}, (3.13)

is asymptotically tight by the asymptotic tightness of (3.7). Furthermore, it converges weakly (in the space of bounded functions with the uniform norm) to a tight Gaussian process because it coverges marginally to a Gaussian distribution by the Lindberg-Feller theorem. To see this, the variance of (3.13) for fixed is equal to for every . In addition,

 1nσ2nn∑i=1P0|aT˙ℓθ0,η,i|21{|aT˙ℓθ0,η,i|>√nσnϵ}=1nσ2nn∑i=1|aTZi|2Pη0s2η1{|sη|≥√nϵσn/|aTZi|}≲1nn∑i=1Pη0s2η1{|sη|≥√nϵσn/|aTZi|}≤Pη0s2η1{|sη|≳√nϵ}=o(1),

for every and large enough . By the weak convergence of (3.13) to a tight Gaussian process, (3.13) is uniformly -equicontinuous in probability (see Section 1.5 of [33]), because,

 P0[aT√nσn(˙ℓ(n)θ0,η−˙ℓ(n)θ0,η′)]2=1nσ2nn∑i=1aTZiZTiaPη0(sη−sη′)2=d22(sη,sη′),

for every . Since for every , by the definition of asymptotic equicontinuity, we have,

 sup{∣∣∣aT(˙ℓ(n)θ0,η−˙ℓ(n)θ0,η0)σn∣∣∣:d2(η,η0)<δn,η∈HN}=oP0(n1/2),

for every . Since is bounded away from zero for large and is arbitrary, (3.12) implies (2.2).

For (2.3), note that,

 ∥Vn,η−Vn,η0∥=∥(vη−vη0)Zn∥=|vη−vη0|⋅∥Zn∥=ρmax(Zn)⋅|vη−vη0|,

and because covariates are bounded. Since,

 |vη−vη0|=|Pη0(sη−sη0)sη0|≲d2(η,η0),

by the Cauchy-Schwartz inequality, we have , and thus (3.12) implies (2.3).

Finally, since and , (2.4) holds trivially because . ∎

### 3.3 Proof of Condition B

We need the following lemma, the proof of which is found in Section A.1.3.

###### Lemma 3.1.

Under the conditions in Theorem 3.1, there exists such that for every sufficiently small and large enough , and imply and .

Under the conditions in Theorem 3.1, it is well known (see Theorem 4 of [14]) that,

 P(n)0Π((θ,η)∈Θn,1×Hn,1:hn((θ,η),(θ0,η0))≤Mnϵn∣∣X(n))→1, (3.14)

for every . Thus Lemma 3.1 implies (2.5).

For (2.6), let be a sufficiently small constant and be a real sequence such that and . Also, let