The Semiparametric Bernstein–von Mises Theorem for Models with Symmetric Error
Submitted in fulfillment of the requirements
for the degree of
Doctor of Philosophy
Department of Statistics
College of Natural Sciences
Seoul National University
In a smooth semiparametric model, the marginal posterior distribution of the finite dimensional parameter of interest is expected to be asymptotically equivalent to the sampling distribution of frequentist’s efficient estimators. This is the assertion of the so-called Bernstein-von Mises theorem, and recently, it has been proved in many interesting semiparametric models. In this thesis, we consider the semiparametric Bernstein-von Mises theorem in some models which have symmetric errors. The simplest example of these models is the symmetric location model that has 1-dimensional location parameter and unknown symmetric error. Also, the linear regression and random effects models are included provided the error distribution is symmetric. The condition required for nonparametric priors on the error distribution is very mild, and the most well-known Dirichlet process mixture of normals works well. As a consequence, Bayes estimators in these models satisfy frequentist criteria of optimality such as Hájek-Le Cam convolution theorem. The proof of the main result requires that the expected log likelihood ratio has a certain quadratic expansion, which is a special property of symmetric densities. One of the main contribution of this thesis is to provide an efficient estimator of regression coefficients in the random effects model, in which it is unknown to estimate the coefficients efficiently because the full likelihood inference is difficult. Our theorems imply that the posterior mean or median is efficient, and the result from numerical studies also shows the superiority of Bayes estimators. For practical use of our main results, efficient Gibbs sampler algorithms based on symmetrized Dirichlet process mixtures are provided.
Semiparametric Bernstein-von Mises theorem,
Linear regression with symmetric error,
mixture of normal densities,
Dirichlet process mixture
- 1 Introduction
- 2 Literature reviews
3 Main results
- 3.1 Semiparametric Bernstein-von Mises theorem
- 3.2 Quadratic expansion of the expected log likelihood ratio
- 3.3 Examples
- 4 Numerical studies
- 5 Conclusion
- A Miscellanies
List of Tables
Chapter 1 Introduction
It is a fundamental problem in statistics to make an optimal decision for a given statistical problem. Every statistical inference is based on the observed data, but we rarely know about the sampling distribution of a given estimator with finite samples. As a result, it is extremely restrictive in actual exercises to find an optimal estimator. In many interesting examples, however, the sampling distribution of an estimator converges to a specific distribution as the number of observations increases, and it is possible to estimate this limit. Therefore statistical inferences and theories on optimality are usually based on these asymptotic properties. For example, Fisher conjectured that the maximum likelihood estimator would be efficient, and in the middle of the 20th century many statisticians solved this problem under different assumptions.
In this thesis, we prove that statistical inferences based on Bayesian posterior distributions are efficient in some semiparametric problems. More specifically, we prove the semiparametric Bernstein-von Mises (BvM) theorem in some models which have symmetric errors. In theses models, the observation can be represented by
where and . Here is non-random and can be parametrized by the location parameter or the regression coefficient with explanatory variables. The error distribution is assumed to be symmetric in the sense that , where means that two distributions of both sides are the same. Since the error distribution is completely unknown except its symmetricity, these are semiparametric estimation problems. Symmetric location model, linear regression with unknown error, and random effects model are included in these models, all of them give very useful implication. The assertion of the semiparametric BvM theorem is, roughly speaking, that the marginal posterior distribution for the parameter of interest is asymptotically normal centered on an efficient estimator with variance the inverse of Fisher information matrix. As a result statistical inferences based on the posterior distribution satisfy frequentist criteria of optimality.
Even before the 1970s, putting a prior, which is always a delicate and difficult problem in Bayesian analysis, posed conceptual, mathematical, and practical difficulties in infinite dimensional models. A discovery of Dirichlet processes by Ferguson  was a breakthrough. This prior is easy to elicit, has a large support, and the posterior distribution is analytically tractable. After this discovery, there have been a growing interest on Bayesian nonparametric statistics, and for the last few decades there was remarkable development in many fields science and industry. Useful models, priors and efficient computational algorithms has been developed in broad areas, and convenient statistical softwares have been provided to analyze data of various forms. Especially the development of Markov chain Monte Carlo algorithms, along with the improvement of computing technologies, boosts Bayesian methodologies because they are very flexible and can be applied complex and highly structured data, while frequentist methods may have some difficulties to analyze such data. More recently, there was considerable progress on asymptotic behavior of posterior distributions.
While the BvM theorem for parametric Bayesian models is well established (e.g. Le Cam , Kleijn and van der Vaart ), the non- or semiparametric BvM theorem has been actively studied recently after Cox  and Freedman  gave negative examples on the non- or semiparametric BvM theorem. The BvM theorems for various models including survival models (Kim and Lee , Kim ), Gaussian regression models with increasing number of parameters (Bontemps , Johnstone , Ghosal et al. ), discrete probability measures (Boucheron and Gassiat ) have been proved. In addition, general sufficient conditions for non- or semiparametric BvM theorems are given by Shen , Castillo , Bickel and Kleijn , Castillo and Rousseau . Those sufficient conditions, however, are rather abstract and not easy to verify. In particular, it is difficult to apply these general theories to models with unknown errors in which the quadratic expansion of the likelihood ratio is not straightforward. More recently, Castillo and Nickl [13, 14] have established fully infinite-dimmensional BvM theorems by considering weaker topologies than the classical spaces.
We consider the semiparametric BvM theorem in models of the form (1.1). There is a vast amount of literature about the frequentist’s efficient estimation in these models. For example, for the symmetric location model, where ’s are i.i.d. with mean , we refer to Beran , Stone , Sacks  and references therein. More elegant and practical method using kernel density estimation can be found in Park . This approach can be easily extended for estimating the regression coefficient in the linear regression model. Bickel  also provide an efficient estimator for the linear regression model.
Bayesian analysis of the symmetric location model has also received much attention since Diaconis and Freedman  showed that a careless choice of a prior on leads to an inconsistent posterior. Posterior consistency of the symmetric location model with Polya tree prior is proved by Ghosal et al. , posterior consistency of more general regression model has been studied by Amewou-Atisso et al. , Tokdar , and posterior convergence rate with Dirichlet process mixture prior has been derived by Ghosal and van der Vaart . But the efficiency of the Bayes estimators, the semiparametric BvM theorem, in such models has not been proved yet. We prove that this is true when the error distribution is endowed with a Dirichlet process mixture of normals prior. Furthermore, we have shown that the Bayes estimators in random effect models, where the error and random effects distributions are unknown except that they are symmetric about the origin, are also efficient. In the random effects model, it is known that the full likelihood inference is difficult because it can be obtained by integrating out the random effects.
The remainder of the thesis is organized as follows. In Chapter 2, we review three topics in asymptotic statistics which are prerequisites for our main results. In Section 2.1, we introduce the local asymptotic normality and associated frequentist’s optimality theories. Some empirical processes techniques are given in Section 2.2, and the last section provides asymptotic theories on nonparametric Bayesian statistics. The main results are given in Chapter 3. The first section proves a general semiparametric BvM theorem which requires two conditions: the integral local asymptotic normality and convergence of the marginal posterior at parametric rate. These two conditions are studied in more depth in following subsections. In these two subsections, it is required that the expectation of the log likelihood ratio allows a certain quadratic expansion, and Section 3.2 proves this condition using the property of symmetric densities. The last section of this chapter provides three examples mentioned above: the location, linear regression and random intercept models. Some numerical studies, which show the superiority of Bayes estimators in random effects models, are provided in Chapter 4. A useful Gibbs sampler algorithm is given in the first section of this chapter. A real dataset is also analyzed in Section 4.3. There are concluding remarks and future works in Chapter 5, and miscellanies that are required for main theorems and examples are given in Appendix. Section A.1 is devoted to prove posterior consistency when the model is slightly misspecified and observations are independent but not identically distributed. Some technical lemmas for semiparametric mixture models, such as bounded entropy and prior positivity conditions, are given in Section A.2. The last Section presents properties of symmetrized Dirichlet processes and Gibbs sampler algorithms using symmetrized Dirichlet process mixtures.
Before going further, we introduce notations used in this thesis. For a real-valued function defined on a subset of , the first, second and third derivatives are denoted by , and , respectively. If the domain of is a subset of for , then and denotes the gradient vector and Hessian matrix. Also, and denote the first and second order partial derivatives of with respect to the corresponding indices. The Euclidean norm is denoted by . For a matrix , represents the operator norm, defined as , of , and if is a square matrix, and denotes the minimum and maximum eigenvalues of . The capital letters etc are the corresponding probability measures of densities denoted by lower letters , etc and vise versa. The corresponding log densities are written by the letter , etc. The Hellinger and total variation metrics between two probability measures and are defined by
and , respectively, where is a measure dominating both and . Let be the Kullback-Leibler divergence. The metrics and Kullback-Leibler divergence are sometimes denoted like, for example, using the corresponding densities. The expectation of a random variable under a probability measure is denoted by . The notation always represents the true probability which generates the observation. Finally, is the probability measure of the multivariate normal distribution with mean and variance , and denotes the univariate normal density with mean 0 and variance .
Chapter 2 Literature reviews
This chapter briefly reviews three topics in asymptotic statistics. Each topic is closely related to our main results and essential techniques for the proofs in this thesis. Section 2.1 introduces some results derived from the local asymptotic normality which is a key property of classical asymptotic theory. In Section 2.2, modern empirical processes theories are provided. The last section is devoted to introduce Bayesian asymptotics including the parametric BvM theorem and theories for infinite dimensional models.
2.1 Local asymptotic normality
A sequence of statistical models is locally asymptotically normal if, roughly speaking, the likelihood ratio behaves like that for a normal location parameter. This implies that the likelihood ratio admits a certain quadratic expansion. An important example is a smooth parametric model, so-called the regular parametric model. If a model is locally asymptotically normal, estimating the model parameter can be understood as a problem of estimating the normal mean in an asymptotic sense. As a result, it satisfies some asymptotic optimality criteria such as the convolution theorem and locally asymptotic minimax theorem. There are much literature about the local asymptotic normality and related asymptotic theories. Here we refer to two well-known books: Bickel et al.  and van der Vaart  which contain a lot of references and examples.
In this section, we only consider i.i.d. models because it contains all essentials about the local asymptotic normality. For i.i.d. models, a sequence of statistical models can be represented as a collection of probability measures for a single observation. An extension to non-i.i.d. models, including both finite and infinite dimensional models, is well-established in McNeney and Wellner . Consider a statistical model parametrized by finite dimensional parameter and assume that is an open subset of . The model is called locally asymptotic normal, or simply LAN, at if there exists a function such that and for every converging sequence in ,
as , where . The function and matrix are called by the score function and Fisher information matrix, respectively. Le Cam formulated the first version of LAN property as early as 1953 in his thesis. This original version can be found, for example, in Le Cam and Yang . Note that the likelihood ratio of the normal location model with single observation is given by
where is the multivariate normal density with mean and variance . Since the term in (2.1) converges in distribution to the normal distribution , it is clear that the local log likelihood ratio (2.1) converges in distribution to the log likelihood ratio of the normal location model in which . The name LAN originated from this fact.
One important result is that every smooth parametric model is LAN. Here the smoothness of a model can be expressed in quadratic mean differentiability. A model is called differentiable in quadratic mean at if it is dominated by a -finite measure and there exists an -function such that
as . This is actually the Hadamard (equivalently Fréchet) differentiability of the root density which can be established by pointwise differentiability plus a convergence theorem for integrals. A proof of the following theorem can be found in Theorem 7.2 of van der Vaart .
Assume that is open in and is differentiable in quadratic mean at . Then, , exists, and the LAN assertion (2.1) holds.
More general statement of LAN can be found in Strasser . With the help of the LAN property, Fisher’s early concept of efficiency can be sharpened and elaborated upon. We state three optimality theorems by Le Cam and Hájek, which can be derived from the LAN property. Besides the original reference, we refer to Chapter 8 of van der Vaart  as a nice text. An estimator sequence is called regular at if, for every ,
for some probability distribution . Here denotes the distribution of when follows the probability measure and represents convergence in distribution. Note that the limit distribution does not depend on and this is the key assumption for regularity of an estimator. Let be the convolution operator. The most important theorem about asymptotic optimality is definitely Hájek-Le Cam convolution theorem (Hájek , Le Cam ) stated as follows.
Theorem 2.1.2 (Convolution).
Assume that is open in and is LAN at with the nonsingular Fisher information matrix . Then for any regular estimator sequence for , there exist probability distributions such that
where is the limit distribution in (2.2).
Theorem 2.1.2 says that for a class of all regular estimators, the normal distribution is the best possible limit distribution. However, some estimator sequences of interest, such as shrinkage estimators, are not regular. A typical example is the Hodges superefficient estimator
for the normal location parameter. Here is an arbitrary positive constant which is strictly smaller than 1. In this case, is -consistent, that is , and asymptotically normal, but superefficient at 0 (variance is smaller than that of MLE). Interestingly, the set of superefficiency is of Lebesgue measure zero and this can be proved in general situations (Le Cam ).
Assume that is open in and is LAN at with the nonsingular Fisher information matrix . Let be an estimator sequence such that converges to a limit distribution under every . Then, there exist probability distributions such that
for Lebesgue almost every .
Though the set of superefficiency is a null set, the above theorem may not be fully satisfactory because there is no information about parameters which may be important as in the Hájek’s example. Furthermore, an estimator sequence is required to be -consistent in Theorem 2.1.3. The following theorem, which can be found in Theorem 8.11 of van der Vaart , is a refined version of the so-called local asymptotic minimax theorem (Hájek , Le Cam et al. ). A function is called a bowl-shaped loss if the sublevel sets are convex and symmetric about the origin. It is called subconvex if, moreover, these sets are closed.
Theorem 2.1.4 (Local asymptotic minimax).
Assume that is open in and is LAN at with the nonsingular Fisher information matrix . Then, for any estimator sequence and bowl-shaped loss function ,
where the first supremum is taken over all finite subsets of .
According to the three theorems above we conclude that the normal distribution is the best possible limit distribution. An estimator sequence is called efficient or best regular if it is regular and
as . A well-known (see, for example, van der Vaart ) fact is that every efficient estimator is asymptotically linear estimator as stated in the following theorem.
An estimator sequence is efficient if and only if
So far we have studied asymptotic optimality of an estimator sequence in a smooth parametric model. The two theorems, the convolution theorem and local asymptotic minimax theorem, have natural extensions in infinite dimensional models. Typically an infinite dimensional parameter is not estimable at rate (van der Vaart ). It is possible, however, to estimate some finite dimensional parameters at this rate even in an infinite dimensional model. The central limit theorem, by which mean parameters are estimable at parametric rate, is a representative example. Under regularity conditions, moreover, some estimators can be shown to be asymptotically optimal in the sense of the convolution theorem and local asymptotic minimax theorem as in parametric models.
We first define the tangent set and tangent space. For a given statistical model containing , consider a one-dimensional submodel passing through at and differentiable in quadratic mean. By the differentiability we get the score function at from this submodel. Letting range over the collection of all such submodels, we obtain the collection of score functions, which is called the tangent set of the model at . The closed linear span of the tangent set in , denoted by , is called the tangent space of at .
Since our main interest in Chapter 3 is to estimate a finite dimensional parameter in a semiparametric model, we only consider the information bound for a semiparametric model , is the finite dimensional parameter of interest and is the infinite dimensional nuisance parameter. For more general theory, readers are referred to two books: van der Vaart , Bickel et al. . Fix , and define two submodels and . Assume that is differentiable in quadratic mean and let be the score function at . Then it is easy to show that is equal to the set of all , where ranges over . The function defined by
is called the efficient score function and the matrix is the efficient information matrix, where is the orthogonal projection onto in . For defining the information for estimating , if , then it is enough to consider one-dimensional smooth (differentiable in quadratic mean) submodels of type
for . An estimator sequence is regular for estimating if it is regular in every such submodel, that is
for some which does not depend on . The following two theorems are extensions of the convolution theorem and local asymptotic minimax theorem, respectively, to semiparametric models.
Theorem 2.1.6 (Convolution).
Assume that , is convex and is nonsingular. Then, every limit distribution of a regular sequence of estimators can be written for some probability distribution .
Theorem 2.1.7 (Local asymptotic minimax).
Assume that , is convex and is nonsingular. Then for any estimator sequence and subconvex loss function ,
where the first supremum is taken over all finite index sets of one-dimensional smooth submodels, denoted by , of type (2.3).
As in parametric models, the normal distribution can be considered as the best possible limit distribution. A regular estimator sequence is called efficient or best regular if it is regular and its limit distribution is . An efficient estimator is asymptotically linear as in Theorem 2.1.5, replacing the score function and information matrix by the efficient score function and efficient information matrix.
An estimator sequence is efficient if and only if
Roughly speaking, the information bound of a semiparametric model is equal to the infimum of information bounds of all smooth parametric submodels. If there is a smooth parametric submodel whose information bound achieves this infimum, it is the hardest submodel. Formally in a smooth semiparametric model, if there exists a submodel which has as the score function at , it is called a least favorable submodel at . There may be more than two least favorable submodels, or it may not exist. Typically, if a maximizer of the map is smooth in , it constitutes a least favorable submodel (Severini and Wong ; Murphy and van der Vaart ).
We finish this section with the notion of adaptiveness. A smooth semiparametric model is called (locally) adaptive (at ) if in . By definition the efficient score function and information matrix is equal to the ordinary score function and information matrix in adaptive models. Therefore the information bound for the semiparametric model and the parametric model , in which the true nuisance parameter is known, are the same.
2.2 Empirical processes
In this section we review modern empirical process theories that play important roles for the proofs given in Chapter 3. We assume that readers are familiar to weak convergence of probability measures in metric spaces. Also, we do not state any measurability conditions, because the formulation of these would require too many digressions. For all details about this section and further reading including historical stories, examples and so on, we refer to the monograph van der Vaart and Wellner .
Consider a sample of random elements in a measurable space , where is endowed with a semimetric111 can be equal to 0 when . . Let be the empirical measure and be empirical process, where denotes the Dirac measure at point . Consider a collection of measurable functions . With the notation , if
in -probability, is called a Glivenko-Cantelli class, or simply Glivenko-Cantelli class. Under the condition for every , the empirical process can be viewed as an -valued random element. If this map converges weakly to a tight Borel measurable element in , it is called a Donsker class, or -Donsker to be more complete.
The Donsker property is very important and closely related to the notion of tightness. Before going further, we introduce some definitions and theorems about stochastic processes in spaces of bounded functions. A sequence of -valued stochastic processes is asymptotically tight if for every there exists a compact set such that
for every . Here is defined by the set . This is slightly weaker than uniform tightness but enough to assure the weak convergence. For an index set , weak convergence in is characterized as asymptotic tightness plus convergence of marginals as stated in the following theorem.
A sequence of -valued stochastic processes converges weakly to a tight limit if and only if is asymptotically tight and the marginals converge weakly to a limit for every finite subset of .
Asymptotic tightness is a quite complicate concept and it is closely related to equicontinuity of sample paths of stochastic processes. For a semimetric space , a sequence of -valued stochastic process is said to be asymptotically uniformly -equicontinuous in probability if for every there exists a such that
The following theorem represents the relationship between asymptotic tightness and asymptotic unifomrly equicontinuity of sample paths.
A sequence of stochastic processes indexed by is asymptotically tight if and only if is asymptotically tight in for every and there exists a semimetric on such that is totally bounded and is asymptotically uniformly -equicontinuous in probability. If, moreover, converges weakly to , then almost all paths are uniformly -continuous and the semimetric can be taken equal to any semimetric for which this is true and is totally bounded.
A stochastic process is called Gaussian if each of its finite-dimensional marginals has a multivariate normal distribution on Euclidean space. For a given stochastic process , define a semimetric on by
When the limit process in Theorem 2.2.2 is Gaussian, can always be used to establish asymptotic equicontinuity of a sequence .
A Gaussian process in is tight if and only if is totally bounded and almost all paths are uniformly -continuous.
Now we return to empirical processes on . By the central limit theorem, a marginal distribution converges weakly to a normal distribution. Therefore if the stochastic process is asymptotically tight, then is a Donsker class by Theorem 2.2.1. Since asymptotic tightness is conceptually equivalent to the uniform equicontinuity of sample paths by Theorem 2.2.2, we can expect from Arzelà-Ascoli theorem that the Donsker property can be determined by the covering number. The covering number of with respect to a semimetric is the minimal number of balls of radius needed to cover the set . For given two functions and , the bracket is the set of all functions with . An -bracket is a bracket with . The bracketing number is the minimum number of -brackets needed to cover . Then it is easy to show that
for every . Define
for . A collection of functions is a Donsker class if the covering number or bracketing number is suitably bounded. We only introduce conditions on bracketing numbers and refer to Section 2.6 of van der Vaart and Wellner  for conditions on covering numbers.
is -Donsker if .
The condition of Theorem 2.2.4 is very simple and is satisfied for many interesting examples For classes of smooth functions on a Euclidean space, we can find an upper bound for bracketing numbers. To define such classes let, for a given function and ,
where the suprema are taken over all in the interior of with , the value is the greatest integer strictly smaller than , and for each vector of integers is the differential operator
These are well-known -Hölder norms. Let be the set of all continuous functions with .
Let be a partition into cubes of uniformly bounded size, and be a class of functions such that the restrictions of onto belong to for every and some fixed . Then, there exists a constant depending only on and the uniform bound on the diameter of the sets such that
Theorem 2.2.4 concern the empirical process for different , but each time with the same indexing class . This is enough for many applications, but sometimes it may be necessary to allow the class to change with . The following theorem is a modification of Theorem 2.2.4 for this purpose.
Let be a class of measurable functions indexed by a totally bounded semimetric space satisfying
and assume that there exists a function such that , and for all . If for every and converges pointwise on , then the sequence converges weakly to a tight Gaussian process.
Theorems 2.2.4 and 2.2.6 only consider empirical processes of i.i.d. observations. We finish this section with an extension of Donsker theorem to the case of independent but not identically distributed processes. The following theorem is an extension of Jain-Marcus’s central limit theorem (Jain and Marcus ), and the proof can be found in Theorems 2.11.9 and 2.11.11 of van der Vaart and Wellner .
For each , let be independent stochastic processes indexed by an arbitrary index set . Suppose that there exist independent random variables , and a semimetric such that
for every ,
Furthermore assume that
for every . Then the sequence is asymptotically uniformly -equicontinuous in -probability. Moreover, it converges to a tight Gaussian process provided the sequence of covariance functions converges pointwise on .
2.3 Bayesian asymptotics
For the last few decades, there were remarkable activities in the development of nonparametric Bayesian statistics. This section reviews some frequentist properties of Bayesian procedures in infinite dimensional models. There are books for nonparametric Bayesian statistics like Ghosh and Ramamoorthi  and Hjort et al. , but they are not fully satisfactory because a lot of important theories and examples are developed quite recently. Here we focus on asymptotic behaviors of posterior distributions when i.i.d. observations are given.
Let be a random sample in a metric space with the Borel -algebra . Consider a statistical model , where the parameter space is equipped with a metric . Let be a prior on , that is, a probability measure on the Borel -algebra of . Any version of the conditional distribution of given is called a posterior distribution and denoted by . We assume that there exists a -finite measure on dominating all . In this case, using Bayes’ rule, the posterior distribution is given by
for all .
A prior and data yield the posterior and the subjectiveness of this strategy does not need the idea of what happens if further data arise. However, one may be interested in asymptotic behavior of the posterior distribution which can be seen as a frequentist viewpoint. Frequentist typically assumes that there exists the true distribution which generates the observations . Throughout this section, we assume that for some , and under this assumption the posterior distribution is expected to concentrate around the true parameter .
Before going to infinite-dimensional models, we begin with parametric models. In a smooth parametric, the posterior distribution is asymptotically normal centered on a best regular estimator with the variance the inverse of Fisher information matrix. This is the so-called BvM theorem which was proved by many authors. The following theorem is considerably more elegant than the results by early authors and proofs can be found, for example, in Le Cam , Le Cam and Yang .
Theorem 2.3.1 (Bernstein-von Mises).
Assume that a parametric model is differentiable in quadratic mean at with nonsingular Fisher information matrix . Furthermore suppose that for every there exists a sequence of tests such that
If the prior has continuous and positive density in a neighborhood of , then the corresponding posterior distributions satisfy
in -probability, where is a best regular estimator and the supremum is taken over all Borel sets.
Since best regular estimators are asymptotically equivalent up to terms, the centering sequence in the BvM theorem can be any best regular estimator sequence. An important application of the BvM theorem is that the posterior mean is an efficient estimator and Bayesian credible sets are asymptotically equivalent to frequentists’ confidence intervals. This implies that statistical inferences based on the posterior distribution is equally optimal to that based on the maximum likelihood estimators.
A sequence of tests is called uniformly consistent for testing versus if
as . Le Cam’s version of the BvM theorem requires the existence of uniformly consistent tests for testing versus for every . Such tests certainly exist if there exist estimators that are uniformly consistent, that is,
for every .
Theorem 2.3.1 is quite general so it can be applied for most smooth parametric models. As frequentist theory, however, Theorem 2.3.1 does not generalize fully to nonparametric estimation problems. Actually many nonparametric priors do not work well in the sense that the posterior mass does not concentrate around the true parameter. An important counterexample was found by Diaconis and Freedman [17, 18] which proves that the posterior distribution may be inconsistent even if a very natural nonparametric prior is used. Doss [20, 21, 22] found similar phenomena for median estimation problem. Before introducing this example, we define the posterior consistency rigorously and state an important theorem about consistency proved by Doob . The sequence of posteriors is said to be consistent at (with respect to a metric ) if for every
as . The definition of consistency may be different in some texts in which consistency is defined using almost-sure convergence, not convergence in probability. More precisely, we call the posterior is almost-surely consistent at , if for every
-almost-surely. Furthermore we say that a sequence is the convergence rate of the posterior distribution at (with respect to a metric ) if for any , we have that
in -probability. As the definition of posterior consistency, the convergence rate of the posterior also can be defined using almost-sure convergence. Now we state the theorem by Doob .
Suppose that and are both complete and separable metric spaces, and the model is identifiable. Then there exists , with such that is consistent at every .
Doob’s theorem looks very useful bet it does not tell about the posterior consistency at a specific . Although the set of inconsistency is a -null set, it may not be ignorable when is an infinite-dimensional parameter space. As mentioned above the Diaconis-Freedman’s counterexample was a surprising discovery in Bayesian nonparametric society as the case of Hodges supperefficient estimator. Before the discovery of this counterexample, it was believed that most prior works well except some abnormal examples. To explain the Diaconis-Freedman example, we need to mention the Dirichlet process (Ferguson ) prior which is often considered as a starting point of Bayesian nonparametrics. Dirichlet processes are widely used in many fields of science and industry for the prior of unknown probability distributions. The definition of Dirichlet processes and its symmetrized version is given in Section A.3. In the statement of the following theorem, we slightly abuse notations for which is used for the location parameter, not the whole parameter, in a semiparametric location problem.
Consider an i.i.d. observations from well-specified model
where follows an unknown distribution . For the prior, has the standard normal density, and is independently drawn from the symmetrized Dirichlet process with mean the standard Cauchy distribution. Then the posterior is inconsistent at and for some which has infinitely differentiable density , which is compactly supported and symmetric about 0, with a strict maximum at .
An example of inconsistent in Theorem 2.3.3 is illustrated in Figure 2.1. With this , the posterior mass for concentrate around two distinct points for some . To prove the posterior consistency at a specific point , the condition by Schwartz  can be a very useful tool. It requires that the prior mass of every Kullback-Leibler neighborhood of the true parameter is positive. Furthermore a uniformly consistent sequence of tests are required.
Let be a prior on , and assume that the model is dominated by a common -finite measure. If for every ,
and there exists a uniformly consistent sequence of tests for testing versus , then the posterior is almost-surely consistent.
There are many interesting examples satisfying the Scwartz’s condition. Barron et al.  founds a sufficient condition using bracketing number for consistency with respect to Hellinger distance.. Some extensions to semiparametric models and non-i.i.d. models can be found, for example, in Amewou-Atisso et al.  and Wu and Ghosal . More recently Walker  founds a new sufficient condition for posterior consistency.
Many statisticians do not fully satisfy posterior consistency and they want to know how fast it converges to the true parameter. As an extension of Scwartz’s theorem, Ghosal et al.  found sufficient conditions which assures a certain rate of posterior consistency. Let denote the -packing number of , that is, the maximal number of points in such that the distance between every pair is at least . This is related to the covering number by the inequalities
The following general theorem given in Ghosal et al.  is very intuitive and interpretable.
Let be the metric on defined by or . Suppose that for a sequence with and , a constant and sets , we have
Then for sufficiently large , we have that
A sequence is a sieve for . Condition (2.6) requires that the model is not too big. The log of covering number is called entropy and this is often interpreted as the complexity of the model (Birgé , Le Cam ). Under certain conditions a rate satisfying (2.6) gives the optimal rate of convergence relative to the Hellinger metric. Condition (2.6) ensures the existence of certain tests and could be replaced by a testing condition. Condition (2.7) requires that the prior mass around the true parameter is not too small, and this is a refined version of condition (2.5). Roughly speaking condition (2.7) tells that the prior mass should be uniformly spread on the support of the prior.
An important application of Theorem 2.3.5 is Dirichlet process mixture priors for density estimation problems. Ghosal and van der Vaart  found a tight entropy bound for classes of mixtures of normal densities and got Hellinger convergence rate when the true density is a mixture of normals. Note that this is nearly parametric rate. Although the true density is not a mixture of normal densities, a Dirichlet process mixture of normals prior works well if the prior mass for standard deviance of normal is concentrated around zero as . When the true density is twice continuously differentiable, Ghosal and van der Vaart  proved that a Dirichlet process mixture of normals prior gives Hellinger convergence rate which is almost same to the optimal rate of kernel density estimation.
Conditions in Theorem 2.3.5 may be slightly strong than required, and more refined versions are given in Ghosal et al. . Shen and Wasserman  independently found similar sufficient conditions for posterior convergence rate around the same time. More recently Walker et al.  developed new conditions as an extension of Walker  and provided an example which gives a better convergence rate than previous works. When the model is misspecified, Kleijn and van der Vaart  proved that the posterior converges to the parameter in the support at minimal Kullback-Leibler divergence to the true parameter, at rate as if it were in the support.
Chapter 3 Main results
3.1 Semiparametric Bernstein-von Mises theorem
Consider a sequence of statistical models parametrized by finite dimensional of interest and infinite dimensional which is usually considered as a nuisance parameter. Assume that is an open subset of and has the density with respect to a -finite measure . Let be a random element which follows and assume that for some and . We consider a product prior on and denote the posterior distribution by . Assume that is thick at , that is, it has a positive continuous Lebesgue density in a neighborhood of . Also is allowed to depend on , but we abbreviate the notation in for notational simplicity. For a given prior distribution on , let
be the integrated likelihood, where . We begin this section with the statement of general BvM theorem. The proof is almost identical to that of Theorem 2.1 in Kleijn and van der Vaart  upon replacement of parametric likelihoods with integrated likelihoods. Hereafter, some quantities in proofs may not be measurable, and in this case the expectation can be understood by the outer integral and measurable majorants. We refer to Part I of van der Vaart and Wellner  for details about this.
Assume that the model is endowed with the product prior , where is thick at , and
for every real sequence with . Furthermore, suppose that for given sequences of uniformly tight random vectors and non-random positive definite matrices satisfying , the integrated likelihood (3.1) satisfies
for any compact . Then,