A Proof of Theorem 4.1 & Theorem 4.3

Bernstein-von Mises Theorems for Functionals of Covariance Matrix 1

Abstract

We provide a general theoretical framework to derive Bernstein-von Mises theorems for matrix functionals. The conditions on functionals and priors are explicit and easy to check. Results are obtained for various functionals including entries of covariance matrix, entries of precision matrix, quadratic forms, log-determinant, eigenvalues in the Bayesian Gaussian covariance/precision matrix estimation setting, as well as for Bayesian linear and quadratic discriminant analysis.

Keywords. Bernstein-von Mises Theorem, Bayes Nonparametrics, Covariance Matrix.

\newrefformat

eq(LABEL:#1) \newrefformatchapChapter LABEL:#1 \newrefformatsecSection LABEL:#1 \newrefformatalgoAlgorithm LABEL:#1 \newrefformatfigFig. LABEL:#1 \newrefformattabTable LABEL:#1 \newrefformatrmkRemark LABEL:#1 \newrefformatclmClaim LABEL:#1 \newrefformatdefDefinition LABEL:#1 \newrefformatcorCorollary LABEL:#1 \newrefformatlmmLemma LABEL:#1 \newrefformatlemmaLemma LABEL:#1 \newrefformatpropProposition LABEL:#1 \newrefformatappAppendix LABEL:#1 \newrefformatexExample LABEL:#1 \newrefformatexerExercise LABEL:#1 \newrefformatsolnSolution LABEL:#1 \newrefformatcondCondition LABEL:#1

1 Introduction

The celebrated Bernstein-von Mises (BvM) theorem [20, 3, 29, 21, 27] justifies Bayesian methods from a frequentist point of view. It bridges the gap between Bayesians and frequentists. Consider a parametric model , and a prior distribution . Suppose we have i.i.d. observations from the product measure . Under some weak assumptions, Bernstein-von Mises theorem shows that the conditional distribution of

is asymptotically under the distribution with some centering and covariance when . In a local asymptotic normal (LAN) family, the centering can be taken as the maximum likelihood estimator (MLE) and as the inverse of the Fisher information matrix. An immediate consequence of the Bernstein-von Mises theorem is that the distributions

are asymptotically the same under the sampling distribution . Note that the first one, known as the posterior, is of interest to Bayesians, and the second one is of interest to frequentists in the large sample theory. Applications of Bernstein-von Mises theorem include constructing confidence sets from Bayesian methods with frequentist coverage guarantees.

Despite the success of BvM results in the classical parametric setting, little is known about the high-dimensional case, where the unknown parameter is of increasing or even infinite dimensions. The pioneering works of [11] and [13] (see also [17]) showed that generally BvM may not be true in non-classical cases. Despite the negative results, further works on some notions of nonparametric BvM provide some positive answers. See, for example, [22, 8, 9, 24]. In this paper, we consider the question whether it is possible to have BvM results for matrix functionals, such as matrix entries and eigenvalues, when the dimension of the matrix grows with the sample size .

This paper provides some positive answers to this question. To be specific, we consider a multivariate Gaussian likelihood and put a prior on the covariance matrix. We prove that the posterior distribution has a BvM behavior for various matrix functionals including entries of covariance matrix, entries of precision matrix, quadratic forms, log-determinant, and eigenvalues. All of these conclusions are obtained from a general theoretical framework we provide in Section 2, where we propose explicit easy-to-check conditions on both functionals and priors. We illustrate the theory by both conjugate and non-conjugate priors. A slight extension of the general framework leads to BvM results for discriminant analysis. Both linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are considered.

This work is inspired by a growing interest in studying the BvM phenomena on a low-dimensional functional of the whole parameter. That is, the asymptotic distribution of

with being a map from to , where does not grow with . A special case is the semiparametric setting, where contains both a parametric part and a nonparametric part . The functional takes the form of . The works in this field are pioneered by [19] in a right-censoring model and [26] for a general theory in the semiparametric setting. However, the conditions provided by [26] for BvM to hold are hard to check when specific examples are considered. To the best of our knowledge, the first general framework for semiparametric BvM with conditions cleanly stated and easy to check is the beautiful work by [7], in which the recent advancement in Bayes nonparametrics such as [2] and [15] are nicely absorbed. [25] proves BvM for linear functionals for which the distribution of converges to a mixture of normal instead of a normal. At the point when this paper is drafted, the most updated theory is due to [10], which provides conditions for BvM to hold for general functionals. The general framework we provide for matrix functional BvM is greatly inspired by the framework developed in [10] for functionals in nonparametrics. However, the theory in this paper is different from theirs since we can take advantage of the structure in the Gaussian likelihood and avoid unnecessary expansion and approximation. Hence, in the covariance matrix functional case, our assumptions can be significantly weaker.

The paper is organized as follows. In Section 2, we state the general theoretical framework of our results. It is illustrated with two priors, one conjugate prior and one non-conjugate prior. Section 3 considers specific examples of matrix functionals and the associated BvM results. The extension to discriminant analysis is developed in Section 4. Finally, we devote Section 5 to some discussions on the assumptions and possible generalizations. Most of the proofs are gathered in Section 6.

1.1 Notation

Given a matrix , we use to denote its spectral norm, and to denote its Frobenius norm. The norm , when applied to a vector, is understood to be the usual vector norm. Let be the unit sphere in . For any , we use notation and . The probability stands for and is for . In most cases, we use to denote the covariance matrix, and to denote the precision matrix (including those with superscripts or subscripts). The notation is for a generic probability, whenever the distribution is clear in the context. We use and to denote stochastic orders under the sampling distribution of the data. We use to indicate constants throughout the paper. They may be different from line to line.

2 A General Framework

Consider i.i.d. samples drawn from , where is a covariance matrix with inverse . A Bayes method puts a prior on the precision matrix , and the posterior distribution is defined as

where is the log-likelihood of defined as

We deliberately omit the logarithmic normalizing constant in for simplicity and it will not affect the definition of the posterior distribution. Note that specifying a prior on the precision matrix is equivalent to specifying a prior on the covariance matrix . The goal of this work is to show that the asymptotic distribution of the functional under the posterior distribution is approximately normal, i.e.,

where , as jointly with some appropriate centering and variance . In this paper, we choose the centering to be the sample version of , where is replaced by the sample covariance , and compare the BvM results with the classical asymptotical normality for in the frequentist sense. Other centering , including bias correction on the sample version, will be considered in the future work.

We first provide a framework for approximately linear functionals, and then use the general theory to derive results for specific examples of priors and functionals. For clarity of presentation, we consider the cases of functionals of and functionals of separately. Though a functional of is also a functional of , we treat them separately, since some functional may be “more linear” in than in , or the other way around.

2.1 Functional of Covariance Matrix

Let us first consider a functional of , . The functional is approximately linear in a neighborhood of the truth. We assume there is a set satisfying

(1)

for any sequence , on which is approximately linear in the sense that there exists a symmetric matrix such that

(2)

The main result is stated in the following theorem.

Theorem 2.1.

Under the assumptions of (2) and , if for a given prior , the following two conditions are satisfied:

  1. ,

  2. For any fixed , for the perturbed precision matrix

then

where .

The theorem gives explicit conditions on both prior and functional. The first condition says that the posterior distribution concentrates on a neighborhood of the truth under the spectral norm, on which the functional is approximately linear. The second condition says that the bias caused by the shifted parameter can be absorbed by the posterior distribution. Under both conditions, Theorem 2.1 shows that the asymptotic posterior distribution of is

2.2 Functional of Precision Matrix

We state a corresponding theorem for functionals of precision matrix in this section. The condition for linear approximation is slightly different. Consider the functional . Let be a set satisfying

(3)

for some integer and any sequence . We assume the functional is approximately linear on in the sense that there exists a symmetric matrix satisfying , such that

(4)

The main result is stated in the following theorem.

Theorem 2.2.

Under the assumptions of (4), and , if for a given prior , the following conditions are satisfied:

  1. ,

  2. For any fixed , for the perturbed precision matrix

then

where .

Remark 2.1.

The extra condition does not appear in Theorem 2.1. We show that this condition is indeed sharp for Theorem 2.2 in Section 5.3 in comparison with the asymptotics of MLE.

2.3 Priors

In this section, we provide examples of priors. In particular, we consider both a conjugate prior and a non-conjugate prior. Note that the result of a conjugate prior can be derived by directly exploring the posterior form without applying our general theory. However, the general framework provided in this paper can handle both conjugate and non-conjugate priors in a unified way.

Wishart Prior

Consider the Wishart prior on with density function

(5)

supported on the set of symmetric positive semi-definite matrices.

Lemma 2.1.

Assume and . Then, for any integer , the prior satisfies the two conditions in Theorem 2.1 for some . If the extra assumption is made, the two conditions in Theorem 2.2 are also satisfied for some .

Remark 2.2.

In the proof of Lemma 2.1 (Section 6.2), we set

for some .

Gaussian Prior

Consider Gaussian prior on with density function

(6)

supported on the following set

for some constant .

Lemma 2.2.

Assume and . The Gaussian prior defined above satisfies the two conditions in Theorem 2.1 for some appropriate . If the extra assumption is made, the two conditions in Theorem 2.2 are also satisfied for some appropriate .

Remark 2.3.

In the proof of Lemma 2.2 (Section 6.3), we set

for some constant .

3 Examples of Matrix Functionals

We consider various examples of functionals in this section. The two conditions of Theorem 2.1 and Theorem 2.2 are satisfied by Wishart prior and Gaussian prior, as is shown in Lemma 2.1 and Lemma 2.2 respectively. Hence, it is sufficient to check the approximate linearity of the functional with respect to or for the BvM result to hold. Among the four examples we consider, the first two are exactly linear and the last two are approximately linear. In the below examples, is always a random variable distributed as .

3.1 Entry-wise Functional

We consider the elementwise functional and . Note that these two functionals are linear with respect to and respectively. For , we write

where the matrix is the -th basis in with on its -the element and elsewhere. For , we write

Note that . Hence, the corresponding matrices and in the linear expansion of and are . In view of Theorem 2.1 and Theorem 2.2, the asymptotic variance for is

The asymptotic variance for is

Plugging these quantities in Theorem 2.1, Theorem 2.2, Lemma 2.1, and Lemma 2.2, we have the following Bernstein-von Mises results.

Corollary 3.1.

Consider the Wishart prior in (5) with integer . Assume and , then we have

where is the -th element of the sample covariance . If we additionally assume , then

where is the -th element of .

Corollary 3.2.

Consider the Gaussian prior in (6). Assume and , then we have

If we additionally assume , then

where and are defined in Corollary 3.1.

3.2 Quadratic Form

Consider the functional and for some . Therefore, the corresponding matrices and are . It is easy to see that . The asymptotic variances are

Plugging these representations in Theorem 2.1, Theorem 2.2, Lemma 2.1 and Lemma 2.2, we have the following Bernstein-von Mises results.

Corollary 3.3.

Consider the Wishart prior in (5) with integer . Assume and , then we have

If we additionally assume , then

Corollary 3.4.

Consider the Gaussian prior in (6). Assume and , then we have

If we additionally assume , then

Remark 3.1.

The entry-wise functional and the quadratic form are both special cases of the functional for some . It is direct to apply the general framework to this functional and obtain the result

Similarly, for the functional for some , we have

Both results can be derived under the same conditions of Corollary 3.3 and Corollary 3.4.

3.3 Log Determinant

In this section, we consider the log-determinant functional. That is . Different from entry-wise functional and quadratic form, we do not need to consider because of the simple observation

The following lemma establishes the approximate linearity of .

Lemma 3.1.

Assume and , then for any , we have

By Lemma 3.1, the corresponding matrix is . The asymptotic variance of is

Corollary 3.5.

Consider the Wishart prior in (5) with integer . Assume and , then we have

where is the sample covariance matrix.

Proof.

By Theorem 2.1 and Lemma 2.1, we only need to check the approximate linearity of the functional. According to the proof of Lemma 2.1, the choice of such that is

for some . This implies . Therefore,

for some . By Lemma 3.1, we have

and the approximate linearity holds. ∎

Corollary 3.6.

Consider the Gaussian prior in (6). Assume and , then we have

where is the sample covariance matrix.

Proof.

The proof of this corollary is the same as the proof of the last one using Wishart prior. The only difference is that the choice of , according to the proof of Lemma 2.2, is

for some . Therefore,

for some under the assumption, and the approximate linearity holds. ∎

One immediate consequence of the result is the Bernstein-von Mises result for the entropy functional, defined as

Then it is direct that

3.4 Eigenvalues

In this section, we consider the eigenvalue functional. In particular, let be eigenvalues of the matrix with decreasing order. We investigate the posterior distribution of for each . Define the eigen-gap

The asymptotic order of plays an important role in the theory. The following lemma characterizes the approximate linearity of .

Lemma 3.2.

Assume and , then for any , we have

where is the -th eigenvector of .

Lemma 3.2 implies that the corresponding in the linear expansion of is , and the asymptotic variance is

We also consider eigenvalues of the precision matrix. With slight abuse of notation, we define the eigengap of to be

The approximate linearity of is established in the following lemma.

Lemma 3.3.

Assume , then for any , we have

where is the -th eigenvector of .

Similarly, Lemma 3.3 implies that the corresponding in the linear expansion of is , and the asymptotic variance is

Plugging the above lemmas into our general framework, we get the following corollaries.

Corollary 3.7.

Consider the Wishart prior in (5) with integer . Assume and , then we have

where is the sample covariance matrix. If we instead assume with being the eigengap of , then

Proof.

We only need to check the approximate linearity. According to Lemma 2.1, the choice of is

for some . The assumption implies

on the set . By Lemma 3.2 and Lemma 3.3, we have

and

Corollary 3.8.

Consider the Gaussian prior in (6). Assume and , then we have

where is the sample covariance matrix. If we instead assume with being the eigengap of , then

Proof.

We only need to check the approximate linearity. According to Lemma 2.2, the choice of is