Wishart distributions for decomposable covariance graph models

# Wishart distributions for decomposable covariance graph models

[ [    [ [ University of Florida and Stanford University Department of Statistics
University of Florida
102 Griffin-Floyd Hall
Gainesville, Florida 32606
USA
Department of Statistics
Stanford University
Sequoia Hall
390 Serra Mall
Stanford, California 94305-4065
USA
\smonth7 \syear2009\smonth6 \syear2010
\smonth7 \syear2009\smonth6 \syear2010
\smonth7 \syear2009\smonth6 \syear2010
###### Abstract

Gaussian covariance graph models encode marginal independence among the components of a multivariate random vector by means of a graph . These models are distinctly different from the traditional concentration graph models (often also referred to as Gaussian graphical models or covariance selection models) since the zeros in the parameter are now reflected in the covariance matrix , as compared to the concentration matrix . The parameter space of interest for covariance graph models is the cone of positive definite matrices with fixed zeros corresponding to the missing edges of . As in Letac and Massam [Ann. Statist. 35 (2007) 1278–1323], we consider the case where is decomposable. In this paper, we construct on the cone a family of Wishart distributions which serve a similar purpose in the covariance graph setting as those constructed by Letac and Massam [Ann. Statist. 35 (2007) 1278–1323] and Dawid and Lauritzen [Ann. Statist. 21 (1993) 1272–1317] do in the concentration graph setting. We proceed to undertake a rigorous study of these “covariance” Wishart distributions and derive several deep and useful properties of this class. First, they form a rich conjugate family of priors with multiple shape parameters for covariance graph models. Second, we show how to sample from these distributions by using a block Gibbs sampling algorithm and prove convergence of this block Gibbs sampler. Development of this class of distributions enables Bayesian inference, which, in turn, allows for the estimation of , even in the case when the sample size is less than the dimension of the data (i.e., when “”), otherwise not generally possible in the maximum likelihood framework. Third, we prove that when is a homogeneous graph, our covariance priors correspond to standard conjugate priors for appropriate directed acyclic graph (DAG) models. This correspondence enables closed form expressions for normalizing constants and expected values, and also establishes hyper-Markov properties for our class of priors. We also note that when is homogeneous, the family of Letac and Massam [Ann. Statist. 35 (2007) 1278–1323] is a special case of our covariance Wishart distributions. Fourth, and finally, we illustrate the use of our family of conjugate priors on real and simulated data.

[
\kwd
\doi

10.1214/10-AOS841 \volume39 \issue1 2011 \firstpage514 \lastpage555 \newproclaimremarkRemark \newproclaimdefinitionDefinition

\runtitle

Wishart distributions

{aug}

A]\fnmsKshitij \snmKhare\thanksreft1label=e1]kdkhare@stat.ufl.edu and B]\fnmsBala \snmRajaratnam\thanksreft2\correflabel=e2]brajarat@stanford.edu \thankstextt1Supported in part by the B. C. and E. J. Eaves Stanford graduate fellowship. \thankstextt2Supported in part by NSF Grants DMS-05-05303, DMS-09-06392 and Grant SUFSC08-SUSHSTF09-SMSCVISG0906.

class=AMS] \kwd62H12 \kwd62C10 \kwd62F15.

Graphical model \kwdGaussian covariance graph model \kwdWishart distribution \kwddecomposable graph \kwdGibbs sampler.

## 1 Introduction

Due to recent advances in science and information technology, there has been a huge influx of high-dimensional data from various fields such as genomics, environmental sciences, finance and the social sciences. Making sense of all the many complex relationships and multivariate dependencies present in the data, formulating correct models and developing inferential procedures is one of the major challenges in modern day statistics. In parametric models, the covariance or correlation matrix (or its inverse) is the fundamental object that quantifies relationships between random variables. Estimating the covariance matrix in a sparse way is crucial in high-dimensional problems and enables the detection of the most important relationships. In this light, graphical models have served as tools to discover structure in high-dimensional data.

The primary aim of this paper is to develop a new family of conjugate prior distributions for covariance graph models (a subclass of graphical models) and study the properties of this family of distributions. It is shown in this paper that these properties are highly attractive for Bayesian inference in high-dimensional settings. In covariance graph models, specific entries of the covariance matrix are restricted to be zero, which implies marginal independence in the Gaussian case. Covariance graph models correspond to curved exponential families and are distinctly different from the well-studied concentration graph models, which, in turn, correspond to natural exponential families.

A rich framework for Bayesian inference for natural exponential families has been established in the last three decades, starting with the seminal and celebrated work of Diaconis and Ylvisaker dicnsylskr () that laid the foundations for constructing conjugate prior distributions for natural exponential family models. The Diaconis–Ylvisaker (henceforth referred to as “DY”) conjugate priors are characterized by posterior linearity of the mean. An analogous framework for curved exponential families is not available in the literature.

Concentration graph models (or covariance selection models) were one of the first graphical models to be formally introduced to the statistics community. These models reflect conditional independencies in multivariate probability distributions by means of a graph. In the Gaussian case, they induce sparsity or zeros in the inverse covariance matrix and correspond to natural exponential families. In their pioneering work, Dawid and Lauritzen dawidlrtzn () developed the DY prior for this class of models. In particular, they introduced the hyper-inverse Wishart as the DY conjugate prior for concentration graph models. In a recent major contribution to this field, a rich family of conjugate priors that subsumes the DY class has been developed by Letac and Massam ltcmssmwdg (). Both the hyper-inverse Wishart priors and the “Letac–Massam” priors have attractive properties which enable Bayesian inference, with the latter allowing multiple shape parameters and hence being suitable in high-dimensional settings. Bayesian procedures corresponding to these Letac–Massam priors have been derived in a decision theoretic framework in the recent work of Rajaratnam, Massam and Carvalho rajmasscar ().

Consider an undirected333We shall use dotted edges for our graphs, in keeping with the notation in the literature; bi-directed edges have also been used for representing covariance graphs. graph with a finite set of vertices (of size ) and a finite set of edges between these vertices, that is, . The Gaussian covariance graph model corresponding to the graph is the collection of -variate Gaussian distributions with covariance matrix such that whenever . This class of models was first formally introduced by Cox and Wermuth cxwrmthcvg (), cxwrmthmai (). In the frequentist setting, maximum likelihood estimation in covariance graph models has been a topic of interest in recent years. Many iterative methods that obtain the maximum likelihood estimate have been proposed in the literature. The graphical modeling software MIM in Edwards edwardsigm () fits these models by using the “dual likelihood method” from Kauermann krmnndlldm (). In Wermuth, Cox and Marchetti wrmthcxmrt (), the authors derive asymptotically efficient approximations to the maximum likelihood estimate in covariance graph models for exponential families. Chaudhuri, Drton and Richardson chaudrtric () propose an iterative conditional fitting algorithm for maximum likelihood estimation in this class of models. Covariance graph models have also been used in applications in Butte et al. butaslgoko (), Grzebyk, Wild and Chouaniere grzwildcho (), Mao, Kschischang and Frey maokscfrey () and others.

Although Gaussian covariance graph models are simple and intuitive to understand, no comprehensive theoretical framework for Bayesian inference for this class of models has been developed in the literature. In that sense, Bayesian inference for covariance graph models has been an open problem since the introduction of these models by Cox and Wermuth cxwrmthcvg (), cxwrmthmai () more than fifteen years ago. The main difficulty is that these models give rise to curved exponential families. The zero restrictions on the entries of the covariance matrix translate into complicated restrictions on the corresponding entries of the natural parameter, . Hence, the sparseness in does not translate into sparseness in and thus a covariance graph model cannot be viewed as a concentration graph model. No general theory is available for Bayesian inference in curved exponential families for continuous random variables, akin to the Diaconis–Ylvisaker dicnsylskr () or standard conjugate theory for natural exponential families.

There are several desirable properties that one might want when constructing a class of priors, but one of the foremost requirements is to be able to compute quantities such as the mean or mode of the posterior distribution, either in closed form or by sampling from the posterior distribution by a simple mechanism. This is especially important in high-dimensional situations, where computations are complex and can become infeasible very quickly. Another desirable and related feature is conjugacy, that is, the class of priors is such that the posterior distribution also belongs to this class. Among other things, this increases the prospects of obtaining closed form Bayes estimators and can also add to the interpretability of the hyper-parameters. The class of Wishart distributions developed by Letac and Massam ltcmssmwdg () (and later used for flexible Bayesian inference for concentration graph models by Rajaratnam, Massam and Carvalho rajmasscar ()), known as the family of distributions, are not appropriate for the covariance graph setting. There is the additional option of using the class as priors for this situation. We, however, establish that the posterior distribution fails to belong to the same class and there are no known results for computing the posterior mean or mode, either in closed form or by sampling from the posterior distribution.

A principal objective of this paper is to develop a framework for Bayesian inference for Gaussian covariance graph models. We proceed to construct a rich and flexible class of conjugate Wishart distributions, with multiple shape parameters, on the space of positive definite matrices with fixed zeros, that corresponds to a decomposable graph . This class of distributions is specified up to a normalizing constant, and conditions under which this normalizing constant can be evaluated in closed form are derived. We explore the distributional properties of our class of priors and, in particular, show that the parameter can be partitioned into blocks so that the conditional distribution of each block, given the others, is tractable. Based on this property, we propose a block Gibbs sampling algorithm to simulate from the posterior distribution. We proceed to formally prove the convergence of this block Gibbs sampler. Our priors yield proper inferential procedures, even in the case when the sample size is less than the dimension of the data, whereas maximum likelihood estimation is, in general, only possible when (in fact, in the homogeneous case, it can be shown that the condition is actually also necessary, thus highlighting the fact that results from the concentration graph setting do not carry over to the covariance model setting). We also show that our covariance Wishart distributions are, in the decomposable nonhomogeneous case, very different from the Letac–Massam priors and . However, when the underlying graph is homogeneous, the Letac–Massam priors are a special case of our distributions. We establish, in the homogeneous setting, a correspondence between the covariance priors in this paper and the natural conjugate priors for appropriate directed acyclic graph (DAG) models. This correspondence helps us to explicitly evaluate quantities like the normalizing constant and the posterior mean of the covariance matrix in closed form. In this scenario, we also show that our class of priors satisfies the strong directed hyper-Markov property (as introduced in Dawid and Lauritzen dawidlrtzn () for concentration graph models). It should be pointed out that these aforementioned results for homogeneous graphs can also be established directly, without exploiting the correspondence with the DAG models. The direct approach is self-contained, whereas the latter invokes an external result which states that for the restrictive class of homogeneous graphs, covariance graph models and DAGs are Markov equivalent.

We noted above that for concentration graph models or the traditional Gaussian graphical models, a rich theory has been established by Dawid and Lauritzen dawidlrtzn (), who derive the single parameter DY conjugate prior for these models, and by Letac and Massam ltcmssmwdg (), who derive a larger flexible class with multiple shape parameters. In essence, this paper is the analog of the results in the two aforementioned papers in the covariance graph model setting, with parallel results, all of which are contained in a single comprehensive piece. Hence, this work completes the powerful theory that has been developed in the mathematical statistics literature for decomposable models.

We also point out that a class of priors in the recent work ghrmnslvgs () is a special case of our class of flexible covariance Wishart distributions.444This is in a similar spirit to the way in which the HIW prior of Dawid and Lauritzen dawidlrtzn () is a special case of the generalized family of Wishart distributions proposed by Letac and Massam ltcmssmwdg () for the concentration graph setting. Our family allows multiple shape parameters, as compared to a single shape parameter, and hence yields a richer class suitable to high-dimensional problems. Moreover, we show that their iterative algorithm to sample from the posterior is different from ours. Since the authors do not undertake a theoretical investigation of the convergence properties of their algorithm, it is not clear if it does indeed converge to the desired distribution. On the other hand, we proceed to formally prove that our algorithm converges to the desired distribution. The remaining sections of this paper are considerably different from ghrmnslvgs () since we undertake a rigorous probabilistic analysis of our conjugate Wishart distributions for covariance graph models, whereas they give a useful and novel treatment of latent variables and mixed graph models in a machine learning context.

This paper is structured as follows. Section 2 introduces the required preliminaries and notation. In Section 3, the class of covariance Wishart distributions is formally constructed. Conjugacy to the class of covariance graph models and sufficient conditions for integrability are established. Comparison with the Letac–Massam priors, which are not, in general, conjugate in the covariance graph setting, is also undertaken. In Section 4, a block Gibbs sampler which enables sampling from the posterior distribution is proposed and the corresponding conditional distributions are derived. Thereafter, a formal proof of convergence of this block Gibbs sampler is provided. In Section 5, we restrict ourselves to the case when is a homogeneous graph. We examine the distributional properties of our class of priors in this section and prove that the covariance priors introduced in this paper correspond to natural conjugate priors for DAG models in the homogeneous setting. This correspondence helps in establishing closed form expressions for normalizing constants, expected values and hyper-Markov properties for our class of priors for homogeneous. Finally, we illustrate the use of our family of conjugate priors and the methodology developed in this paper on a real example, as well as on simulated data. The Appendix contains the proofs of some of the results stated in the main text.

## 2 Preliminaries

In this section, we give the necessary notation, background and preliminaries that are needed in subsequent sections.

### 2.1 Modified Cholesky decomposition

If is a positive definite matrix, then there exists a unique decomposition

 Σ=LDLT, (1)

where is a lower-triangular matrix with diagonal entries equal to 1 and a diagonal matrix with positive diagonal entries. This decomposition of is referred to as the modified Cholesky decomposition of (see pourahmadi ()). We now provide a formula that explicitly computes the inverse of a lower-triangular matrix with ’s on the diagonal, such as those that appear in (1).

###### Proposition 1

Let be an lower-triangular matrix with ’s on the diagonal. Let

 A=m⋃r=2{\boldsτ\dvtx\boldsτ∈{1,2,…,m}r,τi<τi−1 ∀2≤i≤r}

and

 L\boldsτ=dim(\boldsτ)∏i=2Lτi−1τi∀\boldsτ∈A,

where denotes the length of the vector . Then, , where is lower-triangular matrix with ’s on the diagonal and, for ,

 Nij=∑τ∈A,τ1=i,τdimτ=j(−1)dim(τ)−1dim(τ)∏i=2Lτi−1τi.

The proof is provided in the Appendix.

An undirected graph is a pair , where is a permutation555The ordering in is emphasized here since the elements of will later correspond to rows or columns of matrices. of the set denoting the set of vertices of . The set denotes the set of edges in the graph. If vertices and are such that then we say that there is an edge between and . It is also understood that implies that , that is, the edges are undirected. Although the dependence of on the particular ordering in is often suppressed, the reader should bear in mind that unlike traditional graphs, the graphs defined above are not equivalent up to permutation of the vertices666This has been done for notational convenience, as will be seen later. modulo the edge structure. Below, we describe two classes of graphs which play a central role in this paper.

### 2.2 Decomposable graphs

An undirected graph is said to be decomposable if any induced subgraph does not contain a cycle of length greater than or equal to four. The reader is referred to Lauritzen lrtzngphmd () for all of the common notions of graphical models (and, in particular, decomposable graphs) that we will use here. One such important notion is that of a perfect order of the cliques. Every decomposable graph admits a perfect order of its cliques. Let be one such perfect order of the cliques of the graph . The history for the graph is given by and

 Hj=C1∪C2∪⋯∪Cj,j=2,3,…,k,

and the minimal separators of the graph are given by

 Sj=Hj−1∩Cj,j=2,3,…,k.

Let

 Rj=Cj∖Hj−1for j=2,3,…,k.

Let denote the number of distinct separators and denote the multiplicity of , that is, the number of such that . Generally, we will denote by the set of cliques of a graph and by its set of separators.

Now, let be an arbitrary positive definite matrix with zero restrictions according to ,777It is emphasized here that the ordering of the vertices reflected in plays a crucial role in the definitions and results that follow. that is, whenever . It is known that if is decomposable, then there exists an ordering of the vertices such that if is the modified Cholesky decomposition corresponding to this ordering, then, for ,

 Lij=0whenever (i,j)∉E. (2)

Although the ordering is not unique in general, the existence of such an ordering characterizes decomposable graphs (see plsnpwrsmh ()). A constructive way to obtain such an ordering is given as follows. Label the vertices in descending order, starting with vertices in , with vertices belonging to a particular set being ordered arbitrarily (see lrtzngphmd (), plsnpwrsmh (), wermuthcsp () for more details).

### 2.3 The spaces Pg, Qg and LG

An -dimensional Gaussian covariance graph model888A brief overview of the literature in this area is provided in the Introduction. can be represented by the class of multivariate normal distributions with fixed zeros in the covariance parameter (i.e., marginal independencies) described by a given graph . That is, if , then the th and th components of the multivariate random vector are marginally independent. Without loss of generality, we can assume that these models have mean zero and are characterized by the parameter set of positive definite covariance matrices such that whenever the edge is not in . Following the notation in ltcmssmwdg (), rajmasscar () for decomposable, we define to be the space on which the free elements of the precision matrices (or inverse covariance matrices) live.

More formally, let denote the set of symmetric matrices of order , the cone of positive definite matrices (abbreviated as “”), the linear space of symmetric incomplete matrices with missing entries , and the projection of into . The parameter set of the precision matrices of Gaussian covariance graph models can also be described as the set of incomplete matrices . The entries are not free parameters of the precision matrix for Gaussian covariance graph models (see ltcmssmwdg (), rajmasscar () for details). We are therefore led to consider the two cones

 PG = {y∈M+m|yij=0,(i,j)∉E}, (3) QG = {x∈IG|xCi>0,i=1,…,k}, (4)

where , and denotes the linear space of symmetric matrices with zero entries . Furthermore Grone et al. Grone84 () prove that for decomposable, the spaces and are isomorphic (once more, see ltcmssmwdg (), rajmasscar () for details).

We now introduce new spaces and (the modified Cholesky space) that will be needed in our subsequent analysis999These spaces are not defined in ltcmssmwdg (), rajmasscar ().:

 LG = {L\dvtxLij=0 whenever i0 ∀1≤i≤m}.

We define the mapping as follows:

 ψ(L,D)=LDLT. (5)

This mapping plays an important role in our analysis and shall be studied later.

### 2.4 Homogeneous graphs

A graph is defined to be homogeneous if, for all , either

 {u\dvtxu=j or (u,j)∈E}⊆{u\dvtxu=i or (u,i)∈E}

or

 {u\dvtxu=i or (u,i)∈E}⊆{u\dvtxu=j or (u,j)∈E}.

Equivalently, a graph is said to be homogeneous if it is decomposable and does not contain the graph , denoted by , as an induced subgraph. Homogeneous graphs have an equivalent representation in terms of directed rooted trees, called Hasse diagrams. The reader is referred to ltcmssmwdg () for a detailed account of the properties of homogeneous graphs. We write whenever

 {u\dvtxu=j or (u,j)∈E}⊆{u\dvtxu=i or (u,i)∈E}.

Denote by the equivalence relation on defined by

 iRj⇔i→j and j→i.

Let denote the equivalence class in containing . The Hasse diagram of is defined as a directed graph with vertex set and edge set consisting of directed edges with for if the following holds: and such that , , .

If is a homogeneous graph, then the Hasse diagram described above is a directed rooted tree such that the number of children of a vertex is never equal to one. It was proven in ltcmssmwdg () that there is a one-to-one correspondence between the set of homogeneous graphs and the set of directed rooted trees with vertices weighted by positive integers [], such that no vertex has exactly one child. Also, when , we say that and are twins in the Hasse diagram of . Figure 1 provides an example of a homogeneous graph with seven vertices and the corresponding Hasse diagram.

The following proposition for homogeneous graphs plays an important role in our analysis.

###### Proposition 2

If is a homogeneous graph, then there exists an ordering of the vertices, such that, for this ordering:

1. , where is the modified Cholesky decomposition of ;

2. .

The proof of this proposition is well known and so is omitted for the sake of brevity (see adsnwgnrwd (), khrrjtmwdg (), roverchldn ()). We now describe a procedure for ordering the vertices, under which Proposition 2 holds. Given a homogeneous graph , we first construct the Hasse diagram for . The vertices are labeled in descending order, starting from the root of the tree. If the equivalence class at any node has more than one element, then they are labeled in any order. Hereafter, we shall refer to this ordering scheme as the Hasse perfect vertex elimination scheme. For example, if we apply this ordering procedure to the graph in Figure 1, then the resulting labels are .

### 2.5 Vertex ordering

Let be an undirected decomposable graph with vertex set and edge set . Let denote the permutation group associated with . For any , let , where if and only if . Let denote the subset of permutations of such that, for any with , . Hence, for every , the mapping defined in (5) is a bijection from to . In particular, the ordering corresponding to any perfect vertex elimination scheme lies in (see Section 2.2). If is homogeneous, let denote the subset of permutations of such that . In particular, any ordering of the vertices corresponding to the Hasse perfect vertex elimination scheme lies in (see Section 2.4). The above defines a nested triplet of permutations of given by .

## 3 Wishart distributions for covariance graphs

Let be an undirected decomposable graph with vertex set and edge set . We assume that the vertices in are ordered so that . The covariance graph model associated with is the family of distributions

 G = {Nm(0,Σ)\dvtxΣ∈PG} ≅ {Nm(0,LDLT)\dvtx(L,D)∈\boldsΘG}.

Consider the class of measures on with density [with respect to]

 ˜πU,\boldsα(L,D)=e−(tr((LDLT)−1U)+∑mi=1αilogDii)/2,θ=(L,D)∈\boldsΘG. (6)

These measures are parameterized by a positive definite matrix and a vector with nonnegative entries. Let us first establish some notation:

Let

 zG(U,\boldsα):=∫e−(tr((LDLT)−1U)+∑mi=1αilogDii)/2dLdD.

If , then can be normalized to obtain a probability measure. A sufficient condition for the existence of a normalizing constant for is provided in the following proposition.

###### Theorem 1

Let and . Then,

 ∫\boldsΘGe−(tr((LDLT)−1U)+∑mi=1αilogDii)/2dLdD<∞

if

 αi>|N≺(i)|+2∀i=1,2,…,m.

As the proof of this proposition is rather long and technical, it is deferred to the Appendix. The normalizing constant is not generally available in closed form. Let us consider a simple example to illustrate the difficulty of computing the normalizing constant explicitly.

Let , that is, the path on four vertices, or . Note that this is a decomposable (but not homogeneous) graph. The restrictions on are . Let and . Then, after integrating out the elements (recognizing them as inverse-gamma integrals) and transforming the entries of to the independent entries of (as in the proof of Proposition 1), the normalizing constant reduces to an integral of the form

 ∫R3 (U22+2U12x1+U11x21)−1 ×(U11x21x22+U22x22+U33+2U12x1x22+2U13x1x2+2U23x2)−1 ×(U11x21x22x23+U22x22x23+U33x23+U44+2U12x1x22x23 (+2U13x1x2x23+2U14x1x2x3+2U23x2x23+2U24x2x3+U34x3)−1dx.

The above integral does not seem to be computable by standard techniques for general . Despite this inherent difficulty, we propose a novel method which allows sampling from this rich family of distributions (see Section 4).

We will show later that the condition in Theorem 1 is necessary and sufficient for the existence of a normalizing constant for homogeneous graphs. Moreover, in this case, the normalizing constant can be computed in closed form. We denote by the normalized version of whenever . The following lemma shows that the family is a conjugate family for Gaussian covariance graph models.

###### Lemma 1

Let be a decomposable graph, where vertices in are ordered so that . Let be an i.i.d. sample from , where . Let denote the empirical covariance matrix. If the prior distribution on is , then the posterior distribution of is given by , where and .

{pf}

The likelihood of the data is given by

 f(y1,y2,…,yn∣L,D)=1(√2π)nme−(tr((LDLT)−1(nS))+nlog|D|)/2.

Using as a prior for , the posterior distribution of given the data is

 πU,\boldsα(L,D∣Y1,Y2,…,Yn) ∝e−(tr((LDLT)−1(nS+U))+∑mi=1(n+αi)logDii)/2,\boldsθ∈\boldsΘG.

Hence, the posterior distribution belongs to the same family as the prior, that is,

 πU,\boldsα(⋅∣Y1,Y2,…,Yn)=π˜U,˜\boldsα(⋅),

where and .

{remark*}

If we assume that the observations have unknown mean , that is, are i.i.d. with , then

 ˜S:=1nn∑i=1(Yi−¯Y)(Yi−¯Y)T

is the minimal sufficient statistic for . Here, has a Wishart distribution with parameter and degrees of freedom. Hence, if we assume a prior for , then the posterior distribution is given by

 πU,\boldsα(⋅∣˜S)=π˜U,˜\boldsα(⋅),

where and .

{remark*}

Note that, as with the distributions in ltcmssmwdg (), rajmasscar (), the functional form of the prior distribution depends on the ordering of the vertices specified—but this is not as restrictive as it first appears. In this sense, an ordering is essentially another “parameter” to be specified and thus can also be viewed as imposing extra information. We return to this point in the examples section where we investigate the impact of ordering on a real-world example (see Section 6). But, more importantly, given a perfect ordering of the vertices, any rearrangement of the vertices within the residuals will still preserve the zeros between and and will thus be sufficient for our purposes. In this sense, the covariance Wishart distributions introduced in this paper do not actually depend on a full ordering of the vertices. In fact, for the class of decomposable graphs, any perfect ordering is sufficient, that is, any ordering that is used in ltcmssmwdg () will also be relevant for the covariance Wishart distributions defined above. In this sense, these decomposable covariance Wishart distributions are very flexible, especially since we are working in the curved exponential family setting and are still able to use any ordering that is appropriate for the ltcmssmwdg () distributions which address the natural exponential family (NEF) concentration graph situation. The technical reason why any perfect ordering will suffice is that any perfect ordering will preserve the zeros between and the matrix from its Cholesky decomposition plsnpwrsmh (), roverchldn (). Moreover, from an applications perspective, since matrix operations are not invariant with respect to ordering of the nodes, an ordering that facilitates calculations is desirable. All that the ordering does is to relabel the vertices, but the edge structure is completely and fully retained. To further clarify what is meant, if one has a list of a genes/proteins called ABLIM1, BCL6, etc. and their names are replaced with the numbers 1, 2, 3, etc., the problem can first be analyzed with the integer labels and one can then go back to the original labels after the analysis is done. So, in many applications, the ordering is not a real restriction.

### 3.1 Induced prior on Pg and Qg

The prior on (the modified Cholesky space) induces a prior on (the covariance matrix space) and . We provide an expression for the induced priors on these spaces in order to compare our Wishart distributions with other classes of distributions. Note that since the vertices have been ordered so that , the transformation

 ψ\dvtx\boldsΘG→M+m

defined by

 ψ(L,D)=LDLT=:Σ

is a bijection from to . The lemma below provides the required Jacobians for deriving the induced priors on and . The reader is referred to Section 2.2 for notation on decomposable graphs. Note that if is a matrix, then denotes its determinant, while if is a set, then denotes its cardinality.

###### Lemma 2 ((Jacobians of transformations))
1. The Jacobian of the transformation from to is

 m∏i=1Djj(Σ)−nj.

Here, denotes that is a function of , and for .

2. The absolute value of the Jacobian of the bijection from to is

 ∏C∈C|xC|−|C|−1∏S∈S|xS|(|S|+1)ν(S).
{pf}

The first part is a direct consequence of a result in roveratond () and the proof of the second part can be found in roverchldn ().

These Jacobians allow us to compute the induced priors on and . The induced prior corresponding to on is given by

 ˜πPGU,\boldsα(Σ)∝e−(tr(Σ−1U)+∑mi=1(2ni+αi)logDii(Σ))/2,Σ∈PG. (7)

We first note that the traditional inverse Wishart distribution (see muirheadwd ()) with parameters and is a special case of (7) when is the complete graph and . We also note that the -inverse Wishart priors introduced in ghrmnslvgs () have a one-dimensional shape parameter and are a very special case of our richer class . The single shape parameter is given by the relationship .101010There is an interesting parallel here that becomes apparent from our derivations above. In the concentration graph setting, the single shape parameter hyper-inverse Wishart (HIW) prior of Dawid and Lauritzen dawidlrtzn () is a special case of the multiple shape parameter class of priors introduced by Letac and Massam ltcmssmwdg (), in the sense that (see rajmasscar () for notation). In a similar spirit, we discover that the single shape parameter class of priors in ghrmnslvgs () is a special case of the multiple shape parameter class of priors introduced in this paper, in the sense that .

We now proceed to derive the induced prior on . Let denote the image of in and let denote (see ltcmssmwdg (), rajmasscar () for more details). Using the second part of Lemma 2, the induced prior corresponding to on is given by

 ˜πQGU,\boldsα(x) ∝ e−(tr(^xU)+∑mi=1(2ni+αi)logDii((ˆx)−1))/2 ×∏S∈S|xS|(|S|+1)ν(S)∏C∈C|xC||C|+1,x∈QG.

### 3.2 Comparison with the Letac–Massam priors

We now carefully compare our class of priors to those proposed in Letac and Massam ltcmssmwdg (). In ltcmssmwdg (), the authors construct two classes of distributions, named and , on the spaces and , respectively, for decomposable (see ltcmssmwdg (), Section 3.1). These distributions are generalizations of the Wishart distribution on these convex cones and have been found to be very useful for high-dimensional Bayesian inference, as illustrated in rajmasscar (). These priors lead to corresponding classes of inverse Wishart distributions (on ) and (on ), that is, whenever , and whenever . In ltcmssmwdg (), it is shown that the family of distributions yields a family of conjugate priors in the Gaussian concentration graph setting, that is, when .

As the priors of ltcmssmwdg () are defined on the space , in principle, they can potentially serve as priors111111The use of this class of nonconjugate priors for Bayesian inference in covariance graph models was already explored in lmwdgvpcvg (). in the covariance graph setting since the parameter of interest lives in . Let us examine this class more carefully, first with a view to understanding their use in the covariance graph setting and second to compare them to our priors. Following the notation for decomposable graphs in Section 2.2 and in ltcmssmwdg (), the density of the distribution is given by

 IWU,\boldsα,\boldsβQG(Σ)∝etr(Σ−1U)/2∏C∈C|(Σ−1)C|α(C)+(c+1)/2∏S∈S|(Σ−1)S|ν(S)(β(S)+(s+1)/2),Σ∈PG,

where , and , and , are real numbers. The posterior density of under this prior is given by

 πIWU,\boldsα,\boldsβ(Σ∣Y1,Y2,…,Yn) ∝ e−tr(Σ−1(U+nS))/2 ×∏C∈C|(Σ−1)C|α(C)+(c+1)/2+n/2∏S∈S|(Σ−1)S|ν(S)(β(S)+(s+1)/2)+nν(S)/2.

However, may not, in general, be in , which is a crucial assumption in the analysis in ltcmssmwdg (). Hence, the conjugacy breaks down.

We now investigate similarities and differences between our class of priors and the class. Since the density is defined only for , a pertinent question is whether our class of priors has the same functional form when . We discover that this is not the case and demonstrate this through an example. Consider the -chain . One can easily verify that the terms are identical in both priors. We now show that the remaining terms are not identical. If is the modified Cholesky decomposition of , then, for this particular graph with , , and , , the expression that is not in the exponential term for the density is of the form

 ∏3i=1|(Σ−1)Ci|αi∏3i=2|(Σ−1)Si|βi = (1D11)α1(1D22+L232D33+L232L243D44)α1−β1 ×(1D22)α2(1D33+L243D44)α2−β2(1D33D44)α3.

This expression is clearly different from the term, other than the exponent in , which is a product of different powers of .

However, an interesting property emerges when is homogeneous. Note that, in this case, for any clique and any separator ,

 |(Σ−1)C|=∏i∈C1Dii,|(Σ−1)S|=∏i∈S1Dii.

Hence, when is homogeneous, the class is contained in the class . The containment is strict because need not be in for our class . Also, in , the exponent of and is the same if , that is, the shape parameter, is shared for vertices in the same equivalence class, as defined by the relation . We, however, note that the difference in the number of shape parameters is not a major difference, due to the result of Consonni and Veronese cnsnvrnscr (), together with fact that for the (and, correspondingly, for the ), each one of the blocks has a Wishart distribution (see Theorem 4.5 of ltcmssmwdg ()).

We therefore note that in the restrictive case when is homogeneous and when , the two classes of distributions and have the same functional form. The fact that we do not restrict is an important difference since, even in the homogeneous case, they yield a larger class of distributions on the homogeneous cone compared to those in Andersson and Wojnar adscnvcmcs (), resulting in nonsuperficial consequences for inference in covariance graph models.121212We note, however, that the distributions in adscnvcmcs () are quite general since the authors consider other homogeneous cones and not just .

## 4 Sampling from the posterior distribution

In this section, we study the properties of our family of distributions and thereby provide a method that allows us to generate samples from the posterior distribution corresponding to the priors defined in Section 3. In particular, we prove that can be partitioned into blocks so that the conditional distribution of each block given the others is a standard distribution in statistics and hence easy to sample from. We can therefore generate samples from the posterior distribution by using the block Gibbs sampling algorithm.

### 4.1 Distributional properties and the block Gibbs sampler

Let us introduce some notation before deriving the required conditional distributions. Let be a decomposable graph such that . For a lower-triangular matrix with diagonal entries equal to ,

 Lu⋅ := uth row of L,u=1,2,…,m, L⋅v := vth column of L,v=1,2,…,m, LG⋅v := (Luv)u>v,(u,v)∈E,v=1,2,…,m−1.

So, is the th column of without the components which are specified to be zero under the model (and without the th diagonal entry, which is ). In terms of this notation, the parameter space can be represented as

 \boldsΘG={(LG⋅1,LG⋅2,LG⋅3,…,LG⋅m−1,D)\dvtx (8) \boldsΘG={Lij∈R,∀1≤j0,∀1≤i≤m}.

Suppose that for some positive definite and with nonnegative entries. The posterior distribution is then , where . In the following proposition, we derive the distributional properties which provide the essential ingredients for our block Gibbs sampling procedure.

###### Theorem 2

Using the notation above, the conditional distributions of each component of [as in (4.1)] given the other components and the data are as follows:

1.  LG⋅v∣(L∖LG⋅v,D,Y1,Y2,…,Yn)∼N(\boldsμv,G,Mv,G)∀v=1,2,…,m−1,

where

 μv,Gu := μvu+∑u′>v:(u′,v)∈E∑w>v:(w,v)∉Eor wv:(u′,v)∈E∑w>v:(w,v)∉Eor wv,(u,v)∈E, μvu := (L−1˜U)vu(L−1˜U(LT)−1)vv∀u such that L−1vu=0, (Mv,G)−1uu′ := (L−1˜U(LT)−1)vv(LDLT)−1uu′∀u,u′>v,(u,v),(u′,v)∈E;
2.  Dii∣L,Y1,Y2,…,Ym∼IG(˜αi2−1,(L−1˜U(LT)−1)ii2),

independently for , where “” represents the inverse-gamma distribution.

{remark*}

The notation in the definition of above means indices for which is as a function of entries of .

Deriving the required conditional distributions in Theorem 2 entails careful analysis. We first state two lemmas which are essential for deriving these distributions.

Let , . Then,

 ∂L−1