The ratio of normalizing constants for Bayesian graphical Gaussian model selection

# The ratio of normalizing constants for Bayesian graphical Gaussian model selection

Tilburg University
H. Massam
York University
massamh@yorku.ca
G. Letac
Université de Toulouse
gerard.letac@math.univ-toulouse.fr
###### Abstract

The ratio of normalizing constants for the -Wishart distribution, for two graphs differing by an edge , has long been a bottleneck in the search for efficient model selection in the class of graphical Gaussian models. We give an accurate approximation to this ratio under two assumptions: first we assume that the scale of the prior is the identity, second we assume that the set of paths between the two ends of are disjoint. The first approximation does not represent a restriction since this is what statisticians use. The second assumption is a real restriction but we conjecture that similar results are also true without this second assumption. We shall prove it in subsequent work.

This approximation is simply a ratio of Gamma functions and thus need no simulation. We illustrate the efficiency and practical impact of our result by comparing model selection in the class of graphical Gaussian models using this approximation and using current Metropolis-Hastings methods. We work both with simulated data and a complex high-dimensional real data set. In the numerical examples, we do not assume that the paths between the two end points of edge are disjoint.

Keywords: Bayesian Inference; Gaussian Graphical Models; Model Selection; G-Wishart distribution; Normalizing Constant; Birth-death MCMC.

## 1 Introduction

Given an undirected graph where is a finite set and is the set of undirected edges, we say that the Gaussian variable is Markov with respect to if is independent of given all the other variables whenever there no edge in . It is well-known that in that case the precision matrix belongs to the cone of positive definite matrices with whenever . One can then define the graphical Gaussian model Markov with respect to a given graph as the family of distributions

 NG={N(0,Σ)∣K=Σ−1∈PG}.

Graphical Gaussian models form nowadays one of the basic tools used to analyze high-dimensional complex continuous data. Model selection in this class of models has thus been the topic of much research both from the frequentist and Bayesian point of view. The model is defined by both the graph and the precision matrix . Model search in the frequentist framework is done by maximizing a penalized likelihood (see Friedman et al. (2008)): this yields simultaneously the best (in that sense) and by determining which entries of the covariance matrix are zero and estimating the others. In the Bayesian framework, model search has traditionally been based on the comparison of the posterior distribution of each model, each model being represented by a graph . The selected models are the models with the highest posterior probabilities and the corresponding is then estimated through sampling of the posterior distribution of . The conjugate prior for is the -Wishart as defined by Roverato (2002). The density of the -Wishart can be written as

 f(K∣G)=1IG(δ,D)|K|δ−22exp{−12⟨K,D⟩}1PG(K) (1)

where denotes the determinant of , is the inner product of and in the space of symmetric matrices and is the normalizing constant of the -Wishart which is finite for and . Given a sample from the Gaussian distribution in , let . The marginal density of , which is also the unnormalized posterior density of given , is then

 f(x∣G)=(2π)−np/2IG(δ+n,D+S)IG(δ,D)∫PG|K|δ+n−22exp{−12⟨K,(D+S)⟩}dK.

To do model selection, one must run a Markov chain or some stochastic search to move through the space of graphs, see (Jones et al., 2005, Scott and Carvalho, 2008, Wong et al., 2003). This approach is not feasible for high-dimensional data for two reasons: the Markov chains are slow and the prior and posterior normalizing constants for each graph visited are hard to evaluate numerically.

Another approach to Bayesian model selection is to run a Markov chain on the joint space of , see (Giudici and Green, 1999) or , see (Dobra and Lenkoski, 2011, Dobra et al., 2011, Wang and Li, 2012, Mohammadi and Wit, 2015). The joint posterior distribution of is

 f(K,G∣x) = |K|δ+n−22(2π)−np/2IG(δ,D)exp{−12⟨K,(D+S)⟩}

The advantage of this approach is that the posterior normalizing constant does not come into play. To compute the acceptance probabilities in the chain, the only quantity which is computationally expensive is the ratio of prior normalizing constants

 IG−e(δ,Ip)IG(δ,Ip) (2)

where is the graph obtained from by removing one edge and . A recent paper Uhler et al. (2014) gives the exact analytic expression of these integrals. However, this expression is complicated and impossible to implement at the present time. The purpose of this paper is to give an accurate approximation to (2) which, itself, leads to accurate model selection with minimal scale-free computational burden.

In practice, the normalizing constants in (2) are computed in two ways, either using the Laplace approximation to an integral or the method given by Atay-Kayis and Massam (2005). The Laplace approximation is accurate only if the density of the prior distribution is similar in shape to a multivariate normal distribution and, like for the Wishart distribution, this requires that the shape parameter be high. Since, in order to minimize the impact of the prior distribution on inference, is traditionally chosen to be , which is not high, this approximation is not accurate. Atay-Kayis and Massam (2005) expressed as the product of a constant and an expected value. For , this expression is

 IG(δ,Ip)=p∏i=12δ+νi2(2π)νi/2Γ(δ+νi2)E(fE(ψE)) (3)

where, for a given order of the vertices, is the number of neighbours of vertex which have a numbering larger than or equal to , is the incomplete upper triangular matrix with entries the free entries of the Cholesky decomposition , is a function depending on and more particularly on . The expected value is taken with respect to a product of independent standard normal and chi-square distributions. Evaluating (2) is therefore equivalent to evaluating

 E(fE−e(ψE−e))E(fE(ψE)), (4)

where is the edge set of the graph obtained from by removing the edge . We will show that, under reasonable conditions of sparsity, we have the following approximation of (2)

 IG−e(δ,Ip)IG(δ,Ip)≈12√πΓ(δ+d2)Γ(δ+d+12)

where is the number of minimal paths of length 2 linking the two end points of . Under the assumption that the minimal paths (that is paths without a chord) linking the two vertices of in are disjoint, we show in equation (22) below that the accuracy of this approximation depends on the number and on the length of the paths linking the two end vertices of the edge . We illustrate the accuracy of our approximation through simulations, using numerous configurations, and we show that if the number of paths of length 3 or more is not too large, say , the approximation is of the order of . A relative error of this order does not truly affect the model search. We will follow the method of Mohammadi and Wit (2015) to do a model search using both real and simulated data. We will see that using this approximation actually yields better precision for the identification of the true model than if we actually evaluate (4). This means that graphical Gaussian model selection can be done for high-dimensional data in a fast, scale-free and accurate manner with minimal computational burden.

Though, in the numerical examples considered, the assumption that the minimal paths linking the two end vertices of are disjoints is not satisfied, the approximation seems to work and the model search gives good results. We conjecture that an approximation similar to (22) holds in general. This is the topic of future work.

## 2 The ratio of prior normalizing constants

Let be an undirected graph and the graph obtained by removing a given edge from . Without loss of generality, in this section, we will assume that the numbering of the vertices is such that . The aim of this section is to express ratio (4) in such a way that, in the next section, we can deduce an approximation to it. This is done in Proposition 2.1. As we shall see below, the accuracy of this approximation depends on the number of minimal paths between the two vertices and of . A path between and is said to be minimal if it has no chord.

In a first step, we recall how (3) was obtained in Atay-Kayis and Massam (2005). Let be an upper triangular matrix with positive diagonal elements such is the Cholesky decomposition of . Let be the complement of in the set of all possible edges in a graph with vertex set , i.e. indexes the missing edges of . Similarly denotes the missing edges of . It was shown that

 ψE={ψij,(i,j)∈E}

are free variables in the sense that there is a 1-1 correspondence between and and that can be expressed in terms of and are thus non-free variables. Then, with a change of variables from to , we have

 |K|δ−22exp{−12⟨K,D⟩}1PG(dK)= (5) 2δ+νi2(2π)νi/2Γ(δ+νi2)e−12∑(i,j)∈¯¯¯Eψ2ij(∏(i,j)∈E1√2πe−ψ2ij2dψij) p∏i=11Γ((δ+νi)/2)(ψ2ii2)(δ+νi)/2)e−ψ2ii2d(ψ2ii).

This expression immediately yields (3) with

 fE(ψE)=e−12∑(i,j)∈¯¯¯Eψ2ij (6)

where the expected value is taken with respect to a product of independent for and distribution for . It remains to give the expression of the non-free variables in terms of the free ones. Equation (31) in Atay-Kayis and Massam (2005), in the particular case where , yields

 ψij=−∑i−1r=1ψriψrj2,i

This equation shows that each non-free variables is equal to multiplied by the sum, over the rows which are numbered to of the products of the entries in the column and the column .

It is well-known (see formula (22) in Atay-Kayis and Massam (2005)) that, if is a perfect sequence of prime components of with corresponding separators and induced graphs , then can be decomposed as

 IG(δ,D)=∏kj=1IGPj(δ,DPj)∏kj=2IGSj(δ,DSj), (8)

where and are the submatrices of indexed by the vertices of and respectively. In a second step, we show that the ratio in (2) can be reduced to the ratio

 IG−epq(δ,DG−epq)IGpq(δ,DGpq).

where is the graph induced by and all the vertices contained in the minimal path between and and not containing the edge while is the graph obtained from by adding the edge . We say that a path between and is minimal if this path has no chord. Indeed, if a vertex is not on a path linking and , it is either not linked to either or or it is linked to one of them but not both. In the first case, the graph is disconnected and we clearly only have to consider the connected part of containing and . In the second case, assuming, without loss of generality, that is not linked to , means that there is a separator between the prime components containing and the prime components containing . Since in both and , the prime component are the same, their corresponding terms will cancel out in (8) above and so will the terms for the corresponding separators. Finally if belongs to a path between and but not on a minimal path, then such that is a chord. Then the graph induced by the path and the chord is a prime component which will appear in both the factorization of and . This proves that in our study of (2), it is sufficient to consider a graph which contains only the minimal paths between and . To keep notation simple, from now in this paper, we will write and for and respectively.

In a third step, we will set up the numbering of the vertices in a convenient way. From now on, we assume that is the graph induced by the minimal paths between and . Let be the set of these minimal paths. A path between and with vertices between and will be written

 λ={q,1λ,2λ,…,ℓλ,p}.

We let , and be, respectively, the set of edges, the set of interior vertices of and the set of interior points deprived of , i.e.

 Eλ={(1λ,q),(1λ,2λ),…,((l−1)λ,ℓλ),(ℓλ,p)},Vλ={1λ,2λ,…,ℓλ},V(−1)λ=Vλ∖{1λ}.

If is the total number of minimum paths, we set an arbitrary order of the path where, for convenience, we list the paths of length 2, i.e. last. Within each path , we order the vertices in following the path. The vertices and are ranked last so that the order of the vertices is

 1λ1,…,ℓλ1,1λ2,…,ℓλ2,…,…,1λL,…,ℓλL,q,p. (9)

We are now ready to state the following.

###### Proposition 2.1.

let , and be as defined above. Assume that the paths between and are disjoint, that is, have no edge in common. Let the order of the vertices in be as given in (9). The ratio of prior normalizing constants in (2) is equal to

 IG−e(δ,Ip)IG(δ,Ip)=12√πΓ(δ/2)Γ((δ+1)/2)E(exp−12(∑(i,j)∈¯Eψ2ij+ψ2e))E(exp−12∑(i,j)∈¯Eψ2ij), (10)

where

 ψe=∑λ∈Λ(−1)ℓλ∏a∈Eλψaψqq∏v∈V(−1)λψvv. (11)
###### Proof.

Since is equal to the cardinality of , since the only change between and is the edge and since the neighbours of are the same in and , given the order (9), only is different in and . Indeed while . This yields immediately equation (10). Let us now prove (11). From (7) and given (9), we have

 ψe=ψqp=−∑λ∈Λ∑iλ∈VΛψiλqψiλpψqq. (12)

We want to compute the entries in terms of

 ψE=(ψiλiλ,iλ∈Vλ,ψiλ(i+1)λ,iλ∈Vλ∖{ℓλ},ψ1λ,q,ψℓλ,p,λ∈Λ).

We now note three important facts. First, the elements of the first row of the matrix are all zero except for those corresponding to the edges of the path , i.e.

 ψ1λ1,v=0,v∈∪λ∈ΛVλ,v≠1λ1,2λ1,q. (13)

Second, as a consequence of (13) and (7), the remaining non-free entries in all the columns of except for the columns and , are equal to zero, i.e., for and for ranked before ,

 ψiλ,jλ′=0,1λ

Third, due to the first entry of column being free, none of the entries of column are necessarily zero. However, for each , using iteratively (7), we see that the entries of column are zero except for the last one which is a free variable,i.e.

 ψjλ,p=0,jλ

Applying (15) and (14), equation (12) yields

 ψe=ψ∗qp=−∑λ∈Λψlλqψlλpψqq. (16)

The entries are free. The entries are obtained by successively applying (7), (14) and the fact that are free. That is

 ψlλq = −ψ(l−1)λ,lλψ(l−1)λ,qψ(l−1)λ,(l−1)λ=+ψ(l−1)λ,lλψ(l−2)λ,lλψ(l−2)λ,qψ(l−1)λ,(l−1)λψ(l−2)λ,(l−2)λ (17) = …=(−1)lλ−1ψ1λ,q∏l−1j=1ψjλ,(j+1)λ∏lj=2ψjλ,jλ=(−1)lλ−1ψ1λ,q∏l−1j=1ψjλ,(j+1)λ∏l−1j=1ψ(j+1)λ,(j+1)λ, = (−1)lλ−1ψ1λ,ql−1∏j=1ψjλ,(j+1)λψ(j+1)λ,(j+1)λ. (18)

Equalities (16) and (17) together yield

 ψqp=∑λ∈Λ(−1)lλψ1λ,qψlλ,p∏l−1j=1ψjλ,(j+1)λψqq∏lj=2ψjλ,jλ=∑λ∈Λ(−1)lλψlλ,pψ1λ,qψqql−1∏j=1ψjλ,(j+1)λψ(j+1)λ,(j+1)λ,

which is identical to (11). ∎

We immediately illustrate the ranking of the vertices, our notation and the expressions in (17) and (11) with an example.

###### Example 2.1.

Consider the graph of Figure 1 where for simplicity, we write for

The matrix is as follows.

 11213141122213qpψ1111ψ112100000ψ11q0ψ2121ψ21310000∗0ψ3131ψ3141000∗0ψ4141000∗ψ41pψ1212ψ12220ψ12q0ψ22220∗ψ22pψ1313ψ13qψ13pψqq∗ψpp

where the entries marked with a are the non-free entries and are given as follows:

 ψ21q=−ψ1121ψ11qψ2121,ψ31q=ψ2131ψ1121ψ11qψ2121ψ3131,ψ41q=−ψ3141ψ2131ψ1121ψ11qψ2121ψ3131ψ4141 ψ22q=−ψ1222ψ12qψ2222 ψqp=−1ψqq(ψ13qψ13q+ψ22qψ22p+ψ41qψ41p) =−ψ13qψ13qψqq+ψ1222ψ12qψ22pψqqψ2222−ψ41pψ3141ψ2131ψ1121ψ11qψqqψ2121ψ3131ψ4141.

The paths of length 2, that is, with will play a special role in the approximation of the ratio (4) (and thus (2)) that we seek to approximate. We thus write (11) as

 ψe=ψqp=−A+b (19)

where

 A=∑λ∈Λ,lλ=1ψlλ,q,ψℓλ,pψqq,b=1ψqq∑λ∈Λ,ℓλ≥2(−1)ℓλψ1λ,qψℓλ,p∏ℓλ−1j=1ψjλ,(j+1)λ∏ℓλ−1j=1ψ(j+1)λ,(j+1)λ. (20)

## 3 An approximation to the ratio of normalizing constants

### 3.1 The results

Following (5) and (6), in order to approximate the ratio (10), we need to approximate

 E(exp−12(∑(i,j)∈¯Eψ2ij+ψ2e))E(exp−12∑(i,j)∈¯Eψ2ij)=E(e−12∑(i,j)∈¯Eψ2ije−A22e−Ab−b22)E(e−12∑(i,j)∈¯Eψ2ij). (21)

From (5), we see that is the sum of ratios of product of independent normal by the square root of a chi-square distribution with degrees of freedom and similarly is the sum of ratios of products of standard normals by a product of square roots of chi-square distributions. To obtain an approximation to (21), we proceed in two steps. We will first show that is a good approximation to We will then see that is independent of and thus

 E(e−12∑(i,j)∈¯Eψ2ije−A22e−Ab−b22)E(e−12∑(i,j)∈¯Eψ2ij)≈E(e−12∑(i,j)∈¯Eψ2ije−A22)E(e−12∑(i,j)∈¯Eψ2ij)=E(e−A22),

which can easily be obtained explicitly. The following theorem is the main result of this paper and yields an accurate approximation to ratio (2).

###### Theorem 3.1.

For as defined above and for and as defined in (19) and (20), we have

 0≤1−E(exp−12(∑(i,j)∈¯Eψ2ij+ψ2e))E(exp−12(∑(i,j)∈¯Eψ2ij+A2))≤B, (22)

where

 B=2πr(δ)δδ+2(∑λ∈Λ2+Rℓλδ)Rδ+d−1,

where is the number of paths of length 2, i.e. with , and

 Rδ=Γ(δ2)√πΓ(δ+12),r(δ)=Γ(12(δ+1))2Γ(12δ)Γ(12(δ+2)).

Moreover, is independent of with

 E(e−A22)=Γ((δ+d)/2)Γ((δ+1)/2)Γ(δ/2)Γ((δ+d+1)/2) (23)

and, with an accuracy given by (22),

 IG−e(δ,D)IG(δ,D)≈12√πΓ(δ+d2)Γ(δ+d+12). (24)

The approximation (24) gives an analytically explicit expression for ratio (2) and thus removes the need to do the simulations that were previously necessary to the evaluation of (4) in model search. This makes the search algorithm scale free. We will illustrate this fact in Section 4. Before doing so and before giving the proof of Theorem 3.1 above, we give, in Table 1 the value of under different scenarios for graphs having five different paths between and .

For ,

 B=83π∑λ∈Λ2+RℓλδRδ+d.

The computations in Table 1 and Figure 2 have been done with replications and iterations for the Monte-Carlo algorithm.

### 3.2 The proof

We express here and in terms of

 A = −∑λ∈Λ,lλ=1ψlλ,q,ψlλ,pψqq=−1ψqq∑λ∈Λ,lλ=1ψlλ,q,ψlλ,p=A1√Qδ, b = 1ψqq∑λ∈Λ,lλ≥2(−1)lλψ1λ,qψlλ,plλ−1∏jλ=1ψjλ,(j+1)λψ(j+1)λ,(j+1)λ=b1√Qδ=∑λ∈Λ,lλ≥2b1λ√Qδ,

where

 A1=∑λ∈Λ,lλ=1ψlλ,q,ψlλ,p,b1λ=∑λ∈Λ,lλ≥2(−1)lλψ1λ,qψlλ,plλ−1∏jλ=1ψjλ,(j+1)λψ(j+1)λ,(j+1)λ,Qδ=ψ2qq.

All the entries of appearing above are independent and those appearing in are different from those appearing in . Thus , and are stochastically independent. For ease of notation, from now on, we will write for the set of paths in that have . We define and as follows:

 D=∑(i,j)∈¯Eψ2ij=∑λ∈Λ2+lλ∑k=2{(−1)k−1ψ1λ,qk−1∏jλ=1ψjλ,(j+1)λψ(j+1)λ,(j+1)λ}2=∑λ∈Λ2+Dλ

To prove (22), we thus have to find an upper bound for

 |1−I1I2|=|1−E(e−