An empirical Bayes procedure for the selection of Gaussian graphical models
Abstract
A new methodology for model determination in decomposable graphical Gaussian models
(Dawid and Lauritzen, 1993) is developed. The Bayesian paradigm is used and, for each given graph,
a hyper inverse Wishart prior distribution on the covariance matrix is considered. This prior distribution
depends on hyperparameters.
It is wellknown that the models’s posterior distribution is sensitive to the
specification of these hyperparameters and no completely satisfactory method is registered.
In order to avoid this problem, we suggest adopting an empirical Bayes strategy, that is a strategy for which
the values of the hyperparameters are determined using the data. Typically, the hyperparameters are fixed to
their maximum likelihood estimations. In order to calculate these maximum likelihood estimations, we suggest
a Markov chain Monte Carlo version of the Stochastic Approximation EM algorithm.
Moreover, we introduce a new sampling scheme
in the space of graphs that improves the add and delete proposal of Armstrong et al. (2009).
We illustrate the efficiency of this new scheme on simulated and real datasets.
Keywords: Gaussian graphical models, decomposable models, empirical Bayes, Stochastic Approximation EM, Markov Chain Monte Carlo
1 Gaussian graphical models in a Bayesian Context
Statistical applications in genetics, sociology, biology , etc often lead to complicated interaction patterns between variables. Graphical models have proved to be powerful tools to represent the conditional independence structure of a multivariate distribution : the nodes represent the variables and the absence of an edge between two vertices indicates some conditional independence between the associated variables.
Our paper presents a new approach for estimating the graph structure in Gaussian graphical model. A very large literature deals with this issue in the Bayesian paradigm: Dawid and Lauritzen (1993); Madigan and Raftery (1994); Giudici and Green (1999); Jones et al. (2005); Armstrong et al. (2009); Carvalho and Scott (2009). For a frequentist point of view, one can see Drton and Perlman (2004).
We suggest here an empirical Bayes approach: the parameter of the prior are estimated from the data. Parametric empirical Bayes methods have a long history, with major developments evolving in the sequence of papers by Efron and Morris (1971, 1972b, 1972a, 1973a, 1973b, 1976b, 1976a). Empirical Bayes estimation falls outside the Bayesian paradigm. However, it has proven to be an effective technique of constructing estimators that performs well under both Bayesian and frequentist criteria. Moreover, in the case of decomposable Gaussian graphical models, it gives a default and objective way for constructing prior distribution. The theory and applications of empirical Bayes methods are given by Morris (1983).
In this Section, we first recall some results on Gaussian graphical models, then we justify the use of the empirical Bayes strategy.
1.1 Background on Gaussian graphical models
Let be an undirected graph with vertices and set of edges
,
(, ).
Using the notations of Giudici and Green (1999), we first recall the definition of a decomposable graph.
A graph or subgraph is said to be complete if all pairs of its vertices are joined by edges.
Moreover, a complete subgraph that is not contained within another complete subgraph is called a clique.
Let be the set of the cliques of an undirected
graph.
An order of the cliques is said to be perfect if ,
such that .
is the set of separators associated to the perfect order .
An undirected graph admitting a perfect order is said to be decomposable. Let denote the set of decomposable graphs with
vertices. For more details, one can refer to Dawid and Lauritzen (1993), Lauritzen (1996) (Chapters 2, 3 and 5)
or Giudici and Green (1999).
The graph drawn in Figure 1 – and used as benchmark in numerical Section 4.2– is decomposable. Indeed, the set of cliques , , , and with associated separators , , and forms a perfect order.
Note that, with vertices, the total number of possible graphs is , being the number of possible edges. The total number of decomposable graphs with vertices can be calculated for moderate values of . For instance, if , among the possible graphs, are decomposable (around ); if , then of the possible graphs are decomposable (around ).
A pair of subsets of the vertex set of an undirected graph is said to form a decomposition of if (1) , (2) is complete and (3) separates from ie any path from a vertex in to a vertex in goes through .
To each vertex , we associate a random variable . For , denotes the collection of random variables . To simplify the notation, we set . The probability distribution of is said to be Markov with respect to , if for any decomposition of , is independent of given . A graphical model is a family of distributions on verifying the Markov property with respect to a graph.
A Gaussian graphical model, also called covariance selection model (see Dempster (1972)), is such that
(1) 
where denotes the variate Gaussian distribution with expectation and symmetric definite positive covariance matrix . has to ensure the Markov property with respect to . In the Gaussian case, is Markov with respect to if and only if
where denotes the inverse of the matrix . is called the concentration matrix.
In the following, we suppose that we observe a sample from model (1) with mean parameter set to zero. The data are expressed as a deviation from the sample mean. This centering strategy is standard in the literature, however the technique developed here can be easily extended to the case .
The density of is a function of multivariate Gaussian densities on the cliques and separators of . More precisely, let and denote respectively the sets of the cliques and separators of corresponding to a perfect order for . We have :
(2) 
where for every subset of vertices , denotes its cardinal and is the restriction of to i.e. and . is the variate Gaussian density with mean and symmetric definite positive covariance matrix .
From a Bayesian perspective, we are interested in the posterior probabilities
(3) 
for specific priors and . In the following, we discuss the choice of these prior distributions.
1.2 Prior distributions specification
Prior and posterior distributions for the covariance matrix
Conditionally on , we set an HyperInverse Wishart (HIW) distribution as prior distribution on :
where is the degree of freedom and is a symmetric positive definite location matrix. This distribution is the unique hyperMarkov distribution such that, for every clique , with density
(4) 
where is the normalizing constant:
(5) 
where and are respectively the determinant and trace and is the multivariate function with parameter :
The full joint density is:
(6) 
Conditionally on , the HIW distribution is conjugate. The posterior distribution of is given by (Giudici, 1996):
(7) 
where , denoting the transpose of .
Moreover for such a prior distribution, the marginal likelihood for any graph is a simple function of the HIW prior and posterior normalizing constants and (Giudici, 1996):
(8) 
where is the normalizing constant of the HIW distribution which can be computed explicitly in decomposable graphs from the normalizing constants of the inverse Wishart cliques and separators densities (456) :
Prior and posterior distributions for the graphs
The prior distribution in the space of decomposable graphs has been widely discussed in the literature. The naive choice is to use the standard uniform prior distribution:
One great advantage of this choice is simplifying the calculus but it can be criticized. Indeed, with vertices, the number of possible edges is equal to and, in the case of a uniform prior over all graphs, the prior number of edges has its mode around which is typically too large.
An alternative to this prior is to set a Bernouilli distribution of parameter on the inclusion or not of each edge (Jones et al., 2005; Carvalho and Scott, 2009)
(9) 
where is the number of edges of . The parameter has to be calibrate. If , this prior resumes to the uniform one.
In the following we consider this prior distribution and give an empirical estimation of .
Using (8) and (9), we deduce easily that the density of the posterior distribution in the space of decomposable graphs satisfies:
(10) 
This posterior distribution is known to be sensitive to the specification of the hyperparameters , and (see Jones et al. (2005); Armstrong et al. (2009)). To tackle this problem various strategies have been developed. In the following, we supply a short review of these methods and offer an alternative one.
Choice of the hyperparameters , and
In a fully Bayesian context, as proposed by Giudici and Green (1999), a hierarchical prior modelling can be used. In this approach, and are considered as random quantities and a prior distribution is assigned to those parameters ( is fixed to ). This strategy does not completely solve the problem since the prior distributions on and also depend on hyperparameters which are difficult to calibrate.
An other strategy consists in fixing the values of , and as in Jones et al. (2005). In that paper, is set to encouraging sparse graphs. They choose which is the minimal integer such that the first moment of the prior distribution on exists. Finally, they set and using the fact that the mode of the marginal prior for each variance terms is equal to , is fixed to if the data set is standardized.
An intermediate strategy is suggested by Armstrong et al. (2009). First, they fix the value of to ^{2}^{2}2In fact, they set but they consider that is unknown with uniform prior distribution: this situation corresponds to the case when . assessing that such a value gives a suitably noninformative prior for . Then, they consider different possibilities for , all of the form where the matrix is fixed. In all cases, for the hyperparameter , they use a uniform prior distribution on the interval where is very large. Finally, they also use a hierarchical prior on : , which leads to
by integration. is the binomial coefficient.
This hierarchical prior of is also used in Carvalho and Scott (2009). In that paper, they suggest a HIW prior approach with . This approach consists of fixing and .
In our point of view, measures the amount of information in the prior relative to the sample (see (7)): we suggest setting to such that the prior weight is the same as the weight of one observation. As pointed out by Jones et al. (2005), for this particular choice, the first moment of the prior distribution on does not exist. However, for , the prior distribution is proper and we fail to see any argument in favour of the existence of a first moment.
The structure of can be discussed and various forms exist in the literature (see Armstrong et al. (2009) for instance). In this paper, we standardise the data and use . This choice leads to sparse graph: on average each variable has major interactions with a relatively small number of other variables. In that context, plays the role of a shrinkage factor and has to be carefully chosen on the appropriate scale.
In this paper, we recommend to use an empirical Bayes strategy and to fix to its maximum likelihood estimation for which computation is a challenging issue. To tackle this point, a Markov Chain Monte Carlo (MCMC) version of the Stochastic Approximation EM (SAEM) algorithm is used.
2 An empirical Bayes procedure via the SAEMMCMC algorithm
In the following, we set . In order to compute the maximum likelihood estimation of , we need to optimize in the following function
(11) 
If the number of vertices is greater than , the number of decomposable graphs is so huge that it is not possible
to calculate the sum over . In that case, we consider the use of the ExpectationMaximization (EM) algorithm
developed by
Dempster et al. (1977), noting the fact that the data
are issued from
the partial observations of the complete data . However, for such a data augmentation scheme, the Estep of the EM algorithm is not explicit and
we have to resort to a stochastic version of the EM algorithm, like:

the MCEM or the MCMCEM algorithms
where the Estep is replaced by some Monte Carlo approximations
(McLachlan and Krishnan, 2008); 
the SAEM algorithm introduced by Delyon et al. (1999) where the Estep is divided into a simulation step and a stochastic approximation step;

the SAEMMCMC algorithm
(Kuhn and Lavielle, 2004) which extends the SAEM scheme, the “exact” simulation step being replaced by a simulation from an ergodic Markov chain.
The SEM, MCEM and SAEM methods require to simulate a realization from the distribution of given and . We are not able to produce a realization exactly distributed from the distribution of given and . We use the SAEMMCMC algorithm which just requires some realizations from an ergodic Markov chain with stationary distribution . In a first part, we recall the EM algorithm principles and present the SAEMMCMC scheme. In a second part, we detail its application to Gaussian graphical models and prove its convergence.
2.1 The Stochastic Approximation version of the EM algorithm
The EM algorithm is competitive when the maximization of the function
is easier than the direct maximization of the marginal likelihood (11). The EM algorithm is a two steps iterative procedure. More precisely, at the th iteration, the Estep consists of evaluating while the Mstep updates by maximizing .
For complicated models where the Estep is untractable, Delyon et al. (1999) introduce the Stochastic Approximation EM algorithm (SAEM) replacing the Estep by a stochastic approximation of . At iteration , the Estep is divided into a simulation step (Sstep) of with the posterior distribution and a stochastic approximation step (SAstep):
where is a sequence of positive numbers decreasing to zero. When the joint distribution of belongs to the exponential family, the SAstep reduces to the stochastic approximation on the minimal exhaustive statistics. The Mstep remains the same. One of the benefits of the SAEM algorithm is the lowlevel dependence on the initialization , due to the stochastic approximation of the SAstep.
In Gaussian graphical models, we cannot generate directly a realization from the conditional distribution of given and . For such cases, Kuhn and Lavielle (2004) suggest to replace the simulation step by a MCMC scheme which consists of generating realizations from an ergodic Markov chain with stationary distribution and use the last simulation in the SAEM algorithm. Kuhn and Lavielle (2004) prove the convergence of the estimates sequence provided by this SAEMMCMC algorithm towards a maximum of the function under general conditions for the exponential family.
2.2 The SAEMMCMC algorithm on Gaussian graphical models
In this section, we detail the application of the SAEMMCMC algorithm to the Gaussian graphical model introduced in Section 1.2. More precisely, we give the expression of the complete loglikelihood and of the minimal sufficient statistics. Lavielle and Lebarbier (2001) applied the same methodology on a changepoint problem.
The complete loglikelihood can be decomposed into three terms:
(12) 
On the righthand side of equation (12), the first quantity is independent of thus, it will not take part in its estimation. Using the fact that we only consider decomposable graphs and the definition of the Hyper Inverse Wishart distribution, the second term of the righthand side of Equation (12) can be developed :
Furthermore,
As a consequence, there exists a function of independent of such that
(13) 
where is the scalar product of . Finally, following (13), the complete likelihood function belongs to the exponential family and the minimal sufficient statistic is such that:
In an exponential model, the SAstep of the SAEMMCMC algorithm reduces to the approximation of the minimal sufficient statistics. Thus, we can now write the three steps of the SAEMMCMC algorithm: let be a sequence of positive numbers such that and .
Algorithm 1 SAEMMCMC algorithm

Initialize , , and .

At iteration ,
[SStep] generate from iterations of a MCMC procedure – detailed in Section
3 – with as stationnary distribution.
[SAStep] update using a stochastic approximation scheme:
[MStep] maximize the joint loglikelihood (13):

Set and return to (2) until convergence.
The convergence of the estimates sequence supplied by this SAEMMCMC algorithm is ensured by the results of Kuhn and Lavielle (2004). Indeed, first, the complete likelihood belongs to the exponential family and the regularity assumptions required by Kuhn and Lavielle (2004) (assumptions M1M5 and SAEM2) are easily verified. Secondly, the convergence requires the ergodicity of the Markov Chain generated at Sstep towards the stationary distribution that is the distribution of . Finally, the properties of allow to apply the results of Kuhn and Lavielle (2004) and we conclude that the estimates sequence converges almost surely towards a (local) maximum of the function .
3 A new MetropolisHastings sampler
At each iteration of the SAEM algorithm, a couple has to be generated under the posterior distribution . As described in Giudici and Green (1999), Brooks et al. (2003) and Wong et al. (2003), this simulation can be achieved using a variable dimension MCMC scheme like the reversible jump algorithm. In case of an HIW prior distribution on , the marginal likelihood is available in closed form (8) and, therefore, there is no need to resort to a variable dimension MCMC scheme.
At iteration of the SAEM algorithm, the simulation of can be achieved through the following two steps procedure:

[S1step]

[S2step]
According to (7), the second step [S2step] of this procedure resolves into the simulation of HIW distributions the principle of which is detailed in Carvalho et al. (2007).
For the first step [S1step], we have to resort to an MCMC algorithm but not of variable dimension since the chain is generated in the decomposable graphs space with vertices.
To sample for the posterior in the space of graphs, Armstrong et al. (2009) use the fact that the marginal likelihood is available in closed form and introduce a MetropolisHastings (MH) algorithm. At iteration , their add and delete MH proposal consists of picking uniformly at random an edge such that the current graph with or without this edge stays decomposable; and deducing the proposed graph by deleting the generated edge to the current graph if it contains this edge or adding the generated edge otherwise.
Let be the current graph, the set of decomposable graphs derived from by removing an edge and the set of decomposable graphs derived from by adding an edge. For pedagogical reasons, we present here an add and delete MH sampler slightly different from the one of Armstrong et al. (2009). In our proposal, we first decide at random if we try to delete or to add an edge. The two schemes has exactly the same properties. Our add and delete algorithm is initialized on and the following procedure is repeated until the convergence is reached.
Algorithm 2 Add and Delete MH proposal
At iteration ,

Choose at random (with probability ) to delete or add an edge to .

If delete an edge, enumerate and generate according to the uniform distribution on .

If add an edge, enumerate and generate according to the uniform distribution on .

Calculate the MH acceptance probability
such that is the invariant distribution of the Markov chain. 
With probability , accept and set , otherwise reject and set .
The acceptance probability is equal to where
with
Note that because in general , the proposal distribution is not symmetric. The ratio is evaluated with formula (10).
The enumerations of and are not obvious and can be timeconsuming. To tackle this point, we apply the results of Giudici and Green (1999) characterizing the set of moves (add and delete) which preserve the decomposability of the graph. These criteria lead to a fast enumeration.
Armstrong et al. (2009) prove that this scheme^{3}^{3}3In Armstrong et al. (2009), the step on the space of graphs represents a Gibbs step of
an hybrid sampler (as already explained, they consider a hierarchical model where that the hyperparameter is a random variable).
is more efficient than the variable dimension sampler of Brooks et al. (2003), which is itself an improvement of the reversible jump algorithm
proposed by Giudici and Green (1999). Their proposal is clearly irreducible and, therefore, the theoretical convergence of the produced Markov
Chain
towards the stationary distribution
is ensured, following standard results on MH schemes.
However, in practice, the space of decomposable graphs is so large that the chain may take quite some time to reach the invariant distribution. To improve this point, we introduce a datadriven MH kernel which uses the informations contained in the inverse of the empirical covariance matrix. To justify this choice, recall that,because of the Gaussian graphical model properties, if the inverse empirical covariance between vertices and is near zero, we can presume that there is no edge between vertices and . Then, during the MH iterations, if the current graph contains an edge between vertices and , it is legitimate to propose removing this edge. The same type of reasoning can be done if the absolute value of the inverse empirical covariance between vertices and is large. Indeed, in that case, and if during the MH iterations the current graph does not contain an edge between vertices and , it is legitimated to propose to add this edge. With this proposal, once the random choice to add or delete an edge has been done, the proposed graph is not chosen uniformly within the class of decomposable graphs but according to the values of the inverse empirical covariances.
Let denote the inverse empirical covariance matrix: . (respectively ) denotes the graph where the edge has been removed (respectively added).
The Data Driven kernel is the following one :
Algorithm 3 Data Driven MH proposal
At iteration ,

Choose at random to delete or add an edge to .

If delete an edge, enumerate and generate according to the distribution such that

If add an edge, enumerate and generate according to the distribution such that

Calculate the MH acceptance probability
such that is the invariant distribution of the Markov chain. 
With probability , accept and set , otherwise reject and set .
The algorithm is initialized on and the procedure is repeated until the convergence is reached.
Finally, in view of some numerical experiments and in order to keep the good properties in terms of exploration of the standard MH kernel, we propose to use in practice a combination of the standard add and delete MH kernel and the previously presented datadriven kernel. This point is detailed in the next section.
4 Numerical experiments
In this part, we illustrate the statistical performances of our methodology on three different data sets. The second one is a simulated example which highlights the convergence properties of the SAEMMCMC algorithm. The first and third examples appeared in Whittaker (1990) and have been widely used to evaluate the statistical performance of graphical models methodology, one can see for instance Giudici and Green (1999); Armstrong et al. (2009). Through these two examples, the importance of the choice of the hyperparameters and the efficiency of the new MCMC sampler are underlined.
4.1 The Fret’s heads dataset Whittaker (1990)
Fret’s heads dataset contains head measurements on the first and the second adult son in a sample of families. The variables are the head length of the first son, the head breadth of the first son, the head length of the second son and the head breadth of the second son. 61 graphs are decomposable among the possibles graphs.
We compare three different prior distributions on .

We first consider the prior distribution suggested by Jones et al. (2005) e.g.

In a second experiment, we use the prior distribution proposed in Carvalho and Scott (2009) i.e
Furthermore, resulting into

Finally, we use our prior distribution e.g,
and a Bernouilli prior of parameter on the edges of .
Using the SAEM algorithm described previously, we estimate and to
On this example, there are only decomposable graphs and so we are able to compute exactly the posterior probabilities for every prior distribution. At that point, we are interested in comparing the posterior probabilities of the five most probable decomposable graphs for the three previously prior distribution. The results are resumed in Table LABEL:fretsheads_compartau.
The empirical Bayes estimation of is quite smaller than the value provided by the heuristic of Jones et al. (2005). As a consequence, the posterior probabilities of graphs are really different. Moreover, the approach of Carvalho and Scott (2009) gives results not in agreement with one of the two others method. The way the hyperparameters and are considered is essential, since that drastically influences the results.
4.2 Simulated Datasets
We consider artificial datasets where . These datasets are simulated according to model (1) with the graph of Figure 1. , and are set equal to , and respectively.
The SAEMMCMC algorithm has been performed on the datasets in order to estimate the hyperparameter . The algorithm is arbitrary initialized with and . Given , is initialized with a standard backward procedure based on the posterior probabilities with .
The step of the stochastic approximation scheme is chosen as recommended by Kuhn and Lavielle (2005): during the first iterations , and during the subsequent iterations. The initial guess on and could be far from a local maximum of the likelihood function and the first iterations with allow the sequence of estimates to converge to a neighborhood of a local maximum. Subsequently, smaller step sizes during additional iterations ensure the almost sure convergence of the algorithm to a local maximum of the likelihood function. We implemented the SAEMMCMC algorithm with and . At the Sstep of the algorithm, the Markov Chain supplied by the MCMC algorithm is of length during the first iterations of the SAEM scheme and for the remaining iterations.
Figure 2 illustrates the convergence of the parameter estimates considering arbitrary chosen datasets. The estimated sequences are represented as a function of the iteration number. During the first iterations of SAEM, the parameter estimates fluctuate, reflecting the Markov Chain construction. After iterations, the curves smooth but still continue to converge towards a neighborhood of a local maximum of the likelihood function. Convergence is obtained after iterations.
Considering the datasets, fir the parameter the relative bias is negligible and the relative root mean square error (RMSE) amounts to . Note that the same study has been conducted with a uniform prior on In that case, the algorithm only involves the parameter and the corresponding RMSE is equal to .
4.3 The Fowl bones dataset Whittaker (1990)
This dataset concerns bone measurements which are taken from white leghorn fowl. The variables are skull length, skull breadth, humerous (wings), ulna (wings), femur (legs) and tibia (legs). On such a dataset, the determination of the best decomposable Gaussian graphical model results in finding the best graph within decomposable graphs ( of the possible graphs).
Using this example, we aim at illustrating the fact that a careful choice of the transition kernel in the MCMC algorithm ensures a better exploration of the support of the posterior distribution. To do this, we compare the performances of the add and delete proposal of Armstrong et al. (2009) to those given by the datadriven one.
In a first step, we use the SAEMMCMC algorithm to calibrate the value of and . We obtain and .
In a second step, using this fixed value of and , we generate Markov chains of iterations. The first one is simulated using the add and delete kernel. For the second one, we use exclusively the add and delete kernel during iterations : this phase of burnin allows a large exploration of the decomposable graphs space. During the last iterations, we alternatively and systematically use the add and delete and datadriven kernels.
To illustrate the performance of this new kernel, we compute exactly the posterior probabilities
for each decomposable graph of size . We concentrate our efforts on the graphs such that
(resulting into graphs among the ones) assuming the the other ones are of small interest because nearly never reached by the Markov chains.
For each graph of interest , we count the number of times each Markov Chain reached it (after having removed the burnin period). We finally obtain an estimation of the posterior probability by each chain:
These values are compared to the theoretical ones . In Figure 3, we plot the estimated densities of the quantities relative errors
in solid line, and
in dashed line.
We note that the density corresponding to the errors involved by the datadriven kernel is more concentrate around the value . The large errors in the add an delete density are due to the graphs with small probabilities. Thus, the new kernel explores more efficiently the posterior distribution. The acceptance rate is higher for the datadriven chain (see Figure 4).
5 Conclusion and discussion
An empirical Bayes strategy estimating prior hyperparameters in a Gaussian graphical model using a SAEMMCMC algorithm is introduced.
That proposal does not depend on any calibrating parameters and can be viewed as a default option for decomposable graphical model determination.
Some empirical studies show the relevance of the proposed approach and the good properties of the introduced algorithms.
However, Scott and Berger (2010) has recently found considerable differences between fully Bayes and empirical Bayes strategies in the context of variable selection. It would be very interesting to investigate, both from theoretical and practical perspectives, on such a discrepancy in the case of decomposable graphical model selection.
Acknowledgments
The authors are grateful to Marc Lavielle for very helpful discussions. The authors wish to thank the Associate Editor and two reviewers whose suggestions were very helpful in improving the presentation of this work. This work has been supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 20092012 project Big’MC.
References
 Armstrong et al. (2009) Armstrong, H., Carter, C., Wong, K., and Kohn, R. (2009). Bayesian covariance matrix estimation using a mixture of decomposable graphicals models. Statistics and Computing, 19(3):303–316.
 Brooks et al. (2003) Brooks, S. P., Giudici, P., and Roberts, G. O. (2003). Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions. J. R. Stat. Soc. Ser. B Stat. Methodol., 65(1):3–55.
 Carvalho et al. (2007) Carvalho, C., Massam, H., and West, M. (2007). Simulation of hyperinverse Wishart distributions in graphical models. Biometrika, 94:647–659.
 Carvalho and Scott (2009) Carvalho, C. and Scott (2009). Objective Bayesian Model Selection in Gaussian Graphical Models. Biometrika, 96:497–512.
 Celeux and Diebolt (1992) Celeux, G. and Diebolt, J. (1992). A stochastic approximation type EM algorithm for the mixture problem. Stochastics Stochastics Rep., 41(12):119–134.
 Dawid and Lauritzen (1993) Dawid, A. P. and Lauritzen, S. L. (1993). HyperMarkov laws in the statistical analysis of decomposable graphical models. Ann. Statist., 21(3):1272–1317.
 Delyon et al. (1999) Delyon, B., Lavielle, M., and Moulines, E. (1999). Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist., 27:94–128.
 Dempster (1972) Dempster, A. (1972). Covariance selection. Biometrics, 28:157–175.
 Dempster et al. (1977) Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39(1):1–38. With discussion.
 Diebolt and Celeux (1993) Diebolt, J. and Celeux, G. (1993). Asymptotic properties of a stochastic EM algorithm for estimating mixing proportions. Comm. Statist. Stochastic Models, 9(4):599–613.
 Drton and Perlman (2004) Drton, M. and Perlman, M. (2004). Model selection for Gaussian concentration graphs. Biometrika, 91(3):591–602.
 Efron and Morris (1971) Efron, B. and Morris, C. (1971). Limiting the risk of Bayes and empirical Bayes estimators. I. The Bayes case. J. Amer. Statist. Assoc., 66:807–815.
 Efron and Morris (1972a) Efron, B. and Morris, C. (1972a). Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika, 59:335–347.
 Efron and Morris (1972b) Efron, B. and Morris, C. (1972b). Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes case. J. Amer. Statist. Assoc., 67:130–139.
 Efron and Morris (1973a) Efron, B. and Morris, C. (1973a). Combining possibly related estimation problems (with discussion). J. Roy. Statist. Soc. Ser. B, 35:379–421.
 Efron and Morris (1973b) Efron, B. and Morris, C. (1973b). Stein’s estimation rule and its competitors—an empirical Bayes approach. J. Amer. Statist. Assoc., 68:117–130.
 Efron and Morris (1976a) Efron, B. and Morris, C. (1976a). Families of minimax estimators of the mean of a multivariate normal distribution. Ann. Statist., 4(1):11–21.
 Efron and Morris (1976b) Efron, B. and Morris, C. (1976b). Multivariate empirical Bayes and estimation of covariance matrices. Ann. Statist., 4(1):22–32.
 Giudici (1996) Giudici, P. (1996). Learning in graphical Gaussian models. In Bayesian statistics, 5 (Alicante, 1994), Oxford Sci. Publ., pages 621–628. Oxford Univ. Press, New York.
 Giudici and Green (1999) Giudici, P. and Green, P. J. (1999). Decomposable graphical Gaussian model determination. Biometrika, 86(4):785–801.
 Jones et al. (2005) Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experiments in stochastic computation for highdimensional graphical models. Statist. Sci., 20.
 Kuhn and Lavielle (2004) Kuhn, E. and Lavielle, M. (2004). Coupling a stochastic approximation version of EM with a MCMC procedure. ESAIM Probab. Stat., 8:115–131.
 Kuhn and Lavielle (2005) Kuhn, E. and Lavielle, M. (2005). Maximum likelihood estimation in nonlinear mixed effects models. Comput. Statist. Data Anal., 49:1020–1038.
 Lauritzen (1996) Lauritzen, S. (1996). Graphical models, volume 17 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York. Oxford Science Publications.
 Lavielle and Lebarbier (2001) Lavielle, M. and Lebarbier, E. (2001). An application of MCMC methods to the multiple changepoints problem. Signal Processing, 81:39–53.
 Letac and Massam (2007) Letac, G. and Massam, H. (2007). Wishart distributions for decomposable graphs. Ann. Statist., 35(3):1278–1323.
 Madigan and Raftery (1994) Madigan, D. and Raftery, A. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s Window. Journal of the American Statistical Association, 89:1335–1346.
 McLachlan and Krishnan (2008) McLachlan, G. and Krishnan, T. (2008). The EM algorithm and extensions. Wiley Series in Probability and Statistics. WileyInterscience [John Wiley & Sons], Hoboken, NJ, second edition.
 Morris (1983) Morris, C. N. (1983). Parametric empirical Bayes inference: theory and applications (with discussion). J. Amer. Statist. Assoc., 78(381):47–65.
 Roverato (2002) Roverato, A. (2002). Hyper inverse Wishart distribution for nondecomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scand. J. Statist., 29(3):391–411.
 Scott and Berger (2010) Scott, J. and Berger, J. (2010). Bayes and empiricalBayes multiplicity adjustment in the variableselection problem. The Annals of Statistics, 38(5):2587–2619.
 Whittaker (1990) Whittaker, J. (1990). Graphical models in applied multivariate statistics. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons Ltd., Chichester.
 Wong et al. (2003) Wong, F., Carter, C. K., and Kohn, R. (2003). Efficient estimation of covariance selection models. Biometrika, 90(4):809–830.