Rank Likelihood for Bayesian Nonparanormal Graphical Models
Abstract
Gaussian graphical models, where it is assumed that the variables of interest jointly follow a multivariate normal distribution with a sparse precision matrix, have been used to study intrinsic dependence among variables, but the normality assumption may be restrictive in many settings. A nonparanormal graphical model is a semiparametric generalization of a Gaussian graphical model for continuous variables where it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformation. We consider a Bayesian approach for the nonparanormal graphical model using a rank likelihood which remains invariant under monotone transformations, thereby avoiding the need to put a prior on the transformation functions. On the underlying precision matrix of the transformed variables, we consider a horseshoe prior on its Cholesky decomposition and use an efficient posterior Gibbs sampling scheme. We present a posterior consistency result on the rankbased transformation. We study the numerical performance of the proposed method through a simulation study and apply it on a real data set.
Keywords: Horseshoe prior, Gaussian copula graphical models, structure learning, semiparametric model
1 Introduction
Graphical models are useful mathematical tools that model complex dependence relationships between variables. Under the Gaussianity assumption, the graph of relations is completely determined by the zeros in the inverse covariance matrix, also known as the precision matrix. If the th entry of the precision matrix is zero, then the th and the th variables are conditionally independent given all other variables; see Lauritzen (1996) and Edwards (2000) for more information on the properties of Gaussian graphical models (GGMs).
If the variables are not normally distributed, an adjustment needs to be made if one wants to use the statistical properties of a Gaussian graphical model. The standard procedure is to transform the data, typically by the logarithmic or the square root transformation. With such transformations, one has to check each transformation to see if the data appear close to normal, checking which can be tedious and the method is somewhat adhoc. From a modeling perspective, an easier solution is to leave the transformation functions unspecified and estimate them instead. The nonparanormal graphical model (Liu et al., 2009) consists of estimating the transformation functions using a truncated marginal empirical distribution function, and then estimating the precision matrix of the transformed variables using the graphical lasso assuming sparsity. A Bayesian approach for nonparanormal graphical models, developed by Mulgrave and Ghosal (2018, 2017), uses a random series Bsplines prior to estimate the transformation functions and induces sparsity on the offdiagonal entries of the precision matrix by spikeandslab or continuous shrinkage priors, either directly or through the Cholesky decomposition.
However, one important requirement of the nonparanormal model is that the transformation functions need to be estimated. In order to avoid estimating the transformation functions, alternative rankbased procedures can be employed to transform the data to normally distributed random variables. Dabrowska and Doksum (1988) studies the properties of the likelihood function based on transformed observations in a general class called transformation models. An example is given by the rank likelihood which is the joint distribution of the ranks of the observations. The rank likelihood is invariant under monotone transformations of the observations. Thus one can ignore the transformations and focus on the main parameter of interest, the precision matrix, or equivalently, the inverse correlation matrix. The rank likelihood has been used for semiparametric copula estimation (Hoff, 2007) and for ROC curve estimation (Gu and Ghosal, 2009; Gu et al., 2014). The use of the rank likelihood results in a gain in robustness and simplification of estimation. Models which can be represented by transformations on Gaussian graphical models are generally known as Gaussian copula graphical models (GCGMs) and have been explored in the Bayesian literature (Pitt et al., 2006; Dobra and Lenkoski, 2011; Liu et al., 2012; Mohammadi and Wit, 2017). In the frequentist literature, nonparametric rankbased correlation coefficient estimators have been used to construct Gaussian copula graphical models (Liu et al., 2012; Xue and Zou, 2012). These models can also address binary or ordinal data, but we do not pursue this direction here.
Once the transformed variables have been obtained, one needs to estimate a sparse precision matrix in order to learn the structure of the graphical model. In the frequentist literature, the popular algorithm to use is the graphical lasso (Friedman et al., 2008). Numerous algorithms have been proposed to solve this problem (Meinshausen and Buhlmann, 2006; Yuan and Lin, 2007; Friedman et al., 2008; Rothman et al., 2008; Banerjee et al., 2008; d’Aspremont et al., 2008; Lu, 2009; Scheinberg et al., 2010; Witten et al., 2011; Mazumder and Hastie, 2012b). Alternative penalties used to estimate the sparse precision matrix include adaptive LASSO and SCAD penalties (Fan et al., 2009), and LASSO and grouped LASSO penalties (Friedman et al., 2010). In the Bayesian literature, graphical models can be learned with the use of continuous shrinkage priors. Priors such as the double exponential (Wang, 2012; Peterson et al., 2013), uniform shrinkage priors (Wang and Pillai, 2013), and normal spikeandslab priors (Wang, 2015; Peterson et al., 2016; Li and McCormick, 2017; Li et al., 2017) can characterize zero and nonzero elements of the precision matrix with continuous distributions that have a mass at zero and heavy tails. More recently, horseshoe priors have been employed for sparse precision estimation (Mulgrave and Ghosal, 2017; Williams et al., 2018). We estimate a sparse precision matrix using a Cholesky decomposition to naturally incorporate the positive definite matrix, a horseshoe prior to regularize matrix, and a loss procedure to threshold the matrix. Our methods differ from other Bayesian GCGMs which use the rank likelihood to transform the random variables. Dobra and Lenkoski (2011) used a GWishart prior (Roverato, 2002) and Markov Chain Monte Carlo to estimate the sparse precision matrix to construct a GCGM. Mohammadi et al. (2017) estimated a sparse precision matrix using a GWishart prior and birthanddeath Markov chain Monte Carlo (MCMC) (Mohammadi and Wit, 2015). Li and McCormick (2017) put a normal spikeandslab prior on the precision matrix and used an expectationconditional maximization algorithm for estimation.
The paper is organized as follows. In the next section, we state model and priors. In Section 3, we review the posterior computation. In Section 4, we review the thresholding procedure and in Section 5, we discuss our tuning procedure. We derive a posterior consistency result in Section 6 and in Section 7, we review the results of the simulation study. Lastly, in Section 8, we describe an application with a gene expression data.
2 Model and Priors
2.1 Estimation of Transformed Variables
Given a set of observed continuous variables distributed from unknown marginal distributions, there exist monotone increasing transformation functions , such that the distribution of the continuous transformed variables is
where is the correlation matrix. We make this model identifiable by centering the variables and setting the covariance matrix equal to the correlation matrix. We wish to make inference on and not on the transformation functions . Let represent the set of ranks of . Then and use the index for the ranks . Since the transformation functions are increasing, implies that , where is the index for the position of the th rank of . Then for a given dataset , the transformed variables must lie in the set
Then following Dabrowska and Doksum (1988) and the notation of Hoff (2007), we calculate the rank likelihood as
(1) 
Thus, this likelihood depends on the parameter and not on the nuisance functions .
In order to sample , we reparameterize the model in terms of the nonidentifiable inverse covariance matrix , but focus our posterior inference on the identifiable inverse correlation matrix. Thus , where and are the square roots of the diagonal elements of . The rank likelihood is scale invariant so the nonidentifiable and identifiable models lead to the same posterior distribution, .
2.2 Estimation of Inverse Correlation Matrix
We put a horseshoe prior on and sample using a regressionbased Cholesky decomposition method discussed in Mulgrave and Ghosal (2017). Denote the Cholesky decomposition of as , where is a lower triangular matrix. Define the coefficients and the precision as . Then the multivariate Gaussian model leads to the set of independent regression problems,
where are the regression coefficients for and . Denoting to be the th column of , the matrix formed by columns of greater than , and , we may write the regression relation in the vector form as
which gives rise to the likelihood.
We use a standard conjugate noninformative prior on the variances with improper density proportional to . We enforce a sparsity constraint along the rows of the lower triangular matrix in order to ensure that the probability that an entry is nonzero (i.e. sparsity) remains roughly the same over different rows. We choose Prob(nonzero in th row), and tune the value of to cover a range of four orders of magnitude, i.e. ; see Mulgrave and Ghosal (2017) for the more information on the sparsity constraint.
We use a horseshoe prior described in Neville et al. (2014)
(2)  
for . We denote the inverse gamma distribution as IG.
According to van der Pas et al. (2014), the global scale parameter is roughly equivalent with the probability of a nonzero element. We enforce the sparsity constraint by replacing with . Thus, since we are working with the squared parameter, the factor in the variance term for is . The prior on leads to an induced prior on .
3 Posterior Computation
We obtain samples of by employing the following Gibbs sampler:

Sample
For ,

Compute the ranks of , where represents the ranks.
For

Compute and , where is the index for the position of the th rank of . Let and .

Compute


Sample where
denotes a univariate truncated normal distribution with mean , variance , and truncation limits and . Sampling from the truncated normal distribution is implemented with the fast sampling function trandn in MATLAB that uses minimax tilting (Botev, 2017).



Sample :
For

Sample the variables , where
Since sampling from this normal distribution can be expensive with large , we used an exact sampling algorithm for Gaussian priors that uses data augmentation (Bhattacharya et al., 2016):

Sample and , where

set , where

solve , where

set


Sample

Sample

Sample

Sample

Sample
Sample 
Compute

Compute


Set
These steps are repeated until convergence.
4 Thresholding
4.0.1 01 Loss Procedure
We find the posterior partial correlation using the inverse correlation matrices from the Gibbs sampler of the horseshoe prior (2) and the posterior partial correlation using the standard conjugate Wishart prior. The posterior partial correlation using the matrices from the Gibbs sampler is defined as
where is the th sample of Markov chain Monte Carlo (MCMC) draws after burnin from the posterior distribution of , , . The posterior partial correlation using the standard conjugate Wishart prior is found by starting with the latent observation, which is obtained from the MCMC output. We put a standard Wishart prior on the inverse correlation matrix, , where is the identity matrix. By conjugacy, the posterior is , where . We compute the mean of the posterior distribution given , . Finally, we find the posterior partial correlation coefficients
where is the th entry of at the th MCMC iteration.
We link these two posterior partial correlations for the 01 loss method. We claim the event if and only if
(3) 
for and . The rationale for this thresholding procedure is that we are comparing the regularized precision matrix to the nonregularized precision matrix from the Wishart prior. If the absolute value of the partial correlation coefficient from the regularized precision matrix is similar in size or larger than the absolute value of the partial correlation coefficient from the Wishart precision matrix, then the edge matrix should have an edge. If the absolute value of the partial correlation coefficient from the regularized precision matrix is much smaller than the absolute value of the coefficient from the Wishart matrix, then the edge matrix should not have an edge. The precision matrix from the Wishart prior serves as a means of comparison to determine whether the element of the regularized precision matrix is truly large or small.
5 Choice of Prior Parameters
For the sparsity constraint on the inverse correlation matrix , we need to select the value of the parameter . We solve a convex constrained optimization problem in order to use the Bayesian Information Criterion (BIC), originally described in Dahl et al. (2005) and developed further in Mulgrave and Ghosal (2018). First, we find the Bayes estimate of the inverse correlation matrix, . We also find the average of the transformed variables, , where , , are obtained from the MCMC output. Then, using the sum of squares matrix, , we solve the following to obtain the maximum likelihood estimate of the inverse correlation matrix, ,
where stands for the constraint that an element of is zero if and only if the estimated edge matrix from the MCMC sampler has that element zero. The estimated edge matrix from the MCMC sampler will be described in more detail in Subsection 7.1. For computational simplicity, in the code, we represent this problem as an unconstrained optimization problem as described in Dahl et al. (2005).
Lastly, we calculate , where , the sum of the number of diagonal elements and the number of edges in the estimated edge matrix, and . We select the that results in the smallest BIC.
6 Posterior Consistency
In this section, we show that in the fixed dimensional setting (i.e. is fixed), the rankbased posterior distribution of is consistent at its true value for almost all with respect to the Lebesgue measure under the only assumption that the prior distribution for has positive density on the space of inverse correlation matrices, i.e., the collection of all positive definite matrices such that all diagonal elements of are .
Theorem 6.1.
Assume that has prior density a.e. over the space of positive definite matrices, with respect to the Lebesgue measure and that the dimension is fixed. Then for a.e. , and for any neighborhood of , we have that
(4) 
where denotes the joint distribution of all ’s and ranks with as the true value of and denotes the underlying normality restoring transformations.
Proof.
The proof is based on an application of Doob’s Theorem (Ghosal and van der Vaart, 2017), Section 6.2. Doob’s theorem is a very general posterior consistency result, which only requires that in the joint distribution of all parameters and observables, the parameter can be a.s. written as a function of all observables of all stages, and then concludes that posterior consistency holds for almost all parameters with respect to the prior distribution. Here observations are the rank information for each variable, where stands for the rank of the th observation in the th component, , , . We follow some arguments given in Gu and Ghosal (2009). Let , the “population quantile” of the th observation regarding the th variable, , . Note that . By Theorem a on page 157 of Hájek and Šidák (1967), for any ,
(5) 
As such, is an inprobability limit of measurable random variables, where is the field generated by . Thus, for any ,
(6) 
and hence, is an measurable random variable, where , and so is . Therefore it suffices to show that can be written as the almost sure limit of a sequence of functions of .
Let stand for the th element of . Then clearly almost surely. Thus , and hence is expressible as an almost sure limit of . Thus, by Doob’s theorem, the consistency of the posterior (4) at holds a.e. . However, as the prior density is positive throughout the parameter space, it also follows that the posterior (4) at holds a.e. . ∎
Usually, the main criticism against a posterior consistency result obtained by applying Doob’s theorem is that the exceptional set where consistency may fail may be “large”, since it only needs to be null with respect to the prior, which is somewhat arbitrary. However, in the present application, since the parameter space for the parameter of interest is finite dimensional, where we have the Lebesgue measure as a benchmark measure, the exceptional set of points where the posterior may be inconsistent is characterized as Lebesgue null, which can be regarded as “small”. It is important to note that the normality restoring transformations are taken to be fixed, and no prior is assigned on them. Since the underlying procedure is unaffected by the transformations, the fact that these transformations are unknown does not matter. Note that the fixed dimensionality of the variables is essential, since the posterior consistency result is applicable only if the parameter space is fixed.
7 Simulation Results
We conduct a simulation study to compare the performance of the proposed rank likelihood method with the Bayesian method based on GCGM (Mohammadi et al., 2017), the empirical method (Liu et al., 2009) in a nonparanormal graphical model, the Bayesian method (Mulgrave and Ghosal, 2017) in a nonparanormal graphical model, and the empirical method (Liu et al., 2012) based on GCGM. Both the proposed rank likelihood method, indicated as Rank Likelihood, and the Bayesian method of Mulgrave and Ghosal (2017) use a horseshoe prior on the Cholesky decomposition of the precision matrix and MCMC estimation. The latter, indicated as Bsplines, uses a random series Bsplines prior to estimate the transformation functions. The Bayesian method based on GCGM, indicated as Bayesian Copula, uses the rank likelihood to transform the random variables and puts a GWishart prior on the inverse correlation matrix, a uniform prior on the graph, and estimates the sparse matrix using a birthanddeath MCMC (Mohammadi and Wit, 2015). The empirical method in a nonparanormal graphical model, indicated as Truncation, uses a truncated marginal empirical distribution function to transform the variables and the graphical lasso to estimate the sparse precision matrix. The empirical method based on GCGM, indicated as SKEPTIC, uses the Spearman’s rho to transform the variables and estimates the sparse precision matrix with the graphical lasso.
The random variables, , are simulated from a multivariate normal distribution such that for . The means are selected from an equally spaced grid between 0 and 5 with length . We consider nine different combinations of and sparsity for :

, AR(4) model;

, , AR(4) model;

, , AR(4) model;

, AR(1) model;

, , AR(1) model;

, , AR(1) model;

, , sparsity = nonzero entries in the offdiagonals;

, , sparsity = nonzero entries in the offdiagonals;

, , sparsity = nonzero entries in the offdiagonals;
where the AR(4), AR(1), and star models are described by the relations

AR(4) model: ;

AR(1) model: ,
The percent sparsity levels for are computed using lower triangular matrices that have diagonal entries normally distributed with mean 1 and standard deviation 0.1, and nonzero offdiagonal entries normally distributed with mean 0 and standard deviation 1.
The observed variables are constructed from the simulated variables . The functions used to construct the observed variables were four cumulative distribution functions (c.d.f.s): asymmetric Laplace, extreme value, logistic, and stable. We could choose any values of the parameters for the c.d.f.s, but instead of selecting them ourselves, we automatically choose the values of the parameters to be the maximum likelihood estimates with the mle function in MATLAB. The values of the parameters for each of the c.d.f.s are the maximum likelihood estimates for the parameters of the corresponding distributions (asymmetric Laplace, extreme value, logistic, and stable), using the variables .
We follow the procedure in Mulgrave and Ghosal (2018) to estimate the transformation functions for the Bsplines method. The hyperparameters for the normal prior are chosen to be and . To choose the number of basis functions, we use the Aikaike Information Criterion. Samples from the truncated multivariate normal posterior distributions for the Bspline coefficients are obtained using the exact Hamiltonian Monte Carlo (exact HMC) algorithm (Pakman and Paninski, 2014). After finding the initial coefficient values , we construct initial values for using the observed variables. These initial values are used to find initial values for , and for the algorithm.
For the Rank Likelihood method, we initialize the algorithm by using the ranks of the observed variables. First, we create an matrix of ranks, . Then, we divide the columns by and convert to normal random variables using the inverse standard normal c.d.f. , and we take the transpose to obtain an matrix. Finally, we obtain an initial value for the inverse correlation matrix using
For both the Rank Likelihood and the Bsplines methods, the hyperparameters for the horseshoe prior are initialized with ones. To impose the sparsity constraint on the Cholesky decomposition of matrices for the Bsplines and the Rank Likelihood methods, we consider four values for tuning: . We select the graphic with value of having the lowest BIC. The 01 loss procedure is used to threshold the matrices for the Bsplines and the Rank Likelihood methods and construct the corresponding edge matrices. The codes for these methods were written in MATLAB.
For the simulation study, we run 100 replications for each of the nine combinations and assess structure learning and parameter estimation for each replication. We collect MCMC samples for inference after discarding a burnin of and we do not apply thinning. The Bayesian Copula method is implemented using the R package BDGraph (Mohammadi and Wit, 2015, 2017; Dobra and Mohammadi, 2017; Mohammadi and Wit, 2018) using the option “gcgm”. Bayesian model averaging is used for inverse correlation matrix and graph selection. The default option in the BDGraph package selects the graph formed by links having estimated posterior probabilities greater than . The Truncation and SKEPTIC methods are implemented using the R package huge (Zhao et al., 2015) using the “truncation” and “skeptic” options respectively. For both empirical methods, the graphical lasso method is used for the graph estimation based on transformed variables and the default lossless screening method (Witten et al., 2011; Mazumder and Hastie, 2012a) is applied. A sequence of 100 regularization parameters, , starting from to , in the logscale. We define , where is the correlation matrix from the SKEPTIC method and is the matrix constructed for the Truncation method after scaling the transformed variables and converting it to a correlation matrix. Note that is the dimension. The method used to select the graphical model along the regularization path was Generalized Stability Approach to Regularization Selection (Müller et al., 2016), or GStARS, implemented using the R package pulsar. We use upper and lower bounds to reduce the computational burden and the number of random subsamples taken for graph reestimation was 100. All codes are provided in the Supplementary Material.
7.1 Performance Assessment
For the Rank Likelihood method, we find the Bayes estimate of the inverse correlation matrix , and average it over MCMC iterations. For the Bsplines method, the Bayes estimate of the precision matrix is , using the MCMC samples. The median probability model (Berger and Barbieri, 2004) is used to find the Bayes estimate of the edge matrix for both the Rank Likelihood and Bsplines methods. We find the estimated edge matrix by first using the 01 loss procedure to threshold the MCMC inverse correlation and precision matrices, and then we take the mean of the thresholded matrices. If the offdiagonal element of the mean is greater than , the element is registered as an edge; else it is registered as a noedge.
We compute the specificity (SP), sensitivity (SE), and Matthews Correlation Coefficient (MCC), previously used for assessing the accuracy of classification procedures (Baldi et al., 2000), to assess the performance of the graphical structure learning. They are defined as follows:
where TP stands for true positives, TN stands for true negatives, FP stands for false positives, and FN stands for false negatives. Specificity and sensitivity values are between 0 and 1, where 1 is the best value. MCC values are between and 1 and 1 is the best value.
We also assess the strength of parameter estimation. We consider the scaled loss function, the average absolute distance. The scaled loss is defined as
For the Rank Likelihood, SKEPTIC, and Bayesian Copula methods, is the estimated inverse correlation matrix and is the true inverse correlation matrix. For the Truncation and Bsplines method, is the estimated inverse covariance matrix and is the true inverse covariance matrix. The results are presented in Figures 1–4. Note that Percent refers to the 10% model for dimension , 5% model for dimension and 2% model for dimension .
The Rank Likelihood method performs consistently better than the Bsplines method in terms of structure learning and parameter estimation. In particular, the Rank Likelihood method appears to be more sensitive and specific to signals than the Bsplines method. In addition, the scaled loss of the Rank Likelihood method is significantly better than the scaled loss of the Bsplines method. The Bayesian Copula method performs similar to or better than SKEPTIC and Truncation methods in terms of structure learning for all models considered. The Bayesian Copula method performs similar to the SKEPTIC and Truncation methods with regard to parameter estimation. The proposed Rank Likelihood method performs the best at structure learning for all models considered except for the AR(4) model at dimension , at which the Bayesian Copula model performs best. For parameter estimation, the proposed Rank Likelihood model generally outperforms all competing models.
Thus overall, compared to competing methods, the Rank Likelihood method performs equivalently or better for structure learning and parameter estimation for all models excluding the AR(4) model at dimension . However, the Rank Likelihood model has good performance when considering structure learning and parameter estimation together, compared to the competing models.
8 Real Data Application
We demonstrate the methods on a gene expression data set originally referenced in Stranger et al. (2007) with Gene Expression Omnibus database Series Accession Number GSE6536 19 and supported in funding in part by the US National Institutes of Health ENDGAME. Data are collected to measure the gene expression in Blymphocyte cells from inhabitants in Utah with European ancestry. The interest is on the single nucleotide polymorphisms that are found in the 5’ untranslated region of messenger RNA with a minor allele frequency . Following Bhadra and Mallick (2013), of the 47,293 total available probes, we considered the 100 most variable probes that correspond to different Illumina TargetID transcripts. The data for these 100 transcripts are available in the R package BDGraph (Mohammadi and Wit, 2015, 2017; Dobra and Mohammadi, 2017; Mohammadi and Wit, 2018). The data consist of unrelated individuals and transcripts. The variables in the data are continuous but do not appear Gaussian. A Bayesian estimate based on a Gaussian graphical model using a spikeandslab type prior constructed by Bhadra and Mallick (2013) detected 55 edges.
To construct the graph using our method, we convert the original values to be between 0 and 1 using the affine transform . We use the identity matrix as the initial matrix for the covariance and inverse covariance matrices for the Rank Likelihood and Bsplines methods. The Rank Likelihood method results in 252 edges and the Bsplines method results in 99 edges. Convergence of the Rank Likelihood method can be obtained in about 28 minutes and for the Bsplines method in about 60 minutes for a given for these data on a laptop computer. Using the same setup as in the simulation study, the SKEPTIC method results in no edges and the Truncation method results in 363 edges. The Bayesian Copula method results in 834 edges. The proposed Rank Likelihood and Bsplines methods results in the sparsest models, with the Bsplines method as the most sparse model. The graphs of the proposed methods are shown in Figure 1. The graphs of the Bayesian Copula and truncation methods are shown in Figure 2. Since the SKEPTIC method resulted in no edges, it is not included in the comparison. Plots are made with the circularGraph function in MATLAB.
SUPPLEMENTAL MATERIALS
 GitHub Repository:

The code for the methods described in this paper can be found in the following GitHub repository: https://github.com/jnj2102/RankLikelihood.
References
 Baldi et al. (2000) Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5):412–424.
 Banerjee et al. (2008) Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9:485–516.
 Berger and Barbieri (2004) Berger, J. O. and Barbieri, M. M. (2004). Optimal predictive model selection. The Annals of Statistics, 32(3):870–897.
 Bhadra and Mallick (2013) Bhadra, A. and Mallick, B. K. (2013). Joint highdimensional Bayesian variable and covariance selection with an application to eQTL analysis. Biometrics, 69(2):447–457.
 Bhattacharya et al. (2016) Bhattacharya, A., Chakraborty, A., and Mallick, B. K. (2016). Fast sampling with Gaussian scalemixture priors in highdimensional regression. Biometrika, 103(4):985–991.
 Botev (2017) Botev, Z. I. (2017). The normal law under linear restrictions: simulation and estimation via minimax tilting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):125–148.
 Dabrowska and Doksum (1988) Dabrowska, D. M. and Doksum, K. A. (1988). Partial likelihood in transformation models with censored data. Scandinavian Journal of Statistics, 15(1):1–23.
 Dahl et al. (2005) Dahl, J., Roychowdhury, V., and Vandenberghe, L. (2005). Maximum likelihood estimation of Gaussian graphical models: numerical implementation and topology selection. Technical report. University of California, Los Angeles.
 d’Aspremont et al. (2008) d’Aspremont, A., Banerjee, O., and El Ghaoui, L. (2008). Firstorder methods for sparse covariance selection. SIAM Journal on Matrix Analysis and Applications, 30(1):56–66.
 Dobra and Lenkoski (2011) Dobra, A. and Lenkoski, A. (2011). Copula Gaussian graphical models and their application to modeling functional disability data. The Annals of Applied Statistics, 5(2A):969–993.
 Dobra and Mohammadi (2017) Dobra, A. and Mohammadi, R. (2017). Loglinear model selection and human mobility. arXiv preprint arXiv:1711.02623.
 Edwards (2000) Edwards, D. (2000). Introduction to Graphical Modelling. Springer Texts in Statistics. Springer New York.
 Fan et al. (2009) Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties. The Annals of Applied Statistics, 3(2):521–541.
 Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441.
 Friedman et al. (2010) Friedman, J. S., Hastie, T. J., and Tibshirani, R. (2010). Applications of the lasso and grouped lasso to the estimation of sparse graphical models.
 Ghosal and van der Vaart (2017) Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge Series in Statistical and Probabilistic Mathematics (44). Cambridge University Press, Cambridge.
 Gu and Ghosal (2009) Gu, J. and Ghosal, S. (2009). Bayesian ROC curve estimation under binormality using a rank likelihood. Journal of Statistical Planning and Inference, 139(6):2076–2083.
 Gu et al. (2014) Gu, J., Ghosal, S., and Kleiner, D. E. (2014). Bayesian ROC curve estimation under verification bias. Statistics in Medicine, 33(29):5081–5096.
 Hoff (2007) Hoff, D., P. (2007). Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics, 1(1):265–283.
 Hájek and Šidák (1967) Hájek, J. and Šidák, Z. (1967). Theory of Rank Tests. Academic Press, New York.
 Lauritzen (1996) Lauritzen, S. L. (1996). Graphical models. Number 17 in Oxford statistical science series. Clarendon Press ; Oxford University Press, Oxford : New York.
 Li and McCormick (2017) Li, Z. and McCormick, T. H. (2017). An expectation conditional maximization approach for Gaussian graphical models. arXiv preprint arXiv:1709.06970.
 Li et al. (2017) Li, Z. R., McCormick, T. H., and Clark, S. J. (2017). Bayesian latent Gaussian graphical models for mixed data with marginal prior information. arXiv preprint arXiv:1711.00877.
 Liu et al. (2012) Liu, H., Han, F., Yuan, M., Lafferty, J., and Wasserman, L. (2012). Highdimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326.
 Liu et al. (2009) Liu, H., Lafferty, J. D., and Wasserman, L. A. (2009). The nonparanormal: semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10:2295–2328.
 Lu (2009) Lu, Z. (2009). Smooth optimization approach for sparse covariance selection. SIAM Journal on Optimization, 19(4):1807–1827.
 Mazumder and Hastie (2012a) Mazumder, R. and Hastie, T. (2012a). Exact covariance thresholding into connected components for largescale graphical lasso. Journal of Machine Learning Research, 13:781–794.
 Mazumder and Hastie (2012b) Mazumder, R. and Hastie, T. (2012b). The graphical lasso: new insights and alternatives. Electronic Journal of Statistics, 6(0):2125–2149.
 Meinshausen and Buhlmann (2006) Meinshausen, N. and Buhlmann, P. (2006). Highdimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462.
 Mohammadi et al. (2017) Mohammadi, A., Abegaz, F., van den Heuvel, E., and Wit, E. C. (2017). Bayesian modelling of Dupuytren disease by using Gaussian copula graphical models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 66(3):629–645.
 Mohammadi and Wit (2015) Mohammadi, A. and Wit, E. C. (2015). Bayesian structure learning in sparse Gaussian graphical models. Bayesian Analysis, 10(1):109–138.
 Mohammadi and Wit (2017) Mohammadi, R. and Wit, E. C. (2017). BDgraph: an R package for Bayesian structure learning in graphical models. arXiv preprint arXiv:1501.05108.
 Mohammadi and Wit (2018) Mohammadi, R. and Wit, E. C. (2018). BDgraph: Bayesian structure learning in graphical models using birthdeath MCMC. R package version 2.47.
 Mulgrave and Ghosal (2017) Mulgrave, J. and Ghosal, S. (2017). Cholesky decomposition for Bayesian nonparanormal graphical models. Manuscript in preparation.
 Mulgrave and Ghosal (2018) Mulgrave, J. J. and Ghosal, S. (2018). Bayesian inference in nonparanormal graphical models. arXiv preprint arXiv:1806.04334.
 Müller et al. (2016) Müller, C. L., Bonneau, R. A., and Kurtz, Z. D. (2016). Generalized stability approach for regularized graphical models. preprint.
 Neville et al. (2014) Neville, S. E., Ormerod, J. T., and Wand, M. P. (2014). Mean field variational Bayes for continuous sparse signal shrinkage: Pitfalls and remedies. Electronic Journal of Statistics, 8(1):1113–1151.
 Pakman and Paninski (2014) Pakman, A. and Paninski, L. (2014). Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542.
 Peterson et al. (2013) Peterson, C., Vannucci, M., Karakas, C., Choi, W., Ma, L., and MaleticSavatic, M. (2013). Inferring metabolic networks using the Bayesian adaptive graphical lasso with informative priors. Statistics and Its Interface, 6(4):547–558.
 Peterson et al. (2016) Peterson, C. B., Stingo, F. C., and Vannucci, M. (2016). Joint Bayesian variable and graph selection for regression models with networkstructured predictors. Statistics in Medicine, 35(7):1017–1031.
 Pitt et al. (2006) Pitt, M., Chan, D., and Kohn, R. (2006). Efficient Bayesian inference for Gaussian copula regression models. Biometrika, 93(3):537–554.
 Rothman et al. (2008) Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2(0):494–515.
 Roverato (2002) Roverato, A. (2002). Hyper inverse Wishart distribution for nondecomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics, 29(3):391–411.
 Scheinberg et al. (2010) Scheinberg, K., Ma, S., and Goldfarb, D. (2010). Sparse inverse covariance selection via alternating linearization methods. In Proceedings of the 23rd International Conference on Neural Information Processing Systems  Volume 2, NIPS’10, pages 2101–2109, USA. Curran Associates Inc.
 Stranger et al. (2007) Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavaré, S., Deloukas, P., and Dermitzakis, E. T. (2007). Population genomics of human gene expression. Nature Genetics, 39(10):1217–1224.
 van der Pas et al. (2014) van der Pas, S. L., Kleijn, B. J. K., and van der Vaart, A. W. (2014). The horseshoe estimator: posterior concentration around nearly black vectors. Electron. J. Statist., 8(2):2585–2618.
 Wang (2012) Wang, H. (2012). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis, 7(4):867–886.
 Wang (2015) Wang, H. (2015). Scaling it up: stochastic search structure learning in graphical models. Bayesian Analysis, 10(2):351–377.
 Wang and Pillai (2013) Wang, H. and Pillai, N. S. (2013). On a class of shrinkage priors for covariance matrix estimation. Journal of Computational and Graphical Statistics, 22(3):689–707.
 Williams et al. (2018) Williams, D. R., Piironen, J., Vehtari, A., and Rast, P. (2018). Bayesian estimation of Gaussian graphical models with projection predictive selection. arXiv preprint arXiv:1801.05725.
 Witten et al. (2011) Witten, D. M., Friedman, J. H., and Simon, N. (2011). New insights and faster computations for the graphical lasso. Journal of Computational and Graphical Statistics, 20(4):892–900.
 Xue and Zou (2012) Xue, L. and Zou, H. (2012). Regularized rankbased estimation of highdimensional nonparanormal graphical models. The Annals of Statistics, 40(5):2541–2571.
 Yuan and Lin (2007) Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35.
 Zhao et al. (2015) Zhao, T., Li, X., Liu, H., Roeder, K., Lafferty, J., and Wasserman, L. (2015). huge: highdimensional undirected graph estimation. R package version 1.2.7.