A Stepwise Approach for HighDimensional Gaussian Graphical Models
Abstract
We present a stepwise approach to estimate high dimensional Gaussian graphical models . We exploit the relation between the partial correlation coefficients and the distribution of the prediction errors, and parametrize the model in terms of the Pearson correlation coefficients between the prediction errors of the nodes’ best linear predictors. We propose a novel stepwise algorithm for detecting pairs of conditionally dependent variables. We show that the proposed algorithm outperforms existing methods such as the graphical lasso and CLIME in simulation studies and real life applications. In our comparison we report different performance measures that look at different desirable features of the recovered graph and consider several model settings.
Keywords: Covariance Selection; Gaussian Graphical Model; Forward and Backward Selection; Partial Correlation Coefficient.
1 Introduction
Highdimensional Gaussian graphical models (GGM) are widely used in practice to represent the linear dependency between variables. The underlying idea in GGM is to measure linear dependencies by estimating partial correlations to infer whether there is an association between a given pair of variables, conditionally on the remaining ones. Moreover, there is a close relation between the nonzero partial correlation coefficients and the nonzero entries in the inverse of the covariance matrix. Covariance selection procedures take advantage of this fact to estimate the GGM conditional dependence structure given a sample (Dempster, 1972; Lauritzen, 1996; Edwards, 2000).
When the dimension is larger than the number of observations, the sample covariance matrix is not invertible and the maximum likehood estimate (MLE) of does not exist. When , but close to , is invertible but illconditioned, increasing the estimation error (Ledoit and Wolf, 2004). To deal with this problem, several covariance selection procedures have been proposed based on the assumption that the inverse of the covariance matrix, , called precision matrix, is sparse.
We present an approach to perform covariance selection in a high dimensional GGM based on a forwardbackward algorithm called graphical stepwise (GS). Our procedure takes advantage of the relation between the partial correlation and the Pearson correlation coefficient of the residuals.
Existing methods to estimate the GGM can be classified in three classes: nodewise regression methods, maximum likelihood methods and limited order partial correlations methods. The nodewise regression method was proposed by Meinshausen and Bühlmann (2006). This method estimates a lasso regression for each node in the graph. See for example Peng et al. (2009), Yuan (2010), Liu and Wang (2012), Zhou et al. (2011) and Ren et al. (2015). Penalized likelihood methods include Yuan and Lin (2007), Banerjee et al. (2008), Friedman et al. (2008), Johnson et al. (2011) and Ravikumar et al. (2011) among others. Cai et al. (2011) propose an estimator called CLIME that estimates precision matrices by solving the dual of an penalized maximum likelihood problem. Limited order partial correlation procedures use lower order partial correlations to test for conditional independence relations. See Spirtes et al. (2000), Kalisch and Bühlmann (2007), Rütimann et al. (2009), Liang et al. (2015) and Huang et al. (2016).
The rest of the article is organized as follows. Section 2 introduces the stepwise approach along with some notation. Section 3 gives simulations results and a real data example. Section 4 presents some concluding remarks. The Appendix shows a detailed description of the crossvalidation procedure used to determine the required parameters in our stepwise algorithm and gives some additional results from our simulation study.
2 Stepwise Approach to Covariance Selection
2.1 Definitions and Notation
In this section we review some definitions and technical concepts needed later on. Let be a graph where is the set of nodes or vertices and is the set of edges. For simplicity we assume that . We assume that the graph is undirected, that is, if and only if . Two nodes and are called connected, adjacent or neighbors if .
A graphical model (GM) is a graph such that indexes a set of variables and is defined by:
(2.1) 
Here denotes conditional independence.
Given a node , its neighborhood is defined as
(2.2) 
Notice that gives the nodes directly connected with and therefore a GM can be effectively described by giving the system of neighborhoods .
We further assume that , where is a positivedefinite covariance matrix. In this case the graph is called a Gaussian graphical model (GGM). The matrix is called precision matrix.
2.2 Conditional dependence in a GGM
In a GGM the set of edges represents the conditional dependence structure of the vector . To represent this dependence structure as a statistical model it is convenient to find a parametrization for .
In this subsection we introduce a convenient parametrization of using well known results from classical multivariate analysis. For an exhaustive treatment of these results see, for instance, Anderson (2003), Cramér (1999), Lauritzen (1996) and Eaton (2007).
Given a subset of , denotes the vector of variables with subscripts in in increasing order. For a given pair of nodes , set , and . Note that has multivariate normal distribution with mean and covariance matrix
(2.3) 
such that has dimension , has dimension and so on. The matrix in (2.3) is a partition of a permutation of the original covariance matrix , and will be also denoted by , after a small abuse of notation.
Moreover, we set
Then, by (B.2) of Lauritzen (1996), the blocks can be written explicitly in terms of and . In particular
where
is the submatrix of (with rows and and columns and ). Hence,
and, in consequence, the partial correlation between and can be expressed as
(2.5) 
This gives the standard parametrization of in terms of the support of the precision matrix
(2.6) 
We now introduce another parametrization of , which we need to define and implement our proposed method. We consider the regression error for the regression of on ,
and let and denote the entries of (i.e. ). The regression error is independent of and has normal distribution with mean and covariance matrix with elements denoted by
(2.7) 
A straightforward calculation shows that
See Cramér (1999, Section 23.4).
Therefore, by this equality, (2.2) and (2.5), the partial correlation coefficient and the conditional correlation are equal
Summarizing, the problem of determining the conditional dependence structure in a GGM (represented by ) is equivalent to finding the pairs of nodes of that belong to the set
(2.8) 
which is equal to the support of the precision matrix, , defined by (2.6).
Remark 1.
As noticed above, under normality, partial and conditional correlation are the same. However, in general they are different concepts (Lawrance, 1976).
Remark 2.
Let be the regression coefficient of in the regression of versus and, similarly let be the regression coefficient of in the regression of versus . Then it follows that . This allows for another popular parametrization for . Moreover, let be the error term in the regression of the variable on the remaining ones. Then by Lemma 1 in Peng et al. (2009) we have that and .
2.3 The Stepwise Algorithm
Conditionally on its neighbors, is independent of all the other variables. Formally, for all ,
(2.9) 
Therefore, given a system of neighborhoods and (and so ), the partial correlation between and can be obtained by the following procedure: (i) regress on and compute the regression residual ; regress on and compute the regression residual ; (ii) calculate the Pearson correlation between and
This reasoning motivates the graphical stepwise algorithm (GSA). It begins with the family of empty neighborhoods, for each . There are two basic steps, the forward and the backward steps. In the forward step, the algorithm adds a new edge if the largest absolute empirical partial correlation between the variables is above the given threshold . In the backward step the algorithm deletes an edge if the smallestt absolute empirical partial correlation between the variables is below the given threshold . A step by step description of GSA is as follows:
Graphical Stepwise Algorithm

Input: the (centered) data and the forward and backward thresholds and

Initialization. : set .

Iteration Step. Given we compute as follows.

Forward. For each do the following.
For each calculate the partial correlations as follows.

Regress the variable on the variables with subscript in the set and compute the regression residuals

Regress the variables on the variables with subscript in the set and compute the regression residuals

Obtain the partial correlation by calculating the Pearson correlation between and
If
set for
If


Backward. For each do the following.
For each calculate the partial correlation as follows.

Regress the variables on the variables with subscript in the set and compute the regression residuals

Regress the variable on the variables with subscript in the set and compute the regression residuals

Compute the partial correlation by calculating the Pearson correlation between and
If
set .



Output

A collection of estimated neighborhoods , .

The set of estimated edges .

An estimate of , with defined as follow: in the case for where is the vector of the prediction errors in the regression of the variable on In the case we must distinguish two cases, if then otherwise (see Remark 2).

2.4 Thresholds selection by crossvalidation
Let be the matrix with rows , , corresponding to observations. We randomly partition the dataset into disjoint subsets of approximately equal sizes, the subset being of size and . For every , let be the validation subset, and its complement , the training subset. For every and for every pair of threshold parameters let be the estimated neighborhoods given by GSA using the training subset. For every let be the estimated coefficient of the regression of the variable on the neighborhood .
Consider now the validation subset. So, for every , using , we obtain the vector of predicted values . If we predict each observation of by the sample mean of the observations in the dataset of this variable.
Then, we define the –fold cross–validation function as
where the L2norm or euclidean distance in . Hence the –fold cross–validation forward–backward thresholds , is
where is a grid of ordered pairs in over which we perform the search. For a detail description see the Appendix.
2.5 Example
To illustrate the algorithm we consider the GGM with 16 edges given in the first panel of Figure 1. We draw independent observations from this model (see the next section for details). The values for the threshold parameters and are determined by fold crossvalidation. The figure also displays the selected pairs of edges at each step in a sequence of successive updates of , for and the final step , showing that the estimated graph is identical to the true graph.
3 Numerical results and real data example
We conducted extensive Monte Carlo simulations to investigate the performance of GS. In this section we report some results from this study and a numerical experiment using real data.
3.1 Monte Carlo simulation study
Simulated Models
We consider three dimension values and three different models for :

Model 1. Autoregressive model of orden , denoted . In this case for .

Model 2. Nearest neighbors model of order 2, denoted . For each node we randomly select two neighbors and choose a pair of symmetric entries of using the NeighborOmega function of the R package Tlasso.

Model 3. Block diagonal matrix model with blocks of size , denoted BG. For and , we use and blocks, respectively. Each block, of size , has diagonal elements equal to and offdiagonal elements equal to .
For each and each model we generate random samples of size . These graph models are widely used in the genetic literature to model gene expression data. See for example Lee and Liu (2015) and Lee and Ghi (2006). Figure 2 displays graphs from Models 13 with nodes.
Methods
We compare the performance of GS with Graphical lasso (Glasso) and Constrained minimization for inverse matrix estimation (CLIME) proposed by Friedman et al. (2008) and Cai et al. (2011) respectively. Therefore, the methods compared in our simulation study are:

The proposed method GS with the forward and backward thresholds, , estimated by fold crossvalidation on a grid of values in , as described in Subsection 2.4. The computing algorithm is available by request.

The Glasso estimate obtained by solving the penalizedlikelihood problem:
(3.1) In our simulations and examples we use the Rpackage CVglasso with the tuning parameter selected by fold crossvalidation (the package default).

The CLIME estimate obtained by symmetrization of the solution of
(3.2) where is the sample covariance, is the identity matrix, is the elementwise norm, and is a tuning parameter. For computations, we use the Rpackage clime with the tuning parameter selected by fold crossvalidation (the package default).
To evaluate the ability of the methods for finding the pairs of edges, for each replicate, we compute the Matthews correlation coefficient (Matthews, 1975)
(3.3) 
the and the , where TP, TN, FP and FN are, in this order, the number of true positives, true negatives, false positives and false negatives, regarding the identification of the nonzero offdiagonal elements of . Larger values of MCC, Sensitivity and Specificity indicate a better performance (Fan et al., 2009; Baldi et al., 2000).
For every replicate, the performance of as an estimate for is measured by (where denotes the Frobenius norm) and by the normalized KullbackLeibler divergence defined by where
is the the KullbackLeibler divergence between and .
Results
Table 1 shows the MCC performance for the three methods under Models 13. GS clearly outperforms the other two methods while CLIME just slightly outperforms Glasso. Cai et al. (2011) pointed out that a procedure yielding a more sparse is preferable because this facilitates interpretation of the data. The sensitivity and specificity results, reported in Table 4 in Appendix, show that in general GS is more sparse than the CLIME and Glasso, yielding fewer false positives (more specificity) but a few more false negatives (less sensitivity). Table 2 shows that under models and the three methods achieve fairly similar performances for estimating . However, under model BG, GS clearly outperforms the other two.
Figure 3 display the heatmaps of the number of nonzero links identified in the replications under model . Notice that among the three compared methods, the GS sparsity patterns best match those of the true model. Figures 4 and 5 in the Appendix lead to similar conclusions for models and BG.
Model  GS  Glasso  CLIME  

50  0.741 (0.009)  0.419 (0.016)  0.492 (0.006)  
100  0.751 (0.004)  0.433 (0.020)  0.464 (0.004)  
150  0.730 (0.004)  0.474 (0.017)  0.499 (0.003)  
50  0.751 (0.004)  0.404 (0.014)  0.401 (0.007)  
100  0.802 (0.005)  0.382 (0.006)  0.407 (0.005)  
150  0.695 (0.007)  0.337 (0.008)  0.425 (0.003)  
50  0.898 (0.005)  0.356 (0.009)  0.482 (0.005)  
BG  100  0.857 (0.005)  0.348 (0.004)  0.461 (0.002) 
150  0.780 (0.008)  0.314 (0.003)  0.408 (0.003) 



GS  Glasso  CLIME  

Model  
50  0.70  3.82  0.64  3.90  0.63  3.91  
(0.00)  (0.00)  (0.00)  ( 0.02)  (0.00)  (0.01)  
100  0.83  5.73  0.80  5.72  0.79  5.75  
(0.00)  (0.00)  (0.00)  (0.02)  (0.00)  (0.01)  
150  1.25  7.16  1.17  7.21  1.17  7.25  
(0.00)  (0.00)  (0.00)  (0.02)  (0.00)  (0.01)  
50  0.99  6.98  0.99  6.65  0.99  6.64  
(0.00)  (0.00)  (0.00)  (0.01)  (0.00)  (0.00)  
100  0.10  10.11  1.00  9.64  1.00  9.601  
(0.00)  (0.00)  (0.00)  (0.009)  (0.000)  (0.005)  
150  1.00  12.37  1.00  11.90  1.00  11.79  
(0.00)  (0.00)  (0.00)  (0.01)  (0.00)  (0.00)  
BG  50  0.46  1.44  0.85  5.45  0.82  5.03 
(0.00)  (0.00)  (0.00)  (0.10)  (0.00)  (0.05)  
100  0.71  2.94  0.93  9.16  0.92  8.71  
(0.00)  (0.00)  (0.00)  (0.07)  (0.00)  (0.02)  
150  0.88  6.10  0.96  11.59  0.96  11.42  
(0.00)  (0.00)  (0.00)  (0.06)  (0.00)  (0.02) 
3.2 Analysis of Breast Cancer Data
In preoperative chemoterapy, the complete eradication of all invasive cancer cells is referred to as pathological complete response, abbreviated as pCR. It is known in medicine that pCR is associated with the longterm cancerfree survival of a patient. Gene expression profiling (GEP) – the measurement of the activity (expression level) of genes in a patient – could in principle be a useful predictor for the patient’s pCR.
Using normalized gene expression data of patients in stages IIII of breast cancer, Hess et al. (2006) aim to identify patients that may achieve pCR under sequential anthracycline paclitaxel preoperative chemotherapy. When a patient does not achieve pCR state, he is classified in the group of residual disease (RD), indicating that cancer still remains. Their data consist of 22283 gene expression levels for 133 patients, with 34 pCR and 99 RD. Following Fan et al. (2009) and Cai et al. (2011) we randomly split the data into a training set and a testing set. The testing set is formed by randomly selecting 5 pCR patients and 16 RD patients (roughly of the subjects) and the remaining patients form the training set. From the training set, a two sample ttest is performed to select the 50 most significant genes. The data is then standardized using the standard deviation estimated from the training set.
We apply a linear discriminant analysis (LDA) to predict whether a patient may achieve pathological complete response (pCR), based on the estimated inverse covariance matrix of the gene expression levels. We label with the pCR group and the RD group and assume that data are normally distributed, with common covariance matrix and different means . From the training set, we obtain , and for the test data compute the linear discriminant score as follows
(3.4) 
where is the proportion of group subjects in the training set. The classification rule is
(3.5) 
For every method we use 5fold cross validation on the training data to select the tuning constants. We repeat this scheme 100 times.
Table 3 displays the means and standard errors (in brackets) of Sensitivity, Specificity, MCC and Number of selected Edges using over the 100 replications. Considering the MCC, GS is slightly better than CLIME and CLIME than Glasso. While the three methods give similar performance considering the Specificity, GS and CLIME improve over Glasso in terms of Sensitivity.
GS  CLIME  Glasso  

Sensitivity  0.798 (0.02)  0.786 (0.02)  0.602 (0.02) 
Specificity  0.784 (0.01)  0.788 (0.01)  0.767 (0.01) 
MCC  0.520 (0.02)  0.516 (0.02)  0.334 (0.02) 
Number of Edges  54 (2)  4823 (8)  2103 (76) 
4 Concluding remarks
This paper introduces a stepwise procedure, called GS, to perform covariance selection in high dimensional Gaussian graphical models. Our method uses a different parametrization of the Gaussian graphical model based on Pearson correlations between the bestlinearpredictors prediction errors. The GS algorithm begins with a family of empty neighborhoods and using basic steps, forward and backward, adds or delete edges until appropriate thresholds for each step are reached. These thresholds are automatically determined by cross–validation.
GS is compared with Glasso and CLIME under different Gaussian graphical models (, and BG) and using different performance measures regarding network recovery and sparse estimation of the precision matrix . GS is shown to have good support recovery performance and to produce simpler models than the other two methods (i.e. GS is a parsimonious estimation procedure).
We use GS for the analysis of breast cancer data and show that this method may be a useful tool for applications in medicine and other fields.
Acknowledgements
The authors thanks the generous support of NSERC, Canada, the Institute of Financial Big Data, University Carlos III of Madrid and the CSIC, Spain.
Appendix A Appendix
a.1 Selection of the thresholds parameters by crossvalidation
Let be the matrix with rows , , corresponding to observations. For each , let denote the jth–column of the matrix .
We randomly partition the dataset into disjoint subsets of approximately equal size, the subset being of size and . For every , let be the validation subset, and its complement , the training subset.
For every and threshold parameters let be the estimated neighborhoods given by GSA using the training subset with . Consider for every node the estimated neighborhood and let be the estimated coefficient of the regression of on , represented in (A.2) (red colour).
Consider the validation subset with , and for every let and define the vector of predicted values
where is the matrix with rows , represented in (A.2) (in blue colour). If the neighborhood we define
where is the mean of the sample of observations .
We define the –fold cross–validation function as
where the L2norm or euclidean distance in . Hence the –fold cross–validation forward–backward thresholds , is
(A.1) 
where is a grid of ordered pairs in over which we perform the search.
(A.2) 
Remark 3.
Matrix (A.2) represents, for every node the comparison between estimated and predicted values for crossvalidation. is computed using the observations and the matrix with rows , in the training subset (red colour). Based on the validation set is computed using and compared with (in blue color).
a.2 Complementary simulation results
GS  Glasso  CLIME  

Model  Sensitivity  Specificity  MCC  Sensitivity  Specificity  MCC  Sensitivity  Specificity  MCC  
50  0.756  0.988  0.741  0.994  0.823  0.419  0.988  0.891  0.492  
(0.015)  (0.002)  (0.009)  (0.002)  (0.012)  (0.016)  (0.002)  (0.003)  (0.006)  
100  0.632  0.999  0.751  0.989  0.897  0.433  0.983  0.934  0.464  
(0.007)  (0.000)  (0.004)  (0.002)  (0.009)  (0.020)  (0.002)  (0.001)  (0.004)  
150  0.607  0.999  0.730  0.981  0.943  0.474  0.972  0.964  0.499  
(0.006)  (0.000)  (0.004)  (0.002)  (0.007)  (0.017)  (0.002)  (0.001)  (0.003)  
50  0.632  0.999  0.751  0.971  0.864  0.404  0.984  0.875  0.401  
(0.007)  (0.000)  (0.004 )  (0.004)  (0.010)  (0.014)  (0.003)  (0.004)  (0.007)  
100  0.730  0.999  0.802  0.987  0.924  0.382  0.985  0.937  0.407  
(0.008)  (0.000)  (0.005)  (0.002)  (0.004)  (0.006)  (0.002)  (0.001)  (0.005)  
150  0.555  0.999  0.695  0.952  0.936  0.337  0.934  0.965  0.425  
(0.017)  (0.000)  (0.007)  (0.004)  (0.002)  (0.008)  ( 0.003)  (0.001)  (0.003)  
50  0.994  0.981  0.898  0.867  0.697  0.356  0.962  0.807  0.482  
(0.002)  (0.001)  (0.005)  (0.032)  (0.021)  (0.009)  (0.004)  (0.005)  (0.005)  
BG  100  0.949  0.989  0.857  0.569  0.908  0.348  0.818  0.920  0.4615 
(0.007)  (0.000)  (0.005)  (0.039)  (0.011)  ( 0.004)  (0.005)  (0.005)  (0.002)  
150  0.782  0.994  0.780  0.426  0.952  0.314  0.626  0.959  0.408  
(0.021)  (0.000)  (0.008)  (0.035)  (0.006)  (0.003)  (0.006)  (0.001)  (0.003) 






References
 Anderson (2003) Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. John Wiley.
 Baldi et al. (2000) Baldi, P., S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen (2000). Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16(5), 412–424.
 Banerjee et al. (2008) Banerjee, O., L. El Ghaoui, and A. d’Aspremont (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research 9, 485–516.
 Bühlmann and Van De Geer (2011) Bühlmann, P. and S. Van De Geer (2011). Statistics for highdimensional data: methods, theory and applications. Springer Science & Business Media.
 Cai et al. (2011) Cai, T., W. Liu, and X. Luo (2011). A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106(494), 594–607.
 Cramér (1999) Cramér, H. (1999). Mathematical Methods of Statistics. Princeton University Press.
 Dempster (1972) Dempster, A. P. (1972). Covariance selection. Biometrics, 157–175.
 Eaton (2007) Eaton, M. L. (2007). Multivariate Statistics : A Vector Space Approach. Institute of Mathematical Statistics.
 Edwards (2000) Edwards, D. (2000). Introduction to Graphical Modelling. Springer Science & Business Media.
 Fan et al. (2009) Fan, J., Y. Feng, and Y. Wu (2009). Network exploration via the adaptive lasso and scad penalties. The Annals of Applied Statistics 3(2), 521–541.
 Friedman et al. (2008) Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441.
 Hess et al. (2006) Hess, K. R., K. Anderson, W. F. Symmans, V. Valero, N. Ibrahim, J. A. Mejia, D. Booser, R. L. Theriault, A. U. Buzdar, P. J. Dempsey, et al. (2006). Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. Journal of Clinical Oncology 24(26), 4236–4244.
 Huang et al. (2016) Huang, S., J. Jin, and Z. Yao (2016). Partial correlation screening for estimating large precision matrices, with applications to classification. The Annals of Statistics 44(5), 2018–2057.
 Johnson et al. (2011) Johnson, C. C., A. Jalali, and P. Ravikumar (2011). Highdimensional sparse inverse covariance estimation using greedy methods. arXiv preprint arXiv:1112.6411.
 Kalisch and Bühlmann (2007) Kalisch, M. and P. Bühlmann (2007). Estimating highdimensional directed acyclic graphs with the pcalgorithm. The Journal of Machine Learning Research 8, 613–636.
 Lauritzen (1996) Lauritzen, S. L. (1996). Graphical Models. Oxford University Press.
 Lawrance (1976) Lawrance, A. J. (1976). On conditional and partial correlation. The American Statistician 30(3), 146–149.
 Ledoit and Wolf (2004) Ledoit, O. and M. Wolf (2004). A wellconditioned estimator for largedimensional covariance matrices. Journal of Multivariate Analysis 88(2), 365–411.
 Lee and Ghi (2006) Lee, H. and J. Ghi (2006). Gradient directed regularization for sparse gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7(2), 302â317.
 Lee and Liu (2015) Lee, W. and Y. Liu (2015). Joint estimation of multiple precision matrices with common structures. Journal of Machine Learning Research 16(1), 1035â1062.
 Liang et al. (2015) Liang, F., Q. Song, and P. Qiu (2015). An equivalent measure of partial correlation coefficients for highdimensional gaussian graphical models. Journal of the American Statistical Association 110(511), 1248–1265.
 Liu and Wang (2012) Liu, H. and L. Wang (2012). Tiger: A tuninginsensitive approach for optimally estimating gaussian graphical models. arXiv preprint arXiv:1209.2437.
 Matthews (1975) Matthews, B. (1975). Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta 405(2), 442â451.
 Meinshausen and Bühlmann (2006) Meinshausen, N. and P. Bühlmann (2006). Highdimensional graphs and variable selection with the lasso. The Annals of Statistics 34(3), 1436–1462.
 Peng et al. (2009) Peng, J., P. Wang, N. Zhou, and J. Zhu (2009). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104(486), 735–746.
 Ravikumar et al. (2011) Ravikumar, P., M. J. Wainwright, G. Raskutti, B. Yu, et al. (2011). Highdimensional covariance estimation by minimizing penalized logdeterminant divergence. Electronic Journal of Statistics 5, 935–980.
 Ren et al. (2015) Ren, Z., T. Sun, C.H. Zhang, H. H. Zhou, et al. (2015). Asymptotic normality and optimalities in estimation of large gaussian graphical models. The Annals of Statistics 43(3), 991–1026.
 Rütimann et al. (2009) Rütimann, P., P. Bühlmann, et al. (2009). High dimensional sparse covariance estimation via directed acyclic graphs. Electronic Journal of Statistics 3, 1133–1160.
 Spirtes et al. (2000) Spirtes, P., C. N. Glymour, and R. Scheines (2000). Causation, Prediction, and Search. MIT press.
 Yuan (2010) Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research 11, 2261–2286.
 Yuan and Lin (2007) Yuan, M. and Y. Lin (2007). Model selection and estimation in the gaussian graphical model. Biometrika 94(1), 19–35.
 Zhou et al. (2011) Zhou, S., P. Rütimann, M. Xu, and P. Bühlmann (2011). Highdimensional covariance estimation based on gaussian graphical models. The Journal of Machine Learning Research 12, 2975–3026.