Regularized rankbased estimation of highdimensional nonparanormal graphical models\thanksrefT1
Abstract
A sparse precision matrix can be directly translated into a sparse Gaussian graphical model under the assumption that the data follow a joint normal distribution. This neat property makes highdimensional precision matrix estimation very appealing in many applications. However, in practice we often face nonnormal data, and variable transformation is often used to achieve normality. In this paper we consider the nonparanormal model that assumes that the variables follow a joint normal distribution after a set of unknown monotone transformations. The nonparanormal model is much more flexible than the normal model while retaining the good interpretability of the latter in that each zero entry in the sparse precision matrix of the nonparanormal model corresponds to a pair of conditionally independent variables. In this paper we show that the nonparanormal graphical model can be efficiently estimated by using a rankbased estimation scheme which does not require estimating these unknown transformation functions. In particular, we study the rankbased graphical lasso, the rankbased neighborhood Dantzig selector and the rankbased CLIME. We establish their theoretical properties in the setting where the dimension is nearly exponentially large relative to the sample size. It is shown that the proposed rankbased estimators work as well as their oracle counterparts defined with the oracle data. Furthermore, the theory motivates us to consider the adaptive version of the rankbased neighborhood Dantzig selector and the rankbased CLIME that are shown to enjoy graphical model selection consistency without assuming the irrepresentable condition for the oracle and rankbased graphical lasso. Simulated and real data are used to demonstrate the finite performance of the rankbased estimators.
10.1214/12AOS1041 \volume40 \issue5 2012 \firstpage2541 \lastpage2571
Rankbased estimation of nonparanormal graphical models \thankstextT1Some results in this paper were reported in a research proposal funded by the Office of Naval Research in November 2010. Supported in part by NSF Grant DMS0846068 and a grant from ONR.
and
class=AMS] \kwd[Primary ]62G05 \kwd62G20 \kwd[; secondary ]62F12 \kwd62J07 CLIME \kwdDantzig selector \kwdgraphical lasso \kwdnonparanormal graphical model \kwdrate of convergence \kwdvariable transformation
1 Introduction.
Estimating covariance or precision matrices is of fundamental importance in multivariate statistical methodologies and applications. In particular, when data follow a joint normal distribution, the precision matrix can be directly translated into a Gaussian graphical model. The Gaussian graphical model serves as a noncausal structured approach to explore the complex systems consisting of Gaussian random variables, and it finds many interesting applications in areas such as gene expression genomics and macroeconomics determinants study [Friedman (2004), Wille et al. (2004), Dobra, Eicher and Lenkoski (2010)]. The precision matrix plays a critical role in the Gaussian graphical models because the zero entries in precisely capture the desired conditional independencies, that is, if and only if [Lauritzen (1996), Edwards (2000)].
The sparsity pursuit in precision matrices was initially considered by Dempster (1972) as the covariance selection problem. Multiple testing methods have been employed for network exploration in the Gaussian graphical models [Drton and Perlman (2004)]. With rapid advances of the highthroughput technology (e.g., microarray, functional MRI), estimation of a sparse graphical model has become increasingly important in the highdimensional setting. Some welldeveloped penalization techniques have been used for estimating sparse Gaussian graphical models. In a highlycited paper, Meinshausen and Bühlmann (2006) proposed the neighborhood selection scheme which tries to discover the smallest index set for each variable satisfying . Meinshausen and Bühlmann (2006) further proposed to use the lasso [Tibshirani (1996)] to fit each neighborhood regression model. Afterwards, one can summarize the zero patterns by aggregation via union or intersection. Yuan (2010) considered the Dantzig selector [Candes and Tao (2007)] as an alternative to the lasso penalized least squares in the neighborhood selection scheme. Peng et al. (2009) proposed the joint neighborhood lasso selection. Penalized likelihood methods have been studied for Gaussian graphical modeling [Yuan and Lin (2007)]. Friedman, Hastie and Tibshirani (2008) developed a fast blockwise coordinate descent algorithm [Banerjee, El Ghaoui and d’Aspremont (2008)] called graphical lasso for efficiently solving the lasso penalized Gaussian graphical model. Rate of convergence under the Frobenius norm was established by Rothman et al. (2008). Ravikumar et al. (2011) obtained the convergence rate under the elementwise norm and the spectral norm. Lam and Fan (2009) studied the nonconvex penalized Gaussian graphical model where a nonconvex penalty such as SCAD [Fan and Li (2001)] is used to replace the lasso penalty in order to overcome the bias issue of the lasso penalization. Zhou et al. (2011) proposed a hybrid method for estimating sparse Gaussian graphical models: they first infer a sparse Gaussian graphical model structure via thresholding neighborhood selection and then estimate the precision matrix of the submodel by maximum likelihood. Cai, Liu and Luo (2011) recently proposed a constrained minimization estimator called CLIME for estimating sparse precision matrices and established its convergence rates under the elementwise norm and Frobenius norm.
Critical value  Cramer–von Mises  Lilliefors  Shapiro–Francia  

Raw data  0.05  30  30  35 
0.0539  24  26  28  
Log data  0.05  29  24  33 
0.0539  14  12  16 
Although the normality assumption can be relaxed if we only focus on estimating a precision matrix, it plays an essential role in making the neat connection between a sparse precision matrix and a sparse Gaussian graphical model. Without normality, we ought to be very cautious when translating a good sparse precision matrix estimator into an interpretable sparse Gaussian graphical model. However, the normality assumption often fails in reality. For example, the observed data are often skewed or have heavy tails. To illustrate the issue of nonnormality in real applications, let us consider the gene expression data to construct isoprenoid genetic regulatory network in Arabidposis thaliana [Wille et al. (2004)], including 16 genes from the mevalonate (MVA) pathway in the cytosolic, 18 genes from the plastidial (MEP) pathway in the chloroplast and 5 encode proteins in the mitochondrial. This dataset contains gene expression measurements of 39 genes assayed on Affymetrix GeneChip microarrays. This dataset was analyzed by Wille et al. (2004), Li and Gui (2006) and Drton and Perlman (2007) in the context of Gaussian graphical modeling after taking the logtransformation of the data. However, the normality assumption is still inappropriate even after the logtransformation. To show this, we conduct the normality test at the significance level of as in Table 1. It is clear that at most 9 out of 39 genes would pass any of three normality tests. Even after logtransformation, at least genes reject the null hypothesis of normality. With Bonferroni correction there are still over genes that fail to pass any normality test. Figure 1 plots the histograms of two key isoprenoid genes MECPS in the MEP pathway and MK in the MVA pathway after the logtransformation, clearly showing the nonnormality of the data after the logtransformation.
Using transformation to achieve normality is a classical idea in statistical modeling. The celebrated Box–Cox transformation is widely used in regression analysis. However, any parametric modeling of the transformation suffers from model misspecification which could lead to misleading inference. In this paper we take a nonparametric transformation strategy to handle the nonnormality issue. Let be the CDF of a continuous random variable and be the inverse of the CDF of . Consider the transformation from to by . Then it is easy to see that is standard normal regardless of . Motivated by this simple fact, we consider modeling the data by the following nonparanormal model:
The nonparanormal model: follows a dimensional nonparanormal distribution if there exists a vector of unknown univariate monotone increasing transformations, denoted by , such that the transformed random vector follows a multivariate normal distribution with mean 0 and covariance ,
(1) 
where without loss of generality the diagonals of are equal to 1.
Note that model (1) implies that is a standard normal random variable. Thus, must be where is the CDF of . The marginal normality is always achieved by transformations, so model (1) basically assumes that those marginally normaltransformed variables are jointly normal as well. We follow Liu, Lafferty and Wasserman (2009) to call model (1) the nonparanormal model, but model (1) is in fact a semiparametric Gaussian copula model. The semiparametric Gaussian copula model is a nice combination of flexibility and interpretability, and it has generated a lot of interests in statistics, econometrics and finance; see Song (2000), Chen and Fan (2006), Chen, Fan and Tsyrennikov (2006) and references therein. Let . By the joint normality assumption of , we know that if and only if Interestingly, we have that
Therefore, a sparse can be directly translated into a sparse graphical model for presenting the original variables.
In this work we primarily focus on estimating which is then used to construct a nonparanormal graphical model. As for the nonparametric transformation function, by the expression , we have a natural estimator for the transformation function of the th variable as where is a Winsorized empirical CDF of the th variables. Note that the Winsorization is used to avoid infinity value and to achieve better biasvariance tradeoff; see Liu, Lafferty and Wasserman (2009) for detailed discussion. In this paper we show that we can directly estimate without estimating these nonparametric transformation functions at all. This statement seems to be a bit surprising because a natural estimation scheme is a twostage procedure: first estimate and then apply a welldeveloped sparse Gaussian graphical model estimation method to the transformed data . Liu, Lafferty and Wasserman (2009) have actually studied this “plugin” estimation approach. They proposed a Winsorized estimator of the nonparametric transformation function and used the graphical lasso in the second stage. They established convergence rate of the “plugin” estimator when is restricted to a polynomial order of . However, Liu, Lafferty and Wasserman (2009) did not get a satisfactory rate of convergence for the “plugin” approach, because the rate of convergence can be established for the Gaussian graphical model even when grows with almost exponentially fast [Ravikumar et al. (2011)]. As noted in Liu, Lafferty and Wasserman (2009), it is very challenging, if not impossible, to push the theory of the “plugin” approach to handle exponentially large dimensions. One might ask if using a better estimator for the transformation functions could improve the rate of convergence such that could be allowed to be nearly exponentially large relative to . This is a legitimate direction for research. We do not pursue this direction in this work. Instead, we show that we could use a rankbased estimation approach to achieve the exact same goal without estimating these transformation functions at all.
Our estimator is constructed in two steps. First, we propose using the adjusted Spearman’s rank correlation to get a nonparametric sample estimate of . As the second step, we compute a sparse estimator from the rankbased sample estimate of . For that purpose, we consider several regularized rank estimators, including the rankbased graphical lasso, the rankbased neighborhood Dantzig selector and the rankbased CLIME. The complete methodological details are presented in Section 2. In Section 3 we establish theoretical properties of the proposed rankbased estimators, regarding both precision matrix estimation and graphical model selection. In particular, we are motivated by the theory to consider the adaptive version of the rankbased neighborhood Dantzig selector and the rankbased CLIME, which can select the true support set with an overwhelming probability without assuming a stringent irrepresentable condition required for the oracle and rankbased graphical lasso. Section 4 contains numerical results and Section 5 has some concluding remarks. Technical proofs are presented in an Appendix.
A referee informed us in his/her review report that Liu et al. (2012) also independently used the rankbased correlation in the context of nonparametric Gaussian graphical model estimation. A major focus in Liu et al. (2012) is the numerical demonstration of the robustness property of the rankbased methods using both Spearman’s rho and Kendall’s tau when data are contaminated. In the present paper we provide a systematic analysis of the rankbased estimators, and our theoretical analysis further leads to the rankbased adaptive Dantizg selector and the rankbased adaptive CLIME in order to achieve improved sparsity recovery properties. Our theoretical analysis of the rankbased adaptive Dantizg selector is of independent interest. Although the theory is established for the rankbased estimators using Spearman’s rho, the same analysis can be easily adopted to prove the theoretical properties of the rankbased estimators using Kendall’s tau rank correlation.
2 Methodology.
We first introduce some necessary notation. For a matrix , we define its entrywise norm as , and its entrywise norm as . For a vector , we define its norm as and its norm as . To simplify notation, define as the submatrix of with row indexes and column indexes , and define as the subvector of with indexes . Let be the index set . Denote by the submatrix of with both th row and column removed, and denote by the vector including all the covariances associated with the th variable. In the same fashion, we can also define , , and so on.
2.1 The “oracle” procedures.
Suppose an oracle knows the underlying transformation vector; then the oracle could easily recover “oracle data” by applying these true transformations, that is, . Before presenting our rankbased estimators, it is helpful to revisit the “oracle” procedures that are defined based on the “oracle data.”

The oracle graphical lasso. Let be the sample covariance matrix for the “oracle” data, and then the “oracle” logprofilelikelihood becomes . The “oracle” graphical lasso solves the following penalized likelihood problem:
(2) 
The oracle neighborhood lasso selection. Under the nonparanormal model, for each , the “oracle” variable given is normally distributed as which can be written as with and Notice that and are closely related to the precision matrix , that is, and Thus for the th variable, and share the same sparsity pattern. Following Meinshausen and Bühlmann (2006), the oracle neighborhood lasso selection obtains the solution from the following lasso penalized least squares problem:
(3) and then the sparsity pattern of can be estimated by aggregating the neighborhood support set of () via intersection or union.

The oracle neighborhood Dantzig selector. Following Yuan (2010) the lasso least squares in (3) can be replaced with the Dantzig selector
(5) Then the sparsity pattern of can be similarly estimated by aggregating via intersection or union. Furthermore, we notice that
Then (5) can be written in the following equivalent form:
(6) 
The oracle CLIME. Following Cai, Liu and Luo (2011) we can estimate precision matrices by solving a constrained minimization problem,
(7) Cai, Liu and Luo (2011) compared the CLIME with the graphical lasso, and showed that the CLIME enjoys nice theoretical properties without assuming the irrepresentable condition of Ravikumar et al. (2011) for the graphical lasso.
2.2 The proposed rankbased estimators.
The existing theoretical results in the literature can be directly applied to these oracle estimators. However, the “oracle data” are unavailable and thus the abovementioned “oracle” procedures are not genuine estimators. Naturally we wish to construct a genuine estimator that can mimic the oracle estimator. To this end, we can derive an alternative estimator of based on the actual data and then feed this genuine covariance estimator to the graphical lasso, the neighborhood selection or CLIME. To implement this natural idea, we propose a rankbased estimation scheme. Note that can be viewed as the correlation matrix as well, that is, . Let () be the observed values of variable . We convert them to ranks denoted by . Spearman’s rank correlation is defined as Pearson’s correlation between and . Spearman’s rank correlation is a nonparametric measure of dependence between two variables. It is important to note that are the ranks of the “oracle” data. Therefore, is also identical to the Spearman’s rank correlation between the “oracle” variables . In other words, in the framework of rankbased estimation, we can treat the observed data as the “oracle” data and avoid estimating nonparametric transformation functions. We make a note here that one may consider other rank correlation measures such as Kendall’s tau correlation. To fix the idea we use Spearman’s rank correlation throughout this paper.
The nonparanormal model implies that follows a bivariate normal distribution with correlation parameter . Then a classical result due to Kendall (1948) [see also Kruskal (1958)] shows that
(8) 
which indicates that is a biased estimator of . To correct the bias, Kendall (1948) suggested using the adjusted Spearman’s rank correlation
(9) 
Combining (8) and (9) we see that is an asymptotically unbiased estimator of . Naturally we define the rankbased sample estimate of as follows:
In Section 3 we show is a good estimator of . Then we naturally come up with the following rankbased estimators of by using the graphical lasso, the neighborhood Dantzig selector and CLIME:

The rankbased graphical lasso:
(10) 
The rankbased neighborhood Dantzig selector: A rankbased estimate of can be solved by
(11) The support of can be estimated from the support of via aggregation by union or intersection. We can also construct the rankbased precision matrix estimator with
and
(). We can symmetrize by solving the following optimization problem [Yuan (2010)]:
Theoretical analysis of the rankbased neighborhood Dantzig selector in Section 3.2 motivated us to consider using the adaptive Dantzig selector in the rankbased neighborhood estimation in order to achieve better graphical model selection performance. See Section 3.2 for more details.

The rankbased CLIME:
(12) Let ’s be the natural basis in . By Lemma 1 in Cai, Liu and Luo (2011) the above optimization problem can be further decomposed into subproblems of vector minimization,
(13) for . Then is exactly equivalent to . Note that could be asymmetric. Following Cai, Liu and Luo (2011) we consider
with In the original paper Cai, Liu and Luo (2011) proposed to use hard thresholding for graphical model selection. Borrowing the basic idea from Zou (2006), we propose an adaptive version of the rankbased CLIME in order to achieve better graphical model selection. See Section 3.3 for more details.
2.3 Rankbased neighborhood lasso?
One might consider the rankbased neighborhood lasso defined as follows:
(14) 
However, there is a technical problem for the above definition. The Spearman’s rank correlation matrix is always positive semidefinite, but the adjusted correlation matrix could become indefinite. To our best knowledge, Devlin, Gnanadesikan and Kettenring (1975) were the first to point out the indefinite issue of the estimated rank correlation matrix. Here we also use a toy example to illustrate this point. Consider the correlation matrix
Note that is positivedefinite with eigenvalues , and , but becomes indefinite with eigenvalues , and . The negative eigenvalues will make (14) an illdefined optimization problem. Fortunately, the positive definite issue does not cause any problem for the graphical lasso, Dantzig selector and CLIME. Notice that the diagonal elements of are obviously strictly positive, and thus Lemma 3 in Ravikumar et al. (2011) suggests that the rankbased graphical lasso always has a unique positive definite solution for any regularization parameter . The rankbased neighborhood Dantzig selector and the rankbased CLIME are still well defined, even when becomes indefinite, and the according optimization algorithms also tolerate the indefiniteness of . One might consider a positive definite correction of for implementing the neighborhood lasso estimator. However, the resulting estimator shall behave similarly to the rankbased neighborhood Dantzig selector because the lasso penalized least squares and Dantzig selector, in general, work very similarly [Bickel, Ritov and Tsybakov (2009), James, Radchenko and Lv (2009)].
3 Theoretical properties.
For a vector , let denote the minimum absolute value, that is, . For a matrix , we define the following matrix norms: the matrix norm , the matrix norm and the Frobenius norm . For any symmetric matrix, its matrix norm coincides its matrix norm. Denote by and the smallest and largest eigenvalues of , respectively. Define as the true covariance matrix, and let be its inverse. Let be the true support set of the offdiagonal elements in . Let be the maximal degree over the underlying graph corresponding to , and let be the total degree over the whole graph.
In this section we establish theoretical properties for the proposed rankbased estimators. The main conclusion drawn from these theoretical results is that the rankbased graphical lasso, neighborhood Dantzig selector and CLIME work as well as their oracle counterparts in terms of the rates of convergence. We first provide useful concentration bounds concerning the accuracy of the rankbased sample correlation matrix.
Lemma 1
Fix any , and let . Then there exists some absolute constant , and we have the following concentration bounds:
Lemma 1 is a key ingredient of our theoretical analysis. It basically shows that the rankbased sample estimator of works as well as the usual sample covariance estimator of based on the “oracle data.”
3.1 Rankbased graphical lasso.
Denote by the minimal entry of in the absolute scale. Define and . Define as the Kronecker product .
Theorem 1
Assume for . {longlist}[(a)]
Elementwise maximal bound: if is chosen such that
with probability at least , the rankbased graphical lasso estimator satisfies that for any and
Graphical model selection consistency: picking a regularization parameter to satisfy that
then with probability at least , is sign consistent satisfying that for any and for any .
In Theorem 1, the condition is also referred as the irrepresentable condition for studying the theoretical properties of the graphical lasso [Ravikumar et al. (2011)]. We can obtain a straightforward understanding of Theorem 1 by considering its asymptotic consequences.
Corollary 1
Assume that there is a constant such that . Suppose that and are both fixed constants. {longlist}[(a)]
Rates of convergence: assume , and pick a regularization parameter such that . Then we have
Furthermore, the convergence rates in both Frobenius and matrix norms can also be obtained as follows:
Graphical model selection consistency: assume is also fixed and . Pick a such that Then we have , and , .
Under the same conditions of Theorem 1 and Corollary 1, by the results in Ravikumar et al. (2011), we know that the conclusions of Theorem 1 and Corollary 1 hold for the oracle graphical lasso. In other words, the rankbased graphical lasso estimator is comparable to its oracle counterpart in terms of rates of convergence.
3.2 Rankbased neighborhood Dantzig selector.
We define , and . For each variable , define the corresponding active set with the maximal cardinality . Then we can organize and with respect to as and
Likewise we can partition and with respect to .
Theorem 2
Pick the such that and . With probability at least , there exists depending on , and only such that
Corollary 2
Suppose that , and are all fixed. Let , and pick such that . Then we have
Yuan (2010) established the rates of convergence of the neighborhood Dantzig selector under the norm, which can be directly applied to the oracle neighborhood Dantzig selector under the nonparanormal model. Comparing Theorem 2 and Corollary 2 to the results in Yuan (2010), we see that the rankbased neighborhood Dantzig selector and the oracle neighborhood Dantzig selector achieve the same rates of convergence.
Dantzig selector and the lasso are closely related [Bickel, Ritov and Tsybakov (2009), James, Radchenko and Lv (2009)]. Similarly to the lasso, the Dantzig selector tends to overselect. Zou (2006) proposed the adaptive weighting idea to develop the adaptive lasso which improves the selection performance of the lasso and corrects its bias too. The very same idea can be used to improve the selection performance of Dantzig selector which leads to the adaptive Dantzig selector [Dicker and Lin (2009)]. We can extend the rankbased Dantzig selector to the rankbased adaptive Dantzig selector. Given adaptive weights , consider
(15) 
where denotes the Hadamard product, and denotes the set of entrywise inequalities for ease of notation. In both our theoretical analysis and numerical implementation, we utilize the optimal solution of the rankbased Dantzig selector to construct the adaptive weights by
(16) 
Define and let Thus the support of exactly coincides with that of , and then it is further equivalent to the active set . Define , and for . Let .
Theorem 3
For each , we pick as in (11) satisfying that and , and pick as in (15) such that and In addition, we also choose as in (16) for each . Then with a probability at least , for each , the rankbased adaptive Dantzig selector finds the unique solution with and , and thus the rankbased neighborhood adaptive Dantzig selector is consistent for the graphical model selection.
Corollary 3
Suppose , , , , and () are all constants. Assume that and . Pick the tuning parameters and such that and . Then with probability tending to , for each , the rankbased adaptive Dantzig selector with as in (16) finds the unique optimal solution with and , and thus the rankbased neighborhood adaptive Dantzig selector is consistent for the graphical model selection.
The signconsistency of the adaptive Dantzig selector is similar to that of the adaptive lasso [van de Geer, Bühlmann and Zhou (2011)]. Based on Theorem 2 we construct the adaptive weights in (16) which is critical for the success of the rankbased adaptive Dantzig selector in the highdimensional setting. It is important to mention that the rankbased adaptive Dantzig selector does not require the strong irrepresentable condition for the rankbased graphical lasso to have the sparsity recovery property. Our treatment of the adaptive Dantzig selector is fundamentally different from Dicker and Lin (2009). Dicker and Lin (2009) focused on the canonical linear regression model and constructed the adaptive weights as the inverse of the absolute values of ordinary least square estimator. Their theoretical results only hold in the classical fixed setting. In our problem can be much bigger than . The choice of adaptive weights in (16) plays a critical role in establishing the graphical model selection consistency for the adaptive Dantzig selector under the highdimensional setting where is at a nearly exponential rate to . Our technical analysis uses some key ideas such as the strong duality and the complementary slackness from the linear optimization theory [Bertsimas and Tsitsiklis (1997), Boyd and Vandenberghe (2004)].
3.3 Rankbased CLIME.
Compared to the graphical lasso, the CLIME can enjoy nice theoretical properties without assuming the irrepresentable condition [Cai, Liu and Luo (2011)]. This continues to hold when comparing the rankbased graphical lasso and the rankbased CLIME.
Theorem 4
Recall that . Pick a regularizing parameter such that . With a probability at least ,
Moreover, assume that , and suppose is a fixed constant. Pick a regularization parameter satisfying . Then we have
Theorem 4 is parallel to Theorem 6 in Cai, Liu and Luo (2011) which can be used to establish the rate of convergence of the oracle CLIME.
To improve graphical model selection performance, Cai, Liu and Luo (2011) suggested an additional thresholding step by applying the elementwise hardthresholding rule to ,
(17) 
where is the threshold, and is given in Theorem 4. Here we show that consistent graphical model selection can be achieved by an adaptive version of the rankbased CLIME. Given an adaptive weight matrix we define the rankbased adaptive CLIME as follows:
(18) 
where is a simplified expression for the set of inequalities (for all ). Write . By Lemma 1 in Cai, Liu and Luo (2011) the above linear programming problem in (18) is exactly equivalent to vector minimization subproblems,
In both our theory and implementation, we utilize the rankbased CLIME’s optimal solution to construct an adaptive weight matrix by
(19) 
We now prove the graphical model selection consistency of the rankbased adaptive CLIME. Denote as