Abstract
We consider a method to jointly estimate sparse precision matrices and their underlying graph structures using dependent highdimensional datasets. We present a penalized maximum likelihood estimator which encourages both sparsity and similarity in the estimated precision matrices where tuning parameters are automatically selected by controlling the expected number of false positive edges. We also incorporate an extra step to remove edges which represent an overestimation of triangular motifs. We conduct a simulation study to show that the proposed methodology presents consistent results for different combinations of sample size and dimension. Then, we apply the suggested approaches to a highdimensional real case study of gene expression data with samples in two medical conditions, healthy and colon cancer tissues, to estimate a common network of genes as well as the differentially connected genes that are important to the disease. We find denser graph structures for healthy samples than for tumor samples, with groups of genes interacting together in the shape of clusters.
Keywords: joint graphical lasso, high dimension, clustering, gene expression, tuning parameters
Joint Estimation of Sparse Networks with
application to Paired Gene Expression data
Adria Caballe Mestres, Natalia Bochkina and Claus Mayer
University of Edinburgh & Maxwell Institute, Scotland, UK
Biomathematics & Statistics Scotland, Scotland, UK
1 Motivating problem in genomic data
Genomic data produced by highthroughput technology are nowadays easy to collect and store generating many statistical questions. For instance here we want to use a dataset where genomic profiles are obtained for individuals in different classes. It is publicly available in the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress/) and it is formally presented in Hinoue et al. (2012). It contains the gene expression information of 25 patients in two samples (tissues) for each gene/patient: the expression in a colon cancerous tissue and the expression in its adjacent healthy tissue. In total, there are more than 24k genes.
One of the challenges in the analysis of this data is to understand how genes interact between each other in a cell as well as detecting which groups of such connections vary from a healthy to a cancer state. This can be formulated by an estimation problem of sparse conditional dependence networks which are fully characterized by their underlying precision matrices (inverse of covariance/correlation matrices). Two genes are said to be conditionally independent given all the remaining genes if their correspondent coefficient in the precision matrix is zero.
This type of estimation problem is extensively studied in the field of statistics and also bioinformatics when the data is highdimensional (dimension is larger than the sample size) in which case maximum likelihood estimators are not suitable (Pourahmadi, 2007). Methods that address this issue to estimate a single precision matrix include sparsitypenalization approaches known as graphical lasso (Meinshausen and Bühlmann, 2006; Friedman et al., 2007; Peng et al., 2009). A natural extension is applied to jointly estimate multiple precision matrices by using an additional penalization term that encourages the similarity between such matrices. For instance, Guo et al. (2011) use a grouplasso penalization (GGL) or Danaher et al. (2014) incorporate a fusedlasso penalization option (FGL). The FGL method, which we will consider as the current method, yields better graph recovery rates than estimating the matrices separately when these are expected to be similar. However, it is designed under the assumption of independence between datasets.
Motivated by real data, in this paper we extend the methodology presented in Danaher et al. (2014) for a more general case of estimating jointly two sparsesimilar precision matrices whose datasets are dependent. A related proposal is given in Wit and Abbruzzo (2015) who estimate a joint precision matrix that can reflect several time points in a disease process. The divergence with respect to our method is in the interpretation of the differential network. Our aim is to find out which gene associations are (not) common between populations whereas they consider such matrix design as a constrain prior to estimation.
In Section 2 we describe the estimation problem, in Section 3 we propose a strategy to select the tuning parameters originated by the penalization terms and in Section 4 we discuss an issue in the current algorithm to estimate triangular motifs structures. In Section 5 we apply all the methodology to simulated datasets given different models to generate the data. Finally, in Section 6 we estimate conditional dependence structures for the motivating application to colon cancer gene expression data. The proposed methodology is implemented within the R package ldstatsHD (Caballe, 2016).
2 Weighted fused graphical lasso
2.1 Problem set up and crosscorrelation
We assume that the data are independent and identically distributed (iid) observations from a Gaussian model, , , with dimension and sample size , assuming, without a loss of generality, that the mean is zero. The matrix represents the joint conditional dependence structure for and , and it is defined by
(1) 
The objective is to estimate the precision matrices , and with the assumptions of sparsity and similarity among the two matrices.
If (which means that must also be ), then pairs and are independent with both and . In contrast, if contains at least one nonzero element, then and are dependent (e.g. in paired data) and the equality does not hold.
There are various types of dependence structures defined in either or . For instance, structures defined in the crosscovariance matrix motivated by an additive model (, with diagonal matrix , ) or a multiplicative model (). Both such cases, which coincide when , can be described by a Kronecker products formulation (Fan et al., 2008; Srivastava et al., 2008). Wit and Abbruzzo (2015) make a further simplification in the dependence structure by characterizing the crossprecision matrix . They assume that is a diagonal matrix. Hence, that any variable of the first dataset is independent from any variable of the other dataset if , , once conditioning on variables and (i.e. see Figure 1 for the underlying graphical representation). The main results we give in the next subsections can be applied for any type of dependence structure. However, for the real data analysis and also simulations we assume that the cross partial correlation matrix is diagonal.
2.2 Weighted fused graphical lasso for two dependent datasets
We propose to use a weightedfused graphical lasso (WFGL) maximum likelihood estimator for the joint precision matrix:
(2) 
with
(3) 
where is the sparsity tuning parameter, is the similarity tuning parameter, and is a matrix to weight for each coefficient of the differential precision matrix. In case for all pairs , then the maximization problem coincides with the fused graphical lasso (FGL) presented in Danaher et al. (2014). As novelty, in the next subsection we define weights that account for the dependence structure between the two datasets.
The maximization problem in (2) and (3) can be solved by the ADMMtype algorithm (Boyd, 2010) described in Algorithm 1. The main difference with respect to the FGL algorithm is in step 6, where different similarity penalties are considered to estimate the differential network.
(4) 
(5) 
(6) 
(7) 
2.3 Weights in the similarity penalization term
We consider the weights as a way to marginally normalize the initial estimated differential precision matrix which can be found in step 6 of Algorithm 1 just before thresholding. Our objective is to adapt the similaritypenalty parameter for each pair such that the probability to recover differential edges is independent of the relationship between variables in the two datasets.
We define the partial correlation matrix (scaled of estimated precision matrix ) and the Fisher transformation function , . We propose to use weights described by
(8) 
where . Individual variances expressed by , for both , could also be included in the weights. However, it can be proved that elements with large variances are, generally, more likely to contain nonzero coefficients than elements with low variances. Hence, correcting in this case might increase the number of false positive edges.
The asymptotic expression for the correlation of Fisher transform sample correlation coefficients (scaled of sample covariance matrix elements) is derived in Elston (1975) and Olkin and Finn (1990), among others, and it is only function of the true correlation coefficients. Similarly, we derive the asymptotic expression for the correlation of Fisher transform sample partial correlation coefficients :
(9) 
We shall remark that this excludes the perfect dependence case where and in (9). For instance, we consider unit weights for the matrix diagonal. Furthermore, if we assume a diagonal dependence structure in , the expression can be simplified by
(10) 
We propose two estimators for :

Regressionbased estimator (Regbased):
(11) where and are estimates for and respectively. These can be found using eq. (5) on the initial iteration of the ADMM Algorithm 1.
Moreover, and can be computed by considering a regressiontype partial correlation coefficient estimation. For instance, the expression for variable is defined by , with regression coefficients for .

Regressionbased simplified estimator (Regbasedsim):
(12) for same regressionbased estimators of and as well as partial correlation estimators and as defined in the Regbased estimator, thus using the main terms in eq. (10).
3 Selection of tuning parameters
3.1 Combination of two regularization parameters
The joint estimation problem described in Section 2.2 requires the selection of two regularization parameters: (sparsity) and (similarity), and the combination of the two characterizes the estimated network sizes (both common network and differential network). In terms of the differential network , the same number of nonzero estimated differential elements, say , can be achieved for many different combinations of the two parameters. For example, if we want to estimate differential edges, these can be found by the two extremes: (1) setting and selecting such that we have a maximum of edges for each graphs (); (2) setting and find such that (in this case, ). In between, there are infinitely many combinations of ’s that reach the same value for with the total number of edges being an upper bound for the number of differential edges.
In Caballe et al. (2016) we discussed different ways of choosing sparsity penalization parameters that encourage certain network characteristics, i.e. clustering structure or connectivity of the estimated networks. These could also be applied for the joint estimation algorithm once the parameter is fixed. Furthermore, here we want to propose an alternative procedure that transforms the problem of selecting regularization parameters and to setting the desired expected proportion of false positive edges (EFPR) using parameters (sparsity) and (similarity). This is possible to do directly (no resampling) and fast for the nature of the ADMM recursive algorithm presented in Section 2.2, that, for every iteration, obtains a dense estimation of the precision matrices before thresholding (see step 4 and 6). By having the whole dense matrix we can approximate a distribution that represents estimated coefficients whose true values are zero. In contrast, for graphical lasso algorithms in which the thresholding step is applied row by row using a regression based approach (Friedman et al., 2007), the EFPR is commonly controlled using subsampling methods (Meinshausen and Bühlman, 2010), which increases considerably the computational cost.
3.2 Selection of the expected false positive rate
We define the sets and the set with and . The objective is to set significance levels such that
Here we use by default.
The main characteristic of our proposed procedure is the adjustment of the penalization parameters and in every iteration of the joint estimation algorithm depending on the significance levels and as well as the estimated precision matrices and (which are described in step 6 of Algorithm 1). We define as a vector of size with the standardized estimated partial correlation coefficient differences at iteration and as a vector of size with the updated estimated partial correlation coefficients in both populations. We assume that coefficients with follow a normal distribution with mean zero and variance and that coefficients with follow a normal distribution with also mean zero and variance . In both cases, normality assumption is justified for sufficiently large sample size in our simulated data study and the analysis is presented in the supplementary material.
The pairs of variable where similarity or sparsity conditions hold are unknown without any prior information and variance parameters cannot be estimated by their sample estimators using elements in such defined sets. However, we assume that (i) most of the coefficients are zero in any of the two matrices (strong sparsity) and (ii) most of the coefficients are equal between the two precision matrices (strong similarity). Hence, we propose to use robust estimators for the variances and using all the partial correlation coefficients (differences). In this way, we reduce the importance of large coefficients which can be generated by true nonzero partial correlation coefficients. Next, we describe three of the most popular estimators for the variance in the robust statistics literature which are fully described in Rousseeuw and Croux (1993):

Median absolute deviation around the median:
where .

Interquartile range:
where with quanitle .

Rousseeuw and Croux (RC) mad alternative:
where .
For each iteration , another way to select the pair is by fixing significance levels and considering where is the upper critical value for the standard normal distribution and is determined by
with , and using independent standard normal distributed random variables and . We approximate by Monte Carlo. Default values for and as or could be used. Note that the estimated variances are found using coefficients, but in case is very small, Student’s tquantiles could be used instead of to approximate the CI.
3.3 Uncertainty in the differential network
Differential network estimators incorporate the variability of the two individual estimated networks and tend to be much more uncertain that the underlying estimated common network. Here we propose to perform a permuted samples based approach to assess the uncertainty in the number of estimated differential edges. We permute the data as follows to ensure that the dependence structure between datasets is maintained: where and
(13) 
with . Given the new permuted data, a weighted fused graphical lasso estimate can be found by solving eq. (2) using the same combination for ’s as for the original estimate. By repeating this permutation and estimation process times, we can compute a confidence region for the number of estimated differential edges (distinguishing between the two populations) under the hypothesis of equality in the two precision matrices: .
4 Overestimation of triangular motifs
4.1 Problem and toy example
We discovered that the overestimation of triangles is a major issue here. If there are 3 nodes A,B,C and we already know that pairs A,B and A,C, are connected, then a connection between B and C is more often falsely predicted than expected. The reason for this is that the ADMMtype algorithm presented in Algorithm 1 uses a regularization for the eigenvalues of the covariance/correlation matrix to approximate its inverse denoted by (see eq. (5)). It can be proved that when then and when then . In the second such scenario, which happens when is small in comparison to , the estimated coefficients are biased.
We illustrate this using a toy graph structure example described by:
hence, here the edge is the one missing to do a triangle. Assuming that the edges and have the same strength, we can express the correlation matrix and its inverse by
To show the behavior of the regularized precision matrix estimator defined at (5) we simulate data from a multivariate normal distribution with mean vector equal to zero and covariance matrix equal to . In Figure 2 we show the trend of (true edge), (false edge) and (false triangle edge) for different sample sizes and over 1000 simulations. We shall see that the is shrunk towards zero for small as expected, also is centered at zero as expected but is biased. The true , but for large enough the expected value of estimated is different from zero.
Danaher et al. (2014) make an additional consideration to the formula in (5). They suggest to use as a vector with the weights of the classes and default values equal to for all classes. The reason is that even though using as sample size reduces the bias, it also gives much larger variances for edges equal to zero than using the default weights and, therefore, it can produce more false positive edges. However, using weights equal to has as main problem precisely the detection of false positive triangular motifs. This is reflected in Figure 2(d), where the regularized inverse produces even a larger bias for the false triangle edge than in previous cases.
4.2 Reducing overestimation of triangular motifs
Here we focus on the hypothesis testing problem defined by H: not a triangle and H: triangle. The null hypothesis holds if any of the three partial correlation coefficients associated to the three edges that make the triangle is zero. Note that this contains multiple scenarios under and finding a reliable null distribution is not suitable without prior information. For instance, there are 7 configurations of the graph structure which can be considered under the null hypothesis (1 for all zero values, 3 for only one nonzero value and also 3 for two nonzero values) and in each one of them the correlation between estimated coefficients is different. Since the ADMM algorithm does find well conditionally dependence structures, we simplify the testing problem by assuming that under two edges are present, thus that there is a pair of variables, say , that is conditionally independent to the rest (the missing one to complete the triangle).
To test the existence of this motifs structure we employ the scaled inverse of the correlation matrix (which determines the partial correlation matrix) that involve the three nodes in the triangle. We use the Fisher transformation function , on the estimated partial correlation coefficient such that is approximately normal with mean value zero and variance (Fisher, 1924). In case the pair was known, then the pvalue of the test would be calculated by
(14) 
where defines the standard normal with cumulative distribution . We approximate a pvalue for the test in case the position of the pair is unknown by applying (14) on the minimum estimated coefficient in absolute value . This results to a conservative pvalue: for example if pair , then it is immediate to see that
For large sample sizes (or very large true nonzero partial correlation coefficients), then the equality holds.
Here, we assess the weakest edges of all the observed triangular structures separately and we eliminate those with small pvalues (default threshold equal to as described in Section 3). In case one edge is tested more than once, we only count its smallest pvalue. Nevertheless, multiple testing correction and another interpretation for overlapping triangles could be used instead.
5 Simulated data analysis
5.1 Models used to generate the data
We generate data from multivariate normal distributions with zero mean vector and several almostblock diagonal precision matrices, where each block (or cluster) has a powerlaw underlying graph structure (defined below) and there are some extra random connections between blocks. The nonzero partial correlation coefficients are simulated by
(15) 
Then, we regularize by , with such that the condition number of is less than the number of nodes, so obtaining a positive definite matrix (Cai et al., 2011). Initially we consider . However, we also include differential edges using additional block diagonal structures. For instance, we use two block diagonal structures , so that we merge to and to .
As for Wit and Abbruzzo (2015) we define by a diagonal matrix. Nevertheless, we consider that these elements in the diagonal can be different, for instance we use for diagonal elements (chosen randomly) and for the other .
Powerlaw networks assume that the variable , which denotes the fraction of nodes in the network that has degree , follows a powerlaw distribution
where , a constant and the normalizing function is the Riemann zeta function. Following Peng et al. (2009), provides a good representation of biological networks.
We generate datasets with several dimension sizes 200, 300, 400 and sample sizes 25, 100, 250, 500. In Figure 3 we present the graphical representation of some of the the generated networks.
5.2 Differential network recovery
In this section we focus on the recovery of differential edges by using two joint graphical lasso algorithms in the simulated datasets: FGL (Danaher et al., 2014) and WFGL (proposed). In order to make the methods comparable we select estimated graphs (or and ) that have the same number of common edges and differential edges in the two approaches. We first select the pair for the WFGL approach by setting the expected false positive rate by the parameters following the strategy proposed in Section 3. Then we find ’s such that the FGL graphs have the same sizes as WFGL. In total we use 200 iterations for each model, 4 different sample sizes and three dimension sizes 200, 300, 400.
To compare the performance of the methods, we propose to use the Youden’s index defined by
where and are the number of true positives and false positive of the estimated differential graphs with and method . Then we compute
which defines the Youden’s index differences between the two methods to estimate the joint networks. In Table 1 we present the average difference (with a ttest pvalue) and also the average sign of the differences (with a Wilcoxon test pvalue). The proposed method that assumes a dependence structure achieves better TPFP ratios for the differential network than the original FGL in most of the models when is fairly large. For small , there are no significant differences between the two algorithms even when there exists a dependence structure in the data.
p= 200  p=300  p=400  

(pval)  (pval)  (pval)  (pval)  (pval)  (pval)  
25  .18 (0.07)  .06 (0.09)  .04 (0.48)  .01 (0.50)  .12 (0.11)  .05 (0.11) 
100  .25 (0.04)  .07 (0.06)  .16 (0.10)  .04 (0.20)  .26 (0.03)  .08 (0.06) 
250  .26 (0.02)  .10 (0.02)  .27 (0.05)  .09 (0.04)  .32 (0.02)  .11 (0.01) 
500  .24 (0.05)  .08 (0.05)  .15 (0.12)  .06 (0.07)  .19 (0.07)  .07 (0.08) 
Even though the correction for dependent datasets does not improve in great measure the differential network recovery rates, assuming that differential edges can occur with same probability independently of the values produces a fairer procedure in which edges with high correlation have similar chances to be recovered as edges with low correlation.
We show this using the model defined by a dimension and several sample sizes. We separate pairs of variables in two groups: and . For all pairs , we compute using (Indep.) as well as (paired) with estimated by the Regbasedsim method discussed in Section 2.3 (see Section 1 in the supplementary material for comparison between estimators). Then we rank the values and we denote them by such that for and for . In Figure 4 we show the differences of the average ranks in the two groups, i.e. . We can see that the independent method encourages recovery of differential edges with small (seen in the plot by large negative rank differences) and this bias is corrected by the dependent data adjustment, which for relatively large sample size gives very similar ranks in the two groups.
5.3 Evaluation of tuning parameter selection
In Figure 5 we compare the expected proportion of false positive edges determined by the value of against the observed false positive rate (with median and confidence) using the RCmad estimator to approximate (see Section 2 in the supplementary material for comparison between estimators). To draw the confidence interval we replicate the procedure in 100 simulated datasets for different sample sizes and dimension sizes. The approximated false positive rate is close to the true one, given by , and it is only for very small that the true value is not always included in the confidence interval. A similar analysis is applied to the other tuning parameter in the supplementary material.
5.4 Testing and removing triangle motifs
As we discussed in section 4, using the eigenvalue decomposition regularization forces a bias to some non existing edges in the true network. These ones are the missing edges to form closed structures. For example, triangle structures are the ones that suffer the most this bias. This is illustrated in Table 2 where we present the average TPFP behavior for the weakest edge of estimated triangles for models with different sample sizes, dimensions sizes and significance levels distinguishing by triangles in a common network and triangles in a differential network. The initial estimated triangles contain more false positive than true positives increasingly with and . This is corrected by our triangle detection procedure (particularly for common edges), which without losing many true positive edges, reduces notably the number of false positives.
common edges  differential edges  

n 

dimension p=200  
NO  3.5326.40  6.0340.04  6.3961.02  5.7557.06  .383.68  0.764.30  0.514.75  0.485.10 
00  0.090.53  1.720.97  3.381.26  00  00.12  0.070.48  0.120.76  
00.02  0.611.25  2.932.40  4.213.21  00  0.030.49  0.170.94  0.291.14  
00.08  1.211.87  3.674.30  4.555.37  00  0.090.82  0.191.27  0.371.47  
dimension p=300  
NO  5.9260.20  9.4374.25  8.1291.25  7.23114.64  .519.13  0.848.35  0.677.76  0.388.20 
00  0.300.71  2.331.09  4.371.87  00  00.16  0.090.67  0.080.92  
00.02  1.031.60  3.653.89  5.195.86  00  0.020.65  0.251.22  0.211.43  
0.040.10  1.933.28  4.527.48  5.6311.11  00.02  0.081.05  0.361.83  0.282.09  
dimension p=400  
NO  11.90232.4  18.36241.2  16.43259.4  13.29274.3  .5617.20  1.1417.86  0.9216.69  0.6417.7 
00.08  0.71.7  4.313.36  7.495.46  00  00.44  0.050.98  0.251.21  
00.12  2.095.09  6.7413.22  9.2319.8  00  0.021.26  0.301.79  0.402.52  
0.010.37  3.7512.19  8.3027.03  9.9538.5  00  0.141.95  0.432.85  0.474.27 
6 Network analysis of colon cancer gene expression data
We apply the methods to a real case study. A gene expression dataset which can be downloaded at http://www.ebi.ac.uk/arrayexpress/ and is presented in Hinoue et al. (2012). A total of 25 patients are examined, the gene expression profiling is obtained in each one of them for a colorectal tumor sample and its healthy adjacent colonic tissue: in total there are 50 samples and 24,526 genes.
6.1 Reduction of the number of genes to be analysed and
clustering
We reduce the dimension size of the dataset by considering two filters with the objective to select highly correlated genes and differentially correlated genes. For the first filter we use a simple statistic, the adjusted squared correlation, that measures the global strength of gene connections by
(16) 
with
where is the sample correlation coefficients for pair of variables . We compare the statistic for each gene (independently for the two medical conditions) against a null distribution, , . To account for gene dependencies we approximate an empirical null distribution by simulating independent observations from a normal distribution and then finding the adjusted square correlations between simulated observations and all remaining genes in the real data. For the second filter we consider a sum of squares based statistic that uses the differences between Fisher transform healthy and tumor sample correlations:
(17) 
where and are the Fisher transform function applied to the sample correlation coefficients for tumor and healthy genes respectively. For each gene we compare the mean square Fisher transform correlation differences with the expected value under the null hypothesis, say , . The null distribution is approximated using permuted samples.
In both presented tests we use a false discovery rate correction (Benjamini and Hochberg, 1995) for the pvalues to account for multiple testing and a threshold of such that we select genes :
where are the adjusted sum of square square test pvalues for the healthy dataset, are the adjusted sum of square square test pvalues for the tumor dataset and are the adjusted differential sum of square square test pvalues using both tumor and healthy samples.
The total length of the reduced genes is which is a reduction of the of the variables. We further use a clustering procedure on the reduced dataset to estimate joint networks separately for different groups of genes. We consider the hierarchical clustering algorithm presented in Müllner (2013) since it provides a fast procedure even for very large dimensions. We use 1 minus the matrix of correlations for healthy genes as dissimilarity matrix to find 4 large clusters of size 2582, 4958, 3409, 214 genes respectively. In Figure 6 we present the heat map of the average square correlation between and within clusters. Note that the darkest squares are given in the diagonal indicating large within cluster correlation magnitudes in comparison to between correlation magnitudes. Moreover, cluster 2 is quite correlated with cluster 1 and 3.
6.2 Network estimation of cancer and healthy gene expression data
We fit four weighted fused graphical lasso models corresponding to the 4 clusters of genes defined in Section 6.1. We use significant levels and to tune the penalization parameters. For we set the underlying expected number of false positive edges (EFP) with EFP = respectively for each cluster. Then, with . In terms of we use three different levels which are specified in Table 3.
Precisely, in Table 3 we show the number of estimated edges common to the two medical conditions and the number of differential edges: healthy for edges only present in the network for healthy samples; and tumor for edges only present in the network for tumor samples. The total number of edges is much larger than the expected number of false positives which suggests certain strength in the results. Moreover, we observe that the number of differential edges is remarkably larger for healthy samples than for tumor samples for all clusters.
Cluster 1  Cluster 2  
0.001  0.01  0.05  0.001  0.01  0.05  
common  1,386  1,338  1,305  1,971  1,913  1,927 
healthy only  26  70  518  53  220  958 
tumor only  1  2  20  3  22  116 
Cluster 3  Cluster 4  
0.001  0.01  0.05  0.01  0.03  0.05  
common  2,542  2,315  2,024  124  123  116 
healthy only  104  355  1,129  0  1  2 
tumor only  5  28  111  0  0  0 
Differential network uncertainty is assessed by applying the permutation process proposed in Section 3.3 which estimates WFGL for data under the hypothesis of . We use