Multivariate Gaussian Network Structure Learning
1 Abstract
We consider a graphical model where a multivariate normal vector is associated with each node of the underlying graph and estimate the graphical structure. We minimize a loss function obtained by regressing the vector at each node on those at the remaining ones under a group penalty. We show that the proposed estimator can be computed by a fast convex optimization algorithm. We show that as the sample size increases, the estimated regression coefficients and the correct graphical structure are correctly estimated with probability tending to one. By extensive simulations, we show the superiority of the proposed method over comparable procedures. We apply the technique on two real datasets. The first one is to identify gene and protein networks showing up in cancer cell lines, and the second one is to reveal the connections among different industries in the US.
2 Introduction
Finding structural relations in a network of random variables is a problem of significant interest in modern statistics. The intrinsic dependence between variables in a network is appropriately described by a graphical model, where two nodes are connected by an edge if and only if the two corresponding variables and are conditionally dependent given all other variables. If the joint distribution of all variables is multivariate normal with precision matrix , the conditional independence between the variable located at node and that located at node is equivalent of having zero at the th entry of . In a relatively large network of variables, generally conditional independence is abundant, meaning that in the corresponding graph edges are sparsely present. Thus in a Gaussian graphical model, the structural relation can be learned from a sparse estimate of , which can be naturally obtained by regularization method with a lassotype penalty. Friedman et al. [2] and Banerjee et al. [1] proposed the graphical lasso (glasso) estimator by minimizing the sum of the negative loglikelihood and the norm of , and its convergence property was studied by Rothman et al. [7]. A closely related method was proposed by Yuan & Lin [10]. An alternative to the graphical lasso is an approach based on regression of each variable on others, since is zero if and only if the regression coefficient of in regressing on other variables is zero. Equivalently this can be described as using a pseudolikelihood obtained by multiplying onedimensional conditional densities of given for all instead of using the actual likelihood obtained from joint normality of . The approach is better scalable with dimension since the optimization problem is split into several optimization problems in lower dimensions. The approach was pioneered by Meinshausen & Bühlmann [5], who imposed a lassotype penalty on each regression problem to obtain sparse estimates of the regression coefficients, and showed that the correct edges are selected with probability tending to one. However, a major drawback of their approach is that the estimator of and that of may not be simultaneously zero (or nonzero), and hence may lead to logical inconsistency while selecting edges based on the estimated values. Peng et al. [6] proposed the Sparse PArtial Correlation Estimation (space) by taking symmetry of the precision matrix into account. The method is shown to lead to convergence and correct edge selection with high probability, but it may be computationally challenging. A weighted version of space was considered by Khare et al. [3], who showed that a specific choice of weights guarantees convergence of the iterative algorithm due to the convexity of the objective funtion in its arguments. Khare et al. [3] named their estimator the CONvex CORrelation selection methoD (concord), and proved that the estimator inherits the theoretical convergence properties of space. By extensive simulation and numerical illustrations, they showed that concord has good accuracy for reasonable sample sizes and can be computed very efficiently.
However, in many situations, such as if multiple characteristics are measured, the variables at different nodes may be multivariate. The methods described above apply only in the context when all variables are univariate. Even if the above methods are applied by treating each component of these variables as separate onedimensional variables, ignoring their group structure may be undesirable, since all component variables refer to the same subject. For example, we may be interested in the connections among different industries in the US, and may like to see if the GDP of one industry has some effect on that of other industries. The data is available for 8 regions, and we want to take regions into consideration, since significant difference in relations may exist because of regional characteristics, which are not possible to capture using only national data. It seems that the only paper which addresses multidimensional variables in a graphical model context is Kolar et al. [4], who pursued a likelihood based approach. In this article, we propose a method based on a pseudolikelihood obtained from multivariate regression on other variables. We formulate a multivariate analog of concord, to be called mconcord, because of the computational advantages of concord in univariate situations. Our regression based approach appears to be more scalable than the likelihood based approach of Kolar et al. [4]. Moreover, we provide theoretical justification by studying large sample convergence properties of our proposed method, while such properties have not been established for the procedure introduced by Kolar et al. [4].
The paper is organized as follows. Section 3 introduces the mconcord method and describes its computational algorithm. Asymptotic properties of mconcord are presented in Section 4. Section 5 illustrates the performance of mconcord, compared with other methods mentioned above. In Section 6, the proposed method is applied to two real data sets on gene/protein profiles and GDP respectively. Proofs are presented in Section 7 and in the appendix.
3 Method description
3.1 Model and estimation procedure
Consider a graph with nodes, where at the th node there is an associated dimensional random variable , . Let . Assume that has multivariate normal distribution with zero mean and covariance matrix , where , , , . Let the precision matrix be denoted by , which can also be written as a blockmatrix . The primary interest is in the graph which describes the conditional dependence (or independence) between and given the remaining variables. We are typically interested in the situation where is relatively large and the graph is sparse, that is, most pairs and , , , are conditionally independent given all other variables. When and are conditionally independent given other variables, there will be no edge connecting and in the underlying graph; otherwise there will be an edge. Under the assumed multivariate normality of , it follows that there is an edge between and if and only if is a nonzero matrix. Therefore the problem of identifying the underlying graphical structure reduces to estimating the matrix under the sparsity constraint that most offdiagonal blocks in the grand precision matrix are zero.
Suppose that we observe independent and identically distributed (i.i.d.) samples from the graphical model, which are collectively denoted by , while stands for the sample of many variate observations at node and stands for the vector of observations of the th component at node , , . Following the estimation strategies used in univariate Gaussian graphical models, we may propose a sparse estimator for by minimizing a loss function obtained from the conditional densities of given , , for each and a penalty term. However, since sparsity refers to offdiagonal blocks rather than individual elements, the lassotype penalty used in univariate methods like space or concord should be replaced by a grouplasso type penalty, involving the sum of the Frobeniusnorms of each offdiagonal block . A multivariate analog of the loss used in a weighted version of space is given by
(1) 
where , are nonnegative weights and due to the symmetry of precision matrix. Writing the quadratic term in the above expression as
and, as in concord choosing to make the optimization problem convex in the arguments, we can write the quadratic term in the loss function as . Applying the group penalty we finally arrive at the objective function
(2) 
3.2 Algorithm
To obtain a minimizer of (2), we periodically minimize it with respect to the arguments of , , . For each fixed , , suppressing the terms not involving any element of , we may write the objective function as
where . Without loss of generality, we assume and rewrite the expression as
where and are matrices specified as follows: th columns of are , the th columns of are , and other columns are zero. This leads to the following algorithm.
Algorithm:
Initialization: For , and , set the initial values and .
Iteration: For all and , repeat the following steps until certain convergence criterion is satisfied:

Step 1: Calculate the vectors of errors for :

Step 2: Regress the errors on the specified variables to obtain
by the proximal gradient algorithm described as follows:
Given , and , compute
Set and repeat

,

if , set ; else set ,

replace by ,
until .


Step 3: For and , update to
If the total number of variables at all nodes is less than or equal to the available sample size , then the objective function is strictly convex, there is a unique solution to the minimization problem (2) and the iterative scheme converges to the global minimum (Tseng [8]). However, if , the objective function need not be strictly convex, and hence a unique minimum is not guaranteed. However, as in univariate concord, the algorithm converges to a global minimum. This follows by arguing as in the proof of Theorem 1 of Kolar et al. [3] after observing that the objective function of mconcord differs from that of concord only in two aspects — the loss function does not involve offdiagonal entries of diagonal blocks, and the penalty function has grouping, neither of which affect the structure of the concord described by Equation (33) of Kolar et al. [3].
4 Large Sample Properties
In this section, we study large sample properties of the proposed mconcord method. As in the univariate concord method, we consider the estimator obtained from the minimization problem
with a general weight and a suitably consistent estimator of plugged in for all , , and for some suitable sequence . Existence of such an estimator is also shown.
Introduce the notation
(3) 
where and , , . Let and respectively stand for true values of and respectively. All probability and expectation statements made below are understood under the distributions obtained from the true parameter values. Let and be the expected first and second order partial derivatives of at the true parameter respectively. Also let stand for the row vector and for the matrix , where . Note that .
Let , and . We further define that , and thus there are elements in . Let . The following assumptions will be made throughout.

The weights satisfy and and grow at most like a power of .

There exist constants depending on the true parameter value such that the minimum and maximum eigenvalues of the true covariance satisfies .

There exists a constant such that for all , where is a columnvector with elements , .

There is an estimator of , satisfying for every with probability tending to 1.
The following result concludes that Condition C3 holds if the total dimension is less than a fraction of the sample size.
Proposition 1
Suppose that for some . Let stand for the vector of regression residuals of on . Then the estimator , where , satisfies Condition C3.
We adapt the approach in Peng et al. [6] to the multivariate Gaussian setting. The approach consists of first showing that if the estimator is restricted to the correct model, then it converges to the true parameter at a certain rate as the sample size increases to infinity. The next step consists of showing that with high probability no edge is falsely selected. These two conclusions combined yield the result.
Theorem 1
Let , and as . Then the following events hold with probability tending to :

there exists a solution of the restricted problem
(4) 
(estimation consistency) for any sequence , any solution of the restricted problem (4) satisfies
Theorem 2
Suppose that for some , , , and as . Then with probability tending to , the solution of (4) satisfies where .
Theorem 3
Assume that the sequences and satisfy the conditions in Theorem 2. Then with probability tending to , there exists a minimizer of which satisfies

(estimation consistency) for any sequence , ,

(selection consistency) if for some , whenever , then , where .
5 Simulation
In this section, two simulation studies are conducted to examine the performance of mconcord and compare with space, concord, glasso and multi, the method of Kolar et al. [4] in regards of estimation accuracy and model selection. For space, concord and glasso, all components of each node are treated as separate univariate nodes, and we put an edge between two nodes as long as there is at least one nonzero entry in the corresponding submatrix.
5.1 Estimation Accuracy Comparison
In the first study, we evaluate the performance of each method at a series of different values of the tuning parameter . Four random networks with (44% density), (21% density), (6% density), (2% density) and (2% density) nodes are generated, and each node has a dimensional Gaussian variable associate with it, . Based on each network, we construct a precision matrix, with nonzero blocks corresponding to edges in the network. Elements of diagonal blocks are set as random numbers from . If node and node are not connected, then the entire th and th blocks would take values zero. If node and node are connected, the th block would have elements taking values in with equal probabilities so that both strong and weak signals are included. The th block can be obtained by symmetry. Finally, we add to the precision matrix to make it positivedefinite, where is the absolute value of the smallest eigenvalue plus 0.5 and is the identity matrix. Using each precision matrix, we generate 50 independent datasets consisting of (for the and networks) and (for the , and networks) i.i.d. samples. Results are given in Figure 1 to Figure 5. All figures show the number of correctly detected edges () versus the number of total detected edges (), averaged across the 30 independent datasets.
We can observe that for all methods, decreases when we increase . It can be seen that mconcord consistently outperforms its counterparts, as it detects more correct edges than the other methods for the same number of total edges detected, especially when we have large or large . In all scenarios, space, concord and glasso give very similar results. With large and , multi performs better than univariate methods.
The better performance of moncord over space, concord and glasso is largely due to the fact that mconcord is designed for multivariate network, and treating the precision matrix by different blocks is more likely to catch an edge even when the signal is comparably weak. On the contrary, the univariate approaches tend to select more unwanted edges since there is high probability that there is at least on nonzero element in the block due to randomness.
In high dimensional settings, regression based methods have simpler quadratic loss function and are computationally faster and more efficient than that of penalized likelihood methods, which optimize with respect to the entire precision matrix at once. The running time for mconcord is about onethird of that for multi. The higher numerical accuracy of regression based methods over penalized likelihood methods were often observed in the univariate setting, and hence is expected to continue in the multivariate setting as well.
5.2 Model Selection Comparison
Next in the second study, we compare the model selection performance of the above approaches. We fix , and conduct simulation studies for several combinations of and with different densities which vary from 41% to 1%. The precision matrices are generated using the same technique as in the first study. The tuning parameter is selected using a 5fold crossvalidation for all methods. We also studied the performance of the Bayesian Information Criterion (BIC) for model selection, but it seems that BIC does not work in the multi dimensional settings. In fact, BIC in most cases tends to choose the smallest model where no edge can be detected. Here we compare sensitivity (TPR), precision (PPV) and Matthew’s Correlation Coefficient (MCC) defined by
where TP, TN, FP and FN denote true positives (number of edges correctly detected), true negatives (number of edges correctly excluded), false positives (number of edges detected but absent in the true model) and false negatives (number of edges falsely excluded). For each network, all final numbers are averaged across 30 independent datasets.
mconcord  space  concord  glasso  multi  

(i)  50  ()  58(34)  70(35)  85(42)  378(157)  217(89) 
TPR(PPV)  0.19(0.57)  0.20(0.50)  0.24(0.49)  0.89(0.42)  0.50(0.41)  
MCC  0.14  0.08  0.09  0.04  0.01  
(ii)  50  ()  105(57)  47(10)  46(9)  805(105)  612(69) 
TPR(PPV)  0.42(0.54)  0.07(0.21)  0.07(0.20)  0.77(0.13)  0.50(0.11)  
MCC  0.42  0.06  0.05  0.08  0.01  
100  ()  191(64)  286(58)  280(59)  923(122)  525(69)  
TPR(PPV)  0.47(0.34)  0.42(0.20)  0.43(0.21)  0.89(0.13)  0.50(0.13)  
MCC  0.30  0.16  0.17  0.11  0.05  
(iii)  100  ()  248(87)  202(40)  267(51)  2389(274)  2501(211) 
TPR(PPV)  0.21(0.35)  0.10(0.20)  0.12(0.19)  0.65(0.11)  0.50(0.08)  
MCC  0.22  0.08  0.09  0.10  0.00  
200  ()  613(200)  814(170)  1005(196)  1066(204)  2380(201)  
TPR(PPV)  0.48(0.33)  0.41(0.21)  0.47(0.20)  0.49(0.19)  0.48(0.08)  
MCC  0.33  0.20  0.20  0.20  0.00  
(iv)  100  ()  481(112)  84(12)  133(18)  5657(306)  4797(240) 
TPR(PPV)  0.18(0.23)  0.02(0.14)  0.03(0.14)  0.50(0.05)  0.39(0.05)  
MCC  0.18  0.04  0.05  0.08  0.06  
200  ()  1250(300)  892(143)  976(151)  6357(426)  4392(226)  
TPR(PPV)  0.49(0.24)  0.23(0.16)  0.24(0.15)  0.69(0.07)  0.37(0.05)  
MCC  0.31  0.16  0.16  0.14  0.06  
(v)  100  ()  764(129)  31(3)  54(6)  14283(326)  10229(259) 
TPR(PPV)  0.16(0.17)  0.00(0.10)  0.00(0.11)  0.42(0.02)  0.33(0.03)  
MCC  0.16  0.02  0.03  0.06  0.06  
200  ()  2063(378)  396(62)  404(53)  16092(480)  9648(240)  
TPR(PPV)  0.48(0.18)  0.08(0.16)  0.07(0.13)  0.61(0.03)  0.31(0.02)  
MCC  0.29  0.11  0.09  0.10  0.06 
Table 1 shows that substantial gain is achieved by considering the multivariate aspect in mconcord compared with the univariate methods space and concord in regards of both sensitivity and precision, except for the case and where these two methods score slightly better TPR due to more selection of edges. Both glasso and multi select very dense models in nearly all cases, and as a consequence their TPR are higher. However, in terms of MCC which accounts for both correct and incorrect selections, mconcord performs consistently better than all the other methods.
6 Application
6.1 Gene/Protein Network Analysis
According to the NCI website https://dtp.cancer.gov/discovery_development/nci60, “the US National Cancer Institute (NCI) 60 human tumor cell lines screening has greatly served the global cancer research community for more than 20 years. The screening method was developed in the late 1980s as an in vitro drugdiscovery tool intended to supplant the use of transplantable animal tumors in anticancer drug screening. It utilizes 60 different human tumor cell lines to identify and characterize novel compounds with growth inhibition or killing of tumor cell lines, representing leukemia, melanoma and cancers of the lung, colon, brain, ovary, breast, prostate, and kidney cancers”.
We apply our method to a dataset from the wellknown NCI60 database, which consists of protein profiles (normalized reversephase lysate arrays for 94 antibodies) and gene profiles (normalized RNA microarray intensities from Human Genome U95 Affymetrix chipset for more than 17000 genes). Our analysis will be restricted to a subset of 94 genes/proteins for which both types of profiles are available. These profiles are available across the same set of 60 cancer cell lines. Each geneprotein combination is represented by its Entrez ID, which is a unique identifier common for a protein and a corresponding gene that encodes this protein.
Three networks are studied: a network based on protein measurements alone, a network based on gene measurements alone, and a geneprotein multivariate network. For protein alone and gene alone networks, we use concord, and for geneprotein network, we use mconcord. The tuning parameter is selected using 5fold crossvalidation for all three networks.
From the geneprotein network 531 edges are selected. For the protein network, 798 edges are selected and for the gene network, 784 edges are selected. Protein and geneprotein networks share 313 edges, while gene and geneprotein networks share 287 edges. However, protein and gene networks only share 167 edges. Table 2 provides summary statistics for these networks.
Protein network  Gene network  Geneprotein network  
Number of edges  798  784  531 
Density (%)  18  18  12 
Maximum degree  24  24  20 
Average node degree  16.98  16.68  11.30 
In Table 3, we also list the top 20 most connected components for all three networks. Among them, the geneprotein network and the protein network share 11, the geneprotein network and the gene network share 10, while the protein network and the gene network share only 6.
Geneprotein network  Protein network  Gene network  

Entrez ID  Gene name  Entrez ID  Gene name  Entrez ID  Gene name 
302  ANXA2  4179  CD46  2064  ERBB2 
7280  TUBB2A  983  CDK1  5605  MAP2K2 
1398  CRK  3265  HRAS  307  ANXA4 
4255  MGMK  3716  JAK1  5578  PRKCA 
5578  PRKCA  10270  AKAP8  1173  AP2M1 
5925  RB1  354  KLK3  1828  DSG1 
9564  BCAR1  1019  CDK4  4179  CD46 
307  ANXA4  6776  STAT5A  9961  MVP 
354  KLK3  9564  BCAR1  1000  CDH2 
2064  ERBB2  1398  CRK  2932  GSK3B 
4163  MCC  3667  IRS1  4176  MCM7 
6778  STAT6  4830  NME1  4436  MSH2 
7299  TYR  307  ANXA4  5970  RELA 
1173  AP2M1  1173  AP2M1  999  CDH1 
983  CDK1  2017  CTTN  1001  CDH3 
1001  CDH3  4255  MGMT  1398  CRK 
1499  CTNNB1  1001  CDH3  2335  FN1 
3716  JAK1  1020  CDK5  5925  RB1 
4179  CD46  3308  HSPA4  7280  TUBB2A 
4830  NME1  4176  MCM7  7299  TYR 
6.2 GDP Network Analysis
In this analysis, we apply our method to the regional GDP data obtained from U.S. Department of Commerce website https://www.bea.gov/index.html, which contains GDP data including the following 20 different industries with labels: {enumerate*}[(1)]
utilities (uti),
construction (cons) ,
Manufacturing (manu),
Durable goods manufacturing (durable),
nondurable goods manufacturing (nondu),
wholesale trade (wholesale),
retail trade (retail),
transportation and warehousing (trans),
information (info),
finance and insurance (finance),
real estate and rental and leasing (real),
professional, scientific and technical services (prof),
management of companies and enterprises (manage),
administrative and waste management services (admin),
educational services (edu),
health care and social assistance (health),
arts, entertainment and recreation (arts),
accommodation and food services (food),
other services except government (other) and
government (gov).
The data is available from the first quarter of 2005 to the second quarter of 2016. Data from the third quarter of 2008 to the forth quarter of 2009 is eliminated to reduce the impact of the financial crisis of that period. The data is in 8 regions in the US, including New England (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island and Vermont), Mideast (Delaware, D.C., Maryland, New Jersey, New York and Pennsylvania), Great Lakes (Illinois, Indiana, Michigan, Ohio and Wisconsin), Plains (Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota and South Dakota), Souteast (Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, Virginia and West Virginia), Southwest (Arizona, New Mexico, Oklahoma and Texas), Rocky Mountain (Colorado, Idaho, Montana, Utah and Wyoming) and Far West (Alaska, California, Hawaii, Nevada, Oregon and Washington).
We reduce correlation in the time series data by taking differences of the consecutive observations. A multivariate network consisting of 20 nodes and 8 attributes for each node is studied. After using 5fold crossvalidation to select the tuning parameter , 47 edges are detected, with density of 24.7% and average node degree of 4.7. The 5 most connected industries are retail trade, transportation, wholesale trade, accommodation and food services, and professional and technical services. The network is shown in Figure 4(a). It is obvious to see hubs comprising of wholesale trade and retail trade. This is very natural for the consumerdriven economy of the US. Both of these two nodes are connected to transportation, as both of these industries heavily rely on transporting goods. Another noticeable fact is that education is connected with government. As part of the services provided by government, it is natural that the quality as well as GDP of educational services can both be influenced by government.
The univariate network using the nationwide GDP data only is also studied for comparison using concord. For the tuning parameter , 5fold crossvalidation is applied, and 95 networks are selected, with density of 50% and average node degree of 9.5. The 5 most connected industries are administrative and waste management services, accommodation and food services, wholesale trade, professional and technical services and health care and social assistance. The network is shown in Figure 4(b). The more modest degree of connections in the multivariate network seems to be more interpretable.
7 Proof of the theorems
We rewrite (3) as , where .
For any subset , the KarushKuhnTucker (KKT) condition characterizes a solution of the optimization problem
A vector is a solution if and only if for any
The following lemmas will be needed in the proof of Theorems 1–3. Their proofs are deferred to the Appendix.
Lemma 1
The following properties hold.

For all and , .

If for all and , then is convex in and is strictly convex with probability one.

For every index with , .

All entries of are bounded and bounded below. Also, there exist constants such that

There exists constants , such that
Lemma 2

There exists a constant , such that for all and , , .

There exists constants , such that for any ,

There exists a positive constant , such that for all ,
where .

For any , for some constant .
Lemma 3
There exists a constant , such that for any and , , .
Lemma 4
Lemma 5
If