Multivariate Gaussian Network Structure Learning

# Multivariate Gaussian Network Structure Learning

Xingqi Du
Department of Statistics, North Carolina State University,
5109 SAS Hall, Campus Box 8203, Raleigh, North Carolina 27695, USA
xdu8@ncsu.edu

Subhashis Ghosal
Department of Statistics, North Carolina State University,
5109 SAS Hall, Campus Box 8203, Raleigh, North Carolina 27695, USA
sghosal@ncsu.edu

## 1 Abstract

We consider a graphical model where a multivariate normal vector is associated with each node of the underlying graph and estimate the graphical structure. We minimize a loss function obtained by regressing the vector at each node on those at the remaining ones under a group penalty. We show that the proposed estimator can be computed by a fast convex optimization algorithm. We show that as the sample size increases, the estimated regression coefficients and the correct graphical structure are correctly estimated with probability tending to one. By extensive simulations, we show the superiority of the proposed method over comparable procedures. We apply the technique on two real datasets. The first one is to identify gene and protein networks showing up in cancer cell lines, and the second one is to reveal the connections among different industries in the US.

## 2 Introduction

Finding structural relations in a network of random variables is a problem of significant interest in modern statistics. The intrinsic dependence between variables in a network is appropriately described by a graphical model, where two nodes are connected by an edge if and only if the two corresponding variables and are conditionally dependent given all other variables. If the joint distribution of all variables is multivariate normal with precision matrix , the conditional independence between the variable located at node and that located at node is equivalent of having zero at the th entry of . In a relatively large network of variables, generally conditional independence is abundant, meaning that in the corresponding graph edges are sparsely present. Thus in a Gaussian graphical model, the structural relation can be learned from a sparse estimate of , which can be naturally obtained by regularization method with a lasso-type penalty. Friedman et al.  and Banerjee et al.  proposed the graphical lasso (glasso) estimator by minimizing the sum of the negative log-likelihood and the -norm of , and its convergence property was studied by Rothman et al. . A closely related method was proposed by Yuan & Lin . An alternative to the graphical lasso is an approach based on regression of each variable on others, since is zero if and only if the regression coefficient of in regressing on other variables is zero. Equivalently this can be described as using a pseudo-likelihood obtained by multiplying one-dimensional conditional densities of given for all instead of using the actual likelihood obtained from joint normality of . The approach is better scalable with dimension since the optimization problem is split into several optimization problems in lower dimensions. The approach was pioneered by Meinshausen & Bühlmann , who imposed a lasso-type penalty on each regression problem to obtain sparse estimates of the regression coefficients, and showed that the correct edges are selected with probability tending to one. However, a major drawback of their approach is that the estimator of and that of may not be simultaneously zero (or non-zero), and hence may lead to logical inconsistency while selecting edges based on the estimated values. Peng et al.  proposed the Sparse PArtial Correlation Estimation (space) by taking symmetry of the precision matrix into account. The method is shown to lead to convergence and correct edge selection with high probability, but it may be computationally challenging. A weighted version of space was considered by Khare et al. , who showed that a specific choice of weights guarantees convergence of the iterative algorithm due to the convexity of the objective funtion in its arguments. Khare et al.  named their estimator the CONvex CORrelation selection methoD (concord), and proved that the estimator inherits the theoretical convergence properties of space. By extensive simulation and numerical illustrations, they showed that concord has good accuracy for reasonable sample sizes and can be computed very efficiently.

However, in many situations, such as if multiple characteristics are measured, the variables at different nodes may be multivariate. The methods described above apply only in the context when all variables are univariate. Even if the above methods are applied by treating each component of these variables as separate one-dimensional variables, ignoring their group structure may be undesirable, since all component variables refer to the same subject. For example, we may be interested in the connections among different industries in the US, and may like to see if the GDP of one industry has some effect on that of other industries. The data is available for 8 regions, and we want to take regions into consideration, since significant difference in relations may exist because of regional characteristics, which are not possible to capture using only national data. It seems that the only paper which addresses multi-dimensional variables in a graphical model context is Kolar et al. , who pursued a likelihood based approach. In this article, we propose a method based on a pseudo-likelihood obtained from multivariate regression on other variables. We formulate a multivariate analog of concord, to be called mconcord, because of the computational advantages of concord in univariate situations. Our regression based approach appears to be more scalable than the likelihood based approach of Kolar et al. . Moreover, we provide theoretical justification by studying large sample convergence properties of our proposed method, while such properties have not been established for the procedure introduced by Kolar et al. .

The paper is organized as follows. Section 3 introduces the mconcord method and describes its computational algorithm. Asymptotic properties of mconcord are presented in Section 4. Section 5 illustrates the performance of mconcord, compared with other methods mentioned above. In Section 6, the proposed method is applied to two real data sets on gene/protein profiles and GDP respectively. Proofs are presented in Section 7 and in the appendix.

## 3 Method description

### 3.1 Model and estimation procedure

Consider a graph with nodes, where at the th node there is an associated -dimensional random variable , . Let . Assume that has multivariate normal distribution with zero mean and covariance matrix , where , , , . Let the precision matrix be denoted by , which can also be written as a block-matrix . The primary interest is in the graph which describes the conditional dependence (or independence) between and given the remaining variables. We are typically interested in the situation where is relatively large and the graph is sparse, that is, most pairs and , , , are conditionally independent given all other variables. When and are conditionally independent given other variables, there will be no edge connecting and in the underlying graph; otherwise there will be an edge. Under the assumed multivariate normality of , it follows that there is an edge between and if and only if is a non-zero matrix. Therefore the problem of identifying the underlying graphical structure reduces to estimating the matrix under the sparsity constraint that most off-diagonal blocks in the grand precision matrix are zero.

Suppose that we observe independent and identically distributed (i.i.d.) samples from the graphical model, which are collectively denoted by , while stands for the sample of many -variate observations at node and stands for the vector of observations of the th component at node , , . Following the estimation strategies used in univariate Gaussian graphical models, we may propose a sparse estimator for by minimizing a loss function obtained from the conditional densities of given , , for each and a penalty term. However, since sparsity refers to off-diagonal blocks rather than individual elements, the lasso-type penalty used in univariate methods like space or concord should be replaced by a group-lasso type penalty, involving the sum of the Frobenius-norms of each off-diagonal block . A multivariate analog of the loss used in a weighted version of space is given by

 Ln(ω,σ,Y)=12p∑i=1Ki∑k=1(−logσik+wikn∥∥Yik+∑j≠iKj∑l=1ωijklσikYjl∥∥22), (1)

where , are nonnegative weights and due to the symmetry of precision matrix. Writing the quadratic term in the above expression as

 wik∥∥Yik+∑j≠iKj∑l=1ωijklσikYjl∥∥22=wik(σik)2∥∥σikYik+∑j≠iKj∑l=1ωijklYjl∥∥22,

and, as in concord choosing to make the optimization problem convex in the arguments, we can write the quadratic term in the loss function as . Applying the group penalty we finally arrive at the objective function

 12p∑i=1Ki∑k=1(−logσii+1n∥∥σikYik+∑j≠iKj∑l=1ωijklYjl∥∥22)+λ∑i

### 3.2 Algorithm

To obtain a minimizer of (2), we periodically minimize it with respect to the arguments of , , . For each fixed , , suppressing the terms not involving any element of , we may write the objective function as

 12n(Ki∑k=1∥σikYik+∑j′≠iKj′∑l=1ωij′klYj′l∥22+Kj∑l=1∥σjlYjl+∑i′≠jKi′∑k=1ωi′jlkYik∥22)+λ∥ωij∥2,

where . Without loss of generality, we assume and rewrite the expression as

 12n(Ki∑k=1∥σikYik+B1jkωij+∑j′>i,j′≠jB1j′kωij′+∑j′jB1i′lωi′j+∑i′

where and are matrices specified as follows: th columns of are , the th columns of are , and other columns are zero. This leads to the following algorithm.

Algorithm:

Initialization: For , and , set the initial values and .

Iteration: For all and , repeat the following steps until certain convergence criterion is satisfied:

Step 1: Calculate the vectors of errors for :

 rijk=^σikYik+∑j′i,j′≠jB1j′k^ωij′, rjil=^σjlYjl+∑i′>jB1i′l^ωji′+∑i′

Step 2: Regress the errors on the specified variables to obtain

 ^ωij = argmin[12n{ωTij(Ki∑k=1BT1jkB1jk+Kj∑l=1BT2ilB2il)ωij +2(Ki∑k=1rTijkB1jk+Kj∑l=1rTjilB2il)ωij}+λ∥ωij∥2],

by the proximal gradient algorithm described as follows:

Given , and , compute

 f(ω(t)ij) = +2(Ki∑k=1r(t+1)TijkB1jk+Kj∑l=1r(t+1)TjilB2il)ω(t)ij] g = 1n(Ki∑k=1(BTjkBjkω(t)ij+r(t+1)TijkBjk)+Kj∑l=1(BTilBilω(t)ij+r(t+1)TjilBil))

Set and repeat

• ,

• if , set ; else set ,

• replace by ,

until .

Step 3: For and , update to

 −YTik(∑jiB1jk^ωij)+√(YTik(∑jiB1jk^ωij))2+2nYTikYik2YTikYik.

If the total number of variables at all nodes is less than or equal to the available sample size , then the objective function is strictly convex, there is a unique solution to the minimization problem (2) and the iterative scheme converges to the global minimum (Tseng ). However, if , the objective function need not be strictly convex, and hence a unique minimum is not guaranteed. However, as in univariate concord, the algorithm converges to a global minimum. This follows by arguing as in the proof of Theorem 1 of Kolar et al.  after observing that the objective function of mconcord differs from that of concord only in two aspects — the loss function does not involve off-diagonal entries of diagonal blocks, and the penalty function has grouping, neither of which affect the structure of the concord described by Equation (33) of Kolar et al. .

## 4 Large Sample Properties

In this section, we study large sample properties of the proposed mconcord method. As in the univariate concord method, we consider the estimator obtained from the minimization problem

 12p∑i=1Ki∑k=1(−log^σik+wikn∥Yik+λ∑j≠iKj∑l=1ωijkl^σikYjl∥22)+λn∑i

with a general weight and a suitably consistent estimator of plugged in for all , , and for some suitable sequence . Existence of such an estimator is also shown.

Introduce the notation

 (3)

where and , , . Let and respectively stand for true values of and respectively. All probability and expectation statements made below are understood under the distributions obtained from the true parameter values. Let and be the expected first and second order partial derivatives of at the true parameter respectively. Also let stand for the row vector and for the matrix , where . Note that .

Let , and . We further define that , and thus there are elements in . Let . The following assumptions will be made throughout.

• The weights satisfy and and grow at most like a power of .

• There exist constants depending on the true parameter value such that the minimum and maximum eigenvalues of the true covariance satisfies .

• There exists a constant such that for all , where is a column-vector with elements , .

• There is an estimator of , satisfying for every with probability tending to 1.

The following result concludes that Condition C3 holds if the total dimension is less than a fraction of the sample size.

###### Proposition 1

Suppose that for some . Let stand for the vector of regression residuals of on . Then the estimator , where , satisfies Condition C3.

We adapt the approach in Peng et al.  to the multivariate Gaussian setting. The approach consists of first showing that if the estimator is restricted to the correct model, then it converges to the true parameter at a certain rate as the sample size increases to infinity. The next step consists of showing that with high probability no edge is falsely selected. These two conclusions combined yield the result.

###### Theorem 1

Let , and as . Then the following events hold with probability tending to :

1. there exists a solution of the restricted problem

 argminω:ωAc=0Ln(ω,^σ,Y)+λn∑i
2. (estimation consistency) for any sequence , any solution of the restricted problem (4) satisfies

###### Theorem 2

Suppose that for some , , , and as . Then with probability tending to , the solution of (4) satisfies where .

###### Theorem 3

Assume that the sequences and satisfy the conditions in Theorem 2. Then with probability tending to , there exists a minimizer of which satisfies

1. (estimation consistency) for any sequence , ,

2. (selection consistency) if for some , whenever , then , where .

## 5 Simulation

In this section, two simulation studies are conducted to examine the performance of mconcord and compare with space, concord, glasso and multi, the method of Kolar et al.  in regards of estimation accuracy and model selection. For space, concord and glasso, all components of each node are treated as separate univariate nodes, and we put an edge between two nodes as long as there is at least one non-zero entry in the corresponding submatrix.

### 5.1 Estimation Accuracy Comparison

In the first study, we evaluate the performance of each method at a series of different values of the tuning parameter . Four random networks with (44% density), (21% density), (6% density), (2% density) and (2% density) nodes are generated, and each node has a -dimensional Gaussian variable associate with it, . Based on each network, we construct a precision matrix, with non-zero blocks corresponding to edges in the network. Elements of diagonal blocks are set as random numbers from . If node and node are not connected, then the entire th and th blocks would take values zero. If node and node are connected, the th block would have elements taking values in with equal probabilities so that both strong and weak signals are included. The th block can be obtained by symmetry. Finally, we add to the precision matrix to make it positive-definite, where is the absolute value of the smallest eigenvalue plus 0.5 and is the identity matrix. Using each precision matrix, we generate 50 independent datasets consisting of (for the and networks) and (for the , and networks) i.i.d. samples. Results are given in Figure 1 to Figure 5. All figures show the number of correctly detected edges () versus the number of total detected edges (), averaged across the 30 independent datasets.

We can observe that for all methods, decreases when we increase . It can be seen that mconcord consistently outperforms its counterparts, as it detects more correct edges than the other methods for the same number of total edges detected, especially when we have large or large . In all scenarios, space, concord and glasso give very similar results. With large and , multi performs better than univariate methods.

The better performance of moncord over space, concord and glasso is largely due to the fact that mconcord is designed for multivariate network, and treating the precision matrix by different blocks is more likely to catch an edge even when the signal is comparably weak. On the contrary, the univariate approaches tend to select more unwanted edges since there is high probability that there is at least on non-zero element in the block due to randomness.

In high dimensional settings, regression based methods have simpler quadratic loss function and are computationally faster and more efficient than that of penalized likelihood methods, which optimize with respect to the entire precision matrix at once. The running time for mconcord is about one-third of that for multi. The higher numerical accuracy of regression based methods over penalized likelihood methods were often observed in the univariate setting, and hence is expected to continue in the multivariate setting as well.

### 5.2 Model Selection Comparison

Next in the second study, we compare the model selection performance of the above approaches. We fix , and conduct simulation studies for several combinations of and with different densities which vary from 41% to 1%. The precision matrices are generated using the same technique as in the first study. The tuning parameter is selected using a 5-fold cross-validation for all methods. We also studied the performance of the Bayesian Information Criterion (BIC) for model selection, but it seems that BIC does not work in the multi dimensional settings. In fact, BIC in most cases tends to choose the smallest model where no edge can be detected. Here we compare sensitivity (TPR), precision (PPV) and Matthew’s Correlation Coefficient (MCC) defined by

 TPR=TPTP+FN,PPV=TPTP+FP,MCC=TP×TN−FP×FN√(TP+FP)(TP+FN)(TN+FP% )(TN+FN),

where TP, TN, FP and FN denote true positives (number of edges correctly detected), true negatives (number of edges correctly excluded), false positives (number of edges detected but absent in the true model) and false negatives (number of edges falsely excluded). For each network, all final numbers are averaged across 30 independent datasets.

Table 1 shows that substantial gain is achieved by considering the multivariate aspect in mconcord compared with the univariate methods space and concord in regards of both sensitivity and precision, except for the case and where these two methods score slightly better TPR due to more selection of edges. Both glasso and multi select very dense models in nearly all cases, and as a consequence their TPR are higher. However, in terms of MCC which accounts for both correct and incorrect selections, mconcord performs consistently better than all the other methods.

## 6 Application

### 6.1 Gene/Protein Network Analysis

According to the NCI website https://dtp.cancer.gov/discovery_development/nci-60, “the US National Cancer Institute (NCI) 60 human tumor cell lines screening has greatly served the global cancer research community for more than 20 years. The screening method was developed in the late 1980s as an in vitro drug-discovery tool intended to supplant the use of transplantable animal tumors in anticancer drug screening. It utilizes 60 different human tumor cell lines to identify and characterize novel compounds with growth inhibition or killing of tumor cell lines, representing leukemia, melanoma and cancers of the lung, colon, brain, ovary, breast, prostate, and kidney cancers”.

We apply our method to a dataset from the well-known NCI-60 database, which consists of protein profiles (normalized reverse-phase lysate arrays for 94 antibodies) and gene profiles (normalized RNA microarray intensities from Human Genome U95 Affymetrix chip-set for more than 17000 genes). Our analysis will be restricted to a subset of 94 genes/proteins for which both types of profiles are available. These profiles are available across the same set of 60 cancer cell lines. Each gene-protein combination is represented by its Entrez ID, which is a unique identifier common for a protein and a corresponding gene that encodes this protein.

Three networks are studied: a network based on protein measurements alone, a network based on gene measurements alone, and a gene-protein multivariate network. For protein alone and gene alone networks, we use concord, and for gene-protein network, we use mconcord. The tuning parameter is selected using 5-fold cross-validation for all three networks.

From the gene-protein network 531 edges are selected. For the protein network, 798 edges are selected and for the gene network, 784 edges are selected. Protein and gene-protein networks share 313 edges, while gene and gene-protein networks share 287 edges. However, protein and gene networks only share 167 edges. Table 2 provides summary statistics for these networks.

In Table 3, we also list the top 20 most connected components for all three networks. Among them, the gene-protein network and the protein network share 11, the gene-protein network and the gene network share 10, while the protein network and the gene network share only 6.

### 6.2 GDP Network Analysis

In this analysis, we apply our method to the regional GDP data obtained from U.S. Department of Commerce website https://www.bea.gov/index.html, which contains GDP data including the following 20 different industries with labels: {enumerate*}[(1)]

utilities (uti),

construction (cons) ,

Manufacturing (manu),

Durable goods manufacturing (durable),

nondurable goods manufacturing (nondu),

transportation and warehousing (trans),

information (info),

finance and insurance (finance),

real estate and rental and leasing (real),

professional, scientific and technical services (prof),

management of companies and enterprises (manage),

educational services (edu),

health care and social assistance (health),

arts, entertainment and recreation (arts),

accommodation and food services (food),

other services except government (other) and

government (gov).

The data is available from the first quarter of 2005 to the second quarter of 2016. Data from the third quarter of 2008 to the forth quarter of 2009 is eliminated to reduce the impact of the financial crisis of that period. The data is in 8 regions in the US, including New England (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island and Vermont), Mideast (Delaware, D.C., Maryland, New Jersey, New York and Pennsylvania), Great Lakes (Illinois, Indiana, Michigan, Ohio and Wisconsin), Plains (Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota and South Dakota), Souteast (Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, Virginia and West Virginia), Southwest (Arizona, New Mexico, Oklahoma and Texas), Rocky Mountain (Colorado, Idaho, Montana, Utah and Wyoming) and Far West (Alaska, California, Hawaii, Nevada, Oregon and Washington).

We reduce correlation in the time series data by taking differences of the consecutive observations. A multivariate network consisting of 20 nodes and 8 attributes for each node is studied. After using 5-fold cross-validation to select the tuning parameter , 47 edges are detected, with density of 24.7% and average node degree of 4.7. The 5 most connected industries are retail trade, transportation, wholesale trade, accommodation and food services, and professional and technical services. The network is shown in Figure 4(a). It is obvious to see hubs comprising of wholesale trade and retail trade. This is very natural for the consumer-driven economy of the US. Both of these two nodes are connected to transportation, as both of these industries heavily rely on transporting goods. Another noticeable fact is that education is connected with government. As part of the services provided by government, it is natural that the quality as well as GDP of educational services can both be influenced by government.

The univariate network using the nationwide GDP data only is also studied for comparison using concord. For the tuning parameter , 5-fold cross-validation is applied, and 95 networks are selected, with density of 50% and average node degree of 9.5. The 5 most connected industries are administrative and waste management services, accommodation and food services, wholesale trade, professional and technical services and health care and social assistance. The network is shown in Figure 4(b). The more modest degree of connections in the multivariate network seems to be more interpretable.

## 7 Proof of the theorems

We rewrite (3) as , where .

For any subset , the Karush-Kuhn-Tucker (KKT) condition characterizes a solution of the optimization problem

 argminω:ωSc=0⎧⎪⎨⎪⎩Ln(ω,^σ,Y)+λn∑1≤i

A vector is a solution if and only if for any

 L′n,ijkl(^ω,^σ,Y)=−λn^ωijkl√∑k′,l′^ω2ijk′l′, if ∃1≤k≤Ki,1≤l≤Kj,^ωijkl≠0 |L′n,ijkl(^ω,^σ,Y)|≤λn, if ^ωijkl=0,k=1,…,Ki,l=1,…,Kj.

The following lemmas will be needed in the proof of Theorems 13. Their proofs are deferred to the Appendix.

###### Lemma 1

The following properties hold.

1. For all and , .

2. If for all and , then is convex in and is strictly convex with probability one.

3. For every index with , .

4. All entries of are bounded and bounded below. Also, there exist constants such that

 ¯σ0≤min{¯σik:1≤i≤p,1≤k≤Ki}≤max{¯σik:1≤i≤p,1≤k≤Ki}≤¯σ∞.
5. There exists constants , such that

 0<ΛLmin(¯ω,¯σ)≤λmin(¯L′′(¯ω,¯σ))≤λmax(¯L′′(¯ω,¯σ))≤ΛLmax(¯ω,¯σ)<∞.
###### Lemma 2
1. There exists a constant , such that for all and , , .

2. There exists constants , such that for any ,

 Var(L′ijkl(¯ω,¯σ,Y))≤M1,Var(L′′ijkl,ijkl(¯ω,¯σ,Y))≤M2.
3. There exists a positive constant , such that for all ,

 L′′ijkl,ijkl(¯ω,¯σ)−L′′ijkl,A−ijkl(¯ω,¯σ)[L′′A−ijkl,A−ijkl(¯ω,¯σ)]−1L′′Aijkl,ijkl(¯ω,¯σ)≥g,

where .

4. For any , for some constant .

###### Lemma 3

There exists a constant , such that for any and , , .

###### Lemma 4

Let the conditions of Theorem 2 hold. Then for any sequence ,

 max1≤i
 maxi

hold with probability tending to .

If