# Clustering Gaussian Graphical Models

###### Abstract

We derive an efficient method to perform clustering of nodes in Gaussian graphical models directly from sample data. Nodes are clustered based on the similarity of their network neighborhoods, with edge weights defined by partial correlations. In the limited-data scenario, where the covariance matrix would be rank-deficient, we are able to make use of matrix factors, and never need to estimate the actual covariance or precision matrix. We demonstrate the method on functional MRI data from the Human Connectome Project. A matlab implementation of the algorithm is provided.

###### keywords:

Gaussian Graphical Models, Unsupervised Learning, Graph Embedding, Conditional correlation## 1 Introduction

Gaussian graphical models whittaker_graphical_2009 are a popular and principled approach to describing networks, and are directly related to variable prediction via linear regression pourahmadi_covariance_2011 . The focus is often on graphical model edges described by partial correlations which are zero, identifying pairs of nodes which can be treated as conditionally independent baba_partial_2004 . For example, the graphical LASSO friedman_sparse_2008 imposes a sparse regularization penalty on the precision matrix estimate, seeking a network which trades off predictive accuracy for sparsity. This provides a network which interpretable and efficient to use. However it is not clear that sparse solutions generalize better to new data than do dense solutions williams_back_2019 .

Clustering of networks, on the other hand, is usually approached from a very different perspective. A graph is formed via some simple relationship such as affinity or univariate correlation. Then this network is used as a starting point for graph-theoretical approaches to partitioning. A dominant method in this category is spectral clustering von_luxburg_tutorial_2007 , commonly described as a continuous relaxation of the normalized cut algorithm von_luxburg_tutorial_2007 for partitioning graphs. In addition, other interpretations have been noted for spectral clustering such as random walks and smooth embeddings meila_spectral_2015 , or sometimes involving minor variants of the algorithm such as by normalizing the Laplacian differently meila_learning_2002 . However when the true underlying graph is based on correlations between samples of variables, the principled statistical basis for relating nodes in the network is obfuscated under multiple degrees of approximation; first the correlation estimate is approximated with a binary (or approximately binary) edge; second the partitioning of this binary graph is approximated with a continuous relaxation. Sparse graph methods impose yet another approximation, as residual accuracy is traded off versus sparsity.

In this paper we show how clustering of partial correlations can be efficiently performed directly from the data, without need for computing the full graphical model or forming a sparse approximation. We provide a partial-correlation-based estimate for distance between nodes. Then we provide an efficient algorithm for computing these distances directly from the data. Finally we demonstrate the clustering on functional MRI data from the Human Connectome Project, a case where the network is far too large to fit in memory. Matlab code to implement the method is provided in an appendix.

## 2 Method

We model the data signal at the th node as the zero-mean Gaussian random variable . The partial correlation between and , is the Pearson correlation between the residuals after regressing out all other nodes except and from both. Rather than directly performing this computationally-intensive calculation on each pair of variables using data, there are two general categories of methods used. The first category of methods exploits the relationship with the precision matrix, the inverse of the sample covariance matrix. Note, of course, that if the dense network is too large to fit into memory, the so too will be the covariance and precision matrices. The second category of methods, which we will consider here, exploits the relationship between the partial correlation and the regression coefficients for the neighborhood selection problem meinshausen_high-dimensional_2006 . These regression coefficients are defined as the solutions to the linear system

(1) |

where is the th variable and is the residual. From these we can estimate as pourahmadi_covariance_2011 ,

(2) |

using the residual variances . A common alternative formulation exploits the symmetry of the partial correlation (i.e., that by definition) and use the geometric mean to cancel the residual variances as in schafer_shrinkage_2005

(3) |

This also has the advantage of enforcing symmetry in sample estimates. If the signs of and differ, is typically set to zero.

We describe the sample version of the regression problem of Eq. (1) with the linear system

(4) |

where is the matrix with the th column set to zeros; the vector is the estimates of the regression coefficients, where is the estimate of (defining ); and is the vector of samples of the residual . The sample residual variances are then

(5) |

which we form into a vector with . Following dillon_computation_2019 , we write the sample version of Eq. (2) as

(6) |

where is a diagonal matrix with , and we have formed with as columns. contains our sample-based estimates of the partial correlations, with describing the partial correlation between nodes and . Again, we can avoid calculating the residual variances using the method of Eq. (3) as follows,

(7) |

using the Hadamard product and element-wise exponential , and where the sign function is taken element-wise. We can in turn compute from the resolution matrix using a regularized generalized inverse , giving (dillon_computation_2019 ),

(8) |

where is a diagonal matrix with on the diagonal, and is with its diagonal set to zero. Then, combining Eqs. (9) and (8) we get

(9) |

In cases where and are too large to fit in memory, we can compute columns as needed,

(10) | ||||

(11) |

where we have defined and as the th column of the identity. This requires that we calculate and store a regularized pseudoinverse of our data matrix, which is of the same size as our original data matrix. Additionally we can pre-compute the diagonal of and the vector. The diagonal of is equal to the sum of its eigenvectors squared, so can be computed as a by-product of truncated-SVD regularization. In extremely-large-scale situations, the diagonal can be computed using even more efficient techniques maccarthy_efficient_2011 ; trampert_resolution_2013 . The vector can then be computed via

(12) | ||||

(13) |

Alternatively, combining Eqs. (7) and (8) we get the geometric mean formulation,

(14) |

where is the vector with elements . To enforce the sign test, we set equal to zero when . For this version of the estimate we compute columns on the fly as,

(15) | ||||

(16) |

where we have defined as the vector with elements , and . An advantage of this version, in addition to enforcing symmetry, is that the scaling only requires the diagonal of , not the residual . A disadvantage is potentially the need to enforce a sign test. Generally we can write either Eq. (16) or Eq. (11) as

(17) |

for appropriate definitions of and .

### 2.1 Distance Calculation

Now we will demonstrate how to compute distances using partial correlation estimates. In dillon_regularized_2017 we used distances between pairs of columns of as a form of connectivity-based distance in a network clustering algorithm, and demonstrated how the distance could be computed efficiently even for large datasets where could not fit in memory. The basic idea was to use the factors and in calculations, rather than precomputing itself, as follows,

(18) | ||||

(19) |

In dillon_spectral_2018 it was noted that if SVD-truncation was used as the regularization method, then the resolution matrix can be written as

(20) | ||||

(21) |

where is a truncated version of the identity, with zeros for the columns corresponding to discarded singular values. Then the resolution distance can be written as

(22) | ||||

(23) |

where is the th row of . So is the th row of the matrix formed by the first singular vectors of . As these singular vectors of are the same as the singular vectors of the covariance matrix, this distance is essentially a spectral embedding of the graph formed by correlations between samples.

We can similarly define conditional forms of distances between columns of , which may be implemented as

(24) | ||||

(25) |

wherein we can see the similarity to the resolution form of the spectral clustering distance in Eq. (23).

### 2.2 Partial correlation clustering

Next we demonstrate the use of the partial correlation distance in clustering, by similarly extending -means clustering. A basic -means clustering algorithm for clustering the columns of some matrix is given in Algorithm 1.

For a small dataset we may form the matrix and directly apply this algorithm to its columns. However, for large datasets we cannot fit in memory so must maintain factors as in Eq. (25) In particular, the squared distances between a given center and a column of can be calculated as

(26) |

Since we are only concerned with the class index of the cluster with the minimum distance to each column, we do not need to compute the term, so we can compute the labels as

(27) |

By forming a matrix with weighted cluster centers as columns, and a weighted data matrix with as columns, we can efficiently compute the first part of the cross term for all and as , a by matrix. The second part of the cross term can be computed by (element-wise) multiplying each row of by a vector who’s th element is .

Similar tactics can be used to efficiently compute the mean over columns in each cluster, by noting that the mean over a set of columns with as the set of column indices can be written as

(28) |

So in general, we see that clustering of can be implemented whenever the dataset itself is small enough to implement -means clustering, taking roughly double the storage space and computational resources (rather than the number of variables squared). In the Appendix we provide matlab matlab_version_2010 code for an efficient implementation of the clustering algorithm, taking advantage of fast broadcast routines for matrix computations.

## 3 Results

Fig. 1 gives a demonstration of the algorithm applied to a functional Magnetic Resonance Imaging (fMRI) dataset for a subject from the Human Connectome Project van_essen_human_2012 , compared to other clustering approaches. The data was preprocessed by applying spatial smoothing with a 5mm kernel, and SVD-truncation-based regularization to achieve a cutoff of 30 percent of singular values. The data contains 96854 time series with 1200 time samples each, resulting in a data matrix of size . Each time-series describes the blood-oxygen-level dependent (BOLD) activity in one voxel of the brain, so a network formed by comparing these signals provides an estimate of the functional connectivity of the brain. A clustering of this dense network would produce clusters of nodes with similar connectivity, estimating the modularity of function in the brain. The dense network describing the relationships between all pairs of voxels, however, would require a dense matrix of size , which is far too large to fit in computer memory. However the limited rank of this matrix means we only need store the pseudoinverse. The clustering algorithm took 9 seconds on a desktop computer.

We see that clustering of produces much more modular segmentation of the regions of the brain, particularly compared to the conceptually-similar approach of clustering the network of univariate correlations of the data instead.

Next we considered the effect of the difference between Eqs. (23) and (25) for this dataset. In Fig. 2 we plot the percentage of nodes which fall into different clusters using the two different distance calculations, as increasing amounts of spatial smoothing are applied.

Spatial smoothing creates correlations between neighboring variables. Therefore, the removal of the th and th variables will have less effect, as for example, the same information is increasingly included in the th and th variables due to smoothing. Interestingly, we see that even with no spatial smoothing applied, clustering of partial correlations for this data agrees within 94 percent of the simpler spectral estimate.

## 4 Discussion

We showed how partial correlations can be clustered by providing an appropriate correction to a simple spectral distance calculation. We also showed that this could be computed efficiently by utilizing the data matrix and its pseudoinverse. As the latter may be computed by computing singular vectors of the data matrix, the proposed method can be performed with comparable efficiency to spectral clustering. We also demonstrated the approach on a functional brain imaging dataset, where we found a strong agreement with a spectral distance calculation. This suggests another perspective on the success of spectral clustering methods–as an approximation to clustering of a Gaussian graphical model for the data. A benefit of the proposed approach and perspective is its principled basis; rather than perform a series of approximations to convert a covariance selection problem into a intractable graph parcellation problem which must be relaxed, we can directly cluster variables based on their partial correlations.

The drawbacks of the proposed approach include the moderately increased computational effort, requiring double the storage, and approximately two matrix-vector products instead of one. Still, this is only a factor of two. The handling of negative values is another difficulty of the method, as will all similar statistical methods. Though our approach to address it has the advantage of leveraging an approach used in bioinformatic analyses for several years.

Finally, there are a number of potential extensions and directions for research based on this result. Instead of a simple -means stage, we might apply a more sophisticated clustering algorithm such as fuzzy or hierarchical clustering. Given the principled statistical basis for our distance calculation, we might also utilize more sophisticated statistical metrics. For example, the penalized regression may be viewed as a maximum a posteriori estimate; we might extend this idea to compute a kind of Bayesian clustering based on computing moments of the distribution of each variable, rather than a point estimate.

## 5 Appendix: matlab implementation of clustering

In this appendix we demonstrate how the clustering of a dense matrix can be performed efficiently when the data matrix is of limited rank. We make use of the more efficient form of the regularized pseudoinverse for an underdetermined matrix, as well as efficient broadcast methods where possible.

A = randn(500,100000); % data matrix lambda = 1; % regularization parameter k = 100; % number of clusters [rows_A,cols_A] = size(A); % standardize data columns A = bsxfun(@minus,A,mean(A)); A = bsxfun(@times,A,1./sum(A.^2).^.5); % compute diagonal of R via sum of squared eigenvectors [uA,sA,vA] = svd(A,’econ’); r = sum(vA(:,1:rank(A)).^2,2)’; % compute pseudoinverse efficiently iA_lambda = A’*inv(A*A’-lambda*eye(rows_A)); % compute scaling vectors (symmetric version) s = 1./(1-r(:)); z = abs(s(:)).^.5; zeta = sign(s).*abs(s(:)).^.5; Az = bsxfun(@times,A,z(:)’); % randomly assign columns to clusters initially c = ceil(rand(cols_A,1)*k); n_change = inf while (n_change>0) M = sparse(1:cols_A,c,1,cols_A,k,cols_A); % cols of M are masks of clusters M = bsxfun(@times, M, 1./sum(M,1)); % now M is averaging operator P_c_1 = iA_lambda*(Az*M); % first part of cluster center calc P_c_2 = bsxfun(@times,M,r.*zeta); % second park (peak removal) P_c = bsxfun(@times,A_c_1-A_c_2,z(:)); % cluster centers Pz2_c = sum(P_c.^2,1); % squared term from distance Cz = bsxfun(@times,P_c,z(:)); % weighted cluster centers D_ct1 = (Cz’*iA_lambda)*Az; % first part of cross-term D_ct2 = bsxfun(@times,Cz’,r’.*zeta(:)’); % second part of cross term D_ct = D_ct1-D_ct2; % cross-term Dz = bsxfun(@minus,D_ct,.5*Pz2_c’); % dist metric (sans unnecessary term) c_old = c; [D_max,c(:)] = max(Dz,[],1); % c is arg of max n_change = sum(c~=c_old); disp(n_change); end;

## References

## References

- (1) Kunihiro Baba, Ritei Shibata, and Masaaki Sibuya. Partial Correlation and Conditional Correlation as Measures of Conditional Independence. Australian & New Zealand Journal of Statistics, 46(4):657–664, 2004.
- (2) Keith Dillon. On the Computation and Applications of Large Dense Partial Correlation Networks. arXiv:1903.07181 [cs, stat], March 2019. arXiv: 1903.07181.
- (3) Keith Dillon and Yu-Ping Wang. A regularized clustering approach to brain parcellation from functional MRI data. In Wavelets and Sparsity XVII, volume 10394, page 103940E. International Society for Optics and Photonics, August 2017.
- (4) Keith Dillon and Yu-Ping Wang. Spectral Resolution Clustering for Brain Parcellation. arXiv:1810.04026 [cs, q-bio], October 2018. arXiv: 1810.04026.
- (5) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, July 2008.
- (6) J. K. MacCarthy, B. Borchers, and R. C. Aster. Efficient stochastic estimation of the model resolution matrix diagonal and generalized cross–validation for large geophysical inverse problems. Journal of Geophysical Research: Solid Earth, 116(B10), 2011.
- (7) MATLAB. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts, 2010.
- (8) Marina Meila. Spectral Clustering: a Tutorial for the 2010’s. In Christian Hennig, Marina Meila, Fionn Murtagh, and Roberto Rocci, editors, Handbook of Cluster Analysis, Handbooks of Modern Statistical Methods, pages 125–144. Chapman and Hall, December 2015.
- (9) Marina Meila and Jianbo Shi. Learning segmentation by random walks. In In NIPS 13, 2002.
- (10) Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3):1436–1462, June 2006.
- (11) Mohsen Pourahmadi. Covariance Estimation: The GLM and Regularization Perspectives. Statistical Science, 26(3):369–387, August 2011.
- (12) Juliane Schäfer and Korbinian Strimmer. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics. Statistical Applications in Genetics and Molecular Biology, 4(1), 2005.
- (13) Jeannot Trampert, Andreas Fichtner, and Jeroen Ritsema. Resolution tests revisited: the power of random numbers. Geophysical Journal International, 192(2):676–680, February 2013.
- (14) D. C. Van Essen, K. Ugurbil, E. Auerbach, D. Barch, T. E. J. Behrens, R. Bucholz, A. Chang, L. Chen, M. Corbetta, S. W. Curtiss, S. Della Penna, D. Feinberg, M. F. Glasser, N. Harel, A. C. Heath, L. Larson-Prior, D. Marcus, G. Michalareas, S. Moeller, R. Oostenveld, S. E. Petersen, F. Prior, B. L. Schlaggar, S. M. Smith, A. Z. Snyder, J. Xu, E. Yacoub, and WU-Minn HCP Consortium. The Human Connectome Project: a data acquisition perspective. NeuroImage, 62(4):2222–2231, October 2012.
- (15) Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, December 2007.
- (16) Joe Whittaker. Graphical Models in Applied Multivariate Statistics. Wiley Publishing, 2009.
- (17) Donald R. Williams and Philippe Rast. Back to the basics: Rethinking partial correlation network methodology. The British Journal of Mathematical and Statistical Psychology, June 2019.