Phase Transitions and a Model Order Selection Criterion for Spectral Graph Clustering
One of the longstanding open problems in spectral graph clustering (SGC) is the so-called model order selection problem: automated selection of the correct number of clusters. This is equivalent to the problem of finding the number of connected components or communities in an undirected graph. We propose automated model order selection (AMOS), a solution to the SGC model selection problem under a random interconnection model (RIM) using a novel selection criterion that is based on an asymptotic phase transition analysis. AMOS can more generally be applied to discovering hidden block diagonal structure in symmetric non-negative matrices. Numerical experiments on simulated graphs validate the phase transition analysis, and real-world network data is used to validate the performance of the proposed model selection procedure.
Undirected graphs are widely used for network data analysis, where nodes can represent entities or data samples, and the existence and strength of edges can represent relations or affinity between nodes. For attributional data (e.g., multivariate data samples), such a graph can be constructed by calculating and thresholding the similarity measure between nodes. For relational data (e.g., friendships), the edges reveal the interactions between nodes. The goal of graph clustering is to group the nodes into clusters of high similarity. Applications of graph clustering, also known as community detection [1, 2], include but are not limited to graph signal processing [3, 4, 5, 6, 7, 8, 9, 10, 11, 12], multivariate data clustering [13, 14, 15], image segmentation [16, 17], structural identifiability in physical systems , and network vulnerability assessment .
Spectral clustering [13, 14, 15] is a popular method for graph clustering, which we refer to as spectral graph clustering (SGC). It works by transforming the graph adjacency matrix into a graph Laplacian matrix , computing its eigendecomposition, and performing K-means clustering  on the eigenvectors to partition the nodes into clusters. Although heuristic methods have been proposed to automatically select the number of clusters [22, 13, 14], rigorous theoretical justifications on the selection of the number of eigenvectors for clustering are still lacking and little is known about the capabilities and limitations of spectral clustering on graphs.
The contributions of this paper are twofold. First, we analyze the performance of spectral clustering on undirected unweighted graphs generated by a random interconnection model (RIM), where each cluster can have arbitrary internal connectivity structure and the inter-cluster edges are assumed to be random. Under the RIM, we establish a breakdown condition on the ability to identify correct clusters using SGC. Furthermore, when all of the cluster interconnection probabilities are identical, a model we call the homogeneous RIM, this breakdown condition specifies a critical phase transition threshold on the inter-cluster connection probability . When this interconnection probability is below the critical phase transition threshold, SGC can perfectly detect the clusters. On the other hand, when the interconnection probability is above the critical phase transition threshold, SGC fails to identify the clusters. This breakdown condition and phase transition analysis apply to weighted graphs as well, where the critical phase transition threshold depends not only on the interconnection probability but also on the weights of the interconnection edges.
Second, we show that the phase transition results for the homogeneous RIM can be used to bound the phase transitions of SGC for the inhomogeneous RIM. This leads to a method for automatically selecting the number of clusters in SGC, which we call automated model order selection (AMOS). AMOS works by sequentially increasing the model order while running multi-stage tests for testing for RIM structure. Specifically, for a given model order and an estimated cluster membership map obtained from SGC, we first test for local RIM structure for a single cluster pair using a binomial test of homogeneity. This is repeated for all cluster pairs and, if they pass the RIM test, we proceed to the second stage of testing, otherwise we increase the model order and start again. The second stage consists of testing whether the RIM is globally homogeneous or inhomogeneous. This is where the phase transition results are used - if any of the estimated inter-cluster connection probabilities exceed the critical phase transition threshold the model order is increased. In this manner, the outputs from AMOS are the clustering results from SGC of minimal model order that are deemed reliable.
Simulation results on both unweighted and weighted graphs generated by different network models validate our phase transition analysis. Comparing to other graph clustering methods, experiments on real-world network datasets show that the AMOS algorithm indeed outputs clusters that are more consistent with the ground-truth meta information. For example, when applied to network data with longitude and latitude meta information, such as the Internet backbone map across North American and Europe, and the Minnesota road map, the clusters identified by the AMOS algorithm are more consistent with known geographic separations.
The rest of this paper is organized as follows. Sec. II discusses previous work on phase transition and model order selection for graph clustering. Sec. III introduces the RIM and the mathematical formulation of SGC. Sec. IV describes the breakdown condition and phase transition analysis of SGC, including unweighted and weighted graphs. Sec. V summarizes the proposed AMOS algorithm for SGC. Sec. VI discusses numerical experiments and comparisons on simulated graphs and real-world datasets. Sec. VII concludes this paper.
Ii Related Work
Ii-a Phase transitions in graph clustering
In recent years, researchers have established phase transitions in the accuracy of graph clustering under a diverse set of network models [23, 24, 25, 26, 2, 27, 28]. A widely used network model is the stochastic block model (SBM) , where the edge connections within and between clusters are independent Bernoulli random variables. Under the SBM, a phase transition on the cluster interconnectivity probability separates clustering accuracy into two regimes: a regime where correct graph clustering is possible, and a regime where correct graph clustering is impossible. The critical values that separate these two regimes are called phase transition thresholds. A summary of phase transition analysis under the SBM can be found in . In this paper, we establish the phase transition analysis of SGC under a more general network model, which we call the random interconnection model (RIM). The RIM does not impose any distributional assumptions on the within-cluster connectivity structure, but assumes the between-cluster edges are generated by a SBM. The formal definition of the RIM is introduced in Sec. III-A. The RIM introduced in this paper is a direct generalization of the model introduced in , which is a special case of an unweighted graph with two clusters.
Ii-B Model order selection criterion
Most existing model selection algorithms specify an upper bound on the number of clusters and then select based on optimizing some objective function, e.g., the goodness of fit of the -cluster model for . In , the objective is to minimize the sum of cluster-wise Euclidean distances between each data point and the centroid obtained from K-means clustering. In , the objective is to maximize the gap between the -th largest and the -th largest eigenvalue. In , the authors propose to minimize an objective function that is associated with the cost of aligning the eigenvectors with a canonical coordinate system. In [30, 31, 32, 33], model selection is cast as a multiscale community detection problem. In , the authors propose to iteratively divide a cluster based on the leading eigenvector of the modularity matrix until no significant improvement in the modularity measure can be achieved. The Louvain method in  uses a greedy algorithm for modularity maximization. In , the authors use the integrated classification likelihood (ICL) criterion  for graph clustering based a random graph mixture model. In , the authors use the degree-corrected SBM  and Monte Carlo sampling techniques for graph clustering. In [40, 41], the authors propose to use the eigenvectors of the nonbacktracking matrix for graph clustering, where the number of clusters is determined by the number of real eigenvalues with magnitude larger than the square root of the largest eigenvalue. Different from these approaches, this paper not only establishes a new model order selection criterion based on the phase transition analysis, but also provides multi-stage statistical tests for determining clustering reliability of SGC.
Iii Random Interconnection Model (RIM) and Spectral Clustering
Iii-a Random interconnection model (RIM)
Consider an undirected graph where its connectivity structure is represented by an binary symmetric adjacency matrix , where is the number of nodes in the graph. if there exists an edge between the node pair (), and otherwise . An unweighted undirected graph is completely specified by its adjacency matrix , while a weighted undirected graph is specified by a nonnegative matrix , where its nonzero entries denote the weight of an edge. In the next section, Theorems 1, 2 and 3 apply to unweighted undirected graphs while Theorem 4 extends these theorems to weighted undirected graphs.
Assume there are clusters in the graph and denote the size of cluster by . The size of the largest and smallest cluster is denoted by and , respectively. Let denote the adjacency matrix representing the internal edge connections in cluster and let () be an matrix representing the adjacency matrix of inter-cluster edge connections between the cluster pair (). The matrix is symmetric and for all . Using these notations, the adjacency matrix of the entire graph can be represented by a block structure, which is
The proposed random interconnection model (RIM) assumes that: (1) the adjacency matrix is associated with a connected graph of nodes but is otherwise arbitrary; (2) the matrices are random and mutually independent, and each has i.i.d. Bernoulli distributed entries with Bernoulli parameter . We call this model a homogeneous RIM when all random interconnections have equal probability, i.e., for all . Otherwise, the model is called an inhomogeneous RIM. In the next section, Theorems 1 and 3 apply to general RIM while Theorems 2 and 4 are restricted to the homogeneous RIM.
The stochastic block model (SBM)  is a special case of the RIM in the sense that the RIM does not impose any distributional constraints on . In contrast, under the SBM is a Erdos-Renyi random graph with some edge connection probability .
Iii-B Spectral clustering
Let be the -element column vector of ones (zeros) and let be the diagonal degree matrix, where is the degree vector of the graph. The graph Laplacian matrix of the entire graph is defined as , and similarly the graph Laplacian matrix of is denoted by . Let denote the -th smallest eigenvalue of . Then since by definition, and if the entire graph is connected. is also known as the algebraic connectivity of the graph as it is a lower bound on the node and edge connectivity of a connected graph .
To partition the nodes in the graph into () clusters, spectral clustering uses the eigenvectors associated with the smallest eigenvalues of . Each node can be viewed as a -dimensional vector in the subspace spanned by these eigenvectors. K-means clustering  is then implemented on the -dimensional vectors to group the nodes into clusters. Vector normalization of the obtained -dimensional vectors or degree normalization of the adjacency matrix can be used to stabilize K-means clustering [13, 14, 15].
For analysis purposes, throughout this paper we will focus on the case where the observed graph is connected. If the graph is not connected, the connected components can be easily found and the proposed algorithm can be applied to each connected component separately. Since the smallest eigenvalue of is always and the associated eigenvector is , only the higher order eigenvectors will affect the clustering results. By the Courant-Fischer theorem , the eigenvectors associated with the smallest nonzero eigenvalues of , represented by the columns of the eigenvector matrix , are the solution of the minimization problem
where the optimal value of (III-B) is the sum of the second to the -th smallest eigenvalues of , and is the identity matrix. The constraints in (III-B) impose orthonormality and centrality on the eigenvectors.
Iv Breakdown Condition and Phase Transition Analysis
In this section we establish a mathematical condition (Theorem 1) under which SGC fails to accurately identify clusters under the RIM. Furthermore, under the homogeneous RIM assumption of identical interconnection probability governing the entries of the matrices in (1), the condition leads to (Theorem 2) a critical phase transition threshold where, if SGC correctly identifies the communities with probability one while if SGC fails. The phase transition analysis developed in this section will be used to establish an automated model order selection algorithm for SGC in Sec. V. The proofs of the main theorems (Theorems 1, 2 and 3) are given in the appendix, and the proofs of extended theorems and corollaries are given in the supplementary material.
|expression||limit value of|
In the sequel, there are a number of limit theorems stated about the behavior of random matrices and vectors whose dimensions go to infinity as the sizes of the clusters goes to infinity while their relative sizes are held constant. Throughout this paper, the convergence of a real matrix is defined with respect to the spectral norm , defined as , where denotes the Euclidean norm of the vector . Let denote the singular value decomposition of , where denotes the -th largest singular value of , and are the associated left and right singular vectors, and denotes the rank of . For any two matrices and of the same dimension, we write if as for all , the spectral norm , equivalently , converges to zero. By Weyl’s inequality [45, 46], implies and asymptotically have the same singular values, i.e., for all , and for all . Furthermore, the Davis-Kahan theorem [47, 46] establishes that under some mild condition on the gap of singular values of and , implies and asymptotically have the same singular vectors (identical up to sign), i.e., and for all . If is a random matrix and is a given matrix, then is shorthand for almost surely. In particular, if the dimension of grows with , then for simplicity we often write , where is a matrix of infinite dimension. For example, let denote the identity matrix. If as , then for simplicity we write , where is the identity matrix of infinite dimension. While this infinite dimensional notation is non-rigorous, its use in place of the more cumbersome notation greatly simplifies the presentation. For vectors, we say converges to if as . Similarly, for a vector , if as , where is a vector of increasing dimension, we use the notation , where is the infinite dimensional limit of . Table I summarizes the limit expressions presented in this paper.
Theorem 1 (Breakdown condition).
Let be the cluster partitioned eigenvector matrix associated with the graph Laplacian matrix obtained by solving (III-B), where with its rows indexing the nodes in cluster . Let be the matrix with -th element
The following holds almost surely as and . If , then , , and hence spectral graph clustering cannot be successful.
Since the eigenvalues of depend only on the RIM parameters and whereas the eigenvalues of depend not only on these parameters but also on the internal adjacency matrices , Theorem 1 specifies how the graph connectivity structure affects the success of SGC.
For the special case of homogeneous RIM, where , for all , Theorem 2 establishes the existence of a phase transition in the accuracy of SGC as the interconnection probability increases. A similar phase transition likely exists for the inhomogeneous RIM (i.e., ’s are not identical), but an inhomogeneous extension of Theorem 2 is an open problem. Nonetheless, Theorem 3 shows that the homogeneous RIM phase transition threshold in Theorem 2 can be used to bound clustering accuracy when the RIM is inhomogeneous.
Theorem 2 (Phase transition).
Let be the cluster partitioned eigenvector matrix associated with the graph Laplacian matrix obtained by solving (III-B),
where with its rows indexing the nodes in cluster .
Let and assume .
Under the homogeneous RIM in (1) with constant interconnection probability ,
there exists a critical value such that the following holds almost surely as and :
In particular, if and , .
Furthermore, reordering the indices in decreasing cluster size so that , we have
where is a diagonal matrix.
(c) , where and . In particular, when .
Theorem 2 (a) establishes a phase transition of the partial eigenvalue sum at some critical value , called the critical phase transition threshold. When the quantity converges to . When the slope in of changes and the intercept depends on the cluster having the smallest partial eigenvalue sum. When all clusters have the same size (i.e., ) so that , undergoes a slope change from to .
Theorem 2 (b) establishes that renders the entries of the matrix incoherent, making it impossible for SGC to separate the clusters. On the other hand, makes coherent, and hence the row vectors in the eigenvector matrix possess cluster-wise separability. This is stated as follows.
Corollary 1 (Separability of the row vectors in the eigenvector matrix when ).
Under the same assumptions as in Theorem 2, when , the following properties of hold almost surely as and :
(a) The columns of are constant vectors.
(b) Each column of has at least two nonzero cluster-wise constant components, and these constants have alternating signs such that their weighted sum equals due to the property .
(c) No two columns of have the same sign on the cluster-wise nonzero components.
These properties imply that for the rows in corresponding to different nodes are identical (Corollary 1 (a)), while the row vectors in and , , corresponding to different clusters are distinct (Corollary 1 (b) and (c)). Therefore, the within-cluster distance between any pair of row vectors in each is zero, whereas the between-cluster distance between any two row vectors of different clusters is nonzero. This means that as and the ground-truth clusters become the optimal solution to K-means clustering, and hence K-means clustering on these row vectors can group the nodes into correct clusters. Note that when , from Theorem 2 (b) the row vectors of corresponding to the same cluster sum to a zero vector. This means that the entries of each column in have alternating signs and the centroid of the row vectors of each cluster is the origin. Therefore, K-means clustering on the rows of yields incorrect clusters.
Furthermore, as a demonstration of the breakdown condition in Theorem 1, observe that when , Theorem 1 implies that is a diagonal matrix . From (-B) in the appendix we know that for almost surely when . Therefore, under the homogeneous RIM, SGC can only be successful when is below .
Theorem 2 (c) provides upper and lower bounds on the critical threshold value for the phase transition to occur when . These bounds are determined by the cluster having the smallest partial eigenvalue sum , the number of clusters , and the size of the largest and smallest cluster ( and ). When all cluster sizes are identical (i.e., ), these bounds become tight. Based on Theorem 2 (c), the following corollary specifies the properties of and the connection to algebraic connectivity of each cluster.
Corollary 2 (Properties of and its connection to algebraic connectivity).
Let , and , and let , and denote their limit value, respectively.
Under the same assumptions as in Theorem 2, the following statements hold almost surely as and :
(a) If , then .
(b) If , then .
The following corollary specifies the bounds on the critical value for some special types of clusters. These results provide theoretical justification of the intuition that strongly connected clusters, e.g., complete graphs, have high critical threshold value, and weakly connected clusters, e.g., star graphs, have low critical threshold value.
Corollary 3 (bounds on the critical value for special type of cluster).
Under the same assumptions as in Theorem 2, the following statements hold almost surely as and :
(a) If each cluster is a complete graph, then .
(b) If each cluster is a star graph, then .
Furthermore, in the special case of a SBM, where each adjacency matrix corresponds to a Erdos-Renyi random graph with edge connection probability , under the same assumptions as in Theorem 2 we can show that almost surely,
The next corollary summarizes the results from Theorem 2 for the case of to elucidate the phase transition phenomenon. Note that it follows from Corollary 4 (b) that below the phase transition () the rows in corresponding to different clusters are constant vectors with entries of opposite signs, and thus K-means clustering is capable of yielding correct clusters. On the other hand, above the phase transition () the entries corresponding to each cluster have alternating signs and the centroid of each cluster is the origin, and thus K-means clustering fails.
Corollary 4 (Special case of Theorem 2 when ).
When , let , let and assume . Then there exists a critical value such that the following holds almost surely as and .
(c) , where and .
The above phase transition analysis can also be applied to the inhomogeneous RIM for which the ’s are not constant. Let and . The corollary below shows that under the inhomogeneous RIM when is below , which is the critical threshold value specified by Theorem 2 for the homogeneous RIM, the smallest nonzero eigenvalues of the graph Laplacian matrix lie within the internal with probability one.
Corollary 5 (Bounds on the smallest nonzero eigenvalues of under the inhomogeneous RIM).
Under the RIM with interconnection probabilities , let , , and let be the critical threshold value of the homogeneous RIM specified by Theorem 2. If , the following statement holds almost surely as and :
In particular, Corollary 5 implies that the normalized algebraic connectivity of the inhomogeneous RIM is between and almost surely as and .
For graphs following the inhomogeneous RIM, Theorem 3 below establishes that accurate clustering is possible if it can be determined that . As defined in Theorem 2, let be the eigenvector matrix of under the inhomogeneous RIM, and let be the eigenvector matrix of the graph Laplacian of another random graph, independent of , generated by a homogeneous RIM with cluster interconnectivity parameter . We can specify the distance between the subspaces spanned by the columns of and by inspecting their principal angles . Since and both have orthonormal columns, the vector of principal angles between their column spaces is , where is the -th largest singular value of a real rectangular matrix . Let , and let be defined entrywise. When , the following theorem provides an upper bound on the Frobenius norm of , which is denoted by .
Theorem 3 (Distance between column spaces spanned by and ).
Under the RIM with interconnection probabilities , let be the critical threshold value for the homogeneous RIM specified by Theorem 2, and define . For a fixed , let denote the limit of . If and as , the following statement holds almost surely as and :
Furthermore, let . If ,
As established in Corollary 1, under the homogeneous RIM when the row vectors of the eigenvector matrix are perfectly cluster-wise separable as and . Under the inhomogeneous RIM, Theorem 3 establishes that cluster separability can still be expected provided that is small and . As a result, we can bound the clustering accuracy under the inhomogeneous RIM by inspecting the upper bound (4) on . Note that if , we can obtain a tighter upper bound on (4).
Next we extend Theorem 2 to undirected weighted random graphs obeying the homogeneous RIM. The edges within each cluster are assumed to have nonnegative weights and the weights of inter-cluster edges are assumed to be independently drawn from a common nonnegative bounded distribution. Let denote the symmetric nonnegative weight matrix of the entire graph. Then the corresponding graph Laplacian matrix is defined as , where is the diagonal matrix of nodal strengths of the weighted graph. Similarly, the symmetric graph Laplacian matrix of each cluster can be defined. The following theorem establishes a phase transition phenomenon for such weighted graphs. Specifically, the critical value depends not only on the inter-cluster edge connection probability but also on the mean of inter-cluster edge weights.
Theorem 4 (Phase transition in weighted graphs).
Under the same assumptions as in Theorem 2, further assume
the weight matrix is symmetric, nonnegative and bounded, and
the weights of the upper triangular part of
are independently drawn from a common nonnegative bounded distribution with mean . Let and .
Then there exists a critical value such that the following holds almost surely as and :
where is a diagonal matrix.
(c) , where and .
V Automated Model Order Selection (AMOS) Algorithm for Spectral Graph Clustering
Based on the phase transition analysis in Sec. IV, we propose an automated model order selection (AMOS) algorithm for selecting the number of clusters in spectral graph clustering (SGC). This algorithm produces p-values of hypothesis tests for testing the RIM and phase transition. In particular, under the homogeneous RIM, we can estimate the critical phase transition threshold for each putative cluster found and use this estimate to construct a test of reliability of the cluster. The statistical tests in the AMOS algorithm are implemented in two phases. The first phase is to test the RIM assumption based on the interconnectivity pattern of each cluster (Sec. V-B), and the second phase is to test the homogeneity and variation of the interconnectivity parameter for every cluster pair and in addition to making comparisons to the critical phase transition threshold (Sec. V-C). The flow diagram of the proposed algorithm is displayed in Fig. 1, and the algorithm is summarized in Algorithm 2. The AMOS package is publicly available for download111https://github.com/tgensol/AMOS. Next we explain the functionality of each block in the diagram.
V-a Input network data and spectral clustering
The input network data is a matrix that can be a symmetric adjacency matrix , a degree-normalized symmetric adjacency matrix , a symmetric weight matrix , or a normalized symmetric weight matrix , where and are assumed invertible. Spectral clustering is then implemented on the input data to produce clusters , where is the -th identified cluster with number of nodes and number of edges . Initially is set to . The AMOS algorithm works by iteratively increasing and performing spectral clustering on the data until the output clusters meet a level of significance criterion specified by the RIM test and phase transition estimator.
V-B RIM test via p-value for local homogeneity testing
Given clusters obtained from spectral clustering with model order , let be the interconnection matrix of edges connecting clusters and . Our goal is to compute a p-value to test the hypothesis that the matrix in (1) satisfies the RIM. More specifically, we are testing the null hypothesis that is a realization of a random matrix with i.i.d. Bernoulli entries (RIM) and the alternative hypothesis that is not a realization of a random matrix with i.i.d. Bernoulli entries entries (not RIM), for all , . Since the RIM homogeneity model for the interconnection matrices will only be valid when the clusters have been correctly identified, this RIM test can be used to test the quality of a graph clustering algorithm.
To compute a p-value for the RIM we use the V-test  for homogeneity testing of the row sums or column sums of . Specifically, given independent binomial random variables, the V-test tests that they are all identically distributed. For concreteness, here we apply the V-test to the row sums. Given a candidate set of clusters, the V-test is applied independently to each of the interconnection matrices .
For any interconnection matrix the test statistic of the V-test converges to a standard normal distribution as , and the p-value for the hypothesis that the row sums of are i.i.d. is p-value, where is the cumulative distribution function (cdf) of the standard normal distribution. The proposed V-test procedure is summarized in Algorithm 1. The RIM test on rejects the null hypothesis if p-value, where is the desired single comparison significance level. Since the ’s are independent, the p-value threshold parameter can be easily translated into a multiple comparisons significance level for detecting homogeneity of all ’s. It can also be translated into a threshold for testing the homogeneity of at least one of these matrices using family-wise error rate Bonferroni corrections or false discovery rate analysis [50, 51].
V-C A cluster quality measure for RIM
Once the identified clusters pass the RIM test, one can empirically determine the reliability of the clustering using the phase transition analysis introduced in the previous section. In a nutshell, if the estimate of falls below the critical phase transition threshold then, by Theorem 3, the results of the clustering algorithm can be declared reliable if the clustering quality measure is small. This is the basis for the proposed AMOS procedure under the assumption of inhomogeneous RIM. For homogeneous RIM models an alternative procedure is proposed. The AMOS algorithm (Fig. 1) runs a serial process of homogeneous and inhomogeneous RIM phase transition tests. Each of these is considered separately in what follows.
Homogeneous RIM phase transition test:
The following plug-in estimators are used to evaluate the RIM parameters and the critical phase transition threshold under the homogeneous RIM. Let be the number of inter-cluster edges between clusters and (i.e., the number of nonzero entries in ). Then under the inhomogeneous RIM is the maximum likelihood estimator (MLE) of . Under the homogeneous RIM, , and the MLE of is , where is the number of edges in the graph. We use the estimates and