# Simultaneous Dimensionality and Complexity Model Selection for Spectral Graph Clustering

###### Abstract

Our problem of interest is to cluster vertices of a graph by identifying underlying community structure. Among various vertex clustering approaches, spectral clustering is one of the most popular methods because it is easy to implement while often outperforming more traditional clustering algorithms. However, there are two inherent model selection problems in spectral clustering, namely estimating both the embedding dimension and number of clusters. This paper attempts to address the issue by establishing a novel model selection framework specifically for vertex clustering on graphs under a stochastic block model. The first contribution is a probabilistic model which approximates the distribution of the extended spectral embedding of a graph. The model is constructed based on a theoretical result of asymptotic normality of the informative part of the embedding, and on a simulation result providing a conjecture for the limiting behavior of the redundant part of the embedding. The second contribution is a simultaneous model selection framework. In contrast with the traditional approaches, our model selection procedure estimates embedding dimension and number of clusters simultaneously. Based on our conjectured distributional model, a theorem on the consistency of the estimates of model parameters is stated and proven. The theorem provides statistical support for the validity of our method. Heuristic algorithms via the simultaneous model selection framework for vertex clustering are proposed, with good performance shown in the experiment on synthetic data and on the real application of connectome analysis.

Keywords: Adjacency spectral embedding, stochastic block model, random dot product graph, BIC, Model-based clustering.

## 1 Introduction

A mathematical graph encodes the relationships between objects in a network as edges between vertices. The analysis of such networks is of importance in many fields ranging from sociology (Lazer et al., 2009) and ecology (Proulx et al., 2005) to political science (Ward et al., 2011) and neuroscience (Bullmore and Bassett, 2011). One of the most important tasks in the analysis of such a graph is to identify its community structure. This is essentially a vertex clustering problem, in which the set of vertices is to be partitioned into nonoverlapping groups (called clusters) according to their similarities in the underlying communities. Numerous heuristic methodologies have been proposed for vertex clustering, including divisive approaches by iteratively removing edges based on number of shortest paths (Girvan and Newman, 2002; Newman and Girvan, 2004), methods of optimizing a function called “modularity” which evaluates the quality of a partition (Bickel and Chen, 2009; Blondel et al., 2008; Newman, 2006), algorithms employing a random walk to infer structural properties of networks (Pons and Latapy, 2005; Rosvall and Bergstrom, 2008), to name just a few. Among various vertex clustering approaches, we are interested here in the so-called spectral clustering methods, for their easy implementation, theoretical consistency and good empirical performance.

Spectral clustering makes use of the spectral decomposition of some kind of similarity matrix that measures the similarities between vertices. Numerous spectral clustering algorithms based on decomposing the adjacency matrix, one natural similarity matrix of the graph, have been proposed to solve the vertex clustering problem (Rohe et al., 2011; Sussman et al., 2012; Qin and Rohe, 2013; Lei et al., 2015). Basically, the so-called adjacency spectral embedding (ASE) is first derived by factorizing the adjacency matrix; then a traditional Gaussian mixture model (GMM) clustering approach is applied on the ASE. Although the GMM ASE methods exhibit good performance, there are two inherent model selection problems that need to be addressed in order to perform this clustering. The first is determining the number of top eigenvectors whose rows are the low-dimensional points on which the GMM method is applied. Since these top eigenvectors comprise the adjacency spectral embedding, we call this number the embedding dimension. The second is determining the number of clusters, essential for GMM.

The first model selection problem of determining the embedding dimension has received much attention over the years. In more general scenarios we call the corresponding eigenvectors variables, thus the problem is one of variable selection. A comprehensive review of the variable selection approaches in the model-based clustering framework has been provided in Fop and Murphy (2018) Handcock et al. (2007). The necessity of variable selection is based on the fact that only a subset of the variables of the high-dimensional data are informative and important to the subsequent statistical inference. Using all the variables may lead to unnecessary computational cost, and may also decrease the performance of the clustering owing to the irrelevance of extraneous variables. Therefore, the selection of variables which optimize the clustering structure is of great importance. Considering the overwhelming number of methods for variable selection, we do not attempt to give a concise review of the literature. However, among various techniques for variable selection arguably the best-known methodology of principal component analysis (PCA) (Jolliffe, 2011) is worth mentioning. In PCA, singular values of the data matrix are used to measure the importance of the variables, and the variables corresponding to relatively small singular values are discarded. For a broad review of the many stopping rules of PCA, we refer the readers to Jackson (1993). Unfortunately, there are no best rules in the task of dimension reduction in general due to the bias-variance tradeoff. Roughly speaking, heuristic approaches are usually not theoretically reliable because they all need to determine a threshold which is highly subjective, while statistical approaches usually rely on an overly strong distributional assumption that the data does not often satisfy in many applications (Jackson, 1993).

The second model selection problem, namely determining the number of clusters, is also a widely studied problem. As numerous approaches have been proposed on this topic, we refer the readers to the detailed reviews in Milligan and Cooper (1985) and Hardy (1996). One substantial category of such methods is the information criterion approach. These methods evaluate and compare the so-called information criterion, usually some kind of penalized likelihood, on finite mixture models with different number of mixture components and model complexity to perform model selection. Various information criteria are proposed. To list a few: Akaike information criterion (AIC) (Akaike, 1998), Bayesian information criterion (BIC) (Schwarz et al., 1978), an entropy criterion (NEC) (Celeux and Soromenho, 1996), integrated completed likelihood (ICL) (Biernacki et al., 2000) and cross-validated likelihood (Smyth, 2000). Of these, we are mostly interested in BIC since it is a well-studied and easily implemented approach. Moreover, the consistency of estimation for number of components using BIC is theoretically supported in Keribin (2000). The practical performance of BIC approaches in model selection have also been highly rated by a large number of works (Roeder and Wasserman, 1997; Stanford and Raftery, 1997; Dasgupta and Raftery, 1998; Campbell et al., 1999).

The traditional way to address both of the model selection problems in spectral clustering is to execute corresponding approaches successively. That is, one applies spectral embedding with the dimension given by some dimension reduction technique in the first step, and then one proceeds to the model selection technique on the embedded data to estimate the number of clusters in the second step. This consecutive procedure of model selection suffers from three drawbacks. First, there are no best methods for estimating the embedding dimension. Even if we choose one of the modern and commonly used scree plot methods (Zhu and Ghodsi, 2006), in comparison the result is still not robust for limited data size. Second, the latter model selection procedure, namely estimating the number of clusters, completely depends on the result of the former one, because no information of the discarded variables will pass through. This may cause an accumulation of errors when the former procedure performs poorly, even if the latter procedure is reliable. Third, the original data is truncated before applying the clustering algorithm, which means it may not be possible to take advantage of any useful information contained in the discarded dimensions to improve the clustering result. Therefore, jointly addressing these two model selection problems is desirable.

In this paper, we propose a novel simultaneous dimensionality and complexity model selection framework for spectral graph clustering. This is inspired by breakthrough work on model selection in the framework of model-based clustering proposed in Raftery and Dean (2006). In that work, all of the variables are taken into consideration in a family of finite mixture models, which describes the distributional behavior of the raw data. The model selection procedure is conducted by comparing different models in the same family via the Bayes factor, the ratio of the posterior probability of the model given the observations. The authors utilize a BIC-based approximation to the Bayes factor that is much easier to compute. A remarkable highlight of this framework is the simultaneity of selecting variables and number of clusters, which overcomes the drawbacks of the consecutive model selection procedure. However, the method is not applicable to the current spectral vertex clustering task in the sense that neither the distributional model nor the greedy variable selection algorithm is appropriate with respect to the graph context. This inspires the development of a reliable model for spectral embedding with both signal and noise dimensions and a novel methodology for vertex clustering on the graphs with heterogeneous community structure. Note that our spectral graph clustering setting allows the use of the standard BIC rather than the Bayes factor approximation used in Raftery and Dean (2006).

A simultaneously-developed, related approach to ours, using a Bayesian modeling perspective, is presented in Sanna Passino and Heard (2019). The basic model utilized is the same as in this paper, with the extra complexity of the prior distributions on the parameters and their associated hyper-parameters. These are fit using Markov Chain Monte Carlo, rather than the frequentist perspective presented herein. One advantage of the frequentist approach is the relative ease in handling very large graphs. We also propose a 2-step procedure, which while it does not produce a maximum likelihood solution, seems empirically to perform as well as the full maximum likelihood procedure. Of course Bayesian inference has its advantages (e.g. uncertainty quantification) and variational methods allow for scalable Bayesian inference. Still, as pointed out in Sanna Passino and Heard (2019), our distributional model and likelihood methods motivate their approach.

The paper is organized as follows. In Section 2, we review the existing methodology of GMM ASE. In Section 3, we define the extended ASE and provide a specific GMM model to characterize the potential distribution of extended ASE based on our simulation results. In Section 4, we propose the simultaneous model selection framework, as well as two heuristic algorithms specifically tailored for graphs under a stochastic block model. In Section 5, the results of simulation and real data application on connectomes are presented. We conclude the paper by remarking in Section 6 on some extensions of our approach.

## 2 Background

In this section we review the ubiquitous spectral graph clustering with sequential model selection. We first introduce models for random graphs, then we summarize the existing methodology.

### 2.1 Random dot product graph and stochastic block model

One of the simplest generative models for a random graph is the inhomogeneous Erdős-Rényi (IER) graph (Erdos and Rényi, 1960) on vertices, where the edges are independent each with their own probabilities given by an matrix called the edge probability matrix. To more effectively depict the heterogeneous attributes of the vertices in the graph, we consider the so-called random dot product graph (RDPG) (Young and Scheinerman, 2007), one instance of the well-studied class of latent position graphs. The definition of RDPG is as follows:

###### Definition 1 (Random dot product graph (RDPG)).

Let be a subset of satisfying for all . Let be latent vectors, and be the latent position matrix such that the th row of is . If the edges of an undirected graph are generated according to an edge probability matrix , then we say is a random dot product graph (RDPG) with latent position matrix , denoted by . That is, , the entry of the adjacency matrix , is independently Bernoulli distributed with parameter , i.e.

(1) |

for all .

In the RDPG, the probability of the connection between vertex and , namely the -th entry of the edge probability matrix , depends on the inner product of the corresponding latent positions (Hoff et al., 2002; Hoff, 2005, 2008). In the context of vertex clustering, vertices from the same group are supposed to share common connection attributes. Therefore, we may further assume vertices from the same group have the same latent position. This leads to the stochastic block model (Holland et al., 1983), in which the set of vertices is partitioned into groups, called blocks. The connectivity of the graph is parameterized by the block connectivity probability matrix , which determines solely the edge probability within and between blocks. The formal definition of SBM is given below:

###### Definition 2 (Stochastic block model (SBM)).

Let be the graph of interest with vertices, be the block connectivity probability matrix, and be the vector of prior block probability such that . is called a -block stochastic block model (SBM) graph, denoted by SBM, if there is a random vector , called the block memberships, that assigns vertex to block with probability . Mathematically, are i.i.d. random variables with categorical distribution with parameter , i.e.

(2) |

for all and . Furthermore, the edges are generated according to an edge probability matrix , whose -th entry is . Equivalently, , the entry of the adjacency matrix , is independently Bernoulli distributed with parameter , i.e.

(3) |

for all .

In addition, it is convenient in some cases to consider that the block membership vector is not random but fixed. We call such graph a SBM conditioned on block memberships, denoted by SBM. If an undirected graph and is positive semidefinite, then can be represented by an RDPG with at most distinct latent positions. In this case, all vertices in the same block have the same latent vectors. This builds a connection between the SBM with positive semidefinite block connectivity probability matrix and the RDPG. For the relationship between an SBM with non-positive semi-definite block connectivity probability matrix and a generalized RDPG, we refer the readers to Rubin-Delanchy et al. (2017).

### 2.2 Spectral graph clustering via adjacency spectral embedding

Given an observed SBM graph, our inference task is to identify the underlying memberships of the vertices corresponding to the blocks to which they belong. That is, if , our goal is to infer the graph parameter from the observed adjacency matrix . Among various techniques, spectral clustering methods (Von Luxburg, 2007) are effective, well-studied and computationally feasible approaches through which the vertices of a graph are mapped to points in the Euclidean space. These Euclidean points are the data on which the traditional clustering methods can be applied to finalize the clustering procedure. Spectral clustering performs spectral decomposition on some “similarity” matrix that represents the graph. There are two natural similarity matrices, namely the adjacency matrix and the Laplacian matrix of the graph. While the choice between adjacency matrix and Laplacian matrix is always debatable, it has been shown that neither of them dominates the other in all cases (Tang et al., 2018). See in particular Priebe et al. (2019). In this paper, we focus on the spectral method using the adjacency matrix for ease of analysis. The reason is for this is that some properties of the top eigenvectors of the adjacency matrix, known as the adjacency spectral embedding (ASE), have been analyzed in the literature (Sussman et al., 2012, 2014; Lyzinski et al., 2017), where it has been proven that the rows of ASE converge to the corresponding underlying latent positions.

There are two model selection problems in spectral clustering of an SBM graph. One is to estimate the embedding dimension , while the other is to estimate the number of blocks . The traditional solution of these two model selection problems proceeds in a successively, namely applying variable selection or dimension reduction techniques to estimate (let the estimate be ) first, then applying model selection techniques on the data with dimensional ASE to estimate . Since the two model selection procedures are executed in sequence, we name this framework consecutive model selection (CMS). As a concrete and commonly used solution of CMS, one can apply the so-called scree plot method (for example, an effective method to locate the “elbow” in the scree plot is proposed in Zhu and Ghodsi (2006)) to estimate the embedding dimension, then apply the BIC approach (Keribin, 2000) on the spectral embedding to select the number of clusters for the subsequent GMM clustering. We denote such CMS approach BIC ZG for the purpose of comparison.

## 3 Models for Extended Adjacency Spectral Embedding

### 3.1 Extended adjacency spectral embedding

In real data applications, the rank of the edge probability matrix , namely the ideal embedding dimension , is unknown, because is unobserved and we observe only the adjacency matrix which is a noisy version of . To address the problem of estimating , we hereby define the extended adjacency spectral embedding (extended ASE) with a constant embedding dimension as follows:

###### Definition 3 (Extended adjacency spectral embedding (extended ASE)).

Let be an undirected graph of interest with vertices, and be its symmetric adjacency matrix. Let the spectral decomposition of be

(4) |

Here, is a diagonal matrix with eigenvalues of on its diagonal in descending order. That is, with . is an orthogonal matrix whose columns are the corresponding eigenvectors of . For a given integer satisfying , called embedding dimension, the extended adjacency spectral embedding (extended ASE) of with dimension is given by

(5) |

where is the submatrix of consisting of its first columns, and is the submatrix of consisting of its first rows and columns.

In practice, can be taken as a loose upper bound for . Such an upper bound is usually easily obtained, either from first principle assumptions about the problem, a requirement for a minimal number of vertices in a cluster, or other external information. Typically one considers the scree plot – the plot of the decreasing eigenvalues against , and looks for an “elbow” or uses a profile likelihood (Zhu and Ghodsi (2006), or similar method. Since the number of clusters is a bound on the rank of , one could use the maximum value of as the estimate for . Alternatively, one can look at scatter plots of the embedding to look for dimensions in which no clustering is apparent; this is perilous, but can provide some information to add to that of scree plots or other information. Finally, if one has a Bayesian inclination, and can suggest a reasonable prior on , this prior can be used to determine a reasonable choice for . The number of blocks is upper bounded by the rank of , but since this is unavailable, we must rely on other methods. Once again, the information that would be used in a Bayesian prior on the number of blocks can be used for an upper bound on . A great advantage of our approach is that we are happy with (perhaps greatly) over-estimating with , as our methodology allows this first choice to be remedied later, whereas conventional spectral clustering is constrained to proceed with the first embedding dimension.

In this paper, we assume is always given without estimation. From the formula, it is trivial that the first columns of is the regular ASE . The extended ASE can be partitioned into two parts as , where and . We call the first dimensions the informative part, while we call the remaining dimensions the redundant part. If we consider the spectral decomposition of , the unperturbed version of , then all of the latent position information is contained in the first dimensions, justifying our terminology. We notice that all existing methods using ASE can be applied to the extended ASE simply by truncating to an estimated embedding dimension . Moreover, as our main result in the paper, the extended ASE can be used to perform simultaneous model selection and vertex clustering without first estimating the embedding dimension .

### 3.2 Distributional results for the extended ASE

We seek a model-based clustering approach to perform vertex clustering directly on the extended ASE. In the framework of model-based clustering, both the informative part and the redundant part need to be parameterized so as to make the models comparable. For this purpose, we need to provide a model for the entire extended ASE. A remarkable distributional result for the informative part of the extended ASE is available (Athreya et al., 2016; Tang et al., 2018). In Athreya et al. (2016), a central limit theorem for the rows of ASE for the RDPG is presented and proven. This result justifies model-based clustering for identifying the block memberships in SBM via ASE. In Tang et al. (2018), the central limit theorem of ASE is restated in a stronger version, in the sense that its proof does not need an assumption that has been made in Athreya et al. (2016). Basically, the theorem states that any row of the ASE of an RDPG asymptotically follows a multivariate normal distribution centered at its conditional latent position (up to rotational nonidentifiability). Specifically, let . Considering the latent positions themselves follow an i.i.d. categorical distribution into distinct possible -dimensional vectors according to , the unconditioned version of the theorem claims that any row of ASE converges in distribution to a mixture of multivariate normals, with mixing probabilities . The theorem gives a complete formula for the covariance matrix of each multivariate normal component, thus fully characterizing the marginal distributional behavior of the informative part of the extended ASE.

To obtain empirical support of the distributional behavior of the redundant part of extended ASE, we have conducted a collection of simulations. In the simulation, we generate random graphs according to the stochastic block model SBM, where , for -block graphs, and , for -block graphs. The number of vertices considered, , is varied during the simulation study. Notice that the true embedding dimension for the -block case and for -block case. We apply the extended ASE to the adjacency matrix according to definition 3. As defined, the extended ASE is partitioned into informative part and redundant part by . denotes the -th row of , which corresponds to the -th vertex with block membership .

Figure 1 depicts the informative part for the two different block models described above, entirely in agreement with theory. Colors indicate block membership. Each graph was drawn with vertices. The observations of the distributional behavior of are as follows.

Observation 1: The within-block sample mean of tends to a zero vector as increases. Figure 2 shows the results for the sample mean for the 2-block model. Denoted by , the sample mean of the redundant part in dimension for all in block 1 is calculated by

(6) |

where is the number of vertices assigned to block , and is the -th entry of . We plot the sample mean values for each dimension from Monte Carlo replicates in Figure 2a. For larger , the points are closer to zero in general. The upper tick-mark for the plots is , while for it is . In Figure 2b we plot the means for Monte Carlo replicates for various values of for the first redundant vector, . While this proves nothing, even for this one simulation setting, the boxplots in Figure 2b certainly support our Observation 1.

Observation 2: The within-block sample variance of each dimension of tends to a constant as increases. Figure 3 shows the results for the sample variances. For each dimension , the sample variance of for all in block- is calculated by

(7) |

where is the number of vertices assigned to block and is corresponding sample mean in block-. We plot the sample variance values for each dimension from Monte Carlo replicates in Figure 3a. In each panel the boxplots indicate the variances for the two blocks, the lower set in each panel corresponding to the first block, the upper to the second. For relatively small graphs (, top panel), the variances are clearly different across the dimensions, indicating that for these graphs a modified model may be appropriate – although the extra measurement error inherent in these more complex models may argue for using the simpler model of a constant variance, particularly if it is assumed that there are a small number of redundant variables. For larger graphs (, bottom) we see that the variances appear constant and distinct for the different blocks. To investigate the structure of the variance for a range of graph orders, we show in Figure 3b the medians of the variances for Monte Carlo replicates for various values of . Again we see that for small there is clearly a difference in the variances for the different dimensions, but this difference becomes less pronounced for larger . Again, these empirical results, narrow though they may be, certainly support our Observation 2.

Observation 3: The within-block sample covariance matrix of tends to be diagonal, and the covariance between informative and redundant dimensions tend to be zero, for large . Figure 4 shows the results for the sample covariance matrix. The sample covariance matrix of for all in block- is calculated by

(8) |

where is the number of vertices assigned to block , is the -th row of extended ASE (but regarded as a column vector), and is the corresponding sample mean in block . We plot the sample covariance matrix for in Figure 3a and in Figure 3b. The matrix contains both informative dimensions and redundant dimensions. We observe that the diagonal values in the matrix of redundant dimensions concentrates on a constant for , which is consistent with the result shown in Figure 3. The off-diagonal values in the matrix of redundant dimensions tend to zero as increases. For , the covariance matrix presents a block diagonal structure, partitioned by the true embedding dimension . So we conclude that the within-block sample covariance matrix of tends to be diagonal for large . Moreover, the covariance between informative and redundant dimensions tend to be zero, for large . Further evidence that the off-diagonal terms are zero is given in Figure 5, depicting box plots of all the off-diagonal terms in the redundant part for Monte Carlo replicates for the 2-block model, for various values of between and .

Observation 4: The within-block sample variances are distinct for different blocks. The evidence for this observation is found in Figure 3b, as discussed above.

The above results provide the impetus for several observations about the distribution of the ASE of a stochastic block model. Although we did not perform extensive simulations herein, the plot in Figure 1 suggests (and the theoretical results mentioned support) a GMM model with general mean vectors and covariance matrices is appropriate for the informative part of the embedding. For the redundant part, asymptotically, it appears that the group-conditional means are zero, and the group-conditional covariances are diagonal. There is some evidence that (again asymptotically) the group-conditional variances for the redundant are the same, although for small , and in particularly for large , these variances may well be different, as suggested in Figure 3b. For our simulations, “small ” seems to be in the range of ten thousand or so, but of course this would depend on the specific structure of the stochastic block model. Since the extra group-conditional information in these variance differences is likely to be small, we will assume the simpler model where all the redundant variances are equal within groups, but this is an area for future investigation.

The above simulations illustrate well the reasons for our assumptions about the model. In order to see that these assumptions are not specific to the specific cases we consider, we ran a much more extensive set of simulations. A representative example is depicted in Figure 6. If our observations are correct, we should expect that the first two means are arbitrary but different, and that the are different for . We also expect to see for , and , . All of these expectations are borne out in the figure.

### 3.3 Probability models for extended ASE

For the extended ASE of an SBM graph, the theoretical result for its informative part and the conjecture for its redundant part are discussed above. There are few theoretical results about the redundant part in the literature. However, a distributional model is important and necessary for model-based clustering. So we provide a finite mixture model for the extended ASE. Although it hasn’t been proven analytically at this point, we believe the model is asymptotically close to the truth – wrong but potentially useful, in George Box’s aphorism – based on both our observations of the large sample behavior of redundant part, and the performance on the subsequent inference task based on this model.

We first state our conjectures on the distribution of the redundant part of the extended ASE as follows. We consider a -block SBM graph. Any row of is asymptotically multivariate Gaussian distributed conditioned on its block membership. That is, for any ,

(9) |

approximately if is sufficiently large. If we consider the sample statistics from the simulations to be a good estimation of the Gaussian parameters, we can further specify the model. By Observation 1, we may assume for all ; by Observation 2 and Observation 3, we assume , where is the identity matrix; by Observation 4, we assume different if the conditioned block membership is different. So now our conjecture becomes

(10) |

By combining this conjecture with the theoretical results for the informative dimensions, we propose a Gaussian mixture model (GMM) for the extended ASE as follows.

###### Model 1 (GMM for extended ASE of undirected graphs).

Let

(11) |

be a family of density functions for a dimensional GMM random vector, where are the mixing probabilities, are the mean vectors, and are the covariance matrices. Furthermore, these parameters satisfy

(12) |

(13) |

and

(14) |

where is a positive semidefinite matrix, and is a identity matrix. In this notation, denotes the parameters , specifically , which belongs to the parameter space . To avoid trivialities and to ensure that is well-defined, we assume that and that if then Similarly, we restrict to be the smallest value, , such that the components all have spherical covariances and means.

We establish our probability model for the extended ASE of . Let the extended ASE be , then our conjecture states, for any ,

(15) |

approximately for sufficiently large , where is the density function defined in Model 1, is the true dimension of latent position, is the true number of blocks, and is the true underlying collection of parameters of the GMM. This conjecture states that the rows of the extended ASE are identically distributed as in the GMM, but we haven’t assumed that they are independent. In fact, it has been shown that the rows of ASE are not independent (Athreya et al., 2016; Tang et al., 2018). However, for ease of analysis we will proceed in the consistency theorem and in the calculation of BICs by ignoring dependency, because the later experimental results show that the independent assumption is tractable and acceptable. See Tang et al. (2017) for one recent treatment of the dependency.

## 4 Simultaneous Model Selection

### 4.1 Simultaneous model selection framework

The idea of simultaneous model selection is inspired by the basis of model comparison in Raftery and Dean (2006). Assume and are models that both describe the same random vector. By Bayes’ theorem, the posterior probability of the model is proportional to the product of the prior and the integrated likelihood, i.e. for

(16) |

where we call the integrated likelihood because it can be obtained by integrating over all the unknown parameters in the model, i.e.

(17) |

Since usually we assume no preference on the two models, we can ignore the prior probability term and just compare the integrated likelihoods. However, computing the integrated likelihood is impractical. Thus, we use the Bayesian information criterion (BIC).

Now we consider and as the model parameters in the vertex clustering problem. Let be the probability density function of the model which characterizes the rows of extended ASE . We assume two models differ from each other if and only if they have distinct model parameters. (Gaussian mixture models have well-known non-identifiabilities – in particular, labeling of components is arbitrary – which are of no practical concern in most GMM inference tasks.) So selecting a model from the family is equivalent to determining the pair of model parameters. Now we can recast the model selection problem in the simultaneous model selection framework as follows. Provided we have a family of -dimensional distributional models, each with a distinct pair of model parameters that determine the structure of the model, the model selection problem is to choose a model by comparing the BIC values evaluated on the observed throughout all pairs.

In the framework of simultaneous model selection (SMS), a probability model for the rows of the extended ASE is needed. The model parameter should play a similar role to the embedding dimension, which separates the informative dimension and redundant dimension in the extended ASE. The model parameter should be the number of mixture components in the model. If we have such a family of models that well approximates the distribution of the extended ASE with an appropriate , we can apply our SMS procedure. Fortunately, Model 1 exactly satisfies these requirements. To see this, let be the random graph and be the corresponding extended ASE. Let be the dimension of the latent position vectors, and let given by the dimension of be the number of blocks in the SBM. In Model 1, is the model parameter which decides the size of the informative part and is the model parameter which decides the number of components. Most importantly, the rows of approximately follow the distribution by the existing theorem and our conjecture. Therefore if we use this family of models in the SMS procedure, we expect the BIC value will be maximized with model parameter . In fact, if we assume that the rows of do asymptotically follow the distribution in the model, we can prove the consistency of the model parameter estimates obtained via our SMS procedure.

### 4.2 Consistency of model parameter estimates

We first define some notation. Let

(18) |

be a family of GMM density functions for a dimensional random vector, as defined in Model 1, where are the model parameters which determine a specific density function. For given constants and , let be a set of given parameters in the density function (18). We define

(19) |

for all . Here, is the Kullback-Leibler divergence of density from density , defined as

(20) |

Notice that this definition is self-consistent on , because and equality holds if and only if almost everywhere, by the properties of KL divergence. We say the model (18) is identifiable on the density , if for all , . In other words, there are no identical density functions from the model with different . (Again: GMMs do have some well-known and unconcerning non-identifiabilities, but the quotient space obtained therefrom is suitable for our purposes.) Let denote the BIC evaluated on with model , i.e.

(21) |

where is the number of parameters in the model, is the number of observations in , and is the maximum likelihood estimator (MLE) of the parameters from optimizing the loglikelihood

(22) |

where is the parameter space of the model with given .

Using the notation defined above, we here state our theoretical result as follows.

###### Theorem 1 (Consistency of model parameter estimates).

Let be a sequence of random matrices, where each element is a matrix with rows of -dimensional random vectors. If

a) every row in are independently identically distributed according to (18), with parameter , i.e. for an arbitrary ,

(23) |

i.i.d. for all ;

b) the model is identifiable on density ;

c) for all , the parameter space is a compact metric space;

then the estimates of model parameters given by

(24) |

(with a constant ) will converge to the truth, i.e.

(25) |

as .

We leave the proof of the theorem for the appendix. This strong theoretical support in addition to the advantages of SMS motivate us to conduct vertex clustering via SMS.

### 4.3 Algorithms based on SMS

We present a model-based clustering algorithm via simultaneous model selection with the Gaussian mixture model. We call our method MCG – model-based clustering by Gaussian Mixture Models. The entire procedure of MCG algorithm consists of three phases. First, the “parameter fitting” phase. We compute the maximum likelihood estimators of the GMM for each pair of . The MLEs are used to complete the density function while evaluating the likelihood on the data. Second, the “model selection” phase. We compute the BIC values for all pairs, then choose the one with the largest BIC as the model parameter given the data. Finally, the “clustering” phase. The likelihoods of all the data points are evaluated on the selected model with fitted parameters. Labels are assigned to each point by the maximize a posterior rule. A summary of the algorithm MCG is shown in algorithm 1.

Although we believe that simultaneous model selection has its advantage compared to consecutive model selection, it is uncertain whether including the redundant dimensions of the extended ASE in the clustering procedure is preferable or not. The reasoning can be explained by two aspects. First, the redundant dimensions may contain little information for the clustering – as indicated by the model, and by the simulations in the preceding section, only a single variance term contains potential clustering information. Second, choosing a smaller dimension in the clustering task may lead to better performance, especially for a small number of observations, due to the bias-variance tradeoff (Jain et al., 2000). This motivates a variation in the third phase in MCG algorithm. To be specific, in phase 1 and phase 2, Model 1 and the extended ASE are utilized just to find the estimate of embedding dimension. In phase 3, we can now truncate the extended ASE to the dimension which is estimated by the SMS procedure. In this context, redundant dimensions do not take part in the clustering procedure. Thus we may apply the traditional model-based clustering algorithm with regular GMM on the truncated embedding . Notice that the embedding dimension is determined by the SMS procedure, so the clustering results could be remarkably different than the algorithm under CMS framework. We call this algorithm MCEG, inspired by model-based clustering by GMM with embedding dimension determined via SMS. The outline of the steps of MCEG is shown in algorithm 2.

Because Algorithm 1 uses the full embedding, while Algorithm 2 uses only the dimensional, or “reduced” embedding, henceforth we will drop the use of the acronyms and refer to Algorithm 1 as the full algorithm and Algorithm 2 as the reduced algorithm.

There is another implementation that we consider here, which we refer to as the “two-step” algorithm. It is nearly equivalent to the reduced model – and in fact is identical to it, in those cases where the two models produce the same .

The loop in Algorithm 3 is over alone. There is indeed a loop over in the model based clustering, which halts when the BIC value decreases. Note that this approach does not necessarily produce the maximum likelihood solution; however, in the event the estimates of and are the same as those of the reduced model, the resulting models are identical. This eliminates the extra step of fitting the GMM to the dimensional embedding, for a computational advantage. This approach makes explicit that the informative part of the model need not be the full, unconstrained model described above; by using “mclust” explicitly in step 1, we have the ability to use the full range of models implemented in the package. In this paper we consider only the full model described above and do not investigate modifications that would allow for a wider range of constrained models, but it would be trivial to implement these within the two-step procedure.

## 5 Experimental Results

We evaluate the performance of the full and reduced algorithms by experiments on both synthetic and real data. First, we investigate the performance of the method for jointly estimating the embedding dimension and the number of components using the 2- and 3-block models with probability matrices defined by the matrices described above.

Next we compare our methods with the BIC ZG methods, the combination of a ubiquitous GMM approach via consecutive model based clustering. Since a constant is required as an input in the ZG algorithm in order to determine which elbow of the scree plot is taken, we denote by ZG and BIC ZG the algorithms for given . The job of deciding the ordinal number of the elbow is always subjective in practice, so we will consider ZG1, ZG2 and ZG3, the ZG algorithm which takes the 1st, 2nd and 3rd elbows respectively, at the same time in competition. Notice that even if one ZG (or corresponding BIC ZG) method outperforms our proposed SMS method in a specific setting, it does not mean that the ZG algorithm is superior to ours because the optimal will be different in a different setting. We will see this in the simulation. We apply the mclust R package (Fraley and Raftery, 2002) to perform the BIC algorithm. Additionally, we also perform two well-known heuristic vertex clustering methods for comparison. One is the Louvain algorithm proposed in Blondel et al. (2008); the other is the Walktrap algorithm proposed in Pons and Latapy (2005).

There are numerous criteria to evaluate the performance of a clustering result, including Jaccard (Jaccard, 1912), rand index (Hubert and Arabie, 1985), normalized mutual information (Danon et al., 2005) and variation of information (Meilă, 2007). Of these, we choose the well known adjusted rand index (ARI) as the measure of the similarity between the clustering result and the ground truth labels – in the simulations the ground truth is known, and it is this that is used to compute the ARI. As a corrected-for-chance version of the rand index, ARI normalizes the rand index so that the expected value of that between a random cluster and the ground truth is zero. The maximum value of ARI is , which indicates perfect agreement of two partitions. So a larger ARI means the clustering is performing better.

### 5.1 Numerical results on synthetic data

First, we ran Monte Carlo simulations for the two- and three-block models with the two matrices defining the block probabilities, as described above. We used in this simulation.

The results are depicted in Figure 7. There is a slight tendency, particularly for smaller graphs, for the method to choose slightly larger than the rank of . Since the model was developed under asymptotic assumptions, this is likely due to the slight model-mismatch that may occur for small . Recall from Figure 3 that the assumption that the redundant variables all have the same within-class variance is not met for small . However, the correct is chosen nearly always (all but once for the 2-block model, and all but times for the 3-block model, out of a total of simulations for each model).

Next we generate a graph from a stochastic block model SBM() by specifying the block probability matrix , prior block probability and number of vertices . The adjacency matrix represents . Then we apply the extended adjacency spectral embedding on the graph, denoted by . For simplicity, we fix . Let , , . We vary to change the angle between two latent vectors.

For this experiment we consider only the reduced model, which we refer to as “our model”. Figure 8 shows the difference of ARI (computed using the ground truth provided by the simulations) between our model and the BIC ZG methods in boxplots. All of the differences are paired for Monte Carlo trials. A point appearing to be bigger than means our model has a higher ARI than the corresponding BIC ZG method in that Monte Carlo replica. Figure 8(left panel) shows the result under the setting with . We find our model outperforms BIC ZG1 and BIC ZG3, and has many ties with BIC ZG2. We did a sign test for the paired differences of ARI, where the null hypothesis is that the two methods are equally good or BIC ZG is better ( with respect to Binomial distribution), and the alternative hypothesis is that our method is better (). We ignore ties in the sign test. The p-values for our model comparing to BIC ZG1, BIC ZG2 and BIC ZG3 are , and respectively. The small p-values suggest that our model is superior, surely, to BIC ZG1 and BIC ZG3. In Figure 8(right panel), we use in the matrix. Similarly, the p-values of a sign test for our model compared to BIC ZG1, BIC ZG2 and BIC ZG3 are , and respectively. In this case, our model has similar performance to BIC ZG1, but outperforms BIC ZG2 and BIC ZG3. In both cases, our model has joint best performance with respect to ARI. In contrast, none of the BIC ZG methods win in both cases. Considering that in practice we need to fix an elbow in BIC ZG methods without knowing the ground truth, the reduced model is a more robust algorithm.

Figure 9 shows the distribution of ARI for the reduced method for a range of values for in the matrix. The gray boxes in the figure indicate those for which more extensive experiments are reported, comparing the reduced and full methods to other methods. The reason for the first low box with median near % is the fact that for very small there is a high probability of the graphs being disconnected, in which case simple spectral embedding is inappropriate.

Figure 10 shows the mean of ARI for all methods, including the existing heuristic Louvain and Walktrap algorithms. The random graph with vertices is generated from a 2-block SBM with block probability matrix [0.2, ; , 0.1]. The parameter is varying from to . We observe that the Louvain and Walktrap algorithms do not perform well for large , so we may conclude that these two heuristic vertex clustering algorithms are not suitable for specific SBM graphs. To have a detailed look, Figure 11 shows the mean of ARI for MCEG and ZGs. In Figure 11(a), all methods have decreasing ARI as increases. This is because the angle between two latent vectors become smaller, so the cluster centers get closer. Out of all the methods, our proposed MCEG performs well for all ’s. In Figure 11(b), the mean of is plotted. We can see that MCEG is the closest one to zero, which means it estimates better than the other methods.

### 5.2 Demonstration on connectomes

We demonstrate the performance of the MCG (full model) and MCEG (reduced model) algorithms via the SMS procedure on real data sets of human connectomes, a collection of brain graphs induced from brain neuronal connections. Basically, the connectome describes the network of the brain consisting of neurons (or collections thereof) as vertices and synapses (or structural connections) as edges. It is fundamentally helpful to unlock the structural and functional unknowns in the human brain in cognitive neuroscience and neuropsychology by studying the topological properties of the connectome. The raw data is collected by diffusion magnetic resonance imaging (dMRI), which can represent the structural connectivity within the brain. The macro-scale connectomes are estimated by NeuroData’s MRI to graphs (NDMG) pipeline (Kiar et al., 2018), which is designed to produce robust and biologically plausible connectomes across studies, individules and scans. As the output of the NDMG pipeline, the brain graphs are generated. The vertices of the graph represent regions of interest (ROI) identified by spatial proximity, and the edges of the graph represent the connection between ROIs via tensor-based fiber streamlines. Specifically, there is an edge for a pair of ROIs if and only if there is a streamline passing between them. The graph is undirected since the raw dMRI data doesn’t have direction information. In our demonstration, we pick the graphs in study BNU1 with parcellation DS01216. For more details of the data set, we refer the readers to Kiar et al. (2018).

This specific data set with parcellation DS01216 consists of 114 connectomes (57 subjects with 2 scans each), with 1215 vertices for each graph. There are two attributes for each vertex, the region of interest, in the graph. One attribute is the hemisphere, which could be either left, right or other. The other attribute is tissue, which could be either gray, white or other. For ease of illustration, we consider only the regions in the left or right hemisphere, and in gray or white tissue. So we get an induced subgraph from the original connectome by deleting the vertices with label “other” in hemisphere or tissue attributes. Then we extract the largest connected component of that subgraph so as to support the spectral embedding. This yields 114 connected undirected graphs, with approximately 760 vertices for each graph. Each vertex has been assigned two labels, one represents the hemisphere and the other represents the tissue. These are treated as the ground truth for the clustering structure in the graph. We apply clustering methods on the 114 graphs to compare our MCG/MCEG algorithms via SMS framework with BIC ZG algorithm via CMS framework. The ARI is calculated by comparing the clustering results with three separate versions of ground truth, namely hemisphere, tissue, and the combination of the two. Specifically, each vertex of a connectome is assigned a label left or right from the 2-cluster attribute hemisphere, and a label gray or white from the 2-cluster attribute tissue. We also assign a label (left-gray, left-white, right-gray or right-white) from the 4-cluster attribute by combining the hemisphere and tissue.

Figure 12 presents the estimates of the model parameter pair for 114 connectomes. The red dots represent the results from simultaneous model selection, and other colors are the results from BIC ZG. Consequently, there are four points for each graph, representing the pair of estimates by SMS, BIC ZG1, BIC ZG2 and BIC ZG3 respectively. The coordinates of the points are slightly perturbed so as to view the occlusion. We observe that the estimated by BIC ZG methods are spread out up to the boundary . In contrast, our SMS method gives a smaller and more concentrated estimate of number of clusters.

For each graph and one specific algorithm, we have three ARIs indicating the clustering accuracy for three different attributes. We are interested in how well our MCG/MCEG algorithms perform compared with the traditional BIC ZG algorithms. As an example, Figure 13 shows the result of the paired difference of ARIs between MCG/MCEG and BIC ZG methods. Attribute hemisphere (left or right) is considered when computing ARI. Fixing two algorithms in competition, the differences of ARI are taken pair-wise for all 114 graphs. We plot the histogram of those differences. More positive values in the histogram indicates stronger evidence that MCG/MCEG outperforms BIC ZGs, since higher ARI indicates that clustering result is closer to the ground truth. From Figure 13(a)-(c) we claim that MCG dominates all BIC ZGs, following the observation that obviously more difference values are positive. In Figure 13(d)-(f), although the number of positive values is close to that of negative ones, the reduced model, MCEG, still wins against BIC ZGs slightly because of higher ARIs on average. Table 1 gives the results on all three attributes, where the number of graphs (out of 114) on which ARI of MCG/MCEG is strictly larger than existing methods is reported in the column “#win”. Here we also consider the Louvain and Walktrap methods in the competition. We calculate a p-value by conducting a binomial test: , , where is the probability that MCG/MCEG wins. The p-value evaluates the confidence of whether our method performs better in the connectome data set. The results show that MCG dominates in all cases against BIC ZGs. In addition, MCG/MCEG outperforms all other methods with respect to tissue attribute. Notice that the Louvain method demonstrates good performance for hemisphere and 4-block attributes, but it does not work well (with very small ARI values) for the tissue attribute. An analogous “two-truths” phenomenon has been discovered in the work of Priebe et al. (2019), where the authors find that Laplacian spectral embedding (LSE) better captures the hemisphere affinity structure while ASE better captures the tissue core-periphery structure. So in this manner Louvain is good at detecting the hemisphere affinity structure but is bad at detecting the tissue core-periphery structure. This is similar to the behavior of LSE on connectome clustering (Priebe et al., 2019).

MCG | MCEG | ||||
---|---|---|---|---|---|

#win | p-value | #win | p-value | ||

Hemisphere | BIC ZG1 | 93 | 5.8e-13 | 66 | 0.037 |

BIC ZG2 | 86 | 7.6e-9 | 62 | 0.151 | |

BIC ZG3 | 85 | 2.4e-8 | 64 | 0.080 | |

Louvain | 20 | 1 | 22 | 1 | |

Walktrap | 70 | 5.6e-3 | 52 | 0.800 | |

Tissue | BIC ZG1 | 69 | 0.009 | 41 | 0.998 |

BIC ZG2 | 102 | 0 | 81 | 1.6e-6 | |

BIC ZG3 | 82 | 5.9e-7 | 56 | 0.537 | |

Louvain | 110 | 0 | 101 | 0 | |

Walktrap | 114 | 0 | 112 | 0 | |

4-block | BIC ZG1 | 82 | 5.9e-7 | 57 | 0.463 |

BIC ZG2 | 89 | 1.8e-10 | 60 | 0.256 | |

BIC ZG3 | 86 | 7.6e-9 | 67 | 0.024 | |

Louvain | 32 | 1 | 34 | 1 | |

Walktrap | 59 | 0.320 | 94 | 1.2e-13 |

## 6 Conclusion

This paper attempts to address the issue by establishing a novel model selection framework specifically for vertex clustering on stochastic block model graphs.

In the first part of the paper, we propose the extended adjacency spectral embedding (extended ASE), in which the embedding is performed with a fixed dimension. Under the framework of model-based clustering, we propose a family of specific Gaussian mixture models (GMM) to parameterize the entire extended ASE. The basis of the model comprises of a state-of-the-art distributional result for the informative dimensions, as well as evidence from principled simulations for the redundant dimensions.

In the second part of the paper, we propose a simultaneous model selection (SMS) framework to address the issue occurring in the consecutive model selection. The framework is specifically tailored for the vertex clustering task on stochastic block model graphs. In contrast with consecutive model selection, our SMS identifies the embedding dimension, mixture complexity and membership of each vertex simultaneously. Moreover, we state and prove a theorem on the consistency of model parameter estimates. The theorem claims that the estimates in the model selection procedure given by our SMS method converge to the underlying truth for large graphs, provided the extended ASE follows the distribution in our proposed model. Based on SMS, we also develop two heuristic algorithms to solve the vertex clustering problem. The effectiveness of the algorithms are demonstrated in simulations and a real data experiment.

Note that the rank of , the “best” embedding dimension for clustering, and the number of clusters interact in a rather complex manner. In Figure 1, the right plot shows that is likely to be the correct embedding dimension (and this will be even more obvious for larger values of ) and yet both the rank of the matrix and the number of groups is . Directly utilizing a measure of clustering performance in the parameter selection and modeling is an area for further research.

We have focused on the so called “hard clustering” procedure in which each vertex is assigned to a unique cluster. However, the use of Gaussian mixture model clustering allows for “soft clustering” whereby the likelihood ratio is used as the assignment probability, rather taking the to provide a hard threshold.

We have also focused exclusively on undirected graphs. The same technique is valid for directed graphs (see the discussion in Sanna Passino and Heard (2019)), wherein the singular value decomposition is used, and it is proposed as the embedding. However, it is not clear that the same dimension is appropriate for the vectors as the . We suggest that there are several other approaches to consider: , clustering and separately via ASE spectral clustering, or some method of jointly clustering the two embeddings. These are areas for future research.

## Acknowledgements

This work was partially supported by DARPA D3M contract FA8750-17-2-0112 and by the Naval Innovative Science and Engineering program at the Naval Surface Warfare Center, Dahlgren Division.

The required R source codes and the data to reproduce the results in this manuscript can be found in https://github.com/youngser/dhatkhat.

## 7 Appendix

For the proof of Theorem 1, we begin with the following lemma.

###### Lemma 1.

Proof. By the definition of Kullback-Leibler divergence in (20),

(27) |

So we can prove the lemma by showing

(28) |

and

(29) |

as . (28) is the direct result of the law of large number. (29) is the result of theorem 2.2 in White (1982) then followed by Slutsky’s theorem.

Now we show the proof of Theorem 1 as follows.