Comparative analysis on the selection of number of clusters in community detection
We conduct a comparative analysis on various estimates of the number of clusters in community detection. An exhaustive comparison requires testing of all possible combinations of frameworks, algorithms, and assessment criteria. In this paper we focus on the framework based on a stochastic block model, and investigate the performance of greedy algorithms, statistical inference, and spectral methods. For the assessment criteria, we consider modularity, map equation, Bethe free energy, prediction errors, and isolated eigenvalues. From the analysis, the tendency of overfit and underfit that the assessment criteria and algorithms have, becomes apparent. In addition, we propose that the alluvial diagram is a suitable tool to visualize statistical inference results and can be useful to determine the number of clusters.
Community detection is a coarse graining process for networks. Whereas the original dataset, given as a network, possesses information that is quite rich, it is often too microscopic to have its important structures interpreted. For better interpretability, a community detection algorithm summarizes (i.e., clustering or partitioning) the dataset by aggregating the vertices and edges of densely connected components. That is, the detailed relational information of similar vertices is discarded, while preserving an important macroscopic structure. A set of aggregated vertices is regarded as a cluster, or a community.
A straightforward approach, or a framework, for community detection is to optimize an objective function that evaluates the quality of clustering. Another popular approach is based on statistical inference and considers a generative model of a network. It is often formulated using the so-called stochastic block model Holland et al. (1983); Wang and Wong (1987); Karrer and Newman (2011) as a generative model, which is a random graph with a modular structure. Therefore, the community structure can be inferred by fitting the network to the model. While these two approaches may seem very different, the former can sometimes be formulated as a limiting case (zero-temperature limit in physics terminology) of the latter, and in this paper, it is mainly explained in terms of the latter approach.
In regards to these frameworks, a number of algorithms have been proposed, such as greedy algorithms Fortunato (2010); Blondel et al. (2008); Rosvall and Bergstrom (2008), spectral methods Luxburg (2007); Newman (2006a); Krzakala et al. (2013); Saade et al. (2014a), and inference algorithms such as expectation-maximization (EM) algorithms Daudin et al. (2008); Latouche et al. (2012); Decelle et al. (2011a); Hayashi et al. (2016); Kawamoto and Kabashima (2017) and Monte Carlo methods Nowicki and Snijders (2001); Peixoto (2014); Newman and Reinert (2016), to name a few. In this paper, the following are considered: Louvain method Blondel et al. (2008) and Infomap Rosvall and Bergstrom (2008) for the greedy algorithms, the modularity matrix Newman (2006b, a) and non-backtracking matrix Krzakala et al. (2013) for the matrices in the spectral methods, and the EM algorithm with belief propagation (BP) Decelle et al. (2011b); Kawamoto and Kabashima (2017) for the inference algorithm.
When community detection is performed, the number of clusters needs to be determined. In other words, the complexity of the model, i.e., the model selection needs to be specified. This process consists of determining the partition that describes the modular structure most efficiently, or to evaluate whether the obtained partition is statistically significant. Just as the quality of the -way partition varies depending on the very definition of a cluster, the appropriateness of the number of clusters also varies depending on the principle followed. Meanwhile, it is difficult to decide on which principle to apply, given a dataset. Therefore, it is important to investigate the typical behavior and biases of each criterion. For example, some criteria may behave very differently from others in some cases, or some criteria may be more sensitive to the accuracy of a particular algorithm. Moreover, the dangers of underfit and overfit are often not symmetric. In the case of community detection, it is often safer to underfit than to overfit, because the former only implies a different level of coarse graining, while the latter implies the detection of fictitious small clusters.
In this paper, a comparative analysis of various criteria that estimate is conducted. This analysis is distinct from other comparative analyses in the following sense. Whereas the performance of the community detection completely depends on the framework (or objective function), algorithm, and assessment criterion used, it is often the case that a specific combination of them is employed in a benchmark test. For example, Infomap, a greedy algorithm for the map equation Rosvall and Bergstrom (2008, 2011), is a popular algorithm and frequently appears in benchmark tests. However, when the performance of a certain objective function or an assessment criterion is compared with the map equation, it is fair to use a common algorithm. For this reason, the performance of various assessment criteria using the same statistical inference algorithm is compared. In addition, the performance of the same assessment criterion on the same dataset using different algorithms is investigated. Therefore, when a criterion is ill-behaved, it can be argued whether it is due to the criterion itself, or the algorithm used.
As can be observed below, sometimes, the validation curves of assessment criteria change very gradually. In such a case, it is not easy to determine the plausible value of from the assessment criteria, and a finer inspection of the partition actually obtained is required. For this purpose, a visualization technique, called the alluvial diagram Rosvall and Bergstrom (2010) is proposed as a suitable tool; not only because of the way the network is partitioned, but also because it allows the significance level to be evaluated from an inference algorithm.
The remainder of the paper is organized as follows. First, stochastic block models are defined in Sec. II to set the basic framework. Second, the algorithms used to determine the cluster assignments and for the estimate of in Sec. III are introduced. Third, the assessment criteria of the number of clusters is explained in Sec. IV. Then, in Sec. V, the results of the comparative analyses are shown. In Sec. VI, how the alluvial diagram helps determine the number of clusters is explained. Finally, Sec. VII is devoted to the summary and discussion.
Ii Stochastic block models
Community detection based on a stochastic block model is considered to be the basic framework. The sets of vertices and edges are denoted as and , respectively. Their respective cardinalities are referred to as and .
ii.1 Standard stochastic block model
The most standard version of stochastic block models is constructed as follows. We first consider a set of vertices without edges. For each vertex, we randomly specify the cluster assignment , where is the index of a vertex and the number of clusters is given as an input. The probability of the cluster size can also be specified. It is a prior distribution of the cluster assignments and is expressed by a multinomial distribution . Then, the edges are generated randomly according to the vertex pair’s cluster assignment, where the connection probability is specified by an element of the affinity matrix ; the edge probability distribution of a vertex pair is given by the Bernoulli distribution. Thus, the likelihood of the stochastic block model is given as
When a higher connection probability is provided for pairs of vertices with the same cluster assignment (as compared to pairs of vertices with different cluster assignments), i.e., (), a set of random graphs with a community structure can be generated. Generating the stochastic block model is a forward problem, and community detection is its inverse problem, i.e., the inference of , , and .
ii.2 Degree-corrected stochastic block model with restricted model-parameter space
Whereas the standard stochastic block model is very flexible, it is often not suitable to fit real-world networks, mainly because it can only have a binomial degree distribution, which is not true in many datasets. To resolve this problem, the so-called degree-corrected stochastic block model Karrer and Newman (2011) was proposed. Following Ref. Karrer and Newman (2011), the Bernoulli distribution for the edge probability of each vertex pair was approximated with the Poisson distribution, which is justified when the network is sparse. In this model, it is assumed that the mean of the Poisson distribution depends on the degrees of the vertex pair as well as on the affinity matrix. Hence, for a given affinity matrix and the number of clusters , the likelihood of the degree-corrected stochastic block model is given as
where is the degree of vertex . Here and hereafter, the uniform prior distribution for was considered for simplicity. Moreover, we neglected the existence of self-loops, which is also justified when the network is sparse. The log likelihood reads
where we neglected a constant term.
While the stochastic block model of Eq. (3) is able to express various modular structures, hereafter, we restrict our interest to the community structure. The affinity matrix is then restricted to the form
In other words, only whether a vertex pair is assigned to the same cluster or not is distinguished. This restriction to the inference algorithm using BP was proposed in Ref. Zhang and Moore (2014). One of the reasons why this restriction is employed is because it can considerably reduce the computational cost, so that performance on large networks with many clusters can be evaluated. Another reason is because some assessment criteria compared in this paper are specialized to the community structure, and their generalizations to general modular structures are not known.
The log likelihood Eq. (3) is then simplified to
Note that the second sum does not depend on the cluster assignment . We consider this stochastic block model for community detection.
Iii Community detection algorithms
iii.1 Statistical inference
The goal of community detection is to determine the set of cluster assignments ; it is a hidden variable, and when the model parameter is learned for a given number of clusters , fitting the network for the stochastic block model is carried out by maximizing the marginal log likelihood or equivalently, minimizing the free energy,
Unfortunately, this optimization problem is computationally difficult in general, and a number of approximate methods have been proposed in the literature. The EM algorithm is employed in this paper, which is a popular method for fitting the stochastic block model. To obtain the minimum of the free energy, the EM algorithm iteratively optimizes the distribution of the hidden variable with a fixed model parameter (E step) and the optimization of with a fixed distribution of (M step). For the E step, we use the BP algorithm which will be explained later in this section. Thus, we obtain the probability distribution of the cluster assignment for each vertex, such that Eq. (6) is expected to be minimized. Hereafter, we often omit the number of clusters in the argument, which is always given as an input; we try various values of for model assessment.
When the affinity matrix is fixed as a constant in the E step, the free energy reads
are the modularity function , resolution parameter Reichardt and Bornholdt (2006), and inverse temperature , respectively. This indicates that modularity maximization can be regarded as a special case of the inference using the stochastic block model; the partition with the maximum modularity coincides with the result of the statistical inference when the entropic effect is ignored, or . This is known as the maximum a posteriori (MAP) estimate Nishimori (2001). The connection between likelihood maximization and modularity maximization was first discussed in Ref. Newman (2013) for in the context of spectral graph partitioning; the above relation was pointed out in Ref. Zhang and Moore (2014), which discusses a finite temperature formulation of the modularity maximization. It is known that also plays the role of a resolution parameter Schülke and Ricci-Tersenghi (2015) that controls the typical scale of clusters.
In Refs. Zhang and Moore (2014); Schülke and Ricci-Tersenghi (2015), is set to unity and is treated as an input parameter, which corresponds to fitting a network with a fixed affinity matrix. However, it is more natural to learn them instead. The learning of and can be carried out in a straightforward manner. They are obtained as the values that minimize the free energy, Eq. (7). The derivatives with respect to the model parameters Foo (a) yield
where is the Kronecker delta, is the average with respect to the current estimate of the posterior distribution and the hat notation indicates the estimated quantity. Let denote the number of nodes within cluster . As mentioned in Ref. Zhang et al. (2015), if we assume that is the distribution that prevents from fluctuating significantly, i.e., ,
Note that the update of model parameters only costs linear time; therefore, it is not a bottleneck in the algorithm.
To evaluate the probability of cluster assignment for each vertex, BP is used (see Refs. Mézard and Montanari (2009); Decelle et al. (2011b, a); Zhang and Moore (2014) for details). The marginal probability of vertex ’s cluster assignment can be obtained by marginalizing the likelihood Eq. (II.2). Using the tree approximation, which is valid for sparse networks, it can be expressed as
where denotes the neighboring vertices of . In Eq. (15), indicates the cavity bias from vertex to vertex , that is, the marginal probability of without the marginalization from , and is the normalization factor. Assuming that , we can further approximate and as
It should be noted that the iteration of Eqs. (13), (14), (16), and (17) does not minimize the free energy Eq. (6) itself, but minimizes its approximated quantity called the Bethe free energy. We show the specific form of the Bethe free energy in Sec. IV.1.
The critical values of for the stochastic block model have been discussed in Refs. Zhang and Moore (2014); Schülke and Ricci-Tersenghi (2015). There are three phases of state, depending on the value of and the strength of the community structure: the retrieval phase, paramagnetic phase, and spin-glass phase. In the retrieval phase, the fixed point of BP with the minimum Bethe free energy correctly indicates the community structure. In the paramagnetic phase, BP converges to the so-called factorized state as the minimum of the Bethe free energy. In the factorized state, for any vertex , the marginal probability distribution of the cluster assignment has a uniform distribution, . In other words, any vertex has an equal probability of joining any cluster. Therefore, the resulting partition does not exhibit any community structure. Finally, the spin-glass phase is the phase in which BP typically does not converge. This is also the case in which the statistically significant community structure cannot be retrieved. In the case of the standard stochastic block model with equal size clusters, for a given number of clusters , the critical value of between the paramagnetic phase and the spin-glass phase obtained by the stability of the factorized state against a random perturbation is
where is the average degree. The lower bound estimate of that prevents BP from going into the paramagnetic phase is given by:
In practice, it cannot be uniquely determined whether BP belongs to the retrieval phase, paramagnetic phase, or spin-glass phase, because real-world networks do not precisely emulate the stochastic block model. However, they work as the reference values of to obtain an intuition regarding which phase BP belongs to. In Ref. Zhang and Moore (2014), it is suggested that should be used as an input, because BP is expected to belong to the retrieval phase with this value.
The effect due to the absence of model-parameter learning can be interpreted as follows. Given that the model only distinguishes whether a pair of vertices is in the same cluster or not, the specific values of and may not be so crucial for the resulting cluster assignment. Conversely, when other statistical quantities such as likelihood or cross-validation errors are considered, erroneous model-parameter estimates may cause a large bias. As we observe in Sec. V, the results of the criteria that depend only on cluster assignments (e.g., modularity and minimum description length of the map equation) are not very sensitive to model-parameter learning, while the criteria that utilize the model parameters (e.g., the Bethe free energy and cross-validation errors) are ill-behaved without learning.
iii.2 Greedy algorithms
In the previous section, the free energy minimization based on the stochastic block model has been considered. In the limit of , it reduces to the maximization of the modularity function in Eq. (8), or the energy minimization. In this case, the probability distribution with respect to is no longer considered, and our goal here is to find the best cluster assignment for each vertex.
While a number of algorithms have been proposed in the literature, perhaps, greedy algorithms, such as the Louvain method Blondel et al. (2008), are the most widely used in practice. Another greedy algorithm for community detection that we analyze is the Infomap Rosvall and Bergstrom (2008), which optimizes the map equation Rosvall and Bergstrom (2008) (see Sec. IV.3 for the details of the map equation). In such algorithms, we assign a unique cluster label for each vertex at the beginning, i.e., , and merge and update their assignments as referring to the neighboring vertices to achieve a higher or lower value of the objective function, e.g., . Note that the number of clusters is also determined automatically during the optimization process. Although these greedy algorithms are extremely fast, as we will observe in Sec. V.1, they tend to largely overfit when the algorithm is trapped in a local extremum of the energy landscape. The situation is very severe particularly when the landscape is glassy Good et al. (2010).
iii.3 Spectral methods
Other commonly used algorithms are spectral methods. In this section, the focus is on the case of (bisection). Let us consider the case of modularity maximization. While maximizing is originally a discrete optimization problem, the assignments are relaxed to a real vector with a spherical normalization constraint, i.e.,
Here, and is a matrix the element of which is given as
This matrix is known as the modularity matrix Newman (2006b, a). If Eq. (22) is rewritten using the Lagrange multiplier, an eigenvalue problem is obtained with respect to and the leading eigenvector is expected to be correlated to the optimum assignments. Although is a dense matrix, because it is due to the rank 1 matrix of the second term in Eq. (23), its leading eigenvalues and eigenvectors can be computed efficiently using the power iteration Git ().
A more classical version of a spectral method is the one based on the minimization of an objective function called the normalized cut Shi and Malik (2000). The normalized cut is defined as, for ,
Analogous to the case of modularity, the continuous relaxation with the spherical normalization constraint yields the eigenvalue problem with respect to the normalized Laplacian , which is defined as:
where is the identity matrix and (see Ref. Luxburg (2007) for details).
The optimizations of the normalized cut and modularity look different. However, it is known that minimizing the normalized cut is equivalent to maximizing modularity with a special choice of the resolution parameter Kawamoto and Kabashima (2015a) at the level of discrete optimization. Moreover, when the problem is relaxed to a continuous optimization, it is also possible to formulate the modularity maximization as the eigenvalue problem of the normalized Laplacian Newman (2013); it can be done by imposing a degree-dependent normalization constraint instead of the spherical normalization constraint and setting the resolution parameter to unity. These relationships are summarized in Fig. 1.
In general, the leading eigenvectors are expected to be correlated to the optimum -way partition. Thus, to determine the assignment of each vertex, those eigenvectors have to be rounded, e.g., using the K-means method. However, the focus of this paper is only on the eigenvalues, because they are sufficient to estimate the number of clusters.
While the above two spectral methods are based on energy minimization, a spectral method related to the Bethe free energy minimization was proposed in Ref. Krzakala et al. (2013). The matrix that appears in this method is called the non-backtracking matrix , and is derived from the linear stability analysis of the BP algorithm that minimizes the Bethe free energy. The non-backtracking matrix is not a symmetric matrix, and its specific form is given as:
A symmetric variant of the non-backtracking matrix is also proposed in Ref. Saade et al. (2014a).
In Sec. IV.4, we will discuss the properties of the spectra of the above matrices and their use as the assessment criteria of the number of clusters.
Iv Assessment criteria of the number of clusters
In this section, we explain the assessment criteria of the number of clusters . To determine it using an algorithm in which is given as an input, we assess the quality of the clustering based on a criterion, as we sweep the value of . It should be noted that, although the input value of the number of clusters, namely, the maximum number of clusters that the vertices can be assigned to, is , the resulting partition might have less than clusters.
iv.1 Bethe free energy
In the present framework of statistical inference, the most natural assessment is to measure the free energy and observe its saturation as an increment of the number of clusters . When the network is generated by a block model with clusters, the marginal likelihood will not be increased for . Therefore, we expect that a parsimonious number of clusters can be selected from the saturation of the free energy.
As mentioned above, the algorithm using BP does not minimize the free energy itself. Instead, it minimizes the Bethe free energy as an approximated quantity, and it can be written in terms of the cavity bias and affinity matrix as follows.
Some simple algebra shows that the Bethe free energy here is related to the Bethe free energy in Ref. Zhang and Moore (2014) (which we refer to as ) as
While modularity appeared as an objective function with a fixed in Sec. III.1, it was originally defined as an assessment criterion of the number of clusters Newman and Girvan (2004). In modularity, the strength of a community structure is measured by comparing the actual network and a randomized network in each cluster. Although the performance of modularity is not considered state of the art, it has been extensively studied and used as a baseline in many benchmark tests.
Precisely speaking, while the sum is taken over every vertex pair () in Eq. (8), the sum is taken over all possible combinations of vertices (including the case ) in the original definition, although this does not cause a qualitative difference unless the self-loops are significant. The modularity function of Eq. (8) with the partition obtained by free energy minimization is sometimes distinguished as the retrieval modularity. However, it is referred to as modularity in this paper for simplicity. The partition is selected with a maximum modularity, or the parsimonious one among the partitions with a high modularity.
From the view point of statistical inference, the modularity maximization corresponds to a maximum likelihood estimate; i.e., there is no penalty term. In principle, it can still assess the number of clusters because the degrees of freedom of the affinity matrix are restricted as in Eq. (4), and thus, the model with a larger does not contain the model with a smaller as a subset. In addition, if we tune the resolution parameter , the likelihood varies, and the optimum value changes.
iv.3 The map equation
Another popular criterion is the map equation Rosvall and Bergstrom (2008, 2011), in which the strength of the community structure is measured in terms of the minimum description length of a random walker. The map equation encodes the moves of a random walker on a given network using multiple codebooks. Specifically, it considers a codebook that encodes moves between clusters, as well as codebooks that encode moves within each cluster. Given that the codewords of different codebooks can be overlapped, a proper assignment of clusters will compress the description length of a random walker. Moreover, by using the codebooks of superclusters, i.e., the clusters of clusters, its hierarchical extension can be performed naturally. The map equation also has an interesting feature in that it allows for the consideration of flow information, e.g., the directedness of edges, although we do not address this point in this paper (see Refs. Rosvall and Bergstrom (2008, 2011) for more details).
The excellent performance of the map equation and its greedy implementation (Infomap) has been shown in numerous articles. As with modularity, one can use the minimum description length of the map equation for model assessment only and perform community detection based on another objective function. It should be noted that the characterization of a cluster in the map equation is not equivalent to that of the stochastic block model. However, when densely connected components exist in a network, the minimum description length of a random walker is further compressed by clustering them; thus, it is expected that an optimal partition in the sense of modularity is also a good partition in the sense of the map equation.
It is also debatable whether we should consider the hierarchical nature of the map equation Rosvall and Bergstrom (2011). The map equation is naturally formulated as a hierarchical clustering, and the fundamental two-level method can be regarded as a truncation of the general multilevel method. Nevertheless, we measure the minimum description length of the two-level method and compare it with other model assessment criteria, because it is not always possible to measure the minimum description length in the sense of the multilevel map equation. Whereas the multilevel map equation assumes a hierarchical structure, for example, in the case of partitions using the inference algorithm considered in this paper, each partition with different values of is independent and is not constrained to constitute a hierarchical structure.
iv.4 Spectral band
The spectra of matrices , , and , in Sec. III.3 can be used to estimate the number of clusters. In the case of a uniform random graph, in the infinite graph size limit, the spectrum of a corresponding matrix exhibits a non-zero spectral density within a finite range. In other words, the spectral band can be observed, as exemplified in Fig. 2; it is often referred to as the semicircle law Mehta (2004) in the case of a symmetric matrix. The spectral band stems purely from the random nature of a network, and if a characteristic structure in a network exists, the eigenvalues outside of the spectral band, i.e., isolated eigenvalues, will be observed. As we mentioned in Sec. III.3, because the leading eigenvectors are expected to be correlated to the optimum partition, the number of statistically significant clusters can be estimated by counting the isolated eigenvalues.
While it is sometimes possible to evaluate the boundary of the spectral band by visual inspection, it is not trivial and it is preferable to have its estimate. The estimate of the boundary of the spectral band of the modularity matrix was first derived in Ref. Nadakuditi and Newman (2012), the estimate was then generalized for random networks with arbitrary expected degrees Nadakuditi and Newman (2013), and arbitrary degree sequences Zhang et al. (2014). These results, however, assume that the average degree is sufficiently large. A result that is applicable to sparse networks is derived in Ref. Kawamoto and Kabashima (2015a); although it is still a mean-field result, it yields the estimate for a random network with an arbitrary degree sequence and it is exact when the network is regular.
Although the boundary estimate of the spectral band of the normalized Laplacian is also possible using the mean-field approximation Kawamoto and Kabashima (2015b), it is known that seriously suffers from the emergence of localized eigenvectors; those localized eigenvectors consist of a few elements with very large values and most of the elements are close to zero. These localized eigenvectors are not correlated to the optimum partition and can deteriorate the estimate of the number of clusters. For this reason, we do not pursue the assessment using the spectrum of in this paper.
On the other hand, the non-backtracking matrix tends to avoid the emergence of the localized eigenvectors, and its spectral band has a clear boundary at Saade et al. (2014b), where is the spectral radius of . As seen in Sec. V.2, the assessment using the non-backtracking matrix performs well in many cases. Note that, however, the non-backtracking matrix is not completely free from the localized eigenvectors Kawamoto (2016); Pastor-Satorras and Castellano (2016).
iv.5 Prediction errors
Finally, we explain the cross-validation estimates of prediction errors, which are also useful to estimate the number of clusters (see Ref. Kawamoto and Kabashima (2017) for the detailed derivations). Although evaluating cross-validation errors is computationally demanding in general, the leave-one-out cross-validation (LOOCV) is an exceptional case and the corresponding errors can be obtained efficiently using the result of BP Kawamoto and Kabashima (2017).
We consider four types of cross-validation errors, the Bayes prediction error, Gibbs prediction error, MAP estimate of the Gibbs prediction error, and Gibbs training error. We refer to as the adjacency matrix without the knowledge of . Given , the cluster assignment probability of and is
Then, the prediction probability that and are connected is:
Note that the two-point partition function is the normalization factor in Eq. (19) and is not equivalent to the two-point partition function defined in Ref. Zhang and Moore (2014), which does not have a probabilistic interpretation. The cross-entropy error function with respect to , which is referred to as the Bayes prediction error of LOOCV, , is:
Using the fact that , it can be approximated as:
where we neglected the term. The Bayes prediction error should be the appropriate choice for assessing models in terms of the predictability of edges when the network is generated by the stochastic block model. However, this is often not the case. Hence, the Gibbs prediction error is considered, which is a rough estimate of the prediction error compared to . While the probability with respect to and is marginalized when the cross-entropy error function is measured in , a specific choice is made regarding and first, and the average is taken later in . Thus, we have:
where we again neglected the term. By replacing with the delta function that has a peak at , the MAP estimate of the Gibbs prediction error is obtained, which is referred to as .
The Gibbs training error can be derived in the same manner. In , we include the information of for the probability with respect to and . Thus, we have,
Again, the term was neglected.
Note that the complexity of computing the Bethe free energy and the cross-validation errors is considerably reduced by restricting the parameter space of the stochastic block model. While the stochastic block model required a computation of in the general case, it is with the restriction: Eq. (4).
V Comparative analysis
In this section, a comparative analysis of the assessment of the number of clusters was conducted using synthetic and real-world networks. For the synthetic networks, the planted number of clusters is denoted as: .
v.1 Assessment using the greedy algorithms
The performance of the greedy algorithms was first examined on the basis of the standard stochastic block model. Figures 3a and 3b show the number of clusters detected using the Louvain method and the two-level Infomap, respectively. The horizontal axes represent the strength of the community structure . The Louvain method is a hierarchical clustering algorithm that aims to optimize modularity, while the (two-level) Infomap is a non-hierarchical clustering algorithm that aims to optimize the map equation. For the implementation, we used Ref. lou () for the Louvain method and Ref. inf () for the Infomap.
All instances considered in this section have . Given that the stochastic block model is exactly the model assumed in the inference algorithm, the assessment by the Bethe free energy and some of the prediction errors are known to be very accurate Mossel et al. (2014); Massoulié (2014); Kawamoto and Kabashima (2017), even when the planted modular structure is very weak.
When the average degree is sufficiently high and the community structure is strong (i.e., ), both algorithms correctly detect two clusters. However, when the networks are sparse and the community structure is weak (i.e., ), those algorithms tend to largely overfit. Moreover, as shown in Figs. 3c and 3d, the detected number of clusters increases as the network becomes larger. A non-hierarchical clustering algorithm for modularity Clauset et al. (2004); fas () and the multilevel Infomap Rosvall and Bergstrom (2011) were also tested. Although the tendency that the hierarchical clusterings slightly prevent overfitting was confirmed, significant differences were not observed.
It should be noted that, detecting too many clusters does not readily imply the failure of the algorithm. For example, when the result consists of a few large clusters and many very small clusters, significant clusters can be extracted via a visual inspection. This is actually the case for instances with strong community structures. Otherwise, the sizes of clusters can be broadly distributed, and such a visual inspection may fail. Such situations are exemplified in Fig. 4 for the Infomap.
v.2 Assessment using the inference algorithm
|6 & 7|
In this section, the performance of various assessment criteria based on the statistical inference algorithm that were explained in Sec. III.1 is analyzed. First, the performance on synthetic networks, called the LFR network Lancichinetti and Fortunato (2009), is analyzed. The LFR network is a random graph model that has a power-law degree distribution and, as a planted structure, a power-law cluster size distribution. The parameters of the LFR networks considered are listed in Table 1. Although the LFR network is often analyzed as a random graph that emulates typical real-world networks, in this paper it is not argued whether the parameter set investigated is “realistic” or not. In fact, it is not obvious whether the LFR network really emulates typical real-world networks, because, as can be seen from Fig. 4, a broad cluster size distribution can be obtained fictitiously.
Figure 6 shows the result for an instance with a strong community structure (small mixing parameter , in terms of the LFR network). Although the network has vertices with very large degrees, the cluster sizes are set to be almost equal. For this network, all the criteria support values close to , although the criterion based on the non-backtracking matrix slightly overfits.
While the values of the model parameters are learned in Fig. 6, we show the result for the same network without the model-parameter learning in Fig. 7. As we can see from Figs. 6 and 7, modularity and the map equation behave similarly in both cases. As we mentioned at the end of Sec. III.1, this may be due to the fact that the cluster assignment is not very sensitive to specific values of model parameters, at least when the network has a strong community structure. Conversely, the performance of the Bethe free energy and cross-validation errors change qualitatively, indicating that the learning step cannot be skipped. Note that skipping the learning step does not necessarily mean that it is computationally more efficient. With an incorrect choice of the affinity matrix , it will be more difficult to fit the network. It turns out that BP requires more sweeps until convergence. Therefore, it is more beneficial to optimize the model parameters. The rest of the results in this paper are generated according to the algorithm with model-parameter learning.
The LFR networks with weak community structures are shown in Fig. 8. Figures 8a–c represent the results for networks with narrowly peaked planted cluster size distributions. Conversely, Fig. 8d–f represent the result for networks with broad planted cluster size distributions. Although it is difficult to thoroughly examine the effect of cluster size distribution, we can at least confirm that the performance of the present algorithm and assessment criteria are not very sensitive to the cluster size distribution.
In the case of sparse networks such that the average degree is of , if the planted community structure is too weak, it becomes fundamentally impossible to retrieve the planted structure. In other words, the network becomes statistically impossible to distinguish from a uniform random graph. The critical strength of the community structure is called the detectability threshold, or the detectability limit Decelle et al. (2011a); Moore (2017); Foo (b). In terms of the spectral method, it is the case that the leading eigenvalues are buried in the spectral band. In terms of other assessment criteria, the slope of a validation curve becomes flat, or the value at becomes the minimum. In the case of the stochastic block model with equally sized clusters, this threshold is given by the value of that corresponds to in Eq. (20), and the paramagnetic phase will be observed beyond the detectability threshold.
For the network in Fig. 8a, all the criteria we consider behave reasonably, supporting the values close to . For the network in Fig. 8b, other than the Gibbs prediction error and its MAP estimate, the assessment criteria still support the values near . Indeed, in the case of the stochastic block model, it is known that the Gibbs prediction error tends to underfit near the detectability threshold Kawamoto and Kabashima (2017). Although the value of the (information-theoretic) detectability threshold for the LFR network is not known, the network in Fig. 8c may be beyond the detectability threshold. The Gibbs prediction error and Bayes prediction error are minimized or saturated already at (not shown in the figure). The Bethe free energy exhibits a monotonic behavior, while the values of other criteria behave violently; this implies that the landscapes of the objective functions are glassy.
More importantly, while we observed in Figs. 3 and 5 that the estimates by modularity and the map equation largely overfit when the greedy algorithm is used, the results in Fig. 8 indicate that those criteria behave reasonably when statistical inference is used. Therefore, the shortcomings that we observed in Fig. 3 were not the flaw of the criteria themselves, but of the greedy algorithms and the MAP estimate (i.e., ) framework. In fact, when the assumed model is correct, it is known that the MAP estimate overfits compared to the estimate with the optimum inverse temperature Zhang and Moore (2014); Moore (2017); Peixoto (2017). The contribution of this paper is that it confirms that the overfitting occurs with the greedy algorithms near the detectability threshold.
While the results in Figs. 8d and 8f are similar to those in Figs. 8a–c, the result in Fig. 8e is qualitatively different. As the input values of are increased, at some point, BP converged to the factorized state Foo (c); as a result, the estimated value of jumps, and some prediction errors become constant. Convergence to the factorized state is a desirable feature of BP; it implies that BP has reached the detectability threshold and that there is no significant structure. Note, however, that it is often difficult to determine whether BP is actually in the paramagnetic phase or the retrieval phase. Given that the factorized state always exists as a fixed point of BP in the retrieval phase, it is possible that BP is trapped in a local minimum of the Bethe free energy, while the correct initial state would converge to the global minimum.
As the final example using the LFR network, consider the case in which the assessment seems to fail because of the EM algorithm. Consider a network that has a broad degree distribution as in Figs. 6 and 7, in addition to a weak community structure and a broad planted cluster size distribution. As shown in Fig. 9, while the spectrum of the non-backtracking matrix exhibits the estimate of near , such estimates cannot be obtained via the other criteria, because BP converges to the factorized state at , although the values of the criteria significantly vary when they reach this value. In this case, we can hardly conclude that there are no statistically significant structures beyond , and it is more natural to conclude that the BP converged to a local minimum of the Bethe free energy. Note that, even if we accept the estimate of , we cannot obtain a result with 27 clusters; recall that the input value of is the maximum number of clusters allowed, and the actual number of clusters that can be obtained is much less than 27. Readers might wonder what factor dominates the performance of the EM algorithm in the LFR network. Although the degree distribution seems to be an important factor, because there are many model parameters in the LFR network, it is difficult to precisely determine parameter dependencies experimentally. Note that a thorough investigation of the phase space of a particular model is not the goal of this paper. Instead, we investigate generic behaviors in community detection.
|karate club Zachary (1977)|
|dolphins Lusseau et al. (2003)|
|Les Miserables Knuth (1993)|
|football Girvan and Newman (2002)|
|political books Newman (2006b)|
|network science Newman (2006a)|
|C. elegans Duch and Arenas (2005)|
|political blogs Adamic and Glance (2005)|
|protein Rual et al. (2005); kon (2016a); Kunegis (2013)|
|power grid Watts and Strogatz (1998)|
|Facebook ego Leskovec and Mcauley (2012)|
|chess che (); kon (2016b); Kunegis (2013)|
|US airports Opsahl et al. (2010)|
|ca-HepTh Leskovec et al. (2007)|
|Enron Leskovec et al. (2009); Klimt and Yang (2004)|
|Epinions Richardson et al. (2003)|
Let us next examine the performance of the assessment criteria on real-world networks. The basic information about each dataset is listed in Table 2 and the results of the assessment by each criterion are shown in Figs. 10(a)–(d), 11(e)–(h), 12(i)–(l), and 13(m)–(p). For some networks, the inference algorithm does converge to the factorized state at some point as the input value of is increased; as far as we investigated, in many cases, the convergence to this trivial BP fixed point either supports a plausible value of or does not affect the assessment.
It is known that the Bethe free energy tends to largely overfit for real-world networks Decelle et al. (2011a); Kawamoto and Kabashima (2017) when an affinity matrix of full degrees of freedom is used. However, with a restricted affinity matrix, the assessment by the Bethe free energy does not seem to be very wrong.
Unlike the cases of synthetic networks, the behaviors of the assessment criteria are often very different from each other. For example, modularity tends to support a relatively small value for , while the map equation tends to support a relatively large value. Assessment by the Bethe free energy and prediction errors can be close to either of them, and we cannot determine a similarity tendency. Note again that the inference algorithm does not optimize the minimum description length of the map equation; the partition is obtained such that the marginal likelihood is maximized and the minimum description length is measured only as a criterion for the assessment of the number of clusters. Another way to utilize the assessment by the minimum description length is to check whether the resolution limit Fortunato and Barthélemy (2007); Kawamoto and Rosvall (2015) affects the result. The estimates for the number of clusters by modularity and by the map equation can differ considerably when many small clusters exist, because their resolution limits are very different Kawamoto and Rosvall (2015). When both modularity and the map equation support the same number of clusters , it implies that the network is resolution limit-free Foo (d).
v.3 Assessment using the spectral methods
Finally, we examine the performance of the assessment of the number of clusters using the spectral methods that we explained in: Secs. III.3 and IV.4. The results of the estimates using the modularity matrix and non-backtracking matrix are listed in Tables. 1 and 2. The estimates using the non-backtracking matrix are also shown in Figs. 6–13 as dashed lines.
Note that, because the leading eigenvalues can be quickly computed for sparse networks, the assessment using the spectral method can be conducted most easily. Overall, the assessment using the modularity matrix tends to underfit, while the assessment using the non-backtracking matrix tends to overfit, compared to the other criteria. Furthermore, for real-world networks, it is often the case that the spectral band of the modularity matrix cannot be clearly observed. Therefore, in many cases, it is also difficult to visually assess the number of clusters from the spectral density.
The assessment using the non-backtracking matrix is often very accurate in the sense that it coincides with the planted value of an LFR network. The analysis with various values of was also analyzed in Ref. Darst et al. (2014). It is difficult to analyze what exactly causes overfitting in the cases of the real-world networks; one of the possibilities is that the community structure may not be the only structure that contributes to the eigenvalues outside of the spectral band, and those unknown structures cause overfitting when we focus on community detection.
Vi Assessment through visualization
As we have observed, the validation curves of the criteria are often gradually saturated, particularly when the community structure is weak. In such a case, a criterion supports a range of values for , instead of a single plausible value. Therefore, a finer inspection is required as a final step, if one wishes to determine the most plausible value of .
Visualizing how a network is partitioned for each input value of can be helpful for the assessment of the number of clusters. The alluvial diagram Rosvall and Bergstrom (2010) is a suitable tool for this purpose. It was originally introduced as a diagram to indicate the time evolution of a community structure. Here, we visualize the change in the partition for different values of , rather than the change in the partition over time. In the alluvial diagram, the results of community detection are aligned horizontally. For each partition, the set of vertices in the same cluster is expressed as a vertical bundle. Then, the same vertices in different partitions are smoothly connected. The alluvial diagram can be generated at Ref. map () using .smap files.
The alluvial diagram also uses color tone to express the significance of the cluster assignment; the vertices with insignificant assignments are expressed by faint colors. We assess that the cluster assignment of vertex is not significant if is less than a threshold. Here, we regard as a significant assignment.
The alluvial diagrams of four real-world networks are shown in Fig. 14. As we can see from the political books and political blogs networks, the actual partition may be kept the same as we increase the input value of . We can confirm that the partition with stably exists in the political books network and the partition with is also consistent with the partitions with fewer clusters; i.e., it is a refinement of the partition with , and the highlighted clusters in the middle belong to different clusters in the partition with . For the political blogs network, although modularity and the map equation support or , in the end, we can confirm that the dominant structure does not change from the partition with .
In the case of the protein network, for any choice of , only a fraction of vertices belong to a specific cluster significantly. In other words, the network does not have a global community structure. Whereas the vertices with insignificant assignments exhibit a random-like behavior, the vertices with significant assignments roughly exhibit a hierarchical structure. According to the Gibbs prediction error and its MAP estimate in Fig. 12(i), the partitions with or are supported. Although we cannot clearly see a qualitative difference in the alluvial diagram, if we look carefully, from the partition with , we can observe that a small fraction of vertices with significant assignments start to exhibit a non-hierarchical structure.
The way the US airports network is partitioned is also different from other networks. When we focus on the vertices with significant assignments, the resulting partitions do not constitute a hierarchical structure for small values of , while they roughly do for large values of . As various assessment criteria support the range of , it seems to be plausible to select in this range.
Vii Summary and Discussion
We conducted a comparative analysis on the assessment of the number of clusters in community detection. Although we examined the performance of various algorithms and assessment criteria, an exhaustive analysis requires all possible combinations of the frameworks, algorithms, and assessment criteria. For example, an important missing part is the Monte Carlo method Nowicki and Snijders (2001); Peixoto (2014); Newman and Reinert (2016). The Monte Carlo method usually incorporates the prior probability distribution with respect to the affinity matrix ; it plays the role of a penalty in the assessment of the number of clusters. Therefore, a comparative analysis including the Monte Carlo method would not only be a comparison of different algorithms, but also a comparison of different frameworks. In a broader sense, we should note that community detection also possesses some other fundamental issues as discussed in Ref. Peel et al. (2017).
We confirmed that the assessment using the EM algorithm with BP and the corresponding prediction errors also provide plausible estimates in various synthetic and real-world networks, while the greedy algorithms tend to largely overfit. Note that it is not trivial that the BP algorithm performs reasonably for real-world networks, because the emergence of the so-called hard phase Decelle et al. (2011a) may deteriorate the performance when the planted number of clusters is large. Furthermore, the EM algorithm may be trapped in a local minimum depending on the choice of the initial condition of the model parameters.
We also observed that the estimate of using the modularity matrix tends to underfit, while the estimate using the non-backtracking matrix tends to overfit. To the best of our knowledge, the assessment using the boundary of the spectral band of the modularity matrix has not been investigated in the literature.
Finally, we proposed the utilization of the alluvial diagram for the assessment of . Although there is the obvious issue that it is not applicable to the networks with a very large , when it is not the case, the alluvial diagram is very useful, particularly when the network does not clearly have a global community structure.
For the LFR networks and real-world networks, we do not know the number of clusters that are statistically significant. It may not coincide with the planted number of clusters or the number of clusters in the metadata. Therefore, we rely on the consistency among various criteria and algorithms for the plausibility of assessment. The tendency of overfit and underfit that we investigated in this paper represents the relative performance among the criteria and algorithms. Although there is no ground truth in a real-world network, estimating the number of clusters is a practical problem. At the end of the day, a practitioner selects a certain value (or a set of values) as .
T. K. thanks Jean-Gabriel Young for useful comments. This work was supported by Japan Society for the Promotion of Science KAKENHI Grants No. 26011023 (TK) and No. 25120013 (YK).
- P. W. Holland, K. B. Laskey, and S. Leinhardt, Soc. Networks 5, 109 (1983).
- Y. J. Wang and G. Y. Wong, Journal of the American Statistical Association 82, 8 (1987).
- B. Karrer and M. E. J. Newman, Phys. Rev. E 83, 016107 (2011).
- S. Fortunato, Physics Reports 486, 75 (2010).
- V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, Journal of Statistical Mechanics: Theory and Experiment 2008, P10008 (2008).
- M. Rosvall and C. Bergstrom, Proc. Natl. Acad. Sci. U.S.A. 105, 1118 (2008).
- U. Luxburg, Statistics and Computing 17, 395 (2007).
- M. E. J. Newman, Phys. Rev. E 74, 036104 (2006a).
- F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, and P. Zhang, Proc. Natl. Acad. Sci. U.S.A. 110, 20935 (2013).
- A. Saade, F. Krzakala, and L. Zdeborová, in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (MIT Press, Cambridge, MA, USA, 2014a), NIPS’14, pp. 406–414.
- J. J. Daudin, F. Picard, and S. Robin, Statistics and Computing 18, 173 (2008).
- P. Latouche, E. Birmelé, and C. Ambroise, Statistical Modelling, SAGE Publications 12, 93 (2012).
- A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Phys. Rev. E 84, 066106 (2011a).
- K. Hayashi, T. Konishi, and T. Kawamoto, arXiv:1602.02256 (2016).
- T. Kawamoto and Y. Kabashima, Sci. Rep. 7, 3327 (2017).
- K. Nowicki and T. A. B. Snijders, Journal of the American Statistical Association 96, 1077 (2001).
- T. P. Peixoto, Phys. Rev. E 89, 012804 (2014).
- M. E. J. Newman and G. Reinert, Phys. Rev. Lett. 117, 078301 (2016).
- M. E. J. Newman, Proc. Natl. Acad. Sci. U.S.A. 103, 8577 (2006b).
- A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Phys. Rev. Lett. 107, 065701 (2011b).
- M. Rosvall and C. T. Bergstrom, PloS One 6, e18209 (2011).
- M. Rosvall and C. T. Bergstrom, PLoS ONE 5, 1 (2010).
- P. Zhang and C. Moore, Proc. Natl. Acad. Sci. U.S.A. 111, 18144 (2014).
- J. Reichardt and S. Bornholdt, Phys. Rev. E 74, 016110 (2006).
- H. Nishimori, Statistical Physics of Spin Glasses and Information Processing: an Introduction (Oxford University Press, 2001).
- M. E. J. Newman, Phys. Rev. E 88, 042822 (2013).
- C. Schülke and F. Ricci-Tersenghi, Phys. Rev. E 92, 042804 (2015).
- A formal derivation of the EM algorithm is the one using the variational expression, e.g., Ref. Kawamoto and Kabashima (2017).
- X. Zhang, T. Martin, and M. E. J. Newman, Phys. Rev. E 91, 032803 (2015).
- M. Mézard and A. Montanari, Information, physics, and computation (Oxford University Press, 2009).
- B. H. Good, Y.-A. de Montjoye, and A. Clauset, Phys. Rev. E 81, 046106 (2010).
- J. Shi and J. Malik, IEEE Trans. Pattern Anal. Mach. Intell. 22, 888 (2000).
- T. Kawamoto and Y. Kabashima, Eur. Phys. Lett. 112, 40007 (2015a).
- M. E. J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004).
- M. L. Mehta, Random Matrices (Elsevier, 2004), 3rd ed.
- R. R. Nadakuditi and M. E. J. Newman, Phys. Rev. Lett. 108, 188701 (2012).
- R. R. Nadakuditi and M. E. J. Newman, Phys. Rev. E 87, 012803 (2013).
- X. Zhang, R. R. Nadakuditi, and M. E. J. Newman, Phys. Rev. E 89, 042816 (2014).
- T. Kawamoto and Y. Kabashima, Phys. Rev. E 91, 062803 (2015b).
- A. Saade, F. Krzakala, and L. ZdeborovÃ¡, Europhys. Lett. 107, 50005 (2014b).
- T. Kawamoto, Journal of Statistical Mechanics: Theory and Experiment 2016, 023404 (2016).
- R. Pastor-Satorras and C. Castellano, Sci. Rep. 6 (2016).
- E. Mossel, J. Neeman, and A. Sly, Probab. Theory Relat. Fields pp. 1–31 (2014).
- L. Massoulié, in Proceedings of the 46th Annual ACM Symposium on Theory of Computing (ACM, New York, 2014), STOC ’14, pp. 694–703.
- A. Clauset, M. E. J. Newman, and C. Moore, Phys. Rev. E 70, 066111 (2004).
- A. Lancichinetti and S. Fortunato, Phys. Rev. E 80, 056117 (2009).
- C. Moore, arXiv preprint arXiv:1702.00467 (2017).
- Precisely speaking, the algorithmic detectability threshold that a certain algorithm fails to retrieve the planted structure is different from the information-theoretic detectability threshold that it is fundamentally impossible to retrieve the planted structure.
- T. P. Peixoto, arXiv preprint arXiv:1705.10225 (2017).
- Of course, this can also be confirmed directly by observing the marginal distributions themselves.
- W. W. Zachary, Journal of Anthropological Research 33, 452 (1977).
- D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, Behavioral Ecology and Sociobiology 54, 396 (2003).
- D. E. Knuth, The Stanford GraphBase: a platform for combinatorial computing, vol. 37 (Addison-Wesley Reading, 1993).
- M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. U.S.A. 99, 7821 (2002).
- J. Duch and A. Arenas, Phys. Rev. E 72, 027104 (2005).
- L. A. Adamic and N. Glance, in Proceedings of the 3rd International Workshop on Link Discovery (ACM, New York, NY, USA, 2005), LinkKDD ’05, pp. 36–43.
- J.-F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F. Berriz, F. D. Gibbons, M. Dreze, and N. Ayivi-Guedehoussou, Nature pp. 1173–1178 (2005).
- Human protein (vidal) network dataset – KONECT (2016a), URL http://konect.uni-koblenz.de/networks/maayan-vidal.
- J. Kunegis, in Proc. Int. Conf. on World Wide Web Companion (2013), pp. 1343–1350, URL http://userpages.uni-koblenz.de/~kunegis/paper/kunegis-koblenz-network-collection.pdf.
- D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998).
- J. Leskovec and J. J. Mcauley, in NIPS (Curran Associates, Inc., 2012), pp. 539–547.
- Kaggle, ’Chess ratings - Elo versus the Rest of the World.’ https://www.kaggle.com/c/chess/data/ (2010).
- Chess network dataset – KONECT (2016b), URL http://konect.uni-koblenz.de/networks/chess.
- T. Opsahl, F. Agneessens, and J. Skvoretz, Social Networks 32, 245 (2010).
- J. Leskovec, J. Kleinberg, and C. Faloutsos, ACM Trans. Knowl. Discov. Data 1 (2007), ISSN 1556-4681.
- J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, Internet Mathematics 6, 29 (2009).
- B. Klimt and Y. Yang, in CEAS 2004 - First Conference on Email and Anti-Spam, July 30-31, 2004, Mountain View, California, USA (2004).
- M. Richardson, R. Agrawal, and P. Domingos, Trust Management for the Semantic Web (Springer Berlin Heidelberg, Berlin, Heidelberg, 2003), pp. 351–368.
- S. Fortunato and M. Barthélemy, Proc. Natl. Acad. Sci. U.S.A. 104, 36 (2007).
- T. Kawamoto and M. Rosvall, Phys. Rev. E 91, 012809 (2015).
- Although the word “resolution limit-free” is sometimes used for an algorithm, it seems more appropriate to say that a network is resolution limit-free for a given set of algorithms.
- R. K. Darst, Z. Nussinov, and S. Fortunato, Phys. Rev. E 89, 032809 (2014).
- L. Peel, D. B. Larremore, and A. Clauset, Sci. Adv. 3 (2017).