Measuring robustness of community structure in complex networks
The theory of community structure is a powerful tool for real networks, which can simplify their topological and functional analysis considerably. However, since community detection methods have random factors and real social networks obtained from complex systems always contain error edges, evaluating the robustness of community structure is an urgent and important task. In this letter, we employ the critical threshold of resolution parameter in Hamiltonian function, , to measure the robustness of a network. According to spectral theory, a rigorous proof shows that the index we proposed is inversely proportional to robustness of community structure. Furthermore, by utilizing the co-evolution model, we provides a new efficient method for computing the value of . The research can be applied to broad clustering problems in network analysis and data mining due to its solid mathematical basis and experimental effects.
Community structure detection - is a hotspot of social network studies. It has attracted much attention from various scientific fields. Generally, community refers to a group of nodes in the network that are more densely connected internally than the rest of the network. A well known exploration for this problem is the concept of modularity, which is proposed by Newman et al. to quantify a network’s partition. Optimizing modularity is effective for community structure detection and has been widely used in many real networks. However, as pointed out by Fortunato et al., modularity is restricted by the resolution limit problem which is concerned about the reliability of the communities detected through the optimization methods. Complementary to the modularity concept, many efforts are devoted to understanding the properties of dynamic processes taking place in the underlying networks. Specifically, researchers have begun to investigate the correlation between community structure and dynamic systems, such as synchronization and random walk process.
In the real-world, network topology changes over time. The analysis of community structure in evolving networks has been regarded as a “Holy Grail” of network scientists. A famous example is the karate club network constructed by Wayne Zachary in 1970s. During the course of his study, a dispute arose between the club’s administrator and principal karate teacher over whether to raise club fees, and the club eventually split into two smaller clubs, centered around the administrator(node 1) and the teacher(node 33), as shown in Fig.1. It can be assumed that, the relationships between members in karate club at the very beginning are not robust and small perturbation may cause the complete change of the topology. Why the initial un-tight relationship evolves into or out of community structure is a very interesting question, since community structure has a great impact on human organizational structure, rumor and epidemic spreading, network attack effect and congestion control.
Given a network, it is meaningless to detect the community when the community structure is un-robust: if a small change in the network, for example, an edge added here or there, can completely change the outcome(significance or stability) of community structure, then, we argue that, the network is un-robust and the result could not be trustworthy. In this letter we focus on this imperative task and prove the critical threshold of resolution parameter in Hamiltonian function, , can measure the robustness of a network.
For any given network, robustness information can be derived from directly and conveniently without using particular partition algorithms. A rigorous proof is then used to show the index we proposed is inversely proportional to robustness of community structure theoretically based on spectral theory. Furthermore, to calculate the value of , a new efficient method is provided based on co-evolution model. The method can be applied to broad clustering problems in network analysis and data mining due to its solid mathematical basis and efficiency.
2. Potts model and the ciritcal resolution parameter
Potts model is a powerful thermodynamic method, which has been widely applied to uncover community structure in networks. We use the multiresolution Potts method to carry out the study. Given a network and corresponding adjacency matrix , community structure can be determined by minimizing the infinite -state Hamiltonian function:
where represents the community (state) that node (spin) belongs, is the resolution parameter, represents the expected number of edges between nodes and in the null model, and is the coupling matrix with entries represents the interaction strength between node and .
In Potts model, the resolution parameter is an important indicator of dynamics of community structures. By tuning the value of , we can detect community structure at multiple scales. Specifically, when the value increases, a network can be divided into more smaller communities, as shown in Fig.1. If we define as the minimal value for dividing network into communities, then can be naturally used to indicate the stability or significance of -community structure. For example, Fig 1(c) shows a strong 4-community structure in network and a weak 4-community structure in network . It can be easily estimated that . Based on the analysis above,the following theorem can be obtained:
Theorem 1. If Hamiltonian function with divide network into communities, this result is the weakest one which just meets the definition, i.e. the number of intra-community edges is equal to the number of inter-community edges.
The proof of theorem 1 is explicit. In addition, the profiles of networks with different scales and types of connectivity can be compared using . These differences are defined as “network distances”. For example, Hamiltonian function containing is used to measure the distance between network and :
where is the Hamiltonian function with parameter in network . “Network distance” in this form can be applied without considering the differences of connectivity between various networks, such as size, type of degree distribution and sparsity, and it is convenient to analyze the information hidden behind the topology. However, estimating the value of is a tough job, which can only be tested by optimization methods up to now. In this study, we can use to directly quantify the robustness of a given network. Since few studies have shown the dynamic changes of , we focus on this novel issue and reveal the relationship between and network’s robustness in the next section.
3. The relationship between and the robustness
In this section, a typical case is studied to prove is able to quantify the robustness directly. For an undirected and unweighted graph with nodes and edges, the topology is characterized by an associated adjacent matrix . communities are partitioned, and each community is labeled by . We denote the number of inner links connecting each pair of members inside community as , and the number of inter-community links as , i.e. the number of links connecting a member of any one community to a member of another community. Based on the mentioned notations, there is .
Next, the hyper-graph associated to network is defined as the weighted directed -clique in which each node corresponds to one community in . In , the connection linking node to node is weighted by , where represents the number of links of that connect members of the community with members of the community , represents the number of inner links in the source community . The corresponding Laplacian matrix is asymmetric, but can be written as a product , where is a symmetric zero row-sum matrix with off-diagonal elements and diagonal ones and . Then is
The spectrum of is non-negative real values, as is zero row-sum. Then the smallest eigenvalue of , , is zero, while the second smallest one . The method we proposed to measure the dynamic quality of community structure is defined as follows:
In fact, is an inherent evaluation function to compute the significance of community structure, based on spectral theory . is able to quantify the connectivity of the hyper-graph, and therefore measure the extent to which different communities are bounded and interacted. It should be noticed that both and are properly normalized, so that even if the network links were associated to cohesive forces, the two quantities would be one dimensional. The maximum of corresponds to a topology in which the community structure is most significant, and thus crucial to this study.
Let us then consider the case that communities are all cliques with equal size . The number of intra-community links , as well as the number of inter-community links are the same for all communities. Then, as and , . Under these assumptions, Hamiltonian function of Eq.(2. Potts model and the ciritcal resolution parameter ) can be simplified to the following expression:
In addition, since , according to the matrix identity, there is . Integrating and with , the function is derived as follows:
To study the dynamic characteristics of networks, a particular protocol is adopted by increasing at each step from 0 to , the value at which and are zero. Given a fully modularized configuration (in which and ), we conduct successive rewiring processes, i.e. in each step an intra-community link corresponding to each community is deleted, and inter-community links are formed by connecting those pairs of nodes (each one in different communities) which lost their intra-link. In this way, at the -th rewiring, there are and . Accordingly, the partial derivative of , and the dynamical change of inter-community edges are calculated as follows:
When , . The function of reaches the maximum since the second derivative of is indeed negative. According to the formation of the maximum of inter-community edges , when , . In this case, no inter-community edges exist and the original network cannot be perturbed anymore(increase will decrease ). On the contrary, when , there is . At this time, the network is indeed perturbable because increasing will also increase until all edges are inter-community edges. In this situation, one node is a single community and only belongs to itself. In a special intermediate case, when , there is , and intra-community edges have the same number of edges with inter-community ones. According to the definition, this is just the threshold testing whether the community structure emerges. As explained above, this is just the critical value , and is closely related to robustness: if the value of increases, the size of imperturbable area(un-robust) is also increases accordingly, as shown in Fig.2. This conclusion can be reflected in the following theorem:
Theorem 2. The larger , the lower robustness of a given network including communities, and vise versa.
For any given network, robustness information can be derived by conveniently without particular algorithms, since is only determined by network’s topology.
4. A novel method to calculate the critical
As proved by theorem 2, can be used to quantify a network’s robustness. However, the calculation of is a tough task and few studies have addressed this problem. Fortunately, theorem 1 provides a feasible way to the calculation, i.e. can be got through calculating in a community structure when . In this part, a two-stage method is proposed to calculate . The detailed procedures are described as follows.
4.1. The relationship between and community edge density
In unweighted graphs, the “inverse adjacency matrix” is defined as if there is no edge between node and preset, and , vise versa. In general, we set . Since the inner sum of , i.e. , is the number of edges in community , and the sum of , i.e. , is the number of missing edges in community , Eq.(2. Potts model and the ciritcal resolution parameter ) can be rewritten as
where the inner sum of has been rewritten in terms of the number of existing edges , is maximal number of edges that community would have and the number of missing edges is . With minor rearrangement, Eq.(2. Potts model and the ciritcal resolution parameter ) can be transformed in an “edge density” form:
where the edge density of community is defined as .
If the energy of a given community is attractive and have a binding pattern, the term must be positive. Rearranging Eq.(4.1. The relationship between and community edge density) provides a relationship between the resolution parameter and the critical (minimum) edge density :
Based on Eq.(4.1. The relationship between and community edge density), two important inferences can be obtained:
4.2. The co-evolution model
We set , and substitute into Hamiltonian function of Eq.(2. Potts model and the ciritcal resolution parameter ):
Then, the term is extracted and an equivalent function is derived
In Eq.(4.2. The co-evolution model), we consider as a special probability, and the value of lies between 0 and 1. To optimize Eq.(4.2. The co-evolution model), the following steps are needed: at each step, a vertex is picked randomly. If its degree , nothing happens. For , (i) with probability , a random neighbor of is selected and we put node into the same community of , i.e. set ; (ii) otherwise, with probability , an edge attached to vertex is selected and the other end of this edge is rewired to a randomly chosen vertex in the same community with . This process continues until no edge connecting individuals between different communities.
This dynamical evolutionary process can be considered as a special case of famous Holme-Newman model. There are two extremes corresponding to the value of . When , only rewiring steps(step ii) occur. Once all of edges are touched, the graph has been split into components, each consisting of individuals who share the same label. Because none of the states have changed, the components are small (i.e., their sizes are Poisson distribution with mean ). According to classical results for the coupon collector’s problem, updates are approximately required. In contrast, for , this system reduces to the voter model on a static graph. If we suppose that the initial graph is an ER random graph in which each vertex has average degree , and then there is a ”giant component” that contains a positive fraction of the vertices, and the second largest component is small having only vertices. The voter model on the giant component will reach a consensus in steps.
To pursue the study, a two state Potts model (the two different spin states called 0 and 1) is proposed instead of a number proportional to the size of the graph. This model is also called Ising model. As the same as Holme-Newman model, the final fraction of nodes with the minority spin states undergoes a discontinuous transition at a value that does not depend on the initial density. Fig.3 shows results of simulations for our method starting from an initial graph that is ER random graph with nodes and average degree . Spin values are initially assigned randomly with the probability of state 1 given by fraction , and . The figure shows the final fraction of nodes with the minority spin states from five scenarios for each . Although the fraction of nodes with state 1 is less than 0, this minority state will reach a stable state instead of being ”assimilated” or ”swallowed” by state 0. In community analysis, this phenomenon is equivalent to free of ”resolution limit” problem pointed out by Fortunato et al, where the modularity exists at an intrinsic scale beyond which small qualified communities cannot be detected by maximizing the modularity.
4.3. The computation of
Since analyzed above, is proportional to the value of . If is defined as the critical threshold of when in community structure, and then estimate that the value of is proportional to the critical threshold , using the relationship . Hereby, we focus on the case and find the solution which can be extended to more general cases.
For methodology, we use mean-field theory with Markov dynamical process to compute . First, Let be the number of edges of adjacent nodes with states and , and is defined as the number of oriented triples of adjacent sites with states , respectively. is equal to , for example, in the 0-1-0 case, all such triples will be counted twice, but the approach is limited of dense graphs, where the general statistics are the number of homomorphisms of some small graphs (labeled by ones and zeros in our case) into the random graph being studied. It is common to use the pair approximation (PA), which in essence assumes that the equilibrium state is a Markov chain: , where is the number of vertices in state 0, and is the clustering coefficient of a network. Using mean field theory and algebraic transformation, the following theorem can be deduced:
Theorem 3. Defining as the average degree, as the fraction of nodes with minority spin state, as the number of nodes, as the clustering coefficient, then, the critical threshold and the number of inter-community edges satisfies and , respectively.
Proof. The calculations presented here are inspired by similar equations in Kimura and Hayakawa. According to the mechanism of our model, by considering all of the possible changes, the partial differential equations can be established as follows:
The fact that this notation is more natural than dividing by 2 to eliminate overcounting, can be seen by observing that, if is the degree of node , there are and . Also, , where is the number of edges, and the sum of the three differential equations(i.e. Eq.(4.3. The computation of )-Eq.(4.3. The computation of )) is zero.
Taking steady solution of these equations and the pair approximation as before, we get
When , only intra-community edges exist, we have . The threshold information can be got:
Using Eq.(4.3. The computation of ) and , we have
The number of inter-community edges satisfies
The proof is completed.
The approach is limited to dense graphs. As , is proportional to the value of , and then we can get . This approximation is simple and convenient to compute in large scale networks, by using sampling technology. Although is derived from two states Ising model, it can be directly applied to network with more than 2 communities, since the elements and are only determined by network topology without using any partition algorithm.
We test the index on both the classical GN benchmark presented by Girven and Newman and the more challenging LRF benchmark proposed by Lancichinetti, Fortunato and Radicchi. GN network has nodes that are divided into 4 communities with 32 nodes each. Each node is connected to average nodes of its own group and of the rest of the network. The total degree of each node is always kept constant and equals to . In the LFR benchmark, each node is given a degree taken from a power law distribution with an positive exponent. Additional, each node shares a fraction with other nodes in the network, where is the mixing parameter. The clarity of community structure can be adjusted by the mixing parameter .
As is well known, the communities become fuzzier and thus more difficult to be identified when and increase. Hence, the robustness of the community structure will tend to be weaker and the index will increase. The numerical results of value for both and are shown in Fig.4. The figure indicates that the index works well in these networks: when community structure is very clear, the is near 0.2-0.3; when the network is nearly a random one, the corresponding is very close to 1. Thus, this method shows a great ability in characterizing the properties of modular structure and the lower the (or ) index is, the more robust community structure will be.
In order to verify our method, it is also applied to three famous artificial networks– ER random graph, BA scale-free network, and network where the number of nodes are 10,000 and the average degree is all 3. The experimental result indicates that model is the most robust one. We also test the method on real networks and the results are shown and analyzed in Supplementary Material.
In summary, this letter presents a new community analysis method which is able to uncover the connection between robustness of community structure and the critical threshold of resolution parameter . Based on the theoretical analysis, a novel computation method is developed to quantify using co-evolution theory. The effectiveness and efficiency are demonstrated and verified, which can be applied to broad problems in network analysis and data mining due to its solid mathematical basis and efficiency.
We are grateful to the anonymous reviewers for their valuable suggestions. The authors are separately supported by NSFC grants 71401194, 71403304, 11131009, 91324203 and “121” Youth Development Fund of CUFE grants QBJ1410.
-  Newman.M.E.J, Phys. Rev. E, 69 (2004) 066133.
-  Girvan.M, Newman.M.E.J, Proc. Natl. Acad. Sci. U.S.A., 99 (2002) 7821-7826.
-  Fortunato.S Barthelemy.M, Proc. Natl. Acad. Sci. U.S.A., 104 (2007) 36.
-  A.Arenas, A.Diaz-Guilera, C.J.Perez-Vicente, Phys. Rev. Lett., 96 (2006) 114102.
-  Weinan.E, T.Li Vanden-Eijnden.E, Proc. Natl. Acad. Sci. U.S.A., 105 (2008) 7907-7912.
-  X.S.Zhang, R.S.Wang, Y.Wang, J.Wang, Y.Qiu, L.Wang, L.Chen, Eur. Phys. Lett., 87 (2009) 38002.
-  W.W.Zachary, J. Anthropol. Res., 33 (1977) 452-473.
-  Reichardt.J Bornholdt.S, Phys. Rev. Lett., 93 (2004) 218701.
-  H.J.Li, Y.Wang, L.Y.Wu, Z.P.Liu, L.Chen, X.S.Zhang, Eur. Phys. Lett., 97 (2012) 48005.
-  B.Karrer, E.Levina, M.E.J.Newman, Phys. Rev. E, 77 (2008) 046119.
-  Holme.P Newman.M.E.J, Phys. Rev. E, 74 (2006) 056108.
-  Kimura.D Hayakawa.Y, Phys. Rev. E, 78 (2008) 016103.
-  Papadopoulos.F, Kitsak.M, Serrano.M, Boguna.M Krioukov.D, Nature, 489 (2012) 537-540.
-  A.Lancichinetti, S.Fortunato, F.Radicchi, Phys. Rev. E, 78 (2008) 046110.
-  Durrett.R, Gleeson.J.P et al., Proc.Natl.Acad.Sci.U.S.A., 190 (2012) 3682-3687.
-  Please download the supplementary material file from the website http://doc.aporc.org/wiki/Robustness