Hierarchical Multiresolution Method to Overcome the Resolution Limit in Complex Networks
The analysis of the modular structure of networks is a major challenge in complex networks theory. The validity of the modular structure obtained is essential to confront the problem of the topology-functionality relationship. Recently, several authors have worked on the limit of resolution that different community detection algorithms have, making impossible the detection of natural modules when very different topological scales coexist in the network. Existing multiresolution methods are not the panacea for solving the problem in extreme situations, and also fail. Here, we present a new hierarchical multiresolution scheme that works even when the network decomposition is very close to the resolution limit. The idea is to split the multiresolution method for optimal subgraphs of the network, focusing the analysis on each part independently. We also propose a new algorithm to speed up the computational cost of screening the mesoscale looking for the resolution parameter that best splits every subgraph. The hierarchical algorithm is able to solve a difficult benchmark proposed in [Lancichinetti & Fortunato, 2011], encouraging the further analysis of hierarchical methods based on the modularity quality function.
Received (to be inserted by publisher)
Keywords: Complex networks, community structure, multiple resolution, modularity.
The quality function called modularity has been largely used in the assessment of the modular structure of networks [Girvan & Newman, 2002; Newman & Girvan, 2004; Newman, 2004a; Clauset et al., 2004; Duch & Arenas, 2005; Danon et al., 2005] and for data clustering and exploration [Newman, 2006; Granell et al., 2011]. Modularity is a global descriptor of a complex network that measures the difference between a given partition of the network and the same partition in an ensemble of the randomized versions of the original network preserving the local strength of every node. The optimization of modularity is coherently related to the definition of modules in the network; a module is defined as the result of the optimal modularity partition. In 2007, Fortunato & Barthélemy [Fortunato & Barthélemy, 2007] pointed out a drawback in this function consisting in a certain resolution limit (generalized later in [Kumpula et al., 2007; Good et al., 2010]), beyond which optimization of modularity is unable to identify certain modules, even those easily detectable at first sight, such as cliques almost disconnected from the rest of the network. This effect is known as the resolution limit of modularity. This problem arises because modularity fixes a global scale that could be appropriate for some networks but not for others, specially not suitable for those networks conformed by coexisting densely large and small communities. After this work, a multiresolution method was introduced in [Arenas et al., 2008], which preserved the use of modularity with the addition of a parameter to control the resistance of nodes to form communities. The idea is that the analysis of communities may be performed at different scales of description, and the resolution limit is overcome just by moving to the right scale. Other methods to overcome the resolution limit are found in [Reichardt & Bornholdt, 2004; Pons & Latapy, 2011; Traag et al., 2011; Berry et al., 2011; Ronhovde & Nussinov, 2009, 2010].
A recent work by [Lancichinetti & Fortunato, 2011] shows that even those methods devoted to avoid the resolution limit, indeed have a resolution limit, and propose the use of an algorithm composed of several approaches called OSLOM [Lancichinetti et al., 2011] to really avoid such resolution problem. The proof that multiresolution schemes still have a resolution limit is performed analytically on the RB (after Reichardt-Bornholdt) method, and extended qualitatively using examples to the AFG (after Arenas-Fernández-Gómez) method and the recent CPM (Constant Potts Model) method.
We have performed extensive simulations using the AFG method and conclude that the authors of [Lancichinetti & Fortunato, 2011] are right, the AFG method also has a resolution limit, and that the benchmark they propose (see Fig. 1), consisting of a giant Erdös-Rényi (ER) network and two small cliques, connected between them by just one link, is impossible to separate in the configuration of one cluster for the giant ER network and one cluster for each of the cliques, in the current proposal of the AFG method. Even though the synthetic benchmarks where multiresolution methods could fail are far from the structure of real networks, it is still challenging to investigate what are the problems and how to solve them.
In this paper we focus in the AFG method, analyzing its performance in resolution limiting situations, and proposing alternatives to eliminate or, if not possible, diminish the effect of this limit to minimum. An alternative is presented, a hierarchical application of the resolution screening, that avoids the resolution limit. The hierarchical application of a multiresolution method consists in to focus the screening on different clusters of the network as soon as these clusters are detected.
2 Multiresolution AFG method
In a previous work, the authors proposed a method that allows the full screening of the topological structure at different resolution levels using the original formulation and semantics of modularity as defined in [Girvan & Newman, 2002]. The original modularity allows the comparison of different partitions of the network. Given a network partitioned into communities, being the community to which node is assigned, the mathematical definition of modularity is
where is the weight of the link between nodes and (zero if no link exists), is the strength of node and is the total strength of the network [Newman, 2004a]. The Kronecker delta function takes the value 1 if node and are into the same community and 0 otherwise. Several authors have attacked the problem of modularity optimization, with considerable success, by proposing different heuristics [Newman, 2004b; Clauset et al., 2004; Guimerà & Amaral, 2005; Duch & Arenas, 2005; Pujol et al., 2006; Newman, 2006], see [Fortunato, 2010] for a review.
The AFG method was designed to evaluate the community structure of networks using a kind of magnifying glass of the topology [Arenas et al., 2008]. The mathematical form of this prescription is given by
where the resistance is the parameter controlling the resolution of the partitions we want to find, and is the new weights matrix after the addition of a self-loop with value to each node. When is zero, we recover the standard modularity . The definition of preserves the original semantics of modularity.
A refinement of the AFG method may be found in [Granell et al., 2011, 2012 in press], where the original formulation of modularity Eq. (2) is replaced by its extension to networks with positive and negative weights [Gómez et al., 2009; Traag & Bruggeman, 2009]. Although the differences are usually small, this is necessary since the access to the macroscale needs the use of negative values of the resistance, even if the original network has only positive weights. Thus, the adequate formulation of modularity Eq. (2) for undirected weighted signed networks which should be used is
are the positive and negative strengths of node , and
are the positive and negative total strengths respectively. Please note that these four strengths are defined to be non-negative. The extension to directed networks [Arenas et al., 2007] is simply obtained by the substitutions in Eq. (2)
For the sake of simplicity, we will refer to the undirected case for the rest of the paper. In the particular case that the original network does not have negative weights, and , , Eq. (2) reads
where is the total number of nodes, is the number of nodes in community of the partition , and the nodes and total strengths refer to the original network before the addition of the self-loops. It is interesting to realize that, since all the negative strengths are equal to the absolute value of , the contribution of the resistance to modularity is equivalent to a constant Potts model [Traag et al., 2011].
Resolving the substructure of networks using a unique parameter as proposed in the AFG has still a resolution problem. As pointed out by [Lancichinetti & Fortunato, 2011], when very different sized modules coexist, multiresolution methods will tend to break the larger groups before finding the smaller ones. The phenomenon is easy to understand with an example: let us imagine an image with a real size elephant and an ant, to see the details of the ant we have to get so close to the image that the elephant image disintegrates in smaller parts, and only part of the elephant is seen when focusing on the ant. In terms of modularity, we are trying to unravel those areas which are denser in terms of links with respect to other areas in the network. A way to determine if we could have resolution problems is to plot a link density map and detect if there are sharp contrasts. If very different topological scales coexist, there will also be jumps in the clustering coefficient. In the example provided by [Lancichinetti & Fortunato, 2011], which consists of an ER network of 400 nodes with average degree 100 linked to two cliques of 13 nodes only by a unique link between them (see Fig. 1), the clustering coefficient presents a drastic separation of scales, see Fig. 2. This indicates small zones of the network very densely connected and a wide area not so dense, corresponding to the cliques and the ER, respectively.
3 Hierarchical Multiresolution method
Our approach to solve the resolution problem takes advantage of the capability of the AFG method to find meaningful communities from the initial steps of the mesoscale analysis. More precisely, we propose the use of an iterative scheme which combines the optimization of modularity close to the macroscale of the network with its splitting in subgraphs, one for each of the previously found communities.
Supposing that our network is undirected, weighted, with positive weights and no self-loops, the prescription of our algorithm is the following:
Start out from the macroscale partition , which has only one community containing all nodes. Then, find the upper bound of this macroscale, which is the minimum value of the resistance parameter () needed to find a partition of the network with optimal modularity formed by more than one community.
Split the network in the subgraphs defined by the partition just found.
Repeat the previous steps with each subgraph until no further subdivisions are needed.
This algorithm defines a hierarchical organization of the nodes, where the values of at each splitting define the ultrametric distances between nodes, i.e. the heights in the dendrogram at which every pair of nodes first meet.
The calculations of and may be performed simultaneously, therefore avoiding the costly scanning of the whole mesoscale between the lower and upper bounds of the resistance [Granell et al., 2011, 2012 in press]. This is a consequence of the following properties:
The value of is negative, with the only exception in which the network is just a clique.
, , because:
In fact, modularity Eq. 2 is always zero for , no matter the network or the value of the self-loops.
Since and modularity is a continuous and monotonically increasing function of the resistance for any given , the optimal partition at must satisfy .
For any given partition , the minimum meaningful value of the resistance is the one for which . Thus, Eq. (2) leads to
The upper bound of the macroscale is given by
and is the partition which minimizes .
All these properties may be combined in the following fast-tracking resistance (FTR) algorithm to find the upper bound of the macroscale:
Optimize modularity at , to obtain partition .
Calculate using Eq. (13).
Optimize modularity at , to obtain the current partition .
If or , then and .
Otherwise, let and go back to the second step.
In practice, this algorithm converges in a few number of steps. It stops when a value of is found such that the optimization of modularity does not produce any new partition. In this case, the modularity of both and is zero, and no known partition can be used to obtain a better upper bound of the macroscale. Of course, we cannot claim that we have found the “real” , since no optimization heuristic can ensure the finding of the global maximum of modularity (this problem is known to be a NP-hard problem, see [Brandes et al., 2008]), but this is the best approximation one may obtain. To exemplify the functioning of the FTR algorithm we show in Fig. 3 its application to the first hierarchical splitting of Zachary karate club network [Zachary, 1977].
4 Results and discussion
We have applied the hierarchical multi-resolution method explained before to the benchmark proposed by [Lancichinetti & Fortunato, 2011] shown in Fig. 1. We use the FTR algorithm to speed up the process of finding the minimal at which every subgraphs splits. The aim is to find the partition divided in three communities in which the giant ER and each clique are separated. These three communities should contain the nodes labeled 1 to 400, 401 to 413 and 414 to 426, respectively.
As stated in the method, we have started out from the macroscale of the network, which contains the 426 nodes. The optimal partition splits in two communities at a value of the resistance parameter -12.5, obtaining a community formed by the nodes from 1 to 400 and another community containing the 26 nodes corresponding to the two cliques. Performing the hierarchical method on the two communities obtained, we find that the community containing the 26 nodes rapidly splits in two communities of 13 nodes, at a value of the resistance equal to -11.69. The partition containing 400 nodes splits in two at a much greater value of the resistance parameter, which is -8.97. After that, a hierarchical multiresolution is applied to any community found, until no further divisions are needed. The results of this example are shown in a dendrogram representation in Fig. 4.
Observing this figure, we find that there is a region of the resistance parameter in which the three communities we were hoping to find coexist. This happens because the two cliques form their own communities much before the community of 400 nodes is split in two. Note that this result can not be obtained using the original multiresolution AFG method exploring the whole mesoscale, because of the resolution limit emerging from the coexistence of very different topological scales. The rationale behind the success of the hierarchical method in this situation is the following: the separation of the network in optimal subgraphs, each one split and independently analyzed through the multiresolution scheme, reduces the global resolution limit. This resolution limit depends on the number of nodes and the number of links in the whole structure. The multiresolution method is able to focus the attention on lower scales while other parts of the network are being screened independently at larger resolution values of .
We have presented a hierarchical multiresolution method able to cope with networks where the resolution limit would make other schemes to fail, finding the natural communities as defined by [Fortunato & Barthélemy, 2007]. The method is boosted by a mechanism that allows the determination of the resolution parameter at which to optimize modularity in a few steps. The results solving the difficult separation of the benchmark proposed in [Lancichinetti & Fortunato, 2011] are encouraging and open the door for further investigation of modularity based community detection methods to escape from the implicit resolution limit.
We acknowledge support from the Spanish Ministry of Science and Innovation FIS2009-13730-C02-02 and the Generalitat de Catalunya SGR-00838-2009.
- Arenas et al.  Arenas, A., Duch, J., Fernández, A. & Gómez, S.  “Size reduction of complex networks preserving modularity,” New J. Phys. 9, 176.
- Arenas et al.  Arenas, A., Fernández, A. & Gómez, S.  “Analysis of the structure of complex networks at different resolution levels,” New J. Phys. 10, 053039.
- Berry et al.  Berry, J. W., Hendrickson, B., LaViolette, R. A. & Phillips, C. A.  “Tolerating the community detection resolution limit with edge weighting,” Phys. Rev. E 83, 056119.
- Brandes et al.  Brandes, U., Delling, D., Gaertler, M., Goerke, R., Hoefer, M., Nikoloski, Z. & Wagner, D.  “On modularity clustering,” IEEE Trans. Knowl. Data Eng. 20, 172.
- Clauset et al.  Clauset, A., Newman, M. E. J. & Moore, C.  “Finding community structure in very large networks,” Phys. Rev. E 70, 066111.
- Danon et al.  Danon, L., Díaz-Guilera, A., Duch, J. & Arenas, A.  “Comparing community structure identification,” J. Stat. Mech. , P09008.
- Duch & Arenas  Duch, J. & Arenas, A.  “Community identification using extremal optimization,” Phys. Rev. E 72, 027104.
- Fortunato  Fortunato, S.  “Community detection in graphs,” Phys. Rep. 486, 75.
- Fortunato & Barthélemy  Fortunato, S. & Barthélemy, M.  “Resolution limit in community detection,” Proc. Natl. Acad. Sci. USA 104, 36.
- Girvan & Newman  Girvan, M. & Newman, M. E. J.  “Community structure in social and biological networks,” Proc. Natl. Acad. Sci. USA 99, 7821.
- Gómez et al.  Gómez, S., Jensen, P. & Arenas, A.  “Analysis of community structure in networks of correlated data,” Phys. Rev. E 80, 016114.
- Good et al.  Good, B. H., de Montjoye, Y.-A. & Clauset, A.  “Performance of modularity maximization in practical contexts,” Phys. Rev. E 81, 046106.
- Granell et al.  Granell, C., Gómez, S. & Arenas, A.  “Mesoscopic analysis of networks: applications to exploratory analysis and data clustering,” Chaos 21, 016102.
- Granell et al. [2012 in press] Granell, C., Gómez, S. & Arenas, A. [2012 in press] “Unsupervised clustering analysis: a multiscale complex networks approach,” Int. J. Bifurcat. Chaos .
- Guimerà & Amaral  Guimerà, R. & Amaral, L. A. N.  “Cartography of complex networks: modules and universal roles,” J. Stat. Mech. , P02001.
- Kumpula et al.  Kumpula, J. M., Saramaki, J., Kaski, K. & Kertesz, J.  “Limited resolution and multiresolution methods in complex network community detection,” Fluctuation Noise Letters 7, 209.
- Lancichinetti & Fortunato  Lancichinetti, A. & Fortunato, S.  “Limits of modularity maximization in community detection,” ArXiv e-prints , arXiv:1107.1155.
- Lancichinetti et al.  Lancichinetti, A., Radicchi, F., Ramasco, J. J. & Fortunato, S.  “Finding statistically significant communities in networks,” PLoS ONE 6, e18961.
- Newman [2004a] Newman, M. E. J. [2004a] “Analysis of weighted networks,” Phys. Rev. E 70, 056131.
- Newman [2004b] Newman, M. E. J. [2004b] “Fast algorithm for detecting community structure in networks,” Phys. Rev. E 69, 066133.
- Newman  Newman, M. E. J.  “Modularity and community structure in networks,” Proc. Natl. Acad. Sci. USA 103, 8577.
- Newman & Girvan  Newman, M. E. J. & Girvan, M.  “Finding and evaluating community structure in networks,” Phys. Rev. E 69, 026113.
- Pons & Latapy  Pons, P. & Latapy, M.  “Post-processing hierarchical community structures: Quality improvements and multi-scale view,” Theor. Comput. Sci. 412, 892.
- Pujol et al.  Pujol, J. M., Béjar, J. & Delgado, J.  “Clustering algorithm for determining community structure in large networks,” Phys. Rev. E 74, 016107.
- Reichardt & Bornholdt  Reichardt, J. & Bornholdt, S.  “Detecting fuzzy community structures in complex networks with a potts model,” Phys. Rev. Lett. 93, 218701.
- Ronhovde & Nussinov  Ronhovde, P. & Nussinov, Z.  “Multiresolution community detection for megascale networks by information-based replica correlations,” Phys. Rev. E 80, 016109.
- Ronhovde & Nussinov  Ronhovde, P. & Nussinov, Z.  “Local resolution-limit-free potts model for community detection,” Phys. Rev. E 81, 046114.
- Traag & Bruggeman  Traag, V. A. & Bruggeman, J.  “Community detection in networks with positive and negative links,” Phys. Rev. E 80, 036115.
- Traag et al.  Traag, V. A., Dooren, P. V. & Nesterov, Y.  “Narrow scope for resolution-limit-free community detection,” Phys. Rev. E 84, 016114.
- Zachary  Zachary, W. W.  “An information flow model for conflict and fission in small groups,” J. Anthropol. Res. 33, 452–473.