A New Comparative Definition of Community and Corresponding Identifying Algorithm

A New Comparative Definition of Community and Corresponding Identifying Algorithm

Yanqing Hu, Hongbin Chen, Peng Zhang, Menghui Li, Zengru Di, Ying Fan111Author for correspondence: yfan@bnu.edu.cn

Department of Systems Science, School of Management,
Center for Complexity Research,
Beijing Normal University, Beijing 100875, P.R.China
Abstract

In this paper, a new comparative definition for community in networks is proposed and the corresponding detecting algorithm is given. A community is defined as a set of nodes, which satisfy that each node’s degree inside the community should not be smaller than the node’s degree toward any other community. In the algorithm, the attractive force of a community to a node is defined as the connections between them. Then employing attractive force based self-organizing process, without any extra parameter, the best communities can be detected. Several artificial and real-world networks, including Zachary Karate club network and College football network are analyzed. The algorithm works well in detecting communities and it also gives a nice description for network division and group formation.

Keyword: Complex Network, Community Structure, Comparative Definition

PACS: 89.75.Hc, 89.75.Fb

1 Introduction

Many physicists have become interested in the study of networks describing the topologies of wide variety of systems[1, 2, 3], such as the world wide web[4], social and communication networks[5, 6], biochemical networks[7] and many more. Many networks are found to divide naturally into communities. Nodes belonging to a tight-knit community are more than likely to have other properties in common. In the world wide web, community analysis has uncovered thematic clusters. In biochemical or neural networks, communities may be functional groups. As a result, the problem of identification of communities has been the focus of many recent efforts. Many different algorithms are proposed[8, 9, 10, 11, 12, 13, 14, 17, 22, 23, 24, 27, 28, 30, 31](see [9] as a review).

Communities within networks can loosely be defined as subsets of nodes which are more densely linked, when compared to the rest of the network. Modularity [26] was presented as a index of community structure and now has been widely accepted [9, 14, 16, 30] as a measure for the communities. Modularity was introduced by Newman and Girvan as follow:

(1)

where is the fraction of links that connect two nodes inside the community , is the fraction of links that have one or both vertices in side the community , and sum extends to all communities in a given network. Note that this index provides a quantitative measurement to decide the best division of network. The larger the value of , the more accurate is a partition into communities. So maximizing modularity can also detect communities. Actually, there are already many algorithms of maximizing such as Extremal Optimization (EO) [30], Greedy algorithm [12] and other optimal algorithms. There are also many other algorithms to identify communities in complex networks such as GN algorithm [22, 26], random walks method [10], edge clustering coefficient method [8], and spectral analysis[8]. When the methods can only produce the dendrogram of the community structure, the best partition is usually obtained by maximizing modularity . Unfortunately, modularity maximization problem was proved to be a NPC problem [33]. Moreover, it has been proved that modularity measurement may fail to identify modules smaller than a scale which depends on the total number of links of the network and on the degree of interconnectedness of the modules, even in cases where modules are unambiguously defined [19].

There are also other community definitions based on the topology of networks, such as self-referring definitions and comparative definitions. The basic self-referring definition is a clique, defined as a subgroup of a graph containing more than two nodes where all the nodes are connected to each other by means of links in both directions. In other words, this is a fully connected subgraph. This is a particularly strong definition and rarely fulfilled in real sparse networks for larger groups [20]. The another referring community definition is k-core which is defined as a subgraph in which each node is adjacent to at least a minimum number, , of the other nodes in the subgraph. It is weaker than clique but it is very hard to find the optimal when we want to detect the best partition of the network. Comparative definitions are given on the basis of links comparison. There are three kinds of comparative definitions which are called LS-set, strong and weak community definition. LS-set is defined as a set of nodes in which each of its subsets has more ties to its components within the set than outside [32]. The LS-set definition is also quite stringent. Moreover, it is a very tough problem to detect all the LS-sets in a network. In order to relax the constraints, Raddichi et al. [27] proposed the strong definition and weak definition. In a strong community, each node has more connections within the community than with the rest of the network and in a weak community the sum of all degrees within the community is larger than the sum of all degrees toward the rest of the network. Based on these comparative definitions, the self-contained algorithm is developed, which is similar with GN algorithm for finding strong or weak communities in a network. But it is very costly.

In this article, following the basic idea of comparative definitions, we define community as: a community is a set of nodes, each node’s degree inside the community should be bigger than or at least equal to its degree link to any other community. This definition is different from other comparative definitions. The strong, weak and LS-sets definitions are presented by comparing the degree in the community with the degree in the whole rest network. But our community definition is designed by comparing the degree in the community with the degree in each rest community, not the whole rest network.

Then how to detect the communities in a network based on our definition? Obviously, whether a node belongs to a community or not is determined by its connections. We can define the attractive force of a community to a node by the links connect them. Employing attractive force based self-organize process, we can detect community structures without any extra parameter. The algorithm also gives a nice description of the affection of a community to a node and group formation process. With the formation of communities, individual will choose and change its position according to its friends continuously until the partition become clear.

This paper is organized as follows. Section 2 gives our comparative definition for communities in networks. Then in Section 3, the corresponding algorithm is given in details. The application of the definition and the algorithm in ad hoc networks, Zachary karate club network, and College football network are presented in Section 4. Some concluding remarks are put in Section 5.

2 Quantitative Definitions of Community

2.1 Previous comparative definition

The most important comparative definitions of community are strong and weak definitions, which are proposed by Raddichi et al.[27]. Suppose there is a network which has nodes and it can be represented mathematically by an adjacency matrix with elements if there is an edge from to and otherwise.

Definition of Community in a Strong Sense. The subnetwork is a community in a strong sense if for any belonged to we have

(2)

Definition of Community in a Weak Sense. The subnetwork is a community in a weak sense if we have

(3)

Obviously, strong community definition concerns the situation of every node, but the weak sense takes a community as a whole. From the strong (weak) definition of community we can easily get that if satisfy strong (weak) definition then we have also satisfy strong (weak) definition. Raddichi et al.[27] call this phenomena as self-contained and use self-contained algorithm to detect communities, which is similar with GN algorithm for finding strong or weak communities in a network. But it is very costly.

2.2 New community definition

Inspired by the above strong and weak definitions, we define the community as follow.

Definition of Community: if are communities of , should satisfy that

(4)

and

(5)

This definition can be summarized as: a community should satisfy that each node’s degree inside the community should not be smaller than the node’s degree toward any other community. The same as the strong sense, our definition also focus on the situation of node. But different from comparing the degree in the community with the degree in the whole rest network, our definition compare the degree in the community with the degree in each rest community instead of the whole rest network. Obviously, our definition is weaker than the strong definition. Here we can also give an another most weak community definition: in a community, the sum of all degree inside the community should not be smaller than the sum of degree toward any one other community. The same as the weak sense, our most weak community definition focus on the case of community instead of the single node. The difference between the weak definition and our the most weak definition is that the weak definition compare the sum of degree inside the community with the sum of degree towards the whole rest network, but the most weak one compares the sum of degree inside the community with the sum of degree towards any other community. In the following discussion, we only deal with the new definition given by formula (5).

3 Algorithm

In order to detect the community structure under our new definition, we set each node and its random half of neighbors to be a community initially. Then we define the attractive force by the connections among nodes and let the communities be self-organized with the forces. When the community structure become fixed, the survivors will be the best partitions which satisfy the above definition naturally.

Let denotes the attractive force of community to node and can be calculate out by the formula

(6)

Then our algorithm is defined as follows.

  1. We initially set each node and its random half neighbors to be a community. If a node has neighbors and is odd, we let the node and it’s random nodes as a community. If two or more than two communities are the same, just keep one of them. So after the first step the network is partitioned to or less than overlapping communities. is the number of nodes in the network.

  2. Calculate for all and .

  3. For every node, move it into the community or communities with the largest attractive force respectively at the same time.

  4. Check all communities, if two or more than two communities are the same, just keep one of them.

  5. Repeat step to step until sufficient steps or the partition be fixed.

The time complexity of our algorithm is . Step runs in time , step in , step in , step in and the repeated time in step is uncertain, where is the average degree. According to the numerical experiments in artificial networks, around repeating steps, the partition will be fixed. So we think the time complexity is . It is lower than many algorithms for detecting community structures.

Even our definition of communities is not a self-contained one as strong and weak definitions, there should be more than one partitions that may satisfy our community definition. So we keep some stochastic factors in our initial partition and run the algorithm several times. Then we could report the average result or choose the best one from all the partitions. Here we introduce another indicator for evaluating the partitions. We think the best partition should satisfy that there are more connections inside the communities and less connections outside the communities. So we use the proportion of average connection density inside the communities and the connection density outside the communities to measure how reasonable a partition is. This kind of measurement can be defined as following. Suppose the network contains nodes and connections and is partitioned to communities. denotes the number of nodes in the th community and denotes the number of connections in the th community. Then the average connection density inside the communities is

(7)

and the connection density outside the communities

(8)

Then the measurement can be defined as and when there only one community . Obviously, larger means more reasonable partition.

4 Application in ad hoc and Real Networks

4.1 Algorithm on artificial networks

In order to test our algorithm, we apply it on computer-generated random networks with a well-known predetermined community structures and some real networks. The accuracy of the algorithm is evaluated by similarity function [29]. Each network has nodes divided into communities of nodes each. Edges between two nodes are introduced with different probabilities depending on whether the two nodes belong to the same community or not: every node has links on average to its fellows in the same community, and links to the outer-world, keeping . For each given out degree , we produce 20 realizations of networks. Then for each network, we first run the algorithm one time and give the average accuracy of 20 networks as One-run shown in Fig.1. Then we run the algorithm 15 times for each network and choose the best partition with the aid of indicator . The average accuracy of 20 networks is also shown as Multi-runs in Fig.1. Comparing our algorithm with GN algorithm [22, 26], we could find that the accuracy of One-run algorithm is similar with GN and the accuracy of Multi-runs algorithm is better than GN. Moreover, GN algorithm need an extra index and the time complexity is high, but our algorithm do not need any extra parameters and has lower time complexity.

Figure 1: The accuracies of our algorithms and GN algorithm. From the plot we can see that the accuracy of one-run algorithm is similar with GN algorithm. The best partition of the multi-runs with the aid of indicator is better than GN algorithm when the out degree becomes larger. Where we run times for each network for multi-runs. Each point is the average of realizations of networks.
Figure 2: The evolution of community degree with the process of the algorithm. The results are for one-run algorithm. We can see that when the community structure is no very fuzzy, one-run algorithm services our community definition very well.

We also test that with the process of our algorithm, to what extent the partition satisfies our definition. For a given partition , , we define its community degree as the ratio:

(9)

where denotes the subset of , in which each node satisfy the requirement of our definition for community, that is node’s inter degree is larger or equal to its intra degree between any other community. The numerical experiments results tell us when the community structure is not very fuzzy, the algorithm will finally produce a partition that satisfy our definition very well. The community degree tends to 1. When the community structure is very fuzzy, it is hard to find the partition that satisfy the definition exactly.

Furthermore, recently Santo Fortunato and Marc Barthelemy [19] proved that modularity may fail to identify small communities and give a kind of network as shown in Fig.3. We test our algorithm on this kind of networks. When each circle contains a clique with or more than nodes, our algorithm can detect all the pre-determinate communities (circles) always.

Figure 3: The circles represent the communities in which each pair of nodes are connected. The circles be connected to each other by the minimal number of links. The plot is cited from [19]

4.2 Zachary karate club network

When apply our algorithm to real network, first we use the popular Zachary karate club network[21], which is considered as a simple workbench for community finding methodologies[22, 23, 24, 25, 26, 28]. This network was constructed with the data collected observing members of a karate club over a period of years and considering friendship between members. By our algorithm, communities are detected (as shown in Fig. 4). The partition is reasonable compared with the actual division of the club members.

As mentioned above, there may be many partitions that satisfy the requirement of our definition and the final partition is related to the initial conditions. For the karate club network, if we think the club division is caused by some leaders, such as leaders (nodes) , and set the leaders and their random half neighbors as initial partition, then our algorithm will divide the network into communities. That is consistent with the real division. If we set be the leaders, our algorithm will also partition the network into communities which is the same as the result without leaders. It is very interesting, with the process of group formation, nodes and nodes combine and are in the same community respectively. The other community don’t contain any nodes of . It implies that, if some leaders have contradictions and want to divide the network, some nodes will not always follow the leaders and may form other groups (see Fig.4).

Figure 4: The community structure of Zachary Karate club network. Our algorithm detects communities which are depicted by circles, squares and triangles. When we set as leaders, the partition is the same. But if we set as leaders, the network will be divided into communities. Circles represent a community and the rest is another one, which corresponds to the actual division.

4.3 College football network

We also apply our algorithm to Collage football network which was provided by Newman. The network is a representation of the schedule of Division I games for the season. Nodes in the network represent teams and edges represent regular-season games between the two teams they connect. What makes this network interesting is that it incorporates a known community structure. The teams are divided into conferences [22]. Games are more frequent between members of the same conference than between members of different conferences. It is found that our overlapping algorithm identifies the conference structure with a high degree of success. We detect communities in which five communities were detected exactly, the average accuracy is and no node is overlapping. The GN algorithm associating with Q function [26] gives the best partition with =0.2998. It divide the football teams into communities and the average accuracy is . The results are shown in Tab. 4.3.

Table 4.3: The accuracy of each detected community comparing with the counterpart of real-world community.

Conference name Accuracy GN accuracy
Atlantic Coast 1 1
Big East 0.8000 0.8889
Big10 1 1
Big12 1 0.9231
Conference USA 0.9000 0.9000
IA Independents 0 0
Mid American 0.8667 0.8667
Mountain West 1 0
Pac10 1 0.5556
SEC 1 0.7500
Sunbelt 0.4444 0.4444
Western Athletic 0.7273 0.7273
Average accuracy 0.8115 0.6713

5 Conclusion and discussion

In this paper, we present a new comparative community definition and the corresponding algorithm. A community should satisfy that each node’s degree inside the community should be bigger or equal to the node’s degree toward any other community. Then we introduce the concept of attractive force and develop a self-organizing algorithm based on the comparing of attractive forces. The algorithm can detect the community structures without any extra parameter. In order to choose the best partition from several possible results, we also define an indicator to evaluate the partitions. We apply the algorithm to artificial networks and some real-world networks such as Zachary karate club network and College football network. The algorithm work well in all networks. Furthermore our community definition and identification algorithm can be generalize to weighted and directed networks easily.

Moreover, our algorithm can be use to predict network division when there are some contradictions between some leaders. In the algorithm, we can initially set some leaders and their random half neighbors to be the communities respectively. Then the self-organizing process gives a nice description of leaders’ affections. We think this partition technique has great potential for analyzing network structure.

In section 2, we give the most weak community definition: in a most weak sense, the sum of all degree inside the community should not be smaller than the sum of degree toward any other community. From the view of statical physics, we think the most weak definition is also reasonable. Here we propose an open problem of finding a algorithm to detect the communities based on the most weak definition.

Acknowledgement

The authors thank Professor M.E.J. Newman very much for providing College Football network data. This work is partially supported by 985 Projet and NSFC under the grant No., No. and No..

References

  • [1] R. Albert, A.-L. Barabasi, Rev. Mod. Phys. 74, 47 (2002).
  • [2] M. E. J. Newman, SIAM Rev. 45, 167-256 (2003).
  • [3] S. Boccaletti, V.Latora, Y. Moreno, M. Chavez, and D.-U. Hwang, Physics Report 424, 175-308 (2006).
  • [4] R. Albert, H. Jeong, A.-L. Barabasi, Nature 401, 130 (1999).
  • [5] S. Redner, Eur. Phys. J. B 4, 131 (1998).
  • [6] M. E. J. Newman, Proc. Natl. Acad. Sci. 98, 404 (2001).
  • [7] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai and A.-L. Barabasi, Nature 407, 651 (2000).
  • [8] L. Danon, J. Duch, A. Arenas, and A. Diaz-Guilera, arXiv:cond-mat/0505245 (2005).
  • [9] S. Lehmann, L. K. Hansen, arXiv:physics/0701348 (2007).
  • [10] M. Latapy and P. Pons, in Proceedings of the 20th International Symposium on Computer and Information Sciences, ISCIS’05, LNCS 3733, 284-293 (2005).
  • [11] F. Wu and B. A. Huberman, Eur. Phys. J. B 38, 331-338 (2004).
  • [12] A. Clauset, Phys. Rev. E 72, 026132 (2005).
  • [13] S. Muff, F. Rao and A. Caflisch, Phys. Rev. E 72, 056107 (2005).
  • [14] M. E. J. Newman, Proc. Natl. Acad. Sci. 103, 8577-8582 (2006)
  • [15] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis, Hodder Arnold, London, 4th edition (2001).
  • [16] M. E. J. Newman, Phys. Rev. E 74, 036104 (2006).
  • [17] C. P. Massen and J. P. K. Doye, Phys. Rev. E 71, 046101 (2005).
  • [18] A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori, Physica A 352, 669 (2005).
  • [19] S. Fortunato and M. Barthelemy, Proc. Natl. Acad. Sci. 104, 36 (2007).
  • [20] G. Palla, I. Derenyi, I. Fakas, T. Vicsek, Nature 435, 814-818 (2005).
  • [21] W. Zachary, Journal of Anthropol Research, 33, 452 (1977).
  • [22] M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. 99, 7821-7826 (2004).
  • [23] M. E. J Newman, Phys. Rev. E 69, 066133 (2004).
  • [24] M. E. J. Newman and E. A. Leicht, Proc. Natl. Acad. Sci. 104, 9564-9569 (2007).
  • [25] L. Donetti and M. A. Munoz, J. Stat. Mech. P10012 (2004).
  • [26] M. E. J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004).
  • [27] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, Proc. Natl. Acad. Sci. 101, 2658 (2004).
  • [28] J. P. Bagrow and E. M. Bollt, Phys. Rev. E 72, 046108 (2005).
  • [29] Y. Fan, M. Li, P. Zhang, J. Wu, Z. Di, Physica A 377, 363-372 (2007).
  • [30] J. Duch and A. Arenas, Phys. Rev. E 72, 027104 (2005).
  • [31] J. Reichardt and S. Bornholdt, Phys. Rev. Lett. 93, 218701 (2004).
  • [32] S. Wasserman, K. Faust, Social Network Analysis, Cambridge Univ. Press, Cambridge, U.K. (1994).
  • [33] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, and D. Wagner, arXiv:physics/0608255 (2006).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
84210
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description