Community detection using preference networks
Abstract
Community detection is the task of identifying clusters or groups of nodes in a network where nodes within the same group are more connected with each other than with nodes in different groups. It has practical uses in identifying similar functions or roles of nodes in many biological, social and computer networks. With the availability of very large networks in recent years, performance and scalability of community detection algorithms become crucial, i.e. if time complexity of an algorithm is high, it can not run on large networks. In this paper, we propose a new community detection algorithm, which has a local approach and is able to run on large networks. It has a simple and effective method; given a network, algorithm constructs a preference network of nodes where each node has a single outgoing edge showing its preferred node to be in the same community with. In such a preference network, each connected component is a community. Selection of the preferred node is performed using similarity based metrics of nodes. We use two alternatives for this purpose which can be calculated in 1-neighborhood of nodes, i.e. number of common neighbors of selector node and its neighbors and, the spread capability of neighbors around the selector node which is calculated by the gossip algorithm of Lind et.al. Our algorithm is tested on both computer generated LFR networks and real-life networks with ground-truth community structure. It can identify communities accurately in a fast way. It is local, scalable and suitable for distributed execution on large networks.
pacs:
89.75.Hc, 89.65.Ef, 89.75.Fb[name=Haluk Bingol,color=purple]hb \definechangesauthor[name=Mursel Tasgin,color=blue]mt \setremarkmarkup(#2)
I Introduction
Community detection is one of the key areas in complex networks that has attracted great attention in the last decade. In network science, a network is seen as a system and individual nodes as agents or elements of the system where they are connected with ties Wasserman and Faust (1994). Mobile communication networks, scientific collaboration networks, patent networks, protein interaction networks and brain networks are examples of network representation of corresponding systems Onnela et al. (2007); Newman (2001a); Leskovec et al. (2005); Chen and Yuan (2006); Sporns (2013). Interaction of agents in a network can create emergent structures like communities. A community is defined as grouping of nodes in a network such that nodes in the same group have more connections with each other than with the nodes in the rest of the network Girvan and Newman (2002). There have been many algorithms proposed so far for community detection and there is a comprehensive survey by Fortunato on community detection Fortunato (2010).
While many algorithms perform well on small networks with hundreds or thousands of nodes, only a few of them can run on very large networks of millions or billions of nodes due to performance and time-complexity issues. If a community detection algorithm has to deal with the whole network during its execution steps or needs to optimize a global value (i.e. network modularity), it becomes computationally expensive to run this algorithm on large networks. Besides their large sizes, real-life networks also evolve over time, i.e. structure and size can change while a community detection algorithm is still running on such a large network.
In recent years local community detection algorithms are proposed to overcome the challenges of large networks. Local algorithms are scalable and are suitable for distributed run, i.e. they can work on separate parts of the network locally and then merge results for the whole network. Their local nature makes it possible to identify more granular structures which is useful for finding subtle communities; especially in networks of loosely connected groups of nodes.
In this paper, we propose a new community detection algorithm which has a local approach. Our assumption is that, each node in the network selects to be a member of a community in order to be in the same group with some preferred nodes. Whether it is the common things they share, common enemies they avoid, common features they have or common ones they follow; being a member of a community is meaningful only when friends or preferred nodes are together there. Such communities can be constructed by asking each node who they would like to be with; and then grouping them together according to given answers. How should a node decide on its preferred node: a popular node, a hub node connecting others or a node with most common friends? We think that using the similarity of nodes is a good way to make the decision, i.e. each node should select the neighbor with whom it has the most similarity.
When we are given a network dataset, generally we only have
the knowledge of nodes and edges,
i.e. no meta-data describing
the nodes or common features of nodes (i.e. similarity) may exist.
So, using the connectivity information (edges) among the nodes in network,
we can get some of the features of individual nodes (i.e. centrality, degree etc.)
and some similarity measures between nodes (i.e. common neighbors shared by two nodes).
In order to keep local, we should limit our attention to local metrics,
i.e. metrics regarding a node and its local neighborhood only, not further.
Alternatively, we can try to use some other methods to make decision on preferred node,
i.e. selection of the neighbor with highest degree,
selection of the neighbor having highest clustering coefficient etc.
We will go into the details of metrics for preferred node selection later.
Outline of the paper is as follows. We will give background information about the topics in community detection and methods we use in our algorithm. We then explain our community detection approach in detail. We compare our algorithm with other known algorithms on both real-life networks and computer generated networks. We will finish with the conclusion section.
Ii Background
ii.1 Local approach for community detection
Community formation is something local by nature. But algorithms to detect communities may use global information rather than local information. For example, one community detection algorithm Girvan and Newman (2002) iterates over the network. In each iteration, it calculates the number of shortest paths passing through each edge and removes the edge with the largest number. Such calculations require information of the entire network. This global approach, which is fine for networks of small sizes, is not feasible for very large networks.
In recent years several local community detection algorithms have been proposed Newman (2004); Rosvall and Bergstrom (2007); Raghavan et al. (2007); Blondel et al. (2008); Gregory (2010); Lancichinetti et al. (2011); De Meo et al. (2014); Eustace et al. (2015). One popular method is the label propagation algorithm Raghavan et al. (2007) which has linear time-complexity and nodes decide on their communities according to the majority of their neighbors. Local community detection algorithms generally discover communities based on local interactions of nodes or local metrics calculated in their 1-neighborhood. Some algorithms merge nodes into communities based on the optimization of a local metric Eustace et al. (2015). Besides their scalability, these approaches are also suitable for large networks evolving over time. A local algorithm can handle what it already has (i.e. a portion of the network at a certain time) and can continue with what will come later; it does not need the snapshot of the whole network at once.
ii.2 Triangles and communities in networks
In his book, Simmel Simmel (1950) argued that a strong social tie could not exist without being part of a triangle in a relation, i.e. relation among three people where all know each other. People who have common friends are more likely to create friendships; they form triangles. There is a correlation between triangles and communities in social networks; there exists many triangles within communities while very few or no triangle exists between nodes of different communities Radicchi et al. (2004). Triangle is the smallest cycle of size 3. There are studies that investigate cycles of size 4 or more Lind and Herrmann (2007) but we focus on triangles in this study.
Clustering coefficient (CC), is equal to the probability that two nodes that are both neighbors of the same third node will be neighbors of one another Newman (2001b). This metric shows the number of existing triangles around a node compared to the all possible triangles. A high clustering coefficient will mean many triangles and clustering around a node;
where is the number of triangles around node and is the number of triplets where is in center. A triplet is formed by three nodes and two edges i.e. a triplet centered at has edges and .
Radicchi et.al. Radicchi et al. (2004) proposed a community detection algorithm based on triangles and clustering coefficient. In recent years, local community detection algorithms are more focused on similarity of nodes; especially on local level, i.e. 1-neighborhood of nodes. A node similarity based approach is applied on several known community detection algorithms by Xiang et.al. Xiang et al. (2016), where they presented the achieved improvements on those algorithms by using various node similarity metrics. Some examples of those similarity metrics are number of common neighbors and Jaccard similarity Jaccard (1901).
Number of common neighbors of nodes and is , where denotes the 1-neighborhood of . Number of common neighbors shows the number of triangles formed on two nodes, i.e. in Fig. 15, four triangles are formed on and by their common neighbors , , , and .
Jaccard similarity is the fraction of common neighbors of and to the union of their 1-neighborhoods given as
All of these metrics are related with friendship transitivity and triangles.
A new metric, spread capability, is proposed as a similarity metric in this paper. This metric is calculated by using the gossip algorithm of Lind et.al. Lind et al. (2007). A gossip about a victim node is initiated by one of its neighbors, node (originator), and spreads the gossip to common friends with , i.e. gossip about is meaningful to friends of only. Nodes hearing the gossip from behave the same way and propagate it further in the 1-neighborhood of until no further spread is possible. To measure how effectively the gossip is spread, they calculate spread factor of victim by originator as
where is the set of neighbors of who heard the gossip originated by . Lind et.al. calculated the spread factors of each originator and averaged them to get spread factor of , i.e.
As we will explain in more detail in the next section, we use the spread factor in a different way in our algorithm; instead of average gossip spread factor of a node, i.e. , we focus on values, which show the contribution of each originator to that average. We call as spread capability of around . Spread capability is directly related with the connectivity of and its position in the neighborhood of . So each can have a different spread capability around and they can be used as a similarity measure between and from the perspective of node . Note that .
Spread capability metric has similarity with number of common neighbors and clustering coefficient; but has additional information, see Fig. 4. It contains the number of common neighbors (triangles) between and ; moreover it has the number of other triangles around with its neighbors along the spreading pathway of gossip originated by . We call such adjacent group of triangles as a triangle cascade, where all triangles are cornered at the same node (i.e. ) and are adjacent to each other through common edges.
In Fig. 15, form a triangle cascade cornered at . On this triangle cascade, gossip about originated by is spread to directly. By using the cascade, gossip is propagated to by and then to by . Although , still has a role in spreading gossip to and by means of triangle cascades. Hence we have . Note that for all , we have . This property becomes very useful to reduce computation of gossip spread factor on a cascade. Once is calculated, then for all . See further discussion in SI Tas .
ii.3 Method for comparison of two partitions
Success of a community detection algorithm lies in finding communities of ground-truth. Since every node belongs to exactly one community in our algorithm, being in the same community is an equivalence relation. Hence, community structure of a network is a partition of a set of nodes. Suppose we have the partition of the ground-truth. A community detection algorithm produces another partition. Then we need a way of comparing these two partitions. For comparison of two partitions, Normalized Mutual Information (NMI) can be used Danon et al. (2005). NMI is a metric to understand how far (close) two partitions are; if NMI of two partitions is close to 1, then they are very similar, i.e. number of communities and the members of communities in two partitions are similar; and when it is close to 0, two partitions are different from each other.
ii.4 Method for testing of algorithms
We need to test our algorithm on different networks. First one is the Zachary karate club network Zachary (1977). It is a well known network dataset of a karate club where members of the club are divided into two groups after a dispute over lesson prices, i.e. first group continued with the president of the club while the second group continued with the instructor. The network dataset has ground-truth community structure; so we can compare it with the communities identified by our algorithm.
As a second group of testing, we use real-life networks provided by SNAP Leskovec and Krevl (2014); namely DBLP dataset, Amazon co-purchase network, Youtube network and European-email network, which have available ground-truth communities. We run some of the known community detection algorithms, which are Newman’s fast greedy algorithm Newman (2004), Infomap Rosvall and Bergstrom (2007), Louvain Blondel et al. (2008) and Label Propagation (LPA) Raghavan et al. (2007) on these network datasets and compare their results with the results of our algorithm. We compare the partitions identified by each algorithm with partitions of ground-truth of these networks using NMI. We also measure the execution times of all the algorithms (we use a standard laptop computer having a 2.2 GHz Intel Core i7 processor with 4-cores).
When we do not have the ground-truth community structure, we can not benchmark the results of a newly proposed community detection algorithm. One can try a comparative analysis by running a set of algorithms on a network dataset and make a pairwise comparison between identified partitions of these algorithms. This is not a good way of quality testing for an algorithm because there is not a universally “best” community detection algorithm that can be used as gold standard. We carried such a comparative analysis, however we could only see how close or far each algorithm to any other algorithm in terms of number of identified communities or NMI values. For that reason, without ground-truth, quality test of an algorithm in terms of finding correct communities can not be done.
Many real-life network datasets do not have ground-truth of community structure. In such cases, computer generated networks like LFR benchmark networks Lancichinetti et al. (2008) with planted community structure can be helpful. LFR algorithm generates networks with ground-truth community structure using parameter vector of , where is the number of nodes and is the mixing parameter. We investigate response of community detection algorithms to datasets generated with various mixing values. As increases the community structure becomes more blurry and difficult to detect. Being nondeterministic, LFR can generate different networks for the same parameter vector. In order to avoid potential bias of an algorithm to a single network, we generate 100 LFR networks for each vector and report the averages.
Iii Our Approach
Our approach is based on building a preference network where each connected component is declared as a community. Given a network, we build its corresponding preference network using the preference of each node for other nodes to be in same community with. Every node prefers to be in the same community with certain nodes and we simply try to satisfy these requests. In this study we implement the case where each node is allowed to select only one node, which is the most preferred node to be with. It is relatively easy to extend this approach to nodes preferring two, three or more nodes, too. First, we describe how to satisfy such requests by means of preference network. Then we investigate ways to decide which node or nodes to be with.
iii.1 Preference network
Let be an undirected network where and are the sets of nodes and of edges, respectively. Define a prefer function such that iff node prefers to be in the same community with node . If we connect to , clearly induces a new directed network on , but we will use the corresponding undirected network. Using , we define a new undirected network such that nodes and are connected, i.e. , iff either or . We call this network as the preference network. We consider the components of as communities. Hence we satisfy the rule that every node is in the same community with its preferred node. Note that preference network is not a tree since it may have cycles as in the case of node prefers node , prefers node , prefers . See SI for an algorithm of extracting communities using preference network Tas .
iii.2 Deciding which node to be with
Now we can investigate a selection method of preferred node. First of all, with the given definition, there is nothing that restricts a node to prefer any other node in the network even if the preferred node is not connected to the node. For example a node could prefer the node with the largest betweenness centrality. This view is too general and requires global information.
We restrict the selection of preferred node to the local neighborhood of every node. We calculate a score for each node in the local neighborhood around , with respect to . Then select the node with highest score (detailed later) as preferred node. That is, we define the function as;
where being the score of node with respect to . In tie situations, i.e. when two or more neighbors having the highest score, node selects one of them randomly. Note that the score of depends on the node . The score can be interpreted as a measure of how “important” is node for . If node is the only connection of , it has to have very big value. If has many neighbors, then may not be very important for . Hence the very same node usually has different scores with respect to some other nodes, i.e. .
Local neighborhood can be defined in a number of ways. One may define it as the nodes whose distance is not more than to . It may be the nodes whose distances to are exactly . We may also include node itself to the local neighborhood. In this case may prefer to be in the same community with itself. For this study, we take as the 1-neighborhood of , i.e. the set of nodes whose distance to is exactly 1, which is denoted by .
iii.3 Candidates of score metric
There are a number of candidates for score calculation of with respect to .
(i) The simplest one is to assign a random number for each neighbor of as its score.
Probably this is not a good choice,
since random function will decide independent of node and node ; and their relations with each other.
(ii) Nodes with more connections are usually considered to be more important in a network.
So,
as a second choice,
we can use the degree of the nodes,
i.e. .
The degree
of node is also independent of .
So we do not incorporate what thinks of in the score .
(iii) A third candidate is clustering coefficient of ,
which is an indication of how densely connected its immediate neighborhood,
i.e. .
This is again a value,
which does not directly depend on but may have a meaning to and ,
i.e. there is a chance that high clustering coefficient of is
at least partly because of triangles shared by and .
(iv) Having common neighbors is an important feature in social networks.
From the definition of community, members inside the community should have more edges among themselves which leads to more common neighbors of nodes inside a community.
Number of common neighbors and is the number of triangles having and as two corners.
For this reason, as a fourth candidate we can use the number of common neighbors of and ,
i.e.,
.
(v) As a fifth candidate,
we can use spread capability of a neighbor around node ,
i.e. .
It both contains number of common neighbors and triangle cascades as discussed earlier.
(vi) And as the sixth candidate, we can use Jaccard similarity of and as score,
i.e. .
Iv Results and Discussion
iv.1 Selection of best score metric
We first analyze alternative score metrics in our algorithm and try to find which one performs better in community detection. We run our algorithm on generated LFR networks Lancichinetti et al. (2008) of nodes using all the score metrics, , as the method of preferred node selection. NMI values and execution times are measured. The results of our algorithm using six different score metrics on LFR networks generated with increasing mixing values () are in Fig. 7. We observe that number of common neighbors is the best score metric among six alternatives; it has the best NMI values and can identify exact community structure on networks generated with and . Spread capability score metric has the second best results; it is better than Jaccard similarity where Jaccard similarity metric finds 4-5 times more number of communities compared to ground-truth of LFR networks. Other three metrics; namely random score assignment, degree and clustering coefficient can identify communities to a degree but not as successful as the ones mentioned above. In the second group of metrics, clustering coefficient is the best one and our algorithm using clustering coefficient score can find communities on networks generated with low . Interestingly, random score metric can identify a group of communities successfully on these networks. In general, our algorithm using random score and degree based score metrics find less number of communities compared to ground-truth.
It is trivial that calculation of simple score metrics requires less computation time compared to calculation of other score metrics, i.e. execution time of our algorithm using random score, degree and clustering coefficient all have less execution times as given in Fig. (b)b. On the other hand, calculation of common neighbors, spread capability and Jaccard similarity require more computation time, as these metrics are calculated for each pair of nodes (i.e. number of edges), however these metrics have better results in terms of community detection. Hence, we select the two best performing score metrics for our algorithm, namely, common neighbors and spread capability, denoted as PCN and PSC, respectively. We use these score metrics in our algorithm for comparative analysis with other known algorithms on generated and real-life networks.
iv.2 Results on networks
iv.2.1 Zachary karate club network
We run our algorithm on Zachary karate club network and compare the identified communities with those of ground-truth. Our algorithm with common neighbors score metric, namely PCN, identifies two communities as seen in Fig. 8. Only node 9 is misidentified by our algorithm. Node 9 actually has more connections in its identified community and the ground-truth metadata may not reflect the actual community. All the other nodes are identified correctly.
iv.2.2 Large real-life networks
We run our algorithm with both score metrics, PCN and PSC, on large networks with ground-truth communities provided by SNAP Leskovec and Krevl (2014). For comparative analysis, Infomap, Louvain, LPA and Newman’s fast greedy algorithm are also run on these networks. We omit Newman’s algorithm on Youtube network dataset since it could not finish due to long execution time. Results are presented in Table 1. On all of the four real-life networks, number of communities found by PCN, PSC, Infomap and LPA are close to each other and not far from the ground-truth (one exception is the Youtube network). On all of these networks, our algorithm finds more number of communities because of its local nature.
In general, performance of Louvain and Newman’s algorithm on large real-life networks is low. Their NMI scores are very low and number of identified communities by these algorithms are far from those of ground-truth. They find very few number of communities compared to ground-truth. These two algorithms perform better on European-email network. Infomap and LPA algorithms generally have better NMI values compared to other algorithms, however LPA finds only two communities in European-email network where there are 42 ground-truth communities. Our algorithm, with both score metrics (PCN and PSC), performs well on most of the networks with good NMI values. However it performs poorly on Youtube network where all the other algorithms have similar bad results. This may be due to very small clustering coefficient of Youtube network, i.e. no trivial community structure is available.
Network | CC | # communities | NMI | execution time (ms) | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GT | PCN | PSC | Inf | LPA | Lvn | NM | PCN | PSC | Inf | LPA | Lvn | NM | PCN | PSC | Inf | LPA | Lvn | NM | ||||
European-email | 1,005 | 16,064 | 0.40 | 42 | 35 | 32 | 38 | 2 | 25 | 28 | 0.34 | 0.17 | 0.62 | 0.01 | 0.54 | 0.46 | 192 | 133 | 133 | 40 | 69 | 187 |
DBLP | 317,080 | 1,049,866 | 0.63 | 13,477 | 28,799 | 28,798 | 30,811 | 36,291 | 565 | 3,165 | 0.58 | 0.57 | 0.65 | 0.64 | 0.13 | 0.16 | 4,652 | 3,879 | 35,753 | 106,410 | 8,217 | 4,362,272 |
Amazon | 334,863 | 925,872 | 0.40 | 75,149 | 36,514 | 36,519 | 35,139 | 23,869 | 248 | 1,474 | 0.58 | 0.59 | 0.60 | 0.54 | 0.11 | 0.11 | 2,911 | 3,453 | 43,253 | 83,532 | 8,017 | 1,422,590 |
Youtube | 1,134,890 | 2,987,624 | 0.08 | 8,385 | 78,021 | 78,053 | 102,125 | 83,256 | 9,616 | - | 0.07 | 0.08 | 0.13 | 0.07 | 0.06 | - | 105,528 | 421,593 | 188,037 | 1,362,241 | 52,798 | - |
iv.2.3 Generated networks
We perform the similar comparative analysis on generated LFR networks of and nodes as reported in Fig. (a)a and Fig. (a)a, respectively. We present the detailed results of algorithms on LFR networks of nodes in Table 2. As described earlier, we generate 100 LFR networks per value, run the algorithms on all 100 generated datasets and averaged the results for each algorithm. On LFR networks with nodes, our algorithm with common neighbors (PCN) is among the top 3 best performing algorithms according to the NMI values; on most of the networks, Infomap and our algorithm find the best results and LPA is in the third place. Our algorithm with spread capability (PSC) has lower NMI values but still performs better than Newman’s algorithm. On the networks generated with higher mixing values (i.e. ), Infomap and LPA tend to find small number of communities and sometimes they group all the nodes into a single community. Louvain and Newman’s fast algorithm also find very few number of communities on these networks. However our algorithm, with both of the score metrics, can still find communities successfully. NMI values of our algorithm are better than those of other algorithms and number of communities found by our method do not differ much from the ground-truth compared to other algorithms.
On LFR networks of nodes, our algorithm has better results compared to its performance on previous set of LFR networks of nodes. However with spread capability score metric, it finds more granular communities, which leads to greater number of communities compared to ground-truth. Newman’s algorithm and Louvain algorithm find very few number of communities; they tend to merge communities which may lead to resolution limit Fortunato and Barthélemy (2007).
Infomap and LPA are both successful on large networks when mixing parameter is low, however their quality degrades with increasing mixing parameter where our algorithm can still identify communities successfully.
One of the main differences between our algorithm and LPA is that, we do not assign community labels to nodes but keep the information of who prefers whom to be in same community using a preference network. During execution steps of LPA, a node updates its community label according to majority of labels of its neighbors. However when all or part of those neighbors update their labels to be in a different community, then the node will fall apart from them, it will be in a different community (however it wanted to be in same community with them and updated its label accordingly). Using a preference network, we preserve all the preferences made by each node throughout the execution of algorithm (because we do not update any label). And aggregation of all these preferences will eventually lead to a good community structure.
iv.3 Performance of the algorithm
In this section we discuss the performance and time-complexity of our algorithm. The details are given in SI Tas . We use common neighbors as edge weights for PCN and spread capability for PSC. Given a network , where maximum degree of nodes is , calculating edge weights has and time-complexity, for PCN and PSC respectively. As obtaining communities on preference network requires time-complexity, the overall time-complexity of our algorithm is for PCN and for PSC.
Our algorithm is fast and suitable for very large networks on a single processor environment where calculations are done in a sequential manner. Its speed can be improved further by parallel execution. As multiprocessors become available, how well an algorithm can be distributed over parallel processors becomes an important topic. Our algorithm requires calculations such as number of common neighbors or spread capability, which can be related to edges around a node. It can be distributed to as many as processors easily. In this case each processor handles the local calculations around each node in the network. Then each node will have scores to decide its preferred node in such a parallel and fast way. Network data can be redundantly replicated to all processors in order to avoid the computation overhead of data splitting process. Note that obtaining components is not easily distributed in a parallel fashion since component discovery in preference network can not be split into independent tasks. However this step needs fewer steps of computation and has less impact on overall performance.
Network | CC | # communities | NMI | execution time (ms) | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GT | PCN | PSC | Inf | LPA | Lvn | NM | PCN | PSC | Inf | LPA | NM | Lvn | PCN | PSC | Inf | LPA | Lvn | NM | |||||
LFR-1 | 5,000 | 0.1 | 38,868 | 0.51 | 101 | 105 | 240 | 101 | 100 | 89 | 64 | 0.99 | 0.94 | 0.99 | 0.99 | 0.93 | 0.99 | 127 | 77 | 256 | 66 | 127 | 493 |
LFR-2 | 5,000 | 0.2 | 38,955 | 0.37 | 101 | 108 | 254 | 101 | 100 | 81 | 31 | 0.99 | 0.92 | 0.99 | 0.99 | 0.78 | 0.98 | 125 | 78 | 263 | 66 | 136 | 879 |
LFR-3 | 5,000 | 0.3 | 38,871 | 0.25 | 101 | 111 | 266 | 101 | 98 | 73 | 18 | 0.99 | 0.90 | 0.99 | 0.99 | 0.64 | 0.97 | 132 | 83 | 288 | 79 | 153 | 1,453 |
LFR-4 | 5,000 | 0.4 | 38,930 | 0.16 | 101 | 121 | 283 | 101 | 96 | 64 | 12 | 0.97 | 0.86 | 0.99 | 0.99 | 0.55 | 0.95 | 123 | 85 | 299 | 88 | 165 | 2,028 |
LFR-5 | 5,000 | 0.5 | 38,973 | 0.10 | 100 | 142 | 294 | 100 | 91 | 53 | 9 | 0.93 | 0.80 | 0.99 | 0.98 | 0.46 | 0.93 | 130 | 86 | 357 | 115 | 194 | 2,609 |
LFR-6 | 5,000 | 0.6 | 38,973 | 0.05 | 100 | 185 | 300 | 103 | 74 | 41 | 11 | 0.81 | 0.69 | 0.99 | 0.81 | 0.30 | 0.87 | 147 | 94 | 483 | 186 | 252 | 3,628 |
LFR-7 | 5,000 | 0.7 | 38,969 | 0.02 | 101 | 243 | 280 | 159 | 1 | 25 | 14 | 0.62 | 0.52 | 0.88 | 0.00 | 0.14 | 0.47 | 130 | 92 | 781 | 104 | 281 | 3,101 |
LFR-8 | 5,000 | 0.8 | 38,923 | 0.01 | 100 | 269 | 243 | 227 | 1 | 12 | 13 | 0.40 | 0.33 | 0.35 | 0.00 | 0.06 | 0.10 | 131 | 105 | 1,165 | 91 | 274 | 2,491 |
LFR-9 | 5,000 | 0.9 | 38,986 | 0.01 | 102 | 278 | 238 | 76 | 1 | 12 | 13 | 0.30 | 0.25 | 0.09 | 0.00 | 0.04 | 0.04 | 134 | 107 | 921 | 89 | 309 | 2,388 |
LFR-10 | 5,000 | 1.0 | 38,947 | 0.01 | 101 | 285 | 242 | 81 | 1 | 12 | 13 | 0.27 | 0.23 | 0.18 | 0.00 | 0.03 | 0.03 | 145 | 102 | 982 | 90 | 323 | 2,451 |
V Conclusion
We propose a new local community detection algorithm with two variants, PCN and PSC, which builds a preference network using two different node similarity score metrics; namely common neighbors and gossip spread capability. Although it uses only local information, its performance is good especially when community structure is not easily detectable. On LFR networks generated with higher mixing value () our algorithm performs better than all the other algorithms used for comparison in this paper. On such networks, algorithms like Infomap and LPA merge all the nodes into a single community, where these algorithms are stuck in a local optimum and fail to identify communities. Our algorithm identifies communities in many large real-life networks with high accuracy in a fast way.
We think that building a preference network to identify communities is a simple and powerful approach. With this, we preserve the preferences of all of the nodes for being in same community with another node, whether they are highly connected or have a few connections. This approach prevents loss of granular community information especially in very large networks. Even with random score assignment, our algorithm can identify many communities on generated LFR networks.
Due to its local nature, our algorithm is scalable and fast, i.e. it needs only a single pass on the whole network to construct a preference network and similarity metric used for this construction can be evaluated in 1-neighborhood of each node. It can run on very large networks without loss of quality and performance.
We haven’t implemented a distributed or parallel version of our algorithm
however it is suitable for parallel processing in a distributed environment.
It can be deployed as agents on different parts of a large real-life network
which is evolving over time.
On such a network, collecting the data of the whole network is costly (time, space, computation),
while information about small parts of the network can easily be obtained and analyzed by
each nearby agent for community detection.
Agents can easily identify community structures of that particular area without
knowing the rest of the network;
which is a valuable information at that scale
and can be used in real-time by systems like peer-to-peer networks.
Source code of our algorithm is available online at:
https://github.com/murselTasginBoun/CDPN
Acknowledgments
Thanks to Mark Newman, Vincent Blondel and Martin Rosvall for the source codes of their community detection algorithms. Thanks to Mark Newman, Jure Leskovec and Vladimir Batagelj for the network datasets used.
This work was partially supported by the Turkish State Planning Organization (DPT) TAM Project (2007K120610).
Vi Supplementary Information
The performance of the algorithm is discussed in two ways. The single processor approach deals with complexity. Another possibility is the scalability of the algorithm to multiprocessor running in parallel.
vi.1 Complexity on single processor
In this section we approximate the time-complexity of the algorithm for real-life networks, which are sparse networks. The algorithm is composed of three steps. (i) For every edge in the network, we assign a weight . We use two metrics for , namely number of common neighbors and spread capability. (ii) Then we construct a directed network, where each node is connected to exactly one neighbor, for which the weight is maximum. (iii) Finally, we group the nodes into communities using the directed network. First we investigate the complexity of calculating edge weights. Then the complexity of obtaining communities from the directed network is investigated.
vi.1.1 Complexity of obtaining common neighbors
Let be the edge connecting and , and , be the degrees of and , respectively. In order to find out if a neighbor of is also a neighbor of , we need to search in the neighbors of . Comparing with each neighbor of would require comparisons. If we keep the neighbors of in a hash, which provides direct access, then the complexity of searching of in the hash would be . Since there are neighbors of , finding the common neighbors of and requires searches in the hash. Note that if , then it is better to search neighbors of in the neighbors of in this situation. Then the complexity finding the common neighbors of and is . This is the complexity of calculating the weight of a single edge in network. Then the total number of comparisons required for all common neighbors can be obtained if we consider all the edges. That is,
where is the maximum degree in the network. We get the worst case complexity of if the network is a complete graph, where we have and . This is not a problem for a community detection algorithm, since there is no community structure in a complete graph. Fortunately, real life large networks are far from complete graphs. Although the number of nodes is very large, real-life networks are highly sparse, i.e., , and their nodes are connected to a very small fraction of the nodes, i.e., . Note also that for networks with power-law degree distribution, is extremely high compared to degree of majority of the nodes. So for real networks we can consider as constant, and the complexity becomes .
vi.1.2 Complexity of obtaining spread capability
Suppose we want to calculate the spread capability of gossip originator around victim , which is given as . The denominator is simply the degree of the victim node . The numerator needs to be calculated. We use common neighbor algorithm to obtain spread capability as follows. In the first wave, all common neighbors of and will receive the gossip from . If node is in the common neighbors of and , it will receive the gossip. Now starts its wave, i.e. a triangular cascade, which passes gossip to all the nodes in the common neighbors of and . Any node that receives gossip, will start its own wave. As seen, we repeatedly use of common neighbors algorithm to propagate gossip from one node to another.
The following observation of triangular cascades will enable us to do the calculation once and reuse it. Note that selection of originator makes no difference for a given cascade. As visualized in Fig. 15, if receives gossip initiated by , then, in return, receives gossip initiated by . Therefore we have , which implies . Hence we do calculation of only once for the entire cascade, and use it for the remaining nodes in the cascade. This observation drastically reduces the number of calculations around if there are triangular cascades, which is the case in networks with community structure.
Either all neighbors of are in one cascade, or there are multiple cascades, every edge that is incident to , has to be checked for common neighbors once, and we repeat that for times. Hence the complexity is . We need to do this for every node. Then the complexity becomes if we consider as a constant of real networks.
vi.1.3 Complexity of obtaining communities
Now we investigate the complexity of extraction of the communities once the preferred function is given. We assume that nodes have unique ID. Initially we put all nodes in a list, mark as unvisited and label them with their unique ID, i.e. . We process unvisited nodes in the list one by one and terminate when all the nodes become visited as follows. (See algorithm \procCommunity-Extraction and Fig. 16) We get an unvisited node from the list, mark it as visited and push it into a stack. Then set the preferred node of as , i.e., , and repeat the process. Eventually becomes a node that is already visited. Preserve the community of it in , i.e., . Label all the nodes in the stack with while popping out nodes from the stack. Then repeat the process with a new unvisited node, if there is any.
Note that the algorithm passes every node twice. Once unvisited nodes are visited and push into the stack. Then once more when they are popped from the stack. So the complexity is .
vi.1.4 Overall complexity
Considering all the major parts of our algorithm, the overall complexity of algorithm will include edge weight calculation and obtaining communities using preference network. We use two methods as edge weights which require different number of calculations, i.e. calculating number of common neighbors used in PCN and has time-complexity, while calculating spread capability used in PSC has time-complexity.
As obtaining communities using preference network regardless of weight method has time-complexity, the overall time-complexity of PCN algorithm is and overall time-complexity of PSC algorithm is .
vi.2 Complexity on parallel processors
Using local information for community detection is a good candidate for parallel execution. Let’s consider the case of using common neighbors as edge weights in network. Our algorithm has three steps of operation: (i) Calculate the edge weights either as the number of common neighbors or as spread capability. (ii) Connect the node to the node with the highest edge weight as the preferred node. (iii) Identify communities using the preference network. Step (iii) is not a good candidate for parallel execution but steps (i) and (ii) are perfect candidates since each processor can do its calculations without exchanging data with another processor.
Suppose we have number of processors, which can run in parallel. We can get fold speed up. We dispatch entire network data to processors and each processor calculates one edge weight, then it calculates the preferred node for each node.
References
- Wasserman and Faust (1994) S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications, vol. 8 (Cambridge University Press, 1994), ISBN 9780521387071.
- Onnela et al. (2007) J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski, J. Kertész, and A.-L. Barabási, Proceedings of the National Academy of Sciences 104, 7332 (2007).
- Newman (2001a) M. E. Newman, Proceedings of the National Academy of Sciences 98, 404 (2001a).
- Leskovec et al. (2005) J. Leskovec, J. Kleinberg, and C. Faloutsos, in Proceedings of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (ACM, 2005), pp. 177–187.
- Chen and Yuan (2006) J. Chen and B. Yuan, Bioinformatics 22, 2283 (2006).
- Sporns (2013) O. Sporns, Current Opinion in Neurobiology 23, 162 (2013).
- Girvan and Newman (2002) M. Girvan and M. E. Newman, Proceedings of the National Academy of Sciences 99, 7821 (2002).
- Fortunato (2010) S. Fortunato, Physics Reports 486, 75 (2010).
- Newman (2004) M. E. Newman, Physical Review E 69, 066133 (2004).
- Rosvall and Bergstrom (2007) M. Rosvall and C. T. Bergstrom, Proceedings of the National Academy of Sciences 104, 7327 (2007).
- Raghavan et al. (2007) U. N. Raghavan, R. Albert, and S. Kumara, Physical Review E 76, 036106 (2007).
- Blondel et al. (2008) V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, Journal of Statistical Mechanics: Theory and Experiment 2008, P10008 (2008).
- Gregory (2010) S. Gregory, New Journal of Physics 12, 103018 (2010).
- Lancichinetti et al. (2011) A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato, PloS One 6, e18961 (2011).
- De Meo et al. (2014) P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti, Journal of Computer and System Sciences 80, 72 (2014).
- Eustace et al. (2015) J. Eustace, X. Wang, and Y. Cui, Physica A: Statistical Mechanics and its Applications 436, 665 (2015).
- Simmel (1950) G. Simmel, The Sociology of Georg Simmel, vol. 92892 (Simon and Schuster, 1950), ISBN 1296031292.
- Radicchi et al. (2004) F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, Proceedings of the National Academy of Sciences of the United States of America 101, 2658 (2004).
- Lind and Herrmann (2007) P. G. Lind and H. J. Herrmann, New Journal of Physics 9, 228 (2007), ISSN 1367-2630.
- Newman (2001b) M. E. Newman, Physical Review E 64, 025102 (2001b).
- Xiang et al. (2016) J. Xiang, K. Hu, Y. Zhang, M.-H. Bao, L. Tang, Y.-N. Tang, Y.-Y. Gao, J.-M. Li, B. Chen, and J.-B. Hu, Journal of Statistical Mechanics: Theory and Experiment 2016, 033405 (2016).
- Jaccard (1901) P. Jaccard, Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547 (1901).
- Lind et al. (2007) P. G. Lind, L. R. da Silva, J. S. Andrade Jr, and H. J. Herrmann, Physical Review E 76, 036117 (2007).
- (24) Supplementary information for community detection using preference networks.
- Danon et al. (2005) L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, Journal of Statistical Mechanics: Theory and Experiment 2005, P09008 (2005).
- Zachary (1977) W. W. Zachary, Journal of Anthropological Research 33, 452 (1977).
- Leskovec and Krevl (2014) J. Leskovec and A. Krevl, SNAP Datasets: Stanford large network dataset collection, http://snap.stanford.edu/data (2014).
- Lancichinetti et al. (2008) A. Lancichinetti, S. Fortunato, and F. Radicchi, Physical Review E 78, 046110 (2008).
- Fortunato and Barthélemy (2007) S. Fortunato and M. Barthélemy, Proceedings of the National Academy of Sciences 104, 36 (2007).