MRAttractor: Detecting Communities from Large-Scale Graphs

MRAttractor: Detecting Communities from Large-Scale Graphs

Nguyen Vo, Kyumin Lee, Thanh Tran Computer Science Department, Worcester Polytechnic Institute
Worcester, MA, USA
Email: {nkvo,kmlee,tdtran}@wpi.edu
Abstract

Detecting groups of users, who have similar opinions, interests, or social behavior, has become an important task for many applications. A recent study showed that dynamic distance based Attractor, a community detection algorithm, outperformed other community detection algorithms such as Spectral clustering, Louvain and Infomap, achieving higher Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI). However, Attractor often takes long time to detect communities, requiring many iterations. To overcome the drawback and handle large-scale graphs, in this paper we propose MRAttractor, an advanced version of Attractor to be runnable on a MapReduce framework. In particular, we (i) apply a sliding window technique to reduce the running time, keeping the same community detection quality; (ii) design and implement the Attractor algorithm for a MapReduce framework; and (iii) evaluate MRAttractor’s performance on synthetic and real-world datasets. Experimental results show that our algorithm significantly reduced running time and was able to handle large-scale graphs.

I Introduction

A community can be viewed as a group of vertices which are densely connected, when compared to the rest of a network. Detecting organizational groups of these vertices paves the way for understanding the underlying structure of complex networks. As a result, there are numerous algorithms proposed in the last decade such as [1, 2, 3, 4]. Recently, Shao et al. [5] proposed an algorithm called Attractor which utilized the viewpoint of dynamic distance of linked nodes in a graph to find high-quality communities. Instead of optimizing a specific objective, Attractor relied on three types of interactions to dynamically change the distance between vertices. Despite of its superiority over multiple baselines such as Louvain, Ncut and Infomap, this algorithm takes many iterations to converge edge distances, or it even may not converge in some cases [6].

Furthermore, there has been growing interest to process large-scale graphs such as online social networks (e.g., Facebook friendship network and Twitter follower network), and to extract communities. However, most existing community detection algorithms such as Attractor was not be able to handle the large graphs or was designed for a single machine.

In this paper, we were interested in improving Attractor, so that it can handle large-scale graphs, producing high quality communities as quick as possible with ensuring edge distances converged. Initially, we considered to design, improve, and implement Attractor on well-known graph processing frameworks such as Graphx, Pregel and Pegasus [7, 8, 9, 10, 11]. But these graph processing frameworks are not well suited for communication between unconnected nodes [7] because they share a similar vertex-centric paradigm where only connected vertices can directly communicate with each other. Therefore, we decided to design and implement Attractor on top of a well-known MapReduce framework, Hadoop system [12] which can take advantage of distributed computing power.

However, we faced the following key challenges when we began designing our MRAttractor, an advanced version of Attractor for Hadoop: (i) how to compute dynamic interactions in a distributed computing environment, where a partial graph was loaded to each slave node; (ii) how to force edge distances to converge with minimum overhead of network communication and disk I/O, when edge distances are fluctuated over time or convergence takes long time in some datasets [6]; and (iii) how to mitigate the skewness issue of parallel computing (i.e., a task in a slave node takes longer running time than tasks in the other slave nodes). Especially, researchers observed that the original Attractor takes long time in some datasets due to fluctuated edge distances [6]. If we just implement Attractor for Hadoop, we will face the same problem which will cause large overhead in the Hadoop system.

By overcoming and resolving these challenges, in this paper, we propose MRAttractor which consists of three main components: (i) Jaccard distance initialization; (ii) dynamic interaction computation by proposing our graph partition algorithms and applying a sliding window technique; and (iii) community extraction.

Our contributions in this paper are as follows:

  • We applied a sliding window technique to ensure edges converged, reduce running time, and still achieve the same quality of extracted communities compared with Attractor.

  • We designed and implemented MRAttractor, an improved version of Attractor, to reduce running time and handle large-scale graphs.

  • We evaluated performance of MRAttractor in both synthetic and real-life datasets. Our results showed that MRAttractor was able to handle large-scale graphs, and significantly reduced running time.

  • We publicly shared our source code and datasets available at http://bit.ly/mrattractor for the research community.

Ii Related work

Community detection has been studied for a long time to unveil hierarchical structure and hidden modules of complex networks. To detect communities, many different algorithms were proposed [13, 14] categorized into (i) statistical inference based methods [15], (ii) optimization-based methods in which they are often designed to optimize a specific objective such as modularity [3], normalized cut [2] and betweenness [13], and (iii) dynamical processes based algorithms [4]. To complement the existing approaches, [5] recently proposed an algorithm called Attractor, which is based on dynamic distance between linked nodes. This algorithm has been investigated and extended in [6, 16]. Despite Attractor’s high precision, it was less efficient, requiring many iterations to converge [6].

Other researchers focused on detecting communities from large-scale graphs. One direction was to design and implement algorithms for a MapReduce framework. Tsironis et al. [17] proposed a MapReduce spectral clustering algorithm by employing eigensolver and parallel k-means algorithm [18]. Louvain [19], and community detection algorithms based on Label propagation [20] or propinquity dynamics [21] were developed for Hadoop. Another direction was to design and implement community detection algorithms for other frameworks. For example, [22, 23] developed community detection algorithms based on vertex-centric paradigm of Pregel [8]. In particular, Saltz et al. [22] developed an algorithm to optimize Weighted Community Clustering metric. Ling et al. [23] proposed modularity-based algorithm called FastCD on top of GraphX. Another work [24] employed PMETIS to parallelize the first iteration of Louvain algorithm.

Iii Background: Attractor

In this section, we briefly summarize how Attractor [5] works as the background knowledge so that readers can follow how our MRAttractor works in the following section. Table I presents frequently used notations in the rest of this paper.

Notations Meaning
The undirected graph inputted to MRAttractor
and of the input graph
Distance of edge
’s neighbors,
’s neighbors and associated distances
The star graph with center vertex .
degree of vertex ,
A key-value pair where is key and is value.
,
and
Direct, common and exclusive interaction
between linked nodes and respectively
Hash function for graph partition
Triangle, three edges .
Wedge, where , .
The number of partitions of graph .
w the sliding window vector of an edge
, and Cohesive parameter, threshold of sliding window, upper-bound of non-converged edges respectively.
TABLE I: The notations used in this paper

Attractor consists of three main steps. Firstly, it initializes Jaccard distance of directly linked nodes as follows:

(1)

, where and . and are ’s neighbors and ’s neighbors respectively.

Secondly, it dynamically changes edges’ distance by computing direct linked interaction (DI), common interaction (CI) and exclusive interaction (EI). These interactions are called Dynamic Interactions. The idea behind dynamic interactions is that the more a pair of vertices interacts with each other, the more their distance is reduced (i.e., they attract each other).

measures the direct influence of linked nodes and is defined based on , the sine function, as follows:

(2)

measures influence from common neighbors of and , denoted as . Its main concept is if each has a small and small , u and v will be likely to be in a group.

(3)

where is equal to following expression:

(4)

measures influence from exclusive neighbors. Its main concept is that each exclusive neighbor of attracts to move toward . If and has high similarity, the movement of to will reduce . Otherwise, the distance will increase. The same concept applies to each exclusive neighbor of . EI of and is measured as follows:

(5)

, where EN() and EN() are sets of exclusive neighbors of and respectively. and . is defined below:

(6)

, where is influence of vertex on . Given cohesive parameter , is computed based on , the similarity of unconnected nodes and :

(7)

We measure the similarity of and , , as follows:

(8)

After computing DI, CI, EI for each edge , new distance at timestamp is updated as follows:

(9)

Attractor algorithm is looped until every edge distance converged (e.g., its distance becomes either 0 or 1). Thirdly, Attractor removes edges with distance 1 and finds connected communities with breath first search. Each connected component is an identified community.

Iv MRAttractor

Jaccard Distance Initialization

Generating Star Graphs

Computing Three Types of Interaction

Updating Distances and sliding window

#non-converged edges

Processing on the Master node until all edges converged

Extracting communities

no

yes

Fig. 1: Flowchart of three major components MRAttractor. (1) Jaccard Distance Initialization, (2) Dynamic Interactions and (3) Extracting communities

In this section, we describe MRAttractor, our proposed distributed version of Attractor. It not only produces the same results with Attractor, but also significantly reduces the running time in both single machine and distributed system.

MRAttractor consists of three main components. The first one is to initialize Jaccard distance. The second component is to compute dynamic interactions and make edge distance converged, which consists of four phases such as generating star graphs, computing three types of interaction, updating distances, and running all edge convergence on the master node. The third component is to extract communities. Figure 1 shows three main components of MRAttractor. We explain each component in detail in the following subsections.

Iv-a Jaccard Distance Initialization

For each vertex , we find its neighbors and sort these neighbors increasingly based on their indexes. Then, for each edge of graph , we can find common neighbors of and and compute Jaccard distance (See Eq.1) with complexity O().

Iv-B Computing Dynamic Interactions

After initializing all edge distances of , we move on to the second major component of MRAttractor which consists of three MapReduce phases (i.e., generating star graphs, computing three types of Interactions, and updating distances based on sliding window), and running on the master node to make all edge distances converge.

1:Map: Input
2:emit ; emit
3:Reduce: Input
4:Sort increasingly based on index of ’s neighbors.
5:emit
Algorithm 1 MR1: Generating Star Graphs

Iv-B1 Generating Star Graphs

Algorithm 1 processes each edge and its distance. Then, in reduce step, we sort based on index of ’s neighbors and output its star graph . Note that a star graph is a tree of nodes where center vertex has degree , while other vertices have degree 1. Sorting helps us find common and exclusive neighbors of two linked nodes in linear time. Totally, there will be star graphs output from reduce instances of Algorithm 1.

Fig. 2: Original graph where , . Each edge is associated with Jaccard distance in Equation 1. This graph is partitioned into four subgraph , , and . The hash function where . The solid lines are main edges and the dot lines are rear edges.

Iv-B2 Computing Three Types of Interactions

Direct Interaction (DI), Common Interaction (CI) and Exclusive Interaction (EI) are three interactions we need to compute. The hardest task is computing of edge because depends on , the similarity of unconnected nodes and where (see Eq.6 and Eq.7). Well-known large-scale graph processing frameworks [7, 8, 10] limitedly support direct communication between two unconnected nodes (e.g. vertex and vertex ), leading to difficulties in computing . Therefore, we propose DecGP, an algorithm to efficiently compute dynamic interactions of every edge in . Our proposed DecGP algorithm was inspired by a graph partitioning algorithm introduced in [25], which showed that a graph partition algorithm helped mitigate skewness issue by evenly distributing workload to each reducer. The third challenge mentioned in Section I will be resolved by our graph partition algorithm.

Our graph partition algorithm uses a hash function , that maps each vertex to range , to partition original graph into disjoint partitions such that , and . From these partitions, we form overlapping subgraphs where , . An edge is called an outer-edge if and are in different partition (i.e., ). Otherwise, it is an inner-edge. Intuitively, in each smaller-size subgraph , we compute DI, CI and EI of edges . Let’s take an example graph of 12 vertices and 16 edges in Figure 2. By using hashing function where , this graph is partitioned into 4 subgraphs , , and . The value on each edge is Jaccard distance. Although our algorithm shares similar methodology with [25], there are key differences as follows:

  • We reduce complexity of finding subgraphs that contain an edge from O() to O().

  • In our algorithm, we also include additional edges called rear edges in subgraphs to compute exclusive interaction while in [25], subgraphs only contain main edges. In each subgraph , an edge is called a main edge if otherwise it is called a rear-edge. In Figure 2, subgraphs , , and contain rear-edges denoted as dotted edges and main edges denoted as solid edges.

  • In each subgraph , we compute partial values of , and of every main edges instead of counting the number of triangles like [25].

These differences are described as follows:

(i) Reducing complexity of finding subgraphs .

Given an edge , a graph partition algorithm of Siddharth et al. [25] takes for finding subgraphs that belongs to. However, our proposed algorithm only takes for the task (see Lemma IV.1 and associated proof).

Lemma IV.1.

For each edge in original graph :

  1. If edge is an inner edge, there will be distinct subgraphs containing it.

  2. If edge is an outer edge, there will be subgraphs containing this edge.

Proof.

(1) If edge is an inner edge (e.g., ), it will be emitted to for each and , , . Therefore, there are subgraphs containing the inner edge .

(2) If edge is an outer edge (e.g., ), it will be output to where and for each and . So, there are subgraphs containing this edge. ∎

Due to Lemma IV.1, finding subgraphs that contain an edge can be implemented in quadric complexity as shown in Algorithm 2.

1:function FindSubGraphs()
2:     
3:     if  then
4:          for  and  do
5:               for  and  do
6:                                              
7:     else
8:          for  and and  do
9:                               return
Algorithm 2 Finding containing an edge

(ii) Adding rear-edges to subgraphs . What is the motivation of adding rear-edges?

Let’s consider only main edges of subgraphs . Once again, an edge in subgraph is a main edge if . Otherwise, it is a rear-edge. For each main edge associated with its distance in each subgraph , we can load vertices’ degree into memory, and compute based on Eq.2 and for every common vertex based on Eq.4. But how can we compute ? Let’s look at a main edge (12, 10) of subgraph in Figure 2 as an example. We can see that vertex 9 is an exclusive neighbor of vertex 12. depends on (See Equation 6). To compute , we need to measure similarity in Eq.8 and in Eq.7. To accurately compute , we need both star graph and star graph to find common neighbors. But in subgraph , without considering rear-edges, we are missing necessary edges and to compute since . Similar analysis to vertex 2, an exclusive neighbor of vertex 10, we are missing edge to compute . Motivated by these observations and to guarantee the correctness of MRAttractor, we added rear edges , and , denoted as dotted edges, to subgraph to correctly compute exclusive interactions. We resolved the first challenge mentioned in Section I by adding rear-edges.

1:Map: Input:
2: Set of triples
3:for each  do
4:      FindSubGraphs()
5:for each  do
6:     emit
7:Reduce: Input:
8:Read value from Hadoop configuration object.
9: An empty dictionary of star graphs
10: Set of main edges.
11:for each routed by key  do
12:      Insert to dictionary
13:     for each  do
14:          if  then is a main edge
15:                               
16:for each and  do
17:     
18:     ComputeDI()
19:     ComputeCI()
20:     for  and  do
21:          if  then
22:                Retrieve neighbors of vertex
23:               ComputeEI()                
24:     for  and  do
25:          if  then
26:                Retrieve neighbors of vertex
27:               ComputeEI()                
28:     emit
Algorithm 3 MR2: DecGP - Computing 3 Types Interactions

(iii) Computing DI, CI and EI of main edges. By adding rear edges to , we are now able to compute DI, CI and EI of main edges in subgraph . Algorithm 3 shows the pseudocode of DecGP to compute DI,CI and EI of every main edge in . It is a MapReduce algorithm consisting of a map function and a reduce function. In particular, each map instance handles a star graph output from Algorithm 1. It firstly finds distinct subgraphs , represented as a sorted triple , which contains at least one of the edges where by applying function FindSubGraphs. Secondly, it emits star graph to subgraph (See lines [2-5] of DecGP). By emitting the whole star graph to subgraph , we ensure that there are always enough edges to compute similarity of unconnected nodes.

Moving on to each reduce instance of Algorithm 3. Each reduce instance receives a list of star graph routed by the sorted key of subgraph , which can be viewed as an adjacent list representation of subgraph . From lines [8-13] in Algorithm 3, we find set of main edges , which are used for computing DI, CI and EI in Lines [14-26]. After computing DI, CI and EI of every non-converged main edge in , each reduce instance will emit key-value pairs where key is a main edge and value is the aggregated DI, CI and EI of edge . Three main functions ComputeDI, ComputeCI and ComputeEI are explained below.

Computing DI. Algorithm 4 shows how we compute Direct Interaction. of each main edge is computed based on Eq.2. However, this edge will be re-computed in multiple subgraphs based on Lemma IV.1 so we need to scale down to guarantee correctness. In particular, if is an inner edge, we scale down times. Otherwise, we scale it down times.

1:function ComputeDI()
2:     
3:     if  then
4:          
5:     else
6:                return
Algorithm 4 Direct interaction of the edge

Computing CI. Algorithm 5 shows how we compute common interaction of main edge . The input of this function is the main edge , its distance , ’s neighbors , ’s neighbors , the number of partitions and key of subgraph . For each common neighbor of and , we only consider vertex because we only care about main edges of . We compute based on Eq.4. However, we need to scale down since three vertices , and will form a triangle and Siddharth et al., [25] pointed out that can be repeated in several subgraphs in Lemma IV.2. In particular, if three vertices of are in a same partition , we scale down times. If has two nodes in same partition and the third node belongs to a different partition, is scaled down times.

1:function ComputeCI()
2:     
3:     for  and and and  do
4:           ;
5:          
6:          if  then
7:               
8:          else if  then
9:                          
10:                
11:     return
Algorithm 5 Common interaction of the edge
Lemma IV.2.

Given of original graph :

  1. if three nodes are in a same partition, will also appear in different subgraphs .

  2. if two nodes are in a same partition and one node belongs to a different partition, will appear in different subgraphs .

  3. if three nodes are in three different partition, there is only one subgraph containing .

Proof.

(1) When , and are in a same partition , will appear in subgraphs where , and . Totally, there are different subgraphs .

(2) Without loss of generality, we assume that and . will appear in subgraphs where and and . Therefore, there are distinct subgraphs containing .

(3) Finally, when three nodes are in three different partitions , and , there is only one subgraph that contains this . ∎

1:function ComputeEI()
2:     
3:     Compute based on and
4:     
5:     if  then
6:               
7:     
8:     if  then
9:         
10:     else if  then
11:                
12:     return
Algorithm 6 Exclusive interaction of vertex on edge

Computing EI. For each main edge of , we sequentially process each exclusive neighbor (See Lines [18-21] of Algorithm 3) and each exclusive neighbor (See Lines [22-25] of Algorithm 3). Once again, we only pay attention to vertex and vertex such that and since main edges are our target.

Before describing details of the computation, we present definition of a wedge or a two-hop path in Definition 1.

Definition 1 (Wedge).

In , three nodes , and form a wedge denoted as if and there is no edge between and . A wedge can be called two-hop path [26].

As we can see, three nodes and form a . Similarly, three nodes and create . Without loss of generality, we only explain how we compute , the effect of exclusive neighbor on distance of edge based on .

In function computeEI of Algorithm 6, we input , ’s neighbors, ’s neighbors, the number of partitions and cohesive parameters [5]. To begin with, we compute (See Eq.8), the similarity of vertex and based on and . Then, we derive and compute in Eq.6. Computing in each subgraph will face duplication problem since wedge can appear in other different subgraphs . Therefore, we need to scale down appropriately. Lemma IV.3 shows the number of subgraphs that a wedge can appear. In particular, if has three nodes in a same partition, we scale down times. If the first two nodes in are in the same partition, but the third node is in another partition, we scale down times (see Lines [8-11] of Algorithm 6).

Lemma IV.3.

For each wedge :

  1. If three nodes , and are placed in the same partition, it will appear in different subgraphs .

  2. If two nodes are in the same partition and the other one belongs to a different partition, it will appear in different subgraphs .

  3. If each vertex is in different partitions, it will belong to only one subgraph .

Proof.

(1) When three nodes are in a same partition, . Edges and are two inner edges and always appear together. Due to Lemma IV.1, inner edges will appear in subgraphs . Therefore, will appear in different subgraphs .

(2) Without loss of generality, we assume that and . For each subgraph that outer edge appears, we can see that inner edge also appears, leading to the existence of in . We know that the outer edge will appear in different subgraphs based on Lemma IV.1. Therefore, will appear in different subgraphs .

(3) If each vertex are in different partitions, , , . Thus, only one has this wedge. ∎

Iv-B3 Updating edge distances based on sliding window

Next, we move on to updating distance of every edge and its sliding window in original graph based on , and computed by Algorithm 3.

In Section I, we addressed three key challenges to design and implement MRAttractor. The second challenge was how to make all edge distances converged with minimum overhead of network communication and disk I/O. To overcome the second challenge, we use sliding window technique [27].

1:Map: Input: , and
2:emit with is either , or w
3:Reduce: Input:
4:Read settings the maximum size of sliding window and
5:; ;
6:for  do
7:     if  is w then
8:          
9:     else if  is  then
10:          
11:     else if  is  then
12:                
13:if  or  then
14:     return This edge was converged. No need to process it
15:if  then
16:      =
17:     
18:      Set position of vector
19:     if  then
20:                
21:      is the number of 1 in sliding window
22:      is the number of -1 in sliding window
23:     if  then of edge is full
24:          if  and  then
25:                          
26:          if  and  then
27:                               
28:     if  then
29:                
30:     if  then
31:                
32:     emit
33:     emit
Algorithm 7 MR3: Updating Distances and Sliding Window

Our sliding window model works as follows: For each edge , we use a binary vector w, indicating the status of edge , means that at iteration increases and means decreases. By using sliding window, we only keep the last statuses of each edge (i.e. ) to observe the increasing/decreasing trend of the edge in the last iterations which may be more reliable to reflect the convergence trend of an edge. Then, we predict its final distance (i.e. 0 or 1). To decide if an edge converges or not, we use a threshold . In particular, if the last status of edge is and there are at least negative values in vector w, we decide that edge will eventually converge to 0 (e.g., when =10, =0.6 and last status=-1, if at least 6 statuses in w are -1, edge distance will be set to 0). If the last status of edge is and there are at least positive values in vector w, we will set . Otherwise, edge still does not converge, and we continue to compute its dynamic interactions as shown in Figure 1.

Algorithm 7 shows our pseudocode to update edge distances. The map instances of Algorithm 7 process three types of input: (1) is the key-value pairs generated Dynamic Interactions. Recall that is the aggregated sum of DI, CI and EI of edge output from reduce instances of Algorithm 3. (2) is the edge and its distance in previous iteration. At the first iteration, is Jaccard distance. (3) is sliding window of edge . At the first iteration, w is an empty vector. Note that, for each edge , there are only one pair , one pair and multiple pairs .

Each map instance of Algorithm 7 simply outputs a key-value pair where key is an edge and value is either , or sliding window vector w.

Each reduce instance of Algorithm 7 receives routed by key and performs two tasks with edge - (1) computing its new distance and (2) updating its sliding window vector. To begin with, from Lines [3-10], we sum up all values and store it into . We can verify that . After computing , we can derive , distance of edge at timestamp based on Eq.9. Next, from Lines [15-25], we update sliding window vector . Due to modulo function, . Finally, we emit pair for new distance of edge and pair for new sliding window vector. These key-value pairs will act as new input for next iteration of MRAttractor.

Iv-B4 Running on Master node

After deriving new distance of all edges in original graph , we will check how many edges are still non-converged 01. If the number of non-converged edges are smaller than a threshold we will continue our computation on Master node, which control slave nodes. There are two reasons why we do this. Firstly, Attractor algorithm suffers long-tail iterations because some edges converge slowly [6]. Secondly, after multiple iterations, the number of non-converged edges are very small which can be handle efficiently on single computer. A well-known problem of MapReduce Hadoop is the overhead of I/O operations [28]. Therefore, by running on single computer, we can avoid unnecessary overhead of Hadoop framework. In this work, we set for all testing networks. The second challenge mentioned in Section I is resolved by the sliding window model and running on the master node.

Iv-B5 Complexity Analysis

We now show the correctness of computing dynamic interactions and analysis of DecGP’s complexity since it is the most time consuming part.

Lemma IV.4.

For each edge of , its , and are computed correctly in each loop.

Proof.

In each , we compute partial values of DI, CI and EI for every main edge with appropriate scaling as shown in Algorithms 3, 4, 5 and 6. After computing dynamic interactions, we aggregate DI, CI and EI for every edge of the original graph in reduce instances of Algorithm 7. Since we apply scaling correctly, the aggregated values of every edge is exactly equal to , leading to correctness of our computation. ∎

Datasets V E classes AVD CC
Karate 34 78 2 4.588 0.571
Football 115 613 12 10.661 0.403
Polbooks 105 441 3 8.400 0.488
Amazon 334,863 925,872 top5000 5.530 0.397
Collaboration 9,875 25,973 unknown 5.260 0.472
Friendship 58,228 214,078 unknown 7.353 0.172
Road 1,088,092 1,541,898 unknown 2.834 0.046
TABLE II: Networks with labels and non-labels, average degree (AVD), and average clustering coefficient (CC).
Lemma IV.5.

For each setting of :

  1. The expected number of main edges in is O().

  2. The expected number of key-value pairs is O() for all reduce instances of Algorithm 3.