# Simple Distributed Graph Clustering using Modularity and Map Equation

^{†}^{†}thanks: Support by DFG grant WA654/19-2 and WA654/22-2.

###### Abstract

We study large-scale, distributed graph clustering. Given an undirected, weighted graph, our objective is to partition the nodes into disjoint sets called clusters. Each cluster should contain many internal edges. Further, there should only be few edges between clusters. We study two established formalizations of this internally-dense-externally-sparse principle: modularity and map equation. We present two versions of a simple distributed algorithm to optimize both measures. They are based on Thrill, a distributed big data processing framework that implements an extended MapReduce model. The algorithms for the two measures, DSLM-Mod and DSLM-Map, differ only slightly. Adapting them for similar quality measures is easy. In an extensive experimental study, we demonstrate the excellent performance of our algorithms on real-world and synthetic graph clustering benchmark graphs.

## I Introduction

Graph clustering is a well researched topic [9, 11] and has many applications, such as community detection in social networks. Modern social networks are huge and contain hundreds of millions of users and billions of friendship relationships between them. We model the users as nodes and the friendships as edges between them. The resulting graphs are huge and do not fit into the main memory of a single machine. We therefore study distributed extensions of established single machine clustering algorithms. This enables us to efficiently compute clusterings in huge graphs.

We consider the problem of clustering a graph into disjoint clusters such that every node is part of exactly one cluster. While there is no universally accepted definition of a good clustering, it is commonly accepted that clusters should be internally densely and externally sparsely connected. We consider optimizing two established quality measures that formalize this concept: modularity [23] and map equation [25].

### I-a Related Work

Existing distributed approaches follow one of two approaches. The first is to partition the graph into a part per machine. Each part is then clustered independently on one machine. Finally, all clusters are contracted and the resulting coarser graph is assumed to fit into the memory of a single machine. The second approach consists of distributing the clustering algorithm itself.

In [29], the input node id range is chunked into equally sized parts. This can work well, but is problematic if input node ids do not reflect the graph structure. In [27], the input graph is partitioned using the non-distributed, parallel graph partitioning algorithm ParMETIS [17]. This approach does not rely on the input ids but requires that the graph fits into the memory of one machine to perform the partitioning.

Following the second approach, [24] have introduced a distributed extension of the Louvain algorithm [4]. It uses MPI. Similarly, [21] have presented an algorithm that uses the GraphX [13] framework. Another algorithm named GossipMap is presented in [2] which uses the GraphLab [22] framework. Nearly all algorithms heuristically optimize modularity. The only exception is GossipMap which heuristically optimizes map equation.

Other community detection formalizations have been considered. For example, EgoLP [6] is a distributed algorithm to find overlapping clusters.

Our approach presented in this work uses the second approach. We implemented our algorithm within the Thrill framework [3].

### I-B Contribution

We introduce two distributed graph clustering algorithms, DSLM-Mod and DSLM-Map, that optimize modularity and map equation, respectively. They are the first formulation of clustering algorithms based on modularity and map equation in an extended version of the classical MapReduce model [7]. This allows to formulate the algorithms in a much simpler way which is even simpler than existing approaches e.g. based on GraphX. While we only show algorithms for optimizing modularity and map equation, our algorithms are easy to extend for optimizing different density-based quality measures. Our algorithms are the first graph clustering algorithms based on Thrill [3], a distributed big data processing framework written in C++ that implements an extended MapReduce model. Using Thrill allows our algorithms to be efficient concerning memory and CPU use despite their simplicity. As our evaluation shows, few hosts of a distributed system are enough to run them on huge graphs. Compared to the related work, we provide a much more in-depth evaluation of the quality of the found communities. Our evaluation not only considers quality scores where a very small change of the score can result in very different clusterings [14], but also the similarity with ground truth communities on synthetic benchmark graphs with billions of edges. We show that an evaluation on small benchmark graphs as it is typically done in related work is not enough, as some problems only occur in larger graphs. Our results show that our algorithms scale well and the optimization of the map equation achieves perfect results on the used LFR benchmark graphs [19].

Similarly to most related work, we need to make implicit assumptions on the structure of the graph. We assume that data associated with a cluster fits into the memory of a single compute node. This data consists of the nodes in and all edges incident to a node in . In practice, this is only a limitation when a network has huge clusters. In many scenarios like social networks, web graphs or road networks that we consider in our experiments, this is no limitation as cluster sizes are not that huge. Our algorithms can be modified to avoid these restrictions, but this would lead to higher running times.

### I-C Outline

## Ii Basics

Conceptually, our algorithms work on undirected graphs. However, we represent all graphs as symmetric, directed graphs. Denote by a directed edge from tail node to head node . is ’s back-edge. Further, is the number of nodes and is the number of directed edges. Unless stated otherwise, there are no multi-edges. One exception are loop edges where every node has two. One is the back-edge of the other. We require two loop edges to assure that no edge is its own back-edge.

All our input graphs are unweighted. We represent them as graphs with constant edge weight 1. In intermediate steps of our algorithms, graphs with non-uniform weights appear. An edge and its back-edge must have the same weight. We describe all our algorithms for weighted graphs.

A cluster is a node subset. A clustering is set of clusters such that each node is part of exactly one cluster.

The weighted degree of a node is the sum over the weights of all outgoing edges of . As there is a forward and a backward loop edge, the weight of loop edges is counted twice in the degree of their node. The volume of a set of nodes is the sum of their weighted degrees. The internal volume of a set of nodes is the sum of the edge weights of all (directed) edges for which both endpoints are in . The cut between two sets of nodes , is the sum of the weights of all edges such that and . As a simplification, we write for , for the cut of with the rest of the graph and for the sum of the cuts of all clusters. Note that .

Many approaches exist to formalize the quality of a clustering. In this work, we study two popular ones: modularity [23] and map equation [25]. The modularity of a clustering is defined as

Higher modularity values indicate better clusterings. However, sometimes higher modularity values can also be achieved by merging small but actually clearly distinct clusters. This effect is called resolution limit [10].

In contrast, the map equation has a resolution limit that is in practice orders of magnitudes smaller [18]. Clusterings are better when they have a lower map equation score. To simplify its definition, we set . The definition is

The last term is independent of the clustering and therefore does not affect the optimization. In DSLM-Map, we consider a simplified map equation where this term is omitted.

Optimizing modularity is NP-hard [5] but can be optimized heuristically in practice [4]. Only heuristic map equation optimization algorithms are known to us [25].

To compare clusterings, either with ground truth communities or with a clustering calculated using a different algorithm, we use two measures: the normalized mutual information (NMI) and the adjusted rand index. NMI [12] is based on the mutual information which is normalized by the entropy such that the value is between 0 and 1. The entropy of a clustering is defined as follows:

Similarly, also the mutual information can be defined:

where is defined as 0.

The normalized mutual information is then simply the mutual information, normalized by the entropy of both clusterings:

The adjusted rand index [16] is based on the idea of counting pairs of nodes that are in both clusterings in the same or in both clusterings in different clusters. This rand index is then adjusted for chance such that when one of the two clusterings is random, the adjusted rand index has an expected value of 0. The adjusted rand index is defined as follows:

In contrast to NMI, the adjusted rand index can also be negative. For both measures, the maximum value of 1 indicates that both clusterings are identical.

### Ii-a Thrill

Thrill [3] is a distributed C++ big data processing framework. It can distribute the program execution over multiple machines and threads within a machine. Each thread is called worker. Every worker executes the same program. Thrill provides distributed operations which synchronize the workers.

If there is not enough main memory available, Thrill can use local storage such as an SSD as external memory (EM) to store parts of the processed data. This makes it possible to work with datasets larger than the combined main memory of all hosts.

Data is maintained in distributed immutable arrays (DIA). Thrill can maintain several DIAs simultaneously. Distributed Thrill operations are applied to the DIAs. For example, Thrill contains a sort operation, whose input is a DIA and whose output is a new sorted DIA. Similarly, the zip operation combines two DIAs of the same length into one DIA where each element is a pair of the two original elements.

Usually, the elements of DIAs are structures of multiple primitive types. For example, graphs can be represented as list of edges. Internally, this list is stored as DIA of pairs of tail and head node IDs.

DIAs of structures are similar to SQL-like tables. The components of the structure are the table columns and the DIA elements are the table rows.

Thrill supports DIAs with elements of non-uniform size, as long as each element fits into the memory of a worker. This allows elements to be arrays.

Apart from zip and sort, we use the following operations Thrill supports on these DIAs: The simplest is the map operation, which applies a function to each element. The return values are the elements of the new DIA. The flatmap operation is similar, but instead of returning just one value, the function may emit 0, 1, or more elements. A flatmap in Thrill is very similar to the map operation in the original MapReduce model.

A DIA can be aggregated by a key component. All elements with the same key are combined and put into an array. An aggregation is much more efficient if the keys form a consecutive sequence. In that case the result will also be sorted by the key. For example, aggregating by the first component results in . In Thrill, this operation is called group to index.

On the other hand, there is also the group by key operation, which can handle arbitrary keys. This is very similar to the reduce operation in the original MapReduce model. Our algorithms always make use of the index based operation except noted otherwise.

### Ii-B Data Representation

We represent a graph and its clustering using three DIAs: graph DIA, degree DIA, and clustering DIA.

A graph DIA contains triples , where is a node ID and and are equally-sized arrays. For every , there is an edge with weight . We refer to the pair of and as the neighborhood of .

The degree DIA can be derived from the graph DIA using a map operation. However, it is more compact and therefore using it can be more efficient. It contains triples , where is the weighted degree of and is the weight of one loop edge of .

A clustering is a DIA of pairs . The interpretation is that node is in cluster .

We store all three DIAs sorted by . This enables us to quickly join them by using a zip operation.

As space optimization, we omit loops from graph DIAs if their weight is 0.

## Iii Algorithm

The basis of our algorithm is the Louvain algorithm [4], a fast algorithm for modularity optimization that delivers high-quality results. The original Infomap algorithm proposed for optimizing the map equation [25] is based on the schema of Louvain algorithm, but introduces additional steps to improve the quality.

Initially, every node is in its own cluster. This is called a singleton clustering.

In the local moving phase, the Louvain algorithm works in rounds. In each round, it iterates in a random order over all nodes . For every , it considers ’s current cluster and all clusters such that there is an edge from to a node of . For all these clusters, the difference in quality score is computed. If ’s current cluster does not maximize this value, is moved into a cluster that maximizes it, resolving ties uniformly at random. The local moving phase stops when no node is moved or a maximum number of rounds is reached.

After the local moving phase, the contraction phase starts. All nodes inside a cluster are merged into a single large node. An edge from and is transformed into an edge from ’s cluster to ’s cluster. This operation results in parallel edges and loop edges. We group all parallel edges into one edge. The weights of the edges are summed up. Loop edges are kept to assure that the quality score values of the singleton clustering on the contracted graph are the same as the clustering on the original graph. On the contracted graph, the algorithm is applied recursively.

The Louvain algorithm terminates if the contraction phase is entered and every node is in its own cluster.

### Iii-a Synchronous Local Moving

In the Louvain algorithm, the -th move depends on the clustering after the first moves are executed. This data dependency makes the parallelization difficult. We therefore split a round into sub-rounds. Every node picks a random sub-round in which it is active. Only active nodes may be moved within a sub-round. The moves in the first sub-round are performed with respect to the initial clustering. The moves in the -th sub-round are performed with respect to the clustering after the -th sub-rounds were executed. We call this schema synchronous local moving because all active nodes are moved synchronously and in parallel.

### Iii-B Computing Quality Score Differences

To evaluate if a node shall be moved from its cluster to a cluster , we need to know the difference in the quality measure . For the calculation of , let denote the cluster without node and the cluster with node . Both modularity and map equation allow computing using just information about and and, for map equation, how changes. The latter can also be obtained easily if we know the value before the move. We only need the values for the affected clusters because most terms in the modularity and the map equation formula are just sums over all clusters. When moving a node, only the parts for the affected clusters and change. Thus, all other terms cancel out when calculating the difference. Only the first term in the map equation is different as it is a sum of all cut values. However, this sum can be quickly updated, too. For modularity, we obtain the following formula:

Since the absolute difference is not required to select the best cluster for a given node, this formula can be simplified further. Specifically, if the set of possible clusters is guaranteed to contain the current cluster, we can drop all parts in the formula referring to the current cluster . This will allow optimizations later on.

For the map equation, this formula contains more terms. Due to the logarithms fewer terms cancel out:

Note that for obtaining the updated volume and cut values of and , we only need to know the previous cut and volume values of and , the degree of and the cut of with and . The latter can easily be obtained by iterating over the neighbors of . In the sequential algorithms, the cut and volume of each cluster as well as the global cut are cached and updated after every move. For the parallelization, we re-calculate all of them after each sub-round.

### Iii-C Distributed Synchronous Local Moving (DSLM)

In this subsection, we describe one round of our distributed synchronous local moving algorithm.

We use a global hash function to determine when a node is active. maps a node ID together with the number of completed rounds onto a sub-round . is active in sub-round .

A sub-round is composed of a bidding and a compare step. In the bidding step, the clusters place bids for active nodes. In the compare step, the active nodes compare their bids. Every active node picks the best bid and becomes part of the corresponding cluster.

Bids must be attributed with sufficient information to allow nodes to compare bids. A bid of cluster for node contains: (a) volume , (b) cut weight between and , and (c) cut weight between and the remaining graph.

The bidding step starts by zipping the clustering DIA and graph DIA. It then aggregates the elements corresponding to a cluster. The result is a DIA with one large element per cluster . This element contains all nodes in cluster and the neighborhoods of these nodes. Recall that every element must fit into the working space of a single worker. We can therefore compute the measures (a), (b), and (c) using a non-distributed algorithm inside a worker. Using a flatmap, our algorithm emits for every cluster bids for all active nodes inside and adjacent to . It can determine which nodes are active as the hash function is globally known. The generated bid DIA consists of quintuples . Each quintuple is a bid of cluster for active node .

The compare step aggregates the bid DIA by node . After the aggregation, the result is zipped with the degree DIA. Given the information in the bid and its weighted degree, can determine which bid is the best. Using a map operation, our algorithm computes for every node the best bid and returns the corresponding cluster. The result is a partial clustering DIA that contains the new cluster for every active node. The DIA is padded to a full clustering DIA by adding an invalid cluster ID for every non-active node. We zip this DIA with the original clustering DIA. Using a map operation, we pick the new cluster for active nodes and the old cluster for the other nodes. The result is an updated clustering DIA. This DIA is the input of the next sub-round.

#### Implementation Details and Optimizations

If modularity is optimized, our algorithm can be improved in two ways. The first optimization is that the modularity computation does not depend on the cut weight . We therefore omit its computation and a bid is a quadruple.

The second optimization exploits that, given two modularity bids, a node can determine which is better, without knowing what other bids exist. Thrill has an aggregation operation which allows reducing elements pairwise into a new element. Recall that for the computation we can omit the parts of the equation referring to the current cluster if the current cluster is guaranteed to be in the set of potential clusters. This allows combining elements immediately as they arrive from other workers rather than first aggregating all elements.

For the pairwise reduction the degree of the node needs to be available. This would not be possible by zipping the degree DIA later on. To make the degrees available, they are kept not as a DIA but as a plain array on the worker where they will be required. This is possible because we know on which worker each node will end up.

### Iii-D Distributed Contraction

The contraction phase has three goals: (a) compactify cluster IDs into a consecutive ID range, (b) replace all node IDs by the cluster IDs, and (c) combine multi-edges.

Our algorithm starts by zipping the graph DIA and the clustering DIA. It then aggregates the combined DIA by the cluster ID. This aggregation is not index based. To compactify IDs, it then replaces the cluster IDs by the element positions in the aggregated DIA. From the resulting DIA, two DIAs are derived.

The first DIA is a new clustering DIA. It contains pairs of a cluster and a node . Our algorithm needs two steps for this: First, it drops the adjacency information, getting a DIA where each item consists of a cluster id and an array of node ids. We store this intermediate clustering DIA for the unpacking phase. Then, it uses a flatmap operation to expand this into one item per node. Afterwards, this new clustering DIA is sorted by . This new clustering DIA is similar to the input clustering DIA, except that cluster IDs are consecutive.

The second DIA contains triples. For every triple, there is an edge from cluster to node with weight . The DIA is obtained by dropping the tail nodes. As an optimization, we perform a map operation that groups all outgoing edges of a cluster. This map operation removes some but not all multi-edges. After this map operation, for every combination of and , there is at most one triple.

In the next step, the second DIA is aggregated by . The result is zipped with the new clustering DIA. Afterwards, is dropped. The result is a graph DIA with triples. As in every graph DIA, the last two components and are arrays with index . However, there may be several elements with the same . We therefore aggregate by . Finally, we perform a map operation that groups multi-edges. The result is the contracted graph DIA.

### Iii-E Distributed Unpacking

To unpack the clustering calculated in a level, we simply zip the clustering DIA of that level with the intermediate clustering DIA ( of a cluster and an array of nodes of the previous contraction phase. A flatmap operation then assigns the cluster id of the contracted node to all original nodes , resulting in a new clustering DIA . After sorting it by node, it can again be unpacked on the next level.

## Iv Experiments

In this section, we present an experimental evaluation of our algorithm DSLM.
The source code of our implementation is publicly available on GitHub^{1}^{1}1https://github.com/kit-algo/distributed_clustering_thrill.
We evaluate its running time and the quality of the obtained clustering.
We use real-world and synthetic LFR data [19].

We first describe our experimental setup. This includes our testing machines, the configuration of the tested algorithms, and the synthetic instance generation process. Afterwards, we present weak and strong scaling experiments on LFR data to evaluate the running time scaling behavior. The next set of experiments evaluates the quality of the clusterings on LFR data. It compares the results against the LFR ground truth. The final set of experiments evaluates the performance of our algorithm on established real-world benchmark data.

### Iv-a Test Machines

All running time experiments were run on a compute cluster system. Each compute node has two 14-core Intel Xeon E5-2660 v4 processors (Broadwell) with a default frequency of 2 GHz, 128 GiB RAM and 480 GiB SSD. The compute nodes are connected by an InfiniBand 4X FDR Interconnect. We use the TCP back-end of Thrill due to problems with the combination of multithreading and OpenMPI. We use Thrills default parameters, except for the block size, which determines the size of data packages sent between the hosts. Preliminary experiments found that a block size of 128 KiB instead of the default 2 MiB yields the best results.

### Iv-B Algorithm Configurations

In a preliminary study [28], we observed that using four sub-rounds is enough. Using more does not significantly improve quality but makes the execution slower. We therefore use four sub-rounds in every experiment. In each local moving phase, we perform at most eight rounds. Unless denoted otherwise, all experiments were performed with 16 threads per host.

We evaluate three variants of our algorithm denoted by: DSLM-Mod, DSLM-Mod w/o Cont., and DSLM-Map. DSLM-Mod and DSLM-Mod w/o Cont. optimize modularity. DSLM-Map optimizes map equation. The difference between DSLM-Mod and DSLM-Mod w/o Cont. is that the former performs the contraction phase while the later stops after the first local moving phase. We include DSLM-Mod w/o Cont. because we observed that on many synthetic instances this significantly improved the quality and decreased the running time.

We compare our algorithms to their sequential counterparts. For the modularity optimization we use our own implementation of the sequential Louvain algorithm [4]. For map equation we use the original Infomap implementation [25] with the option to optimize two-level map equation.

We apply a preprocessing step to all graphs. This includes removing degree zero nodes and making the id space consecutive. We randomize the node order. This ensures that our algorithms are independent of input order and improves load balancing. The preprocessed graph is emitted in a binary format, split into several files of fixed size.

### Iv-C Synthetic Instance Generation

Our synthetic test data is generated using the established LFR benchmark generation scheme [19]. Unfortunately, the original implementation is too slow and requires too much memory to generate sufficiently large test data. We therefore use the external memory generator of [15] which implements the same scheme but can generate larger graphs. The LFR generator produces benchmark graphs with properties similar to real-world graphs. Further, it gives a ground truth clustering to which a calculated clustering can be compared.

The node degrees and the cluster sizes of LFR benchmark graphs are drawn from power law distributions. Nodes are distributed to clusters at random. A certain fraction of the neighbors of every node are in a different cluster, all other neighbors are in the same cluster. Every cluster forms a random graph with degrees . An additional random graph with degrees connects the clusters. For details, we refer the reader to the original description [19]. We choose the parameters for the LFR graph generation by trying to mimic the properties of a social network. We use the properties reported in [26] as an orientation. For all experiments we choose a minimum degree of 50 and a maximum degree of 10 000 with a power law exponent of 2. This leads to an average degree of approximately 264. Our largest LFR graph with 512 million nodes has 67.6 billion (undirected) edges. For the communities, we choose 50 as minimum and 12 000 as maximum size with a power law exponent of 1. Unless otherwise noted, we set the mixing parameter to 0.4. This means that in contrast to previous experiments on the behavior of clustering algorithms on larger LFR benchmark instances [8], our cluster sizes are not scaled as we increase the graph size and the diameter of the clusters remains small. This avoids the field of view limit experienced in previous work [8].

### Iv-D Weak Scaling

For the weak scaling experiments, we use LFR graphs with 16, 32, 64, 128, 256 and 512 million nodes. We execute them on 1, 2, 4, 8, 16, and 32 hosts respectively, using 16 workers on each host. Figure 1 shows the running time of our distributed algorithms.

With a linear time algorithm and perfect scaling, we would expect that the running time remains constant as we increase graph size and the number of nodes. For the variant of DSLM-Mod w/o Cont., the running time actually does not increase much. The running time of the full DSLM-Mod and DSLM-Map algorithms increases approximately linearly though as the number of hosts is scaled exponentially. A possible explanation for this is that the contraction does not scale well. Also, DSLM-Map is approximately a factor of two slower than DSLM-Mod. This is expected as the optimizations described in Section III-C are not applicable to DSLM-Map.

### Iv-E Strong Scaling

To complement our weak scaling experiments, we also conducted a set of strong scaling experiments. Thrill supports distributing the execution over machines and threads inside a machine. To investigate how these two parallelism variants scale, we depict in Figure 2 two strong scaling experiments.

We ran several experiments using a fixed LFR graph. For the first experiment, we used only one worker per host. For the second experiment, we used 16 workers per host as in all other experiments. Our figures report the absolute running time and the efficiency relative to the running time with one host.

Using more threads and workers decreases the running time. However, the speedup follows an interesting pattern. When using 16 threads per host the efficiency drops below one, i.e., our algorithm does not scale perfectly. However, when using one thread per host, we observe a slightly superlinear speedup.

The superlinear speedup can be explained by Thrills external memory capabilities. With one host, the graph is reaching the limits of what Thrill can handle with the internal memory and external memory is used sometimes. With two or more hosts no external memory is needed anymore, which causes the observed speedup. We attribute the decreased scalability with more threads to increased communication overhead, as each worker needs to communicate with all other workers.

We conclude that our algorithms scales well with the number of hosts even though having more threads per host negatively impacts the scalability. Fortunately, even though the efficiency is suboptimal, using more hosts and threads per host always results in faster execution times. We therefore use 16 threads per host in all other experiments in this paper.

### Iv-F Quality

First, we cluster a set of LFR graphs with varying sizes, ranging from 1M to 512M nodes with exponentially increasing size. Figure 3 depicts the similarities of the clusterings found by our algorithms compared to the ground truth. From the plot, we observe that DSLM-Map finds a clustering close to the ground truth. A detailed investigation revealed that it finds it even perfectly. Similarly, DSLM-Mod w/o Cont. achieves similar results. However, there are small differences to the ground truth. Unfortunately, DSLM-Mod with Cont. fails to find a clustering similar to the ground truth on the larger instances. The NMI is much less sensitive to merged clusters than the ARI. As the NMI is not going down to 0 but is still relatively high, a likely explanation for this phenomenon is that DSLM-Mod merges clusters that are separate in the ground truth. This also explains why omitting the contraction helps: These unwanted merge operations likely happen after the contraction step.

Additionally, we use smaller LFR graphs with 1M nodes but with varying mixing parameter to compare the quality of the communities found by the distributed algorithms with their sequential counterparts. Figure 4 shows both the results concerning how well the ground truth is found and the quality in terms of the quality measures modularity and map equation. All reported values are averages over three runs.

As the parallel versions of the algorithms can actually move nodes into clusters such that the quality is decreased, it is surprising to see that DSLM-Mod without contraction and DSLM-Map outperform their sequential counterparts by a significant margin. Again, NMI scores are significantly higher than the ARI scores, in particular for the sequential Louvain and DSLM-Mod with contraction. A likely cause is therefore again that many clusters have been merged. One might think that this is due to the resolution limit and that the DSLM-Mod without contraction gets just stuck in a local maximum. However, looking at the modularity scores this is definitely not the case. DSLM-Mod without contraction delivers better modularity scores. For Infomap and DSLM-Map, the situation is similar for all graphs up to . For , DSLM-Map seems to indeed get stuck in a local minimum (note that smaller map equation scores are better). For , a trivial solution yields a better map equation score than the ground truth clustering.

Overall, for these LFR benchmark graphs our distributed synchronous local moving scheme seems to be superior to sequential local moving. When examining sequential local moving algorithms, we noticed that high-degree nodes attract many nodes in the first local moving round. After a few nodes join their cluster, many others follow. In contrast to that, 25% of the nodes have the possibility to join the cluster of another node before any cluster sizes come into play. Apparently this synchronous moving helps to avoid such accumulation effects.

Further research is needed to see if this is a phenomenon particular to this kind of LFR graphs or if synchronous local moving could be a way to avoid local maxima when optimizing such quality functions.

### Iv-G Real-World Graphs

To assess whether our results on LFR benchmark graphs are also true for real-world graphs, we perform experiments on a set of different real-world graphs. From the Stanford Large Network Dataset Collection [20], we include an Amazon co-purchase network where nodes represent products that are connected when they are frequently purchased together (com-amazon) and four social networks (com-youtube, com-lj, com-orkut and com-friendster) of the online platforms YouTube, LiveJournal, Orkut and Friendster. From the 10th DIMACS Implementation challenge [1], we include three web graphs where nodes represent URLs and edges links between them (in-2004, uk-2002 and uk-2007-05) and the road network of Europe from OpenStreetMap (osm-europe). The sizes of the graphs we selected range from 334 thousand nodes and 925 thousand edges up to 105 million nodes and 3.3 billion edges.

Modularity | Map Equation | |||||||

Instance | # Nodes | # Edges | # Hosts | sequential Louvain [s] | DSLM-Mod | sequential Infomap [s] | DSLM- Map [s] | |

with Cont. [s] | w/o Cont. [s] | |||||||

com-amazon | 334 863 | 925 872 | 2 | 1.2 | 5.7 | 1.0 | 23.8 | 4.6 |

com-youtube | 1 134 890 | 2 987 624 | 2 | 6.9 | 9.6 | 3.9 | 113.1 | 14.7 |

in-2004 | 1 382 867 | 13 591 473 | 4 | 16.5 | 13.4 | 4.0 | 131.2 | 11.7 |

com-lj | 3 997 962 | 34 681 189 | 8 | 94.8 | 31.4 | 11.4 | 1 093.8 | 45.3 |

osm-europe | 50 912 018 | 54 054 660 | 8 | 1 607.3 | 156.8 | 45.9 | 12 h | 164.1 |

com-orkut | 3 072 441 | 117 185 083 | 8 | 164.8 | 45.3 | 34.3 | 2 478.9 | 83.5 |

uk-2002 | 18 483 186 | 261 787 258 | 8 | 529.4 | 48.0 | 20.3 | 5 614.0 | 52.1 |

com-friendster | 65 608 366 | 1 806 067 135 | 16 | 5 499.1 | 993.3 | 1 093.1 | OOM | 1 143.8 |

uk-2007-05 | 105 153 952 | 3 301 876 564 | 16 | 7 260.0 | 162.5 | 108.8 | OOM | 220.3 |

We clustered these graphs both with the sequential baseline algorithms and our distributed algorithms. Table I depicts the sizes of the graphs, the number of hosts we used for our distributed algorithms and the algorithm running times. Unfortunately, the original Infomap implementation was not able to cluster all instances. On some instances, it terminated because 128 GB of RAM were not enough memory (OOM), or the execution times were prohibitively large (more than 12 hours).

The modularity based algorithms are nearly always faster than the map equation based algorithms. However, this advantage is smaller than on the LFR graphs. On one instance, namely in-2004, DSLM-Map is even slightly faster than DSLM-Mod.

The distributed algorithms have inherent communication overhead compared to the non-distributed, sequential ones. But for the instances where we used four or more hosts, they outperform their sequential counterparts. DSLM-Map outperforms Infomap an all graphs. This is likely due to the additional refinements implemented in Infomap.

Surprisingly, on com-friendster DSLM-Mod w/o Cont. is slower than DSLM-Mod. We observe that on com-friendster the randomization of the local moving can have a significant impact on the algorithm convergence. In this case different convergence in local moving dominated the performance of the entire algorithm.

Modularity | Map Equation | Adjusted Rand Index | ||||||||

Instance | sequential Louvain | DSLM-Mod | sequential Infomap | DSLM- Map | Louvain 1 - Louvain 2 | Louvain - DSLM-Mod | Infomap 1 - Infomap 2 | Infomap - DSLM-Map | ||

with Cont. | w/o Cont. | with Cont. | w/o Cont. | |||||||

com-amazon | 0.926 | 0.924 | 0.662 | 5.240 | 5.309 | 0.645 | 0.561 | 0.006 | 0.881 | 0.780 |

com-youtube | 0.718 | 0.721 | 0.594 | 8.448 | 8.544 | 0.843 | 0.790 | 0.397 | 0.952 | 0.823 |

in-2004 | 0.980 | 0.980 | 0.879 | 6.287 | 6.298 | 0.867 | 0.832 | 0.325 | 0.939 | 0.970 |

com-lj | 0.752 | 0.748 | 0.572 | 9.900 | 9.980 | 0.614 | 0.539 | 0.126 | 0.977 | 0.949 |

osm-europe | 0.999 | 0.999 | 0.486 | 12 h | 4.350 | 0.618 | 0.587 | 0.000 | 12 h | 12 h |

com-orkut | 0.667 | 0.642 | 0.537 | 11.825 | 11.896 | 0.497 | 0.585 | 0.314 | 0.949 | 0.839 |

uk-2002 | 0.990 | 0.990 | 0.877 | 6.458 | 6.469 | 0.719 | 0.658 | 0.047 | 0.984 | 0.964 |

com-friendster | 0.622 | 0.617 | 0.575 | OOM | 14.788 | 0.633 | 0.499 | 0.355 | OOM | OOM |

uk-2007-05 | 0.996 | 0.996 | 0.907 | OOM | 8.057 | 0.856 | 0.817 | 0.276 | OOM | OOM |

In Table II, we measure the quality of the found clusterings on the real-world networks. We report the modularity respectively the map equation values. Recall, that modularity is maximized, whereas map equation is minimized. Additionally, we compare clusterings using the adjusted rand index similarity rating. A higher similarity means that the clusterings are more similar. Identical clusterings have a similarity of 1.

Interestingly, the differences in modularity between Louvain and DSLM-Mod are tiny but the results of DSLM-Mod w/o Cont. are bad. This is a significant difference between the tested real-world instances and the LFR synthetic instances. On the LFR instances DSLM-Mod w/o Cont. achieves the best results, whereas on the real-world instances DSLM-Mod wins. The modularity of the sequential Louvain clusterings is slightly higher than that of DSLM-Mod.

On the instances where Infomap terminates, DSLM-Map and Infomap achieve very similar results. Infomap achieves slightly smaller values. It is therefore slightly superior.

These results suggest that the clusterings found by DSLM-Mod and DSLM-Map are essentially equally good as those found by the baseline algorithms. However, this does not imply that the same clusterings are found. For example, many different clusterings can exist with the same modularity value. To investigate whether the same clusterings are found, we compared the computed clusterings using the ARI.

First, we compare two runs of the sequential baselines with different random seeds. As the obtained similarities are not 1, we observe that the baseline algorithms do not consistently find the same clustering. Especially, the Louvain algorithm finds vastly different clusterings depending on the random seed. Comparing two runs of the baseline algorithms gives an upper bound onto the maximum similarity, that we can expect when comparing sequential and distributed algorithms. Except for DSLM-Mod w/o Cont., the achieved similarities of our distributed algorithms are very similar to the baseline. The values are only slightly worse. However, this results was expected, as the quality scores are also slightly inferior.

In general, map equation algorithms achieve significantly higher similarities values. Further, on social networks the adjusted rand indices are smaller than on web graphs. This probably due to web graphs having a more pronounced community structure. This leads to fewer good clusterings. The algorithms therefore automatically find more similar clusterings.

Our experiments show that on real-world graphs DSLM-Mod and DSLM-Map both achieve results comparable to the sequential Louvain and Infomap algorithms. The clustering quality is in general a bit lower, but the variance between two runs of the sequential algorithm and the sequential and parallel algorithms is in a similar region. DSLM-Mod w/o Cont., which performed really well on the LFR benchmark graphs is unsuited for real-world graphs.

### Iv-H Comparison with Related Work

We compare our algorithm on the uk-2007-05 instance. The authors of GossipMap report a running time of slightly less than 2 500s using 128 threads on 16 hosts. DSLM-Mod requires 162.5s and DSLM-Map requires 220.3s using 256 threads on 16 hosts. For GossipMap, we do not know the exact processors the experiments were conducted with. That makes a direct comparison difficult. Nevertheless, these numbers strongly indicate that our much simpler algorithm is at the least competitive.

## V Conclusion

We have introduced two distributed graph clustering algorithms, DSLM-Mod and DSLM-Map, that optimize modularity and map equation, respectively. They are based on the Thrill framework which allows for a simple formulation. In an extensive experimental evaluation, we have shown that on LFR benchmark graphs, DSLM-Map achieves excellent results, even better than the sequential Infomap algorithm. For DSLM-Mod, we also evaluate a variant without contraction which has great performance on LFR benchmark graphs. The full DSLM-Mod algorithm with contraction fails to recover the ground truth on LFR benchmark graphs – similar to the sequential Louvain algorithm – but significantly outperforms the variant without contraction on the real-world graphs. On real-world graphs, both distributed algorithms find slightly worse clusterings than the sequential algorithms. However, the found clusterings are not significantly more different than the random fluctuations we also see in the sequential versions. Concerning the running time, we show our distributed algorithms scale reasonably well. Overall, our algorithms are very fast and quite memory efficient: For a LFR graph with 512 million nodes and 67.6 billion edges on a compute cluster with 32 compute nodes with 128 GB RAM each, DSLM-Map needs 67 minutes, DSLM-Mod needs only 38 minutes.

Possible directions for future work are to further explore the scaling behavior of the Thrill framework and how to improve the scaling of our algorithms, in particular the contraction step. Revisiting the local moving phase after the contraction when optimizing modularity seems worthwhile to improve DSLM-Mod’s performance on synthetic LFR benchmark graphs. Fortunately, DSLM-Mod works well on real-world graphs as it is.

## References

- [1] David A. Bader, Henning Meyerhenke, Peter Sanders, Christian Schulz, Andrea Kappes, and Dorothea Wagner. Benchmarking for graph clustering and partitioning. In Jon Rokne and Reda Alhajj, editors, Encyclopedia of Social Network Analysis and Mining, pages 73–82. Springer, 2013.
- [2] Seung-Hee Bae and Bill Howe. GossipMap: a distributed community detection algorithm for billion-edge directed graphs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 27:1–27:12. ACM Press, 2015.
- [3] Timo Bingmann, Michael Axtmann, Emanuel Jöbstel, Sebastian Lamm, Huyen Chau Nguyen, Alexander Noe, Sebastian Schlag, Matthias Stumpp, Tobias Sturm, and Peter Sanders. Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++. Technical report, arXiv, 2016. arXiv:1608.05634.
- [4] Vincent Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), 2008.
- [5] Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Görke, Martin Höfer, Zoran Nikoloski, and Dorothea Wagner. On Modularity Clustering. IEEE Transactions on Knowledge and Data Engineering, 20(2):172–188, February 2008.
- [6] Nazar Buzun, Anton Korshunov, Valeriy Avanesov, Ilya Filonenko, Ilya Kozlov, Denis Turdakov, and Hangkyu Kim. EgoLP: Fast and Distributed Community Detection in Billion-Node Social Networks. In Proceedings of the 2014 IEEE International Conference on Data Mining Workshops, pages 533–540. IEEE Computer Society, 2014.
- [7] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107–113, 2008.
- [8] Scott Emmons, Stephen G. Kobourov, Mike Gallant, and Katy Börner. Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale. PLoS ONE, 11(7):1–18, July 2016.
- [9] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3–5):75–174, 2010.
- [10] Santo Fortunato and Marc Barthélemy. Resolution limit in community detection. Proceedings of the National Academy of Science of the United States of America, 104(1):36–41, 2007.
- [11] Santo Fortunato and Darko Hric. Community detection in networks: A user guide. Physics Reports, 659:1–44, 2016.
- [12] Ana L. Fred and Anil K. Jain. Robust Data Clustering. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), pages 128–136. IEEE Computer Society, June 2003.
- [13] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. GraphX: Graph Processing in a Distributed Dataflow Framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, pages 599–613, 2014.
- [14] Benjamin H. Good, Yves-Alexandre de Montjoye, and Aaron Clauset. Performance of modularity maximization in practical contexts. Physical Review E, 81(046106), 2010.
- [15] Michael Hamann, Ulrich Meyer, Manuel Penschuck, and Dorothea Wagner. I/O-efficient Generation of Massive Graphs Following the LFR Benchmark. In Proceedings of the 19th Meeting on Algorithm Engineering and Experiments (ALENEX’17), pages 58–72. SIAM, 2017.
- [16] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985.
- [17] George Karypis and Vipin Kumar. A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering. Journal of Parallel and Distributed Computing, 48:71–95, January 1998.
- [18] Tatsuro Kawamoto and Martin Rosvall. Estimating the resolution limit of the map equation in community detection. Physical Review E, 91:012809, 2015.
- [19] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4):046110, October 2008.
- [20] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection, June 2014.
- [21] Xiao Ling, Jiahai Yang, Dan Wang, Jinfeng Chen, and Liyao Li. Fast Community Detection in Large Weighted Networks Using GraphX in the Cloud. In 18th IEEE International Conference on High Performance Computing and Communications; 14th IEEE International Conference on Smart City; 2nd IEEE International Conference on Data Science and Systems, pages 1–8. IEEE, 2016.
- [22] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. In Proceedings of the 38th International Conference on Very Large Databases (VLDB 2012), 2012.
- [23] Mark E. J. Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(026113):1–16, 2004.
- [24] Xinyu Que, Fabio Checconi, Fabrizio Petrini, Teng Wang, and Weikuan Yu. Lightning-fast community detection in social media: A scalable implementation of the louvain algorithm. Technical report, Department of Computer Science and Software Engineering, Auburn University, 2013. Tech. Rep. AU-CSSE-PASL/13-TR01.
- [25] Martin Rosvall, Daniel Axelsson, and Carl T. Bergstrom. The map equation. The European Physical Journal Special Topics, 178(1):13–23, 2009.
- [26] Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The Anatomy of the Facebook Social Graph. Technical report, arXiv, 2011. arXiv:1111.4503.
- [27] Charith Wickramaarachchi, Marc Frincu, Patrick Small, and Viktor K. Prasanna. Fast parallel algorithm for unfolding of communities in large graphs. In Proceedings of the 2014 IEEE High Performance Extreme Computing Conference, pages 1–6. IEEE, 2014.
- [28] Tim Zeitz. Engineering Distributed Graph Clustering using MapReduce. Master’s thesis, Karlsruhe Institute of Technology.
- [29] Jianping Zeng and Hongfeng Yu. A study of graph partitioning schemes for parallel graph community detection. Parallel Computing, 58:131–139, 2016.