DAOC: Stable Clustering of Large Networks
Abstract
Clustering is a crucial component of many data mining systems involving the analysis and exploration of various data. Data diversity calls for clustering algorithms to be accurate while providing stable (i.e., deterministic and robust) results on arbitrary input networks. Moreover, modern systems often operate on large datasets, which implicitly constrains the complexity of the clustering algorithm. Existing clustering techniques are only partially stable, however, as they guarantee either determinism or robustness. To address this issue, we introduce DAOC, a Deterministic and Agglomerative Overlapping Clustering algorithm. DAOC leverages a new technique called Overlap Decomposition to identify finegrained clusters in a deterministic way capturing multiple optima. In addition, it leverages a novel consensus approach, Mutual Maximal Gain, to ensure robustness and further improve the stability of the results while still being capable of identifying microscale clusters. Our empirical results on both synthetic and realworld networks show that DAOC yields stable clusters while being on average 25% more accurate than stateoftheart deterministic algorithms without requiring any tuning. Our approach has the ambition to greatly simplify and speed up data analysis tasks involving iterative processing (need for determinism) as well as data fluctuations (need for robustness) and to provide accurate and reproducible results.
0.982
I Introduction
Clustering is a fundamental part of data mining with a wide applicability to statistical analysis and exploration of physical, social, biological and information systems. Modeling and analyzing such systems often involves processing large complex networks [1]. Clustering large networks is intricate in practice, and should ideally provide stable results in an efficient way in order to make the process easier for the data scientist.
Stability is pivotal for many data mining tasks since it allows to better understand whether the results are caused by the evolving structure of the network, by evolving node ids (updated labels, coordinates shift or nodes reordering), or by some fluctuations in the application of nondeterministic algorithms. Stability of the results involves both determinism and robustness. We refer to the term deterministic in the strictest sense denoting algorithms that a) do not involve any stochastic operations and b) produce results invariant of the nodes processing order. Robustness ensures that clustering results gracefully evolve with small perturbations or changes in the input network [19]. It prevents sudden changes in the output for dynamic networks and provides the ability to tolerate noise and outliers in the input data [10].
Clustering a network is usually not a oneoff project but an iterative process, where the results are visually explored and refined multiple times. The visual exploration of large networks requires to consider the specificities of human perception [4, 32] which is good at handling finegrained hierarchies of clusters. In addition, those hierarchies should be stable across iterations such that the user can compare previous results with new results. This calls for results that are both stable and finegrained.
In this paper,
we introduce a novel clustering method called DAOC to address the aforementioned issues. To the best of our knowledge, DAOC
Ii Related Work
A great diversity of clustering algorithms can be found in the literature. Below, we give an overview of prior methods achieving robust results, before describing deterministic approaches and outlining a few widely used algorithms that are neither robust nor deterministic but were inspirational for our method.
Robust clustering algorithms typically leverage consensus or ensemble techniques [14, 45, 23, 31]. They identify clusters using consensus functions (e.g., majority voting) by processing an input network multiple times and varying either the parameters of the algorithm, or the clustering algorithm itself. However, such algorithms typically a) are unable to detect finegrained structures due to the lack of consensus therein, b) are stochastic and c) are inapplicable to large networks due to their high computational cost. We describe some prominent and scalable consensus clustering algorithms below.
 Order Statistics Local Optimization Method (OSLOM) [21] is one of the first widely used consensus clustering algorithms, which accounts for weights of the network links and yields overlapping clusters with a hierarchical structure. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations. OSLOM scales near linearly on sparse networks but has a relatively high computational complexity at each iteration, making it inapplicable to large realworld networks (as we show in Section V).
 Core Groups Graph Clustering Randomized Greedy (CGGC[i]_RG) [38] is a fast and accurate ensemble clustering algorithm. It applies a generic procedure of ensemble learning called Core Groups Graph Clustering (CGGC) to determine several weak graph (network) clusterings and then to form a strong clustering from their maximal overlap. The algorithm has a near linear computational complexity with the number of edges due to the sampling and local optimization strategies applied at each iteration. However, this algorithm is designed for unweighted graphs and produces flat and nonoverlapping clusters only, which limits its applicability and yields low accuracy on large complex networks as we show in Section V.
 Fast Consensus technique was recently proposed and works on top of stateoftheart clustering algorithms including Louvain (FCoLouv), Label Propagation (FCoLPM) and Infomap (FCoIMap) [43]. The technique initializes a consensus matrix and then iteratively refines it until convergence as follows. First, the input network is clustered by the original algorithm multiple times. The consensus values of the matrix are evaluated as the fraction of the runs in which nodes and belong to the same cluster. The consensus matrix is formed using pairs of coclustered adjacent nodes and extended with closed triads instead of all nodes in the produced clusters, which significantly reduces the amount of computation. The formed matrix is filtered with a threshold and then clustered times by the original clustering algorithm, producing a refined consensus matrix. This refinement process is repeated until all runs produce identical clusters (i.e.,until all values in the consensus matrix are either zero and one) with precision . The Fast Consensus technique however lacks a convergence guarantee and relies on three parameters having a strong impact on its computational complexity.
Deterministic clustering algorithms and, in general, nonstochastic ones (i.e., algorithms relaxing the determinism constraint) are typically not robust and are sensitive to both a) initialization [42, 5, 44, 18] (including the order in which the nodes are processed) and b) minor changes in the input network, which may significantly affect the clustering results [10, 23]. Nonstochastic algorithms also often yield less precise results getting stuck on the same local optimum until the input is updated. Multiple local optima often exist due to the degeneracy phenomenon, which is explained in Section III and has to be specifically addressed to create deterministic clustering algorithms that are both robust and accurate. We describe below some of the wellknown deterministic algorithms.
 Clique Percolation method (CPM) [11] is probably the first deterministic clustering algorithm supporting overlapping clusters and capable of providing finegrained results. Sequential algorithm for fast clique percolation (SCP) [20] is a CPMbased algorithm, which detects clique clusters in a single run and produces a dendrogram of clusters. SCP produces deterministic and overlapping clusters at various scales, and shows a linear dependency of the computational complexity with the number of kcliques in the network. However, SCP relies on a number of parameters and has an exponential worst case complexity in dense networks, which significantly limits its practical applicability.
 pSCAN [7] is a fast overlapping clustering algorithm for “exact structural graph clustering” (i.e., it is deterministic and inputorder independent). First, it identifies core graph vertices (network nodes), which always belong to exactly one cluster, forming initially disjoint clusters. The remaining nodes are then assigned to the initial clusters, yielding overlapping clusters. pSCAN relies on two input parameters, and . The results it produces are very sensitive to those parameters, whose optimal values are hard to guess for arbitrary input networks.
Inspirational algorithms for our method
 Louvain [3] is a commonly used clustering algorithm that performs modularity optimization using a local search technique on multiple levels to coarsen clusters. It introduces modularity gain as an optimization function. The algorithm is parameterfree, returns a hierarchy of clusters, and has a nearlinear runtime complexity with the number of network links. However, the resulting clusters are not stable and depend on the order in which the nodes are processed. Similarly to Louvain, our method is a greedy agglomerative clustering algorithm, which uses modularity gain as optimization function. However, the clusters formation process in DAOC differs a lot, addressing the aforementioned issues of the Louvain algorithm.
 DBSCAN [13] is a densitybased clustering algorithms suitable to process data with noise. It regroups points that are close in space given the maximal distance between the points and the minimal number of points within an area. DBSCAN is limited in discovering a large variety of clusters because of its reliance to a density parameter. It has a strong dependency on input parameters, and lacks a principled way to determine optimal values for these parameters [8]. We adopt however the DBSCAN idea of clusters formation based on the extension of the densest region to prevent early coarsening and to produce a finegrained hierarchical structure.
Iii Preliminaries
#i 
Node <i> of the network (graph) 

Q  Modularity 
Modularity Gain between #i and #j 

Items and (nodes or clusters) are neighbors (adjacent, linked)  
Maximal Modularity Gain for #i :


Mutual Maximal Gain:  
Mutual clustering candidates of #i (by ) 
A clustering algorithm is applied to a network to produce groups of nodes that are called clusters (also known as communities, modules, partitions or covers). Clusters represent groups of tightlycoupled nodes with loosely intergroup connections [35], where the group structure is defined by the clustering optimization function. The resulting clusters can be overlapping, which happens in case they share some common nodes, the overlap. The input network (graph) can be weighted and directed, where a node (vertex) weight is represented as a weighted link (edge) to the node itself (a selfloop). The main notations used in this paper are listed in Table I.
Clustering algorithms can be classified by the kind of input data they operate on: a) graphs specified by pairwise relations (networks) or b) attributed graphs (e.g., vertices specified by coordinates in a multidimensional space). These two types of input data cannot be unambiguously converted into each other, at least unless one agrees on some customized and specific conversion function. Hence, their respective clustering algorithms are not (directly) comparable. In this paper, we focus on clustering algorithms working on graphs specified by pairwise relations (networks), which are also known as community structure discovery algorithms.
Modularity () [34] is a standard measure of clustering quality that is equal to the difference between the density of the links in the clusters and the expected density:
(1) 
where is the accumulated weight of the arcs between nodes #i
and #j
, is the accumulated weight of all arcs of #i
, is the total weight of the network, is the cluster to which #i
is assigned, and Kronecker delta is a function, which is equal to when #i
and #j
belong to the same cluster (i.e., ), and otherwise.
Modularity is applicable to nonoverlapping cases only. However, there exist modularity extensions that handle overlaps [17, 9]. The main intuition behind such modularity extensions is to quantify the degree of a node membership among multiple clusters either by replacing a Kronecker (see (1)) with a similarity measure [33, 41] or by integrating a belonging coefficient [37, 24, 25] directly into the definition of modularity. Although both old and new measures are named modularity, they generally have different values even when applied to the same clusters [16], resulting in incompatible outcomes. Some modularity extensions are equivalent to the original modularity when applied to nonoverlapping clusters [33, 41, 25]. However, the implementations of these extensions introduce an excessive number of additional parameters [33, 41] and/or boost the computational time by orders of magnitude [25], which significantly complicates their application to large networks.
Modularity gain () [3] captures the difference in modularity when merging two nodes #i
and #j
into the same cluster, providing a computationally efficient way to optimize Modularity:
(2) 
We use modularity gain () as an underlying optimization function for our metaoptimization function MMG (introduced in Section IVA1).
Degeneracy is a phenomenon linked to the clustering optimization function appearing when multiple distinct clusterings (i.e., results of the clustering process) share the same globally maximal value of the optimization function while being structurally different [15]. This phenomenon is inherent to any optimization function and implies that a network node might yield the maximal value of the optimization function while being a member of multiple clusters, which is the case when an overlap occurs. This prevents the derivation of accurate results by deterministic clustering algorithms without considering overlaps. To cope with degeneracy, typically multiple stochastic clusterings are produced and combined, which is called an ensemble or consensus clustering and provides robust but coarsegrained results [15, 2]. Degeneracy of the optimization function, together with the aforementioned computational drawback of modularity extensions, motivated us to introduce a new overlap decomposition technique, OD (see Section IVB1). OD allows to consider and process overlaps efficiently using algorithms having an optimization function designed for the nonoverlapping case. It produces accurate, robust and finedgrained results in a deterministic way as we show in our experimental evaluation (see Section V).
Iv Method
We introduce a novel clustering algorithm, DAOC, to perform a stable (i.e., both robust and deterministic) clustering of the input network, producing a finegrained hierarchy of overlapping clusters. DAOC is a greedy algorithm that uses an agglomerative clustering approach with a local search technique (inspired by Louvain [3]) and extended with two novel techniques. Namely, we first propose a novel (micro) consensus technique called Mutual Maximal Gain (MMG) for the robust identification of nodes membership in the clusters, which is performed in a deterministic and finegrained manner. In addition to MMG, we also propose a new overlap decomposition (OD) technique to cope with the degeneracy of the optimization function. OD forms stable and finegrained clusters in a deterministic way from the nodes preselected by MMG.
Algorithm 1 gives a highlevel description of our method. It takes as input a directed and weighted network with selfloops specifying node weighs. The resulting hierarchy of clusters is built iteratively starting from the bottom level (the most finegrained level). One level of the hierarchy is generated at each iteration of our clustering algorithm. A clustering iteration consists of the following steps listed on lines 4–5:
At the end of each iteration, links are (re)computed for the formed clusters (initCluster procedure) and for the nonclustered nodes (propagated nodes in the initNode procedure). Both the nonclustered nodes and the formed clusters are treated as input nodes for the following iteration. The algorithm terminates when the iteration does not produce any new cluster.
The clustering process yields a hierarchy of overlapping clusters in a deterministic way independent of the nodes processing order, since all clustering operations a) consist solely of nonstochastic, uniform and local operations, and b) process each node independently, relying on immutable data evaluated on previous steps only. The algorithm is guaranteed to converge since a) the optimization function is bounded (as outlined in Section IVA1) and monotonically increasing during the clustering process, and b) the number of formed clusters does not exceed the number of clustered nodes at each iteration (as explained in Section IVB2).
Iva Identification of the Clustering Candidates
The clustering candidates are the nodes that are likely to be grouped into clusters in the current iteration. The clustering candidates are identified for each node () in two steps as listed in Algorithm 2. First, for each node the adjacent nodes () having the maximal nonnegative value of the optimization function optfn are stored in the sequence, see lines 3–11. Then, the preselected are reduced to the mutual candidates by the mcands procedure, and the filtered out nodes are marked as propagated. The latter step is combined with a cluster formation operation in our implementation to avoid redundant passes over all nodes. The mcands procedure implements our Maximal Mutual Gain (MMG) consensus approach described in Section IVA2, which is a metaoptimization technique that can be applied on top of any optimization function that satisfies a set of constraints described in the following paragraph.
Optimization Function
In order to be used in our method, the optimization function optfn should be a) applicable to pairwise node comparison, i.e. (adjusted pair of nodes); b) commutative, i.e.; and c) bounded on the nonnegative range, where positive values indicate some quality improvement in the structure of the forming cluster. There exist various optimization functions satisfying these constraints besides modularity and inverse conductance (see the list in [6], for instance).
Our DAOC algorithm uses modularity gain, (see (2)), as an optimization function. We chose modularity (gain) optimization because of the following advantages. First, modularity maximization (under certain conditions) is equivalent to the provably correct but computationally expensive methods of graph partitioning, spectral clustering and to the maximum likelihood method applied to the stochastic block model [36, 30]. Second, there are known and efficient algorithms for modularity maximization, including the Louvain algorithm [3], which are accurate and have a nearlinear computational complexity.
MMG Consensus Approach
We propose a novel (micro) consensus approach, called Mutual Maximal Gain (MMG) that requires only a single pass over the input network, is more efficient and yields much more finegrained results compared to stateoftheart techniques.
Definition 1 (Mutual Maximal Gain (MMG))
MMG is a value of the optimization function (in our case modularity gain) for two adjacent nodes #i
and #j
, and is defined in cases where these nodes mutually reach the maximal value of the optimization function (i.e., reach consensus on the maximal value) when considering each other:
(3) 
where denotes adjacency of #i
and #j
, and is the maximal modularity gain for #i
:
(4) 
where is the modularity gain for #i
and #j
(see (2)).
MMG exists in any finite network, which can be easily proven by contradiction as follows. The nonexistence of MMG would create a cycle with increasing and results in considering that : , i.e. , which yields a contradiction.
MMG evaluation is deterministic and the resulting nodes are quasiuniform clustering candidates, in the sense that inside each connected component they share the same maximal value of modularity gain. MMG takes into account finegrained clusters as it operates on pairs of nodes, unlike conventional consensus approaches, where microclusters either require lots of reexecutions of the consensus algorithm, or cannot be captured at all.
MMG does not always guarantee optimal clustering results but reduces degeneracy due to the applied consensus approach.
According to (3), all nodes having MMG to #i
have the same value of , i.e., form the overlap in #i
.
Overlaps processing
is discussed in the following section.
IvB Clusters Formation with Overlap Decomposition
Clusters are formed from candidate nodes selected by MMG as listed in Algorithm 3: a) nodes having a single mutual clustering candidate (cc) form the respective cluster directly as shown on line 9, b) otherwise the overlap is processed. There are three possible ways of clustering an overlapping node in a deterministic way: a) split the node into fragments to have one fragment per each cc of the node and group each resulting fragment with the respective cc into the dedicated cluster (see lines 11–12), or b) group the node together with all its nd.ccs items into a single cluster (i.e. coarsening on line 19), or c) propagate the node to be processed on the following clustering iteration if its clustering would yield a negative value of the optimization function. Each node fragment created by the overlap decomposition is treated as a virtual node representing the belonging degree (i.e., the fuzzy relation) of the original node to multiple clusters. Virtual nodes are used to avoid the introduction of the fuzzy relation for all network nodes (i.e., to avoid an additional complex node attribute) reducing memory consumption and execution time, and not affecting the input network itself. In order to get the most effective clustering result, we evaluate the first two aforementioned options and select the one maximizing the optimization function, . Then, we form the cluster(s) by the merge or mergeOvp procedures as follows. The mergeOvp procedure groups each node fragment (i.e., the virtual node created by the overlap decomposition) together with its respective mutual clustering candidate. This results in either a) an extension of the existing cluster to which the candidate already belongs to, or b) the creation of a new cluster and its addition to the cls list. The merge procedure groups the node with all its clustering candidates either a) merging together the respective clusters of the candidates if they exists, or b) creating a new cluster and adding it to the cls list.
Node splitting is the most challenging process, which is performed only if the accumulated gain from the decomposed node fragments to each of the respective mutual clustering candidates, , (gainEach procedure) exceeds the gain of grouping the whole node with all (gainAll procedure). The node splitting involves: a) the estimation of the node fragmentation impact on the clustering convergence (odAccept procedure given in Section IVB2) and b) the evaluation of the weights for both the forming fragment and for the links between the fragments of the splitting node as described in Section IVB1.
Overlap Decomposition (OD)
An overlap occurs when a node has multiple mutual clustering candidates (ccs). To evaluate when clustering the node with each of its ccs, the node is split into identical and fully interconnected fragments sharing the node weight and original node links. However, since the objective of the clustering is the maximization of : a) the node splitting itself is acceptable only in case the resulting , and b) the decomposed node can be composed back from the fragments only in case . Hence, to have a reversible decomposition without affecting the value of the optimization function for the decomposing node, we end up with .
The outlined constraints for an isolated node, which does not have any link to other nodes, can formally be expressed as:
(5) 
where is the weight of the original node being decomposed into fragments, is the weight of each node fragment and is the weight of each link between the fragments. since the modularity of the isolated node is zero (see (1)). The solution of (5) is:
(6) 
Nodes in the network typically have links, which get split equally between the fragments of the node:
(7) 
where is the weight of each fragment #ik
of the node #i
, is the weight of the link .
Example 1 (Overlap Decomposition)
The input network on the lefthand side of Fig. 1 has node with neighbor nodes being ccs of . These neighbor nodes form the respective clusters overlapping in . is decomposed into fragments to evaluate the overlap. Node has an internal weight equal to (which can be represented via an additional edge to itself) and three edges of weight each. The overlapping clusters are evaluated using (7) as equivalent and virtual nonoverlapping clusters formed using the new fragments of the overlapping node:
Constraining Overlap Decomposition
Overlap decomposition (OD) does not affect the value of the optimization function for the node being decomposed (), hence it does not affect the convergence of the optimization function during the clustering. However, OD increases the complexity of the clustering when the number of produced clusters exceeds the number of clustered nodes decomposed into multiple fragments. This complexity increase should be identified and prevented to avoid indefinite increases in terms of clustering time.
In what follows, we infer a formal condition that guarantees the nonincreasing complexity of OD. We decompose a node of degree into fragments, . Each forming cluster that has an overlap in this node owns one fragment (see Fig. 1) and shares at most links to the nonccs neighbors of the node. The number of links between the fragments resulting in the node split is . The aggregated number of resulting links should not exceed the degree of the node being decomposed to retain the same network complexity, therefore:
(8) 
The solution of (8) is , namely: .
If a node being decomposed has a degree or a node has more than ccs then, before falling back to the coarse cluster formation, we apply the following heuristic inspired by the DBSCAN algorithm [13]. We evaluate the intersection of nd.ccs with each (maxIntersectOrig procedure on line 15 of Algorithm 3) and group the node with its clustering candidate(s) yielding the maximal intersection if the latter contains at least half of the nd.ccs. In such a way, we try prevent an early coarsening and obtain more finegrained and accurate results.
IvC Complexity Analysis
The computational complexity of DAOC on sparse networks is , where m is the number of links in the network. All links of each node () are processed for iterations. In the worst case, the number of iterations is equal to the number of nodes n (instead of ) and the number of mutual candidates is equal to the node degree d instead of 1. Thus, the theoretical worstcase complexity is and occurs only in a hypothetical dense symmetric network having equal MMG for all links (and, hence, requiring overlap decomposition) on each clustering iteration and in case the number of clusters is decreased at each iteration by one only. The memory complexity is .
V Experimental Evaluation
Va Evaluation Environment
Our evaluation was performed using an opensource parallel isolation benchmarking framework, Clubmark
Features \ Algs  Daoc  Scp  Lvn  Fcl  Osl2  Gnx  Psc  Cgr  Cgri  Scd  Rnd 

Hierarchical  +  +  +  
Multiscale  +  +  +  +  +  
Deterministic  +  +  +  
Overlapping clusters  +  +  +  +  +  
Weighted links  +  +  +  +  +  +  
Parameterfree  +!  +  *  *  *  *  *  *  +  
Consensus/Ensemble  +  +  +  +  + 
Deterministic includes inputorder invariance;
+! the feature is available, still the ability to force custom parameters is provided;
* the feature is partially available, parameters tuning might be required for specific cases;
the feature is available in theory but is not supported by the original implementation of the algorithm.
We compare DAOC against almost a dozen stateoftheart clustering algorithms listed in Table II (the original implementations of all algorithms except Louvain are included into Clubmark and are executed as precompiled or JITcompiled applications or libraries) and described in the following papers: SCP [20], Lvn(Louvain
VB Stability Evaluation
We evaluate stability in terms of both robustness and determinism for the consensus (ensemble) and deterministic clustering algorithms listed in Table II. Determinism (nonstochasticity and input order independence) evaluation is performed on synthetic and realworld networks below, where we quantify the standard deviation of the clustering accuracy. To evaluate stability in terms of robustness, we quantify the deviation of the clustering accuracy in response to small perturbations of the input network. The clustering accuracy on each iteration is measured relative to the clustering yielded by the same algorithm at the previous perturbation iteration. For each clustering algorithm, the accuracy is evaluated only for the middle level (scale or hierarchical level), since it is crucial to take the same clustering scale to quantify structural changes in the forming clusters of evolving networks. Robust clustering algorithms are expected to have their accuracy gradually evolving (without any surges) relative to the previous perturbation iteration. In addition, the clustering algorithms sensitive enough to capture the structural changes are expected to have their accuracy monotonically decreasing since the relative network reduction (perturbation) is increased at each iteration: deleted links on iteration represent a fraction of , but on the following iteration this fraction is increased to .
We evaluate robustness and sensitivity (i.e., the ability to capture small structural changes) on synthetic networks with nodes forming overlapping clusters generated by the LFR framework [22]. We generate a synthetic network with ten thousand nodes having an average degree of 20 and using the mixing parameter . This network is shuffled (the links and nodes are reordered) 4 times to evaluate the input order dependence of the algorithms. Small perturbations of the input data are performed gradually reducing the number of links in the network by 2% of the original network size (i.e., 10 1000 20 0.02 = 4000 links) starting from 1% and ending at 15%. The links removal is performed a) randomly to retain the original distributions of the network links and their weights but b) respecting the constraint that each node retains at least a single link. This constraint prevents the formation of disconnected regions. Our perturbation does not include any random modification of the link weights or the creation of new links since it would affect the original distributions of the network links and their weights, causing surges in the clustering results.
The evaluations of stability in terms of robustness (absence of surges in response to small perturbation of the input network) and sensitivity (ability to capture small structural changes) are shown in Fig. 2. Absolute accuracy values relative to the previous link reduction iteration are shown in Fig. 2(a). The results demonstrate that, as expected, all deterministic clustering algorithms except DAOC (i.e. pSCAN and SCP) result in surges and hence are not robust. We also obtain some unexpected results. First, pSCAN, which is nominally “exact” (i.e., nonstochastic and inputorder independent), actually shows significant deviations in accuracy. Second, the clusterings produced by OSLOM2 using default parameters and by FCoLouv using a number of consensus partitions are prone to surges. Hence, OSLOM2 and FCoLouv cannot be classified as robust algorithms according to the obtained results in spite of being a consensus clustering algorithms. Fig. 2(b) illustrates the sensitivity of the algorithms, where the relative accuracy values compared to the previous perturbation iteration are shown. Sensitive algorithms have monotonically decreasing results for the subsequent link reduction, which corresponds to positive values on this plot. The stable algorithms (CGGCRG, CGGCiRG and DAOC) are highlighted with a bolder line width on the figure. These results demonstrate that being robust, CGGCRG and CGGCiRG are not always able to capture structural changes in the network, i.e., they are less sensitive than DAOC. Overall, the results show that only our algorithm, DAOC, is stable (it is deterministic, including inputorder independence, and robust) and at the same time is able to capture even small structural changes in the input network.
VC Effectiveness and Efficiency Evaluation
Our performance evaluation was performed both
a) on weighted undirected synthetic networks with overlapping groundtruth clusters produced by the LFR
frameworkintegrated into Clubmark [26] and
b) on large realworld networks having overlapping and nested groundtruth communities
A number of accuracy measures exist for overlapping clusters evaluation. We are aware of only two families of accuracy measures applicable to large overlapping clusterings, i.e. having a nearlinear computational complexity with the number of nodes: the F1 family [27] and generalized NMI (GNMI) [12, 27]. However, mutual informationbased measures are biased to a large numbers of clusters while GNMI does not have any bounded computational complexity in general. Therefore, we evaluate clustering accuracy with F1h [27], a modification of the popular average F1score (F1a) [48, 40] providing indicative values in the range , since the artificial clusters formed from all combinations of the input nodes yield and .
First, we evaluate accuracy for all the deterministic algorithms listed in Table II on synthetic networks, and then evaluate both accuracy and efficiency for all clustering algorithms on realworld networks. Our algorithm, DAOC, shows the best accuracy among the deterministic clustering algorithms on synthetic networks, outperforming others on each network and being more accurate by 25% on average according to Fig. 3.
Moreover, DAOC also has the best accuracy on average among all evaluated algorithms on large realworld networks as shown in Fig. 4(a). Being parameterfree, our algorithm yields good accuracy on both synthetic networks and realworld networks, unlike some other algorithms having good performance on some datasets but low performance on others.
Nets\Algs  Daoc  Scp*  Lvn  Fcl  Osl2  Gnx  Psc  Cgr  Cgri  Scd  Rnd 

amazon  238  3,237  339  3,177  681  3,005  155  247  1,055  37  337 
dblp  225  3,909  373  3,435  717  2,879  167  247  1,394  36  373 
youtube  737  4,815  1,052  –  –  8,350  508  830  3,865  131  1,050 
livejournal  5,038  –  10,939  –  –  –  4,496  4,899  11,037  761  – 
– denotes that the algorithm was terminated for violating the execution constraints;
* the memory consumption and execution time for SCP are reported for a clique size since they grow exponentially with on dense networks, though accuracy was evaluated varying .
Besides being accurate, DAOC consumes the least amount of memory among the evaluated hierarchical algorithms (Louvain, OSLOM2) as shown in Table III. In particular, DAOC consumes 2x less memory than Louvain on the largest realworld evaluated network (livejournal) and 3x less memory than OSLOM2 on dblp, while producing much more finegrained hierarchies of clusters with almost an order of magnitude more levels than other algorithms. Moreover, among the evaluated overlapping clustering algorithms, only pSCAN and DAOC are able to cluster the livejournal network within the specified execution constraints, the missing bars in Fig. 4(b) corresponding to the algorithms that we had to terminate.
Vi Conclusions
In this paper, we presented a new clustering algorithm, DAOC, which is at the same time stable and provides a unique combination of features yielding a finegrained hierarchy of overlapping clusters in a fully automatic manner. We experimentally compared our approach on a number of different datasets and showed that while being parameterfree and efficient, it yields accurate and stable results on any input networks. DAOC builds on a new (micro) consensus technique, MMG, and a novel overlap decomposition approach, OD, which are both applicable on top of nonoverlapping clustering algorithms and allow to produce overlapping and robust clusters. DAOC is released as an opensource clustering library implemented in C++ that includes various cluster analysis features not mentioned in this paper and that is integrated with several data mining applications (StaTIX [28], or DAOR [29] embeddings). In future work, we plan to design an approximate version of MMG to obtain nearlinear execution times on dense networks, and to parallelize DAOC taking advantage of modern hardware architectures to further expand the applicability of our method.
Footnotes
 https://github.com/eXascaleInfolab/daoc
 https://github.com/eXascaleInfolab/clubmark
 http://igraph.org/c/doc/igraphCommunity.html
 https://snap.stanford.edu/data/#communities
References
 (200204) Linked: the new science of networks. Cited by: §I.
 (2013) Robust detection of dynamic community structure in networks. Chaos 23 (1), pp. 013142. Cited by: §III.
 (200810) Fast unfolding of communities in large networks. J Stat Mech. 2008 (10), pp. P10008. Cited by: §II, §III, §IVA1, §IV, §VA.
 (2014) Perception, Cognition, and Effectiveness of Visualizations with Applications in Science and Engineering. Ph.D. Thesis, Harvard University. Cited by: §I.
 (2015) Linear, deterministic, and orderinvariant initialization methods for the kmeans clustering algorithm. In Partitional Clustering Algorithms, pp. 79–98. Cited by: item a.
 (2013) Comparing different modularization criteria using relational metric. In GSI, pp. 180–187. Cited by: §IVA1.
 (2016) PSCAN: fast and exact structural graph clustering. In ICDE, Vol. , pp. 253–264. Cited by: §II, §VA.
 (2018) Neighbourhood contrast: a better means to detect clusters than density. In PAKDD, Cited by: §II.
 (2015) Fuzzy overlapping community quality metrics. SNAM 5 (1), pp. 40:1–40:14. Cited by: §III.
 (1997) Robust clustering methods: a unified view. IEEE Transactions on Fuzzy systems 5 (2). Cited by: §I, §II.
 (200504) Clique percolation in random networks. Phys. Rev. Lett. 94, pp. 160202. Cited by: §II.
 (2012) Comparing network covers using mutual information. CoRR abs/1202.0425. Cited by: §VC.
 (1996) A densitybased algorithm for discovering clusters in large spatial databases with noise. In KDD, pp. 226–231. Cited by: §II, §IVB2.
 (2003) Robust data clustering. In CVPR, pp. 128–136. Cited by: §II.
 (2010) Performance of modularity maximization in practical contexts. Phys. Rev. E 81 (4), pp. 046106. Cited by: §III.
 (2008) A fast algorithm to find overlapping communities in networks. In ECML PKDD, pp. 408–423. External Links: ISBN 9783540874782 Cited by: §III.
 (2011) Fuzzy overlapping communities in networks. J Stat Mech. 2011 (02), pp. P02017. Cited by: §III.
 (2017) Clustering based on dominant set and cluster expansion. In PAKDD, Cited by: item a.
 (200804) Robustness of community structure in networks. Phys. Rev. E 77, pp. 046119. Cited by: §I.
 (2008) Sequential algorithm for fast clique percolation. Phys. Rev. E 78 (2). Cited by: §II, §VA.
 (2011) Finding Statistically Significant Communities in Networks. PLoS ONE 6. External Links: 1012.2363 Cited by: §II, §VA.
 (200907) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys. Rev. E 80. Cited by: §VB.
 (2012) Consensus clustering in complex networks. Sci. Rep. 2. Cited by: §II, §II.
 (2010) Modularity measure of networks with overlapping communities. EPL 90 (1), pp. 18001. Cited by: §III.
 (2010) Fuzzy modularity and fuzzy community structure in networks. Eur. Phys. J. B 77 (4), pp. 547–557. Cited by: §III.
 (2018) Clubmark: a parallel isolation framework for benchmarking and profiling clustering algorithms on numa architectures. In ICDMW, pp. 1481–1486. Cited by: item a, §VA, §VA.
 (2019) Accuracy evaluation of overlapping and multiresolution clustering algorithms on large datasets. In BigComp, pp. 1–8. Cited by: §VC.
 (2018) StaTIX — statistical type inference on linked data. In BigData, Cited by: §VI.
 (2019) Bridging the gap between community and node representations: graph embedding via community detection. In IEEE BigData, Cited by: §VI.
 (2016) Equivalence between modularity optimization and maximum likelihood methods for community detection. Phys. Rev. E 94. Cited by: §IVA1.
 (2018) Consensus community detection in multilayer networks using parameterfree graph pruning. In PAKDD, Cited by: §II.
 (1956) The magical number seven, plus or minus two: some limits on our capacity for processing information. The Psychol. Rev. 63 (2). Cited by: §I.
 (200801) Fuzzy communities and the concept of bridgeness in complex networks. Phys. Rev. E 77, pp. 016107. Cited by: §III.
 (2004) Finding and evaluating community structure in networks. Phys. Rev. E 69 (2), pp. 026113. Cited by: §III.
 (2003) The structure and function of complex networks. SIAM Rev. 45 (2). Cited by: §III.
 (2013) Spectral methods for network community detection and graph partitioning. Phys. Rev. E 88 (4), pp. 042822. Cited by: §IVA1.
 (200903) Extending the definition of modularity to directed graphs with overlapping communities. J Stat Mech. 3, pp. 24. External Links: 0801.1647 Cited by: §III.
 (2013) An ensemble learning strategy for graph clustering. Contemp. Math., Vol. 588, pp. 187–206. Cited by: §II, §VA.
 (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, pp. 814–818. Cited by: §VA.
 (2014) High quality, scalable and parallel community detection for large real graphs. WWW ’14, pp. 225–236. External Links: ISBN 9781450327442 Cited by: §VA, §VC.
 (2009) Quantifying and identifying the overlapping community structure in networks. J Stat Mech. 2009 (07), pp. P07042. Cited by: §III.
 (2007) In search of deterministic methods for initializing kmeans and gaussian mixture clustering. Intelligent Data Analysis 11 (4), pp. 319–338. Cited by: item a.
 (201904) Fast consensus clustering in complex networks. Phys. Rev. E 99, pp. 042301. Cited by: §II, §VA.
 (2011) Automatically finding clusters in normalized cuts. Pattern Recognition 44 (7). Cited by: item a.
 (2011) A survey of clustering ensemble algorithms. Int. J. Pattern Recogn. 25 (03), pp. 337–372. Cited by: §II.
 (201308) Overlapping community detection in networks: the stateoftheart and comparative study. ACM Comput. Surv. 45 (4), pp. 1–35. Cited by: item a.
 (2011) SLPA: uncovering overlapping communities in social networks via a speakerlistener interaction dynamic process. In ICDMW, pp. 344–349. Cited by: item a, §VA.
 Overlapping community detection at scale: a nonnegative matrix factorization approach. In WSDM ’13, pp. 587–596. Cited by: §VC.
 (2015) Defining and evaluating network communities based on groundtruth. Knowl. Inf. Syst.. External Links: ISSN 02191377 Cited by: item b.