Unifying Sparsest Cut, Cluster Deletion, and Modularity Clustering Objectives with Correlation Clustering
We present and analyze a new framework for graph clustering based on a specially weighted version of correlation clustering, that unifies several existing objectives and satisfies a number of attractive theoretical properties. Our framework, which we call LambdaCC, relies on a single resolution parameter , which implicitly controls both the edge density and sparsest cut of all output clusters. We prove that our new clustering objective interpolates between the cluster deletion problem and the minimum sparsest cut problem as we vary , and is also closely related to the well-studied maximum modularity objective. We provide several algorithms for optimizing our new objective, including a 5-approximation for the case where , and also the first constant factor approximation algorithm for the NP-hard cluster deletion problem. We demonstrate the effectiveness of our framework and algorithms in finding communities in several real-world networks.
Identifying groups of related entities in a network is a ubiquitous task across scientific disciplines. This task is often called graph clustering, or community detection, and can be used to find similar proteins in a protein interaction network, group related organisms in a food web, identify communities in a social network, and classify web documents, among numerous other applications.
Defining the right notion of a “good” community in a graph is an important precursor to developing successful algorithms for graph clustering. In general, a good clustering is one in which nodes inside clusters are more densely connected to each other than to the rest of the graph. However, no consensus exists as to the best way to determine the quality of network clusterings, and recent results show there cannot be such a consensus for the multiple possible reasons people may cluster data . Common objective functions studied by theoretical computer scientists include normalized cut, sparsest cut, conductance, and edge expansion, all of which measure some version of the cut-to-size ratio for a single cluster in a graph. Other standards of clustering quality put a greater emphasis on the internal density of clusters, such as the cluster deletion objective, which seeks to partition a graph into completely connected sets of nodes (cliques) by removing the fewest number of edges possible.
Arguably the most widely used multi-cluster objective for community detection is modularity, introduced by Newman and Girvan . Modularity measures the difference between the true number of edges inside the clusters of a given partitioning (“inner edges”) minus the expected number of inner edges, where expectation is calculated with respect to a specific random graph model.
There are a limited number of results which have begun to unify distinct clustering measures by introducing objective functions that are closely related to modularity and depend on a tunable clustering resolution parameter [14, 32]. Reichardt and Borhholdt developed an approach based on finding the minimum-energy state of an infinite range Potts spin glass. The resulting Hamiltonian function they study is viewed as a clustering objective with a resolution parameter , which can be used as a heuristic for detecting overlapping and hierarchical community structure in a network. When , the authors prove an equivalence between minimizing the Hamiltonian and finding the maximum modularity partitioning of a network . Later, Delvenne et al. introduced a measure called the stability of a clustering, which generalizes modularity and also is related to the normalized cut objective and Fiedler’s spectral clustering method for certain values of an input parameter .
The inherent difficulty of obtaining clusterings that are provably close to the optimal solution puts these objective functions at a disadvantage. Although both the stability and the Hamiltonian-Potts objectives provide useful interpretations for community detection, there are no approximation guarantees for either: all current algorithms are heuristics. Furthermore, it is known that maximizing modularity itself is not only NP-hard, but is also NP-hard to approximate to within any constant factor .
In this paper, we introduce a new clustering framework based on a specially weighted version of correlation clustering . Our partitioning objective for signed networks lies “between” the family of complete instances and the most general correlation clustering instances. Our framework comes with several novel theoretical properties and leads to many connections between clustering objectives that were previously not seen to be related. In summary, we provide:
A novel framework LambdaCC for community detection that is related to modularity and the Hamiltonian, but is more amenable to approximation results.
A proof that our framework interpolates between the sparsest-cut objective and the cluster-deletion problem, as we increase a single resolution parameter, .
Several successful algorithms for optimizing our new objective function in both theory and practice, including the first constant-factor approximation for cluster deletion.
A demonstration of our methods in a number of web-based clustering applications, including social network analysis, mining cliques in collaboration networks, and detecting ground truth communities in an email network.
2 Background and Related Work
Let be an undirected and unweighted graph on nodes , with edges . For all , let be node ’s degree. Given , let be the complement of and be its volume. For every two disjoint sets of vertices , indicates the number of edges between and . If , we write . Let denote the interior edge set of . The edge density of a cluster is , the ratio between the number of edges to the number of pairs of nodes in . By convention, the density of a single node is 1. We now present background and related work that is foundational to our results, including definitions for several common clustering objectives.
2.1 Correlation Clustering
An instance of correlation clustering is given by a signed graph where every pair of nodes and possesses two non-negative weights, and , to indicate how similar and how dissimilar and are, respectively. Typically only one of these weights is nonzero for each pair . The objective can be expressed as an integer linear program (ILP):
In the above formulation, represents “distance”: indicates that nodes and are clustered together, while indicates separating the two nodes. Including triangle-inequality constraints ensures the output of the above ILP defines a valid clustering of the nodes. This objective counts the total weight of disagreements between the signed weights in the graph and a given clustering of its nodes. The disagreement (or “mistake”) weight of a pair is if the nodes are clustered together, but if they are separated. We can equivalently define the agreement weight to be if are clustered together, but if they are separated. The optimal clusterings for maximizing agreements and minimizing disagreements are identical, but it is more challenging to approximate the latter objective.
Correlation clustering was introduced by Bansal et al., who proved the problem is NP-complete . They gave a polynomial-time approximation scheme for the maximization version and a constant-factor approximation for minimizing disagreements in -weighted graphs. Subsequently, Charikar et al. gave a factor 4-approximation for minimizing disagreements and proved APX-hardness of this variant. They also described an approximation for minimization in general weighted graphs , proved independently by two different groups, who showed that minimizing disagreements is equivalent to minimum multicut [15, 19].
The problem has also been studied for the case where edges carry both positive and negative weights, satisfying probability constraints: for all pairs , . Ailon et al. gave a -approximation for this version of the problem based on an LP-relaxation, and additionally developed a very fast algorithm, called Pivot, that in expectation gives a -approximation . Currently the best-known approximation factor for correlation clustering on instances is slightly smaller than , obtained by a careful rounding of the canonical LP relaxation .
2.2 Modularity and the Hamiltonian
One very popular measure of clustering quality is modularity, introduced in its most basic form by Newman and Girvan . We more closely follow the presentation of modularity given by Newman . The modularity of an underlying clustering is:
where if nodes and are adjacent, and zero otherwise, and is again the binary variable indicating “distance” between and in the corresponding clustering. The value represents the probability of an edge existing between and in a specific random graph model. The intent of this measure is to reward clusterings in which the actual number of edges inside a cluster is greater than the expected number of edges in the cluster, as determined by the choice for . Although there are many options, it is standard in the literature to set , since this preserves both the degree distribution and the expected number of edges between the original graph and null model. Many generalizations have been introduced for modularity, including an extension to multislice networks, which allow one to study the evolution of communities in a network over time .
By slightly editing the modularity function, we obtain the Hamiltonian objective of Reichardt and Bornholdt :
The primary difference between this and modularity is the inclusion of a clustering resolution parameter . If we fix , minimizing (3) is equivalent to maximizing modularity. When varied, this parameter controls how much a clustering is penalized for putting two non-adjacent nodes together or separating adjacent nodes.
The Hamiltonian objective is in turn closely related to the stability of a clustering as defined by Delvenne et al., another generalization of modularity . Roughly speaking, the stability of a partition measures the likelihood that a random walker, beginning at a node and following outgoing edges uniformly at random, will end up in the cluster it started in after a random walk of length . This serves as a resolution parameter, since the walker will tend to “wander" farther when is increased, leading to the formation of larger clusters when the stability is maximized. Delvenne et al. showed that objective (3) is equivalent to a linearized version of the stability measure for a specific range of time steps .
2.3 Sparsest Cut and Normalized Cut
One measure of cluster quality in an unsigned network is the sparsest cut score, defined for a set to be . Smaller values for are desirable, since they indicate that , in spite of its size, is only loosely connected to the rest of the graph. This measure differs by at most a factor of two from the related edge expansion measure: . If we replace with in these two objectives, we obtain the normalized cut and the conductance measure respectively. In our work we focus on a multiplicative scaling of the sparest cut objective that we call the scaled sparsest cut: , which is identical to sparsest cut in terms of multiplicative approximations. The best known approximation for finding the minimum sparsest cut of a graph is an -approximation algorithm due to Arora et al. .
2.4 Cluster Deletion
Cluster deletion is the problem of finding a minimum number of edges in to be deleted in order to convert into a disjoint set of cliques. This can be viewed as stricter version of correlation clustering, in which we want to minimize disagreements, but we are strictly prohibited from making mistakes at negative edges. This problem was first studied by Ben-Dor et al. , later formalized in the work of Natanzon et al. , who proved it is NP-hard, and Shamir et al. , who showed it is APX-hard. The latter studied the problem in conjunction with other related edge-modification problems, including cluster completion and cluster editing.
Numerous fixed parameter tractability results are known for cluster deletion [7, 22, 23, 13], as well many results regarding special graphs for which the problem can be solved in polynomial time [20, 10, 16, 9]. Dessmark et al. proved that by iteratively finding maximum cliques in the graph, one can achieve a 2-approximation for this objective , though in general this procedure is NP-hard.
3 Theoretical Results
Our novel clustering framework takes an unsigned graph and converts it into a signed graph on the same set of nodes, , for a fixed clustering resolution parameter . Partitioning with respect to the correlation clustering objective will then induce a clustering on . To construct the signed graph, we first introduce a node weight for each . If , we place a positive edge between nodes and in , with weight . For , we place a negative edge between and in , with weight . We consider two different choices for node weights : setting for all (standard) or choosing (degree-weighted). In Figure 1 we illustrate the process of converting into the LambdaCC signed graph, .
3.1 Connection to Modularity
Despite a significant difference in approach and interpretation, the clustering that minimizes disagreements is the same clustering that minimizes the Hamiltonian objective (3), for a certain choice of parameters. To see this, express the weight of disagreements, , for a clustering of in terms of edges and non-edges of :
By introducing node-adjacency indicators , this becomes:
Choosing and , we see that:
where the first term is just a constant. This theorem follows:
Minimizing disagreements for the LambdaCC objective is equivalent to minimizing .
The choice is reminiscent of the graph null model most commonly used for modularity and the Hamiltonian. This best highlights the similarity between these objectives and degree-weighted LambdaCC. On the other hand, standard LambdaCC (setting for every ) leads to strong connections between the sparsest cut objective and cluster deletion. This version corresponds to solving a correlation clustering problem where all positive edges have equal weight, , while all negative edges have equal weight, . The objective function for minimizing disagreements is
where we include the same constraints as in ILP (1). This is a strict generalization of the unit-weight correlation clustering problem  () indicating the problem in general is NP-hard (though it admits several approximation algorithms). If is or , the problem is trivial to solve: put all nodes in one cluster or put each node in a singleton cluster, respectively. By selecting values for , we uncover more subtle connections between identifying sparse cuts and finding dense subgraphs in the same network.
3.2 Connection to Sparsest Cut
Given and , the weight of positive-edge mistakes in the LambdaCC objective made by a two-clustering equals the weight of edges crossing the cut: . To compute the weight of negative-edge mistakes, we take the weight of all negative edges in the entire network, , and then subtract the weight of negative edges between and : . Adding together all terms we find that the LambdaCC objective for this clustering is
Note that if we minimize (6) over all 2-clusterings, we solve the decision version of the minimum scaled sparsest cut problem: a few steps of algebra confirms that there is some set with if and only if (6) is less than .
In a similar way we can show that objective (5) is equivalent to
where we minimize over all clusterings of with arbitrarily many clusters . In this case, optimally solving objective (7) will tell us whether we can find a clustering such that
Hence LambdaCC can be viewed as a multi-cluster generalization of the decision version of minimum sparsest cut. We now prove an even deeper connection between sparsest cut and LambdaCC. Using degree-weighted LambdaCC yields an analogous result for normalized cut.
Let be the minimum scaled sparsest cut for a graph .
For all , optimal solution (7) partitions into two or more clusters, each of which has scaled sparsest cut . There exists some such that the optimal clustering for LambdaCC is the minimum sparsest cut partition.
For , it is optimal to place all nodes into a single cluster.
Statement (1) Let be some optimal sparsest cut-inducing set in , i.e., . The LambdaCC objective corresponding to is
When minimizing objective (7), we can always obtain a score of by placing all nodes into a single cluster. Note however that the the score of clustering in expression (8) is strictly less than for all . Even if is not optimal, this means that when , we can do strictly better than placing all nodes into one cluster. In this case let be the optimal LambdaCC clustering and consider two of its clusters: and . The weight of disagreements between and is equal to the number of positive edges between them times the weight of a positive edge: . Should we form a new clustering by merging and , these positive disagreements will disappear; in turn, we would introduce new mistakes, being negative edges between the clusters. Because we assumed is optimal, we know that we cannot decrease the objective by merging two of the clusters, implying that
Given this, we fix an arbitrary cluster and perform a sum over all other clusters to see that
proving the desired upper bound on scaled sparsest cut.
Since is a finite graph, there are a finite number of scaled sparsest cut scores that can be induced by a subset of . Let be the second-smallest scaled sparsest cut score achieved, so . If we set , then the optimal LambdaCC clustering produces at least two clusters, since , and each cluster has scaled sparsest cut at most . By our selection of , all clusters returned must have scaled sparsest cut exactly equal to , which is only possible if the clustering returned has two clusters. Hence this clustering is a minimum sparsest cut partition of the network.
Statement (2) If , forming a single cluster must be optimal, otherwise we could invoke Statement (1) to assert the existence of some nontrivial cluster with scaled sparsest cut less than or equal to , contradicting the minimality of . If , forming a single cluster or using the clustering yield the same objective score, which is again optimal for the same reason.
3.3 Connection to Cluster Deletion
For large our problem becomes more similar to cluster deletion. We can reduce any cluster deletion problem to correlation clustering by taking the input graph and introducing a negative edge of weight “” between every pair of non-adjacent nodes. This guarantees that optimally solving correlation clustering will yield clusters that all correspond to cliques in . Furthermore, the weight of disagreements will be the number of edges in that are cut, i.e., the cluster deletion score. We can obtain a generalization of cluster deletion by instead choosing the weight of each negative edge to be . The corresponding objective is
If we substitute we see this differs from objective (5) only by a multiplicative constant, and is therefore equivalent in terms of approximation. When , putting dissimilar nodes together will be more expensive than cutting positive edges, so we would expect that the clustering which optimizes the LambdaCC objective will separate into dense clusters that are “nearly” cliques. We formalize this with a simple theorem.
If minimizes the LambdaCC objective for the unsigned network , then the edge density of every cluster in is at least .
Take a cluster and consider what would happen if we broke apart so that each of its nodes were instead placed into its own singleton cluster. This means we are now making mistakes at every positive edge previously in , which increases the weight of disagreements by . On the other hand, there are no longer negative mistakes between nodes in , so the LambdaCC objective would simultaneously decrease by . The total change in the objective made by pulverizing is
which must be nonnegative, since is optimal, so .
Let have edges. For every , optimizing LambdaCC is equivalent to optimizing cluster deletion.
All output clusters must have density at least , which is only possible if the density is actually one, since is the total number of edges in the graph. Therefore all clusters are cliques and the LambdaCC and cluster deletion objectives differ only by a multiplicative constant .
3.4 Equivalences and Approximations
We summarize the equivalence relationships between LambdaCC and other objectives in Figure 2. Accompanying this, Table 1 outlines the best-known approximation results both for maximizing agreements and minimizing disagreements for the standard LambdaCC signed graph. For degree-weighted LambdaCC, the best-known approximation factors for all are [15, 19, 11] for minimizing disagreements, and for maximizing agreements . Thus, LambdaCC is more amenable to approximation than modularity (and relatives) because of additive constants.
We present several new algorithms, tailored specifically to our LambdaCC framework; some come with approximation guarantees, some are designed for efficiency.
4.1 LP Relaxation Algorithm for
Our first algorithm relies on solving the LP-relaxation of ILPs (1,5), replacing the binary constraint with the linear constraint . Our rounding scheme and proof technique are similar to those developed by Charikar et al. for correlation clustering . Our procedure, outlined in Algorithm 1, is called fiveLP, based on the following approximation result.
Algorithm fiveLP gives a factor- approximation for LambdaCC for all .
We prove the 5-approximation holds for objective (9) when , since this objective is equivalent to LambdaCC in terms of approximations. In other words, consider a correlation clustering instance where each positive edge has weight and each negative edge has weight . Solving this LP gives a lower bound on the optimal LambdaCC score. We show that both for singleton and non-singleton clusters formed by fiveLP, the number of mistakes made at each cluster is within a factor five of the LP cost corresponding to that cluster. Recall that each cluster is formed around some node , and .
If is a singleton, we know that . In this case we make at most mistakes, which would happen if all edges between and are postive. Given that for every , we know that . Let and . The LP cost associated with this cluster is
so we account for the errors within a factor .
Negative-edge mistakes in non-singleton clusters.
Consider a negative edge inside a cluster of the form . If that edge is , the LP cost is . For every other negative edge, , where , the LP cost is . Either way, the LP has paid at least for each negative-edge mistake.
Positive-edge mistakes in non-singleton clusters.
Positive edges from to satisfy , so this type of edge pays for itself easily. The other edges we need to account for are all edges where and . We will charge all edges of this form to the node that lies outside .
First, if , then and the positive edge pays for itself within factor .
Now, fix some where . Let and , and let be the number of positive edges, while is the number of negative edges. The number of positive mistakes we are charging to is exactly , and we have
|LP cost at|
where the last inequality follows from the fact that the average distance from to is less than . Thus the LP cost is bounded by a linear function, , where . Hence the coefficient of , while the coefficient of , so is within a factor of the LP cost.
4.2 LP Relaxation Algorithm for Cluster Deletion.
We slightly alter fiveLP in the following ways for all :
For all , force constraints in the LP.
When rounding, select arbitrary and set (rather than using ).
Make a singleton if the average distance from to is , otherwise cluster with .
We call this algorithm fourCD, and show that it is the first constant-factor approximation algorithm for cluster deletion.
Algorithm fourCD returns a 4-approximation to Cluster Deletion.
First, fourCD forms only cliques. Should the cluster formed around not be a singleton, for every , with , so . Since this distance is strictly less than , and since we forced for non-adjacent nodes, the nodes must be adjacent. We therefore only need to account for positive-edge mistakes. The remainder of the proof follows directly from the same steps used to prove Theorem 4, as well as the original proof of Charikar et al.  for correlation clustering. We include full details here for completeness:
If we cluster as a singleton, then the number of mistakes we make between and is exactly , as these are all positive neighbors of . Since was made a singleton cluster, we know that , so these positive mistakes are paid within factor four. Finally, note that every positive edge for has LP cost greater than , so all mistakes are paid for within factor two.
No negative mistakes are made, so we only need to account for positive mistakes. As mentioned in the previous case, edges for pay for themselves within factor two. For where , if and , we know , so the edge pays for itself within factor four.
Now consider a single node such that , and then consider all with . Again, use the notation and with and . Now we bound the weight of positive mistakes as a function of the LP cost associated with . Thanks to the constraint for all , , therefore, relying also on the reasoning for fiveLP:
For , the coefficient of , while the coefficient of . Therefore the number of mistakes, , is paid for within factor four.
4.3 Strategies for Solving the LP Relaxation
The LP-relaxation for correlation clustering involves triangle-inequality constraints. We can nonetheless compute this relaxation for graphs with up to a few thousand nodes, via the following two strategies. The first is to solve the LP on a subset of the constraints, then iteratively update the constraint set and re-solve the LP as needed, until convergence. The second approach employs the triangle-fixing procedure of Dhillon et al. for the related metric nearness problem .
4.4 Scalable Heuristic Algorithms
To counterpart the previous approximation-driven approaches, we provide fast algorithms for LambdaCC-based greedy local heuristics. The first of these is GrowCluster, which iteratively selects an unclustered node uniformly at random and forms a cluster around it by greedily aggregating adjacent nodes, until there is no more improvement to the LambdaCC objective.
A variant of this, called GrowClique, is specifically designed for cluster deletion. It monotonically improves the LambdaCC objective, but differs in that at each iteration it randomly selects unclustered nodes, and greedily grows cliques around each of these seeds. The resulting cliques may overlap: at each iteration we select only the largest of such cliques. Pseudocode for GrowCluster and GrowClique are given in Algorithms 2 and 3 respectively.
Finally, since the LambdaCC and Hamiltonian objectives are equivalent, we can use previously developed algorithms and software for modularity-like objectives with a resolution parameter. In particular we employ adaptations of the Louvain method, an algorithm developed by Blondel et al. . It iteratively visits each node in the graph and moves it to an adjacent cluster, if such a move gives a locally maximum increase in the modularity score. This continues until no move increases modularity, at which point the clusters are aggregated into super-nodes and the entire process is repeated on the aggregated network. By adapting the original Louvain method to make greedy local moves based on the LambdaCC objective, rather than modularity, we obtain a scalable algorithm that is known to provide good approximations for a related objective, and additionally adapts well to changes in our parameter . We refer to this as Lambda-Louvain. Both standard and degree-weighted versions of the algorithm can be achieved by employing existing generalized Louvain algorithms (e.g., the GenLouvain algorithm of Jeub et al. http://netwiki.amath.unc.edu/GenLouvain/).
Our heuristic algorithms satisfy the following guarantee:
For every , standard (respectively, degree-weighted) Lambda-Louvain either places all nodes in one cluster, or produce clusters that have scaled sparsest cut (respectively, scaled normalized cut) bounded above by . The same holds true for GrowCluster.
Note that by design, when Lambda-Louvain terminates there will be no two clusters which can be merged to yield a better objective score. Just as in the proof of statement (1) for Theorem 2, for the standard LambdaCC objective this means that for any pair of cluster and we have
We then fix , perform a sum over all other clusters, and get the desire result:
If we are using degree-weighted Lambda-Louvain, when the algorithm terminates we know that all pairs of clusters satisfy
and the corresponding result for scaled normalized cut holds.
Though slightly less obvious, it is also true that none of the output clusters of GrowCluster (if there are at least two) could be merged to yield a better objective score. Notice that this is certainly true of the first cluster formed by GrowCluster: we stop growing when we find that (for standard-weighted LambdaCC)
for all other nodes in the graph. Therefore, given any other subset of nodes (including sets of nodes making up other clusters that the algorithm will output), we see
Therefore when we form the second cluster with GrowCluster, we already know that , and similar reasoning shows that will hold for any cluster with that will be subsequently formed. In this way we see that inequality (10) will also hold between all pairs of clusters output by GrowCluster, so the rest of the result follows. The same steps will also work for degree-weighted LambdaCC.
We begin by comparing our new methods against existing correlation clustering algorithms on several small networks. This shows our algorithms for LambdaCC are superior to common alternatives. We then study how well-known graph partitioning algorithms implicitly optimize the LambdaCC objective for various . In the following experiments, we explore applications of our methods in ground truth community detection, clique detection in collaboration and gene networks, and social network analysis.
5.1 LambdaCC on Small Networks
Our first experiment shows that Lambda-Louvain is the best general-purpose correlation clustering method for minimizing the LambdaCC objective. We test this on four small networks: Karate , Les Mis , Polbooks , and Football . Figure 3 shows the performance of our algorithms, as well as Pivot and ICM, for a range of values. Pivot is the very fast algorithm of Ailon et al. , which selects a uniform random node and clusters it will all its neighbors. ICM is the energy-minimization heuristic algorithm of Bagon and Galun .
We find that fiveLP gives much better than a -approximation in practice. Pivot is much faster, but performs poorly for close to zero or one. ICM is also much quicker than solving the LP relaxation, but is still limited in scalability as it is intended for correlation clustering problems where most edge weights are zero, which is not the case for LambdaCC. On the other hand, GrowCluster and Lambda-Louvain are scalable and give good approximations for all input networks and values of .
5.2 Standard Clustering Algorithms
Many existing clustering algorithms implicitly optimize different parameter regimes of the LambdaCC objective. We show this by running several clustering algorithms on a 1000-node synthetic graph generated from the BTER model . We do the same on the largest component (4158 nodes) of the ca-GrQc collaboration network from the arXiv e-print website. We then compute LambdaCC objective scores for each algorithm for a range of values. We first cluster each graph using Graclus  (forming two clusters), Infomap , and Louvain . To form dense clusterings, we also partition the networks by recursively extracting the maximum clique (called RMC), and by recursively extracting the maximum quasi-clique (RMQC), i.e., the largest set of nodes with inner edge density bounded below by some . The last two procedures must solve an NP-hard objective at each step, but for reasonably sized graphs there is available clique and quasi-clique detection software [33, 26].
In Figure 4 we report for each the ratio between each clustering’s objective score and the LambdaCC LP-relaxation lower bound. We also display adjusted rand index (ARI) scores between the Lambda-Louvain clustering and the output of other algorithms. We note that the ARI scores peak in the same regime where each algorithm best optimizes LambdaCC. Typically the ARI peaks are higher for larger . This can be explained by realizing that when is small, fewer clusters are formed, and we would expect there to be many ways to partition the graph into a small number of clusters such that different clusterings share a very similar structure, even if the individual clusters themselves do not match. On the whole, the plots in Figure 4 illustrate that our framework and algorithm effectively interpolate between several well-established strategies in graph partitioning, and can serve as a good proxy for any clustering task for which any one of these algorithms is known to be effective.
By performing multiple runs of Graclus and varying the number of partitions formed by this algorithm, we can show that Graclus can approximately optimize different parameter regimes of LambdaCC. In Figure (a)a we show how the Graclus objective scores change as we increase the number of clusters from to over . As the number of clusters increases, the algorithm performs better and better for large and worse for smaller . Figure (b)b shows that something similar occurs for RMQC when we vary the minimum density of quasi-cliques from to . As the inner-edge density increases, the performance of RMQC essentially converges to the performance of RMC.
5.3 Ground Truth Community Detection
Determining how to choose the best value for a resolution parameter is a challening yet important task in graph clustering. Recently, Jeub et al. proposed a new techinque for sampling values of the resolution parameter for the Hamiltonian objective, to avoid resorting to ad-hoc methods [jeub2017multiresolution]. In this experiment, we explore a different approach, by considering the relationship between our parameter, , and the structure of ground-truth communities in a network. In particular, we cluster the email-EU graph, which encodes email correspondence between 1005 faculty members organized into 42 departments (the ground truth) at a European research institution [leskovec2007graph]. In order to learn as much as we can about the relationship between our resolution parameter and the ground truth, we purposely select the value of that empirically leads to the best Adjusted Rand Index (ARI) scores between the ground-truth clustering and the output of degree-weighted Lambda-Louvain. We find that when , our method’s ARI score is much higher than the scores obtained by algorithms that optimize a more rigid objective function (Table 2). This highlights the potential benefit our framework can provide when given the right parameter, and shows the importance of developing good techniques for appropriately selecting .
The insight in this experiment comes from comparing normalized cut scores in different clusterings. Running Lambda-Louvain with yields clusters with scaled normalized cut between and , similar to the ground-truth clusters, which exhibit scores between and . Interestingly, these values are roughly an order-of-magnitude smaller than our choice of . This is consistent with our result in Theorem 6: is an upper bound on the scaled normalized cut scores of all clusters formed by Lambda-Louvain. This suggests a general strategy for setting where there is some prior knowledge about ground truth structure. If we are given a target scaled normalized cut score for the output clusters we desire, we know not to set equal to or lower than this target. Rather, we choose a resolution parameter that is not too far from the target, but is still a generous upper bound. In future work, we aim to continue researching both theoretically and experimentally how to more precisely determine a priori the right upper-bound to use.
5.4 Cliques in Large Collaboration Networks
Identifying large cliques in a network is an important task in graph mining, though it can be computationally challenging due to the NP-completeness of detecting maximum cliques. We show that GrowClique is a successful and scalable alternative for this task by using it to cluster two large collaboration networks: one formed from a snapshot of the author-paper DBLP databset in 2007, and the other generated using actor-movie information from the NotreDame actors dataset [Barabasi]. The original data in both cases is a bipartite network indicating which ‘players’ (i.e., authors or actors) have parts in different ‘projects’ (papers or movies respectively). We transform each bipartite network into a graph in which nodes are players and edges represent collaboration on a project.
At each iteration GrowClique grows (possibly overlapping) cliques from random seeds and selects the largest to be included in the final output. We compare against RMC, an expensive method which provably returns a -approximation to the optimal cluster deletion objective . We also design ProjectClique, a method that looks at the original bipartite network and recursively identifies the project associated with the largest number of players not yet assigned to a cluster. These players form a clique in the collaboration network, so ProjectClique clusters them together, then repeats the procedure on remaining nodes.
Table 3 shows that GrowClique outperforms ProjectClique in both cases, and slightly outperforms RMC on the actor network. Our method is therefore competitive against two algorithms that in some sense have an unfair advantage over it: ProjectClique employs knowledge not available to GrowClique regarding the original bipartite dataset, and RMC performs very well mainly because it solves an NP-hard problem at each step.
5.5 Clustering Yeast Genes
The study of cluster deletion and cluster editing (which is equivalent to -correlation clustering) was originally motivated by applications to clustering genes using expresison patterns [5, 35]. Standard LambdaCC is a natural framework for this, since it generalizes both objectives and interpolates between them as ranges from to . We cluster genes of the Saccharomyces cerevisiae yeast organism using microarray expression data collected by Kemmeren et al. [kemmeren2014large]. With the expression values from the dataset, we compute correlation coefficients between all pairs of genes. We threshold these at to obtain a small disconnected graph of nodes corresponding to unique genes, which we cluster with fourCD. For this cluster deletion experiment, our algorithm returns the optimal solution: solving the LP-relaxation returns a solution that is in fact integral. We validate each clique of size at least three returned by fourCD against known gene-association data from the Saccharomyces Genome Database (SGD) and the String Consortium Database (see Table 4). With one exception, these cliques match groups of genes that are known to be strongly associated, according to at least one validation database. The exception is a cluster with four genes (YHR093W, YIL171W, YDR490C, and YOR225W), three of which, according to the SGD are not known to be associated with any Gene Ontology term. We conjecture that this may indicate a relationship between genes not previously known to be related.
|Clique #||Size||Shared GO term||Term||String|
|4||4||vitamin metabolic process||0.7||980|
5.6 Social Network Analysis with LambdaCC
Clustering a social network using a range of resolution parameters can reveal valuable insights about how links are formed in the network. Here we examine several graphs from the Facebook100 dataset, each of which represents the induced subgraph of the Facebook network corresponding to a US university at some point in 2005. The networks come with anonymized meta-data, reporting attributes such as major and graduation year for each node. While meta-data attributes are not expected to correspond to ground-truth communities in the network , we do expect them to play a role in how friendship links and communities are formed. In this experiment we illustrate strong correlations between the link structure of the networks and the dorm, graduation year, and student/faculty status meta-data attributes. We also see how these correlations are revealed, to different degrees, depending on our choice of .
Given a Facebook subgraph with nodes, we cluster it with degree-weighted Lambda-Louvain for a range of values between and . In this clustering, we refer to two nodes in the same cluster as an interior pair. We measure how well a meta-data attribute correlates with the clustering by calculating the proportion of interior pairs that share the same value for . This value, denoted by , can also be interpreted as the probability of selecting an interior pair uniformly at random and finding that they agree on attribute . To determine whether the probability is meaningful, we compare it against a null probability : the probability that a random interior pair agree at a fake meta-data attribute . We assign to each node a value for the fake attribute by performing a random permutation on the vector storing values for true attribute . In this way, we can compare each true attribute against a fake attribute that has the same exact proportion of nodes with each attribute value, but does not impart any true information regarding each node.
In Figure 6 we plot results for each of the three attributes on four Facebook networks, as is varied. In all cases, we see significant differences between and . In general, and reach a peak at small values of when clusters are large, whereas is highest when is large and clusters are small. This indicates that the first two attributes are more highly correlated with large sparse communities in the network, whereas sharing a dorm is more correlated with smaller, denser communities. Caltech, a small residential university, is an exception to these trends and exhibits a much stronger correlation with the dorm attribute, even for very small .
We have introduced a new clustering framework that unifies several other commonly used objectives and offers many attractive theoretical properties. We prove that our objective function interpolates between the sparsest cut objective and the cluster deletion problem, as we vary a single input parameter, . We give a -approximation algorithm for our objective when , and a related method which is the first constant-factor approximation for the cluster deletion problem. We also give scalable procedures for greedily improving our objective, which are successful in a wide variety of clustering applications. These methods are easily modified to add must-cluster and cannot-cluster constraints, which makes them amenable to many applications. In future work, we will continue exploring approximations when .
This work was supported by several funding agencies: Nate Veldt and David Gleich are supported by NSF award IIS-154648, David Gleich is additionally supported by NSF awards CCF-1149756 and CCF-093937 as well as the DARPA Simplex program and the Sloan Foundation. Anthony Wirth is supported by the Australian Research Council. We also thank Flavio Chierichetti for several helpful conversations.
-  N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):23, 2008.
-  S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning. Journal of the ACM (JACM), 56(2), 2009.
-  S. Bagon and M. Galun. Large scale correlation clustering optimization. arXiv, cs.CV:1112.2903, 2011.
-  N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56:89–113, 2004.
-  A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. Journal of computational biology, 6(3-4):281–297, 1999.
-  V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008.
-  S. Böcker and P. Damaschke. Even faster parameterized cluster deletion and cluster editing. Information Processing Letters, 111(14):717 – 721, 2011.
-  L. Bohlin, D. Edler, A. Lancichinetti, and M. Rosvall. Community detection and visualization of networks with the map equation framework. In Measuring Scholarly Impact, pages 3–34. Springer, 2014.
-  F. Bonomo, G. Duran, A. Napoli, and M. Valencia-Pabon. A one-to-one correspondence between potential solutions of the cluster deletion problem and the minimum sum coloring problem, and its application to p4-sparse graphs. Information Processing Letters, 115:600–603, 2015.
-  F. Bonomo, G. Duran, and M. Valencia-Pabon. Complexity of the cluster deletion problem on subclasses of chordal graphs. Theoretical Computer Science, 600:59–69, 2015.
-  M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. Journal of Computer and System Sciences, 71(3):360 – 383, 2005. Learning Theory 2003.
-  S. Chawla, K. Makarychev, T. Schramm, and G. Yaroslavtsev. Near optimal lp rounding algorithm for correlation clustering on complete and complete -partite graphs. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 219–228. ACM, 2015.
-  P. Damaschke. Bounded-Degree Techniques Accelerate Some Parameterized Graph Algorithms, pages 98–109. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
-  J.-C. Delvenne, S. N. Yaliraki, and M. Barahona. Stability of graph communities across time scales. Proceedings of the National Academy of Sciences, 107(29):12755–12760, 2010.
-  E. D. Demaine and N. Immorlica. Correlation clustering with partial information. In S. Arora, K. Jansen, J. D. P. Rolim, and A. Sahai, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques: 6th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2003 and 7th International Workshop on Randomization and Approximation Techniques in Computer Science, RANDOM 2003, Princeton, NJ, USA, August 24-26, 2003. Proceedings, pages 1–13, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
-  A. Dessmark, J. Jansson, A. Lingas, E.-M. Lundell, and M. Person. On the approximability of maximum and minimum edge clique partition problems. International Journal of Foundations of Computer Science, 18(02):217–226, 2007.
-  I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence, 29(11), 2007.
-  T. N. Dinh, X. Li, and M. T. Thai. Network clustering via maximizing modularity: Approximation algorithms and theoretical limits. In Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM), pages 101–110. IEEE, 2015.
-  D. Emanuel and A. Fiat. Correlation clustering – minimizing disagreements on arbitrary weighted graphs. In G. Di Battista and U. Zwick, editors, Algorithms - ESA 2003: 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003. Proceedings, pages 208–220, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
-  Y. Gao, D. R. Hare, and J. Nastos. The cluster deletion problem for cographs. Discrete Mathematics, 313(23):2763–2771, 2013.
-  M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.
-  J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Graph-modeled data clustering: Fixed-parameter algorithms for clique generation. In R. Petreschi, G. Persiano, and R. Silvestri, editors, Algorithms and Complexity: 5th Italian Conference, CIAC 2003, Rome, Italy, May 28–30, 2003. Proceedings, pages 108–119, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
-  J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Automated generation of search tree algorithms for hard graph modification problems. Algorithmica, 39(4):321–347, 2004.
-  D. E. Knuth. The Stanford GraphBase: A Platform for Combinatorial Computing. Addison-Wesley, Reading, MA, 1993.
-  V. Krebs. Books about US Politics, 2004. Hosted at UCI Data Repository.
-  G. Liu and L. Wong. Effective pruning techniques for mining quasi-cliques. In W. Daelemans, B. Goethals, and K. Morik, editors, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part II, pages 33–49, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.
-  P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J.-P. Onnela. Community structure in time-dependent, multiscale, and multiplex networks. Science, 328(5980):876–878, 2010.
-  A. Natanzon, R. Shamir, and R. Sharan. Complexity classification of some edge modification problems. In International Workshop on Graph-Theoretic Concepts in Computer Science, pages 65–77. Springer, 1999.
-  M. E. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104, 2006.
-  M. E. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 69(026113), 2004.
-  L. Peel, D. B. Larremore, and A. Clauset. The ground truth about metadata and community detection in networks. Science Advances, 3(5), 2017.
-  J. Reichardt and S. Bornholdt. Statistical mechanics of community detection. Physical Review E, 74(016110), 2006.
-  R. A. Rossi, D. F. Gleich, and A. H. Gebremedhin. Parallel maximum clique algorithms with applications to network analysis. SIAM Journal on Scientific Computing, 37(5):C589–C616, 2015.
-  C. Seshadhri, T. G. Kolda, and A. Pinar. Community structure and scale-free collections of Erdős-Rényi graphs. Physical Review E, 85(5), May 2012.
-  R. Shamir, R. Sharan, and D. Tsur. Cluster graph modification problems. Discrete Applied Mathematics, 144:173–182, 2004.
-  S. Sra, J. Tropp, and I. S. Dhillon. Triangle fixing algorithms for the metric nearness problem. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 361–368. MIT Press, 2005.
-  C. Swamy. Correlation clustering: maximizing agreements via semidefinite programming. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 526–527. Society for Industrial and Applied Mathematics, 2004.
-  W. Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33:452–473, 1977.