Unifying Sparsest Cut, Cluster Deletion, and Modularity Clustering Objectives with Correlation Clustering

Unifying Sparsest Cut, Cluster Deletion, and Modularity Clustering Objectives with Correlation Clustering

Nate Veldt Purdue University Mathematics Department David F. Gleich Purdue University Computer Science Department Anthony Wirth The University of Melbourne, Computing and Information Systems School
Abstract

We present and analyze a new framework for graph clustering based on a specially weighted version of correlation clustering, that unifies several existing objectives and satisfies a number of attractive theoretical properties. Our framework, which we call LambdaCC, relies on a single resolution parameter , which implicitly controls both the edge density and sparsest cut of all output clusters. We prove that our new clustering objective interpolates between the cluster deletion problem and the minimum sparsest cut problem as we vary , and is also closely related to the well-studied maximum modularity objective. We provide several algorithms for optimizing our new objective, including a 5-approximation for the case where , and also the first constant factor approximation algorithm for the NP-hard cluster deletion problem. We demonstrate the effectiveness of our framework and algorithms in finding communities in several real-world networks.

1 Introduction

Identifying groups of related entities in a network is a ubiquitous task across scientific disciplines. This task is often called graph clustering, or community detection, and can be used to find similar proteins in a protein interaction network, group related organisms in a food web, identify communities in a social network, and classify web documents, among numerous other applications.

Defining the right notion of a “good” community in a graph is an important precursor to developing successful algorithms for graph clustering. In general, a good clustering is one in which nodes inside clusters are more densely connected to each other than to the rest of the graph. However, no consensus exists as to the best way to determine the quality of network clusterings, and recent results show there cannot be such a consensus for the multiple possible reasons people may cluster data [31]. Common objective functions studied by theoretical computer scientists include normalized cut, sparsest cut, conductance, and edge expansion, all of which measure some version of the cut-to-size ratio for a single cluster in a graph. Other standards of clustering quality put a greater emphasis on the internal density of clusters, such as the cluster deletion objective, which seeks to partition a graph into completely connected sets of nodes (cliques) by removing the fewest number of edges possible.

Arguably the most widely used multi-cluster objective for community detection is modularity, introduced by Newman and Girvan [30]. Modularity measures the difference between the true number of edges inside the clusters of a given partitioning (“inner edges”) minus the expected number of inner edges, where expectation is calculated with respect to a specific random graph model.

There are a limited number of results which have begun to unify distinct clustering measures by introducing objective functions that are closely related to modularity and depend on a tunable clustering resolution parameter [14, 32]. Reichardt and Borhholdt developed an approach based on finding the minimum-energy state of an infinite range Potts spin glass. The resulting Hamiltonian function they study is viewed as a clustering objective with a resolution parameter , which can be used as a heuristic for detecting overlapping and hierarchical community structure in a network. When , the authors prove an equivalence between minimizing the Hamiltonian and finding the maximum modularity partitioning of a network [32]. Later, Delvenne et al. introduced a measure called the stability of a clustering, which generalizes modularity and also is related to the normalized cut objective and Fiedler’s spectral clustering method for certain values of an input parameter [14].

The inherent difficulty of obtaining clusterings that are provably close to the optimal solution puts these objective functions at a disadvantage. Although both the stability and the Hamiltonian-Potts objectives provide useful interpretations for community detection, there are no approximation guarantees for either: all current algorithms are heuristics. Furthermore, it is known that maximizing modularity itself is not only NP-hard, but is also NP-hard to approximate to within any constant factor [18].

Our Contributions

In this paper, we introduce a new clustering framework based on a specially weighted version of correlation clustering [4]. Our partitioning objective for signed networks lies “between” the family of  complete instances and the most general correlation clustering instances. Our framework comes with several novel theoretical properties and leads to many connections between clustering objectives that were previously not seen to be related. In summary, we provide:

  1. A novel framework LambdaCC for community detection that is related to modularity and the Hamiltonian, but is more amenable to approximation results.

  2. A proof that our framework interpolates between the sparsest-cut objective and the cluster-deletion problem, as we increase a single resolution parameter, .

  3. Several successful algorithms for optimizing our new objective function in both theory and practice, including the first constant-factor approximation for cluster deletion.

  4. A demonstration of our methods in a number of web-based clustering applications, including social network analysis, mining cliques in collaboration networks, and detecting ground truth communities in an email network.

2 Background and Related Work

Let  be an undirected and unweighted graph on  nodes , with  edges . For all , let  be node ’s degree. Given , let be the complement of  and be its volume. For every two disjoint sets of vertices , indicates the number of edges between  and . If , we write . Let  denote the interior edge set of . The edge density of a cluster is , the ratio between the number of edges to the number of pairs of nodes in . By convention, the density of a single node is 1. We now present background and related work that is foundational to our results, including definitions for several common clustering objectives.

2.1 Correlation Clustering

An instance of correlation clustering is given by a signed graph where every pair of nodes  and  possesses two non-negative weights,  and , to indicate how similar and how dissimilar  and  are, respectively. Typically only one of these weights is nonzero for each pair . The objective can be expressed as an integer linear program (ILP):

(1)

In the above formulation,  represents “distance”: indicates that nodes  and  are clustered together, while indicates separating the two nodes. Including triangle-inequality constraints ensures the output of the above ILP defines a valid clustering of the nodes. This objective counts the total weight of disagreements between the signed weights in the graph and a given clustering of its nodes. The disagreement (or “mistake”) weight of a pair  is  if the nodes are clustered together, but  if they are separated. We can equivalently define the agreement weight to be if are clustered together, but  if they are separated. The optimal clusterings for maximizing agreements and minimizing disagreements are identical, but it is more challenging to approximate the latter objective.

Correlation clustering was introduced by Bansal et al., who proved the problem is NP-complete [4]. They gave a polynomial-time approximation scheme for the maximization version and a constant-factor approximation for minimizing disagreements in -weighted graphs. Subsequently, Charikar et al. gave a factor 4-approximation for minimizing disagreements and proved APX-hardness of this variant. They also described an approximation for minimization in general weighted graphs [11], proved independently by two different groups, who showed that minimizing disagreements is equivalent to minimum multicut [15, 19].

The problem has also been studied for the case where edges carry both positive and negative weights, satisfying probability constraints: for all pairs , . Ailon et al. gave a -approximation for this version of the problem based on an LP-relaxation, and additionally developed a very fast algorithm, called Pivot, that in expectation gives a -approximation [1]. Currently the best-known approximation factor for correlation clustering on  instances is slightly smaller than , obtained by a careful rounding of the canonical LP relaxation [12].

2.2 Modularity and the Hamiltonian

One very popular measure of clustering quality is modularity, introduced in its most basic form by Newman and Girvan [30]. We more closely follow the presentation of modularity given by Newman [29]. The modularity  of an underlying clustering is:

(2)

where if nodes  and  are adjacent, and zero otherwise, and  is again the binary variable indicating “distance” between  and  in the corresponding clustering. The value  represents the probability of an edge existing between  and  in a specific random graph model. The intent of this measure is to reward clusterings in which the actual number of edges inside a cluster is greater than the expected number of edges in the cluster, as determined by the choice for . Although there are many options, it is standard in the literature to set , since this preserves both the degree distribution and the expected number of edges between the original graph and null model. Many generalizations have been introduced for modularity, including an extension to multislice networks, which allow one to study the evolution of communities in a network over time [27].

By slightly editing the modularity function, we obtain the Hamiltonian objective of Reichardt and Bornholdt [32]:

(3)

The primary difference between this and modularity is the inclusion of a clustering resolution parameter . If we fix , minimizing (3) is equivalent to maximizing modularity. When varied, this parameter controls how much a clustering is penalized for putting two non-adjacent nodes together or separating adjacent nodes.

The Hamiltonian objective is in turn closely related to the stability of a clustering as defined by Delvenne et al., another generalization of modularity [14]. Roughly speaking, the stability of a partition measures the likelihood that a random walker, beginning at a node and following outgoing edges uniformly at random, will end up in the cluster it started in after a random walk of length . This  serves as a resolution parameter, since the walker will tend to “wander" farther when  is increased, leading to the formation of larger clusters when the stability is maximized. Delvenne et al. showed that objective (3) is equivalent to a linearized version of the stability measure for a specific range of time steps  [14].

2.3 Sparsest Cut and Normalized Cut

One measure of cluster quality in an unsigned network  is the sparsest cut score, defined for a set  to be . Smaller values for  are desirable, since they indicate that , in spite of its size, is only loosely connected to the rest of the graph. This measure differs by at most a factor of two from the related edge expansion measure: . If we replace with in these two objectives, we obtain the normalized cut and the conductance measure respectively. In our work we focus on a multiplicative scaling of the sparest cut objective that we call the scaled sparsest cut: , which is identical to sparsest cut in terms of multiplicative approximations. The best known approximation for finding the minimum sparsest cut of a graph is an -approximation algorithm due to Arora et al. [2].

2.4 Cluster Deletion

Cluster deletion is the problem of finding a minimum number of edges in  to be deleted in order to convert  into a disjoint set of cliques. This can be viewed as stricter version of correlation clustering, in which we want to minimize disagreements, but we are strictly prohibited from making mistakes at negative edges. This problem was first studied by Ben-Dor et al. [5], later formalized in the work of Natanzon et al. [28], who proved it is NP-hard, and Shamir et al. [35], who showed it is APX-hard. The latter studied the problem in conjunction with other related edge-modification problems, including cluster completion and cluster editing.

Numerous fixed parameter tractability results are known for cluster deletion [7, 22, 23, 13], as well many results regarding special graphs for which the problem can be solved in polynomial time [20, 10, 16, 9]. Dessmark et al. proved that by iteratively finding maximum cliques in the graph, one can achieve a 2-approximation for this objective [16], though in general this procedure is NP-hard.

3 Theoretical Results

Our novel clustering framework takes an unsigned graph and converts it into a signed graph  on the same set of nodes, , for a fixed clustering resolution parameter . Partitioning  with respect to the correlation clustering objective will then induce a clustering on . To construct the signed graph, we first introduce a node weight  for each . If , we place a positive edge between nodes  and  in , with weight . For , we place a negative edge between  and  in , with weight . We consider two different choices for node weights : setting for all (standard) or choosing (degree-weighted). In Figure 1 we illustrate the process of converting  into the LambdaCC signed graph, .

Figure 1: We convert a toy graph (left) into a signed graph for standard (middle) and degree-weighted (right) LambdaCC. Dashed red lines indicate negative edges. Partitioning the signed graph via correlation clustering induces a clustering on the original unsigned graph.

3.1 Connection to Modularity

Despite a significant difference in approach and interpretation, the clustering that minimizes disagreements is the same clustering that minimizes the Hamiltonian objective (3), for a certain choice of parameters. To see this, express the weight of disagreements, , for a clustering of  in terms of edges and non-edges of :

By introducing node-adjacency indicators , this becomes:

Choosing and , we see that:

(4)

where the first term is just a constant. This theorem follows:

Theorem 1

Minimizing disagreements for the LambdaCC objective is equivalent to minimizing .

The choice  is reminiscent of the graph null model most commonly used for modularity and the Hamiltonian. This best highlights the similarity between these objectives and degree-weighted LambdaCC. On the other hand, standard LambdaCC (setting for every ) leads to strong connections between the sparsest cut objective and cluster deletion. This version corresponds to solving a correlation clustering problem where all positive edges have equal weight, , while all negative edges have equal weight, . The objective function for minimizing disagreements is

(5)

where we include the same constraints as in ILP (1). This is a strict generalization of the unit-weight correlation clustering problem [4] () indicating the problem in general is NP-hard (though it admits several approximation algorithms). If  is  or , the problem is trivial to solve: put all nodes in one cluster or put each node in a singleton cluster, respectively. By selecting values for , we uncover more subtle connections between identifying sparse cuts and finding dense subgraphs in the same network.

3.2 Connection to Sparsest Cut

Given  and , the weight of positive-edge mistakes in the LambdaCC objective made by a two-clustering equals the weight of edges crossing the cut: . To compute the weight of negative-edge mistakes, we take the weight of all negative edges in the entire network, , and then subtract the weight of negative edges between and : . Adding together all terms we find that the LambdaCC objective for this clustering is

(6)

Note that if we minimize (6) over all 2-clusterings, we solve the decision version of the minimum scaled sparsest cut problem: a few steps of algebra confirms that there is some set with if and only if (6) is less than .

In a similar way we can show that objective (5) is equivalent to

(7)

where we minimize over all clusterings of with arbitrarily many clusters . In this case, optimally solving objective (7) will tell us whether we can find a clustering such that

Hence LambdaCC can be viewed as a multi-cluster generalization of the decision version of minimum sparsest cut. We now prove an even deeper connection between sparsest cut and LambdaCC. Using degree-weighted LambdaCC yields an analogous result for normalized cut.

Theorem 2

Let  be the minimum scaled sparsest cut for a graph .

  1. For all , optimal solution (7) partitions  into two or more clusters, each of which has scaled sparsest cut . There exists some  such that the optimal clustering for LambdaCC is the minimum sparsest cut partition.

  2. For , it is optimal to place all nodes into a single cluster.

Proof

Statement (1) Let  be some optimal sparsest cut-inducing set in , i.e., . The LambdaCC objective corresponding to  is

(8)

When minimizing objective (7), we can always obtain a score of by placing all nodes into a single cluster. Note however that the the score of clustering in expression (8) is strictly less than for all . Even if is not optimal, this means that when , we can do strictly better than placing all nodes into one cluster. In this case let be the optimal LambdaCC clustering and consider two of its clusters:  and . The weight of disagreements between  and  is equal to the number of positive edges between them times the weight of a positive edge: . Should we form a new clustering by merging  and , these positive disagreements will disappear; in turn, we would introduce new mistakes, being negative edges between the clusters. Because we assumed is optimal, we know that we cannot decrease the objective by merging two of the clusters, implying that

Given this, we fix an arbitrary cluster  and perform a sum over all other clusters to see that

proving the desired upper bound on scaled sparsest cut.

Since  is a finite graph, there are a finite number of scaled sparsest cut scores that can be induced by a subset of . Let  be the second-smallest scaled sparsest cut score achieved, so . If we set , then the optimal LambdaCC clustering produces at least two clusters, since , and each cluster has scaled sparsest cut at most . By our selection of , all clusters returned must have scaled sparsest cut exactly equal to , which is only possible if the clustering returned has two clusters. Hence this clustering is a minimum sparsest cut partition of the network.

Statement (2) If , forming a single cluster must be optimal, otherwise we could invoke Statement (1) to assert the existence of some nontrivial cluster with scaled sparsest cut less than or equal to , contradicting the minimality of . If , forming a single cluster or using the clustering yield the same objective score, which is again optimal for the same reason.

3.3 Connection to Cluster Deletion

For large  our problem becomes more similar to cluster deletion. We can reduce any cluster deletion problem to correlation clustering by taking the input graph  and introducing a negative edge of weight “” between every pair of non-adjacent nodes. This guarantees that optimally solving correlation clustering will yield clusters that all correspond to cliques in . Furthermore, the weight of disagreements will be the number of edges in  that are cut, i.e., the cluster deletion score. We can obtain a generalization of cluster deletion by instead choosing the weight of each negative edge to be . The corresponding objective is

(9)

If we substitute we see this differs from objective (5) only by a multiplicative constant, and is therefore equivalent in terms of approximation. When , putting dissimilar nodes together will be more expensive than cutting positive edges, so we would expect that the clustering which optimizes the LambdaCC objective will separate  into dense clusters that are “nearly” cliques. We formalize this with a simple theorem.

Theorem 3

If  minimizes the LambdaCC objective for the unsigned network , then the edge density of every cluster in  is at least .

Proof

Take a cluster  and consider what would happen if we broke apart  so that each of its nodes were instead placed into its own singleton cluster. This means we are now making mistakes at every positive edge previously in , which increases the weight of disagreements by . On the other hand, there are no longer negative mistakes between nodes in , so the LambdaCC objective would simultaneously decrease by . The total change in the objective made by pulverizing  is

which must be nonnegative, since  is optimal, so .

Corollary 1

Let  have  edges. For every , optimizing LambdaCC is equivalent to optimizing cluster deletion.

Proof

All output clusters must have density at least , which is only possible if the density is actually one, since  is the total number of edges in the graph. Therefore all clusters are cliques and the LambdaCC and cluster deletion objectives differ only by a multiplicative constant .

3.4 Equivalences and Approximations

We summarize the equivalence relationships between LambdaCC and other objectives in Figure 2. Accompanying this, Table 1 outlines the best-known approximation results both for maximizing agreements and minimizing disagreements for the standard LambdaCC signed graph. For degree-weighted LambdaCC, the best-known approximation factors for all  are  [15, 19, 11] for minimizing disagreements, and  for maximizing agreements [37]. Thus, LambdaCC is more amenable to approximation than modularity (and relatives) because of additive constants.

Figure 2: LambdaCC is equivalent to several other objectives for specific values of . Values  and  are not known a priori, but can be obtained by solving LambdaCC for increasingly smaller values of .
Max-Agree  [37] PTAS [4]  [37]
Min-Dis.  [11, 15, 19]  [12]
Table 1: The best approximation factors known for standard LambdaCC, for , both for minimizing disagreements and maximizing agreements. We contribute two constant-factor approximations for minimizing disagreements when .

4 Algorithms

We present several new algorithms, tailored specifically to our LambdaCC framework; some come with approximation guarantees, some are designed for efficiency.

4.1 LP Relaxation Algorithm for

Our first algorithm relies on solving the LP-relaxation of ILPs (1,5), replacing the binary constraint with the linear constraint . Our rounding scheme and proof technique are similar to those developed by Charikar et al. for  correlation clustering [11]. Our procedure, outlined in Algorithm 1, is called fiveLP, based on the following approximation result.

Input: Signed graph ,
Output: Clustering  of 
Solve the LP relaxation of ILPs (1,5), obtaining distances 
,
5:while   do
     Choose arbitrarily
     
     if average distance between  and  is  then
         
10:     else
               
     ,
Algorithm 1 fiveLP
Theorem 4

Algorithm fiveLP gives a factor- approximation for LambdaCC for all .

Proof

We prove the 5-approximation holds for objective (9) when , since this objective is equivalent to LambdaCC in terms of approximations. In other words, consider a correlation clustering instance where each positive edge has weight  and each negative edge has weight . Solving this LP gives a lower bound on the optimal LambdaCC score. We show that both for singleton and non-singleton clusters formed by fiveLP, the number of mistakes made at each cluster is within a factor five of the LP cost corresponding to that cluster. Recall that each cluster is formed around some node , and .

Singleton Clusters.

If  is a singleton, we know that . In this case we make at most  mistakes, which would happen if all edges between  and  are postive. Given that for every , we know that . Let and . The LP cost associated with this cluster is

so we account for the errors within a factor .

Negative-edge mistakes in non-singleton clusters.

Consider a negative edge inside a cluster of the form . If that edge is , the LP cost is . For every other negative edge, , where , the LP cost is . Either way, the LP has paid at least  for each negative-edge mistake.

Positive-edge mistakes in non-singleton clusters.

Positive edges from  to  satisfy , so this type of edge pays for itself easily. The other edges we need to account for are all edges where  and . We will charge all edges of this form to the node  that lies outside .

First, if , then and the positive edge pays for itself within factor .

Now, fix some where . Let and , and let be the number of positive  edges, while is the number of negative  edges. The number of positive mistakes we are charging to  is exactly , and we have

LP cost at

where the last inequality follows from the fact that the average distance from  to  is less than . Thus the LP cost is bounded by a linear function, , where . Hence the coefficient of , while the coefficient of , so  is within a factor  of the LP cost.

This 5-approximation greatly improves upon the  approximation obtained by applying algorithms designed for general weighted correlation clustering [11, 15, 19, 37].

4.2 LP Relaxation Algorithm for Cluster Deletion.

We slightly alter fiveLP in the following ways for all :

  • For all , force constraints  in the LP.

  • When rounding, select arbitrary and set (rather than using ).

  • Make a singleton if the average distance from to is , otherwise cluster with .

We call this algorithm fourCD, and show that it is the first constant-factor approximation algorithm for cluster deletion.

Theorem 5

Algorithm fourCD returns a 4-approximation to Cluster Deletion.

Proof

First, fourCD forms only cliques. Should the cluster formed around  not be a singleton, for every , with  , so . Since this distance is strictly less than , and since we forced for non-adjacent nodes, the nodes  must be adjacent. We therefore only need to account for positive-edge mistakes. The remainder of the proof follows directly from the same steps used to prove Theorem 4, as well as the original proof of Charikar et al. [11] for  correlation clustering. We include full details here for completeness:

Singleton Clusters

If we cluster  as a singleton, then the number of mistakes we make between  and  is exactly , as these are all positive neighbors of . Since  was made a singleton cluster, we know that , so these positive mistakes are paid within factor four. Finally, note that every positive edge for has LP cost greater than , so all mistakes are paid for within factor two.

Clusters

No negative mistakes are made, so we only need to account for positive mistakes. As mentioned in the previous case, edges for pay for themselves within factor two. For where , if and , we know , so the edge pays for itself within factor four.

Now consider a single node such that , and then consider all with . Again, use the notation and with and . Now we bound the weight of positive mistakes as a function of the LP cost associated with . Thanks to the constraint for all , , therefore, relying also on the reasoning for fiveLP:

For , the coefficient of , while the coefficient of . Therefore the number of mistakes, , is paid for within factor four.

4.3 Strategies for Solving the LP Relaxation

The LP-relaxation for correlation clustering involves  triangle-inequality constraints. We can nonetheless compute this relaxation for graphs with up to a few thousand nodes, via the following two strategies. The first is to solve the LP on a subset of the constraints, then iteratively update the constraint set and re-solve the LP as needed, until convergence. The second approach employs the triangle-fixing procedure of Dhillon et al. for the related metric nearness problem [36].

4.4 Scalable Heuristic Algorithms

To counterpart the previous approximation-driven approaches, we provide fast algorithms for LambdaCC-based greedy local heuristics. The first of these is GrowCluster, which iteratively selects an unclustered node uniformly at random and forms a cluster around it by greedily aggregating adjacent nodes, until there is no more improvement to the LambdaCC objective.

A variant of this, called GrowClique, is specifically designed for cluster deletion. It monotonically improves the LambdaCC objective, but differs in that at each iteration it randomly selects  unclustered nodes, and greedily grows cliques around each of these seeds. The resulting cliques may overlap: at each iteration we select only the largest of such cliques. Pseudocode for GrowCluster and GrowClique are given in Algorithms 2 and 3 respectively.

Input:
Output: a clustering  of 
,
while   do
5:     1. Choose a uniformly random , set
     2. For all , compute benefit from merging into :
               (For standard LambdaCC: )
               (For degree-weighted:
     3. Set ,
10:     while  do
         
         Update , and       
     Add cluster  to , update
Algorithm 2 GrowCluster
Input:
Output: a clustering  of  where all clusters are cliques
,
while   do
5:     for  to  do
         Select a random seed node , set
         Set
         while   do
               for any
10:                        
               
     
     Add cluster  to , update
Algorithm 3 GrowClique

Finally, since the LambdaCC and Hamiltonian objectives are equivalent, we can use previously developed algorithms and software for modularity-like objectives with a resolution parameter. In particular we employ adaptations of the Louvain method, an algorithm developed by Blondel et al. [6]. It iteratively visits each node in the graph and moves it to an adjacent cluster, if such a move gives a locally maximum increase in the modularity score. This continues until no move increases modularity, at which point the clusters are aggregated into super-nodes and the entire process is repeated on the aggregated network. By adapting the original Louvain method to make greedy local moves based on the LambdaCC objective, rather than modularity, we obtain a scalable algorithm that is known to provide good approximations for a related objective, and additionally adapts well to changes in our parameter . We refer to this as Lambda-Louvain. Both standard and degree-weighted versions of the algorithm can be achieved by employing existing generalized Louvain algorithms (e.g., the GenLouvain algorithm of Jeub et al. http://netwiki.amath.unc.edu/GenLouvain/).

Our heuristic algorithms satisfy the following guarantee:

Theorem 6

For every , standard (respectively, degree-weighted) Lambda-Louvain either places all nodes in one cluster, or produce clusters that have scaled sparsest cut (respectively, scaled normalized cut) bounded above by . The same holds true for GrowCluster.

Proof

Note that by design, when Lambda-Louvain terminates there will be no two clusters which can be merged to yield a better objective score. Just as in the proof of statement (1) for Theorem 2, for the standard LambdaCC objective this means that for any pair of cluster and we have

(10)

We then fix , perform a sum over all other clusters, and get the desire result:

If we are using degree-weighted Lambda-Louvain, when the algorithm terminates we know that all pairs of clusters satisfy

and the corresponding result for scaled normalized cut holds.

Though slightly less obvious, it is also true that none of the output clusters of GrowCluster (if there are at least two) could be merged to yield a better objective score. Notice that this is certainly true of the first cluster formed by GrowCluster: we stop growing when we find that (for standard-weighted LambdaCC)

for all other nodes in the graph. Therefore, given any other subset of nodes (including sets of nodes making up other clusters that the algorithm will output), we see

Therefore when we form the second cluster with GrowCluster, we already know that , and similar reasoning shows that will hold for any cluster with that will be subsequently formed. In this way we see that inequality (10) will also hold between all pairs of clusters output by GrowCluster, so the rest of the result follows. The same steps will also work for degree-weighted LambdaCC.

5 Experiments

We begin by comparing our new methods against existing correlation clustering algorithms on several small networks. This shows our algorithms for LambdaCC are superior to common alternatives. We then study how well-known graph partitioning algorithms implicitly optimize the LambdaCC objective for various . In the following experiments, we explore applications of our methods in ground truth community detection, clique detection in collaboration and gene networks, and social network analysis.

5.1 LambdaCC on Small Networks

Our first experiment shows that Lambda-Louvain is the best general-purpose correlation clustering method for minimizing the LambdaCC objective. We test this on four small networks: Karate [38], Les Mis [24], Polbooks [25], and Football [21]. Figure 3 shows the performance of our algorithms, as well as Pivot and ICM, for a range of  values. Pivot is the very fast algorithm of Ailon et al. [1], which selects a uniform random node and clusters it will all its neighbors. ICM is the energy-minimization heuristic algorithm of Bagon and Galun [3].

(a) Karate, ,
(b) Les Mis, ,
(c) Polbooks, ,
(d) Football, ,
Figure 3: We optimize the standard LambdaCC objective with five correlation clustering algorithms on four small networks. The -axis reports the ratio between each algorithm’s score and the lower bound on the optimal objective determined by solving the LP relaxation. Lambda-Louvain (black) and GrowCluster (red) perform well for all  in addition to being the most scalable algorithms. In each plot, a yellow vertical line indicates the optimal scaled sparsest cut value, , for that network.

We find that fiveLP gives much better than a -approximation in practice. Pivot is much faster, but performs poorly for  close to zero or one. ICM is also much quicker than solving the LP relaxation, but is still limited in scalability as it is intended for correlation clustering problems where most edge weights are zero, which is not the case for LambdaCC. On the other hand, GrowCluster and Lambda-Louvain are scalable and give good approximations for all input networks and values of .

5.2 Standard Clustering Algorithms

Many existing clustering algorithms implicitly optimize different parameter regimes of the LambdaCC objective. We show this by running several clustering algorithms on a 1000-node synthetic graph generated from the BTER model [34]. We do the same on the largest component (4158 nodes) of the ca-GrQc collaboration network from the arXiv e-print website. We then compute LambdaCC objective scores for each algorithm for a range of  values. We first cluster each graph using Graclus [17] (forming two clusters), Infomap [8], and Louvain [6]. To form dense clusterings, we also partition the networks by recursively extracting the maximum clique (called RMC), and by recursively extracting the maximum quasi-clique (RMQC), i.e., the largest set of nodes with inner edge density bounded below by some . The last two procedures must solve an NP-hard objective at each step, but for reasonably sized graphs there is available clique and quasi-clique detection software [33, 26].

In Figure 4 we report for each the ratio between each clustering’s objective score and the LambdaCC LP-relaxation lower bound. We also display adjusted rand index (ARI) scores between the Lambda-Louvain clustering and the output of other algorithms. We note that the ARI scores peak in the same regime where each algorithm best optimizes LambdaCC. Typically the ARI peaks are higher for larger . This can be explained by realizing that when is small, fewer clusters are formed, and we would expect there to be many ways to partition the graph into a small number of clusters such that different clusterings share a very similar structure, even if the individual clusters themselves do not match. On the whole, the plots in Figure 4 illustrate that our framework and algorithm effectively interpolate between several well-established strategies in graph partitioning, and can serve as a good proxy for any clustering task for which any one of these algorithms is known to be effective.

(a) BTER graph with 1000 nodes
(b) Largest component of ca-GrQc
(c) ARI scores (BTER)
(d) ARI scores (ca-GrQc)
Figure 4: On top we illustrate the performance of well-known clustering algorithms in approximating the LambdaCC objective on (a) one synthetic and (b) and one real-world graph. The bowl-shaped curves indicate that each algorithm implicitly optimizes the LambdaCC objective in a different parameter regime. The -axis reports the ratio between each clustering’s objective score and the LP-relaxation lower bound. Lambda-Louvain interpolates between all the clustering strategies seen here, always giving an approximation ratio . In the lower plots we show the ARI score between each clustering and Lambda-Louvain for both the BTER graph (c) and ca-GrQc (d). These show peaks in the same parameter regime where each algorithm is most successful at approximating LambdaCC.

By performing multiple runs of Graclus and varying the number of partitions formed by this algorithm, we can show that Graclus can approximately optimize different parameter regimes of LambdaCC. In Figure (a)a we show how the Graclus objective scores change as we increase the number of clusters from  to over . As the number of clusters increases, the algorithm performs better and better for large  and worse for smaller . Figure (b)b shows that something similar occurs for RMQC when we vary the minimum density of quasi-cliques from  to . As the inner-edge density increases, the performance of RMQC essentially converges to the performance of RMC.

(a) Multiple runs of Graclus on the caGrQc network, with increasingly many clusters
(b) Multiple runs of RMQC on the caGrQc network, with increasingly higher quasi-clique density
Figure 5: As we increase the number of clusters formed by Graclus (left), the algorithm does better for large values of  and worse for small values. The algorithm seems particularly well-suited to optimize the LambdaCC objective for very small values of . Darker curves represent a larger number of clusters formed; the number of clusters formed ranges from  on the far left of the plot to just over  for the right-most curve. In the right plot we vary the density  of quasi-cliques formed by RMQC from  to . In this plot, darker curves represent a larger density. As density increases, the curves converge to the performance of RMC, shown in blue.

5.3 Ground Truth Community Detection

Determining how to choose the best value for a resolution parameter is a challening yet important task in graph clustering. Recently, Jeub et al. proposed a new techinque for sampling values of the resolution parameter for the Hamiltonian objective, to avoid resorting to ad-hoc methods [jeub2017multiresolution]. In this experiment, we explore a different approach, by considering the relationship between our parameter, , and the structure of ground-truth communities in a network. In particular, we cluster the email-EU graph, which encodes email correspondence between 1005 faculty members organized into 42 departments (the ground truth) at a European research institution [leskovec2007graph]. In order to learn as much as we can about the relationship between our resolution parameter and the ground truth, we purposely select the value of  that empirically leads to the best Adjusted Rand Index (ARI) scores between the ground-truth clustering and the output of degree-weighted Lambda-Louvain. We find that when , our method’s ARI score is much higher than the scores obtained by algorithms that optimize a more rigid objective function (Table 2). This highlights the potential benefit our framework can provide when given the right parameter, and shows the importance of developing good techniques for appropriately selecting .

The insight in this experiment comes from comparing normalized cut scores in different clusterings. Running Lambda-Louvain with yields clusters with scaled normalized cut between and , similar to the ground-truth clusters, which exhibit scores between  and . Interestingly, these values are roughly an order-of-magnitude smaller than our choice of . This is consistent with our result in Theorem 6 is an upper bound on the scaled normalized cut scores of all clusters formed by Lambda-Louvain. This suggests a general strategy for setting  where there is some prior knowledge about ground truth structure. If we are given a target scaled normalized cut score for the output clusters we desire, we know not to set  equal to or lower than this target. Rather, we choose a resolution parameter that is not too far from the target, but is still a generous upper bound. In future work, we aim to continue researching both theoretically and experimentally how to more precisely determine a priori the right upper-bound  to use.

Lam-Louv Metis Graclus Louvain InfoMap
0.587 0.359 0.393 0.264 0.273
Table 2: To cluster the email-EU network, we run each method 20 times and report the median ARI score between each clustering and the ground truth. When , degree-weighted Lambda-Louvain easily outperforms other methods that optimize a more rigid objective function. We have Metis [karypis1998fast] form 27 clusters and Graclus form 13, since each yields the best results for the algorithm.

5.4 Cliques in Large Collaboration Networks

Identifying large cliques in a network is an important task in graph mining, though it can be computationally challenging due to the NP-completeness of detecting maximum cliques. We show that GrowClique is a successful and scalable alternative for this task by using it to cluster two large collaboration networks: one formed from a snapshot of the author-paper DBLP databset in 2007, and the other generated using actor-movie information from the NotreDame actors dataset [Barabasi]. The original data in both cases is a bipartite network indicating which ‘players’ (i.e., authors or actors) have parts in different ‘projects’ (papers or movies respectively). We transform each bipartite network into a graph in which nodes are players and edges represent collaboration on a project.

At each iteration GrowClique grows  (possibly overlapping) cliques from random seeds and selects the largest to be included in the final output. We compare against RMC, an expensive method which provably returns a -approximation to the optimal cluster deletion objective [16]. We also design ProjectClique, a method that looks at the original bipartite network and recursively identifies the project associated with the largest number of players not yet assigned to a cluster. These players form a clique in the collaboration network, so ProjectClique clusters them together, then repeats the procedure on remaining nodes.

Table 3 shows that GrowClique outperforms ProjectClique in both cases, and slightly outperforms RMC on the actor network. Our method is therefore competitive against two algorithms that in some sense have an unfair advantage over it: ProjectClique employs knowledge not available to GrowClique regarding the original bipartite dataset, and RMC performs very well mainly because it solves an NP-hard problem at each step.

Dataset Nodes Edges GC PC RMC
Actors 341,185 10,643,420 8,085,286 8,086,715 8,087,241
DBLP 526,303 1,616,814 945,489 946,295 944,087
Table 3: Cluster deletion scores for GrowClique (GC), ProjectClique (PC) and RMC on two collaboration networks. GrowClique is agnostic to the underlying player-project bipartite network, and does not solve an NP-hard objective at each iteration, yet returns very good results.

5.5 Clustering Yeast Genes

The study of cluster deletion and cluster editing (which is equivalent to -correlation clustering) was originally motivated by applications to clustering genes using expresison patterns [5, 35]. Standard LambdaCC is a natural framework for this, since it generalizes both objectives and interpolates between them as  ranges from  to . We cluster genes of the Saccharomyces cerevisiae yeast organism using microarray expression data collected by Kemmeren et al. [kemmeren2014large]. With the  expression values from the dataset, we compute correlation coefficients between all pairs of genes. We threshold these at  to obtain a small disconnected graph of  nodes corresponding to unique genes, which we cluster with fourCD. For this cluster deletion experiment, our algorithm returns the optimal solution: solving the LP-relaxation returns a solution that is in fact integral. We validate each clique of size at least three returned by fourCD against known gene-association data from the Saccharomyces Genome Database (SGD) and the String Consortium Database (see Table 4). With one exception, these cliques match groups of genes that are known to be strongly associated, according to at least one validation database. The exception is a cluster with four genes (YHR093W, YIL171W, YDR490C, and YOR225W), three of which, according to the SGD are not known to be associated with any Gene Ontology term. We conjecture that this may indicate a relationship between genes not previously known to be related.

Clique # Size Shared GO term Term String
1 6 nucleus 34.3 0
2 4 nucleus 34.3 202
3 4 N/A - 0
4 4 vitamin metabolic process 0.7 980
5 3 cytoplasm 67.0 990
6 3 cytoplasm 67.0 998
7 3 N/A - 962
8 3 cytoplasm 67.0 996
9 3 N/A - 973
10 3 transposition 1.7 0
Table 4: We list cliques of size  in the optimal clustering (found by fourCD) of a network of  yeast genes. We validate each cluster using the SGD GO slim mapper tool, which identifies any GO term (function, process, or component of the organism) for a given gene. We list one GO term shared by all genes in the cluster, if one exists. The Term  column reports the percentage of all genes in the organism associated with this term. A low percentage indicates a cluster of genes that share a process, component, or function that is not widely shared among other genes. The final column shows the minimum String association score between every pair of genes in the cluster, a number between  and  (higher is better). Any non-zero score is a strong indication of gene association, as the majority of String scores between genes of S. cerevisiae are zero. All clusters, except the third, either have a high minimum String score or are all associated with a specific GO term.

5.6 Social Network Analysis with LambdaCC

Clustering a social network using a range of resolution parameters can reveal valuable insights about how links are formed in the network. Here we examine several graphs from the Facebook100 dataset, each of which represents the induced subgraph of the Facebook network corresponding to a US university at some point in 2005. The networks come with anonymized meta-data, reporting attributes such as major and graduation year for each node. While meta-data attributes are not expected to correspond to ground-truth communities in the network [31], we do expect them to play a role in how friendship links and communities are formed. In this experiment we illustrate strong correlations between the link structure of the networks and the dorm, graduation year, and student/faculty status meta-data attributes. We also see how these correlations are revealed, to different degrees, depending on our choice of .

Given a Facebook subgraph with  nodes, we cluster it with degree-weighted Lambda-Louvain for a range of  values between and . In this clustering, we refer to two nodes in the same cluster as an interior pair. We measure how well a meta-data attribute  correlates with the clustering by calculating the proportion of interior pairs that share the same value for . This value, denoted by , can also be interpreted as the probability of selecting an interior pair uniformly at random and finding that they agree on attribute . To determine whether the probability is meaningful, we compare it against a null probability : the probability that a random interior pair agree at a fake meta-data attribute . We assign to each node a value for the fake attribute  by performing a random permutation on the vector storing values for true attribute . In this way, we can compare each true attribute  against a fake attribute  that has the same exact proportion of nodes with each attribute value, but does not impart any true information regarding each node.

In Figure 6 we plot results for each of the three attributes on four Facebook networks, as  is varied. In all cases, we see significant differences between  and . In general,  and reach a peak at small values of  when clusters are large, whereas is highest when  is large and clusters are small. This indicates that the first two attributes are more highly correlated with large sparse communities in the network, whereas sharing a dorm is more correlated with smaller, denser communities. Caltech, a small residential university, is an exception to these trends and exhibits a much stronger correlation with the dorm attribute, even for very small .

(a) Swarthmore
(b) Yale
(c) Cornell
(d) Caltech
Figure 6: On four university Facebook graphs, we illustrate that the dorm (red), graduation year (green), and student/faculty (S/F) status (blue) meta-data attributes all correlate highly with the clustering found by Lambda-Louvain for each . Above the -axis we show the number of clusters formed, which strictly increases with . The -axis reports the probability that two nodes sharing a cluster also share an attribute value. Each attribute curve is compared against a null probability, shown as a dashed line of the same color. The large gaps between each attribute curve and its null probability indicate that the link structure of all networks is highly correlated with these attributes. In general, probabilities for and status are highest for small , whereas has a higher correlation with smaller, denser communities in the network. Caltech is an exception to the general trend; see the main text for discussion.

6 Discussion

We have introduced a new clustering framework that unifies several other commonly used objectives and offers many attractive theoretical properties. We prove that our objective function interpolates between the sparsest cut objective and the cluster deletion problem, as we vary a single input parameter, . We give a -approximation algorithm for our objective when , and a related method which is the first constant-factor approximation for the cluster deletion problem. We also give scalable procedures for greedily improving our objective, which are successful in a wide variety of clustering applications. These methods are easily modified to add must-cluster and cannot-cluster constraints, which makes them amenable to many applications. In future work, we will continue exploring approximations when .

Acknowledgements

This work was supported by several funding agencies: Nate Veldt and David Gleich are supported by NSF award IIS-154648, David Gleich is additionally supported by NSF awards CCF-1149756 and CCF-093937 as well as the DARPA Simplex program and the Sloan Foundation. Anthony Wirth is supported by the Australian Research Council. We also thank Flavio Chierichetti for several helpful conversations.

References

  • [1] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):23, 2008.
  • [2] S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning. Journal of the ACM (JACM), 56(2), 2009.
  • [3] S. Bagon and M. Galun. Large scale correlation clustering optimization. arXiv, cs.CV:1112.2903, 2011.
  • [4] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56:89–113, 2004.
  • [5] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. Journal of computational biology, 6(3-4):281–297, 1999.
  • [6] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008.
  • [7] S. Böcker and P. Damaschke. Even faster parameterized cluster deletion and cluster editing. Information Processing Letters, 111(14):717 – 721, 2011.
  • [8] L. Bohlin, D. Edler, A. Lancichinetti, and M. Rosvall. Community detection and visualization of networks with the map equation framework. In Measuring Scholarly Impact, pages 3–34. Springer, 2014.
  • [9] F. Bonomo, G. Duran, A. Napoli, and M. Valencia-Pabon. A one-to-one correspondence between potential solutions of the cluster deletion problem and the minimum sum coloring problem, and its application to p4-sparse graphs. Information Processing Letters, 115:600–603, 2015.
  • [10] F. Bonomo, G. Duran, and M. Valencia-Pabon. Complexity of the cluster deletion problem on subclasses of chordal graphs. Theoretical Computer Science, 600:59–69, 2015.
  • [11] M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. Journal of Computer and System Sciences, 71(3):360 – 383, 2005. Learning Theory 2003.
  • [12] S. Chawla, K. Makarychev, T. Schramm, and G. Yaroslavtsev. Near optimal lp rounding algorithm for correlation clustering on complete and complete -partite graphs. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 219–228. ACM, 2015.
  • [13] P. Damaschke. Bounded-Degree Techniques Accelerate Some Parameterized Graph Algorithms, pages 98–109. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
  • [14] J.-C. Delvenne, S. N. Yaliraki, and M. Barahona. Stability of graph communities across time scales. Proceedings of the National Academy of Sciences, 107(29):12755–12760, 2010.
  • [15] E. D. Demaine and N. Immorlica. Correlation clustering with partial information. In S. Arora, K. Jansen, J. D. P. Rolim, and A. Sahai, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques: 6th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2003 and 7th International Workshop on Randomization and Approximation Techniques in Computer Science, RANDOM 2003, Princeton, NJ, USA, August 24-26, 2003. Proceedings, pages 1–13, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
  • [16] A. Dessmark, J. Jansson, A. Lingas, E.-M. Lundell, and M. Person. On the approximability of maximum and minimum edge clique partition problems. International Journal of Foundations of Computer Science, 18(02):217–226, 2007.
  • [17] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence, 29(11), 2007.
  • [18] T. N. Dinh, X. Li, and M. T. Thai. Network clustering via maximizing modularity: Approximation algorithms and theoretical limits. In Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM), pages 101–110. IEEE, 2015.
  • [19] D. Emanuel and A. Fiat. Correlation clustering – minimizing disagreements on arbitrary weighted graphs. In G. Di Battista and U. Zwick, editors, Algorithms - ESA 2003: 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003. Proceedings, pages 208–220, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
  • [20] Y. Gao, D. R. Hare, and J. Nastos. The cluster deletion problem for cographs. Discrete Mathematics, 313(23):2763–2771, 2013.
  • [21] M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.
  • [22] J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Graph-modeled data clustering: Fixed-parameter algorithms for clique generation. In R. Petreschi, G. Persiano, and R. Silvestri, editors, Algorithms and Complexity: 5th Italian Conference, CIAC 2003, Rome, Italy, May 28–30, 2003. Proceedings, pages 108–119, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
  • [23] J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Automated generation of search tree algorithms for hard graph modification problems. Algorithmica, 39(4):321–347, 2004.
  • [24] D. E. Knuth. The Stanford GraphBase: A Platform for Combinatorial Computing. Addison-Wesley, Reading, MA, 1993.
  • [25] V. Krebs. Books about US Politics, 2004. Hosted at UCI Data Repository.
  • [26] G. Liu and L. Wong. Effective pruning techniques for mining quasi-cliques. In W. Daelemans, B. Goethals, and K. Morik, editors, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part II, pages 33–49, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.
  • [27] P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J.-P. Onnela. Community structure in time-dependent, multiscale, and multiplex networks. Science, 328(5980):876–878, 2010.
  • [28] A. Natanzon, R. Shamir, and R. Sharan. Complexity classification of some edge modification problems. In International Workshop on Graph-Theoretic Concepts in Computer Science, pages 65–77. Springer, 1999.
  • [29] M. E. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104, 2006.
  • [30] M. E. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 69(026113), 2004.
  • [31] L. Peel, D. B. Larremore, and A. Clauset. The ground truth about metadata and community detection in networks. Science Advances, 3(5), 2017.
  • [32] J. Reichardt and S. Bornholdt. Statistical mechanics of community detection. Physical Review E, 74(016110), 2006.
  • [33] R. A. Rossi, D. F. Gleich, and A. H. Gebremedhin. Parallel maximum clique algorithms with applications to network analysis. SIAM Journal on Scientific Computing, 37(5):C589–C616, 2015.
  • [34] C. Seshadhri, T. G. Kolda, and A. Pinar. Community structure and scale-free collections of Erdős-Rényi graphs. Physical Review E, 85(5), May 2012.
  • [35] R. Shamir, R. Sharan, and D. Tsur. Cluster graph modification problems. Discrete Applied Mathematics, 144:173–182, 2004.
  • [36] S. Sra, J. Tropp, and I. S. Dhillon. Triangle fixing algorithms for the metric nearness problem. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 361–368. MIT Press, 2005.
  • [37] C. Swamy. Correlation clustering: maximizing agreements via semidefinite programming. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 526–527. Society for Industrial and Applied Mathematics, 2004.
  • [38] W. Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33:452–473, 1977.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
20750
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description