Two provably consistent divide and conquer clustering algorithms for large networks

# Two provably consistent divide and conquer clustering algorithms for large networks

## Abstract

In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms which perform clustering on a number of small subgraphs and finally patches the results into a single clustering. The main advantage of these algorithms is that they bring down significantly the computational cost of traditional algorithms, including spectral clustering, semi-definite programs, modularity based methods, likelihood based methods etc., without losing on accuracy and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Thus, exploiting the facts that most traditional algorithms are accurate and the corresponding optimization problems are much simpler in small problems, our divide-and-conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real-data analysis to understand the advantages of the divide-and-conquer approach in various settings.

## 1Glossary of notation

For the convenience of the reader, we collect here some of the more frequently used notations used in the paper, and provide a summarizing phrase for each, as well as the page number at which the notation first appears.

## 2Introduction

Community detection, also known as community extraction or network clustering, is a central problem in network inference. A wide variety of real world problems, ranging from finding protein protein complexes in gene networks [13] to studying the consequence of social cliques on adolescent behavioral patterns [15] depend on detecting and analyzing communities in networks. In most of these problems, one observes co-occurrences or interactions between pairs of entities, i.e. pairwise edges and possibly additional node attributes. The goal is to infer the hidden community memberships.

Many of these real world networks are massive, and hence it is crucial to develop and analyze scalable algorithms for community detection. We will first talk about methodology that uses only the network connections for inference. These can be divided into roughly two types. The first type consists of methods which are derived independently of any model assumption. These typically involve the formulation of global optimization problem; examples include normalized cuts [28], Spectral Clustering [24], etc.

On the other end, Statisticians often devise techniques under model assumptions. The simplest statistical model for networks with communities is the Stochastic Blockmodel (SBM) [14]. The key idea in a SBM is to enforce stochastic equivalence, i.e. two nodes in the same latent community have identical probabilities of connection to all nodes in the network. There are many extensions of SBM. The Degree Corrected Stochastic Blockmodel [16] allow one to model varied degrees in the same community, whereas a standard SBM does not. Mixed membership blockmodels [2] allow a node to belong to multiple communities, whereas in a SBM, a node can belong to exactly one cluster.

For an SBM generating a network with nodes and communities, one has a hidden community/cluster membership matrix , where if node is in community . Given these memberships, the link formation probabilities are given as

where is a symmetric parameter matrix of probabilities. The elements of may decay to zero as grows to infinity, to model sparse networks.

Typically the goal is to estimate the latent memberships consistently. A method outputting an estimate is called strongly consistent if for some permutation matrix , as . A weaker notion of consistency is when the fraction of misclustered nodes goes to zero as goes to infinity. Typically most of the consistency results are derived where average degree of the network grows faster than the logarithm of . This is is often called the semi-dense regime. When average degree is bounded, we are in the sparse regime.

There are a plethora of algorithms for community detection. These include likelihood-based methods [3], modularity based methods [29], spectral methods [26], semi-definite programming (SDP) based approaches [8] etc. Among these, spectral methods are scalable since the main bottleneck is computing top eigenvectors of a large and often sparse matrix. While the theoretical guarantees of Spectral Clustering are typically proven in the semi-dense regime [21], a regularized version of it has been shown to perform better than a random predictor for sparse networks [17]. Profile likelihood methods [6] involve greedy search over all possible membership matrices, which makes them computationally expensive. Semidefinite programs are robust to outliers [8] and are shown to be strongly consistent in the dense regime [4] and yield a small but non-vanishing error in the sparse regime [12]. However, semi definite programs are slow and typically only scale to thousands of nodes, not millions of nodes.

Methods like spectral clustering on geodesic distances [5] are provably consistent in the semi-dense case, and can give a small error in sparse cases. However, it requires computing all pairs of shortest paths between all nodes, which can pose serious problems for both computation and storage for very large graphs.

Monte Carlo methods [29], which are popular tools in Bayesian frameworks, are typically not scalable. More scalable alternatives such as variational methods [10] do not have provable guarantees for consistency, and often suffer from bad local minima.

So far we have discussed community detection methods which only look at the network connections, and not node attributes, which are often also available and may possess useful information on the community structure (see, e.g., [22]). There are extensions of the methods mentioned earlier which accommodate node attributes, e.g., modularity based [34], spectral [7], SDP based [32], etc. These methods come with theoretical guarantees and have good performance in moderately sized networks. While existing Bayesian methods [20] are more amenable to incorporating covariates in the inference procedure, they often are computationally expensive and lack rigorous theoretical guarantees.

While the above mentioned array of algorithms are diverse and each has unique aspects, in order to scale them to very large datasets, one has to apply different computational tools tailored to different algorithmic settings. While stochastic variational updates may be suitable to scale Bayesian methods, pseudo likelihood methods are better optimized using row sums of edges inside different node blocks.

In this article, we propose a divide and conquer approach to community detection. The idea is to apply a community detection method on small subgraphs of a large graph, and somehow stitch the results together. If we could achieve this, we would be able to scale up any community detection method (which may involve covariates as well as the network structure) that is computationally feasible on small graphs, but is difficult to execute on very large networks. This would be especially useful for computationally expensive community detection methods (such as SDPs, modularity based methods, Bayesian methods etc.). Another possible advantage concerns variational likelihood methods (such as mean-field) with a large number (depending on ) of local parameters, which typically have an optimization landscape riddled with local minima. For smaller graphs there are less parameters to fit, and the optimization problem often becomes easier.

Clearly, the principal difficulty in doing this, is matching the possibly conflicting label assignments from different subgraphs (see Figure ?(a) for an example). This immediately rules out a simple-minded averaging of estimates of cluster membership matrices , for various subgraphs , as a viable stitching method.

In this regard, we propose two different stitching algorithms. The first is called Piecewise Averaged Community Estimation (PACE); in which we focus on estimating the clustering matrix , which is labeling invariant, since the -th element of this matrix being one simply means that nodes and belong to the same cluster, whereas the value zero means and belongs to two different clusters. Thus we first compute estimates of for various subgraphs and then average over these matrices to obtain an estimate of . Finally we apply some computationally cheap clustering algorithm like approximate -means, DGCluster1, spectral clustering etc. on to recover an estimate of .

We also propose another algorithm called Global Alignment of Local Estimates (GALE), where we first take a sequence of subgraphs, such that any two consecutive subgraphs on this sequence have a large intersection, and then traverse through this sequence, aligning the clustering based on a subgraph with an averaged clustering of the union of all its predecessor subgraphs in the sequence, which have already been aligned. The alignment is done via an algorithm called Match which identifies the right permutation to align two clusterings on two subgraphs by computing the confusion matrix of these two clusterings restricted to the intersection of the two subgraphs. Whereas a naive approach would entail searching through all permutations, Match finds the right permutation in time. Once this alignment step is complete, we get an averaged clustering of the union of all subgraphs (which covers all the vertices). By design GALE works with estimates of cluster membership matrices directly to output an estimate of , and thus, unlike PACE, avoids the extra overhead of recovering such an estimate from .

The rest of the paper is organized as follows. In Section 3 we describe our algorithms. In Section 4 we state our main results and some applications. Section 5 contains simulations and real data experiments. In Section 7 we provide proofs of our main results, while relegating some of the details to the Appendix, Section Appendix B. Finally, in Section 6 we conclude with some discussions and directions for future work.

## 3Two divide and conquer algorithms

As we discussed in the introduction, the main issue with divide and conquer algorithms for clustering is that one has to somehow match up various potentially conflicting label assignments. In this vein we propose two algorithms to accomplish this task. Both algorithms first compute the clustering on small patches of a network; these patches can be induced subgraph of a random subsample of nodes, or neighborhoods. However, the stitching procedures are different.

### 3.1Pace: an averaging algorithm

Suppose is the adjacency matrix of a network with true cluster membership of its nodes being given by the matrix where there are clusters. Set to be the clustering matrix whose -th entry is the indicator of whether nodes belong to the same cluster. Given we will perform a local clustering algorithm to obtain an estimate of , from which an estimate of the cluster memberships may be reconstructed.

The parameter in PACE seems to reduce variance in estimation quality as it discards information from less credible sources — if a pair of nodes has appeared in only a few subgraphs, we do not trust what the patching has to say about them. Setting seems to work well in practice (this choice is also justified by our theory).

A slight variant of Algorithm ? is where we allow subgraph and/or node-pair specific weights in the computation of the final estimate, i.e.

where now equals . We may call this estimator w-PACE standing for weighted-PACE. If the weights are all equal, w-PACE becomes equivalent to ordinary PACE. There are natural nontrivial choices, including

• which will place more weight to estimates based on large subgraphs,

• , where denotes the degree of node in subgraph (this will put more weight on pairs which have high degree in ).

The first prescription above is intimately related to the following sampling scheme for ordinary PACE: pick subgraphs with probability proportional to their sizes. For instance, in Section 5.2 we analyze the political blog data of [1] where neighborhood subgraphs are chosen by selecting their roots with probability proportional to degree.

In real world applications, it might make more sense to choose these weights based on domain knowledge (for instance, it may be that certain subnetworks are known to be important). Another (minor) advantage of having weights is that when and , we have and so if , then

i.e. w-PACE becomes the estimator based on the full graph. This is for example true with , because is typically much smaller than . However, ordinary PACE lacks this property unless , in fact, with , the estimate returned by PACE is identically . Anyway, in what follows, we will stick with ordinary PACE because of its simplicity.

Before we discuss how to reconstruct an estimate of from , let us note that we may obtain a binary matrix by thresholding at some level (for example, ):

This thresholding does not change consistency properties (see Lemma ?). Looking at a plot of this matrix gives a good visual representation of the community structure. In what follows, we work with unthresholded .

Reconstruction of : How do we actually reconstruct from ? The key is to note that members of the same community have identical rows in and that, thanks to PACE, we have gotten hold of a consistent estimate of . Thus we may use any clustering algorithm on the rows of to recover the community memberships. Another option would be to run spectral clustering on the matrix itself. However, as the rows of are -vectors, most clustering algorithms will typically have a running time of 2 at best. Indeed, the main computational bottleneck of any distance based clustering algorithm in a high dimensional situation like the present one, is computing which takes bit operations. However, since we have gotten a good estimate of , we can project the rows of onto lower dimensions, without distorting the distances too much. The famous Johnson-Lindenstrauss Lemma for random projections says that by projecting onto dimensions, one can keep, with probability at least , the distances between projected vectors within a factor of of the true distances. Choosing as inverse polylog we need to project onto polylog dimensions and this would then readily bring the computational cost of any distance based algorithm down from to .

Following the discussion of the above paragraph, we first do a random projection of the rows of onto dimensions and then apply a (distance-based) clustering algorithm.

As for , we may use approximate -means or any other distance based clustering algorithm, e.g., DGCluster, a greedy algorithm presented in Appendix A as Algorithm ?.

### 3.2Gale: a sequential algorithm

First we introduce a simple algorithm for computing the best permutation to align labels of one clustering () to another () of the same set of nodes (with fixed ordering) in a set . The idea is to first compute the confusion matrix between two clusterings. Note that if the two labelings each have low error with respect to some unknown true labeling, then the confusion matrix will be close to a diagonal matrix up to permutations. The following algorithm below essentially finds a best permutation to align one clustering to another.

Now we present our sequential algorithm which aligns labelings across different subgraphs. The idea is to first fix the indexing of the nodes; cluster the subgraphs (possibly with a parallel implementation) using some algorithm, and then align the clusterings along a sequence of subgraphs. To make things precise, we make the following definition.

After we construct a traversal, we travel through this traversal such that at any step, we align the current subgraph’s labels using the Match algorithm (Algorithm ?) on its intersection with the union of the previously aligned subgraphs. At the end, all subgraph labellings are aligned to the labeling of the starting subgraph. Now we can simply take an average or a majority vote between these.

Implementation details: Constructing a traversal of the subgraphs can be done using a depth first search of the super-graph of subgraphs. For our implementation, we start with a large enough subgraph (the parent), pick another subgraph that has a large overlap with it (the child), align it and note that this subgraph has been visited. Now recursively find another unvisited child of the current subgraph, and so on. It is possible that a particular path did not cover all vertices, and hence it is ideal to estimate clusterings with multiple traversals with different starting subgraphs and then align all these clusterings, and take an average. This is what we do for real networks. We also note that at any step, if we find a poorly clustered subgraph, then this can give a bad permutation which may deteriorate the quality of aligning the subsequent subgraphs on the path. In order to avoid this we use a self validation routine. Let be intersection of current subgraph with union of the previously visited subgraphs. After aligning the current subgraph’s clustering, we compute the classification accuracy of the current labeling of with the previous labeling of . If this accuracy is large enough, we use this subgraph, and if not we move to the next subgraph on the path. For implementation, we use a threshold of .

Computational time and storage: The main computational bottleneck in GALE is in building a traversal through the random subgraphs. Let be the time for computing the clusterings for subgraphs in parallel. A naive implementation would require computing intersections between all pairs of -subsets. As we will show in our theoretical analysis, we take , where (here , where is the size of the -th cluster) and . Taking , the computation of intersections takes time. Further, a naive comparison for computing subsets similar or close to a given one would require time for each subset leading to computation. However, for building a traversal one only needs access to subsets with large overlap with a given subset, which is a classic example of nearest neighbor search in Computer Science.

One such method is the widely used and theoretically analyzed technique of Locality Sensitive Hashing (LSH). A hash function maps any data object of an arbitrary size to another object of a fixed size. In our case, we map the characteristic vector of a subset to a number. The idea of LSH is to compute hash functions for two subsets and such that the two functions are the same with high probability if and are “close”. In fact, the amount of overlap normalized by is simply the cosine similarity between the characteristic vectors and of the two subsets, for which efficient hashing schemes using random projections exist [9], with

For LSH schemes, one needs to build hash tables, for some , that governs the approximation quality. In each hash table, a “bucket” corresponding to an index stores all subsets which have been hashed to this index. For any query point, one evaluates hash functions and examines subsets hashed to those buckets in the respective hash tables. Now for these subsets, the distance is computed exactly. The preprocessing time is , with storage being , and total query time being . This brings down the running time added to the algorithm specific from sub-quadratic time to nearly linear time, i.e. .

Thus, for other nearly linear time clustering algorithms GALE may not lead to computational savings. However, for algorithms like Profile Likelihood or SDP which are well known to be computationally intensive, GALE can lead to a significant computational saving without requiring a lot of storage.

### 3.3Remarks on sampling schemes

With PACE we have mainly used random -subgraphs, -hop neighborhoods, and onion neighborhoods, but many other subgraph sampling schemes are possible. For instance, choosing roots of hop neighborhoods with probability proportional to degree, or sampling roots from high degree nodes (we have done this in our analysis of the political blog data, in Section 5.2). As discussed earlier, this weighted sampling scheme is related to w-PACE. A natural question regarding -hop neighborhoods is how many hops to use. While we do not have a theory for this yet, because of “small world phenomenon” we expect not to need many hops; typically in moderately sparse networks, - hops should be enough. Although, an adaptive procedure (e.g., cross-validation type) for choosing would be welcome. Also, since neighborhood size increases exponentially with hop size, an alternative to choosing full hop-neighborhoods is to choose a smaller hop-neighborhood and then add some (but not all) randomly chosen neighbors of the already chosen vertices. Other possibilities include sampling a certain proportion of edges at random, and the consider the subgraph induced by the participating nodes. We leave all these possibilities for future work.

We have analyzed GALE under the random sampling scheme. For any other scheme, one will have to understand the behavior of the intersection of two samples or neighborhoods. For example, if one takes -hop neighborhoods, for sparse graphs, each neighborhood predominantly has nodes from mainly one cluster. Hence GALE often suffers with this scheme. We show this empirically in Section 5, where GALE’s accuracy is much improved under a random -subgraph sampling scheme.

### 3.4Beyond community detection

The ideas behind PACE and GALE are not restricted to community detection and can be modified for application in other interesting problems, including general clustering problems, co-clustering problems ([27]), mixed membership models, among others (these will be discussed in an upcoming article). In fact, [19] took a similar divide and conquer approach for matrix completion.

## 4Main results

In this section we will state and discuss our main results on PACE and GALE, along with a few applications.

### 4.1Results on Pace

Let and be two clusterings (of objects into clusters), usually their discrepancy is measured by

where is the permutation group on . If are the corresponding binary matrices, then a related measure of discrepancy between the two clusterings is . It is easy to see that . (To elaborate, let be the permutation matrix corresponding to the permutation , i.e. . Then , if and only if i.e. .) For our purposes, however, a more useful measure of discrepancy would be the normalized Frobenius squared distance between the corresponding clustering matrices and , i.e.

Now we compare these two notions of discrepancies.

Incidentally, if the cluster sizes are equal, i.e. , then one can show that

Although we do not have a lower bound on in terms of , Lemma A.1 of [30] gives us (with , ) that there exists an orthogonal matrix such that

where we used the fact that The caveat here is that the matrix need not be a permutation matrix.

To prove consistency of PACE we have to assume that the clustering algorithm we use has some consistency properties. For example, it will suffice to assume that for a randomly chosen subgraph (under our subgraph selection procedure), 3 is small. The following is our main result on the expected misclustering rate of PACE.

The first term in ( ?) essentially measures the performance of the clustering algorithm we use on a randomly chosen subgraph. The second term measures how well we have covered the full graph by the chosen subgraphs, and only depend on the subgraph selection procedure. The effect of the algorithm we use is felt through the first term only.

We can now specialize Theorem ? to various subgraph selection schemes. First, we consider randomly chosen -subgraphs, which is an easy corollary.

Notice that the constant in can be made as close to as one desires, which means that the above bound is essentially optimal.

Full -hop neighborhood subgraphs are much harder to analyze and will not be pursued here. However, ego networks, which are -hop neighborhoods minus the root node (see Figure ?(b)), are easy to deal with. One can also extend our analysis to -hop onion neighborhoods which are recursively defined as follows: is just the ego network of vertex ; in general, the -th shell , and , where the operation denotes superposition of networks. Here, for ease of exposition, we choose to work with ego networks (-hop onion neighborhoods).

We will now use existing consistency results on several clustering algorithms , in conjunction with the above bounds to see what conditions (i.e. conditions on the model parameters, and etc.) are required for PACE to be consistent. We first consider -approximate adjacency spectral clustering (ASP) of [18] as . We will use stochastic block model as the generative model and for simplicity will assume that the link probability matrix has the following simple form

We now quote a slightly modified version of Corollary 3.2 of [18] for this model.

The proof of Corollary ? follows from Corollary ? and an estimate for given in (Equation 16), which is obtained using Lemma ?. In order for the first term of to go to zero we need , i.e. . Thus for balanced block sizes (i.e. ) we need to have . So, qualitatively, for large or small or a small separation between the blocks, has to be large, which is natural to expect. In particular, for fixed and , this shows that we need subgraphs of size , and many of them to achieve consistency (here the average degree ). Let and , where both . Let us see what computational gain we get from this. Spectral clustering on the full graph has complexity , while the complexity of PACE with spectral clustering is

So if , then the complexity would be , which is essentially . When the gain is small.

Note however that for a parallel implementation, with each source processing out of the subgraphs, we may get a significant boost in running time, at least in terms of constants; the running time would be .

The proof of Corollary ? follows from Corollary ? and an estimate for given in (Equation 17), which is obtained using Lemma ?. For the right hand side in to go to zero (assuming fixed, balanced block sizes), we need . In terms of average degree this means that we need , and . That with ego neighborhoods we can not go down to is not surprising, since these ego networks are rather sparse in this case. One needs to use larger neighborhoods. Anyway, writing , , where both , the complexity of adjacency spectral clustering, in this case becomes and with processing units gets further down to .

Although from our analysis, it is not clear why PACE with spectral clustering should work well for sparse settings, in numerical simulations, we have found that in various regimes PACE with (regularized) spectral clustering vastly outperforms ordinary (regularized) spectral clustering (see Table 4).

It seems that the reason why PACE works well in sparse settings lies in the weights . With -hop neighborhoods as the chosen subgraphs, if , where is the geodesic distance on , then . It is known that spectral clustering on the matrix of geodesic distances works well in sparse settings ([5]). PACE seems to inherit that property through , although we do not presently have a rigorous proof of this.

We conclude this section with an illustration of PACE with random -subgraphs using SDP as the algorithm . We shall use the setting of Theorem 1.3 of [12] for the illustration, stated here with slightly different notation. Let SDP-GV denote the following SDP [12]

For the simple two parameter blockmodel with equal community sizes, we have , the average degree of the nodes (note that ). The assumptions of Corollary ? are satisfied when

This is exactly similar to what we saw for spectral clustering (take , and ). In particular, when the average degree , and , we need and for PACE to succeed. However, in the bounded degree regime, the advantage is negligible, only from a potentially smaller constant, because we need . Again, from our numerical results, we expect that with -hop subgraphs, PACE will perform much better.

### 4.2Results on Gale

We denote the unnormalized miscustering error between estimated labels and the true labels , () of the same set of nodes as Note that since are binary, the . As discussed earlier, the number of misclustered nodes will be half of this number.

The main idea of our algorithm is simple. Every approximately accurate clustering of a set of nodes is only accurate up to a permutation, which can never be recovered without the true labels. However we can align a labeling to a permuted version of the truth, where the permutation is estimated from another labeling of the same set of vertices. This is done by calculating the confusion matrix between two clusterings. We call two clusterings aligned if cluster from one clustering has a large overlap with cluster from the other clustering. If the labels are “aligned” and the clusterings agree, this confusion matrix will be a matrix with large diagonal entries. This idea is used in the Match algorithm, where we estimate the permutation matrix to align one clustering to another.

Now we present our main result. We prove consistency of a slightly modified and weaker version of Algorithm ?. In Algorithm ?, at every step of a traversal, we apply the Match algorithm on the intersection of the current subgraph and the union of all subgraphs previously aligned to estimate the permutation of the yet unaligned current subgraph. However, in the theorem presented below we use the intersection between the unaligned current subgraph with the last aligned subgraph. Empirically it is better to use the scheme presented in Algorithm ? since it increases the size of the intersection which requires weaker conditions on the clustering accuracy of any individual subgraph. We leave this for future work.

We now formally define our estimator . Let . Let denote the aligned clustering of subgraph and let . Define

The entries of will be fractions, but as we show in Lemma ?, rounding it to a binary matrix will not change consistency properties.

Note that GALE depends on the spanning tree we use and particular the traversal of that spanning tree. Let be the set of all spanning trees of a graph . For , let be the set of all traversals of . Let be the outcome of GALE on the traversal of .

Again, the constant can be taken as close to as one desires. Thus the above bound is also essentially optimal.

We will now illustrate Theorem ? with several algorithms . We begin with a result on -approximate adjacency spectral clustering.

We see that the first term is exactly same as the first term in Corollary ?. This, for balanced graphs, again imposes the condition . In particular, if and we are in a dense well separated regime, with , , then we need . If , and , then we need . In both cases, we need . Thus in the regime where average degree is like there is still some computational advantage for very large networks (also factoring in parallelizability); however, for moderately sized networks, GALE may not lead to much computational advantage.

Now we present an exact recovery result with SDP as the base algorithm . We shall use a result4 from [33] on an SDP which they call SDP-. Let . Also let denote the vector of the cluster proportions .

Assuming that any subsequent clustering of the exactly recovered scaled clustering matrix gives the exact clustering back (for example, our distance based naive algorithm NaiveCluster5 will do this), we have the following corollary.

Note that, in the above bound can taken to be greater than . This means that, with high probability, the proportion of misclustered nodes is less than and hence zero, leading to exact recovery. As for computational complexity, note that the separation condition , with replaced by , restricts how small can be. Consider the simple SBM with balanced block sizes for concreteness. In this case, the separation condition essentially dictates, as in the case of spectral clustering, that . Thus the remarks made earlier on how large or should be chosen apply here as well.

As discussed earlier in Section 3, even a naive implementation of GALE will only result in an running time in addition to the time () required to cluster the random -subgraphs, whereas a more careful implementation will only add a time to that is nearly linear in . Since SDPs are notoriously time intensive to solve, this gives us a big saving.

## 5Simulations and real-data analysis

In Table 1 we present a qualitative comparison of PACE and GALE with four representative global community detection methods Profile Likelihood (PL), Mean Field Likelihood (MFL), Spectral Clustering (SC) and Semi Definite Programming (SDP).

### 5.1Simulations: comparison against traditional algorithms

For simulations we will use of the following simple block model:

where is the dimensional identity matrix and is the matrix of all ones. Here will be the degree density, and will measure the relative separation between the within block and between block connection probabilities, i.e. and . If the blocks have prior probabilities , then the average degree , under this model is given by

In particular, if the model is balanced, i.e. for all , then

In order to understand and emphasize the role of PACE and GALE in reducing computational time while maintaining good clustering accuracy, we use different settings of sparsity for different methods. For recovering from in PACE, we have used random projection plus -means (abbreviated as RPKmeans below), and spectral clustering (SC). We also want to point out that, for sparse unbalanced networks GALE may return more than clusters, typically when a very small fraction of nodes has not been visited. However, it is possible that the unvisited nodes have important information about the network structure. For example, all subgraphs may be chosen from the larger clusters, thereby leaving the smallest cluster unvisited. We take care of this by computing the smallest error between the permutations of GALE’s clustering to the ground truth. This essentially treats the smallest cluster returned by GALE as misclustered. In real and simulated networks we have almost never seen GALE return a large number of unvisited nodes.

SDP with ADMM: Interior point methods for SDPs are not very fast in practice. We have solved SDPs using an ADMM based implementation of [32]. From Table ? we see that PACE and GALE significantly reduces the running time of SDP without losing accuracy too much. In fact, if we use spectral clustering to estimate from in the last step of PACE, we get zero misclustering error (ME).

Mean Field Likelihood: From Table 3 we see that our implementation of mean field on the full graph did not converge to an acceptable solution even after five and half hours, while both PACE and GALE return much better solutions in about two minutes. In fact, with spectral clustering in the last step of PACE, the misclustering error is only 0.14, which is quite good. This begs the question if this improvement is due to spectral clustering only. We show in the next simulation that in certain settings, even when spectral clustering is used as the base algorithm, PACE and GALE lead to significant improvements in terms of accuracy and running time.

Regularized Spectral Clustering: In sparse unbalanced settings, regularized spectral clustering with PACE and GALE performs significantly better than regularized spectral clustering on full graph. In fact, with spectral clustering used in the last step of PACE, we can hit about 5% error or below, which is quite remarkable. See Table 4. In Section 5.2 we will see that PACE and GALE also add stability to spectral clustering (in terms of clustering degree 1 vertices).

Profile Likelihood with tabu search: Optimizing profile likelihood (PL) or likelihood modularity ([6]) for community detection is a combinatorial problem, and as such hard to scale, even if we ignore the problem of local minima. In Table 5 we compare running time of profile likelihood (optimized using tabu search) and its divide and conquer versions. We see that the local methods significantly cut down the running time of PL without losing accuracy too much.

We also applied profile likelihood on 5000 node graphs with 20 workers. Although PACE and GALE finished in about 22 minutes, the global method did not finish in 3 days. So, here we present results on 1000 node networks instead.

### 5.2Real data analysis

Political blog data:

This is a directed network (see Figure 1) of hyperlinks between blogs (2004) that are either liberal or conservative ([1]); we have ground truth labels available for comparison, are liberal, are conservative. We convert it into an undirected network by putting an edge between blogs and if there is at least one directed edge between them.

The resulting network has lots of isolated nodes and isolated edges. The degree distribution is also quite heterogeneous (so a degree-corrected model would be more appropriate). We focus on the largest connected component. We use Laplacian spectral clustering (row normalized, to correct for degree heterogeneity), with PACE.

Tables Table 6-Table 7 show that PACE and GALE add stability (possibly in eigenvector computation) to spectral clustering. Indeed, with PACE and GALE we are able to cluster “leaf” vertices (i.e. vertices of degree ), with significantly more accuracy.

## 6Discussion

To summarize, we have proposed two divide-and-conquer type algorithms for community detection, PACE and GALE, which can lead to significant computational advantages without sacrificing accuracy. The main idea behind these methods is to compute the clustering for each individual subgraph and then “stitch” them together to produce a global clustering of the entire network. The main challenge of such a stitching procedure comes from the fundamental problem of unidentifiability of label assignments. That is, if two subgraphs overlap, the clustering assignment of a pair of nodes in the overlap may be inconsistent between the two subgraphs.

PACE addresses this problem by estimating the clustering matrix for each subgraph and then estimating the global clustering matrix by averaging over the subgraphs. GALE takes a different approach by using overlaps between two subgraphs to calculate the best alignment between the cluster memberships of nodes in the subgraphs. We prove that, in addition to being computationally much more efficient than base methods which typcally run in time, these methods have accuracy at least as good as the base algorithm’s typical accuracy on the type of subgraphs used, with high probability. Experimentally, we show something more interesting — we identify parameter regimes where a local implementation of a base algorithm based on PACE or GALE in fact outperforms the corresponding global algorithm. One example of this is the Meanfield algorithm, which typically suffers from bad local optima for large networks. Empirically, we have seen that on a smaller subgraph, with a reasonable number of restarts, it finds a local optima that is often highly correlated with the ground truth. PACE and GALE take advantage of this phenomenon to improve on accuracy/running time significantly. Another example is Regularized Spectral Clustering on sparse unbalanced networks. We intend to theoretically investigate this further in future work.

Finally, working with many subgraphs naturally leads to the question of self consistency of the underlying algorithm. This is often crucial in real world clustering problems with no available ground truth labels. We intend to explore this direction further for estimating model parameters like the number of clusters, algorithmic parameters like the size and number of subgraphs, number of hops to be used for the neighborhood subgraphs, etc. Currently, these are all picked a priori based on the degree considerations. It may also be possible to choose between different models (e.g., standard blockmodels, degree corrected models, dot product models etc.) by examining which model leads to the most self consistent results. We leave this for future work.

In conclusion, not only are our algorithms, to the best of our knowledge, the first ever divide-and-conquer type algorithms used for community detection, we believe that the basic principles of our methods will have a broad impact on a range of clustering and estimation algorithms that are computationally intensive.

## 7Proofs

### 7.1Results on Pace

Since both are -valued, we can safely replace the count by Frobenius norm squared, i.e.

Now, note that for all permutation matrices . Thus

But is the maximum eigenvalue of which is diagonal with its maximum diagonal entry being the size of the largest cluster under . Thus equals the size of the largest cluster under and so is trivially upper bounded by . Same goes for . Therefore we get

Squaring this, and taking infimum over all permutation matrices in the right hand side, we obtain the claimed inequality.

Now we will prove Theorem ?. The proof will be broken down into two propositions. First we decompose

Note that , and , so that . Therefore

We will estimate and separately.

Let . Then . So, by an application of Cauchy-Schwartz, we have

Note that . On the other hand, since the subgraphs were chosen independently using the same sampling scheme, the are identically distributed. Therefore, taking expectations we get

where is a randomly chosen subgraph under our subgraph selection scheme.

Since , we have , and by taking expectations we get

Combining Propositions ? and ?, we get . Finally, note that

For this sampling scheme and with , Binomial so that we have, using the Chernoff bound6 for binomial lower tail, that

Finally, we get ( ?) by plugging in these parameter values and estimates in ( ?).

The most crucial thing to observe here is that if one removes the root node and its adjacent edges from a -hop neighborhood, then the remaining “ego network” has again a blockmodel structure. Indeed, let be a random ego neighborhood of size with root , i.e. . Then conditional on being ’s neighbors, and the latent cluster memberships, edges in are independently generated, i.e. for , and , we have

This is because the “spoke” edges are independent of . Therefore, conditional on , this ego-subgraph is one instantiation of a block model with the same parameters on vertices.

Now for ego networks, , where is the total number of ego-subgraphs containing both and . Notice that

that is, is the sum of independent Bernoulli random variables

So

and we have, by the Chernoff bound, that

In order to apply Theorem ? we need the following two ingredients, which we will now work out.

1. estimate of , and

2. estimate of .

(i) Estimate of . Note that . So

Since , and are independent, we have, by Chernoff’s inequality, that

where Therefore, the same upper bound holds for . In particular, for we have

Similarly, using Chernoff’s inequality for Binomial upper tail, we can show that, for ,

(ii) Estimate of . Recall that . Then