Efficient Distributed Community Detection in the Stochastic Block Model

Efficient Distributed Community Detection in the Stochastic Block Model

Reza Fathi Department of Computer Science
University of Houston
Houston, Texas 77204, USA
Email: rfathi@uh.edu
   Anisur Rahaman Molla Cryptology and Security Research Unit
Indian Statistical Institute
Kolkata 700108, India
Email: molla@isical.ac.in
   Gopal Pandurangan Department of Computer Science
University of Houston
Houston, Texas 77204, USA
Email: gopalpandurangan@gmail.com
Abstract

Designing effective algorithms for community detection is an important and challenging problem in large-scale graphs, studied extensively in the literature. Various solutions have been proposed, but many of them are centralized with expensive procedures (requiring full knowledge of the input graph) and have a large running time.

In this paper, we present a distributed algorithm for community detection in the stochastic block model (also called planted partition model), a widely-studied and canonical random graph model for community detection and clustering. Our algorithm called CDRW(Community Detection by Random Walks) is based on random walks, and is localized and lightweight, and easy to implement. A novel feature of the algorithm is that it uses the concept of local mixing time to identify the community around a given node.

We present a rigorous theoretical analysis that shows that the algorithm can accurately identify the communities in the stochastic block model and characterize the model parameters where the algorithm works. We also present experimental results that validate our theoretical analysis. We also analyze the performance of our distributed algorithm under the CONGEST distributed model as well as the -machine model, a model for large-scale distributed computations, and show that it can be efficiently implemented.

community detection, random walk, local mixing, conductance, planted partition model.
A. R. Molla was supported by DST Inspire Faculty Award research grant DST/INSPIRE/04/2015/002801.G. Pandurangan was supported, in part, by NSF grants CCF-1527867, CCF-1540512, IIS-1633720, CCF-1717075, and BSF award 2016419.

I introduction

Finding communities in networks (graphs) is an important problem and has been extensively studied in the last two decades, e.g., see the surveys [1, 10, 17, 18] and other references in Section II. At a high level, the goal is to identify subsets of vertices of the given graph so that each subset represents a “community.” While there are differences in how communities are defined exactly (e.g., subsets defining a community may overlap or not), a uniform property that underlies most definitions is that there are more edges connecting nodes within a subset than edges connecting to outside the subset. Thus, this captures the fact that a community is somehow more “well-connected” with respect to itself compared to the rest of the graph. One way to express this property formally is by using the notion of conductance (defined in Section I-C) which (informally) measures the ratio of the total degree going out of the subset to the total degree among all nodes in the subset. A community subset will have generally low conductance (also sometime referred to as a “sparse” cut). Another way to measure this is by using the notion of modularity [35] which (informally) measures how well connected a subset is compared to the random graph that can be embedded within the set. A vertex subset that has a high modularity value can be considered a community according to this measure.

Designing effective algorithms for community detection in graphs is an important and challenging problem. With the rise of massive graphs such as the Web graph, social networks, biological networks, finding communities (or clusters) in large-scale graphs efficiently is becoming even more important [10, 15, 18, 20, 35]. In particular, understanding the community structure is a key issue in many applications in various complex networks including biological and social networks (see e.g., [10] and the references therein).

Various solutions have been proposed (cf. Section II), but many of them are centralized with expensive procedures (requiring full knowledge of the input graph) and have a large running time [10]. In particular, the problem of detecting (identifying) communities in the stochastic block model (SBM) [1, 10] has been extensively studied in the literature (cf. Section II). The stochastic block model (defined formally in Section I-B), also known as the planted partition model (PPM) is a widely-used random graph model in community detection and clustering studies (see e.g., [1, 8, 25]). Informally, in the PPM model, we are given a graph on nodes which are partitioned into a set of communities (each is a random graph on vertices) and these communities are interconnected by random edges. The total number of edges within a community (intra-community edges) is typically much more than the number of edges between communities (inter-community edges). The main goal is to devise algorithms to identify the communities that are “planted” in the graph. Several algorithms have been devised for this model, but as mentioned earlier, they are mostly centralized (with some exceptions — cf. Section II) with large running time. There are few distributed algorithms (see e.g., [10, 27]) but either they are shown to work only when the number of communities is small (typically 2) [10]) or when the communities are very dense[27]). In particular, to the best of our knowledge, (prior to this work) there is no rigorous analysis of community detection algorithms in distributed large-scale graph processing models such as MapReduce and Massively Parallel Computing (MPC) model [24], and the -machine model [26].

I-a Our Contributions

In this paper, we present a novel distributed algorithm, called CDRW (Community Detection via Random Walks) for detecting communities in the PPM model[12].111Throughout this paper, we use the terms stochastic block model (SBM) and planted partition model (PPM) interchangeably. Our algorithm is based on the recently proposed local mixing paradigm [33] (see Section I-C for a formal definition) to detect community structure in sparse (bounded-degree) graphs. Informally, a local mixing set is one where a random walk started at some node in the set mixes well with respect to this set. The intuition in using this concept for community detection is that since a community is well-connected, it has good expansion within the community and hence a random walk started at a node in the community mixes well within the community. The notion of “mixes well” is captured by the fact that the random walk reaches close to the stationary distribution when restricted to the nodes in the community subset [33]. Since the main tool for this algorithm uses random walks which is local and lightweight, it is easy to implement this algorithm in a distributed manner. We will analyze the performance of the algorithm in two distributed computing models, namely the standard CONGEST model of distributed computing [40] and the -machine model [26], which is a model for large-scale distributed computations. We show that CDRW can be implemented efficiently in both models (cf. Theorem 6 and Section III-B). The -machine model implementation is especially suitable for large-scale graphs and thus can be used in community detection in large SBM graphs. In particular, we show that the round complexity in the -machine model (cf. Section III-B) scales quadractically (i.e., ) in the number of machines when the graph is sparse and linearly (i.e.,) in general.

As is usual in community detection, a main focus is analyzing the effectiveness of the algorithm in finding communities. We present a rigorous theoretical analysis that shows that CDRW algorithm can accurately identify the communities in the PPM, which is a popular and widely-studied random graph model for community detection analysis [1]. A PPM model (cf. Section I-B) is a parameterized random graph model which has a built-in community structure. Each community has high expansion within the community and forms a low conductance subset (and hence relatively less edges go outside the community); the expansion, conductance, and edge density can be controlled by varying the parameters. CDRW does well when the number of intra-community edges is much more than the number of inter-community edges (these are controlled by the parameters of the model). Our theoretical analysis (cf. Theorem 6 for the precise statement) quantitatively characterizes when CDRW does well vis-a-vis the parameters of the model. Our results improve over previous distributed algorithms that have been proposed for the PPM model ([10]) both in the number of communities that can be provably detected as well as range of parameters where accurate detection is possible; they also improve on previous results that provably work only on dense PPM graphs [27] (details in Section II).

We also present extensive simulations of our algorithm in the PPM model under various parameters. Experimental results on the model validate our theoretical analysis; in fact experiments show that CDRW works relatively well in identifying communities even under less stringent parameters.

I-B Model, Definitions, and Preliminaries

I-B1 Graph Model

We describe the stochastic block model (SBM) [21], a well-studied random graph model that is used in community detection analysis. Before that we recall the Erdős-Rényi random graph model [16], a basic random graph model, that the SBM model builds on. In the Erdős-Rényi random graph model, also known as the model, each of possible edges is present in the graph independently with probability . The random graph has close to uniform degree (the expected degree of a node is ) and a well-known fact about is that if the graph is connected with high probability222Throughout this paper, by “with high probability” we mean, “with probability at least .” and its degree is concentrated at its expectation. In the SBM model, the vertices are partitioned into disjoint community sets ( is a parameter). Let denote the community that node belongs to. If two nodes and belong to the same community (i.e., ), they connect independently with probability (independent of other nodes). If they belong to different communities, they connect independently with probability . A SBM model has a separable community structure [1] if it has a higher value of intra- than inter-community connectivity probability. A commonly used version of SBM model that is called Planted Partition Model (PPM), denoted as [11, 21] is usually used as a benchmark. A symmetric with blocks is composed of exclusive set of nodes in , where . In , each possible edge is independently present with probability if both ends of and belong to the same block , otherwise with probability . Figure 1 shows a PPM of blocks (), each containing nodes. For the PPM model, each of the communities induce a -vertex subgraph which is a random graph. The goal of community detection in the PPM (or SBM) model is to identify the community sets. This problem has been widely studied [1].

(a) A PPM graph shown without its ground-truth communities.
(b) Redrawing of the same PPM graph showing the communities.
Fig. 1: Both of the above graphs are the same PPM graph. The graph size is , number of communities , the existence probability of intra(inter) community edges is ( as in [1, Figure 1]. We highlight ground truth communities in different colors in Figure0(b).

I-B2 Distributed Computing Models

We consider two distributed computing models and analyze the complexity of the algorithm’s implementation under both the models.

CONGEST model. This is a standard and widely-studied CONGEST model of distributed computing [37, 40], which captures the bandwidth constraints inherent in real-world networks. The distributed network is modelled as an undirected and unweighted graph , where and , where nodes represent processors (or computing entities) and the edges represent communication links. In this paper, will be a graph belonging to the PPM model. Each node is associated with a distinct label or ID (e.g., its IP address). Nodes communicate through the edges in synchronous rounds; in every round, each node is allowed to send a message of size at most bits to each of its neighbors. The message will arrive to the receiver nodes at the end of the current round. Initially every node has limited knowledge; it only knows its own ID and its neighbors IDs. In addition, it may known additional parameters of the graph such as (where is the diameter). The two standard complexity of an algorithm are the time and message complexity in the CONGEST model. While time complexity measures the number of rounds taken by the algorithm, the message complexity measures the total number of messages exchanged during the course of the algorithm.

-machine model. The -machine model (a.k.a. Big Data model) is a model for large-scale distributed computing introduced in [26] and studied in various papers [3, 23, 26, 38, 39]. In this model, the input graph (or more generally, any other type of data) is distributed across a group of machines that are pairwise interconnected via a communication network. The machines jointly perform computations on an arbitrary -vertex, -edge input graph (where typically ) distributed among the machines (randomly or in a balanced fashion). The communication is point-to-point via message passing. Machines do not share any memory and have no other means of communication. The computation advances in synchronous rounds, and each link is assumed to have a bandwidth of bits per round, i.e., bits can be transmitted over each link in each round; unless otherwise stated, we assume (where is the input size) [38, 39]. The goal is to minimize the round complexity, i.e., the number of communication rounds required by the computation.333The communication cost is assumed to be the dominant cost – which is typically the case in Big Data computations — and hence the goal of minimizing the number of communication rounds [44].

Initially, the entire graph is not known by any single machine, but rather partitioned among the machines in a “balanced” fashion, i.e., the nodes and/or edges of are partitioned approximately evenly among the machines. We assume a vertex-partition model, whereby vertices, along with information of their incident edges, are partitioned across machines. Specifically, the type of partition that we will assume throughout is the random vertex partition (RVP), that is, each vertex of the input graph is assigned randomly to one machine. (This is the typical way used by many real systems, such as Pregel [31], to initially distribute the input graph among the machines.) If a vertex is assigned to machine we say that is the home machine of . A convenient way to implement the RVP model is through hashing: each vertex (ID) is hashed to one of the machines. Hence, if a machine knows a vertex ID, it also knows where it is hashed to. It can be shown that the RVP model results in (essentially) a balanced partition of the graph: each machine gets vertices and edges, where is the maximum degree.

At the end of the computation, for the community detection problem, the community that each vertex belongs to will be output by some machine.

I-C Random Walk Preliminaries and Local Mixing Set

Our algorithm is based on the mixing property of a random walk in a graph. We use the notion of local mixing set of a random walk, introduced in [33], to identify communities in a graph. Let us define random walk preliminaries, local mixing time, and local mixing set as defined in [33]. Given an undirected graph and a source node , a simple random walk is defined as: in each step, the walk goes from the current node to a random neighbor i.e., from the current node , the probability of moving to node is if , otherwise , where is the degree of . Let be the probability distribution at time starting from the source node . Then is the initial distribution with probability at the node and zero at all other nodes. The can be seen as the matrix-vector multiplication between and , where is the transpose of the transition matrix of . Let be the probability that the random walk be in node after steps. When it’s clear from the context we omit the source node from the notations and denote it by only. The stationary distribution (a.k.a steady-state distribution) is the probability distribution which doesn’t change anymore (i.e., it has converged). The stationary distribution of an undirected connected graph is a well-defined quantity which is , where is the degree of node . We denote the stationary distribution vector by , i.e., for each node . The stationary distribution of a graph is fixed irrespective of the starting node of a random walk, however, the number of steps (i.e., time) to reach to the stationary distribution could be different for different starting nodes. The time to reach to the stationary distribution is called the mixing time of a random walk with respect to the source node . The mixing time corresponding to the source node is denoted by . The mixing time of the graph, denoted by , is the maximum mixing time among all (starting) nodes in the graph.

Definition 1.

(–mixing time for source and –mixing time of the graph)
Define , where is the norm. Then is called the -near mixing time for any in . The mixing time of the graph is denoted by and is defined by . It is clear that .

We sometime omit from the notations when it is understood from the context. For any set , we define is the volume of i.e., . Therefore, is the volume of the vertex set. The conductance of the set is denoted by and defined by where is the set of edges between and . The conductance of the graph is .

Let us define a vector over the set of vertices as follows: if , and otherwise.

Notice that is the stationary distribution of a random walk over the graph , and is the restriction of the distribution on the subgraph induced by the set . Recall that we defined as the probability distribution over of a random walk of length , starting from some source vertex . Let us denote the restriction of the distribution over a subset by and define it as: if and otherwise.

It is clear that is not a probability distribution over the set as the sum could be less than .

Informally, local mixing set, with respect to a source node , means that there exists some (large-enough) subset of nodes containing such that the random walk probability distribution becomes close to the stationary distribution restricted to (as defined above) quickly. The time that a random walk mixes locally on a set is called as local mixing time which is formally defined below.

Definition 2.

(Local Mixing Set and Local Mixing Time)
Consider a vertex . Let be a positive constant and be a fixed parameter. We first define the notion of local mixing in a set . Let be a fixed subset containing of size at least . Let be the restricted probability distribution over after steps of a random walk starting from and be as defined above. Define the mixing time with respect to set as . We say that the random walk locally mixes in if exists and well-defined. (Note that a walk may not locally mix in a given set , i.e., there exists no time such that ; in this case we can take to be .)

The local mixing time with respect to is defined as , where the minimum is taken over all subsets (containing ) of size at least , where the random walk starting from locally mixes. A set where the minimum is attained (there may be more than one) is called the local mixing set. The local mixing time of the graph, (for given and ), is .

From the above definition, it is clear that always exists (and well-defined) for every fixed , since in the worst-case, it equals the mixing time of the graph; this happens when (for every ). We note that, crucially, in the above definition of local mixing time, the minimum is taken over subsets of size at least , and thus, in many graphs, local mixing time can be substantially smaller than the mixing time when (i.e., the local mixing can happen much earlier in some set of size than the mixing time).

Ii related works

There has been extensive work on community detection in graphs, see, e.g., the surveys of [1, 2, 17, 18]. Here we focus mainly on related works in distributed community detection and in the SBM, especially the model.

Dyer and Frieze [14] show that if then the minimum edge-bisection is the one that separates the two classes and present an algorithm that gives the bisection in expected time. Jerrum and Sorkin improved this bound for some range of and by using simulated annealing. Further improvements and more efficient algorithms were obtained in [12, 34]. We note that all the above algorithms are centralized and based on expensive procedures such as simulated annealing and spectral graph computations: all of them require the full knowledge of the graph.

The work of Clementi et. al [10] is notable because they present a distributed protocol based on the popular Label Propagation approach and prove that, when the ratio is larger than (for an arbitrarily small constant ), the protocol finds the right planted partition in time. Note that however, they consider only two communities in their PPM model. We also note that this ratio can be significantly weaker compared to the ratio (for identifying all the communities) derived in our Theorem 6 which is where is the number of communities (which can be much smaller compared to , the total number of vertices). Also our algorithm works for any number of communities.

Random walks have been successfully used for graph processing in many ways. Community detection algorithms use the statistics of random walks to infer structure of graphs as clusters or communities. Netwalk [46] defined a proximity measure between neighboring vertices of a graph. Initialized each node as a community, it merges two communities with the lowest proximity index into a community in iterations. But it is an expensive method running in time complexity. Similarly, Walktrap by Pons et al [42] defined a distance measure between nodes using random walks. Then they merged similar nodes into a community in iterations. The logic behind their method is that random walks get trapped inside densely connected parts of a graph which can capture communities. Their algorithm is centralized and has expensive run-time of in worst case.

Some works uses linear dynamics of graphs to perform basic network processing tasks such as reaching self-stabilizing consensus in faulty distributed systems [6, 36] or Spectral Partitioning [13, 29, 41]. They work on connected non-bipartite graphs. Becchetti et al [4] defines averaging dynamics, in which each node updates its value to the average of its neighbors, in iterations. It partitions a graph into two clusters using the sign of last updates. Another interesting research by Becchetti et. al. [5] used random walk to average values of two nodes when randomly any two nodes meet and showed that it ends in detecting communities. The convergence time of the averaging dynamics on a graph is the mixing time of a random walk [45]. These methods works well on graphs with good expansion [22] and are slower on sparse cut graphs. Label Propagation Algorithm (LPA) [43] is another updating method which converges to detecting communities by applying majority rule. Each node initially belongs to its own community. At each iteration, each node joins a community having majority among its neighbors, applying a tie-breaking policy. Recently Kothapalli et al provided a theoretical analysis for its behavior [27] on dense PPM graphs ( and ). In comparison, our algorithm works even for the more challenging case of sparse graphs (, i.e., the near the connectivity threshold). A major drawback of LPA algorithm is the lack of convergence guarantee. For example, it can run forever on a bipartite graph where each part gets a different label (each community is specified by a label).

Although this paper uses the notion of local mixing time introduced in [33], there are substantial differences. In [33], the authors consider only the local mixing time which is essentially the existence of a mixing set of certain size, but not the set of nodes where the random walk mixes. The computation of the local mixing set is more challenging. A key idea of this paper is to use this notion to identify communities. For this, the algorithm and approach of [33] has to be modified substantially.

Iii Algorithm for Community Detection

We design a random walk based community detection algorithm (cf. Algorithm 1). Given a graph and a node, the algorithm finds a community containing the node in a distributed fashion. We show the efficiency and effectiveness of the algorithm both theoretically and experimentally on random graphs and stochastic block model.

Outline of the Algorithm. We use the concept of local mixing set, introduced by Molla and Pandurangan [33], to identify community in a graph. A local mixing set of a random walk is a subset of the vertex set where the random walk probability mixes fast, see formal definition in the Section I-B. Intuitively, a random walk probability mixes fast over a subset where nodes are well-connected among themselves. The idea is to use the concept of local mixing set to identify a community — a subset where nodes are well-connected inside the set and less-connected outside. That is, if a random walk starts from a node inside a community, its probability distribution is likely to mix fast inside the community nodes and less amount of probability will go outside of the set. Thus the high level idea of our approach is to perform a random walk from a given source node and find the largest subset (of nodes) where the random walk probability mixes quickly. We extend the distributed algorithm from [33] to find a largest mixing set in the following way. In each step of the random walk, we keep track the size of the largest mixing set. When the size of the largest mixing set is not increasing significantly with the increase of the length of the random walk, we stop and output the largest mixing set as the community containing the source node.

Algorithm in Detail. Given an undirected graph , our algorithm randomly selects a node and outputs a community set containing . It maintains a set, called as which contains all the remaining nodes of excluding the nodes in . Then another random node gets selected from the set and we compute the community containing that node in , and so on. This way all the different communities are computed one by one. The set is initialized by in the beginning. The algorithm stops when the set becomes empty.

Now we describe how the algorithm computes the community set in from a given node . The algorithm performs a random walk from the source node and computes the probability distribution at each step of the random walk. The probability distribution starting from the source node is computed locally by each node as follows: Initially, at round , the probability distribution is: at node , and at all other nodes , . At the start of a round , each node sends to its -neighbors and at the end of the round , each node computes . This local flooding approach essentially simulates the probability distribution of each step of the random walk starting from a source node. Moreover, this deterministic flooding approach can be used to compute the probability distribution of length from the previous distribution in one round only– simply by resuming the flooding from the last step. The full algorithm can be found in Algorithm 1 in [33]. Then at each step , our algorithm computes a largest mixing set . The largest mixing set is computed as follows: Each node knows its value. The algorithm gradually increases the size of a candidate local mixing set starting from size .444In the pseudocode we assume the size of each community is at least . First each node locally calculates its value as , where is the average volume of the set . Note that any node can compute when it knows the “size”– and hence can compute locally. However, it’s difficult to compute unless it knows the set (i.e., the nodes in ) and the degree distribution of the nodes in . Computing nodes in and their degree distribution is expensive in terms of time. That’s why we consider instead in the localized algorithm.

Then the source node collects smallest of values and checks if their sum is less than (mixing condition). For this each node may sends its to the source nodes via upcasting through a BFS tree rooted at . (A BFS tree is computed from at the beginning of the algorithm). However, the upcast may take time in the worst case due to the congestion in the BFS tree. A better approach is used in [33], which is to do a binary search on , as follows: All the nodes send and (the minimum and maximum respectively among all ) to the root through a convergecast process (e.g., see [40]). This will take time proportional to the depth of the BFS tree. Then can count the number of nodes whose value is less than via a couple of broadcast and convergecast. In fact, broadcasts the value to all the nodes via the BFS tree and then the nodes whose value is less than (say, the qualified nodes), reply back with value through the convergecast. Depending on whether the number of qualified nodes is less than or greater than , the root updates the value (by again collecting or in the reduced set) and iterates the process until the count is exactly . Note that there might be multiple nodes with the same value. We can make them distinct by adding a ‘very’ small random number to each of the such that the addition doesn’t affect the mixing condition. The detailed approach and analysis can be found in [33].

Once the node gets smallest s, it checks if their sum is less than . If true, then these nodes whose sum value is less than gives a candidate mixing set and its size is . Then we increase the set size and check if there is a larger mixing set. If the mixing condition doesn’t satisfy, then there is no mixing set of size . The algorithm iterates the checking process few more times by increasing the size of and checking if there is a mixing set of larger size. If not, then the algorithm stops for this length and stores the largest mixing set at . This way, the algorithm finds the largest mixing set at the step of the random walk. Note that we can increase the candidate mixing set size by each time. This will increase the time complexity of the algorithm by a factor of the “size of the largest mixing set”. Instead we increase the size of the mixing set by a factor of in each iteration. This will only add a factor of to the time complexity. The reason why we increase by a factor of instead of doubling is discussed in [33] (see, Lemma 3 in [33]). The correctness of the all the above tests is also analyzed in [33].

Then the algorithm checks if the size of the largest mixing set at step increases significantly than mixing set in the previous step . This is checked locally by the source node as the source node has the information of the largest mixing set of the current and previous steps. If the size doesn’t increase by a factor , i.e., if , then the algorithm stops and output as the community set . Otherwise, the algorithm increases the length by and checks for . The parameter is chosen to be the conductance of the graph which essentially measures the vertex expansion of the graph.

Input: An undirected graph .
Output: Set of Detected Communities .

1:
2:
3:while  do There exist nodes not assigned to any communities yet
4:       pick a random node from
5:       computes a BFS tree of depth via flooding
6:      Set . Size of the local mixing set initialize by . Assume community size is
7:      Set and for all other nodes .
8:      for  do Length of the random walk
9:            Each node whose , does the following in parallel:
10:                       (i) Send to all the neighbors .
11:                       (ii) Compute the sum of the received values from its neighbors and sets it as .
12:            for  do
13:                 Each node computes the difference locally
14:                  computes the sum of smallest values using binary search method discussed in the detail description of the algorithm.
15:                  checks if the sum is less than , i.e., if .
16:                 If “true”, then checks for the next size of the mixing set
17:                 Else, sets to be the largest set which satisfies the mixing condition. broadcasts an indicator message to all the nodes via BFS tree. The nodes whose value gives the smallest values belong to the largest mixing set .             
18:            if  then
19:                 Break the for-loop.                   
20:      
21:      
22:      
23:Return
Algorithm 1 Community-Detection-by-Random-Walks (CDRW)

Iii-a Analysis

We analyze the algorithm and show that it correctly identifies communities in the planted partition model (PPM) – graphs. The graph is formed by connecting several communities of graphs (see the definition in Section I-B). Let us first analyze the Algorithm 1 on the random graph . We then extend the analysis to the stochastic block model .

On Graphs. Suppose the algorithm is executed on the standard random (almost regular) graph , defined in Section I-B. Since is an expander graph, the random walk starting from any node mixes over the vertex set very fast; in fact in steps. Given any node , we show that our algorithm computes the community as the complete vertex set . More precisely, we show that the size of the largest mixing set increases on a higher rate (than the considered threshold) after each step of the random walk, when the length of the walk is . Since is the mixing time of , the random walk probability reaches the stationary distribution after steps, for a sufficiently large constant .

Let be the probability that the walk is at after steps (starting from a source node ). It is known that in a regular graph is bounded by:

(1)

where is the second largest eigenvalue (absolute value) of the transition matrix of 555The bound follows from the standard bound in general graphs [30]. In a regular graph, for all . Note that is not exactly a regular graph, but very close to regular (especially if for a large enough constant ). It can be shown that in . For simplicity we assume that is a regular graph as this little changes in the degree or in the probability distribution doesn’t affect the lemmas.. Hence, the above bound on the probability distribution holds in graphs. It is further known that in a random -regular graph, the second largest eigenvalue is bounded by [19]:

(2)

Let be the set of nodes that are within the distance from the . The distance is measured by the hop distance between nodes. Let’s call a ball of radius centered at . We now show that after steps of the random walk, the largest mixing set is in a graph.

Lemma 1.

Let a random walk start from a source node in a graph. Then for any length which is less than the mixing time of the random walk, the largest mixing set is the ball with high probability.

Proof.

Assume , since is less than the mixing time . It is known that the size of the ball in a random graph is bounded by with high probability (cf. Lemma 2 in [9]). To prove the lemma we show that the random walk probability mixes inside the ball and doesn’t mix on the ball of radius larger than . Recall that the condition of locally mixing on a subset is , (since is regular graph). Using the above bound of , (Equ 1, III-A) and (since in expectation in ) we have:

This shows that the random walk of length mixes over the nodes in . Now we show that it doesn’t mix on for . Again from Equation 1 and III-A,

Thus the largest mixing set is . ∎

Now we show that our algorithm outputs the full vertex set as the community in graphs.

Lemma 2.

Given a random regular expander graph , the Algorithm 1 outputs the vertex set as a single community with high probability.

Proof.

It follows from the previous lemma that when is less than the mixing time of , then the largest local mixing set is . Therefore, in each step of the random walk, the size of the mixing set is increased by a factor . Hence, by the condition of the Algorithm 1, it doesn’t stop and continue to look for a community set for the larger lengths of the random walk. It means, until the length of the random walk reaches to the mixing time of the graph , the algorithm continue its execution. When the length reaches the mixing time, then the random walk will mix the full vertex set . Then the algorithm stops and outputs as a single community set (as the size of the mixing set won’t increase anymore for larger lengths). ∎

On Graphs. Let us now analyze the algorithm on the planted partition model i.e., on a random graph. A random graph is formed by equal size blocks where each component is a random graph (see the definition in Section I-B). We show that the algorithm correctly identifies each block as a community. Suppose the randomly selected node belongs to some block . The induced subgraph on is a graph i.e., the nodes inside are connected to each other with probability . Further each node in is connected to every node outside of with probability . Thus the random walk may go out of the set at some point. We show that the probability of going out of is very small when the length of the walk is smaller than the mixing time of graph, which is .

Lemma 3.

Given a graph and a node in some block , the probability that a random walk starting from stays inside is at least until when .

Proof.

We show that in each step, the probability that the random walk goes outside of is . For any , the number of neighbors of in is and the number of neighbors in is in expectation. Thus the probability that the random walk goes outside of the block is . This is when . Thus in steps, the probability that walk goes outside of the block is . That is the random walk stays inside with probability at least . ∎

Now we show that the random walk probability will mix over in steps.

Lemma 4.

Given a graph and a node , a random walk starting from will mix over the nodes in after steps with high probability.

Proof.

We show that after steps of the walk, the amount of probability goes out of is very little and that the remaining probability will mix inside . The expected number of outgoing edges from any subset of the block is . In each step the amount of probability goes out of is , as is the degree of a node, each edge carries fraction of the probability. We have for . Thus in steps, the amount of probability goes out of is . Hence fraction of the probability remains inside and it will mix over the nodes in after steps as shown in the above Lemma 1 and 2. ∎

Thus it follows from the above lemma that the largest mixing set is after steps of the random walk. Further, it is shown in Lemma 4 of [33] that the random walk keeps mixing in until steps. In other words, remains the largest local mixing set for at least another steps. Thus the size of the largest local mixing set will not increase from in the further few steps of the walk after the mixing time . Hence the algorithm outputs as a community with high probability. Since we sample the source node from the different blocks, each time our algorithm outputs a new community until all the blocks are identified as separate communities.

The value measures the rate of change of the size of the largest mixing set in each step. When the largest mixing set reaches a community , the vertex expansion becomes which is the conductance of the graph. If the largest mixing set doesn’t reach the community, the size increases in higher rate than . Hence we take to be in our algorithm to stop and output the community. We assume that is given as input, or it can be computed using a distributed algorithm, e.g., [28].

Complexity of the Algorithm in the CONGEST model. Let us first analyze the distributed time complexity of the Algorithm 1 which computes a community corresponding to a given source node. We will focus on the CONGEST model first. The algorithm first computes a BFS tree of depth from the source node. This takes rounds. Note that the diameter of a graph is ; hence the BFS tree covers all the nodes in the community containing the source node. The algorithm then iterates for the length of the walk, . In each iteration:

  • The algorithm probability distribution . As we discussed before, it takes rounds to compute from .

  • collects the sum of smallest s through the BFS tree using binary search method. It takes rounds. This is done for all the potential candidate set of size , where . It may take rounds in the worst case. Hence the total time taken is rounds.

  • Checking if the sum of differences is less than and also checking the community condition is done locally at .

Thus the total time required is , which is bounded by .

Message Complexity of the Algorithm. Let us calculate the number of messages used by the algorithm during the execution in a graph. The degree of a node is in expectation. Hence the number of edges in the graph is . In the worst case, the the algorithm runs over all the edges in the graph. Thus the message complexity of the algorithm for computing a single community is bounded by (time complexity) (the number of edges involved during the execution), which would be in expectation. That is the message complexity of the Algorithm 1 is .

Therefore we have the following main result.

Theorem 5.

Consider a stochastic block model with blocks, where and . Given a node in the graph, there is a distributed algorithm (cf. Algorithm 1) that computes the block containing as a community with high probability in rounds and incurs messages in expectation.

The CDRW algorithm can be used to detect all the communities in the PPM graphs one by one. In that case the running time would be times the time of detecting one community, which is . The message complexity in this case would be in expectation. Thus we have the following theorem.

Theorem 6.

Given a stochastic block model with blocks, where and , there is a distributed algorithm (cf. Algorithm 1) that correctly computes each block as a community with high probability and outputs all the communities in rounds and incurs expected messages.

Iii-B Complexity in the -machine model.

As mentioned earlier, in the -machine model, the input (SBM) graph is partitioned across the machines according to the random vertex partition (RVP) model (cf. Section I-B). The algorithm can be implemented in the -machine model by simulating the corresponding CONGEST model algorithm. Note that since each vertex and its incident edges are assigned to a machine (i.e., its “home” machine — cf. Section I-B), the machine simply simulates the code executed by the vertex in the CONGEST model. If a vertex sends a message to its neighbor in the CONGEST model, then the home machine of sends the same message to the home machine of (addressing it to ). If and have the same (home) machine, then no communication is incurred, otherwise there will be communication along the link that connects these two home machines. This type of simulation is detailed in [26]. Hence one can use the Conversion Theorem (part a) of [26] to compute the round complexity of the CDRW implementation in the -machine model which depends on the message complexity and time complexity of CDRW in the CONGEST model. If and are the message and time complexities (respectively) in the CONGEST model, then in the -machine model then by the Conversion Theorem, the above simulation will give a round complexity of , where is the maximum degree of the graph.666 notation hides a multiplicative and additive factor. For the SBM model, . Hence plugging in the message complexity and time complexity from the CONGEST model analysis, we have that the round complexity in the -machine model is .

Iv Experimental Results

In this section we experimentally analyze the performance of our algorithm in the PPM model under various parameters. In particular, we show how accurately our algorithm can identify the communities in the PPM model. As an important special case, we also analyze the case when , i.e., there is only one community — in other words, the whole graph is a random graph. In this case, we expect the algorithm to output the whole graph as one community.

Since in the PPM model, we know the ground-truth communities, we use F-score metric [32] to measure the accuracy of the detected communities. Let be the set of detected communities by CDRW algorithm and be the ground-truth communities (each is a ground-truth community). Let be the detected community by CDRW using seed node and be the ground-truth community that seed node belongs to. Then the precision is the percentage of truly detected members in detected community defined as and recall is the percentage of truly detected members from the ground truth community defined as . Both precision and recall return a high value when a method detects communities well. For example, if all the detected members belong to the ground-truth community of the seed node, then its precision is equal to ; and if all the ground-truth community members of the seed node are included in the detected community, then its recall value is equal to . We utilize F-score as our accuracy measurement metric which reflects both precision and recall of a result. F-score of a detected community is defined as: . Then the total F-score is equal to the average F-score of all detected communities: . Again a higher F-score value means a better detection of communities.

Fig. 2: Community detection accuracy of CDRW algorithm on random graphs. It shows even when the graph is sparse, when is small and as close to the connectivity threshold as possible, its accuracy is still high. The vertical line shows that when the size is big enough (), the accuracy becomes almost .

The first challenge for any community detection (CD) algorithm is detecting a random graph as a single community. This challenge becomes harder when the graph becomes sparse and it gets closer to the connectivity threshold of a random graph (i.e. , s.t. ) [7]. In the first experiment we show that our CDRW algorithm detects almost the whole graph as a single community resulting in a high F-score accuracy value, see Figure 2. Figure 2 shows that when we increase the size of graph , the accuracy of our algorithm increases as well. For example, for the accuracy metric becomes almost , meaning that almost all the nodes of the graph is detected as a single community. It also shows that when increases (graph gets denser), the accuracy also increases. So in the remaining experiments on PPM graphs, we choose two lowest values of and for generating its random parts in order to give more challenging input graphs to the CDRW algorithm.

Fig. 3: Performance of CDRW algorithm on PPM graphs when there are two parts/communities (). We fixed the size of the graph to , each planted partition is of size . It shows that CDRW works well for small values of and when is small enough.

After showing that CDRW works well on random graphs, now we consider PPM graphs. At first we fix the number of communities to two () so that we can consider the effect of various values of and . This will show us the threshold for the ratio of where CDRW works well. As we showed in Figure 2, when the size of each random graph is big enough (), CDRW detects a single community well. Therefore we set the size of to which makes each ground-truth community big enough (). When considering PPM graphs with and , as the connectivity probability for intra- and inter-community edges, CD algorithms face hardship in detecting communities when is small and is relatively high. But the ratio can not be arbitrarily small because it causes the two communities blend into each other and the graph looses its community structure. Figure 3 shows accuracy of CDRW for different values of and . We highlight that it shows even for sparse parted graphs: for , CDRW detects the two communities with a high F-score value (more than ) for and . In other words, our CDRW algorithm works well even on sparse parted PPM graphs when the ratio is as small as (). Notice that when , the two ground truth communities of the PPM graph are as sparse as possible, i.e close to its connectivity threshold. In the latter example, for instance, when , a partition has in expectation intra and inter community edges. It means the ratio of inter to intra community edges () is high and equal to .

(a) .
(b) .
Fig. 4: Varying the number of ground-truth communities to see its effect on the accuracy of our CDRW algorithm. It shows that when we increase the number of communities, the accuracy decreases slightly. This is expected because the number of inter-community edges increases. Comparing Sub-figure 3(a) and 3(b), we see that if we fix the number of communities, then the accuracy gets higher when the size of the communities becomes larger.

We now consider the effect of increasing the number of ground-truth communities () in order to see its effect on the accuracy of our CDRW algorithm, see Figure 4. We do it in two ways. First, we fix the size of each community to and vary the number of communities. The size of graph is . Figure 3(a) shows that our CDRW algorithm works well when we reasonably increase the number of communities. Second, we fix the size of graph to a number so that the size of each community is when the number of communities is the biggest (), see Figure 3(b). Then, when the number of communities becomes lower, the size of communities gets bigger. By comparing Sub-figures 3(a) and 3(b), we see that when the number of communities are the same, the accuracy is higher when the size of each community is bigger.

V Conclusion

We proposed a distributed algorithm, CDRW, for community detection that works well for the PPM model ( random graph), a standard random graph model used extensively in clustering and community detection studies. Our CDRW algorithm is relatively simple and localized; it uses random walk and mixing time property of graphs to detect communities. We provide a rigorous theoretical analysis of our algorithm on the random graph and characterize its performance vis-a-vis the parameters of the model. In particular, our main result is that it correctly identifies the communities provided , where is the number of communities. Our algorithm takes rounds and hence is quite fast when is relatively small. We point out that our algorithm can also be extended to find communities even faster (by finding communities in parallel), assuming we know an (estimate) of . For future work, it will be interesting to study the performance of this algorithm on other graph models and can be a starting point to design and analyze community detection algorithms that perform well in the more challenging case of real-world graphs.

References

  • [1] E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18(177):1–86, 2018.
  • [2] E. Abbe and C. Sandon. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 670–688. IEEE, 2015.
  • [3] S. Bandyapadhyay, T. Inamdar, S. Pai, and S. V. Pemmaraju. Near-optimal clustering in the -machine model. In Proceedings of the 19th International Conference on Distributed Computing and Networking (ICDCN), 2018.
  • [4] L. Becchetti, A. Clementi, E. Natale, F. Pasquale, and L. Trevisan. Find your place: Simple distributed algorithms for community detection. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 940–959. SIAM, 2017.
  • [5] L. Becchetti, A. E. F. Clementi, P. Manurangsi, E. Natale, F. Pasquale, P. Raghavendra, and L. Trevisan. Average whenever you meet: Opportunistic protocols for community detection. In Y. Azar, H. Bast, and G. Herman, editors, 26th Annual European Symposium on Algorithms, ESA 2018, August 20-22, 2018, Helsinki, Finland, volume 112 of LIPIcs, pages 7:1–7:13. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.
  • [6] F. Bénézit, P. Thiran, and M. Vetterli. Interval consensus: from quantized gossip to voting. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 3661–3664. IEEE, 2009.
  • [7] B. Bollobás. Random graphs. In Modern graph theory, pages 215–252. Springer, 1998.
  • [8] P. Chin, A. Rao, and V. Vu. Stochastic block model and community detection in sparse graphs: A spectral algorithm with optimal rate of recovery. In Conference on Learning Theory, pages 391–423, 2015.
  • [9] F. Chung and L. Lu. The diameter of sparse random graphs. Advances in Applied Mathematics, 26(4):257–279, 2001.
  • [10] A. Clementi, M. Di Ianni, G. Gambosi, E. Natale, and R. Silvestri. Distributed community detection in dynamic graphs. Theoretical Computer Science, 584:19–41, 2015.
  • [11] A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. In Randomization, Approximation, and Combinatorial Optimization. Algorithms and Techniques, pages 221–232. Springer, 1999.
  • [12] A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. Random Structures & Algorithms, 18(2):116–140, 2001.
  • [13] W. Donath and A. Hoffman. Algorithms for partitioning of graphs and computer logic based on eigenvectors of connections matrices. IBM Technical Disclosure Bulletin, 15, 1972.
  • [14] M. E. Dyer and A. M. Frieze. The solution of some random np-hard problems in polynomial expected time. Journal of Algorithms, 10(4):451–489, 1989.
  • [15] D. Easley and J. Kleinberg. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge University Press, 2010.
  • [16] P. Erdős and A. Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5:17–61, 1960.
  • [17] S. Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
  • [18] S. Fortunato and D. Hric. Community detection in networks: A user guide. Physics Reports, 659:1–44, 2016.
  • [19] J. Friedman. A proof of Alon’s second eigenvalue conjecture and related problems. American Mathematical Society, Providence, RI, USA, 2008.
  • [20] M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002.
  • [21] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
  • [22] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439–561, 2006.
  • [23] T. Inamdar, S. Pai, and S. V. Pemmaraju. Large-scale distributed algorithms for facility location with outliers. In The 22nd International Conference on Principles of Distributed Systems (OPODIS), 2018.
  • [24] H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In Proceedings of the 21st annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 938–948, 2010.
  • [25] B. Karrer and M. E. Newman. Stochastic blockmodels and community structure in networks. Physical review E, 83(1):016107, 2011.
  • [26] H. Klauck, D. Nanongkai, G. Pandurangan, and P. Robinson. Distributed computation of large-scale graph problems. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 391–410, 2015.
  • [27] K. Kothapalli, S. V. Pemmaraju, and V. Sardeshmukh. On the analysis of a label propagation algorithm for community detection. In International Conference on Distributed Computing and Networking, pages 255–269. Springer, 2013.
  • [28] F. Kuhn and A. R. Molla. Distributed sparse cut approximation. In OPODIS, 2015.
  • [29] J. Lei, A. Rinaldo, et al. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215–237, 2015.
  • [30] L. Lovász. Random walks on graphs: A survey. Combinatorics, 2:1–46, 1993.
  • [31] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM International Conference on Management of Data (SIGMOD), pages 135–146, 2010.
  • [32] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
  • [33] A. R. Molla and G. Pandurangan. Local mixing time: Distributed computation and applications. In Proc. of 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 743–752, 2018.
  • [34] E. Mossel, J. Neeman, and A. Sly. Stochastic block models and reconstruction. arXiv preprint arXiv:1202.1499, 2012.
  • [35] M. E. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.
  • [36] A. Olshevsky and J. N. Tsitsiklis. Convergence speed in distributed consensus and averaging. SIAM review, 53(4):747–772, 2011.
  • [37] G. Pandurangan. Distributed Network Algorithms. url: https://sites.google.com/site/gopalpandurangan/dna, 2016.
  • [38] G. Pandurangan, P. Robinson, and M. Scquizzato. Fast distributed algorithms for connectivity and MST in large graphs. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA, pages 429–438, 2016.
  • [39] G. Pandurangan, P. Robinson, and M. Scquizzato. On the distributed complexity of large-scale graph computations. In Proceedings of the 30th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 405–414, 2018.
  • [40] D. Peleg. Distributed computing: a locality-sensitive approach. SIAM, Philadelphia, PA, USA, 2000.
  • [41] R. Peng, H. Sun, and L. Zanetti. Partitioning well-clustered graphs: Spectral clustering works! In Conference on Learning Theory, pages 1423–1455, 2015.
  • [42] P. Pons and M. Latapy. Computing communities in large networks using random walks. J. Graph Algorithms Appl., 10(2):191–218, 2006.
  • [43] U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3):036106, 2007.
  • [44] D. A. Reed and J. Dongarra. Exascale computing and big data. Commun. ACM, 58(7):56–68, 2015.
  • [45] D. Shah et al. Gossip algorithms. Foundations and Trends® in Networking, 3(1):1–125, 2009.
  • [46] H. Zhou and R. Lipowsky. Network brownian motion: A new method to measure vertex-vertex proximity and to identify communities and subcommunities. In International conference on computational science, pages 1062–1069. Springer, 2004.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
354640
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description