Distributed Asynchronous Averaging for Community Detection

Distributed Asynchronous Averaging
for Community Detection

Luca Becchetti111 Sapienza Università di Roma, becchetti@dis.uniroma1.it.    Andrea Clementi222Università  di Roma Tor Vergata, [clementi,pasquale]@mat.uniroma2.it.    Pasin Manurangsi333U.C. Berkeley, [pasin,raghavendra,luca]@berkeley.edu. This material is based upon work supported by the National Science Foundation under Grants No. 1540685 and No. 1655215.    Emanuele Natale444Max Planck Institute for Informatics, enatale@mpi-inf.mpg.de.    Francesco Pasquale    Prasad Raghavendra    Luca Trevisan
Abstract

Consider the following probabilistic process on a graph . After initially labeling each vertex by a real number, say randomly chosen in , we repeatedly pick a random edge and replace the values of its endpoints by their average.

Suppose the process is run on a graph exhibiting a community structure, such as two expanders joined by a sparse cut: is there a phase of the process, in which its state reflects the underlying community structure? Moreover, can nodes somewhat learn that structure via a simple, local procedure?

The above questions arise since the expected action of the one-edge-at-a-time averaging corresponds to the repeated application of the transition matrix of a lazy walk on , and it is known that, for certain graph classes, the resulting evolution of the state allows to uncover the underlying community structure.

We answer the first question in the affirmative for a class of regular clustered graphs that includes the regular stochastic block model. Addressing the question above (in this restricted class as well) requires studying the concentration of the averaging process around its expectation. In turn, this calls for a deeper understanding of concentration properties of the product of certain random matrices around its expectation. These properties (albeit in different flavors) emerge both in the regime in which the sparsity of the cut is (with constant expansion within each community), and when the sparsity is constant. The analysis in the latter regime is the technically hardest part of this work, because we have to establish concentration results up to inverse polynomial errors.

As for the second question, since nodes do not share a common clock, it is not immediate to translate the above results into distributed clustering protocols. To this purpose, we show that concentration holds over a long time window and most nodes are able to select a local time within this window. This results in the first asynchronous distributed algorithms that require logarithmic or polylogarithmic work per node (depending on the sparsity of the cut) and that approximately recover community structure.

Keywords: Distributed Community Detection, Distributed Averaging Processes, Spectral Analysis, Probabilistic Algorithms.

1 Introduction

Consider the following distributed process on an undirected regular graph . Each node holds a real number (which we call the state of the node); at each time step, a random node gets activated555In an essentially equivalent continuous-time model, each node has a clock that ticks at random intervals with a Poisson distribution of average 1; when the clock of node ticks, then becomes activated. For larger than , the behavior of the continuous time process for units of time and the behavior of the discrete-time process for steps are roughly equivalent., it selects a random neighbor , and then nodes and update their state to their average.

This can be regarded as an asynchronous distributed protocol, in which no global clock exists (nodes only know how many times they have been activated as endpoints of a selected edge), and which runs on an anonymous network, i.e., nodes are not aware of theirs or their neighbors’ identities. Furthermore, all nodes run the same process at all times.

The long-term behavior of this process is well-understood: for each initial global state , assuming the graph is connected, the global state converges to all nodes holding the average of their initial states. A variant of an argument of Boyd et al. [4] shows that convergence occurs in time , where is the second smallest eigenvalue of the normalized Laplacian of .

Suppose now that exhibits a community structure, which in the simplest case consists of two equal-size expanders (the two “communities” of the graph), connected by a sparse cut, and that we run our averaging process starting, for example, from an initial random global state. We might reasonably expect to see a faster convergence toward a local average within each community, and a slower converge toward the global average over the entire graph, with a transient phase in which all nodes within the same community hold values that are close to each other, and there is a gap between the local averages of the two communities. If this were the case, the global state during the transient phase would be correlated with the indicator of the cut between the communities. This intuition suggests the main questions we address in this paper:

Is there a phase in which the global state carries information about community structure? If so, how strong is the corresponding “signal”? Finally, can nodes leverage local history to learn the underlying community structure?

Our main motivation is the design of simple, lightweight, asynchronous protocols for fully-decentralized community detection. By “lightweight” we mean protocols that require minimalistic assumptions as to network capabilities (e.g., nodes possess no identities and can only exchange one-to-one messages in a single time step), while performing their task with minimal work, storage and communication per node (they will all be at most logarithmic or polylogarithmic in our applications).

1.1 Distributed community detection

In the following, we denote by the transition matrix of the random walk on the regular graph , by its normalized Laplacian matrix and by the eigenvalues of . Recall that the eigenvalues of are .

It is easy to see that the expected action of the averaging process on the global state in one step is , and so the expected action of the process is a lazy random walk on , slowed down by a factor .

Since the random walk on mixes in time , it is intuitive that, as proved by Boyd et al., the averaging process “mixes” in time , although this is not a trivial fact, because it is possible for the action of a sequence of random operations to deviate signficantly from expectation666For example, imagine that we pick a random node and then with probability 1/2 we leave its state unchanged, and with probability 1/2 we change its sign. Then the expected action is , and, after order steps, the expected action shrinks the global state by a polynomial factor, while the actual process never shrinks the global state.. Indeed, Boyd et al. need a second-moment argument to prove their result.

The assumption that consists of two equal-size expanders connected by a sparse cut is essentially equivalent to the assumption that is small (depending on the sparsity of the cut), that is large (depending on the expansion of the two expanders) and that the indicator vector of the cut separating the two expanders is close to an eigenvector of .

If we start from a random vector , and then perform applications of the random walk matrix of to it, the result is a vector that essentially is a mixture of the eigenvectors of and . From this vector, we can approximately recover the cut that separates the two communities. Becchetti et al. [2] use these ideas to devise a distributed algorithm for community detection; in the basic version of their algorithm, each node holds a state that is a real value, and at each step of the algorithm each node updates its state to the average of the states of its neighbors. The action of one step of the algorithm corresponds to one application of the random walk matrix to the global state. One key contribution in [2] is a local strategy that allows nodes to infer the community that they belong to based on their local history. Namely, nodes only need to keep track of whether their state is increasing or decreasing with time. This strategy crucially relies on the synchronicity of the protocol.

Note that a similar phenomenon occurs in the asynchronous protocol in expectation: for any fixed initial state , the expected state reached after running steps of the one-edge-at-a-time averaging protocol is , which again is indeed almost a mixture of eigenvectors of and , as can be easily checked. This naturally raises the following questions:

  1. Does the expected behavior also approximately occur with high, or even constant, probability?

  2. After reaching a global state that approximately lies in the eigenspace of the first two eigenvalues, can nodes asynchronously recover the underlying community structure?

A positive response to the above questions would yield the first asynchronous protocol for distributed community detection in which each node exchanges messages, regardless of graph density, with work and storage per node also logarithmic in . Note that the total work, communication and storage would all be , which, in dense graphs, is sublinear in the size of the graph. In comparison, the protocol of [2] is synchronous and each node exchanges messages even when , with the degree of the graph777In this discussion we take to be regular, although the result of [2] applies to “almost-regular” graphs in which the degrees of the nodes fall within a small range.. This can be fairly large for dense graphs.

We next discuss the meaning of recovering the “underlying community structure” in a distributed setting, which can come in stronger or weaker flavors.

Ideally, we would like the protocol to reach a state in which, at least with high probability, each node can use a simple rule to assign itself one of two possible labels, so that labelling within each community is consistent and nodes in different communities are assigned different labels. Achieving this corresponds to exact reconstruction. The next best guarantee is weak reconstruction (see Definition 2.2). In this case, with high probability the above property is true for all but a small fraction of misclassified nodes. In this paper, we introduce a third notion, which we call community-sensitive labeling: in this case, there is a predicate that can be applied to pairs of labels so that, for all but a small fraction of outliers, the labels of any two nodes within the same community satisfy the predicate, whereas the converse occurs for pairs of nodes from different communities888Note that a weak reconstruction protocol entails a community-sensitive labeling. In this case, the predicate is true if two labels are the same.. In this paper, we use binary sequences as labels, while the predicate we consider is true for a label pair whenever their Hamming distance is below a certain threshold.

Note that the weaker notion of community-sensitive labeling we introduce allows nodes to locally tell “friends” in their community from “foes” in the other community, which is the main application of distributed community detection.

1.2 Our results

We are able to positively answer the above questions under the following regularity condition: the graph is -regular, and there is an equipartition of into two equal-size subsets and such that the subgraph induced by either set is -regular; hence, the graph induced by the edges that cross the partition is -regular for , where (see Definition 2.1).

This assumption implies that the indicator of the cut is an eigenvector of the normalized Laplacian with eigenvalue ; we further assume that , implying that the indicator of the cut is an eigenvector of .

We denote by the projection on the eigenspace of , by the projection on the eigenspace of and by the projection on the eigenspace of .

For any initial vector and any , the random variable denotes the global state reached by the process after steps, starting from the initial global state . Given any node , we call (i.e., the component of corresponding to node ) the state of at time . If we denote by the distribution over matrices that describes the action of one step of the algorithm, then is sampled by iid sampling from and then computing . Recall that if is a random matrix distributed according to , then

Thus, has the same eigenvectors as , and to every eigenvector of there corresponds an eigenvector of . Finally, recall that

1.2.1 The case of small

Our first result is that, there is an absolute constant such that for most initial vectors , and for , we have (see Section 3.1 for rigorous statements and arguments)

(1)

In order to provide some intuition about this result, in the lines that follow we present a simple, first moment analysis, showing that is indeed close to . We have

where

whenever for a sufficiently large absolute constant . So

provided that at least an inverse polynomial fraction of the mass of is in the eigenspace of the second eigenvalue, which is true for most initial vectors .

Unfortunately, proving (1) is considerably more challenging. In fact, given (1) and the above derivations, under the aforementioned assumptions on and , it is possible to show that , thus proving that the overall process has small variance.

If then (1) implies that, after steps we have

This means that the global state at time is essentially the projection of the initial state onto the eigenspace spanned by the first and second eigenvectors of . This in turn implies that, over the randomness of the initial global state, the sign of the state of a node at time is correlated with the community the node belongs to.

We then show that, if , with probability there are vertices whose states possess a certain “good” property999Informally, a node is good at some time if its sign agrees with that of the average, computed over the nodes of the community the node belongs to. over the entire duration of a time window between time steps and (see lemmas 3.8 and 4.6). Proving this claim for a particular within the aforementioned window follows from (1) (more precisely, Theorem 3.1), via a counting argument. Yet, proving this claim over the entire, aforementioned window is considerbly harder and does not follow from a union bound argument, since our second moment argument does not imply a small enough error probability. Rather, we need a careful analysis of how many nodes can become “bad” over the window under consideration.

On the other hand, the aforementioned good property has the following implications:

  • For any two good nodes belonging to different communities, the signs of the nodes’ states differ over the entire duration of the above window with probability at least .

  • For any two good nodes in the same community, the signs of the nodes’ states are the same over the entire duration of the above window with probability .

In the asynchronous setting we consider, each node runs the averaging protocol for a given number of activations. Upon being activated for the -th time, a node sets its label to the sign of its current state. is chosen in such a way that, with high probability, the global time at which a node sets its label falls within the aforementioned global window.

Then, running a suitable number of parallel copies of the protocol (in a carefully interleaved way that ensures their independence), and setting the label of a node to the binary string resulting from the concatenation of the labels generated by the various parallel instantiations of the algorithm, yields a community sensitive labeling (see Theorem 4.1 in Section 4).

We note that our algorithm runs for global steps, requiring work per node, thus with a parallel time that is faster than the mixing time of a random walk on the graph.

1.2.2 The case of large

Our next results concern the case in which and are different constants. In this case we show that, after order of steps, is much bigger than (see Theorem 5.1 and Corollaries 5.2 and 5.3).

Setting up the stage.

Before discussing our techniques, it may help to take a closer look at the effect of the expected action. Recall that

and

Furthermore, with high probability, the lengths of and are polynomially related. Then, as , exceeds by a polynomial factor. .

Our proof proceeds by setting up a two-parameter recursion, that expresses and in terms of and (Lemma 5.7), which we then solve in the proof of Lemma 5.8, showing upper bounds on and . The upper bound on is good enough to argue, using Markov’s inequality, that is small with high probability. To complete the argument, we have to show that exceeds a certain threshold with high probability. To this purpose, we combine the upper bound on with an exact calculation (see Observation 5.4) of , which allows us to apply Chebyshev’s inequality.

Extending the basic protocol.

Our bounds in Lemma 5.8 only allow us to prove that is bigger than when is a much larger constant that . To address this issue, we introduce a lazy variant of the basic averaging protocol101010This is a protocol in which the two nodes and active at a given time step update their states to and to , respectively, where is a laziness parameter which is equal to in the basic averaging protocol discussed so far.. We are then able to show that has a high probability of being bigger than even when and are arbitrarily close constants (in this case, the laziness parameter will depend on ).

We note that the above bounds imply that vanishes with , decaying like , where goes to zero inverse-polynomially in for the values of that we are interested in. Our results also imply that, with high probability, remains bigger than over a very long time interval, i.e., until is in the order of . Thus, nodes can easily select a local time that falls, with high probability, in the global time window in which the above property holds.

Recovering community structure.

After global steps, the global state correlates with the community structure, in the sense that the global state is, up to a small error (i.e., ), a linear combination of (that is, ) and the indicator of the cut between the communities (that is, ). Unfortunately, the latter component is polynomially smaller than , so we need to remove the component parallel to in order to uncover the community structure.

We achieve this by resorting to a trick introduced, in the synchronous and deterministic setting, by Becchetti et al. [2]: each node subtracts the current state from the state that it had at a previous time. Since the component parallel to does not change over time, it is canceled by this operation. Also, the component in the eigenspace of is always small after a certain time. As a result, the difference between the two states is mostly due to changes in the projection on the eigenspace of . In the synchronous deterministic case (in which each node updates to the average of all neighbors), this component changes at every step, so that one picks up the signal in the difference. In our case, we can argue that the component in the eigenspace of decreases appreciably after order of global steps.

The overall protocol is described in Section 6. Each node, when activated, performs a lazy averaging updates. When a node is activated for the -th times, where is a fixed constant, then it has high confidence that the global time is and that the component of the global state in the space of is small; at this point, the node makes a copy of its current state in a separate local memory location. In subsequent activations, the node continues to run the averaging process as before, but it also computes the difference between the current state and the copy of the older state. After further activations, that is, after order of global steps, the component in the eigenspace of of the global state will have reduced appreciably, and the difference between the current state of the node and the stored state will be correlated to the community that the node belongs to. Of course this is just a rough intuition, because different nodes will make a copy of their state at different global times, so the collection of stored copies of the various nodes does not correspond to any global state at any time. See Theorem 6.1 for a rigorous statement that formalizes the above intuition.

If we run several instantiations of the protocol in parallel, we gain, as before, a community-sensitive labeling. The advantage over the “vanilla” protocol is that the error probability of the protocol (the probability that the reconstruction fails to have the required properties) can be pushed from a small constant to an inverse-polynomial factor, at the cost of an extra factor in the communication complexity, memory and running time (see Corollary 6.9 in Section 6).

1.3 Comparison to previous work

We have already outlined a comparison of our results with those in [2]. The advantage of our work over [2] is that, for the first time, it applies to an asynchronous model and that the communication complexity per node does not depend on the degree of the graph. The advantage of [2] is that their analysis works in graphs that are not regular and extends to the case of more than two communities. Furthermore, in graphs in which the indicator of the cut between communities is an eigenvector of , the algorithm of [2] performs exact reconstruction.

Figure 1 provides a summary of the features of our algorithm with respect to [2]. The lines labeled as “CSL version” refer to the version of our algorithms in which we run several instances of a basic protocol in parallel so as to achieve community-sensitive labeling.

This work This work [2]
Protocol Type Asynchronous Asynchronous Synchronous
Eigenvalue Requirement
Average Work Per Node
CSL Version
Fraction of Outliers
CSL Version
Probability of Failure
CSL Version
Multiple Clusters No No Yes
Almost Regular Graphs No No Yes
Figure 1: Comparison of our result with [2], choosing in Theorem 4.1.

If we do not restrict to simple asynchronous protocols111111Sun and Zanetti [14] present a synchronous distributed algorithm in which, at each step, nodes construct a matching consisting of edges; nodes then update their states to the average along such edges. Assuming , the analysis in [14] claims a synchronous algorithm running in a logarithmic number of steps and able to perform approximate reconstruction with multiple communities. Sun and Zanetti recently discovered a gap in their analysis (personal communication), and they retracted the claims in [14]., other techniques for community detection and spectral clustering exist, which are not based on averaging. In particular, Kempe and McSherry showed that the top eigenvectors of the adjacency matrix of the underlying graph can be computed in distributed fashion [10]. These eigenvectors can then be used to partition the graph; in our settings, since we assume that the indicator of the cut is the second eigenvector of the graph, applying Kempe and McSherry’s algorithm with immediately reveals the underlying partition. Again, we note here that the downside of this algorithm is that it is synchronous and quite complex. In particular, the algorithm requires a computation of , for which work per node is a bottleneck, while our first algorithm only requires work per node, a difference that can become significant for very sparse cuts.

At a technical level, we note that our analysis establishes concentration results for products of certain i.d.d. random matrices; concentrations of such products have been studied in the ergodic theory literature [6, 9], but under assumptions that are not met in our setting, and with convergence rates that are not suitable for our applications.

While we only focused on decentralized settings so far, we note that the question of community detection, especially in stochastic block models, has been extensively studied in the centralized (non-distributed) setting and this regime is, by now, very well understood. In the centralized setting, the focus of most studies on stochastic block models is on determining the threshold at which weak recovery becomes possible, rather than simplicity or running time of the algorithm (as most algorithms are already reasonably simple and efficient). After a remarkable line of work [7, 13, 11, 12], such a threshold has now been precisely determined.

Concerning stochastic block models, we remark that the class of graphs to which our analysis applies includes w.h.p. graphs sampled from the regular stochastic block model [2, 5, 13] with appropriate parameters. The regular stochastic block model is defined as follows: given parameters and , a random graph on nodes is obtained by partitioning nodes into two equal-sized communities and and then sampling a random -regular graph over each of and . A random -regular graph is then sampled between and . The final graph is the union of the two edge sets. This model can be instantiated in different ways depending on how we sample the random regular graph (for example, via the uniform distribution over regular graphs, or by taking the disjoint union of random matchings). Thus, by definition, every graph sampled from the regular stochastic block model with parameters and is a -clustered regular graph according to our Definiton 2.1.

It turns out that the hypothesis on required by the analysis in Subsection 1.2.1 is (with high probability) satisfied when (see also [3, 5]). We thus have that the averaging protocol returns a community sensitive labeling in this range of parameters of the regular stochastic block model with performances similar to those described in Subsection 1.2.1.

Moreover, the spectral condition required by the analysis in Subsection 1.2.2 is (with high probability) satisfied in the regular stochastic block model with parameters such that , where is any positive constant. Thus, in this parameter range, the lazy-averaging protocol (with high probability) returns a weak reconstruction with performances similar to those described in Subsection 1.2.2.

1.4 Roadmap

The remainder of this paper is organized as follows.

After presenting some preliminaries in Section 2, the analysis of the averaging process for the case of small is described in Section 3. In Section 4, the analysis of the previous section drives the design of an accurate, asynchronous protocol for community sensitive labeling for small . So, Sections 3 and 4 together offer a formal presentation of the results sketched in Subsection 1.2.1.

In Section 5, we address the case of large and derive bounds for and . These bounds are used in the following Section 6 to derive a suitable clustering criterion, which results in a second distributed protocol for weak reconstruction, which proves effective for large values of . Then, we show how to transform the weak reconstruction protocol to get a good community sensitive labeling for the same class of graphs. So, Sections 5 and 6 together provide a rigorous presentation of the results summarized in Subsection 1.2.2.

Due to the considerable length of this paper, we decided to move the proofs of some technical lemmas stated in the first part into a separate appendix.

2 Preliminaries

We study the weighted version of the Averaging process described in the introduction. At each round an edge of the graph is sampled uniformly at random and the two endpoints of the sampled edge execute the following algorithm. Notice that this process can be seen as a distributed algorithm in the asynchronous communication model [4].

Averaging   (for a node that is one of the two endpoints of an active edge)
Initialization:

If it is the first time is active, then pick u.a.r.

Update:

Send to the other endpoint of the active edge and then update

, where is the value received from the other endpoint.
Algorithm 1 Updating rule for a node of an active edge, where is the parameter measuring the weight given to the neighbor’s value

For a graph with nodes and adjacency matrix , let be the eigenvalues of the normalized Laplacian , where is the average degree and is the diagonal matrix with the degrees of the nodes. We here consider the following class of graphs.

Definition 2.1 (Clustered regular graphs).

Let be an even integer and and two positive integers such that .

  • An -clustered regular graph is a graph over node set , with and such that: (i) Every node has degree and (ii) Every node in has neighbors in and every node in has neighbors in .

  • An -clustered regular graph is said to have good inner expansion if the spectrum of its normalized Laplacian satisfies and121212We have fixed here the value but, in our analysis, any absolute constant in would work as well. .

We recall that, for -clustered regular graph , the all-one vector and the cut vector131313For , we use to denote the indicator vector of , i.e., if and otherwise. are eigenvectors of with eigenvalues and respectively.

Next, we recall the notion of weak reconstruction [2].

Definition 2.2 (Weak Reconstruction).

A function is said to be an -weak reconstruction of if there exists a subset and each of size at least such that .

A community-sensitive labeling is a function that assigns each node a binary word, i.e., a signature, so that nodes from the same community are assigned labels with small Hamming distance, while nodes from different communities receive labels that have large Hamming distance. Such binary labels introduce a notion of similarity between nodes of the graph, in fact behaving like profiles that reflect community membership, hence the phrase Community-Sensitive Labeling we use to refer to our approach. More formally, if denotes the Hamming distance between two binary strings and , we introduce the following notion of distributed community detection.

Definition 2.3 (Community-sensitive labeling).

Let be a graph, a partition of and let . A function , for some , is a -community sensitive labeling for partition of if, for a subset , with , of the nodes, two constants exist, such that for all we have

where if and if .

We will need the following technical lemmas.

Lemma 2.4 (Projection on the first two eigenvectors).

For all , for a random , with probability at least we have,

Proof.

Note that and . Using properties of the binomial distribution (see e.g. Appendix A.1), it is easy to see that

The above event implies . Since and are independent sums of Rademacher random variables, they have the same chances of being positive or negative, thus with probability at least we will have . ∎

Definition 2.5 (-typical initial vectors).

We say that a starting vector is -typical if its projection in the direcion of has norm at most

By Chernoff bounds, it is easy to see that a starting vector is -typical with probability at least .

Lemma 2.6.

For all , for a random , with probability at least we have,

Proof.

The proof follows since, deterministically we have:

is clearly the sum of independent Rademacher variables. Then, a standard concentration argument (see e.g. Theorem A.1.2 in [1]), allows to conclude:

where the first equality follows from the definition of , since . ∎

3 Evolution of the State: the Case of Small

We begin with an analysis of the basic averaging process when . On the one hand, this second-moment analysis discloses clustering properties of the studied process that result exploitable well before mixing time (see Section 4). On the other hand, it uses technical tools that apply to far more general settings than the regular case and may be of independent interest.

In this section we consider the Averaging process described in Algorithm 1, with . For readability sake, we here rename the component of the state vector in the eigenspace of the second eigenvalue of the normalized Laplacian and in the space orthogonal to and . If we also name , we can write

(2)

In Subsection 3.1 we show (see Theorem 3.1) that there is a time window in which the component is close to . Moreover, this time window begins right after the inner mixing time of the two communities (i.e., after steps) which may occur much earlier than the global mixing time (which is ).

In Subsection 3.2, we will first show how to derive a pointwise bound on the values of the nodes from the global bound given by Theorem 3.1 (see Corollary 3.7). Such a pointwise bound holds for a relatively large fraction of “good” nodes but for any fixed global round in the time-window specified by Theorem 3.1. In order to devise an asynchronous protocol we then improve the above analysis in the following way: we prove (see Lemma 3.8) that there is a relatively-large fraction of nodes that remain “good” for a whole time-window which is long enough to allow the nodes to compute an accurate community-sensitive labeling of the input graph (see Section 4).

3.1 Second moment analysis

In this subsection we prove the following theorem.

Theorem 3.1 (Second moment analysis).

Let be an -clustered regular graph with and let be the state vector obtained when all nodes execute the Averaging process described in Algorithm 1 with . For every it holds that

where and are the components of in the eigenspace of and in the space orthogonal to and , respectively.

We prove Theorem 3.1 by bounding the length of the projections of into the eigenspace of and into the space orthogonal to and , i.e. and , and tracking their evolution over time.

From the fact that random matrices are symmetric and idempotent () we get the following upper bound on the expected squared norm of at the next step as a function of their squared norm at the current step. For readability sake, in the following proofs of this section we use and for random variables and conditional on the state at round being .

Lemma 3.2.

Let be an arbitrary vector of states. After one step of Algorithm 1 it holds that

Proof.

Since random matrix is symmetric and idempotent, it holds that

The squared norm of at the next step can be lower bounded as a function of the squared norms of and at the current time step as follows. If the underlying graph is -regular, we can get an upper bound as well.

Lemma 3.3.

Let be an arbitrary vector of states. After one step of Algorithm 1 it holds that

Moreover, if the underlying graph is an -clustered regular graph with we also have that

Proof.

Let be the random edge sampled at step , let be the corresponding random matrix, and let be such that . As for the lower bound we have

where the last equality follows since is the projector along the direction of , which in turn is ’s second eigenvector.

As for the upper bound, it holds that

(3)

Next, note that we have

(4)

where we used that and the fact that if belongs to first community and when belongs to the second community. We further get that

(5)

where it is understood that if belongs to the cut, then and and where, to derive the last equality, we recall that .

Finally, we get that

(6)

where the last inequality follows by observing that is the Rayleigh quotient of the unnormalized Laplacian of a bipartite -regular graph and the largest possible eigenvalue is . The thesis follows by using 4, 3.1, and 3.1 in 3.1. ∎

Lemma 3.4.

Let be an arbitrary vector of states. After one step of Algorithm 1 it holds that

Proof.

From Pythagoras’ Theorem and Lemmas 3.2 and 3.3, we get

Finally, by unrolling the double recursion, we get that the expected squared norm of and at round satisfy the following inequality.

Lemma 3.5.

Let be an -clustered regular graph with . For every starting state