Improved Distributed Expander Decomposition and Nearly Optimal Triangle Enumeration
Abstract
An expander decomposition of a graph is a clustering of the vertices such that (1) each cluster induces subgraph with conductance at least , and (2) the number of intercluster edges is at most . This decomposition has a wide range of applications in the centralized setting, including approximation algorithms for the unique game, algorithms for flow and cut problems, and dynamic graph algorithms. Recently, the first application of expander decomposition in distributed computing was found. Chang, Pettie, and Zhang [SODA’19] showed that a variant of expander decomposition can be computed efficiently in the model, and they used it to show that triangle enumeration can be solved in rounds, improving upon the round algorithm by Izumi and Le Gall [PODC’17]. It is conceivable that expander decomposition will find more applications in distributed computing.
In this paper, we give an improved distributed expander decomposition, and obtain a nearly optimal distributed triangle enumeration algorithm in the model. Specifically, we construct an expander decomposition with in rounds for any and positive integer . For example, a expander decomposition can be computed in rounds, for any arbitrarily small constant , and a expander decomposition only requires rounds to compute, which is optimal up to subpolynomial factors. Previously, the algorithm by Chang, Pettie, and Zhang can construct a expander decomposition using rounds for any , with a caveat that the algorithm is allowed to throw away some edges into an extra part which form a subgraph with arboricity at most . Our algorithm does not have this caveat.
By slightly modifying the distributed algorithm for routing on expanders by Ghaffari, Kuhn and Su [PODC’17], we obtain a triangle enumeration algorithm using rounds. This matches the lower bound by Izumi and Le Gall [PODC’17] and Pandurangan, Robinson and Scquizzato [SPAA’18] of which holds even in the model. To the best of our knowledge, this provides the first nontrivial example for a distributed problem that has essentially the same complexity (up to a polylogarithmic factor) in both and .
The key technique in our proof is the first distributed approximation algorithm for finding a low conductance cut that is as balanced as possible. Previous distributed sparse cut algorithms do not have this nearly most balanced guarantee.^{1}^{1}1Kuhn and Molla [22] previously claimed that their approximate sparse cut algorithm also has the nearly most balanced guarantee, but this claim turns out to be incorrect [4, Footnote 3].
1 Introduction
In this paper, we consider the task of finding an expander decomposition of a distributed network in the model of distributed computing. Roughly speaking, an expander decomposition of a graph is a clustering of the vertices such that (1) each component induces a high conductance subgraph, and (2) the number of intercomponent edges is small. This natural bicriteria optimization problem of finding a good expander decomposition was introduced by Kannan Vempala and Vetta [19], and was further studied in many other subsequent works [38, 28, 30, 2, 40, 27, 33].^{2}^{2}2 The existence of the expander decomposition is (implicitly) exploited first in the context of property testing [14]. The expander decomposition has a wide range of applications, and it has been applied to solving linear systems [39], unique games [1, 40, 32], minimum cut [20], and dynamic algorithms [26].
Recently, Chang, Pettie, and Zhang [4] applied this technique to the field of distributed computing, and they showed that a variant of expander decomposition can be computed efficiently in . Using this decomposition, they showed that triangle detection and enumeration can be solved in rounds.^{3}^{3}3The notation hides any polylogarithmic factor. The previous stateoftheart bounds for triangle detection and enumeration were and , respectively, due to Izumi and Le Gall [16]. Later, Daga et al. [7] exploit this decomposition and obtain the first algorithm for computing edge connectivity of a graph exactly using sublinear number of rounds.
Specifically, the variant of the decomposition in [4] is as follows. If we allow one extra part that induces an arboricity subgraph^{4}^{4}4The arboricity of a graph is the minimum number such that its edge set can be partitioned into forests. in the decomposition, then in rounds we can construct an expander decomposition in such that each component has conductance and the number of intercomponent edges is at most .
A major open problem left by the work [4] is to design an efficient distributed algorithm constructing an expander decomposition without the extra lowarboricity part. In this work, we show that this is possible. A consequence of our new expander decomposition algorithm is that triangle enumeration can be solved in rounds, nearly matching the lower bound [16, 29] by a polylogarithmic factor.
The Model.
In the model of distributed computing, the underlying distributed network is represented as an undirected graph , where each vertex corresponds to a computational device, and each edge corresponds to a bidirectional communication link. Each vertex has a distinct bit identifier . The computation proceeds according to synchronized rounds. In each round, each vertex can perform unlimited local computation, and may send a distinct bit message to each of its neighbors. Throughout the paper we only consider the randomized variant of . Each vertex is allowed to generate unlimited local random bits, but there is no global randomness. We say that an algorithm succeeds with high probability (w.h.p.) if its failure probability is at most .
The model is a variant of that allows alltoall communication, and the model is a variant of that allows messages of unbounded length.
Terminology.
Before we proceed, we review the graph terminologies related to the expander decomposition. Consider a graph . For a vertex subset , we write to denote . Note that by default the degree is with respect to the original graph . We write , and let be the set of edges with and . The sparsity or conductance of a cut is defined as . The conductance of a graph is the minimum value of over all vertex subsets . Define the balance of a cut by . We say that is a mostbalanced cut of of conductance at most if is maximized among all cuts of with conductance at most . We have the following relation [17] between the mixing time and conductance :
Let be a vertex set. Denote by the set of all edges whose two endpoints are both within . We write to denote the subgraph induced by , and we write to denote the graph resulting from adding self loops to each vertex in . Note that the degree of each vertex in both and is identical. As in [35], each self loop of contributes 1 in the calculation of . Observe that we always have
Let be a vertex. Denote as the set of neighbors of . We also write . Note that . These notations , , and depend on the underlying graph . When the choice of underlying graph is not clear from the context, we use a subscript to indicate the underlying graph we refer to.
Expander Decomposition.
An expander decomposition of a graph is defined as a partition of the vertex set satisfying the following conditions.

For each component , we have .

The number of intercomponent edges is at most .
The main contribution of this paper is the following result.
theoremrestateexpanderdecomposition Let , and let be a positive integer. An expander decomposition with can be constructed in rounds, w.h.p.
The proof of Theorem 1 is in Section 2. We emphasize that the number of rounds does not depend on the diameter of . There is a tradeoff between the two parameters and . For example, an expander decomposition with and can be constructed in rounds by setting in Theorem 1. If we are allowed to have and spend rounds, then we can achieve .
Distributed Triangle Finding.
Variants of the triangle finding problem have been studied in the literature [3, 4, 8, 9, 10, 29, 16]. In the triangle detection problem, it is required that at least one vertex must report a triangle if the graph has at least one triangle. In the triangle enumeration problem, it is required that each triangle of the graph is reported by at least one vertex. Both of these problems can be solved in rounds in . It is the bandwidth constraint of and that makes these problems nontrivial.
It is important that a triangle is allowed to be reported by a vertex . If it is required that a triangle has to be reported by a vertex , then there is an lower bound [16] for triangle enumeration, in both and . To achieve a round complexity of , it is necessary that some triangles are reported by vertices not in .
Dolev, Lenzen, and Peled [8] showed that triangle enumeration can be solved deterministically in rounds in . This algorithm is optimal, as it matches the round lower bound [16, 29] in . Interestingly, if we only want to detect one triangle or count the number of triangles, then CensorHillel et al. [3] showed that the round complexity in can be improved to time [3], where is the exponent of the complexity of matrix multiplication [23].
For the model, Izumi and Le Gall [16] showed that the triangle detection and enumeration problems can be solved in and time, respectively. These upper bounds were later improved to by Chang, Pettie, and Zhang using a variant of expander decomposition [4].
A consequence of Theorem 1 is that triangle enumeration (and hence detection) can be solved in rounds, almost matching the lower bound [16, 29] which holds even in . To the best of our knowledge, this provides the first nontrivial example for a distributed problem that has essentially the same complexity (up to a polylogarithmic factor) in both and , i.e., allowing nonlocal communication links does not help. In contrast, many other graph problems can be solved much more efficiently in than in ; see e.g., [18, 13].
theoremrestatetriangle Triangle enumeration can be solved in rounds in , w.h.p.
1.1 Prior Work on Expander Decomposition
In the centralized setting, the first polynomial time algorithm for construction an expander decomposition is by Kannan, Vempala and Vetta [19] where . Afterward, Spielman and Teng [37, 38] significantly improved the running time to be nearlinear in , where is the number of edges. In time , they can construct a “weak” expander decomposition. Their weak expander decomposition decomposition only has the following weaker guarantee that each part in the partition of might not induce an expander, and we only know that is contained in some unknown expander. That is, there exists some where . Although this guarantee suffices for many applications (e.g. [21, 6]), some other applications [26, 5], including the triangle enumeration algorithm of [4], crucially needs the fact that each part in the decomposition induces an expander.
Nanongkai and Saranurak [25] and, independently, WulffNilsen [41] gave a fast algorithm without weakening the guarantee as the one in [37, 38]. In [25], their algorithm finds a )expander decomposition in time . Although the tradeoff is worse in [41], their highlevel approaches are in fact the same. They gave the same blackbox reduction from constructing an expander decomposition to finding a nearly most balanced sparse cut. The difference only comes from the quality of their nearly most balanced sparse cuts algorithms. Our distributed algorithm will also follow this highlevel approach.
Most recently, Saranurak and Wang [33] gave a expander decomposition algorithm with running time . This is optimal up to a polylogarithmic factor when . We do not use their approach, as their trimming step seems to be inherently sequential and very challenging to parallelize or make distributed.
The only previous expander decomposition in the distributed setting is by Chang, Pettie, and Zhang [4]. Their distributed algorithm gave an expander decomposition with an extra part which is an arboricity subgraph in rounds in . Our distributed algorithm significantly improved upon this work.
1.2 Technical Overview
For convenience, we call a cut with conductance at most a sparse cut in this section. To give a highlevel idea, the most straightforward algorithm for constructing an expander decomposition of a graph is as follows. Find a sparse cut . If such a cut does not exist, then return as a part in the partition. Otherwise, recurse on both sides and , and so the edges in become intercluster edges. To see the correctness, once the recursion stops at for some , we know that . Also, the total number of intercluster edges is at most because (1) each intercluster edge can be charged to edges in the smaller side of some sparse cut, and (2) each edge can be in the smaller side of the cut for at most times.
This straightforward approach has two efficiency issues: (1) checking whether a sparse cut exists does not admit fast distributed algorithms (and is in fact NPhard), and (2) a sparse cut can be very unbalanced and hence the recursion depth can be as large as . Thus, even if we ignore time spent on finding cuts, the round complexity due to the recursion depth is too high. At a highlevel, all previous algorithms (both centralized and distributed) handle the two issues in the same way up to some extent. First, they instead use approximate sparse cut algorithms which either find some sparse cut or certify that there is no sparse cut where . Second, they find a cut with some guarantee about the balance of the cut, i.e., the smaller side of the cut should be sufficiently large.
Let us contrast our approach with the only previous distributed expander decomposition algorithm by Chang, Pettie, and Zhang [4]. They gave an approximate sparse cut algorithm such that the smaller side of the cut has vertices for some constant , so the recursion depth is . They guarantee this property by “forcing” the graph to have minimum degree at at least , so any sparse cut must contain vertices (this uses the fact that the graph is simple) To force the graph to have high degree, they keep removing vertices with degree at most at any step of the algorithms. Throughout the whole algorithm, the removed part form a graph with arboricity at most . This explains why their decomposition outputs the extra part which induces a low arboricity subgraph. With some other ideas on distributed implementation, they obtained the round complexity of , roughly matching the recursion depth.
In this paper, we avoid this extra lowarboricity part. The key component is the following. Instead of just guaranteeing that the smaller side of the cut has vertices, we give the first efficient distributed algorithm for computing a nearly most balanced sparse cut. Suppose there is a sparse cut with balance , then our sparse cut algorithm returns a sparse cut with balance at least , where is not much larger than . Intuitively, given that we can find a nearly most balanced sparse cut efficiently, the recursion depth should be made very small. This intuition can be made formal using the ideas in the centralized setting from Nanongkai and Saranurak [25] and WullfNilsen [41]. Our main technical contribution is of twofold. First, we show the first distributed algorithm for computing a nearly most balanced sparse cut, which is our key algorithmic tool. Second, in order to obtain a fast distributed algorithm, we must modify the centralized approach of [25, 41] on how to construct an expander decomposition. In particular, we need to run a low diameter decomposition whenever we encounter a graph with high diameter, as our distributed algorithm for finding a nearly most balanced sparse cut is fast only on graphs with low diameter.
Sparse Cut Computation.
At a high level, our distributed nearly most balanced sparse cut algorithm is a distributed implementation of the sequential algorithm of Spielman and Teng [38]. The algorithm of [38] involves sequential iterations of with a random starting vertex on the remaining subgraph. Roughly speaking, the procedure aims at finding a sparse cut by simulating a random walk. The idea is that if the starting vertex belongs to some sparse cut , then it is likely that most of the probability mass will be trapped inside . Chang, Pettie, and Zhang [4] showed that simultaneous iterations of an approximate version of with a random starting vertex can be implemented efficiently in in rounds, where is the target conductance. A major difference between this work and [4] is that the expander decomposition algorithm of [4] does not need any requirement about the balance of the cut in their sparse cut computation.
Note that the sequential iterations of in the nearly most balanced sparse cut algorithm of [38] cannot be completely parallelized. For example, it is possible that the union of all output of equals the entire graph. Nonetheless, we show that this process can be partially parallelized at the cost of worsening the conductance guarantee by a polylogarithmic factor.
[Nearly most balanced sparse cut]theoremrestatenearlybalcut Given a parameter , there is an round algorithm that achieves the following w.h.p.

In case , the algorithm is guaranteed to return a cut with balance and conductance , where is defined as , where is a mostbalanced sparse cut of of conductance at most .

In case , the algorithm either returns or returns a cut with conductance .
The proof of Theorem 1.2 is in Appendix A. We note again that this is the first distributed sparse cut algorithm with a nearly most balanced guarantee. The problem of finding a sparse cut the distributed setting has been studied prior to the work of [4]. Given that there is a sparse cut and balance , the algorithm of Das Sarma, Molla, and Pandurangan [34] finds a cut of conductance at most in rounds in . The round complexity was later improved to by Kuhn and Molla [22]. These prior works have the following drawbacks: (1) their running time depends on which can be as small as , and (2) their output cuts are not guaranteed to be nearly most balanced (see footnote 1).
Low Diameter Decomposition.
The runtime of our distributed sparse cut algorithm (Theorem 1.2) is proportional to the diameter. To avoid running this algorithm on a high diameter graph, we employ a low diameter decomposition to decompose the current graph into components of small diameter.
The low diameter decomposition algorithm of Miller, Peng, and Xu [24] can already be implemented in efficiently. However, there is one subtle issue that the guarantee that the number of intercluster edges is at most only holds in expectation. In sequential or parallel computation model, we can simply repeat the procedure for several times and take the best result. In , this however takes at least diameter time, which is inefficient when the diameter is large.
We provide a technique that allows us to achieve this guarantee with high probability without spending diameter time, so we can ensure that the number of intercluster edges is small with high probability in our expander decomposition algorithm.^{5}^{5}5We remark that the triangle enumeration algorithm of [4] still works even if the guarantee on the number of intercluster edges in the expander decomposition only holds in expectation. Intuitively, the main barrier needed to be overcome is the high dependence among the events that an edge has its endpoints in different clusters. Our strategy is to compute a partition in such a way that already induces a low diameter clustering, and the edges incident to satisfy the property that if we run the the low diameter decomposition algorithm of [24], the events that they are intercluster have sufficiently small dependence. Then we can use a variant of Chernoff bound with bounded dependence [31] to bound the number of intercluster edges with high probability.
[Low diameter decomposition]theoremrestatelowdiamclustering Let . There is an round algorithm that finds a partition of the vertex set satisfying the following conditions w.h.p.

Each component has diameter .

The number of intercomponent edges is at most .
Triangle Enumeration.
Incorporating our expander decomposition algorithm (Theorem 1) with the triangle enumeration algorithm of [4, 11], we immediately obtain an round algorithm for triangle enumeration. This round complexity can be further improved to by adjusting the routing algorithm of Ghaffari, Kuhn, and Su [11] on graphs of small mixing time. The main observation is their algorithm can be viewed as a distributed data structure with a tradeoff between the query time and the preprocessing time. In particular, for any given constant , it is possible to achieve query time by spending time on preprocessing.
2 Expander Decomposition
The goal of this section is to prove Theorem 1.
*
For the sake of convenience, we denote
as a function associated with Theorem 1.2 such that when we run the nearly most balanced sparse cut algorithm of Theorem 1.2 with conductance parameter , if the output subset is nonempty, then it has . We note that
Let and be the parameters specified in Theorem 1. We define the following parameters that are used in our algorithm.
 Nearly Most Balanced Sparse Cut:

We define in such a way that when we run the nearly most balanced sparse cut algorithm with this conductance parameter, any nonempty output must satisfy . For each , we define .
 Low Diameter Decomposition:

The parameter for the low diameter decomposition is chosen as follows. Set as the smallest integer such that . Then we define .
We show that an expander decomposition can be constructed in rounds, with conductance parameter . We will later see that is the smallest conductance parameter we ever use for applying the nearly most balanced sparse cut algorithm.
Algorithm.
Our algorithm has two phases. In the algorithm there are three places where we remove edges from the graph, and they are tagged with Remove, for for convenience. Whenever we remove an edge , we add a self loop at both and , and so the degree of a vertex never changes throughout the algorithm. We never remove self loops.
At the end of the algorithm, is partitioned into connected components induced by the remaining edges. To prove the correctness of the algorithm, we will show that the number of removed edges is at most , and for each component .
We emphasize that we do not remove the cut edges in Step 2b of Phase 1.
Lemma 1.
The depth of the recursion of Phase 1 is at most .
Proof.
Suppose there is still a component entering the depth of the recursion of Phase 1. Then according to the threshold for specified in Step 2b, we infer that by our choice of , which is impossible. ∎
Intuitively, in Phase 2 we keep calling the nearly most balanced sparse cut algorithm to find a cut and remove it. If we find a cut that has volume greater than , then we make a good progress. If , then we learn that the volume of the most balanced sparse cut of conductance at most is at most by Theorem 1.2, and so we move on to the next level by setting .
The maximum possible level is . Since by definition , there is no possibility to increase to . Once we reach , we will repeatedly run the nearly most balanced sparse cut algorithm until we get and quit.
When we remove a cut in Phase 2, each becomes an isolated vertex with self loops, as all edges incident to have been removed, and so in the final decomposition we have for some . We emphasize that we only do the edge removal when . Lemma 2 bounds the volume of the cuts found during Phase 2.
Lemma 2.
For each , define as the union of all subsets found in Phase 2 when . Then either or .
Proof.
We first consider the case of . Observe that the graph satisfies the property that the most balanced sparse cut of conductance at most has balance at most , since otherwise it does not meet the condition for entering Phase 2. Note that all cuts we find during Phase 2 have conductance at most , and so the union of them is also a cut of with conductance at most . This implies that .
The proof for the case of is exactly the same, as the condition for increasing is to have . Let be the graph considered in the iteration when we increase to . The existence of such a cut of implies that the most balanced sparse cut of conductance at most of has volume at most . Similarly, note that all cuts we find when have conductance at most , and so the union of them is also a cut of with conductance at most . This implies that . ∎
Conductance of Remaining Components.
For each , there are two possible ways for to end the algorithm:

During Phase 1 or Phase 2, the output of the nearly most balanced sparse cut algorithm on the component that belongs to is . In this case, the component that belongs to becomes a component in the final decomposition . If is the conductance parameter used in the nearly most balanced sparse cut algorithm, then . Note that .

During Phase 2, for the output of the nearly most balanced sparse cut algorithm. In this case, itself becomes a component in the final decomposition . Trivially, we have .
Therefore, we conclude that each component in the final decomposition satisfies that .
Number of Removed Edges.
There are three places in the algorithm where we remove edges. We show that, for each , the number of edges removed due to Remove is at most , and so the total number of intercomponent edges in the final decomposition is at most .

By Lemma 1, the depth of recursion of Phase 1 is at most . For each to , the number of edges removed due to the low diameter decomposition algorithm during depth of the recursion is at most . By our choice of , the number of edges removed due to Remove1 is at most .

For each edge removed due to the nearly most balanced sparse cut algorithm in Phase 1, we charge the cost of the edge removal to some pairs in the following way. If , for each , and for each edge incident to , we charge the amount to ; otherwise, for each , and for each edge incident to , we charge the amount to . Note that each pair is being charged for at most times throughout the algorithm, and the amount per charging is at most . Therefore, the number of edges removed due to Remove2 is at most by our choice of .

By Lemma 2, the summation of over all cuts in that are found and removed during Phase 2 due to Remove3 is at most .
Round Complexity.
During Phase 1, each vertex participates in at most times the nearly most balanced sparse cut algorithm and the low diameter decomposition algorithm. By our choice of parameters and , the round complexity of both algorithms are , as we note that whenever we run the nearly most balanced sparse cut algorithm, the diameter of each connected component is at most .
For Phase 2, Lemma 2 guarantees that for each the algorithm can stay for at most iterations. If we neither increase nor quit Phase 2 for iterations, then we have , which is impossible. Therefore, the round complexity for Phase 2 can be upper bounded by
During Phase 2, it is possible that the graph be disconnected or has a large diameter, but we are fine since we can use all edges in for communication during a sparse cut computation, and the diameter of is at most .
3 Triangle Enumeration
\restatetriangle*
Chang, Pettie, and Zhang [4] showed that given an expander decomposition with , there is an algorithm that finds an edge subset with such that each triangle in is detected by some vertex during the execution of , except the triangles whose three edges are all within . The algorithm has to solve times the following routing problem in each . Given a set of routing requests where each vertex is a source or a destination for at most messages of bits, the goal is to deliver all messages to their destinations. Ghaffari, Khun, and Su [11] showed that this routing problem can be solved in rounds. This was later improved to in by Ghaffari and Li [12].
Applying our distributed expander decomposition algorithm (Theorem 1), we can find an expander decomposition with and in rounds by selecting to be a sufficiently large constant. The mixing time of each component is at most . Then we apply the above algorithm , and it takes rounds with the routing algorithm of Ghaffari and Li [12]. After that, we recurse on the edge set , and we are done enumerating all triangles after iterations. This concludes the round algorithm for triangle enumeration.
To improve the complexity to , we make the observation that the routing algorithm of [11] can be seen as a distributed data structure with the following properties.
 Parameters:

The parameter is a positive integer that specifies the depth of the hierarchical structure in the routing algorithm. Given , define as the number such that , where is the total number of edges.
 Preprocessing Time:
 Query Time:

After building the data structure, each routing task can be solved in rounds [11, Lemma 3.4].
The parameter can be chosen as any positive integer. In [11] they used to balance the preprocessing time and the query time to show that the routing task can be solved in rounds. This round complexity was later improved to in [12]. We however note that the algorithm of [12] does not admit a tradeoff as above. The main reason is their special treatment of the base layer of the hierarchical structure. In [12], is a random graph with degree , and simulating one round in already costs rounds in the original graph .
In the triangle enumeration algorithm , we need to query this distributed data structure for times. It is possible to set to be a large enough constant so that the preprocessing time costs only rounds, while the query time is still . This implies that the triangle enumeration problem can be solved in rounds.
Acknowledgment
We thank Seth Pettie for very useful discussion.
References
 [1] S. Arora, B. Barak, and D. Steurer. Subexponential algorithms for unique games and related problems. J. ACM, 62(5):42:1–42:25, Nov. 2015.
 [2] S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning. J. ACM, 56(2):5:1–5:37, Apr. 2009.
 [3] K. CensorHillel, P. Kaski, J. H. Korhonen, C. Lenzen, A. Paz, and J. Suomela. Algebraic methods in the congested clique. Distributed Computing, 2016.
 [4] Y.J. Chang, S. Pettie, and H. Zhang. Distributed Triangle Detection via Expander Decomposition. In Proceedings of the 30th Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pages 821–840, 2019.
 [5] T. Chu, Y. Gao, R. Peng, S. Sachdeva, S. Sawlani, and J. Wang. Graph sparsification, spectral sketches, and faster resistance computation, via short cycle decompositions. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 79, 2018, pages 361–372, 2018.
 [6] M. B. Cohen, J. A. Kelner, J. Peebles, R. Peng, A. B. Rao, A. Sidford, and A. Vladu. Almostlineartime algorithms for markov chains and new spectral primitives for directed graphs. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 1923, 2017, pages 410–419, 2017.
 [7] M. Daga, M. Henzinger, D. Nanongkai, and T. Saranurak. Distributed edge connectivity in sublinear time. arXiv preprint arXiv:1904.04341, 2019. To appear at STOC’19.
 [8] D. Dolev, C. Lenzen, and S. Peled. “Tri, tri again”: Finding triangles and small subgraphs in a distributed setting. In Proceedings 26th International Symposium on Distributed Computing (DISC), pages 195–209, 2012.
 [9] A. Drucker, F. Kuhn, and R. Oshman. On the power of the congested clique model. In Proceedings 33rd ACM Symposium on Principles of Distributed Computing (PODC), pages 367–376, 2014.
 [10] O. Fischer, T. Gonen, F. Kuhn, and R. Oshman. Possibilities and impossibilities for distributed subgraph detection. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 153–162, New York, NY, USA, 2018. ACM.
 [11] M. Ghaffari, F. Kuhn, and H.H. Su. Distributed MST and routing in almost mixing time. In Proceedings 37th ACM Symposium on Principles of Distributed Computing (PODC), pages 131–140, 2017.
 [12] M. Ghaffari and J. Li. New distributed algorithms in almost mixing time via transformations from parallel algorithms. In U. Schmid and J. Widder, editors, Proceedings 32nd International Symposium on Distributed Computing (DISC), volume 121 of Leibniz International Proceedings in Informatics (LIPIcs), pages 31:1–31:16, Dagstuhl, Germany, 2018. Schloss Dagstuhl–LeibnizZentrum fuer Informatik.
 [13] M. Ghaffari and K. Nowicki. Congested clique algorithms for the minimum cut problem. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC ’18, pages 357–366, New York, NY, USA, 2018. ACM.
 [14] O. Goldreich and D. Ron. A sublinear bipartiteness tester for bounded degree graphs. Combinatorica, 19(3):335–373, Mar 1999.
 [15] B. Haeupler and D. Wajc. A faster distributed radio broadcast primitive. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing (PODC), pages 361–370. ACM, 2016.
 [16] T. Izumi and F. Le Gall. Triangle finding and listing in CONGEST networks. In Proceedings 37th ACM Symposium on Principles of Distributed Computing (PODC), pages 381–389, 2017.
 [17] M. Jerrum and A. Sinclair. Approximating the permanent. SIAM Journal on Computing, 18(6):1149–1178, 1989.
 [18] T. Jurdziński and K. Nowicki. MST in rounds of congested clique. In Proceedings 29th Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pages 2620–2632, 2018.
 [19] R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. J. ACM, 51(3):497–515, May 2004.
 [20] K.I. Kawarabayashi and M. Thorup. Deterministic edge connectivity in nearlinear time. J. ACM, 66(1):4:1–4:50, Dec. 2018.
 [21] J. A. Kelner, Y. T. Lee, L. Orecchia, and A. Sidford. An almostlineartime algorithm for approximate max flow in undirected graphs, and its multicommodity generalizations. In Proceedings of the TwentyFifth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 57, 2014, pages 217–226, 2014.
 [22] F. Kuhn and A. R. Molla. Distributed sparse cut approximation. In Proceedings 19th International Conference on Principles of Distributed Systems (OPODIS), pages 10:1–10:14, 2015.
 [23] F. Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, ISSAC ’14, pages 296–303, New York, NY, USA, 2014. ACM.
 [24] G. L. Miller, R. Peng, and S. C. Xu. Parallel graph decompositions using random shifts. In Proceedings of the twentyfifth annual ACM symposium on Parallelism in algorithms and architectures (SPAA), pages 196–203. ACM, 2013.
 [25] D. Nanongkai and T. Saranurak. Dynamic spanning forest with worstcase update time: adaptive, las vegas, and time. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 1923, 2017, pages 1122–1129, 2017.
 [26] D. Nanongkai, T. Saranurak, and C. WulffNilsen. Dynamic minimum spanning forest with subpolynomial worstcase update time. In Proceedings of IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 950–961. IEEE, 2017.
 [27] L. Orecchia and N. K. Vishnoi. Towards an sdpbased approach to spectral methods: A nearlylineartime algorithm for graph partitioning and decomposition. In Proceedings of the TwentySecond Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 2325, 2011, pages 532–545, 2011.
 [28] L. Orecchia and Z. A. Zhu. Flowbased algorithms for local graph clustering. In Proceedings of the Twentyfifth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’14, pages 1267–1286, Philadelphia, PA, USA, 2014. Society for Industrial and Applied Mathematics.
 [29] G. Pandurangan, P. Robinson, and M. Scquizzato. On the distributed complexity of largescale graph computations. In Proceedings 30th ACM Symposium on Parallelism in Algorithms and Architecture (SPAA), 2018.
 [30] M. Pǎtraşcu and M. Thorup. Planning for fast connectivity updates. In Proceedings 48th IEEE Symposium on Foundations of Computer Science (FOCS), pages 263–271, 2007.
 [31] S. V. Pemmaraju. Equitable coloring extends chernoffhoeffding bounds. In M. Goemans, K. Jansen, J. D. P. Rolim, and L. Trevisan, editors, Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, pages 285–296, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
 [32] P. Raghavendra and D. Steurer. Graph expansion and the unique games conjecture. In Proceedings 42nd ACM Symposium on Theory of Computing (STOC), pages 755–764, 2010.
 [33] T. Saranurak and D. Wang. Expander decomposition and pruning: Faster, stronger, and simpler. In Proceedings of the 30th Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pages 2616–2635, 2019.
 [34] A. D. Sarma, A. R. Molla, and G. Pandurangan. Distributed computation of sparse cuts via random walks. In Proceedings 16th International Conference on Distributed Computing and Networking (ICDCN), pages 6:1–6:10, 2015.
 [35] D. A. Spielman and N. Srivastava. Graph sparsification by effective resistances. In Proceedings 40th ACM Symposium on Theory of Computing (STOC), pages 563–568, 2008.
 [36] D. A. Spielman and S.H. Teng. Nearlylinear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings 36th Annual ACM Symposium on Theory of Computing (STOC), pages 81–90, 2004.
 [37] D. A. Spielman and S.H. Teng. Spectral sparsification of graphs. SIAM J. Comput., 40(4):981–1025, 2011.
 [38] D. A. Spielman and S.H. Teng. A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM J. Comput., 42(1):1–26, 2013.
 [39] D. A. Spielman and S.H. Teng. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835–885, 2014.
 [40] L. Trevisan. Approximation algorithms for unique games. Theory of Computing, 4(5):111–128, 2008.
 [41] C. WulffNilsen. Fullydynamic minimum spanning forest with improved worstcase update time. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 1923, 2017, pages 1130–1143, 2017.
Appendix
Appendix A Nearly Most Balanced Sparse Cut
The goal of this section is to prove the following theorem.
*
We will prove this theorem by adapting the nearly most balanced sparse cut algorithm of Spielman and Teng [36]^{6}^{6}6There are many versions of the paper [36]; we refer to https://arxiv.org/abs/cs/0310051v9. to in a whitebox manner. Before presenting the proof, we highlight the major differences between this work and the sequential algorithm of [36]. The procedure itself is not suitable for a distributed implementation, so we follow the idea of [4] to consider an approximate version of (Section A.2) and use the distributed implementation described in [4] (Section A.5). The nearly most balanced sparse cut algorithm of Spielman and Teng [36] involves doing iterations of with a random starting vertex on the remaining subgraph. We will show that this sequential process can be partially parallelized at the cost of worsening the conductance guarantee by a polylogarithmic factor (Section A.4).
Terminology.
Given a parameter , We define the following functions as in [36].
Let be the adjacency matrix of the graph . We assume a 11 correspondence between and . In a lazy random walk, the walk stays at the current vertex with probability and otherwise moves to a random neighbor of the current vertex. The matrix realizing this walk can be expressed as , where is the diagonal matrix with on the diagonal.
Let be the probability distribution of the lazy random walk that begins at and walks for steps. In the limit, as , approaches , so it is natural to measure relative to this baseline.
Let be any function. The truncation operation rounds to zero if it falls below a threshold that depends on .
As in [36], for any vertex set , we define the vector by if and if , and we define the vector by if and if . In particular, is a probability distribution on that has all its probability mass on the vertex , and is the degree distribution of . That is, .
a.1 Nibble
We first review the algorithm of [36], which computes the following sequence of vectors with truncation parameter .
We define as the normalized probability mass at at time . Due to truncation, for all and , we have and .
We define as a permutation of such that . That is, we order the vertices by their value, breaking ties arbitrarily (e.g., by comparing IDs). We write to denote the set of vertices with . For example, is the set of the top vertices with the highest value.
Note that the definition of () is exactly the same as the one presented in [36].
Definition 1.
Define as the subset of such that if we start the lazy random walk from , then for at least one of . For any edge , define .
Intuitively, if , then does not participate in () and both endpoints of are not in the output of (). In particular, is a necessary condition for , The following auxiliary lemma establishes upper bounds on and . This lemma will be applied to bound the amount of congestion when we execute multiple in parallel. Intuitively, if is small, then we can afford to run many instances () in parallel for random starting vertices sampled from the degree distribution .
Lemma 3.
The following formulas hold for each vertex and each edge .
In particular, these two quantities are both upper bounded by .
Proof.
In this proof we use superscript to indicate the starting vertex of the lazy random walk. We write . Then . Thus, to prove the lemma, if suffices to show that . This inequality follows from the fact that , as follows.