Fast Distributed Algorithms for Testing Graph Properties

# Fast Distributed Algorithms for Testing Graph Properties

\SetKwFor

Procedureprocedureend procedure \SetKwCommenttcpy#  \SetKwForPerformperformtimesendp \SetKwForSimulfor eachsimultaneouslyends \SetKwForUseusetoendu \SetKwInputKwVarsVariables \SetKwInputKwSetVariables set \RestyleAlgoboxruled \LinesNumbered

We initiate a thorough study of distributed property testing – producing algorithms for the approximation problems of property testing in the CONGEST model. In particular, for the so-called dense graph testing model we emulate sequential tests for nearly all graph properties having -sided tests, while in the general and sparse models we obtain faster tests for triangle-freeness, cycle-freeness and bipartiteness, respectively. In addition, we show a logarithmic lower bound for testing bipartiteness and cycle-freeness, which holds even in the stronger LOCAL model.

In most cases, aided by parallelism, the distributed algorithms have a much shorter running time as compared to their counterparts from the sequential querying model of traditional property testing. The simplest property testing algorithms allow a relatively smooth transitioning to the distributed model. For the more complex tasks we develop new machinery that may be of independent interest.

## 1 Introduction

The performance of many distributed algorithms naturally depends on properties of the underlying network graph. Therefore, an inherent goal is to check whether the graph, or some given subgraph, has certain properties. However, in some cases this is known to be hard, such as in the CONGEST model [31]. In this model, computation proceeds in synchronous rounds, in each of which every vertex can send an -bit message to each of its neighbors. Lower bounds for the number of rounds of type are known for verifying many global graph properties, where is the number of vertices in the network and is its diameter (see, e.g. Das-Sarma et al. [36])1.

To overcome such difficulties, we adopt the relaxation used in graph property testing, as first defined in [18, 20], to the distributed setting. That is, rather than aiming for an exact answer to the question of whether the graph satisfies a certain property , we settle for distinguishing the case of satisfying from the case of being -far from it, for an appropriate measure of being far.

Apart from its theoretical interest, this relaxation is motivated by the common scenario of having distributed algorithms for some tasks that perform better given a certain property of the network topology, or given that the graph almost satisfies that property. For example, Hirvonen et al. [24] show an algorithm for finding a large cut in triangle-free graphs (with additional constraints), and for finding an -approximation if at most an fraction of all edges are part of a triangle. Similarly, Pettie and Su [32] provide fast algorithms for coloring triangle-free graphs.

We construct fast distributed algorithms for testing various graph properties. An important byproduct of this study is a toolbox that we believe will be useful in other settings as well.

### 1.1 Our contributions

We provide a rigorous study of property testing methods in the realm of distributed computing under the CONGEST model. We construct -sided error distributed -tests, in which if the graph satisfies the property then all vertices output accept, and if it is -far from satisfying the property then at least one vertex outputs reject with probability at least . Using the standard amplification method of invoking such a test times and having a vertex output reject if there is at least one invocation in which it should output reject, gives rejection with higher probability at the price of a multiplicative factor for the number of rounds.

The definition of a graph being -far from satisfying a property is roughly one of the following (see Section 2 for precise definitions): (1) Changing any entries in the adjacency matrix does not give a graph that satisfies the property (dense model), or (2) changing any entries in the adjacency matrix does not give a graph that satisfies the property, where is the number of edges (general model). A particular case here is when the degrees are bounded by some constant , and any resulting graph must comply with this restriction as well (sparse model).

In a sequential -test, access to the input is provided by queries, whose type depends on the model. In the dense model these are asking whether two vertices are neighbors, and in the general and sparse models these can be either asking what the degree of a vertex is, or asking what the -th neighbor of is (the ordering of neighbors is arbitrary). While a sequential -test can touch only a small handful of vertices with its queries, in a distributed test the lack of ability to communicate over large distances is offset by having all vertices operating in parallel.

Our first contribution is a general scheme for a near-complete emulation in the distributed context of -tests originating from the dense graph model (Section 3). This makes use of the fact that in the dense model all (sequential) testing algorithms can be made non-adaptive, which roughly means that queries do not depend on responses to previous queries (see Section 2 for definition). In fact, such tests can be made to have a very simple structure, allowing the vertices in the distributed model to “band together” for an emulation of the test. There is only one additional technical condition (which we define below), since in the distributed model we cannot handle properties whose counter-examples can be “split” to disjoint graphs. For example, the distributed model cannot hope to handle the property of the graph having no disjoint union of two triangles, a property for which there exists a test in the dense model.

• Any -test in the dense graph model for a non-disjointed property that makes queries can be converted to a distributed -test that takes communication rounds.

We next move away from the dense graph model to the sparse and general models, that are sometimes considered to be more realistic. In the general model, there exists no test for the property of containing no triangle that makes a number of queries independent of the number of graph vertices [2]. Here the distributed model can do better, because the reason for this deficiency is addressed by having all vertices operate concurrently. In Section 4 we adapt the interim lemmas used in the best testing algorithm constructed in [2], and construct a distributed algorithm whose number of rounds is independent of .

• Algorithm 4 is a distributed -test in the general graph model for the property of containing no triangles, that requires rounds.

The sparse and general models inherently require adaptive property testing algorithms, since there is no other way to trace a path from a given vertex forward, or follow its neighborhood. Testing triangle freeness sequentially uses adaptivity only to a small degree. However, other problems in the sparse and general models, such as the one we explore next, have a high degree of adaptivity built into their sequential algorithms, and we need to take special care for emulating it in the distributed setting.

In the sparse model (degrees bounded by a constant ), we adapt ideas from the bipartiteness testing algorithm of [19], in which we search for odd-length cycles. Here again the performance of a distributed algorithm surpasses that of the test (a number of rounds polylogarithmic in vs. a number of queries which is – a lower bound that is given in [20]). The following is proved in Section 5.

• Algorithm 5.1 is a distributed -test in the bounded degree graph model for the property of being bipartite, that requires rounds.

In the course of proving Theorem 5.2 we develop a method that we consider to be of independent interest2. The algorithm works by performing random walks concurrently (two starting from each vertex). The parallel execution of random walks despite the congestion restriction is achieved by making sure that the walks have a uniform stationary distribution, and then showing that congestion is “close to average”, which for the uniform stationary distribution is constant.

In Section 6 we show a fast test for cycle-freeness. This makes use of a combinatorial lemma that we prove, about cycles that remain in the graph after removing edges independently with probability . The following summarizes our result for testing cycle-freeness.

• Algorithm 6.2 is a distributed -test in the general graph model for the property of being cycle-free, that requires rounds.

We also prove lower bounds for testing bipartiteness and cycle-freeness (matching the upper bound for the latter). Roughly speaking, these are obtained by using the probabilistic method with alterations to construct graphs which are far from being bipartite or cycle-free, but all of their cycles are of length that is at least logarithmic. This technique bears some similarity to the classic result by Erdös [13], which showed the existence of graphs with large girth and large chromatic number. The following are given in Section 7.

• Any distributed -test for the property of being bipartite requires rounds of communication.

• Any distributed -test for the property of being cycle-free requires rounds of communication.

The paper is organized as follows. The remainder of this section consists of related work and historical background on property testing. Section 2 contains formal definitions and some mathematical tools. The emulation of sequential tests for the dense model is given in Section 3. In Section 4 we give our distributed test for triangle-freeness. In Section 5 we provide a distributed test for bipartiteness, along with our new method of executing many random walks, and in Section 6 we give our test for cycle-freeness. Section 7 gives our logarithmic lower bounds for testing bipartiteness and cycle-freeness. We conclude with a short discussion in Section 8.

### 1.2 Related work

The only previous work that directly relates to our distributed setting is due to Brakerski and Patt-Shamir [8]. They show a tolerant property testing algorithm for finding large (linear in size) near-cliques in the graph. An -near clique is a set of vertices for which all but an -fraction of the pairs of vertices have an edge between them. The algorithm is tolerant, in the sense that it finds a linear near-clique if there exists a linear -near clique. That is, the testing algorithm considers two thresholds of being close to having the property (in this case – containing a linear size clique). We are unaware of any other work on property testing in this distributed setting.

Testing in a different distributed setting was considered in Arfaoui et al. [5]. They study testing for cycle-freeness, in a setting where each vertex may collect information of its entire neighborhood up to some distance, and send a short string of bits to a central authority who then has to decide whether the graph is cycle-free or not.

Related to having information being sent to, or received by, a central authority, is the concept of proof-labelling schemes, introduced by Korman et al. [27] (for extensions see, e.g., Baruch et al. [6]). In this setting, each vertex is given some external label, and by exchanging labels the vertices need to decide whether a given property of the graph holds. This is different from our setting in which no information other than vertex IDs is available. Another setting that is related to proof-labelling schemes, but differs from our model, is the prover-verifier model of Foerster et al. [15].

Sequential property testing has the goal of computing without processing the entire input. The wider family of local computation algorithms (LCA) is known to have connections with distributed computing, as shown by Parnas and Ron [30] and later used by others. A recent study by Göös et al. [23] proves that under some conditions, the fact that a centralized algorithm can query distant vertices does not help with speeding up computation. However, they consider the LOCAL model, and their results apply to certain properties that are not influenced by distances.

Finding induced subgraphs is a crucial task and has been studied in several different distributed models (see, e.g., [26, 12, 9, 11]). Notice that for finding subgraphs, having many instances of the desired subgraph can help speedup the computation, as in [11]. This is in contrast to algorithms that perform faster if there are no or only few instances, as explained above, which is why we test for, e.g., the property of being triangle-free, rather for the property of containing triangles. (Notice that these are not the same, and in fact every graph with or more vertices is -close to having a triangle.)

Parallelizing many random walks was addressed in [1], where the question of graph covering via random walks is discussed. It is shown there that for certain families of graphs there is a substantial speedup in the time it takes for walks starting from the same vertex to cover the graph, as compared to a single walk. No edge congestion constraints are taken into account. In [37], it is shown how to perform, under congestion, a single random walk of length in rounds, and random walks in rounds, where is the diameter of the graph. Our method has no dependence on the diameter, allowing us to perform a multitude of short walks much faster.

### 1.3 Historical overview

The first papers to consider the question of property testing were [7] and [35]. The original motivations for defining property testing were its connection to some Computerized Learning models, and the ability to leverage some properties to construct Probabilistically Checkable Proofs (PCPs – this is related to property testing through the areas of Locally Testable Codes and Locally Decodable Codes, LTCs and LDCs). Other motivations since then have entered the fray, and foremost among them are sublinear-time algorithms, and other big-data considerations. Since virtually no property can be decidable without reading the entire input, property testing introduces a notion of the allowable approximation to the original problem. In general, the algorithm has to distinguish inputs satisfying the property, from inputs that are -far from it. For more information on the general scheme of “classical” property testing, consult the surveys [33, 14, 21].

The older of the graph testing models discussed here is the dense model, as defined in the seminal work of Goldreich, Goldwasser and Ron [18]. The dense graph model has historically kick-started combinatorial property testing in earnest, but it has some shortcomings. Its main one is the distance function, which makes sense only if we consider graphs having many edges (hence the name “dense model”) – any graph with edges is indistinguishable in this model from an empty graph.

The stricter and at times more plausible distance function is one which is relative to the actual number of edges, rather than the maximum . The general model was defined in [2], while the sparse model was defined already in [20]. The main difference between the sparse and the general graph models is that in the former there is also a guaranteed upper bound on the degrees of the vertices, which is given to the algorithm in advance (the query complexity may then depend on , either explicitly, or more commonly implicitly by considering to be a constant).

## 2 Preliminaries

### 2.1 Additional background on property testing

While the introduction provided rough descriptions of the different property testing models, here we provide more formal definitions. The dense model for property testing is defined as follows.

###### Definition 2.1 (dense graph model [18]).

The dense graph model considers as objects graphs that are given by their adjacency matrix. Hence it is defined by the following features.

• Distance: Two graphs with vertices each are considered to be -close if one can be obtained from the other by deleting and inserting at most edges (this is, up to a constant factor, the same as the normalized Hamming distance).

• Querying scheme: A single query of the algorithm consists of asking whether two vertices form a graph edge in or not.

• Allowable properties: All properties have to be invariant under permutations of the input that pertain to graph isomorphisms (a prerequisite for them being graph properties).

The number of vertices is given to the algorithm in advance.

As discussed earlier, the sparse and general models for property testing relate the distance function to the actual number of edges in the graph. They are formally defined as follows.

###### Definition 2.2 (sparse [20] and general [2] graph models).

These two models consider as objects graphs given by their adjacency lists. They are defined by the following features.

• Distance: Two graphs with vertices and edges (e.g. as defined by the denser of the two) are considered to be -close if one can be obtained from the other by deleting and inserting at most edges3.

• Querying scheme: A single query consists of either asking what is the degree of a vertex , or asking what is the ’th neighbor of (the ordering of neighbors is arbitrary).

• Allowable properties: All properties have to be invariant under graph isomorphisms (which here translate to a relabeling that affects both the vertex order and the neighbor ids obtained in neighbor queries), and reordering of the individual neighbor lists (as these orderings are considered arbitrary).

In this paper, we mainly refer to the distance functions of these models, and less so to the querying scheme, since the latter will be replaced by the processing scheme provided by the distributed computation model. Note that most property testing models get one bit in response to a query, e.g., “yes/no” in response to “is uv an edge” in the dense graph model. However, the sparse and general models may receive bits of information for one query, e.g., an id of a neighbor of a vertex. Also, the degree of a vertex, which can be given as an answer to a query in the general model, takes bits. Since the distributed CONGEST model allows passing a vertex id or a vertex degree along an edge in rounds, we can equally relate to all three graph models.

Another important point is the difference between -sided and -sided testing algorithms, and the difference between non-adaptive and adaptive algorithms.

###### Definition 2.3 (types of algorithms).

A property testing algorithm is said to have -sided error if there is no possibility of error on accepting satisfying inputs. That is, an input that satisfies the property will be accepted with probability , while an input -far from the property will be rejected with a probability that is high enough (traditionally this means a probability of at least ). A -sided error algorithm is also allowed to reject satisfying inputs, as long as the probability for a correct answer is high enough (traditionally at least ).

A property testing algorithm is said to be non-adaptive if it decides all its queries in advance (i.e. based only on its internal coin tosses and before receiving the results of any query), while only its accept/reject output may depend on the actual input. An adaptive algorithm may make each query in turn based on the results of its previous queries (and, as before, possible internal coin tosses).

In the following we address both adaptive and non-adaptive algorithms. However, we restrict ourselves to -sided error algorithms, since the notion of -sided error is not a good match for our distributed computation model.

### 2.2 Mathematical background

An important role in our analyses is played by the Multiplicative Chernoff Bound (see, e.g., [29]), hence we state it here for completeness.

###### Fact 2.4.

Suppose that are independent random variables taking values in . Let denote their sum and let denote its expected value. Then, for any ,

 Pr[X<(1−δ)μ]<(e−δ(1−δ)(1−δ))μ, Pr[X>(1+δ)μ]<(eδ(1+δ)(1+δ))μ.

Some convenient variations of the bounds above are:

 Pr[X≥(1+δ)μ]

## 3 Distributed emulation of sequential tests in the dense model

We begin by showing that under a certain assumption of being non-disjointed, which we define below, a property that has a sequential test in the dense model that requires queries can be tested in the distributed setting within rounds. We prove this by constructing an emulation that translates sequential tests to distributed ones. For this we first introduce a definition of a witness graph and then adapt [22, Theorem 2.2], restricted to -sided error tests, to our terminology.

###### Definition 3.1.

Let be a property of graphs with vertices. Let be a graph with vertices. We say that is a witness against , if it is not an induced subgraph of any graph that satisfies .

Notice that if has an induced subgraph that is a witness against , then by the above definition is also a witness against .

The work of [22] transforms tests of graphs in the dense model to a canonical form where the query scheme is based on vertex selection. This is useful in particular for the distributed model, where the computational work is essentially based in the vertices. We require the following special case for 1-sided error tests.

###### Lemma 3.2 ([22, Theorem 2.2]).

Let be a property of graphs with vertices. If there exists a -sided error -test for with query complexity , then there exists a -sided error -test for that uniformly selects a set of vertices, and accepts if and only if the induced subgraph is not a witness against .

Our emulation leverages Lemma 3.2 under an assumption on the property , which we define as follows.

###### Definition 3.3.

We say that is a non-disjointed property if for every graph that does not satisfy and an induced subgraph of such that is a witness against , has some connected component which is also a witness against . We call such components witness components.

We are now ready to formally state our main theorem for this section.

###### Theorem 3.4.

Any -test in the dense graph model for a non-disjointed property that makes queries can be converted to a distributed -test that takes communication rounds.

The following lemma essentially says that not satisfying a non-disjointed property cannot rely on subgraphs that are not connected, which is exactly what we need to forbid in a distributed setting.

###### Lemma 3.5.

The property is a non-disjointed property if and only if all minimal witnesses that are induced subgraphs of are connected.

Here minimal refers to the standard terminology, which means that no proper induced subgraph is a witness against .

###### Proof.

First, if is non-disjointed and does not satisfy , then for every subgraph of that is a witness against , has a witness component. If is minimal then it must be connected, since otherwise it contains a connected component which is a witness against , which contradicts the minimality of .

For the other direction, if all the minimal witnesses that are induced subgraphs of are connected, then every induced subgraph that is a witness against is either minimal, in which case it is connected, or is not minimal, in which case there is a subgraph of which is connected and a minimal witness against . The connected component of which contains is a witness against (otherwise is not a witness against ), and hence it follows that is non-disjointed. ∎

Next, we give the distributed test (Algorithm 3.5). The test has an outer loop in which each vertex picks itself with probability , collects its neighborhood of a certain size of edges between picked vertices in an inner loop, and rejects if it identifies a witness against . The outer loop repeats two times because not only does the sequential test have an error probability, but also with some small probability we may randomly pick too many or not enough vertices in order to emulate it. Repeating the main loop twice reduces the error probability back to below . In the inner loop, each vertex collects its neighborhood of picked vertices and checks if its connected component is a witness against . To limit communications this is done only for components of picked vertices that are sufficiently small: if a vertex detects that it is part of a component with too many edges then it accepts and does not participate until the next iteration of the outer loop.

{algorithm}

[htbp] Emulation algorithm with input for property \KwVars edges known to , edges to update and send (temporary variables) \Perform reset the state for all vertices

\Simul

vertex Vertex picks itself with probability
\If is picked Notify all neighbors that is picked
Set and

\Perform

\tcpyAt each iteration is a subgraph of ’s connected component \tcpyonly need recently discovered edges \tcpyadd them to

\If

(\tcpy*[h]don’t operate if there are too many edges) Send to all picked neighbours of \tcpypropagate known edges Wait until the time bound for all other vertices to finish this iteration
Set to the union of edge sets received from neighbors \If is a witness against Vertex outputs reject (ending all operations) \Else Wait until the time bound for all other vertices to finish this iteration of the outermost loop Every vertex that did not reject outputs accept

To analyze the algorithm, we begin by proving that there is a constant probability for the number of picked vertices to be sufficient and not too large.

###### Lemma 3.6.

The probability that the number of vertices picked by the algorithm is between and is more than .

###### Proof.

For every , we denote by the indicator variable for the event that vertex is picked. Note that these are all independent random variables. Using the notation gives that , because each vertex is picked with probability . Using the Chernoff Bound from Fact 2.4 with and , we can bound the probability of having too few picked vertices:

 Pr[X

For bounding the probability that there are too many picked vertices, we use the other direction of the Chernoff Bound with and , giving:

 Pr[X>10q]=Pr[X>(1+δ)μ]<(e22)5q=(e5210)q<210.

Thus, with probability at least it holds that . ∎

Now, we can use the guarantees of the sequential test to obtain the guarantees of our algorithm.

###### Lemma 3.7.

Let be a non-disjointed graph property. If satisfies then all vertices output accept in Algorithm 3.5. If is -far from satisfying , then with probability at least there exists a vertex that outputs reject.

###### Proof.

First, assume that satisfies . Vertex outputs reject only if it is part of a witness against , which is, by definition, a component that cannot be extended to some that satisfies . However, every component is an induced subgraph of itself, which does satisfy , and thus every component can be extended to . This implies that no vertex outputs reject.

Now, assume that is -far from satisfying . Since the sequential test rejects with probability at least , the probability that a sample of at least vertices induces a graph that cannot be extended to a graph that satisfies is at least . Because is non-disjointed, the induced subgraph must have a connected witness against . We note that a sample of more than vertices does not reduce the rejection probability. Hence, if we denote by the event that the subgraph induced by the picked vertices has a connected witness against , then , conditioned on that at least vertices were picked.

However, a sample that is too large may cause a vertex to output accept because it cannot collect its neighborhood. We denote by the event that the number of vertices sampled is between and , and by Lemma 3.6 its probability is at least . We bound using Bayes’ Theorem, obtaining . Since the outer loop consists of independent iterations, this gives a probability of at least for having a vertex that outputs reject. ∎

We now address the round complexity. Each vertex only sends and receives information from its -neighborhood about edges between the chosen vertices. If too many vertices are chosen we detect this and accept. Otherwise we only communicate the chosen vertices and their edges, which requires communication rounds using standard pipelining4. Together with Lemma 3.7, this proves Theorem 3.4.

### 3.1 Applications: k-colorability and perfect graphs

Next, we provide some examples of usage of Theorem 3.4. A result by Alon and Shapira [4] states that all graph properties closed under induced subgraphs are testable in a number of queries that depends only on . We note that, except for certain specific properties for which there are ad-hoc proofs, the dependence is usually a tower function in or worse (asymptotically larger).

From this, together with Lemma 3.2 and Theorem 3.4, we deduce that if is a non-disjointed property closed under induced subgraphs, then it is testable, for every fixed , in a constant number of communication rounds.

#### Example – k-colorability:

The property of being -colorable is testable in a distributed manner by our algorithm. All minimal graphs that are witnesses against (not -colorable) are connected, and therefore according to Lemma 3.5 it is a non-disjointed property. It is closed under induced subgraphs, and by [3] there exists a -sided error -test for -colorability that uniformly picks vertices, and its number of queries is the square of this expression (note that the polynomial dependency was already known by [18]). Our emulation implies a distributed -sided error -test for -colorability that requires rounds.

#### Example – perfect graphs:

A graph is said to be perfect if for every induced subgraph of , the chromatic number of equals the size of the largest clique in . Another characterization of a perfect graph is via forbidden subgraphs: a graph is perfect if and only if it does not have odd holes (induced cycles of odd length at least ) or odd anti-holes (the complement graph of an odd hole) [10]. Both odd holes and odd anti-holes are connected graphs. Since these are all minimal witnesses against the property, according to Lemma 3.5 it is a non-disjointed property. Using the result of Alon-Shapira [4] we know that the property of a graph being perfect is testable. Our emulation implies a distributed -sided error -test for being a perfect graph that requires a number of rounds that depends only on .

## 4 Distributed test for triangle-freeness

In this section we show a distributed -test for triangle-freeness. Notice that since triangle-freeness is a non-disjointed property, Theorem 3.4 gives a distributed -test for triangle-freeness under the dense model with a number of rounds that is , where is the number of queries required for a sequential -test for triangle-freeness. However, for triangle-freeness, the known number of queries is a tower function in  [16].

Here we leverage the inherent parallelism that we can obtain when checking the neighbors of a vertex, and show a test for triangle-freeness that requires only rounds (Algorithm 4). Importantly, our algorithm works not only for the dense graph model, but for the general graph model (where distances are relative to the actual number of edges), which subsumes it. In the sequential setting, a test for triangle-freeness in the general model requires a number of queries that is some constant power of by [2]. Our proof actually follows the groundwork laid in [2] for the general graph model – their algorithm picks a vertex and checks two of its neighbors for being connected, while we perform the check for all vertices in parallel.

{algorithm}

[htbp] Triangle freeness test \Simulvertex \Perform Pick uniformly at random
Send to \tcpyAsk if it is a neighbor of \ForEach(\tcpy*[h]Asked by if is a neighbor of ) sent by \If Send “yes” to
\Else Send “no” to
\Ifreceived “yes” from reject (ending all operations) accept (for vertices that did not reject)

###### Theorem 4.1.

Algorithm 4 is a distributed -test in the general graph model for the property of containing no triangles, that requires rounds.

Our line of proof follows that of [2], by distinguishing edges that connect two high-degree vertices from those that do not. Formally, let , where is the number of edges in the graph, and denote . We say that an edge is light if or , and otherwise, we say that it is heavy. That is, the set of heavy edges is . We begin with the following simple claim about the number of heavy edges.

###### Claim 4.2.

The number of heavy edges, , is at most .

###### Proof.

The number of heavy edges is . Since , we get that . This gives that . ∎

Next, we fix an iteration of the algorithm. Every vertex chooses two neighbors . Let , where is the first of the two vertices chosen by the low-degree vertex . Let , and let . We say that an edge is matched if is in the same triangle as . If is matched then is a triangle that is detected by .

We begin with the following lemma that states that if is -far from being triangle-free, then in any iteration we can bound the expected number of matched edges from below by . Let be the number of matched edges.

###### Lemma 4.3.

The expected number of matched edges by a single iteration of the algorithm, , is greater than .

###### Proof.

For every , let be a random variable indicating whether is matched. Then , giving the following bound:

 E[Y|AT]=E[∑e∈ATYe|AT]=∑e∈ATPr[e is matched]≥|AT|/b, (1)

where the last inequality follows because a light edge in is chosen by a vertex with degree at most , hence the third triangle vertex gets picked with probability at least .

Next, we argue that . To see why, for every edge , let be a random variable indicating whether . Let . Then,

 E[|AT|]=E[X]=E[∑e∈TXe]=∑e∈TE[Xe]=∑e∈TPr[e∈A]≥|T|/b, (2)

where the last inequality follows because a light edge has at least one endpoint with degree at most . Hence, this edge gets picked by it with probability at least .

It remains to bound from below, for which we claim that . To prove this, first notice that, since is -far from being triangle free, it has at least triangle edges, since otherwise we can just remove all of them and make the graph triangle free with less than edge changes. By Claim 4.2, the number of heavy edges satisfies . Subtracting this from the number of triangle edges gives that at least edges are light triangle edges, i.e.,

 |T|≥ϵm/2. (3)

Finally, by Inequalities (1), (2) and (3), using iterated expectation we get:

 E[Y]=EAT[E[Y|AT]]≥E[|AT|b]≥|T|b2≥ϵm214ϵ−1m=ϵ2/8.

We can now prove the correctness of our algorithm, as follows.

###### Lemma 4.4.

If is triangle-free then all vertices output accept in Algorithm 4. If is -far from being triangle-free, then with probability at least 2/3 there exists a vertex that outputs reject.

###### Proof.

If is triangle free then in each iteration receives “no” from and after all iterations it returns accept.

Assume that is -far from being triangle-free. Let be an indicator variable for the event that vertex detects a triangle at iteration . First, we note that the indicators are independent, since a vertex detecting a triangle does not affect the chance of another vertex detecting a triangle (note that the graph is fixed), and the iterations are done independently. Now, let , and notice that is the total number of detections over all iterations. Lemma 4.3 implies that for a fixed , it holds that , which sums to:

 E[Z]=E⎡⎣32ϵ−2∑i=1∑v∈VZi,v⎤⎦=32ϵ−2∑i=1E[∑vZi,v]≥32ϵ−2∑i=1ϵ2/8=4.

Using the Chernoff Bound from Fact 2.4 with and gives

 Pr[Z<1]≤Pr[Z<(1−δ)μ]<(e−3/4(1−(3/4))(1−(3/4)))4=4/e3<2/3,

and hence with probability at least at least one triangle is detected and the associated vertex outputs reject, which completes the proof. ∎

In every iteration, each vertex initiates only two messages of size bits, one sent to and one sent back by . Since there are iterations, this implies that the number of rounds is as well. This, together with Lemma 4.4, completes the proof of Theorem 4.1.

## 5 Distributed bipartiteness test for bounded degree graphs

In this section we show a distributed -test for being bipartite for graphs with degrees bounded by . Our test builds upon the sequential test of [19] and, as in the case of triangle freeness, takes advantage of the ability to parallelize queries. While the number of queries of the sequential test is  [20], the number of rounds in the distributed test is only polylogarithmic in and polynomial in . As in [19], we assume that is a constant, and omit it from our expressions (it is implicit in the notation for below).

Let us first outline the algorithm of [19], since our distributed test borrows from its framework and our analysis is in part derived from it. The sequential test basically tries to detect odd cycles. It consists of iterations, in each of which a vertex is selected uniformly at random and random walks of length are performed starting from the source . If, in any iteration with a chosen source , there is a vertex which is reached by an even prefix of a random walk and an odd prefix of a random walk (possibly the same walk), then the algorithm rejects, as this indicates the existence of an odd cycle. Otherwise, the algorithm accepts. To obtain an -test the parameters are chosen to be , , and .

The main approach of our distributed test is similar, except that a key ingredient is that we can afford to perform much fewer random walks from every vertex, namely . This is because we can run random walks in parallel originating from all vertices at once. However, a crucial challenge that we need to address is that several random walks may collide on an edge, violating its congestion bound. To address this issue, our central observation is that lazy random walks (chosen to have a uniform stationary distribution) provide for a very low probability of having too many of these collisions at once. The main part of the analysis is in showing that with high probability there will never be too many walks concurrently in the same vertex, so we can comply with the congestion bound. We begin by formally defining the lazy random walks that we use.

###### Definition 5.1.

A lazy random walk over a graph with degree bound is a random walk, that is, a (memory-less) sequence of random variables taking values from the vertex set , where the transition probability is if is an edge of , if , and in all other cases.

The stationary distribution for the lazy random walk of Definition 5.1 is uniform [34, Section 8]. Next, we describe a procedure to handle one iteration of moving the random walks (Algorithm 5.1), followed by our distributed test for bipartiteness using lazy random walks from every vertex concurrently (Algorithm 5.1).

{algorithm}

[htbp] Move random walks once with input \KwVars walks residing in (multiset), history of walks through \KwIn, the maximum congestion per vertex allowed \tcpyeach walk is characterized by where is the number of actual moves and is the origin vertex \Simulvertex \If(\tcpy*[h]give up if exceeded the maximum allowed) \Forevery in draw next destination (according to the lazy walk scheme)
\If(\tcpy*[h]walk exits ) send to
remove from wait until the maximum time for all other vertices to process up to walks

It is quite immediate that Algorithm 5.1 takes communication rounds.

{algorithm}

[htbp] Distributed bipartiteness test \KwVars walks residing in (multiset), history of walks through \Perform \Simulvertex initialize and with two copies of the walk \Perform move walks using Algorithm 5.1 with input \Simulvertex \If contains and for some , even and odd reject (ending all operations) \tcpyodd cycle found accept (for vertices that did not reject)

Our main result here is that Algorithm 5.1 is indeed a distributed -test for bipartiteness.

###### Theorem 5.2.

Algorithm 5.1 is a distributed -test in the bounded degree graph model for the property of being bipartite, that requires rounds.

The number of communication rounds is immediate from the algorithm – it is dominated by the calls to Algorithm 5.1, making a total of rounds, which is indeed . To prove the rest of Theorem 5.2 we need some notation, and a lemma from [19] that bounds from below the probabilities for detecting odd cycles if is -far from being bipartite.

Given a source , if there is a vertex which is reached by an even prefix of a random walk from and an odd prefix of a random walk from , we say that walks and detect a violation. Let be the probability that, out of random walks of length starting from , there are two that detect a violation. Using this notation, is the probability that the sequential algorithm outlined in the beginning rejects in an iteration in which is chosen. Since we are only interested in walks of length , we denote . A good vertex is a vertex for which this probability is bounded as follows.

###### Definition 5.3.

A vertex is called good if .

In [19] it was proved that being far from bipartite implies having many good vertices.

###### Lemma 5.4 ([19]).

If is -far from being bipartite then at least an -fraction of the vertices are good.

In contrast to [19], we do not perform random walks from every vertex in each iteration, but rather only . Hence, what we need for our analysis is a bound on . To this end, we use as a parameter, and express in terms of and .

###### Lemma 5.5.

For every vertex , .

###### Proof.

Fix a source vertex . For every , let be the probability of walks from detecting a violation. Because different walks are independent, we conclude that for every it holds that . Let be the event of walks detecting a violation. We have

 ps(K)=Pr[∪i,jAi,j]≤∑i,jPr[Ai,j]=ps(2)K(K−1)/2,

which implies that . ∎

Using this relationship between and and , we prove that our algorithm is an -test. First we prove this for the random walks themselves, ignoring the possibility that Algorithm 5.1 will skip moving random walks due to its condition in Line .

###### Lemma 5.6.

If is -far from being bipartite, and we perform iterations of starting random walks of length from every vertex, then the probability that no violation is detected is bounded by .

###### Proof.

Assume that is -far from being bipartite. By Lemma 5.4, at least vertices are good, which means that for each of these vertices , . This implies that . Now, let be a random variable indicating whether there are two random walks starting at that detect a violation. Let . We prove that . First, we bound for some fixed :

 E[X] = E[η∑i=0∑s∈VXi,s]=η∑i=0∑s∈VE[Xi,s] = η∑i=0∑s∈Vps(2)≥η∑i=0∑s∈V2ps(K)K(K−1) = 2K(K−1)η∑i=0∑s∈Vps(K)≥2K(K−1)η∑i=0nϵ160 = ηnϵ80K(K−1)≥ηnϵ80K2.

For it holds that . Using the Chernoff Bound of Fact 2.4 with and gives:

 Pr[X<1]≤Pr[X<(1−δ)μ]<(e−3/4(1−(3/4))(1−(3/4)))4=4e3<1/4,

which completes the proof. ∎

As explained earlier, the main hurdle on the road to prove Theorem 5.2 is in proving that the allowed congestion will not be exceeded. We prove the following general claim about the probability for lazy random walks of length from each vertex to exceed a maximum congestion factor of walks allowed in each vertex at the beginning of each iteration. Here, an iteration is a sequence of rounds in which all walks are advanced by one step (whether or not they actually switch vertices).

###### Lemma 5.7.

With probability at least , running lazy random walks of length originating from every vertex will not exceed the maximum congestion factor of walks allowed in each vertex at the beginning of each iteration, if .

We show below that plugging , and in Lemma 5.7, together with Lemma 5.6, gives the correctness of Algorithm 5.1.

To prove Lemma 5.7, we argue that it is unlikely for any vertex to have more than walks in any iteration. Given that this is indeed the case in every iteration, the lemma follows by a union bound. We denote by the random variable whose value is the number of random walks at vertex at the beginning of the -th iteration. That is, it is equal to the size of the set in the description of the algorithm.

###### Lemma 5.8.

For every vertex and every iteration it holds that .

###### Proof.

Let us first define random variables for our walks. Enumerating our walks ( from each of the vertices) arbitrarily, let denote the sequence corresponding to the ’th walk, that is, is the vertex where the ’th walk is stationed at the beginning of the ’th iteration. In particular, .

Now let us define new random variables in the following manner: First, we choose uniformly at random a permutation . Then we set for all and . The main thing to note is that for any fixed , is a random walk (as it is equal to one of the random walks ). But also, for every , is uniformly distributed over the vertex set of , because we started with exactly random walks from every vertex. Additionally, since the uniform distribution is stationary for our lazy walks, this means that the unconditional distribution of each is also uniform.

Now, since is a permutation, it holds that . The expectation (by linearity of expectation) is thus . ∎

We can now prove Lemma 5.7.

###### Proof of Lemma 5.7.

We first claim that for every iteration and every vertex , with probability at least it holds that . To show this, first fix some . Let be the indicator variable for the event of walk residing at vertex at the beginning of iteration , where . Then , and the variables , where , are all independent. We use the Chernoff Bound of Fact 2.4 with and as proven in Lemma 5.8, obtaining:

 Pr[Xv,i>k+γ]=Pr[Xv,i>(γ/k+1)k]

Applying the union bound over all vertices and all iterations , we obtain that with probability at least it holds that for all and . ∎

###### Lemma 5.9.

If is bipartite then all vertices output accept in Algorithm 5.1. If is -far from being bipartite, then with probability at least there exists a vertex that outputs reject.

###### Proof.

If is bipartite then all vertices output accept in Algorithm 5.1, because there are no odd cycles and thus no violation detecting walks.

If is -far from bipartite, we use Lemma 5.6, in conjunction with Lemma 5.7 with parameters , and as used by Algorithm 5.1. By a union bound the probability to accept will be bounded by (assuming ), providing for the required bound on the rejection probability. ∎

Lemma 5.9, with the communication complexity analysis of Algorithm 5.1, gives Theorem 5.2.

## 6 Distributed test for cycle-freeness

In this section, we give a distributed algorithm to test if a graph with edges is cycle-free or if at least edges have to be removed to make it so. Intuitively, in order to search for cycles, one can run a breadth-first search (BFS) and have a vertex output reject if two different paths reach it. The downside of this exact solution is that its running time depends on the diameter of the graph. To overcome this, a basic approach would be to run a BFS from each vertex of the graph, but for shorter distances. However, running multiple BFSs simultaneously is expensive, due to the congestion on the edges. Instead, we use a simple prioritization rule that drops BFS constructions with lower priority, which makes sure that one BFS remains alive.5

Instead, our technique consists of three parts. First, we make the graph sparser, by removing each of its edges independently with probability . We denote the sampled graph by and prove that if is far from being cycle-free then so is , and in particular, contains a cycle.

Then, we run a partial BFS over from each vertex, while prioritizing by ids: each vertex keeps only the BFS that originates in the vertex with the largest id and drops the rest of the BFSs. The length of this procedure is according to a threshold . This gives detection of a cycle that is contained in a component of with a low diameter of up to , if such a cycle exists, since a surviving BFS covers the component. Such a cycle is also a cycle in . If no such cycle exists in , then has a some component with diameter larger than . For large components, we take each surviving BFS that reached some vertex at a certain distance , and from we run a new partial BFS in the original graph . These BFSs are again prioritized, this time according to the distance . Our main tool here is proving a claim that says that with high probability, if there is a shortest path in of length between two vertices, then there is a cycle in between them of length at most . This allows our BFSs on to find such a cycle.

We start with the following combinatorial lemma that shows the above claim.

###### Lemma 6.1.

Given a graph , let be obtained by deleting each edge in with probability , independently of other edges. Then, with probability at least , every vertex that has a vertex at a distance , has a closed path passing through it in , that contains a simple cycle, of length at most .

###### Proof.

First, we show that for every pair of vertices in that are at a distance of , one of the shortest paths between and is removed in the graph with high probability. For a pair of vertices and at a distance in , the probability that a shortest path is not removed is , which is at most . Therefore, by a union bound over all pairs of vertices, with probability at least , at least one edge is removed from at least one shortest path between every pair of vertices that are at a distance of . Conditioned on this, we prove the lemma.

Now, suppose that and are two vertices in at a distance of . Let be this shortest path in . Suppose is the shortest path between and in . If , then this path is no longer present in (and thus distinct from ) and is a closed path in , passing through that has a simple cycle of length at most . If , then there are at least two shortest paths between and in of length , the one in and one that was removed, which we choose for . Therefore, is a closed path passing through of length at most , and hence contains a simple cycle of length at most in it. ∎

Next, we prove that indeed there is a high probability that contains a cycle if is far from being cycle-free.

###### Claim 6.2.

If is -far from being cycle-free, then with probability at least , is -far from being cycle-free.

###### Proof.

The graph is obtained from by deleting each edge with probability independently of other edges. The expected number of edges that are deleted is . Therefore, by the Chernoff Bound from Fact 2.4, the probability that at least edges are deleted is at most , and the claim follows. ∎

We now describe a multiple-BFS algorithm that takes as input a length and a priority condition over vertices, and starts performing a BFS from each vertex of the graph. This is done for steps, in each of which a vertex keeps only the BFS with the highest priority while dropping the rest. Each vertex also maintains a list of BFSs that have passed through it. The list is a list of -tuples , where is the id of the root of the BFS, is the depth of in this BFS tree and is the id of the parent of in the BFS tree. Initially, each vertex sets to include a BFS starting from itself, and then continues this BFS by sending the tuple to all its neighbors, where is the identifier of the vertex . In an intermediate step, each vertex may receive a BFS tuple from each of its neighbors. The vertex then adds these BFS tuples to the list and chooses one among according to the priority condition , proceeding with the respective BFS and discontinuing the rest. Even when a BFS is discontinued, the information that the BFS reached is stored in the list .

Algorithm 6.2 gives a formal description of the breadth-first search that we use in the testing algorithm for cycle-freeness.

{algorithm}

[htbp] BFS with a priority condition \KwInLength , Priority condition \KwVars list of BFS tuples passing through \Simulvertex Initialize to .
Send to all neighbors of . \Perform times \Simulvertex \If receives from its neighbors Add to .

Select from according to over

Send to all neighbors of except .

We now give more informal details of the test for cycle-freeness. By Lemma 6.1, we know that if there is a vertex in that has a vertex at a distance of , then there is a closed path in starting from that contains a cycle of length . In the first part, each vertex gets its name as its vertex id, and performs a BFS on the graph in the hope of finding a cycle. The BFS is performed using Algorithm 6.2, where the priority condition in the intermediate steps is selecting the BFS with the lowest origin id. If the cycle is present in a component of diameter at most in , then it is discovered during this BFS. To check if there is a cycle, one needs to find if there are appropriate tuples and in , for some vertex .

If no cycle is discovered in this step, then we change the ids of the vertices in the following way: The id of each vertex is now a tuple where is the largest depth at which occurs in a BFS tree among all the breadth-first searches that reached . We perform a BFS in using Algorithm 6.2, where the priority condition is to pick the BFS whose root has the lexicographically highest id. If there is some vertex with , then the highest priority vertex is such a vertex, and by Lemma 6.1, the BFS starting from that vertex will detect a cycle in .

Algorithm 6.2 gives a formal description of the tester for cycle-freeness.

{algorithm}

[htbp] Cycle-freeness test \KwVars list of BFS tuples passing through , vertex identifier \tcpyConstruct by deleting edges with probability . \Simulvertex For each neighbor , mark the edge with probability for deletion.

Send each marked edge to its corresponding .

Set . \Simulvertex Delete all edges incident on that have been marked for deletion. \tcpySearch for cycles in small diameter components. \UseAlgorithm 6.2 perform BFS on for steps, with the priority condition being choosing the BFS with the lowest root id. \Simulvertex If contains two tuples and , output reject.

Set where is the highest among all tuples in . \UseAlgorithm 6.2 perform BFS on for steps, with the priority condition being choosing the BFS with the lexicographically highest root id. \Simulvertex If contains two tuples and , output reject. \Simulvertex output accept, if did not output reject  yet.

We now prove the correctness of the algorithm.

###### Theorem 6.3.

Algorithm 6.2 is a distributed -test in the general graph model for the property of being cycle-free, that requires rounds.

###### Proof.

Notice that a vertex in Algorithm 6.2 outputs reject only when it detects a cycle. Therefore, if is cycle-free, then every vertex outputs accept with probability .

Suppose that is -far from being cycle-free. Notice that, with probability at least , the assertion of Lemma 6.1 holds. Furthermore, from Claim 6.2, we know that is -far from being cycle-free, with probability , and hence contains at least one cycle. This cycle could be in a component of diameter less than or it could be in a component of diameter at least in . We analyse the two cases separately.

Suppose there is a cycle in a component of of diameter at most . Let be the vertex with the smallest id in . In Algorithm 6.2, the BFS starting at is always propagated at any intermediate vertex due to the priority condition. Furthermore, since the diameter of is at most , this BFS reaches all vertices of . Hence, this BFS detects the cycle and at least one vertex in outputs reject.

On the other hand, if the cycle is present in a component in of diameter at least , then after Step 6.2 of the algorithm, each vertex gets the length of the longest path from the origin, among all the BFSs that reached , as the first component of its id. The vertex that gets the lexicographically highest id in the component has a vertex that is at least away in , since the radius of the component is at least half the diameter. Therefore, by Lemma 6.1, it is part of a cycle of length at most in . Hence, the vertex with the highest priority in the BFS on is a vertex that has a vertex at a distance of at least in , and there is a walk through that contain a simple cycle of length at most . At least one vertex on this simple cycle will output reject when Algorithm 6.2 is run on .

The number of rounds is since Algorithm 6.2 performs two breadth-first searches in the graph with this number of rounds.∎

## 7 Lower bounds for testing bipartiteness and cycle-freeness

In this section, we prove that any distributed algorithm for -testing bipartiteness or cycle-freeness in bounded-degree graphs requires rounds of communication6. We construct bounded-degree graphs that are -far from being bipartite, such that all cycles are of length . We argue that any distributed algorithm that runs in rounds does not detect a witness for non-bipartiteness. We also show that the same construction proves that every distributed algorithm for -testing cycle-freeness requires rounds of communication. Formally, we prove the following theorem.

###### Theorem 7.1.

Any distributed -test for the property of being bipartite requires rounds of communication.

To prove Theorem 7.1, we show the existence of a graph that is far from being bipartite, but all of its cycles are at least of logarithmic length. Since in rounds of a distributed algorithm, the output of every vertex cannot depend on vertices that are at distance greater than from it, no vertex can detect a cycle in in less than rounds, which proves Theorem 7.1. To prove the existence of we use the probabilistic method with alterations, and prove the following.

###### Lemma 7.2.

Let be a random graph on vertices where each edge is present with probability . Let be obtained by removing all edges incident with vertices of degree greater than , and one edge from each cycle of length at most . Then with probability at least , is -far from being bipartite.

Since a graph that is -far from being bipartite is also -far from being cycle-free, we immediately obtain the same lower bound for testing cycle-freeness, as follows.

###### Theorem 7.3.

Any distributed -test for the property of being cycle-free requires rounds of communication.

The rest of this section is devoted to proving Lemma 7.2. We need to show three properties of : (a) that it is far from being bipartite, (b) that it does not have small cycles, and (c) that its maximum degree is bounded. We begin with the following definition, which is similar in spirit to being far from satisfying a property and which will assist us in our proof.

###### Definition 7.4.

A graph is -removed from being bipartite if at least edges have to be removed from to make it bipartite.

Note that a graph with maximum degree , is -far from being bipartite if it is -removed from being bipartite.

Let be a random graph on vertices where for each pair of vertices, an edge is present with probability . The expected number of edges in the graph is