We study which property testing and sublinear time algorithms can be transformed into graph streaming algorithms for random order streams. Our main result is that for bounded degree graphs, any property that is constant-query testable in the adjacency list model can be tested with constant space in a single-pass in random order streams. Our result is obtained by estimating the distribution of local neighborhoods of the vertices on a random order graph stream using constant space.
We then show that our approach can also be applied to constant time approximation algorithms for bounded degree graphs in the adjacency list model: As an example, we obtain a constant-space single-pass random order streaming algorithms for approximating the size of a maximum matching with additive error ( is the number of nodes).
Our result establishes for the first time that a large class of sublinear algorithms can be simulated in random order streams, while space is needed for many graph streaming problems for adversarial orders.
Very large and complex networks abound. Some of the prominent examples are gene regulatory networks, health/disease networks, and online social networks like Facebook, Google+, Linkedin and Twitter. The interconnectivity of neurons in human brain, relations in database systems, and chip designs are some further examples. Some of these networks can be quite large and it may be hard to store them completely in the main memory and some may be too large to be stored at all. However, these networks contain valuable information that we want to reveal. For example, social networks can provide insights into the structure of our society, and the structure in gene regulatory networks might yield insights into diseases. Thus, we need algorithms that can analyze the structure of these networks quickly.
One way to approach this problem is to design graph streaming algorithms [HRR98, AMS96]. A graph streaming algorithm gets access to a stream of edges in some order and exactly or approximately solves problems on the graph defined by the stream. The challenge is that a graph streaming algorithm should use space sublinear in the size of the graph. We will focus on algorithms that make only one pass over the graph stream, unless we explicitly say otherwise. It has been shown that many natural graph problems require space in the adversarial order model where is the number of nodes in the graph and the edges can arrive in arbitrary order (see eg.,[FKM05, FKM08]), and thus most of previous work has focused on the semi-streaming model, in which the algorithms are allowed to use space. However, in many interesting applications, the graphs are sparse and so they can be fully stored in the semi-streaming model making this model useless in this setting. This raises the question whether there are at least some natural conditions under which one can solve graph problems with space , possibly even or constant.
One such condition that recently received increasing attention is that the edges arrive in random order, i.e. in the order of a uniformly random permutation of the edges (e.g., [CCM08, KMM12, KKS14]). Uniformly random or near-uniformly random ordering is a natural assumption and can arise in many contexts. Indeed, previous work has shown that some problems that are hard for adversarial streams can be solved in the random order model. Konrad et al. [KMM12] gave single-pass semi-streaming algorithms for maximum matching for bipartite and general graphs with approximation ratio strictly larger than in the random order semi-streaming model, while no such approximation algorithm is known in the adversary order model. Kapralov et al. [KKS14] gave a polylogarithmic approximation algorithm in polylogarithmic space for estimating the size of maximum matching of an unweighted graph in one pass over a random order stream. Assadi et al. [AKL17] recently showed that in the adversarial order and dynamic model where edges can be both inserted and deleted, any polylogarithmic approximation algorithm of maximum matching size requires space. On the other hand, Chakrabarti et al. [CCM08] presented an space lower bound for any single pass algorithm for graph connectivity in the random order streaming model, which is very close to the optimal space lower bound in the adversarial order model [SW15]. In general, it is unclear which graph problems can be solved in random order streams using much smaller space than what is required for adversarially ordered streams.
An independent area of research is property testing, where with certain query access to an object (eg., random vertices or neighbors of a vertex for graphs), there are algorithms that can determine if the object satisfies a certain property, or is far from having such a property [RS96, GGR98, GR02]. The area of property testing has seen fundamental results, including testing various general graph properties. For example, it has been shown that many interesting properties (including connectivity, planarity, minor-freeness, hyperfiniteness) of bounded degree graphs can be tested with a constant number of queries [GR02, BSS10, NS13]. Another very related area of research is called constant-time (or in general, sublinear-time) approximation algorithms, where we are given query access to an object (for example a graph) and the goal is to approximate the objective value of an optimal solution. For example, in bounded degree graphs, one can approximate the cost of the optimal solution with constant query complexity for some fundamental optimization problems (e.g., minimum spanning tree weight [CRT05], maximal matching size [NO08]; see also Section 1.3).
A fundamental question is if such results from property testing and constant-time approximation algorithms will lead to better graph streaming algorithms. Huang and Peng [HP16] recently considered the problem of estimating the minimum spanning tree weight and property testing for general graphs in dynamic and adversarial order model. They showed that a number of properties (e.g., connectivity, cycle-freeness) of general -vertex graphs can be tested with space complexity and one can -approximate the weight of minimum spanning tree with similar space guarantee. Furthermore, there exist space lower bounds for these problems that hold even in the insertion-only model [HP16].
1.1 Overview of Results
In this paper we provide a general framework that transforms bounded-degree graph property testing to very space-efficient random order streaming algorithms.
To formally state our main result, we first review some basic definitions of graph property testing. A graph property is a property that is invariant under graph isomorphism. Let be a graph with maximum degree upper bounded by a constant , and we also call a -bounded graph. In the adjacency list model for (bounded-degree) graph property testing, we are given query access to the adjacency list of the input -bounded graph . That is, for any vertex and index , one can query the th neighbor (if exists) of vertex in constant time. Given a property , we are interested in testing if a graph satisfies or is -far from satisfying while making as few queries as possible, where is said to be -far from satisfying if one has to insert/delete more than edges to make it satisfy . We call a property constant-query testable if there exists a testing algorithm (also called tester) for this property such that the number of performed queries depends only on parameters and is independent of the size of the input graph.
Given a graph property , we are interested in approximately testing it in a single-pass stream with a goal similar to the above. That is, the algorithm uses little space and with high constant probability, it accepts the input graph if it satisfies and rejects if it is -far from satisfying (see Section 4 for formal definitions). Our main result is as follows.
Any -bounded graph property that is constant-query testable in the adjacency list model can be tested in the uniformly random order streaming model with constant space.
To the best of our knowledge, this is the first non-trivial graph streaming algorithm with constant space complexity (measured in the number of words, where a word is a space unit large enough to encode an ID of any vertex in the graph.) By the constructions in [HP16], there exist graph properties (e.g., connectivity and cycle-freeness) of -bounded graphs such that any single-pass streaming algorithm in the insertion-only and adversary order model must use space. In contrast to this lower bound, our main result implies that -bounded connectivity and cycle-freeness can be tested in constant space in the random order stream model, since they are constant-query testable in the adjacency list model [GR02].
Our approach also works for simulating constant-time approximation algorithms as graph streaming algorithms with constant space. For a minimization (resp., maximization) optimization problem and an instance , we let denote the value of some optimal solution of . We call a value an -approximation for the problem , if for any instance , it holds that (resp., ). For example, it is known that there exists a constant-query algorithm for -approximating the maximal matching size of any -vertex -bounded graph [NO08]. That is, the number of queries made by the algorithm is independent of and only depends on . As an application, we show:
Let and be constants. Then there exists an algorithm that uses constant space in the random order model, and with probability , -approximates the size of some maximal matching in -bounded graphs.
We also remark that in a similar way, many other sublinear time algorithms for bounded degree graphs can be simulated in random order streams. Finally, our results can actually be extended to a model which requires weaker assumptions on the randomness of the order of edges in the stream, but we describe our results for the uniformly random order model, and leave the remaining details for later.
1.2 Technical Overview
The local neighborhood of depth of a vertex is the subgraph rooted at and induced by all vertices of distance at most from . We call such a rooted subgraph a -disc. Suppose that we are given a sufficiently large graph whose maximum degree is constant. This means that for any constant , a -disc centered at an arbitrary vertex in has constant size. Now assume that there exists an algorithm that, independent of the labeling of the vertices of , accesses by querying random vertices and exploring their -discs. We observe that any constant-query property tester (see for example [GR11, CPS16]) falls within the framework of such an algorithm. If instead of the graph we are given the distribution of -discs of the vertices of , we can use this distribution to simulate the algorithm and output with high probability the same result as executing the algorithm on itself. Thus, the problem of developing constant-query property testers in random order streams can be reduced to the problem of designing streaming algorithms that approximate the distribution of -discs in .
The main technical contribution of this paper is an algorithm that given a random order stream of edges of an underlying -bounded degree graph , approximates the distribution of -discs of up to an additive error of . We would like to mention that if the edges arrive in adversarial order, any algorithm that approximates the distribution of -discs of requires almost linear space [VY11, HP16], hence the assumption of random order streams (or something similar) is necessary to obtain our result.
Now in order to approximate the distribution of -discs of the graph we do the following. We proceed by sampling vertices uniformly at random and then perform a BFS for each sampled vertex using the arrival of edges along the stream . Note that the new edges of the stream that do not connect to the currently explored vertices are discarded. Let us call the -disc that is observed by doing such a BFS from some vertex to be . Due to possibility of missing edges during the BFS, this subgraph may be different from the true -disc rooted at .
If we are allowed to use two passes of the stream, then one can collect the -disc of in the first pass, and then verify if the collected disc is the true -disc of in the second pass (see Section 3.1). However, if we are restricted to a single pass, then it is more challenging to detect or verify if some edges have been missed in a collected disc. Fortunately, since the edges arrive in a uniformly random order, we can infer the conditional probability . That is, given the true rooted subgraph , we can compute the conditional probability of seeing a rooted subgraph in a random order stream when the true -disc is .
We define the partial order on the set of -discs given by whenever is a root-preserving isomorphic subgraph of . For every two -discs and with we compute the conditional probability . Using the set of all conditional probabilities we can estimate or approximate the distribution of -discs of the graph whose edges are revealed according to the stream . In order to simplify the analysis of our algorithm, we require a natural independence condition for non-intersecting -discs. Finally, we use the approximated distribution of -discs to simulate the algorithm by the machinery that we explained above.
We remark that the idea of using a partial order to compute a distribution of -discs in bounded degree graphs has first been used in [CPS16]. However, the setting in [CPS16] was quite different as it dealt with directed graphs where an edge can only be seen from one side (and the sample sizes required in that paper were only slightly sublinear in ).
1.3 Other Related Work
Feigenbaum et al. [FKSV02] initiated the study of property testing in streaming model, and they gave efficient testers for some properties of a sequence of data items (rather than graphs as we consider here). Bury and Schwiegelshohn [BS15] gave a lower bound of on the space complexity of any algorithm that -approximates the size of maximum matching in adversarial streams. Kapralov et al. [KKS15] showed that in random streams, space is necessary to distinguish if a graph is bipartite or -far from being bipartite. Previous work has extensively studied streaming graph algorithms in both the insertion-only and dynamic models, see the recent survey [McG14].
In the framework of -bounded graph property testing, it is now known that many interesting properties are constant-query testable in the adjacency list model, including -edge connectivity, cycle-freeness, subgraph-freeness [GR02], -vertex connectivity [YI08], minor-freeness [HKNO09, BSS10], matroids related properties [ITY12, TY15], hyperfinite properties [NS13], subdivision-freeness [KY13]. Constant-time approximation algorithms in -bounded graphs are known to exist for a number of fundamental optimization problems, including -approximating the weight of minimum spanning tree [CRT05], -approximating the size of maximal/maximum matching [NO08, YYI12], -approximating the minimum vertex cover size [PR07, MR09, ORRR12], -approximating the minimum dominating set size [PR07, NO08]. For -bounded minor-free graphs, there are constant-time -approximation algorithms for the size of minimum vertex cover, minimum dominating set and maximum independent set [HKNO09].
Let be an -vertex graph with maximum degree upper bounded by some constant , where we often identify as . We also call such a graph -bounded graph. In this paper, we will assume the algorithms have the knowledge of . We assume that is represented as a sequence of edges, which we denote as Stream.
Let . The -disc around a vertex is the subgraph rooted at vertex and induced by the vertices within distance at most from . Note that for an -vertex graph, there are exactly -discs. Let be the set of all -disc isomorphism types, where is the number of all such types (and is thus a constant). In the following, we will refer to a -disc of some vertex in the graph as and a -disc type as . Note that for every vertex , there exists a unique -disc type such that is isomorphic to , denoted as . (Throughout the paper, we call two rooted graphs isomorphic to each other if there is a root-preserving mapping from the vertex set of to the vertex set of .)
We further assume that all the elements in are ordered according to the natural partial order among -disc types. More specifically, for any two -disc types , we let (or equivalently, ) denote that is root-preserving isomorphic to some subgraph of . Then we order all the -disc types such that if , then . Let denote all the indices , except itself, such that .
Locally random order streams.
Let denote the set of all permutations (or orderings) over the edge set . Note that each determines the order of edges arriving from the stream. Let denote a probability distribution over . In particular, we let denote the uniform distribution over . Given a stream of edges, we define the observed -disc of from the stream, denoted as , to be the subgraph rooted at and induced by all edges that are sequentially collected from the stream and the endpoints of which are within distance at most to . This is formally defined in the following algorithm Stream_-disc.
Now we formally define a locally random distribution on the order of edges.
Let . Let be a -bounded graph. Let be a distribution over all the orderings of edges in . Let be a set of real numbers in . We call a locally random -distribution over with respect to -disc types, if for sampled from , the following conditions are satisfied:
(Conditional probabilities) For any vertex with -disc isomorphic to , the probability that its observed -disc is , for any such that .
(Independence of disjoint -discs) For any two disjoint -discs and , their observed -discs and are independent.
Note that the set cannot be an arbitrary set, as there might be no distribution satisfying the above condition. On the other hand, if there indeed exists a distribution satisfying the condition with numbers in , then we call the set realizable. In the following, we call a stream a locally random order stream if there exists a family of realizable sets , such that the edge order is sampled from some locally random -distribution with respect to -disc types, for any integer . We have the following lemma.
Let . For any , there exists , such that for , any -bounded -vertex graph , the uniform permutation over is a locally random -distribution over with respect to -disc types, for some realizable . Furthermore, if we let , , then , .
Note that for any vertex with , the probability that the observed -disc of is isomorphic to is exactly the fraction of orderings such that , where . We use such a fraction, which is a fixed real number, to define . Observe that for an ordering sampled from , it directly satisfies the second condition Item 2 in Definition 2.1. Since there are at most edges in any -disc, the probability of observing a full -disc is at least , that is, . Furthermore, since the -disc type might contain at most different subgraphs that are isomorphic to , it holds that for any such that . This completes the proof of the lemma. ∎
The above lemma shows that the uniformly random order stream is a special case of a locally random order stream. Another natural class of locally random order stream is -wise independent permutation of edges for any (i.e., any function that tends to infinity as goes to infinity) for -vertex bounded degree graphs, but for our qualitative purposes here, it suffices to consider uniformly random order streams.
3 Approximating the -Disc Type Distribution
In this section, we show how to approximate the distribution of -disc types of any -bounded graph in locally random order streams.
Recall that for any , we let be the constant denoting the number of all possible -disc isomorphism types. For any , let be the set of vertices from with -disc isomorphic to in the input graph , that is, . Note that is the fraction of vertices with -disc isomorphic to .
3.1 A Two-Pass Algorithm
We start with a discussion of a two-pass algorithm for approximating the distribution of -disc types. The main idea is that in the first pass we can collect or observe the -disc from any vertex , and then in the second pass, we check if the observed -disc is the true -disc of or not. We can then use the statistics of the observed true -discs to estimate the distribution of -disc types.
Slightly more formally, we first sample a large constant number of vertices and let denote the set of sampled vertices. Then in the first pass, for each vertex , we invoke the algorithm Stream_-disc to collect the observed -disc of , denoted as , from the stream. In the second pass, for each vertex , we collect all the incident edges to . Then we let denote the subgraph spanned by all edges (collected in the second pass) incident to vertices within distance at most to . We check if is isomorphic to . It is not hard to see that the true -disc of is observed if and only if is isomorphic to . For each -disc type , we could then use the fraction of vertices in such that the true -disc of is observed and is isomorphic to , to define an estimator for . One should note that the naive estimator needs to be normalized appropriately by some probabilities and that there are dependencies between different variables, if one samples more than one starting vertex. Similar technical challenges also appear in our single-pass algorithm, for which we give detailed analysis in the following section. We omit further discussion on the two-pass algorithm here.
3.2 A Single-Pass Algorithm
In the following, we present our single-pass algorithm for approximating the distribution of -disc types. We have the following lemma.
Let be a -bounded graph presented in a locally random order stream defined by a -distribution over with respect to -disc types, for some integer . Let , . Then for any constant , there exists a single-pass streaming algorithm that uses space, and with probability , for any , approximates the fraction of vertices with -disc isomorphic to in with additive error .
Our algorithm is as follows. We first sample a constant number of vertices, which are called centers. Then for each center , we collect the observed -disc of from the stream. Then we postprocess all the collected edges and use the corresponding empirical distribution of -disc types of all centers to estimate the distribution of -disc types of the input graph. The formal description is given in Algorithm 2.
Note that since there are vertices in and only edges that belong to the -discs of these vertices will be collected by our algorithm, the space complexity of the algorithm is , which is constant.
Now we show the correctness of the algorithm.
We let denote that is the set of vertices sampled uniformly at random from . For any , let be the set of vertices from with -disc isomorphic to in the input graph , that is, . Note that .
Let . By Chernoff bound and our setting of which satisfy that , we have the following claim.
For any , .
We assume for now that is a fixed set with vertices. We let denote that the edge ordering is sampled from . For any , let be the indicator random variable of the event that the observed -disc of is isomorphic to for . Note that if . Let denote the fraction of vertices in with observed -disc isomorphic to . By definition, it holds that , and furthermore, . Let .
We have the following claim.
For any , it holds that .
We prove the claim by induction. For , it holds that . Assuming that the claim holds for , and we prove it holds for as well. By definition, we have that
where the second to last equation follows from the induction. ∎
We can now bound the variance of as shown in the following claim.
For any , it holds that .
Recall that . Note that for each , by the independence assumption on , the random variable can only correlate with the corresponding variables for vertices that are within distance at most from . The number of such vertices is at most . Let denote the distance between in the graph . Then we have that
where the first inequality follows from the fact that , and that for any two vertices with , are independent.
Then we have that
We next prove that each is concentrated around its expectation with high probability.
For any , it holds that .
We prove the claim by induction. For , it holds that
where the last inequality follows from our choice of and which satisfy that . Now let us consider arbitrary , assuming that the claim holds for any . First, with probability (over the randomness that ) at least , it holds that for all , . This further implies that with probability at least ,
Now note that
where the last inequality follows from our choice of and which satisfy that .
Therefore, with probability (over ) at least , it holds that
Now with probability (over both and ) at least , it holds that
Finally, with probability at least , it holds that for all , . This completes the proof of the lemma. ∎
4 Constant-Space Property Testing
In this section, we show how to transform constant-query property testers in the adjacency list model to constant-space property testers in the random order stream model in a single pass and prove our main result Theorem 1.1. (Our transformation also works in the locally random order model as defined in Definition 2.1, but for simplicity, we only state our result in the uniformly random order model.)
Let be a property of -bounded graphs, where is a property of graphs with vertices. We say that is testable with query complexity , if for every and , there exists an algorithm that performs queries to the adjacency list of the graph, and with probability at least , accepts any -vertex -bounded graph satisfying , and rejects any -vertex -bounded graph that is -far from satisfying . If is a function independent of , then we call constant-query testable.
Similarly, we can define constant-space testable properties in graph streams.
Let be a property of -bounded graphs, where is a property of graphs with vertices. We say that is testable with space complexity , if for every and , there exists an algorithm that performs a single pass over an edge stream of an -vertex -bounded graph , uses space, and with probability at least , accepts if it satisfies , and rejects if it is -far from satisfying . If is a function independent of , then we call constant-space testable.
The proof of Theorem 1.1 is based on the following known fact: every constant-query property tester can be simulated by some canonical tester which only samples a constant number of vertices, and explores the -discs of these vertices, and then makes deterministic decisions based on the explored subgraph. This implies that it suffices to approximate the distribution of -disc types of the input graph to test the corresponding property. Formally, we will use the following lemma relating the constant-time testable properties and their -disc distributions. For any graph , let denote the subgraph spanned by the union of -discs rooted at uniformly sampled vertices from . The following lemma is implied by Lemma 3.2 in [CPS16] (which was built on [GT03] and [GR11]). (The result in [CPS16] is stated for -bounded directed graphs, while it also holds in the undirected case.)
Let be any