Sublinear-Time Algorithms for Counting Star Subgraphswith Applications to Join Selectivity Estimation

We study the problem of estimating the value of sums of the form when one has the ability to sample with probability proportional to its magnitude. When , this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when is the degree sequence of a graph, which corresponds to counting the number of -stars in a graph when one has the ability to sample edges randomly.

Our algorithm for a -multiplicative approximation of has query and time complexities . Here, is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation.

For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds [Gonen, Ron, and Shavitt. SIAM J. Comput., 25 (2011), pp. 1365-1411]. With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where , and , our upper bound is , in contrast to their lower bound when no random edge queries are available.

In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case.

1 Introduction

We study the problem of approximately estimating when one has the ability to sample with probability proportional to its magnitude. To solve this problem we design sublinear-time algorithms, which compute such an approximation while only looking at an extremely tiny fraction of the input, rather than having to scan the entire data set in order to determine this value.

We consider two primary motivations for this problem. The first is that in undirected graphs, if is the degree of vertex then counts the number of -stars in the graph. Thus, estimating when one has the ability to sample with probability proportional to its magnitude corresponds to estimating the number of -stars when one has the ability to sample vertices with probability proportional to their degrees (which is equivalent to having the ability to sample edges uniformly). This problem is an instance of the more general subgraph counting problem in which one wishes to estimate the number of occurrences of a subgraph in a graph . The subgraph counting problem has applications in many different fields, including the study of biological, internet and database systems. For example, detecting and counting subgraphs in protein interaction networks is used to study molecular pathways and cellular processes across species [SIKS06].

The second application of interest is that the problem of estimating corresponds to estimating the selectivity of join and self-join operations in databases when one has the ability to sample rows of the tables uniformly. For example, note that if we set as the number of occurrences of value in the column being joined, then is precisely the number of records in the join of the table with itself on that column. When performing a query in a database, a program called a query optimizer is used to determine the most efficient way of performing the database query. In order to make this determination, it is useful for the query optimizer to know basic statistics about the database and about the query being performed. For example, queries that return a very larger number of records are usually serviced most efficiently by doing simple linear scans over the data whereas queries that return a smaller number of records may be better serviced by using an index [HILM09]. As such, being able to estimate selectivity (number of records returned compared to the maximum possible number) of a query can be useful information for a query optimizer to have. In the more general case of estimating the selectivity of a join between two different tables (which can be modeled with a directed graph), the query optimizer can use this information to decide on the most efficient order to execute a sequence of joins which is a common task.

In the “typical” regime in which we wish to estimate given that , our algorithm has a running time of which is very small compared to than the total amount of data. Furthermore, in the case of selectivity estimation, this number can be much less than the number of distinct values in the column being joined on, which results in an even smaller number of queries than would be necessary if one were using an index to compute the selectivity.

We believe that our query-based framework can be realized in many systems. One possible way to implement random edge queries is as follows: because edges normally take most of the space for storing graphs, an access to a random memory location where the adjacency list is stored, would readily give a random edge. Random edge queries allow us to implement a source of weighted vertex samples, where a vertex is output with probability proportional to its weight (magnitude). Weighted sampling is used in [MPX07, BBS09] to find sublinear algorithms for approximating the sum of numbers (allowing only uniform sampling, results in a linear lower bound). We later use this as a subroutine in our algorithm.

Throughout the rest of the paper, we will mostly use graph terminology when discussing this problem. However, we emphasize that all our results are fully general and apply to the problem of estimating even when one does not assume that the input is a graph.

1.1 Our Contribution

Prior theoretical work on this problem only considered the version of this problem on graphs and assumed the ability to sample vertices uniformly rather than edges. Specifically, prior studies of sublinear-time algorithms for graph problems usually consider the model where the algorithm is allowed to query the adjacency list representation of the graph: it may make neighbor queries (by asking “what is the neighbor of a vertex ”) and degree queries (by asking “what is the degree of vertex ”).

We propose a stronger model of sublinear-time algorithms for graph problems which allows random edge queries. Next, for undirected graphs, we construct an algorithm which uses only degree queries and random edge queries. This algorithm and its analysis is discussed in Section 3. For the problem of computing an approximation satisfying , our algorithm has query and time complexities . Although our algorithm is described in terms of graphs, it also applies to the more general case when one wants to estimate without any assumptions about graph structure. Thus, it also applies to the problem of self-join selectivity estimation.

We then establish some relationships between and other parameters so that we may compare the performance of this algorithm to a related work by Gonen et al. more directly ([GRS11]). We also provide lower bounds for our proposed model in Section 4, which are mostly tight up to polylogarithmic factors. This comparison is given in Table 1. We emphasize that even though these lower bounds are stated for graphs, they also apply to the problem of self-join selectivity estimation.

To understand this table, first note that these algorithms require more samples when is small (i.e., stars are rare). As increases, the complexity of each algorithm decreases until—at some point—the number of required samples drops to . Our algorithm is able to obtain this better complexity of for a larger range of values of than that of the algorithm given in [GRS11]. Specifically, our algorithm is more efficient for , and has the same asymptotic bound for up to . Once , it is unknown whether the degree and random edge queries alone can provide the same query complexity. Nonetheless, if we have access to all three types of queries, we may combine the two algorithms to obtain the best of both cases as illustrated in the last column.

range of permitted types of queries
neighbor, degree degree, random edge all types of queries
([GRS11]) (this paper) (this paper)
Table 1: Summary of the query and time complexities for counting -stars on undirected graphs, given a different set of allowed queries. is assumed to be constant. Adjacent cells in the same column with the same contents have been merged.

We also consider a variant of the counting stars problem on directed graphs in Appendix D. If one only needs to count “stars” where all edges are either pointing into or away from the center, this is essentially still the undirected case. We then consider counting directed paths of length two, and discover that allowing random edge queries does not provide an efficient algorithm in this case. In particular, we show that any constant factor multiplicative approximation of requires queries even when all three types of queries are allowed. However, when the ratio between the in-degree and the out-degree on every vertex is bounded, we solve this special case in sublinear time via a reduction to the undirected case where degree queries and random edge queries are allowed.

This variant of the counting stars problem can also be used for approximating join selectivity. For a directed graph, we aim at estimating the quantity . On the other hand in the database context, we wish to compute the quantity , where and denote the number of occurrences of a label in the column we join on, from the first and the second table, respectively. Thus, applying simple changes in variables, the algorithms from Appendix D can be applied to the problem of estimating join selectivity as well.

1.2 Our Approaches

In order to approximate the number of stars in the undirected case, we convert the random edge queries into weighted vertex sampling, where the probability of sampling a particular vertex is proportional to its degree. We then construct an unbiased estimator that approximates the number of stars using the degree of the sampled vertex as a parameter. The analysis of this part is roughly based on the variance bounding method used in [AMS96], which aims to approximate the frequency moment in a streaming model. The number of samples required by this algorithm depends on , which is not known in advance. Thus we create a guessed value of and iteratively update this parameter until it becomes accurate.

To demonstrate lower bounds in the undirected case, we construct new instances to prove tight bounds for the case in which our model is more powerful than the traditional model. In other cases, we provide a new proof to show that the ability to sample uniformly random edges does not necessarily allow better performance in counting stars. Our proof is based on applying Yao’s principle and providing an explicit construction of the hard instances, which unifies multiple cases together and greatly simplifies the approach of [GRS11].111One useful technique for giving lower bounds on sublinear time algorithms, pioneered by [BBM12], is to make use of a connection between lower bounds in communication complexity and lower bounds on sublinear time algorithms. More specifically, by giving a reduction from a communication complexity problem to the problem we want to solve, a lower bound on the communication complexity problem yields a lower bound on our problem. In the past, this approach has led to simpler and cleaner sublinear time lower bounds for many problems. Attempts at such an approach for reducing the set-disjointness problem in communication complexity to our estimation problem on graphs run into the following difficulties: First, as explained in [Gol13], the straightforward reduction adds a logarithmic overhead, thereby weakening the lower bound by the same factor. Second, the reduction seems to work only in the case of sparse graphs. Although it is not clear if these difficulties are insurmountable, it seems that it will not give a simpler argument than the approach that we present in this work.

For the directed case, we prove the lower bound using a standard construction and Yao’s principle. As for the upper bound when the in-degree and out-degree ratios are bounded, we use rejection sampling to adjust the sampling probabilities so that we may apply the unbiased estimator method from the undirected case.

1.3 Related Work

Motivated by applications in a variety of areas, the subgraph detection and counting problem and its variations have been studied in many different works, often under different terminology such as network motif counting or pathway querying (e.g., [MSOI02, PCJ04, Wer06, SIKS06, SSRS06, GK07, HBPS07, HA08, ADH08]). As this problem is NP-hard in general, many approaches have been developed to efficiently count subgraphs more efficiently for certain families of subgraphs or input graphs (e.g., [DLR95, AYZ97, FG04, KIMA04, ADH08, AG09, VW09, Wil09, GS09, KMPT10, AGM12, AFS12, FLR12]). As for applications to database systems, the problem of approximating the size of the resulting table of a join query or a self-join query in various contexts has been studied in [SS94, HNSS96, AGMS99]. Selectivity and query optimization have been considered, e.g., in [PI97, LKC99, GTK01, MHK07, HILM09].

Other works that study sublinear-time algorithms for counting stars are [GRS11] that aims to approximate the number of stars, and [Fei06, GR08] that aim to approximate the number of edges (or equivalently, the average degree). Note that [GRS11] also shows impossibility results for approximating triangles and paths of length three in sublinear time when given uniform edge sampling, limiting us from studying more sophisticated subgraphs. Recent work by Eden, Levi and Ron ([ELR15]) and Seshadhri ([Ses15]) provide sublinear time algorithms to approximate the number of triangles in a graph. However, their model uses adjacency matrix queries and neighbor queries. The problem of counting subgraphs has also been studied in the streaming model (e.g., [BYKS02, BFL06, BBCG08, MMPS11, KMSS12]). There is also a body of work on sublinear-time algorithms for approximating various graph parameters (e.g., [PR07, NO08, YYI09, HKNO09, ORRR12]).

Abstracting away the graphical context of counting stars, we may view our problem as finding a parameter of a distribution: edge or vertex sampling can be treated as sampling according to some distribution. In vertex sampling, we have a uniform distribution and in edge sampling, the probabilities are proportional to the degree. The number of stars can be written as a function of the degrees. Aside from our work, there are a number of other studies that make use of combined query types for estimating a parameter of a distribution. Weighted and uniform sampling are considered in [MPX07, BBS09]. Their algorithms may be adapted to approximate the number of edges in the context of approximating graph parameters when given weighted vertex sampling, which we will later use in this paper. A closely related problem in the context of distributions, is the task of approximating frequency moments, mainly studied in the streaming model (e.g., [AMS96, CK04, IW05, BGKS06]). On the other hand, the combination of weighted sampling and probability distribution queries is also considered (e.g., [CR14]).

2 Preliminaries

In this paper, we construct algorithms to approximate the number of stars in a graph under different types of query access to the input graph. As we focus on the case of simple undirected graphs, we explain this model here and defer the description for the directed case to Appendix D.

2.1 Graph Specification

Let be the input graph, assumed to be simple and undirected. Let and denote the number of vertices and edges, respectively. The value is known to the algorithm. Each vertex is associated with a unique ID from . Let denote the degree of .

Let be a constant integer. A -star is a subgraph of size , where one vertex, called the center, is adjacent to the other vertices. For example, a -star is an undirected path of length 2. Note that a vertex may be a center for many stars, and a set of vertices may form multiple stars. Let denote the number of occurrences of distinct stars in the graph.

Our goal is to construct a randomized algorithm that outputs a value that is within a -multiplicative factor of the actual number of stars . More specifically, given a parameter , the algorithm must give an approximated value satisfying the inequality with success probability at least .

2.2 Query Access

The algorithm may access the input graph by querying the graph oracle, which answers for the following types of queries. First, the neighbor queries: given a vertex and an index , the neighbor of is returned if ; otherwise, is returned. Second, the degree queries: given a vertex , its degree is returned. Lastly, the random edge queries: a uniformly random edge is returned. The query complexity of an algorithm is the total number of queries of any type that the algorithm makes throughout the process of computing its answer.

Combining these queries, we may implement various useful sampling processes. We may perform a uniform edge sampling using a random edge query, and a uniform vertex sampling by simply picking a random index from . We may also perform a weighted vertex sampling where each vertex is obtained with probability proportional to its degree as follows: uniformly sample a random edge, then randomly choose one of the endpoints with probability each. Since any vertex is incident with edges, then the probability that is chosen is exactly , as desired.

2.3 Queries in the Database Model

Now we explain how the above queries in our graph model have direct interpretations in the database model. Consider the column we wish to join on. For each valid label , let be the number of rows containing this label. We assume the ability to sample rows uniformly at random. This gives us a label with probability proportional to , which is a weighted sample from the distribution of labels. We also assume that we can also quickly compute the number of other rows sharing the same label with a given row (analogous to making a degree query). For example, this could be done quickly using an index on the column. Note that if one has an index that is augmented with appropriate information, one can compute the selectivity of a self-join query exactly in time roughly where is the number of distinct elements in the column. However, our methods can give runtimes that are asymptotically much smaller than this.

3 Upper Bounds for Counting Stars in Undirected Graphs

In this section we establish an algorithm for approximating the number of stars, , of an undirected input graph. We focus on the case where only degree queries and random edge queries are allowed. This illustrates that even without utilizing the underlying structure of the input graph, we are still able to construct a sublinear approximation algorithm that outperforms other algorithms under the traditional model in certain cases.

3.1 Unbiased Estimator Subroutine

Our algorithm uses weighted vertex sampling to find stars. Intuitively, the number of samples required by the algorithms should be larger when stars are rare because it takes more queries to find them. While the query complexity of the algorithm depends on the actual value of , our algorithm does not know this value in advance. In order to overcome this issue, we devise a subroutine which—given a guess for the value of —will give a approximation of if is close enough to or tell us that is much larger than . Then, we start with the maximum possible value of and guess multiplicatively smaller and smaller values for it until we find one that is close enough to , so that our subroutine is able to correctly output a approximation.

Our subroutine works by computing the average value of an unbiased estimator to after drawing enough weighted vertex samples. To construct the unbiased estimator, notice first that the number of -stars centered at a vertex is .222For our counting purpose, if then we define . Thus, .

Next, we define the unbiased estimator and give the corresponding algorithm. First, let be the random variable representing the degree of a random vertex obtained through weighted vertex sampling, as explained in Section 2.2. Recall that a vertex is sampled with probability . We define the random variable so that is an unbiased estimator for ; that is,

1:procedure Unbiased-Estimate()
2:     
3:     for  to  do
4:          weighted sampled vertex obtained from a random edge query
5:          obtained from a degree query
6:               
7:     
8:      return
Algorithm 1 Subroutine for Computing given with success probability 2/3

Clearly, the output of Algorithm 1 satisfies . We claim that the number of samples in Algorithm 1 is sufficient to provide two desired properties: the algorithm returns an -approximation of if is in the correct range; or, if is too large, the anomaly will be evident as the output will be much smaller than . In particular, we may distinguish between these two cases by comparing against , as specified through the following lemma.

Lemma 3.1

For , with probability at least 2/3:

  1. [noitemsep,nolistsep]

  2. If , then Algorithm 1 outputs such that ;
    moreover, if then .

  3. If , then Algorithm 1 outputs such that .

The first item of Lemma 3.1 can be proved by bounding the variance of using various Chebyshev’s Inequality and identities of binomial coefficients, while the second item is a simple application of Markov’s Inequality. Detailed proofs for these statements can be found in Appendix B.

3.2 Full Algorithm

Our full algorithm proceeds by first setting to , the maximum possible value of given by the complete graph. We then use Algorithm 1 to check if ; if this is the case, we reduce then proceed to the next iteration. Otherwise, Algorithm 1 should already give an -approximation to (with constant probability). We note that if , we may replace it with without increasing the asymptotic complexity.

Since the process above may take up to iterations, we must amplify the success probability of Algorithm 1 so that the overall success probability is still at least . To do so, we simply make multiple calls to Algorithm 1 then take the median of the returned values. Our full algorithm can be described as Algorithm 2 below.

1:procedure Count-Stars()
2:     
3:     loop
4:         for  to  do
5:                        
6:         
7:         if  then
8:              
9:              return          
10:               
Algorithm 2 Algorithm for Approximating
Theorem 3.2

Algorithm 2 outputs such that with probability at least . The query complexity of Algorithm 2 is .

Proof:  If we assume that the events from Lemma 3.1 hold, then the algorithm will take at most iterations. By choosing , Chernoff bound (Theorem A.3) implies that excepted for probability , more than half of the return values of Algorithm 1 satisfy the desired property, and so does the median . By the union bound, the total failure probability is at most 1/3.

Now it is safe to assume that the events from the two lemmas hold. In case , our algorithm will detect this event because , implying that we never stop and return an inaccurate approximation. On the other hand, if , our algorithm computes and must terminate. Since we only halve on each iteration, when first occurs, we have . As a result, our algorithm must terminate with the desired approximation before the value is halved again. Thus, Algorithm 2 returns satisfying with probability at least , as desired.

Recall that the number of samples required by Algorithm 1 may only increase when decreases. Thus we may use the number of samples in the last round of Algorithm 2, where , as the upper bound for each previous iteration. Therefore, each of the iterations takes samples, achieving the claimed query complexity.

3.3 Removing the Dependence on

As described above, Algorithm 1 picks the value and defines the unbiased estimator based on , the number of edges. Nonetheless, it is possible to remove this assumption of having prior knowledge of by instead computing its approximation. Furthermore, we will bound in terms of and , so that we can also relate the performance of our algorithm to previous studies on this problem such as [GRS11], as done in Table 1.

3.3.1 Approximating

We briefly discuss how to apply our algorithm when is unknown by first computing an approximation of . Using weighted vertex sampling, we may simulate the algorithm from [MPX07] or [BBS09] that computes an -approximation to the sum of degrees using weighted samples. More specifically, we cite the following theorem:

Theorem 3.3

([MPX07]) Let be variables, and define a distribution that returns with probability . There exists an algorithm that computes a -approximation of using samples from .

Thus, we simulate the sampling process from by drawing a weighted vertex sample , querying its degree, and feeding to this algorithm. We will need to decrease used in this algorithm and our algorithm by a constant factor to account for the additional error. Below we show that our complexities are at least which is already for , and thus this extra step does not affect our algorithm’s performance asymptotically.

3.3.2 Comparing to and

For comparison of performances, we will now show some bounds relating to and . Notice that the function is convex with respect to .333We may use the binomial coefficients for non-integral value in the inequalities. These can be interpreted through alternative formulations of binomial coefficients using falling factorials or analytic functions. Then by applying Jensen’s inequality (Theorem A.4) to this function, we obtain

First, let us consider the case where the stars are very rare, namely when . The inequality above implies that . Substituting this formula back into the bound from Theorem 3.2 yields the query complexity .

Now we consider the remaining case where . If , then the query complexity from Theorem 3.2 becomes . Otherwise we have , which allows us to apply the following bound on our binomial coefficient:

This inequality implies that , also yielding the query complexity .

Compared to [GRS11], our algorithm achieves a better query complexity when , where the rare stars are more likely to be found via edge sampling rather than uniform vertex sampling or traversing the graph. Our algorithm also performs no worse than their algorithm does for any as large as . Moreover, due to the simplicity of our algorithm, the dependence on of our query complexity is only for any value of , while that of their algorithm is as large as in certain cases. This dependence on may be of interest to some applications, especially when stars are rare whilst an accurate approximation of is crucial.

3.4 Allowing Neighbor Queries

We now briefly discuss how we may improve our algorithm when neighbor queries are allowed (in addition to degree queries and random edge queries). For the case when , it is unknown whether our algorithm alone achieves better performance than [GRS11] (see table 1). However, their algorithm has the same basic framework as ours, namely that it also starts by setting to the maximum possible number of stars, then iteratively halves this value until it is in the correct range, allowing the subroutine to correctly compute a -approximation of . As a result, we may achieve the same performance as them in this regime by simply letting Algorithm 2 call the subroutine from [GRS11] when . We will later show tight lower bounds (up to polylogarithmic factors) to the case where all three types of queries are allowed, which is a stronger model than the one previously studied in their work.

4 Lower Bounds for Counting Stars in Undirected Graphs

In this section, we establish the lower bounds summarized in the last two columns of Table 1. We give lower bounds that apply even when the algorithm is permitted to sample random edges. Our first lower bound is proved in Section 4.1; While this is the simplest case, it provides useful intuition for the proofs of subsequent bounds. In order to overcome the new obstacle of powerful queries in our model, for larger values of we create an explicit scheme for constructing families of graphs that are hard to distinguish by any algorithm even when these queries are present. Using this construction scheme, our approach obtains the bounds for all remaining ranges for as special cases of a more general bound, and the general bound is proved via the straightforward application of Yao’s principle and a coupling argument. Our lower bounds are tight (up to polylogarithmic factors) for all cases except for the bottom middle cell in Table 1.

4.1 Lower Bound for

Theorem 4.1

For any constant , any (randomized) algorithm for approximating to a multiplicative factor via neighbor queries, degree queries and random edge queries with probability of success at least requires total number of queries for any .

Proof:  We now construct two families of graphs, namely and , such that any and drawn from each respective family satisfy and for some parameter . We construct as follows: for a subset of size , we create a union of a -regular graph on and a -regular graph on , and add the resulting graph to . To construct all graphs in , we repeat this process for every subset of size . is constructed a little differently: rather than using a -regular graph on , we use a star of size on this set instead. We add a union between a star on and a -regular graph on of any possible combination to .

By construction, every contains no -stars, whereas every has -stars. For any algorithm to distinguish between and , when given a graph , it must be able to detect some vertex in with probability at least . Otherwise, if we randomly generate a small induced subgraph according to the uniform distribution in conditional on not having any vertex or edge in , the distribution would be identical to the uniform in . Furthermore, notice that cannot be reached via traversal using neighbor queries as it is disconnected from . The probability of sampling such vertex or edge from each query is . Thus, samples are required to achieve a constant factor approximation with probability 2/3.

4.2 Overview of the Lower Bound Proof for

Since graphs with large contain many edges, we must modify our approach above to allow graphs from the first family to contain stars. We construct two families of graphs and such that the number of stars of graphs from these families differ by some multiplicative factor ; any algorithm aiming to approximate within a multiplicative factor of must distinguish between these two families with probability at least . We create representations of graphs that explicitly specify their adjacency list structure. Each contains vertices of degree , while the remaining vertices are isolated. For each , we modify our representation from by connecting each of the remaining vertices to neighbors, so that these vertices contribute sufficient stars to establish the desired difference in . We hide these additional edges in carefully chosen random locations while ensuring minimal disturbance to the original graph representation; our representations are still so similar that any algorithm may not detect them without making sufficiently many queries. Moreover, we define a coupling for answering random edge queries so that the same edges are likely to be returned regardless of the underlying graph.

While the proof of [GRS11] also uses similar families of graphs, our proof analysis greatly deviates from their proof as follows. Firstly, we apply Yao’s principle which allows us to prove the lower bounds on randomized algorithms by instead showing the lower bound on deterministic algorithms on our carefully chosen distribution of input instances.444See e.g., [MR10] for more information on Yao’s principle. Secondly, rather than constructing two families of graphs via random processes, we construct our graphs with adjacency list representations explicitly, satisfying the above conditions for each lower bound we aim to prove. This allows us to avoid the difficulties in [GRS11] regarding the generation of potential multiple edges and self-loops in the input instances. Thirdly, we define the distribution of our instances based on the permutation of the representations of these two graphs, and the location we place the edges in that are absent in . We also apply the coupling argument, so that the distribution of these permutations we apply on these graphs, as well as the answers to random edge queries, are as similar as possible. As long as the small difference between these graphs is not discovered, the interaction between the algorithm and our oracle must be exactly the same. We show that with probability , the algorithm and our oracle behave in exactly the same way whether the input instance corresponds to or . Simplifying the arguments from [GRS11], we completely bypass the algorithm’s ability to make use of graph structures. Our proof only requires some conditions on the parameters ; this allows us to show the lower bounds for multiple ranges of simply by choosing appropriate parameters.

We provide the full details in Section  C. The main results of our constructions are given as the following theorems. We note that lower bounds apply when only subsets of these three types of queries are provided. This concludes all of our lower bounds in Table 1.

Theorem 4.2

For any constant , any (randomized) algorithm for approximating to a multiplicative factor via neighbor queries, degree queries and random edge queries with probability of success at least requires total number of queries for any .

Theorem 4.3

For any constant , any (randomized) algorithm for approximating to a multiplicative factor via neighbor queries, degree queries and random edge queries with probability of success at least requires total number of queries for any .

5 Acknowledgements

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. CCF-1217423, CCF-1065125, CCF-1420692, and CCF-1122374. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation. We thank Peter Haas and Samuel Madden for helpful discussions.

References

  • [ADH08] Noga Alon, Phuong Dao, Iman Hajirasouliha, Fereydoun Hormozdiari, and S Cenk Sahinalp. Biomolecular network motif counting and discovery by color coding. Bioinformatics, 24(13):i241–i249, 2008.
  • [AFS12] Omid Amini, Fedor V Fomin, and Saket Saurabh. Counting subgraphs via homomorphisms. SIAM Journal on Discrete Mathematics, 26(2):695–717, 2012.
  • [AG09] Noga Alon and Shai Gutner. Balanced hashing, color coding and approximate counting. In Parameterized and Exact Computation, pages 1–16. Springer, 2009.
  • [AGM12] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Graph sketches: sparsification, spanners, and subgraphs. In Proceedings of the 31st symposium on Principles of Database Systems, pages 5–14. ACM, 2012.
  • [AGMS99] Noga Alon, Phillip B Gibbons, Yossi Matias, and Mario Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 10–20. ACM, 1999.
  • [AMS96] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29. ACM, 1996.
  • [AYZ97] Noga Alon, Raphael Yuster, and Uri Zwick. Finding and counting given length cycles. Algorithmica, 17(3):209–223, 1997.
  • [BBCG08] Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16–24. ACM, 2008.
  • [BBM12] Eric Blais, Joshua Brody, and Kevin Matulef. Property testing lower bounds via communication complexity. Computational Complexity, 21(2):311–358, 2012.
  • [BBS09] Tuğkan Batu, Petra Berenbrink, and Christian Sohler. A sublinear-time approximation scheme for bin packing. Theoretical Computer Science, 410(47):5082–5092, 2009.
  • [BFL06] Luciana S Buriol, Gereon Frahling, Stefano Leonardi, Alberto Marchetti-Spaccamela, and Christian Sohler. Counting triangles in data streams. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 253–262. ACM, 2006.
  • [BGKS06] Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh, and Chandan Saha. Simpler algorithm for estimating frequency moments of data streams. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 708–713. ACM, 2006.
  • [BYKS02] Ziv Bar-Yossef, Ravi Kumar, and D Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 623–632. Society for Industrial and Applied Mathematics, 2002.
  • [CK04] Don Coppersmith and Ravi Kumar. An improved data stream algorithm for frequency moments. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 151–156. Society for Industrial and Applied Mathematics, 2004.
  • [CR14] Clément Canonne and Ronitt Rubinfeld. Testing probability distributions underlying aggregated data. arXiv preprint arXiv:1402.3835, 2014.
  • [DLR95] Richard A Duke, Hanno Lefmann, and Vojtěch Rödl. A fast approximation algorithm for computing the frequencies of subgraphs in a given graph. SIAM Journal on Computing, 24(3):598–620, 1995.
  • [ELR15] Talya Eden, Amit Levi, and Dana Ron. Approximately counting triangles in sublinear time. arXiv preprint arXiv:1504.00954, 2015.
  • [Fei06] Uriel Feige. On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM Journal on Computing, 35(4):964–984, 2006.
  • [FG04] Jörg Flum and Martin Grohe. The parameterized complexity of counting problems. SIAM Journal on Computing, 33(4):892–922, 2004.
  • [FLR12] Fedor V Fomin, Daniel Lokshtanov, Venkatesh Raman, Saket Saurabh, and BV Rao. Faster algorithms for finding and counting subgraphs. Journal of Computer and System Sciences, 78(3):698–706, 2012.
  • [GK07] Joshua A Grochow and Manolis Kellis. Network motif discovery using subgraph enumeration and symmetry-breaking. In Research in Computational Molecular Biology, pages 92–106. Springer, 2007.
  • [Gol13] Oded Goldreich. On the communication complexity methodology for proving lower bounds on the query complexity of property testing. Electronic Colloquium on Computational Complexity (ECCC), 20:73, 2013.
  • [GR08] Oded Goldreich and Dana Ron. Approximating average parameters of graphs. Random Structures & Algorithms, 32(4):473–493, 2008.
  • [GRS11] Mira Gonen, Dana Ron, and Yuval Shavitt. Counting stars and other small subgraphs in sublinear-time. SIAM Journal on Discrete Mathematics, 25(3):1365–1411, 2011.
  • [GS09] Mira Gonen and Yuval Shavitt. Approximating the number of network motifs. Internet Mathematics, 6(3):349–372, 2009.
  • [GTK01] Lise Getoor, Benjamin Taskar, and Daphne Koller. Selectivity estimation using probabilistic models. In ACM SIGMOD Record, volume 30, pages 461–472. ACM, 2001.
  • [HA08] David Hales and Stefano Arteconi. Motifs in evolving cooperative networks look like protein structure networks. Networks and Heterogeneous Media, 3(2):239, 2008.
  • [HBPS07] Fereydoun Hormozdiari, Petra Berenbrink, Nataša Pržulj, and S Cenk Sahinalp. Not all scale-free networks are born equal: the role of the seed graph in ppi network evolution. PLoS computational biology, 3(7):e118, 2007.
  • [HILM09] Peter J Haas, Ihab F Ilyas, Guy M Lohman, and Volker Markl. Discovering and exploiting statistical properties for query optimization in relational databases: A survey. Statistical Analysis and Data Mining: The ASA Data Science Journal, 1(4):223–250, 2009.
  • [HKNO09] Avinatan Hassidim, Jonathan A Kelner, Huy N Nguyen, and Krzysztof Onak. Local graph partitions for approximation and testing. In Foundations of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 22–31. IEEE, 2009.
  • [HNSS96] Peter J Haas, Jeffrey F Naughton, S Seshadri, and Arun N Swami. Selectivity and cost estimation for joins based on random sampling. Journal of Computer and System Sciences, 52(3):550–569, 1996.
  • [IW05] Piotr Indyk and David Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 202–208. ACM, 2005.
  • [KIMA04] Nadav Kashtan, Shalev Itzkovitz, Ron Milo, and Uri Alon. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics, 20(11):1746–1758, 2004.
  • [KMPT10] Mihail N Kolountzakis, Gary L Miller, Richard Peng, and Charalampos E Tsourakakis. Efficient triangle counting in large graphs via degree-based vertex partitioning. In Algorithms and Models for the Web-Graph, pages 15–24. Springer, 2010.
  • [KMSS12] Daniel M Kane, Kurt Mehlhorn, Thomas Sauerwald, and He Sun. Counting arbitrary subgraphs in data streams. In Automata, Languages, and Programming, pages 598–609. Springer, 2012.
  • [LKC99] Ju-Hong Lee, Deok-Hwan Kim, and Chin-Wan Chung. Multi-dimensional selectivity estimation using compressed histogram information. In ACM SIGMOD Record, volume 28, pages 205–214. ACM, 1999.
  • [MHK07] Volker Markl, Peter J Haas, Marcel Kutsch, Nimrod Megiddo, Utkarsh Srivastava, and Tam Minh Tran. Consistent selectivity estimation via maximum entropy. The VLDB journal, 16(1):55–76, 2007.
  • [MMPS11] Madhusudan Manjunath, Kurt Mehlhorn, Konstantinos Panagiotou, and He Sun. Approximate counting of cycles in streams. In Algorithms–ESA 2011, pages 677–688. Springer, 2011.
  • [MPX07] Rajeev Motwani, Rina Panigrahy, and Ying Xu. Estimating sum by weighted sampling. In Automata, Languages and Programming, pages 53–64. Springer, 2007.
  • [MR10] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Chapman & Hall/CRC, 2010.
  • [MSOI02] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827, 2002.
  • [NO08] Huy N Nguyen and Krzysztof Onak. Constant-time approximation algorithms via local improvements. In Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 327–336. IEEE, 2008.
  • [ORRR12] Krzysztof Onak, Dana Ron, Michal Rosen, and Ronitt Rubinfeld. A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1123–1131. SIAM, 2012.
  • [PCJ04] N Pržulj, Derek G Corneil, and Igor Jurisica. Modeling interactome: scale-free or geometric? Bioinformatics, 20(18):3508–3515, 2004.
  • [PI97] Viswanath Poosala and Yannis E Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, volume 97, pages 486–495, 1997.
  • [PR07] Michal Parnas and Dana Ron. Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theoretical Computer Science, 381(1):183–196, 2007.
  • [Ses15] C Seshadhri. A simpler sublinear algorithm for approximating the triangle count. arXiv preprint arXiv:1505.01927, 2015.
  • [SIKS06] Jacob Scott, Trey Ideker, Richard M Karp, and Roded Sharan. Efficient algorithms for detecting signaling pathways in protein interaction networks. Journal of Computational Biology, 13(2):133–144, 2006.
  • [SS94] Arun Swami and K Bernhard Schiefer. On the estimation of join result sizes. Springer, 1994.
  • [SSRS06] Tomer Shlomi, Daniel Segal, Eytan Ruppin, and Roded Sharan. Qpath: a method for querying pathways in a protein-protein interaction network. BMC bioinformatics, 7(1):199, 2006.
  • [VW09] Virginia Vassilevska and Ryan Williams. Finding, minimizing, and counting weighted subgraphs. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 455–464. ACM, 2009.
  • [Wer06] Sebastian Wernicke. Efficient detection of network motifs. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 3(4):347–359, 2006.
  • [Wil09] Ryan Williams. Finding paths of length in time. Information Processing Letters, 109(6):315–318, 2009.
  • [YYI09] Yuichi Yoshida, Masaki Yamamoto, and Hiro Ito. An improved constant-time approximation algorithm for maximum. In Proceedings of the 41st annual ACM symposium on Theory of computing, pages 225–234. ACM, 2009.

Appendix A Useful Inequalities

This section provides standard equalities that we use throughout our paper. These inequalities exist in many variations, but here we only present the formulations which are most convenient for our purposes.

Theorem A.1

(Chebyshev’s Inequality) For any random variable and ,

Theorem A.2

(Markov’s Inequality) For any non-negative random variable and ,

Theorem A.3

(Chernoff Bound) Let be independent Poisson random variables such that for all , and let . Then for any ,

Theorem A.4

(Jensen’s Inequality) For any real convex function with in its domain,

Appendix B Proof of Lemma 3.1

See 3.1 Proof:  Let us first consider the first item. Since , we will focus on establishing an upper bound of . We compute

where the first inequality holds because . Rearranging the terms, we have the following relationship:

Now let us consider our average . Since are identically distributed, we have

By Chebyshev’s inequality (Theorem A.1), we have

In order to achieve the desired value such that with error probability , it is sufficient to take samples. Recall the assumption that satisfying . Thus, the number of required samples to achieve such bound with probability is

For the second item, we apply Markov’s Inequality (Theorem A.2) to the given condition to obtain

implying the desired success probability.

Lastly, we substitute to obtain the relationship between and , which establishes the condition for deciding whether the given is much larger than , as desired.

Appendix C Proof of Lower Bounds for Undirected Graphs with

In this section we provide the proof of lower bounds claimed in Section 4.2. Firstly, to properly describe the adjacency list representation of the input graphs, we introduce the notion of graph representation. Next, we state a main lemma (Lemma C.1) that establishes the constraints of parameters that allows us to create hard instances. We then move on to describe our constructions, including both the distribution for applying Yao’s principle, and the implementation of the oracle for answering random edge queries. We prove our main lemma for our construction, and lastly, we give the appropriate parameters that complete the proof of our lower bounds.

c.1 Graph Representations

Consider the following representation of an adjacency list for an undirected graph . Let us say that each vertex has ports numbered attached, where the port of vertex is identify as a pair , which is used as an index for . imposes a perfect matching between these ports; namely, indicates that ports and are matched to each other, and this implies as well. We use to define the adjacency list of our graph; that is, if then the neighbor of is (and vice versa). Note that there can be many such representations of , and some perfect matchings between ports may yield graphs parallel edges or self-loops. Furthermore, each edge is associated with a unique pair of matched cells.

c.2 Main Lemma

Our proof proceeds in two steps. First, we show the following lemma that applies to certain parameters of graphs.

Lemma C.1

Let be positive parameters satisfying the following properties: and are even, and . Let , and define the following two families of graphs on vertices:

  • [noitemsep,nolistsep]

  • : all graphs containing vertices of degree and isolated vertices;

  • : all graphs containing vertices of degree and vertices of degree .

Let and . Then, there exists a distribution of representations of graphs from such that for any deterministic algorithm that makes at most total neighbor queries, degree queries and random edge queries, on the graph representation randomly drawn from , cannot correctly identify whether the given representation is of a graph from or with probability at least .

By applying Yao’s principle, the following corollary is implied.

Corollary C.2

Let be parameters satisfying the properties specified in Lemma C.1. Let and . If and for some constant , then any (randomized) algorithm for approximating to a multiplicative factor via neighbor queries, degree queries and random edge queries with probability of success at least requires queries for .

As a second step, we propose a few sets of parameters for different ranges of . Applying Corollary C.2, this yields lower bounds for the remaining ranges of .

c.3 Our Constructions

c.3.1 Construction of

We prove this lemma by explicitly constructing the distribution.

Construction of graph representations for . We now define the representation for the graph as follows. We let be the vertices with degree . Let us refer to the pair of consecutive columns (with indices and ) as the slab. Then, in the slab, we match each cell on the left column with the cell at distance below on the right column. Figure 1 illustrates the matching of cells in the first few columns of . More formally, for each integer and , we match the cells and in .

Since is even, this construction fills the entire table of . We wish to claim that we do not create any parallel edges with this construction. Clearly, this is true within a slab. For different slabs, recall that we map cells in the slab with those at vertical distance away. Thus, it suffices to note that no pair of slabs uses the same distance mod . Equivalently, we can note that as the maximum distance is and by our assumption, the set of distances for are all disjoint. That is, our construction creates no parallel edges or self-loops.

Figure 1: first few columns of

Construction of graph representations for . Next, for each integer and , we define a graph with corresponding representation by modifying as follows. First, recall that we need to add neighbors to the previously isolated vertices . These neighbors are represented as a table of size in ; in Figure 2, it is represented as the green rectangle in Figure (a) which is not present in . We match the cells in this new table to a subtable of size , which is shown as the yellow rectangle in Figure (a). The top-left cell of this subtable corresponds to the index in , and note that if or , this subtable may wrap around as shown in Figure (b). Since and , the dimensions of this yellow rectangle does not exceed the original table in .

Figure 2: Comparison between tables and . (a) and (b) show two different possibilities for depending on the values of and .

Now we explain how we match the cells. Between the yellow and green subtables, we map them in a transposed fashion. That is, the cell with index (relative to the green table) is mapped to the yellow cell with index (relative to the yellow subtable), as shown in Figure 3 (a). This method guarantees that no two rows contain two pair of matched cells between them. As a result, we do not create any parallel edges or self-loops.

Figure 3: matchings in

As we place the yellow subtable, some edges originally in may now have only one endpoint in the yellow subtable. We refer to the cells in the table that correspond to such edges as unmatched. Since is even and we set our offset to , then every slab either does not overlap with the yellow subtable, or overlaps in the exact same rows for both columns of the slab. Thus, the only edges that have one endpoint in the yellow subtable are those that go from a cell above it to one in it. Roughly speaking, we still map the cells in the same way but ignore the distance it takes to skip over the yellow subtable. More formally, in the slab, we pair each unmatched cell from the left and right respectively that are at vertical distance away (instead of ), as shown with the red edges in Figure 3 (b).

Now the set of distances between the cells corresponding to an edge in the slab are , since distances can be measured both by going down and by going up and looping around. From our assumption, and , and thus no distance is shared by multiple slabs, and thus there are no parallel edges or self-loops.

Permutation of graph representations. Let be a permutation over .555A permutation over is a bijection . Given a graph representation , we define as a new presentation of the same underlying graph, such that the indices of the vertices are permuted according to . We may consider this operation as an interface to the original oracle. Namely, any query made on a vertex index is translated into a query for index to the original oracle. If a vertex index is an answer from the oracle, then we return instead.

The distribution . Let denote the set of all permutations over . We define formally as follows: for any permutation , the representation corresponding to is drawn from with probability , and each representation corresponding to is drawn with probability for every . In other words, to draw a random instance from , we flip an unbiased coin to choose between families and . We obtain a representation if we choose ; otherwise we pick a random representation for . Lastly, we apply a random permutation to such representation.

c.3.2 Answering Random Edge Queries

Notice that Yao’s principle allows us to remove randomness used by the algorithm, but the randomness of the oracle remains for the random edge queries. For any representation we draw from , the oracle must return an edge uniformly at random for each random edge query. Nonetheless, we may choose our own implementation of the oracle as long as this condition is ensured. We apply a coupling argument that imposes dependencies between the behaviors of our oracle between when the underlying graph is from or . Let and denote the number of edges of graphs from and , respectively.

Our oracle works differently depending on which family the graph comes from. The following describes the behavior of our oracle for a single query, and note that all queries should be evaluated independently.

Query to . We simply return an edge chosen uniformly at random. That is, we pick a random matched pair of cells in , and return the vertices corresponding to the rows of those cells.

Query to . Let denote the number of edges shared by both and . With probability , we return the same edge we choose for . Otherwise, we return an edge chosen uniformly at random from the set of edges in but not in .

Our oracle clearly returns an edge chosen uniformly at random from the corresponding representation. The benefit of using this coupling oracle is that we increase the probability that the same edge is returned to . By our construction, the cells in that are modified to obtain are fully contained within the subtable of size obtained by extending the yellow subtable to include more rows above and below. . Thus, our oracle may only return a different edge with probability

c.4 Proof of Lemma c.1

Recall that we consider a deterministic algorithm that makes at most queries. We may describe the behavior between and the oracle with its query-answer history. Notice that since is deterministic, if every answer that receives from the oracle is the same, then must return the same answer, regardless of the underlying graph. Our general approach is to show that for most permutations , running with instance will result in the same query-answer history as running with for most random parameters and . If these histories are equivalent, then may answer correctly for only roughly half of the distribution.

Throughout this section, we refer to our indices before applying to the representation. We bound the probability that the query-answer histories are different using an inductive argument as follows. Suppose that at some point during the execution of , the history only contains vertices of indices from , and all cells in the history are matched in the same way in both and . This inductive hypothesis restricts the possible parameters and to those that yield same history up to this point. We now consider the probability that the next query-answer pair differs, and aim to bound this probability by .

Firstly, we consider a degree query. By our hypothesis, for a vertex of index outside to be queried, must specify a vertex it has not chosen before. Notice that may learn about up to 2 vertices from each query-answer pair, so at least vertices have never appeared in the history. Since we pick a random permutation for our construction, the probability that the queried vertex has index outside is . As , we have and our probability simplifies to at most

Next, we consider a neighbor query. From the argument above, with probability