Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs \titlenoteThe authors would like to acknowledge support from NSF CCF-1344364, NSF CCF-1344179, ARO YIP W911NF-14-1-0258, DARPA XDATA, and research gifts by Google and Docomo. This work will be presented in part at KDD’15.

Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs \titlenoteThe authors would like to acknowledge support from NSF CCF-1344364, NSF CCF-1344179, ARO YIP W911NF-14-1-0258, DARPA XDATA, and research gifts by Google and Docomo.
This work will be presented in part at KDD’15.

Ethan R. Elenberg
The University of Texas
Austin, Texas 78712, USA
Karthikeyan Shanmugam
The University of Texas
Austin, Texas 78712, USA
elenberg@utexas.edu karthiksh@utexas.edu
   Michael Borokhovich
The University of Texas
Austin, Texas 78712, USA
Alexandros G. Dimakis
The University of Texas
Austin, Texas 78712, USA
michaelbor@utexas.edu dimakis@austin.utexas.edu
Abstract

We study the problem of approximating the -profile of a large graph. -profiles are generalizations of triangle counts that specify the number of times a small graph appears as an induced subgraph of a large graph. Our algorithm uses the novel concept of -profile sparsifiers: sparse graphs that can be used to approximate the full -profile counts for a given large graph. Further, we study the problem of estimating local and ego -profiles, two graph quantities that characterize the local neighborhood of each vertex of a graph.

Our algorithm is distributed and operates as a vertex program over the GraphLab PowerGraph framework. We introduce the concept of edge pivoting which allows us to collect -hop information without maintaining an explicit -hop neighborhood list at each vertex. This enables the computation of all the local -profiles in parallel with minimal communication.

We test out implementation in several experiments scaling up to cores on Amazon EC2. We find that our algorithm can estimate the -profile of a graph in approximately the same time as triangle counting. For the harder problem of ego -profiles, we introduce an algorithm that can estimate profiles of hundreds of thousands of vertices in parallel, in the timescale of minutes.

\CopyrightYear

2015

\toappear\numberofauthors

4

Categories and Subject Descriptors

G.2.2 [Graph Theory]: Graph Algorithms; C.2.4 [Distributed Systems] Distributed Applications

Keywords

3-profiles; Graph Sparsifiers; Motifs; Graph Engines; GraphLab; Distributed Systems; Graph Analytics

1 Introduction

Given a small integer (e.g. or ), the -profile of a graph is a vector with one coordinate for each distinct -node graph (see Figure 1 for ). Each coordinate counts the number of times that appears as an induced subgraph of . For example, the graph (the complete graph on vertices) has the -profile since it contains triangles and no other (induced) subgraphs. The graph (the cycle on vertices, i.e. a pentagon) has the -profile . Note that the sum of the -profile is always , the total number of subgraphs.

One can see -profiles as a generalization of triangle (as well as other motif) counting problems. They are increasingly popular for graph analytics both for practical and theoretical reasons. They form a concise graph description that has found several applications for the web [4, 25], social networks [39], and biological networks [27] and seem to be empirically useful. Theoretically, they connect to the emerging theory of graph homomorphisms, graph limits and graphons [6, 39, 21].

Figure 1: Subgraphs in the -profile of a graph. We call them (empty, edge, wedge, triangle). The -profile of a graph counts how many times each of appears in .

In this paper we introduce a novel distributed algorithm for estimating the -profiles of massive graphs. In addition to estimating the (global) -profile, we address two more general problems. One is calculating the local -profile for each vertex . This assigns a vector to each vertex that counts how many times participates in each subgraph . These local vectors contain a higher resolution description of the graph and are used to obtain the global -profile (simply by rescaled addition as we will discuss).

The second related problem is that of calculating the ego -profile for each vertex . This is the -profile of the graph i.e. the neighbors of , also called the ego graph of . The -profile of the ego graph of can be seen as a projection of the vertex into a coordinate system [39]. This is a very interesting idea of viewing a big graph as a collection of small dense graphs, in this case the ego graphs of the vertices. Note that calculating the ego -profiles for a set of vertices of a graph is different (in fact, significantly harder) than calculating local -profiles.

Contributions: Our first contribution is a provable edge sub-sampling scheme: we establish sharp concentration results for estimating the entire -profile of a graph. This allows us to randomly discard most edges of the graph and still have -profile estimates that are provably within a bounded error with high probability. Our analysis is based on modeling the transformation from original to sampled graph as a one step Markov chain with transitions expressed as a function of the sampling probability. Our result is that a random sampling of edges forms a -profile sparsifier, i.e. a subgraph that preserves the elements of the -profile with sufficient probability concentration. Our result is a generalization of the triangle sparsifiers by Tsourakakis et al.  [38]. Our proof relies on a result by Kim and Vu [15] on concentration of multivariate polynomials, similarly to [38]. Unfortunately, the Kim and Vu concentration holds only for a class of polynomials called totally positive and some terms in the -profile do not satisfy this condition. For that reason, the proof of [38] does not directly extend beyond triangles. Our technical innovation involves showing that it is still possible to decompose our polynomials as combinations of totally positive polynomials using a sequence of variable changes.

Our second innovation deals with designing an efficient, distributed algorithm for estimating -profiles on the sub-sampled graph. We rely on the Gather-Apply-Scatter model used in Graphlab PowerGraph [9] but, more generally, our algorithm fits the architecture of most graph engines. We introduce the concept of edge pivoting which allows us to collect -hop information without maintaining an explicit -hop neighborhood list at each vertex. This enables the computation of all the local -profiles in parallel. Each edge requires only information from its endpoints and each vertex only computes quantities using data from incident edges. For the problem of ego -profiles, we show how to calculate them by combining edge pivot equations and local clique counts.

We implemented our algorithm in GraphLab and performed several experiments scaling up to cores on Amazon EC2. We find that our algorithm can estimate the -profile of a graph in approximately the same time as triangle counting. Specifically, we compare against the PowerGraph triangle counting routine and find that it takes us only - more time to compute the full -profile. For the significantly harder problem of ego -profiles, we were able to compute (in parallel) the -profiles of up to ego graphs in the timescale of several minutes. We compare our parallel ego -profile algorithm to a simple sequential algorithm that operates on each ego graph sequentially and shows tremendous scalability benefits, as expected. Our datasets involve social network and web graphs with edges ranging in number from tens of millions to over one billion. We present results on both overall runtimes and network communication on multicore and distributed systems.

2 Related Work

In this section, we describe several related topics and discuss differences in relation to our work.

Graph Sub-Sampling: Random edge sub-sampling is a natural way to quickly obtain estimates for graph parameters. For the case of triangle counting such graphs are called a triangle sparsifiers [38]. Related ideas were explored in the Doulion algorithm [36, 37, 38] with increasingly strong concentration bounds. The recent work by Ahmed et al. [1] develops subgraph estimators for clustering coefficient, triangle count, and wedge count in a streaming sub-sampled graph. Other recent work [32, 13, 33, 5, 14] uses random sampling to estimate parts of the and -profile. These methods do not account for a distributed computation model and require more complex sampling rules. As discussed, our theoretical results build on [38] to define the first -profile sparsifiers, sparse graphs that are a fortiori triangle sparsifiers.

Triangle Counting in Graph Engines: Graph engines (e.g. Pregel, GraphLab, Galois, GraphX, see [30] for a comparison) are frameworks for expressing distributed computation on graphs in the language of vertex programs. Triangle counting algorithms [31, 4] form one of the standard graph analytics tasks for such frameworks [9, 30]. In [7], the authors list triangles efficiently, by partitioning the graph into components and processing each component in parallel. Typically, it is much harder to perform graph analytics over the MapReduce framework but some recent work [26, 35] has used clever partitioning and provided theoretical guarantees for triangle counting.

Matrix Formulations: Fast matrix multiplication has been used for certain types of subgraph counting. Alon et al. proposed a cycle counting algorithm which uses the trace of a matrix power on high degree vertices [2]. Some of our edge pivot equations have appeared in [16, 17, 40], all in a centralized setting. Related approximation schemes [36] and randomized algorithms [40] depend on centralized architectures and computing matrix powers of very large matrices.

Frequent Subgraph Discovery: The general problem of finding frequent subgraphs, also known as motifs or subgraph isomorphisms, is to find the number of occurrences of a small query graph within a larger graph. Typically frequent subgraph discovery algorithms offer pruning rules to eliminate false positives early in the search [41, 19, 10, 28, 29]. This is most applicable when subgraphs have labelled vertices or directed edges. For these problems, the number of unique isomorphisms grows much larger than in our application.

In [39], subgraphs were queried on the ego graphs of users. While enumerating all -sets and sampling -sets neighbors can be done in parallel, forming the ego subgraphs requires checking for edges between neighbors. This suggests that a graph engine implementation would be highly preferable over an Apache Hive system. Our algorithms simultaneously compute the ego subgraphs and their profiles, reducing the amount of communication between nodes. Our algorithm is suitable for both NUMA multicore and distributed architectures, but our implementation focus in this paper is on GraphLab.

Graphlets: First described in [27], graphlets generalize the concept of vertex degree to include the connected subgraphs a particular vertex participates in with its neighbors. Unique graphlets are defined at a vertex based on its degree in the subgraph. Graphlet frequency distributions (GFDs) have proven extremely useful in the field of bioinformatics. Specifically, GFD analysis of protein interaction networks helps to design improved generative models [11], accurate similarity measures [27], and better features for classification [24, 34]. Systems that use our edge pivot equations (in a different form) appear in prior literature for calculating GFDs [12, 22] but not for enabling distributed computation.

3 Unbiased 3-profile Estimation

In this section, we are interested in estimating the number of node subgraphs of type , , and , as depicted in Figure 1, in a given graph . Let the estimated counts be denoted . Let the actual counts be . This set of counts is called a -profile of the graph, denoted with the following vector:

(1)

Because the vector is a scaled probability distribution, there are only degrees of freedom. Therefore, we calculate

In the case of a large graph, computational difficulty in estimating the -profile depends on the total number of edges in the large graph. We would like to estimate each of the -profile counts within a multiplicative factor. So we first sub-sample the set of edges in the graph with probability . We compute all -profile counts of the sub-sampled graph exactly. Let denote the exact -profile counts of the random sub-sampled graph. We relate the sub-sampled -profile counts to the original ones through a one step Markov chain involving transition probabilities. The sub-sampling process is the random step in the chain. Any specific subgraph is preserved with some probability and otherwise transitions to one of the other subgraphs. For example, a -clique is preserved with probability . Figure 2 illustrates the other transition probabilities.

Figure 2: Edge sampling process.

In expectation, this yields the following linear system:

(2)

from which we obtain unbiased estimators for each entry in :

(3)
(4)
(5)
(6)
Lemma 1

is an unbiased estimator of .

{proof}

By substituting (3)–(6) into (2), clearly for .

We now turn to prove concentration bounds for the above estimators. We introduce some notation for this purpose. Let be a real polynomial function of real random variables . Let and define , where

(7)

Further, we call a polynomial totally positive if the coefficients of all the monomials involved are non-negative. We state the main technical tool we use to obtain our concentration results.

Proposition 1 (Kim-Vu Concentration [15])

Let be a random totally positive Boolean polynomial in Boolean random variables with degree at most . If , then

(8)

for any , where .

The above proposition was used to analyze -profiles of Erdős-Rényi random ensembles () in [15]. Later, this was used to derive concentration bounds for triangle sparsifiers in [38]. Here, we extend §4.3 of [15] to the -profile estimation process, on an arbitrary edge-sampled graph.

Theorem 1

(Generalization of triangle sparsifiers to -profile sparsifiers) Let be the -profile of a graph . Let and . Let be the -profile of the subgraph obtained by sampling each edge in with probability . Let and be the largest collection of ’s, wedges and triangles that share a common edge. Define according to (3)–(6), , and . If satisfy:

(9)

then with probability at least .

{proof}

Full proof can be found in the appendix. Note that, as we mentioned, this proof uses a new polynomial decomposition technique to apply the Kim-Vu concentration.

The sampling probablilty in Theorem 1 depends poly-logarithmically on the number of edges and linearly on the fraction of each subgraph which occurs on a common edge. For example, if all of the wedges in depend on a single edge, i.e. , then the last equation suggests the presence of that particular edge in the sampled graph will dominate the overall sparsifier quality.

4 Local 3-profile Calculation

In this section, we describe how to obtain two types of -profiles for a given graph in a deterministic manner. These algorithms are distributed and can be applied independently of the edge sampling described in Section 3.

The key to our approach is to identify subgraphs at a vertex based on the degree with which it participates in the subgraph. From the perspective of a given vertex , there are actually six distinct node subgraphs up to isomorphism as given in Figure 3. Let , and denote the corresponding local subgraph counts at . We will first outline an approach that calculates these counts and then add across different vertex perspectives to calculate the final scalars ( and ). It is easy to see that the global counts can be obtained from these local counts by summing across vertices:

(10)

Figure 3: Unique -subgraphs from vertex perspective (white vertex corresponds to ).

4.1 Distributed Local 3-profile

We will now give our approach for calculating the local -profile counts of using only local information combined with and .

Scatter: We assume that every edge has access to the neighborhood sets of both and , i.e. and . Therefore, intersection sizes are first calculated at every edge, i.e. . Each edge computes the following scalars and stores them:

(11)

The computational effort at every edge is at most , where is the maximum degree of the graph, for the neighborhood intersection size.

Gather: In the next round, vertex “gathers” the above scalars in the following way:

(12)

Here, relations (a) and (b) are because triangles and wedges from center are double counted. (c) comes from noticing that each triangle and wedge from endpoint excludes an extra edge from forming . In this gather stage, the communication complexity is where it is assumed that is stored over different machines. The corresponding distributed algorithm is described in Algorithm 1.

4.2 Distributed Ego 3-profile

In this section, we give an approach to compute ego -profiles for a set of vertices in . For each vertex , the algorithm returns a -profile corresponding to that vertex’s ego , a subgraph induced by the neighborhood set , including edges between neighbors and excluding itself. Formally, our goal is to compute . Clearly, this can be accomplished in two steps repeated serially on all : first obtain the ego subgraph and then pass as input to Algorithm 1, summing over the ego vertices to get a global count. The serial implementation is provided in Algorithm 2. We note that this was essentially done in [39], where ego subgraphs were extracted from a common graph separately from -profile computations.

Figure 4: -subgraphs for Ego -profiles (white vertex corresponds to ).

Instead, Algorithm 3 provides a parallel implementation which solves the problem by finding cliques in parallel for all . The main idea behind this approach is to realize that calculating the -profile on the induced subgraph is exactly equivalent to computing specific -node subgraph frequencies among and of its neighbors, enumerated as in Figure 4. Now, the aim is to calculate ’s, effectively part of a local -profile.

Scatter: We assume that every edge has already computed the scalars from (4.1). Additionally, every edge also computes the list instead of only its size. The computational complexity is still .

Gather: First, the vertex “gathers” the following scalars, forming three edge pivot equations in unknown variables :

(13)

By choosing two subgraphs that the edge participates in, and then summing over neighbors , these equations gather implicit connectivity information hops away from . However, note that there are only three equations in four variables and we must count one of them directly, namely the number of -cliques . Therefore, at the same gather step, the vertex also creates the list . Essentially, this is the list of edges in the subgraph induced by . This requires worst case communication proportional to the number of edges in , independent of the number of machines .

Scatter: Now, at the next scatter stage, each edge accesses the pair of lists . Each edge computes the number of -cliques it is a part of, defined as follows:

(14)

This incurs a computation time of .

Gather: In the final gather stage, every vertex accumulates these scalars to get requiring communication time. As in the previous section, the scaling accounts for extra counting. Finally, the vertex solves the equations (13) using .

5 Implementation and Results

In this section, we describe the implementation and the experimental results of the 3-prof, Ego-par and Ego-ser algorithms. We implement our algorithms on GraphLab v2.2 (PowerGraph) [9]. The performance (running time and network usage) of our 3-prof algorithm is compared with the Undirected Triangles Count Per Vertex (hereinafter referred to as trian) algorithm shipped with GraphLab. We show that in time and network usage comparable to the built-in trian algorithm, our 3-prof can calculate all the local and global -profiles. Then, we compare our parallel implementation of the ego -profile algorithm, Ego-par, with the naive serial implementation, Ego-ser. It appears that our parallel approach is much more efficient and scales much better than the serial algorithm. The sampling approach, introduced for the 3-prof algorithm, yields promising results – reduced running time and network usage while still providing excellent accuracy. We support our findings with several experiments over various data sets and systems.

Vertex Programs: Our algorithms are implemented using a standard GAS (gather, apply, scatter) model [9]. We implement the three functions gather(), apply(), and scatter() to be executed by each vertex. Then we signal subsets of vertices to run in a specific order.

  Input: Graph with vertices, edges
  Gather: For each vertex union over edges of the ‘other’ vertex in the edge, .
  Apply: Store the gather as vertex data v.nb, size automatically stored.
  Scatter: For each edge , compute and store scalars in (4.1).
  Gather: For each vertex , sum edge scalar data of neighbors
      .
  Apply: For each vertex , calculate and store the quantities described in (12).
  return  [v: v.n0 v.n1 v.n2 v.n3]
Algorithm 1 3-prof
  Input: Graph with vertices, edges, set of ego vertices
  for  do
     Signal and its neighbors.
     Include an edge if both its endpoints are signaled.
     Run Algorithm 1 on the graph induced by the neighbors and edges between them.
  end for
  return  [v: vego.n0 vego.n1 vego.n2 vego.n3]
Algorithm 2 Ego-ser
  Input: Graph with vertices, edges, set of ego vertices
  Gather: For each vertex union over edges of the ‘other’ vertex in the edge, .
  Apply: Store the gather as vertex data v.nb, size automatically stored.
  Scatter: For each edge , compute and store as edge data:
      Scalars in (4.1).
      The list .
  Gather: For each vertex , sum edge data of neighbors:
      Acumulate LHS of (13).
      .
  Apply: Obtain and equations in (13) using the scalars and .
  Scatter: Scatter to all edges .
      Compute as in (14).
  Gather: Sum edge data of neighbors at .
  Apply: Compute .
  return  [v: vego.n0 vego.n1 vego.n2 vego.n3]
Algorithm 3 Ego-par

The Systems: We perform the experiments on three systems. The first system is a single power server, further referred to as Asterix. The server is equipped with 256 GB of RAM and two Intel Xeon E5-2699 v3 CPUs, 18 cores each. Since each core has two hardware threads, up to 72 logical cores are available to the GraphLab engine.

The next two systems are EC2 clusters on AWS (Amazon Web Services) [3]. One is comprised of 12 m3.2xlarge machines, each having 30 GB RAM and 8 virtual CPUs. Another system is a cluster of 20 c3.8xlarge machines, each having 60 GB RAM and 32 virtual CPUs.

The Data: In our experiments we used five real graphs. These graphs represent different datasets: social networks (LiveJournal and Twitter), citations (DBLP), knowledge content (Wikipedia), and WWW structure (PLD – pay level domains). Graph sizes are summarized in Table 1.

Name Vertices Edges (undirected)
Twitter [18]
PLD [23]
LiveJournal [20]
Wikipedia [8]
DBLP [20]
Table 1: Datasets

5.1 Results

Experimental results are averaged over runs.

Local 3-profile vs. triangle count: The first result is that our 3-prof is able to compute all the local -profiles in almost the same time as the GraphLab’s built-in trian computes the local triangles (i.e., number of triangles including each vertex). Let us start with the first AWS cluster with less powerful machines (m3.x2large). In Figure 5 (a) we can see that for the LiveJornal graph, for each sampling probability and for each number of nodes (i.e., machines in the cluster), 3-prof achieves running times comparable to trian. Notice also the benefit in running time achieved by sampling. We can reduce running time almost by half, without significantly sacrificing accuracy (which will be discussed shortly). While the running time is decreased as the number of nodes grows (more computing resources become available), the network usage becomes higher (see Figure 5 (c)) due to the extensive inter-machine communication in GraphLab. We can also see that sampling can significantly reduce network usage. In Figures 5 (b) and (d), we can see similar behavior for the Wikipedia graph: running time and network usage of 3-prof is comparable to trian.

Next, we conduct the experiments on the second AWS cluster with more powerful (c3.8xlarge) machines. For LiveJournal, we note modest improvements in running time for nearly the same network bandwidth observed in Figure 5. On this system we were able to run 3-prof and trian on the much larger PLD graph. In Figures 6 (b) and (d) we compare the running time and network usage of both algorithms. For the large PLD graph, the benefit of sampling can be seen clearly; by setting , the running time of 3-prof is reduced by a factor of and the network usage is reduced by a factor of . Figure 7 shows the performance of 3-prof and trian on the LiveJournal and Wikipedia graphs. We can see that the behavior of running times and the network usage of the 3-prof algorithm is consistently comparable to trian across the various graphs, sampling, and system parameters.

Let us now show results of the experiments performed on a single powerful machine (Asterix). Figure 11 (a) shows the running times for 3-prof and trian for Twitter and PLD graphs. We can see that on the largest graph in our dataset (Twitter), the running time of 3-prof is less than larger than that of trian, and for the PLD graph the difference is less than (for ). Twitter takes roughly twice as long to compute as PLD, implying that these algorithms have running time proportional to the graph’s number of edges.

Finally, we show that while the sampling approach can significantly reduce the running time and network usage, it has negligible effect on the accuracy of the solution. Notice that the sampling accuracy refers to the global -profile count (i.e., the sum of all the local -profiles over all vertices in a graph). In Figure 12 we show accuracy of each scalar in the -profile. For the accuracy metrics, we use ratio between the exact count (obtained running 3-prof with ) divided by the estimated count (i.e., the output of our 3-prof when ). It can be seen that for the three graphs, all the -profiles are very close to . E.g., for the PLD graph, even when , the accuracy is within from the ideal value of . Error bars mark one standard deviation from the mean, and across all graphs the largest standard deviation is . As decreases, the triangle estimator suffers the greatest loss in both accuracy and consistency.

(a)
(b)
(c)
(d)
Figure 5: AWS m3_2xlarge cluster. 3-prof vs. trian algorithms for LiveJournal and Wikipedia datasets (average of runs). 3-prof achieves comparable performance to triangle counting. (a,b) – Running time for various numbers of nodes (machines) and various sampling probabilities . (c,d) – Network bytes sent by the algorithms for various numbers of nodes and various sampling probabilities .
(a)
(b)
(c)
(d)
Figure 6: AWS c3_8xlarge cluster. 3-prof vs. trian algorithms for LiveJournal and PLD datasets (average of runs). 3-prof achieves comparable performance to triangle counting. (a,b) – Running time for various numbers of nodes (machines) and various sampling probabilities . (c,d) – Network bytes sent by the algorithms for various numbers of nodes and various sampling probabilities .
(a)
(b)
Figure 7: AWS c3_8xlarge cluster with 20 nodes. 3-prof vs. trian results for LiveJournal and Wikipedia datasets (average of runs). (a) – Running time for both graphs for various sampling probabilities . (b) – Network bytes sent by the algorithms for both graphs for various sampling probabilities .
(a)
(b)
Figure 8: AWS c3_8xlarge cluster. Ego-par vs. Ego-ser results for LiveJournal and PLD datasets (average of runs). Running time of Ego-par scales well with the number of ego centers, while Ego-ser scales linearly.

Ego 3-profiles: The next set of experiments evaluates the performance of our Ego-par algorithm for counting ego -profiles. We show the performance of Ego-par for various graphs and systems and also compare it to a naive serial algorithm Ego-ser. Let us start with the AWS system with (c3.8xlarge machines). In Figure 8 we see the running time of Ego-ser and Ego-par on the LiveJournal graph. The task was to find ego -profiles of 100, 1K, and 10K randomly selected nodes. Since the running time depends on the size and structure of each induced subgraph, Ego-ser and Ego-par operated on the same list of ego vertices. While for random vertices Ego-ser performed well (and even achieved the same running time as Ego-par for the PLD graph), its performance drastically degraded for a larger number of vertices. This is due to its iterative nature – it finds ego -profiles of the vertices one at a time and is not scalable. Note that the open bars mean that this experiment was not finished. The numbers above them are extrapolations, which are reasonable due to the serial design of the Ego-ser.

On the contrary, the Ego-par algorithm scales extremely well and computes ego -profiles for 100, 1K, and 10K vertices almost in the same time. In Figure 9 (a), we can see that as the number of nodes (i.e., machines) increases, running time of Ego-par decreases since its parallel design allows it to use additional computational resources. However, Ego-ser cannot benefit from more resources and its running time even increases when more machines are used. The increase in running time of Ego-ser is due to the increase in network usage when using more machines (see Figure 9 (b)). The network usage of Ego-par also increases, but this algorithm compensates by leveraging additional computational power. In Figure 10, we can see that Ego-par performs well even when finding ego -profiles for all the LiveJournal vertices (4.8M vertices).

Finally in Figure 11 (b) and (c), we can see the comparison of Ego-par and Ego-ser on the PLD and the DBLP graphs on the Asterix machine. For both graphs, we see a very good scaling of Ego-par, while the running time of Ego-ser scales linearly with the size of the ego vertices list.

(a)
(b)
(c)
Figure 9: AWS c3_8xlarge cluster. Ego-par vs. Ego-ser results for LiveJournal and Wikipedia datasets (average of runs). Running time of Ego-par decreases with the number of machines due to its parallel design. Running time of Ego-ser does not decrease with the number of machines due to its iterative nature. Network usage increases for both algorithms with the number of machines.
(a)
(b)
Figure 10: AWS c3_8xlarge cluster with 20 nodes. Ego-par results for LiveJournal dataset (average of runs). The algorithm scales well for various number of ego centers and even full ego centers list. (a) – Running time. (b) – Network bytes sent by the algorithm.
(a)
(b)
(c)
Figure 11: Asterix machine. Results for Twitter, PLD, and DBLP datasets. (a) – Running time of 3-prof vs. trian for various sampling probabilities . (b,c) – Running time of Ego-par vs. Ego-ser for various number of ego centers. Results are averaged over , and , and runs, respectively.
(a)
(b)
(c)
Figure 12: Global -profiles accuracy achieved by 3-prof algorithm for various graphs and for each profile count. Results are averaged over , and , and iterations, respectively. Error bars indicate standard deviation. The metric is a ratio between the exact profile count (when ) and the given output for . All results are very close to the optimum value of .

6 Conclusions

In summary, we have reduced several -profile problems to triangle and -clique finding in a graph engine framework. Our concentration theorem and experimental results confirm that local -profile estimation via sub-sampling is comparable in runtime and accuracy to local triangle counting.

This paper offers several directions for future work. First, both the local -profile and the ego -profile can be used as features to classify vertices in social or bioinformatic networks. Additionally, we hope to extend our theory and algorithmic framework to larger subgraphs, as well as special classes of input graphs. Our edge sampling Markov chain and unbiased estimators should easily extend to . Equations in (13) are useful to count local or global -profiles in a centralized setting, as shown recently in [17, 40]. Tractable distributed algorithms for using similar edge pivot equations remain as future work. Our observed dependence on -clique count suggests that an improved graph engine-based clique counting subroutine will improve the parallel algorithm’s performance.

References

  • [1] N. K. Ahmed, N. Duffield, J. Neville, and R. Kompella. Graph sample and hold: A framework for big-graph analytics. In KDD, 2014.
  • [2] N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algorithmica, 17(3):209–223, Mar. 1997.
  • [3] Amazon web services. http://aws.amazon.com, 2015.
  • [4] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In KDD, 2008.
  • [5] M. A. Bhuiyan, M. Rahman, M. Rahman, and M. Al Hasan. Guise: Uniform sampling of graphlets for large graph analysis. IEEE 12th International Conference on Data Mining, pages 91–100, Dec. 2012.
  • [6] C. Borgs, J. Chayes, and K. Vesztergombi. Counting graph homomorphisms. Topics in Discrete Mathematics, pages 315–371, 2006.
  • [7] S. Chu and J. Cheng. Triangle listing in massive networks and its applications. In Proc. 17th ACM SIGKDD, page 672, New York, New York, USA, 2011.
  • [8] T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1):1–25, 2011.
  • [9] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 17–30, 2012.
  • [10] W. Han and J. Lee. Turbo: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In SIGMOD, pages 337–348, 2013.
  • [11] F. Hormozdiari, P. Berenbrink, N. Przulj, and S. C. Sahinalp. Not all scale-free networks are born equal: The role of the seed graph in PPI network evolution. PLoS computational biology, 3(7):e118, July 2007.
  • [12] T. Hočevar and J. Demšar. A combinatorial approach to graphlet counting. Bioinformatics, 30(4):559–65, Feb. 2014.
  • [13] M. Jha, C. Seshadhri, and A. Pinar. A space efficient streaming algorithm for triangle counting using the birthday paradox. In KDD, pages 589–597, 2013.
  • [14] M. Jha, C. Seshadhri, and A. Pinar. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. 2014.
  • [15] J. H. Kim and V. H. Vu. Concentration of multivariate polynomials and its applications. Combinatorica, 20(3):417–434, 2000.
  • [16] T. Kloks, D. Kratsch, and H. Müller. Finding and counting small induced subgraphs efficiently. Information Processing Letters, 74(3-4):115–121, May 2000.
  • [17] M. Kowaluk, A. Lingas, and E.-M. Lundell. Counting and detecting small subgraphs via equations. SIAM Journal of Discrete Mathematics, 27(2):892–909, 2013.
  • [18] H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In Proc. 19th International World Wide Web Conference, pages 591–600, New York, NY, USA, 2010. ACM.
  • [19] J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. Proc. VLDB Endowment, 6(2):133–144, Dec. 2012.
  • [20] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • [21] L. Lovász. Large Networks and Graph Limits, volume 60. American Mathematical Soc., 2012.
  • [22] D. Marcus and Y. Shavitt. RAGE - A rapid graphlet enumerator for large networks. Computer Networks, 56(2):810–819, Feb. 2012.
  • [23] R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisited. In Proc. 23rd International World Wide Web Conference, Web Science Track. ACM, 2014.
  • [24] T. Milenkovik and N. Przulj. Uncovering biological network function via graphlet degree signatures. Cancer Informatics, 6:257–273, 2008.
  • [25] D. O’Callaghan, M. Harrigan, J. Carthy, and P. Cunningham. Identifying discriminating network motifs in youtube spam. Feb. 2012.
  • [26] R. Pagh and C. E. Tsourakakis. Colorful triangle counting and a MapReduce implementation. Information Processing Letters, 112(7):277–281, Mar. 2012.
  • [27] N. Przulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):177–183, 2007.
  • [28] P. Ribeiro, F. Silva, and L. Lopes. Efficient parallel subgraph counting using g-tries. In IEEE International Conference on Cluster Computing, pages 217–226. Ieee, Sept. 2010.
  • [29] M. Saltz, A. Jain, A. Kothari, A. Fard, J. A. Miller, and L. Ramaswamy. DualIso: An algorithm for subgraph pattern matching on very large labeled graphs. IEEE International Congress on Big Data, pages 498–505, June 2014.
  • [30] N. Satish, N. Sundaram, M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In SIGMOD, pages 979–990, 2014.
  • [31] T. Schank. Algorithmic Aspects of Triangle-Based Network Analysis. PhD thesis, 2007.
  • [32] C. Seshadhri, A. Pinar, and T. G. Kolda. Triadic measures on graphs: The power of wedge sampling. In Proc. SIAM Conference on Data Mining, pages 10–18, 2013.
  • [33] C. Seshadhri, A. Pinar, and T. G. Kolda. Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Statistical Analysis and Data Mining, 7(4):294–307, 2014.
  • [34] N. Shervashidze, K. Mehlhorn, and T. H. Petri. Efficient graphlet kernels for large graph comparison. In Proc. 20th International Conference on Artificial Intelligence and Statistics, pages 488–495, 2009.
  • [35] S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In Proc. 20th International World Wide Web Conference, page 607, 2011.
  • [36] C. E. Tsourakakis. Fast counting of triangles in large real networks: Algorithms and laws. In IEEE International Conference on Data Mining, 2008.
  • [37] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. Doulion: Counting triangles in massive graphs with a coin. In SIGKDD, 2009.
  • [38] C. E. Tsourakakis, M. Kolountzakis, and G. L. Miller. Triangle sparsifiers. Journal of Graph Theory and Applications, 15(6):703–726, 2011.
  • [39] J. Ugander, L. Backstrom, M. Park, and J. Kleinberg. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In 22nd International World Wide Web Conference, 2013.
  • [40] V. V. Williams, J. Wang, R. Williams, and H. Yu. Finding four-node subgraphs in triangle time. SODA, pages 1671–1680, 2014.
  • [41] X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In International Conference on Data Mining, 2002.

Appendix A Proof of Theorem 1

Let be the total number of edges in the original graph . If is an edge in the original graph , let be the random indicator after sampling. if is sampled and otherwise. Let denote the set of distinct subgraphs of the kind and (anti-clique, edge, wedge and triangle) respectively. Let and denote an anti-clique with no edges, a with edge , a with two edges and a triangle with edges respectively in the original graph . Our estimators (3)-(6) are a function of ’s and each can be written as a polynomial of at most degree in all the variables .

(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)

Note that the newly defined polynomials have the following expectations:

We observe that in the above even by change of variables , and are not totally positive polynomials. This means that Proposition 1 cannot be applied directly to the or ’s. The strategy we adopt is to split the and into many polynomials, each of which is totally positive, and then apply Proposition 1 on each of them. form the set of totally positive polynomials (proved below). Substituting the above equations into (3)-(6), we have the following system of equations that connect ’s and the set of totally positive polynomials :

(26)
(27)
(28)
(29)

Let , , and be the maximum number of ’s, ’s, and ’s containing an edge in the original graph G. Let and be the maximum of , and over all edges . We now show concentration results for the totally positive polynomials alone.

Lemma 2

Define variables . Then is totally positive in . With respect to the variables , .

{proof}

We have the expectation of the following partial derivatives, up to the third order:

From the above equations, we have for a nonempty graph. To satisfy , it is sufficient to have

(30)

This is because with probability .

Lemma 3

is totally positive in . With respect to the variables , .

{proof}

We have the expectation of the following partial derivatives, up to the third order:

From the above equations, we have . implies

(31)
Lemma 4

is totally positive in . With respect to the variables , .

{proof}

We have the expectation of the following partial derivatives, up to the second order:

From the above equations, we have . implies

(32)
Lemma 5

is totally positive in . With respect to the variables , .

{proof}

We have the expectation of the following partial derivatives, up to the second order:

From the above equations, we have . implies

(33)
Lemma 6

is totally positive in . With respect to the variables ,