Parallel and Streaming Algorithms for K-Core Decomposition

Parallel and Streaming Algorithms for K-Core Decomposition

Hossein Esfandiari    Silvio Lattanzi    Vahab Mirrokni
Abstract

The -core decomposition is a fundamental primitive in many machine learning and data mining applications. We present the first distributed and the first streaming algorithms to compute and maintain an approximate -core decomposition with provable guarantees. Our algorithms achieve rigorous bounds on space complexity while bounding the number of passes or number of rounds of computation. We do so by presenting a new powerful sketching technique for -core decomposition, and then by showing it can be computed efficiently in both streaming and MapReduce models. Finally, we confirm the effectiveness of our sketching technique empirically on a number of publicly available graphs.


1 Introduction

A wide range of data mining, machine learning and social network analysis problems can be modeled as graph mining tasks on large graphs. The ability to analyze layers of connectivity is useful to understand the hierarchical structure of the input data and the role of nodes in different networks. A commonly used technique for this task is the -core decomposition: a -core of a graph is a maximal subgraph where every node has induced degree at least . -core decomposition has many real world applications from understanding dynamics in social networks (Bhawalkar et al., 2012) to graph visualization (Alvarez-Hamelin et al., 2005), from describing protein functions based on protein-protein networks (Altaf-Ul-Amin et al., 2006) to computing network centrality measures (Healy et al., 2006). -core is also widely used as a sub-routine for community detection algorithms (Chester et al., 2012; Mitzenmacher et al., 2015) or for finding dense clusters in graphs  (Lee et al., 2010; Mitzenmacher et al., 2015). As a graph theoretic tool -core decomposition has been used to solve the densest subgraph problem (Lee et al., 2010; Bahmani et al., 2012; Epasto et al., 2015; Esfandiari et al., 2015).

-core is often use a feature in machine learning systems with applications in network analysis, spam detection and biology. Furthermore, in comparison with other density-based measure as the densest subgraph, it has the advantage to assign a score to every node in the network. Finally, the -core decomposition induce a hierarchical clustering on the entire network. For many applications in machine learning and in data mining, it is important to be able to compute it efficiently on large graphs.

In the past decade, with increasing size of data sets available in various applications, the need for developing scalable algorithms has become more important. By definition, the process of computing -core decomposition is sequential: in order to find the -core, one can keep removing all nodes of degree less than from the remaining graph until there is no such a node. As a result, computing -cores for big graphs in distributed systems is a challenging task. In fact, while -core decomposition has been studied extensively in the literature and many efficient decentralized and streaming heuristics have been developed for this problem (Montresor et al., 2013; Sarayuce et al., 2015), nevertheless developing a distributed or a streaming algorithm with provable guarantees for -core decomposition problem remains an unsolved problem. One difficulty in tackling the problem is that simple non-adaptive sampling techniques used for similar problems as densest subgraph (Lee et al., 2010; Esfandiari et al., 2015; Bahmani et al., 2012; Epasto et al., 2015; Bhattacharya et al., 2016) do not work here (See Related Work for details). In this paper, we tackle this problem and present the first parallel and streaming algorithm for this problem with provable approximation guarantee. We do so by defining an approximate notion of -core, and providing an adaptive space-efficient sketching technique that can be used to compute an approximate -core decomposition efficiently. Roughly speaking, a -approximate -core is an induced subgraph that includes the -core, and such that the induced degree of every node is at least .

Our Contributions. As a foundation to all our results, we provide a powerful sketching technique to compute a -approximate -core for all simultaneously. Our sketch is adaptive in nature and it is based on a novel iterative edge sampling strategy. In particular, we design a sketch of size that can be constructed in rounds of sampling111In the paper, we use the notation to denote the fact that poly-logarithmic factors are ignored..

We then show the first application of our sketching technique in designing a parallel algorithm for computing the -core decomposition. More precisely, we present a MapReduce-based algorithm to compute a approximate -core decomposition of a graph in rounds of computations, where the load of each machine is , for any .

Moreover, we show that one can implement our sketch for -core decomposition in a streaming setting in one pass using only space. In particular, we present a one-pass streaming algorithm for -approximate -core decomposition of graphs with space. \commentWe also extend our algorithm to the turnstile setting where we have both insertion and deletion of edges in the stream.

Finally, we show experimentally the efficiency and accuracy of our sketching algorithm on few real world networks.

Related Work. The -core decomposition problem is related to the densest subgraph problem. Streaming and turnstile algorithms for the densest subgrpah problem have been studied extensively in the past (Lee et al., 2010; Esfandiari et al., 2015; Bahmani et al., 2012; Epasto et al., 2015; Bhattacharya et al., 2016). While these problems are related, the theoretical results known for the densest subgraph problem are not directly applicable to the -core decomposition problem.

There are two types of algorithms for the densest subgraph problem in the streaming. First type of algorithms simulates the process of iteratively removing vertices with small degrees (Bahmani et al., 2012; Epasto et al., 2015; Bhattacharya et al., 2016). All of these results are based on the fact that we only need logarithmic rounds of probing to find a approximation of the densest subgraph (Bahmani et al., 2012). However, this can not be used to approximate the -coreness numbers.

The second type of algorithms do a (non-adaptive) single pass and use uniform samplings of edges (Esfandiari et al., 2015; Mitzenmacher et al., 2015; McGregor et al., 2015). These results are based on the fact that the density of the optimum solution is proportional to the sampling rate with high probability, where the probability of failure is exponentially small. There are two obstacles toward applying this approach to approximating a -core decomposition. First, by using uniform sampling it is not possible to obtain a approximation of the coreness number for nodes of constant degree (unless we do not sample all edges with probability one). Second, in order to achieve space, we can only sample edges per vertex. Hence, the probability that the degree of a vertex in the sampled is not proportional to the sampling rate, is not exponentially small anymore. Therefore it is not possible to union bound over exponentially many feasible solutions. To overcome this issue, we analyze the combinatorial correlation between feasible solutions and wisely pick polynomially many feasible solutions that approximate all of the feasible solutions. To the best of our knowledge this is the first work that analyzes the combinatorial correlation of different feasible solution on a graph.

In recent years the -core decomposition problem received a lot of attention (Bhawalkar et al., 2012; Montresor et al., 2013; Aksu et al., 2014; Sarayuce et al., 2015; Zhang et al., 2017), nevertheless we do not know of any previous distributed algorithms with small bounded memory and number of rounds. A recent related paper is  (Sarayuce et al., 2015) where the authors present a streaming algorithm for the -core decomposition problem. While the authors report good empirical results for their algorithm, they do not provide a guarantee for this problem, e.g., they do not prove an upper bound on the memory complexity of this algorithm. Finally we note that Monteresor et al. (Montresor et al., 2013) provide a distributed algorithm for this problem in a vertex-centric model. Although their model is different from our, more classic, MapReduce setting and their bound on the number of rounds is linear instead we achieve a logarithmic bound.

2 Preliminaries

In this section, we introduce the main definitions and the computational models that we consider in the paper. We start by defining -core and by introducing the concept of approximate -core. Then we describe the MapReduce and streaming models.

Approximate -core. Let be a graph with nodes and edges. Let be a subgraph of , for any node we denote by the degree of the node in and for any node we denote by the degree of in the subgraph induced by . A is a maximal subgraph such that we have . Note that for any the is unique and it may be possibly disconnected. We say that a vertex has coreness number if it belongs to the -core but it does not belong to the -core. We denote the coreness number of node in the graph with (we drop the subscript notation when the graph is clear from the context).

We define the core labeling for a graph as the labeling where every vertex is labeled with its coreness number. It is wroth noting that this labeling is unique and that it defines a hierarchical decomposition of .

In this paper we are interested in computing a good approximation of the core labeling for a graph efficiently in the MapReduce and in the streaming model. For this reason, we introduce the concept of approximate -core. We define a approximation to the -core of to be a subgraph of that contains the -core of and such that we have . In other words, a approximation to the -core of is a subgraph of the -core of and supergraph of the -core of . In Figure 1 we present the 3-core for a small graph and a -approximate 3-core.

Figure 1: Example of -core and -approximate 3-core.

Similarly, a approximate core-labeling of a graph is a labeling of the vertices in , where each vertex is labelled with a number between its coreness number and its coreness number multiplied by .

In the paper we often refer to the classic greedy algorithm (Matula & Beck, 1983)(also known as peeling algorithm) to compute the coreness number. The algorithm works as follows: nodes are removed from the graph iteratively. In particular, in iteration of the algorithm all nodes with degree smaller or equal to are removed iteratively and they are assigned coreness number . It is possible to show that the algorithm computes the correct coreness number of all nodes in the graph and it can be implemented in linear time.

MapReduce model. Here we briefly recall the main aspect of the model by Karloff et al. (Karloff et al., 2010) of the MapReduce framework (Dean & Ghemawat, 2010).

In the MapReduce model, the computation happens in parallel in several rounds. In each round, data is analyzed on each machine in parallel and then the output of the computations are shuffled between machines. The model has two main restrictions, one on the total number of machines and another on the memory available on each machine. More specifically, given an input of size , and a small constant , in the model there are machines, each with memory available. Note that, the total amount of memory available to the entire system is .

The efficiency of an algorithm is measured by the number of the “rounds” needed by the algorithm to terminate. Classes of algorithms of particular interest are the ones that run in a constant or poly-logarithmic number of rounds.

Streaming We also analyze the approximate core labelling problem in the streaming model (Munro & Paterson, 1980). In this model the input consists of an undirected graph and the input is presented as a stream of edges. The goal of our algorithm is to obtain a good approximation of the core labelling at the end of the stream using only small memory ().

\comment

Finally, we also consider the turnstile model (DBLP:conf/soda/Muthukrishnan03) where the input consists of a sequence of edges additions and deletions and the goal is again to design an algorithm to compute an approximation of the core labelling at the end of the stream using only small memory.

3 Sketching -Cores

In this section we present a sketch to compute an approximate core labelling that uses only space. Compared with previous sketching for similar problems (Lee et al., 2010; Esfandiari et al., 2015; Bahmani et al., 2012; Epasto et al., 2015; Bhattacharya et al., 2016) our sketching samples different area of the graphs with different, carefully selected, probabilities.

The main idea behind the sketch is to sample edges more aggressively in denser areas of the graph and less aggressively in sparser areas. More specifically, the algorithm works as follows: we start by sampling edges with some small probability, , so that the resulting sampled graph, , is sparse. We then compute the coreness numbers for the vertices in . The key observation is that if a vertex has logarithmic coreness number in we can precisely estimate its coreness number in the input graph . Furthermore we can show that if a vertex has large enough coreness number in the input graph it will have at least logarithmic coreness number in . So using this technique we can detect efficiently all nodes with sufficiently high coreness number. To compute the coreness numbers of the rest of the node in the graph, we first remove from the graph the nodes for which we have a good estimation and then we iterate the same approach. In particular we double our sample probability and sample edges again. Interestingly, we can show that by sampling edges adaptively, we can iteratively estimate the coreness of all nodes in the graph by analyzing only sparse subgraphs.

We are now ready to describe our sketching algorithm in details. We start by describing a basic subroutine that estimates a modified version of the coreness number. We dubbed the subroutine ExclusiveCorenessLabeling. The subroutine takes as input a subgraph, , and a subset of the vertices and it runs a modified version of the classic peeling algorithm (Matula & Beck, 1983) to compute the coreness number. The main difference between ExclusiveCorenessLabeling and the peeling algorithm in (Matula & Beck, 1983) is that we do not compute labels for nodes in and we do not remove them from the subgraph . The pseudocode for ExclusiveCorenessLabeling is presented in Algorithm 1.

Input: A graph with vertices and a set . \STATEInitialize \STATEInitialize \WHILE \WHILE \STATELet \STATESet \STATERemove from \STATERemove from \ENDWHILE \ENDWHILE
\STATE
Algorithm 1

During the execution of the algorithm we use subroutine ExclusiveCorenessLabeling to compute a labelling for the subset of the nodes in for which we do not have already a good estimate of the coreness number.

Now we can formally present our algorithm, we start by sampling the graph with . In this way, we obtain a sparse graph . Then we run ExclusiveCorenessLabeling with and to obtain a labeling of the nodes in . Let be the label of vertex in this labeling. If a vertex has , for a specific constant , we can estimate its coreness number in precisely. Intuitively this is true because we are sampling the edges independently so we can use concentration results to bound its coreness number. Hence, in the first round of our algorithm we can compute a precise estimate of the coreness number for all nodes with .

In the rest of the execution of our algorithm we can recurse on the remaining nodes. To do so, we add the nodes with a good estimate to the set and we remove from the edges in the subgraph induced by the nodes in . Then we increase the sampling probability by and sample again. Similarly we obtain a new subgraph and we run ExclusiveCorenessLabeling with and equal to the current . So we obtain a labeling for the nodes in . and also in this case if a vertex has , for a specific constant , we can estimate its coreness number in precisely.

We iterate this algorithm for steps. In the remaining of the section we first present pseudocode of our sketching algorithm(Algorithm 2) then we show that at the end of the execution of the algorithm we have a good estimation of the coreness number for all nodes in . Finally we argue that in every iteration the graphs are sparse so the algorithm uses only small memory at any point in time.

We start by providing the pseudocode of the algorithm in Algorithm 2.

Input: A graph with vertices and parameter . \STATEInitialize \STATEInitialize \FOR to \STATELet be a subgraph of with the edges sampled independently with probability \STATERun and denote the label of vertex on by \FOR \IF \STATE// Node has sufficiently high degree to estimate its coreness number. \IF \STATESet the label of vertex to \STATEAdd to \ELSE\STATESet the label of vertex to \STATEAdd to \ENDIF\ENDIF\ENDFOR\STATERemove from the edges of induced by \STATE \ENDFOR
\STATE
Algorithm 2 A sketch based algorithm to compute approximate core-labeling.

We are now ready to prove the main properties of our sketching technique. We start by stating two technical lemma whose proofs follow from application of concentration bounds and are presented in the appendix. The main goal of the lemma is to relate the degree of a vertex in a subgraph of and the sampled subgraph .

Lemma 3.1.

Let be a graph and let and be two arbitrary numbers. Let be a function of such that and let be a subgraph of that contains each edge of independently with probability . Then for all the following statements holds, with probability : (i) If we have , (ii) If we have .

Lemma 3.2.

Let be a graph and let and be two arbitrary numbers. Let be a function of such that and let be a subgraph of that contains each edge of independently with probability . For all the following statements holds, with probability : (i) If we have , (ii) If we have . In addition, in the first case we have . Furthermore, if the graph is directed the same claims hold for the in-degree() and the out-degree() of a node .

In the remaining of this section we assume that Lemma 3.1 and Lemma 3.2 hold and using this assumption we prove the main properties of our algorithm.

We start by comparing the labels computed by Algorithm 1 with the coreness number of its input graph. Recall that we denote the coreness number of node in the graph with .

Lemma 3.3.

Let be an arbitrary graph and let be an arbitrary set of vertices. Let be an arbitrary graph on the set of vertices , and let . Let be the label computed by . Then for each vertex we have: (i) , (ii) if , we have .

Proof.

By definition of coreness number, if we iteratively remove all vertices with degree from , vertex is not removed from the graph. Furthermore note that Algorithm 1 does not remove any vertex with degree more than unless . Thus, we have as desired.

Note that, if we set , Algorithm 1 acts as the greedy algorithm that computes the coreness numbers. Moreover notice that if , the classic peeling algorithm does not removed any of the vertices in until it does not consider nodes with degree smaller or equal than . Therefore we have which proves the second statement of the theorem. ∎

We are now ready to state the two main Lemma proving the quality of the solution computed by our sketching technique.

Lemma 3.4.

For all such that and for any node added to in round we have with probability that: .

Furthermore for all such that we have with probability that: .

Lemma 3.5.

Algorithm 2 computes a approximate core labeling, with probability .

The proofs of the Lemma is presented in the appendix.

Now we give a lemma that bounds the total number of edges used in sketches .

Lemma 3.6.

The number of edges in produced by Algorithm 2 is upper bounded by , with probability .

Proof.

In the proof, we assume that the statement of Lemma 3.4 holds, and the statements of Lemma 3.1 and 3.2 hold for and for all choices of and and .

Consider an arbitrary . From Lemma 3.4, we have that for any ) the coreness number of is bounded by . Now consider an orientation of the edges of where every edge is oriented to its endpoint of smallest core number, breaking the ties in such a way that the in-degree of every node , is upperbounded by 222Note that such an orientation exists, in fact it can be obtained by orienting every edges to its endpoint that is first removed by the classic peeling algorithm used to compute the coreness number.. Furthermore note that every edge in is incident to a node of coreness number at most , so using Lemma 3.1 we have that in-degree of every node in is bounded by . So summing over all the in-degrees we get that the number of edges in is bounded by . We conclude the proof by noticing that there are at most different so the total memory used is . ∎

Putting together Lemma 3.5 and Lemma 3.6 we get the main theorem of this section.

Theorem 3.7.

Algorithm 2 computes a approximate core labeling and the total spaced used by the algorithm is , with probability .

4 MapReduce and Streaming Algorithms

In this section we show how to compute our sketch efficiently using a MapReduce or a streaming algorithm.

4.1 MapReduce algorithm

Here, we show how to use implement the sketch introduced in Section 3 in the MapReduce model. In this way we obtain an efficient MapReduce algorithm for dense graphs333It is important to note that we only use polylogarithmic memory for each machine so our algorithm works also in more restrictive parallel models as the massively parallel model (Andoni et al., 2014; Im et al., 2017).

Recall that the main limitation of the MapReduce model is on the number of machines and on the available memory on each machine. Our algorithm runs for rounds.444The algorithm can be implemented using MapReduce rounds, but for simplicity, here we present a rounds version. In the first round of MapReduce, the edges are sampled in parallel with probability . In this way, we obtain a graph that we analyze in the second round in a single machines(note that we can do it because from Lemma 3.6 we know that for all the number of edges in is bounded by . At the end of the second round, we obtain the labeling for the nodes with high coreness number and we add them to the set . In the third round we send the set to all the machined and we sample in parallel the edges in with probability in a round of MapReduce. In this way, we obtain that in the fourth round is analyzed by a single machine to obtain the labelling of few additional nodes that are added to . By iterating this process for rounds, we obtain an approximation of the coreness number for each node. The pseudo-code for the MapReduce algorithm is presented in Algorithm 3.

Input: A graph with vertices and parameter . \STATEInitialize \STATEInitialize \FOR to \STATE// First round of MapReduce \STATESend to all machines \STATELet be the set of edges of that are not contained in the graph induced by on \STATESample with probability in parallel using machines the edges in \STATE// Second round of MapReduce \STATESend all the sampled edge to a single machine \STATELet be the sampled subgraph of \STATERun and denote the label of vertex on by \FOR \IF \STATE// Node has sufficiently high degree to estimate its coreness number. \IF \STATESet the label of vertex to \STATEAdd to \ELSE\STATESet the label of vertex to \STATEAdd to \ENDIF\ENDIF\ENDFOR\STATE \ENDFOR
\STATE
Algorithm 3 A MapReduce algorithm to compute -approximate core-labeling.

By Theorem 3.7 presented in the previous section we obtain the following corollary.

Corollary 4.1.

Let be a graph such that , for some constant . Then there is an algorithm that computes w.h.p. an approximate core-labeling of the graph in the MapReduce model using rounds of MapReduce.

4.2 Semi-streaming algorithms

Next we show an application of our sketch in the streaming setting. We consider the setting where edges are only added to the graph. The main idea behind our streaming algorithm is to maintain at any point in time the sketch presented in Section 3, which requires only space. In the remaining of the section we describe how we can maintain the sketching in streaming.

When an edge is added to , we check by sampling if it is . In this case in , we recompute the labeling of and if one of the endpoints of the edge is added to , we update the rest of the sketch to reflect this change. Then, if both endpoints of the edge are not contained in , we check by sampling if the edge is contained in . Also in this case, if it is in , we recompute the labeling of and modify the sketch accordingly. We continue this procedure until both endpoints of the edge are contained in . Notice that, by inserting edges s may only grow. Hence if at some point both endpoints of an edge are in , by inserting more edge and remain in .

The pseudo-code for the streaming algorithm is presented in Algorithm 4(Note that here for simplicity we recompute the core labels after the insertion of an edge . However, one might recurse over the neighborhood of and and update the core labels locally).

By Theorem 3.7 presented in the previous section we obtain the following corollary.

Corollary 4.2.

There exists a streaming algorithm that computes w.h.p. an -approximate core-labeling of the input graph using space.

Input: Stream of edges and parameter . \STATEInitialize , \STATEInitialize \STATEInsertion of \STATE random number from \FOR to \IF or \IF \STATEAdd to \STATERun and denote the label of vertex on by \FOR \IF \IF \STATESet the label of vertex to \STATEAdd to \ELSE\STATESet the label of vertex to \STATEAdd to \ENDIF\ENDIF\ENDFOR\FOR to \STATERemove from any edge induced by \ENDFOR\ENDIF\ELSE\STATEBreak \ENDIF\STATE \ENDFOR
\STATE
Algorithm 4 A streaming algorithm to compute approximate core-labeling.

5 Experiments

In this section, we analyze the performances of our sketch in practice. First we describe our datasets. Next we discuss the implementations of our sketch presented in Section 3 and our MapReduce algorithm. Then we study the scalability and the accuracy of our sketch. In particular, we analyze the trade-off between quality of the approximation and space used by the sketch.

Datasets.

We apply our sketch to eight real-world graphs available in the SNAP, Stanford Large Network Dataset Library (Leskovec & Sosič, 2016): Enron (Klimt & Yang, 2004), Epinions (Richardson et al., 2003), Slashdot (Leskovec et al., 2009), Twitter (McAuley & Leskovec, 2012), Amazon (Yang & Leskovec, 2015), Youtube (Yang & Leskovec, 2015), LiveJournal (Yang & Leskovec, 2015), Orkut (Yang & Leskovec, 2015) with respectively 36692, 75879, 82168, 81306, 334863, 1134890, 3997962 and 3072441 nodes and 183831, 508837, 948464, 1768149, 925872, 2987624, 34681189 and 117185083 edges.

Implementation details.

In order to have an efficient implementation of our sketch, we modify Algorithm 2 slightly. More specifically, we change line 14 to “if ” where is a parameter of our implementation. Furthermore, we also modify line 22 to “, where is modifiable multiplicative factor (that in Algorithm 2 is fixed to 2). We also slightly modify our MapReduce algorithm to remove iteratively in parallel all nodes with degree smaller than before sending the remaining graph to a single machine.

Metrics.

To study the scalability of the algorithm we implement our MapReduce algorithm in distributed setting and we analyze the running time on different graphs by using a fixed number of machine. To evaluate the quality of our sketch, we consider the quality of the approximation and the space used.

For the quality of the approximation, we report the median error and the error at the 60, 70, 80 and 90 percentile of our algorithm. In interest of space, we report the errors only on nodes with coreness number at least , because high coreness number are harder to approximate and for almost all the nodes of coreness smaller than we have errors close to .

For space we consider the maximum size of any sample graph and the sum of their sizes. Note that the first quantity bounds the memory used by our distributed algorithm or a multi-pass streaming algorithm, and the second one bounds the memory used by a single pass streaming algorithm.

Scalability Results.

In Figure 2 we present the results of our scalability experiments(In the experiment we fix and ). On the axis we order the graphs based on their number of edges, in the axis we show the relative running time on different graphs. Note that in the Figure the axis is in logscale and the axis is in linear scale so the running time of our algorithm grows sublinearly in the number of edges in the graph proving that our algorithm is able to leverage parallelization to obtain good performance.

Figure 2: Running time of the distributed algorithm on graphs of increasing size.

For comparison we also run a simple iterative algorithm to estimate the k-core number that resembles a simple adaptation of the algorithm presented in (Lee et al., 2010; Esfandiari et al., 2015; Bahmani et al., 2012; Epasto et al., 2015; Bhattacharya et al., 2016) for densest subgraph. The adapted algorithm works as follows: it removes all nodes below a threshold (initially equal to ) from the graphs in parallel and estimate their coreness number as . Then when no node with degree smaller than is left, it iteratively increases by a multiplicative factor and recurse on the remaining graph. Interestingly we observe that this adapted algorithm is an order of magnitude slower than our distributed algorithm and so we could run it only on relatively small graphs like Amazon.555Note that the parallel version of the simple iterative algorithm is particularly slow in practice because it needs several parallel rounds to complete.

Accuracy Results.

All the reported number are the average over runs of our algorithm. In all our experiment we either fix and vary or fix to and vary . In Table 1 we present the space used in our algorithm when we vary the value of .

Graph T=2 () T=3 () T=4 () T=5 ()
Enron 59300 93482 116110 142557
Epinions 80791 120193 147513 169037
Slashdot 132308 203789 258083 302763
Twitter 58967 107164 148610 197321
T=2 () T=3 () T=4 () T=5 ()
Enron 229549 337574 413380 470765
Epinions 322622 436049 515461 575731
Slashdot 537006 799586 975408 1118110
Twitter 299734 501805 682251 848405
Table 1: Number of edges stored by the sketch as a function of

There are few interesting things to note. First, the size of the maximum sampled graph is always significantly smaller than the size of input graph and in some cases it is more than one order of magnitude smaller (for example in the Twitter case). Interestingly, note that the relative size of the maximum sampled graph decrease with the size of the input graph. This suggests that the sketch would be even more effective when applied to a larger graph. Also the total size of the sketch is also smaller than the size of the graph in many cases (for instance, the sketch for Twitter is always smaller than half of the size of the input graph). This implies that we can compute an approximation of the coreness number without processing most of the edges in the input graph.

In Figure 3, we report the approximation error of our algorithm. First we note that as increases the approximation error decreases as predicted by our theorems. It is also interesting to note that the median error is always below and with is below . Observe for , the error at the percentile is below . Overall our sketch provides a good approximation of the coreness numbers.

(a) Enron
(b) Epinions
(c) Slashdot
(d) Twitter
Figure 3: The approximation error of our sketch. We show the error at median 60,70,80 and 90 percentile as a function of when we restrict our attention to node with coreness number at least .

Now we focus on the effect of on our sketch. In Table 2, we present the space used by our algorithm as a function of . Note that the maximum size of a single sample graph decrease with , but the total size of the sketch increases (this is due to the increased number of sampled graphs). This suggests that we should use small in distributed settings where we have tighter space constraint and larger when we want to design single pass streaming algorithms.

Graph M=1.2 () M=1.4 () M=1.6 () M=2 ()
Enron 52202 67611 79075 85013
Epinions 93059 101692 112514 122134
Slashdot 128649 154429 171347 193257
Twitter 51774 67314 81233 92842
M=1.2 () M=1.4 () M=1.6 () M=2 ()
Enron 740240 485841 398151 355671
Epinions 1023529 644102 517660 455226
Slashdot 1759205 1146974 938933 837956
Twitter 903449 650040 561669 521174
Table 2: Number of edges stored by the sketch as a function of

Finally it is interesting to note that as shown in Figure 4, the quality of the approximation is not very much influenced by the scaling factor .

(a) Enron
(b) Epinions
(c) Slashdot
(d) Twitter
Figure 4: The approximation error of our sketch. We show the error at median 60,70,80 and 90 percentile as a function of when we restrict our attention to node with coreness number at least .

6 Conclusions and future works

In this paper we introduce a new sketching technique for computing the core-labeling of a graph. In particular, we design efficient MapReduce and streaming algorithms to approximate the coreness number of all the nodes in a graph efficiently. We also confirm the effectiveness of our sketch via an empirical study. The most interesting open problem in the area is to design a fully dynamic algorithm (Italiano et al., 1999) to maintain the core-labeling of a graph by using only operations per update(edge addition or deletion).

References

  • Aksu et al. (2014) Aksu, H., Canim, M., Chang, Y., Korpeoglu, I., and Ulusoy, Ö. Distributed $k$ -core view materializationand maintenance for large dynamic graphs. IEEE Trans. Knowl. Data Eng., 26(10):2439–2452, 2014.
  • Altaf-Ul-Amin et al. (2006) Altaf-Ul-Amin, M., Shinbo, Y., Mihara, K., Kurokawa, K., and Kanaya, S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics, 7:207, 2006.
  • Alvarez-Hamelin et al. (2005) Alvarez-Hamelin, J. I., Dall’Asta, L., Barrat, A., and Vespignani, A. k-core decomposition: a tool for the visualization of large scale networks. CoRR, abs/cs/0504107, 2005. URL http://arxiv.org/abs/cs/0504107.
  • Andoni et al. (2014) Andoni, A., Nikolov, A., Onak, K., and Yaroslavtsev, G. Parallel algorithms for geometric graph problems. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pp. 574–583, 2014.
  • Bahmani et al. (2012) Bahmani, B., Kumar, R., and Vassilvitskii, S. Densest subgraph in streaming and mapreduce. PVLDB, 5(5):454–465, 2012.
  • Bhattacharya et al. (2016) Bhattacharya, S., Henzinger, M., and Nanongkai, D. New deterministic approximation algorithms for fully dynamic matching. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pp. 398–411, 2016.
  • Bhawalkar et al. (2012) Bhawalkar, K., Kleinberg, J. M., Lewi, K., Roughgarden, T., and Sharma, A. Preventing unraveling in social networks: The anchored k-core problem. In Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, Part II, pp. 440–451, 2012.
  • Chester et al. (2012) Chester, S., Gaertner, J., Stege, U., and Venkatesh, S. Anonymizing subsets of social networks with degree constrained subgraphs. In International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2012, Istanbul, Turkey, 26-29 August 2012, pp. 418–422, 2012.
  • Dean & Ghemawat (2010) Dean, J. and Ghemawat, S. Mapreduce: a flexible data processing tool. Commun. ACM, 53(1):72–77, 2010.
  • Dubhashi & Panconesi (2009) Dubhashi, D. P. and Panconesi, A. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009. ISBN 978-0-521-88427-3.
  • Epasto et al. (2015) Epasto, A., Lattanzi, S., and Sozio, M. Efficient densest subgraph computation in evolving graphs. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015, pp. 300–310, 2015.
  • Esfandiari et al. (2015) Esfandiari, H., Hajiaghayi, M., and Woodruff, D. P. Applications of uniform sampling: Densest subgraph and beyond. arXiv preprint arXiv:1506.04505, 2015.
  • Healy et al. (2006) Healy, J., Janssen, J. C. M., Milios, E. E., and Aiello, W. Characterization of graphs using degree cores. In Algorithms and Models for the Web-Graph, Fourth International Workshop, WAW 2006, Banff, Canada, November 30 - December 1, 2006. Revised Papers, pp. 137–148, 2006.
  • Im et al. (2017) Im, S., Moseley, B., and Sun, X. Efficient massively parallel methods for dynamic programming. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pp. 798–811, 2017.
  • Italiano et al. (1999) Italiano, G. F., Eppstein, D., and Galil, Z. Dynamic graph algorithms. Algorithms and Theory of Computation Handbook, 1999.
  • Karloff et al. (2010) Karloff, H. J., Suri, S., and Vassilvitskii, S. A model of computation for mapreduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pp. 938–948, 2010.
  • Klimt & Yang (2004) Klimt, B. and Yang, Y. Introducing the enron corpus. In CEAS, 2004.
  • Lee et al. (2010) Lee, V. E., Ruan, N., Jin, R., and Aggarwal, C. A survey of algorithms for dense subgraph discovery. In Managing and Mining Graph Data, pp. 303–336, 2010.
  • Leskovec & Sosič (2016) Leskovec, J. and Sosič, R. Snap: A general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent Systems and Technology (TIST), 8(1):1, 2016.
  • Leskovec et al. (2009) Leskovec, J., Lang, K. J., Dasgupta, A., and Mahoney, M. W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29–123, 2009.
  • Matula & Beck (1983) Matula, D. W. and Beck, L. L. Smallest-last ordering and clustering and graph coloring algorithms. J. ACM, 30(3):417–427, 1983.
  • McAuley & Leskovec (2012) McAuley, J. J. and Leskovec, J. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 548–556, 2012.
  • McGregor et al. (2015) McGregor, A., Tench, D., Vorotnikova, S., and Vu, H. T. Densest subgraph in dynamic graph streams. In International Symposium on Mathematical Foundations of Computer Science, pp. 472–482. Springer, 2015.
  • Mitzenmacher et al. (2015) Mitzenmacher, M., Pachocki, J., Peng, R., Tsourakakis, C., and Xu, S. C. Scalable large near-clique detection in large-scale networks via sampling. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 815–824, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3664-2.
  • Montresor et al. (2013) Montresor, A., Pellegrini, F. D., and Miorandi, D. Distributed k-core decomposition. IEEE Trans. Parallel Distrib. Syst., 24(2):288–300, 2013.
  • Munro & Paterson (1980) Munro, J. I. and Paterson, M. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315–323, 1980.
  • Richardson et al. (2003) Richardson, M., Agrawal, R., and Domingos, P. M. Trust management for the semantic web. In The Semantic Web - ISWC 2003, Second International Semantic Web Conference, Sanibel Island, FL, USA, October 20-23, 2003, Proceedings, pp. 351–368, 2003.
  • Sarayuce et al. (2015) Sarayuce, A. E., Gedik, B., Jacques-Silva, G., Wu, K.-L., and Catalyurek, U. V. Streaming algorithms for k-core decomposition. PVLDB, pp. 433–444, 2015.
  • Yang & Leskovec (2015) Yang, J. and Leskovec, J. Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst., 42(1):181–213, 2015.
  • Zhang et al. (2017) Zhang, Y., Yu, J. X., Zhang, Y., and Qin, L. A fast order-based approach for core maintenance. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, pp. 337–348, 2017.

Appendix

Appendix A Concentration bounds

Before giving a formal proof of Lemma 3.1 and Lemma 3.2 we recall a useful form of the Chernoff bound(for an exhaustive treatment on concentration of measure look at (Dubhashi & Panconesi, 2009)).

Theorem A.1 (Chernoff bound).

Let where , for are independently distributed random variable in . Then, for , we have that

Now we are ready to prove our Lemma.

Proof.

(Proof of Lemma 3.1) To prove this lemma we show that for a fixed vertex , the statement of the lemma holds with probability . Then by applying the union bound we obtain the lemma.

Let be the degree of a vertex in . Note that each neighbor of in exists in with probability . Thus we have .

First assume . By the Chernoff bound we have

Assuming . This proves the first statement of the theorem.

Next assume . Again by the Chernoff bound we have Assuming . This proves the second statement of the theorem and completes the proof for undirected graphs.

The proof for directed graphs follows from the same argument. ∎

Proof.

(Proof of Lemma 3.2) To prove this lemma we show that whenever Lemma 3.1 holds, the statements of this lemma holds as well. Pick a vertex . First assume . In this case, using the first statement of Lemma 3.1 we have . Furthermore by Lemma 3.1 we know that if we have . Those two facts together directly show the first statement of this lemma for vertex .

Moreover, suppose that then if we have Thus, we have , which shows the second statement of the lemma for vertex .

Otherwise, assume . By Lemma 3.1 we know that . Thus, in this case the condition of the first statement of the lemma does not hold. Moreover, we have , which shows the second statement of the lemma.

Finally note that when we have Thus, when we have . ∎

Appendix B An Example That Requires Rounds of Probing

Here we show that rounds of probing is required to approximate coreness numbers in a graph.

Consider the following graph of vertices. For any between and and , vertex is connected to vertex , , , furthermore for any between and the vertex is connected with the vertex . Finally the last vertices form a clique, . Note that in this graph node has coreness number , then all the nodes but the last have coreness number and the last have coreness number . Now we show that for any choice of , if we iteratively remove vertices of degree less than for rounds, it does not approximate the coreness number up to a factor.

If , it contains all node but vertex so it cannot be used to distinguish between nodes with coreness number or . For , it starts with the full graph and after round it is the graph induced on vertex with index bigger or equal than . Hence in order to distinguish nodes with coreness number or , it requires rounds of probing. Finally, if all nodes get deleted in the first round.

Appendix C Sketch Quality

We now give few additional definitions that we use later in our proofs. Let be the set at the end of the -th iteration of the algorithm and let be a graph that contains each edge of induced by with probability . Lets define . Note that contains each edge of with probability . Furthermore for a graph we define as the subgraph of induced by the nodes with coreness number at least in . We also denote . Finally we define the graph as the subgraph of induced by the nodes removed after by the classic peeling algorithm when it is run on and we denote . We are now ready to prove an upper bound and a lower bound on the coreness number of nodes that are added to in round .

Lemma C.1 (Lemma 3.4 restated).

For all such that and for any node added to in round we have with probability that: .

Furthermore for all such that we have with probability that: .

Proof.

By using the union bound and by fixing we have that Lemma 3.1 and Lemma 3.2 holds for all and for all choices of and and with probability . In the rest of the proof we assume that both lemma hold.

To prove the first statement of the theorem, we pick any vertex with and show that is included in , and thus is not in . Specifically, we show that if , then . Therefore, we have , as desired.

Let . By applying Lemma 3.1 to for any vertex we have This gives us . Furthermore note that each vertex in has degree at least . Thus we have This means that the degree of all vertices in , including is at least . In addition we have . Therefore, we have . Recall that . By applying Lemma 3.3 we have . Thus, we have , if . This proves the that for all .

Now we show that the lower bound by contradiction. Without loss of generality suppose that the first vertex that contradict the lemma is in level . Let be the minimum coreness number of any vertex in and let be the first vertex in removed by the peeling algorithm when it is run on the entire graph . Note that we have . Now we assume by contradiction .

Let be the first vertex in removed by Algorithm 1 which received the same label as . Let to be the subgraph of induced by the vertices in and the vertices in that are removed after including . Note that for any , . So by definition of , .

Now by definition of and , we have . By applying Lemma 3.1 to either we have or we have . The latter gives us Hence, in both cases we have . But now, note that the degree of when it receive its label from Algorithm 1 is bounded by its degree in . Hence, the label assigned to is strictly less than , which contradicts the existence of and completes the proof. ∎

We are ready to state the approximation guarantees of our sketch in Lemma 3.5

Lemma C.2 (Lemma 3.5 restated).

Algorithm 2 computes a approximate core labeling, with probability .

Proof.

This proof is similar to the proof of Lemma 3.4. However, here we use Lemma 3.4 and in each fixed we bound the core-label of each vertex . Here we assume the statement of Lemma 3.4 holds and the statements of Lemma 3.1 and 3.2 holds for and for all choices of and and . Indeed, by fixing these hold with probability .

Pick an arbitrary such that , and an arbitrary vertex and let . Note that from Lemma 3.4 we have that , and thus for each vertex we have

(1)

Now, by applying the second statement of Lemma 3.1 to for any vertex we have . This together with inequality 1 gives us This means that the degree of all of the vertices in , including is at least . In addition we have . Therefore, we have . Recall that . By applying Lemma 3.3 we have . Thus, the label Algorithm 2 assigns to is either at least or . In the former case clearly the label that Algorithm 2 assigns to is lower bounded by . Using Lemma 3.4, we have , and thus, in latter case the label that Algorithm 2 assigns to is lower bounded by .

Now pick the first such that , and an arbitrary vertex and let . First note that at the end of round we have . Furthermore we have by Lemma 3.3 that . Thus, also in this case the label Algorithm 2 assign to is either at least or . In the former case clearly the label that Algorithm 2 assigns to is lower bounded by . Using Lemma 3.4, we have , and thus, again in latter case the label that Algorithm 2 assigns to is lower bounded by . This concludes the proof of the lower bound for the labels assigned by Algorithm 2.

Next we show that for an arbitrary round the label that Algorithm 2 assigns to in round is upper bounded by . By the way of contradiction, lets assume there exist some such that the label that Algorithm 2 assigns to is strictly more than . Without loss of generality, lets assume is the first vertex removed by the peeling algorithm when it is run on and such that the label that Algorithm 2 assigns to is strictly more than . Lets assume and let .

First note that , otherwise the label assigned to cannot be bigger than by the condition in lines 18-20(Note that ). So in the rest of the proof we can restrict our attention to the case when . Furthermore note that by Lemma 3.4 for any node in have at least , so , for all . Now let be the vertex in that is the first vertex removed by Algorithm 1 and that received the same label as by this algorithm. Let to be the subgraph of induced by the vertices in and the vertices in that are removed after by Algorithm 1. By the way of picking , for any