Locally Estimating Core Numbers

Locally Estimating Core Numbers

Michael P. O’Brien, Blair D. Sullivan Department of Computer Science
North Carolina State University
Raleigh, North Carolina, 27607
Email: {mpobrie3,blair_sullivan}@ncsu.edu
Abstract

Graphs are a powerful way to model interactions and relationships in data from a wide variety of application domains. In this setting, entities represented by vertices at the “center” of the graph are often more important than those associated with vertices on the “fringes”. For example, central nodes tend to be more critical in the spread of information or disease and play an important role in clustering/community formation. Identifying such “core” vertices has recently received additional attention in the context of network experiments, which analyze the response when a random subset of vertices are exposed to a treatment (e.g. inoculation, free product samples, etc). Specifically, the likelihood of having many central vertices in any exposure subset can have a significant impact on the experiment.

We focus on using -cores and core numbers to measure the extent to which a vertex is central in a graph. Existing algorithms for computing the core number of a vertex require the entire graph as input, an unrealistic scenario in many real world applications. Moreover, in the context of network experiments, the subgraph induced by the treated vertices is only known in a probabilistic sense. We introduce a new method for estimating the core number based only on the properties of the graph within a region of radius around the vertex, and prove an asymptotic error bound of our estimator on random graphs. Further, we empirically validate the accuracy of our estimator for small values of on a representative corpus of real data sets. Finally, we evaluate the impact of improved local estimation on an open problem in network experimentation posed by Ugander et al.

I Introduction

In a graph modeling complex interactions between data instances, the connectivity of the vertices often yields useful insights about the properties of the represented entities. For example, in a social network, it is important to distinguish between a person who is a member of a relatively large, tight-knit community, and a person who exists on the periphery without much involvement with any cohesive communities. This type of connectivity is often not well-correlated with the vertex degree (raw number of connections), leading to the introduction of several more sophisticated metrics. Here, we focus on the core number [1]. To build intuition, consider the setting of a graph representing friendships in a social network – the vertices with large core numbers form a set where people tend to have many common friends, whereas the subset of vertices with small core numbers form a group where most people do not know one another. The applications of core numbers are numerous, well studied, and permeate a variety of domains, such as community detection [2, 3], virus propagation [4], data pruning [5], and graph visualization [6].

Existing algorithms for computing core numbers run in linear time with respect to the number of vertices and edges of the graph [1], but require the entire graph as input and simultaneously calculate the core number for every vertex. There are a number of scenarios in which current approaches are insufficient. First, one might only be concerned with the properties of a small subset of query vertices111For example, vertices which have some common metadata (e.g. age, physical location, etc.) of interest to the user that is not represented explicitly in the graph. If the query vertices constitute a small fraction of the entire graph, it would be much more time- and memory-efficient to locally estimate the core numbers of those specific vertices rather than using the global algorithm, regardless of the fact that it is linear. Second, the entire graph may be unknown and/or infeasible to obtain due to scale, privacy, or business strategy reasons. For example, suppose one wanted to investigate patterns of phone calls between cell phone users. It would be possible to contact a small set of people and ask them to voluntarily log their calls for one month. However, without access to the telecommunication companies’ private records, expanding the domain to the national or global level would not be possible. Finally, current methods for determining core numbers cannot be applied to solve problems in the domain of network experimentation, as we describe below.

A network treatment experiment is a randomized experiment in which subjects are divided into two groups: those receiving treatment and those receiving none (or a placebo). Network treatments differ from other randomized experiments in that the effects of the treatment (or lack thereof) on a given subject are assumed to be dependent on the experiences of other subjects in the experiment. Randomly assigning subjects into two groups (treated versus untreated) is equivalent to randomly partitioning the vertices of a graph into two sets. Previous work has given methods to calculate degree probabilities of a randomly partitioned graph [7], but an analogous algorithm for the core numbers is an open problem. Given the results of Kitsak et al. on the importance of core numbers in spreading information [4], an algorithm to predict the likelihood of a given subject having a large core number would help researchers better understand the impact of their experimental design. For example, researchers conducting market testing on a product that relies on social interaction (such as a new social networking site, online game, etc.) would have a greater ability to see whether early access to their product will generate widespread excitement among the test subjects. Additionally, if certain groups of vertices (say, females participating in the experiment) are more likely to be core exposed than the others, we can reduce the bias of the estimate of the treatment effect by upweighting the probability that underrepresented vertices (males) are treated.

All of the challenges mentioned above could be addressed by estimating the core number of a vertex based on local graph properties. The existence of an accurate non-global estimator is intuitively well-grounded, as it has been shown that addition and deletion of edges can only affect the core numbers of a limited subset of vertices [8]. This suggests that in spite of the fact that computing core numbers exactly requires knowledge of the whole graph, the core number of a given vertex may often depend on a much smaller subgraph. Moreover, even if a core number estimate has a small error, it may still be useful in applications. In particular, it is often sufficient to delineate between vertices with a “large” core number and those with a “small” core number. That is to say, while a vertex in the 50-core is substantially more well-connected than one in the 2-core, it may be functionally identical to a vertex in the 51-core in downstream analysis.

This work introduces a new local estimator of the core number at a specified vertex, which allows a user to tune the balance between accuracy and computational complexity by varying the size of the local region around the query vertex that it considers. We prove that in an Erdös-Rényi random graph, the error of our approximation at each vertex asymptotically almost surely grows arbitrarily slowly with the size of the graph. We also empirically evaluate the estimates with respect to the actual core numbers on a representative corpus of real-world graphs of varying sizes. The results on these graphs demonstrate that high accuracy can be achieved even when considering only a small local region. Finally, we show how our estimators can be applied to address the aforementioned open problem in network treatment experiments. Specifically, we give an algorithm to tighten the upper bound on the core number of a vertex given in [7], and evaluate the impact empirically on a sample experiment.

Ii Background and Definitions

In this paper, all graphs are simple, undirected, and unweighted. Unless otherwise specified, denotes a graph with vertex set and edge set , where . The number of edges incident to a vertex is the degree of , denoted . We also assume that , (isolated vertices are not relevant to our algorithms and analysis). The notation denotes the subgraph of induced on the vertices . In other words, is the graph with vertex set and edge set . Given a function , we say a set of vertices is -ordered if for all , .

To quantify the idea of a “central” vertex, we formalize the notion of being in a well-connected subgraph:

Definition 1 ([1])

The -core of , denoted , is the maximal induced subgraph of with minimum degree at least .

Clearly, a graph with minimum degree at least also has minimum degree at least for , so . Thus, for sake of specificity, we measure the degree to which a vertex is central by the deepest core in which it participates:

Definition 2

The core number of , denoted , is the largest such that .

The term core structure will be used broadly to describe all properties of or relating to the -cores of . A common global metric measures the depth of this structure:

Definition 3

The degeneracy of is the largest for which ; a graph with degeneracy at least is said to be -degenerate.

1

1

3

3

3

2

1

1

1

3

3

3

3

3

2

1

1

1

3

2

2

2

2

1

-core

-core

-core

Fig. 1: Sample graph with its -cores and core numbers labeled.

The core number of a vertex in a graph can be found by performing Algorithm 1 (from [1]), which finds a core decomposition of . Beginning with , the algorithm deletes vertices with degree at most (and their incident edges) until there are no such vertices remaining. The removal of edges incident to a vertex may cause its neighbors whose degrees were initially larger than to have their degree reduced to at most . In this case, those neighbors would be also be deleted in the degree phase. Once all vertices remaining in have , is incremented by 1 and the process repeats until no longer has any vertices. The core number of a vertex is the value of when it is removed from the graph and the degeneracy of is the value of when the last vertex is removed. Since each vertex and edge is removed exactly once, a core decomposition can be completed in time.

1:input: Graph
2:output:
3:
4:while  do
5:     while  do
6:         
7:         
8:         
9:     end while
10:     
11:end while
Algorithm 1 Core decomposition

The basic -core decomposition has been tailored to meet additional constraints. Montressor et al. [9] and Jakma et al. [10] proposed methods by which the core decomposition could be computed in parallel. Li et al. [8] and, later, Saríyüce et al. [11] described ways to update the core numbers of vertices in a dynamic graph without recomputing the full core decomposition each time a vertex or edge is added. Finally, Cheng et al. [12] gave an alternate implementation of the core decomposition for systems with insufficient memory to store the entire graph at once.

Iii Local Estimation and Theory

In order to estimate core numbers efficiently without knowledge of the entire graph, we will restrict the domain of our algorithms to a localized subset around a vertex:

Definition 4

The -neighborhood of a vertex , denoted , is the set of vertices at distance 222We use the typical shortest-path distance function throughout at most from .

The estimation algorithms will vary the size of their input by allowing to range from zero to , the diameter of (the maximum distance among all pairs of vertices).

Iii-a Neighborhood-based estimation

A relatively naïve approach to local estimation would be to compute a core decomposition of the subgraph of induced on the -neighborhood of and use the resulting core number of as the estimate:

Definition 5

Let the induced estimator, , be the core number of in .

By increasing , the induced subgraph captures a progressively larger fraction of the graph, improving the estimate until (which is guaranteed to happen at , but could happen for significantly smaller ). Note that when , the estimator will only be close to the core number if the neighbors of are highly interconnected (so there is a subtle relationship to clustering coefficient).

Lemma 1

Let be a graph. For all ,

{proof}

First we will establish the boundary conditions on our inequality. Because , . Since , . Additionally, for all , . This implies that for all , the degree of in the subgraph induced by the -neighborhood of can be no greater than the degree of in the -neighborhood of . Thus cannot participate in a deeper core with respect to the -neighborhood than it does with respect to the -neighborhood, which makes .

In order to make a more sophisticated estimation, let us consider the information gained as increases. At , we assume that is known. Because the -core requires all vertices to have minimum degree , can not be greater than . Thus, itself can be an estimate of .
Expanding out to allows information about ’s immediate neighbors to be utilized. Suppose the core numbers of the neighbors of were known. It would then be possible to compute precisely using the following two lemmas (previously shown by Montresor et al):

Lemma 2 ([9])

A vertex is in the -core of a graph if and only if has at least neighbors in the -core.

We now give a closed-form algebraic expression for the largest satisfying Lemma 2.

Lemma 3 ([9])

Let be the -ordered neighbors of . Then

{proof}

By the definition of core number and Lemma 2, is the largest in so that has at least neighbors with core number at least . For each (), has neighbors with core numbers at least , since . If , has at least neighbors in the -core. Otherwise, has at least neighbors in the -core. This shows that the core number of must be at least the minimum of and for every (and thus the maximum over ). Equality follows easily by contradiction.

Note that is the maximum value of the minimum of two functions of : and . With respect to , is monotonically non-decreasing and is monotonically decreasing. From a geometric perspective, then, the maximum of their minimums occurs at the intersection of the curves and (as stylized in Figure 2).

Fig. 2: is the -value at intersection of two functions.

Although the core numbers of the neighbors of a vertex may not be known a priori, the reasoning behind Lemma 3 gives useful insight into the behavior of . As shown by Cheng et al. [12], an upper bound on can be achieved if an upper bound on is known for all .

Theorem 1 ([12])

Let be a graph, , and any function satisfying . Let ’s neighbors be -ordered. Then

{proof}

Substituting for in the expression from Lemma 3 can only increase the right hand side, giving an upper bound on .

We base our second estimator on the idea of incorporating iterative upper bounds on the core numbers of a vertex’s neighbors:

Definition 6

Let the propagating estimator, , be the estimator of given by the recurrence

where are the -ordered neighbors of .

Pseudocode for computing is given in Algorithm 2. Essentially, Algorithm 2 first computes the coarsest upper bound () for those vertices at distance at most from . Those estimates are used in conjunction with Theorem 1 to compute a slightly finer upper bound, , for those vertices at distance at most from . This process “propagates” inwards towards until its immediate neighbors have values, which are used as the upper bounds in formulating . The computational complexity of finding is linear in the product of and the number of edges in (see Theorem 2). Since Algorithm 1 is also linear with respect to the number of edges in the graph, the computational complexity of computing is comparable to that of for small .

1:input: Graph , vertex
2:output:
3:if  then
4:     return
5:else
6:     for  do
7:         Compute
8:     end for
9:     -order
10:     
11:     for  do
12:         
13:         if  then
14:              
15:         end if
16:     end for
17:     return
18:end if
Algorithm 2 Algorithm for computing
Theorem 2

For a given vertex , can be computed in time using Algorithm 2, where is the edge set of .

{proof}

Fix and . Assume , is known. Since can only take integer values in the interval , the sorting (line 9) can be done in time using a bucket sort. Once sorted, each neighbor of is visited once (line 11) to find the minimum, which can also be done in time. Thus computing from has complexity .

Using dynamic programming, we can compute and store , which in turn can be used to compute , and so on (through iterations) until we have computed . The th such iteration requires operations. Since , the total running time is .

Unlike , the estimate is a decreasing upper bound on as increases:

Theorem 3

, for any .

{proof}

We first prove that for all by induction on . The base case holds, since core number is always bounded by degree. Assume . Then is an upper bound as in Theorem 1, and substituting the right hand side with Definition 6, we have .

We now prove that for all and by induction on . Combining Definition 6 with Lemma 3, we see that is the maximum of the minimum of the functions and of . Since the maximum of is , . Because the base case is satisfied. Suppose that for some , for all . By Theorem 1, has at least neighbors that satisfy for . Each such also satisfies , meaning can have no more than neighbors that satisfy . Thus .

Iii-B Structures leading to error

One natural question is whether either or has bounded error (is a constant-factor approximation of the core number). Unfortunately, there are extremal constructions forcing unbounded error for both estimators; both are based on , the complete -ary tree with levels (labelled so that level has vertices), rooted at a vertex (Figure 3).

Lemma 4

For all and integers , there exists a graph and vertex so that .

{proof}

First note that since is a tree, it is 1-degenerate. We show that the root vertex has estimators with unbounded error. For any vertex with level number in , every vertex not equal to in ’s -neighborhood has degree . As a result, , and (since has degree and will have propagated inwards). Thus, for any , we have .

Fig. 3: is in blue. is plus , (red), and the dashed edges.
Lemma 5

For all and integers , there exists a graph and vertex so that .

{proof}

Consider the graph created by adding vertices to and then connecting each of them to each of the leaves of (see Figure 3). Then the root is the vertex of minimum degree in and has core number . Any induced subgraph of that does not include at least one is a tree, making it 1-degenerate. Thus whenever .

Note that in , for any . Likewise, in , for any . Despite the fact that the errors of both estimators can theoretically be arbitrarily large, structures causing egregious errors (like and ) are unlikely to occur in real world networks; we provide evidence to support this claim in the next sections.

Iii-C Expected behavior on random graphs

In order to better understand the errors generated when approximating core number with , we analyze its behavior on a well-studied random graph model.

Definition 7 ([13])

Erdös-Rényi random graphs, denoted , are the family of graphs with vertices constructed by placing an edge between each pair of vertices uniformly at random with probability .

To avoid confusion, we use to denote the set of all Erdös-Rényi random graphs with vertices and edge probability and for a specific instance. Since all graphs on vertices occur in with non-zero probability (when ), analysis typically focuses on whether a graph property is very likely (or unlikely) to occur as the size of an Erdös-Rényi random graph grows large. In keeping with prior work ([14, 15, 16]), we assume the average degree is fixed, letting , a constant. Under this assumption, we using the following notion of “very likely”:

Definition 8

A random event is said to happen asymptotically almost surely (a.a.s.) if .

Specifically, we focus our attention on the growth of the error term as by deriving probabilistic expressions for and for any , then demonstrating how each term grows with compared to a function in (recall a function is if ).

Theorem 4

Suppose is . Then for any , is asymptotically almost surely.

{proof}

Fix a vertex , and let , , and be defined to be the subsets of with vertices of degree greater than, equal to, or less than , respectively. We first evaluate . By Definition 6, if , has at least neighbors with but less than with (or else ). Therefore, implies and , and

where .

As tends to infinity, the probability that any two neighbors of have an edge between them approaches . Therefore, the degrees of ’s neighbors can be treated as independent, identical distributions in the limit. Since is large and is fixed, this distribution is asymptotically Poisson with mean . If and denote the Poisson probability mass function and cumulative distribution function, respectively, with mean (evaluated at ), then:

(1)

By computing Equation 1 at each value of for which is a possible value for the core number, we have:

(2)

Pittel et al. [14] demonstrated that in , the proportion of vertices in the -core is a.a.s. a function of but not of . Moreover, for any vertex , is a.a.s. bounded by a constant (equivalently, in ). If were also bounded by a constant, then the error term would be a.a.s. . However, since the Poisson random variables in Equations 1 and 2 are only parameterized by and not by , the proportion of vertices in with is a.a.s. convergent to some non-zero constant. Thus, the probability of having an arbitrarily large value of does not vanish as grows large for a fixed (constant) .

Let be a function of in . By Stirling’s approximation of the factorial,

Then as grows large, and . In Equation 1, the probability that a neighbor of vertex has degree at least (that is, ) is asymptotically zero, and consequently also vanishes in the limit. This implies that a.a.s.  for any error function . Using the result of [14] that , we have a.a.s. .

Iv Experimental Results

In the previous section, the behavior of the propagating estimator was analyzed from a theoretical perspective. In order to enhance this picture, we present computational results on a corpus of real data.

Iv-a Methods

The estimators and were evaluated on nine real-world graphs that appear in the following section333We also tested several additional graphs, which gave qualitatively similar results, and are thus omitted for length. The results can be found in the arXiv version of this paper.. Not only do these graphs cover a variety of domains, but they also are structurally dissimilar. The graphs vary in size, density, core structure, and diameter (see Table III and Figure 4).

Graph
Amazon[17] 334863 925872 549 6 47
Co-purchases
AS[17] 6214 12232 1397 12 9
Autonomous systems
ca-AstroPh[17] 17903 196972 504 56 14
Academic citations
DBLP[17] 317080 1049866 343 113 23
Academic citations
Enron[17] 33696 180811 1383 43 13
Email correspondence
Facebook[18] 36371 1590655 6312 81 7
Facebook friendship
Gnutella[17] 26498 65359 355 5 11
Peer-to-peer filesharing
H. sapiens[19] 18625 146322 9777 47 10
Protein-protein interation
WPG[20] 4941 6594 19 5 46
Western US power grid
TABLE I: Summary statistics for real-world graphs

We computed the core number , as well as the values of and for each vertex444Code and data are available at https://dl.dropboxusercontent.com/u/32167511/core_number_estimate.zip, letting vary from to . To compare the accuracy of the estimators among vertices, we normalize by the true core number at each vertex. We refer to this metric as the core number estimate ratio. When the estimator ( or ) is exactly equal to the core number, the core number estimate ratio is 1, its optimal value. Since is an upper bound on , its core number estimate ratio is always at least one and becomes less optimal the larger it gets; the opposite is true for , a lower bound.

Iv-B Results

We first turn our attention to how often the estimators achieve optimal core number estimate ratios. Figure 5 shows how the proportion of vertices with ratio one grows as increases from zero to four. In all the graphs, the core number estimate ratio for is optimal at least as often as that of at . Additionally, the proportion of vertices with optimal estimate ratios is large in all the graphs (often upwards of ). While the propagating estimator does not have as pronounced of an advantage over the induced estimator when , the number of vertices with optimal estimate ratios still grows noticeably.

Fig. 4: Core number distribution for the real world networks in Table III.
(a) Amazon
(b) AS
(c) ca-AstroPh
(d) DBLP
(e) Enron
(f) Facebook
(g) Gnutella
(h) H. sapiens
(i) WPG
Fig. 5: Proportion of vertices with optimal core number estimate ratios for the propagating (solid green) and induced (dashed blue) estimators as a function of .

We also examined the distribution of core number estimate ratios among those vertices where the estimate was not exact. The change in this distribution over the range to is shown in Figure 6, demonstrating that not only are the sub-optimal estimates closely centered around , but also that increasing can significantly decrease the size of the “tail” of the distribution (thereby improving the core number estimates of those vertices with the least optimal ratios).

(a) Amazon
(b) AS
(c) ca-AstroPh
(d) DBLP
(e) Enron
(f) Facebook
(g) Gnutella
(h) H. sapiens
(i) WPG
Fig. 6: Number of vertices with core number estimate ratios less optimal that a given threshold from (lightest line) to (darkest line). is shown in green while is shown in blue. Because the number of vertices with optimal ratios is frequently large (see Figure 5), the vertices with optimal ratios may not appear within the limits of the plot in order better capture the distribution of those vertices with suboptimal ratios.

Although Figures 5 and 6 suggest that and can accurately estimate the core numbers in real world graphs using only knowledge of the -neighborhood with a small value of , it is important to understand how the size of the -neighborhood impacts the behavior of the estimates. The purpose of having a localized estimate is to reduce the size of the input needed to compute the core number of a vertex. If the average -neighborhood encompasses most of the graph, then not only is this purpose defeated, but we also may not be able to judge whether the accuracy of the localized estimates is only due to having knowledge of the entire graph (as opposed to any theoretical merits of the algorithms themselves). The mean and variance of the proportion of vertices in the -neighborhood is shown in Table V. The rate of growth of -neighborhood sizes varies significantly among the nine graphs, which suggests that picking a value of to maintain appropriately small -neighborhoods is highly dependent on the structure of the graph. Nonetheless, the average size of the neighborhood is below ten percent of the entire graph for all datasets at and in all but one (namely Facebook, which we know to be significantly different from the other networks) at .

Graph Avg. Var. Avg. Var. Avg. Var. Avg. Var.
Amazon
AS
ca-AstroPh
DBLP
Enron
Facebook
Gnutella
H. sapiens
WPG
TABLE II: Proportion of vertices in . Values less than rounded to .

Another natural way to measure the relative amount of information in the -neighborhood is to normalize by the diameter . Figure 7 shows that the average proportion of vertices in the -neighborhood is approximately uniform in all graphs when is expressed as a fraction of . In particular, there is a significant increase in the rate of growth of the -neighborhood size occurring when is approximately of the diameter. Thus, as one might expect, neighborhood-based core number estimates seem most appropriate when is a small fraction of the diameter.

(a) Amazon
(b) AS
(c) ca-AstroPh
(d) DBLP
(e) Enron
(f) Facebook
(g) Gnutella
(h) H. sapiens
(i) WPG
Fig. 7: Average proportion of vertices in as a function of .

To see the effect of this normalized setting on accuracy, consider Figure 8. After normalizing by the diameter, the variation between graphs is less pronounced. Even when is small compared to , the optimum core number estimate ratio can be achieved. It is worthy to note that Facebook is a stark exception in which a significant proportion of vertices cannot acheive an optimal core number estimate ratio even when . This graph has many vertices with and a small diameter. Thus, many vertices have very inaccurate -values, which propagate inwards and remain uncorrected due to the small number of refinements performed on the estimate. Ultimately, we conclude that best achieves its goal of accurately estimating the core number using only a small local section of the graph when the graph has a large diameter and the ratio is small (e.g. less than ).

(a) Amazon
(b) AS
(c) ca-AstroPh
(d) DBLP
(e) Enron
(f) Facebook
(g) Gnutella
(h) H. sapiens
(i) WPG
Fig. 8: Proportion of vertices with optimal core number estimate ratios for the propagating estimator (solid green) and the induced estimator (dashed blue) as a function of . The -axis has been normalized by the diameter.

V Application to Network Experimentation

We now turn to the domain of network experiments and use the estimator to address an open problem given in [7].

V-a Problem Statement

Recall from the introduction that a network treatment experiment is a random experiment in which some subjects are given a treatment and the rest are not. It differs from other experiments in that the effects of the treatment are assumed to be dependent on interactions between subjects, which can be modeled by a graph. The general goal is to measure the subjects’ experiences in a hypothetical universe where the entire graph is treated by observing the experience of the subject when only some of the graph is treated. Ugander et al. [7] focused on local properties of the vertices to compare these two scenarios. In particular, they identified two useful ways to concretely measure the experience of a subject via graph properties:

Definition 9 ([7])

A vertex experiences absolute -degree exposure if and at least of ’s neighbors receive treatment.

Definition 10 ([7])

A vertex experiences absolute -core exposure to a treatment condition if belongs to the -core of the graph , where is the set of treated vertices.

We will use and to denote the events that a vertex experiences absolute -degree and absolute -core exposure, respectively.

In order to reduce variance in later sections of their analysis, Ugander et al. first cluster the graph and then assign treatment randomly to the clusters (as opposed to individual subjects). If a cluster is chosen to be treated, all vertices in the cluster receive treatment; otherwise none of them do. Ugander et al. utilize a -net clustering that is formed by growing balls of radius two centered at randomly selected vertices until every vertex is covered by some ball. The procedures for computing the probabilities of and are independent of the method by which the graph was clustered, so we choose to omit further detail here and refer the reader to [7] for details.

Once the graph is clustered, a recursive function can be used to compute the probability that vertex experiences absolute -degree exposure. We follow the notation of [7]. Let be the number of clusters that contain at least one vertex in , indexed so that resides in the highest numbered cluster. If is the probability that a cluster is treated and is the number of edges from to the vertices in each cluster, then

(3)

where the function is defined as

where denotes the indicator function (evaluates to if the Boolean expression is true and otherwise).

The function defined above recursively visits each cluster containing a neighbor of and considers the probability that is -degree exposed in the first clusters conditioned on whether cluster receives treatment. If is treated, needs to have treated neighbors in the first clusters; otherwise, it needs such neighbors. It follows from Definition 9 that if is -degree exposed, cluster is necessarily treated. This also implies that all of ’s neighbors in the same cluster are necessarily treated as well. Thus, we are ultimately concerned with finding treated neighbors in the remaining clusters that contain a neighbor of . Using dynamic programming, we can compute for all in time.

V-B Estimating -core exposure probabilities

In [7], Ugander et al. left computing the exact probability of absolute -core exposure as an open problem, since the core decomposition requires knowledge of the entire graph. They instead defer to the fact that the absolute -core exposure probability is bounded from above by the absolute -degree exposure probability and use the latter in lieu of the former. This is problematic because there may not be a consistent relationship between and . For example, if is a vertex with degree and core number , independent of . Although this is an extreme case, there are more general cases where the two probabilities are not correlated. Specifically, vertices that require a large value of before (as in Figure 9) can have many treated neighbors without having a large core number.

Fig. 9: with of vertices treated. . Although , , , and have all of their neighbors treated, they only have core number with respect to the treated subgraph.

Recall that is the degree of . As we have shown, even expanding the scope and computing can yield a considerably more accurate estimate of of the core number than the degree. Therefore, a tighter bound of the core exposure probability can be achieved by examining the degree exposure probability of ’s neighbors. To capture this, we introduce a -related condition we call neighbor-degree exposure.

Definition 11

A vertex experiences absolute -neighbor-degree exposure if at least of ’s neighbors experience absolute -degree exposure.

We denote the event that vertex is absolute -neighbor-degree exposed with . Algorithm 3 gives a method for computing , which can then be used to estimate (specifically, find an upper bound on) .

1:input: Graph , vertex , clustering , exposure probability , desired exposure level
2:output:
3:Let denote the cluster containing
4:
5:
6:for  do
7:     
8:     if  then
9:         
10:     end if
11:end for
12:return
Algorithm 3 Absolute -neighbor-degree exposure probability of

The algorithm iterates through all subsets of clusters containing a vertex in ’s -neighborhood and determines whether treating them yields a scenario where has neighbors that are absolute -degree exposed. If so, line 9 adds the probability of that configuration occurring to the final probability. Because it enumerates all possible subsets of , Algorithm 3 will run in time in the worst case. While this algorithm is exponential in the number of clusters, Ugander et al. assume that the graph satisfies some restricted growth conditions555Namely, such that . In this case, the number of clusters that contain vertices from does not grow with respect to the size of the graph [7] which bounds the running time at .

In graphs failing the restricted growth requirements, the running time can still be improved. Note that if treating a specific subset of clusters on line 6 does not yield vertices in that are -degree exposed, then treating any also cannot yield at least vertices in that are -degree exposed. Thus, if the subsets of are enumerated in decreasing order of their sizes, we can prune the search space to avoid needless computation. Moreover, the clustering algorithm can be biased towards selecting -net clusterings that minimize . For example, one possible bias would be to select the centers of the balls with probability proportional to their degrees.

We applied Algorithm 3 and Equation 3 to the WPG data set and binned the data based on the difference as shown in Figure 10. It is particularly noteworthy that multiple vertices have a neighbor-degree exposure probability of zero but a non-zero probability of degree exposure. Moreover, many of those vertices have their degree exposure probability maximized (equal to ). Thus, the empirical data confirms that absolute degree exposure probability may be a misleading estimate of absolute core exposure probability.

(a)
(b)
Fig. 10: for the WPG graph at . Vertices with are omitted.

Finally, we consider a second approach for improving the approximation of that, like , tightens an upper bound on by using the bounds on for in . We examine those vertices in that satisfy . These vertices cannot contribute to , so we can disregard them when computing . Thus, we can use Equation 3 with a modified to get a tighter upper bound on . Figure 11 shows that pruning can decrease the probability of a majority of the vertices (in fact, many probabilities decrease from to ). This further bolsters our argument that is only weakly correlated with , but using information from ’s neighbors can yield a much tighter upper bound at minimal additional cost.

(a) Amazon
(b) AS
(c) ca-AstroPh
(d) DBLP
(e) Enron
(f) Facebook
(g) Gnutella
(h) H. sapiens
(i) WPG
Fig. 11: Histogram of differences between before and after pruning for and . The -axis gives the difference in probability, while the -axis gives the proportion of vertices occurring in that bin. For clarity, only those vertices which are not pruned are considered in the plot.

Vi Conclusions and Future Work

We introduced , a novel method of estimating the core number of a vertex that uses only the data available in the -neighborhood of the vertex. We formally proved that in an Erdös-Rényi graph, the error of grows arbitrarily slowly with respect to the size of the graph. After computing on a representative corpus of real-world networks, we demonstrated that a high-accuracy estimate of the core number can be achieved using a limited subset of the graph. Finally, we described two ways in which the estimators could be used to improve calculations in network treatment experiments.

There are a number of natural extensions to this research. Algorithm 2 computes for each neighbor of , which in turn requires calculating and so forth. However, since is geometrically the value at the intersection of the functions and , refining the core number estimates at the “first” vertices () and “last” vertices () may not affect where the curves intersect. Thus, computational complexity could possibly be reduced by only refining the estimates of vertices near .

There may also be use for in graph property testing. Property testing refers to using an easily computable graph property in order to give an estimate of a less tractable property. For example, the hyperbolicity of a graph informally measures the extent to which a graph is tree-like [21]. As was discussed in Section III-B, tree-like structures with high degree but low degeneracy can lead to large errors in . Therefore, a large error in may indicate that participates in a structure with low hyperbolicity. Since the hyperbolicity is computed in time, it would be significantly faster to indirectly flag such vertices by computing and at every vertex and observing their difference.

Acknowledgments

The authors thank Johan Ugander for introducing them to the problem of calculating the -cores in a network experiment during a workshop at the Statistical and Applied Mathematical Sciences Institute (SAMSI) and for providing helpful comments and discussion that improved the manuscript. This work was supported in part by the National Consortium for Data Science Faculty Fellows Program and the Defense Advanced Research Projects Agency under SPAWAR Systems Center, Pacific Grant N66001-14-1-4063. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of DARPA, SSC Pacific, or the NCDS.

References

  • [1] V. Batagelj and M. Zaversnik, “An O(m) algorithm for cores decomposition of networks,” CoRR, 2003.
  • [2] C. Giatsidis, D. M. Thilikos, and M. Vazirgiannis, “Evaluating cooperation in communities with the -core structure,” in Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conference on.   IEEE, 2011, pp. 87–93.
  • [3] D. W. Matula and L. L. Beck, “Smallest-last ordering and clustering and graph coloring algorithms,” J.ACM, vol. 30, no. 3, pp. 417–427, jul 1983.
  • [4] M. Kitsak, L. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. Stanley, and H. Makse, “Identification of influential spreaders in complex networks,” Nature Physics, vol. 6, no. 11, pp. 888–893, Aug 2010.
  • [5] G. Bader and C. W. V. Hogue, “An automated method for finding molecular complexes in large protein interaction networks,” BMC Bioinformatics, vol. 4, no. 1, pp. 1–27, 2003.
  • [6] J. I. Alvarez-Hamelin, A. Barrat, and A. Vespignani, “Large scale networks fingerprinting and visualization using the -core decomposition,” in Advances in Neural Information Processing Systems 18.   MIT Press, 2006, pp. 41–50.
  • [7] J. Ugander, B. Karrer, L. Backstrom, and J. M. Kleinberg, “Graph cluster randomization: network exposure to multiple universes,” CoRR, 2013.
  • [8] A. E. Saríyüce, B. G., G. Jacques-Silva, K. Wu, and U. V. Çatalyürek, “Streaming algorithms for -core decomposition,” Proc.VLDB Endow., vol. 6, no. 6, pp. 433–444, apr 2013.
  • [9] A. Montresor, F. D. Pellegrini, and D. Miorandi, “Distributed k-core decomposition,” Parallel and Distributed Systems, IEEE Transactions on, vol. 24, no. 2, pp. 288–300, 2013.
  • [10] P. Jakma, M. Orczyk, C. S. Perkins, and M. Fayed, “Distributed -core decomposition of dynamic graphs,” in Proceedings of the 2012 ACM conference on CoNEXT student workshop, ser. CoNEXT Student ’12.   New York, NY, USA: ACM, 2012, pp. 39–40.
  • [11] R. H. Li and J. X. Yu, “Efficient core maintenance in large dynamic graphs,” CoRR, 2012.
  • [12] J. Cheng, Y. Ke, S. Chu, and M. T. Ozsu, “Efficient core decomposition in massive networks,” in Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ser. ICDE ’11.   Washington, DC, USA: IEEE Computer Society, 2011, pp. 51–62.
  • [13] P. Erdös and A. Rényi, “On the evolution of random graphs,” Publ.Math.Inst.Hung.Acad.Sci, vol. 5, pp. 17–61, 1960.
  • [14] B. Pittel, J. Spencer, and N. Wormald, “Sudden emergence of a giant -core in a random graph,” Journal of Combinatorial Theory, Series B, vol. 67, no. 1, pp. 111–151, 5 1996.
  • [15] S. Janson and M. J. Luczak, “Asymptotic normality of the -core in random graphs,” The Annals of Applied Probability, vol. 18, no. 3, pp. 1085–1137, 06 2008.
  • [16] T. Luczak, “Size and connectivity of the k-core of a random graph,” Discrete Math., vol. 91, no. 1, pp. 61–68, jul 1991.
  • [17] “Stanford large network dataset collection.” [Online]. Available: http://snap.stanford.edu/data/
  • [18] A. L. Traud, P. J. Mucha, and M. A. Porter, “Social structure of Facebook networks,” Physica A, vol. 391, pp. 4165–4180, 2012.
  • [19] “Biological general repository for interaction datasets.” [Online]. Available: http://www.thebiogrid.org
  • [20] “Ilab interdisciplinary research institute.” [Online]. Available: http://www.ilabsite.org/?page_id=12
  • [21] N. Cohen, D. Coudert, and A. Lancin, “Exact and approximate algorithms for computing the hyperbolicity of large-scale graphs,” Laboratoire de Recherche en Informatique, Tech. Rep., 2012-09-25 2012, iD: hal-00735481, version 4.
Graph
A. thaliana[19] 6854 16615 1308 15 14
Protein-protein interation
Amazon[17] 334863 925872 549 6 47
Co-purchases
AS[17] 6214 12232 1397 12 9
Autonomous systems
ca-AstroPh[17] 17903 196972 504 56 14
Academic citations
DBLP[17] 317080 1049866 343 113 23
Academic citations
Enron[17] 33696 180811 1383 43 13
Email correspondence
Facebook[18] 36371 1590655 6312 81 7
Facebook friendship
Facebook 2[18] 1657 61049 577 60 6
Facebook friendship
Facebook 3[18] 1446 59589 375 60 6
Facebook friendship
Facebook 4[18] 2672 65244 405 43 7
Facebook friendship
Facebook 5[18] 2250 84386 670 58 6
Facebook friendship
Gnutella[17] 26498 65359 355 5 11
Peer-to-peer filesharing
H. sapiens[19] 18625 146322 9777 47 10
Protein-protein interation
WPG[20] 4941 6594 19 5 46
Western US power grid
TABLE III: Summary statistics for real-world graphs