Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs

Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs

Abstract

Graphs (networks) are an important tool to model data in different domains, including social networks, bioinformatics and the world wide web. Most of the networks formed in these domains are directed graphs, where all the edges have a direction and they are not symmetric. Betweenness centrality is an important index widely used to analyze networks. In this paper, first given a directed network and a vertex , we propose a new exact algorithm to compute betweenness score of . Our algorithm pre-computes a set , which is used to prune a huge amount of computations that do not contribute in the betweenness score of . Time complexity of our exact algorithm depends on and it is respectively and for unweighted graphs and weighted graphs with positive weights. is bounded from above by and in most cases, it is a small constant. Then, for the cases where is large, we present a simple randomized algorithm that samples from and performs computations for only the sampled elements. We show that this algorithm provides an -approximation of the betweenness score of . Finally, we perform extensive experiments over several real-world datasets from different domains for several randomly chosen vertices as well as for the vertices with the highest betweenness scores. Our experiments reveal that in most cases, our algorithm significantly outperforms the most efficient existing randomized algorithms, in terms of both running time and accuracy. Our experiments also show that our proposed algorithm computes betweenness scores of all vertices in the sets of sizes 5, 10 and 15, much faster and more accurate than the most efficient existing algorithms.

Keywords:
Networks directed graphs betweenness centrality exact algorithm approximate algorithm

1 Introduction

Graphs (networks) provide an important tool to model data in different domains, including social networks, bioinformatics, road networks, the world wide web and communication systems. A property seen in most of these real-world networks is that the ties between vertices do not always represent reciprocal relations [24]. As a result, the networks formed in these domains are directed graphs where any edge has a direction and the edges are not always symmetric.

Centrality is a structural property of vertices (or edges) in the network that quantifies their relative importance. For example, it determines the importance of a person within a social network, or a road within a road network. Freeman [13] introduced and defined betweenness centrality of a vertex as the number of shortest paths from all (source) vertices to all others that pass through that vertex. He used it for measuring the control of a human over the communications among others in a social network [13]. Betweenness centrality is also used in some well-known algorithms for clustering and community detection in social and information networks [15].

Although there exist polynomial time and space algorithms for betweenness centrality computation, the algorithms are expensive in practice. Currently, the most efficient existing exact method is Brandes’s algorithm [7]. Time complexity of this algorithm is for unweighted graphs and for weighted graphs with positive weights, where and are the number of vertices and the number of edges of the network, respectively. This means this algorithm is not applicable, even for mid-size networks.

However, there are observations that may improve computation of betweenness centrality in practice. In several applications it is sufficient to compute betweenness score of only one or a few vertices. For instance, this index might be computed for only core vertices of communities in social/information networks [32] or for only hubs in communication networks. Another example, discussed in [1], is handling cascading failures. It has been shown that the failure of a vertex with a higher betweenness score may cause greater collapse of the network [30]. Therefore, failed vertices should be recovered in the order of their betweenness scores. This means it is required to compute betweenness scores of (only) failed vertices. Note that these vertices are not necessarily those that have the highest betweenness scores in the network. Hence, algorithms that identify vertices with the highest betweenness scores [27] are not applicable.

In the current paper, we exploit this observation to design more effective exact and approximate algorithms for computing betweenness centrality in large directed graphs. Our algorithms are based on computing the set of reachable vertices for a given vertex . On the one hand, this set can be computed very efficiently. On the other hand, it indicates the potential source vertices whose dependency scores on are non-zero, as a result, it helps us to avoid a huge amount of computations that do not contribute in the betweenness score of .

In this paper, our key contributions are as follows.

  • Given a directed graph and a vertex , we present an efficient exact algorithm to compute betweenness score of . The algorithm is based on pre-computing the set reachable vertices of , denoted by . can be computed in times for both unweighted graphs and weighted graphs with positive weights. Time complexity of the whole exact algorithm depends on the size of and it is respectively and for unweighted graphs and weighted graphs with positive weights. is bounded from above by and in most cases, it can be considered as a small constant (see Section 5). Hence, in many cases, time complexity of our proposed exact algorithm for unweighted graphs is linear, in terms of , and it is for weighted graphs with positive weights.

  • In the cases where is large, our exact algorithm might be intractable in practice. To address this issue, we present a simple randomized algorithm that samples elements from and performs computations for only the sampled elements. We show that this algorithm provides an -approximation of the betweenness score of .

  • In order to evaluate the empirical efficiency of our proposed algorithms, we perform extensive experiments over several real-world datasets from different domains. In our experiments, we introduce a procedure that first computes . Then if the size of is less than some threshold (e.g., 1000), it employs the exact algorithm. Otherwise, it exploits the randomized algorithm. We evaluate this procedure for several randomly chosen vertices as well as for the vertices with the highest betweenness scores. We show that for randomly chosen vertices, our proposed procedure always significantly outperforms the most efficient existing randomized algorithms, in terms of both running time and accuracy. Furthermore, for the vertices that have the highest betweenness scores, over most of the datasets our algorithm outperforms the most efficient existing algorithms.

  • While our algorithm is intuitively designed to estimate betweenness score of only one vertex, in our experiments we consider the settings where betweenness scores of sets of vertices are computed. Our experiments reveal that in such cases, our proposed algorithm efficiently computes betweenness scores of all vertices in the sets of sizes 5, 10 and 15 and it significantly outperforms the existing algorithms.

The rest of this paper is organized as follows. In Section 2, preliminaries and necessary definitions related to betweenness centrality are introduced. A brief overview on related work is given in Section 3. In Section 4, we present our exact and approximate algorithms and their analysis. In Section 5, we empirically evaluate our proposed algorithm and show its high efficiency and accuracy, compared to the existing algorithms. Finally, the paper is concluded in Section 6.

2 Preliminaries

In this section, we present definitions and notations widely used in the paper. We assume that the reader is familiar with basic concepts in graph theory. Throughout the paper, refers to a graph (network). For simplicity, we assume that is a directed, connected and loop-free graph without multi-edges. Throughout the paper, we assume that is an unweighted graph, unless it is explicitly mentioned that is weighted. and refer to the set of vertices and the set of edges of , respectively. For a vertex , the number of head ends adjacent to is called its in degree, and the number of tail ends adjacent to is called its out degree.

A shortest path (also called a geodesic path) from to is a path whose length is minimum, among all paths from to . For two vertices , if is unweighted, by we denote the length (the number of edges) of a shortest path connecting to . If is weighted, denotes the sum of the weights of the edges of a shortest path connecting to . By definition, . Note that in directed graphs, is not necessarily equal to . For , denotes the number of shortest paths between and , and denotes the number of shortest paths between and that also pass through . Betweenness centrality of a vertex is defined as:

(1)

A notion which is widely used for counting the number of shortest paths in a graph is the directed acyclic graph (DAG) containing all shortest paths starting from a vertex (see e.g., [7]). In this paper, we refer to it as the shortest-path-DAG, or SPD in short, rooted at . For every vertex in graph , the SPD rooted at is unique, and it can be computed in time for unweighted graphs and in time for weighted graphs with positive weights [7].

Brandes [7] introduced the notion of the dependency score of a vertex on a vertex , which is defined as:

(2)

where We have:

(3)

Brandes [7] showed that dependency scores of a source vertex on different vertices in the network can be computed using a recursive relation, defined as the following:

(4)

where contains the predecessors of in the SPD rooted at . As mentioned in [7], given the SPD rooted at and using Equation 4, for unweighted graphs and weighted graphs with positive weights, dependency scores of on all other vertices can be computed in time and time, respectively.

3 Related work

Brandes [7] introduced an efficient algorithm for computing betweenness centrality of a vertex, which is performed in and times for unweighted and weighted networks with positive weights, respectively. Çatalyürek et. al. [9] presented the compression and shattering techniques to improve the efficiency of Brandes’s algorithm for large graphs. During compression, vertices with known betweenness scores are removed from the graph and during shattering, the graph is partitioned into smaller components. Holme [18] showed that betweenness centrality of a vertex is highly correlated with the fraction of time that the vertex is occupied by the traffic of the network. Barthelemy [4] showed that many scale-free networks [3] have a power-law distribution of betweenness centrality.

3.1 Generalization to sets

Everett and Borgatti [12] defined group betweenness centrality as a natural extension of betweenness centrality for sets of vertices. Group betweenness centrality of a set is defined as the number of shortest paths passing through at least one of the vertices in the set [12]. The other natural extension of betweenness centrality is co-betweenness centrality. Co-betweenness centrality is defined as the number of shortest paths passing through all vertices in the set. Kolaczyk et. al. [19] presented an time algorithm for co-betweenness centrality computation of sets of size 2. Chehreghani [10] presented efficient algorithms for co-betweenness centrality computation of any set or sequence of vertices in weighted and unweighted networks. Puzis et. al. [25] proposed an time algorithm for computing successive group betweenness centrality, where is the size of the set. The same authors in [26] presented two algorithms for finding most prominent group. A most prominent group of a network is a set vertices of minimum size, so that every shortest path in the network passes through at least one of the vertices in the set. The first algorithm is based on a heuristic search and the second one is based on iterative greedy choice of vertices.

3.2 Approximate algorithms

Brandes and Pich [8] proposed an approximate algorithm based on selecting source vertices and computing dependency scores of them on the other vertices in the graph. They used various strategies for selecting the source vertices, including: MaxMin, MaxSum and MinSum [8]. In the method of [2], some source vertices are selected uniformly at random, and their dependency scores are computed and scaled for all vertices. Geisberger et. al. [14] presented an algorithm for approximate ranking of vertices based on their betweenness scores. In this algorithm, the method for aggregating dependency scores changes so that vertices do not profit from being near the selected source vertices. Chehreghani [11] proposed a randomized framework for unbiased estimation of the betweenness score of a single vertex. Then, to estimate betweenness score of vertex , he proposed a non-uniform sampler, defined as follows:

where .

Riondato and Kornaropoulos [27] presented shortest path samplers for estimating betweenness centrality of all vertices or the vertices that have the highest betweenness scores in a graph. They determined the number of samples needed to approximate the betweenness with the desired accuracy and confidence by means of the VC-dimension theory [31]. Recently, Riondato and Upfal [28] introduced algorithms for estimating betweenness scores of all vertices in a graph. They also discussed a variant of the algorithm that finds the top- vertices. They used Rademacher average [29] to determine the number of required samples. Finally, Borassi and Natale [6] presented the KADABRA algorithm, which uses balanced bidirectional BFS (bb-BFS) to sample shortest paths. In bb-BFS, a BFS is performed from each of the two endpoints and , in such a way that they are likely to explore about the same number of edges.

3.3 Dynamic graphs

Lee et. al. [20] proposed an algorithm to efficiently update betweenness centralities of vertices when the graph obtains a new edge. They reduced the search space by finding a candidate set of vertices whose betweenness centralities can be updated. Bergamini et. al. [5] presented approximate algorithms that update betweenness scores of all vertices when an edge is inserted or the weight of an edge decreases. They used the algorithm of [27] as the building block. Hayashi et. al. [16] proposed a fully dynamic algorithm for estimating betweenness centrality of all vertices in a large dynamic network. Their algorithm is based on two data structures: hypergraph sketch that keeps track of SPDs, and two-ball index that helps to identify the parts of hypergraph sketches that require updates.

4 Computing betweenness centrality in directed graphs

In this section, we present our exact and approximate algorithms for computing betweenness centrality of a given vertex in a large directed graph. First in Section 4.1, we introduce reachable vertices and show that they are sufficient to compute betweenness score of . Then in Sections 4.2 and 4.3, we respectively present our exact and approximate algorithms.

4.1 Reachable vertices

Let be a directed graph and . Suppose that we want to compute betweenness score of . To do so, as Brandes algorithm [7] suggests, for each vertex , we may form the SPD rooted at and compute the dependency score of on . Betweenness score of will be the sum of all the dependency scores. However, it is possible that in a directed graph for many vertices , there is no path from to and as a result, dependency score of on is 0. An example of this situation is depicted in Figure 1. In the graph of this figure, suppose that we want to compute betweenness score of vertex . If we form the SPD rooted at , after visiting the parts of the graph indicated by hachures, we find out that there is no shortest path from to and hence, is 0. The same holds for all vertices in the hachured part of the graph, i.e., dependency scores of these vertices on are 0. The question arising here is that whether there exists an efficient way to detect the vertices whose dependency scores on are 0 (so that we can avoid forming SPDs rooted at them)? In the rest of this section, we try to answer this question. We first introduce a (usually small) subset of vertices, called reachable vertices and denoted with , that are sufficient to compute betweenness score of . Then, we discuss how this set can be computed efficiently.

Figure 1: In Figure 1, the dependency scores of the vertices in the hachured part of the graph (and also ) on is 0. The graph of Figure 1 presents the reverse graph of the graph of Figure 1. Figure 1 shows how is computed.
Definition 1

Let be a directed graph and . We say is reachable from if there is a (directed) path from to . The set of vertices that is reachable from them is denoted by .

Proposition 1

Let be a directed graph and . If out degree of is , is , too. Otherwise, we have:

(5)
Proof

If out degree of is , there is no shortest path in the graph that leaves , as a result, is . To prove that Equation 5 holds, we need to prove that for any , dependency score of on is . Obviously, this holds, because there is no path from to and as a result, no shortest path starting from can pass over .

Proposition 1 suggests that for computing betweenness score of , we first check whether out degree of is greater than and if so, we compute . Betweenness score of is exactly computed using Equation 5.

If is already known, this procedure can significantly improve computation of betweenness centrality of . The reason is that, as our experiments show, in real-world directed networks is usually significantly smaller than . However, computing can be computationally expensive as in the worst case, it requires the same amount of time as computing betweenness score of . This motivates us to try to define a set that satisfies the following properties: (i) and (ii) can be computed effectively in a time much faster than computing . Condition (i) implies that each vertex whose dependency score on is greater than , belongs to and as a result, In the following, we present a definition of and a simple and efficient algorithm to compute it.

Definition 2

Let be a directed graph. Reverse graph of , denoted by , is a directed graph such that: (i) , and (ii) if and only if .

For example, the graph of Figure 1 presents the reverse graph of the graph of Figure 1.

Definition 3

Let be a directed graph and . We define as the set that contains any vertex such that there is a path from to in .

Proposition 2

Let be a directed graph and . We have: .

Proof

The proof is straight-forward from the definitions of and . For each , if , then there is a path from to and as a result, there is a path from to in . Hence, and therefore, . In a similar way, we can show that . Therefore, we have: .

An advantage of the above definition of is that it can be efficiently computed as follows:

  1. first, by flipping the direction of the edges of , is constructed.

  2. then, if is weighted, the weights of the edges are ignored,

  3. finally, a breadth first search (BFS) or a depth-first search (DFS) on starting from is performed. All the vertices that are met during the BFS (or DFS), except , are added to .

In fact, while in we require to solve the multi-source shortest path problem (MSSP), in this is reduced to the single-source shortest path problem (SSSP), which can be addressed much faster. Figure 1 shows an example of this procedure, where in order to compute , we first generate (Figure 1) and then, we run a BFS (or DFS) starting from (Figure 1). The set of vertices that are met during the traversal except , i.e., vertices , and , form .

For a vertex , each of the steps of the procedure of computing , for both unweighted graphs and weighted graphs, can be computed in time. Hence, time complexity of the procedure of computing for both unweighted graphs and weighted graphs is . Therefore, can be computed in a time much faster than computing betweenness score f . Furthermore, Proposition 2 says that contains all the members of . These mean both the afore-mentioned conditions are satisfied.

4.2 The exact algorithm

In this section, using the notions and definitions presented in Section 4.1, we propose an effective algorithm to compute exact betweenness score of a given vertex in a directed graph .

Algorithm 1 presents the high level pseudo code of the E-BCD algorithm proposed for computing exact betweenness score of in . After checking whether or not out degree of is , the algorithm follows two main steps: (i) computing (Lines 7-12 of Algorithm 1), where we use the procedure described in Section 4.1 to compute ; and (ii) computing (Lines 13-18 of Algorithm 1), where for each vertex , we form the SPD rooted at and compute the dependency score of on the other vertices and add the value of to the betweenness score of . Note that if is weighted, while in the first step the weights of its edges are ignored, in the second step and during forming SPDs and computing dependency scores, we take the weights into account.

Note also that in Algorithm 1, after computing , techniques proposed to improve exact betweenness centrality computation, such compression and shattering [9], can be used to improve the efficiency of the second step. This means the algorithm proposed here is orthogonal to the techniques such as shattering and compression and therefore, they can be merged.

Complexity analysis

On the one hand, as mentioned before, time complexity of the first step is . On the other hand, time complexity of each iteration in Lines 15-18 is for unweighted graphs and for weighted graphs with positive weights. As a result, time complexity of E-BCD is for unweighted graphs and for weighted graphs with positive weights.

1:  E-BCD
2:  Input. A directed network and a vertex .
3:  Output. Betweenness score of .
4:  if out degree of is  then
5:     return  .
6:  end if
7:  {Compute :}
8:  .
9:   compute the reverse graph of .
10:  If is weighted, ignore the weights of the edges of .
11:  Perform a BFS or DFS on starting from .
12:  Add to all the visited vertices, except .
13:  {Compute :}
14:  .
15:  for all vertices  do
16:     Form the SPD rooted at and compute the dependency scores of on the other vertices.
17:     .
18:  end for
19:  return  .
Algorithm 1 High level pseudo code of the algorithm of computing exact betweenness score in directed graphs.

4.3 The approximate algorithm

For a vertex , is always smaller than and as our experiments (reported in Section 5) show, the difference is usually significant. Therefore, E-BCD is usually significantly more efficient than the existing exact algorithms such as Brandes’s algorithm [7]. However, in some cases, the size of can be large (see again Section 5). To make the algorithm tractable for the cases where is large, in this section we propose a randomized algorithm that picks some elements of uniformly at random and only processes these vertices.

Algorithm 2 shows the high level pseudo code of our randomized algorithm, called A-BCD. Similar to E-BCD, A-BCD first computes . Then, at each iteration (), A-BCD picks a vertex from uniformly at random, forms the SPD rooted at and computes . In the end, betweenness of is estimated as the sum of the computed dependency scores on multiply by .

1:  A-BCD
2:  Input. A network , a vertex and the number of samples .
3:  Output. Estimated betweenness score of .
4:  if out degree of is  then
5:     return  .
6:  end if
7:  {Compute :}
8:  .
9:   compute the reverse graph of .
10:  If is weighted, ignore the weights of the edges of .
11:  Perform a BFS or DFS on starting from .
12:  Add to all visited vertices, except .
13:  {Estimate :}
14:  .
15:  for all  to  do
16:     Select a vertex uniformly at random.
17:     Form the SPD rooted at and compute dependency scores of on the other vertices.
18:     .
19:  end for
20:  return  .
Algorithm 2 High level pseudo code of the algorithm of computing exact betweenness score in directed graphs.

Complexity analysis

Similar to E-BCD, on the one hand, time complexity of the computation step is . On the other hand, time complexity of each iteration in Lines 15-19 of Algorithm 2 is for unweighted graphs and for weighted graphs with positive weights. As a result, time complexity of A-BCD is for unweighted graphs and for weighted graphs with positive weights, where is the number of iterations (samples).

Error bound

Using Hoeffding’s inequality [17], we can simply derive an error bound for the estimated value of betweenness score of . First in Proportion 3, we prove that in Algorithm 2 the expected value of is . Then in Proportion 4, we provide an error bound for .

Proposition 3

In Algorithm 2, we have: .

Proof

For each , , we define random variable as follows: . We have:

Random variable is the average of independent random variables . Therefore, we have:

Proposition 4

In Algorithm 2, let be the maximum dependency score that a vertex may have on . For a given , we have:

(6)
Proof

The proof is done using Hoeffding’s inequality [17]. Let be independent random variables bounded by the interval , i.e., (). Let also . Hoeffding [17] showed that:

(7)

Similar to the proof of Proposition 3, for each , , we define random variable as follows: . Note that in Algorithm 2 vertices are chosen independently, as a result, random variables are independent, too. Hence, we can use Hoeffding’s inequality, where ’s are ’s, is , is , is and is . Putting these values into Inequality 7 yields Inequality 6.

Inequality 6 says that for given values and , if is chosen such that

(8)

then, Algorithm 2 estimates betweenness score of within an additive error with a probability at least . The difference between Inequality 8 and the number of samples required by the methods that uniformly sample from the set of all vertices (e.g., [8]) is that in the later case, the lower bound on the number of samples is a function of , instead of . As mentioned earlier, for most of the vertices, .

5 Experimental results

We perform extensive experiments on several real-world networks to assess the quantitative and qualitative behavior of our proposed exact and approximate algorithms. The experiments are done on an Intel processor clocked at 2.6 GHz with 16 GB main memory, running Ubuntu Linux 16.04 LTS. The program is compiled by the GNU C++ compiler 5.4.0 using optimization level 3.

\makesavenoteenv

tabular \makesavenoteenvtable

Dataset # vertices # edges
Amazon10 262,111 1,234,877
Com-amazon11 334,863 925,872
Com-dblp12 317,080 1,049,866
Email-EuAll13 224,832 340,795
P2p-Gnutella3114 62,586 147,892
Slashdot15 82,144 549,202
Soc-sign-epinions16 131,828 841,372
Web-NotreDame17 325,729 1,497,134
Table 1: Summary of real-world datasets.

We test the algorithms over several real-world datasets from different domains, including the amazon product co-purchasing network [21], the com-dblp co-authorship network [33], the com-amazon network [33] the p2p-Gnutella31 peer-to-peer network [23], the slashdot technology-related news network [22] and the soc-sign-epinions who-trust-whom online social network [22]. All the networks are treated as directed graphs. Table 1 summarizes specifications of our real-world networks.

As mentioned before, for a directed graph and a vertex , both of our proposed exact and approximate algorithms first compute , which can be done very effectively. Then, based on the size of , someone may decide to use either the exact algorithm or the approximate algorithm. Hence in our experiments, we follow the following procedure:

  • first, compute ,

  • then, if , run E-BCD; otherwise, run A-BCD with as the number of samples;

We refer to this procedure as BCD. The value of depends on the amount of time someone wants to spend for computing betweenness centrality. In our experiments reported here, we set to 1000. We compare our method against the most efficient existing algorithm for approximating betweenness centrality, which is KADABRA [6].

For a vertex , its empirical approximation error is defined as:

(9)

where is the calculated approximate score.

Dataset Randomly chosen vertices KADABRA BCD
samples Time Error () E/A Time Error ()
Amazon 13645 19613.1 47187 0.1800 16739 19.14 100 A 2.60 0.26 0.26
91289 87523.6 150 0.0005 100 E 0.67 0.29 0
17054 35752.6 533 0.0020 100 E 1.26 0.29 0
231249 10449.4 4 0.00001 100 E 0.11 0.30 0
246486 1837.58 34 0.0001 100 E 0.17 0.30 0
Com-amazon 202389 1486.8 13 0.00003 15036 27.70 100 E 0.14 0.27 0
263212 364 3 0.000008 100 E 0.12 0.27 0
81097 11 14 0.00004 100 E 0.15 0.27 0
13732 1701.51 616 0.0018 100 E 1.41 0.28 0
29825 139 15 0.00004 100 E 0.15 0.27 0
Com-dblp 4456 10153 2092 0.0065 17873 26.14 100 A 5.74 0.26 1.10
278950 34326.5 11 0.00003 100 E 0.13 0.27 0
244680 232994 22 0.00006 100 E 0.21 0.27 0
21141 1957.93 73 0.0002 100 E 0.48 0.27 0
129908 303543 41 0.0001 100 E 0.53 0.29 0
Email-EuAll 25362 1869.16 2 0.000008 17066 16.01 100 E 0.03 0.08 0
16682 2269.29 64 0.0002 100 E 0.14 0.08 0
8796 241434 21181 0.0942 100 A 1.88 0.07 1.72
50365 3 2 0.000008 100 E 0.03 0.07 0
2139 503650 111674 0.4966 100 A 1.78 0.08 3.59
P2p-Gnutella31 46263 12655.2 2 0.00003 16401 6.88 100 E 0.03 0.04 0
34547 3538.79 173 0.0027 100 E 0.95 0.04 0
54609 27824.9 3 0.00004 100 E 0.03 0.04 0
37518 6175.2 24141 0.3857 100 A 2.44 0.06 11.31
9781 4582130 3 0.00004 100 E 0.02 0.04 0
Slashdot0902 20825 15940.9 21 0.0002 17421 7.95 100 E 0.17 0.16 0
47806 15891.7 3 0.00003 100 E 0.06 0.15 0
48251 21744 3 0.00003 100 E 0.05 0.15 0
20969 43067 369 0.0044 100 E 2.30 0.17 0
57099 6165.01 2 0.00002 100 E 0.05 0.15 0
Soc-sign-epinions 2740 2352.43 36393 0.2760 19099 11.28 100 A 4.57 0.17 55.34
24080 9198.78 2621 0.0198 100 A 4.60 0.15 18.48
38349 75201.9 35 0.0002 100 E 0.24 0.14 0
82156 8802 34 0.0002 100 E 0.19 0.14 0
38266 8052 3 0.00002 100 E 0.04 0.14 0
Web-NotreDame 21026 140 9 0.00002 19908 27.29 100 E 0.08 0.25 0
133847 9003.53 797 0.0024 100 E 1.84 0.25 0
307622 4212.33 44 0.0001 100 E 0.18 0.25 0
176211 2157.42 30 0.00009 100 E 0.14 0.25 0
307134 3079.5 123 0.0003 100 E 0.35 0.25 0
Table 2: Empirical evaluation of BCD against KADABRA for randomly chosen vertices. Values of and are 0.1 and 0.01, respectively. All the reported times are in seconds. The number of samples in A-BCD is .

5.1 Results

Table 2 reports the results of our first set of experiments. For KADABRA, we have set and to 0.01 and 0.1, respectively18. Over each dataset, we choose 5 vertices at random and run our algorithm for any of these vertices. This table has a column, called ”A/E”, where ”A” means the computed score by our proposed algorithm is exact (hence, the approximation error is ) and ”E” means that is larger than , therefore, the approximate algorithm has been employed. For the BCD algorithm, we measure both ”Time” and ””, where ”” is the time of computing and ”Time” is the running time of the other parts of the algorithm. The total running time of BCD is the sum of ”Time” and ””. As can be seen in Table 2, for most of the randomly picked up vertices, is very small and it can be computed very efficiently. This gives exact results in a very short time, less than seconds in total. In all these cases, KADABRA, while spends considerably more time, simply estimates the scores as (therefore, its approximation error is ). The randomly picked up vertices belong to the different ranges of betweenness scores, including high, medium and low.

Dataset Vertex KADABRA
samples Time Estimated BC Error ()
Amazon 13645 47330 53.98 0 100
91289 0 100
17054 0 100
231249 0 100
246486 0 100
Com-amazon 202389 42207 58.76 0 100
263212 0 100
81097 0 100
13732 0 100
29825 0 100
Com-dblp 4456 50667 77.40 0 100
278950 0 100
244680 0 100
21141 0 100
129908 0 100
Email-EuAll 25362 48079 43.43 0 100
16682 0 100
8796 0 100
50365 0 100
2139 0 100
P2p-Gnutella31 46263 47631 18.12 0 100
34547 0 100
54609 85638.90 568.32
37518 0 100
9781 4281925.72 6.52
Slashdot0902 20825 50776 22.38 0 100
47806 0 100
48251 0 100
20969 0 100
57099 0 100
Soc-sign-epinions 2740 53667 30.54 0 100
24080 0 100
38349 0 100
82156 0 100
38266 0 100
Web-NotreDame 21026 51015 73.92 0 100
133847 0 100
307622 0 100
176211 0 100
307134 0 100
Table 3: Empirical evaluation of KADABRA for and .

After seeing these experimental results, someone may be interested in the following questions:

  • The accuracy of KADABRA depends on the value of . Can decreasing the value of improve the accuracy of KADABRA and make it be comparable to BCD?

  • KADABRA is more efficient for the vertices that have the highest betweenness scores and since most of the randomly chosen vertices do not have a very high betweenness score, compared to EBC, KADABRA does not show a good performance. What is the efficiency of BCD, compared to KADABRA, for the vertices that have the highest betweenness scores?

  • In the experiments reported in Table 2, BCD is used to estimate betweenness score of only one vertex. However, in practice it might be required to estimate betweenness scores of a given set of vertices. How efficient is BCD in this setting?

In the rest of this section, we answer these questions.

Q1

To answer Q1, we run KADABRA with . The results are reported in Table 3. In this setting, most of the scores estimated by KADABRA are still . There are only two exceptions where, however, the approximation error is high. Note that the running time of KADABRA in this setting is considerably more than the case of and as a result, the running time of BCD.

Q2

To answer Q2, over each dataset, we examine the algorithms for the vertex that has the highest betweenness score19. The results are reported in Table 4. KADABRA can be optimized to estimate betweenness centrality of only top vertices, where is an input parameter. In the experiments of this part, we use this optimized version of KADABRA with and refer to it as KADABRA-TOP-1. In KADABRA-TOP-1, we consider three values for : , and , and in all the cases, we set to . Similar to the other experiments, we run BCD with . In all the experiments of this part, the size of becomes larger than , hence, the scores computed by BCD are approximate scores. In Table 4, in three cases the error of KADABRA-TOP-1 is not reported. The reason is that in these cases the vertex that has the highest betweenness score, is not among the vertices considered by KADABRA-TOP-1 as a top-score vertex. Hence, KADABRA-TOP-1 does not report any value for it.

In this setting, none of the algorithms outperforms the other one in all the cases. More precisely, while for some values of KADABRA-TOP-1 has a better accuracy as well as a higher running time, in some other cases the story is in the other way. Nevertheless, we can investigate the datasets one by one. Over amazon, for all values of , BCD has a better approximation error than KADABRA-TOP-1. In particular, for , KADABRA-TOP-1 takes much more time but produces less accurate output. Hence, we can argue that over amazon BCD outperforms KADABRA-TOP-1. The same holds for com-amazon, email-EuAll and web-NotreDame and over all these datasets, BCD outperforms KADABRA-TOP-1. Over com-dblp, for , KADABRA-TOP-1 outperforms BCD in terms of both accuracy and running time. This also happen over soc-sign-epinions for and . Hence, someone may argue that over these two datasets KADABRA-TOP-1 outperforms BCD. Over p2p-Gnutella31 and slashdot0902, on the one hand for and , BCD shows a better accuracy, however, it is slightly slower. On the other hand, for , KADABRA-TOP-1 shows a better accuracy, however, it takes much more time. Altogether, we can say that for estimating betweenness scores of the vertices that have the highest scores, in most of the datasets BCD works better than KADABRA-TOP-1.

Dataset Vertex with the highest BC KADABRA-TOP-1 BCD
samples Time Error () Time Error ()
Amazon 2804 16066000 162707 0.6207 0.01 16181 0.26 - 2.38 0.29 1.35
0.005 45320 0.56 71.69
0.0005 1459502 16.65 3.01
Com-amazon 28081 378550 3812 0.0113 0.01 14619 0.14 - 2.31 0.28 0.52
0.005 40590 0.21 -
0.0005 1249908 3.86 28.90
Com-dblp 49124 24821300 70561 0.2225 0.01 17303 0.64 17.04 6.27 0.27 9.77
0.005 48411 1.62 7.96
0.0005 1581635 54.11 6.79
Email-EuAll 2387 15943100 102596 0.4563 0.01 16588 0.10 33.79 1.76 0.08 3.37
0.005 46123 0.17 17.50
0.0005 1471932 3.87 4.04
P2p-Gnutella31 9781 4580850 36141 0.5774 0.01 13618 0.32 57.61 1.78 0.04 2.59
0.005 40909 1.00 6.51
0.0005 1515822 38.31 0.32
Slashdot0902 18238 8531850 19153 0.2331 0.01 16962 0.99 11.96 3.90 0.10 3.37
0.005 44847 2.52 5.87
0.0005 1718486 103.89 0.16
Soc-sign-epinions 27463 26116100 9880 0.0749 0.01 18601 1.10 2.25 5.43 0.12 2.30
0.005 51502 2.97 0.23
0.0005 2398143 143.92 1.61
Web-NotreDame 7137 323101000 233965 0.7182 0.01 19448 0.18 1.30 2.71 0.235 0.26
0.005 49456 0.30 7.56
0.0005 779273 3.93 2.22
Table 4: Empirical evaluation of BCD against KADABRA-TOP-1 for vertices with the highest betweenness scores. The value of is 0.1. All the reported times are in seconds. The number of samples in A-BCD is .

Q3

To answer Q3, we select a random set of vertices and run BCD for each vertex in the set. The results are reported in Table 5, where the set contains 5, 10 or 15 vertices. Over all the datasets and for each set of vertices, we report the average, maximum and minimum errors of the vertices. For all the datasets, minimum error is always 0. In Table 5, ”” is the total time of computing of all the vertices in the set and ”Time” is the total time of the other steps of computing betweenness scores of all the vertices in the set. Therefore, the total running time of BCD for a given dataset and a given set is the sum of ”Time” and ””. Comparing the results presented in Table 5 with the results presented in Table 3 reveals that for estimating betweenness scores of a set of vertices, BCD significantly outperforms KADABRA (where is ). While in most cases the total running time of BCD is less than the running time of KADABRA (even when the size of the set is 15), BCD gives much more accurate results. Note that even when in KADABRA is set to 0.01, in many cases BCD is faster than KADABRA. In particular, over datasets such as amazon, com-amazon, email-EuAll and web-NotreDame, even for the sets of size 15, BCD is faster than KADABRA and it always produces much more accurate results.

Dataset Set size Error () Time size
Avg. Max. Min. Avg. Max. Min.
Amazon 5 1.47 7.10 0 4.81 1.44 9581.6 47187 4
10 0.73 7.10 0 7.42 3.21 4818.4 47187 1
15 0.88 7.10 0 9.74 4.98 3497.798 47187 1
Com-amazon 5 0 0 0 1.98 1.36 132.2 616 3
10 0 0 0 4.92 3.43 91.2 616 2
15 0 0 0 7.07 5.48 65.93 616 1
Com-dblp 5 0.22 1.10 0 7.09 1.36 447.8 2092 11
10 3.47 19.45 0 20.71 3.08 24483.6 227218 1
15 2.32 19.45 0 28.81 4.92 21351.33 227218 1
Email-EuAll 5 1.06 3.59 0 3.86 0.38 26584.6 111674 2
10 1.39 7.95 0 9.76 0.78 19020.9 111674 2
15 0.93 7.95 0 13.52 1.27 12742.8 111674 2
P2p-Gnutella31 5 2.26 11.31 0 3.47 0.22 4864.2 24141 2
10 7.26 39.17 0 23.09 0.46 5493.6 24141 2
15 6.79 39.17 0 33.27 0.72 8637.73 28122 2
Slashdot0902 5 0 0 0 2.62 0.78 79.6 369 2
10 5.04 50.48 0 11.37 1.38 3784.3 26802 1
15 4.92 50.48 0 14.93 1.99 6662.86 62089 1
Soc-sign-epinions 5 13.37 48.37 0 9.64 0.74 7817.2 36393 3
10 9.68 48.37 0 17.71 1.52 20302.7 109520 1
15 9.38 48.37 0 28.46 2.28 15538.86 109520 1
Web-NotreDame 5 0 0 0 2.58 1.25 200.6 797 9
10 0 0 0 6.89 2.44 231.5 1092 9
15 0.03 0.30 0 13.16 3.62 414.46 2610 1
Table 5: Empirical evaluation of estimating betweenness scores of a set of vertices. All the reported times are in seconds. The number of samples in A-BCD is .

5.2 Discussion

Our extensive experiments reveal that BCD usually significantly outperforms KADABRA. This is due to the huge pruning that applies on the set of source vertices that are used to form SPDs and compute dependency scores. Note that in all the cases, is computed very efficiently, hence, it does not impose a considerable load on the algorithm. In the case of estimating betweenness score of the vertex with the highest betweenness score, over two datasets we may argue that KADABRA outperforms BCD. This has two reasons. On the one hand, in these cases the ratio is large, as a result, many SPDs are computed by BCD. On the other hand, the SPDs contain many vertices of the graph, as a result, computing them is more expensive.

In the end, it is worth mentioning that while the size of is an important factor on the efficiency of the algorithm, it is not the only factor. For example, both graphs of Figure 2 have vertices, the size of in Figure 2 is and the size of in Figure 2 is . However, in Figure 2 each SPD is computed and processed in time, whereas in Figure 2 each SPD is computed and processed in time. Therefore, while in Figure 2 is computed in time, in Figure 2 it is computed in time.

Figure 2: Using BCD, in the graph of Figure 2, is computed in time; whereas in the graph of Figure 2, is computed in time.

6 Conclusion

In this paper, we studied the problem of computing betweenness score in large directed graphs. First, given a directed network and a vertex , we proposed a new exact algorithm to compute betweenness score of . Our algorithm first computes a set , which is used to prune a huge amount of computations that do not contribute in the betweenness score of . Time complexity of our exact algorithm is respectively and for unweighted graphs and weighted graphs with positive weights. Then, for the cases where is large, we presented a simple randomized algorithm that samples from and performs computations for only the sampled elements. We showed that this algorithm provides an -approximation of the betweenness score of . Finally, we performed extensive experiments over several real-world datasets from different domains for several randomly chosen vertices as well as for the vertices with the highest betweenness scores. Our experiments revealed that in most cases, our algorithm significantly outperforms the most efficient existing randomized algorithms, in terms of both running time and accuracy. Our experiments also showed that our proposed algorithm computes betweenness scores of all vertices in sets of sizes 5, 10 and 15, much faster and more accurate than the most efficient existing approximate algorithms.

Footnotes

  1. email: mostafa.chehreghani@gmail.com
  2. email: albert.bifet@telecom-paristech.fr
  3. email: talel.abdessalem@telecom-paristech.fr
  4. email: mostafa.chehreghani@gmail.com
  5. email: albert.bifet@telecom-paristech.fr
  6. email: talel.abdessalem@telecom-paristech.fr
  7. email: mostafa.chehreghani@gmail.com
  8. email: albert.bifet@telecom-paristech.fr
  9. email: talel.abdessalem@telecom-paristech.fr
  10. http://snap.stanford.edu/data/amazon0302.html
  11. http://snap.stanford.edu/data/com-Amazon.html
  12. http://snap.stanford.edu/data/com-DBLP.html
  13. https://snap.stanford.edu/data/email-EuAll.html
  14. http://snap.stanford.edu/data/p2p-Gnutella31.html
  15. http://snap.stanford.edu/data/soc-sign-Slashdot090221.html
  16. http://snap.stanford.edu/data/soc-sign-epinions.html
  17. https://snap.stanford.edu/data/web-NotreDame.html
  18. For given values of and , KADABRA computes the normalized betweenness of the vertices of the graph within an error with a probability at least . The normalized betweenness of a vertex is its betweenness score divided by . Therefore, we multiply the scores computed by KADABRA by .
  19. We already find this vertex using the exact algorithm.

References

  1. Manas Agarwal, Rishi Ranjan Singh, Shubham Chaudhary, and Sudarshan Iyengar. Betweenness Ordering Problem : An Efficient Non-Uniform Sampling Technique for Large Graphs. CoRR, abs/1409.6470, 2014.
  2. D. A. Bader, S. Kintali, K. Madduri, and M. Mihail. Approximating betweenness centrality. In Proceedings of 5th International Conference on Algorithms and Models for the Web-Graph (WAW), pages 124–137, 2007.
  3. A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999.
  4. M. Barthelemy. Betweenness centrality in large complex networks. The Europ. Phys. J. B - Condensed Matter, 38(2):163–168, 2004.
  5. Elisabetta Bergamini, Henning Meyerhenke, and Christian Staudt. Approximating betweenness centrality in large evolving networks. In Ulrik Brandes and David Eppstein, editors, Proceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments, ALENEX 2015, San Diego, CA, USA, January 5, 2015, pages 133–146. SIAM, 2015.
  6. Michele Borassi and Emanuele Natale. KADABRA is an adaptive algorithm for betweenness via random approximation. In Piotr Sankowski and Christos D. Zaroliagis, editors, 24th Annual European Symposium on Algorithms, ESA 2016, August 22-24, 2016, Aarhus, Denmark, volume 57 of LIPIcs, pages 20:1–20:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016.
  7. U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163–177, 2001.
  8. U. Brandes and C. Pich. Centrality estimation in large networks. Intl. Journal of Bifurcation and Chaos, 17(7):303–318, 2007.
  9. Ümit V. Çatalyürek, Kamer Kaya, Ahmet Erdem Sariyüce, and Erik Saule. Shattering and compressing networks for betweenness centrality. In Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA., pages 686–694. SIAM, 2013.
  10. Mostafa Haghir Chehreghani. Effective co-betweenness centrality computation. In Seventh ACM International Conference on Web Search and Data Mining (WSDM), pages 423–432, 2014.
  11. Mostafa Haghir Chehreghani. An efficient algorithm for approximate betweenness centrality computation. Comput. J., 57(9):1371–1382, 2014.
  12. M. Everett and S. Borgatti. The centrality of groups and classes. Journal of Mathematical Sociology, 23(3):181–201, 1999.
  13. L. C. Freeman. A set of measures of centrality based upon betweenness, sociometry. Social Networks, 40:35–41, 1977.
  14. R. Geisberger, P. Sanders, and D. Schultes. Better approximation of betweenness centrality. In Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 90–100, 2008.
  15. M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Natl. Acad. Sci. USA, 99:7821–7826, 2002.
  16. Takanori Hayashi, Takuya Akiba, and Yuichi Yoshida. Fully dynamic betweenness centrality maintenance on massive networks. Proceedings of the VLDB Endowment (PVLDB), 9(2):48–59, 2015.
  17. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963.
  18. P. Holme. Congestion and centrality in traffic flow on complex networks. Adv. Complex. Syst., 6(2):163–176, 2003.
  19. E. D. Kolaczyk, D. B. Chua, and M. Barthelemy. Group-betweenness and co-betweenness: Inter-related notions of coalition centrality. Social Networks, 31(3):190–203, 2009.
  20. M. J. Lee, J. Lee, J. Y. Park, R. H. Choi, and C.-W. Chung. Qube: A quick algorithm for updating betweenness centrality. In Proceedings of the 21st World Wide Web Conference (WWW), pages 351–360, 2012.
  21. Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. The dynamics of viral marketing. ACM Transactions on the Web (TWEB), 1(1), 2007.
  22. Jure Leskovec, Daniel P. Huttenlocher, and Jon M. Kleinberg. Signed networks in social media. In Elizabeth D. Mynatt, Don Schoner, Geraldine Fitzpatrick, Scott E. Hudson, W. Keith Edwards, and Tom Rodden, editors, Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010, Atlanta, Georgia, USA, April 10-15, 2010, pages 1361–1370. ACM, 2010.
  23. Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.
  24. M. E. J. Newman. The structure and function of complex networks. SIAM REVIEW, 45:167–256, 2003.
  25. R. Puzis, Y. Elovici, and S. Dolev. Fast algorithm for successive computation of group betweenness centrality. Phys. Rev. E, 76(5):056709, 2007.
  26. R. Puzis, Y. Elovici, and S. Dolev. Finding the most prominent group in complex networks. AI Commun., 20(4):287–296, 2007.
  27. Matteo Riondato and Evgenios M. Kornaropoulos. Fast approximation of betweenness centrality through sampling. Data Mining and Knowledge Discovery, 30(2):438–475, 2016.
  28. Matteo Riondato and Eli Upfal. Abra: Approximating betweenness centrality in static and dynamic graphs with rademacher averages. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1145–1154, New York, NY, USA, 2016. ACM.
  29. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014.
  30. George Stergiopoulos, Panayiotis Kotzanikolaou, Marianthi Theocharidou, and Dimitris Gritzalis. Risk mitigation strategies for critical infrastructures based on graph centrality analysis. International Journal of Critical Infrastructure Protection, 10:34 – 44, 2015.
  31. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16(2):264–280, 1971.
  32. Y. Wang, Z. Di, and Y. Fan. Identifying and characterizing nodes important to community structure using the spectrum of the graph. PLoS ONE, 6(11):e27418, 2011.
  33. Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on ground-truth. In Mohammed Javeed Zaki, Arno Siebes, Jeffrey Xu Yu, Bart Goethals, Geoffrey I. Webb, and Xindong Wu, editors, 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 10-13, 2012, pages 745–754. IEEE Computer Society, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
   
Add comment
Cancel
Loading ...
104544
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description