Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs
Abstract
Graphs (networks) are an important tool to model data in different domains, including social networks, bioinformatics and the world wide web. Most of the networks formed in these domains are directed graphs, where all the edges have a direction and they are not symmetric. Betweenness centrality is an important index widely used to analyze networks. In this paper, first given a directed network and a vertex , we propose a new exact algorithm to compute betweenness score of . Our algorithm precomputes a set , which is used to prune a huge amount of computations that do not contribute in the betweenness score of . Time complexity of our exact algorithm depends on and it is respectively and for unweighted graphs and weighted graphs with positive weights. is bounded from above by and in most cases, it is a small constant. Then, for the cases where is large, we present a simple randomized algorithm that samples from and performs computations for only the sampled elements. We show that this algorithm provides an approximation of the betweenness score of . Finally, we perform extensive experiments over several realworld datasets from different domains for several randomly chosen vertices as well as for the vertices with the highest betweenness scores. Our experiments reveal that in most cases, our algorithm significantly outperforms the most efficient existing randomized algorithms, in terms of both running time and accuracy. Our experiments also show that our proposed algorithm computes betweenness scores of all vertices in the sets of sizes 5, 10 and 15, much faster and more accurate than the most efficient existing algorithms.
Keywords:
Networks directed graphs betweenness centrality exact algorithm approximate algorithm∎
1 Introduction
Graphs (networks) provide an important tool to model data in different domains, including social networks, bioinformatics, road networks, the world wide web and communication systems. A property seen in most of these realworld networks is that the ties between vertices do not always represent reciprocal relations [24]. As a result, the networks formed in these domains are directed graphs where any edge has a direction and the edges are not always symmetric.
Centrality is a structural property of vertices (or edges) in the network that quantifies their relative importance. For example, it determines the importance of a person within a social network, or a road within a road network. Freeman [13] introduced and defined betweenness centrality of a vertex as the number of shortest paths from all (source) vertices to all others that pass through that vertex. He used it for measuring the control of a human over the communications among others in a social network [13]. Betweenness centrality is also used in some wellknown algorithms for clustering and community detection in social and information networks [15].
Although there exist polynomial time and space algorithms for betweenness centrality computation, the algorithms are expensive in practice. Currently, the most efficient existing exact method is Brandes’s algorithm [7]. Time complexity of this algorithm is for unweighted graphs and for weighted graphs with positive weights, where and are the number of vertices and the number of edges of the network, respectively. This means this algorithm is not applicable, even for midsize networks.
However, there are observations that may improve computation of betweenness centrality in practice. In several applications it is sufficient to compute betweenness score of only one or a few vertices. For instance, this index might be computed for only core vertices of communities in social/information networks [32] or for only hubs in communication networks. Another example, discussed in [1], is handling cascading failures. It has been shown that the failure of a vertex with a higher betweenness score may cause greater collapse of the network [30]. Therefore, failed vertices should be recovered in the order of their betweenness scores. This means it is required to compute betweenness scores of (only) failed vertices. Note that these vertices are not necessarily those that have the highest betweenness scores in the network. Hence, algorithms that identify vertices with the highest betweenness scores [27] are not applicable.
In the current paper, we exploit this observation to design more effective exact and approximate algorithms for computing betweenness centrality in large directed graphs. Our algorithms are based on computing the set of reachable vertices for a given vertex . On the one hand, this set can be computed very efficiently. On the other hand, it indicates the potential source vertices whose dependency scores on are nonzero, as a result, it helps us to avoid a huge amount of computations that do not contribute in the betweenness score of .
In this paper, our key contributions are as follows.

Given a directed graph and a vertex , we present an efficient exact algorithm to compute betweenness score of . The algorithm is based on precomputing the set reachable vertices of , denoted by . can be computed in times for both unweighted graphs and weighted graphs with positive weights. Time complexity of the whole exact algorithm depends on the size of and it is respectively and for unweighted graphs and weighted graphs with positive weights. is bounded from above by and in most cases, it can be considered as a small constant (see Section 5). Hence, in many cases, time complexity of our proposed exact algorithm for unweighted graphs is linear, in terms of , and it is for weighted graphs with positive weights.

In the cases where is large, our exact algorithm might be intractable in practice. To address this issue, we present a simple randomized algorithm that samples elements from and performs computations for only the sampled elements. We show that this algorithm provides an approximation of the betweenness score of .

In order to evaluate the empirical efficiency of our proposed algorithms, we perform extensive experiments over several realworld datasets from different domains. In our experiments, we introduce a procedure that first computes . Then if the size of is less than some threshold (e.g., 1000), it employs the exact algorithm. Otherwise, it exploits the randomized algorithm. We evaluate this procedure for several randomly chosen vertices as well as for the vertices with the highest betweenness scores. We show that for randomly chosen vertices, our proposed procedure always significantly outperforms the most efficient existing randomized algorithms, in terms of both running time and accuracy. Furthermore, for the vertices that have the highest betweenness scores, over most of the datasets our algorithm outperforms the most efficient existing algorithms.

While our algorithm is intuitively designed to estimate betweenness score of only one vertex, in our experiments we consider the settings where betweenness scores of sets of vertices are computed. Our experiments reveal that in such cases, our proposed algorithm efficiently computes betweenness scores of all vertices in the sets of sizes 5, 10 and 15 and it significantly outperforms the existing algorithms.
The rest of this paper is organized as follows. In Section 2, preliminaries and necessary definitions related to betweenness centrality are introduced. A brief overview on related work is given in Section 3. In Section 4, we present our exact and approximate algorithms and their analysis. In Section 5, we empirically evaluate our proposed algorithm and show its high efficiency and accuracy, compared to the existing algorithms. Finally, the paper is concluded in Section 6.
2 Preliminaries
In this section, we present definitions and notations widely used in the paper. We assume that the reader is familiar with basic concepts in graph theory. Throughout the paper, refers to a graph (network). For simplicity, we assume that is a directed, connected and loopfree graph without multiedges. Throughout the paper, we assume that is an unweighted graph, unless it is explicitly mentioned that is weighted. and refer to the set of vertices and the set of edges of , respectively. For a vertex , the number of head ends adjacent to is called its in degree, and the number of tail ends adjacent to is called its out degree.
A shortest path (also called a geodesic path) from to is a path whose length is minimum, among all paths from to . For two vertices , if is unweighted, by we denote the length (the number of edges) of a shortest path connecting to . If is weighted, denotes the sum of the weights of the edges of a shortest path connecting to . By definition, . Note that in directed graphs, is not necessarily equal to . For , denotes the number of shortest paths between and , and denotes the number of shortest paths between and that also pass through . Betweenness centrality of a vertex is defined as:
(1) 
A notion which is widely used for counting the number of shortest paths in a graph is the directed acyclic graph (DAG) containing all shortest paths starting from a vertex (see e.g., [7]). In this paper, we refer to it as the shortestpathDAG, or SPD in short, rooted at . For every vertex in graph , the SPD rooted at is unique, and it can be computed in time for unweighted graphs and in time for weighted graphs with positive weights [7].
Brandes [7] introduced the notion of the dependency score of a vertex on a vertex , which is defined as:
(2) 
where We have:
(3) 
Brandes [7] showed that dependency scores of a source vertex on different vertices in the network can be computed using a recursive relation, defined as the following:
(4) 
where contains the predecessors of in the SPD rooted at . As mentioned in [7], given the SPD rooted at and using Equation 4, for unweighted graphs and weighted graphs with positive weights, dependency scores of on all other vertices can be computed in time and time, respectively.
3 Related work
Brandes [7] introduced an efficient algorithm for computing betweenness centrality of a vertex, which is performed in and times for unweighted and weighted networks with positive weights, respectively. Çatalyürek et. al. [9] presented the compression and shattering techniques to improve the efficiency of Brandes’s algorithm for large graphs. During compression, vertices with known betweenness scores are removed from the graph and during shattering, the graph is partitioned into smaller components. Holme [18] showed that betweenness centrality of a vertex is highly correlated with the fraction of time that the vertex is occupied by the traffic of the network. Barthelemy [4] showed that many scalefree networks [3] have a powerlaw distribution of betweenness centrality.
3.1 Generalization to sets
Everett and Borgatti [12] defined group betweenness centrality as a natural extension of betweenness centrality for sets of vertices. Group betweenness centrality of a set is defined as the number of shortest paths passing through at least one of the vertices in the set [12]. The other natural extension of betweenness centrality is cobetweenness centrality. Cobetweenness centrality is defined as the number of shortest paths passing through all vertices in the set. Kolaczyk et. al. [19] presented an time algorithm for cobetweenness centrality computation of sets of size 2. Chehreghani [10] presented efficient algorithms for cobetweenness centrality computation of any set or sequence of vertices in weighted and unweighted networks. Puzis et. al. [25] proposed an time algorithm for computing successive group betweenness centrality, where is the size of the set. The same authors in [26] presented two algorithms for finding most prominent group. A most prominent group of a network is a set vertices of minimum size, so that every shortest path in the network passes through at least one of the vertices in the set. The first algorithm is based on a heuristic search and the second one is based on iterative greedy choice of vertices.
3.2 Approximate algorithms
Brandes and Pich [8] proposed an approximate algorithm based on selecting source vertices and computing dependency scores of them on the other vertices in the graph. They used various strategies for selecting the source vertices, including: MaxMin, MaxSum and MinSum [8]. In the method of [2], some source vertices are selected uniformly at random, and their dependency scores are computed and scaled for all vertices. Geisberger et. al. [14] presented an algorithm for approximate ranking of vertices based on their betweenness scores. In this algorithm, the method for aggregating dependency scores changes so that vertices do not profit from being near the selected source vertices. Chehreghani [11] proposed a randomized framework for unbiased estimation of the betweenness score of a single vertex. Then, to estimate betweenness score of vertex , he proposed a nonuniform sampler, defined as follows:
where .
Riondato and Kornaropoulos [27] presented shortest path samplers for estimating betweenness centrality of all vertices or the vertices that have the highest betweenness scores in a graph. They determined the number of samples needed to approximate the betweenness with the desired accuracy and confidence by means of the VCdimension theory [31]. Recently, Riondato and Upfal [28] introduced algorithms for estimating betweenness scores of all vertices in a graph. They also discussed a variant of the algorithm that finds the top vertices. They used Rademacher average [29] to determine the number of required samples. Finally, Borassi and Natale [6] presented the KADABRA algorithm, which uses balanced bidirectional BFS (bbBFS) to sample shortest paths. In bbBFS, a BFS is performed from each of the two endpoints and , in such a way that they are likely to explore about the same number of edges.
3.3 Dynamic graphs
Lee et. al. [20] proposed an algorithm to efficiently update betweenness centralities of vertices when the graph obtains a new edge. They reduced the search space by finding a candidate set of vertices whose betweenness centralities can be updated. Bergamini et. al. [5] presented approximate algorithms that update betweenness scores of all vertices when an edge is inserted or the weight of an edge decreases. They used the algorithm of [27] as the building block. Hayashi et. al. [16] proposed a fully dynamic algorithm for estimating betweenness centrality of all vertices in a large dynamic network. Their algorithm is based on two data structures: hypergraph sketch that keeps track of SPDs, and twoball index that helps to identify the parts of hypergraph sketches that require updates.
4 Computing betweenness centrality in directed graphs
In this section, we present our exact and approximate algorithms for computing betweenness centrality of a given vertex in a large directed graph. First in Section 4.1, we introduce reachable vertices and show that they are sufficient to compute betweenness score of . Then in Sections 4.2 and 4.3, we respectively present our exact and approximate algorithms.
4.1 Reachable vertices
Let be a directed graph and . Suppose that we want to compute betweenness score of . To do so, as Brandes algorithm [7] suggests, for each vertex , we may form the SPD rooted at and compute the dependency score of on . Betweenness score of will be the sum of all the dependency scores. However, it is possible that in a directed graph for many vertices , there is no path from to and as a result, dependency score of on is 0. An example of this situation is depicted in Figure 1. In the graph of this figure, suppose that we want to compute betweenness score of vertex . If we form the SPD rooted at , after visiting the parts of the graph indicated by hachures, we find out that there is no shortest path from to and hence, is 0. The same holds for all vertices in the hachured part of the graph, i.e., dependency scores of these vertices on are 0. The question arising here is that whether there exists an efficient way to detect the vertices whose dependency scores on are 0 (so that we can avoid forming SPDs rooted at them)? In the rest of this section, we try to answer this question. We first introduce a (usually small) subset of vertices, called reachable vertices and denoted with , that are sufficient to compute betweenness score of . Then, we discuss how this set can be computed efficiently.
Definition 1
Let be a directed graph and . We say is reachable from if there is a (directed) path from to . The set of vertices that is reachable from them is denoted by .
Proposition 1
Let be a directed graph and . If out degree of is , is , too. Otherwise, we have:
(5) 
Proof
If out degree of is , there is no shortest path in the graph that leaves , as a result, is . To prove that Equation 5 holds, we need to prove that for any , dependency score of on is . Obviously, this holds, because there is no path from to and as a result, no shortest path starting from can pass over .
Proposition 1 suggests that for computing betweenness score of , we first check whether out degree of is greater than and if so, we compute . Betweenness score of is exactly computed using Equation 5.
If is already known, this procedure can significantly improve computation of betweenness centrality of . The reason is that, as our experiments show, in realworld directed networks is usually significantly smaller than . However, computing can be computationally expensive as in the worst case, it requires the same amount of time as computing betweenness score of . This motivates us to try to define a set that satisfies the following properties: (i) and (ii) can be computed effectively in a time much faster than computing . Condition (i) implies that each vertex whose dependency score on is greater than , belongs to and as a result, In the following, we present a definition of and a simple and efficient algorithm to compute it.
Definition 2
Let be a directed graph. Reverse graph of , denoted by , is a directed graph such that: (i) , and (ii) if and only if .
Definition 3
Let be a directed graph and . We define as the set that contains any vertex such that there is a path from to in .
Proposition 2
Let be a directed graph and . We have: .
Proof
The proof is straightforward from the definitions of and . For each , if , then there is a path from to and as a result, there is a path from to in . Hence, and therefore, . In a similar way, we can show that . Therefore, we have: .
An advantage of the above definition of is that it can be efficiently computed as follows:

first, by flipping the direction of the edges of , is constructed.

then, if is weighted, the weights of the edges are ignored,

finally, a breadth first search (BFS) or a depthfirst search (DFS) on starting from is performed. All the vertices that are met during the BFS (or DFS), except , are added to .
In fact, while in we require to solve the multisource shortest path problem (MSSP), in this is reduced to the singlesource shortest path problem (SSSP), which can be addressed much faster. Figure 1 shows an example of this procedure, where in order to compute , we first generate (Figure 1) and then, we run a BFS (or DFS) starting from (Figure 1). The set of vertices that are met during the traversal except , i.e., vertices , and , form .
For a vertex , each of the steps of the procedure of computing , for both unweighted graphs and weighted graphs, can be computed in time. Hence, time complexity of the procedure of computing for both unweighted graphs and weighted graphs is . Therefore, can be computed in a time much faster than computing betweenness score f . Furthermore, Proposition 2 says that contains all the members of . These mean both the aforementioned conditions are satisfied.
4.2 The exact algorithm
In this section, using the notions and definitions presented in Section 4.1, we propose an effective algorithm to compute exact betweenness score of a given vertex in a directed graph .
Algorithm 1 presents the high level pseudo code of the EBCD algorithm proposed for computing exact betweenness score of in . After checking whether or not out degree of is , the algorithm follows two main steps: (i) computing (Lines 712 of Algorithm 1), where we use the procedure described in Section 4.1 to compute ; and (ii) computing (Lines 1318 of Algorithm 1), where for each vertex , we form the SPD rooted at and compute the dependency score of on the other vertices and add the value of to the betweenness score of . Note that if is weighted, while in the first step the weights of its edges are ignored, in the second step and during forming SPDs and computing dependency scores, we take the weights into account.
Note also that in Algorithm 1, after computing , techniques proposed to improve exact betweenness centrality computation, such compression and shattering [9], can be used to improve the efficiency of the second step. This means the algorithm proposed here is orthogonal to the techniques such as shattering and compression and therefore, they can be merged.
Complexity analysis
On the one hand, as mentioned before, time complexity of the first step is . On the other hand, time complexity of each iteration in Lines 1518 is for unweighted graphs and for weighted graphs with positive weights. As a result, time complexity of EBCD is for unweighted graphs and for weighted graphs with positive weights.
4.3 The approximate algorithm
For a vertex , is always smaller than and as our experiments (reported in Section 5) show, the difference is usually significant. Therefore, EBCD is usually significantly more efficient than the existing exact algorithms such as Brandes’s algorithm [7]. However, in some cases, the size of can be large (see again Section 5). To make the algorithm tractable for the cases where is large, in this section we propose a randomized algorithm that picks some elements of uniformly at random and only processes these vertices.
Algorithm 2 shows the high level pseudo code of our randomized algorithm, called ABCD. Similar to EBCD, ABCD first computes . Then, at each iteration (), ABCD picks a vertex from uniformly at random, forms the SPD rooted at and computes . In the end, betweenness of is estimated as the sum of the computed dependency scores on multiply by .
Complexity analysis
Similar to EBCD, on the one hand, time complexity of the computation step is . On the other hand, time complexity of each iteration in Lines 1519 of Algorithm 2 is for unweighted graphs and for weighted graphs with positive weights. As a result, time complexity of ABCD is for unweighted graphs and for weighted graphs with positive weights, where is the number of iterations (samples).
Error bound
Using Hoeffding’s inequality [17], we can simply derive an error bound for the estimated value of betweenness score of . First in Proportion 3, we prove that in Algorithm 2 the expected value of is . Then in Proportion 4, we provide an error bound for .
Proposition 3
In Algorithm 2, we have: .
Proof
For each , , we define random variable as follows: . We have:
Random variable is the average of independent random variables . Therefore, we have:
Proposition 4
In Algorithm 2, let be the maximum dependency score that a vertex may have on . For a given , we have:
(6) 
Proof
The proof is done using Hoeffding’s inequality [17]. Let be independent random variables bounded by the interval , i.e., (). Let also . Hoeffding [17] showed that:
(7) 
Similar to the proof of Proposition 3, for each , , we define random variable as follows: . Note that in Algorithm 2 vertices are chosen independently, as a result, random variables are independent, too. Hence, we can use Hoeffding’s inequality, where ’s are ’s, is , is , is and is . Putting these values into Inequality 7 yields Inequality 6.
Inequality 6 says that for given values and , if is chosen such that
(8) 
then, Algorithm 2 estimates betweenness score of within an additive error with a probability at least . The difference between Inequality 8 and the number of samples required by the methods that uniformly sample from the set of all vertices (e.g., [8]) is that in the later case, the lower bound on the number of samples is a function of , instead of . As mentioned earlier, for most of the vertices, .
5 Experimental results
We perform extensive experiments on several realworld networks to assess the quantitative and qualitative behavior of our proposed exact and approximate algorithms. The experiments are done on an Intel processor clocked at 2.6 GHz with 16 GB main memory, running Ubuntu Linux 16.04 LTS. The program is compiled by the GNU C++ compiler 5.4.0 using optimization level 3.
tabular \makesavenoteenvtable
Dataset  # vertices  # edges 

Amazon 
262,111  1,234,877 
Comamazon 
334,863  925,872 
Comdblp 
317,080  1,049,866 
EmailEuAll 
224,832  340,795 
P2pGnutella31 
62,586  147,892 
Slashdot 
82,144  549,202 
Socsignepinions 
131,828  841,372 
WebNotreDame 
325,729  1,497,134 
We test the algorithms over several realworld datasets from different domains, including the amazon product copurchasing network [21], the comdblp coauthorship network [33], the comamazon network [33] the p2pGnutella31 peertopeer network [23], the slashdot technologyrelated news network [22] and the socsignepinions whotrustwhom online social network [22]. All the networks are treated as directed graphs. Table 1 summarizes specifications of our realworld networks.
As mentioned before, for a directed graph and a vertex , both of our proposed exact and approximate algorithms first compute , which can be done very effectively. Then, based on the size of , someone may decide to use either the exact algorithm or the approximate algorithm. Hence in our experiments, we follow the following procedure:

first, compute ,

then, if , run EBCD; otherwise, run ABCD with as the number of samples;
We refer to this procedure as BCD. The value of depends on the amount of time someone wants to spend for computing betweenness centrality. In our experiments reported here, we set to 1000. We compare our method against the most efficient existing algorithm for approximating betweenness centrality, which is KADABRA [6].
For a vertex , its empirical approximation error is defined as:
(9) 
where is the calculated approximate score.
Dataset  Randomly chosen vertices  KADABRA  BCD  

samples  Time  Error ()  E/A  Time  Error ()  
Amazon  13645  19613.1  47187  0.1800  16739  19.14  100  A  2.60  0.26  0.26 
91289  87523.6  150  0.0005  100  E  0.67  0.29  0  
17054  35752.6  533  0.0020  100  E  1.26  0.29  0  
231249  10449.4  4  0.00001  100  E  0.11  0.30  0  
246486  1837.58  34  0.0001  100  E  0.17  0.30  0  
Comamazon  202389  1486.8  13  0.00003  15036  27.70  100  E  0.14  0.27  0 
263212  364  3  0.000008  100  E  0.12  0.27  0  
81097  11  14  0.00004  100  E  0.15  0.27  0  
13732  1701.51  616  0.0018  100  E  1.41  0.28  0  
29825  139  15  0.00004  100  E  0.15  0.27  0  
Comdblp  4456  10153  2092  0.0065  17873  26.14  100  A  5.74  0.26  1.10 
278950  34326.5  11  0.00003  100  E  0.13  0.27  0  
244680  232994  22  0.00006  100  E  0.21  0.27  0  
21141  1957.93  73  0.0002  100  E  0.48  0.27  0  
129908  303543  41  0.0001  100  E  0.53  0.29  0  
EmailEuAll  25362  1869.16  2  0.000008  17066  16.01  100  E  0.03  0.08  0 
16682  2269.29  64  0.0002  100  E  0.14  0.08  0  
8796  241434  21181  0.0942  100  A  1.88  0.07  1.72  
50365  3  2  0.000008  100  E  0.03  0.07  0  
2139  503650  111674  0.4966  100  A  1.78  0.08  3.59  
P2pGnutella31  46263  12655.2  2  0.00003  16401  6.88  100  E  0.03  0.04  0 
34547  3538.79  173  0.0027  100  E  0.95  0.04  0  
54609  27824.9  3  0.00004  100  E  0.03  0.04  0  
37518  6175.2  24141  0.3857  100  A  2.44  0.06  11.31  
9781  4582130  3  0.00004  100  E  0.02  0.04  0  
Slashdot0902  20825  15940.9  21  0.0002  17421  7.95  100  E  0.17  0.16  0 
47806  15891.7  3  0.00003  100  E  0.06  0.15  0  
48251  21744  3  0.00003  100  E  0.05  0.15  0  
20969  43067  369  0.0044  100  E  2.30  0.17  0  
57099  6165.01  2  0.00002  100  E  0.05  0.15  0  
Socsignepinions  2740  2352.43  36393  0.2760  19099  11.28  100  A  4.57  0.17  55.34 
24080  9198.78  2621  0.0198  100  A  4.60  0.15  18.48  
38349  75201.9  35  0.0002  100  E  0.24  0.14  0  
82156  8802  34  0.0002  100  E  0.19  0.14  0  
38266  8052  3  0.00002  100  E  0.04  0.14  0  
WebNotreDame  21026  140  9  0.00002  19908  27.29  100  E  0.08  0.25  0 
133847  9003.53  797  0.0024  100  E  1.84  0.25  0  
307622  4212.33  44  0.0001  100  E  0.18  0.25  0  
176211  2157.42  30  0.00009  100  E  0.14  0.25  0  
307134  3079.5  123  0.0003  100  E  0.35  0.25  0 
5.1 Results
Table 2 reports the results of our first set of experiments.
For KADABRA, we have set and to 0.01 and 0.1, respectively
Dataset  Vertex  KADABRA  

samples  Time  Estimated BC  Error ()  
Amazon  13645  47330  53.98  0  100 
91289  0  100  
17054  0  100  
231249  0  100  
246486  0  100  
Comamazon  202389  42207  58.76  0  100 
263212  0  100  
81097  0  100  
13732  0  100  
29825  0  100  
Comdblp  4456  50667  77.40  0  100 
278950  0  100  
244680  0  100  
21141  0  100  
129908  0  100  
EmailEuAll  25362  48079  43.43  0  100 
16682  0  100  
8796  0  100  
50365  0  100  
2139  0  100  
P2pGnutella31  46263  47631  18.12  0  100 
34547  0  100  
54609  85638.90  568.32  
37518  0  100  
9781  4281925.72  6.52  
Slashdot0902  20825  50776  22.38  0  100 
47806  0  100  
48251  0  100  
20969  0  100  
57099  0  100  
Socsignepinions  2740  53667  30.54  0  100 
24080  0  100  
38349  0  100  
82156  0  100  
38266  0  100  
WebNotreDame  21026  51015  73.92  0  100 
133847  0  100  
307622  0  100  
176211  0  100  
307134  0  100 
After seeing these experimental results, someone may be interested in the following questions:

The accuracy of KADABRA depends on the value of . Can decreasing the value of improve the accuracy of KADABRA and make it be comparable to BCD?

KADABRA is more efficient for the vertices that have the highest betweenness scores and since most of the randomly chosen vertices do not have a very high betweenness score, compared to EBC, KADABRA does not show a good performance. What is the efficiency of BCD, compared to KADABRA, for the vertices that have the highest betweenness scores?

In the experiments reported in Table 2, BCD is used to estimate betweenness score of only one vertex. However, in practice it might be required to estimate betweenness scores of a given set of vertices. How efficient is BCD in this setting?
In the rest of this section, we answer these questions.
Q1
To answer Q1, we run KADABRA with . The results are reported in Table 3. In this setting, most of the scores estimated by KADABRA are still . There are only two exceptions where, however, the approximation error is high. Note that the running time of KADABRA in this setting is considerably more than the case of and as a result, the running time of BCD.
Q2
To answer Q2, over each dataset, we examine the algorithms
for the vertex that has the highest betweenness score
In this setting, none of the algorithms outperforms the other one in all the cases. More precisely, while for some values of KADABRATOP1 has a better accuracy as well as a higher running time, in some other cases the story is in the other way. Nevertheless, we can investigate the datasets one by one. Over amazon, for all values of , BCD has a better approximation error than KADABRATOP1. In particular, for , KADABRATOP1 takes much more time but produces less accurate output. Hence, we can argue that over amazon BCD outperforms KADABRATOP1. The same holds for comamazon, emailEuAll and webNotreDame and over all these datasets, BCD outperforms KADABRATOP1. Over comdblp, for , KADABRATOP1 outperforms BCD in terms of both accuracy and running time. This also happen over socsignepinions for and . Hence, someone may argue that over these two datasets KADABRATOP1 outperforms BCD. Over p2pGnutella31 and slashdot0902, on the one hand for and , BCD shows a better accuracy, however, it is slightly slower. On the other hand, for , KADABRATOP1 shows a better accuracy, however, it takes much more time. Altogether, we can say that for estimating betweenness scores of the vertices that have the highest scores, in most of the datasets BCD works better than KADABRATOP1.
Dataset  Vertex with the highest BC  KADABRATOP1  BCD  

samples  Time  Error ()  Time  Error ()  
Amazon  2804  16066000  162707  0.6207  0.01  16181  0.26    2.38  0.29  1.35 
0.005  45320  0.56  71.69  
0.0005  1459502  16.65  3.01  
Comamazon  28081  378550  3812  0.0113  0.01  14619  0.14    2.31  0.28  0.52 
0.005  40590  0.21    
0.0005  1249908  3.86  28.90  
Comdblp  49124  24821300  70561  0.2225  0.01  17303  0.64  17.04  6.27  0.27  9.77 
0.005  48411  1.62  7.96  
0.0005  1581635  54.11  6.79  
EmailEuAll  2387  15943100  102596  0.4563  0.01  16588  0.10  33.79  1.76  0.08  3.37 
0.005  46123  0.17  17.50  
0.0005  1471932  3.87  4.04  
P2pGnutella31  9781  4580850  36141  0.5774  0.01  13618  0.32  57.61  1.78  0.04  2.59 
0.005  40909  1.00  6.51  
0.0005  1515822  38.31  0.32  
Slashdot0902  18238  8531850  19153  0.2331  0.01  16962  0.99  11.96  3.90  0.10  3.37 
0.005  44847  2.52  5.87  
0.0005  1718486  103.89  0.16  
Socsignepinions  27463  26116100  9880  0.0749  0.01  18601  1.10  2.25  5.43  0.12  2.30 
0.005  51502  2.97  0.23  
0.0005  2398143  143.92  1.61  
WebNotreDame  7137  323101000  233965  0.7182  0.01  19448  0.18  1.30  2.71  0.235  0.26 
0.005  49456  0.30  7.56  
0.0005  779273  3.93  2.22 
Q3
To answer Q3, we select a random set of vertices and run BCD for each vertex in the set. The results are reported in Table 5, where the set contains 5, 10 or 15 vertices. Over all the datasets and for each set of vertices, we report the average, maximum and minimum errors of the vertices. For all the datasets, minimum error is always 0. In Table 5, ”” is the total time of computing of all the vertices in the set and ”Time” is the total time of the other steps of computing betweenness scores of all the vertices in the set. Therefore, the total running time of BCD for a given dataset and a given set is the sum of ”Time” and ””. Comparing the results presented in Table 5 with the results presented in Table 3 reveals that for estimating betweenness scores of a set of vertices, BCD significantly outperforms KADABRA (where is ). While in most cases the total running time of BCD is less than the running time of KADABRA (even when the size of the set is 15), BCD gives much more accurate results. Note that even when in KADABRA is set to 0.01, in many cases BCD is faster than KADABRA. In particular, over datasets such as amazon, comamazon, emailEuAll and webNotreDame, even for the sets of size 15, BCD is faster than KADABRA and it always produces much more accurate results.
Dataset  Set size  Error ()  Time  size  

Avg.  Max.  Min.  Avg.  Max.  Min.  
Amazon  5  1.47  7.10  0  4.81  1.44  9581.6  47187  4 
10  0.73  7.10  0  7.42  3.21  4818.4  47187  1  
15  0.88  7.10  0  9.74  4.98  3497.798  47187  1  
Comamazon  5  0  0  0  1.98  1.36  132.2  616  3 
10  0  0  0  4.92  3.43  91.2  616  2  
15  0  0  0  7.07  5.48  65.93  616  1  
Comdblp  5  0.22  1.10  0  7.09  1.36  447.8  2092  11 
10  3.47  19.45  0  20.71  3.08  24483.6  227218  1  
15  2.32  19.45  0  28.81  4.92  21351.33  227218  1  
EmailEuAll  5  1.06  3.59  0  3.86  0.38  26584.6  111674  2 
10  1.39  7.95  0  9.76  0.78  19020.9  111674  2  
15  0.93  7.95  0  13.52  1.27  12742.8  111674  2  
P2pGnutella31  5  2.26  11.31  0  3.47  0.22  4864.2  24141  2 
10  7.26  39.17  0  23.09  0.46  5493.6  24141  2  
15  6.79  39.17  0  33.27  0.72  8637.73  28122  2  
Slashdot0902  5  0  0  0  2.62  0.78  79.6  369  2 
10  5.04  50.48  0  11.37  1.38  3784.3  26802  1  
15  4.92  50.48  0  14.93  1.99  6662.86  62089  1  
Socsignepinions  5  13.37  48.37  0  9.64  0.74  7817.2  36393  3 
10  9.68  48.37  0  17.71  1.52  20302.7  109520  1  
15  9.38  48.37  0  28.46  2.28  15538.86  109520  1  
WebNotreDame  5  0  0  0  2.58  1.25  200.6  797  9 
10  0  0  0  6.89  2.44  231.5  1092  9  
15  0.03  0.30  0  13.16  3.62  414.46  2610  1 
5.2 Discussion
Our extensive experiments reveal that BCD usually significantly outperforms KADABRA. This is due to the huge pruning that applies on the set of source vertices that are used to form SPDs and compute dependency scores. Note that in all the cases, is computed very efficiently, hence, it does not impose a considerable load on the algorithm. In the case of estimating betweenness score of the vertex with the highest betweenness score, over two datasets we may argue that KADABRA outperforms BCD. This has two reasons. On the one hand, in these cases the ratio is large, as a result, many SPDs are computed by BCD. On the other hand, the SPDs contain many vertices of the graph, as a result, computing them is more expensive.
In the end, it is worth mentioning that while the size of is an important factor on the efficiency of the algorithm, it is not the only factor. For example, both graphs of Figure 2 have vertices, the size of in Figure 2 is and the size of in Figure 2 is . However, in Figure 2 each SPD is computed and processed in time, whereas in Figure 2 each SPD is computed and processed in time. Therefore, while in Figure 2 is computed in time, in Figure 2 it is computed in time.
6 Conclusion
In this paper, we studied the problem of computing betweenness score in large directed graphs. First, given a directed network and a vertex , we proposed a new exact algorithm to compute betweenness score of . Our algorithm first computes a set , which is used to prune a huge amount of computations that do not contribute in the betweenness score of . Time complexity of our exact algorithm is respectively and for unweighted graphs and weighted graphs with positive weights. Then, for the cases where is large, we presented a simple randomized algorithm that samples from and performs computations for only the sampled elements. We showed that this algorithm provides an approximation of the betweenness score of . Finally, we performed extensive experiments over several realworld datasets from different domains for several randomly chosen vertices as well as for the vertices with the highest betweenness scores. Our experiments revealed that in most cases, our algorithm significantly outperforms the most efficient existing randomized algorithms, in terms of both running time and accuracy. Our experiments also showed that our proposed algorithm computes betweenness scores of all vertices in sets of sizes 5, 10 and 15, much faster and more accurate than the most efficient existing approximate algorithms.
Footnotes
 email: mostafa.chehreghani@gmail.com
 email: albert.bifet@telecomparistech.fr
 email: talel.abdessalem@telecomparistech.fr
 email: mostafa.chehreghani@gmail.com
 email: albert.bifet@telecomparistech.fr
 email: talel.abdessalem@telecomparistech.fr
 email: mostafa.chehreghani@gmail.com
 email: albert.bifet@telecomparistech.fr
 email: talel.abdessalem@telecomparistech.fr
 http://snap.stanford.edu/data/amazon0302.html
 http://snap.stanford.edu/data/comAmazon.html
 http://snap.stanford.edu/data/comDBLP.html
 https://snap.stanford.edu/data/emailEuAll.html
 http://snap.stanford.edu/data/p2pGnutella31.html
 http://snap.stanford.edu/data/socsignSlashdot090221.html
 http://snap.stanford.edu/data/socsignepinions.html
 https://snap.stanford.edu/data/webNotreDame.html
 For given values of and , KADABRA computes the normalized betweenness of the vertices of the graph within an error with a probability at least . The normalized betweenness of a vertex is its betweenness score divided by . Therefore, we multiply the scores computed by KADABRA by .
 We already find this vertex using the exact algorithm.
References
 Manas Agarwal, Rishi Ranjan Singh, Shubham Chaudhary, and Sudarshan Iyengar. Betweenness Ordering Problem : An Efficient NonUniform Sampling Technique for Large Graphs. CoRR, abs/1409.6470, 2014.
 D. A. Bader, S. Kintali, K. Madduri, and M. Mihail. Approximating betweenness centrality. In Proceedings of 5th International Conference on Algorithms and Models for the WebGraph (WAW), pages 124–137, 2007.
 A.L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999.
 M. Barthelemy. Betweenness centrality in large complex networks. The Europ. Phys. J. B  Condensed Matter, 38(2):163–168, 2004.
 Elisabetta Bergamini, Henning Meyerhenke, and Christian Staudt. Approximating betweenness centrality in large evolving networks. In Ulrik Brandes and David Eppstein, editors, Proceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments, ALENEX 2015, San Diego, CA, USA, January 5, 2015, pages 133–146. SIAM, 2015.
 Michele Borassi and Emanuele Natale. KADABRA is an adaptive algorithm for betweenness via random approximation. In Piotr Sankowski and Christos D. Zaroliagis, editors, 24th Annual European Symposium on Algorithms, ESA 2016, August 2224, 2016, Aarhus, Denmark, volume 57 of LIPIcs, pages 20:1–20:18. Schloss Dagstuhl  LeibnizZentrum fuer Informatik, 2016.
 U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163–177, 2001.
 U. Brandes and C. Pich. Centrality estimation in large networks. Intl. Journal of Bifurcation and Chaos, 17(7):303–318, 2007.
 Ümit V. Çatalyürek, Kamer Kaya, Ahmet Erdem Sariyüce, and Erik Saule. Shattering and compressing networks for betweenness centrality. In Proceedings of the 13th SIAM International Conference on Data Mining, May 24, 2013. Austin, Texas, USA., pages 686–694. SIAM, 2013.
 Mostafa Haghir Chehreghani. Effective cobetweenness centrality computation. In Seventh ACM International Conference on Web Search and Data Mining (WSDM), pages 423–432, 2014.
 Mostafa Haghir Chehreghani. An efficient algorithm for approximate betweenness centrality computation. Comput. J., 57(9):1371–1382, 2014.
 M. Everett and S. Borgatti. The centrality of groups and classes. Journal of Mathematical Sociology, 23(3):181–201, 1999.
 L. C. Freeman. A set of measures of centrality based upon betweenness, sociometry. Social Networks, 40:35–41, 1977.
 R. Geisberger, P. Sanders, and D. Schultes. Better approximation of betweenness centrality. In Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 90–100, 2008.
 M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Natl. Acad. Sci. USA, 99:7821–7826, 2002.
 Takanori Hayashi, Takuya Akiba, and Yuichi Yoshida. Fully dynamic betweenness centrality maintenance on massive networks. Proceedings of the VLDB Endowment (PVLDB), 9(2):48–59, 2015.
 Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963.
 P. Holme. Congestion and centrality in traffic flow on complex networks. Adv. Complex. Syst., 6(2):163–176, 2003.
 E. D. Kolaczyk, D. B. Chua, and M. Barthelemy. Groupbetweenness and cobetweenness: Interrelated notions of coalition centrality. Social Networks, 31(3):190–203, 2009.
 M. J. Lee, J. Lee, J. Y. Park, R. H. Choi, and C.W. Chung. Qube: A quick algorithm for updating betweenness centrality. In Proceedings of the 21st World Wide Web Conference (WWW), pages 351–360, 2012.
 Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. The dynamics of viral marketing. ACM Transactions on the Web (TWEB), 1(1), 2007.
 Jure Leskovec, Daniel P. Huttenlocher, and Jon M. Kleinberg. Signed networks in social media. In Elizabeth D. Mynatt, Don Schoner, Geraldine Fitzpatrick, Scott E. Hudson, W. Keith Edwards, and Tom Rodden, editors, Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010, Atlanta, Georgia, USA, April 1015, 2010, pages 1361–1370. ACM, 2010.
 Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.
 M. E. J. Newman. The structure and function of complex networks. SIAM REVIEW, 45:167–256, 2003.
 R. Puzis, Y. Elovici, and S. Dolev. Fast algorithm for successive computation of group betweenness centrality. Phys. Rev. E, 76(5):056709, 2007.
 R. Puzis, Y. Elovici, and S. Dolev. Finding the most prominent group in complex networks. AI Commun., 20(4):287–296, 2007.
 Matteo Riondato and Evgenios M. Kornaropoulos. Fast approximation of betweenness centrality through sampling. Data Mining and Knowledge Discovery, 30(2):438–475, 2016.
 Matteo Riondato and Eli Upfal. Abra: Approximating betweenness centrality in static and dynamic graphs with rademacher averages. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1145–1154, New York, NY, USA, 2016. ACM.
 Shai ShalevShwartz and Shai BenDavid. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014.
 George Stergiopoulos, Panayiotis Kotzanikolaou, Marianthi Theocharidou, and Dimitris Gritzalis. Risk mitigation strategies for critical infrastructures based on graph centrality analysis. International Journal of Critical Infrastructure Protection, 10:34 – 44, 2015.
 V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16(2):264–280, 1971.
 Y. Wang, Z. Di, and Y. Fan. Identifying and characterizing nodes important to community structure using the spectrum of the graph. PLoS ONE, 6(11):e27418, 2011.
 Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on groundtruth. In Mohammed Javeed Zaki, Arno Siebes, Jeffrey Xu Yu, Bart Goethals, Geoffrey I. Webb, and Xindong Wu, editors, 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 1013, 2012, pages 745–754. IEEE Computer Society, 2012.