Distributed Local Multi-Aggregation and Centrality Approximation

Distributed Local Multi-Aggregation and Centrality Approximation

Benjamin Dissler        Stephan Holzer        Roger Wattenhofer
  disslerb@ethz.ch             holzer@mit.edu           wattenhofer@ethz.ch
ETH Zurich                        MIT                         ETH Zurich
Part of the work was done at ETH Zurich. At MIT the author was supported by the following grants: AFOSR Contract Number FA9550-13-1-0042, NSF Award 0939370-CCF, NSF Award CCF-1217506, NSF Award number CCF-AF-0937274.
Abstract

We study local aggregation and graph analysis in distributed environments using the message passing model. We provide a flexible framework, where each of the nodes in a set –which is a subset of all nodes in the network–can perform a large range of common aggregation functions in its -neighborhood. We study this problem in the CONGEST model, where in each synchronous round, every node can transmit a different (but short) message to each of its neighbors. While the -neighborhoods of nodes in might overlap and aggregation could cause congestion in this model, we present an algorithm that needs time even when each of the nodes in performs a different aggregation on its -neighborhood. The framework is not restricted to aggregation-trees such that it can be used for more advanced graph analysis. We demonstrate this by providing efficient approximations of centrality measures and approximation of minimum routing cost trees.

1 Introduction

Data aggregation and analysis is one of the most basic tasks at the heart of many distributed systems and the question of aggregating and analyzing information and networks itself as efficient as possible arises daily. The result of this is a huge body of work ranging from theoretical to practical aspects focusing on optimizing e.g. speed, space, bandwidth, energy, fault-tolerance and accuracy. As already [21] stated,“the database community classifies aggregation functions into three categories: distributive (max, min, sum, count), algebraic (plus, minus, average, variance), and holistic (median, smallest, or largest value). Combinations of these functions are believed to support a wide range of reasonable aggregation queries.”. Often one is also interested in computing a combination of these such as, e.g., “What is the average of the 10% largest values?”[21].

However, most of this work considers the case in which only one node in the network aggregates information (and then often broadcasts it to all other nodes). In reality, many nodes of a large network are independently interested in aggregating data at the same time

and restricted to their local neighborhood. That is, all nodes in a subset want to perform a (possibly different) aggregation, where is the set of nodes in the network. For example 1) a few nodes in a network want to know if certain data is stored in their neighborhood, or 2) cars participating in a vehicular ad hoc network (VANET) want to aggregate information on traffic, safety warnings and parking spots from their local neighborhood, and 3) nodes who turned idle search for the busiest nodes in their respective neighborhoods to take work from them. A simple approach is to just let all nodes in perform their aggregation at the same time—however, this might lead to congestion and a worst case runtime of .

In this paper we present a general framework for local multi-aggregation that allows a time- and message-optimal implementation of a wide range of aggregation functions. We do this in a general setting, where nodes in aggregate data from their -neighborhood using shortest paths and achieve a runtime of to do so.

We show how to perform aggregation and graph analysis by aggregating through all possible shortest paths between any pair of nodes (not only using one path as it is usually done in an aggregation-tree). This is a powerful tool which enables us to provide an efficient approximation of Betweenness Centrality. We also present efficient approximations of Closeness Centrality and Minimum Routing Cost Trees.

1.1 Our Contribution

We provide a framework for local multi-aggregation that takes care of scheduling messages sent between nodes in an efficient way. One only needs to specify how nodes process incoming messages. That is designing algorithms depending on the aggregation-function at hand, which transform received messages into new messages to be sent. Using our framework, one can aggregate information not only by using a tree, but using all possible shortest paths from a root node to any other node. This has advantages for advanced computations as we demonstrate later. Thus we show two variations of the algorithm:

  1. only one (shortest) --path is needed for each to perform the aggregation.

  2. all shortest --paths are needed for each to perform the aggregation.

The last version is e.g. of interest for computing betweenness centrality measures, which is a measure that depends on all shortest paths, not only a single one. 111In the algorithm we approximate the number of all shortest paths starting in a certain sampled set of nodes. Note that it does not seem possible to approximate centrality measures without performing these independent aggregations..

To perform independent (possibly different) aggregations in the -neighborhood of each of the nodes in , our algorithm takes time , which we show to be optimal (see Remark 3.3). As an example of an aggregation-function, which can be plugged into our framework, we consider computing the maximum value stored in the -neighborhood of each of the nodes. Further aggregation functions can be implemented in a similar way. Different root nodes in can perform different aggregations at the same time. Based on this, we show how to approximate centrality measures and minimum routing cost trees and obtain the following theorems:

Theorem 1.1.

Algorithm 1 DLGcomp computes valid directed leveled graphs with depth , in time . And Algorithm 2 DLGagr aggregates information through directed leveled graphs with depth , in time .

Theorem 1.2.

The tree variations of Algorithms 1 DLGcomp and 2 DLGagr compute trees in time and aggregate information through these trees, in time .

Theorem 1.3.

Algorithm 3 computes an estimation of the betweenness centrality for at least nodes. For all those nodes holds: for and , if the betweenness centrality of a node is for some constant , then with probability at least its betweenness centrality can -approximated with computation of samples of source nodes. Algorithm 3 runs in time .

Theorem 1.4.

In a weighted graph , when the weights denote the distance between two nodes, Algorithm 6 approximates closeness centrality of all nodes with an inverse additive error of and runs in time.

Theorem 1.5.

Algorithm 7 MRCT computes a -approximation of the S-MRCT problem in time , when is uniform or corresponds to the time needed to transmit a message through edge ..

Theorem 1.6.

Algorithm MRCTrand computes a -approximation of the S-MRCT problem in time , when is uniform or corresponds to the time needed to transmit a message through edge .

Algorithm and proof of Theorem 1.1 is stated in Section 3.1. Algorithm and proof of Theorem 1.2 is stated in Section 3.2. Algorithm and proof of Theorem 1.3 is stated in Section 4.1. Algorithm and proof of Theorem 1.4 is stated in Section 4.2. Algorithm and proof of Theorem 1.6 is stated in Section 4.3.

1.2 Related Work

(Local multi-)aggregation

The authors of [25] survey the general area of data-aggregation in wireless sensor networks. It is know [2] that local multi-aggregation is useful in generalizations of Zone-Based Intrusion Detection Systems mentioned in. These systems detect local changes of the topology (assuming the graph does not change too fast) and react to it. The authors of [19] perform local multi-aggregation in sensor networks to detect basic changes of the local environment, such as variations of temperature or concentration of toxic gas. If the value in one node passes a threshold, it starts a local aggregation. They also provide applications to fire fighters and emergency services in case of a flood. In [19] changes of areal objects are detected by the cooperation of sensors in the locality of the changes and later reported to a central node.

Centrality

Classical applications of Closeness Centrality [4, 5] and Betweenness Centrality [14] are in the area of social network analysis [28]. E.g. Closeness Centrality indicates how fast information spreads and Betweeness Centrality indicates how easy it can be manipulated by certain nodes. However, these findings can often be transferred to networks based on electric devices. One example are simple routing protocols, where malicious nodes with high centrality have the power to manipulate large parts of the communication. Eppstein and Wang showed in [12] a fast approximation algorithm to derive the closeness centrality of a node by sampling only a few nodes and computing shortest paths from to all nodes. Brandes [7] was able to provide a a fast algorithm for computing betweenness centrality recursively, which Bader et al. [3] extended to an even faster approximation using an adaptive sampling techique.

We continue by reviewing a few applications in network analysis and design independent of social networks: For distance approximation, when using the landmark method [23, 24], it is mentioned in [1] that a modification of [26] chooses landmarks using local closeness centrality. According to [27], this is implemented e.g. in the Trinity Graph Engine. A relation to network flows is established e.g. in [6] for Closeness as well as Betweenness Centrality. The authors of [9] propose a routing scheme based on Betweenness Centrality.

Mrct

Distributed approximations of MRCT in the CONGEST model were studied in [15, 20, 10]. The authors of [20] showed how to use results of [13] to obtain a randomized -approximation of the MRCT in time . Observe that this might be even in a graph with hop diameter . In [15] two algorithms were presented for unweighted graphs and weighted graphs, where the weight of an edge corresponds to the traversal time of the edge. A deterministic one, that computes a -approximation in linear time and a randomized one, that computes w.h.p. an -approximation in time (and variations thereof). A lower bound of for any randomized -error algorithm that computes a -approximation was shown in [10] for weighted graphs. For a more detailed overview on work related to MRCT computations in various computational models, we refer the reader to [15].

Shortest Paths

At the heart of our algorithms is an algorithm to compute directed layered graphs rooted in nodes in parallel. This algorithm is a modification of the -Shortest Paths algorithm of [17]. Furthermore our modification extends this -Shortest Paths algorithm to the weighted setting and to compute shortest paths only up to a certain distance that can be provided as input. Note that the -source detection algorithm by Lenzen and Peleg [22] can be extended to weighted graphs in a similar way and could be modified to compute directed layered graphs as well, but we developed our method that resulted in this paper independently and simultaneously [11, 16].

2 Model, Definitions and Problems

Model:

Our network is represented by an undirected graph . Nodes correspond to processors (computers, sensor nodes or routers), two nodes are connected by an edge from set if they can communicate directly with each other. We denote the number of nodes of a graph by , and the number of its edges by . Furthermore we assume that each node has a unique ID in the range of , i.e. each node can be represented by bits. Initially the nodes know only the IDs of their immediate neighborhood, the number of nodes in and the node with the lowest ID further denoted as node .222Note that for computing the schedule in our framework we do not need to know and the node with smallest ID. These assumptions are only necessary for our applications such as centrality computation. The runtime of these applications depends on the diameter of the graph such that computing these values does not affect the runtime.

An unweighted shortest path in between two nodes and is a -path consisting of the minimum number of edges contained in any -path. Denote by the unweighted distance between two nodes and in , which is the length of an unweighted shortest -path in . We also say and are hops apart. By we denote a graph’s weight function and by the non-negative weight of an edge in . For an edge we often write to denote . For arbitrary nodes and , by we define the weighted distance between and , that is the weight of a shortest weighted path connecting and . In case we consider a subgraph of a graph , we denote by and the distances with respect to using only edges in .

In our model we consider only symmetric edge weights, i.e. an edge has in both directions the same weight . We denote a graph with weighted edges as . By we denote the -neighborhood of a node . That is in an unweighted graph all nodes in that can be reached from using hops. And in a weighted graph all nodes that can be reached from using paths with at most weight . We use the convention that . Given a set , set denotes the -neighborhood of . During the paper we denote the degree of a node by .

Definition 2.1.

The weighted eccentricity of a node is the largest weighted distance to any other node in the graph, that is .
The unweighted (hop) eccentricity of a node is the largest hop distance to any other node in the graph, that is .

Definition 2.2.

The weighted diameter of a graph is the maximum weighted distance between any two nodes of the graph.
The unweighted diameter (or hop diameter) of a graph is the maximum number of hops between any two nodes of the graph.

Observe that for unweighted graphs. In weighted graphs it is true that as well as .

In this paper we need a graph substructure to describe shortest paths from a root node to all other nodes. Such a representation can be found e.g. in a tree. However, for applications of our framework to e.g. calculating the Betweenness Centrality of a node in Section 4.1, we need to consider all shortest paths between two nodes, not just a single one as provided by a tree. Therefore we need to define a graph structure which clearly indicates all possible shortest paths between a root node and all other nodes .

Definition 2.3 (Tree ).

Given a node , we denote the spanning tree of that results from performing a breadth-first search () starting at by .

Definition 2.4 (Directed Leveled Graph (DLG) , unweighted).

Given a node of a graph , partition all nodes in disjoint subsets with if . Add an directed edge to a set for every edge with and . That is, every shortest path from to any other node in is a directed path in . We say is rooted in . A graphic is provided in Figure 1.

In the weighted case, we partition all nodes in disjoint subsets with if . Add an directed edge to set for every edge with and . That is, every shortest weighted path from to any other node in is a directed weighted path in . A graphic is provided in Figure 2.

Figure 1: on the left side and its leveled graph rooted in on the right.
Figure 2: on the left side and its leveled graph rooted in on the right
Definition 2.5 (parent, children, ancestors and leaves in a DLG).

Given a node , we define a parent of in a DLG as any neighbor of connected by a directed edge to . As a node can have several parents, we often consider a set of parents, formally defined as: . Children are all neighbors of connected by a directed edge to . A leaf node in a DLG is a node without any children.

Definition 2.6 (approximation).

Given an optimization problem , denote by the value of the optimal solution for and by the value of the solution of an algorithm for . Let . We say A is a -approximation (additive approximation) with additive error for if for any input. Let . Like in [12] we say is an inverse additive approximation with inverse additive error for if for any input. We say is -approximative (a one-sided multiplicative approximation) for if and or if and for any input.

Like in [12], the inverse additive error is used in the closeness centrality approximation, see Section 4.2.

Fact 2.7.

For any node we know that .

2.1 Problem Statements

We start by formally stating the problems we consider in this paper. First we state local multi-aggregation, which is not only of interest by itself but turns out to be at the heart of the other problems. Centrality-measures are graph-properties that are based on the number of shortest paths. Besides classical applications in social network analysis [8, 28], they have a number of applications in design and analysis of distributed networks such as [1, 6, 9]. Minimum Routing Cost Spanning Trees can be used to minimize the average cost of communication in a network while keeping a sparse routing-structure [18, 29].

Definition 2.8.

(local multi-aggregation). In the problem of local multi-aggregation, we are given a subset of nodes in a graph . Each of the nodes in contains (several) values and each node in wants to evaluate a (possibly different) aggregation-function based on (some of) these values stored in nodes of its -neighborhood. We consider two variations:

  1. only one (shortest) --path is needed for each to perform the aggregation.

  2. all shortest --paths are needed for each to perform the aggregation.

Betweenness centrality is a measure for centrality of a node , which is based on the number of shortest paths in a graph node is part of. In a graph , let denote the number of shortest paths from to and let denote the number of shortest paths from to that go through the node for . Closeness centrality is another measure to identify important nodes in a graph. The closeness centrality of a node is the inverse of the average distance to all other nodes. At the heart of our solutions for these problems is the -SP problem and we define:

Definition 2.9 (Betweenness Centrality, as stated in [14]).

.

Definition 2.10 (Closeness Centrality, as stated in [5]).

.

Definition 2.11.

(Minimum Routing Cost Tree (MRCT) as stated in [30] ). The MRCT problem [30] is defined as follows. The routing cost of a subgraph of is the sum of the routing costs of all pairs of vertices in the tree, i.e., . Our goal is to find a spanning tree with minimum routing cost. Given a subset of the vertices in , in the -MRCT problem [15], our goal is to find a subtree of that spans and has minimum routing cost with respect to . That is a tree such that is minimized. Here, denotes the routing cost of with respect to .

Definition 2.12 (-Sp).

Let be a graph. In the -Shortest Paths (-SP) problem, we are given a set and need to compute the shortest path lengths between any pair of nodes in such that in the end each node in knows its distance to all nodes in .

3 Local Multi-Aggregation

Many distributed algorithms to compute network properties can be stated in a way that they distribute and/or aggregate information through a tree or a directed leveled graph (DLG). For example computing the max-value of the -neighborhood (using a tree, see the text below), the sum of all values in the -neighborhood (using a tree, see the text below) and computing centrality measures (using a DLG, see Sections 4.1 and 4.2).

By extending the algorithms of [17] and [15], we present an algorithm which can aggregate information in the -Neighborhood of a set of root nodes . Note that if information from the whole graph is needed, can be set to .

All known exact computations of these properties need to compute a tree or DLG for every node , to evaluate ’s dependency on other nodes. Often the exact computation can be approximated by evaluating only the dependencies of a subset of nodes and therefore computing only trees or DLGs with depth in respect to the root nodes . We provide in Section 3.1 an algorithm to compute such DLGs rooted in the nodes of in time . Furthermore we provide an algorithm to aggregate information on these previously computed DLGs, again with a time complexity of . Both algorithms are able to execute additional algorithms, and respectively within each node, to distribute and aggregate information about the desired network property, where the network property description replaces the index . These functions and can be chosen depending on the aggregating problem at hand. A possible application with choices of and to compute the max-value in the -neighborhood of each root node is shown in Example 3.1.

Example 3.1.

(Computing the max-value of each k-Neighborhood of a set of nodes , , with our proposed algorithm.) Each node starts Algorithm 1 DLGcomp. While spanning the DLGs rooted in each node in Algorithm 1 DLGcomp no algorithm needs to be specified. In Algorithm 2 DLGagr the maximum node value needs to be aggregated. Therefore we define algorithm as follows: initializes on each node for each root node a variable to store the highest value received from any child in , first the node’s own value is stored in for every . On execution of with message and ID from a child, stores the value of in if . Then is stored in and sent to the parents of . After the execution of Algorithm 1 DLGcomp and 2 DLGagr each root node stores in the max-value of its own k-Neighborhood.

Or and to compute the sum of all values in the -neighborhood of each root node is provided in Example 3.2.

Example 3.2.

(Computing the sum of all values in each k-Neighborhood of a set of nodes , , with our proposed algorithm.) Each node starts Algorithm 1 DLGcomp. While spanning the DLGs rooted in each node in Algorithm  1 DLGcomp no algorithm needs to be specified. In Algorithm 2 DLGagr the sum of all node values in the k-Neighborhood needs to be aggregated. Therefore we define algorithm as follows: initializes on each node for each root node a variable to store the sum of all values received from any child in plus its own value . First the node’s own value is stored in for every . On execution of with message and ID from a child, adds the value of to . Then is stored in and sent to (only) one parent of . If node has more than one parent, the message is sent to the parent node with the lowest ID. This results in the Tree Variation in Section 3.2. After the execution of Algorithm 1 DLGcomp and 2 DLGagr each root node stores in the sum of all values in its own k-Neighborhood.

Remark 3.3 (Time optimality of Algorithms 1 DLGcomp and 2 DLGagr).

In a setting where edge-weights correspond to the transmission time through the corresponding edge, any solution for an algorithm to aggregate information from all nodes to one root node needs at least time. Now consider the graph of Figure 3. There we have root nodes connected to a chain of nodes and assume this chain determines the diameter . Assume each of the chain nodes stores different values that are encoded using bits each. Assume each root node computes an aggregation function based on the ’th value of each chain node. Due to congestion, the time until all values of the chain node at distance to the root nodes arrives at the corresponding root nodes is . This can be extended to in the unweighted case with .

Figure 3: Graph based on the construction used in Remark 3.3.

3.1 Algorithm for Local Multi-Aggregation

First we describe Algorithm 1 DLGcomp which computes times a DLG with depth , one for each node in , and can distribute/broadcast information (depending on the aggregation function at hand) along these computed DLGs. In Algorithm 1 DLGcomp we extend the -SP Algorithm of [17], which computes shortest paths. (Note that alternatively one could also modify the -source detection algorithm by Lenzen and Peleg [22].) In Algorithm 2 DLGagr, we aggregate information efficiently through the computed DLGs. Algorithm 1 DLGcomp is designed as a subroutine and has two input parameters and as well as a function . Parameter is the number of DLGs rooted in which Algorithm 1 DLGcomp has to compute. Parameter denotes the depth of the -neighborhood.

And function specifies an algorithm executed along with the computation of the DLGs. Further a node knows that it is in (e.g. as it is sampled by some procedure or interested in performing an aggregation by itself). Lets consider first only the special case where we compute a single DLG rooted in node . We denote by -message any messages sent during an execution of Algorithm 1 DLGcomp that belongs to the construction of the DLG rooted in node (later we keep this term for different roots ). Node starts with sending an -message to all its neighbors and in the next time slot, those neighbors send an -message to their neighbors, except to . A DLG continues to grow as follows: in time slot all nodes at distance receive a -message from their neighbors with distance . If the first -message is received from neighbor of node , is considered a parent of in . However, if in the same time slot receives further -messages from different neighbors , those are considered parents too. One time slot after receiving the first -message(s), node computes an -message and sends it to all neighbors which are not considered a parent of .

Now we say a -message received by is considered valid, if it is sent from a parent of in . Each -message contains the ID of the root node and the weighted distance from to the receiving node .

To simulate a weighted graph, the messages get delayed in Line 14 corresponding to the corresponding edge weight , if a message is to be sent from node to . For unweighted graphs, the variable (Line 14) contains the lowest ID in which is not in .

While computing , the ID and some additional information is stored and sent along with the -messages. Each node stores time and parent from which it received a -message (Line 25 and 26) . This is needed to efficiently execute Algorithm 2 DLGagr later on. A node stores in schedule and in arrival times and parent nodes if (Lines 23 to 27 are only executed if , i.e. if , where was received in a -message). Nonetheless, -messages are still forwarded, even if the distance in a -messages is larger than . This is needed to ensure that -messages are properly delayed.

Furthermore each node executes algorithm (Line 27), and can be used to distribute information in . Algorithm is executed with parameters each time receives a valid -message from neighbor . Message is the part of the -message computed by algorithm on a parent node of in . After the execution for every parent of in with the corresponding , algorithm computed and stored in the message which is sent to the children of in (Line 15). Furthermore algorithm is executed once on every node for definition and initialization of additional global variables on the node(Line 11).

With Algorithm 1 DLGcomp, multiple leveled graphs start growing in the same time slot. One for each . This could lead to congestions. To prevent that, every time two Algorithm 1 DLGcomp messages cross333That means both messages are received in the same time slot by the same node or both are sent in the same time slot through the same edge , one from to and one in the opposite direction., the Algorithm 1 DLGcomp message with the higher ID is delayed one time slot. This is donein Line 32by putting the higher ID into queue or by retransmitting the message in the next time slot through the same edge again, respectively(due to the if-statement in Line 21).

Similar as in the proof of the S-SP algorithm [17], we show that the total delay of any Algorithm 1 DLGcomp is at most when many DLGs are constructed in parallel. And we show that despite of the delays we still construct valid leveled graphs.

1:  ,
2:  ;
3:  ; // time when message of DLG reaches , for each parent enumerated in
4:  ; // sp[v] := number of parents in shortest paths from to
5:  ; // ,
6:  if  then
7:     ;
8:     ;
9:     ;
10:  ;
11:  initialize algorithm ;
12:  for  do
13:     for  do
14:         ; // smallest ID that is not delayed and ready to be scheduled
15:     within one time slot: send to neighbor , receive from ; send to neighbor , receive from ; send to neighbor , receive from ;
16:     ;
17:     ;
18:     if  then
19:         ;
20:     for  do
21:         if  then // ’s message will be delayed due to .
22:            if  then
23:               if  then
24:                   ;
25:                   ;
26:                    neighbor ;
27:                   execute algorithm ;
28:               if  then
29:                   ;
30:                   ;
31:                   if  then
32:                      ;
33:               else
34:                   ;
35:         else
36:            ; // ’s message was successfully sent to neighbor .
Algorithm 1 DLGcomp compues DLGs in the k-Neighborhood of the root nodes
(executed by node )
Input: , ,
parameters passed to : message of a parent of in DLG

In Algorithm 2 DLGagr, information gets aggregated according to Algorithm . To do this, the computed leveled graphs of each root node get processed in a bottom-up fashion. Algorithm 2 DLGagr has three inputs: the number of DLGs , the depth of the -neighborhood, which are needed to bound the runtime, and an algorithm , which can consist of an initialization part and a computation part. The initialization part is executed once on every node starting the computation(loop of Algorithm 2 DLGagr, Line 2). In the loop in time slot a node sends the information computed by to its parent in . Here schedule is the time when received a valid -message from neighbor . The schedule was computed before in Algorithm 1 DLGcomp(Line 25).

The computation part of is executed on a node each time received a message from a child node in . When has received all messages form all children in and has executed once for each of those messages, computed and stored in the information that is subsequently sent to all parents of in .

1:  initialize algorithm ; // and use same variables as Algorithm 1 DLGcomp
2:  for  do
3:     within one time slot:      foreach , such that do         send to ;      receive from neighbor ;      receive from neighbor ;            receive from neighbor ;
4:     for  do
5:        if  empty then
6:           execute algorithm ;
Algorithm 2 DLGagr computes aggregation function (executed by node )
Input: , ,
parameters passed to : message of a child of in DLG
Lemma 3.4.

Algorithm 1 DLGcomp computes valid directed leveled graphs, in time for .

Proof.

The proof is very similar to the proofs for the S-SP algorithm in [17]. For the correctness we state a slightly adapted version of Lemma 12 in [17].

Correctness: First, assume that and consider the computation of . At time t, each node at weighted distance from receives a -message from all of ’s neighbors that are at distance to . All edges incident to neighbors that sent such a message are added to DLG , directed from to .

Now consider the case, where the set contains at least two nodes. We analyze how the computations of other nodes affect the computation of . We say that a -message has ‘reached’ a node through edge from a neighbor in time slot t, if the message was successfully received and has been removed from the delay queue in time slot , or it was successfully received in time slot and not put into the queue at all. It turns out that the first -messages of ’s computation which reach are transmitted through the edges at distance as in the case before. Consider two neighbors of ; we can ignore the case that has only one neighbor since this case trivially satisfies our claim. A -message containing weighted distance , has reached through the edge earlier than through edge if and only if . We show this by proving that the set of messages with lower IDs which delay the -messages is the same for both paths and : Assume that the computation of is delaying the -message sent through at some point. Then the -message reaches, in case the -message is coming from ’s direction, node earlier than the -message. Even if an -message and a -message are transmitted in the same time slot to (through and , respectively), the -message delays the -message in the node by putting it into the queue , and the -message reaches earlier. Thus the -message also delays the -message running through path , if it did not already delayed it earlier.

Runtime: In case contains only one element, the computation of DLG , which is not delayed by other computations, takes at most time steps. Because in time step all nodes at distance receive a -message, as a consequence in time step every node with distance receives or has received a -message. Since there are no nodes with , the computation of stops after time steps.

We showed that a -message can be delayed at most one time slot by another -message with , and therefore by at most time slots by all other DLG-constructions. Thus Algorithm 1 DLGcomp runs in . ∎

Lemma 3.5.

Algorithm 2 DLGagr aggregates information through directed leveled graphs, in time for .

Proof.

Runtime: In Algorithm 2 DLGagr, a node sends only messages to parent nodes in all DLGs . The schedule to send the messages is the same as in Algorithm 1 DLGcomp stored in , but reversed. Therefore the runtime of Algorithm 2 DLGagr is bound by the runtime of Algorithm 1 DLGcomp, which is . The complexity of executing algorithms and inside of a node causes no additional communication. Since we are only interested in communication complexity, the total runtime is .

Correctness: Since we use the schedule of Algorithm 1 DLGcomp in reverse order to send messages, it is guaranteed that a node receives all messages from all children in a DLG before sends the first message to any parent in . ∎

Lemma 3.6.

Algorithms 1 DLGcomp and 2 DLGagr aggregate information in time for .

Proof.

Runtime: In the k-Neighborhood restricted Algorithm 1 DLGcomp a -message which is not delayed by other IDs, needs time slots to reach all nodes . We showed in Lemma 3.4 that a -message can be delayed at most time slots due to other ID’s. Hence, to reach all nodes in its -neighborhood an -message needs at most time slots. For aggregating information Algorithm 2 DLGagr uses schedule , which was computed in the k-Neighborhood restricted Algorithm 1 DLGcomp and therefore is bound by the same number of time slots .

Correctness: As long as a distance in a -message is less or equal to , the message is processed the same way as in the restricted Algorithm 1 DLGcomp (). When is larger than , then -messages with ID do no longer extend the DLG , but just delay all -message with higher ID () the same way as in the restricted Algorithm 1 DLGcomp. As a consequence after the execution of the k-Neighborhood restricted Algorithm 1 DLGcomp, all nodes have the same information of DLG stored as if the restricted Algorithm 1 DLGcomp has been executed. All nodes that still send -messages do not contribute to ’s aggregation but are necessary to ensure that the other tree’s computations are delayed and trees are constructed correctly.

By executing Algorithm 2 DLGagr a node sends aggregation messages back to a root node if and only if , as mentioned above. Therefore Algorithm 2 DLGagr just aggregates information of the neighborhood to a root node . ∎

Proof of Theorem 1.1.

Follows by combining Lemma 3.4 and 3.5 and Lemma 3.6. ∎

3.2 Tree Variation

In Algorithm 2 DLGagr we can aggregate information along all shortest --paths between a root node and a node . For some applications it is desirable to aggregate information only along one shortest --path, as for example in max-value/average-value aggregation or in a -Shortest Paths to approximate problems such as closeness centrality. That means a node sends a message only to one parent in a DLG while executing Algorithm 2 DLGagr. The result is the same as when aggregating information through a tree rooted in (instead of a DLG ). For completeness we show an adaptation of Algorithms 1 DLGcomp and 2 DLGagr which computes and aggregates along trees instead of DLGs. We provide a detailed description of the Tree Variation in Appendix .4.

4 Applications

4.1 Betweenness Centrality

Brandes showed in [7], an algorithm for computing betweenness centrality recursively. Let be the ratio of all shortest path between and that contain node compared to all shortest paths between and . We denote by the dependency of a node on another node , defined by . The dependency can be calculated recursively as Brandes [7] stated:

Theorem 4.1.

(Recursive Betweenness Centrality Dependency, Brandes [7] Theorem 6)

Bader et al. showed in [3] an adaptive sampling algorithm, which approximates the betweenness centrality with significantly reduced number of single-source shortest path computations for a node with high centrality when using Brandes’ recursive algorithm.

Theorem 4.2.

(Bader et al [3] Theorem 3). For , if the centrality of a vertex is for some constant , then with probability its betweenness centrality can be estimated to within a factor of with samples of source vertices.

This algorithm can be adapted efficiently in a distributed setting by using Algorithms 1 DLGcomp and 2 DLGagr. We define an Algorithm 3BCsetup to perform multiple rounds as suggested by Bader in [3]. In each round, Algorithms 1 DLGcomp and 2 DLGagr are executed with algorithms and as defined in Algorithm 4 and 5. With and we state the Algorithms and of our framework specified for betweenness centrality approximation () being the function . In contrast to Bader, our algorithm samples not just the betweenness centrality dependency of one node in each round, but of multiple nodes. Furthermore, Bader’s algorithm concentrates on one node and stops sampling if can be approximated within an error with probability . Our Algorithm 3 BCsetup considers all nodes in the graph and stops after at least nodes have been approximated within a similar multiplicative error with probability (where ).

4.1.1 Algorithm for Betweenness Centrality

In Algorithm 3 BCsetup, the idea is to select in multiple rounds multiple sample nodes and calculate the betweenness centrality dependency on all other nodes for each . The algorithm stops if more than nodes in with high betweenness centrality are found, is an input parameter. Set is the set of nodes which are sampled in round , all sets are disjoint and form together the set .

During the algorithm each node stores the sum of all the dependencies of the nodes sampled so far in variable . Furthermore the number of sample nodes that were sampled so far (in any of rounds ), need to be stored to indicate whether the node itself is sampled ().

All centralized communication to and from node 1 uses tree . In round (Line 13 to 24)node 1 calculates the probability (according to the proof in Theorem 1.3)

with which each node gets sampled. The value of is broadcasted in the network and each node decides if itself is a sample node and reports back to node 1 if so. In Line 23, if a node in receives multiple messages in one time slot, it sends the sum of the message value to its parent in . The number of sample nodes gathered by node 1 is then broadcasted again, this is needed to determine the runtime in Algorithms 1 DLGcomp and 2 DLGagr and to maintain the number of sample nodes .

1:  global ; // sum of dependency scores ,
2:  global ; // number of samples so far
3:  global ;
4:  ; // approximated betweenness centrality of , initialized as undefined
5:  if  then
6:     estimate and by generating a spanning tree and using Fact 2.7;
7:     broadcast and on ;
8:     ; // number of high BC nodes found so far
9:  else
10:     wait until value of and received;
11:  for  do
12:     ;
13:     if  then
14:        broadcast on