Efficient Information Flow Maximization in Probabilistic Graphs

Efficient Information Flow Maximization in Probabilistic Graphs

Christian Frey, Andreas Züfle, Tobias Emrich and Matthias Renz C.Frey and T.Emrich are with the Department of Database Systems and Data Mining, Ludwig-Maximilians-Universität, Munich, Germany.
E-mail: {frey, emrich}@dbs.ifi.lmu.de A.Züfle is with the Department of Geography and Geoinformation Science, George Mason University, VA, USA E-mail: azufle@gmu.eduM.Renz is with the Department of Computational and Data Sciences, George Mason University, VA, USA E-mail: mrenz@gmu.edu
Abstract

Reliable propagation of information through large networks, e.g. communication networks, social networks or sensor networks is very important in many applications concerning marketing, social networks, and wireless sensor networks. However, social ties of friendship may be obsolete, and communication links may fail, inducing the notion of uncertainty in such networks. In this paper, we address the problem of optimizing information propagation in uncertain networks given a constrained budget of edges. We show that this problem requires to solve two NP-hard subproblems: the computation of expected information flow, and the optimal choice of edges. To compute the expected information flow to a source vertex, we propose the F-tree as a specialized data structure, that identifies independent components of the graph for which the information flow can either be computed analytically and efficiently, or for which traditional Monte-Carlo sampling can be applied independently of the remaining network. For the problem of finding the optimal edges, we propose a series of heuristics that exploit properties of this data structure. Our evaluation shows that these heuristics lead to high quality solutions, thus yielding high information flow, while maintaining low running time.

uncertain graphs, network analysis, social network, optimization, information flow

To refer to or cite this work, please use the citation of the published version:

http://ieeexplore.ieee.org/document/8166795/

C. Frey, A. Züfle, T. Emrich, M. Renz. Efficient Information Flow Maximization in Probabilistic Graphs. In IEEE Transactions on Knowledge and Data Engineering, 2017. doi:10.1109/TKDE.2017.2780123.

1 Introduction

Nowadays, social and communication networks have become ubiquitous in our daily life to receive and share information. Whenever we are navigating the World Wide Web, updating our social network profiles, or sending a text message on our cell-phone, we participate in an information network as a node. In such settings, network nodes exchange some sort of information: In social networks, users share their opinions and ideas, aiming to convince others. In wireless sensor networks, nodes collect data and aim to ensure that this data is propagated through the network: Either to a destination, such as a server node, or simply to as many other nodes as possible. Abstractly speaking, in all of these networks, nodes aim at propagating their information, or their belief, throughout the network. The event of a successful propagation of information between nodes is subject to inherent uncertainty. In a wireless sensor, telecommunication or electrical network, a link can be unreliable and may fail with certain probability [10, 31]. In a social network, trust and influence issues may impact the likelihood of social interactions or the likelihood of convincing another of an individual’s idea [11, 18, 1]. For example, consider professional social networks like LinkedIn. Such networks allow users to endorse each others’ skills and abilities. Here, the probability of an edge may reflect the likelihood that one user is willing to endorse another user. The probabilistic graph model is commonly used to address such scenarios in a unified way (e.g. [24, 27, 28, 40, 39, 20, 41]). In this model, each edge is associated with an existential probability to quantify the likelihood that this edge exists in the graph. Traditionally, to maximize the likelihood of a successful communication between two nodes, information is propagated by flooding it through the network. Thus, every node that receives a bit of information will proceed to share this information with all its neighbors. Clearly, such a flooding approach is not applicable for large communication and social networks, as the communication between two network nodes incurs a cost: Sensor network nodes have limited computing capability, memory resources and power supply, but require battery power to send and receive messages, and are also limited by their bandwidth; individuals of a social network require time and sometimes even additional monetary resources to convince others of their ideas. For instance, a professional networking service may provide, for a fee, a service to directly ask a limited number of users to endorse another user . The challenge is to maximize the expected number of endorsements that will receive, while limiting the budget of users asked by the service provider. The first candidates to ask are ’s direct connections. In addition, if a user has already endorsed , then ’s connections can be asked if they trust ’s judgment and want to make the same endorsement.

(a) original graph
(b) Dijkstra MST
(c) Optimal five-edge flow
(d) possible world
Fig. 1: Running example.

In this work, we address the following problem: Given a probabilistic network graph with edges that can be activated, i.e., enabled to transfer information, or stay inactive. The problem is to send/receive information from a single node in to/from as many nodes in as possible assuming a limited budget of edges that can be activated. To solve this problem, the main focus is on the selection of edges to be activated.

Example 1.

To illustrate our problem setting, consider the network depicted in Figure 1(a). The task is to maximize the information flow to node from other nodes given a limited budget of edges. This example assumes equal weights of all nodes. Each edge of the network is labeled with the probability of a successful communication. A naive solution is to activate all edges. Assuming each node to have one unit of information, the expected information flow of this solution can be shown to be . While maximizing the information flow, this solution incurs the maximum possible communication cost. A traditional trade-off between these single-objective solutions is using a probability maximizing Dijkstra’s MST, as depicted in Figure 1(b). The expected information flow in this setting can be shown to aggregate to units, while requiring six edges to be activated. Yet, it can be shown that the solution depicted in Figure 1(c) dominates this solution: Only fives edges are used, thus further reducing the communication cost, while achieving a higher expected information flow of units of information to .

The aim of this work is to efficiently find a near-optimal sub-network, which maximizes the expected flow of information at a constrained budget of edges. In Example 1, we computed the information flow for an example graph. But in fact, this computation has been shown to be exponentially hard in the number of edges of the graph, and thus impractical to be solved analytically. Furthermore, the optimal selection of edges to maximize the information flow is shown to be NP-hard. These two subproblems define the main computational challenges addressed in this work.

To tackle these challenges, the remainder of this work is organized as follows. After a survey of related work in Section 2, we recapitulate common definitions for stochastic networks and formally define our problem setting in Section 3. After a more detailed technical overview in Section 4, the theoretical heart of this work is presented in Section 5. We show how to identify independent subgraphs, for which the information flow can be computed independently. This allows to divide the main problem into much smaller subproblems. To conquer these subproblems, we identify cases for which the expected information flow can be computed analytically, and we propose to employ Monte-Carlo sampling to approximate the information flow of the remaining cases. Section 5.3 is the algorithmic core of our work, showing how aforementioned independent components can be organized hierarchically in a F-tree which is inspired by the block-cut tree [35, 14, 37]. This structure allows us to aggregate results of individual components efficiently, and we show how previous Monte-Carlo sampling results can be re-used as more edges are selected and activated. Our experimental evaluation in Section 7 shows that our algorithms significantly outperform traditional solutions, in terms of combined communication cost and information flow, on synthetic and real stochastic networks. In summary, the main contributions of this work are:

  • Theoretical complexity study of the flow maximization problem in probabilistic graphs.

  • Efficient estimation of the expected information flow based on network graph decomposition and Monte-Carlo sampling.

  • Our F-tree structure enabling efficient organization of independent graph components and (local) intermediate results for efficient expected flow computation.

  • An algorithm for iterative selection of edges to be activated to maximize the expected information flow.

  • Thorough experimental evaluation of proposed algorithms.

2 Related Work

Reliability and Influence computation in probabilistic graphs (a.k.a. uncertain graphs) has recently attracted much attention in the data mining and database research communities. We summarize state-of-the-art publications and relate our work in this context.

Subgraph Reliability. A related and fundamental problem in uncertain graph mining is the so-called subgraph reliability problem, which asks to estimate the probability that two given (sets of) nodes are reachable. This problem, well studied in the context of communication networks, has seen a recent revival in the database community due to the need for scalable solutions for big networks. Specific problem formulations in this class ask to measure the probability that two specific nodes are connected (two-terminal reliability [2]), all nodes in the network are pairwise connected (all-terminal reliability [33]), or all nodes in a given subset are pairwise connected (k-terminal reliability [13, 12]). Extending these reliability queries, where source and sink node(s) are specified, the corresponding graph mining problem is to find, for a given probabilistic graph, the set of most reliable k-terminal subgraphs [16]. All these problem definitions have in common that the set of nodes to be reached is predefined, and that there is no degree of freedom in the number of activated edges - thus all nodes are assumed to attempt to communicate to all their neighbors, which we argue can be overly expensive in many applications.

Reliability Bounds. Several lower bounds on (two-terminal) reliability have been defined in the context of communication networks [3, 4, 9, 29]. Such bounds could be used in the place of our sampling approach, to estimate the information gain obtained by adding a network edge to the current active set. However, for all these bounds, the computational complexity to obtain these bounds is at least quadratic in the number of network nodes, making these bounds unfeasible for large networks. Very simple but efficient bounds have been presented in [19], such as using the most-probable path between two nodes as a lower bound of their two-terminal reliability. However, the number of possible (non-circular) paths is exponentially large in the number of edges of a graph, such that in practice, even the most probable path will have a negligible probability, thus yielding a useless upper bound. Thus, since none of these probability bounds are sufficiently effective and efficient for practical use, we directly decided to use a sampling approach for parts of the graph where no exact inference is possible.

Influential Nodes. Existing work motivated by applications in marketing provide methods to detect influential members within a social network. This can help to promote a new product. The task is to detect nodes, i.e. persons, where the chance that the product is recommended to a broad range of connected people is maximized. In [6], [30] a framework is provided which considers the interactions between the persons in a probabilistic model. As the problem of finding the most influential vertices is NP-hard, approximation algorithms are used in [18] outperforming basic heuristics based on degree centrality and distance centrality which are applied traditionally in social networks. This branch of research has in common that the task is to activate a constrained number of nodes to maximize the information flow, whereas our problem definition constrains the number of activated edges for a single specified query/sink node.

Reliable Paths. In mobile ad hoc networks, the uncertainty of an edge can be interpreted as the connectivity between two nodes. Thus, an important problem in this field is to maximize the probability that two nodes are connected for a constrained budget of edges [10]. In this work, the main difference to our work is that the information flow to a single destination is maximized, rather than the information flow in general. The heuristics [10] cannot be applied directly to our problem, since clearly, maximizing the flow to one node may detriment the flow to another node.

Bi-connected components. The F-tree that we propose in this work is inspired by the block-cut tree [35, 14, 37]. The main difference is that our approach aims at finding cyclic subgraphs, where nodes are bi-connected. For subgraphs having a size of at least three vertices, this problem is equivalent to finding bi-connected subgraphs, which is solved in [35, 14, 37]. Thus, our proposed data structure treats bi-connected subgraphs of size less than three separately, grouping them together as mono-connected components. More importantly, this existing work does not show how to compute, estimate and propagate probabilistic information through the structure, which is the main contribution of this work.

3 Problem Definition

A probabilistic undirected graph is given by , where is a set of vertices, is a set of edges, is a function that maps each vertex to a positive value representing the information weight of the corresponding vertex and is a function that maps each edge to its corresponding probability of existing in . In the following, it is assumed that the existence of different edges are independent from one another. Let us note, that our approach also applies to other models such as the conditional probability model [28], as long as a computational method for an unbiased drawing of samples of the probabilistic graph is available.

In a probabilistic graph , the existence of each edge is a random variable. Thus, the topology of is a random variable, too. The sample space of this random variable is the set of all possible graphs. A possible graph of a probabilistic graph is a deterministic graph which is a possible outcome of the random variables representing the edges of . The graph contains a subset of edges of , i.e. . The total number of possible graphs is , where represents the number of edges having , because for each such edge, we have two cases as to whether or not that edge is present in the graph. We let denote the set of all possible graphs. The probability of sampling the graph from the random variables representing the probabilistic graph is given by the following sampling or realization probability :

(1)

Figure 1(a) shows an example of a probabilistic graph and one of its possible realization in 1(d). This probabilistic graph has possible worlds. Using Equation 1, the probability of world is given by:

Definition 1 (Path).

Let be a probabilistic graph and let be two nodes such that . An (acyclic) path is a sequence of vertices, such that and .

Definition 2 (Reachability).

The network reachability problem as defined in [15, 5] computes the likelihood of the binomial random variable of two nodes being connected in , formally:

where is an indicator function that returns one if there exists a path between nodes and in the (deterministic) possible graph , and zero otherwise. For a given query node , our aim is to optimize the information gain, which is defined as the total weight of nodes reachable from .

Definition 3 (Expected Information Flow).

Let be a node and let be a probabilistic graph, then denotes the random variable of the sum of vertex weights of all nodes in reachable from , formally:

Due to linearity of expectations, and exploiting that is deterministic, we can compute the expectation of this random variable as

(2)

Given our definition of Expected Information Flow in Equation 2, we can now state our formal problem definition of optimizing the expected information flow of a probabilistic graph for a constrained budget of edges.

Definition 4 (Maximum Expected Information Flow).

Let be a probabilistic graph, let be a query node and let be a non-negative integer. The Maximum Expected Information Flow

(3)

is the subgraph of maximizing the information flow towards constrained to having at most edges.

Computing efficiently requires to overcome two NP-hard subproblems. First, the computation of the expected information flow to vertex for a given probabilistic graph is NP-hard as shown in [5]. In addition, the problem of selecting the optimal set of vertices to maximize the information flow is a NP-hard problem in itself, as shown in the following.

Theorem 1.

Even if the Expected Information Flow to a vertex can be computed in for any probabilistic graph , the problem of finding is still NP-hard.

Proof.

In this proof, we will show that a special case of computing is NP-complete, thus implying that our general problem is NP-hard. We reduce the 0-1 knapsack problem to the problem of computing . Thus, assume a 0-1 knapsack problem: Given a capacity integer and given a set of items each having an integer weight and an integer value . The 0-1 knapsack problem is to find the optimal vector such that , subject to . This problem is known to be NP-complete [17]. We reduce this problem to the problem of computing as follows. Let be a probabilistic graph such that is connected to nodes (one node for each item of the knapsack problem). Each node is connected to a chain of nodes . All edges have a probability of one, i.e., . The information of a node is set to if it is the (only) leaf node of the branch of connected to and zero otherwise. Finally, set . Then, the solution of the 0-1 knapsack problem can be derived from the constructed problem by selecting all items such that the corresponding node is connected to . Thus, if we can solve the problem in polynomial time, then we can solve the 0-1 knapsack problem in polynomial time: A contradiction assuming .

Fig. 2: Example of the Knapsack Reduction of Theorem 1

4 Roadmap

To compute , we first need an efficient solution to approximate the reachability probability from to/from a single node . Since this problem can be shown to be #P-hard, Section 5.3 presents an approximation technique which exploits stochastic independencies between branches of a spanning tree of subgraph rooted at . This technique allows to aggregate independent subgraphs of efficiently, while exploiting a sampling solution for components of the graph that contains cycles.

Once we can efficiently approximate the flow from to each node , we next tackle the problem of efficiently finding a subgraph that yields a near-optimal expected information flow given a budget of edges in Section 6. Due to the theoretic result of Theorem 1, we propose heuristics to choose edges from . Finally, our experiments in Section 7 support our theoretical intuition that our solutions for the two aforementioned subproblems synergize: An optimal subgraph will choose a budget of edges in a tree-like fashion, to reach large parts of the probabilistic graph. At the same time, our solutions exploit tree-like subgraphs for efficient probability computation.

5 Expected Flow Estimation

In this section we estimate the expected information flow of a given subgraph . Following Equation 2, the reachability probability between and a node can be used to compute the total expected information flow . This problem of computing the reachability probability between two nodes has been shown to be -hard [10, 5] and sampling solutions have been proposed to approximate it [22, 7]. In this section, we will present our solution to identify subgraphs of for which we can compute the information analytically and efficiently, such that expensive numeric sampling only has to be applied to small subgraphs. We first introduce the concept of Monte-Carlo sampling of a subgraph.

5.1 Traditional Monte-Carlo Sampling

Lemma 1.

Let be an uncertain graph and let be a set of sample worlds drawn randomly and unbiased from the set of possible graphs of . Then the average information flow in samples in

(4)

is an unbiased estimator of the expected information flow , where is an indicator function that returns one if there exists a path between nodes and in the (deterministic) sample graph , and zero otherwise.

Proof.

For to be an unbiased estimator of , we have to show that . Substituting yields . Due to linearity of expectations, this is equal to . The sum over identical values can be replaced by a factor of . Reducing this factor yields . Following the assumption of unbiased sampling from the set of possible worlds, the expected information flow of a sample possible world is equal to the expected information flow . ∎

Naive sampling of the whole graph has disadvantages: First, this approach requires to compute reachability queries on a set of possibly large sampled graphs. Second, a rather large approximation error is incurred. We will approach these drawbacks by first describing how non-cyclic subgraphs, i.e. trees, can be processed in order to compute the information flow exactly and efficiently without sampling. For cyclic subgraphs, we show how sampled information flows can be used to compute the information flow in the full graph.

5.2 Mono-Connected vs. Bi-Connected graphs

The main observation that will be exploited in our work is the following: if there exists only one possible path between two vertices, then we can compute their reachability probability efficiently.

Definition 5 (Mono-Connected Nodes).

Let be a probabilistic graph and let . If is the only path between and , i.e., there exists no other path that satisfies Definition 1, then we denote and as mono-connected.

In the following, when the query vertex is clear from the context, we call a vertex mono-connected if it is mono-connected to the query vertex .

Lemma 2.

If two vertices and are mono-connected in a probabilistic graph , then the reachability probability between and is equal to the product of the edge probabilities included in , i.e.,

Proof.

Following possible world semantics as defined in Definition 2, the reachability probability is the sum of probabilities of all possible worlds where is connected to . We show that and are connected in a possible graph iff all edges with exist.
: By contradiction: Let and be connected in , and let any edge on be missing. Then there must exist a path which contradicts the assumption that and are mono-connected.
: If all edges on exist, then is connected to following the assumption that is a path from to .

Due to our assumption of independent edges, the probability that all edges in exist is given by . ∎

Definition 6 (Mono-Connected Graph).

A probabilistic graph is called mono-connected, iff all pairs of vertices in are mono-connected.

Next, we generalize Lemma 2 to whole subgraphs, such that a specified vertex in that subgraph has a unique path to all other vertices in the subgraph. Using Lemma 2, we constitute the following theorem that will be exploited in the remainder of this work.

Theorem 2.

Let be a probabilistic graph, let be a node. If is mono-connected, then can be computed efficiently.

Proof.

is the sum of reachability probabilities of all nodes, according to Equation 2. If is connected and non-cyclic, we can guarantee that each node has exactly one path to , and thus, is mono-connected. Thus, Lemma 2 is applicable to compute the reachability probability between and each node . Due to linearity of expectations, i.e., for random variables and , we can aggregate individual reachability expectations, yielding . ∎

Analogously to Definition 5, we define bi-connected nodes.

Definition 7 (Bi-Connected Nodes).

Let be a probabilistic graph and let . If there exists (at least) two paths and , such that , then we denote and as bi-connected.

Definition 8 (Bi-Connected Graph).

A bi-connected graph [35, 14] is a connected probabilistic graph such that removal of any one vertex will still yield a connected probabilistic graph.

Lemma 3.

In a bi-connected graph of size , all pairs of vertices are bi-connected following Definition 7.

Proof.

By contradiction, let be two nodes in that are mono-connected. Let be the only path between them.
Case 1: contains no other vertices: Since is bi-connected, removal of vertex yields a graph where and are connected by some path . At the same time, removal of vertex yields a graph where and are connected by some path . Thus, the concatenation of these paths yields an alternative path between and , contradicting the assumption that are mono-connected by path .
Case 2: contains other vertices. Let be such a vertex. Since is bi-connected, removal of vertex yields a graph where and are still connected, contradicting the assumption that and are mono-connected by only. ∎

The information flow within a bi-connected graph can not be computed efficiently using Theorem 2, as the flow between any two nodes to is shared by more than one path. In the next section, we propose techniques to substitute bi-connected subgraphs by super-nodes, for which we can estimate the information flow using Monte-Carlo sampling exploiting Lemma 1. By substituting the bi-connected subgraphs by super-nodes for which we apply sampling and memoize the sampling information for these super-nodes, we yield a mono-connected graph that uses the substituted super-nodes. This approach maximizes the partitions of the graph for which expensive Monte-Carlo estimation can be replaced using Theorem 2.

The next section will show how to achieve this goal, by employing a F-tree of the graph. This data structure borrowed from graph theory partitions the graph into bi-connected components (a.k.a. “blocks”) generated by bi-connected subgraphs, and identifies vertices of the graph as articulation vertices to connect two bi-connected components. We exploit these articulation vertices, by having them represent all the information flow that is estimated to flow to them from their corresponding bi-connected component.

5.3 Flow tree

(a) Example Graph
(b) F-tree representation
Fig. 3: Running example graph with corresponding F-tree

In this section, we propose to adapt the block-cut tree [35, 14, 37] to partition a graph into independent bi-connected components. Instead of sampling the whole uncertain graph, the purpose of this index structure is to exploit Theorem 2 for mono-connected components, and to apply local Monte-Carlo within bi-connected components only. Our employed Flow tree (F-tree) memoizes the information flow at each node. Before we show how to utilize the F-tree for efficient information flow computation, we first give a formal definition.

Definition 9 (Flow tree).

Let be a probabilistic graph and let be a vertex for which the expected information flow is computed. A Flow tree (F-tree) is a tree structure defined as follows.

1) each component of the F-tree is a connected subgraph of . A component can be mono-connected or bi-connected.
2) a mono-connected component is a set of vertices that form a mono-connected subgraph (c.f. Definition 6) in . The vertex is called articulation vertex. Intuitively, a mono-connected components represents a tree-like structure rooted in . Using Theorem 2, we can efficiently compute the information flow from all vertices to .
3) a bi-connected component , is a set of vertices of size greater than two that form a bi-connected subgraph in according to Definition 8. Intuitively, a bi-connected component represents a subgraph describing a cycle. In this case, we can estimate the likelihood of being connected to the articulation vertex using Monte-Carlo sampling in Lemma 1. The function maps each vertex to the estimated reachability probability of being connected to in .
4) For each pair of (mono- or bi-connected) components , it holds that the intersection of vertices is empty. Thus, each vertex in is mapped to at most one component’s vertex set.
5) Two different components may share the same articulation vertex, and the articulation vertex of one component may be in the vertex set of another component.
6) The articulation vertex of the root of a F-tree is .

Intuitively speaking, a component is a set of vertices together with an articulation vertex that all information must flow through in order to reach . By our iterative construction algorithm presented in Section 5.4, each component is guaranteed to have such an articulation vertex, guiding the direction to vertex . The idea of the F-tree is to use components as virtual nodes, such that all actual vertices of a component send their information to their articulation vertex. Then the articulation vertex forwards all information to the next component, until the root of the tree is reached where all information is sent to articulation vertex .

Example 2.

As an example for a F-tree, consider Figure 3(a), showing a probabilistic graph. For brevity, assume that each edge has an existential probability of and that all vertices have an information weight corresponding to their id, e.g. vertex has a weight of six. A corresponding F-tree is shown in Figure 3(b). A mono-connected component is given by . For this component, we can exploit Theorem 2 to analytically compute the flow of information from any vertex in to articulation vertex : vertices and are connected to with probability . Thus, these nodes contributed an expected information flow of and respectively. Vertices and are connected to with a probability of , respectively, following Lemma 2. Thus, these nodes contribute an expected information of and . Following Theorem 2, we can aggregate these probabilities to obtain the expected information flow from vertices to articulation vertex as .

A bi-connected component is defined by , representing a sub-graph having a cycle. Having a cycle, we cannot exploit Theorem 2 to compute the flow of a vertex in to vertex . But we can sample the subgraph spanned by vertices in to estimate probabilities that vertices are connected to articulation vertex using Lemma 1. With sufficient samples, this will yield a probability of around for both vertices to be reached. Again using Theorem 2, we compute an information flow of to articulation vertex . Given this expected flow, we can use the mono-connected component to compute the expected information analytically that is further propagated from the articulation vertex of component to the articulation vertex of . As the articulation vertex of component is in the vertex set of component , component is a child of component in Figure 3(b) since propagates its information to . As we have already computed above, the probability of vertex to be connected to its articulation vertex is , yielding an information flow worth units flowing from vertices to . Again, exploiting Theorem 2, we can aggregate this to a total flow of from vertices {1,2,3,4,5,6} to .

Another bi-connected component is , for which we can estimate the information flow from vertices , , and to articulation vertex numerically using Monte-Carlo sampling. Since vertex is in , component is a child of . We find another bi-connected component , and two more mono-connected components and .

In this example, the structure of the F-tree allows us to compute or approximate the expected information flow to from each vertex. For this purpose, only three small components and and need to be sampled. This is a vast reduction of sampling space compared to a naive Monte-Carlo approach that samples the full graph: rather than sampling a single random variable having possible worlds, we only need to sample three random variables corresponding to the bi-connected components , and having , , and possible worlds, respectively. Clearly, this approach reduces the number of edges (marked in red in Figure 3(a)) that need to be sampled in each sampling iteration. More importantly, our experiments show that this approach of sampling component independently vastly decreases the variance of the total information flow, thus yielding a more precise estimation at the same number of samples.

Having defined syntax and semantics of the F-tree, the next section shows how to maintain the structure of a F-tree when additional edges are selected. It is important to note that we do not intend to insert all edges of a probabilistic graph into the F-tree. Rather, we only add the edges that are selected to compute the maximum flow given a constrained budget of edges. Thus, even in a case where all vertices a bi-connected, such as in the initial example in Figure 1(a), we note, supported by our experimental evaluation, that an optimal selection of edges prefers a spanning-tree-like topology, which synergizes well with our F-tree. The next section shows how to build the structure of the F-tree iteratively by adding edges to an initially empty graph.

The next subsection proposes an algorithm, to update a F-tree when a new edge is selected, starting at a trivial F-tree that contains only one component . Using this edge-insertion algorithm, we will show how to choose promising edges to be inserted to maximize the expected information flow. The selection of the edges of the F-tree will be shown in section 6.

5.4 Insertion of Edges into a F-tree

Following Definition 9 of a F-tree, each vertex is assigned to either a single mono-connected component (noted by a flag in the algorithm below), a single bi-connected component (noted by ), or to no component, and thus disconnected from , noted by . To insert a new edge , our edge-insertion algorithm derived in this section differs between these cases as follows:
Case I) and : We omit this case, as our edge selection algorithms presented in Section 6 always ensure a single connected component and initially the F-tree contains only vertex .
Case II) exclusive-or : Due to considering undirected edges, we assume without loss of generality that . Thus is already connected to F-tree.

Case IIa): : In this case, a new dead end is added to the mono-connected structure which is guaranteed to remain mono-connected. We add to .

Case IIb): : In this case, a new dead end is added to the bi-connected structure . This dead end becomes a new mono-connected component . Intuitively speaking, we know that vertex has no other choice but propagating its information to . Thus, becomes the articulation vertex of . The bi-connected component adds the new mono-connected component to its list of children.
Case III) and belong to the same component, i.e.

Case IIIa) This component is a bi-connected component : Adding a new edge between and within component may change the reachability of each vertex to reach their articulation vertex . Therefore, needs to be re-sampled to numerically estimate the reachability probability function for each .

Case IIIb): This component is a mono-connected component : In this case, a new cycle is created within a mono-connected component, thus some vertices within may become bi-connected. We need to (i) identify the set of vertices affected by this cycle, (ii) split these vertices into a new bi-connected component, and (iii) handle the set of vertices that have been disconnected from by the new cycle. These three steps are performed by the splitTree function as follows: (i) We start by identifying the new cycle as follows: Compare the (unique) paths of and to , and find the first vertex that appears in both paths. Now we know that the new cycle is decribed by and the new edge between and . (ii) All of these vertices are added to a bi-connected component using as their articulation vertex. All vertices in having (except itself) on their path are removed from . The probability mass function is estimated by sampling the subgraph of vertices in . (iii) Finally, orphans of that have been split off from due to the creation of need to be collected into new mono-connected components. Such orphans having a vertex of the cycle on their path to will be grouped by these vertices: For each , let denote the set of orphans separated by (separated means being the first vertex in on the path to ). For each such group, we create a new mono-connected component . All these new mono-connected components with become children of . If is now empty, thus all vertices of have been reassigned to other components, then is deleted and will be appended to the list of children of the component where . In case of being not empty, we are left over with a mono-connected component with . The new bi-connected component becomes a child of .
Case IV) and belong to different components . Since the F-tree is a tree-structure itself, we can identify the lowest common ancestor of and . The insertion of edge has incurred a new cycle going from to , then to via the new edge, and then back to . This cycle may cross mono-connected and bi-connected components, which all have to be adjusted to account for the new cycle. We need to identify all vertices involved to create a new cyclic, thus bi-connected, component for , and we need to identify which parts remain mono-connected. In the following cases, we adjust all components involved in iteratively. First, we initialize , where is the vertex within where the cycle meets if is a mono-connected component, and otherwise. Let denote the component that is currently adjusted:

Case IVa) : In this case, the new cycle may enter from two different articulation vertices. In this case, we apply Case III, treating these two vertices as and , as these two vertices have become connected transitively via the big cycle .

Case IVb) is a bi-connected component: In this case becomes absorbed by the new cyclic component , thus , and inherits all children from . The rational is that all vertices within are able to access the new cycle.

Case IVc) is a mono-connected component: In this case, one path in from one vertex to is now involved in a cycle. All vertices involved in are added to and removed from . The operation splitTree is called to create new mono-connected components that have been split off from and become connected to via their individual articulation vertices.

In the following, we use the graph of Figure 3(a) and its corresponding F-tree representation of Figure 3(a) to insert additional edges and to illustrate the interesting cases of the insertion algorithm of Section 5.4.

(a) Case IIb: Insertion of edge .
(b) Case IIIa: Insertion of edge .
(c) Case IIIb: Insertion of edge 
(d) Case IVa-c: Insertion of edge
Fig. 4: Examples of edge insertions and F-tree update cases using the running example of Figure 3(a).

5.5 Insertion Examples

In the following, we use the graph of Figure 3(a) and its corresponding F-tree (FT) representation of Figure 3(b) to insert additional edges and to illustrate the interesting cases of the insertion algorithm of Section 5.4.

We start by an example for Case II in Figure 4(a). Here, we insert a new edge , thus connecting a new vertex to the FT. Since vertex belongs to the bi-connected component , we apply Case IIb. A new mono-connected component is created and added to the children of .

In Figure 4(b), we insert a new edge instead. In this case, the two connected vertices are already part of the FT, thus Case II does not apply. We find that both vertices belong to the same component . Thus, Case III is used and more specifically, since component is a bi-connected component , Case IIIa is applied. In this case, no components need to be changed, but the probability function has to re-approximated, as the probabilities of nodes , and will have increased probability of being connected to articulation vertex , due to the existence of new paths arising by inserting edge .

Next, in Figure 4(c), an edge is inserted between vertices and . Both vertices belong to the mono-connected component , thus Case IIIb is applied here. After insertion of edge , the previously mono-connected component now contains a cycle involving vertices , and . (i) We identify this cycle by considering the previous paths from vertices and to their articulation vertex . These paths are and , respectively. The first common vertex on this path is ; (ii) We create a new bi-connected component , containing all vertices of this cycle using the first common vertex as articulation vertex. We further remove these vertices except the articulation vertex from the mono-connected component ; the probability function is initialized by sampling the reachability probabilities within ; and component is added to the list of children of ; (iii) Finally, orphans need to be collected. These are vertices in , which have now become bi-connected to , because their (previously unique) path to their former articulation vertex crosses a new cycle. We find that one vertex, vertex , had as the first removed vertex on its path to . Thus, vertex is moved from component into a new mono-connected component , terminating this case. Summarizing, vertex in component now reports its information flow to vertex in component , for which the information flow to articulation vertex in component is approximated using Monte-Carlo sampling. This information is then propagated analytically to vertex in component , subsequently, the remaining flow that has been propagated all this way, is approximatively propagated to articulation vertex in component , which allows to analytically compute the flow to articulation vertex .

For the last case, Case IV, considering Figure 4(d), where a new edge connected two vertices belonging to two different components and . We start by identifying the cycle that has been created within the FT, involving components and , and meeting at the first common ancestor component . For each of these components in the cycle , one of the sub-cases of Case IV is used. For component , we have that is the common ancestor component, thus triggering Case IVa. We find that both components and used vertex as their articulation vertex . Thus, the only cycle incurred in component is the (trivial) cycle from vertex to itself, which does not require any action. We initialize the new bi-connected component , which initially holds no vertices, and has no probability mass function computed yet (the operator can be read as null or not-defined) and uses as articulation vertex. For component , we apply Case IVb, as is a bi-connected component, it becomes absorbed by a new bi-connected component , now having . For the mono-connected component Case IVc is used. We identify the path within that is now involved in a cycle, by using the path between the involved vertex to articulation vertex . All nodes on this path are added to , now having . Using the operation similar to Case III, we collect orphans into new mono-connected components, creating and as children of . Finally, Monte-Carlo sampling is used to approximate the probability mass function for each .

6 Optimal Edge Selection

The previous section presented the F-tree, a data structure to compute the expected information flow in a probabilistic graph. Based on this structure, heuristics to find a near-optimal set of edges maximizing the information flow to a vertex (see Definition 4) are presented in this section. Therefore, we first present a Greedy-heuristic to iteratively add the locally most promising edges to the current result. Based on this Greedy approach, we present improvements, aiming at minimizing the processing cost while maximizing the expected information flow.

6.1 Greedy Algorithm

Aiming to select edges incrementally, the Greedy algorithm initially uses the probabilistic graph , which contains no edges. In each iteration , a set of candidate edges candList is maintained, which contains all edges that are connected to in the current graph , but which are not yet selected in . Then, each iteration selects an edge in addition maximizing the information flow to , such that , where

(5)

For this purpose, each edge is probed, by inserting it into the current F-tree using the insertion method presented in Section 5.3. Then, the gain in information flow incurred by this insertion is estimated by equation 1. After iterations, the graph is returned.

6.2 Component Memoization

We introduce an optimization reducing the number of computations for bi-connected components for which their reachability probabilities have to be estimated using Monte-Carlo sampling, by exploiting stochastic independence between different components in the F-tree. During each Greedy-iteration, a whole set of edges is probed for insertion. Some of these insertions may yield new cycles in the F-tree, resulting from cases III and IV. Using component Memoization, the algorithm memoizes, for each edge in , the probability mass function of any bi-connected component that had to be sampled during the last probing of . Should again be inserted in a later iteration, the algorithm checks if the component has changed, in terms of vertices within that component or in terms of other edges that have been inserted into that component. If the component has remained unchanged, the sampling step is skipped, using the memoized estimated probability mass function instead.

6.3 Sampling Confidence Intervals

A Monte Carlo sampling is controlled by a parameter samplesize which corresponds to the number of samples taken to approximate the information flow of a bi-connected component to its articulation vertex. In each iteration, we can reduce the amount of samples by introducing confidence intervals for the information flow for each edge that is probed. The idea is to prune the sampling of any probed edge for which we can conclude that, at a sufficiently large level of significance , there must exist another edge in , such that is guaranteed to have a higher information flow than , based on the current number of samples only. To generate these confidence intervals, we recall that, following Equation 4 the expected information flow to is the sample-average of the sum of information flow of each individual vertex. For each vertex , the random event of being connected to in a random possible world follows a binomial distribution, with an unknown success probability . To estimate , given a number of samples and a number of ’successful’ samples in which is reachable from , we borrow techniques from statistics to obtain a two sided confidence interval of the true probability . A simple way of obtaining such confidence interval is by applying the Central Limit Theorem of Statistics to approximate a binomial distribution by a normal distribution.

Definition 10 (-Significant Confidence Interval).

Let be a set of possible graphs drawn from the probabilistic graph , and let be the fraction of possible graphs in in which is reachable from . With a likelihood of , the true probability that is reachable from in the probabilistic graph is in the interval

(6)

where is the percentile of the standard normal distribution. We denote the lower bound as and the upper bound as . We use .

To obtain a lower bound of the expected information flow to in a graph , we use the sum of lower bound flows of each vertex using Equation 4 to obtain

as well as the upper bound

Now, at any iteration of the Greedy algorithm, for any candidate edge having an information flow lower bounded by , we prune any other candidate edge having an upper bound iff . The rational of this pruning is that, with a confidence of , we can guarantee that inserting yields less information gain than inserting . To ensure that the Central Limit Theorem is applicable, we only apply this pruning step if at least 30 sample worlds have been drawn for both probabilistic graphs.

6.4 Delayed Sampling

For the last heuristic, we reduce the number of Monte-Carlo samplings that need to be performed in each iteration of the Greedy Algorithm in Section 6.1. In a nutshell, the idea is that an edge, which yields a much lower information gain than the chosen edge, is unlikely to become the edge having the highest information gain in the next iteration. For this purpose, we introduce a delayed sampling heuristic. In any iteration of the Greedy Algorithm, let denote the best selected edge, as defined in Equation 5. For any other edge , we define its potential , as the fraction of information gained by adding edge compared to the best edge which has been selected in an iteration. Furthermore, we define the cost as the number of edges that need to be sampled to estimate the information gain incurred by adding edge . If the insertion of does not incur any new cycles, then is zero. Now, after iteration where edge has been probed but not selected, we define a sampling delay

which implies that will not be considered as a candidate in the next iterations of the Greedy algorithm of Section 6.1. This definition of delay, makes the (false) assumption that the information gain of an edge can only increase by a factor of in each iteration, where the parameter is used to control the penalty of having high sampling cost and having low information gain. As an example, assume an edge having an information gain of only of the selected best edge , and requiring to sample a new bi-connected component involving edges upon probing. Also, we assume that the information gain per iteration (and thus by insertion of other edges in the graph), may only increase by a factor of at most . We get . Thus, using delayed sampling and having , edge would not be considered in the next nine iterations of the edge selection algorithm. It must be noted that this delayed sampling strategy is a heuristic only, and that no correct upper-bound for the change in information gain can be given. Consequently, the delayed sampling heuristic may cause the edge having the highest information gain not to be selected, as it might still be suspended. Our experiments show that even for low values of (i.e., close to ), where edges are suspended for a large number of iterations, the loss in information gain is fairly low.

7 Evaluation

In this section, we empirically evaluate efficiency and effectiveness of our proposed solutions to compute a near-optimal subgraph of an uncertain graph to maximize the information flow to a source node , given a constrained number of edges, according to Definition 4. As motivated in the introducing Section 1, two main application fields of information propagation on uncertain graphs are: i) information/data propagation in spatial networks, such as wireless networks or a road networks, and ii) information/belief propagation in social networks. These two types of uncertain graphs have extremely different characteristics, which require separate evaluation. A spatial network follows a locality assumption, constraining the set of pairwise reachable nodes to a spatial distance. Thus, the average shortest path between a pair of two randomly selected nodes can be very large, depending on the spatial distance. In contrast, a social network has no locality assumption, thus allowing to move through the network with very few hops. As a result, without any locality assumption, the set of nodes reachable in -hops from a query node may grow exponentially large in the number of hops. In networks following a locality assumption, this number grows polynomial, usually quadratic (in sensor and road networks on the plane) in the range , as the area covered by a circle is quadratic to its radius. Our experiments have shown, that the locality assumption - which clearly exists in some applications - has tremendous impact on the performance of our algorithms, including the baseline. Consequently, we evaluate both cases separately.

All experiments were evaluated on a system with Linux 3.16.7, x86_64, Intel(R) Xeon(R) CPU E5,2609, 2.4GHz. All algorithms were implemented in Python . Dependencies: NetworkX 1.11, numpy 1.13.1.

7.1 Dataset Descriptions

This section describes our employed uncertain graph datasets. For both cases, i.e. with locality assumption and no-locality assumption, we use synthetic and real datasets.

Synthetic Datasets: No locality assumption. Our first model Erdös is based on the idea of the Erdös-Rényi model [8], distributing edges independently and uniformly between nodes. Probabilities of edges are chosen uniformly in and weights of nodes are integers selected uniformly from . It is known that this model is not able to capture real human social networks [21], due to the lack of modeling long tail distributions produced by “social animals”. Thus, we use this data generation only in our first set of experiments and we employ real social network data later.

Synthetic Datasets: Locality assumption. We use two synthetic data generating scheme to generate spatial networks. For the first data generating scheme - denoted by partitioned -, each vertex has the same degree . The dataset is partitioned into partitions of size . Each vertex in partition is connected to all and only vertices in the previous and next partition and . This data generation allows to control the diameter of a resulting network, which is guaranteed to be equal to .

For a more realistic synthetic data set - denoted as WSN -, we simulate a wireless sensor network. Here, vertices have two spatial coordinates selected uniformly in . Using a global parameter , any vertex is connected to all vertices located in the -distance of using Euclidean distance. For both settings, the probabilities of edges are chosen uniformly in [0, 1].

Real Datasets: No locality assumption. We use the social circles of Facebook dataset published in [23]. This dataset is a snapshot of the social network of Facebook - containing a subgroup of users which form a social circle, i.e. a highly connected subgraph, having edges. These users have excessive number of ’friends’. Yet, it has been discussed in [36] that the number of real friends that influence, affect and interact with an individual is limited. According to this result, and due to the lack of better knowledge which people of this social network are real friends, we apply higher edge probabilities uniformly selected in to random adjacent nodes of each user. Due to symmetry, an average user has such high probabilities ’close friends’. All other edges are assigned edge probabilities uniformly selected in .

For our experiments on collaboration network data, we used the computer science bibliography DBLP. The structure of this dataset is such that if an author co-authored a paper with author , where , the graph contains a undirected edge from to . If a paper is co-authored by authors this generates a completely connected (sub)graph (clique) on nodes. This dataset has been published in [38]. Probabilities on edges are uniformly distributed in . The graph consists of vertices and edges.

Finally, we evaluated our methods also on the YouTube social network, first published in [25]. In this network, edges represent friendship relationships between users. The graph consists of vertices and edges. Again, the probabilities on edges are uniformly distributed in .

Real Datasets: Locality assumption. For our experiments on spatial networks, we used the road network of San Joaquin County111https://www.cs.utah.edu/ lifeifei/SpatialDataset.htm, having vertices and edges. The vertices of the graph are road intersections and edges correspond to connections between them. In order to simulate real sensor nodes located at road intersections, we have connected vertices that are spatially distant from each other have a lower chance to successfully communicate. To give an example, for two vertices having a distance of in meters, we set the communication probability to . Thus, a , and distance will yield a probability of , , and , respectively.

(a) Changing Graph Size with locality assumption
(b) Changing Graph Size without locality assumption
Fig. 5: Experiments with changing graph size

7.2 Evaluated Algorithms

The algorithms that we evaluate in this section are denoted and described as follows:

Naive As proposed in [22, 7] the first competitor Naive does not utilize the strategy of component decomposition of Section 5 and utilizes a pure sampling approach to estimate reachability probabilities. To select edges, the Naive approach chooses the locally best edge, as shown in Section 6, but does not use the F-tree representation presented in Section 5.3. We use a constant Monte-Carlo sampling size of samples.

Dijkstra Shortest-path spanning trees [34] are used to interconnect a wireless sensor network (WSN) to a sink node. To obtain a maximum probability spanning tree, we proceed as follows: the cost of each edge is set to . Running the traditional Dijkstra algorithm on the transformed graph starting at node yields, in each iteration, a spanning tree which maximizes the connectivity probability between and any node connected to [32]. Since, in each iteration, the resulting graph has a tree structure, this approach can fully exploit the concept of Section 5, requiring no sampling step at all.

FT employs the F-tree proposed in Section 5.3 to estimate reachability probabilities. To sample bi-connected components, we draw samples for a fair comparison to Naive. All following FT-Algorithms build on top of FT.

FT+M additionally memoizes for each candidate edge the of the corresponding bi-connected component from the last iteration (cf. Section 6.2).

FT+M+CI further ensures that probing of an edge is stopped whenever another edge has a higher information flow with a certain degree of confidence, as explained in Section 6.3.

FT+M+DS instead tries to minimize the candidate edges in an iteration by leaving out edges that had a small information gain/cost-ratio in the last iteration (cf. Section 6.4). Per default, we set the penalization parameter to .

FT+M+CI+DS is a combination of all the above concepts.

(a) Changing Graph Density with locality assumption
(b) Changing Graph Density without locality assumption
Fig. 6: Experiments with changing graph density

7.3 Experiments on Synthetic Data

In this section, we employ randomly generated uncertain graphs. We generate graphs having no-locality-assumption using Erdös graphs and having locality assumption using the partitioned generation. Both generation approaches are described in Section 7.1. This data generation allows us to scale the topology of the uncertain graph in terms of size and density. Unless specified otherwise, we use a graph size of , a vertex degree of and a budget of edges in a all experiments on synthetic data.

Graph Size. We first scale the size of the synthetic graphs. Figure 5(a) shows the information flow (left-hand-side) and running time (right-hand-side) for our synthetic data set following the locality assumption. First, we note that the Dijkstra-based shortest-path spanning tree yields an extremely low information flow, far inferior to all other approaches. The reason is that a spanning tree allows no room for failure of edges: whenever any edge fails, the whole subtree become disconnected from . We further note that all other algorithms, including the Naive one, are oblivious to the size of the network, in terms of information flow and running time. The reason is that, due to the locality assumption, only a local neighborhood of vertices and edges is relevant, regardless of the global size of the graph. Additionally, we see that the delayed sampling heuristic (DS) yields a significant running time performance gain, whilst keeping the information flow constantly high. The combination of all heuristics (FT+M+CI+DS) yields significant loss of information flow due to the pruning strategy of the confidence interval heuristic (CI). Figure 5(b), shows the performance in terms of information gain and running time for the Erdös graphs having no locality assumption. We first observe that Dijkstra and Naive yield a significantly lower information flow than our proposed approaches. For Dijkstra, this result is again contributed to the constraint of constructing a spanning tree, and thus not allowing any edges to connect the flow between branches. For the Naive approach, the loss in information flow requires a closer look. This approach samples the whole graph only times, to estimate the information flow. In contrast, our F-tree approach samples each individual bi-connected component times. Why is the later approach more accurate? A first, informal, explanation is that, for a constant sampling size, the information flow of a small component can be estimated more accurately than for a large component. Intuitively, sampling two independent components times each, yields a total of combinations of samples of their joint distribution. More formally, this effect is contributed to the fact that the variance of the sum of two random variables increases as their correlation increases, since [26]. Furthermore, the Naive approach also incurs an approximation error for mono-connected components, for which all F-tree (FT) approaches compute the exact flow analytically. We further see that the Naive approach, which has to sample the whole graph, is by far the most inefficient. On the other end of the scope, the Dijkstra approach, which is able to avoid sampling entirely by guaranteeing a single mono-connected component, is the fastest in all experiments, but at the cost of information flow. We also see that in Figure 5(b) all algorithm stay in the same order of magnitude in their running time and information flow as the graph increases. This is due to the fact that in this experiment we stay constant in the average vertex degree, i.e. for all . We also observe that the CI heuristic yields an overhead of computing the intervals whilst losing information gain due to its rigorous pruning strategy.

(a) Changing budget k with locality assumption
(b) Changing budget k without locality assumption
Fig. 7: Experiments with changing Budget
(a) WSN
(b) WSN
Fig. 8: Experiments in synthetic Wireless Sensor Networks
(a) San Joaquin County Road Network
(b) Circle of Friends - Facebook
(c) DBLP
(d) YouTube
Fig. 9: Experiments on Real World Datasets

Graph Density. In this experiment, we scale the average degree of vertices. In the case of graphs following the locality assumption, the gain in information flow of all proposed solutions compared to Dijkstra is quite significant as shown in Figure 6(a), particularly when the degree of vertices is low. This is the case in road networks, but also in most sensor and ad hoc networks. The reason is that, in such case, spanning trees gain quickly in height as edges are added, thus incurring low-probability paths that require circular components to connect branches to support the information flow. For larger vertex degrees, the optimal solutions become more star-like, thus becoming more tree-like. For a small vertex degree, we observe also that the same bi-connected-components are occurring in consecutive iterations resulting in a running time gain for the memoization approach. As the complexity of the graph grows, the gap between the FT and FT+M shrinks as more candidates result in an increased number of possibilities where bi-connected components can occur and thus make cached results for bi-connected components obsolete. The results shown in Figure 6(b) indicate that the Dijkstra approach is able to find higher quality result in graphs without locality assumption for small (average) vertex degrees. This is contributed to the fact that in graphs with this setting, the optimal result will be almost tree-like, having only a few inter-branch edges. The algorithms FT+M+CI and FT+M+CI+DS yield a trade-off between running time and accuracy, i.e. we observe a slightly loss in information gain coming along with a better running time for a setting with a larger (average) vertex degree.

Scaling of parameter . In the next experiments, shown in Figure 7, we show how the budget of edges affects the performance of the proposed algorithms. In the case of a network following the locality assumption, we observe in Figure 7(a) that the overall information gain per additional edge slowly decreases. This is clear, since in average, the hop distance to increases as more edges are added, increasing the possible links of failure, thus decreasing the likelihood of propagating a nodes information to . We observe that the effectiveness of Dijkstra in the locality setting quickly deteriorates, since the constraint of returning a tree structure leads to paths between and other connected nodes that become increasingly unlikely, making the Dijkstra approach unable to handle such settings. Here, the memoization heuristic perform extremely well. A sever loss of information gain is observed when running FT+M+CI and FT+M+CI+DS due to its pruning policy. Later one is the best in terms of running time.

In contrast, using a network following no locality assumption in Figure 7(b), we see that both Dijkstra and Naive yield an low information gain for a large budget . For Dijkstra, the reason is that for large values of , the depths of the spanning tree, which is lower bounded by , incurs longer paths without any backup route in the case of a single failure. For the Naive approach, the low information gain is contributed to the high variance of sampling the information flow of the whole graph for each edge selection. Further, we see that the Naive approach further suffers from an extreme running time, requiring to re-sample the whole graph in each iteration. The -tree in combination with the memoization give a consistently high information gain while having a low running time. The heuristics suffering from a loss in information gain yield a slightly better running time.

Synthetic Wireless Sensor Networks (WSN). In this experiments, we simulate real world wireless sensor networks (WSN). We embed a number of vertices - here - according to a uniform distribution in a spatial space . For each vertex, we observe adjacent vertices being in its proximity which is regulated by an additional parameter . Figure 8 shows the results. We observe nearly the same behavior as in Figure 6(a). As the parameter is a regulator for the graph’s interconnectivity, we observe again a fair trade-off of information gain and running time for the proposed heuristics. By increasing the parameter , hence, simulating dense graphs, the gap between Dijkstra and the F-tree approaches is reduced. For these datasets, we can also observe the benefit of FT+M+CI+DS which still identifies a high information gain whilst reducing the running time, as the number of candidates are reduced, respectively, we can prune candidates in earlier stages of each iteration.

Parameter . We also evaluated the penalization parameter of the delayed sampling heuristics and summarize the results. In all our evaluated settings, ranging from , the running time consistently decreases as is decreased, yielding a factor of to speed-up for , depending on the dataset, and a multi-orders of magnitude speed-up for . Yet, for we start to observe a significant loss of information flow. For the extreme case of , the information flow became worse than Dijkstra, as edges become suspended unreasonable long, choosing edges nearly arbitrarily. For the default setting of used in all previous evaluations, the delayed sampling heuristics showed insignificant loss of information, but yielding a better running time.

7.4 Experiments on Real World Data

Our first real world data experiments uses the San Joaquin County Road Network. As road networks are of very sparse nature, and follow a strong locality assumption, our approaches outperforms Dijkstra significantly as is scaled to . Thus, Dijkstra is highly undesirable as budget is wasted without proper return in information flow. In this setting, following the locality assumption, we see that all heuristics yield a significant run time performance gain, while the information flow remains similar for all heuristics.

In the next experiment, we employ the social circles of friends dataset, an extremely dense network with no locality assumption, where most pairs of nodes are connected. As described in Section 7.1, each vertex in this graph only has ten high-probability links, whereas all other nodes have a lower probability. Figure 9(b) shows that Dijkstra yields a most significant loss of information, as it is forced to quickly build a large-height tree to maintain high probability edges. Further, we see that the memoization heuristic yields a significant running time improvement of about one order of magnitude. We note that in such dense setting, heuristics CI and DS show almost no effect in both, runtime and information flow.

Figure 9(c) shows similar results on the DBLP collaboration network dataset, a sparse network which follows no locality assumption. Again, we observe a loss of potential information flow for Dijkstra as increases.

Finally, we observe similar behavior of all approaches on a bigger graph such as the YouTube social network, which refers to a sparse setting with no locality assumption. Figure 9(d) shows the results. As in the other settings, we can observe an extremely low information flow of Dijkstra, and an extreme running time of the Naive approach. It is interesting that in this setting, the memoization approach FT+M yields only a minimal gain in running time, like the other heuristics. Fortunately, none of these heuristics shows a significant loss of information flow.

7.5 Experimental Evaluation: Summary

To summarize our experimental results, we reiterate the shortcomings of the naive solutions, and briefly discuss which of the heuristics are best used in which setting.

Naive: Our Naive competitor, which applies a Greedy edge selection (c.f. Sec. 6) but does not use the F-tree, is multiple orders of magnitude slower than our other approaches in all real-data experiments (c.f. Fig. 9). Further, large sampling errors also yield a significantly lower information flow in most settings.

Dijkstra: A Dijkstra-based spanning tree algorithm runs extremely fast, but at the cost of an extreme loss of information, yielding low information flow. The information loss is particularly high for social networks (e.g. Fig. 9(b)), where cycles are required to increase the odds of connecting a distant node to the source.

FT: Employing the F-tree proposed in Section 5.3 maximizes the information flow. Compared to the Naive approach, smaller partitions need to be sampled yielding smaller sampling variation while being multiple orders of magnitude faster.

FT+M: The memoization heuristic technique described in Section 6.2 was shown to be simple and effective. It yields vast reduction in running time of up to one order of magnitude on real-date (see Fig. 9), while showing no notable detriment to the information flow.

FT+M+CI: Employing confidence intervals as described in Section 6.3 has shown a significant improvement in running time on spatial networks following the locality assumption (c.f. Figures 5(a), 6(a), and 9(a)). However, this heuristic yields no improvement (and often has a detrimental effect) in settings without locality assumptions such as in social networks (c.f. Fig. 9(b)-9(d)). This heuristic should not be employed in such settings.

FT+M+DS: The delayed sampling heuristic presented in Section 6.4 yields an improvement in running time in networks following the locality assumption. This improvement is especially large in cases having a high vertex degree (c.f. Figure 6(a)). However, in social networks which do not follow the locality assumption, the gain of this heuristic is often marginal (c.f. Fig. 9(b)-9(d)). Yet, this heuristic comes at minimal loss of information flow, such that it is not detrimental to enable it by default.

FT+M+CI+DS: The combination of all heuristics inherits the problems of FT+M+CI and FT+M+DS for the cases without locality assumption. But for the cases with locality assumption, our experiments on real world data show that in most cases the combination of all heuristic achieves significant lower running time compared to setting where we apply each of the heuristics separately proofing the importance of each proposed heuristic.

8 Conclusions

In this paper we discussed solutions for the problem of maximizing the information flow in an uncertain graph given a fixed budget of communication edges. We identified two NP-hard subproblems that needed heuristical solutions: (i) Computing the expected information flow of a given subgraph; and (ii) selecting the optimal -set of edges. For problem (i), we developed an advanced sampling strategy that only performs an expensive (and approximative) sampling step for parts of the graph for which we can not obtain an efficient (and exact) analytic solution. For problem (ii), we propose our F-tree representation of a graph , which keeps track of bi-connected components - for which sampling is required to estimate the information flow - and mono-connected components - for which the information flow can be computed analytically. On the basis of the F-tree representation, we introduced further approaches and heuristics to handle the trade-off between effectiveness and efficiency. Our evaluation shows that these enhanced algorithms are able to find high quality solutions (i.e., -sets of edges having a high information flow) in efficient time, especially in graphs following a locality assumption, such as road networks and wireless sensor networks.

References

  • [1] E. Adar and C. Ré. Managing uncertainty in social networks. IEEE Data Eng. Bull., 30(2):15–22, 2007.
  • [2] K. Aggarwal, K. Misra, and J. Gupta. Reliability evaluation a comparative study of different techniques. Microelectronics Reliability, 14(1):49–56, 1975.
  • [3] T. B. Brecht and C. J. Colbourn. Lower bounds on two-terminal network reliability. Discrete Applied Mathematics, 21(3):185–198, 1988.
  • [4] D. Bulka and J. B. Dugan. Network st reliability bounds using a 2-dimensional reliability polynomial. Reliability, IEEE Transactions on, 43(1):39–45, 1994.
  • [5] C. J. Colbourn and C. Colbourn. The combinatorics of network reliability, volume 200. Oxford University Press New York, 1987.
  • [6] P. Domingos and M. Richardson. Mining the network value of customers. In SIGKDD, pages 57–66, 2001.
  • [7] T. Emrich, H.-P. Kriegel, J. Niedermayer, M. Renz, A. Suhartha, and A. Züfle. Exploration of monte-carlo based probabilistic query processing in uncertain graphs. In CIKM, pages 2728–2730, 2012.
  • [8] P. Erdös and A. Rényi. On the evolution of random graphs. In Publication of the mathematical institute of the hungarian academy of sciences, pages 17–61, 1960.
  • [9] J. Galtier, A. Laugier, and P. Pons. Algorithms to evaluate the reliability of a network. In DRCN, pages 8–pp, 2005.
  • [10] J. Ghosh, H. Q. Ngo, S. Yoon, and C. Qiao. On a routing problem within probabilistic graphs and its application to intermittently connected networks. In INFOCOM, pages 1721–1729, 2007.
  • [11] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. Propagation of trust and distrust. In WWW, pages 403–412, 2004.
  • [12] G. Hardy, C. Lucet, and N. Limnios. K-terminal network reliability measures with binary decision diagrams. Reliability, IEEE Transactions on, 56(3):506–515, 2007.
  • [13] P. Hintsanen. The most reliable subgraph problem. In PKDD.
  • [14] J. Hopcroft and R. Tarjan. Algorithm 447: efficient algorithms for graph manipulation. Communications of the ACM, 16(6):372–378, 1973.
  • [15] R. Jin, L. Liu, B. Ding, and H. Wang. Distance-constraint reachability computation in uncertain graphs. Proceedings of the VLDB Endowment, 4(9):551–562, 2011.
  • [16] M. Kasari, H. Toivonen, and P. Hintsanen. Fast discovery of reliable k-terminal subgraphs. In M. J. Zaki, J. X. Yu, B. Ravindran, and V. Pudi, editors, PAKDD, volume 6119, pages 168–177, 2010.
  • [17] H. Kellerer, U. Pferschy, and D. Pisinger. Introduction to NP-Completeness of knapsack problems. Springer, 2004.
  • [18] D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social network. In SIGKDD, pages 137–146, 2003.
  • [19] A. Khan, F. Bonchi, A. Gionis, and F. Gullo. Fast reliability search in uncertain graphs. In EDBT, pages 535–546, 2014.
  • [20] G. Kollios, M. Potamias, and E. Terzi. Clustering large probabilistic graphs. Knowledge and Data Engineering, IEEE Transactions on, 25(2):325–336, 2013.
  • [21] J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos. Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In ECML-PKDD, pages 133–145. Springer, 2005.
  • [22] J. Leskovec and C. Faloutsos. Sampling from large graphs. In SIGKDD’06, pages 631–636, 2006.
  • [23] J. Leskovec and J. J. Mcauley. Learning to discover social circles in ego networks. In NIPS, pages 539–547, 2012.
  • [24] J. Li, Z. Zou, and H. Gao. Mining frequent subgraphs over uncertain graph databases under probabilistic semantics. The VLDB Journal, 21(6):753–777, 2012.
  • [25] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07), San Diego, CA, October 2007.
  • [26] W. C. Navidi. Statistics for engineers and scientists. McGraw-Hill New York, 2006.
  • [27] O. Papapetrou, E. Ioannou, and D. Skoutas. Efficient discovery of frequent subgraph patterns in uncertain graph databases. In EDBT, pages 355–366, 2011.
  • [28] M. Potamias, F. Bonchi, A. Gionis, and G. Kollios. k-nearest neighbors in uncertain graphs. PVLDB, 3(1):997–1008, 2010.
  • [29] J. S. Provan and M. O. Ball. Computing network reliability in time polynomial in the number of cuts. Operations Research, 32(3):516–526, 1984.
  • [30] M. Richardson and P. Domingos. Mining knowledge-sharing sites for viral marketing. In SIGKDD, pages 61–70, 2002.
  • [31] G. Rubino. Network reliability evaluation. State-of-the art in performance modeling and simulation, pages 275–302, 1998.
  • [32] P. Sevon, L. Eronen, P. Hintsanen, K. Kulovesi, and H. Toivonen. Link discovery in graphs derived from biological databases. In DILS, pages 35–49, 2006.
  • [33] A. R. Sharafat and O. R. Ma’rouzi. All-terminal network reliability using recursive truncation algorithm. Reliability, IEEE Transactions on, 58(2):338–347, 2009.
  • [34] K. Sohrabi, J. Gao, V. Ailawadhi, and G. J. Pottie. Protocols for self-organization of a wireless sensor network. IEEE personal communications, 7(5):16–27, 2000.
  • [35] R. Tarjan. Depth-first search and linear graph algorithms. SIAM journal on computing, 1(2):146–160, 1972.
  • [36] B. Wellman, A. Q. Haase, J. Witte, and K. Hampton. Does the internet increase, decrease, or supplement social capital? social networks, participation, and community commitment, 2001.
  • [37] J. Westbrook and R. E. Tarjan. Maintaining bridge-connected and biconnected components on-line. Algorithmica, 7(1):433–464, 1992.
  • [38] J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. In ICDM. IEEE Computer Society, 2012.
  • [39] Y. Yuan, G. Wang, L. Chen, and H. Wang. Efficient keyword search on uncertain graph data. TKDE, 25(12):2767–2779, 2013.
  • [40] Z. Zou, J. Li, H. Gao, and S. Zhang. Finding top-k maximal cliques in an uncertain graph. In ICDE, pages 649–652, 2010.
  • [41] Z. Zou, J. Li, H. Gao, and S. Zhang. Mining frequent subgraph patterns from uncertain graph data. Knowledge and Data Engineering, IEEE Transactions on, 22(9):1203–1218, 2010.

Christian Frey is a research fellow at the Institute for Informatics at the Ludwig-Maximilians-Universität München, Germany. His research interests include query processing in (uncertain) graph databases, network analysis on large heterogeneous information networks and machine learning approaches on (heterogeneous) information networks/Knowledge Graphs.

Andreas Züfle is an assistant professor with the department of Geography and Geoinformation Science at George Mason University. Dr. Züfle’s research expertise includes big spatial data, spatial data mining, social network mining, and uncertain database management.Since 2016, Dr. Züfle research has received more than $2,000,000 in research grants by the National Science Foundation (NSF) and the Defense Advanced Research Projects Agency (DARPA). Since 2011, Dr. Züfle has published more than 60 papers in refereed conferences and journals having an h-index of 16.

Tobias Emrich received his PhD in Computer Science from LMU, Munich in 2013. He then did his Post-Doc at the Integrated Media Systems Center at the University of Southern California in 2014. In 2015 he went back to LMU to become the Director of the Data Science Lab. Since then he started and led many industry collaborations on Data Science topics with companies such as Siemens, Volkswagen, Roche, and IAV. His research interest include similarity search and data mining in spatial, temporal, uncertain and dynamic graph databases. To date he has more than 40 publications in refereed conferences.

Matthias Renz is an associate professor at the Computational and Data Science Department at George Mason University. He received his PhD in computer science at the Ludwig-Maximilians-Universität (LMU) Munich 2006, and his habilitation 2011. His main research topics are data science, scientific and spatial databases, data mining and uncertain databases. To date, he has more than 120 peer-reviewed publications that in total received over 2200 citations.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
11506
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description