Absorbing random-walk centrality: Â  Theory and algorithms

# Absorbing random-walk centrality: Â  Theory and algorithms

Charalampos Mavroforakis
Dept. of Computer Science
Boston University
Boston, U.S.A.
cmav@cs.bu.edu
Michael Mathioudakis and Aristides Gionis Helsinki Institute for Information Technology HIIT
Dept. of Computer Science, Aalto University
Helsinki, Finland
firstname.lastname@aalto.fi
###### Abstract

We study a new notion of graph centrality based on absorbing random walks. Given a graph and a set of query nodes , we aim to identify the most central nodes in with respect to . Specifically, we consider central nodes to be absorbing for random walks that start at the query nodes . The goal is to find the set of central nodes that minimizes the expected length of a random walk until absorption. The proposed measure, which we call absorbing random-walk centrality, favors diverse sets, as it is beneficial to place the absorbing nodes in different parts of the graph so as to “intercept” random walks that start from different query nodes.

Although similar problem definitions have been considered in the literature, e.g., in information-retrieval settings where the goal is to diversify web-search results, in this paper we study the problem formally and prove some of its properties. We show that the problem is -hard, while the objective function is monotone and supermodular, implying that a greedy algorithm provides solutions with an approximation guarantee. On the other hand, the greedy algorithm involves expensive matrix operations that make it prohibitive to employ on large datasets. To confront this challenge, we develop more efficient algorithms based on spectral clustering and on personalized PageRank.

graph mining; node centrality; random walks

## I Introduction

A fundamental problem in graph mining is to identify the most central nodes in a graph. Numerous centrality measures have been proposed, including degree centrality, closeness centrality [14], betweenness centrality [5], random-walk centrality [13], Katz centrality [9], and PageRank [4].

In the interest of robustness many centrality measures use random walks: while the shortest-path distance between two nodes can change dramatically by inserting or deleting a single edge, distances based on random walks account for multiple paths and offer a more global view of the connectivity between two nodes. In this spirit, the random-walk centrality of one node with respect to all nodes of the graph is defined as the expected time needed to come across this node in a random walk that starts in any other node of the graph [13].

In this paper, we consider a measure that generalizes random-walk centrality for a set of nodes with respect to a set of query nodes . Our centrality measure is defined as the expected length of a random walk that starts from any node in until it reaches any node in — at which point the random walk is “absorbed” by . Moreover, to allow for adjustable importance of query nodes in the centrality measure, we consider random walks with restarts, that occur with a fixed probability at each step of the random walk. The resulting computational problem is to find a set of nodes that optimizes this measure with respect to nodes , which are provided as input. We call this measure absorbing random-walk centrality and the corresponding optimization problem -arw-Centrality.

To motivate the -arw-Centrality problem, let us consider the scenario of searching the Web graph and summarizing the search results. In this scenario, nodes of the graph correspond to webpages, edges between nodes correspond to links between pages, and the set of query nodes consists of all nodes that match a user query, i.e., all webpages that satisfy a keyword search. Assuming that the size of is large, the goal is to find the most central nodes with respect to , and present those to the user.

It is clear that ordering the nodes of the graph by their individual random-walk centrality scores and taking the top- set does not solve the -arw-Centrality problem, as these nodes may all be located in the same “neighborhood” of the graph, and thus, may not provide a good absorbing set for the query. On the other hand, as the goal is to minimize the expected absorption time for walks starting at , the optimal solution to the -arw-Centrality problem will be a set of , both centrally-placed and diverse, nodes.

This observation has motivated researchers in the information-retrieval field to consider random walks with absorbing states in order to diversify web-search results [18]. However, despite the fact that similar problem definitions and algorithms have been considered earlier, the -arw-Centrality problem has not been formally studied and there has not been a theoretical analysis of its properties.

Our key results in this paper are the following: we show that the -arw-Centrality problem is -hard, and we show that the absorbing random-walk centrality measure is monotone and supermodular. The latter property allows us to quantify the approximation guarantee obtained by a natural greedy algorithm, which has also been considered by previous work [18]. Furthermore, a naïve implementation of the greedy algorithm requires many expensive matrix inversions, which make the algorithm particularly slow. Part of our contribution is to show how to make use of the Sherman-Morrison inversion formula to implement the greedy algorithm with only one matrix inversion and more efficient matrixvector multiplications.

Moreover, we explore the performance of faster, heuristic algorithms, aiming to identify methods that are faster than the greedy approach without significant loss in the quality of results. The heuristic algorithms we consider include the personalized PageRank algorithm [4, 10] as well as algorithms based on spectral clustering [17]. We find that, in practice, the personalized PageRank algorithm offers a very good trade-off between speed and quality.

The rest of the paper is organized as follows. In Section II, we overview previous work and discuss how it compares to this paper. We define our problem in Section III and provide basic background results on absorbing random walks in Section IV. Our main technical contributions are given in Sections IV and V, where we characterize the complexity of the problem, and provide the details of the greedy algorithm and the heuristics we explore. We evaluate the performance of algorithms in Section VII, over a range of real-world graphs, and Section VIII is a short conclusion. Proofs for some of the theorems shown in the paper are provided in the Appendix.

## Ii Related work

Many works in the literature explore ways to quantify the notion of node centrality on graphs [3]. Some of the most commonly-used measures include the following: () degree centrality, where the centrality of a node is simply quantified by its degree; () closeness centrality [11, 14], defined as the average distance of a node from all other nodes on the graph; () betweenness centrality [5], defined as the number of shortest paths between pairs of nodes in the graph that pass through a given node; () eigenvector centrality, defined as the stationary probability that a Markov chain on the graph visits a given node, with Katz centrality [9] and PageRank [4] being two well-studied variants; and () random-walk centrality [13], defined as the expected first passage time of a random walk from a given node, when it starts from a random node of the graph. The measure we study in this paper generalizes the notion of random-walk centrality to a set of absorbing nodes.

Absorbing random walks have been used in previous work to select a diverse set of nodes from a graph. For example, an algorithm proposed by Zhu et al. [18] selects nodes in the following manner: () the first node is selected based on its PageRank value and is set as absorbing; () the next node to be selected is the node that maximizes the expected first-passage time from the already selected absorbing nodes. Our problem definition differs considerably from the one considered in that work, as in our work the expected first-passage times are always computed from the set of query nodes that are provided in the input, and not from the nodes that participate in the solution so far. In this respect, the greedy method proposed by Zhu et al. is not associated with a crisp problem definition.

Another conceptually related line of work aims to select a diverse subset of query results, mainly within the context of document retrieval [1, 2, 16]. The goal, there, is to select query results to optimize a function that quantifies the trade-off between relevance and diversity.

Our work is also remotely related to the problem studied by Leskovec et al. on cost-effective outbreak detection [12]. One of the problems discussed there is to select nodes in the network so that the detection time for a set of cascades is minimized. However, their work differs from ours on the fact that they consider as input a set of cascades, each one of finite size, while in our case the input consists of a set of query nodes and we consider a probabilistic model that generates random walk paths, of possibly infinite size.

## Iii Problem definition

We are given a graph over a set of nodes and set of undirected edges . The number of nodes is denoted by and the number of edges by . The input also includes a subset of nodes , to which we refer as the query nodes. As a special case, the set of query nodes may be equal to the whole set of nodes, i.e., .

Our goal is to find a set of nodes that are central with respect to the query nodes . For some applications it makes sense to restrict the central nodes to be only among the query nodes, while in other cases, the central nodes may include any node in . To model those different scenarios, we consider a set of candidate nodes , and require that the central nodes should belong in this candidate set, i.e., . Some of the cases include , , or , but it could also be that is defined in some other way that does not involve . In general, we assume that is given as input.

The centrality of a set of nodes with respect to query nodes is based on the notion of absorbing random-walks and their expected length. More specifically, let us consider a random walk on the nodes of the graph, that proceeds at discrete steps: the walk starts from a node and, at each step moves to a different node, following edges in , until it arrives at some node in . The starting node of the walk is chosen according to a probability distribution . When the walk arrives at a node for the first time, it terminates, and we say that the random walk is absorbed by that node . In the interest of generality, and to allow for adjustable importance of query nodes in the centrality measure, we also allow the random walk to restart. Restarts occur with a probability at each step of the random walk, where is a parameter that is specified as input to the problem. When restarting, the walk proceeds to a query node selected randomly according to . Intuitively, larger values of favor nodes that are closer to nodes .

We are interested in the expected length (i.e., number of steps) of the walk that starts from a query node until it gets absorbed by some node in , and we denote this expected length by . We then define the absorbing random-walk centrality of a set of nodes with respect to query nodes , by

 acQ(C)=∑q∈Qs(q)acqQ(C).

The problem we consider in this paper is the following.

###### Problem

(-arw-Centrality) We are given a graph , a set of query nodes , a set of candidate nodes , a starting probability distribution over such that if , a restart probability , and an integer . We ask to find a set of nodes that minimizes , i.e., the expected length of a random walk that starts from and proceeds until it gets absorbed in some node in .

In cases where we have no reason to distinguish among the query nodes, we consider the uniform starting probability distribution . In fact, for simplicity of exposition, hereinafter we focus on the case of uniform distribution. However, we note that all our definitions and techniques generalize naturally, not only to general starting probability distributions , but also to directed and weighted graphs.

## Iv Absorbing random walks

In this section we review some relevant background on absorbing random walks. Specifically, we discuss how to calculate the objective function for Problem III.

Let be the transition matrix for a random walk, with expressing the probability that the random walk will move to node  given that it is currently at node . Since random walks can only move to absorbing nodes , but not away from them, we set and , if , for all absorbing nodes . The set of non-absorbing nodes is called transient. If are the neighbors of a node and its degree, the transition probabilities from node to other nodes are

 P(i,j)={αs(j) if j∈Q∖N(i),(1−α)/di+αs(j) if j∈N(i). (1)

Here, represents the starting probability vector. For example, for the uniform distribution over query nodes we have if and otherwise. The transition matrix of the random walk can be written as follows

 P=(PTTPTC0I). (2)

In the equation above, is an identity matrix and a matrix with all its entries equal to ; is the sub-matrix of that contains the transition probabilities between transient nodes; and is the sub-matrix of that contains the transition probabilities from transient to absorbing nodes.

The probability of the walk being on node at exactly steps having started at node , is given by the -entry of the matrix . Therefore, the expected total number of times that the random walk visits node having started from node  is given by the -entry of the matrix

 F=∞∑ℓ=0PℓTT=(I−PTT)−1, (3)

which is known as the fundamental matrix of the absorbing random walk. Allowing the possibility to start the random walk at an absorbing node (and being absorbed immediately), we see that the expected length of a random walk that starts from node and gets absorbed by the set is given by the -th element of the following vector

 L=LC=(F0)1, (4)

where is an vector of all 1s. We write to emphasize the dependence on the set of absorbing nodes .

The expected number of steps when starting from a node in and until being absorbed by some node in is then obtained by summing over all query nodes, i.e.,

 acQ(C)=sTLC. (5)

### Iv-a Efficient computation of absorbing centrality

Equation (5) pinpoints the difficulty of the problem we consider: even computing the objective function for a candidate solution requires an expensive matrix inversion; . Furthermore, searching for the optimal set involves an exponential number of candidate sets, while evaluating each one of them requires a matrix inversion.

In practice, we find that we can compute much faster approximately, as shown in Algorithm 1. The algorithm follows from the infinite-sum expansion of Equation (5).

 acQ(C)=sTLC=sT(F0)1=sT(c∑∞ℓ=0PℓTT0)1 =sT∞∑ℓ=0(PℓTT0)1=(∞∑ℓ=0sT(PℓTT0))1 =(∞∑ℓ=0xℓ)1=∞∑ℓ=0xℓ1,

with

 x0=sT and xℓ+1=xℓ(PTT0). (6)

Note that computing each vector requires time . Algorithm 1 terminates when the increase of the sum due to the latest term falls below a pre-defined threshold .

## V Problem characterization

We now study the -arw-Centrality problem in more detail. In particular, we show that the function is monotone and supermodular, a property that is used later to provide an approximation guarantee for the greedy algorithm. We also show that -arw-Centrality is -hard.

Recall that a function over subsets of a ground set is submodular if it has the diminishing returns property

 f(Y∪{u})−f(Y)≤f(X∪{u})−f(X), (7)

for all and . The function is supermodular if is submodular. Submodularity (and supermodularity) is a very useful property for designing algorithms. For instance, minimizing a submodular function is a polynomial-time solvable problem, while the maximization problem is typically amenable to approximation algorithms, the exact guarantee of which depends on other properties of the function and requirements of the problem, e.g., monotonicity, matroid constraints, etc.

Even though the objective function is given in closed-form by Equation (5), to prove its properties we find it more convenient to work with its descriptive definition, namely, being the expected length for a random walk starting at nodes of before being absorbed at nodes of .

For the rest of this section we consider that the set of query nodes is fixed, and for simplicity we write .

###### Proposition (Monotonicity)

For all it is .

The proposition states that absorption time decreases with more absorbing nodes. The proof is given in the Appendix.

Next we show that the absorbing random-walk centrality measure is supermodular.

###### Proposition (Supermodularity)

For all sets and it is

 ac(X)−ac(X∪{u})≥ac(Y)−ac(Y∪{u}). (8)

###### Proof:

Given an instantiation of a random walk, we define the following propositions for any pair of nodes , non-negative integer , and set of nodes :

:

The random walk started at node and visited node  after exactly steps, without visiting any node in set .

:

The random walk started at node and visited node  after exactly steps, having previously visited node  but without visiting any node in the set .

It is easy to see that the set of random walks for which is true can be partitioned into those that visited within the first steps and those that did not. Therefore, the probability that proposition is true for any instantiation of a random walk generated by our model is equal to

 Pr[Aℓi,j(Z)]=Pr[Aℓi,j(Z∪{u})]+Pr[Bℓi,j(Z,u)]. (9)

Now, let be the number of steps for a random walk to reach the nodes in . is a random variable and its expected value over all random walks generated by our model is equal to . Note that the proposition is true for a given instantiation of a random walk only if there is a pair of nodes and , for which the proposition is true. Therefore,

 Pr[Λ(Z)≥ℓ+1]=∑q∈Q∑j∈V∖ZPr[Aℓq,j(Z)]. (10)

From the above, it is easy to calculate as

 ac(Z) = E[Λ(Z)] (11) = ∞∑ℓ=0ℓPr[Λ(Z)=ℓ] = ∞∑ℓ=1Pr[Λ(Z)≥ℓ] = ∞∑ℓ=0Pr[Λ(Z)≥ℓ+1] = ∞∑ℓ=0∑q∈Q∑j∈V∖ZPr[Aℓq,j(Z)].

The final property we will need is the observation that, for , implies and thus

 Pr[Bℓi,j(X,u)]≥Pr[Bℓi,j(Y,u)]. (12)

By using Equation (11), the Inequality (8) can be rewritten as

 ∞∑ℓ=0∑q∈Q ∑j∈V∖XPr[Aℓq,j(X)]− ∞∑ℓ=0∑q∈Q∑j∈V∖{X∪{u}}Pr[Aℓq,j(X∪{u})] ≥∞∑ℓ=0∑q∈Q ∑j∈V∖YPr[Aℓq,j(Y)]− ∞∑ℓ=0∑q∈Q∑j∈V∖{Y∪{u}}Pr[Aℓq,j(Y∪{u})]. (13)

We only need to show that the inequality holds for an arbitrary value of and , that is

 ∑j∈V∖XPr[Aℓq,j(X)]−∑j∈V∖{X∪{u}}Pr[Aℓq,j(X∪{u})]≥ ∑j∈V∖YPr[Aℓq,j(Y)]−∑j∈V∖{Y∪{u}}Pr[Aℓq,j(Y∪{u})]. (14)

Notice that , so we can rewrite the above inequality as

 ∑j∈V∖XPr[Aℓq,j(X)]−∑j∈V∖XPr[Aℓq,j(X∪{u})]≥ ∑j∈V∖YPr[Aℓq,j(Y)]−∑j∈V∖YPr[Aℓq,j(Y∪{u})]. (15)

To show the latter inequality we start from the left hand side and use Inequality (12). We have

 ∑j∈V∖XPr [Aℓi,j(X)]−∑j∈V∖XPr[Aℓi,j(X∪{u})] = ∑j∈V∖XPr[Bℓi,j(X,u)] ≥ ∑j∈V∖YPr[Bℓi,j(Y,u)] = ∑j∈V∖YPr[Aℓi,j(Y)]−∑j∈V∖YPr[Aℓi,j(Y∪{u})],

which completes the proof.

Finally, we establish the hardness of absorbing centrality, defined in Problem III.

###### Theorem

The -arw-Centrality problem is -hard.

###### Proof:

We obtain a reduction from the VertexCover problem [6]. An instance of the VertexCover problem is specified by a graph and an integer , and asks whether there exists a set of nodes such that and is a vertex cover, (i.e., for every it is ). Let .

Given an instance of the VertexCover problem, we construct an instance of the decision version of -arw-Centrality by taking the same graph with query nodes and asking whether there is a set of absorbing nodes such that and .

We will show that is a solution for VertexCover if and only if .

Assuming first that is a vertex cover. Consider a random walk starting uniformly at random from a node . If then the length of the walk will be 0, as the walk will be absorbed immediately. This happens with probability . Otherwise, if the length of the walk will be 1, as the walk will be absorbed in the next step (since is a vertex cover all the neighbors of need to belong in ). This happens with the rest of the probability . Thus, the expected length of the random walk is

 acQ(C)=0⋅kn+1⋅(1−kn)=1−kn (16)

Conversely, assume that is not a vertex cover for . Then, there should be an uncovered edge . A random walk that starts in and then goes to (or starts in and then goes to ) will have length at least 2, and this happens with probability at least . Then, following a similar reasoning as in the previous case, we have

 acQ(C) = ∞∑k=0kPr(absorbed in exactly k% steps) (17) = ∞∑k=1Pr(absorbed after at least k%steps) ≥ (1−kn)+2n2>1−kn.

## Vi Algorithms

This section presents algorithms to solve the -arw-Centrality problem. In all cases, the set of query nodes is given as input, along with a set of candidate nodes and the restart probability .

### Vi-a Greedy approach

The first algorithm is a standard greedy algorithm, denoted Greedy, which exploits the supermodularity of the absorbing random-walk centrality measure. It starts with the result set equal to the empty set, and iteratively adds a node from the set of candidate nodes , until nodes are added. In each iteration the node added in the set is the one that brings the largest improvement to .

As shown before, the objective function to be minimized, i.e., , is supermodular and monotonically decreasing. The Greedy algorithm is not an approximation algorithm for this minimization problem. However, it can be shown to provide an approximation guarantee for maximizing the absorbing centrality gain measure, defined below.

###### Definition (Absorbing centrality gain)

Given a graph , a set of query nodes , and a set of candidate nodes , the absorbing centrality gain of a set of nodes is defined as

 acgQ(C)=mQ−acQ(C),

where .

Justification of the gain function. The reason to define the absorbing centrality gain is to turn our problem into a submodular-maximization problem so that we can apply standard approximation-theory results and show that the greedy algorithm provides a constant-factor approximation guarantee. The shift quantifies the absorbing centrality of the best single node in the candidate set. Thus, the value of expresses how much we gain in expected random-walk length when we use the set as absorbing nodes compared to when we use the best single node. Our goal is to maximize this gain.

Observe that the gain function is not non-negative everywhere. Take for example any node such that . Then, . Note also that we could have obtained a non-negative gain function by defining gain with respect to the worst single node, instead of the best. In other words, the gain function , with , is non-negative everywhere.

Nevertheless, the reason we use the gain function instead of is that takes much larger values than , and thus, a multiplicative approximation guarantee on is a weaker result than a multiplicative approximation guarantee on . On the other hand, our definition of creates a technical difficulty with the approximation guarantee, that is defined for non-negative functions. Luckily, this difficulty can be overcome easily by noting that, due to the monotonicity of , for any , the optimal solution of the function , as well as the solution returned by Greedy, are both non-negative.

Approximation guarantee. The fact that the Greedy algorithm gives an approximation guarantee to the problem of maximizing absorbing centrality gain is a standard result from the theory of submodular functions.

###### Proposition

The function is monotonically increasing, and submodular.

###### Proposition

Let . For the problem of finding a set with , such that is maximized, the Greedy algorithm gives a -approximation guarantee.

We now discuss the complexity of the Greedy algorithm. A naïve implementation requires computing the absorbing centrality using Equation (5) for each set that needs to be evaluated during the execution of the algorithm. However, applying Equation (5) involves a matrix inversion, which is a very expensive operation. Furthermore, the number of times that we need to evaluate is , as for each iteration of the greedy we need to evaluate the improvement over the current set of each of the candidates. The number of candidates can be very large, e.g., , yielding an algorithm, which is prohibitively expensive.

We can show, however, that we can execute Greedy significantly more efficiently. Specifically, we can prove the following two propositions.

###### Proposition

Let be a set of absorbing nodes, the corresponding transition matrix, and let . Let . Given the value can be computed in .

###### Proposition

Let be a set of absorbing nodes, the corresponding transition matrix, and . Let , . Given the value can be computed in time .

The proofs of these two propositions can be found in the Appendix. Proposition VI-A implies that in order to compute  for absorbing nodes in , it is enough to maintain the matrix , computed in the previous step of the greedy algorithm for absorbing nodes . Proposition VI-A, on the other hand, implies that we can compute the absorbing centrality of each set of absorbing nodes of a fixed size  in , given the matrix , which is computed for one arbitrary set of absorbing nodes of size . Combined, the two propositions above yield a greedy algorithm that runs in and offers the approximation guarantee discussed above. We outline it as Algorithm 2.

Practical speed-up. We found that the following heuristic lets us speed-up Greedy even further, with no significant loss in the quality of results. To select the first node for the solution set (see Algorithm 2), we calculate the PageRank values of all nodes in and evaluate only for the nodes with highest PageRank score, where is a fixed parameter. In what follows, we will be using this heuristic version of Greedy, unless explicitly stated otherwise.

### Vi-B Efficient heuristics

Even though Greedy runs in polynomial time, it can be quite inefficient when employed on moderately sized datasets (more than some tens of thousands of nodes). We thus describe algorithms that we study as efficient heuristics for the problem. These algorithms do not offer guarantee for their performance.

Spectral methods have been used extensively for the problem of graph partitioning. Motivated by the wide applicability of this family of algorithms, here we explore three spectral algorithms: SpectralQ, SpectralC, and SpectralD. We start by a brief overview of the spectral method; a comprehensive presentation can be found in the tutorial by von Luxburg [17].

The main idea of spectral approaches is to project the original graph into a low-dimensional Euclidean space so that distances between nodes in the graph correspond to Euclidean distances between the corresponding projected points. A standard spectral embedding method, proposed by Shi and Malik [15], uses the “random-walk” Laplacian matrix of a graph , where is the adjacency matrix of the graph, and forms the matrix whose columns are the eigenvectors of that correspond to the smallest eigenvalues , with being the target dimension of the projection. The spectral embedding is then defined by mapping the -th node of the graph to a point in , which is the -row of the matrix .

The algorithms we explore are adaptations of the spectral method. They all start by computing the spectral embedding , as described above, and then, proceed as follows:

SpectralQ performs -means clustering on the embeddings of the query nodes, where is the desired size of the result set. Subsequently, it selects candidate nodes that are close to the computed centroids. Specifically, if is the size of the -th cluster, then candidate nodes are selected whose embedding is the nearest to the -th centroid. The number is selected so that and .

SpectralC is similar to SpectralQ, but it performs the -means clustering on the embeddings of the candidate nodes, instead of the query nodes.

SpectralD performs -means clustering on the embeddings of the query nodes, where is the desired result-set size. Then, it selects the candidate nodes whose embeddings minimize the sum of squared -distances from the centroids, with no consideration of the relative sizes of the clusters.

Personalized Pagerank (PPR). This is the standard Pagerank [4] algorithm with a damping factor equal to the restart probability of the random walk and personalization probabilities equal to the start probabilities . Algorithm PPR returns the nodes with highest PageRank values.

Degree and distance centrality. Finally, we consider the standard degree and distance centrality measures.

Degree returns the highest-degree nodes. Note that this baseline is oblivious to the query nodes.

Distance returns the nodes with highest distance centrality with respect to . The distance centrality of a node is defined as .

## Vii Experimental evaluation

### Vii-a Datasets

We evaluate the algorithms described in Section VI on two sets of real graphs: one set of small graphs that allows us to compare the performance of the fast heuristics against the greedy approach; and one set of larger graphs, to compare the performance of the heuristics against each other on datasets of larger scale. Note that the bottleneck of the computation lies in the evaluation of centrality. Even though the technique we describe in Section IV-A allows it to scale to datasets of tens of thousands of nodes on a single processor, it is still prohibitively expensive for massive graphs. Still, our experimentation allows us to discover the traits of the different algorithms and understand what performance to anticipate when they are employed on graphs of massive size.

The datasets are listed in Table I. Small graphs are obtained from Mark Newman’s repository, larger graphs from SNAP. For kddCoauthors, livejournal, and roadnet we use samples of the original datasets. In the interest of repeatability, our code and datasets are made publicly available.

### Vii-B Evaluation Methodology

Each experiment in our evaluation framework is defined by a graph , a set of query nodes , a set of candidate nodes , and an algorithm to solve the problem. We evaluate all algorithms presented in Section VI. For the set of candidate nodes , we consider two cases: it is equal to either the set of query nodes, i.e., , or the set of all nodes, i.e., .

Query nodes are selected randomly, using the following process: First, we select a set of seed nodes, uniformly at random among all nodes. Then, we select a ball of predetermined radius , around each seed .444For the planar roadnet dataset we use . Finally, from all balls, we select a set of query nodes of predetermined size , with and , respectively, for the small and larger datasets. Selection is done uniformly at random.

Finally, the restart probability is set to and the starting probabilities are uniform over .

### Vii-C Implementation

All algorithms are implemented in Python using the NetworkX package [8], and were run on an Intel Xeon 2.83GHz with 32GB RAM.

### Vii-D Results

Figure 1 shows the centrality scores achieved by different algorithms on the small graphs for varying (note: lower is better). We present two settings: on the left, the candidates are all nodes (), and on the right, the candidates are only the query nodes (). We observe that PPR tracks well the quality of solutions returned by Greedy, while Degree and Distance often come close to that. Spectral algorithms do not perform that well.

Figure 2 is similar to Figure 1, but results on the larger datasets are shown, not including Greedy. When all nodes are candidates, PPR typically has the best performance, followed by Distance, while Degree is unreliable. The spectral algorithms typically perform worse than PPR.

When only query nodes are candidates, all algorithms demonstrate similar performance, which is most typically worse than the performance of PPR (the best performing algorithm) in the previous setting. Both observations can be explained by the fact that the selection is very restricted by the requirement , and there is not much flexibility for the best performing algorithms to produce a better solution.

In terms of running time on the larger graphs, Distance returns within a few minutes (with observed times between 15 seconds to 5 minutes) while Degree returns within seconds (all observed times were less than 1 minute). Finally, even though Greedy returns within 1-2 seconds for the small datasets, it does not scale well for the larger datasets (running time is orders of magnitude worse than the heuristics and not included in the experiments).

Based on the above, we conclude that PPR offers the best trade-off of quality versus running time for datasets of at least moderate size (more than nodes).

## Viii Conclusions

In this paper, we have addressed the problem of finding central nodes in a graph with respect to a set of query nodes . Our measure is based on absorbing random walks: we seek to compute nodes that minimize the expected number of steps that a random walk will need to reach at (and be “absorbed” by) when it starts from the query nodes. We have shown that the problem is -hard and described an greedy algorithm to solve it approximately. Moreover, we experimented with heuristic algorithms to solve the problem on large graphs. Our results show that, in practice, personalized PageRank offers a good combination of quality and speed.

## References

• [1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 5–14. ACM, 2009.
• [2] A. Angel and N. Koudas. Efficient diversity-aware search. ACM, June 2011.
• [3] P. Boldi and S. Vigna. Axioms for centrality. Internet Mathematics, 2014.
• [4] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30, 1998.
• [5] L. Freeman. A set of measures of centrality based upon betweenness. Sociometry, 40, 1977.
• [6] M. Garey and D. Johnson. Computers and intractability; A guide to the theory of NP-completeness. W. H. Freeman & Co., 1990.
• [7] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.
• [8] A. A. Hagberg, D. A. Schult, and P. J. Swart. Exploring network structure, dynamics, and function using NetworkX. In SciPy, 2008.
• [9] L. Katz. A New Status Index Derived from Sociometric Index. Psychometrika, 1953.
• [10] A. N. Langville and C. D. Meyer. A survey of eigenvector methods for web information retrieval. SIAM review, 47(1):135–161, 2005.
• [11] H. J. Leavitt. Some effects of certain communication patterns on group performance. The Journal of Abnormal and Social Psychology, 1951.
• [12] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In SIGKDD. ACM, 2007.
• [13] J. D. Noh and H. Rieger. Random walks on complex networks. Phys. Rev. Lett., 92, 2004.
• [14] G. Sabidussi. The centrality index of a graph. Psychometrika, 31, 1966.
• [15] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
• [16] M. R. Vieira, H. L. Razente, M. C. N. Barioni, M. Hadjieleftheriou, D. Srivastava, A. J. M. Traina, and V. J. Tsotras. On query result diversification. 2011 IEEE International Conference on Data Engineering, pages 1163–1174, 2011.
• [17] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
• [18] X. Zhu, A. Goldberg, J. Van Gael, and D. Andrzejewski. Improving diversity in web search results re-ranking using absorbing random walks. In NAACL-HLT, 2007.

### -a Proposition V

For all it is .

###### Proof:

Write for the input graph where the set are absorbing nodes. Define similarly. Let . Consider a path in drawn from the distribution induced by the random walks on . Let be the probability of the path and its length. Let and be the set of paths on and . Finally, let be the set of paths on that pass from , and the set of paths on that do not pass from . We have

 ac(X) = ∑p∈P(X)Pr[p]ℓ(p) = ∑p∈P(¯¯¯Z,X)Pr[p]ℓ(p)+∑p∈P(Z,X)Pr[p]ℓ(p) ≥ ∑p∈P(Y)Pr[p]ℓ(p) = ac(Y),

where the inequality comes from the fact that a path in passing from and being absorbed by corresponds to a shorter path in being absorbed by .

### -B Proposition Vi-A

###### Proposition

Let be a set of absorbing nodes, the corresponding transition matrix, and . Let . Given , the centrality score can be computed in time .

The proof makes use of the following lemma.

###### Lemma (Sherman-Morrison Formula [7])

Let be a square invertible matrix and its inverse. Moreover, let and be any two column vectors of size . Then, the following equation holds

 (M+abT)−1=M−1−M−1abTM−1/(1+bTM−1a).

###### Proof:

Without loss of generality, let the set of absorbing nodes be . As in Section VI, the expected number of steps before absorption is given by the formulas

 acQ(Ci−1)=sTQFi−11,
 with Fi−1=A−1i−1 and Ai−1=I−Pi−1.

We proceed to show how to increase the set of absorbing nodes by one and calculate the new absorption time by updating in . Without loss of generality, suppose we add node to the absorbing nodes , so that

 Ci=Ci−1∪{i}={1,2,…,i−1,i}.

Let be the transition matrix over with absorbing nodes . Like before, the expected absorption time by nodes is given by the formulas

 acQ(Ci)=sTQFi1,
 with Fi=A−1i and Ai=I−Pi.

Notice that

 Ai−Ai−1=(I−Pi)−(I−Pi−1)=Pi−1−Pi =⎡⎢ ⎢⎣0(i−1)×npi,1…pi,n0(n−i)×n⎤⎥ ⎥⎦=abT

where denotes the transition probability from node to node in transition matrix , and the column-vectors and are defined as

 a = [i−10…0 1 n−i0…0], and b = [pi,1…pi,n].

By a direct application of Lemma -B, it is easy to see that we can compute from with the following formula, at a cost of operations.

 Fi = Fi−1−(Fi−1a)(bTFi−1)/(1+bT(Fi−1a))

We have thus shown that, given , we can compute , and therefore as well, in .

### -C Proposition Vi-A

###### Proposition

Let be a set of absorbing nodes, the corresponding transition matrix, and . Let , for . Given , the centrality score can be computed in time .

###### Proof:

The proof is similar to the proof of Proposition VI-A. Without loss of generality, let the two sets of absorbing nodes be

 C = {1,2,…,i−1,i}, and C′ = {1,2,…,i−1,i+1}.

Let be the transition matrix with absorbing nodes . The absorbing centrality for the two sets of absorbing nodes and is expressed as a function of the following two matrices

 F=A−1, with A=I−P,%and
 F′=A′−1, with A′=(I−P′).

Notice that

 A′−A=(I−P′)−(I−P)=P−P′ =⎡⎢ ⎢ ⎢ ⎢ ⎢⎣0(i−1)×n−pi,1 … −pi,npi+1,0… pi+1,n0(n−i−1)×n⎤⎥ ⎥ ⎥ ⎥ ⎥⎦=a2bT2−a1bT1

where denotes the transition probability from node to node in a transition matrix where neither node or is absorbing, and the column-vectors , , , are defined as

 a1 = [i−10…0 1 0 n−i−10…0] b1 = [pi,1 … pi,n] a2 = [i−10…0 0 1 n−i−10…0] b2 = [pi+1,1… pi+1,n].

By an argument similar with the one we made in the proof of Proposition VI-A, we can compute in the following two steps from , each costing operations for the provided parenthesization

 Z = F−(Za2)(bT2Z)/(1+bT2(Za2)), F′ = Z+(Fa1)(bT1F)/(1+bT1(Fa1)).

We have thus shown that, given , we can compute , and therefore as well, in time .

Comments 0
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters

Loading ...
11471

You are asking your first question!
How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description