Personalized PageRank to a Target Node

Personalized PageRank to a Target Node

Peter Lofgren
Stanford University
Ashish Goel
Stanford University
plofgren@stanford.edu ashishg@stanford.edu
July 4, 2019
Abstract

Personalalized PageRank uses random walks to determine the importance or authority of nodes in a graph from the point of view of a given source node. Much past work has considered how to compute personalized PageRank from a given source node to other nodes. In this work we consider the problem of computing personalized PageRanks to a given target node from all source nodes. This problem can be interpreted as finding who supports the target or who is interested in the target.

We present an efficient algorithm for computing personalized PageRank to a given target up to any given accuracy. We give a simple analysis of our algorithm’s running time in both the average case and the parameterized worst-case. We show that for any graph with nodes and edges, if the target node is randomly chosen and the teleport probability is given, the algorithm will compute a result with error in time . This is much faster than the previously proposed method of computing personalized PageRank separately from every source node, and it is comparable to the cost of computing personalized PageRank from a single source. We present results from experiments on the Twitter graph which show that the constant factors in our running time analysis are small and our algorithm is efficient in practice.

\numberofauthors

2

1 Note on Related Work

After we posted this work, we became aware of the related work [2, 3]. It includes an algorithm similar to the one we (independently) discovered. However, our work makes the following novel contributions. We analyze the algorithm under a more detailed parameterization which includes the in-degree of nodes. We use a priority queue to obtain a dependence on of , showing that the running time tends toward the running time of power iteration as tends to 0. Finally, we present detailed experiments to determine the running time of this algorithm on the Twitter graph.

2 Introduction

Personalized PageRank is a random-walk based method of modeling how nodes are related in a graph like a social network, the web graph, or a citation graph. It has been used in a variety of application including personalized search [10], link prediction [11, 5], link-spam detection [6], and graph partitioning [4]. Previous work has considered how to compute personalized PageRank from a single source node. In this work, we consider the problem of computing personalized PageRank to a single target node from all source nodes. More precisely, given a node in a directed (or undirected) graph , we would like to approximate the personalized PageRanks from all nodes to the target node . We define the personalized PageRank from a node to a node to be the fraction of time we spend at on a random walk from , where after each step we stop with a given probability . Note that this is different from reversing the edges and computing personalized PageRank from a single source . If edges represent interest, this problem can be interpreted as finding the nodes which are interested in , or if edges represent support this problem can be interpreted as finding nodes which support .

This problem has several applications. In a social network, for example, whenever produces content, we might want to find the nodes with above some threshold and add the content to each such ’s feed. Or an advertiser on a social network might want to give special offers to the nodes which are most interested in it. The use of personalized PageRank in recommendation and trust systems is discussed in [1]. On the web graph, this problem has been considered before. In [6], the first phase of the authors’ algorithm to detect when a web page is benefiting from link-spam is to compute the set of nodes with a high value of .

The simplest solution to this problem is to compute personalized PageRanks from every source node using known methods like Monte Carlo [8, 5] or power iteration [12]. This is the solution proposed in [6], the only previous work on this problem (prior to [2, 3]). In [6], the cost of computing personalized PageRank from every source using Monte Carlo is amortized because there are a large number of target nodes . However, this simple solution requires Monte Carlo computations even for a single target . As shown in [8] using the Chernoff bounds, computing an approximation to with high probability from a single source to all takes time. Thus even for a single target this approach would take time. The challenge we address is finding an algorithm which can find all nodes that have high values of without doing work linear in .

We present an algorithm which, given , approximates for all to within a given additive error without needing to visit all the nodes. Our method is to start at the target node and propagate updated estimates of backwards along edges. In power iteration, every node propagates its current value in every step. The key idea of our algorithm is to maintain a priority queue and only propagate the value of the node whose value has changed the most since its value was last propagated. It is very simple to implement: the entire algorithm is shown in Algorithm 1. We prove that it is efficient both in an average-case and a parameterized sense. We also present experiments on part of Twitter’s graph to show that it is efficient in practice. In this work we assume that a single processor is used for each target and that the graph is stored in local or distributed RAM. If there are multiple targets, parallelism can be achieved by assigning different targets to different processors.

The contributions of this work are the following:

  • In section 6 we present a simple algorithm for computing personalized PageRanks to a target node up to any given additive error. In section 6.1 we analyze the approximation error of the algorithm and prove it is correct.

  • In section 6.2, we show that for an arbitrary graph and a target node chosen uniformly at random, our algorithm runs in time

    where is the average degree of a node, is the desired additive error, and (typically between 0.1 and 0.2 in practice) is the probability of stopping after each step of the walk. This is comparable to the cost of running Monte Carlo from a single source node, for high success probability, and it is much less than the cost of running Monte Carlo from every source node, .

  • In section 6.3 we show that for an arbitrary graph and arbitrary target node , our algorithm runs in time

    where is a parameter which captures how difficult the problem is for . This shows the asymptotic dependence on is , and not as it is for Monte Carlo from a single a single source. Even if is small enough that we must consider the entire graph, for graphs with . For such graphs as goes to zero the algorithm degrades gracefully to the asymptotic performance of power-iteration, . Thus for larger we get the benefit of only exploring a small set of nodes with high personalized PageRank to the target, while for small the running time is still comparable to the cost of running power iteration.

  • In section 7 we present results from an experiment on part of the Twitter graph with 5.3 million nodes and 380 million edges. We find that our error analysis is tight and that is an accurate paramterization of the running time. As one example, we find that for a approximation of , the priority queue algorithm takes 1.2 seconds while power iteration takes 410 seconds to achieve additive error on the same machine. This shows that the local nature of the algorithm can give significant savings.

3 Related Work

(See note on related work in section 1.)

Personalized PageRank was first suggested in the original PageRank paper [12], and much follow up work has considered how to compute it efficiently. Our approach of propagating estimate updates is similar to the approach taken by Jeh and Widom [10] and Berkin [7] to compute personalized PageRank from a single source. Our equation (1) appears as equation (10) in [10]. Both of these works suggest the heuristic of propagating from the node with the largest unpropagated estimate. Our work is different because we are interested in estimating the values for a single target , while earlier work was concerned with the values for a single source . Because of this, our analysis is completely different, and we are able to prove running time bounds.

To the best of our knowledge, the only previous work to consider the problem of computing personalized PageRank to a target node was by Benczur et al. [6], where it is used as one phase of an algorithm to identify link-spam. They observe that a node ’s global PageRank is the average over all nodes of . Thus to determine how a node achieves its global PageRank score, they propose we first find the nodes with a high value of . Once that set has been found, it can be analyzed to determine if it looks like an organic set of nodes or an artificial link-farm. To compute the values of for each , they propose taking random walks from every source node and do not consider other methods.

4 Preliminaries

We are given a directed or undirected graph . For now we assume is unweighted, but in section 6.4 we show how our algorithm and theorems generalize easily to weighted graphs. We define and . We are given a parameter which determines the expected length of a random walk, . For , we define personalized PageRank to be the fraction of time we spend at on the following random walk: we start at and at each step with probability we halt, while with probability we transition to a random out-neighbor of the current node. We refer to as the teleport probability because another description of the Markov chain is the following: the process never halts, and at each step with probability we teleport back to and continue from there, while with probability we transition to a random out-neighbor of the current node.

There may be dead end nodes with no out-neighbors in the graph, so for convenience we introduce an artificial sink node with a self-loop and introduce an artificial edge to the sink from each dead end node. Alternatively, we could have artificially added a self-edge to each dead end node, or said that the walk should halt when it reaches a dead end node. These alternatives result in a slightly different boundary case or normalization, but the exact choice doesn’t matter significantly.

In the original PageRank paper [12], the authors propose that when the random walk for computing PageRank teleports, the resulting node could be chosen from an arbitrary distribution. We focus on the case when the distribution has a single point of support, because that is the case relevant to our applications. The PageRank function is linear in the personalization distribution, as shown in [10], so computing PageRank on single-point distributions is sufficient for computing it on arbitrary personalization distributions.

In the worst case, the target node might have an edge from every other node in the graph, so we must do work even for a rough approximation. To parameterize the difficulty of the problem for a given , we define

This parameter captures the idea that for each node which has a large personalized PageRank to , we must consider all of ’s in-neighbors to see if any of them also have a large personalized PageRank to . The term captures the cost of popping from a priority queue.

In evaluating our approximation we consider additive pointwise error (the norm). Given error threshold , we seek an estimate for each such that

We choose this error measure because in applications we are often only interested in the nodes with a large value of , and we don’t care if there are a large number of nodes with very small values of which have been estimated to be 0. For efficiency we want the resulting estimate vector to be sparse unless is very small, and this norm allows for a sparse estimate vector.

5 A Recurrence for Personalized PageRank

Our algorithm is based on a recurrence equation that relates the value of to the values of for . To derive this recurrence, it is convenient to think about the number of times we visit on a random walk rather than the fraction of time we spend at . The number of times we visit is proportional to the fraction of time because over a large number of walks, the number of times we visit will be the fraction of time we spend at multiplied by the average length of a walk, . A random walk from begins by either teleporting immediately or by transitioning to a random neighbor, so the expected number of times we reach from is the probability of not teleporting immediately times the average expected number of times we reach from an out-neighbor of . Thus personalized PageRank satisfies the recurrence

(1)

We add when because a walk from clearly visits on its first step regardless of what happens next, and this visit corresponds to an -fraction of an average walk. This equation appears as equation (10) in [10], where the authors give an alternate proof using linear algebra.

6 The Priority Queue Algorithm

Given a target node , our algorithm is based on the idea of propagating updates outwards from . We maintain for each node a score which estimates from below and improves as the algorithm progresses. Using the recurrence of equation (1), we see that when we update our estimate for some node , we need to update our estimate of for each . Hence the basic update step of the algorithm is to choose a node and increase the score of each in-neighbor by . Since we might propagate a node ’s score more than once, it is important that we only propagate the part of ’s score which changed since the last time ’s score was propagated. We let denote the difference between ’s current score and ’s score when its score was last propagated. We use a priority queue ordered by priority so we can easily find the node with the largest value of . The complete algorithm is shown in Algorithm 1: as long as some node has priority above a minimum threshold, we pop off the node with the greatest priority and propagate its score to its in-neighbors.

digraph ,teleport probability ,target vertex , error tolerance
Approximation to personalized PageRank such that for all ,
= Max Priority Queue on ordered by key
while  do
      = .popMaxElement()
     for  in .inNeighbors() do
         
         if  not in  then
              
         end if
         
         .increasePriority(, )
     end for
     
end while
Algorithm 1 Computing personalized PageRank to a target.

An example run of 6 iterations of the algorithm is shown in Figure 1.

Figure 1: The first six iterations of the priority queue algorithm run on a simple graph with four nodes and target node v. The node which will propagate its priority next is shown with a dark background. The score for a node is our current estimate of , and the priority is the amount of unpropagated score.

6.1 Error Analysis

One key question was the priority threshold at which we should stop popping nodes. Initially we considered threshold , but this threshold is not small enough to ensure that all errors are less than . We now show that is a sufficient threshold, and our experiments show that it is tight.

Theorem 1 (Correctness)

When the priority queue algorithm is run until all priorities are less than , the resulting score vector satisfies for all .

{proof}

After the algorithm has run to completion, let be the node with the greatest additive error and let be its error, so for all nodes , . Recall equation (1):

When the algorithm has completed, a node ’s score is equal to the sum of the amount of score it has received from each out-neighbor. An out-neighbor has final score and final un-propagated score , so the amount propagated is . This gives us the following:

where is the part of ’s score which has not been propagated back to . Subtracting these two equations, we see that

where we’ve used the fact that for all nodes , and . Isolating the error we conclude that

6.2 Average Running Time

Next we analyze the running time of this algorithm. In the worst case, our target node could have a high personalized PageRank from every other node, forcing us to consider the entire graph and do work. Thus to give a useful bound on the running time, we give both an average case analysis and a worst-case parameterized analysis. First we analyze the priority queue algorithm in the average case where the target node is chosen uniformly at random.

Theorem 2

Let an arbitrary graph , additive error tolerance , and teleport probability be given. Let be the number of nodes in and be the number of edges. If is chosen uniformly at random from , then the priority queue algorithm runs in expected time steps.

{proof}

Suppose we ran the algorithm once for every . When the target node is , the number of times a node can be popped from the queue is at most , since its priority decreases by at least each time it is popped and the total accumulated priority is at most . The time to propagate the score from a node is steps since each of its in-neighbors must receive some of its score. We also must do work to pop the maximum node from the priority queue. Thus the running time for all nodes is at most

and the average running time per node is as claimed.

In the appendix we prove a bound which is tighter for in the case when the personalized PageRanks from each source follow a power law.

Because we perform a large number of increase-priority operations on the priority queue, the best asymptotic time is achieved by using a Fibonacci heap for the priority queue. With a Fibonacci heap [9], we can increase a node’s priority in constant amortized time, so in the above analysis the cost of increase-priority operations is . In our experiments we use a standard binary-heap priority queue for simplicity.

Also note that our average running time analysis did not use the fact that we are using a priority queue. The same time bound would hold if we simply maintained the set of nodes with and repeatedly popped an arbitrary element of this set. This is an alternative implementation of the algorithm which avoids the cost of the queue. By removing the cost of the priority queue from the above analysis we see that this alternative runs in time

Our parameterized running time analysis does use the priority queue property to improve the dependence on from to .

Comparison with Monte Carlo In [6], the authors suggest computing values of to a given target by taking Monte Carlo walks from every other node. As shown in [8] using the Chernoff bounds, computing an approximation of for a single source and with failure probability takes

steps. If the graph is sparse enough or is small enough, our average time bound of

is better than this. Thus to compute an approximation of personalized PageRank for all pairs of nodes, running the priority queue algorithm to every node is a viable alternative to running Monte Carlo from every node.

6.3 Parameterized Running Time

Next we give a paramterized bound that applies to arbitrary graphs and arbitrary target node . As in the preliminaries section, we define

to capture the difficulty of computing personalized PageRank to the target .

Theorem 3

If the priority queue algorithm is run with teleportation probability , target node , and additive error , it takes time

{proof}

We divide the execution of the algorithm into stages, where in stage it pops nodes with priority greater until all nodes have priority less than . After stage has completed, by Theorem 1, the difference between the score of a node and the true value of is at most . This implies that each node can be popped at most times in each stage, since each pop in stage decreases the difference between and by at least . Each time a node is popped we do work increasing priorities and work popping the node from the priority queue. This gives us a running time of

6.4 Extension to Weighted Graphs

We assume that the graph is unweighted for simplicity, but our algorithm extends immediately to the case of a weighted graph, in which is proportional to the probability of transitioning from node to node on a random walk. In this case, the change in score in Algorithm 1 should become

where is defined as . Similarly, the power iteration equation (1) should become

where is the weighted out-degree of . All the proofs can be modified similarly. The theorem statements remain the same.

7 Experiments

For our experiments, we used a part of the Twitter follower graph with 5.3 million nodes and 389 million edges. We ran an experiment for each setting of parameters in the Cartesian product of teleport probability and additive error . For each experiment we chose 100 target nodes uniformly at random and ran the priority queue algorithm. Since nodes with high global PageRank might be targets more often than other nodes, we repeated the above setup sampling 100 target nodes with probability equal to their global PageRank. We measured the number of steps (defined as the number of times we updated some node’s priority in the inner loop), the change in wall-clock time, and the maximum error. We measured the maximum error by running power iteration, equation (1), until convergence and comparing the result pointwise with the result of the priority queue algorithm.

We first note that our error analysis is tight. For efficiency, we want to do as few operations as possible to reach our desired error tolerance . If our empirical error was much lower than our target error , it would indicate that we were wasting effort achieving an accuracy which is finer than required.

However, on the Twitter graph there are nodes with empirical error , which is quite close to our proven bound of . A histogram of the empirical errors for and targets sampled from the global PageRank distribution is shown in Figure 2. Notice that for these parameters the empirical error is often more than 50% of the proven bound .

Figure 2: The empirical error of our algorithm after convergence. To obtain this data, we set teleport probability and error threshold , choose 100 target nodes with probability equal to their global PageRank, ran the priority queue algorithm to obtain scores , and then computed the empirical error . The x-axis is empirical error divided by . Notice that most nodes have an error which is a large fraction of , showing that our error bound is tight in practice.

We compare our actual average running time to the bound from Theorem 2, steps, and find that the algorithm actually runs faster than the bound requires. For and all three values of , the algorithm uses less than 3% of the number of steps the bound allows. For , and all values of the algorithm uses less than 20% of the number of steps the bound allows. A histogram of the running times is shown in figure 3. Notice that the step-axis is log-scale, and most nodes use far fewer steps than the bound represented by the vertical line allows.

Figure 3: The number of steps required to reach convergence, defined as the number of times we updated some node’s priority in the inner loop. To obtain this data, we set teleport probability and error threshold , choose 100 target nodes uniformly at random, and ran the priority queue algorithm. The vertical line indicates the average running time bound of Theorem 2. Notice that the x-axis is log-scale, so most nodes require many fewer steps than the bound allows.

Our parameterized analysis shows that the number of steps needed is at most . To measure how tight this is, we compared the number of steps taken to . To use more adversarial , we sampled from the global PageRank distribution instead of uniformly at random, so with high global PageRank will be chosen more often. We found that in practice is an excellent predictor for the number of steps, and that the constant of proportionality in practice is much less than . For example, with and , the proven ratio between step count and is , but in our experiment the average ratio is less than 4. The distribution of ratios is shown in Figure 4. Note that for most nodes, the number of steps taken is within a factor of 2 of even though the absolute number of steps varies on an exponential scale, as shown in Figure 3.

Figure 4: The number of steps required to reach convergence compared to the parameter . To obtain this data, we set teleport probability and error threshold , choose 100 target nodes sampled from the PageRank distribution, and ran the priority queue algorithm. Notice that for most nodes, the number of steps taken is within a fraction of 2 of even though the absolute number of steps varies on an exponential scale, as shown in Figure 3.

Now we compare our algorithm’s performance to power iteration and observe the benefit of only visiting a set of nodes around . It can be shown that equation (1) is a contraction mapping with contraction ratio . Thus one alternative to our priority queue algorithm is to apply equation (1) repeatedly. Using the contraction map property, to guarantee additive error we must do iterations. In our experiment, we also computed personalized PageRank to each target using power iteration in order to measure the empirical error. We found that each iteration of applying equation (1) took 3.9 seconds, and this was stable over a large number of iterations (the graph is too large for the processor cache to help). Because our algorithm explores only a neighborhood around , it is often much more efficient than power iteration. For example, when and , our algorithm took 0.2 seconds on average, which is 1700 times faster than the 87 iterations needed to guarantee at most error. For smaller , our algorithm is forced to consider more of the graph, so its relative advantage diminishes. For the smallest value of we tried, , our algorithm took 30 seconds on average, while power iteration takes seconds. A table of running times for the two algorithms is shown in Figure 5.

Priority Queue Algorithm (s) Power Iteration (s)
0.20 330
1.2 410
29 500
Figure 5: The average wall-clock running time of our algorithm compared to power iteration. To obtain the first column, we set teleport probability , choose 100 target nodes uniformly at random, and ran the priority queue algorithm until completion. To obtain the second column, we measured the time for an average iteration of applying equation (1) and multiplied by , the number of iterations needed for accuracy . Notice that by propagating the largest changes first, our algorithm is much faster than power iteration.

As tends to zero, our algorithm’s performance degrades gracefully to within a constant factor of the performance of power iteration, as proven in Theorem 3.

8 Acknowledgements

This work was supported in part by the DARPA xdata program, by grant #FA9550- 12-1-0411 from the U.S. Air Force Office of Scientific Research (AFOSR) and the Defense Advanced Research Projects Agency (DARPA), and by NSF Award 0915040. One of the authors was supported by the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program. We would like to thank Rishi Gupta for helpful conversations.

References

  • [1] R. Andersen, C. Borgs, J. Chayes, U. Feige, A. Flaxman, A. Kalai, V. Mirrokni, and M. Tennenholtz. Trust-based recommendation systems: an axiomatic approach. In Proceeding of the 17th international conference on World Wide Web, pages 199–208. ACM, 2008.
  • [2] R. Andersen, C. Borgs, J. Chayes, J. Hopcraft, V. S. Mirrokni, and S.-H. Teng. Local computation of pagerank contributions. In Algorithms and Models for the Web-Graph, pages 150–165. Springer, 2007.
  • [3] R. Andersen, C. Borgs, J. Chayes, J. Hopcroft, K. Jain, V. Mirrokni, and S. Teng. Robust pagerank and locally computable spam detection features. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, AIRWeb ’08, pages 69–76, New York, NY, USA, 2008. ACM.
  • [4] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 475–486. IEEE, 2006.
  • [5] B. Bahmani, A. Chowdhury, and A. Goel. Fast incremental and personalized pagerank. Proceedings of the VLDB Endowment, 4(3):173–184, 2010.
  • [6] A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank–fully automatic link spam detection work in progress. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, 2005.
  • [7] P. Berkhin. Bookmark-coloring algorithm for personalized pagerank computing. Internet Mathematics, 3(1):41–62, 2006.
  • [8] D. Fogaras, B. Rácz, K. Csalogány, and T. Sarlós. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3):333–358, 2005.
  • [9] M. L. Fredman and R. E. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM (JACM), 34(3):596–615, 1987.
  • [10] G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, pages 271–279. ACM, 2003.
  • [11] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.
  • [12] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order to the web. Technical report, Stanford University Database Group, 1999.

Appendix A Average Running Time for Power Law Graphs

We can get a better bound on the expected running time if we assume a power law on the personalized PageRank values: suppose that for each there is some such that if we order the nodes in decreasing order of then

for some constant . Since , the value for is determined by :

Such a power law was observed empirically on the twitter graph in [5] with . For simplicity we assume that all nodes have the same exponent .

Theorem 4

For a graph in which the personalized PageRanks from each node follow a power law with exponent , if is chosen uniformly at random from , then the priority queue algorithm runs in time

where is the number of edges in the graph.

{proof}

Suppose we ran the algorithm once for every . As in the average-case analysis proof, the running time is at most . With the power law assumption, the majority of nodes will not be popped even once because . The largest such that is

so we only need to visit this many distinct nodes for each . The running time for all nodes is thus at most

Substituting in the value of and we see that the total runtime for all target nodes is

where so the average running time per node is as claimed.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
19990
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description