Fast Distributed PageRank Computation 1footnote 11footnote 1Appeared in Theoretical Computer Science (TCS), volume 561, pages 113 -121, 2015 [10].

Fast Distributed PageRank Computation 111Appeared in Theoretical Computer Science (TCS), volume 561, pages 113 -121, 2015 [10].

Atish Das Sarma eBay Research Labs, eBay Inc., CA, USA. E-mail: atish.dassarma@gmail.com    Anisur Rahaman Molla Division of Mathematical Sciences, Nanyang Technological University, Singapore 637371. E-mail: anisurpm@gmail.com.    Gopal Pandurangan Division of Mathematical Sciences, Nanyang Technological University, Singapore 637371 and Department of Computer Science, Brown University, Providence, RI 02912, USA. E-mail: gopalpandurangan@gmail.com. Supported in part by the following research grants: Nanyang Technological University grant M58110000, Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2 grant MOE2010-T2-2-082, and a grant from the US-Israel Binational Science Foundation (BSF).    Eli Upfal Department of Computer Science, Brown University, Providence, RI 02912, USA. E-mail: eli_upfal@brown.edu. Partially supported by NSF BIGDATA Award IIS 1247581.
Abstract

Over the last decade, PageRank has gained importance in a wide range of applications and domains, ever since it first proved to be effective in determining node importance in large graphs (and was a pioneering idea behind Google’s search engine). In distributed computing alone, PageRank vector, or more generally random walk based quantities have been used for several different applications ranging from determining important nodes, load balancing, search, and identifying connectivity structures. Surprisingly, however, there has been little work towards designing provably efficient fully-distributed algorithms for computing PageRank. The difficulty is that traditional matrix-vector multiplication style iterative methods may not always adapt well to the distributed setting owing to communication bandwidth restrictions and convergence rates.

In this paper, we present fast random walk-based distributed algorithms for computing PageRanks in general graphs and prove strong bounds on the round complexity. We first present a distributed algorithm that takes rounds with high probability on any graph (directed or undirected), where is the network size and is the reset probability used in the PageRank computation (typically is a fixed constant). We then present a faster algorithm that takes rounds in undirected graphs. Both of the above algorithms are scalable, as each node sends only small () number of bits over each edge per round. To the best of our knowledge, these are the first fully distributed algorithms for computing PageRank vector with provably efficient running time.

Keywords: PageRank, Distributed Algorithm, Random Walk, Monte Carlo Method

1 Introduction

In the last decade, PageRank has emerged as a very powerful measure of relative importance of nodes in a network. The term PageRank was first introduced in [7, 16] where it was used to rank the importance of webpages on the Web. Since then, PageRank has found a wide range of applications in a variety of domains within computer science such as distributed networks, data mining, Web algorithms, and distributed computing [5, 6, 8, 14]. Since PageRank vector or PageRanks is essentially the steady state distribution or the top eigenvector of the Laplacian corresponding to a slightly modified random walk process, it is an easily defined quantity. However, the power and applicability of PageRank arises from its basic intuition of being a way to naturally identify “important” nodes, or in certain cases, similarity between nodes.

While there has been recent work on performing random walks efficiently in distributed networks [4, 9], surprisingly, little provable results are known towards efficient distributed computation of PageRanks. This is perhaps because the traditional method of computing PageRanks is to apply iterative methods i.e., do matrix-vector multiplications till (near)-convergence. Since such techniques may not adapt well in certain settings, when dealing with a global network with only local views (as is common in distributed networks such as Peer-to-Peer (P2P) networks), and particularly, very large networks, it becomes crucial to design far more efficient techniques. Therefore, PageRank computation using Monte Carlo methods is more appropriate in a distributed model where only messages of limited size are permitted to be sent over each edge in each round.

To elaborate, a naive way to compute PageRank of nodes in a distributed network is simply scaling iterative PageRank algorithms to distributed environment. But this is firstly not trivial, and secondly expensive even if doable. As each iteration step needs computation results of previous steps, there needs to be continuous synchronization and several messages may need to be exchanged. Further, the convergence time may be large. It is important to design efficient and localized distributed algorithms as communication overhead is more important than CPU and memory usage in distributed page ranking. We take all these concerns into consideration and design highly efficient fully decentralized algorithms for computing the PageRank vector in distributed networks.

Our Contributions. In this paper, to the best of our knowledge, we present the first provably efficient fully decentralized algorithms for estimating PageRanks under a variety of settings. Our algorithms are scalable, since each node sends only bits per round. Specifically, our contributions are as follows:

  • We present an algorithm, Basic-PageRank-Algorithm (cf. Algorithm 1), that computes PageRanks accurately in rounds with high probability222Throughout, “with high probability (w.h.p.)” means with probability at least , where is the number of nodes in the network and is a suitably chosen constant., where is the number of nodes in the network and is the random reset probability in the PageRank random walk [2, 4, 9]. Our algorithm works for any arbitrary network (directed as well as undirected).

  • We present an improved algorithm, called as Improved-PageRank-Algorithm (cf. Algorithm 2), that computes PageRanks accurately in undirected graphs and terminates with high probability in rounds. We note that though PageRank is usually applied for directed graphs (e.g., for the World Wide Web), it is sometimes also applied in connection with undirected graphs as well [1, 12, 13, 17, 20] and is non-trivial to compute (cf. Section 2.2). In particular, it can be applied for distributed networks when modeled as undirected graphs (as is typically the case, e.g., in P2P network models).

We note that the Improved-PageRank-Algorithm requires only bits to be sent per round per edge, and the Basic-PageRank-Algorithm requires only bits per round per edge.

2 Background and Related Work

2.1 Distributed Computing Model

We model the communication network as an unweighted, connected -node graph . Each node has limited initial knowledge. Specifically, we assume that each node is associated with a distinct identity number (e.g., its IP address). At the beginning of the computation, each node accepts as input its own identity number which is of length bits and the identity numbers of its neighbors in . The node may also accept some additional inputs as specified by the problem at hand e.g., the number of nodes in the network. A node can communicate with any node if knows the id of .333This is a typical assumption in the context of P2P and overlay networks, where a node can establish communication with another node if it knows the other node’s IP address. We sometimes call this direct communication, especially when the two nodes are not neighbors in . Note that our algorithm of Section 3 uses no direct communication between non-neighbors in . Initially, each node knows only the ids of its neighbors in . We assume that the communication occurs in synchronous rounds, i.e., nodes run at the same processing speed and any message that is sent by some node to its neighbors in some round will be received by the end of round . In each round, each node is allowed to send a message of size bits through each communication link (this applies to both communication via an edge in the network as well as direct communication).

There are several measures of efficiency of distributed algorithms; here we will focus on the running time, i.e. the number of rounds of distributed communication. Note that the computation that is performed by the nodes locally is ÒfreeÓ, i.e., it does not affect the number of rounds.

2.2 PageRank

We formally define the PageRankof a graph . Let be a small constant which is fixed ( is called the reset probability, i.e., with probability , the random walk starts from a node chosen uniformly at random among all nodes in the network). The PageRank vector of a graph (e.g., see [2, 4, 5, 9]) is the stationary distribution vector of the following special type of random walk: at each step of the random walk, with probability the walk starts from a randomly chosen node and with remaining probability , the walk follows a randomly chosen outgoing (neighbor) edge from the current node and moves to that neighbor.444We sometime use the terminology “PageRank random walk” for this special type of random walk process. Therefore the PageRank transition matrix on the state space (or vertex set) can be written as

(1)

where is the matrix with all entries and is the transition matrix of a simple random walk on defined as , if is one of the outgoing links of , otherwise . Computing the PageRanks and its variants efficiently in various computation models has been of tremendous research interest in both academia and industry. For a detailed survey of PageRanksee e.g., [5, 14]. We note that PageRank is well-defined in both directed and undirected graphs. Note that it is difficult to compute the PageRank distribution (exactly) analytically (and no analytical formulas are known for general directed graphs) and hence various computational methods have been used to estimate the PageRank distribution. In fact, this is true even for undirected graphs as well [12].

There are mainly two broad approaches to computing PageRanks (e.g., see [3]). One is to using linear algebraic techniques, (e.g., the Power Iteration [16]) and the other approach is Monte Carlo [2]. In the Monte Carlo method, the basic idea is to approximate PageRanks by directly simulating the corresponding random walk and then estimating the stationary distribution with the performed walk’s distribution. In [2] Avrachenkov et al., proposed the following Monte Carlo method for PageRankapproximation: Perform random walks (according to the PageRanktransition probability) starting from each node of the graph . For each walk, terminate the walk with its first reset instead of moving to a random node. It is shown that the frequencies of visits of all these random walks to different nodes will approximate the PageRanks. Our distributed algorithms are based on the above method.

Monte Carlo methods are efficient, light weight and highly scalable [2]. Monte Carlo methods have been useful in designing algorithms for PageRank and its variants in important computational models like data streaming [9] and MapReduce [3]. The works in [18, 19] study distributed implementation of PageRankin peer-to-peer networks but use iteration methods.

3 A Distributed Algorithm for PageRank

We present a Monte Carlo based distributed algorithm for computing PageRank distribution of a network [2]. The main idea of our algorithm (formal pseudocode is given in Algorithm 1) is as follows. Perform ( will be fixed appropriately later) random walks starting from each node of the network in parallel. In each round, each random walk independently goes to a random (outgoing) neighbor with probability and with the remaining probability (i.e., ) terminates in the current node. Henceforth, we call such a random walk a ‘PageRank random walk’. In [2], this random walk process is shown to be equivalent to one based on the PageRank transition matrix , defined in Section 2.2. It is easy to see that picking each node as starting point for the same number of times (i.e., restarting walks according to the uniform distribution) accounts for the term in Equation 1; and between any two restarts, we just have a simple random walk that terminates with probability in each step — which accounts for the term. Since is the probability of termination of a walk in each round, the expected length of every walk is and the length will be at most with high probability. Let every node count the number of visits (say, ) of all the walks that go through it. Then, after termination of all walks in the network, each node computes (estimates) PageRank as . Notice that is the (expected) total number of visits over all nodes of all the walks. The above idea of counting the number of visits is a standard technique to approximate PageRanks (see e.g., [2, 4]). We want to note that our algorithm in this section does not require any direct communication between non-neighbors.

We show in the next section that the above algorithm computes PageRank vector accurately (with high probability) for an appropriate value of . The main technical challenge in implementing the above method is that performing many walks from each node in parallel can create a lot of congestion. Our algorithm uses a crucial idea to overcome the congestion. We show that (cf. Lemma 3.2) that there will be no congestion in the network even if we start a polynomial number of random walks from every node in parallel. The main idea is based on the Markovian (memoryless) properties of the random walks and the process that terminates the random walks. To calculate how many walks move from node to node , node only needs to know the number of walks that reached it. It does not need to know the sources of these walks or the transitions that they took before reaching node . Thus it is enough to send the count of the number of walks that pass through a node. The algorithm runs till all the walks are terminated which is at most rounds with high probability. Then every node outputs PageRank as the ratio between the number of visits (denoted by ) to it and the total number of visits over all nodes of all the walks . We show that our algorithm computes approximate PageRanks in rounds with high probability (cf. Theorem 3.3).

Input (for every node): Number of nodes and reset probability .
Output: Approximate PageRank of each node.
[Each node starts walks, where and is defined in Section 3.2. All walks keep moving in parallel until they terminate. The termination probability of each walk is , so the expected length of each walk is .]

1:Each node maintains a count variable “” corresponding to number of random walk coupons held by . Initially, for starting random walks.
2:Each node also maintains a counter for counting the number visits of random walks to it. Set .
3:for round  do    //[for sufficiently large constant ]
4:   Each node holding at least one alive coupon (i.e., ) does the following in parallel:
5:   For every neighbor of , set            // [ is the number of random walks moving from to in round ]
6:   for  do
7:      With probability , pick a uniformly random outgoing neighbor
8:      
9:   end for
10:   Send the coupon counter number to the respective outgoing neighbors .
11:   Each node computes: .             //[the quantity is the total number of visits of random walks to in -th round (from its neighbors)]
12:   Each node update the count variable
13:end for
14:Each node outputs its PageRank as .
Algorithm 1 Basic-PageRank-Algorithm

3.1 Analysis

Our algorithm computes the PageRank of each node as and we say that approximates original PageRank . We first focus on the correctness of our approach and then analyze the running time.

3.2 Correctness of PageRank Approximation

The correctness of the above approximation follows directly from the main result of [2] (see Algorithm and Theorem ) and also from [4] (Theorem ). In particular, it is mentioned in [2, 4] that the approximate PageRankvalue is quite good even for . It is easy to see that the expected value of is (formal proof is given in [2]). Now it follows from the Theorem 1 in [4] that, is sharply concentrated around its expectation . We included the proof of the theorem below for the sake of completeness.

Theorem 3.1 (Theorem 1 in [4]).

, where is a constant depending on , the reset probability and .

Proof.

For simplicity we first show the result assuming . For general value of , it will follow in the similar way. Fix an arbitrary node . Define to be times the number of visits to in the walk started at , to be the length of this walk, , and . Then, ’s are independent, and hence , , and . Then it follows easily that,

Thus,

where is a random variable with having geometric distribution with parameter , and is a constant depending on and , and can be found by optimization over .

The proof for the other direction is similar. ∎

From the above bound (cf. Theorem 3.1), we see that for , for any , where is minimal PageRank. Using union bound, it follows that there exist a node such that is at most . Hence, for all nodes , with probability at least , i.e., with high probability. This implies that we get a -approximation of the PageRank vector with high probability for . Note that can be arbitrary. Since the PageRankof any node is at least (i.e., the minimal PageRankvalue, ), so it gives . For simplicity we define that , which is constant assuming (and hence ) and are constant. Therefore, it is enough if we perform PageRankrandom walks from each node. We note that while this value of is sufficient to guarantee a constant approximation of the PageRanks, our algorithm permits a larger value of , allowing for tighter approximation with the same running time (follows from Lemma 3.2 below). Now we focus on the running time of our algorithm.

3.3 Time Complexity

From the above section we see that our algorithm is able to compute the PageRank vector in rounds with high probability if we can perform walks from each node in parallel without any congestion. The lemma below guarantees that there will be no congestion even if we do a polynomial number of walks in parallel.

Lemma 3.2.

The algorithm can be implemented such that the message size is at most per each edge in every round.

Proof.

It follows from our algorithm that each node only needs to count the number of visits of random walks to itself. Since the total number of random walk coupons in the network is polynomially bounded, bits suffice. ∎

Theorem 3.3.

The algorithm Basic-PageRank-Algorithm (cf. Algorithm 1) computes a -approximation of the PageRanks in rounds with high probability for any constant .

Proof.

The algorithm outputs the RageRanks when all the walks terminate. Since the termination probability is , in expectation after steps, a walk terminates and with high probability (via a Chernoff bound) the walk terminates in rounds. By the union bound [15], all walks (they are only polynomially many) terminate in rounds with high probability. Since all the walks are moving in parallel and there is no congestion (follows from the Lemma 3.2), all the walks in the network terminate in rounds with high probability. Hence the algorithm is able to output the PageRanks in rounds with high probability. The correctness of the PageRanks approximation follows from [2, 4] as discussed earlier in Section 3.2. The -approximation guarantee is follows from the Theorem 3.1. ∎

4 A Faster Distributed PageRankAlgorithm (for Undirected Graphs)

We present a faster algorithm for PageRanks computation in undirected graphs. Our algorithm’s time complexity holds in the bandwidth restricted communication model, requires only bits to be sent over each link in each round.

We use a similar Monte Carlo method as described in Section 3 to estimate PageRanks. This says that the PageRank of a node is the ratio between the number of visits of PageRankrandom walks to itself and the sum of all the visits over all nodes in the network. In the previous section (cf. Section 3) we show that in rounds, one can approximate RageRank accurately by walking in a naive way in general graphs. We now outline how to speed up our previous algorithm (cf. Algorithm 1) using an idea similar to the one used in [11]. In [11], it is shown how one can perform a simple random walk in an undirected graph555In each step, an edge is taken from the current node with probability proportional to where is the degree of . of length in rounds w.h.p. ( is the diameter of the network). The high level idea of their algorithm is to perform ‘many’ short walks in parallel and later ‘stitch’ them to get the desired longer length walk. To apply this idea in our case, we modify our approach accordingly as speeding up (many) PageRank random walks is different from speeding up one simple random walk. We show that our improved algorithm (cf. Algorithm 2) approximates PageRanks in rounds.

4.1 Description of Our Algorithm

In Section 3, we showed that by performing walks (in particular we are performing walks, where , is defined in Section 3.2) of length from each node, one can estimate the PageRankvector accurately with high probability. In this section we focus on the problem of performing efficiently walks ( from each node) each of length and count the number of visits of these walks to different nodes. Throughout, by “random walk” we mean the “PageRank random walk” (cf. Section 3).

The main idea of our algorithm is to first perform ‘many’ short random walks in parallel and then ‘stitch’ those short walks to get the longer walk of length and subsequently ‘count’ the number of visits of these random walks to different nodes. In particular, our algorithm runs in three phases. In the first phase, each node performs ( is degree of ) independent ‘short’ random walks of length in parallel. While value of the parameters and will be fixed later in the analysis, the assigned value will be and respectively. This is done naively by forwarding ‘coupons’ having the ID of from (for each node ) for steps via random walks. Besides the node’s ID, we also assign a coupon number “” to each coupon to keep track the path followed by the random walk coupon. The intuition behind performing short walks is that the PageRanks of an undirected graph is proportional to the degree distribution [12]. Therefore we can easily bound the number of visits of random walks to any node (cf. Lemma 4.1). At the end of this phase, if node has random walk coupons with the ID of a node , then is a destination of walks starting at . Note that just after this phase, has no knowledge of the destinations of its own walks, but it can be known by direct communication from the destination nodes. The destination nodes (at most ) have the ID of the source node . So they can contact the source node via direct communication. We show that this takes at most constant number of rounds as only polylogarithmic number of bits are sent (since will be at most ). It is shown that the first phase takes rounds (cf. Lemma 4.2).

In the second phase, starting at source node , we ‘stitch’ some of the -length walks prepared in first phase. Note that we do this for every node in parallel as we want to perform walks from each node. The algorithm starts from and samples one coupon distributed from in Phase 1. In the end of Phase 1, each node knows the destination node’s ID of its short walks (or coupons). When a coupon needs to be sampled, node chooses a coupon number sequentially (in order of the coupon IDs) from the unused set of coupons and informs the destination node (which will be the next stitching point) holding the coupon by direct communication, since knows the ID of the destination node at the end of the first phase. Let be the sampled coupon and be the destination node of . The source then sends a ‘token’ to and deletes the coupon so that will not be sampled again next time at . This is because our goal is to produce independent random walks of a given length, so naturally we do not reuse the same short walks, or in other words, this will preserve randomness when we concatenate short walks. The process then repeats. That is, the node currently holding the token samples one of the coupons it distributed in Phase 1 and forwards the token to the destination of the sampled coupon, say . Nodes are called ‘connectors’ — they are the endpoints of the short walks that are stitched. A crucial observation is that the walk of length used to distribute the corresponding coupons from to and from to are independent random walks. Therefore, we can stitch them to get a random walk of length . We therefore can generate a random walk of length by repeating this process. We do this until we have completed a length of at least . Then, we complete the rest of the walk by doing the naive random walk algorithm. Note that in the beginning of Phase 2, we first

Input (for every node): Number of nodes , reset probability and short walk length .
Output: Approximate PageRank of each node.

Phase 1: (Each node performs random walks of length . At the end of this phase, there are (not necessarily distinct) nodes holding a ‘coupon’ containing the ID of .)

1:Each node construct messages .    // [We will refer to these messages created by node as ‘coupons created by ’.]
2:for  to  do
3:    This is the -th iteration. Each node holding at least one coupon does the following in parallel:
4:    for each coupon held by  do         // [i.e., the coupons which received by in the -th iteration.]
5:       Generate a random number .
6:       if  then
7:          Terminate the coupon and keep the coupon as then itself is the destination.
8:       else
9:          pick a neighbor uniformly at random for the coupon and forward to .
10:       end if
11:    end for{Note that an iteration could require more than 1 round, because of congestion}
12:end for
13:Each destination node sends its ID to the source node, as it has the source node’s ID now.      // [destination nodes hold the short walk coupon(s) and contact the source nodes through direct communication.]

Phase 2: (Stitch short walks by token forwarding. Stitch approximately walks, each of length . Recall that each node wants to perform long random walks, where and is defined in Section 3.2)

1:Each node generates “tokens” , where is a random integer value chosen with probability         // [ is drawn from the geometric distribution with parameter i.e., from the distribution of the lengths of random walks.]
2:for  do    //[for sufficiently large constant ]
3:    Each node holding at least one token with does the following in parallel:
4:    For each token with , send to , where is sampled using a coupon of sequence number from the set of the coupons distributed by in Phase 1, and delete the token             // [ sends to via the direct communication.]
5:    For each such received message , node memorizes and creates a token         // [Each node memorizes it for backtracking in Phase 3.]
6:end for
7: For the remaining tokens (whose ), it holds that : for each of them walk naively in parallel for another steps.

Phase 3: (Counting the number of visits of short walks to a node)

1:Each node maintains a counter to keep track of the number of visits of walks at . is initialized to .
2:Each node which memorizes coupon IDs in Phase 2, does the following in parallel:
3:For each such coupon, starting from trace all the short random walks in reverse.
4:Count the number of visits to any node during this reverse tracing and add to . Also count the visits during ‘naively walking’ walks (Step 7 in Phase 2) and add it to .
5:Each node outputs its PageRank as .
Algorithm 2 Improved-PageRank-Algorithm

check the length of survival of each walk and then stitch them accordingly. We show that Phase 2 finishes in rounds (cf. Lemma 4.4).

In the third phase we count the number of visits of all the random walks to a node. As we have discussed, we have to create many short walks of length from each node. Some short walks may not be used to make the long walk of length . We show a technique to count all the used short walks’ visits to different nodes. We note that after completion of Phase 2, all the long walks ( from each node) have been stitched. During stitching (i.e., in Phase 2), each connector node (which is also the end point of the short walk) should remember the source node and the of the short walk. Then start from each of the connector nodes and do a walk in reverse direction (i.e., retrace the short walk backwards) to the respective source nodes in parallel. During the reverse walk, simply count the visits to nodes. It is easy to see that this will take at most rounds, in accordance with Phase 1 (cf. Lemma 4.5). Now we analyze the running time of our algorithm Improved-PageRank-Algorithm. The compact pseudo code is given in Algorithm 2.

4.2 Analysis

First we are interested in the value of i.e., the number of coupons (short walks) needed from each node to successfully answer all the stitching requests. Notice that it is possible that coupons are not enough if is not chosen suitably large: We might forward the token to some node many times in Phase 2 and all coupons distributed by in the first phase may be deleted. In other words, is chosen as a connector node many times, and all its coupons have been exhausted. If this happens then the stitching process cannot progress. To fix this problem, we use an easy upper bound of the number of visits to any node of a random walk of length in an undirected graph: times. Therefore each node will be visited as a connector node at most times. This implies that each node does not have to prepare too many short walks.

The following lemma bounds the number of visits to every node when we do walks from each node, each of length (note that this is the maximum length of a long walk, w.h.p.).

Lemma 4.1.

If each node performs random walks of length , then no node is visited more than times with high probability.

Proof.

We show the above bound on the number of visits still holds if each node performs random walks of length . Suppose each node starts simple random walks in parallel. We claim that after any given number of steps , the expected number of random walks at node is still . Consider the random walk’s transition probability matrix . Then, holds for the stationary distribution having value , where is the number of edges in the graph. Now the number of random walks started at any node is proportional to its stationary distribution, therefore, in expectation, the number of random walks at any node after steps remains the same. We show this is true with high probability using Chernoff bound technique, since the random walks are independent. For each random walk coupon , any , and any vertex , we define to be the random variable having value 1 if the random walk is at after step. Let , i.e., is the total number of random walks are at after step. By Chernoff bound, for any vertex and any ,

It follows that the probability that there exists an vertex and an integer such that is at most since and . Therefore, for all and for all , with high probability.

Now, if each node starts independent random walks that terminate with probability in each step, the number of random walks to any node is dominated from above by . This is because there will be at most random walk coupons in the network in each step. Therefore, the total number of visits by all random walks to any node is bounded by w.h.p., since there are total of steps. ∎

It is now clear from the above lemma (cf. Lemma 4.1) that i.e., each node has to prepare short walks of length in Phase 1. Now we show the running time of our algorithm (cf. Algorithm 2) using the following lemmas.

Lemma 4.2.

Phase 1 finishes in rounds.

Proof.

It is known from Lemma 4.1 that in Phase 1, each node performs walks of length . Assume that initially each node starts with coupons (or messages) and each coupon takes a random walk according to the PageRanktransition probability. Now, in the similar way we showed in Lemma 4.1 that after any given number of steps , the expected number of coupons at any node is . Therefore, in expectation the number of messages, say , that want to go through an edge in any round is at most (from the two end points of the edge). This is because the number of messages, the edge receives from its one end node, say , in expectation is exactly the number of messages at divided by . Using Chernoff bound we get, . By union bound we get that there exists an edge and an integer such that the probability of is at most , since and . Hence the number of messages that go through any edge in any round is at most with high probability. So the message size will be at most bits w.h.p. over any edge in each round (a message contains source IDs and coupon IDs each of which can be encoded using bits). Since our considered model allows polylogarithmic (i.e., ) bits messages per edge per round, we can extend all the random walk’s length from to length in rounds. Therefore, for walks of length it takes rounds as claimed. ∎

Lemma 4.3.

With the message size in Phase 2, one stitching step from each node in parallel can be done in one round.

Proof.

Each node knows all of its short walks’ (or coupons’) destination address and the . Each time when a source or connector node wants to stitch, it chooses its unused coupons (created in Phase 1) sequentially in order of the coupon IDs. Then it contacts the destination node (holding the coupon) through direct communication and informs the destination node as the next connector node or stitching point. Therefore, in each round, it is sufficient for any node to send to connector node the maximal with destination that it has used so far. This implies that message size of bits per edge suffices for this process. Since we assume the network allows congestion, this one time stitching from each node in parallel will finish in one round. ∎

Lemma 4.4.

Phase 2 finishes in rounds.

Proof.

Phase 2 is for stitching short walks of length to get a long walk of length , where the constant is chosen sufficiently large so that all the random walks terminate within this length with high probability. Therefore, it is sufficient to stitch approximately times from each node in parallel. Since each stitching step can be done in one of round (cf. Lemma 4.3), the stitching process takes rounds. Now it remains to show the running time of completing the random walks at the end of Phase 2 (Step 7 in Algorithm 2). For this step, the length of the random walk is less than , which are executed in parallel. In this case, we do not need to send any IDs or counters with the coupon, simply send the count of the tokens traversing an edge in a given round to the appropriate neighbors (i.e., in the similar way as of Algorithm 1). Each token corresponds to a random walk for the remaining length left to complete the length . This will take at most rounds. Hence, Phase 2 finishes in rounds. ∎

Lemma 4.5.

Phase 3 finishes in rounds.

Proof.

Recall that each short walk is of length . Phase 3 is simply tracing back the short walks from each node in parallel. So it is easy to see that we can perform all the reverse walks in parallel in rounds (in the same way as to do all the short walks in parallel in Phase 1). Therefore, in accordance with the Lemma 4.2, Phase 3 finishes in rounds. ∎

Notice that the Coupon IDs are useful in this context, since the random walks starting at and ending at may have followed different paths; just knowing the number of random walks coming from is insufficient to backtrace the walks. Moreover, the nodes on the paths will need to know the as well for the same reason. Now we are ready to show the main result of this section.

Theorem 4.6.

The Improved-PageRank-Algorithm (cf. Algorithm 2) computes a - approximation of the PageRanks with high probability for any constant and finishes in rounds.

Proof.

The algorithm Improved-PageRank-Algorithm consists of three phases. We have calculated above the running time of each phase separately. Now we want to compute the overall running time of the algorithm by combining these three phases and by putting appropriate value of parameters. By summing up the running time of all the three phases, we get from Lemmas 4.2, 4.4, and 4.5 that the total time taken to finish the Improved-PageRank-Algorithm is rounds. Choosing , gives the required bound as . The correctness and approximation guarantee follows from the previous section. ∎

5 Conclusion

We presented fast distributed algorithms for computing PageRank, a measure of fundamental interest in networks. Our algorithms are Monte-Carlo and based on the idea of speeding up random walks in a distributed network. Our faster algorithm takes time only sub-logarithmic in which can be useful in large-scale, resource-constrained, distributed networks, where running time is especially crucial. Since they are based on random walks, which are lightweight, robust, and local, they can be amenable to self-organizing and dynamic networks.

Acknowledgments

We thank the anonymous reviewers for their detailed comments which helped in improving the presentation of the paper.

References

  • [1] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In Proc. of IEEE Symposium on Foundations of Computer Science (FOCS), pages 475–486, 2006.
  • [2] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova. Monte carlo methods in pagerank computation: When one iteration is sufficient. SIAM J. Number. Anal., 45(2):890–904, 2007.
  • [3] B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized pagerank on mapreduce. In Proc. of ACM SIGMOD Conference, pages 973–984, 2011.
  • [4] B. Bahmani, A. Chowdhury, and A. Goel. Fast incremental and personalized pagerank. PVLDB, 4:173–184, 2010.
  • [5] P. Berkhin. A survey on pagerank computing. Internet Mathematics, 2(1):73–120, 2005.
  • [6] M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Trans. Internet Technol., 5(1):92–128, Feb. 2005.
  • [7] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of Seventh International World-Wide Web Conference (WWW), pages 107–117, 1998.
  • [8] M. Cook. Calculation of pagerank over a peer-to-peer network. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.123.9069, 2004.
  • [9] A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating pagerank on graph streams. J. ACM, 58(3):13, 2011.
  • [10] A. Das Sarma, A. R. Molla, G. Pandurangan, and E. Upfal. Fast distributed pagerank computation. Theoretical Computer Science, 561:113 –121, 2015.
  • [11] A. Das Sarma, D. Nanongkai, G. Pandurangan, and P. Tetali. Distributed random walks. Journal of the ACM, 60(1):2, 2013.
  • [12] V. Grolmusz. A note on the pagerank of undirected graphs. CoRR, abs/1205.1960, 2012.
  • [13] G. Iván and V. Grolmusz. When the web meets the cell: using personalized pagerank for analyzing protein interaction networks. Bioinformatics, 27(3):405–407, 2011.
  • [14] A. N. Langville and C. D. Meyer. Survey: Deeper inside pagerank. Internet Mathematics, 1(3):335–380, 2003.
  • [15] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005.
  • [16] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
  • [17] N. Perra and S. Fortunato. Spectral centrality measures in complex networks. Phys. Rev. E, 78:036107, September 2008.
  • [18] K. Sankaralingam, S. Sethumadhavan, and J. C. Browne. Distributed pagerank for p2p systems. In Proc. of 12th International Symposium on High Performance Distributed Computing, pages 58–68, June 2003.
  • [19] S. Shi, J. Yu, G. Yang, and D. Wang. Distributed page ranking in structured p2p networks. In Proc. of International Conference on Parallel Processing (ICPP), pages 179–186, 2003.
  • [20] J. Wang, J. Liu, and C. Wang. Keyword extraction based on pagerank. In Proc. of The Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 857–864, 2007.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
11468
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description