Optimized on-line computation of PageRank algorithm

Optimized on-line computation of PageRank algorithm

Dohy Hong
Alcatel-Lucent Bell Labs
Route de Villejust
91620 Nozay, France
dohy.hong@alcatel-lucent.com
July 30, 2019
Abstract

In this paper we present new ideas to accelerate the computation of the eigenvector of the transition matrix associated to the PageRank algorithm. New ideas are based on the decomposition of the matrix-vector product that can be seen as a fluid diffusion model, associated to new algebraic equations. We show through experimentations on synthetic data and on real data-sets how much this approach can improve the computation efficiency.

Computation, Ranking, Web graph, Markov chain, Eigenvector.
\numberofauthors

1

\category

G.2.2Discrete MathematicsGraph Theory[Graph algorithms] \categoryF.2.2Analysis of algorithms and problem complexityNonnumerical Algorithms and Problems[Sorting and searching] \categoryH.3.3Information storage and retrievalInformation Search and Retrieval[relevance feedback, search process] \termsAlgorithms, Experimentation {psfrags}

1 Introduction

In this paper, we are interested by the computation issue of the solution of the PageRank equation. PageRank is a link analysis algorithm that been initially introduced by [24] and used by the Google Internet search engine, that assigns a numerical value to each element of a hyper-linked set of nodes, such as the World Wide Web. The algorithm may be applied to any collection of entities (nodes) that are linked through directional relationships. The numerical value that it assigns to each node is called the PageRank of the node and as we will see below the rank value of the node is associated to a eigenvector problem.

The complexity of the problem for computing the eigenvector of a matrix increases rapidly with the dimension of the vector space. Efficient, accurate methods to compute eigenvalues and eigenvectors of arbitrary matrices are in general a difficult problem (cf. power iteration [26], QR algorithm [11, 18]).

In the particular case of PageRank equation, several specific solutions were proposed and analyzed [20, 5] including power method [24] with adaptation [15] or extrapolation [12, 16, 7], iterative aggregation/disaggregation method [21, 13, 23], adaptive on-line method [1], etc.

The approach proposed here is an improvement idea partially inspired from the algorithm proposed in [1]. We also proposed an algebraic proof of Lemma 2.3 in [1]. This approach can be also compared to the Gauss-Seidel iteration (cf. [25]): the Gauss-Seidel iteration is known to be faster than the Jacobi iteration (cf. [2]). However the approach can not be distributed because of the constraint in the order of the iteration ([17]). In our approach, the computation of each iteration uses the elements of the previous integrating the last update per component as with Gauss-Seidel method, as opposed to the power iteration (or to Jacobi method) where the computation of the whole vector is based on the previous vector, without taking into account the update (or partial update) at vector entry level.

As we will show below, our approach has the advantage of being iteration order independent and can be very naturally deployed using an asynchronous distributed computation.

Figure 1: Intuition: collection vs diffusion.

Finally, if one can associate the Gauss-Seidel method to an operation of collection (one entry of vector is updated based on the previous vector based on the incoming links), our approach consists in an operation of diffusion (the fluid diffusion from one entry of the vector consists in updating all children nodes following the outgoing links) (cf. Figure 1).

The model is introduced in Section 2. Some theoretical results are presented in Section 3. Finally we compared the computation cost of our approach to existing methods through several data-sets in Section 4.

2 Model

2.1 Notations

We consider a non-negative matrix of size such that each column sums up to one. In particular, we can associate such a matrix to a Markov chain on a state of size , where would be the transition matrix with the transition probability from state to state . In the following, we will also call as a node (from web graph and PageRank context).

The fact that each column of sums up to one means that in the Markov chain, each state has a positive probability to jump to at least one state (if such a condition is not true, we complete the matrix by replacing the zero columns with (by the personalization vector in a general case) to get . This corresponds to the dangling nodes completion [20], we will see below that such an adaptation is not required in our approach).

In this paper, we consider the iteration of equation of the form:

(1)

where is a matrix of size which can be explicitly decomposed as:

(2)

where is a normalized vector of size and is the column vector with all components equal to one. The equation (2) is an explicit Doeblin’s decomposition [9, 10]. So we have:

We assume that the stationary probability of is defined by the vector (its existence and uniqueness is straightforward e.g. from the contraction property of ). We have:

In the context of PageRank, the vector is a personalized initial condition cf. [20].

2.2 Main equations

We first define the vector by:

So that:

The equation has been used in [6] to define formulae for PageRank derivatives of any order.

We define a matrix with all entries equal to zero except for the k-th diagonal term: .

In the following, we assume given a deterministic or random sequence with . We only require that the number of occurrence of each value in to be infinity.

Then, we define the fluid vector associated to by:

(3)
(4)
(5)

where is the identity matrix.

We define the history vector by:

(6)

By definition, is an increasing function (all components are positive).

The above fluid and history vector is associated to the following algorithm (ALGO-REF):

Initialization:
  H[i] := 0;
  F[i] := (1-d)*v_i;
  R := 1-d;

k := 1;
While ( R/(1-d) > Target_Error )
  Choose i_k;
  sent := F[i_k];
  F[i_k] := 0;
  H[i_k] += sent;
  For all child node j of i_k:
    F[j] += sent*p(j,i_k)*d;
  R -= sent*(1-d);
  k++;
Remark 1

If , the above algorithm is equal to the one defined in [1]. The role of is here important: in [1], the stopping condition is not defined, because in their algorithm, what they called cash (our fluid ) is not decreasing but constant. Also, because in our algorithm, the total fluid tends to zero, we can predict and control precisely the convergence to the limit, as we will see below.

3 Theoretical results

3.1 Main equations

Theorem 1

We have the equality:

(7)
\proof

The proof is straightforward by induction: assuming the equation (7) true for :

Note that the equation (7) is true for with replaced by any initial vector.

Theorem 2

We have the equality:

(8)

and as a direct consequence, we have:

(9)

where is the norm.

\proof

Using the equation (7), we have:

By iteration, we get (8). The equation (9) is obvious remarking that

Now, we assume that the sequence is chosen such that at iteration : . Then we have the following result:

Lemma 1

If , we have:

\proof

The proof is straightforward, noticing that we suppressed and added . Then using .

Thanks to Lemma 1, we have an exponential convergence to zero of . This lemma still holds if for an entry of which is larger or equal to is chosen.

As a consequence of this lemma, we have:

Therefore, and is an increasing (component by component) function to .

Remark 2

From equations (7) and (5), we can obtain the following iteration equation on :

This formulation may be directly exploited for an efficient distributed computation of . This issue will be addressed in a future paper.

If , we still have:

and converges to after appropriate normalization (case of [1]), at least when all entries of are non-negative (when the spectral radius of is one and if there are negative entries in , it may converge or not converge): but this convergence is based on a Cesaro averaging which is known to be very slow (cf. results in Section 4).

In all cases, can be simply interpreted as a specific way to compute or to estimate the power series (when there is convergence).

3.2 Updating equation

The fact that converges to for any arbitrary choice of the sequence can be also exploited to compute more efficiently the new eigenvector in case of the graph modification (so of ).

Theorem 3

Assume the initial graph associated to is modified to the updated graph represented by , then we have:

\proof

We have

In the above updating equation, may mix positive and negative terms. We can apply on them the operator separately or jointly.

Now, assume the equation has been computed up to the iteration and that at that time, we are interested to compute the limit associated to (for instance, because the web graph has been modified/updated).

Then very naturally, we can apply our diffusion method with , but with modified initial condition for which we have:

(10)

We have the following very intuitive results:

Theorem 4

( is the limit of the equation (10)) is the solution of the equation:

\proof

The limit of (10) satisfies:

Combining this with , we have obviously what we want.

The above result implies that one can continuously update the iterations when is regularly updated by just injecting in the system a fluid quantity equal to and then applying the new matrix .

If a distributed computation approach was to be used, we just need to synchronize the time from which is applied.

3.3 About dangling nodes

In the algorithm ALGO-REF, we don’t need to complete the null columns of the matrix . We can simply add the following condition:

While ( R/(1-d) > Target_Error )
  Choose i_k;
  sent := F[i_k];
  H[i_k] += sent;
  F[i_k] := 0;
  If ( i_k has no child )
    R -= sent;
  else
    For all child node j of i_k:
      F[j] += sent*p(j,i_k)*d;
    R -= sent*(1-d);
  k++;

The quantity measures exactly (thanks to Theorem 2) the distance to the stationary probability. However when includes dangling nodes, it is easy to see that defines only an upper bound and that need to be renormalized (dividing by the norm of ) to find the probability eigenvector satisfying the PageRank equation with the completed matrix .

3.4 Asynchronous distributed computation

The proposed algorithm is very well suited for the asynchronous distributed computation (e.g. cf. [14]). Indeed, at any moment of the iteration, the fluid can be split in any number of elements and be distributed per element, the norm of each element controlling exactly the error that can be reduced. The most natural way is to divide per component of the vector, and an obvious particular example is to split the initial condition vector in elements of size .

4 Comparison results

In the following, we will call:

  • ALGO-MAX: if is defined by ;

  • ALGO-RAND: if is randomly chosen from ;

  • ALGO-PER: if (periodic cycle);

  • MAT-ITER: if the matrix product iteration (1) is used;

  • OPIC: the one defined in [1] (Random version).

The stopping condition for ALGO-REF variants are based on Target_Error. For MAT-ITER, we use the well known condition: Target_Error. For OPIC, we measured the convergence by comparing the distance to the precomputed limit (from MAT-ITER), since it does not define any stopping condition.

Below we set a simple simulation scenario to get a first evaluation of our proposed solution and comparison to existing solutions in the context of the original PageRank on the web graph. We don’t pretend to generate any realistic model, for more details on the web graph the readers may refer to [8, 19, 3, 4, 22].

4.1 Simulation scenario

We set the total number of nodes (URLs) to be simulated. Then we create random links (directional) to connect a node to as follow:

  • the choice of the source node is done following a power-law: ;

  • the choice of the destination node is done following a power-law: .

For simplicity, we assumed no correlation between the number of incoming and outgoing links: for that purpose, to defined the distribution of the destination nodes by associating to node a probability proportional to followed by a large number (by default ) of permutations of randomly chosen pair of nodes . Then, to define the distribution of the destination nodes, we did the same operation. In this way, the order of the nodes does not introduce any bias for ALGO-PER.

The tables below summarize the characteristics of the 6 synthetic data we considered varying the number of links per node and the parameter .

Scenario L (nb links) D (Nb dangling nodes)
S1 2172 9552
S2 8081 8646
S3 28507 6252
Table 1: Parameters: ,
Scenario L (nb links) Nb dangling nodes
S1b 12624 7696
S2b 61189 3197
S3b 265245 33
Table 2: Parameters: ,

4.2 Simulation results analysis

We define the elementary step as an operation requiring the use of one non-zero entry of : for instance, if has entries that are not zero, the product would require elementary steps. We already mentioned that ALGO-REF variants does not require the matrix completion . For MAT-ITER, we can also avoid such a completion, but the cost to pay is the vector renormalization. Therefore, we assumed here that the cost of is where is the number of not zero entries of the initial matrix .

For ALGO-RAND and ALGO-PER, when we have no fluid on the node (), we go to the next step without incrementing the counter of the elementary steps.

For ALGO-MAX, there is of course a computation cost to find the of . In order to show its full potential, its cost is not taken into account in the number of elementary steps. In the comparison tests, we found that taking the argmax is not necessary and same improvements were obtained replacing the argmax by iteration process where in one iteration all such that is above a threshold are all chosen, then scaling down the threshold progressively.

Figure 2, 3 and 4 shows the results for S1, S2 and S3. In these scenarios, because of , there are the proportion of the dangling nodes are important. We can observe in all cases a substantial gain with our approach.

Figure 2: Scenario S1. , .
Figure 3: Scenario S2. , .
Figure 4: Scenario S3. , .

Figure 5, 6 and 7 shows the results for S1b, S2b and S3b. In these scenarios, the proportion of the dangling nodes are less important. We can still observe in all cases a substantial gain with our approach.

Figure 5: Scenario S1b. , .
Figure 6: Scenario S2b. , .
Figure 7: Scenario S3b. , .

4.3 Comparison on web graph

In this section, we realized the same comparison tests than above on a web graph imported from the dataset gr0.California (available on http://www.cs.cornell.edu/Courses/cs685/ 2002fa/) and from the dataset uk-2007-05@1000000 (available on http://law.dsi.unimi.it/datasets.php).

The results for the dataset gr0.California is shown in Figure 8. In this figure, we added ALGO-OP a variant of ALGO-MAX where the argmax is replaced by

with (resp. ) equal to the number of incoming (resp. outgoing) links to (resp. from) the node . The intuition of this renormalization is:

  • #out: the cost of the diffusion of the fluid is proportional to #out;

  • #in: when there are many incoming links, it is worth to wait and aggregate the fluid before the fluid diffusion.

We see that we still have a significant gain and that ALGO-OP outperforms ALGO-MAX. The main reason for which the proposed approach outperforms greatly the original OPIC algorithm is the fact that we don’t have the fluctuation due to the Cesaro averaging as in OPIC (which converges as ).

Figure 8: Dataset gr0.California: 16150 links on 9664 nodes (4637 dangling nodes).

Figure 9 shows the results for the dataset uk-2007-05@1000000. Here ALGO-MAX outperforms ALGO-OP. It shows that the optimization of the sequence choice is still an open problem. We added here ALGO-OP2 based on:

Figure 9: Dataset uk-2007-05@1000000: 41247159 links on 1000000 nodes (45766 dangling nodes).

5 Conclusion

In this paper, we proposed a new algorithm to optimize the computation cost of the eigenvector of PageRank equation and showed the theoretical potential gain.

We believe that we have here a promising new approach and we are still investigating on a practical optimal ALGO-OP variants solution.

In a future work, we wish to validate the performance of our approach in terms of the effective run time cost for large data, in particular, using the asynchronous distributed computation approach.

Acknowledgments

The author is very grateful to François Baccelli for his very valuable comments and suggestions.

References

  • [1] S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. WWW2003, pages 280–290, 2003.
  • [2] A. Arasu, J. Novak, J. Tomlin, and J. Tomlin. Pagerank computation and the structure of the web: Experiments and algorithms, 2002.
  • [3] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
  • [4] A.-L. Barabasi, R. Albert, and H. Jeong. Scale-free characteristics of random networks: the topology of the world-wide web. Physica, A(281):69–77, 2000.
  • [5] M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Trans. Internet Techn., 2005.
  • [6] P. Boldi, M. Santini, and S. Vigna. Pagerank: Functional dependencies. ACM Trans. Inf. Syst., 27:19:1–19:23, November 2009.
  • [7] C. Brezinski, M. Redivo-Zaglia, and S. Serra-Capizzano. Extrapolation methods for pagerank computations. Comptes Rendus Acad. Sci., pages 393–397, 2005.
  • [8] A. Broder, R. Kumar, F. Maghoul1, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33(1-6):309–320, June 2000.
  • [9] W. Doeblin. Eléments d’une théorie générale des chaînes simples constantes de markov. Annales Scientifiques de l’Ecole Normale Supérieure, 57(III):61–111, 1940.
  • [10] J. L. Doob. John Wiley, New York, 1953.
  • [11] J. G. F. Francis. The QR transformation, I. The Computer Journal, 4(3):265–271, 1961.
  • [12] T. H. Haveliwala, S. D. Kamvar, D. Klein, C. D. Manning, and G. H. Golub. Computing pagerank using power extrapolation. Technical report, Stanford University, 2003.
  • [13] I. C. F. Ipsen and S. Kirkland. Convergence analysis of a pagerank updating algorithm by langville and meyer. SIAM J. Matrix Anal. Appl., 27(4):952–967, 2006.
  • [14] M. Jelasity, G. Canright, and K. Engø-Monsen. Asynchronous distributed power iteration with gossip-based normalization. In Euro-Par 2007 Parallel Processing, volume 4641 of Lecture Notes in Computer Science, pages 514–525. 2007.
  • [15] S. D. Kamvar, T. H. Haveliwala, and G. H. Golub. Adaptive methods for the computation of pagerank. Linear Algebra Appl., pages 51–65, 2004.
  • [16] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods for accelerating pagerank computations. Proc. of International World Wide Web Conference (WWW2003), pages 261–270, 2003.
  • [17] C. Kohlschütter, P.-A. Chirita, R. Chirita, and W. Nejdl. Efficient parallel computation of pagerank. In In Proc. of the 28th European Conference on Information Retrieval, pages 241–252, 2006.
  • [18] V. N. Kublanovskaya. On some algorithms for the solution of the complete eigenvalue problem. USSR Computational Mathematics and Mathematical Physics, 3:637–657, 1961.
  • [19] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. Proc. of the 41st Annual Symposium on Foundations of Computer Science, 2000.
  • [20] A. N. Langville and C. D. Meyer. Deeper inside pagerank. Internet Mathematics, 1(3), 2004.
  • [21] A. N. Langville and C. D. Meyer. Updating the stationary vector of an irreducible markov chain with an eye on google’s pagerank. Technical report, North Carolina State University, 2004.
  • [22] M. Levene, T. Fenner, G. Loizou, and R. Wheeldon. A stochastic model for the evolution of the web. Computer Networks, 39(3):277–287, June 2002.
  • [23] I. Marek and P. Mayer. Convergence theory of some classes of iterative aggregation/disaggregation methods for computing stationary probability vectors of stochastic matrices. Linear Algebra Appl., pages 177–200, 2003.
  • [24] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report Stanford University, 1998.
  • [25] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2nd edition, 2003.
  • [26] R. von Mises and H. Pollaczek-Geiringer. Praktische verfahren der Gleichungsauflösung. ZAMM - Zeitschrift für Angewandte Mathematik und Mechanik, 9:152–164, 1929.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
322167
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description