Reinforcement Learning for Matrix
Computations: PageRank as an Example
Vivek S. Borkar and Adwaitvedant S. Mathkar
Department of Electrical Engineering,
Indian Institute of Technology,
Powai, Mumbai 400076, India.
{borkar.vs, mathkar.adwaitvedant}@gmail.com
Abstract. Reinforcement learning has gained wide popularity as a technique for simulationdriven approximate dynamic programming. A less known aspect is that the very reasons that make it effective in dynamic programming can also be leveraged for using it for distributed schemes for certain matrix computations involving nonnegative matrices. In this spirit, we propose a reinforcement learning algorithm for PageRank computation that is fashioned after analogous schemes for approximate dynamic programming. The algorithm has the advantage of ease of distributed implementation and more importantly, of being modelfree, i.e., not dependent on any specific assumptions about the transition probabilities in the random websurfer model. We analyze its convergence and finite time behavior and present some supporting numerical experiments.
Key words: Reinforcement Learning, PageRank, Stochastic Approximation, Sample Complexity
1 Introduction
Reinforcement learning has its roots in models of animal behavior [1]
and mathematical psychology [2], [3]. The recent resurgence of interest in the field, however, is propelled by applications to artificial intelligence and control engineering. By now there are several textbook accounts of this development [4] (Chapter 16), [5], [6], [7], [8], [9]. To put things in context, recall that methodologically, reinforcement learning sits somewhere in between supervised learning, which works with a reasonably accurate information regarding the performance gradient or something analogous (e.g., parameter tuning of neural networks), and unsupervised learning, which works without such explicit information (e.g., clustering). To be specific, supervised learning is usually based upon an optimization formulation such as minimizing an error measure, which calls for a higher quantum of information per iterate. Reinforcement learning on the other hand has to manage with signals somehow correlated with performance, but which fall short of the kind of information required for a typical supervised learning scheme. It then makes simple incremental corrections based on these ‘suggestive though inexact’ signals, usually with low per iterate computation. The latter aspect
has also made it a popular framework for models of bounded rationality in economics [10].
Our interest is in its recent avatar as a scheme for simulationbased methodology for approximate dynamic programming for Markov decision processes which has found applications, among other things, in robotics [11]. These can be viewed as stochastic approximation counterparts of the classical iterative methods for solving dynamic programming equations, such as value and policy iteration. Stochastic approximation, introduced by Robbins and Monro [12] as an iterative scheme for finding the roots of a nonlinear function given its noisy measurements, is the basis of most adaptive schemes in control and signal processing. What it does in the present context is to replace a conditional average appearing on the right hand side of the classical iterative schemes (or their variants) by an actual evaluation at a simulated transition according to the conditional distribution in question. It then makes an incremental move towards the resulting random quantity. That is, it takes a convex combination of the current value and the random right hand side, with a slowly decreasing weight on the latter. The averaging properties of stochastic approximation then ensure that asymptotically you see the same limiting behavior as the original scheme.
But there are other situations wherein one encounters iterations involving conditional averages. In fact, by pulling out row sums of a nonnegative matrix into a diagonal matrix premultiplier, we can write it as a product of a diagonal matrix and a stochastic matrix. This allows us to cast iterations involving nonnegative matrices as iterations involving averaging with respect to stochastic matrices, making them amenable to the above methodology. This opens up the possibility of using reinforcement learning schemes for distributed matrix computations of certain kind. Important instances are plain vanilla averaging and estimation of the PerronFrobenius eigenvectors [13]. Reinforcement learning literature is replete with means of curtailing the curse of dimensionality, a hazard only too common in dynamic programming applications. This machinery then becomes available for such matrix computations. An important special case is the case of linear function approximation, wherein one approximates the desired vector by a weighted combination of a moderate number of basis vectors, and then updates these weights instead of the entire vector [14].
In the present article, we illustrate this methodology in the context of Google’s PageRank, an eigenvectorbased ranking scheme.
It is primarily based on the stationary distribution of the ‘random websurfer’ Markov chain, equivalently, the normalized left PerronFrobenius eigenvector of its transition probability matrix. This chain is defined on a directed graph wherein each node is a web page. Let the set of nodes to which points. Let and the total number of nodes. The chain moves from to with a probability and to any other node in the graph with probability where is the ‘Google constant’. The latter renders it irreducible, ensuring a unique stationary distribution. An excellent account of the numerical techniques for computing , essentially based on the ‘power method’ and its variants, appears in [15], along with a brief historical account. See also [16]. While a bulk of the work in this direction has been on efficient computations for the power method, there have also been alternative approaches, such as Markov Chain Monte Carlo [17], [18], optimization based methods [19], and schemes based on stochastic approximation and/or gossip [20], [21], [22].
Such ‘spectral ranking’ techniques, made popular by the success of PageRank, are in fact quite old. See [23] for a historical survey. Evaluative exercises of this kind occur in other applications as well, such as reputation systems or popularity measures on social networks. In such applications (for that matter, in search), it is unclear whether the assumption that each is equally important to is reasonable. Motivated by this, we propose a modelfree scheme based on ideas from reinforcement learning. This idea has also been discussed in [13]. The present scheme, however, differs in an essential way from [13] in that whereas [13] views PageRank as a special instance of the general problem of eigenvector estimation, we exploit the special structure of the random websurfer model to simplify the problem to a simple linear scheme. This is very much in tune with some of the works cited above (notably [20], [22]), but with a nonstandard sample and update rule. The outcome is an algorithm that can run on accumulated traces of nodetonode interactions without requiring us to explicitly estimate the probabilities associated with these.
The next section describes our algorithm and its convergence analysis. Section 3 describes finite time analysis and a variant of the basic scheme. Section 4 presents some numerical experiments. Section 5 concludes with some general observations.
2 The Algorithm
Let be an stochastic matrix. Define . Let denote the unique stationary probability distribution of . That is, for ,
Here is a row vector and every other vector is a column vector. Since we are only interested in ranking we can neglect the factor . Thus by abuse of terminology,
To estimate , we run the following dimensional stochastic iteration. Sample as follows: Sample uniformly and independently from . Sample with , independent of all other random variables realized before . Update as follows:
(1) 
where the stepsizes satisfy and . Hence is updated only if , or both are i. We can write (1) as follows:
where is a martingale difference sequence w.r.t. . By Theorem 2, p. 81, [24], the ODE corresponding to the iteration is
In vector form,
Since the constant doesn’t affect the asymptotic behavior, we consider
(2) 
Define .
It is easy to see that uniformly on .
Theorem 1: Under the above assumptions, a.s., where is the unique solution to .
Proof: Define , . As in the proof of Theorem 2, p. 126, [24], for ,
Integrating,
Letting ,
with equality iff . We similarly get that is a Lyapunov function for the scaled o.d.e which has the origin as its globally asymptotically stable asymptotic equilibrium. By Theorem 9, p. 75, a.s. In turn (2) has as its globally stable asymptotic equilibrium with as its Lyapunov function. The claim follows from Theorem 7 and Corollary 8, p. 74, [24].
3 Remarks

We first look at sample complexity of the stochastic iteration. We mainly use section 4.2 of [24] to derive sample complexity estimates.
Let . Let denote the stationary distribution. Without loss of generality (by relabeling if necessary), let , i.e., the components of are numbered in accordance with their ranking. We shall consider as our objective the event that the top ranks of fall within the top ranks of the output of our algorithm when stopped, for a prescribed pair . This is a natural criterion for ranking problems, which are an instance of ‘ordinal optimization’ [25]. To avoid pathologies, we assume that . We shall derive an estimate for the number of iterates needed to achieve our aim with ‘high’ probability. Let , where is the Ndimensional probability simplex. Thus consists of all distributions such that the top indices of are in the top indices of the given distribution.
Let be the time flowmap associated with the differential equation, where . Thus,
with . Define
and for ,
Then
where and for a matrix is its induced matrix norm. We argue that . To see this, view as the rate matrix of a continuous time Markov chain killed at rate . Then is its transition probability matrix after time , whose row sums will be uniformly bounded away from . The claim follows. Let and pick such that
Since ,
Let . Also, let denote the open neighborhood w.r.t. norm of a generic set . Set . Then . Arguing as in Corollary 14, p. 43 of [24], we have
where is a suitable constant. (In ibid., replace by and by .)

Note that at each time , we can generate more than one, say pairs , which are independent, each distributed as above, and change the iteration to:
That is, we update several components at once. This will speed up convergence at the expense of increased per iterate computation.
4 Numerical Experiments
In this section we simulate the algorithm for different number of nodes. The results for the cases when the number of nodes are 50, 200 and 500 are plotted in Figure 1, Figure 2 and Figure 3 resectively. The dotted line indicates the distance between and w.r.t. . The solid line indicates the percentage of top 5 indices of that do not feature in the top 10 indices of . Figure 4, Figure 5 and Figure 6 further show (for 200 nodes) that the number of iterations required to achieve this objective varies inversely with variance of .
Figure 1: varaince of =47.1641
Figure 2: variance of =277.3392
Figure 3: variance of = 743.4651
Figure 4: varaince of =259.6187
Figure 5: variance of =335.6385
Figure 6: variance of = 365.0774
5 Conclusions
In conclusion, we highlight some of the important features of the above scheme, which are facilitated by the reinforcement learning framework.

As already mentioned, the scheme does not depend on an a priori model for transition probabilities, but is completely datadriven in this aspect.

We use ‘split sampling’ introduced in [14] for reinforcement learning, sampling pairs with the desired conditional law for given , but with uniform sampling for . This is a departure from classical reinforcement learning, where one runs a single Markov chain according to and .

Since we are iterating over probability vectors as they evolve under a transition matrix, the scheme requires leftmultiplication by row vectors thereof. This is different from usual reinforcement learning schemes, which involve averaging with respect to the transition probabilities, i.e., rightmultiplication by a column vector. We have worked around this difficulty by modifying the update rule. In classical reinforcement learning algorithms based on a simulated Markov chain , one updates the th component at time , i.e., the th component gets updated only when . In the above scheme, the th component gets updated both when and when , albeit in different ways. This is another novel feature of the present scheme.
References
 Thorndike, E. L., “ Animal intelligence: an experimental study of the associative processes in animals”, Psychological Review, Monograph Supplement 2, No. 8, 1898.
 Bush, R. R. and Mosteller, F., “A mathematical model of simple learning”, Psychological Review 58, 313323.
 Estes, K. W., “Towards a statistical theory of learning”, Psychological Review 57, 94107.
 Bertsekas, D. P.; Dynamic Programming and Optimal Control, Vol. 2 (4th ed.), Athena Scientific, Belmont, Mass., 2007.
 Bertsekas, D. P. and Tsitsiklis, J. N., Neurodynamic Programming, Athena Scientific, Belmont, Mass., 1996.
 Gosavi, A., Simulationbased Optimization, Parametric Optimization Techniques and Reinforcement Learning, Springer Verlag, New York, 2003.
 Powell, W. B., Approximate Dynamic Programming: Solving the Curses of Dimensionality (2nd Edition), Wiley, New York, 2011.
 Sutton, R. S. and Barto, A. G., Reinforcement Learning: An Introduction, MIT Press, Cambridge, Mass., 1998.
 Szepesvari, C., Algorithms for Reinforcement Learning, Morgan and Claypool Publishers, 2010.
 Sargent, T. J., Bounded Rationality in Macroeconomics, Oxford Uni. Press, Oxford, UK, 1994.
 Thrun, S., Burgard, W. and Fox, D., Probabilistic Robotics, MIT Press, Cambridge, Mass., 2005.
 Robbins, H. and Monro, J., “A stochastic approximation method”, Annals of Math. Stat. 22, 1951, 400407.
 Borkar, V. S., Makhijani, R. and Sundaresan, R., “How to gossip if you must”, preprint, 2013, available at http://arxiv.org/abs/1309.7841
 Borkar, V., S., “Reinforcement Learning  A Bridge between Numerical Methods and Markov Chain Monte Carlo”, in Perspectives in Mathematical Sciences, ( Sastry, N. S. N., Rajeev, B., Delampady, M. and Rao, T. S. S. R. K., eds.), World Scientific, 2008.
 Langville, A. N. and Meyer, C. D.; Google’s PageRank and Beyond: The Science of Search Engine Rankings, Princeton Uni. Press, Princeton, NJ, 2006.
 Langville, A. N. and Meyer, C. D.; “Deeper inside PageRank”, Internet Mathematics 1(3), 2004, 335380.
 Avrachenkov, K.; Litvak, N.; Nemirovsky, D; and Osipova, N.; “Monte Carlo methods in PageRank computation: when one iteration is sufficient”, SIAM J. Numer. Anal. 45(2), 2007, 890904.
 Avrachenkov, K.; Litvak, N.; Nemirovsky, D; Smirnova, E.; and Sokol, M.; “Quick detection of topk personalized PageRank lists”, Algorithms and Models for the Web Graph, Proc. WAW 2011 (A. Frieze, P. Horn and P. Pralat, eds.), Lecture notes in Comp. Sci. No. 6732, Springer Verlag, BerlinHeidelberg, 2011, 5061.
 Polyak, B. T. and Timonina, A. V., “PageRank: new regularizations and simulation models”, Proc. of 11th IFAC World Congress, Milano, Aug. 28  Sep., 2011, 1120211207.
 Ishii, H. and Tempo, R.; “Distributed randomized algorithms for PageRank computation”, IEEE Trans. Auto. Control 55(9), 2010, 19872002.
 Nazin, A. V. and Polyak, B. T.; “The randomized algorithm for finding an eigenvector of the stochastic matrix with application to PageRank”, Doklady Mathematics, Vol. 79(3), 2009, 424427.
 Zhao, W.; Chen, HF. and Fang, HT., “Convergence of distributed randomized PageRank algorithms”, arXiv:1305.3178 [cs.SY], 2013.
 Vigna, S., “Spectral ranking”, http://arxiv.org/abs/0912.0238.
 Borkar, V. S.; Stochastic Approximation: A Dynamical Systems Viewpoint, Hindustan Publ. Agency, New Delhi, and Cambridge Uni. Press, Cambridge, UK, 2008.
 Ho, Y.C.; “An explanation of ordinal optimization: Soft computing for hard problems”, Information Sciences 113(3ï¿½4), 1999, 169192.