InfiniteHorizon PolicyGradient Estimation
Abstract
Gradientbased approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in valuefunction methods. In this paper we introduce , a simulationbased algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (s) controlled by parameterized stochastic policies. A similar algorithm was proposed by \citeAkimura95. The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter (which has a natural interpretation in terms of biasvariance tradeoff), and requires no knowledge of the underlying state. We prove convergence of , and show how the correct choice of the parameter is related to the mixing time of the controlled . We briefly describe extensions of to controlled Markov chains, continuous state, observation and control spaces, multipleagents, higherorder derivatives, and a version for training stochastic policies with internal states. In a companion paper [6] we show how the gradient estimates generated by can be used in both a traditional stochastic gradient algorithm and a conjugategradient procedure to find local optima of the average reward.
1520019/0010/01 \firstpageno319 \ShortHeadingsPolicyGradient EstimationBaxter and Bartlett
1 Introduction
Dynamic Programming is the method of choice for solving problems of decision making under uncertainty [9]. However, the application of Dynamic Programming becomes problematic in large or infinite statespaces, in situations where the system dynamics are unknown, or when the state is only partially observed. In such cases one looks for approximate techniques that rely on simulation, rather than an explicit model, and parametric representations of either the valuefunction or the policy, rather than exact representations.
Simulationbased methods that rely on a parametric form of the value function tend to go by the name “Reinforcement Learning,” and have been extensively studied in the Machine Learning literature [8, 25]. This approach has yielded some remarkable empirical successes in a number of different domains, including learning to play checkers [20], backgammon [27, 28], and chess [7], jobshop scheduling [30] and dynamic channel allocation [22].
Despite this success, most algorithms for training approximate value functions suffer from the same theoretical flaw: the performance of the greedy policy derived from the approximate valuefunction is not guaranteed to improve on each iteration, and in fact can be worse than the old policy by an amount equal to the maximum approximation error over all states. This can happen even when the parametric class contains a value function whose corresponding greedy policy is optimal. We illustrate this with a concrete and very simple example in Appendix A.
An alternative approach that circumvents this problem—the approach we pursue here—is to consider a class of stochastic policies parameterized by , compute the gradient with respect to of the average reward, and then improve the policy by adjusting the parameters in the gradient direction. Note that the policy could be directly parameterized, or it could be generated indirectly from a value function. In the latter case the valuefunction parameters are the parameters of the policy, but instead of being adjusted to minimize error between the approximate and true value function, the parameters are adjusted to directly improve the performance of the policy generated by the value function.
These “policygradient” algorithms have a long history in Operations Research, Statistics, Control Theory, Discrete Event Systems and Machine Learning. Before describing the contribution of the present paper, it seems appropriate to introduce some background material explaining this approach. Readers already familiar with this material may want to skip directly to section 1.2, where the contributions of the present paper are described.
1.1 A Brief History of PolicyGradient Algorithms
For largescale problems or problems where the system dynamics are unknown, the performance gradient will not be computable in closed form^{1}^{1}1See equation (17) for a closedform expression for the performance gradient.. Thus the challenging aspect of the policygradient approach is to find an algorithm for estimating the gradient via simulation. Naively, the gradient can be calculated numerically by adjusting each parameter in turn and estimating the effect on performance via simulation (the socalled crude MonteCarlo technique), but that will be prohibitively inefficient for most problems. Somewhat surprisingly, under mild regularity conditions, it turns out that the full gradient can be estimated from a single simulation of the system. The technique is called the score function or likelihood ratio method and appears to have been first proposed in the sixties [2, 17] for computing performance gradients in i.i.d. (independently and identically distributed) processes.
Specifically, suppose is a performance function that depends on some random variable , and is the probability that , parameterized by . Under mild regularity conditions, the gradient with respect to of the expected performance,
(1) 
may be written
(2) 
To see this, rewrite (1) as a sum
differentiate (one source of the requirement of “mild regularity conditions”) to obtain
rewrite as
and observe that this formula is equivalent to (2).
If a simulator is available to generate samples distributed according to , then any sequence generated i.i.d. according to gives an unbiased estimate,
(3) 
of . By the law of large numbers, with probability one. The quantity is known as the likelihood ratio or score function in classical statistics. If the performance function also depends on , then is replaced by in (2).
1.1.1 Unbiased Estimates of the Performance Gradient for Regenerative Processes
Extensions of the likelihoodratio method to regenerative processes (including Markov Decision Processes or s) were given by \citeAglynn86,glynn90,glynn95 and \citeAreiman86,reiman89, and independently for episodic Partially Observable Markov Decision Processes (s) by \citeAwilliams92, who introduced the algorithm^{2}^{2}2A thresholded version of these algorithms for neuronlike elements was described earlier in \citeABarSutAnd83.. Here the i.i.d. samples of the previous section are sequences of states (of random length) encountered between visits to some designated recurrent state , or sequences of states from some start state to a goal state. In this case can be written as a sum
(4) 
where is the transition probability from to given parameters . Equation (4) admits a recursive computation over the course of a regenerative cycle of the form , and after each state transition ,
(5) 
so that each term in the estimate (3) is of the form^{3}^{3}3The vector is known in reinforcement learning as an eligibility trace. This terminology is used in \citeABarSutAnd83. . If, in addition, can be recursively computed by
for some function , then the estimate for each cycle can be computed using storage of only parameters ( for and parameter to update the performance function ). Hence, the entire estimate (3) can be computed with storage of only real parameters, as follows.
Algorithm 1.1: PolicyGradient Algorithm for Regenerative Processes.

Set , , , and ().

For each state transition :

If the episode is finished (that is, , set
,
,
,
. 
Otherwise, set
.


If return , otherwise goto 2.
Examples of recursive performance functions include the sum of a scalar reward over a cycle, where is a scalar reward associated with state (this corresponds to being the average reward multiplied by the expected recurrence time ); the negative length of the cycle (which can be implemented by assigning a reward of to each state, and is used when the task is to mimimize time taken to get to a goal state, since in this case is just ); the discounted reward from the start state, , where is the discount factor, and so on.
As \citeAwilliams92 pointed out, a further simplification is possible in the case that is a sum of scalar rewards depending on the state and possibly the time since the starting state (such as , or as above). In that case, the update from a single regenerative cycle may be written as
Because changes in have no influence on the rewards associated with earlier states (), we should be able to drop the first term in the parentheses on the righthandside and write
(6) 
Although the proof is not entirely trivial, this intuition can indeed be shown to be correct.
Equation (6) allows an even simpler recursive formula for estimating the performance gradient. Set , and introduce a new variable . As before, set and if , or and otherwise. But now, on each iteration, set . Then is our estimate of . Since is updated on every iteration, this suggests that we can do away with altogether and simply update directly: , where the are suitable stepsizes^{4}^{4}4The usual requirements on for convergence of a stochastic gradient algorithm are , , and .. Proving convergence of such an algorithm is not as straightforward as normal stochastic gradient algorithms because the updates are not in the gradient direction (in expectation), although the sum of these updates over a regenerative cycle are. \citeAmarbach98 provide the only convergence proof that we know of, albeit for a slightly different update of the form , where is a moving estimate of the expected performance, and is also updated online (this update was first suggested in the context of s by \shortciteAjaakola95).
marbach98 also considered the case of dependent rewards (recall the discussion after (3)), as did \citeAbaird98 with their “” algorithm (Value And Policy Search). This last paper contains an interesting insight: through suitable choices of the performance function , one can combine policygradient search with approximate value function methods. The resulting algorithms can be viewed as actorcritic techniques in the spirit of \citeABarSutAnd83; the policy is the actor and the value function is the critic. The primary motivation is to reduce variance in the policygradient estimates. Experimental evidence for this phenomenon has been presented by a number of authors, including \citeABarSutAnd83, \citeAkimura98a, and \citeAbaird98. More recent work on this subject includes that of \shortciteAsutton99 and \shortciteAkonda99. We discuss the use of style updates further in Section 6.2.
So far we have not addressed the question of how the parameterized statetransition probabilities arise. Of course, they could simply be generated by parameterizing the matrix of transition probabilities directly. Alternatively, in the case of s or s, state transitions are typically generated by feeding an observation that depends stochastically on the state into a parameterized stochastic policy, which selects a control at random from a set of available controls (approximate valuefunction based approaches that generate controls stochastically via some form of lookahead also fall into this category). The distribution over successor states is then a fixed function of the control. If we denote the probability of control given parameters and observation by , then all of the above discussion carries through with replaced by . In that case, Algorithm 1.1 is precisely Williams’ algorithm.
Algorithm 1.1 and the variants above have been extended to cover multiple agents \shortcitepeshkin00, policies with internal state \shortcitemeuleau99, and importance sampling methods \shortcitemeuleau00. We also refer the reader to the work of \citeArubinstein93 and \citeArubinstein98 for indepth analysis of the application of the likelihoodratio method to DiscreteEvent Systems (), in particular networks of queues. Also worth mentioning is the large literature on Infinitesimal Perturbation Analysis (IPA), which seeks a similar goal of estimating performance gradients, but operates under more restrictive assumptions than the likelihoodratio approach; see, for example, \citeAho91.
1.1.2 Biased Estimates of the Performance Gradient
All the algorithms described in the previous section rely on an identifiable recurrent state , either to update the gradient estimate, or in the case of the online algorithm, to zero the eligibility trace . This reliance on a recurrent state can be problematic for two main reasons:

The variance of the algorithms is related to the recurrence time between visits to , which will typically grow as the state space grows. Furthermore, the time between visits depends on the parameters of the policy, and states that are frequently visited for the initial value of the parameters may become very rare as performance improves.

In situations of partial observability it may be difficult to estimate the underlying states, and therefore to determine when the gradient estimate should be updated, or the eligibility trace zeroed.
If the system is available only through simulation, it seems difficult (if not impossible) to obtain unbiased estimates of the gradient direction without access to a recurrent state. Thus, to solve 1 and 2, we must look to biased estimates. Two principle techniques for introducing bias have been proposed, both of which may be viewed as artificial truncations of the eligibility trace . The first method takes as a starting point the formula^{5}^{5}5For ease of exposition, we have kept the expression for in terms of the likelihood ratios which rely on the availability of the underlying state . If is not available, should be replaced with . for the eligibility trace at time :
and simply truncates it at some (fixed, not random) number of terms looking backwards [13, 18, 19, 11]:
(7) 
The eligibility trace is then updated after each transition by
(8) 
and in the case of statebased rewards , the estimated gradient direction after steps is
(9) 
Unless exceeds the maximum recurrence time (which is infinite in an ergodic Markov chain), is a biased estimate of the gradient direction, although as , the bias approaches zero. However the variance of diverges in the limit of large . This illustrates a natural tradeoff in the selection of the parameter : it should be large enough to ensure the bias is acceptable (the expectation of should at least be within of the true gradient direction), but not so large that the variance is prohibitive. Experimental results by \citeAcao98 illustrate nicely this bias/variance tradeoff.
One potential difficulty with this method is that the likelihood ratios must be remembered for the previous time steps, requiring storage of parameters. Thus, to obtain small bias, the memory may have to grow without bound. An alternative approach that requires a fixed amount of memory is to discount the eligibility trace, rather than truncating it:
(10) 
where and is a discount factor. In this case the estimated gradient direction after steps is simply
(11) 
This is precisely the estimate we analyze in the present paper. A similar estimate with replaced by where is a reward baseline was proposed by \shortciteAkimura95,kimura97 and for continuous control by \citeAkimura98. In fact the use of in place of does not affect the expectation of the estimates of the algorithm (although judicious choice of the reward baseline can reduce the variance of the estimates). While the algorithm presented by \citeAkimura95 provides estimates of the expectation under the stationary distribution of the gradient of the discounted reward, we will show that these are in fact biased estimates of the gradient of the expected discounted reward. This arises because the stationary distribution itself depends on the parameters. A similar estimate to (11) was also proposed by \citeAmarbach98, but this time with replaced by , where is an estimate of the average reward, and with zeroed on visits to an identifiable recurrent state.
As a final note, observe that the eligibility traces and defined by (10) and (8) are simply filtered versions of the sequence , a firstorder, infinite impulse response filter in the case of and an th order, finite impulse response filter in the case of . This raises the question, not addressed in this paper, of whether there is an interesting theory of optimal filtering for policygradient estimators.
1.2 Our Contribution
We describe , a general algorithm based upon (11) for generating a biased estimate of the performance gradient in general s controlled by parameterized stochastic policies. Here denotes the average reward of the policy with parameters . does not rely on access to an underlying recurrent state. Writing for the expectation of the estimate produced by , we show that , and more quantitatively that is close to the true gradient provided exceeds the mixing time of the Markov chain induced by the ^{6}^{6}6The mixingtime result in this paper applies only to Markov chains with distinct eigenvalues. Better estimates of the bias and variance of may be found in \citeAjcss_01, for more general Markov chains than those treated here, and for more refined notions of the mixing time. Roughly speaking, the variance of grows with , while the bias decreases as a function of .. As with the truncated estimate above, the tradeoff preventing the setting of arbitrarily close to is that the variance of the algorithm’s estimates increase as approaches . We prove convergence with probability 1 of for both discrete and continuous observation and control spaces. We present algorithms for both general parameterized Markov chains and s controlled by parameterized stochastic policies.
There are several extensions to that we have investigated since the first version of this paper was written. We outline these developments briefly in Section 7.
In a companion paper we show how the gradient estimates produced by can be used to perform gradient ascent on the average reward [6]. We describe both traditional stochastic gradient algorithms, and a conjugategradient algorithm that utilizes gradient estimates in a novel way to perform line searches. Experimental results are presented illustrating both the theoretical results of the present paper on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.
2 The Reinforcement Learning Problem
We model reinforcement learning as a Markov decision process () with a finite state space , and a stochastic matrix^{7}^{7}7A stochastic matrix has for all and for all . giving the probability of transition from state to state . Each state has an associated reward^{8}^{8}8All the results in the present paper apply to bounded stochastic rewards, in which case is the expectation of the reward in state . . The matrix belongs to a parameterized class of stochastic matrices, . Denote the Markov chain corresponding to by . We assume that these Markov chains and rewards satisfy the following assumptions:
Assumption 1.
Each has a unique stationary distribution satisfying the balance equations
(12) 
(throughout denotes the transpose of ).
Assumption 2.
The magnitudes of the rewards, , are uniformly bounded by for all states .
Assumption 1 ensures that the Markov chain forms a single recurrent class for all parameters . Since any finitestate Markov chain always ends up in a recurrent class, and it is the properties of this class that determine the longterm average reward, this assumption is mainly for convenience so that we do not have to include the recurrence class as a quantifier in our theorems. However, when we consider gradientascent algorithms \citeAjair_01b, this assumption becomes more restrictive since it guarantees that the recurrence class cannot change as the parameters are adjusted.
Ordinarily, a discussion of s would not be complete without some mention of the actions available in each state and the space of policies available to the learner. In particular, the parameters would usually determine a policy (either directly or indirectly via a value function), which would then determine the transition probabilities . However, for our purposes we do not care how the dependence of on arises, just that it satisfies Assumption 1 (and some differentiability assumptions that we shall meet in the next section). Note also that it is easy to extend this setup to the case where the rewards also depend on the parameters or on the transitions . It is equally straightforward to extend our algorithms and results to these cases. See Section 6.1 for an illustration.
3 Computing the Gradient of the Average Reward
For general s little will be known about the average reward , hence finding its optimum will be problematic. However, in this section we will see that under general assumptions the gradient exists, and so local optimization of is possible.
To ensure the existence of suitable gradients (and the boundedness of certain random variables), we require that the parameterized class of stochastic matrices satisfies the following additional assumption.
Assumption 3.
The derivatives,
exist for all . The ratios
are uniformly bounded by for all .
The second part of this assumption allows zeroprobability transitions only if is also zero, in which case we set . One example is if is a forbidden transition, so that for all . Another example satisfying the assumption is
where are the parameters of , for then
Assuming for the moment that exists (this will be justified shortly), then, suppressing dependencies,
(14) 
since the reward does not depend on . Note that our convention for in this paper is that it takes precedence over all other operations, so . Equations like (14) should be regarded as shorthand notation for equations of the form
where . To compute , first differentiate the balance equations (12) to obtain
and hence
(15) 
The system of equations defined by (15) is underconstrained because is not invertible (the balance equations show that has a left eigenvector with zero eigenvalue). However, let denote the dimensional column vector consisting of all s, so that is the matrix with the stationary distribution in each row. Since , we can rewrite (15) as
To see that the inverse exists, let be any matrix satisfying . Then we can write
Thus,
It is easy to prove by induction that which converges to as by Assumption 1. So exists and is equal to . Hence, we can write
(16) 
and so^{9}^{9}9The argument leading to (16) coupled with the fact that is the unique solution to (12) can be used to justify the existence of . Specifically, we can run through the same steps computing the value of for small and show that the expression (16) for is the unique matrix satisfying .
(17) 
For s with a sufficiently small number of states, (17) could be solved exactly to yield the precise gradient direction. However, in general, if the state space is small enough that an exact solution of (17) is possible, then it will be small enough to derive the optimal policy using policy iteration and tablelookup, and there would be no point in pursuing a gradient based approach in the first place^{10}^{10}10Equation (17) may still be useful for s, since in that case there is no tractable dynamic programming algorithm..
Thus, for problems of practical interest, (17) will be intractable and we will need to find some other way of computing the gradient. One approximate technique for doing this is presented in the next section.
4 Approximating the Gradient in Parameterized Markov Chains
In this section, we show that the gradient can be split into two components, one of which becomes negligible as a discount factor approaches .
For all , let denote the vector of expected discounted rewards from each state :
(18) 
Where the dependence is obvious, we just write .
Proposition 1.
For all and ,
(19) 
We shall see in the next section that the second term in (19) can be estimated from a single sample path of the Markov chain. In fact, Theorem 1 in [14] shows that the gradient estimates of the algorithm presented in that paper converge to . By the Bellman equations (20), this is equal to , which implies . Thus the algorithm of \citeAkimura97 also estimates the second term in the expression for given by (19). It is important to note that —the two quantities disagree by the first term in (19). This arises because the the stationary distribution itself depends on the parameters. Hence, the algorithm of \citeAkimura97 does not estimate the gradient of the expected discounted reward. In fact, the expected discounted reward is simply times the average reward \shortcite[Fact 7]singh94a, so the gradient of the expected discounted reward is proportional to the gradient of the average reward.
The following theorem shows that the first term in (19) becomes negligible as approaches . Notice that this is not immediate from Proposition 1, since can become arbitrarily large in the limit .
Theorem 2.
For all ,
(21) 
where
(22) 
Proof.
Recalling equation (17) and the discussion preceeding it, we have^{11}^{11}11Since , (23) motivates a different kind of algorithm for estimating based on differential rewards [16].
(23) 
But since is a stochastic matrix, so (23) can be rewritten as
(24) 
Now let be a discount factor and consider the expression
(25) 
Clearly . To complete the proof we just need to show that .
Since , we can invoke the observation before (16) to write
In particular, converges, so we can take back out of the sum in the righthandside of (25) and write^{12}^{12}12We cannot back out of the sum in the righthandside of (24) because diverges (). The reason converges is that becomes orthogonal to in the limit of large . Thus, we can view as a sum of two orthogonal components: an infinite one in the direction and a finite one in the direction . It is the finite component that we need to estimate. Approximating with is a way of rendering the component finite while hopefully not altering the component too much. There should be other substitutions that lead to better approximations (in this context, see the final paragraph in Section 1.1).
(26) 
But . Thus . ∎
Theorem 2 shows that is a good approximation to the gradient as approaches , but it turns out that values of very close to lead to large variance in the estimates of that we describe in the next section. However, the following theorem shows that need not be too small, provided the transition probability matrix has distinct eigenvalues, and the Markov chain has a short mixing time. From any initial state, the distribution over states of a Markov chain converges to the stationary distribution, provided the assumption (Assumption 1) about the existence and uniqueness of the stationary distribution is satisfied [<]see, for example,¿[Theorem 15.8.1, p. 552]lancaster85. The spectral resolution theorem [15, Theorem 9.5.1, p. 314] implies that the distribution converges to stationarity at an exponential rate, and the time constant in this convergence rate (the mixing time) depends on the eigenvalues of the transition probability matrix. The existence of a unique stationary distribution implies that the largest magnitude eigenvalue is and has multiplicity , and the corresponding left eigenvector is the stationary distribution. We sort the eigenvalues in decreasing order of magnitude, so that for some . It turns out that determines the mixing time of the chain.
The following theorem shows that if is small compared to , the gradient approximation described above is accurate. Since we will be using the estimate as a direction in which to update the parameters, the theorem compares the directions of the gradient and its estimate. In this theorem, denotes the spectral condition number of a nonsingular matrix , which is defined as the product of the spectral norms of the matrices and ,
where
and denotes the Euclidean norm of the vector .
Theorem 3.
Suppose that the transition probability matrix satisfies Assumption 1 with stationary distribution , and has distinct eigenvalues. Let be the matrix of right eigenvectors of corresponding, in order, to the eigenvalues . Then the normalized inner product between and satisfies
(27) 
where .
Notice that is the expectation under the stationary distribution of .
As well as the mixing time (via ), the bound in the theorem depends on another parameter of the Markov chain: the spectral condition number of . If the Markov chain is reversible (which implies that the eigenvectors are orthogonal), this is equal to the ratio of the maximum to the minimum probability of states under the stationary distribution. However, the eigenvectors do not need to be nearly orthogonal. In fact, the condition that the transition probability matrix have distinct eigenvalues is not necessary; without it, the condition number is replaced by a more complicated expression involving spectral norms of matrices of the form .
Proof.
The existence of distinct eigenvalues implies that can be expressed as , where [15, Theorem 4.10.2, p 153]. It follows that for any polynomial , we can write .
It is easy to verify that is the left eigenvector corresponding to , and that we can choose and . Thus we can write
where
It follows from this and Proposition 1 that
by the CauchySchwartz inequality. Since , we can apply the CauchySchwartz inequality again to obtain
(28) 
We use spectral norms to bound the second factor in the numerator. It is clear from the definition that the spectral norm of a product of nonsingular matrices satisfies , and that the spectral norm of a diagonal matrix is given by . It follows that
5 Estimating the Gradient in Parameterized Markov Chains
Algorithm 1 introduces (Markov Chain Gradient), an algorithm for estimating the approximate gradient from a single online sample path from the Markov chain . requires only reals to be stored, where is the dimension of the parameter space: parameters for the eligibility trace , and parameters for the gradient estimate . Note that after time steps is the average so far of ,