Experiments with InfiniteHorizon, PolicyGradient Estimation
Abstract
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (). These algorithms are based on , an algorithm introduced in a companion paper [5], which computes biased estimates of the performance gradient in s. The algorithm’s chief advantages are that it uses only one free parameter , which has a natural interpretation in terms of biasvariance tradeoff, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by can be used to perform gradient ascent, both with a traditional stochasticgradient algorithm, and with an algorithm based on conjugategradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of \citeAjair_01a on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.
351 \jairheading1520019/0010/01 \ShortHeadingsPolicyGradient EstimationBaxter et al
1 Introduction
Function approximation is necessary to avoid the curse of dimensionality associated with largescale dynamic programming and reinforcement learning problems. The dominant paradigm is to use the function to approximate the state (or state and action) values. Most algorithms then seek to minimize some form of error between the approximate value function and the true value function, usually by simulation [21, 6]. While there have been a multitude of empirical successes for this approach [<]for example,¿samuel59,tesauro92,tesauro94,ml_00a,zhang95,singh97, there are only weak theoretical guarantees on the performance of the policy generated by the approximate value function. In particular, there is no guarantee that the policy will improve as the approximate value function is trained; in fact performance can degrade even when the function class contains an approximate value function whose corresponding greedy policy is optimal [<]see¿[Appendix A, for a simple twostate example]jair_01a.
An alternative technique that has received increased attention recently is the “policygradient” approach in which the parameters of a stochastic policy are adjusted in the direction of the gradient of some performance criterion (typically either expected discounted reward or average reward). The key problem is how to compute the performance gradient under conditions of partial observability when an explicit model of the system is not available.
This question has been addressed in a large body of previous work [4, 25, 10, 7, 8, 9, 18, 19, 15, 16, 1, 17, 13, 12]. See the introduction of [5] for a discussion of the history of policygradient approaches. Most existing algorithms rely on the existence of an identifiable recurrent state in order to make their updates to the gradient estimate, and the variance of the algorithms is governed by the recurrence time to that state. In cases where the recurrence time is too large (for instance because the state space is large), or in situations of partial observability where such a state cannot be reliably identified, we need to seek alternatives that do not require access to such a state.
Motivated by these considerations, \citeAjair_01a,icml_00 introduced and analysed —an algorithm for generating a biased estimate of the gradient of the average reward in general Partially Observable Markov Decision Processes (s) controlled by parameterized stochastic policies. The chief advantages of are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter , which has a natural interpretation in terms of biasvariance tradeoff, and it requires no knowledge of the underlying state.
More specifically, suppose are the parameters controlling the . For example, could be the parameters of an approximate neuralnetwork valuefunction that generates a stochastic policy by some form of randomized lookahead, or could be the parameters of an approximate function used to stochastically select controls^{1}^{1}1Stochastic policies are not strictly necessary in our framework, but the policy must be “differentiable” in the sense that exists.. Let denote the average reward of the with parameter setting . computes an approximation to based on a single continuous sample path of the underlying Markov chain. The accuracy of the approximation is controlled by the parameter , and one can show that
The tradeoff preventing choosing arbitrarily close to 1 is that the variance of ’s estimates of scale as . However, on the bright side, it can also be shown that the bias of (measured by ) is proportional to where is a suitable mixing time of the Markov chain underlying the [2]. Thus for “rapidly mixing” ’s (for which is small), estimates of the performance gradient with acceptable bias and variance can be obtained.
Provided is a sufficiently accurate approximation to —in fact, need only be within of —small adjustments to the parameters in the direction will guarantee improvement in the average reward . In this case, gradientbased optimization algorithms using as their gradient estimate will be guaranteed to improve the average reward on each step. Except in the case of tablelookup, most valuefunction based approaches to reinforcement learning cannot make this guarantee.
In this paper we present a conjugategradient ascent algorithm that uses the estimates of provided by . Critical to the successful operation of the algorithm is a novel line search subroutine that brackets maxima by relying solely upon gradient estimates. This largely avoids problems associated with finding the maximum using noisy value estimates. Since the parameters are only updated after accumulating sufficiently accurate estimates of the gradient direction, we refer to this approach as the “offline” algorithm. This approach essentially allows us to take a stochastic gradient optimization problem and treat it as a nonstochastic optimization problem, thus enabling the use of a large body of accumulated heuristics and algorithmic improvements associated with such methods. We also present a more traditional, “online” stochastic gradient ascent algorithm based on that updates the parameters at every time step. This algorithm is essentially the algorithm proposed in [12].
The offline and online algorithms are applied to a variety of problems, beginning with a simple 3state Markov decision process (MDP) controlled by a linear function for which the true gradient can be exactly computed. We show rapid convergence of the gradient estimates to the true gradient, in this case over a large range of values of . With this simple system we are able to illustrate vividly the bias/variance tradeoff associated with the selection of . We then compare the performance of the offline and online approaches applied to finding a good policy for the MDP. The offline algorithm reliably finds a nearoptimal policy in less than 100 iterations of the Markov chain, an order of magnitude faster than the online approach. This can be attributed to the more aggressive exploitation of the gradient information by the offline method.
Next we demonstrate the effectiveness of the offline algorithm in training a neural network controller to control a “puck” in a twodimensional world. The task in this case is to reliably navigate the puck from any starting configuration to an arbitrary target location in the minimum time, while only applying discrete forces in the and directions. Although the online algorithm was tried for this problem, convergence was considerably slower and we were not able to reliably find a good local optimum.
In the third experiment, we use the offline algorithm to train a controller for the call admission queueing problem treated in [16]. In this case nearoptimal solutions are found within about 2000 iterations of the underlying queue, 12 orders of magnitude faster than the experiments reported in [16] with online (stochasticgradient) algorithms.
In the fourth and final experiment, the offline algorithm was used to reliably train a switched neuralnetwork controller for a twodimensional variation on the classical “mountaincar” task [21, Example 8.2].
The rest of this paper is organized as follows. In Section 2 we introduce s controlled by stochastic policies, and the assumptions needed for our algorithms to apply. is described in Section 3. In Section 4 we describe the offline and online gradientascent algorithms, including the gradientbased linesearch subroutine. Experimental results are presented in Section 5.
2 Controlled by Stochastic Policies
A partially observable, Markov decision process () consists of a state space , observation space and a control space . For each state there is a deterministic reward . Although the results in \citeAjair_01a only guarantee convergence of in the case of finite (but continuous and ), the algorithm can be applied regardless of the nature of so we do not restrict the cardinality of , or .
Consider first the case of discrete , and . Each control determines a stochastic matrix giving the transition probability from state to state (). For each state , an observation is generated independently according to a probability distribution over observations in . We denote the probability that by . A randomized policy is simply a function mapping observations into probability distributions over the controls . That is, for each observation , is a distribution over the controls in . Denote the probability under of control given observation by .
For continuous and , becomes a kernel giving the probability density of transitions from to , becomes a probability density function on with the density at , and becomes a probability density function on with the density at .
To each randomized policy there corresponds a Markov chain in which state transitions are generated by first selecting an observation in state according to the distribution , then selecting a control according to the distribution , and finally generating a transition to state according to the probability .
At present we are only dealing with a fixed . To parameterize the we parameterize the policies, so that now becomes a function of a set of parameters , as well as of the observation . The Markov chain corresponding to has state transition matrix given by
(1) 
Note that the policies are purely reactive or memoryless in that their choice of action is based only upon the current observation. All the experiments described in the present paper use purely reactive policies. \citeAtr_belief_01 have extended and the techniques of the present paper to controllers with internal state.
The following technical assumptions are required for the operation of .
Assumption 1.
The derivatives,
exist, and the ratios
are uniformly bounded by , for all , , and .
The second part of this assumption is needed because the ratio appears in the algorithm. It allows zeroprobability actions only if is also zero, in which case we set . See Section 5 for examples of policies satisfying this requirement.
Assumption 2.
The magnitudes of the rewards, , are uniformly bounded by for all states .
For deterministic rewards, his condition only represents a restriction in infinite state spaces. However, all the results in the present paper apply to bounded stochastic rewards, in which case is the expectation of the reward in state .
Assumption 3.
Each , has a unique stationary distribution , satisfying the balance equations:
Assumption 3 ensures that, for all parameters , the Markov chain forms a single recurrent class. Since any finitestate Markov chain always ends up in a recurrent class, and it is the properties of this class that determine the longterm average reward, this assumption is mainly for convenience so that we do not have to include the recurrence class as a quantifier in our theorems. Observe that episodic problems, such as the minimization of time to a goal state, may be modeled in a way that satisfies Assumption 3 by simply resetting the agent upon reaching the goal state back to some initial starting distribution over states. Examples are described in Section 5.
The average reward is simply the expected reward under the stationary distribution :
(2) 
Because of Assumption 3, is also equal to the expected longterm average of the reward received when starting from any state :
Here the expectation is over sequences of states with state transitions generated by (note that the expectation is independent of the starting state ).
3 The Algorithm
(Algorithm 1) is an algorithm for computing a biased estimate of the gradient of the average reward . satisfies
where () is an approximation to satisfying
[5, Theorems 2, 5]. Note that relies only upon a single sample path from the POMDP. Also, it does not require knowledge of the transition probability matrix , nor of the observation process ; it only requires knowledge of the randomized policy , in particular the ability to compute the gradient of the probability of the chosen control divided by the probability of the chosen control.
We cannot set arbitrarily close to in , since the variance of the estimate is proportional to . However, on the bright side, it can also be shown that the bias of (measured by ) is proportional to where is a suitable mixing time of the Markov chain underlying the [2]. Under Assumption 3, regardless of the initial starting state, the distribution over states converges to the stationary distribution when the agent is following policy . Standard Markov chain theory shows that the rate of convergence to is exponential, and loosely speaking, the mixing time is the time constant in the exponential decay.
Thus has a natural interpretation in terms of a bias/variance tradeoff: small values of give lower variance in the estimates , but higher bias in that the expectation of may be far from , whereas values of close to yield small bias but correspondingly larger variance. Fortunately, for problems which mix rapidly (small ), can be small and still yield reasonable bias. This bias/variance tradeoff is vividly illustrated in the experiments of Section 5; see [2] for a more detailed theoretical discussion of the bias/variance question.
4 Stochastic Gradient Ascent Algorithms
This section introduces two approaches to exploiting the gradient estimates produced by :

an offline approach based on traditional conjugategradient optimization techniques but employing a novel linesearch mechanism to cope with the noise in ’s estimates, and

an online stochastic optimization approach that uses the core update in () to update the parameters on every iteration of the .
4.1 Offline optimization of the average reward
generates biased and noisy estimates of the gradient of the average reward for s controlled by parameterized stochastic policies. A straightforward algorithm for finding local maxima of would be to compute at the current parameter settings , and then modify by . Provided is close enough to the true gradient direction , and provided the stepsizes are suitably decreasing, standard stochastic optimization theory tells us that this technique will converge to a local maximum of . However, given that each computation of requires many iterations of the to guarantee suitably accurate gradient estimates (that is, in general needs to be large), we would like to more aggressively exploit the information contained in than by simply adjusting the parameters by a small amount in the direction .
There are two techniques for making better use of gradient information that are widely used in nonstochastic optimization: better choice of the search direction and better choice of step size. Better search directions can be found by employing conjugategradient directions rather than the pure gradient direction. Better step sizes are usually obtained by performing some kind of linesearch to find a local maximum in the search direction, or through the use of second order methods. Since linesearch techniques tend to be more robust to departures from quadraticity in the optimization surface, we will only consider those here [<]however, see¿[Section 7.3, for a discussion of how secondorder derivatives may be computed with a like algorithm]jair_01a.
, described in Algorithm 2, is a version of the PolakRibiere conjugategradient algorithm [<]see, e.g. ¿[Section 5.5.2]fine99 that is designed to operate using only noisy (and possibly) biased estimates of the gradient of the objective function (for example, the estimates provided by ). The argument to computes the gradient estimate. The novel feature of is , a linesearch subroutine that uses only gradient information to find the local maximum in the search direction. The use of gradient information ensures is robust to noise in the performance estimates. Both and can be applied to any stochastic optimization problem for which noisy (and possibly) biased gradient estimates are available.
The argument to provides an initial stepsize for . The argument provides a stopping condition; when falls below , terminates.
4.2 The algorithm
The key to the successful operation of is the linesearch algorithm (Algorithm 3). uses only gradient information to bracket the maximum in the direction , and then quadratic interpolation to jump to the maximum.
We found the use of gradients to bracket the maximum far more robust than the use of function values. To illustrate why this is so, in Figure 1 we have plotted a stylized view of the average reward along some search direction (labeled “” in the figure), and its gradient in that direction (labeled “grad()”). There are two ways we could search in the direction to bracket the maximum of in that direction (at in this case), one using function values and the other using gradient estimates:

Find three points , all lying in the direction from , such that and . Assuming no overshooting, we then know the maximum must lie between and and we can use the three points and quadratic interpolation to estimate the location of the maximum.

Find two points and such that and , and again use quadratic interpolation (which corresponds to linear interpolation of the gradients) to estimate the location of the maximum.
Both of these approaches will be equally satisfactory provided there is no noise in either the function estimates , or the gradient estimates . However, when estimates of or are available only through simulation, they will necessarily be noisy and the situation will look more like Figure 2. In this case the use of gradients to bracket the maximum becomes more desirable, because the linesearch technique based on value estimates could choose any of the peaks in the plot of as the location of the maximum, which occur nearly uniformly along the axis, whereas the second technique based on gradients would choose any of the zerocrossings of the noisy gradient plot, which are far closer to the true maximum^{2}^{2}2There is an implicit assumption in our argument that the noise processes in the gradient and value estimates are of approximately the same magnitude. If the variance of the value estimates is considerably smaller than the variance of the gradient estimates then we would expect bracketing with values to be superior. In all our experiments we found gradient bracketing to be superior.. This is illustrated in Figure 3.
Another view of this phenomenon is that regardless of the variance of our estimates of , the variance of approaches (the maximum possible) as approaches . Thus, to reliably bracket the maximum using noisy estimates of we need to be able to reduce the variance of the estimates when and are close. In our case this means running the simulation from which the estimates are derived for longer and longer periods of time. In contrast, the variance of (and ) is independent of the distance between and , and in particular does not grow as the two points approach one another.
One disadvantage to using gradient estimates to bracket is that it is not possible to detect extreme overshooting of the maximum. However, this can be avoided by using value estimates as a “sanity check” to determine if the value has dropped dramatically, and suitably adjusting the search if this occurs.
In Algorithm 3, lines 5–25 bracket the maximum by finding a parameter setting such that , and a second parameter setting such that . The reason for rather than in these expressions is to provide some robustness against errors in the estimates . It also prevents the algorithm “stepping to ” if there is no local maximum in the direction . Note that we use the same as used in to determine when to terminate due to small gradient (line 4 in ).
Provided that the signs of the gradients at the bracketing points and show that the maximum of the quadratic defined by these points lies between them, line 27 will jump to the maximum. Otherwise the algorithm simply jumps to the midpoint between and .
4.3 Online optimization of the average reward:
combined with operates by iteratively choosing “uphill” directions and then searching for a local maximum in the chosen direction. If the argument to is , the optimization will involve many iterations of the underlying between parameter updates.
In traditional stochastic optimization one typically uses algorithms that update the parameters at every iteration, rather than accumulating gradient estimates over many iterations. Algorithm 4, , presents an adaptation of to this form. See \citeAcdc_00 for a proof that converges to the vicinity of a local maximum of . Note that is very similar to the algorithms proposed in \citeAkimura95,kimura97.
5 Experiments
In this section we present several sets of experimental results. Throughout this section, where we refer to we mean with as its argument.
In the first set of experiments, we consider a system in which a controller is used to select actions for a 3state Markov Decision Process (). For this system we are able to compute the true gradient exactly using the matrix equation
(3) 
where is the transition matrix of the underlying Markov chain with the controller’s parameters set to , is the stationary distribution corresponding to (written as a row vector), is the square matrix in which each row is the stationary distribution, and is the (column) vector of rewards [<]see ¿[Section 3, for a derivation of (3)]jair_01a. Hence we can compare the estimates generated by with the true gradient , both as a function of the number of iterations and as a function of the discount parameter . We also optimize the performance of the controller using the online algorithm, , and the offline algorithm . reliably converges to a near optimal policy with around 100 iterations of the , while the online method requires approximately 1000 iterations. This should be contrasted with training a linear valuefunction for this system using [22], which can be shown to converge to a value function whose onestep lookahead policy is suboptimal [24].
In the second set of experiments, we consider a simple “puckworld” problem in which a small puck must be navigated around a twodimensional world by applying thrust in the and directions. We train a 1hiddenlayer neuralnetwork controller for the puck using . Again the controller reliably converges to near optimality.
In the third set of experiments we use to optimize the admission thresholds for the calladmission problem considered in [16].
In the final set of experiments we use to train a switched neuralnetwork controller for a twodimensional variant of the “mountaincar” task [21, Example 8.2].
In all the experiments we found that convergence of the linesearches was greatly improved if all calls to the algorithm were seeded with the same random number sequence.
5.1 A threestate MDP
In this section we consider a threestate , in each state of which there is a choice of two actions and . Table 1 shows the transition probabilities as a function of the states and actions. Each state has an associated twodimensional feature vector and reward which are detailed in Table 2. Clearly, the optimal policy is to always select the action that leads to state with the highest probability, which from Table 1 means always selecting action .
Origin  Destination State Probabilities  

State  Action  
0.0  0.8  0.2  
0.0  0.2  0.8  
0.8  0.0  0.2  
0.2  0.0  0.8  
0.0  0.8  0.2  
0.0  0.2  0.8 
This rather odd choice of feature vectors for the states ensures that a value function linear in those features and trained using —while observing the optimal policy—will implement a suboptimal greedy onestep lookahead policy (see [24] for a proof). Thus, in contrast to the gradient based approach, for this system, training a linear value function is guaranteed to produce a worse policy if it starts out observing the optimal policy.
5.1.1 Training a controller
Our goal is to learn a stochastic controller for this system that implements an optimal (or nearoptimal) policy. Given a parameter vector , we generate a policy as follows. For any state , let
Then the probability of choosing action in state is given by
while the probability of choosing action is given by
The ratios needed by Algorithms 1 and 4 are given by,
(4)  
(5) 
Since the second two components in are always the negative of the first two, this shows that two of the parameters are redundant in this case: we could just as well have set and .
5.1.2 Gradient estimates
With a parameter vector^{3}^{3}3Other initial values of the parameter vector were chosen with similar results. Note that generates a suboptimal policy. of , was used to generate estimates of , for various values of and . To measure the progress of towards the true gradient , was calculated from (3) and then for each value of the angle between and and the relative error were recorded. The angles and relative errors are plotted in Figures 4, 5 and 6.
The graphs illustrate a typical tradeoff for the algorithm: small values of give higher bias in the estimates, while larger values of give higher variance (the final bias is only shown in Figure 6 for the norm deviation because it was too small to measure for the angular deviation). The bias introduced by having is very small for this system. In the worst case, , the final gradient direction is indistinguishable from the true direction while the relative deviation is only .
5.1.3 Training via conjugategradient ascent
with as the “” argument was used to train the parameters of the controller described in the previous section. Following the low bias observed in the experiments of the previous section, the argument of was set to . After a small amount of experimentation, the arguments and of were set to and respectively. None of these values were critical, although the extremely large initial stepsize () did considerably reduce the time required for the controller to converge to nearoptimality.
We tested the performance of for a range of values of the argument to from to . Since only uses to determine the sign of the inner product of the gradient with the search direction, it does not need to run for as many iterations as does. Thus, determined its own parameter to as follows. Initially, (somewhat arbitrarily) the value of within was set to the value used in (or 1 if the value in was less than 10). then called to obtain an estimate of the gradient direction. If ( being the desired search direction) then was doubled and was called again to generate a new estimate . This procedure was repeated until , or had been doubled four times. If was still negative at the end of this process, searched for a local maximum in the direction , and the number of iterations used by was doubled on the next iteration (the conclusion being that the direction was generated by overly noisy estimates from ).
Figure 7 shows the average reward of the final controller produced by , as a function of the total number of simulation steps of the underlying Markov chain. The plots represent an average over independent runs of . Note that is the average reward of the optimal policy. The parameters of the controller were (uniformly) randomly initialized in the range before each call to . After each call to , the average reward of the resulting controller was computed exactly by calculating the stationary distribution for the controller. From Figure 7, optimality is reliably achieved using approximately 100 iterations of the Markov chain.
5.1.4 Training online with
The controller was also trained online using Algorithm 4 () with fixed stepsizes with . Reducing stepsizes of the form were tried, but caused intolerably slow convergence. Figure 8 shows the performance of the controller (measured exactly as in the previous section) as a function of the total number of iterations of the Markov chain, for different values of the stepsize . The graphs are averages over 100 runs, with the controller’s weights randomly initialized in the range at the start of each run. From the figure, convergence to optimal is about an order of magnitude slower than that achieved by , for the best stepsize of . Stepsizes much greater that failed to reliably converge to an optimal policy.
5.2 Puck World
In this section, experiments are described in which and were used to train 1hiddenlayer neuralnetwork controllers to navigate a small puck around a twodimensional world.
5.2.1 The World
The puck was a unitradius, unitmass disk constrained to move in the plane in a region 100 units square. The puck had no internal dynamics (i.e rotation). Collisions with the region’s boundaries were inelastic with a (tunable) coefficient of restitution (set to for the experiments reported here). The puck was controlled by applying a 5 unit force in either the positive or negative direction, and a 5 unit force in either the positive or negative direction, giving four different controls in total. The control could be changed every of a second, and the simulator operated at a granularity of of a second. The puck also had a retarding force due to air resistance of . There was no friction between the puck and the ground.
The puck was given a reward at each decision point ( of a second) equal to where was the distance between the puck and some designated target point. To encourage the controller to learn to navigate the puck to the target independently of the starting state, the puck state was reset every 30 (simulated) seconds to a random location and random and velocities in the range , and at the same time the target position was set to a random location.
Note that the size of the statespace in this example is essentially infinite, being of the order of where PRECISION is the floating point precision of the machine ( bits). Thus, the time between visits to a recurrent state is likely to be large. Also, the puck cannot just maximize its immediate reward because this leads to significant overshooting of the target locations.
5.2.2 The controller
A onehiddenlayer neuralnetwork with six input nodes, eight hidden nodes and four output nodes was used to generate a probabilistic policy in a similar manner to the controller in the threestate Markov chain example of the previous section. Four of the inputs were set to the raw and locations and velocities of the puck at the current timestep, the other two were the differences between the puck’s and location and the target’s and location respectively. The location inputs were scaled to lie between and , while the velocity inputs were scaled so that a speed of units per second mapped to a value of . The hidden nodes computed a squashing function, while the output nodes were linear. Each hidden and output node had the usual additional offset parameter. The four output nodes were exponentiated and then normalized as in the Markovchain example to produce a probability distribution over the four controls ( units thrust in the direction, units thrust in the direction). Controls were selected at random from this distribution.
5.2.3 Conjugate gradient ascent
We trained the neuralnetwork controller using with the gradient estimates generated by . After some experimentation we chose and as the parameters supplied to . used the same value of and the scheme discussed in Section 5.1.3 to determine the number of iterations with which to call .
Due to the saturating nature of the neuralnetwork hidden nodes (and the exponentiated output nodes), there was a tendency for the network weights to converge to local minima at “infinity”. That is, the weights would grow very rapidly early on in the simulation, but towards a suboptimal solution. Large weights tend to imply very small gradients and thus the network becomes “stuck” at these suboptimal solutions. We have observed a similar behaviour when training neural networks for pattern classification problems. To fix the problem, we subtracted a small quadratic penalty term from the performance estimates and hence also a small correction from the gradient calculation^{4}^{4}4When used as a technique for capacity control in pattern classification, this technique goes by the name “weight decay”. Here we used it to condition the optimization problem. for .
We used a decreasing schedule for the quadratic penalty weight (arrived at through some experimentation). was initialized to and then on every tenth iteration of , if the performance had improved by less than 10% from the value ten iterations ago, was reduced by a factor of 10. This schedule solved nearly all the local minima problems, but at the expense of slower convergence of the controller.
A plot of the average reward of the neuralnetwork controller is shown in Figure 9, as a function of the number of iterations of the . The graph is an average over 100 independent runs, with the parameters initialized randomly in the range at the start of each run. The four bad runs shown in Figure 10 were omitted from the average because they gave misleadingly large error bars.
Note that the optimal performance (within the neuralnetwork controller class) seems to be around for this problem, due to the fact that the puck and target locations are reset every simulated seconds and hence there is a fixed fraction of the time that the puck must be away from the target. From Figure 9 we see that the final performance of the puck controller is close to optimal. In only 4 of the 100 runs did get stuck in a suboptimal local minimum. Three of those cases were caused by overshooting in (see Figure 10), which could be prevented by adding extra checks to .
Figure 11 illustrates the behaviour of a typical trained controller. For the purpose of the illustration, only the target location and puck velocity were randomized every 30 seconds, not the puck location.
5.3 Call Admission Control
In this section we report the results of experiments in which was applied to the task of training a controller for the call admission problem treated by \citeA[Chapter 7]marbachthesis98.
5.3.1 The Problem
The call admission control problem treated by \citeA[Chapter 7]marbachthesis98 models the situation in which a telecommunications provider wishes to sell bandwidth on a communications link to customers in such a way as to maximize longterm average reward.
Specifically, the problem is a queuing problem. There are three different types of call, each with its own call arrival rate , , , bandwidth demand , , and average holding time , , . The arrivals are Poisson distributed while the holding times are exponentially distributed. The link has a maximum bandwidth of 10 units. When a call arrives and there is sufficient available bandwidth, the service provider can choose to accept or reject the call (if there is not enough available bandwidth the call is always rejected). Upon accepting a call of type , the service provider receives a reward of units. The goal of the service provider is to maximize the longterm average reward.
The parameters associated with each call type are listed in Table 3. With these settings, the optimal policy (found by dynamic programming by \citeAmarbachthesis98) is to always accept calls of type 2 and 3 (assuming sufficient available bandwidth) and to accept calls of type 1 if the available bandwidth is at least 3. This policy has an average reward of , while the “always accept” policy has an average reward^{5}^{5}5There is some discrepancy between our average rewards and those quoted by \citeAmarbachthesis98. This is probably due to a discrepancy in the way the state transitions are counted, which was not clear from the discussion in [16]. of .
Call Type  1  2  3  

Bandwidth Demand  1  1  1  
Arrival Rate  
Average Holding Time  
Reward  1  2  4 
5.3.2 The Controller
The controller had three parameters , one for each type of call. Upon arrival of a call of type , the controller chooses to accept the call with probability
where is the currently used bandwidth. This is the class of controllers studied by \citeAmarbachthesis98.
5.3.3 Conjugate gradient ascent
was used to train the above controller, with generating the gradient estimates from a range of values of and . The influence of on the performance of the trained controllers was marginal, so we set which gave the lowestvariance estimates. We used the same value of for calls to within and within , and this was varied between and . The controller was always started from the same parameter setting (as was done by \citeAmarbachthesis98). The value of this initial policy is . The graph of the average reward of the final controller produced by as a function of the total number of iterations of the queue is shown in Figure 12. A performance of was reliably achieved with less than iterations of the queue.
Note that the optimal policy is not achievable with this controller class since it is incapable of implementing any threshold policy other than the “always accept” and “always reject” policies. Although not provably optimal, a parameter setting of and any suitably large values of and generates something close to the optimal policy within the controller class, with an average reward of . Figure 13 shows the probability of accepting a call of each type under this policy (with ), as a function of the available bandwidth.
The controllers produced by with and sufficiently large are essentially “always accept” controllers with an average reward of , within 2% of the optimum achievable in the class. To produce policies even nearer to the optimal policy in performance, must keep close to its starting value of , and hence the gradient estimate produced by must have a relatively small first component. Figure 14 shows a plot of normalized as a function of , for (sufficiently large to ensure low variance in ) and the starting parameter setting . From the figure, starts at a high value which explains why produces “always accept” controllers for , and does not become negative until , a value for which the variance in even for moderately large is relatively high.
A plot of the performance of for and is shown in Figure 15. Approximately half of the remaining 2% in performance can be obtained by setting , while for a sufficiently large choice for gives most of the remaining performance. For this problem, there is a huge difference between gaining 98% of optimal performance, which is achieved for and less than 2000 iterations of the queue, and gaining 99% of the optimal which requires and of the order of 500,000 queue iterations. A similar convergence rate and final approximation error to the latter case were reported for the online algorithms by \citeA[Chapter 7]marbachthesis98.
5.4 Mountainous Puck World
The “mountaincar” task is a wellstudied problem in the reinforcement learning literature [21, Example 8.2]. As shown in Figure 16, the task is to drive a car to the top of a onedimensional hill. The car is not powerful enough to accelerate directly up the hill against gravity, so any successful controller must learn to “oscillate” back and forth until it builds up enough speed to crest the hill.
In this section we describe a variant of the mountain car problem based on the puckworld example of Section 5.2. With reference to Figure 17, in our problem the task is to navigate a puck out of a valley and onto a plateau at the northern end of the valley. As in the mountaincar task, the puck does not have sufficient power to accelerate directly up the hill, and so has to learn to oscillate in order to climb out of the valley. Once again we were able to reliably train nearoptimal neuralnetwork controllers for this problem, using and , and with generating the gradient estimates.
5.4.1 The World
The world dimensions, physics, puck dynamics and controls were identical to the flat puck world described in Section 5.2, except that the puck was subject to a constant gravitational force of units, the maximum allowed thrust was units (instead of ), and the height of the world varied as follows:
With only units of thrust, a unit mass puck can not accelerate directly out of the valley.
Every 120 (simulated) seconds, the puck was initialized with zero velocity at the bottom of the valley, with a random location. The puck was given no reward while in the valley or on the southern plateau, and a reward of while on the northern plateau, where was the speed of the puck. We found the speed penalty helped to improve the rate of convergence of the neural network controller.
5.4.2 The controller
After some experimentation we found that a neuralnetwork controller could be reliably trained to navigate to the northern plateau, or to stay on the northern plateau once there, but it was difficult to combine both in the same controller (this is not so surprising since the two tasks are quite distinct). To overcome this problem, we trained a “switched” neuralnetwork controller: the puck used one controller when in the valley and on the southern plateau, and then switched to a second neuralnetwork controller while on the northern plateau. Both controllers were onehiddenlayer neuralnetworks with nine input nodes, five hidden nodes and four output nodes. The nine inputs were the normalized (valued) , and puck locations, the normalized , and locations relative to center of the northern wall, and the , and puck velocities. The four outputs were used to generate a policy in the same fashion as the controller of Section 5.2.2.
An approach requiring less prior knowledge would be to have a third controller that stochastically selects the base neural network controller as a function of the puck’s location. This “master” controller could itself be parameterized and have its parameters trained along with the base controllers.
5.4.3 Conjugate gradient ascent
The switched neuralnetwork controller was trained using the same scheme discussed in Section 5.2.3, except this time the discount factor was set to .
A plot of the average reward of the neuralnetwork controller is shown in Figure 18, as a function of the number of iterations of the . The graph is an average over 100 independent runs, with the neuralnetwork controller parameters initialized randomly in the range at the start of each run. In this case no run failed to converge to nearoptimal performance. From the figure we can see that the puck’s performance is nearly optimal after about 40 million total iterations of the puck world. Although this figure may seem rather high, to put it in some perspective note that a random neuralnetwork controller takes about 10,000 iterations to reach the northern plateau from a standing start at the base of the valley. Thus, 40 million iterations is equivalent to only about 4,000 trips to the top for a random controller.
Note that the puck converges to a final average performance around 75, which indicates it is spending at least 75% of its time on the northern plateau. Observation of the puck’s final behaviour shows it behaves nearly optimally in terms of oscillating back and forth to get out of the valley.
5.5 Choosing and the Running Time of
One aspect of these experiments that required some measure of tuning is the choice of the parameter and running time used by . Although these were selected by trial and error, we have had some success recently with a scheme for automatically choosing these parameters as follows. Before any training begins, is run for a large number of iterations whilst simultaneously generating gradient estimates for a number of different choices of . This can be done from a single simulation simply by maintaining a separate eligibility trace for each value of . Since the bias reduces with increasing , the largest that gives a reasonably lowvariance gradient estimate at the end of the long run is selected as a “reference” (the variance is estimated by comparing gradient estimates at reasonably wellseparated intervals towards the end of the run). Furthermore, since the variance of the gradient estimate decreases as decreases, all gradient estimates for values of smaller than the reference will typically have smaller variance than that of the reference . Hence, we can reliably compare the directions for smaller ’s with the direction given by the reference , and choose the smallest whose corresponding direction is sufficiently close to the reference direction. We take“sufficiently close” to mean within –.
Note that this scheme only works if the original run is sufficiently long to get a lowvariance direction estimate at the right value of . If the right value of is too large then any fixed bound on the run length can be made to fail, but this will be a problem for all algorithms that automatically choose .
Once a suitable has been found, we can go back and find the point in the original long run where the direction estimate corresponding to that value of “settled down” (again, we measure the variance of the estimates by sampling at suitably large intervals, and choose a point where the variance falls below some chosen value). This time is then used as the running time for when estimating the gradient direction. Finally, the running time used in when bracketing the maximum in can also be automatically tuned by starting with an initial fixed running time that is a fraction of , and then continuing until the sign of the inner product of the estimates produced by with the search direction “settles down”. With this technique, the sign estimation time is usually considerably smaller than the gradient direction estimation time.
Another useful heuristic is to reestimate and ’s running time whenever the parameters change by a large amount, since a large change in can lead to significant changes in the mixing time of the .
6 Conclusion
This paper showed how to use the performance gradient estimates generated by the algorithm [5] to optimize the average reward of parameterized s. We described both a traditional “online” stochastic gradient algorithm and an “offline” approach that relied on the use of , a robust linesearch algorithm that uses gradient estimates, rather than value estimates, to bracket the maximum. The offline approach in particular was found to perform well on four quite distinct problems: optimizing a controller for a threestate , optimizing a neuralnetwork controller for navigating a puck around a twodimensional world, optimizing a controller for a call admission problem, and optimizing a switched neuralnetwork controller in a variation of the classical mountaincar task. One reason for the superiority of the offline approach is that by searching for a local maximum at each step it makes much more aggressive use of the gradient information than does the online algorithm.
For the threestate and the calladmission problems we were able to provide graphic illustrations of how the bias and variance of the gradient estimates can be traded against one another by varying between (low variance, high bias) and (high variance, low bias).
Relatively little tuning was required to generate these results. In addition, the controllers operated on direct and simple representations of the state, in contrast to the more complex representations usually required of valuefunction based approaches.
It is often the case that valuefunction methods converge much more rapidly than their policygradient counterparts. This is due to the fact that they enforce constraints on the valuefunction. With this in mind an interesting avenue for further research is ActorCritic algorithms [4, 1, 11, 14, 20] in which one attempts to combine the fast convergence of valuefunctions with the theoretical guarantees of policygradient approaches.
Despite the success of the offline approach in the experiments described here, the online algorithm has advantages in other settings. In particular, when it is applied to multiagent reinforcement learning, both gradient computations and parameter updates can be performed for distinct agents without any communication beyond the global distribution of the reward signal. This idea has led to a parameter optimization procedure for spiking neural networks, and some successful preliminary results with network routing [3, 23].
Acknowledgements
This work was supported by the Australian Research Council, and benefited from the comments of several anonymous referees. Most of this research was performed while the first and second authors were with the Research School of Information Sciences and Engineering, Australian National University.
References
 [1] (1999) Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, Cited by: §1, §6.
 [2] (2000) Estimation and approximation bounds for gradientbased reinforcement learning. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pp. 133–141. Cited by: §1, §3, §3.
 [3] (199911) Hebbian synaptic modifications in spiking neurons that learn. Technical report Research School of Information Sciences and Engineering, Australian National University. Note: http://csl.anu.edu.au/bartlett/papers/BartlettBaxterNov99.ps.gz Cited by: §6.
 [4] (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC13, pp. 834–846. Cited by: §1, §6.
 [5] (2001) Infinitehorizon policygradient estimation. Journal of Artificial Intelligence Research. Note: To appear Cited by: Experiments with InfiniteHorizon, PolicyGradient Estimation, §1, §3, §6.
 [6] (1996) Neurodynamic programming. Athena Scientific. Cited by: §1.
 [7] (1997) Perturbation Realization, Potentials, and Sensitivity Analysis of Markov Processes. IEEE Transactions on Automatic Control 42, pp. 1382–1393. Cited by: §1.
 [8] (1998) Algorithms for Sensitivity Analysis of Markov Chains Through Potentials and Perturbation Realization. IEEE Transactions on Control Systems Technology 6, pp. 482–492. Cited by: §1.
 [9] (1994) Smooth Perturbation Derivative Estimation for Markov Chains. Operations Research Letters 15, pp. 241–251. Cited by: §1.
 [10] (1986) Stochastic approximation for montecarlo optimization. In Proceedings of the 1986 Winter Simulation Conference, pp. 356–365. Cited by: §1.
 [11] (1998) An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value functions. In Fifteenth International Conference on Machine Learning, pp. 278–286. Cited by: §6.
 [12] (1997) Reinforcement learning in POMDPs with function approximation. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), D. H. Fisher (Ed.), pp. 152–160. Cited by: §1, §1.
 [13] (1995) Reinforcement learning by stochastic hill climbing on discounted reward. In Proceedings of the Twelfth International Conference on Machine Learning (ICML’95), pp. 295–303. Cited by: §1.
 [14] (2000) ActorCritic Algorithms. In Neural Information Processing Systems 1999, Cited by: §6.
 [15] (1998) SimulationBased Optimization of Markov Reward Processes. Technical report MIT. Cited by: §1.
 [16] (1998) SimulationBased Methods for Markov Decision Processes. Ph.D. Thesis, MITLaboratory for Information and Decision Systems, MIT. Cited by: §1, §1, §5, footnote 5.
 [17] (1998) Modern simulation and modeling. Wiley, New York. Cited by: §1.
 [18] (1994) Learning Without StateEstimation in Partially Observable Markovian Decision Processes. In Proceedings of the Eleventh International Conference on Machine Learning, Cited by: §1.
 [19] (1995) Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, G. Tesauro, D.S. Touretzky, and T.K. Leen (Eds.), Vol. 7. Cited by: §1.
 [20] (2000) Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Neural Information Processing Systems 1999, Cited by: §6.
 [21] (1998) Reinforcement Learning: An Introduction. MIT Press, Cambridge MA. Note: ISBN 0262193981 Cited by: §1, §1, §5.4, §5.
 [22] (1988) Learning to Predict by the Method of Temporal Differences. Machine Learning 3, pp. 9–44. Cited by: §5.
 [23] (200101) A multiagent, policygradient approach to network routing. Technical report Australian National University. Cited by: §6.
 [24] (199909) Reinforcement learning from state and temporal differences. Technical report Australian National University. Cited by: §5.1, §5.
 [25] (1992) Simple Statistical GradientFollowing Algorithms for Connectionist Reinforcement Learning. Machine Learning 8, pp. 229–256. Cited by: §1.