# Policy Gradient Methods for Off-policy Control

###### Abstract

Off-policy learning refers to the problem of learning the value function of a way of behaving, or policy, while following a different policy. Gradient-based off-policy learning algorithms, such as GTD and TDC/GQ [13], converge even when using function approximation and incremental updates. However, they have been developed for the case of a fixed behavior policy. In control problems, one would like to adapt the behavior policy over time to become more greedy with respect to the existing value function. In this paper, we present the first gradient-based learning algorithms for this problem, which rely on the framework of policy gradient in order to modify the behavior policy. We present derivations of the algorithms, a convergence theorem, and empirical evidence showing that they compare favorably to existing approaches.

## 1 Introduction

One fundamental concept in Reinforcement Learning (RL) is Temporal Difference (TD) learning introduced by Sutton [9]. In TD-learning, methods such as TD(0) are used for policy evaluation where one tries to learn the value of a given state under a fixed policy. The extension to the control case is called Q-learning where the value function is defined on state-action pairs. The control policy is then computed from these action values. One of the first Q-learning algorithms was proposed by Watkins and Dayan [14] which simultaneously searches and evaluates a policy by varying its action value estimates. Watkins and Dayan’s Q-learning algorithm is an off-policy algorithm as the policy that is searched and evaluated is strictly greedy with respect to the current action values, but for control the agent uses a -greedy policy. This facilitates exploration as the agent is allowed to make a random move with probability to obtain representative samples and facilitate the search for a policy that generates high rewards.

Recently, gradient-based off-policy learning algorithms were introduced such as GTD [11] and TDC [13] which are also proven to be convergent under off-policy learning with linear value function approximation. The extension to Q-learning, GQ() [5], is also convergent under off-policy learning but only if the control policy is fixed. For the control case this is not sufficient as the agent has to explore its environment to be able to search and find a good policy. The reason why convergence cannot be guaranteed is that a non-stationary policy causes drift in the distribution from which transition samples are generated. While this drift is necessary for the agent to find a good policy, it can also cause oscillations in the value function estimates and the algorithm to not converge. SARSA also suffers from this problem and is only guaranteed to converge to a sub-space of policies [4, 3]. Within this sub-space the value function estimates may oscillate indefinitely.

In this paper we present a new gradient-based TD-learning algorithm that is similar to GQ but also incorporates policy gradients to correct for the drift in the distribution from which transitions are sampled. Similar to the policy gradient framework [12] we directly analyze the interaction between the policy gradient and the distribution from which transitions are sampled. As a result, our algorithm iterates over the sequence Markov Chains induced by the variation in the value function estimates and therefore policies. This makes our algorithm similar to policy iteration methods such as [8]. However, rather than evaluating and then improving the policy in consecutive steps, our method simultaneously improves and evaluates the current policy.

## 2 Q-learning with Policy Gradients

We consider an MDP where is a finite state space and is a finite action space. The transition function is stochastic, the reward function is defined as , and the discount factor . As in [13, 5] we consider the linear function approximation case with a basis function and define the state-action value function as

(1) |

Let be the vector of all state-action values and similarly be the vector of all rewards. We are assuming that the MDP is ergodic and that a limit distribution exists. Letting be a diagonal matrix with the limit distribution on its diagonal we define the norm . The Mean Squared Projected Bellman Error introduced by [13] is

(2) |

where is the projection matrix and the Bellman operator applied to the action value function is defined as

(3) |

Our approach differs to TDC and GQ in that we view the Bellman operator and the stationary distribution over state-action pairs as parametric in the value function parameter . For the stationary distribution we assume that

(4) |

This changes the way derive the gradient of the MSPBE as we assume additional dependencies on the parameter vector through the action selection probabilities .

### 2.1 Gradient Derivation

To obtain the gradient of the MSPBE objective, [13] have shown

To simplify the gradient calculation we assume and compute the partial dervative with respect to , which we denote with :

For the derivative of the inverse feature covariance we have

Plugging this back into the gradient above we obtain

For the partial derivative on the Bellman error we have

where is the th column of . Plugging this back into the MSPBE gradient we have

(5) |

### 2.2 Sampling the Gradient

To derive a stochastic gradient descend algorithm we rewrite (5) as expectations. Let

where with and the TD-error being

with and . This simplifies the partial derivative to

(6) |

For the first matrix term we have

where the expectation is over , , and for the TD-error. For the second matrix term we denote the th component of as . Expanding this term we have

where . For the third term we obtain

Assembling the MSPBE gradient then gives

(7) |

To derive an iterative algorithm we follow [11] and derive two timescale update rules to learn the parameter vector and approximate the auxiliary weight vector with

(8) |

Sampling the gradient above then gives the update rule

(9) |

Note that this update rule contains the standard TDC/GQ term plus correction terms that are in the direction of the policy gradient.

Algorithm 1 shows the resulting algorithm, which we call PGQ for Policy-Gradient Q-learning. This algorithm uses linear function approximation and updates are done in , where is the number of basis functions used. After making a transition, we do not want to sample the next action using the old parameter estimate and rather use the updated estimate. To do this we have to the calculate expected values and analytically over the next possible actions.

## 3 Baird Counter Example

We have tested our method on the ”star” Baird counter example [1] and compared it with Q-learning and GQ [5]. For this 7 state version divergence of Q-learning is monotonic and GQ is known to converge [7]. We initialize the parameter vector corresponding to the action that transitions to the 7th centre state with and the remaining parameter entries with 1. The discount factor is set to . In our experiments we do not assume a hard-coded policy that ensures uniform exploration over state-action pairs but look at the control case where actions are selected using a Boltzmann policy where the probability of selecting a specific action is

(10) |

Updating was done either through sampling transitions according to hard coded distributions or either through simulating trajectories through the MDP. For the sampled version we have sampled the state according to a uniform distribution over all 7 states, the action was sampled with probability and the next state was sampled according to the transition model. Figure 1 shows the MSPBE error for the sampled update experiment. Q-learning diverges monotonically and both GQ and PGQ converge to a zero MSPBE.

For the trajectory based experiments we have sampled one of the seven start states uniformly and then executed transitions through the MDP. While transitioning or updating the parameter vector we have measured the MSPBE using a uniform stationary distribution over states. Figure 2 shows the MSPBE and the Mean Squared TD-error (MSTDE) defined in [2] of the parameter vector at each step of the simulation.

## 4 Conclusion

We have presented a new gradient based TD-learning algorithm that incorporates policy gradients. The resulting algorithm is similar to GQ/TDC but also has a correction term in the direction of the gradient of the target policy. Our analysis assumes a dependency of the Markov chain on the parameter vector through the target policy. This allows our algorithm to correctly step over a sequence of different Markov chains and account for the drift in the distribution from which transition data is sampled due to changes in the parameter vector.

One next research direction is to extend this method to the non-linear function approximation case. Maei [6] present the first gradient based TD algorithm that converges in this case. One may able to draw on their results for our work. For the derivation of our algorithm we only assumed the Bellman operator to be parametric in the parameter estimate, which lead to the additional policy gradient terms. No further assumptions were made on the Bellman operator and the value function terms in the MSPBE objective, so in the non-linear function approximation case one would obtain gradients of the value function here. However, one would have to analyze the projection operator in the MSPBE objective differently.

## References

- Baird [1995] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30–37. Morgan Kaufmann, 1995.
- Dann et al. [2014] Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15:809–883, 2014. URL http://jmlr.org/papers/v15/dann14a.html.
- Gordon [1996] Geoffrey J. Gordon. Chattering in sarsa(lambda) - a cmu learning lab internal report. Technical report, 1996.
- Gordon [2001] Geoffrey J. Gordon. Reinforcement learning with function approximation converges to a region. In Advances in Neural Information Processing Systems, pages 1040–1046. The MIT Press, 2001.
- Maei [2010a] Sutton R. S. Maei, H. R. Gq(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. Proceedings of the Third Conference on Artificial General Intelligence, 2010a.
- Maei [2009] Szepesvari Cs. Bhatnagar S. Precup D. Silver D. Sutton R. S. Maei, H. R. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems 22, pages pp. 1204–1212, 2009.
- Maei [2010b] Szepesvari Cs. Bhatnagar S. Sutton R. S. Maei, H. R. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, 2010b.
- Perkins and Precup [2003] Theodore J. Perkins and Doina Precup. A convergent form of approximate policy iteration. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 1627–1634. MIT Press, 2003. URL http://papers.nips.cc/paper/2143-a-convergent-form-of-approximate-policy-iteration.pdf.
- Sutton [1988] Richard S. Sutton. Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, August 1988. ISSN 0885-6125. doi: 10.1023/A:1022633531479.
- Sutton and Barto [1998] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
- [11] Richard S. Sutton, Csaba Szepesvári, and Hamid Reza Maei. A convergent o(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems 21 (to appear. MIT Press.
- Sutton et al. [2000] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In IN ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12, pages 1057–1063. MIT Press, 2000.
- Sutton et al. [2009] Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In In Proceedings of the 26th International Conference on Machine Learning, 2009.
- Watkins and Dayan [1992] ChristopherJ.C.H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. ISSN 0885-6125. doi: 10.1007/BF00992698. URL http://dx.doi.org/10.1007/BF00992698.