A Convergent OffPolicy Temporal Difference Algorithm
Abstract
Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of offpolicy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as onpolicy prediction). However, it has been well established in the literature that offpolicy TD algorithms under linear function approximation diverge. In this work, we propose a convergent online offpolicy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.
I Introduction
The two important problems in Reinforcement Learning (RL) are Prediction and Control [1]. The prediction problem deals with computing the value function of a given policy. In a discounted reward setting, value function refers to the total expected discounted reward obtained by following the given policy. The control problem refers to computing the optimal policy, i.e., the policy that maximizes the total expected discount reward. When the model information (probability transition matrix and singlestage reward function) is fully known, techniques like value iteration and policy iteration are used to solve the control problem. Policy iteration is a twostep iterative algorithm where the task of prediction is performed in the first step for a given policy followed by the policy improvement task in the second step. However, in most of the practical scenarios, the model information is not known and instead (state, action, reward and nextstate) samples are only available. Under such a modelfree setting, popular RL algorithms for prediction are Temporal Difference (TD) and for control are QLearning and ActorCritic algorithms [2]. ActorCritic algorithms can be seen as modelfree analogs of the policy iteration algorithm and involve a modelfree prediction step. Therefore, it is clear that modelfree prediction is an important problem for which optimal and convergent solutions are desired.
TD algorithms under the tabular approach (where there is no approximation of the value function) are a very popular class of algorithms for computing the exact value function of a given policy (henceforth referred to as target policy) from samples. In many of the reallife problems though, we encounter situations where the number of states is large or even infinite. In such cases, it is not possible to use tabular approaches and one has to resort to approximation based methods. TD algorithms are shown to be stable and convergent under linear function approximation, albeit under the setting of onpolicy. Onpolicy refers to the setting where state and action samples are obtained using the target policy itself. As we approach practical scenarios, it can be noted that such samples are not always available to the practitioner. For example in games, say a practitioner would like to evaluate a (target) strategy. However, the data available to her might be from a player following a different strategy. The question that arises in this scenario is whether she can make use of this data and still evaluate the target strategy. These problems are studied under the setting of offpolicy prediction where the goal is to evaluate the value function of the target policy from the data generated from a different policy (commonly referred to as behavior policy). The recent empirical success of the Deep QLearning algorithm (modelfree control algorithm) motivates us to understand its convergence behavior, which is a very difficult problem. It has been noted in (Section 11.3 of [2]) that convergence and stability issues arise when we combine three components  function approximation, bootstrapping (TD algorithms) and offpolicy learning, what they refer to as the “deadly triad”.
In our work, we propose an online offpolicy stable TD algorithm for a prediction problem under linear function approximation. The idea is to penalize the parameters of the TD update to mitigate the divergence problem. We note here that the recent work [3] provides a comprehensive and excellent survey of algorithms for offpolicy prediction problems and performs a comparative study. However, for the sake of completeness, we now discuss some of the important and relevant works on the offpolicy prediction problem. In [4], LeastSquares TD algorithms (LSTD) with linear function approximation have been proposed that are shown to be convergent under both onpolicy and offpolicy settings. However, the perstep complexity of LSTD algorithms is quadratic in the number of parameters. In [5], offpolicy TD algorithms are proposed that make use of an importance sampling idea to convert the expected value of total discounted reward under behavior policy to expected value under target policy. However, the variance of such algorithms is very high and in some cases tends to be infinite. In [6], the Gradient TD (GTD) algorithm has been proposed that is stable under offpolicy learning and linear approximation and has linear (in the number of parameters) complexity. Since then, there have been a lot of improvements on the GTD algorithm under various settings like prediction, control, and nonlinear function approximation [7, 8, 9, 10]. The idea of adding the penalty in the form of a regularization term has been considered in [11] where Regularized offpolicy TD (ROTD) algorithm has been proposed based on GTD algorithms and convexconcave saddle point formulations. Emphatic TD algorithms (ETD) [12, 13, 14, 15] are another popular class of offpolicy TD algorithms that achieve stability by emphasizing or deemphasizing updates of the algorithm. These updates also have lineartime complexity. Moreover, these algorithms learn only one set of parameters, unlike GTD algorithms which are twotime scale stochastic approximation algorithms that learn two sets of parameters. Recently in [16, 17], a covariance offpolicy TD (COPTD) algorithm has been proposed that includes a covariance shift term in the TD update. This shift term is also learned along with the parameter of the algorithm.
Our algorithm, like the Emphatic TD algorithm, trains only one set of parameters and like ETD and GTD algorithms, has perupdate complexity that is linear in the number of parameters. The contributions of our paper are as follows:

We derive an online offpolicy TD learning algorithm with linear function approximation. Our algorithm has linear periteration computational complexity in the number of parameters.

We show the empirical performance of our algorithm on standard benchmark RL environments.
The rest of the paper is organized as follows. In Section II, we introduce the background and preliminaries. We propose our algorithm in Section III. Sections IV and V describe the analysis of our algorithm. Section VI presents the results of our numerical experiments. Finally, Section VII presents concluding remarks and future research directions.
Ii Background and Preliminaries
We consider a Markov Decision Process (MDP) of form where denotes the state space. is the set of actions, is a probability transition matrix where denotes the probability of system transition to state when action is chosen in state . is the singlestage reward where denotes the reward obtained by taking action in state . Finally, denotes the discount factor. Let be the target policy where denotes the set of probability distributions over actions. The objective of the MDP prediction problem is to estimate the value function () of the target policy , where the value function of a state denoted by is given by:
(1) 
where the stateaction trajectory is obtained following the policy and denotes the expectation.
As the number of states of the MDP can be very large, we resort to approximation techniques to compute the value function. In our work, we consider the linear function approximation architecture where
(2) 
where denotes the approximate value function associated with state (that we desire to be very close to the exact value function), is a feature vector associated with state and is a weight vector. Note that the exact value function may not be representable by (2). Therefore, the objective is to estimate the weight vector so that the approximate value function denoted by (2) is as close as possible to the exact value function.
The onpolicy TD(0) [2] is a popular online algorithm for computing the weight vector . The update equation is given by:
(3) 
where is the state, reward and next state samples obtained at time , is the stepsize sequence and denotes the initial parameter vector.
The stability of the onpolicy TD(0) algorithm is well established in the literature [12]. We outline the proof of the convergence of this algorithm. Following the notation of [12], please note that the update rule (3) can be rewritten as:
(4) 
where and .
It is shown in [2] that the algorithm with update rule (4) is stable if the matrix given by:
(5) 
is positive definite. In (5), is a matrix with the feature vector in row . is the diagonal matrix with the diagonal being the stationary distribution of state obtained under policy . Finally, is a matrix with . For the onpolicy TD(0) algorithm, is shown to be positive definite [12] proving the stability of the algorithm.
In the offpolicy prediction problem, the data samples are obtained from a behavior policy instead of the target policy . In this case, the offpolicy TD(0) update [12] is given by:
(6) 
where is the reward obtained by taking action in state and is the importance sampling ratio given by . The corresponding matrix for this algorithm is given by:
(7) 
where is a diagonal matrix with diagonal of being the stationary distribution obtained under policy .
The matrix defined in (7) need not be positive definite [12]. Therefore stability and convergence of the offpolicy TD(0) are not guaranteed.
The offpolicy TD(0) algorithm, if it converges, may perform comparably to some of the offpolicy convergent algorithms in the literature. For example, in Figure 5 of [3], it has been shown that the performance of offpolicy TD(0) is comparable to the GTD(0) algorithm. However, as the algorithm is not stable, offpolicy TD(0) can diverge. In this paper, we propose a simple and stable offpolicy TD algorithm. In the next section, we propose our algorithm and in Section IV, we provide its convergence analysis.
Iii The Proposed Algorithm
The input to our algorithm is the target policy, whose value function we want to estimate and the behavior policy, using which the samples are generated. Also, provided as an input to our algorithm is the regularization parameter (). The algorithm works as follows. At each time step , we obtain a sample using which importance sampling is computed as shown in the Step 3. We then compute our modified temporal difference term as show in Step 4. Finally, the parameters of the algorithm are updated as shown in Step 5.
Remark 1.
It is clear from the Algorithm 1 that, the perstep complexity is , where is the number of parameters.
Remark 2.
In the next section, we provide the convergence analysis of our proposed algorithm.
Iv Convergence Analysis
The update rule of Algorithm 1 can be rewritten as follows.
where and are given by
(8)  
(9) 
We state and invoke Theorem 2 (also see Th. 17, p. 239 of [19]) of [18] to show the convergence of our algorithm.
Theorem 1.
Consider an iterative algorithm of the form
where

the stepsize sequence satisfies ,

is a Markov process with a unique invariant distribution.

and Here is the expectation with respect to the stationary distribution of the Markov chain given by the behaviour policy

The matrix is positive definite.

There exist positive constants and a positive real valued function from the states of the Markov chain such that and

For any there exists a constant such that for all and ,
Under these assumptions, i.e. 16 above, it is known that converges to the solution of
To begin we define the process as follows. Let Observe that is a Markov chain. In particular, is a deterministic function of and the distribution of and depends only on . Also note that, in our algorithm, and given by equations (8) and (9) respectively with .
Assumptions 1 and 2 are fairly general. The assumptions 5 and 6 can be shown to hold with as a constant function for finite stateaction MDPs and the arguments are similar to those in theorem 1 of [18]. Therefore, the most important assumption to verify is that the matrix is positive definite. In this section, we prove that is positive definite, thereby proving the convergence of our proposed algorithm.
We begin by proving some important lemmas that are used in our main theorem.
Lemma 1.
Let be the matrix where the row of is given by , the feature vector of state and be the vector where the component is given by . Let be the expectation with respect to the stationary distribution of the Markov chain realized by . Then and are given by
where is a diagonal matrix with the diagonal element being .
Proof.
Similarly
∎
Definition 1.
A matrix is positive definite if for all ,
Lemma 2.
Given a matrix , is positive definite iff the symmetric matrix is positive definite.
Proof.
For observe that
since as both are scalars and . Hence is positive definite if and only if is positive definite ∎
Theorem 2.
Suppose where is a diagonal matrix with positive diagonal entries, is a Markov matrix and and are positive constants. Then is positive definite.
Proof.
Consider the symmetric matrix . From Lemma 2, it is enough to show that is positive definite. Since is symmetric it is diagonalizable. Therefore it is enough to show that the eigenvalues of are positive. From the Gershgorin circle theorem (see [20]) for any eigen value of there exists such that
Now and for we have . Therefore and
from the hypothesis . We see that every eigenvalue of i.e., is positive definite. Hence is positive definite. In particular, given the behaviour policy and the target policy , there exists such that is positive definite. ∎
To describe the point of convergence of our algorithm consider for a given policy and a parameter , as . We state and prove the following properties about
Lemma 3.
is a contraction and converges pointwise to as
Proof.
From the definition For any
It is easy to see that for any ,
Hence is  contraction. ∎
V About the Point of Convergence
The algorithm converges to the point such that Now
where is the projection operator that projects any to the subspace with respect to the norm Hence we observe that, similar to online onpolicy TD, our online offpolicy TD is a projected stochastic fixed point iteration with respect to the perturbed Bellman operator
Remark 3.
Note that the bound derived for in Theorem 2 is a sufficient but not a necessary condition. If the value of is large, the algorithm converges but to a poorly approximated solution. Therefore, in experiments, we select the value of that is large enough to ensure convergence and small enough to ensure that approximation is reasonable.
Vi Experiments and Results
In this section, we describe the performance of our proposed algorithm on three tasks. We first perform experiments on two benchmark counterexamples for offpolicy divergence. Finally, we perform our experiment on a 3state MDP example and analyze various properties of our proposed algorithm ^{1}^{1}1The implementation codes for our experiments is available at: https://github.com/raghudiddigi/OffPolicyConvergentAlgorithm . The evaluation metric considered is Root Mean Square Error (RMSE) defined as:
(10) 
where is the parameter that is used to approximate the value function, is the stationary distribution associated with the behavior policy , is the exact value function of the target policy and is the approximate value function that is estimated. We perform independent runs and present the average of RMSE obtained on all the three experiments. For comparison purposes, we also implement Emphatic TD (ETD(0)) algorithm [12] and a gradientfamily algorithm, linear TD with gradient correction (TDC) [7].
First, we consider the “” example ([22], Section 3 of [12]). In this example, there are two states  and and two actions  ’left’ and ’right’. Left action in state results in state , while right action results in state . Similarly, right action in state results in state and left action results in state . The target policy is to take right in both the states, whereas behavior policy is to take left and right actions with equal probability in both the states. The value function is linearly approximated with one feature. The feature of state is and that of state is . The discount factor is taken to be . The update parameter is initialized to and the for our algorithm is taken to be . The stepsize for the algorithms is held constant at . In Figure 2, we show the performance of algorithms over iterations. We can see that the standard offpolicy TD(0) diverges whereas the other three algorithms including our proposed perturbed offpolicy TD(0) converges to a point where the RMSE is zero.
Next, we consider the “7star” example, first proposed in [23]. This is completely described in Figure 1 [21]. There are states represented as circles. The expression inside the circle represents the linear approximation of the state . The policy in Figure 1 represents the target policy and represents the behavior policy. We run all the algorithms, i.e., standard offpolicy TD(0), Emphatic offpolicy TD(0), TDC and our algorithm, Perturbed offpolicy TD(0) for iterations. The stepsize for the algorithms is set to ^{2}^{2}2We have run experiments with three other stepsizes and included it in our supplementary material. Please find them at: https://github.com/raghudiddigi/OffPolicyConvergentAlgorithm/blob/master/Supplementary.pdf. From Figure 3, we can see that our perturbed offpolicy TD converges to the exact solution while the Emphatic TD(0) appears to oscillate. On the other hand, the TDC algorithm appears to converge slowly. Moreover, it is known that standard offpolicy diverges for this example, which can also be observed from Figure 3.
Finally, we construct an MDP as follows. There are states and actions  ’left’ and ’right’ possible in each state. The ’Left’ action in states and leads to state . And the ’right’ action in states and leads to state . Finally ’left’ action in state leads to state . The singlestage rewards in all transitions is taken to be and the discount factor is . The target policy and the behavior policy (where the first component represents the probability to take ’left’ and the second component represents the probability to take ’right’). The feature vectors of the three states are respectively. The stepsize for the algorithms is set to . We run all the algorithms for iterations. From Figure 4, we can see that perturbed offpolicy TD(0) converges. For this experiment, the best possible RMSE is and our proposed algorithm achieves .
In the experimental setting above, the value of is set to . In Figure 5, we run our algorithm with two other values of and respectively. We observe that, for , convergence is not guaranteed as this correction is not enough. On the other hand, for , convergence is ensured. However, the converged solution is not close due to the overcorrection. Hence, it is to be noted that an optimal value of is desired for ensuring the convergence and nearoptimal solution at the same time (recall that a higher value of is enough to ensure the convergence alone).
Remark 4.
It has to be noted that the objective of the experiments is to show that our proposed algorithm mitigates the divergence problem of the offpolicy TD algorithm. Moreover, if we choose a good value of , it ensures that the algorithm converges to a solution closer to the optimal solution. At this point, we do not make any claims about the quality of the converged solution compared to the Emphatic TD(0) and TDC algorithms. We have seen that our proposed algorithm performed better than Emphatic TD and TDC in the last two examples. Further empirical analysis is needed to compare the quality of the converged solution with Emphatic TD(0), TDC as well as other offpolicy algorithms in the literature.
Vii Conclusions and Future Work
In this work, we have proposed an offpolicy TD algorithm for mitigating the divergence problem of the standard offpolicy TD algorithm. Our proposed algorithm makes use of a penalty parameter to ensure the stability of the iterates. We have then proved that this addition of penalty parameter makes the matrix positive definite, which in turn ensures the convergence. Finally, we empirically show the convergence on benchmark counterexamples for offpolicy divergence.
As seen from the experiments, the choice of is critical for our algorithm. The lowerbound that we have provided in our analysis is not tight and coming up with a tight bound is an interesting future direction. Also, in future, we would like to extend our algorithm to include eligibility traces and study its applications on real world problems.
References
 [1] D. P. Bertsekas and J. N. Tsitsiklis, Neurodynamic programming. Athena Scientific Belmont, MA, 1996, vol. 5.
 [2] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [3] S. Ghiassian, A. Patterson, M. White, R. S. Sutton, and A. White, “Online offpolicy prediction,” arXiv preprint arXiv:1811.02597, 2018.
 [4] S. J. Bradtke and A. G. Barto, “Linear leastsquares algorithms for temporal difference learning,” Machine learning, vol. 22, no. 13, pp. 33–57, 1996.
 [5] D. Precup, R. S. Sutton, and S. Dasgupta, “Offpolicy temporaldifference learning with function approximation,” in ICML, 2001, pp. 417–424.
 [6] R. S. Sutton, C. Szepesvári, and H. R. Maei, “A convergent O(n) algorithm for offpolicy temporaldifference learning with linear function approximation,” Advances in Neural Information Processing Systems, vol. 21, no. 21, pp. 1609–1616, 2008.
 [7] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, “Fast gradientdescent methods for temporaldifference learning with linear function approximation,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 993–1000.
 [8] H. R. Maei, C. Szepesvári, S. Bhatnagar, D. Precup, D. Silver, and R. S. Sutton, “Convergent temporaldifference learning with arbitrary smooth function approximation,” in Advances in Neural Information Processing Systems, 2009, pp. 1204–1212.
 [9] H. R. Maei and R. S. Sutton, “GQ(): A general gradient algorithm for temporaldifference prediction learning with eligibility traces,” in 3d Conference on Artificial General Intelligence (AGI2010). Atlantis Press, 2010.
 [10] H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, “Toward offpolicy learning control with function approximation.” in ICML, 2010, pp. 719–726.
 [11] B. Liu, S. Mahadevan, and J. Liu, “Regularized offpolicy TDlearning,” in Advances in Neural Information Processing Systems, 2012, pp. 836–844.
 [12] R. S. Sutton, A. R. Mahmood, and M. White, “An emphatic approach to the problem of offpolicy temporaldifference learning,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2603–2631, 2016.
 [13] H. Yu, “On convergence of emphatic temporaldifference learning,” in Conference on Learning Theory, 2015, pp. 1724–1751.
 [14] A. Hallak, A. Tamar, R. Munos, and S. Mannor, “Generalized emphatic temporal difference learning: Biasvariance analysis,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [15] S. Ghiassian, B. Rafiee, and R. S. Sutton, “A first empirical study of emphatic temporal difference learning,” arXiv preprint arXiv:1705.04185, 2017.
 [16] A. Hallak and S. Mannor, “Consistent online offpolicy evaluation,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 1372–1383.
 [17] C. Gelada and M. G. Bellemare, “Offpolicy deep reinforcement learning by bootstrapping the covariate shift,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3647–3655.
 [18] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporaldifference learning with function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674–690, 1997.
 [19] A. Benveniste, M. Métivier, and P. Priouret, Adaptive algorithms and stochastic approximations. Springer Science & Business Media, 2012, vol. 22.
 [20] G. Golub and C. Van Loan, “Matrix computations, (Johns Hopkins University Press, Baltimore, 1996).”
 [21] J. Zhang, “Bairdexample,” Nov. 2019. [Online]. Available: https://github.com/MJeremy2017/ReinforcementLearningImplementation/tree/master/BairdExample
 [22] J. N. Tsitsiklis and B. Van Roy, “Featurebased methods for large scale dynamic programming,” Machine Learning, vol. 22, no. 13, pp. 59–94, 1996.
 [23] L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995. Elsevier, 1995, pp. 30–37.