1 Introduction and Related Works


We present the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge under off-policy training. Our solution targets problems in reinforcement learning where the action representation adds to the-curse-of-dimensionality; that is, with continuous or large action sets, thus making it infeasible to estimate state-action value functions (Q functions). Using state-value functions helps to lift the curse and as a result naturally turn our policy-gradient solution into classical Actor-Critic architecture whose Actor uses state-value function for the update. Our algorithms, Gradient Actor-Critic and Emphatic Actor-Critic, are derived based on the exact gradient of averaged state-value function objective and thus are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters. To our knowledge, this is the first time that convergent off-policy learning methods have been extended to classical Actor-Critic methods with function approximation.

1 Introduction and Related Works

One of the most important desirable features of a Reinforcement Learning (RL) algorithm is the ability to learn off-policy. Off-policy learning refers to learning about a (or multiple) desirable policy (policies) while the agent acts according to its own behavior policy, which may involve exploration. Off-policy learning is important because it allows the agent to learn about an optimal policy while it is exploring. It is also important for the case of off-line learning: for example, for the case of recommendation systems, we would like to learn about a better recommendation strategy than the one used previously, or conduct off-line A/B testing to avoid costs. Whether the learning is happening online or offline, freeing these two policies from each other makes the algorithms modular and easier to implement. For example, Q-learning is an off-policy learning because the agent can learn about a greedy policy while it could follow an exploratory policy. However, Q-learning has some limitations including the requirement for limited number of actions. Policy-gradient methods, on the other hand, are suitable to use for continuous actions (Williams, 1987; Sutton et .al, 1999). Reinforce (Williams, 1987) is one of the most popular policy-gradient methods, however, the learning only can be done on-policy. In addition, the learning agent should wait until it collects all the rewards and then update. There has been attempts to make an off-policy version of Reinforce but the algorithm suffers from huge variance, particularly when the time-horizon is large or infinite (Tang & Abbeel, 2010), as each reward signal must be multiplied by the products of importance ratios (Precup et al., 2001). Temporal-Difference (TD) Learning methods solve this problem but they have been used for value function based methods (Sutton et al., 2009; Maei et al., 2010; Maei, 2011; Sutton et al., 2016). The classical Actor-Critic Architectures (Barto et al., 1983; Sutton, 1984; Sutton et al., 1999) provide an intuitive framework which combine both state-value function and policy-gradient ideas as part of Critic and Actor, respectively.

The Q-Prop algorithm (Gu et al., 2017) uses Actor-Critic architecture for off-policy learning but it uses state-action value functions (known as Q-functions) that requires representations for both state and actions, which implies a significant number of learning parameters (specially for continuous actions) making it potential for the curse-of-dimensionality/overfitting. The off-policy Actor-Critic algorithm proposed in Degris. et al. (2012) uses state-value functions to update the Actor. The Critic uses the GTD() algorithm (Maei, 2011) to estimate an off-policy evaluation for state-value function which will be used in Actor, and is one of the first attempts to solve the classical problem of Actor-Critic with off-policy learning. The algorithm, has all the desirable features which we are seeking in this paper, except the fact that the actor-update is not based on the true gradient direction of the proposed objective function with linear value-function approximation 1,

In this paper, we solve this problem and propose the first convergent off-policy actor-critic algorithms, Gradient Actor-Critic and Emphatic Actor-Critic, with the following desirable features: online, incremental updating, linear complexity both in terms of memory and per-time-step computation, without adding any new hyper-parameter. Our approach provides the first systematic solution that extends the classical actor-critic algorithms, originally developed for on-policy learning, to off-policy learning.

2 RL Setting and Notations

We consider standard RL setting where the learning agent interacts with a complex, large-scale, environment that has Markov Decision Process (MDP) properties. The MDP model is represented by quadruple , where denotes a finite state set, denotes a finite or infinite action set, and denote real-valued reward functions and transition probabilities, respectively; for taking action from state and arriving at state .

The RL agent learns from data, generated as a result of interacting with the MDP environment. At time the agent takes , where denotes policy function, then the environment puts the agent in state with the reward of . As a result, the data is generated in the form of a trajectory and each sample point (fragment of experience) at time can be represented by tuple .

The objective of the agent is to find a policy function that has the highest amount of return in the long run; that is the sum of discounted future rewards. Formally, the objective is to find the optimal policy , , where , is called state-value function under policy , with discount factor , and represents expectation over random variable (data) generated according to the execution of policy . From now on, by value-function, we always mean state-value function and we drop the subscripts from expectation terms.

Let us, represent all value-functions in a single vector, , whose element is , denotes Bellman Operator, defined as , where denotes state-state transition probability matrix, where . Just for the purpose of clarify, and with a slight abuse of notations, we denote as the sum (or integral) over actions for both discrete and continuous actions (instead of using the notation ). Under MDP assumptions, the solution of , is unique and equal to .

In real-world large-scale problems, the number of states is too high. For example, for the case of Computer Go we have roughly around states. This implies, we would need to estimate the value-function for each state and without generalization; that is, function approximation, we are subject to the curse-of-dimensionality.

To be practical, we would need to do function approximation. To do this, we can represent the state by a feature vector . For example, for the case of Computer Go we are shrinking a binary feature-vector of size (tabular features) to the size of , and then we can do linear function approximation (linear in terms of learning parameters and not states). Now the value-function can be approximated by and our first goal is to learn the parameter such that . For our notation, each sample (from the experience trajectory), at time , is perceived in the form of tuple , where for simplicity we have adopted the notation .

Now we would need to do policy improvement given value-function estimate for a given policy . Again to tackle the curse-of-dimensionality, for policy functions, we can parameterize the policy as , where , where . Finally, through an iterative policy-improvement approach, we would like to converge to , such that .

The Actor-Critic (AC) approach is the only known method that allows us to use state-value functions in the the policy-improvement step, while incrementally updating value functions through Temporal-Difference (TD) learning methods, such as TD() (Sutton, 1998; 1988). This incremental, online update make the AC methods increasingly desirable.

Now let us consider the off-policy scenario, where the agent interacts with a fixed MDP environment and with a fixed behavior policy . The agent would like to estimate the value of a given parametrized target policy and eventually find which policy is the best. This type of evaluation of , from data generated according to a different policy , is called off-policy evaluation. By defining the importance-weighting ratio and under standard MDP assumptions, statistically we can write the value-function , in statistical form of

Again, for large-scale problems, we do TD-learning in conjunction with linear function approximation to estimate the parameters. Just like TD() with linear function approximation, which is used for on-policy learning, the GTD() algorithm (Maei, 2011) and Emphatic-TD() (Sutton et al., 2016) can be used for the problem of off-policy learning, with convergence guarantees.

The question we ask is: What would be the Actor-update; that is, policy-improvement step? Particularly, we would like do a gradient ascent on policy-objective function such that the -weights update, in expectation, exactly follow the direction of the gradient function.

3 The Problem Formulation

Degris et al. (2012) introduced the following objective function with linear value-function approximation,


where represents the stationary distribution for visiting state according to the behavior policy . Please note, is an implicit function of , since is an approximate estimator for . The goal is to maximize by updating the policy parameters, iteratively, along the gradient direction of , where is gradient (operation vector) w.r.t policy parameters

Degris et al. (2012) Off-Policy Actor-Critic Algorithm, Off-PAC, uses GTD() as Critic, however, the Actor-update, in expectation, does not follow the true gradient direction of , thus questioning the convergence properties of Off-PAC. (See Footnote 1.)

In this paper, we solve this problem and derive an convergent Actor-Critic algorithms based on TD-learning for the problem of off-policy learning, whose Actor-update, in expectation, follows the true gradient direction of , thus maximizing the .

In the next section, we discuss about the off-policy evaluation step. We discuss about two solutions; that is, GTD() (Maei, 2011) and Emphatic-TD() (Sutton et .al, 2016).

4 Value-Function Approximation: GTD() and Emphatic-TD()-Solutions

We consider linear function approximation, where and is a feature matrix whose row is the feature vector . To construct the space in the column vectors of need to be linearly independent and we make this assumption throughout the paper. We also assume that the feature vectors , all have a unit feature-value of in their element. This is not needed if we use tabular features, but for the case of function approximation we will see later why this unit feature-value is needed, which is typically used for linear (or logistic) regression problems as the intercept term.

To estimate , it is natural to find through minimizing the following mean-square-error objective function:

where , , denotes the underlying state-distribution of data, which is generated according to . The square-error is weighted by , which makes our estimation biased towards the distribution of data generated by following the policy . The ideal weight would be the underlying stationary distribution under the target policy . However, we claim that this is a natural fact in nature, and used in all supervised learning methods. There are ad-hoc methods to re-weight distributions if needed, but they are outside of the scope of this paper.

By minimizing the MSE objective function w.r.t , we get


where is projection operator, and . Also in this paper we assume that , , meaning all states should be visited. We call this solution, MSE-Solution.

Historically, there are two alternative and generic solutions, called GTD() (Maei, 2011) and Emphatic-TD() (Sutton, et al., 2016). The MSE-Solution is that is the special case of the two solutions (when ). Later we discuss about the merits of the two solutions and the reasons behind them.

Gtd()-Solution: The Projected Bellman-Equation with Bootstrapping Parameter :

To find the approximate solution of for evaluating the target policy , historically, the classical projected Bellman-Equation (Bertsekas, Sutton et al. Maei, 2011) has been used for the problem of off-policy evaluation is, , where is the projection operator defined in Eq. 2, and . We can convert the above Matrix-vector products into the following statistical form (Maei, 2011):


where . (Note, following GTD() in Maei (2011), here we have done a small change of variable for without changing the solution.)

The GTD() is used to find the solution of Eq. 3 with convergence guarantees. As such we call the fixed-point, GTD()-Solution. The GTD() main update is as follows:


where is step-size at time , represents a secondary set of weights, updated according to , where is a step-size at time .

There are a few points to make regarding the solution of GTD():

  • For the case of tabular features, or features that span the state-space, the solution is independent of value and is equal to the true solution .

  • For the case of linear function approximation the solution can depend on .

  • For the solution is equivalent to MSE-solution; that is . In addition, GTD(1)-update would not need a second set of weights and step-size, making it very simple, as we can see from its main update.

  • For the case of on-policy learning, GTD(1) and TD(1) are identical.


An alternative solution to the problem of off-policy prediction and its solution with function approximation is an Emphatic version of Projected Bellman-Equation, developed by Sutton et al. (2016), in projection operator, , now we have , where is an emphatic (positive definite) diagonal matrix. Later we will discuss about the properties of matrix . In statistical form, the solution satisfies,


where and remains always strictly positive. (Please note, we have combined and used in Sutton et al. (2016). Also note, since the form of the update for both GTD() and Emphatic-TD() looks the same, due to simplicity, we have used the same notation for the eligibility trace vector, .)

The Emphatic-TD() update is as follows:


where follows Eq. 5 update.

Here, we make a few points regarding the solution of Emphatic-TD():

  • For the case of tabular features, or features that can span the state-space, the Emphatic-TD() and GTD() both converge to the true solution .

  • For the case of linear function approximation, both Emphatic-TD() and GTD() solutions depend on but they may differ:

    • For the case on-policy learning both solutions are the same, because would become a constant diagonal matrix

    • For the case of off-policy learning, the solutions will differ, and still it is not clear which solution has a better solution advantage.

  • Both Emphatic-TD(1) and GTD(1) have the same update (identical, as we have for all ) and converge to the MSE-Solution.

It is worth to discuss MSE-Solution here as both Emphatic-TD(1) and GTD(1) become identical and converge to MSE-Solution. The question is why not using MSE-Solution, with , and why to have such a variety of solutions based on bootsrapping parameter ? The truth is, it is widely accepted that the main reason bias-variance trade-off. In fact MSE-Solution is natural solution, but when , traces become large and cuase a huge variance around the fixed-point, thus may result an inferior solution. (This is also known as Monte-Carlo solution in Sutton & Barto, 1998.) However, there has been little investigation on how by reducing the variance of traces one can reduce the overall variance and thus converge to a quality solution. We did some simple experiments and by normalizing the eligibility trace vector, e.g. , we were able to get superior results as is shown in Fig. 1. (For the details of the exepriment see the 19-state random walk in Sutton & Barto, 1998.)

Figure 1: Parameter studies: RMS values (vertical axis) with various step-size, alpha (x-axis) and values. The right panel normalizes the eligibility traces which leads to the best value for , which is also slightly better than the best value (with ) in the left pannel.

5 The Gradient Direction of

In this section, we explicitly derive the exact gradient direction of the function in Eq. 1 w.r.t. policy parameter .

As we can see the GTD() solution in Eq. 3 and Emphatic-TD() solution in Eq. 5 look similar. (Their solution is different due to a different update for .) Thus, we provide the same form of gradient of for both, as follows: First, we compute the gradient of from Eq. 1, . Now, to obtain the matrix, , we transpose Eq. 3 (Eq. 5) and then take its gradient as follows ( is column vector operator):

where we can show,

Putting all together and solving for the matrix we get,



matrix is invertible because we have assumed that the column vectors of are linearly independent. This is a realistic assumption to construct a space in where the approximate solution is located.

Now we substitute the equalities in the Eq. 5 to obtain , which will be used in , to obtain the exact gradient. To do this, first let us consider the following definitions:


and , and also we define the diagonal matrix whose diagonal elements are vector . Using definitions(10,12), and given the fact that conditioning on we get and independent, then we have the following equations: First, Eq. 10 can be written as:


and also we have,

The above gradient is the exact gradient direction in statistical form.

The most important question is to identify the values of and , which would be our next task. We will show that these terms are state dependent and can not be ignored for the case of off-policy learning, unless the problem is on-policy.

5.1 On-Policy scenario:

Lemma 1.

(On-Policy GTD() and Emphatic-TD() Solutions) For the problem of on-policy, where , , the Emphatic-TD(),GTD(), and TD() solutions are identical.


From Eq. 3, it is clear that on-policy GTD() and TD() have the same solution. For the case of emphatic-TD(), when , , we get . Thus we can use the convergent value, in the update for ; that is, . Let us divide the Eq. 5 by , and define , then we get , where . Thus the expectation term becomes identical to Eq. 3, finishing the proof. ∎

Theorem 1.

( for the Problem of On-Policy (aka TD()-Solution))


The term satisfying in Eq. 13, can be replaced by , which by definition is equivalent to and since the matrix is invertible (due to the assumption of linearly independent column feature vectors of ), then solution is unique, which implies uniqueness of for a given . Now, we show that , which is constant . Using in Eq. 13, we get

where we have used . Thus is the solution for all , and using Lemma (1) , we finish the proof. ∎

5.2 Off-Policy scenario:

Unlike on-policy, for the problem of off-policy is state dependent and is not constant (see the proof of Theorem(1)). Estimating the value of will require estimating the parameter which comes with complexities in terms of computations. This is mainly due to the fact that we don’t have on-policy criteria . However, when ; that is, for the case of GTD(1) solution, which is equivalent to MSE solution, we would be able to find the exact value of

which would enable us to do sampling from the true gradient of . This is one of our main contributions in this paper.

6 The Gradient Direction of with GTD(1)-Solution

Lemma 2.

(Value of for the Problem of Off-Policy with GTD(1)-Solution) For the problem of off-policy the value of , defined as , satisfying in Eq. 13 is


where where , , , , and all the elements of are zero except the element which is .


Just like the proof of Theorem (1), let us assume that is equal to . Due to having a unique solution, if it satisfies in Eq. 13 then it must be the unique solution. Please note that here, we have done a slight abuse of notation and have used which by definition is and also for the iteration we have used with subscript . These are two different variables and should not be mistaken, but for simplicity we have adopted this notation in the paper.

To do the proof, first, we can show that, the product of the diagonal matrix (whose diagonal element is ) times is (See supplementary materials). Let us define the diagonal matrix with diagonal elements of , where , we have

Again, given the fact that has a unique solution, and the element of all the feature vectors , , have a unit value of , we can see that the element of would be equivalent to and thus with all zero elements except the element with the value of satisfies in Eq. 13. Thus, finishing the proof. ∎

Theorem 2.

( for the Problem of Off-Policy with GTD(1)-Solution (aka MSE-Solution))


where ,, is zero vector, .


The first term of in Eq. 5; that is can be written as since, due to MDP properties, conditioning on makes independent of as it depends on and past. Now from , we use the value of from Lemma (2), then we get

Now we turn into the second term which is with . From Lemma(2), we see is constant and zero except the last element which is one. Thus , can be written as since . Now by putting all together and using and by defining , we get the recursive form of , thus finishing the proof. ∎

Please note, for , finding a value for that can be used in and enabling us doing sampling was not possible due to parameter. However, later we will see that the Emphatic-TD() solution, solves this problem.

7 Gradient-AC: A Gradient Actor-Critic with GTD(1) as Critic

By sampling from in Theorem (2), we present the first off-policy gradient Actor-Critic algorithm, that is convergent, in Table (1). The Critic uses GTD(1) update while the Actor uses policy-gradient method to maximize the objective function. It is worth to mention that the complexity cost of the new algorithm is , the same as classical AC method, with no additional hyper/tuning parameters.

1:  Initialize ,, and , , to zero values
2:  Choose proper step-size values for and
3:  repeat
4:     for  each sample generated by  do
12:     end for
13:  until  converges
Algorithm 1 The Gradient-AC Algorithm

7.1 Convergent Analysis of G-AC

In this section we provide convergence analysis for the Gradient-AC algorithm. Since the Actor-update is based on true gradient direction, we use existing results in literature avoid repetitions. Before providing the main theorem, let us consider the following assumptions. The first set of assumptions are related to data sequence and parametrized policy:

  1. , such that (s.t.) is stationary and , ;

  2. The Markov processes and are in steady state, irreducible and aperiodic, with stationary distribution , ;

  3. , and s.t. holds almost surely (a.s.).

Assumptions on parametrized policy is as follows

  1. For every the mapping is twice differentiable;

  2. and has a bounded derivative . Note denotes Euclidean norm.

  1. Features are bounded according to Konda & Tsitsiklis (2003), and we follow the same noise properties conditions.

For the convergence analysis we follow the two-time-scale approach (Konda & Tsitsiklis, 2003; Borkar, 1997; Borkar, 2008; Bhatnagar et al., 2009). We will use the following step-size conditions for the convergence proof:

  1. , , and .

The actor update can be written in the following form:



Note, projects its argument to a compact set with smooth boundary, that is, if the iterates leaves it is projected to the closest or some convenient point in , that is, . Here, we choose to be the largest possible compact set. We consider ordinary differential equation (ODE) approach for the convergence of our proof and show our algorithm converges to the set of all asymptotically stable solution of the following ODE


where . Note, if we have , otherwise, projects to the tangent space of at .

Theorem 3.

(Convergence of Gradient-AC) Under the conditions listed in this section, as , converges to the set of all asymptotically stable solution of (19) with probability 1.


The proof exactly follows on a two-timescales, in steps-size, convergence analysis. We use the results in Borkar (1997; 2008, see Lemma 1 and Theorem 2, page 66. Also see, Bhatnagar et al., 2009; Konda & Tsitsiklis; 2003). The expected update of the actor, according to Theorem (2) is exactly and also the critic, GTD(1), is a true gradient method. As such the proof will follow and for brevity, we have omitted the repetition of the proof. ∎

8 for Emphatic-TD() solution

With Emphatic-TD() solution, now we aim to optimize its corresponding objective. Again one can show that with all zero elements except the element with value of 1, will satisfy in equation (see Eq.12 and Eq.13 ) and as a result, Eq. 5 becomes

Lemma 3.

(The value of for Emphatic-TD() Solution) Given Emphatic-TD() solution, from equation satisfying in Eq.13, we have