Abstract
We present the first class of policygradient algorithms that work with both statevalue and policy functionapproximation, and are guaranteed to converge under offpolicy training. Our solution targets problems in reinforcement learning where the action representation adds to thecurseofdimensionality; that is, with continuous or large action sets, thus making it infeasible to estimate stateaction value functions (Q functions). Using statevalue functions helps to lift the curse and as a result naturally turn our policygradient solution into classical ActorCritic architecture whose Actor uses statevalue function for the update. Our algorithms, Gradient ActorCritic and Emphatic ActorCritic, are derived based on the exact gradient of averaged statevalue function objective and thus are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical ActorCritic methods with no additional hyperparameters. To our knowledge, this is the first time that convergent offpolicy learning methods have been extended to classical ActorCritic methods with function approximation.
1 Introduction and Related Works
One of the most important desirable features of a Reinforcement Learning (RL) algorithm is the ability to learn offpolicy. Offpolicy learning refers to learning about a (or multiple) desirable policy (policies) while the agent acts according to its own behavior policy, which may involve exploration. Offpolicy learning is important because it allows the agent to learn about an optimal policy while it is exploring. It is also important for the case of offline learning: for example, for the case of recommendation systems, we would like to learn about a better recommendation strategy than the one used previously, or conduct offline A/B testing to avoid costs. Whether the learning is happening online or offline, freeing these two policies from each other makes the algorithms modular and easier to implement. For example, Qlearning is an offpolicy learning because the agent can learn about a greedy policy while it could follow an exploratory policy. However, Qlearning has some limitations including the requirement for limited number of actions. Policygradient methods, on the other hand, are suitable to use for continuous actions (Williams, 1987; Sutton et .al, 1999). Reinforce (Williams, 1987) is one of the most popular policygradient methods, however, the learning only can be done onpolicy. In addition, the learning agent should wait until it collects all the rewards and then update. There has been attempts to make an offpolicy version of Reinforce but the algorithm suffers from huge variance, particularly when the timehorizon is large or infinite (Tang & Abbeel, 2010), as each reward signal must be multiplied by the products of importance ratios (Precup et al., 2001). TemporalDifference (TD) Learning methods solve this problem but they have been used for value function based methods (Sutton et al., 2009; Maei et al., 2010; Maei, 2011; Sutton et al., 2016). The classical ActorCritic Architectures (Barto et al., 1983; Sutton, 1984; Sutton et al., 1999) provide an intuitive framework which combine both statevalue function and policygradient ideas as part of Critic and Actor, respectively.
The QProp algorithm (Gu et al., 2017) uses ActorCritic architecture for offpolicy learning but it uses stateaction value functions (known as Qfunctions) that requires representations for both state and actions, which implies a significant number of learning parameters (specially for continuous actions) making it potential for the curseofdimensionality/overfitting. The offpolicy ActorCritic algorithm proposed in Degris. et al. (2012) uses statevalue functions to update the Actor. The Critic uses the GTD() algorithm (Maei, 2011) to estimate an offpolicy evaluation for statevalue function which will be used in Actor, and is one of the first attempts to solve the classical problem of ActorCritic with offpolicy learning. The algorithm, has all the desirable features which we are seeking in this paper, except the fact that the actorupdate is not based on the true gradient direction of the proposed objective function with linear valuefunction approximation
In this paper, we solve this problem and propose the first convergent offpolicy actorcritic algorithms, Gradient ActorCritic and Emphatic ActorCritic, with the following desirable features: online, incremental updating, linear complexity both in terms of memory and pertimestep computation, without adding any new hyperparameter. Our approach provides the first systematic solution that extends the classical actorcritic algorithms, originally developed for onpolicy learning, to offpolicy learning.
2 RL Setting and Notations
We consider standard RL setting where the learning agent interacts with a complex, largescale, environment that has Markov Decision Process (MDP) properties. The MDP model is represented by quadruple , where denotes a finite state set, denotes a finite or infinite action set, and denote realvalued reward functions and transition probabilities, respectively; for taking action from state and arriving at state .
The RL agent learns from data, generated as a result of interacting with the MDP environment. At time the agent takes , where denotes policy function, then the environment puts the agent in state with the reward of . As a result, the data is generated in the form of a trajectory and each sample point (fragment of experience) at time can be represented by tuple .
The objective of the agent is to find a policy function that has the highest amount of return in the long run; that is the sum of discounted future rewards. Formally, the objective is to find the optimal policy , , where , is called statevalue function under policy , with discount factor , and represents expectation over random variable (data) generated according to the execution of policy . From now on, by valuefunction, we always mean statevalue function and we drop the subscripts from expectation terms.
Let us, represent all valuefunctions in a single vector, , whose element is , denotes Bellman Operator, defined as , where denotes statestate transition probability matrix, where . Just for the purpose of clarify, and with a slight abuse of notations, we denote as the sum (or integral) over actions for both discrete and continuous actions (instead of using the notation ). Under MDP assumptions, the solution of , is unique and equal to .
In realworld largescale problems, the number of states is too high. For example, for the case of Computer Go we have roughly around states. This implies, we would need to estimate the valuefunction for each state and without generalization; that is, function approximation, we are subject to the curseofdimensionality.
To be practical, we would need to do function approximation. To do this, we can represent the state by a feature vector . For example, for the case of Computer Go we are shrinking a binary featurevector of size (tabular features) to the size of , and then we can do linear function approximation (linear in terms of learning parameters and not states). Now the valuefunction can be approximated by and our first goal is to learn the parameter such that . For our notation, each sample (from the experience trajectory), at time , is perceived in the form of tuple , where for simplicity we have adopted the notation .
Now we would need to do policy improvement given valuefunction estimate for a given policy . Again to tackle the curseofdimensionality, for policy functions, we can parameterize the policy as , where , where . Finally, through an iterative policyimprovement approach, we would like to converge to , such that .
The ActorCritic (AC) approach is the only known method that allows us to use statevalue functions in the the policyimprovement step, while incrementally updating value functions through TemporalDifference (TD) learning methods, such as TD() (Sutton, 1998; 1988). This incremental, online update make the AC methods increasingly desirable.
Now let us consider the offpolicy scenario, where the agent interacts with a fixed MDP environment and with a fixed behavior policy . The agent would like to estimate the value of a given parametrized target policy and eventually find which policy is the best. This type of evaluation of , from data generated according to a different policy , is called offpolicy evaluation. By defining the importanceweighting ratio and under standard MDP assumptions, statistically we can write the valuefunction , in statistical form of
Again, for largescale problems, we do TDlearning in conjunction with linear function approximation to estimate the parameters. Just like TD() with linear function approximation, which is used for onpolicy learning, the GTD() algorithm (Maei, 2011) and EmphaticTD() (Sutton et al., 2016) can be used for the problem of offpolicy learning, with convergence guarantees.
The question we ask is: What would be the Actorupdate; that is, policyimprovement step? Particularly, we would like do a gradient ascent on policyobjective function such that the weights update, in expectation, exactly follow the direction of the gradient function.
3 The Problem Formulation
Degris et al. (2012) introduced the following objective function with linear valuefunction approximation,
(1) 
where represents the stationary distribution for visiting state according to the behavior policy . Please note, is an implicit function of , since is an approximate estimator for . The goal is to maximize by updating the policy parameters, iteratively, along the gradient direction of , where is gradient (operation vector) w.r.t policy parameters
Degris et al. (2012) OffPolicy ActorCritic Algorithm, OffPAC, uses GTD() as Critic, however, the Actorupdate, in expectation, does not follow the true gradient direction of , thus questioning the convergence properties of OffPAC. (See Footnote 1.)
In this paper, we solve this problem and derive an convergent ActorCritic algorithms based on TDlearning for the problem of offpolicy learning, whose Actorupdate, in expectation, follows the true gradient direction of , thus maximizing the .
In the next section, we discuss about the offpolicy evaluation step. We discuss about two solutions; that is, GTD() (Maei, 2011) and EmphaticTD() (Sutton et .al, 2016).
4 ValueFunction Approximation: GTD() and EmphaticTD()Solutions
We consider linear function approximation, where and is a feature matrix whose row is the feature vector . To construct the space in the column vectors of need to be linearly independent and we make this assumption throughout the paper. We also assume that the feature vectors , all have a unit featurevalue of in their element. This is not needed if we use tabular features, but for the case of function approximation we will see later why this unit featurevalue is needed, which is typically used for linear (or logistic) regression problems as the intercept term.
To estimate , it is natural to find through minimizing the following meansquareerror objective function:
where , , denotes the underlying statedistribution of data, which is generated according to . The squareerror is weighted by , which makes our estimation biased towards the distribution of data generated by following the policy . The ideal weight would be the underlying stationary distribution under the target policy . However, we claim that this is a natural fact in nature, and used in all supervised learning methods. There are adhoc methods to reweight distributions if needed, but they are outside of the scope of this paper.
By minimizing the MSE objective function w.r.t , we get
(2) 
where is projection operator, and . Also in this paper we assume that , , meaning all states should be visited. We call this solution, MSESolution.
Historically, there are two alternative and generic solutions, called GTD() (Maei, 2011) and EmphaticTD() (Sutton, et al., 2016). The MSESolution is that is the special case of the two solutions (when ). Later we discuss about the merits of the two solutions and the reasons behind them.
Gtd()Solution: The Projected BellmanEquation with Bootstrapping Parameter :
To find the approximate solution of for evaluating the target policy , historically, the classical projected BellmanEquation (Bertsekas, Sutton et al. Maei, 2011) has been used for the problem of offpolicy evaluation is, , where is the projection operator defined in Eq. 2, and . We can convert the above Matrixvector products into the following statistical form (Maei, 2011):
(3) 
where . (Note, following GTD() in Maei (2011), here we have done a small change of variable for without changing the solution.)
The GTD() is used to find the solution of Eq. 3 with convergence guarantees. As such we call the fixedpoint, GTD()Solution. The GTD() main update is as follows:
(4) 
where is stepsize at time , represents a secondary set of weights, updated according to , where is a stepsize at time .
There are a few points to make regarding the solution of GTD():

For the case of tabular features, or features that span the statespace, the solution is independent of value and is equal to the true solution .

For the case of linear function approximation the solution can depend on .

For the solution is equivalent to MSEsolution; that is . In addition, GTD(1)update would not need a second set of weights and stepsize, making it very simple, as we can see from its main update.

For the case of onpolicy learning, GTD(1) and TD(1) are identical.
EmphaticTD()Solution:
An alternative solution to the problem of offpolicy prediction and its solution with function approximation is an Emphatic version of Projected BellmanEquation, developed by Sutton et al. (2016), in projection operator, , now we have , where is an emphatic (positive definite) diagonal matrix. Later we will discuss about the properties of matrix . In statistical form, the solution satisfies,
(5) 
where and remains always strictly positive. (Please note, we have combined and used in Sutton et al. (2016). Also note, since the form of the update for both GTD() and EmphaticTD() looks the same, due to simplicity, we have used the same notation for the eligibility trace vector, .)
Here, we make a few points regarding the solution of EmphaticTD():

For the case of tabular features, or features that can span the statespace, the EmphaticTD() and GTD() both converge to the true solution .

For the case of linear function approximation, both EmphaticTD() and GTD() solutions depend on but they may differ:

For the case onpolicy learning both solutions are the same, because would become a constant diagonal matrix

For the case of offpolicy learning, the solutions will differ, and still it is not clear which solution has a better solution advantage.


Both EmphaticTD(1) and GTD(1) have the same update (identical, as we have for all ) and converge to the MSESolution.
It is worth to discuss MSESolution here as both EmphaticTD(1) and GTD(1) become identical and converge to MSESolution. The question is why not using MSESolution, with , and why to have such a variety of solutions based on bootsrapping parameter ? The truth is, it is widely accepted that the main reason biasvariance tradeoff. In fact MSESolution is natural solution, but when , traces become large and cuase a huge variance around the fixedpoint, thus may result an inferior solution. (This is also known as MonteCarlo solution in Sutton & Barto, 1998.) However, there has been little investigation on how by reducing the variance of traces one can reduce the overall variance and thus converge to a quality solution. We did some simple experiments and by normalizing the eligibility trace vector, e.g. , we were able to get superior results as is shown in Fig. 1. (For the details of the exepriment see the 19state random walk in Sutton & Barto, 1998.)
5 The Gradient Direction of
In this section, we explicitly derive the exact gradient direction of the function in Eq. 1 w.r.t. policy parameter .
As we can see the GTD() solution in Eq. 3 and EmphaticTD() solution in Eq. 5 look similar. (Their solution is different due to a different update for .) Thus, we provide the same form of gradient of for both, as follows: First, we compute the gradient of from Eq. 1, . Now, to obtain the matrix, , we transpose Eq. 3 (Eq. 5) and then take its gradient as follows ( is column vector operator):
where we can show,
Putting all together and solving for the matrix we get,
where
(9) 
matrix is invertible because we have assumed that the column vectors of are linearly independent. This is a realistic assumption to construct a space in where the approximate solution is located.
Now we substitute the equalities in the Eq. 5 to obtain , which will be used in , to obtain the exact gradient. To do this, first let us consider the following definitions:
(10) 
(11) 
(12) 
and , and also we define the diagonal matrix whose diagonal elements are vector . Using definitions(10,12), and given the fact that conditioning on we get and independent, then we have the following equations: First, Eq. 10 can be written as:
(13) 
and also we have,
The above gradient is the exact gradient direction in statistical form.
The most important question is to identify the values of and , which would be our next task. We will show that these terms are state dependent and can not be ignored for the case of offpolicy learning, unless the problem is onpolicy.
5.1 OnPolicy scenario:
Lemma 1.
(OnPolicy GTD() and EmphaticTD() Solutions) For the problem of onpolicy, where , , the EmphaticTD(),GTD(), and TD() solutions are identical.
Proof.
From Eq. 3, it is clear that onpolicy GTD() and TD() have the same solution. For the case of emphaticTD(), when , , we get . Thus we can use the convergent value, in the update for ; that is, . Let us divide the Eq. 5 by , and define , then we get , where . Thus the expectation term becomes identical to Eq. 3, finishing the proof. ∎
Theorem 1.
( for the Problem of OnPolicy (aka TD()Solution))
(15) 
Proof.
The term satisfying in Eq. 13, can be replaced by , which by definition is equivalent to and since the matrix is invertible (due to the assumption of linearly independent column feature vectors of ), then solution is unique, which implies uniqueness of for a given . Now, we show that , which is constant . Using in Eq. 13, we get
where we have used . Thus is the solution for all , and using Lemma (1) , we finish the proof. ∎
5.2 OffPolicy scenario:
Unlike onpolicy, for the problem of offpolicy is state dependent and is not constant (see the proof of Theorem(1)). Estimating the value of will require estimating the parameter which comes with complexities in terms of computations. This is mainly due to the fact that we don’t have onpolicy criteria . However, when ; that is, for the case of GTD(1) solution, which is equivalent to MSE solution, we would be able to find the exact value of
which would enable us to do sampling from the true gradient of . This is one of our main contributions in this paper.
6 The Gradient Direction of with GTD(1)Solution
Lemma 2.
(Value of for the Problem of OffPolicy with GTD(1)Solution) For the problem of offpolicy the value of , defined as , satisfying in Eq. 13 is
(16) 
where where , , , , and all the elements of are zero except the element which is .
Proof.
Just like the proof of Theorem (1), let us assume that is equal to . Due to having a unique solution, if it satisfies in Eq. 13 then it must be the unique solution. Please note that here, we have done a slight abuse of notation and have used which by definition is and also for the iteration we have used with subscript . These are two different variables and should not be mistaken, but for simplicity we have adopted this notation in the paper.
To do the proof, first, we can show that, the product of the diagonal matrix (whose diagonal element is ) times is (See supplementary materials). Let us define the diagonal matrix with diagonal elements of , where , we have
Again, given the fact that has a unique solution, and the element of all the feature vectors , , have a unit value of , we can see that the element of would be equivalent to and thus with all zero elements except the element with the value of satisfies in Eq. 13. Thus, finishing the proof. ∎
Theorem 2.
( for the Problem of OffPolicy with GTD(1)Solution (aka MSESolution))
(17) 
where ,, is zero vector, .
Proof.
The first term of in Eq. 5; that is can be written as since, due to MDP properties, conditioning on makes independent of as it depends on and past. Now from , we use the value of from Lemma (2), then we get
Now we turn into the second term which is with . From Lemma(2), we see is constant and zero except the last element which is one. Thus , can be written as since . Now by putting all together and using and by defining , we get the recursive form of , thus finishing the proof. ∎
Please note, for , finding a value for that can be used in and enabling us doing sampling was not possible due to parameter. However, later we will see that the EmphaticTD() solution, solves this problem.
7 GradientAC: A Gradient ActorCritic with GTD(1) as Critic
By sampling from in Theorem (2), we present the first offpolicy gradient ActorCritic algorithm, that is convergent, in Table (1). The Critic uses GTD(1) update while the Actor uses policygradient method to maximize the objective function. It is worth to mention that the complexity cost of the new algorithm is , the same as classical AC method, with no additional hyper/tuning parameters.
7.1 Convergent Analysis of GAC
In this section we provide convergence analysis for the GradientAC algorithm. Since the Actorupdate is based on true gradient direction, we use existing results in literature avoid repetitions. Before providing the main theorem, let us consider the following assumptions. The first set of assumptions are related to data sequence and parametrized policy:

, such that (s.t.) is stationary and , ;

The Markov processes and are in steady state, irreducible and aperiodic, with stationary distribution , ;

, and s.t. holds almost surely (a.s.).
Assumptions on parametrized policy is as follows

For every the mapping is twice differentiable;

and has a bounded derivative . Note denotes Euclidean norm.

Features are bounded according to Konda & Tsitsiklis (2003), and we follow the same noise properties conditions.
For the convergence analysis we follow the twotimescale approach (Konda & Tsitsiklis, 2003; Borkar, 1997; Borkar, 2008; Bhatnagar et al., 2009). We will use the following stepsize conditions for the convergence proof:

, , and .
The actor update can be written in the following form:
(18) 
where
Note, projects its argument to a compact set with smooth boundary, that is, if the iterates leaves it is projected to the closest or some convenient point in , that is, . Here, we choose to be the largest possible compact set. We consider ordinary differential equation (ODE) approach for the convergence of our proof and show our algorithm converges to the set of all asymptotically stable solution of the following ODE
(19) 
where . Note, if we have , otherwise, projects to the tangent space of at .
Theorem 3.
(Convergence of GradientAC) Under the conditions listed in this section, as , converges to the set of all asymptotically stable solution of (19) with probability 1.
Proof.
The proof exactly follows on a twotimescales, in stepssize, convergence analysis. We use the results in Borkar (1997; 2008, see Lemma 1 and Theorem 2, page 66. Also see, Bhatnagar et al., 2009; Konda & Tsitsiklis; 2003). The expected update of the actor, according to Theorem (2) is exactly and also the critic, GTD(1), is a true gradient method. As such the proof will follow and for brevity, we have omitted the repetition of the proof. ∎
8 for EmphaticTD() solution
With EmphaticTD() solution, now we aim to optimize its corresponding objective. Again one can show that with all zero elements except the element with value of 1, will satisfy in equation (see Eq.12 and Eq.13 ) and as a result, Eq. 5 becomes
(20) 
Lemma 3.
(The value of for EmphaticTD() Solution) Given EmphaticTD() solution, from equation satisfying in Eq.13, we have