# Convergent Off-Policy Actor-Critic

## Abstract

We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear, and the actor can be nonlinear.

lb \addeditorsz \addeditoryh

## 1 Introduction

The policy gradient theorem and the corresponding actor-critic algorithm (Sutton et al., 2000; Konda, 2002) have recently enjoyed great success in various domains, e.g., defeating the top human player in Go (Silver et al., 2016), achieving human-level control in Atari games (Mnih et al., 2016). However, the canonical actor-critic algorithm is on-policy and hence suffers from significant data inefficiency (e.g., see Mnih et al. (2016)). To address this issue, Degris et al. (2012) propose the Off-Policy Actor-Critic (Off-PAC) algorithm. Off-PAC has been extended in various ways, e.g., off-policy Deterministic Policy Gradient (DPG, Silver et al. 2014), Deep Deterministic Policy Gradient (DDPG, Lillicrap et al. 2015), Actor Critic with Experience Replay (ACER, Wang et al. 2016), off-policy Expected Policy Gradient (EPG, Ciosek & Whiteson 2017), TD3 (Fujimoto et al., 2018), and IMPALA (Espeholt et al., 2018). Off-PAC and its extensions have enjoyed great empirical success as the canonical on-policy actor-critic algorithm. There is, however, a theoretical gap between the canonical on-policy actor-critic and Off-PAC. Namely, on-policy actor-critic has a two-timescale convergent analysis under function approximation (Konda, 2002), but Off-PAC is convergent only in the tabular setting (Degris et al., 2012). While there have been several attempts to close this gap (Imani et al., 2018; Maei, 2018; Zhang et al., 2019; Liu et al., 2019), none of them is convergent under function approximation without imposing strong assumptions (e.g., assuming the critic converges).

In this paper, we close this long-standing theoretical gap via the Convergent Off-Policy Actor-Critic (COF-PAC) algorithm,
the first provably convergent two-timescale off-policy actor-critic algorithm.
COF-PAC builds on Actor-Critic with Emphatic weightings (ACE, Imani et al. 2018), which reweights Off-PAC updates with emphasis through the followon trace (Sutton et al., 2016).
The emphasis accounts for off-policy learning by adjusting the state distribution, and the followon trace approximates the emphasis (see Sutton et al. 2016).^{1}

Instead of using the followon trace, we propose a novel stochastic approximation algorithm, Gradient Emphasis Learning (GEM), to approximate the emphasis in COF-PAC, inspired by Gradient TD methods (GTD, Sutton et al. 2009b, a), Emphatic TD methods (ETD, Sutton et al. 2016), and reversed TD methods (Wang et al., 2007, 2008; Hallak & Mannor, 2017; Gelada & Bellemare, 2019). We prove the almost sure convergence of GEM, as well as other GTD-style algorithms, with linear function approximation under a slowly changing target policy. With the help of GEM, we prove the convergence of COF-PAC, where the policy parameterization can be nonlinear, and the convergence level is the same as the on-policy actor-critic (Konda, 2002).

## 2 Background

We use to denote the norm induced by a positive definite matrix , which induces the matrix norm . To simplify notation, we write for , where is the identity matrix. All vectors are column vectors. We use “0” to denote an all-zero vector and an all-zero matrix when the dimension can be easily deduced from the context, and similarly for “1”. When it does not confuse, we use vectors and functions interchangeably. Proofs are in the appendix.

We consider an infinite-horizon Markov Decision Process (MDP) with a finite state space with states, a finite action space with actions, a transition kernel , a reward function , and a discount factor . At time step , an agent at a state takes an action according to , where is a fixed behavior policy. The agent then proceeds to a new state according to and gets a reward . In the off-policy setting, the agent is interested in a target policy . We use to denote the return at time step when following instead of . Consequently, we define the state value function and the state action value function as and . We use to denote the importance sampling ratio and define (Assumption 1 below ensures is well-defined). We sometimes write as to emphasize its dependence on .

Policy Evaluation: We consider linear function approximation for policy evaluation. Let be the state feature function, and be the state-action feature function. We use and to denote feature matrices, where each row of is and each row of is . We use as shorthand that . Let be the stationary distribution of ; we define where . We define and . Assumption 1 below ensures exists and is invertible, as well as . Let be the state transition matrix and be the state-action transition matrix, i.e., . We use to denote estimates for respectively, where are learnable parameters.

We first consider GTD methods. For a vector , we define a projection . We have (Assumption 2 below ensures the existence of ). Similarly, for a vector , we define a projection . The value function is the unique fixed point of the Bellman operator where . Similarly, is the unique fixed point for the operator , where and . GTD2 (Sutton et al., 2009a) learns the estimate for , by minimizing . GQ(0) (Maei, 2011) learns the estimate for by minimizing . Besides GTD methods, ETD methods are also used for off-policy policy evaluation. ETD(0) updates as

(1) | ||||

(2) |

where is a step size, is the followon trace, and is the interest function reflecting the user’s preference for different states (Sutton et al., 2016).

Control: Off-policy actor-critic methods (Degris et al., 2012; Imani et al., 2018) aim to maximize the excursion objective by adapting the target policy . We assume is parameterized by , and use interchangeably in the rest of this paper when it does not confuse. All gradients are taken w.r.t. unless otherwise specified. According to the off-policy policy gradient theorem (Imani et al., 2018), the policy gradient is , where . We rewrite as and define

(3) |

We therefore have , yielding

(4) |

where
.
We refer to as the emphasis in the rest of this paper.
To compute , we need and , to which we typically do not have access.
Degris et al. (2012) ignore the emphasis and update as in Off-PAC,
which is theoretically justified only in the tabular setting.^{2}

### 2.1 Assumptions

###### Assumption 1.

The Markov chain induced by the behavior policy is ergodic, and .

###### Assumption 2.

The matrices are nonsingular.

###### Assumption 3.

There exists a constant such that

(5) | |||

(6) | |||

(7) |

###### Remark 1.

###### Lemma 2.

Under Assumption 1,

## 3 Gradient Emphasis Learning

Motivation: The followon trace has two main disadvantages. First, when we use to approximate (e.g., in ACE), the approximation error tends to be large. is a random variable and although its conditional expectation converges to under a fixed target policy , itself can have unbounded variance (Sutton et al., 2016), indicating the approximation error can be unbounded. Moreover, in our actor-critic setting, where keeps changing, it is not clear whether this convergence holds or not. Theoretically, this large approximation error may preclude a convergent analysis for ACE. Empirically, this large variance makes ETD hard to use in practice. For example, as pointed out in Sutton & Barto (2018), “it is nigh impossible to get consistent results in computational experiments” (for ETD) in Baird’s counterexample (Baird, 1995), a common off-policy learning benchmark.

Second, it is hard to query the emphasis for a given state using the followon trace. As is only a scalar, it is almost memoryless. To obtain an emphasis estimation for a given state using the followon trace, we have to simulate a trajectory long enough to go into the mixing stage and visit that particular state, which is typically difficult in offline training. This lack of memory is also a cause of the large approximation error.

In this paper, we propose a novel stochastic approximation algorithm, Gradient Emphasis Learning (GEM), to learn using function approximation. GEM can track the true emphasis under a changing target policy .

Algorithm Design: We consider linear function approximation, and our estimate for is , where is the learnable parameters. For a vector , we define an operator as .

###### Proposition 1.

is a contraction mapping w.r.t. some weighted maximum norm and is its unique fixed point.

The proof involves arguments from Bertsekas & Tsitsiklis (1989), where the choice of the weighted maximum norm depends on . Our operator is a generalization of the discounted COP-TD operator (Gelada & Bellemare, 2019), where and is a scalar similar to . They show that is contractive only when is small enough. Here our Proposition 1 proves contraction for any . Although and are similar, they are designed for different purposes. Namely, is designed to learn a density ratio, while is designed to learn the emphasis. Emphasis generalizes density ratio in that users are free to choose the interest in .

Given Proposition 1, it is tempting to compose a semi-gradient update rule for updating analogously to discounted COP-TD, where the incremental update for is . This semi-gradient update, however, can diverge for the same reason as the divergence of off-policy linear TD: the key matrix is not guaranteed to be negative semi-definite (see Sutton et al. (2016)). Motivated by GTD methods, we seek an approximate solution that satisfies via minimizing a projected objective , where . For reasons that will soon be clear, we also include ridge regularization, yielding the objective

(10) |

where is the weight of the ridge term. We can now compute following a similar routine as Sutton et al. (2009a). When sampling , we use another set of parameters to address the double sampling issue as proposed by Sutton et al. (2009a). See Sutton et al. (2009a) for details of the derivation. This derivation, however, provides only an intuition behind GEM and has little to do with the actual convergence proof for two reasons. First, in an actor-critic setting, keeps changing, as does . Second, we consider sequential Markovian data . The proof in Sutton et al. (2009a) assumes i.i.d. data, i.e., each state is sampled from independently. Compared with the i.i.d. assumption, the Markovian assumption is more practical in RL problems. We now present the GEM algorithm, which updates and recursively as

GEM: | (11) | |||

(12) |

(14) | ||||

(15) |

where is a constant, is a deterministic sequence satisfying the Robbins-Monro condition (Robbins & Monro, 1951), i.e., is non-increasing positive and . Similar to Sutton et al. (2009a), we define and rewrite the GEM update as

(16) |

where . With , we define

(17) | ||||

(18) | ||||

(19) |

Let , the limiting behavior of GEM is then governed by

(20) | ||||

(21) | ||||

(22) |

Readers familiar with GTD2 (Sutton et al., 2009a) may find that the in Eq (21) is different from its counterpart in GTD2 in that
the bottom right block of is while that block in GTD2 is .
This results from the ridge regularization in the objective ,
and this block has to be strictly positive definite^{3}

As we consider an actor-critic setting where the policy is changing every step, we pose the following condition on the changing rate of :

###### Condition 1.

(Assumption 3.1(3) in Konda (2002)) The random sequence satisfies , where is some nonnegative process with bounded moments and is a nonincreasing deterministic sequence satisfying the Robbins-Monro condition such that for some .

When we consider a policy evaluation setting where is fixed, this condition is satisfied automatically. We show later that this condition is also satisfied in COF-PAC. We now characterize the asymptotic behavior of GEM.

###### Theorem 1.

By simple block matrix inversion, Theorem 1 implies

(23) | ||||

(24) |

Konda (2002) provides a general theorem for stochastic approximation algorithms to track a slowly changing linear system. To prove Theorem 1, we verify that GEM indeed satisfies all the assumptions (listed in the appendix) in Konda’s theorem. Particularly, that theorem requires to be strictly positive definite, which is impossible if . This motivates the introduction of the ridge regularization in defined in Eq. (10). Namely, the ridge regularization is essential in the convergence of GEM under a slowly changing target policy. Introducing regularization in the GTD objective is not new. Mahadevan et al. (2014) introduce the proximal GTD learning framework to integrate GTD algorithms with first-order optimization-based regularization via saddle-point formulations and proximal operators. Yu (2017) introduces a general regularization term for improving robustness. Du et al. (2017) introduce ridge regularization to improve the convexity of the objective. However, their analysis is conducted with the saddle-point formulation of the GTD objective (Liu et al., 2015; Macua et al., 2015) and requires a fixed target policy, which is impractical in our control setting. We are the first to establish the tracking ability of GTD-style algorithms under a slowly changing target policy by introducing ridge regularization, which ensures the driving term is strictly positive definite. Without this ridge regularization, we are not aware of any existing work establishing this tracking ability. Note our arguments do not apply when and is changing, which is an open problem. However, if and is fixed, we can use arguments from Yu (2017) to prove convergence. In this scenario, assuming is nonsingular, converges to and we have

###### Proposition 2.

.

Similarly, we introduce ridge regularization in the -value analogue of GTD2, which we call GQ2. GQ2 updates recursively as

GQ2: | (25) | |||

(26) | ||||

(27) | ||||

(28) |

Similarly, we define ,

(29) | ||||

(30) |

###### Theorem 2.

Similarly, we have

(31) | ||||

(32) |

Comparing the update rules of GEM and GQ2, it now becomes clear that GEM is “reversed” GQ2. In particular, the in GEM is the “transpose” of the in GQ2. Such reversed TD methods have been explored by Hallak & Mannor (2017); Gelada & Bellemare (2019), both of which rely on the operator introduced by Hallak & Mannor (2017). Previous methods implement this operator under the semi-gradient paradigm (Sutton, 1988). By contrast, GEM is a full gradient. The techniques in GEM can be applied immediately to the discounted COP-TD (Gelada & Bellemare, 2019) to improve its convergence from a small enough to any . Applying GEM-style update to COP-TD (Hallak & Mannor, 2017) is still an open problem as COP-TD involves a nonlinear projection, whose gradient is hard to compute.

## 4 Convergent Off-Policy Actor-Critic

To estimate , we use GEM and GQ2 to estimate and respectively, yielding COF-PAC (Eq (41)). In COF-PAC, we require both and to be deterministic and nonincreasing and satisfy the Robbins-Monro condition. Furthermore, there exists some such that . These are common stepsize conditions in two-timescale algorithms (see Borkar (2009)). Like Konda (2002), we also use adaptive stepsizes and to ensure changes slowly enough. We now pose the same condition on as Konda (2002). There exist constants satisfying such that for any vector , the following properties hold: , . Konda (2002) provides an example for . Let be some constant, then we define as , where is the indicator function. It is easy to verify that the above conditions on stepsizes (), together with Assumptions (1,3), ensure that is bounded. Condition 1 on the policy changing rate, therefore, indeed holds. Consequently, Theorems 1 and 2 hold when the target policy is updated according to COF-PAC.

COF-PAC: | (33) | |||

(34) | ||||

(35) | ||||

(36) | ||||

(37) | ||||

(38) | ||||

(39) | ||||

(40) | ||||

(41) |

We now characterize the asymptotic behavior of COF-PAC. The limiting policy update in COF-PAC is

(42) | ||||

(43) |

The bias introduced by the estimates and is , which determines the asymptotic behavior of COF-PAC:

###### Theorem 3.

The proof is inspired by Konda (2002). According to Theorem 3, COF-PAC reaches the same convergence level as the canonical on-policy actor-critic (Konda, 2002). Together with the fact that is Lipschitz continuous and is diminishing, it is easy to see will eventually remain in the neighborhood in Theorem 3 for arbitrarily long time. When is close to in the sense of the following Assumption 4(a), we can provide an explicit bound for the bias . However, failing to satisfy Assumption 4 does not necessarily imply the bias is large. The bound here is indeed loose and is mainly to provide an intuition for the source of the bias.

###### Assumption 4.

(a) The following two matrices are positive semidefinite :

(44) | ||||

(45) |

(b) .

(c) The Markov chain induced by is ergodic.

###### Remark 2.

###### Proposition 3.

The bias comes from the bias of both the estimate and the estimate. The bound of the estimate follows directly from Kolter (2011). The proof from Kolter (2011), however, can not be applied to analyze the estimate until Lemma 2 is established.

Compatible Features: One possible approach to eliminate the bias is to consider compatible features as in the canonical on-policy actor-critic (Sutton et al., 2000; Konda, 2002). Let be a subspace and be an inner product, which induces a norm . We define a projection as . For any vector and a vector , we have by Pythagoras. Based on this equality, Konda (2002) designs compatible features for an on-policy actor-critic. Inspired by Konda (2002), we now design compatible features for COF-PAC.

Let be estimates for . With slight abuse of notations, we define

(48) |

which is the limiting policy update. The bias can then be decomposed as , where

(49) | |||

(50) | |||

(51) | |||

(52) |

For an , we consider , where is the -th element of . Let denote the subspace in spanned by . We define an inner product . Then we can write , the -the element of , as

(53) |

If our estimate satisfies , we have . This motivates learning the estimate via minimizing . One possibility is to consider linear function approximation for and use as features. Similarly, we consider the subspace in spanned by and define the inner product according to . We then aim to learn via minimizing . Again, we can consider linear function approximation with features . In general, any feature, whose feature space contains or , are compatible features. Due to the change of