# Generalized Off-Policy Actor-Critic

###### Abstract

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used *excursion objective*, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an *emphatic* approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

## 1 Introduction

Reinforcement learning (RL) algorithms based on the policy gradient theorem (Sutton et al., 2000; Marbach and
Tsitsiklis, 2001) have recently enjoyed great success in various domains, e.g., achieving human-level performance on Atari games (Mnih et al., 2016). The original policy gradient theorem is on-policy and used to optimize the *on-policy objective*. However, in many cases, we would prefer to learn off-policy to improve data efficiency (Lin, 1992) and exploration (Osband et al., 2018). To this end, the Off-Policy Policy Gradient (OPPG) Theorem (Degris et al., 2012; Maei, 2018; Imani et al., 2018) was developed and has been widely used (Silver et al., 2014; Lillicrap et al., 2015; Wang et al., 2016; Gu et al., 2017; Ciosek and Whiteson, 2017; Espeholt et al., 2018).

Ideally, an off-policy algorithm should optimize the off-policy analogue of the on-policy objective. In the continuing RL setting, this analogue would be the performance of the target policy in expectation w.r.t. the stationary distribution of the target policy, which is referred to as the alternative life objective (Ghiassian et al., 2018). This objective corresponds to the performance of the target policy when deployed. Previously, OPPG optimizes a different objective, the performance of the target policy in expectation w.r.t. the stationary distribution of the behavior policy. This objective is referred to as the excursion objective (Ghiassian et al., 2018), as it corresponds to the excursion setting (Sutton et al., 2016). Unfortunately, the excursion objective can be misleading about the performance of the target policy when deployed, as we illustrate in Section 3.

It is infeasible to optimize the alternative life objective directly in the off-policy continuing setting. Instead, we propose to optimize the counterfactual objective, which approximates the alternative life objective. In the excursion setting, an agent in the stationary distribution of the behavior policy considers a hypothetical excursion that follows the target policy. The return from this hypothetical excursion is an indicator of the performance of the target policy. The excursion objective measures this return w.r.t. the stationary distribution of the behavior policy, using samples generated by executing the behavior policy. By contrast, evaluating the alternative life objective requires samples from the stationary distribution of the target policy, to which the agent does not have access. In the counterfactual objective, we use a new parameter to control how counterfactual the objective is, akin to Gelada and Bellemare (2019). With , the counterfactual objective uses the stationary distribution of the behavior policy to measure the performance of the target policy, recovering the excursion objective. With , the counterfactual objective is fully decoupled from the behavior policy and uses the stationary distribution of the target policy to measure the performance of the target policy, recovering the alternative life objective. As in the excursion objective, the excursion is never actually executed and the agent always follows the behavior policy.

We make two contributions in this paper. First, we prove the Generalized Off-Policy Policy Gradient (GOPPG) Theorem, which gives the policy gradient of the counterfactual objective. Second, using an emphatic approach (Sutton et al., 2016) to compute an unbiased sample for this policy gradient, we develop the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We evaluate Geoff-PAC empirically in challenging robot simulation tasks with neural network function approximators. Geoff-PAC outperforms the actor-critic algorithms proposed by Degris et al. (2012); Imani et al. (2018), and to our best knowledge, Geoff-PAC is the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

## 2 Background

We use a time-indexed capital letter (e.g., ) to denote a random variable. We use a bold capital letter (e.g., X) to denote a matrix and a bold lowercase letter (e.g., x) to denote a column vector. If is a scalar function defined on a finite set , we use its corresponding bold lowercase letter to denote its vector form, i.e., . We use I to denote the identity matrix and 1 to denote an all-one column vector.

We consider an infinite horizon MDP (Puterman, 2014) consisting of a finite state space , a finite action space , a bounded reward function and a transition kernel . We consider a transition-based discount function (White, 2017) for unifying continuing tasks and episodic tasks. At time step , an agent at state takes an action according to a policy . The agent then proceeds to a new state according to and gets a reward satisfying . The return of at time step is , where and . We use to denote the value function of , which is defined as . Like White (2017), we assume exists for all . We use to denote the state-action value function of . We use to denote the transition matrix induced by , i.e., . We assume the chain induced by is ergodic and use to denote its unique stationary distribution. We define .

In the off-policy setting, an agent aims to learn a target policy but follows a behavior policy . We use the same assumption of coverage as Sutton and Barto (2018), i.e., . We assume the chain induced by is ergodic and use to denote its stationary distribution. Similarly, . We define , and .

Typically, there are two kinds of tasks in RL, prediction and control.

Prediction: In prediction, we are interested in finding the value function of a given policy . Temporal Difference (TD) learning (Sutton, 1988) is perhaps the most powerful algorithm for prediction. TD enjoys convergence guarantee in both on- and off-policy tabular settings. TD can also be combined with linear function approximation. The update rule for on-policy linear TD is , where is a step size and is an incremental update. Here we use to denote an estimate of parameterized by . Tsitsiklis and Van Roy (1997) prove the convergence of on-policy linear TD. In off-policy linear TD, the update is weighted by . The divergence of off-policy linear TD is well documented (Tsitsiklis and Van Roy, 1997). To approach this issue, Gradient TD methods (Sutton et al. 2009) were proposed. Instead of bootstrapping from the prediction of a successor state like TD, Gradient TD methods compute the gradient of the projected Bellman error directly. Gradient TD methods are true stochastic gradient methods and enjoy convergence guarantees. However, they are usually two-time-scale, involving two sets of parameters and two learning rates, which makes it hard to use in practice (Sutton et al., 2016). To approach this issue, Emphatic TD (ETD, Sutton et al. 2016) was proposed.

ETD introduces an interest function to specify user’s preferences for different states. With function approximation, we typically cannot get accurate predictions for all states and must thus trade off between them. States are usually weighted by in the off-policy setting (e.g., Gradient TD methods) but with the interest function, we can explicitly weight them by in our objective. Consequently, we weight the update at time via , which is the emphasis that accumulates previous interests in a certain way. In the simplest form of ETD, we have . The update is weighted by . In practice, we usually set .

Inspired by ETD, Hallak and Mannor (2017) propose to weight via in the Consistent Off-Policy TD (COP-TD) algorithm, where is the density ratio, which is also known as the covariate shift (Gelada and Bellemare, 2019). To learn via stochastic approximation, Hallak and Mannor (2017) propose the COP operator. However, the COP operator does not have a unique fixed point. Extra normalization and projection is used to ensure convergence (Hallak and Mannor, 2017) in the tabular setting. To address this limitation, Gelada and Bellemare (2019) further propose the -discounted COP operator.

Gelada and Bellemare (2019) define a new transition matrix where is a constant. Following this matrix, an agent either proceeds to the next state according to w.p. or gets reset to w.p. . Gelada and Bellemare (2019) show the chain under is ergodic. With denoting its stationary distribution, they prove

(1) |

and . With , Gelada and Bellemare (2019) prove that

(2) |

yielding the following learning rule for estimating in the tabular setting:

(3) |

where is an estimate of and is a step size. A semi-gradient is used when is a parameterized function (Gelada and Bellemare, 2019). For a small (depending on the difference between and ), Gelada and Bellemare (2019) prove a multi-step contraction under linear function approximation. For a large or nonlinear function approximation, they provide an extra normalization loss for the sake of the constraint . Gelada and Bellemare (2019) use to weight the update in Discounted COP-TD. They demonstrate empirical success in Atari games (Bellemare et al., 2013) with pixel inputs.

Control: In this paper, we focus on policy-based control. In the on-policy continuing setting, we seek to optimize the average value objective (Silver, 2015)

(4) |

Optimizing the average value objective is equivalent to optimizing the average reward objective (Puterman, 2014) if both and are constant (see White 2017). In general, the average value objective can be interpreted as a generalization of the average reward objective to adopt transition-based discounting and nonconstant interest function.

In the off-policy continuing setting, Imani et al. (2018) propose to optimize the excursion objective

(5) |

instead of the alternative life objective . The key difference between and is how we trade off different states. With function approximation, it is usually not possible to maximize for all states, which is the first trade-off we need to make. Moreover, visiting one state more implies visiting another state less, which is the second trade-off we need to make. and achieve both kinds of trade-off according to and respectively. However, it is , not , that correctly reflects the deploy-time performance of , as the behavior policy will no loner matter when we deploy the off-policy learned in a continuing task.

In both objectives, is usually set to 1. We assume is parameterized by . In the rest of this paper, all gradients are taken w.r.t. unless otherwise specified, and we consider the gradient for only one component of for the sake of clarity. It is not clear how to compute the policy gradient of in the off-policy continuing setting directly. For , we can compute the policy gradient as

(6) |

Degris et al. (2012) prove in the Off-Policy Policy Gradient (OPPG) theorem that we can ignore the term without introducing bias for a tabular policy^{1}^{1}1See Errata in Degris et al. (2012), also in Imani et al. (2018); Maei (2018). when , yielding the Off-Policy Actor Critic (Off-PAC), which updates as

(7) |

where is a step size, is sampled from , and is sampled from . For a policy using a general function approximator, Imani et al. (2018) propose a new OPPG theorem. They define

where is a constant used to optimize the bias-variance trade-off and . Imani et al. (2018) prove that is an unbiased sample of when and for a general interest function . Based on this, Imani et al. (2018) propose the Actor-Critic with Emphatic weightings (ACE) algorithm, which updates as . ACE is an emphatic approach where is the emphasis to reweight the update.

## 3 The Counterfactual Objective

We now introduce the counterfactual objective

(8) |

where is a user-defined interest function. Similarly, we can set to 1 for the continuing setting but we proceed with a general . When , recovers the alternative life objective . When , recovers the excursion objective . To motivate the counterfactual objective , we first present the two-circle MDP (Figure (a)a) to highlight the difference between and .

In the two-circle MDP, an agent only needs to make a decision in state A. The behavior policy proceeds to B or C randomly with equal probability. For this continuing task, the discount factor is always and the interest is always 1. Under this task specification (White, 2017), the optimal policy under the average value objective , which is equivalent to the average reward objective as and are constant, is to stay in the outer circle. However, to maximize , the agent prefers the inner circle. To see this, first note and hardly change w.r.t. , and we have and . To maximize , the target policy would prefer transitioning to state C to maximize . The agent, therefore, remains in the inner circle. The two-circle MDP is tabular, so the policy maximizing for all can be represented accurately and there is no need to make the first kind of trade-off. If we consider an episodic task, e.g., removing the arrow back to A, such policy will be optimal under the episodic return criterion. However, when we consider a continuing task and aim to optimize , we have to consider state visitation. The excursion objective maximizes regardless of the state visitation under , yielding a policy that never visits the state D, a state of the highest value. Such policy is sub-optimal in this continuing task. To maximize , the agent has to sacrifice and visits state D more. This two-circle MDP is not an artifact due to the small . The same effect can also occur with a larger if the path is longer. With function approximation, the discrepancy between and can be magnified as we need to make both kinds of trade-off.

One solution to this problem is to set the interest function in in a clever way. However, it is not clear how to achieve this without domain knowledge. Imani et al. (2018) simply set to 1. Another solution might be to optimize directly in off-policy learning, if one could use importance sampling ratios to fully correct to as Precup et al. (2001) propose for value-based methods in the episodic setting. However, this solution is infeasible for the continuing setting (Sutton et al., 2016).

In this paper, we propose to optimize instead.
It is a well-known fact that the policy gradient of the stationary distribution exists under mild conditions (e.g., see Yu (2005)) ^{2}^{2}2
For completeness, we include that proof in supplementary materials.
.
It follows immediately from the proof of this existence that .
Moreover, it is trivial to see that , indicating the counterfactual objective can recover both the excursion objective and the alternative life objective smoothly.
Furthermore, we show empirically that a small (e.g., 0.6 in the two-circle MDP and 0.2 in Mujoco tasks) is enough to generate a different solution from maximizing .

## 4 Generalized Off-Policy Policy Gradient

In this section, we derive an estimator for and show in Proposition 1 that it is unbiased in the limiting sense. Our (standard) assumptions are given in supplementary materials. The OPPG theorem (Imani et al., 2018) leaves us the freedom to choose the interest function in . In this paper, we set , which, to our best knowledge, is the first time that a non-trivial interest is used. Hence, depends on and we cannot invoke OPPG directly as . However, we can still invoke the remaining parts of OPPG:

(9) |

where . We now compute the gradient .

###### Theorem 1 (Generalized Off-Policy Policy Gradient Theorem)

where

Proof. We first use the product rule of calculus and plug in :

follows directly from (9). To show , we take gradients on both sides of (2). We have . Solving this linear system of leads to

With , follows easily.

Now we use an emphatic approach to provide an unbiased sample of . We define

Here functions as an intrinsic interest (in contrast to the user-defined extrinsic interest ) and is a sample for b. accumulates previous interests and translates b into g. is for bias-variance trade-off similar to Sutton et al. (2016); Imani et al. (2018). We now define

and proceed to show that is an unbiased sample of when .

###### Lemma 1

Assuming the chain induced by is ergodic and is fixed, the limit exists, and for .

Proof. Previous works (Sutton et al., 2016; Imani et al., 2018) assume exists.
Here we prove the existence of , inspired by the process of computing the value of (assuming its existence) in Sutton et al. (2016).
The existence of for a constant can also be established with the same routine ^{3}^{3}3The existence of does not follow directly from the convergence analysis of ETD in Yu (2015)..
The proof also involves similar techniques as Hallak and Mannor (2017).
Details in supplementary materials.

###### Proposition 1

Assuming the chain induced by is ergodic, is fixed, , then

Proof. The proof involves Proposition 1 in Imani et al. (2018) and Lemma 1. Details are provided in supplementary materials.

When , the Generalized Off-Policy Policy Gradient (GOPPG) Theorem recovers the OPPG theorem in Imani et al. (2018). The main contribution of GOPPG lies in the computation of , i.e., the policy gradient of a distribution, which has not been done by previous policy gradient methods. The main contribution of Proposition 1 is the trace , which is an effective way to approximate . Inspired by Propostion 1, we propose to update as , where is a step size.

So far, we discussed the policy gradient for a single dimension of the policy parameter , so are all scalars. When we compute policy gradients for the whole in parallel, remain scalars while become vectors of the same size as . This is because our intrinsic interest “function” is a multi-dimensional random variable, instead of a deterministic scalar function like . We, therefore, generalize the concept of interest.

So far, we also assumed access to the true density ratio and the true value function . We can plug in their estimates and , yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm ^{4}^{4}4At this moment, a convergence analysis of Geoff-PAC is an open problem.. The density ratio estimate can be learned via the learning rule in (3). The value estimate can be learned by any off-policy prediction algorithm, e.g., one-step off-policy TD (Sutton and Barto, 2018), Gradient TD methods, (Discounted) COP-TD or V-trace (Espeholt et al., 2018). Pseudocode of Geoff-PAC is provided in supplementary materials.

We now discuss two potential practical issues with Geoff-PAC. First, Proposition 1 requires . In practice, this means has been executed for a long time and can be satisfied by a warm-up before training. Second, Proposition 1 provides an unbiased sample for a fixed policy . Once is updated, will be invalidated as well as . As their update rule does not have a learning rate, we cannot simply use a larger learning rate for as we would do for . This issue also appeared in Imani et al. (2018). In principle, we could store previous transitions in a replay buffer (Lin, 1992) and replay them for a certain number of steps after is updated. In this way, we can satisfy the requirement and get the up-to-date . In practice, we found this unnecessary. When we use a small learning rate for , we assume changes slowly and ignore this invalidation effect.

## 5 Experimental Results

Our experiments aim to answer the following questions. 1) Can Geoff-PAC find the same solution as on-policy policy gradient algorithms in the two-circle MDP as promised? 2) How does the degree of counterfactualness () influence the solution? 3) Can Geoff-PAC scale up to challenging tasks like robot simulation in Mujoco with neural network function approximators? 4) Can the counterfactual objective in Geoff-PAC translate into performance improvement over Off-PAC and ACE? 5) How does Geoff-PAC compare with other downstream applications of OPPG, e.g., DDPG (Lillicrap et al., 2015) and TD3 (Fujimoto et al., 2018)?

### 5.1 Two-circle MDP

We implemented a tabular version of ACE and Geoff-PAC for the two-circle MDP. The behavior policy was random, and we monitored the probability from A to B under the target policy . In Figure (b)b, we plot during training. The curves are averaged over 10 runs and the shaded regions indicate standard errors. We set so that both ACE and Geoff-PAC are unbiased. For Geoff-PAC, was set to . ACE converges to the correct policy that maximizes as expected, while Geoff-PAC converges to the policy that maximizes , the policy we want in on-policy training. Figure (c)c shows how manipulating and influences the final solution. In this two-circle MDP, has little influence on the final solution, while manipulating significantly changes the final solution.

### 5.2 Robot Simulation

Evaluation: We benchmarked Off-PAC, ACE, DDPG, TD3, and Geoff-PAC on five Mujoco robot simulation tasks from OpenAI gym (Brockman et al., 2016). As all the original tasks are episodic, we adopted similar techniques as White (2017) to compose continuing tasks. We set the discount function to 0.99 for all non-termination transitions and to 0 for all termination transitions. The agent was teleported back to the initial states upon termination. The interest function was always 1. This setting complies with the common training scheme for Mujoco tasks (Lillicrap et al., 2015; Asadi and Williams, 2016). However, we interpret the tasks as continuing tasks. As a consequence, , instead of episodic return, is the proper metric to measure the performance of a policy . The behavior policy is a fixed *uniformly random policy*, same as Gelada and Bellemare (2019). The data generated by is significantly different from any meaningful policy in those tasks. Thus, this setting exhibits a high degree of off-policyness. We monitored periodically during training. To evaluate , states were sampled according to , and was approximated via Monte Carlo return. Evaluation based on the commonly used total undiscounted episodic return criterion is provided in supplementary materials. The relative performance under the two criterion is almost identical.

Implementation: Although emphatic algorithms have enjoyed great theoretical success (Yu, 2015; Hallak et al., 2016; Sutton et al., 2016; Imani et al., 2018), their empirical success is still limited to simple domains (e.g., simple hand-crafted Markov chains, cart-pole balancing) with linear function approximation. To our best knowledge, this is the first time that emphatic algorithms are evaluated in challenging robot simulation tasks with neural network function approximators. To stabilize training, we adopted the A2C (Clemente et al., 2017) paradigm with multiple workers and utilized a target network (Mnih et al., 2015) and a replay buffer (Lin, 1992). All three algorithms share the same architecture and the same parameterization. We first tuned hyperparameters for Off-PAC. ACE and Geoff-PAC inherited common hyperparameters from Off-PAC. For DDPG and TD3, we used the same architecture and hyperparameters as Lillicrap et al. (2015) and Fujimoto et al. (2018) respectively.
More details are provided in supplementary materials and all the implementations are publicly available ^{5}^{5}5https://github.com/ShangtongZhang/DeepRL.

Results: We first studied the influence of on ACE and the influence of on Geoff-PAC in HalfCheetah. The results are reported in supplementary materials. We found ACE was not sensitive to and set for all experiments. For Geoff-PAC, we found produced good empirical results and used this combination for all remaining tasks. All curves are averaged over 10 independent runs and shaded regions indicate standard errors. Figure 2 compares Geoff-PAC, ACE, and Off-PAC. Geoff-PAC significantly outperforms ACE and Off-PAC in 3 out of 5 tasks. The performance on Walker and Reacher is similar. This performance improvement supports our claim that optimizing can better approximate than optimizing . We also report the performance of a random agent for reference. Moreover, this is the first time that ACE is evaluated on such challenging domains instead of simple Markov chains. Figure 3 compares Geoff-PAC, DDPG, and TD3. Geoff-PAC outperforms DDPG in Hopper and Swimmer. DDPG with a uniformly random policy exhibits high instability in HalfCheetah, Walker, and Hopper. This is expected because DDPG ignores the discrepancy between and . As training progresses, this discrepancy gets larger and finally yields a performance drop. TD3 uses several techniques to stabilize DDPG, which translate into the performance and stability improvement over DDPG in Figure 3. However, Geoff-PAC still outperforms TD3 in Hopper and Swimmer. This is not a fair comparison in that many design choices for DDPG, TD3 and Geoff-PAC are different (e.g., one worker vs. multiple workers, deterministic vs. stochastic policy, network architectures), and we do not expect Geoff-PAC to outperform all applications of OPPG. However, this comparison does suggest GOPPG sheds light on how to improve applications of OPPG.

## 6 Related Work

The density ratio c is a key component in Geoff-PAC, which is proposed by Gelada and Bellemare (2019). However, how we use this density ratio is different. Q-Learning (Watkins and Dayan, 1992; Mnih et al., 2015) is a semi-gradient method. Gelada and Bellemare (2019) use the density ratio to reweigh the Q-Learning semi-gradient update directly. The resulting algorithm still belongs to semi-gradient methods. If we would use the density ratio to reweigh the Off-PAC update (7) directly, it would just be an actor-critic analogue of the Q-Learning approach in Gelada and Bellemare (2019). This reweighed Off-PAC, however, will no longer follow the policy gradient of the objective , yielding instead “policy semi-gradient”. In this paper, we use the density ratio to define a new objective, the counterfactual objective, and compute the policy gradient of this new objective directly (Theorem 1). The resulting algorithm, Geoff-PAC, still belongs to policy gradient methods (in the limiting sense). Computing the policy gradient of the counterfactual objective requires computing the policy gradient of the density ratio, which has not been explored in Gelada and Bellemare (2019).

There have been many applications of OPPG, e.g., DPG (Silver et al., 2014), DDPG, ACER (Wang et al., 2016), EPG (Ciosek and Whiteson, 2017), and IMPALA (Espeholt et al., 2018). Particularly, Gu et al. (2017) propose IPG to unify on- and off-policy policy gradients. IPG is a mix of gradients (i.e., a mix of and ). To compute , IPG does need on-policy samples. In this paper, the counterfactual objective is a mix of objectives, and we do not need on-policy samples to compute the policy gradient of the counterfactual objective. Mixing and directly in IPG-style is a possibility for future work.

There have been other policy-based off-policy algorithms. Maei (2018) provide an unbiased estimator (in the limiting sense) for , assuming the value function is linear. Theoretical results are provided without empirical study. Imani et al. (2018) eliminate this linear assumption and provide a thorough empirical study. We, therefore, conduct our comparison with Imani et al. (2018) instead of Maei (2018). In another line of work, the policy entropy is used for reward shaping. The target policy can then be derived from the value function directly (O’Donoghue et al., 2016; Nachum et al., 2017a; Schulman et al., 2017a). This line of work includes the deep energy-based RL (Haarnoja et al., 2017, 2018), where a value function is learned off-policy and the policy is derived from the value function directly, and path consistency learning (Nachum et al., 2017a, b), where gradients are computed to satisfy certain path consistencies. This line of work is orthogonal to this paper, where we compute the policy gradients of the counterfactual objective directly in an off-policy manner and do not involve reward shaping.

Liu et al. (2018) prove that is the unique solution for a minimax problem, which involves maximization over a function set . They show that theoretically should be sufficiently rich (e.g., neural networks). To make it tractable, they restrict to a ball of a reproducing kernel Hilbert space, yielding a closed form solution for the maximization step. SGD is then used to learn an estimate for in the minimization step, which is then used for policy evaluation. In a concurrent work (Liu et al., 2019), this approximate for is used in off-policy policy gradient for , and empirical success is observed in simple domains. By contrast, our unifies and , where naturally allows bias-variance trade-off, yielding an empirical success in challenging robot simulation tasks.

## 7 Conclusions

In this paper, we introduced the counterfactual objective unifying the excursion objective and the alternative life objective in the continuing RL setting. We further provided the Generalized Off-Policy Policy Gradient Theorem and corresponding Geoff-PAC algorithm. GOPPG is the first example that a non-trivial interest function is used, and Geoff-PAC is the first empirical success of emphatic algorithms in prevailing deep RL benchmarks. There have been numerous applications of OPPG including DDPG, ACER, IPG, EPG and IMPALA. We expect GOPPG to shed light on improving those applications. Theoretically, a convergent analysis of Geoff-PAC involving compatible function assumption (Sutton et al., 2000) or multi-timescale stochastic approximation (Borkar, 2009) is also worth further investigation.

#### Acknowledgments

SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. The authors thank Richard S. Sutton, Matthew Fellows, Huizhen Yu for the valuable discussion.

## References

- Asadi and Williams (2016) Asadi, K. and Williams, J. D. (2016). Sample-efficient deep reinforcement learning for dialog control. arXiv preprint arXiv:1612.06000.
- Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research.
- Borkar (2009) Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint. Springer.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
- Ciosek and Whiteson (2017) Ciosek, K. and Whiteson, S. (2017). Expected policy gradients. arXiv preprint arXiv:1706.05374.
- Clemente et al. (2017) Clemente, A. V., Castejón, H. N., and Chandra, A. (2017). Efficient parallel methods for deep reinforcement learning. arXiv preprint arXiv:1705.04862.
- Degris et al. (2012) Degris, T., White, M., and Sutton, R. S. (2012). Off-policy actor-critic. arXiv preprint arXiv:1205.4839.
- Dhariwal et al. (2017) Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. (2017). Openai baselines. https://github.com/openai/baselines.
- Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561.
- Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477.
- Gelada and Bellemare (2019) Gelada, C. and Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.
- Ghiassian et al. (2018) Ghiassian, S., Patterson, A., White, M., Sutton, R. S., and White, A. (2018). Online off-policy prediction.
- Gu et al. (2017) Gu, S. S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Schölkopf, B., and Levine, S. (2017). Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems.
- Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org.
- Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
- Hallak and Mannor (2017) Hallak, A. and Mannor, S. (2017). Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning.
- Hallak et al. (2016) Hallak, A., Tamar, A., Munos, R., and Mannor, S. (2016). Generalized emphatic temporal difference learning: Bias-variance analysis. In Proceedins of 30th AAAI Conference on Artificial Intelligence.
- Imani et al. (2018) Imani, E., Graves, E., and White, M. (2018). An off-policy policy gradient theorem using emphatic weightings. In Advances in Neural Information Processing Systems.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Lin (1992) Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning.
- Liu et al. (2018) Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems.
- Liu et al. (2019) Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2019). Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473.
- Maei (2018) Maei, H. R. (2018). Convergent actor-critic algorithms under off-policy training and function approximation. arXiv preprint arXiv:1802.07842.
- Marbach and Tsitsiklis (2001) Marbach, P. and Tsitsiklis, J. N. (2001). Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control.
- Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature.
- Nachum et al. (2017a) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017a). Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems.
- Nachum et al. (2017b) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017b). Trust-pcl: An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.01891.
- Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.
- O’Donoghue et al. (2016) O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626.
- Osband et al. (2018) Osband, I., Aslanides, J., and Cassirer, A. (2018). Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems.
- Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of the 18th International Conference on Machine Learning.
- Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Schulman et al. (2017a) Schulman, J., Chen, X., and Abbeel, P. (2017a). Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440.
- Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning.
- Schulman et al. (2017b) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Seneta (2006) Seneta, E. (2006). Non-negative matrices and Markov chains. Springer Science & Business Media.
- Silver (2015) Silver, D. (2015). Policy gradient methods. url: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf.
- Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning.
- Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning.
- Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction (2nd Edition). MIT press.
- Sutton et al. (2009) Sutton, R. S., Maei, H. R., and Szepesvári, C. (2009). A convergent temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems.
- Sutton et al. (2016) Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research.
- Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
- Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems.
- Wang et al. (2016) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224.
- Watkins and Dayan (1992) Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning.
- White (2017) White, M. (2017). Unifying task specification in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning.
- Yu (2005) Yu, H. (2005). A function approximation approach to estimation of policy gradient for pomdp with structured policies. The 21st Conference on Uncertainty in Artificial Intelligence.
- Yu (2015) Yu, H. (2015). On convergence of emphatic temporal-difference learning. In Conference on Learning Theory.

## Appendix A Assumptions and Proofs

### a.1 Assumptions

### a.2 Proof of the Existence of

Proof. (From Yu (2005)) The stationary distribution is an eigenvector of associated with the eigenvalue 1, which is the largest eigenvalue. For any primitive matrix X, Seneta (2006) states in the proof of his Theorem 1.1 (f) that each row of is an eigenvector associated with the largest eigenvalue of the matrix X, where is the adjugate matrix of X. As is ergodic, it is primitive. Consequently, the rows of are the eigenvectors associated with the eigenvalue 1, which is the stationary distribution . According to fact that each element of is a polynomial of elements in , exists whenever exists. Particularly, is polynomials of , it follows easily that .

### a.3 Proof of Lemma 1

Proof. To get , we start from and follow for steps. The expectation is taken w.r.t. to this process of following for steps. We use to denote the probability of transitioning to a state from a state in steps under the target policy . We define shorthands:

(10) | ||||

(11) | ||||

(12) |

When , the agent goes to the stationary distribution , so

(13) | ||||

(14) | ||||

(15) |

Consequently,

(16) |

We first compute the limit of the intrinsic interest :

(17) | ||||

(18) | ||||

(Law of total expectation and Markov property) | ||||

(19) | ||||

(20) | ||||

(21) | ||||

(22) |

We then expend recursively: