# Expected Policy Gradients

###### Abstract

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. We also prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Finally, we show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to , where is the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new form of exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic in four challenging MuJoCo domains.

Expected Policy Gradients

Kamil Ciosek and Shimon Whiteson Department of Computer Science, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD {kamil.ciosek,shimon.whiteson}@cs.ox.ac.uk

Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

## Introduction

Policy gradient methods (Sutton et al., 2000; Peters and Schaal, 2006; Peters and Schaal, 2008b; Silver et al., 2014), which optimise policies by gradient ascent, have enjoyed great success in reinforcement learning problems with large or continuous action spaces. The archetypal algorithm optimises an actor, i.e., a policy, by following a policy gradient that is estimated using a critic, i.e., a value function.

The policy can be stochastic or deterministic, yielding stochastic policy gradients (SPG) (Sutton et al., 2000) or deterministic policy gradients (DPG) (Silver et al., 2014). The theory underpinning these methods is quite fragmented, as each approach has a separate policy gradient theorem guaranteeing the policy gradient is unbiased under certain conditions.

Furthermore, both approaches have significant shortcomings. For SPG, variance in the gradient estimates means that many trajectories are usually needed for learning. Since gathering trajectories is typically expensive, there is a great need for more sample efficient methods.

DPG’s use of deterministic policies mitigates the problem of variance in the gradient but raises other difficulties. The theoretical support for DPG is limited since it assumes a critic that approximates when in practice it approximates instead. In addition, DPG learns off-policy^{1}^{1}1We show in this paper that, in certain settings, off-policy DPG is equivalent to EPG, our on-policy method., which is undesirable when we want learning to take the cost of exploration into account. More importantly, learning off-policy necessitates designing a suitable exploration policy, which is difficult in practice. In fact, efficient exploration in DPG is an open problem and most applications simply use independent Gaussian noise or the Ornstein-Uhlenbeck heuristic (Uhlenbeck and
Ornstein, 1930; Lillicrap et al., 2015).

In this paper, we propose a new approach called expected policy gradients (EPG) that unifies policy gradients in a way that yields both theoretical and practical insights. Inspired by expected sarsa (Sutton and Barto, 1998; van Seijen et al., 2009), the main idea is to integrate across the action selected by the stochastic policy when estimating the gradient, instead of relying only on the action selected during the sampled trajectory.

EPG enables two theoretical contributions. First, we establish a number of equivalences between EPG and DPG, among which is a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Second, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead over SPG.

EPG also enables a practical contribution: a principled exploration strategy for continuous problems. We show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to , where is the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new approach to exploration substantially outperforms DPG with Ornstein-Uhlenbeck exploration in four challenging MuJoCo domains.

## Background

A Markov decision process is a tuple where is a set of states, is a set of actions (in practice either or is finite), is a reward function, is a transition kernel, is an initial state distribution, and is a discount factor. A policy is a distribution over actions given a state. We denote trajectories as , where , and is a sample reward. A policy induces a Markov process with transition kernel where we use the symbol to denote Lebesgue integration against the measure where is fixed. We assume the induced Markov process is ergodic with a single invariant measure defined for the whole state space. The value function is where actions are sampled from . The -function is and the advantage function is . An optimal policy maximises the total return . Since we consider only on-policy learning with just one current policy, we drop the super/subscript where it is redundant.

If is parameterised by , then stochastic policy gradients (SPG) (Sutton et al., 2000; Peters and Schaal, 2006; Peters and Schaal, 2008b) perform gradient ascent on , the gradient of with respect to (gradients without a subscript are always with respect to ). For stochastic policies, we have:

(1) |

where is the discounted-ergodic occupancy measure, defined in the supplement, and is a baseline, which can be any function that depends on the state but not the action, since . Typically, (1) is approximated from samples from a trajectory of length :

(2) |

If the policy is deterministic (we denote it ), we can use deterministic policy gradients (Silver et al., 2014) instead:

(3) |

This update is then approximated using samples:

(4) |

Since the policy is deterministic, the problem of exploration is addressed using an external source of noise, typically modeled using a zero-mean Ornstein-Uhlenbeck (OU) process (Uhlenbeck and Ornstein, 1930; Lillicrap et al., 2015) parametrized by and :

(5) |

In (2) and (4), is a critic that approximates and can be learned by sarsa (Rummery and Niranjan, 1994; Sutton, 1996):

(6) |

Alternatively, we can use expected sarsa (Sutton and Barto, 1998; van Seijen et al., 2009), which marginalises out , the distribution over which is specified by the known policy, to reduce the variance in the update:

(7) |

We could also use advantage learning (Baird and others, 1995) or LSTDQ (Lagoudakis and Parr, 2003). If the critic’s function approximator is compatible, then the actor, i.e., , converges (Sutton et al., 2000).

Instead of learning , we can set so that and then use the TD error as an estimate of Bhatnagar et al. (2008):

(8) |

where is an approximate value function learned using any policy evaluation algorithm. (8) works because , i.e., the TD error is an unbiased estimate of the advantage function. The benefit of this approach is that it is sometimes easier to approximate than and that the return in the TD error is unprojected, i.e., it is not distorted by function approximation. However, the TD error is noisy, introducing variance in the gradient.

To cope with this variance, we can reduce the learning rate when the variance of the gradient would otherwise explode, using, e.g., Adam (Kingma and Ba, 2014), natural policy gradients (Kakade, 2002; Amari, 1998; Peters and Schaal, 2008a), the adaptive step size method (Pirotta, Restelli, and Bascetta, 2013) or Newton’s method (Furmston and Barber, 2012; Parisi, Pirotta, and Restelli, 2016). However, this results in slow learning when the variance is high. One can also use PGPE, which replaces the stochastic policy with a distribution over deterministic policies (Sehnke et al., 2010). However, PGPE precludes updating the current policy during the episode and makes it difficult to explore efficiently.

We can also eliminate all variance caused by the policy at the cost of making the policy deterministic and using the DPG update, which usually necessitates performing off-policy exploration. EPG, presented below, reduces to DPG in many useful cases, while providing a principled way to explore and also allowing for stochastic policies.

Yet another way to eliminate variance in the actor is not to have an actor at all, instead selecting actions soft-greedily with respect to learned using sarsa. This is trivial for discrete actions and can also be done with a one-step Newton’s method for -functions that are quadric in the actions (Gu et al., 2016b).

## Expected Policy Gradients

In this section, we propose expected policy gradients (EPG).

### Main Algorithm

First, we introduce to denote the inner integral in (1):

(9) |

This suggests a new way to write the approximate gradient:

(10) |

where is some approximation to . This approach makes explicit that one step in estimating the gradient is to evaluate an integral to estimate . The main insight behind EPG is that, given a state, is expressed fully in terms of known quantities. Hence we can manipulate it analytically to obtain a formula or we can just compute the integral using any numerical quadrature if an analytical solution is impossible.

SPG as given in (2) performs this quadrature using a simple one-sample Monte Carlo method. However, relying on such a method is unnecessary. In fact, the actions used to interact with the environment need not be used at all in the evaluation of since is a bound variable in the definition of . The motivation is thus similar to that of expected sarsa but applied to the actor’s gradient estimate instead of the critic’s update rule. EPG, shown in Algorithm 1, uses (10) to form a policy gradient algorithm that repeatedly estimates with an integration subroutine.

EPG has benefits even when an analytical solution is not possible: if the action space is low dimensional, numerical quadrature is cheap; if it is high dimensional, it is still often worthwhile to balance the expense of simulating the system with the cost of quadrature. Actually, even in the extreme case of expensive quadrature but cheap simulation, the limited resources available for quadrature could still be better spent on EPG with smart quadrature than SPG with simple Monte Carlo. One of the motivations of DPG was precisely that the simple one-sample Monte-Carlo quadrature implicitly used by SPG often yields high variance gradient estimates, even with a good baseline. To see why, consider Figure 1 (left). A simple Monte Carlo method evaluates the integral by sampling one or more times from (blue) and evaluating (red) as a function of . A baseline can decrease the variance by adding a multiple of to the red curve, but the problem remains that the red curve has high values where the blue curve is almost zero. Consequently, substantial variance persists, whatever the baseline, even with a simple linear -function, as shown in Figure 1 (right). DPG addressed this problem for deterministic policies but EPG extends it to stochastic ones.

### Relationship to Other Methods

EPG has some similarities with VINE sampling (Schulman et al., 2015), which uses an (intrinsically noisy) Monte Carlo quadrature with many samples.^{2}^{2}2VINE sampling also differs from EPG by performing independent rollouts of , requiring a simulator with reset. However, the example in Figure 1 shows that even with a computationally expensive many-sample Monte Carlo method, the problem of variance remains, regardless of the baseline.

EPG is also related to variance minimisation techniques that interpolate between two estimators, e.g., (Gu et al., 2016a, Eq. 7) is similar to Corollary 4. However, EPG uses a quadric (not linear) approximation to the critic, which is crucial for exploration. Furthermore, it completely eliminates variance in the inner integral, as opposed to just reducing it.

The idea behind EPG was also independently and concurrently developed as Mean Actor Critic (Asadi et al., 2017), though only for discrete actions and without a supporting theoretical analysis.

### Gaussian Policies

EPG is particularly useful when we make the common assumption of a Gaussian policy: we can then perform the integration analytically under reasonable conditions. We show below (see Lemma 3) that the update to the policy mean computed by EPG is equivalent to the DPG update. Moreover, a simple formula for the covariance can be derived (see Lemma 2). Algorithms 2 and 3 show the resulting special case of EPG, which we call Gaussian policy gradients (GPG).

Surprisingly, GPG is on-policy but nonetheless fully equivalent to DPG, an off-policy method, with a particular form of exploration. Hence, GPG, by specifying the policy’s covariance, can be seen as a derivation of an exploration strategy for DPG. In this way, GPG addresses an important open question. As we show later, this leads to improved performance in practice.

The computational cost of GPG is small: while it must store a Hessian matrix , its size is only , where , which is typically small, e.g., for HalfCheetah-v1. This Hessian is the same size as the policy’s covariance matrix, which any policy gradient must store anyway, and should not be confused with the Hessian with respect to the parameters of the neural network, as used with Newton’s or natural gradient methods (Peters and Schaal, 2008a; Furmston, Lever, and Barber, 2016), which can easily have thousands of entries. Hence, GPG obtains EPG’s variance reduction essentially for free.

## Analysis

In this section, we analyse EPG, showing that it unifies SPG and DPG, that can often be computed analytically, and that EPG has lower variance than SPG.

### General Policy Gradient Theorem

We begin by stating our most general result, showing that EPG can be seen as a generalisation of both SPG and DPG. To do this, we first state a new general policy gradient theorem. We use the shorthand without a subscript to denote the gradient with respect to policy parameters .

###### Theorem 1 (General Policy Gradient Theorem).

If is a normalised Lebesgue measure for all , then

###### Proof.

We begin by expanding the following expression.

The first equality follows by expanding the definition of and the penultimate one follows from Lemma B (in the supplement). Then the theorem follows by rearranging terms. ∎

The crucial benefit of Theorem 1 is that it works for all policies, both stochastic and deterministic, unifying previously separate derivations for the two settings. To show this, in the following two corollaries, we use Theorem 1 to recover the stochastic policy gradient theorem (Sutton et al., 2000) and the deterministic policy gradient theorem (Silver et al., 2014), in each case by introducing additional assumptions to obtain a formula for expressible in terms of known quantities.

###### Corollary 1 (Stochastic Policy Gradient Theorem).

If is differentiable, then

###### Proof.

We obtain the following by expanding .

We obtain by plugging this into the definition of . We obtain by invoking Theorem 1 and plugging in the above expression for . ∎

We now recover the DPG update introduced in (3).

###### Corollary 2 (Deterministic Policy Gradient Theorem).

If is a Dirac-delta measure (i.e., a deterministic policy) and is differentiable, then

###### Proof.

We begin by obtaining an expression for .

Here, the second equality follows by expanding the definition of and the third follows from an established deterministic policy gradient result (Silver et al., 2014, Supplement, Eq. 1). We can then obtain by invoking Theorem 1 and plugging in the above expression for . ∎

These corollaries show that the choice between deterministic and stochastic policy gradients is fundamentally a choice of quadrature method. Hence, the empirical success of DPG relative to SPG (Silver et al., 2014; Lillicrap et al., 2015) can be understood in a new light. In particular, it can be attributed, not to a fundamental limitation of stochastic policies (indeed, stochastic policies are sometimes preferred), but instead to superior quadrature. DPG integrates over Dirac-delta measures, which is known to be easy, while SPG typically relies on simple Monte Carlo integration. Thanks to EPG, a deterministic approach is no longer required to obtain a method with low variance.

### Analytical Quadrature - Gaussian Policy

We now derive a lemma supporting GPG.

###### Lemma 1 (Gaussian Policy Gradients).

If the policy is Gaussian, i.e. with and parametrised by , and the critic is of the form , then , where the mean and covariance components are given by and .

### Arbitrary Critics

If does not meet the conditions of Lemma 1, we can approximate with a quadric function in the neighbourhood of the policy mean. This approximation is motivated by two arguments. First, in MDPs that model physical systems with reasonable reward functions, is fairly smooth. Second, policy gradients are a local, incremental method anyway – since the policy mean changes slowly, the values of for actions far from the policy mean are usually not relevant for the current update.

###### Corollary 3 (Approximate Gaussian Policy Gradients with an Arbitrary Critic).

If the policy is Gaussian, i.e. with and parametrised by as in Lemma 1 and any critic doubly differentiable with respect to actions for each state, then and , where is the Hessian of with respect to , evaluated at for a fixed .

###### Proof.

We begin by approximating the critic (for a given ) using the first two terms of the Taylor expansion of in .

Because of the series truncation, the function on the righthand side is quadric and we can then use Lemma 1:

∎

To actually obtain the Hessian, we could use automatic differentiation to compute it analytically. Alternatively, we can observe that, if the critic really is quadric, we can just read off the coefficients of the quadric term directly. Therefore, we can approximate the Hessian by generating a number of random action-values around , computing the values, and (locally) fitting a quadric. This process is typically more computationally expensive than automatic differentiation but has the advantage of working with ReLU networks (where the true Hessian is zero but we still have a kind of global curvature after smoothing) and leveraging more information from the critic (since the evaluation is at more than one point).

### Linear GPG

We now state a consequence of Lemma 1 for the case when the critic is linear in the actions, i.e., the quadric term is always zero.

###### Corollary 4 (Linear Gaussian Policy Gradients).

If the policy is Gaussian, i.e., with parametrised by and the critic is of the form , then . Moreover, it is unnecessary to parameterise since the policy gradient w.r.t. to is zero (i.e., a linear -function does not give any information about the exploration covariance).

We make Corollary 4 explicit for two reasons. First, it is useful for showing an equivalence between DPG and EPG (see below). Second, it may actually be useful for a non-trivial class of physical systems: if the time-sampling frequency is high enough (which implies acting in small steps), the critic is effectively only used to say if a small step one way is preferable to small step the other way – a linear property.

### Equivalences between EPG and DPG

The update for the policy mean obtained in Corollary 3 is the same as the DPG update, linking the two methods:

We now formalise the equivalences between EPG and DPG. First, on-policy GPG with a linear critic (or an arbitrary critic approximated by the first term in the Taylor expansion) is equivalent to DPG with a Gaussian exploration policy where the covariance stays the same. This follows from Corollary 4. Second, on-policy GPG with a quadric critic (or an arbitrary critic approximated by the first two terms in the Taylor expansion) is equivalent to DPG with a Gaussian exploration policy where the covariance is computed using the update (where is a sequence of step-sizes):

(11) |

This follows from Corollary 3. Third, and most generally, for any critic at all (not necessarily quadric), DPG is a kind of EPG for a particular choice of quadrature (using a Dirac measure). This follows from Theorem 1.

Surprisingly, this means that DPG, normally considered to be off-policy, can also be seen as on-policy when exploring with Gaussian noise. Furthermore, the compatible critic for DPG (Silver et al., 2014) is indeed linear in the actions. Hence, this relationship holds whenever DPG uses a compatible critic.^{3}^{3}3The notion of compatibility of a critic is different for stochastic and deterministic policy gradients. Furthermore, Lemma 1 lends new legitimacy to the common practice of replacing the critic required by the DPG theory, which approximates , with one that approximates itself, as done in SPG and EPG.

### Exploration using the Hessian

The second equivalence given above suggests that we can include the covariance in the actor network and learn it along with the mean. However, another option is to compute it from scratch at each iteration by analytically computing the result of applying (11) infinitely many times.

###### Lemma 2 (Robins-Monro Exploration Limit).

The iterative procedure defined by the equation using the diminishing Robbins-Monroe learning rate converges to .

###### Proof.

Consider the sequence , . We diagonalise the Hessian as for some orthonormal matrix and obtain the following expression for the -th element of the sequence.

Since we have for each eigenvalue of the Hessian, we obtain the identity:

∎

The practical implication of Lemma 2 is that, in a policy gradient method, it is justified to use Gaussian exploration with covariance proportional to for some reward scaling constant . Thus by exploring with (scaled) covariance , we obtain a principled alternative to the Ornstein-Uhlenbeck heuristic defined in (5). Our results below show that it also performs much better in practice. The derivation relies crucially on the use of decreasing Robins-Monro step sizes, rather than finite step sizes, which we analyse in detail in Section 2 of the supplement.

Lemma 2 has an intuitive interpretation. If has a large positive eigenvalue , then has a sharp minimum along the corresponding eigenvector, and the corresponding eigenvalue of is , i.e., also large. The result is a large exploration bonus along that direction, enabling the algorithm to leave local minima. Conversely, if is negative, then has a maximum and so is small, since exploration is not needed.

### Variance Analysis

We now prove that for any policy, the EPG estimator of (10) has lower variance than the SPG estimator of (2).

###### Lemma 3.

If for all , the random variable where has nonzero variance, then

The proof is deferred to the supplement (see Lemma 3 there). Lemma 3’s assumption is reasonable since the only way a random variable could have zero variance is if it were the same for all actions in the policy’s support (except for sets of measure zero), in which case optimising the policy would be unnecessary. Since we know that both the estimators of (2) and (10) are unbiased, the estimator with lower variance has lower MSE.

### Extension to Entropy Regularisation

On-policy SPG sometimes includes an entropy term in the gradient in order to aid exploration by making the policy more stochastic. The gradient of the differential entropy^{4}^{4}4For discrete action spaces, the same derivation with integrals replaced by sums holds for the entropy. of the policy at state is defined as follows.

Typically, we weight the entropy update with the policy gradient update:

This equation makes clear that performing entropy regularisation is equivalent to using a different critic with -values shifted by ; this holds for both SPG and EPG.

Domain | ||
---|---|---|

HalfCheetah-v1 | 1336.39 | |

[1107.85, 1614.51] | 1056.15 | |

[875.54, 1275.94] | ||

InvertedPendulum-v1 | 291.26 | |

[241.45, 351.88] | 0.00 | |

n/a | ||

Reacher2d-v1 | 1.22 | |

[0.63, 2.31] | 0.13 | |

[0.07, 0.26] | ||

Walker2d-1 | 543.54 | |

[450.58, 656.65] | 762.35 | |

[631.98, 921.00] |

## Experiments

While EPG has many potential uses, we focus on empirically evaluating one particular application: exploration driven by the Hessian exponential (as introduced in Algorithm 2 and Lemma 2), replacing the standard Ornstein-Uhlenbeck (OU) exploration in continuous action domains. To this end, we applied EPG to four domains modelled with the MuJoCo physics simulator (Todorov, Erez, and Tassa, 2012): HalfCheetah-v1, InvertedPendulum-v1, Reacher2d-v1 and Walker2d-v1 and compared its performance to DPG and SPG.

In practice, EPG differed from deep DPG (Lillicrap et al., 2015; Silver et al., 2014) only in the exploration strategy, though their theoretical underpinnings are different. The hyperparameters for DPG and those of EPG that are not related to exploration were taken from an existing benchmark (Islam et al., 2017; Brockman et al., 2016). The exploration hyperparameters for EPG were and where the exploration covariance is . These values were obtained using a grid search from the set for and for over the HalfCheetah-v1 domain. Since is just a constant scaling the rewards, it is reasonable to set it to whenever reward scaling is already used. Hence, our exploration strategy has just one hyperparameter as opposed specifying a pair of parameters (standard deviation and mean reversion constant) for OU. We used the same learning parameters for the other domains. For SPG^{5}^{5}5We tried learning the covariance for SPG but the covariance estimate was unstable; no regularisation hyperparameters we tested matched SPG’s performance with OU even on the simplest domain., we used OU exploration and a constant diagonal covariance of in the actor update (this approximately corresponds to the average variance of the OU process over time). The other parameters for SPG are the same as for the rest of the algorithm. For the learning curves, we obtained 90% confidence intervals around the learning curves. The learning curves show results of independent evaluation runs which used actions generated by the policy mean without any exploration noise.

The results (Figure 2) show that EPG’s exploration strategy yields much better performance than DPG with OU. Furthermore, SPG does poorly, solving only the easiest domain (InvertedPendulum-v1) reasonably quickly, achieving slow progress on HalfCheetah-v1, and failing entirely on the other domains. This is not surprising DPG was introduced precisely to solve the problem of high variance SPG estimates on this type of problem. In InvertedPendulum-v1, SPG initially learns quickly, outperforming the other methods. This is because noisy gradient updates provide a crude, indirect form of exploration that happens to suit this problem. Clearly, this is inadequate for more complex domains: even for this simple domain it leads to subpar performance late in learning.

In addition, EPG typically learns more consistently than DPG with OU. In two tasks, the empirical standard deviation across runs of EPG () was substantially lower than that of DPG () at the end of learning, as shown in Table 1. For the other two domains, the confidence intervals around the empirical standard deviations for DPG and EPG were too wide to draw conclusions.

Surprisingly, for InvertedPendulum-v1, DPG’s learning curve declines late in learning. The reason can be seen in the individual runs shown in Figure 3: both DPG and SPG suffer from severe unlearning. This unlearning cannot be explained by exploration noise since the evaluation runs just use the mean action, without exploring. Instead, OU exploration in DPG may be too coarse, causing the optimiser to exit good optima, while SPG unlearns due to noise in the gradients. The noise also helps speed initial learning, as described above, but this does not transfer to other domains. EPG avoids this problem by automatically reducing the noise when it finds a good optimum, i.e., a Hessian with large negative eigenvalues.

## Conclusions

This paper proposed a new policy gradient method called expected policy gradients (EPG), that integrates across the action selected by the stochastic policy. We used EPG to prove a new general policy gradient theorem subsuming the stochastic and deterministic policy gradient theorems. We also showed that, under certain realistic conditions, the quadrature required by EPG can be performed analytically, allowing DPG with principled exploration. We presented empirical results confirming that this application of EPG outperforms DPG and SPG on four domains.

## Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).

## References

- Amari (1998) Amari, S.-I. 1998. Natural gradient works efficiently in learning. Neural computation 10(2):251–276.
- Asadi et al. (2017) Asadi, K.; Allen, C.; Roderick, M.; Mohamed, A.-r.; Konidaris, G.; and Littman, M. 2017. Mean Actor Critic. ArXiv e-prints.
- Baird and others (1995) Baird, L., et al. 1995. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the twelfth international conference on machine learning, 30–37.
- Bhatnagar et al. (2008) Bhatnagar, S.; Ghavamzadeh, M.; Lee, M.; and Sutton, R. S. 2008. Incremental natural actor-critic algorithms. In Advances in neural information processing systems, 105–112.
- Brockman et al. (2016) Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv preprint arXiv:1606.01540.
- Furmston and Barber (2012) Furmston, T., and Barber, D. 2012. A unifying perspective of parametric policy search methods for markov decision processes. In Advances in neural information processing systems, 2717–2725.
- Furmston, Lever, and Barber (2016) Furmston, T.; Lever, G.; and Barber, D. 2016. Approximate newton methods for policy search in markov decision processes. Journal of Machine Learning Research 17(227):1–51.
- Gu et al. (2016a) Gu, S.; Lillicrap, T.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2016a. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247.
- Gu et al. (2016b) Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S. 2016b. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, 2829–2838.
- Heess et al. (2015) Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Erez, T.; and Tassa, Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2944–2952.
- Islam et al. (2017) Islam, R.; Henderson, P.; Gomrokchi, M.; and Precup, D. 2017. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133.
- Kakade (2002) Kakade, S. M. 2002. A natural policy gradient. In Advances in neural information processing systems, 1531–1538.
- Kingma and Ba (2014) Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Lagoudakis and Parr (2003) Lagoudakis, M. G., and Parr, R. 2003. Least-squares policy iteration. Journal of machine learning research 4(Dec):1107–1149.
- Lillicrap et al. (2015) Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Parisi, Pirotta, and Restelli (2016) Parisi, S.; Pirotta, M.; and Restelli, M. 2016. Multi-objective reinforcement learning through continuous pareto manifold approximation. Journal of Artificial Intelligence Research 57:187–227.
- Peters and Schaal (2006) Peters, J., and Schaal, S. 2006. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, 2219–2225. IEEE.
- Peters and Schaal (2008a) Peters, J., and Schaal, S. 2008a. Natural actor-critic. Neurocomputing 71(7):1180–1190.
- Peters and Schaal (2008b) Peters, J., and Schaal, S. 2008b. Reinforcement learning of motor skills with policy gradients. Neural networks 21(4):682–697.
- Pirotta, Restelli, and Bascetta (2013) Pirotta, M.; Restelli, M.; and Bascetta, L. 2013. Adaptive step-size for policy gradient methods. In Advances in Neural Information Processing Systems, 1394–1402.
- Rummery and Niranjan (1994) Rummery, G. A., and Niranjan, M. 1994. On-line Q-learning using connectionist systems. University of Cambridge, Department of Engineering.
- Schulman et al. (2015) Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 1889–1897.
- Sehnke et al. (2010) Sehnke, F.; Osendorfer, C.; Rückstieß, T.; Graves, A.; Peters, J.; and Schmidhuber, J. 2010. Parameter-exploring policy gradients. Neural Networks 23(4):551–559.
- Silver et al. (2014) Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In ICML.
- Sutton and Barto (1998) Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
- Sutton et al. (2000) Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
- Sutton (1996) Sutton, R. S. 1996. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems 1038–1044.
- Todorov, Erez, and Tassa (2012) Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, 5026–5033. IEEE.
- Uhlenbeck and Ornstein (1930) Uhlenbeck, G. E., and Ornstein, L. S. 1930. On the theory of the brownian motion. Physical review 36(5):823.
- van Seijen et al. (2009) van Seijen, H.; van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of expected sarsa. In ADPRL 2009: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 177–184.

## Supplement

In section Proofs, we provide formal proofs for certain statements invoked by our paper. In section Remarks on the Robins-Monro limit we provide a brief discussion of the use of the Robins-Monro learning rate in the computation of the covariance.

## Proofs

First, we prove two lemmas concerning the discounted-ergodic measure which have been implicitly realised for some time but as far as we could find, never proved explicitly.

###### Definition 1 (Time-dependent occupancy).

###### Definition 2 (Truncated trajectory).

Define the trajectory truncated after steps as .

###### Observation 1 (Expectation wrt. truncated trajectory).

Since is associated with the density , we have that

for any function .

###### Definition 3 (Expectation with respect to infinte trajectory).

For any bounded function , we have

Here, the sum on the left-hand side is part of the symbol being defined.

###### Observation 2 (Property of expectation with respect to infinte trajectory).

for any bounded function .

###### Definition 4 (Discounted-ergodic occupancy measure ).

The measure is not normalised in general. Intuitively, it can be thought of as ‘marginalising out’ the time in the system dynamics.

###### Lemma 4 (Discounted-ergodic property).

For any bounded function :

###### Proof.

Here, the first equality follows from Observation 2. ∎

This property is useful since the expression on the left can be easily manipulated while the expression on the right can be estimated from samples using Monte Carlo.

###### Lemma 5 (Generalised eigenfunction property).

For any bounded function :

###### Proof.

###### Definition 5 (Markov Reward Process).

A Markov Reward Process is a tuple , where is a transition kernel, is the distribution over initial states, is a reward distribution conditioned on the state and is the discount constant.

An MRP can be thought of as an MDP with a fixed policy and dynamics given by marginalising out the actions . Since this paper considers the case of one policy, we abuse notation slightly by using the same symbol to denote trajectories including actions, i.e. and without them .

###### Lemma 6 (Second Moment Bellman Equation).

Consider a Markov Reward Process where is a Markov process and is some probability density function^{6}^{6}6Note that while occupies a place in the definition of the MRP usually called ‘reward distribution’, we are using the symbol , not since we shall apply the lemma to es which are constructions distinct from the reward of the MDP we are solving.. Denote the value function of the MRP as . Denote the second moment function as

Then is the value function of the MRP: , where is a deterministic random variable given by

###### Proof.

This is exactly the Bellman equation of the MRP . The theorem follows since the Bellman equation uniquely determines the value function. ∎

###### Observation 3 (Dominated Value Functions).

Consider two Markov Reward Processes and , where is a Markov process (common to both MRPs) and , are some deterministic random variables meeting the condition for every . Then the value functions and of the respective MRPs satisfy for every . Moreover, if we have that for all states, then the inequality between value functions is strict.

###### Proof.

Follows trivially by expanding the value function as a series and comparing series elementwise. ∎

We now move our attention to prove the Gaussian Policy Gradients lemma.

###### Lemma 1 (Gaussian Policy Gradients).

If the policy is Gaussian, i.e. with and parametrised by , and the critic is of the form , then , where the mean and covariance components are given by and .

###### Proof.

We will first prove the lemma for the case where the action space is one-dimensional, where and