Deterministic ValuePolicy Gradients
Abstract
Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the modelfree DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the modelbased deterministic value gradient estimators with the modelfree deterministic policy gradient estimator, we propose the deterministic valuepolicy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with stateoftheart methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.
Introduction
Silver et al. propose the deterministic policy gradient (DPG) algorithm [23] that aims to find an optimal deterministic policy that maximizes the expected longterm reward, which lowers the variance when estimating the policy gradient, compared to stochastic policies [25]. Lillicrap et al. further combine deep neural networks with DPG to improve the modeling capacity, and propose the deep deterministic policy gradient (DDPG) algorithm [16]. It is recognized that DDPG has been successful in robotic control tasks such as locomotion and manipulation. Despite the effectiveness of DDPG in these tasks, it suffers from the high sample complexity problem [22].
Deterministic value gradient methods [30, 20, 12, 5] compute the policy gradient through back propagation of the reward along a trajectory predicted by the learned model, which enables better sample efficiency. However, to the best of our knowledge, existing works of deterministic value gradient methods merely focus on finite horizon, which are too myopic and can lead to large bias. Stochastic value gradient (SVG) methods [11] use the reparameterization technique to optimize the stochastic policies. Among the class of SVG algorithms, although SVG() studies infinitehorizon problems, it only uses onestep rollout, which limits its efficiency. Also, it suffers from the high variance due to the importance sampling ratio and the randomness of the policy.
In this paper, we study the setting with infinite horizon, where both state transitions and policies are deterministic. [11] gives recursive Bellman gradient equations of deterministic value gradients, but the gradient lacks of theoretical guarantee as the DPG theorem does not hold in this deterministic transition case. We prove that the gradient indeed exists for a certain set of discount factors. We then derive a closed form of the value gradients.
However, the estimation of the deterministic value gradients is much more challenging. The difficulty of the computation of the gradient mainly comes from the dependency of the gradient of the value function over the state. Such computation may involve infinite times of the product of the gradient of the transition function and is hard to converge. Thus, applying the Bellman gradient equation recursively may incur high instability.
To overcome these challenges, we use modelbased approaches to predict the reward and transition function. Based on the theoretical guarantee of the closed form of the value gradients in the setting, we propose a class of deterministic value gradients DVG() with infinite horizon, where denotes the number of rollout steps. For each choice of , we use the rewards predicted by the model and the actionvalue at step to estimate of the value gradients over the state, in order to reduce the instability of the gradient of the value function over the state. Different number of rollout steps maintains a tradeoff between the accumulated model bias and the variance of the gradient over the state. The deterministic policy gradient estimator can be viewed as a special case of this class, i.e., it never use the model to estimate the value gradients, and we refer it to DVG().
As the modelbased approaches are more sample efficient than modelfree algorithms [15, 14], and the modelbased deterministic value gradients may incur model bias [27], we consider an essential question: How to combines the modelbased gradients and the modelfree gradients efficiently?
We propose a temporal difference method to ensemble gradients with different rollout steps. The intuition is to ensemble different gradient estimators with geometric decaying weights. Based on this estimator, we propose the deterministic valuepolicy gradient (DVPG) algorithm. The algorithm updates the policy by stochastic gradient ascent with the ensembled value gradients of the policy, and the weight maintains a tradeoff between sample efficiency and performance.
To sum up, the main contribution of the paper is as follows:

First of all, we provide a theoretical guarantee for the existence of the deterministic value gradients in settings with infinite horizon.

Secondly, we propose a novel algorithm that ensembles the deterministic value gradients and the deterministic policy gradients, called deterministic valuepolicy gradient (DVPG), which effectively combines the modelfree and modelbased methods. DVPG reduces sample complexity, enables faster convergence and performance improvement.

Finally, we conduct extensive experiments on standard benchmarks comparing with DDPG, DDPG with modelbased rollouts, the stochastic value gradient algorithm, SVG() and stateoftheart stochastic policy gradient methods. Results confirm that DVPG significantly outperforms other algorithms in terms of both sample efficiency and performance.
Related Work
Modelbased algorithms has been widely studied [18, 19, 9, 10, 2, 31] in recent years. Modelbased methods allows for more efficient computations and faster convergence than modelfree methods [28, 15, 14, 29].
There are two classes of modelbased methods, one is to use learned model to do imagination rollouts to accelerate the learning. [8, 13] generate synthetic samples by the learned model. PILCO [3] learns the transition model by Gaussian processes and applies policy improvement on analytic policy gradients. The other is to use learned model to get better estimates of actionvalue functions. The value prediction network (VPN) uses the learned transition model to get a better target estimate [21]. [7, 1] combines different modelbased value expansion functions by TD() trick or stochastic distributions to improve the estimator of the actionvalue function.
Different from previous modelbased methods, we present a temporal difference method that ensembles modelbased deterministic value gradients and modelfree policy gradients. Our technique can be combined with both the imagination rollout technique and the modelbased value expansion technique.
Preliminaries
A Markov decision process (MDP) is a tuple , where and denote the set of states and actions respectively. represents the conditional density from state to state under action . The density of the initial state distribution is denoted by . At each time step , the agent interacts with the environment with a deterministic policy . We use to represent the immediate reward, contributing to the discounted overall rewards from state following , denoted by . Here, is the discount factor. The Qfunction of state and action under policy is denoted by . The corresponding value function of state under policy is denoted by . We denote the density at state after time steps from state following the policy by . We denote the discounted state distribution by . The agent aims to find an optimal policy that maximizes .
Deterministic Value Gradients
In this section, we study a setting of infinite horizon with deterministic state transition, which poses challenges for the existence of deterministic value gradients. We first prove that under proper condition, the deterministic value gradient does exist. Based on the theoretical guarantee, we then propose a class of practical algorithms by rolling out different number of steps. Finally, we discuss the difference and connection between our proposed algorithms and existing works.
Deterministic Policy Gradient (DPG) Theorem [23], proves the existence of the deterministic policy gradient for MDP that satisfies the regular condition, which requires the probability density of the next state to be differentiable in . In the proof of the DPG theorem, the existence of the gradient of the value function is firstly proven, i.e.,
(1) 
then the gradient of the longterms rewards exists. Without this condition, the arguments in the proof of the DPG theorem do not work ^{1}^{1}1Readers can refer to http://proceedings.mlr.press/v32/silver14supp.pdf, and poses challenges for cases where the differentiability is not satisfied. Note this condition does not hold in any case with deterministic transitions. Therefore, one must need a new theoretical guarantee to determine the existence of the gradient of over in deterministic state transition cases.
Deterministic value gradient theorem
We now analyze the gradient of a deterministic policy. Denote the next state given current state and action . Without loss of generality, we assume that the transition function is continuous, differentiable in and and is bounded. Note that the regular condition is not equivalent to this assumption. Consider a simple example that a transition , then the gradient of over is infinite or does not exist. However, the gradient of over exists. By definition,
Therefore, the key of the existence (estimation) of the gradient of over is the existence (estimation) of . In Theorem 1, we give a sufficient condition of the existence of .
Theorem 1
For any policy , the gradient of the value function over the state, , exists with two assumptions:

A.1: The set of states that the policy visits starting from any initial state is finite.

A.2: For any initial state , by Assumption A.1, we get that there is a periodic loop of visited states. Let denote the loop, and , the power sum of , converges.
Proof 1
By definition,
(2) 
Taking the gradient of Eq. (2), we obtain
(3) 
Unrolling Eq. (3) with infinite steps, we get
(4) 
where and is the state after steps following policy .
With the assumption A.1, we rewrite (4) by the indicator function that indicates whether is obtained after steps from the initial state following the policy :
(5) 
Where is the set of states the policy visits from .
We now prove that for any , the infinite sum of gradients, converges.
For each state , there are three cases during the process from the initial state with infinite steps:

Never visited:

Visited once: Let denote the number of steps that it takes to reach the state , then

Visited infinite times: Let denote the number of steps it takes to reach for the first time. The state will be revisited every steps after the previous visit. By definition,
(6) By the assumption A.2 we get (6) converges.
By exchanging the order of the limit and the summation,
(7) 
Assumption A.1 guarantees the existence of the stationary distribution of states theoretically. Actually, it holds on most continuous tasks, e.g., InvertedPendulumv2 in MuJoCo. We directly test a deterministic policy with a 2layer fully connected network on this environment with 10,000 episodes^{2}^{2}2We test different weights, the observation of finite visited states set is very common among different weights., and we count the number that each state is visited. After projecting the data into 2D space by tSNE [17], we obtain the state visitation density countour as shown in Figure 1. We have two interesting findings: (1) The set of states visited by the policy is finite. (2) Many states are visited for multiple times, which justifies Assumption A.1.
By the analysis of Assumption A.2, we get that for any policy and state, there exists a set of discount factors such that the the gradient of the value function over the state exists, as illustrated in Corollary 1. Please refer to Appendix A for the proof.
Corollary 1
For any policy and any initial state , let denote the loop of states following the policy and the state, , the gradient of the value function over the state, exists if
In Theorem 2, we show that the deterministic value gradients exist and obtain the closed form based on the analysis in Theorem 1. Please refer to Appendix B for the proof.
Theorem 2
(Deterministic Value Gradient Theorem) For any policy and MDP with deterministic state transitions, if assumptions A.1 and A.2 hold, the value gradients exist, and
where is the action the policy takes at state , is the discounted state distribution starting from the state and the policy, and is defined as
Deterministic value gradient algorithm
The value gradient methods estimate the gradient of value function recursively [4]:
(8) 
(9) 
In fact, there are two kinds of approaches for estimating the gradient of the value function over the state, i.e., infinite and finite. On the one hand, directly estimating the gradient of the value function over the state recursively by Eq. (9) for infinite times is slow to converge. On the other hand, estimating the gradient by finite horizon like traditional value gradient methods [30, 20, 11] may cause large bias of the gradient.
We set out to estimate the actionvalue function denoted by with parameter , and replace by in Eq. (8). In this way, we can directly obtain a 1step estimator of the value gradients,
(10) 
where is the next state of , which can be generalized to rollout steps. Let denote the state visited by the policy at the th step starting form the initial state , We choose to rollout steps to get rewards, then replace by in Eq. (9), and we get
Replacing with in Eq. (8), we get a step estimator of the value gradients:
(11) 
It is easy to see that and are the same if we have the true reward and transition functions, which is generally not the case as we need to learn the model in practical environments. Let denote the value gradient at the sampled state with rollout steps, on learned transition function and reward function , which is defined as:
(12) 
Based on Eq.(12), we propose the deterministic value gradients with infinite horizon, where the algorithm is shown in Algorithm 1: given samples , for each choice of , we use to update the current policy. We use samplebased methods to estimate the deterministic value gradients. For each state in the trajectory, we take the analytic gradients by the learned model. As the model is not given, we choose to predict the reward function and the transition function. We choose to use experience replay to compare with the DDPG algorithm fairly. Different choices of the number of rollout steps tradeoff between the variance and the bias. Larger steps means lower variance of the value gradients, and larger bias due to the accumulated model error.
The difference between infinite and finite horizon
In this section, we discuss the advantage of our proposed DVG algorithm over finite horizon and validate the effect on a continuous control task. The estimator of deterministic value gradients with finite horizon, DVG, is defined as [4]:
Note that does not take rewards after the th step into consideration. Therefore, given samples , DVG uses the sample mean of to update the policy, where is defined as:
We then test the two approaches on the environment HumanoidStandupv2, where we choose the parameter to be ^{3}^{3}3For the choice of , we test DVG with steps ranging from 1 to 5, and we choose the parameter with the best performance for fair comparison.. As shown in Figure 2, DVG significantly outperforms DVG, which validates our claim that only considering finite horizon fails to achieve the same performance as that of infinite horizon.
Connection and comparison of DVG and DDPG
The DDPG algorithm uses the gradient of the estimator of the Q function over the action, to estimate , i.e.,
The DDPG algorithm is a modelfree algorithm which does not predict the reward and the transition, and can be viewed as the DVG() algorithm. We compare the DVG algorithm with different rollout steps and DDPG on a continuous control task in MuJoCo, Hopperv2. From Figure 3, we get that DVG with any choice of the number of rollout steps is more sample efficient than DDPG, which validates the power of modelbased techniques. DVG() outperforms DDPG and DVG with other number of rollout steps in terms of performance as it trades off well between the bias and the variance of the value gradients. Note that with a larger number of step, DVG(5) is not stable due to the propagated model error.
The DVPG Algorithm
As discussed before, the modelbased DVG algorithm are more sample efficient than the modelfree DDPG algorithm. However, it suffers from the model bias which results in performance loss. In this section, we consider to ensemble these different gradient estimators for better performance.
Motivated by the idea of TD() algorithm [24], which ensembles the TD() error with a geometric decaying weight , we propose a temporaldifference method to ensemble DVG with varying rollout steps and the modelfree deterministic policy gradients. We defined the temporal difference deterministic value gradients as , where denotes the maximal number of rollout steps by the learned model. For the gradient update rule, we also apply sample based methods: given samples , we use
(14) 
to update the policy. Based on this ensembled deterministic valuepolicy gradients, we propose the deterministic valuepolicy gradient algorithm, shown in Algorithm 2 ^{4}^{4}4The only difference between the DVG(k) algorithm and the DVPG algorithm is the update rule of the policy..
Experimental Results
We design a series of experiments to evaluate DVG and DVPG. We investigate the following aspects: (1) What is the effect of the discount factor on DVG? (2) How sensitive is DVPG to the hyperparameters? (3) How does DVPG compare with stateoftheart methods?
We evaluate DVPG in a number of continuous control benchmark tasks in OpenAI Gym based on the MuJoCo [26] simulator. The implementation details are referred to Appendix C. We compare DVPG with DDPG, DVG, DDPG with imagination rollouts (DDPG(model)), and SVG with 1 step rollout and experience replay (SVG()) in the text. We also compare DVPG with methods using stochastic policies, e.g. ACKTR, TRPO, in Appendix D. We plot the averaged rewards of episodes over 5 different random seeds with the number of real samples, and the shade region represents the 75% confidence interval. We choose the same hyperparameters of the actor and critic network for all algorithms. The prediction models of DVPG, DVG and DDPG(model) are the same.
The effect of discount factors on DVG
From Eq. (9), we get that is equivalent to the infinite sum of the gradient vectors. To study the effect of the discount factor on DVG, we train the algorithm with 2 rollout steps with different values of the discount factor on the environment InvertedPendulumv2. As shown in Figure 5, performs the best in terms of rewards and stability while and performs comparably, while the performance of and are inferior to other values. This is because the convergence of the computation of the gradient of the value function over the state may be slow if the discount factor is close to 1 while a smaller value of may enable better convergence of . However, the sum of rewards discounted by a too small will be too myopic, and fails to perform good. Here, tradesoff well between the stability and the performance, which is as expected that there exists an optimal intermediate value for the discount factor.
Ablation study of DVPG
We evaluate the effect of the weight of bootstrapping on DVPG with different values from to , where the number of rollout steps is set to be 4. From Figure 6, we get that the performance of the DVPG decreases with the increase of the value , where performs the best in terms of the sample efficiency and the performance. Thus, we choose the value of the weight to be in all experiments.
We evaluate the effect of the number of rollout steps ranging from to . Results in Figure 7 show that DVPG with different number of rollout steps all succeed to learn a good policy, with 1 rollout step performing the best. Indeed, the number of rollout steps trade off between the modelerror and the variance. There is an optimal value of the number of rollout steps for each environment, which is the only one parameter we tune. To summarize, for the number of look steps, 1 rollout step works the best on Humanoidv2, Swimmerv2 and HalfCheetahv2, while 2 rollout steps performs the best on HumanoidStandupv2, Hopperv2 and Antv2. For fair comparisons, we choose the same number of rollout steps for both the DVG and the DVPG algorithm.
Performance comparisons
In this section we compare DVPG with the modelfree baseline DDPG, and modelbased baselines including DVG, DDPG(model) and SVG() on several continuous control tasks on MuJoCo. As shown in Figure 8, there are two classes of comparisons.
Firstly, we compare DVPG with DDPG and DVG to validate the effect of the temporal difference technique to ensemble modelbased and modelfree deterministic value gradients. The DVG algorithm is the most sample efficient than other algorithms in environments HumanoidStandupv2, and Hopperv2. For sample efficiency, DVPG outperforms DDPG as it trades off between the modelbased deterministic value gradients and the modelfree deterministic policy gradients. In the end of the training, DVPG outperforms other two algorithms significantly, which demonstrates the power of the temporal difference technique. In other four environments, DVPG outperforms other algorithms in terms of both sample efficiency and performance. The performance of DVG and DDPG on Swimmerv2 and Antv2 are comparable, while DVG performs bad in Halfcheetahv2 and Humanoidv2 due to the modelerror.
Secondly, we compare DVPG with SVG() and DDPG with imagination rollouts. Results show that the DVPG algorithm significantly outperforms these two modelbased algorithms in terms of sample efficiency and performance, especially in environments where other modelbased algorithms do not get better performance than the modelfree DDPG algorithm. For the performance of the SVG() algorithm, it fails to learn good policies in Antv2, which is also reported in [13].
Conclusion
Due to high sample complexity of the modelfree DDPG algorithm and high bias of the deterministic value gradients with finite horizon, we study the deterministic value gradients with infinite horizon. We prove the existence of the deterministic value gradients for a certain set of discount factors in this infinite setting. Based on this theoretical guarantee, we propose the DVG algorithm with different rollout steps by the model. We then propose a temporal difference method to ensemble deterministic value gradients and deterministic policy gradients, to trade off between the bias due to the model error and the variance of the modelfree policy gradients, called the DVPG algorithm. We compare DVPG on several continuous control benchmarks. Results show that DVPG substantially outperforms other baselines in terms of convergence and performance. For future work, it is promising to apply the temporal difference technique presented in this paper to other modelfree algorithms and combine with other modelbased techniques.
Acknowledgments
The work by Ling Pan was supported in part by the National Natural Science Foundation of China Grants 61672316. Pingzhong Tang was supported in part by the National Natural Science Foundation of China Grant 61561146398, and a China Youth 1000talent program.
Appendix A A. Proof of Corollary 1
Corollary 1 For any policy and any initial state , let denote the loop of states following the policy and the state, let , the gradient of the value function over the state, exists if
(15) 
Proof 2
By the definition of , we get
(16) 
Then by [6], the absolute value of any eigenvalue of is strictly less than . By representing with Jordan normal form, i.e., ,
(17) 
As the absolute value of any eigenvalue of is strictly less than , converges, then converges. By Lemma 1, converges.
Appendix B B. Proof of Theorem 2
Theorem 2 For any policy and MDP with deterministic state transitions, if assumptions A.1 and A.2 hold, the value gradients exist, and
(18) 
where is the discounted state distribution starting from the state and the policy, and is defined as
Appendix C C. Implementation Details
In this section we describe the details of the implementation of DVPG, DVG, DDPG and DDPG (model). The configuration of the actor network and the critic network is the same as the implementation of OpenAI Baselines. For the reward network, we use the same network structure. Each network has two fully connected layers, where each layer has 64 units. The activation function is ReLU, the batch size is , the learning rate of the actor is , and the learning rate of the critic is . The learning rates of the transition network and the reward network are all . We also add norm regularizer to the loss.
For the reward network, the loss is . For the transition network, the loss is
We also compare with DDPG with modelbased rollouts, i.e., besides the training of the policy on real samples, the actor is also updated by model generated samples. The detail of DDPG(model) is referred to Algorithm 3.
(21) 
For the running time of the DVPG algorithm, it takes about 4 hours for running 1M steps.
Appendix D D. Comparisons with start of the art stochastic policy optimization methods
We compare the DVPG algorithm and the DDPG algorithm with state of the art stochastic policy optimization algorithms, TRPO and ACKTR. As shown in Figure 8, results show that DVPG performs much better than DDPG and other algorithms in the environments where DDPG is more sample efficient than policy optimization algorithms. DVPG also outperforms other baselines significantly in Swimmerv2 where DDPG is outperformed by TRPO.
References
 [1] (2018) Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8234–8244. Cited by: Related Work.
 [2] (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: Related Work<