Deterministic Value-Policy Gradients

Deterministic Value-Policy Gradients

Qingpeng Cai, Ling Pan, Pingzhong Tang
1 Alibaba Group
2 IIIS, Tsinghua University
qingpeng.cqp@alibaba-inc.com, pl17@mails.tsinghua.edu.cn, kenshin@tsinghua.edu.cn
The first two authors contributed equally to this work.
Abstract

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Introduction

Silver et al. propose the deterministic policy gradient (DPG) algorithm [23] that aims to find an optimal deterministic policy that maximizes the expected long-term reward, which lowers the variance when estimating the policy gradient, compared to stochastic policies [25]. Lillicrap et al. further combine deep neural networks with DPG to improve the modeling capacity, and propose the deep deterministic policy gradient (DDPG) algorithm [16]. It is recognized that DDPG has been successful in robotic control tasks such as locomotion and manipulation. Despite the effectiveness of DDPG in these tasks, it suffers from the high sample complexity problem [22].

Deterministic value gradient methods [30, 20, 12, 5] compute the policy gradient through back propagation of the reward along a trajectory predicted by the learned model, which enables better sample efficiency. However, to the best of our knowledge, existing works of deterministic value gradient methods merely focus on finite horizon, which are too myopic and can lead to large bias. Stochastic value gradient (SVG) methods [11] use the re-parameterization technique to optimize the stochastic policies. Among the class of SVG algorithms, although SVG() studies infinite-horizon problems, it only uses one-step rollout, which limits its efficiency. Also, it suffers from the high variance due to the importance sampling ratio and the randomness of the policy.

In this paper, we study the setting with infinite horizon, where both state transitions and policies are deterministic. [11] gives recursive Bellman gradient equations of deterministic value gradients, but the gradient lacks of theoretical guarantee as the DPG theorem does not hold in this deterministic transition case. We prove that the gradient indeed exists for a certain set of discount factors. We then derive a closed form of the value gradients.

However, the estimation of the deterministic value gradients is much more challenging. The difficulty of the computation of the gradient mainly comes from the dependency of the gradient of the value function over the state. Such computation may involve infinite times of the product of the gradient of the transition function and is hard to converge. Thus, applying the Bellman gradient equation recursively may incur high instability.

To overcome these challenges, we use model-based approaches to predict the reward and transition function. Based on the theoretical guarantee of the closed form of the value gradients in the setting, we propose a class of deterministic value gradients DVG() with infinite horizon, where denotes the number of rollout steps. For each choice of , we use the rewards predicted by the model and the action-value at step to estimate of the value gradients over the state, in order to reduce the instability of the gradient of the value function over the state. Different number of rollout steps maintains a trade-off between the accumulated model bias and the variance of the gradient over the state. The deterministic policy gradient estimator can be viewed as a special case of this class, i.e., it never use the model to estimate the value gradients, and we refer it to DVG().

As the model-based approaches are more sample efficient than model-free algorithms [15, 14], and the model-based deterministic value gradients may incur model bias [27], we consider an essential question: How to combines the model-based gradients and the model-free gradients efficiently?

We propose a temporal difference method to ensemble gradients with different rollout steps. The intuition is to ensemble different gradient estimators with geometric decaying weights. Based on this estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. The algorithm updates the policy by stochastic gradient ascent with the ensembled value gradients of the policy, and the weight maintains a trade-off between sample efficiency and performance.

To sum up, the main contribution of the paper is as follows:

  • First of all, we provide a theoretical guarantee for the existence of the deterministic value gradients in settings with infinite horizon.

  • Secondly, we propose a novel algorithm that ensembles the deterministic value gradients and the deterministic policy gradients, called deterministic value-policy gradient (DVPG), which effectively combines the model-free and model-based methods. DVPG reduces sample complexity, enables faster convergence and performance improvement.

  • Finally, we conduct extensive experiments on standard benchmarks comparing with DDPG, DDPG with model-based rollouts, the stochastic value gradient algorithm, SVG() and state-of-the-art stochastic policy gradient methods. Results confirm that DVPG significantly outperforms other algorithms in terms of both sample efficiency and performance.

Related Work

Model-based algorithms has been widely studied [18, 19, 9, 10, 2, 31] in recent years. Model-based methods allows for more efficient computations and faster convergence than model-free methods [28, 15, 14, 29].

There are two classes of model-based methods, one is to use learned model to do imagination rollouts to accelerate the learning. [8, 13] generate synthetic samples by the learned model. PILCO [3] learns the transition model by Gaussian processes and applies policy improvement on analytic policy gradients. The other is to use learned model to get better estimates of action-value functions. The value prediction network (VPN) uses the learned transition model to get a better target estimate [21]. [7, 1] combines different model-based value expansion functions by TD() trick or stochastic distributions to improve the estimator of the action-value function.

Different from previous model-based methods, we present a temporal difference method that ensembles model-based deterministic value gradients and model-free policy gradients. Our technique can be combined with both the imagination rollout technique and the model-based value expansion technique.

Preliminaries

A Markov decision process (MDP) is a tuple , where and denote the set of states and actions respectively. represents the conditional density from state to state under action . The density of the initial state distribution is denoted by . At each time step , the agent interacts with the environment with a deterministic policy . We use to represent the immediate reward, contributing to the discounted overall rewards from state following , denoted by . Here, is the discount factor. The Q-function of state and action under policy is denoted by . The corresponding value function of state under policy is denoted by . We denote the density at state after time steps from state following the policy by . We denote the discounted state distribution by . The agent aims to find an optimal policy that maximizes .

Deterministic Value Gradients

In this section, we study a setting of infinite horizon with deterministic state transition, which poses challenges for the existence of deterministic value gradients. We first prove that under proper condition, the deterministic value gradient does exist. Based on the theoretical guarantee, we then propose a class of practical algorithms by rolling out different number of steps. Finally, we discuss the difference and connection between our proposed algorithms and existing works.

Deterministic Policy Gradient (DPG) Theorem [23], proves the existence of the deterministic policy gradient for MDP that satisfies the regular condition, which requires the probability density of the next state to be differentiable in . In the proof of the DPG theorem, the existence of the gradient of the value function is firstly proven, i.e.,

(1)

then the gradient of the long-terms rewards exists. Without this condition, the arguments in the proof of the DPG theorem do not work 111Readers can refer to http://proceedings.mlr.press/v32/silver14-supp.pdf, and poses challenges for cases where the differentiability is not satisfied. Note this condition does not hold in any case with deterministic transitions. Therefore, one must need a new theoretical guarantee to determine the existence of the gradient of over in deterministic state transition cases.

Deterministic value gradient theorem

We now analyze the gradient of a deterministic policy. Denote the next state given current state and action . Without loss of generality, we assume that the transition function is continuous, differentiable in and and is bounded. Note that the regular condition is not equivalent to this assumption. Consider a simple example that a transition , then the gradient of over is infinite or does not exist. However, the gradient of over exists. By definition,

Therefore, the key of the existence (estimation) of the gradient of over is the existence (estimation) of . In Theorem 1, we give a sufficient condition of the existence of .

Theorem 1

For any policy , the gradient of the value function over the state, , exists with two assumptions:

  • A.1: The set of states that the policy visits starting from any initial state is finite.

  • A.2: For any initial state , by Assumption A.1, we get that there is a periodic loop of visited states. Let denote the loop, and , the power sum of , converges.

Proof 1

By definition,

(2)

Taking the gradient of Eq. (2), we obtain

(3)

Unrolling Eq. (3) with infinite steps, we get

(4)

where and is the state after steps following policy .

With the assumption A.1, we rewrite (4) by the indicator function that indicates whether is obtained after steps from the initial state following the policy :

(5)

Where is the set of states the policy visits from .

We now prove that for any , the infinite sum of gradients, converges.

For each state , there are three cases during the process from the initial state with infinite steps:

  1. Never visited:

  2. Visited once: Let denote the number of steps that it takes to reach the state , then

  3. Visited infinite times: Let denote the number of steps it takes to reach for the first time. The state will be revisited every steps after the previous visit. By definition,

    (6)

    By the assumption A.2 we get (6) converges.

By exchanging the order of the limit and the summation,

(7)

Assumption A.1 guarantees the existence of the stationary distribution of states theoretically. Actually, it holds on most continuous tasks, e.g., InvertedPendulum-v2 in MuJoCo. We directly test a deterministic policy with a 2-layer fully connected network on this environment with 10,000 episodes222We test different weights, the observation of finite visited states set is very common among different weights., and we count the number that each state is visited. After projecting the data into 2D space by t-SNE [17], we obtain the state visitation density countour as shown in Figure 1. We have two interesting findings: (1) The set of states visited by the policy is finite. (2) Many states are visited for multiple times, which justifies Assumption A.1.

By the analysis of Assumption A.2, we get that for any policy and state, there exists a set of discount factors such that the the gradient of the value function over the state exists, as illustrated in Corollary 1. Please refer to Appendix A for the proof.

Corollary 1

For any policy and any initial state , let denote the loop of states following the policy and the state, , the gradient of the value function over the state, exists if

Figure 1: State visitation density countour on InvertedPendulum-v2.

In Theorem 2, we show that the deterministic value gradients exist and obtain the closed form based on the analysis in Theorem 1. Please refer to Appendix B for the proof.

Theorem 2

(Deterministic Value Gradient Theorem) For any policy and MDP with deterministic state transitions, if assumptions A.1 and A.2 hold, the value gradients exist, and

where is the action the policy takes at state , is the discounted state distribution starting from the state and the policy, and is defined as

Deterministic value gradient algorithm

The value gradient methods estimate the gradient of value function recursively [4]:

(8)
(9)

In fact, there are two kinds of approaches for estimating the gradient of the value function over the state, i.e., infinite and finite. On the one hand, directly estimating the gradient of the value function over the state recursively by Eq. (9) for infinite times is slow to converge. On the other hand, estimating the gradient by finite horizon like traditional value gradient methods [30, 20, 11] may cause large bias of the gradient.

We set out to estimate the action-value function denoted by with parameter , and replace by in Eq. (8). In this way, we can directly obtain a 1-step estimator of the value gradients,

(10)

where is the next state of , which can be generalized to rollout steps. Let denote the state visited by the policy at the -th step starting form the initial state , We choose to rollout steps to get rewards, then replace by in Eq. (9), and we get

Replacing with in Eq. (8), we get a -step estimator of the value gradients:

(11)

It is easy to see that and are the same if we have the true reward and transition functions, which is generally not the case as we need to learn the model in practical environments. Let denote the value gradient at the sampled state with rollout steps, on learned transition function and reward function , which is defined as:

(12)

Based on Eq.(12), we propose the deterministic value gradients with infinite horizon, where the algorithm is shown in Algorithm 1: given samples , for each choice of , we use to update the current policy. We use sample-based methods to estimate the deterministic value gradients. For each state in the trajectory, we take the analytic gradients by the learned model. As the model is not given, we choose to predict the reward function and the transition function. We choose to use experience replay to compare with the DDPG algorithm fairly. Different choices of the number of rollout steps trade-off between the variance and the bias. Larger steps means lower variance of the value gradients, and larger bias due to the accumulated model error.

1:  Initialize the reward network , transition network , critic network , actor network , target networks , and experience replay buffer
2:  for episode do
3:     for  do
4:         Select action according to the current policy and exploration noise
5:         Execute action , observe reward and new state , and store transition in
6:         Sample a random minibatch of transitions from
7:         Update the critic by minimizing the TD error:
8:         Update the reward network and the transition network on the batch by minimizing the square loss
9:         Estimate the value gradients by and perform gradient update on the policy
10:         Update the target networks by
11:     end for
12:  end for
Algorithm 1 The DVG() algorithm

The difference between infinite and finite horizon

In this section, we discuss the advantage of our proposed DVG algorithm over finite horizon and validate the effect on a continuous control task. The estimator of deterministic value gradients with finite horizon, DVG, is defined as [4]:

Note that does not take rewards after the -th step into consideration. Therefore, given samples , DVG uses the sample mean of to update the policy, where is defined as:

We then test the two approaches on the environment HumanoidStandup-v2, where we choose the parameter to be 333For the choice of , we test DVG with steps ranging from 1 to 5, and we choose the parameter with the best performance for fair comparison.. As shown in Figure 2, DVG significantly outperforms DVG, which validates our claim that only considering finite horizon fails to achieve the same performance as that of infinite horizon.

Figure 2: Comparisons of DVG and DVG.
Figure 3: Comparisons of DVG with DDPG.

Connection and comparison of DVG and DDPG

By the proof of the DPG theorem in [23], Eq. (8) can be re-written as

(13)

The DDPG algorithm uses the gradient of the estimator of the Q function over the action, to estimate , i.e.,

The DDPG algorithm is a model-free algorithm which does not predict the reward and the transition, and can be viewed as the DVG() algorithm. We compare the DVG algorithm with different rollout steps and DDPG on a continuous control task in MuJoCo, Hopper-v2. From Figure 3, we get that DVG with any choice of the number of rollout steps is more sample efficient than DDPG, which validates the power of model-based techniques. DVG() outperforms DDPG and DVG with other number of rollout steps in terms of performance as it trades off well between the bias and the variance of the value gradients. Note that with a larger number of step, DVG(5) is not stable due to the propagated model error.

The DVPG Algorithm

As discussed before, the model-based DVG algorithm are more sample efficient than the model-free DDPG algorithm. However, it suffers from the model bias which results in performance loss. In this section, we consider to ensemble these different gradient estimators for better performance.

Motivated by the idea of TD() algorithm [24], which ensembles the TD() error with a geometric decaying weight , we propose a temporal-difference method to ensemble DVG with varying rollout steps and the model-free deterministic policy gradients. We defined the temporal difference deterministic value gradients as , where denotes the maximal number of rollout steps by the learned model. For the gradient update rule, we also apply sample based methods: given samples , we use

(14)

to update the policy. Based on this ensembled deterministic value-policy gradients, we propose the deterministic value-policy gradient algorithm, shown in Algorithm 2 444The only difference between the DVG(k) algorithm and the DVPG algorithm is the update rule of the policy..

1:  Initialize the weight and the number of maximal steps
2:  Initialize the reward network , transition network , critic network , actor network , target networks , and experience replay buffer
3:  for episode do
4:     for  do
5:         Select action according to the current policy and exploration noise
6:         Execute action , observe reward and new state , and store transition in
7:         Sample a random minibatch of transitions from
8:         Update the critic by minimizing the TD error:
9:         Update the reward network and the transition network on the batch by minimizing the square loss
10:         Estimate the value gradients by Eq. (14), and perform gradient update on the policy
11:         Update the target networks by
12:     end for
13:  end for
Algorithm 2 The DVPG algorithm

Experimental Results

We design a series of experiments to evaluate DVG and DVPG. We investigate the following aspects: (1) What is the effect of the discount factor on DVG? (2) How sensitive is DVPG to the hyper-parameters? (3) How does DVPG compare with state-of-the-art methods?

We evaluate DVPG in a number of continuous control benchmark tasks in OpenAI Gym based on the MuJoCo [26] simulator. The implementation details are referred to Appendix C. We compare DVPG with DDPG, DVG, DDPG with imagination rollouts (DDPG(model)), and SVG with 1 step rollout and experience replay (SVG()) in the text. We also compare DVPG with methods using stochastic policies, e.g. ACKTR, TRPO, in Appendix D. We plot the averaged rewards of episodes over 5 different random seeds with the number of real samples, and the shade region represents the 75% confidence interval. We choose the same hyperparameters of the actor and critic network for all algorithms. The prediction models of DVPG, DVG and DDPG(model) are the same.

Figure 4: Performance comparisons on environments from the MuJoCo simulator.

The effect of discount factors on DVG

From Eq. (9), we get that is equivalent to the infinite sum of the gradient vectors. To study the effect of the discount factor on DVG, we train the algorithm with 2 rollout steps with different values of the discount factor on the environment InvertedPendulum-v2. As shown in Figure 5, performs the best in terms of rewards and stability while and performs comparably, while the performance of and are inferior to other values. This is because the convergence of the computation of the gradient of the value function over the state may be slow if the discount factor is close to 1 while a smaller value of may enable better convergence of . However, the sum of rewards discounted by a too small will be too myopic, and fails to perform good. Here, trades-off well between the stability and the performance, which is as expected that there exists an optimal intermediate value for the discount factor.

Figure 5: The effect of discount factors.

Ablation study of DVPG

We evaluate the effect of the weight of bootstrapping on DVPG with different values from to , where the number of rollout steps is set to be 4. From Figure 6, we get that the performance of the DVPG decreases with the increase of the value , where performs the best in terms of the sample efficiency and the performance. Thus, we choose the value of the weight to be in all experiments.

Figure 6: The weight of bootstrapping.

We evaluate the effect of the number of rollout steps ranging from to . Results in Figure 7 show that DVPG with different number of rollout steps all succeed to learn a good policy, with 1 rollout step performing the best. Indeed, the number of rollout steps trade off between the model-error and the variance. There is an optimal value of the number of rollout steps for each environment, which is the only one parameter we tune. To summarize, for the number of look steps, 1 rollout step works the best on Humanoid-v2, Swimmer-v2 and HalfCheetah-v2, while 2 rollout steps performs the best on HumanoidStandup-v2, Hopper-v2 and Ant-v2. For fair comparisons, we choose the same number of rollout steps for both the DVG and the DVPG algorithm.

Figure 7: The number of rollout steps.

Performance comparisons

In this section we compare DVPG with the model-free baseline DDPG, and model-based baselines including DVG, DDPG(model) and SVG() on several continuous control tasks on MuJoCo. As shown in Figure 8, there are two classes of comparisons.

Firstly, we compare DVPG with DDPG and DVG to validate the effect of the temporal difference technique to ensemble model-based and model-free deterministic value gradients. The DVG algorithm is the most sample efficient than other algorithms in environments HumanoidStandup-v2, and Hopper-v2. For sample efficiency, DVPG outperforms DDPG as it trades off between the model-based deterministic value gradients and the model-free deterministic policy gradients. In the end of the training, DVPG outperforms other two algorithms significantly, which demonstrates the power of the temporal difference technique. In other four environments, DVPG outperforms other algorithms in terms of both sample efficiency and performance. The performance of DVG and DDPG on Swimmer-v2 and Ant-v2 are comparable, while DVG performs bad in Halfcheetah-v2 and Humanoid-v2 due to the model-error.

Secondly, we compare DVPG with SVG() and DDPG with imagination rollouts. Results show that the DVPG algorithm significantly outperforms these two model-based algorithms in terms of sample efficiency and performance, especially in environments where other model-based algorithms do not get better performance than the model-free DDPG algorithm. For the performance of the SVG() algorithm, it fails to learn good policies in Ant-v2, which is also reported in [13].

Conclusion

Due to high sample complexity of the model-free DDPG algorithm and high bias of the deterministic value gradients with finite horizon, we study the deterministic value gradients with infinite horizon. We prove the existence of the deterministic value gradients for a certain set of discount factors in this infinite setting. Based on this theoretical guarantee, we propose the DVG algorithm with different rollout steps by the model. We then propose a temporal difference method to ensemble deterministic value gradients and deterministic policy gradients, to trade off between the bias due to the model error and the variance of the model-free policy gradients, called the DVPG algorithm. We compare DVPG on several continuous control benchmarks. Results show that DVPG substantially outperforms other baselines in terms of convergence and performance. For future work, it is promising to apply the temporal difference technique presented in this paper to other model-free algorithms and combine with other model-based techniques.

Acknowledgments

The work by Ling Pan was supported in part by the National Natural Science Foundation of China Grants 61672316. Pingzhong Tang was supported in part by the National Natural Science Foundation of China Grant 61561146398, and a China Youth 1000-talent program.

Appendix A A. Proof of Corollary 1

Corollary 1 For any policy and any initial state , let denote the loop of states following the policy and the state, let , the gradient of the value function over the state, exists if

(15)
Proof 2

By the definition of , we get

(16)

Then by [6], the absolute value of any eigenvalue of is strictly less than . By representing with Jordan normal form, i.e., ,

(17)

As the absolute value of any eigenvalue of is strictly less than , converges, then converges. By Lemma 1, converges.

Appendix B B. Proof of Theorem 2

Theorem 2 For any policy and MDP with deterministic state transitions, if assumptions A.1 and A.2 hold, the value gradients exist, and

(18)

where is the discounted state distribution starting from the state and the policy, and is defined as

Proof 3

By definition,

(19)

With the indicator function , we rewrite the equation (19):

(20)

By unrolling (20) with infinite steps, we get (18).

Appendix C C. Implementation Details

In this section we describe the details of the implementation of DVPG, DVG, DDPG and DDPG (model). The configuration of the actor network and the critic network is the same as the implementation of OpenAI Baselines. For the reward network, we use the same network structure. Each network has two fully connected layers, where each layer has 64 units. The activation function is ReLU, the batch size is , the learning rate of the actor is , and the learning rate of the critic is . The learning rates of the transition network and the reward network are all . We also add norm regularizer to the loss.

For the reward network, the loss is . For the transition network, the loss is

We also compare with DDPG with model-based rollouts, i.e., besides the training of the policy on real samples, the actor is also updated by model generated samples. The detail of DDPG(model) is referred to Algorithm 3.

1:  Initialize the reward network , transition network , critic network , actor network and target networks ,
2:  Initialize experience replay buffer and model-based experience replay buffer
3:  for episode do
4:     for  do
5:        Select action according to the current policy and exploration noise
6:        Execute action , observe reward and new state , and store transition in
7:        Sample a random minibatch of transitions from
8:        Update the critic by minimizing the TD error:
9:        Update the reward network and the transition network on the batch by minimizing the square loss
10:        Estimate the policy gradients by
(21)
11:        Perform model-free gradients update on the policy
12:        Update the target networks
13:     end for
14:     Generate samples by the policy and the learned transitions, and store fictitious samples in
15:     for  do
16:        Sample a random minibatch of transitions from
17:        Estimate the policy gradients on fictitious samples
18:        Perform model-based gradients update on the policy
19:     end for
20:     Reset the model-based buffer to be empty
21:  end for
Algorithm 3 The DDPG with model-based rollouts algorithm

For the running time of the DVPG algorithm, it takes about 4 hours for running 1M steps.

Appendix D D. Comparisons with start of the art stochastic policy optimization methods

(a) LunarLander-v2.
(b) Swimmer-v2.
(c) HalfCheetah-v2.
(d) HumanoidStandup-v2.
(e) Humanoid-v2.
Figure 8: Return/steps of training on environments from the MuJoCo simulator.

We compare the DVPG algorithm and the DDPG algorithm with state of the art stochastic policy optimization algorithms, TRPO and ACKTR. As shown in Figure 8, results show that DVPG performs much better than DDPG and other algorithms in the environments where DDPG is more sample efficient than policy optimization algorithms. DVPG also outperforms other baselines significantly in Swimmer-v2 where DDPG is outperformed by TRPO.

References

  • [1] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8234–8244. Cited by: Related Work.
  • [2] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: Related Work<