Supervised Policy Update

Supervised Policy Update

Quan Ho Vuong        Yiming Zhang        Keith W. Ross
New York University Abu Dhabi
New York University
New York University Shanghai
quan.vuong@nyu.edu, yiming.zhang@cs.nyu.edu, keithwross@nyu.edu
Abstract

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.

 

Supervised Policy Update


  Quan Ho Vuong        Yiming Zhang        Keith W. Ross New York University Abu Dhabi New York University New York University Shanghai quan.vuong@nyu.edu, yiming.zhang@cs.nyu.edu, keithwross@nyu.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

The policy gradient problem in deep reinforcement learning can be informally defined as seeking a parameterized policy that produces a high expected reward . The parameterized policy is realized with a neural network, and stochastic gradient descent with back propagation is used to optimize the parameters. An issue that plagues traditional policy gradient methods is poor sample efficiency schulman2015trust (); wang2016sample (); wu2017scalable (); schulman2017proximal (). In algorithms such as REINFORCE williams1992simple (), new samples are needed for every small gradient step. In environments for which generating trajectories is expensive (such as robotic environments), sample efficiency is of central concern. The sample efficiency problem can be informally stated as follows: Beginning with the current policy , and using only trajectories from , try to obtain a new policy which improves on as much as possible.

Several papers have addressed the sample efficiency problem by considering candidate new policies that are close to the original policy schulman2015trust (); wu2017scalable (); achiam2017constrained (); schulman2017proximal (). Intuitively, if the candidate policy is far from the original policy , then information from those samples (states visited, actions taken, and the estimated advantage values) would lose their relevance. This guideline seems reasonable in principle, but requires a notion of closeness of two policies. One natural approach is to define a distance or divergence between the current policy and the candidate new policy , and then attempt to solve the constrained optimization problem:

(1)
subject to (2)

Here the objective in (1) attempts to maximize the improvement in performance of the updated policy compared to the current policy, and the constraint (2) ensures that the resulting policy is near the policy that was used to generate the data. The is a hyper-parameter that can be possibly annealed over time.

We propose a new methodology, called Supervised Policy Update (SPU), for the sample efficiency problem. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a general methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be studied using this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.

2 Preliminaries

We consider a Markov Decision Process (MDP) , where is the state space, is the action space, , is the reward function, is the probability of transitioning to from after taking action , and is the initial state distribution over . Let denote a policy, let be the set of all policies, and let the expected discounted reward be:

(3)

where is a discount factor, is a sample trajectory, and is the expectation with respect to the probability of under policy . Let be the advantage function for policy schulman2015high (). Deep reinforcement learning considers a set of parameterized policies , where each parameterized policy is defined by a neural network called the policy network. In this paper, we will consider optimizing over the parameterized policies in as well as over the non-parameterized policies in .

One popular approach to maximizing over is to apply stochastic gradient ascent. The gradient of evaluated at a specific can be shown to be williams1992simple (); sutton2000policy (); schulman2015high ():

(4)

To obtain an estimate of the gradient, we can sample finite-length trajectories , , from , and approximate (4) as:

(5)

where is the length of the trajectory, and is an approximation of obtained from a critic network. Using the approximate advantage in the gradient estimator introduces a bias but has the effect of lowering the variance (konda2000actor, ; mnih2016asynchronous, ).

Additionally, define

for the the future state probability distribution for policy , and denote for the probability distribution over the action space when in state and using policy . Further denote for the KL divergence from distribution to , and denote

(6)

for the “aggregated KL divergence”.

2.1 Approximations for the Sample Efficiency Problem

For the sample efficiency problem, the objective is typically approximated using samples generated from schulman2015trust (); achiam2017constrained (); schulman2017proximal () (Although importance sampling can be used to form an unbiased estimator, the estimator has many product terms which can lead to numerical instabilities degris2012off ()). One of two different approaches is typically used to approximate using samples from . The first approach is to make a first order approximation of in the vicinity of peters2008natural (); peters2008reinforcement (); schulman2015trust ():

(7)

where is the sample estimate (5). The second approach, which applies to all policies and not just to policies , is to approximate the state distribution with , giving the approximation (achiam2017constrained, ; schulman2017proximal, ):

(8)

To estimate the expectation in (8), as in (5), we generate trajectories of finite length from , create estimates of the advantage values using a critic network, and then form a sample average. There is a well-known bound for the approximation (8) kakade2002approximately (); achiam2017constrained (). Furthermore, the approximation matches to first order with respect to the parameter achiam2017constrained ().

3 Related Work

Natural gradient was proposed by Amari amari1998natural () and first applied to policy gradients by Kakade kakade2002natural (). Instead of following the direction of the gradient in the Euclidean space, Natural Policy Gradient method (NPG) attempts to follow the direction of steepest descent in the policy space, which is typically a high-dimensional manifold. This is done by pre-multiplying the policy gradient term with the inverse of the Fisher information matrix.

The goal of TRPO schulman2015trust (); peters2008natural (); peters2008reinforcement (); advanced_pg_joshua () is to solve the sample efficiency problem (1)-(2) with , i.e., use the weighted KL-divergence for the policy proximity constraint (2). TRPO addresses this problem in the parameter space . First, it uses the first order approximation (7) to approximate and also makes a similar second-order approximation to approximate . Second, it uses samples from to form estimates of these two approximations. Third, using these estimates (which are functions of ), it solves for the optimal . The optimal is a function of and of , the sample average of the Hessian evaluated at . TRPO takes an additional step by limiting the magnitude of the update to ensure (i.e., checking to see if the sample-average estimate of the proximity constraint is met without the second-order approximation).

Actor-Critic using Kronecker-Factored Trust Region (ACKTR) wu2017scalable () proposed using Kronecker-factored approximation curvature (K-FAC) to update both the policy gradient and critic terms, giving a more computationally efficient method of calculating the natural gradients. ACER linearizes the KL divergence constraint and maintains an average policy network to enforce the KL divergence constraint rather than using the current policy , leading to significant performance improvement for actor critic method ACER ().

PPO schulman2017proximal () takes a very different approach TRPO. In order to obtain the new policy , PPO seeks to maximize the objective:

(9)

In the process in going from to , PPO makes many gradient steps while only using the data from . It has been shown to have excellent sample-efficiency performance. To gain some insight into the PPO objective, note that without the clipping, it is simply the approximation (8) (while also removing the discounting and using a finite horizon). The additional clipping is analogous to the constraint (2) in that it has the goal of keeping close to . Indeed, the clipping can be seen as an attempt at keeping from becoming neither much larger than nor much smaller than . Thus, although the PPO objective does not squarely fit into the optimization framework (1)-(2), it is quite similar in spirit.

4 Optimizing in the Policy Space

As mentioned in Section 3, TRPO uses first- and second-order approximations to reformulate the sample efficiency probem (1)-(2) as a constrained optimization problem in the paramter space , and then finds the parameter that optimizes the approximated problem.

The approach proposed in this paper is to first determine (or partially determine) the optimal policy in the larger non-parameterized policy space . We refer to such an optimal policy as the optimal target policy, and to as the optimal targets. After determining the targets in the non-parameterized space, we then try to find a parameterized policy in that is close to the targets.

In this section, we consider finding the optimal target policy. Specifically, we consider the MDP problem:

(10)
subject to (11)

Note that is not restricted to the set of parameterized policies . Also note, as is common practice, we are using an approximation for the objective function. Specifically, we are using the approximation (8). However, unlike TRPO, we are not approximating the constraint (2).

4.1 Solving TRPO MDP Problems in the Policy Space

4.1.1 Aggregated TRPO Problem

TRPO uses . The optimization problem (10)-(11) therefore becomes

(12)
subject to (13)

in which we used the identity

(14)

The following result provides the structure of the TRPO policy in the policy space:

Theorem 1

There is an optimal policy to the TRPO problem (12)-(13) that takes the following form:

(15)

where and are functions of but independent of .

As a consequence, for any two actions and we have . This result indicates that, for a fixed , the optimal solution in the policy space has the targets grow exponentially with respect to the advantage values. Therefore if is larger than , then the target for will be exponentially larger than the target for .

Proof of Theorem 1: Let be an optimal policy for (12)-(13), and let . For each state , consider the decomposed optimization problem:

(16)
subject to (17)

Let be an optimal solution to (16)-(17), and let ) be a policy. It is easily seen that is also an optimal policy for the TRPO problem (12)-(13).

We now consider finding an optimal policy for the sub-problem (16)-(17) for a fixed . First convert the constrained optimization problem to an unconstrained one with an Lagrange multiplier :

(18)

This is the standard maximum entropy problem ziebart2008maximum (); schulman2017equivalence (); haarnoja2018soft () in reinforcement learning. Its solution is given by (15) with chosen so that the constraint (17) is met with equality

Instead of constraining the forward KL divergence, one could instead constrain the backward KL divergence, i.e. . In this case, the optimization problem again decomposes. The optimal targets are then obtained by solving a simple optimization problem.

4.1.2 Solving Disaggregated TRPO MDP Problem in the Policy Space

An alternative way of formulating the TRPO problem is to require for all states , rather than using the aggregate constraint . In fact, (schulman2015trust, ) states that this alternative “disaggregated-constraint” version of the problem is preferable to the aggregated version, but that (schulman2015trust, ) uses the aggregated version for mathematical convenience. It turns out that when optimizing in the policy space, it is easier to solve the disaggregated version than the aggregated version. Indeed, as in the proof of Theorem 1, the optimization problem of maximizing subject to for all decomposes into fully separate optimization problems, one for each state :

(19)
subject to (20)

Note that in this case the constraint (20) uses whereas the corresponding “aggregated” problem uses the more complicated . Owing to this simplification, we can explicitly calculate the optimal Lagrange multiplier (as a function of ).

Theorem 2

There is an optimal policy to the disaggregated-constraints TRPO problem which takes the same form as the optimal policy given in Theorem 1. However, in this case, for each given , we can explicitly obtain and by solving two non-linear equations for the two unknowns and . The first equation is obtained from the constraint and the second from . (See Appendix A).

Note that even in this disaggregated version of the problem, the optimal policy again has the exponential structure for each fixed state .

4.2 Solving the PPO-inspired Problem in the Policy Space

Recall from Section 3 that the the clipping in PPO can be seen as an attempt at keeping from becoming neither much larger than nor much smaller than . In this subsection, we consider the general problem (10)-(11) with the constraint function

(21)

with . We refer to the optimization problem (10)-(11) using the distance (21) as the “PPO-inspired problem”. This problem once again decomposes into sub-problems. For each , we have to solve

(22)
subject to (23)
(24)

This problem can be solved explicitly:

Theorem 3

For each fixed , re-order the actions so that is non-decreasing in . There is an optimal policy to the PPO-inspired problem which takes the form

(25)

where the action is given by and where is set so that .

Note how strikingly different the optimal policy for the TRPO problem (aggregated or disaggregated) is from the optimal solution for the PPO-problem. In the former (Theorems 1 and 2), the targets grow exponentially as a function of whereas in the latter (Theorem 3), the targets are bounded.

5 Supervised Policy Update

We now introduce SPU, a new sample-efficient methodology for deep reinforcement learning. SPU focuses on the non-parameterized policy space , first determining targets that the non-parametrized policy should have. Once the targets are determined, it uses supervised regression to find a parameterized policy that nearly meets the targets. Since there is significant flexibility in how the targets can be defined, SPU is versatile and can also provide good sample efficiency performance.

In SPU, to advance from to we perform the following steps:

  1. As usual, we first sample trajectories using policy , giving sample data , . Here is again an estimate of the advantage value , which is obtained from an auxiliary critic network. (For notational simplicity, we henceforth index the samples with rather than with corresponding to the th sample in the th trajectory.)

  2. For each , using the advantage we define a specific target for . For example, as discussed below, we can define , where is the optimal policy of one of the constrained MDP problems in Section 4. Alternatively, as discussed below, we can hand engineer target functions.

  3. We then fit the policy network to the labeled data , . Specifically, we solve a supervised regression problem, minimizing:

    (26)

    where is a loss function such as the loss.

  4. After a fixed number of passes through the data to minimize , the resulting becomes our .

Thus SPU proceeds by solving a series of supervised learning problems, one for each policy step . Note that SPU does not use traditional policy gradient steps in the parameter space. Instead, SPU focuses on moving from policy to policy in the non-parameterized policy space, where each new target policy is approximately realized by the policy network by solving a supervised regression problem.

In minimizing , we have considered two approaches. The first is to initialize the policy network with small random weights; the second is to initialize the policy network with . For both approaches, we have tried using regularization (by putting aside a portion of the labeled data , , for estimating the validation error.) We have found that initializing with provides the best performance; and when initializing with , regularization does not seem to help.

5.1 TRPO-inspired targets with disaggregated constraints

Consider the TRPO problem with disaggregated constraints, that is, problem (19)-(20). We can rewrite (19)-(20) as:

(27)
subject to (28)

where for simplicity we assume . To derive targets for this case, we estimate the expectation in the objective (27) using the samples and estimated advantage values generated from . With and replacing the expectation with its sampled value gives:

(29)
subject to (30)

Let and . Also denote . Then (29)-(30) becomes

(31)
subject to (32)
(33)

First consider the case . In this case we are going to want to make as large as possible. It is easily seen that if then , , is a feasible solution to (31)-(33). Now suppose . At the optimal solution, we will have for all for some . Thus, for all . Substituting this into the two constraints in (31)-(33) and doing some algebra gives:

(34)
(35)

The equations (34)-(35) can be readily solved for and ; we then set the target . Now consider the case . In this case we are going to want to make as small as possible. If , then we can set . Otherwise, we can again determine and by solving (34)-(35) with the restriction .

In summary, for TRPO with disaggregated constraints, for each , the target can be obtained by solving two equations with two variables. The procedure is repeated for each of the samples to obtain the target values. In the Appendix we show how the targets can be obtained for the aggregated KL constraint.

5.2 PPO-inspired targets

Analogous to what was done for the TRPO-inspired targets, we can form sample estimates of the expectation for PPO-inspired problem (22)-(24), obtaining:

(36)
subject to (37)

The optimal solution to this problem gives the targets for all and for all . We refer to these targets as the “default targets”.

5.3 Engineered target functions

With the default targets, for all values of , including small values, the corresponding target either equals or . This is counter-intuitive, as we would expect the targets to be close to 1 for small values of . Motivated by the methodology in Section 4 and by the default PPO-inspired targets, we engineer three classes of alternative target functions with the properties: (i) ; (ii) the target is a non-decreasing function of . All three classes of target functions outperform PPO on the MuJoCo domain, demonstrating the robustness of SPU with respect to how the target is computed. Below are the exact forms of these functions and their respective plots.

Target Function 1:

where is clipped such that .

Target Function 2:

Target Function 3:

where is a tunable hyper-parameter.

(a) Target Function 1
(b) Target Function 2
(c) Target Function 3
Figure 1: Classes of Target Functions

6 Experimental Results

We tested each algorithm on seven MuJoCo todorov2012mujoco () simulated robotics tasks implemented in OpenAI Gym openaigym (). We only compare the performance of our algorithms against PPO since PPO has become a popular baseline. Thus, our results here can be extrapolated to get a rough estimate of how our algorithms perform against other recent algorithms such as A3C schulman2015trust (); a3c (). The performance of PPO is obtained by running the publicly available OpenAI implementation openai_baselines (). Except for the hyper-parameters in the target functions, we used the same hyper-parameter values as the PPO paper schulman2017proximal () and kept them fixed across different algorithmic settings.

Target Function Improvement over PPO
default targets 8 %
default targets with gradient clipping 12 %
Target Function 1 16 %
Target Function 2 17 %
Target Function 3 %
Table 1: Performance improvement of SPU over PPO. The left column represents SPU with different target functions.

As in schulman2017proximal (), each run of an algorithm in one environment is trained with one million time-steps and is scored by averaging the episodic reward over the last 100 episodes. We set the score of the random policy to 0 and use the performance of the random policy to measure the performance improvement of an algorithm over PPO. The relative performance in 7 environments is averaged to produce a single scalar that represents the overall score for one algorithm. For each environment, the relative performance is also averaged over 5 different starting seeds. The source code will be released after the blind review process.

As shown in Table 1, SPU with Target Function 3 provides a improvement over PPO. Not only is the final average reward of SPU better than PPO, it also has higher sample efficiency, as measured by the number of time-steps taken to reach a particular performance level. Figure 2 illustrates that Target Function 1 consistently achieves higher reward than PPO in the latter half of training in 6 out of the 7 MuJoCo environments. PPO makes 10 passes (epochs) through the samples to update the policy. To ensure fair computational comparison, for all 5 target functions, SPU also makes 10 passes through the samples.

Figure 2: Performance of SPU with Target Function 1 compared to PPO. The x-axis represents time-steps. The y-axis represents the average episodic reward for the last 100 episodes.

As shown in Figure 3, if we increase the number of passes to 20 (but not the number of samples taken), SPU with Target Function 1’s improvement climbed from to while PPO with 20 passes only performs better than PPO with 10 passes. Note that performance does not improve for all environments when increasing the number of passes. For PPO, performance actually declines significantly for three of the seven environments, whereas for SPU it only declines for one environment. By limiting the number of epochs in SPU (early stopping), we prevent overfitting to the targets. For completeness and reproducibility, in the Appendix, we discuss the implementation details and list hyper-parameters values.

Figure 3: Performance of PPO with 20 passes versus SPU (with Target Function 1) with 20 passes. The x-axis represents time-steps. The y-axis represents the average episodic reward for the last 100 episodes. To find the optimal for PPO with 20 passes, was retuned from to , with a step size of . We have removed the confidence interval here to make the graphs more legible.

7 Conclusion

We developed a novel policy-space methodology, which can be used to compare and contrast various sample-efficient reinforcement learning algorithms, including PPO and different versions of TRPO. The methodology can also be used to study many other forms of constraints, such as constraining the aggregated and disaggregated reverse KL-divergence. We also proposed a new sample-efficient class of algorithms called SPU, for which there is significant flexibility in how we set the targets.

As compared to PPO, our experimental results show that SPU with simple target functions can lead to improved sample-efficiency performance without increasing wall-clock time. In the future, it may be possible to achieve further gains with yet-to-be-explored classes of target functions, annealing the targets, and changing the number of passes through the data.

8 Acknowledgements

We would like to acknowledge the extremely helpful support by the NYU Shanghai High Performance Computing Administrator Zhiguo Qi.

Appendix A Solving for the optimal policy in the disaggregated-constraints TRPO problem (Theorem 2)

Plugging (15) into gives:

(38)

Plugging (15) into gives

(39)

Equation (38) can then be plugged into (39) to give a simple equation of one unknown . can then be found by solving numerically the single equation (39) of one variable.

Appendix B TRPO-Inspired Targets with Aggregated Constraint

In this section, we outline how specific targets can be obtained for the case of TRPO with the aggregated contstraint. The TRPO problem in the policy space (12)-(13) can be rewritten as:

(40)
subject to (41)

Because we do not have an estimate for for all and (and because is usually huge for deep reinforcement learning problems), we cannot easily calculate the expectations in (40)-(41). We instead use the samples , , from to approximate the expectations with their sample averages, resulting in the following optimization problem:

subject to

where , is the number of sampled trajectories, and is the trajectory time-step number of the th sample. Letting , , then the above optimization problem becomes

subject to

After we solve the above optimization problem for an optimal solution , we set the targets

Solving the above optimization problem is a topic for further research. We conjecture that it can be solved as quickly as the conjugate gradient method in TRPO. One possible approach is to first fix allocations , with and solve the following disaggregated problem for each :

subject to

Each of these problems is the disaggregated TRPO problem, which can be rapidly solved, as discussed in Section 5.1. Let denote the optimal value for the th disaggregated problem. Then the resulting optimiziation problem becomes maximize subject to . This problem can then be solved in a hierachical manner.

Appendix C Performance Graph For Target Function 2 and Target Function 3

Figure 4: Performance of SPU with Target Function 2 compared to PPO. The x-axis represents time-steps. The y-axis represents the average episodic reward for the last 100 episodes.

Figure 5: Performance of SPU with Target Function 3 compared to PPO. The x-axis represents time-steps. The y-axis represents the average episodic reward for the last 100 episodes.

Appendix D Implementation Details and Hyperparameters

As in schulman2017proximal (), the policy is parameterized by a fully-connected feed-forward neural network with two hidden layers, each with 64 units and tanh nonlinearities. The policy outputs the mean of a Gaussian distribution with state-independent variable standard deviations, following schulman2015trust (); benchmark_drl_continuous (). The action dimensions are assumed to be independent. The probability of an action is given by the multivariate Gaussian probability distribution function. The baseline used in the advantage value calculation is parameterized by a similarly sized neural network, trained to minimize the MSE between the sampled states TD returns and the their predicted values. To calculate the advantage values, we use Generalized Advantage Estimation schulman2015GAE (). States are normalized by dividing the running mean and dividing by the running standard deviation before being fed to any neural networks. The advantage values are normalized by dividing the batch mean and dividing by the batch standard deviation before being used for policy update.

Parameters Value
Number of timesteps 1e6
Seed 0-4
Optimizer Adam
Optimizer Learning Rate 3e-4
Optimizer Learning Rate Anneal Schedule Linearly to 0
Optimizer Adam Epsilon 1e-5
Timesteps ber batch 2048
Number of full passes 10
Mini-batch size 64
GAE 0.99
GAE 0.95
Table 2: Hyperparameters that are kept fixed across different algorithmic settings
Target Functions Value of
default targets 0.32
default targets with gradient clipping 0.48
Target Function 1 0.84
Target Function 1 20 passes 0.48
Target Function 2 2.19
Target Function 3 0.8,
PPO with 20 passes 0.12
Table 3: Hyperparameters used for target functions.

References

  • [1] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • [2] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
  • [3] Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5285–5294, 2017.
  • [4] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [5] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
  • [6] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31, 2017.
  • [7] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  • [8] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • [9] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
  • [10] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • [11] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 179–186, 2012.
  • [12] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
  • [13] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
  • [14] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
  • [15] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
  • [16] Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
  • [17] Joshua Achiam. Advanced policy gradient methods. http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_13_advanced_pg.pdf. Accessed: 2018-05-24.
  • [18] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. CoRR, abs/1611.01224, 2016.
  • [19] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
  • [20] John Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017.
  • [21] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
  • [22] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
  • [23] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
  • [24] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.
  • [25] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
  • [26] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778, 2016.
  • [27] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
200416
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description