Supervised Policy Update
Abstract
We propose a new sampleefficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a nonparameterized policy. It then solves a supervised regression problem to convert the nonparameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the nonparameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.
Supervised Policy Update
Quan Ho Vuong Yiming Zhang Keith W. Ross New York University Abu Dhabi New York University New York University Shanghai quan.vuong@nyu.edu, yiming.zhang@cs.nyu.edu, keithwross@nyu.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
The policy gradient problem in deep reinforcement learning can be informally defined as seeking a parameterized policy that produces a high expected reward . The parameterized policy is realized with a neural network, and stochastic gradient descent with back propagation is used to optimize the parameters. An issue that plagues traditional policy gradient methods is poor sample efficiency schulman2015trust (); wang2016sample (); wu2017scalable (); schulman2017proximal (). In algorithms such as REINFORCE williams1992simple (), new samples are needed for every small gradient step. In environments for which generating trajectories is expensive (such as robotic environments), sample efficiency is of central concern. The sample efficiency problem can be informally stated as follows: Beginning with the current policy , and using only trajectories from , try to obtain a new policy which improves on as much as possible.
Several papers have addressed the sample efficiency problem by considering candidate new policies that are close to the original policy schulman2015trust (); wu2017scalable (); achiam2017constrained (); schulman2017proximal (). Intuitively, if the candidate policy is far from the original policy , then information from those samples (states visited, actions taken, and the estimated advantage values) would lose their relevance. This guideline seems reasonable in principle, but requires a notion of closeness of two policies. One natural approach is to define a distance or divergence between the current policy and the candidate new policy , and then attempt to solve the constrained optimization problem:
(1)  
subject to  (2) 
Here the objective in (1) attempts to maximize the improvement in performance of the updated policy compared to the current policy, and the constraint (2) ensures that the resulting policy is near the policy that was used to generate the data. The is a hyperparameter that can be possibly annealed over time.
We propose a new methodology, called Supervised Policy Update (SPU), for the sample efficiency problem. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a nonparameterized policy. It then solves a supervised regression problem to convert the nonparameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a general methodology for finding an optimal policy in the nonparameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be studied using this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.
2 Preliminaries
We consider a Markov Decision Process (MDP) , where is the state space, is the action space, , is the reward function, is the probability of transitioning to from after taking action , and is the initial state distribution over . Let denote a policy, let be the set of all policies, and let the expected discounted reward be:
(3) 
where is a discount factor, is a sample trajectory, and is the expectation with respect to the probability of under policy . Let be the advantage function for policy schulman2015high (). Deep reinforcement learning considers a set of parameterized policies , where each parameterized policy is defined by a neural network called the policy network. In this paper, we will consider optimizing over the parameterized policies in as well as over the nonparameterized policies in .
One popular approach to maximizing over is to apply stochastic gradient ascent. The gradient of evaluated at a specific can be shown to be williams1992simple (); sutton2000policy (); schulman2015high ():
(4) 
To obtain an estimate of the gradient, we can sample finitelength trajectories , , from , and approximate (4) as:
(5) 
where is the length of the trajectory, and is an approximation of obtained from a critic network. Using the approximate advantage in the gradient estimator introduces a bias but has the effect of lowering the variance (konda2000actor, ; mnih2016asynchronous, ).
Additionally, define
for the the future state probability distribution for policy , and denote for the probability distribution over the action space when in state and using policy . Further denote for the KL divergence from distribution to , and denote
(6) 
for the “aggregated KL divergence”.
2.1 Approximations for the Sample Efficiency Problem
For the sample efficiency problem, the objective is typically approximated using samples generated from schulman2015trust (); achiam2017constrained (); schulman2017proximal () (Although importance sampling can be used to form an unbiased estimator, the estimator has many product terms which can lead to numerical instabilities degris2012off ()). One of two different approaches is typically used to approximate using samples from . The first approach is to make a first order approximation of in the vicinity of peters2008natural (); peters2008reinforcement (); schulman2015trust ():
(7) 
where is the sample estimate (5). The second approach, which applies to all policies and not just to policies , is to approximate the state distribution with , giving the approximation (achiam2017constrained, ; schulman2017proximal, ):
(8) 
To estimate the expectation in (8), as in (5), we generate trajectories of finite length from , create estimates of the advantage values using a critic network, and then form a sample average. There is a wellknown bound for the approximation (8) kakade2002approximately (); achiam2017constrained (). Furthermore, the approximation matches to first order with respect to the parameter achiam2017constrained ().
3 Related Work
Natural gradient was proposed by Amari amari1998natural () and first applied to policy gradients by Kakade kakade2002natural (). Instead of following the direction of the gradient in the Euclidean space, Natural Policy Gradient method (NPG) attempts to follow the direction of steepest descent in the policy space, which is typically a highdimensional manifold. This is done by premultiplying the policy gradient term with the inverse of the Fisher information matrix.
The goal of TRPO schulman2015trust (); peters2008natural (); peters2008reinforcement (); advanced_pg_joshua () is to solve the sample efficiency problem (1)(2) with , i.e., use the weighted KLdivergence for the policy proximity constraint (2). TRPO addresses this problem in the parameter space . First, it uses the first order approximation (7) to approximate and also makes a similar secondorder approximation to approximate . Second, it uses samples from to form estimates of these two approximations. Third, using these estimates (which are functions of ), it solves for the optimal . The optimal is a function of and of , the sample average of the Hessian evaluated at . TRPO takes an additional step by limiting the magnitude of the update to ensure (i.e., checking to see if the sampleaverage estimate of the proximity constraint is met without the secondorder approximation).
ActorCritic using KroneckerFactored Trust Region (ACKTR) wu2017scalable () proposed using Kroneckerfactored approximation curvature (KFAC) to update both the policy gradient and critic terms, giving a more computationally efficient method of calculating the natural gradients. ACER linearizes the KL divergence constraint and maintains an average policy network to enforce the KL divergence constraint rather than using the current policy , leading to significant performance improvement for actor critic method ACER ().
PPO schulman2017proximal () takes a very different approach TRPO. In order to obtain the new policy , PPO seeks to maximize the objective:
(9) 
In the process in going from to , PPO makes many gradient steps while only using the data from . It has been shown to have excellent sampleefficiency performance. To gain some insight into the PPO objective, note that without the clipping, it is simply the approximation (8) (while also removing the discounting and using a finite horizon). The additional clipping is analogous to the constraint (2) in that it has the goal of keeping close to . Indeed, the clipping can be seen as an attempt at keeping from becoming neither much larger than nor much smaller than . Thus, although the PPO objective does not squarely fit into the optimization framework (1)(2), it is quite similar in spirit.
4 Optimizing in the Policy Space
As mentioned in Section 3, TRPO uses first and secondorder approximations to reformulate the sample efficiency probem (1)(2) as a constrained optimization problem in the paramter space , and then finds the parameter that optimizes the approximated problem.
The approach proposed in this paper is to first determine (or partially determine) the optimal policy in the larger nonparameterized policy space . We refer to such an optimal policy as the optimal target policy, and to as the optimal targets. After determining the targets in the nonparameterized space, we then try to find a parameterized policy in that is close to the targets.
In this section, we consider finding the optimal target policy. Specifically, we consider the MDP problem:
(10)  
subject to  (11) 
Note that is not restricted to the set of parameterized policies . Also note, as is common practice, we are using an approximation for the objective function. Specifically, we are using the approximation (8). However, unlike TRPO, we are not approximating the constraint (2).
4.1 Solving TRPO MDP Problems in the Policy Space
4.1.1 Aggregated TRPO Problem
TRPO uses . The optimization problem (10)(11) therefore becomes
(12)  
subject to  (13) 
in which we used the identity
(14) 
The following result provides the structure of the TRPO policy in the policy space:
Theorem 1
As a consequence, for any two actions and we have . This result indicates that, for a fixed , the optimal solution in the policy space has the targets grow exponentially with respect to the advantage values. Therefore if is larger than , then the target for will be exponentially larger than the target for .
Proof of Theorem 1: Let be an optimal policy for (12)(13), and let . For each state , consider the decomposed optimization problem:
(16)  
subject to  (17) 
Let be an optimal solution to (16)(17), and let ) be a policy. It is easily seen that is also an optimal policy for the TRPO problem (12)(13).
We now consider finding an optimal policy for the subproblem (16)(17) for a fixed . First convert the constrained optimization problem to an unconstrained one with an Lagrange multiplier :
(18) 
This is the standard maximum entropy problem ziebart2008maximum (); schulman2017equivalence (); haarnoja2018soft () in reinforcement learning. Its solution is given by (15) with chosen so that the constraint (17) is met with equality
Instead of constraining the forward KL divergence, one could instead constrain the backward KL divergence, i.e. . In this case, the optimization problem again decomposes. The optimal targets are then obtained by solving a simple optimization problem.
4.1.2 Solving Disaggregated TRPO MDP Problem in the Policy Space
An alternative way of formulating the TRPO problem is to require for all states , rather than using the aggregate constraint . In fact, (schulman2015trust, ) states that this alternative “disaggregatedconstraint” version of the problem is preferable to the aggregated version, but that (schulman2015trust, ) uses the aggregated version for mathematical convenience. It turns out that when optimizing in the policy space, it is easier to solve the disaggregated version than the aggregated version. Indeed, as in the proof of Theorem 1, the optimization problem of maximizing subject to for all decomposes into fully separate optimization problems, one for each state :
(19)  
subject to  (20) 
Note that in this case the constraint (20) uses whereas the corresponding “aggregated” problem uses the more complicated . Owing to this simplification, we can explicitly calculate the optimal Lagrange multiplier (as a function of ).
Theorem 2
There is an optimal policy to the disaggregatedconstraints TRPO problem which takes the same form as the optimal policy given in Theorem 1. However, in this case, for each given , we can explicitly obtain and by solving two nonlinear equations for the two unknowns and . The first equation is obtained from the constraint and the second from . (See Appendix A).
Note that even in this disaggregated version of the problem, the optimal policy again has the exponential structure for each fixed state .
4.2 Solving the PPOinspired Problem in the Policy Space
Recall from Section 3 that the the clipping in PPO can be seen as an attempt at keeping from becoming neither much larger than nor much smaller than . In this subsection, we consider the general problem (10)(11) with the constraint function
(21) 
with . We refer to the optimization problem (10)(11) using the distance (21) as the “PPOinspired problem”. This problem once again decomposes into subproblems. For each , we have to solve
(22)  
subject to  (23)  
(24) 
This problem can be solved explicitly:
Theorem 3
For each fixed , reorder the actions so that is nondecreasing in . There is an optimal policy to the PPOinspired problem which takes the form
(25) 
where the action is given by and where is set so that .
Note how strikingly different the optimal policy for the TRPO problem (aggregated or disaggregated) is from the optimal solution for the PPOproblem. In the former (Theorems 1 and 2), the targets grow exponentially as a function of whereas in the latter (Theorem 3), the targets are bounded.
5 Supervised Policy Update
We now introduce SPU, a new sampleefficient methodology for deep reinforcement learning. SPU focuses on the nonparameterized policy space , first determining targets that the nonparametrized policy should have. Once the targets are determined, it uses supervised regression to find a parameterized policy that nearly meets the targets. Since there is significant flexibility in how the targets can be defined, SPU is versatile and can also provide good sample efficiency performance.
In SPU, to advance from to we perform the following steps:

As usual, we first sample trajectories using policy , giving sample data , . Here is again an estimate of the advantage value , which is obtained from an auxiliary critic network. (For notational simplicity, we henceforth index the samples with rather than with corresponding to the th sample in the th trajectory.)

For each , using the advantage we define a specific target for . For example, as discussed below, we can define , where is the optimal policy of one of the constrained MDP problems in Section 4. Alternatively, as discussed below, we can hand engineer target functions.

We then fit the policy network to the labeled data , . Specifically, we solve a supervised regression problem, minimizing:
(26) where is a loss function such as the loss.

After a fixed number of passes through the data to minimize , the resulting becomes our .
Thus SPU proceeds by solving a series of supervised learning problems, one for each policy step . Note that SPU does not use traditional policy gradient steps in the parameter space. Instead, SPU focuses on moving from policy to policy in the nonparameterized policy space, where each new target policy is approximately realized by the policy network by solving a supervised regression problem.
In minimizing , we have considered two approaches. The first is to initialize the policy network with small random weights; the second is to initialize the policy network with . For both approaches, we have tried using regularization (by putting aside a portion of the labeled data , , for estimating the validation error.) We have found that initializing with provides the best performance; and when initializing with , regularization does not seem to help.
5.1 TRPOinspired targets with disaggregated constraints
Consider the TRPO problem with disaggregated constraints, that is, problem (19)(20). We can rewrite (19)(20) as:
(27)  
subject to  (28) 
where for simplicity we assume . To derive targets for this case, we estimate the expectation in the objective (27) using the samples and estimated advantage values generated from . With and replacing the expectation with its sampled value gives:
(29)  
subject to  (30) 
Let and . Also denote . Then (29)(30) becomes
(31)  
subject to  (32)  
(33) 
First consider the case . In this case we are going to want to make as large as possible. It is easily seen that if then , , is a feasible solution to (31)(33). Now suppose . At the optimal solution, we will have for all for some . Thus, for all . Substituting this into the two constraints in (31)(33) and doing some algebra gives:
(34) 
(35) 
The equations (34)(35) can be readily solved for and ; we then set the target . Now consider the case . In this case we are going to want to make as small as possible. If , then we can set . Otherwise, we can again determine and by solving (34)(35) with the restriction .
In summary, for TRPO with disaggregated constraints, for each , the target can be obtained by solving two equations with two variables. The procedure is repeated for each of the samples to obtain the target values. In the Appendix we show how the targets can be obtained for the aggregated KL constraint.
5.2 PPOinspired targets
Analogous to what was done for the TRPOinspired targets, we can form sample estimates of the expectation for PPOinspired problem (22)(24), obtaining:
(36)  
subject to  (37) 
The optimal solution to this problem gives the targets for all and for all . We refer to these targets as the “default targets”.
5.3 Engineered target functions
With the default targets, for all values of , including small values, the corresponding target either equals or . This is counterintuitive, as we would expect the targets to be close to 1 for small values of . Motivated by the methodology in Section 4 and by the default PPOinspired targets, we engineer three classes of alternative target functions with the properties: (i) ; (ii) the target is a nondecreasing function of . All three classes of target functions outperform PPO on the MuJoCo domain, demonstrating the robustness of SPU with respect to how the target is computed. Below are the exact forms of these functions and their respective plots.
Target Function 1:
where is clipped such that .
Target Function 2:
Target Function 3:
where is a tunable hyperparameter.
6 Experimental Results
We tested each algorithm on seven MuJoCo todorov2012mujoco () simulated robotics tasks implemented in OpenAI Gym openaigym (). We only compare the performance of our algorithms against PPO since PPO has become a popular baseline. Thus, our results here can be extrapolated to get a rough estimate of how our algorithms perform against other recent algorithms such as A3C schulman2015trust (); a3c (). The performance of PPO is obtained by running the publicly available OpenAI implementation openai_baselines (). Except for the hyperparameters in the target functions, we used the same hyperparameter values as the PPO paper schulman2017proximal () and kept them fixed across different algorithmic settings.
Target Function  Improvement over PPO 

default targets  8 % 
default targets with gradient clipping  12 % 
Target Function 1  16 % 
Target Function 2  17 % 
Target Function 3  % 
As in schulman2017proximal (), each run of an algorithm in one environment is trained with one million timesteps and is scored by averaging the episodic reward over the last 100 episodes. We set the score of the random policy to 0 and use the performance of the random policy to measure the performance improvement of an algorithm over PPO. The relative performance in 7 environments is averaged to produce a single scalar that represents the overall score for one algorithm. For each environment, the relative performance is also averaged over 5 different starting seeds. The source code will be released after the blind review process.
As shown in Table 1, SPU with Target Function 3 provides a improvement over PPO. Not only is the final average reward of SPU better than PPO, it also has higher sample efficiency, as measured by the number of timesteps taken to reach a particular performance level. Figure 2 illustrates that Target Function 1 consistently achieves higher reward than PPO in the latter half of training in 6 out of the 7 MuJoCo environments. PPO makes 10 passes (epochs) through the samples to update the policy. To ensure fair computational comparison, for all 5 target functions, SPU also makes 10 passes through the samples.
As shown in Figure 3, if we increase the number of passes to 20 (but not the number of samples taken), SPU with Target Function 1’s improvement climbed from to while PPO with 20 passes only performs better than PPO with 10 passes. Note that performance does not improve for all environments when increasing the number of passes. For PPO, performance actually declines significantly for three of the seven environments, whereas for SPU it only declines for one environment. By limiting the number of epochs in SPU (early stopping), we prevent overfitting to the targets. For completeness and reproducibility, in the Appendix, we discuss the implementation details and list hyperparameters values.
7 Conclusion
We developed a novel policyspace methodology, which can be used to compare and contrast various sampleefficient reinforcement learning algorithms, including PPO and different versions of TRPO. The methodology can also be used to study many other forms of constraints, such as constraining the aggregated and disaggregated reverse KLdivergence. We also proposed a new sampleefficient class of algorithms called SPU, for which there is significant flexibility in how we set the targets.
As compared to PPO, our experimental results show that SPU with simple target functions can lead to improved sampleefficiency performance without increasing wallclock time. In the future, it may be possible to achieve further gains with yettobeexplored classes of target functions, annealing the targets, and changing the number of passes through the data.
8 Acknowledgements
We would like to acknowledge the extremely helpful support by the NYU Shanghai High Performance Computing Administrator Zhiguo Qi.
Appendix A Solving for the optimal policy in the disaggregatedconstraints TRPO problem (Theorem 2)
Appendix B TRPOInspired Targets with Aggregated Constraint
In this section, we outline how specific targets can be obtained for the case of TRPO with the aggregated contstraint. The TRPO problem in the policy space (12)(13) can be rewritten as:
(40)  
subject to  (41) 
Because we do not have an estimate for for all and (and because is usually huge for deep reinforcement learning problems), we cannot easily calculate the expectations in (40)(41). We instead use the samples , , from to approximate the expectations with their sample averages, resulting in the following optimization problem:
subject to 
where , is the number of sampled trajectories, and is the trajectory timestep number of the th sample. Letting , , then the above optimization problem becomes
subject to  
After we solve the above optimization problem for an optimal solution , we set the targets
Solving the above optimization problem is a topic for further research. We conjecture that it can be solved as quickly as the conjugate gradient method in TRPO. One possible approach is to first fix allocations , with and solve the following disaggregated problem for each :
subject to  
Each of these problems is the disaggregated TRPO problem, which can be rapidly solved, as discussed in Section 5.1. Let denote the optimal value for the th disaggregated problem. Then the resulting optimiziation problem becomes maximize subject to . This problem can then be solved in a hierachical manner.
Appendix C Performance Graph For Target Function 2 and Target Function 3
Appendix D Implementation Details and Hyperparameters
As in schulman2017proximal (), the policy is parameterized by a fullyconnected feedforward neural network with two hidden layers, each with 64 units and tanh nonlinearities. The policy outputs the mean of a Gaussian distribution with stateindependent variable standard deviations, following schulman2015trust (); benchmark_drl_continuous (). The action dimensions are assumed to be independent. The probability of an action is given by the multivariate Gaussian probability distribution function. The baseline used in the advantage value calculation is parameterized by a similarly sized neural network, trained to minimize the MSE between the sampled states TD returns and the their predicted values. To calculate the advantage values, we use Generalized Advantage Estimation schulman2015GAE (). States are normalized by dividing the running mean and dividing by the running standard deviation before being fed to any neural networks. The advantage values are normalized by dividing the batch mean and dividing by the batch standard deviation before being used for policy update.
Parameters  Value 

Number of timesteps  1e6 
Seed  04 
Optimizer  Adam 
Optimizer Learning Rate  3e4 
Optimizer Learning Rate Anneal Schedule  Linearly to 0 
Optimizer Adam Epsilon  1e5 
Timesteps ber batch  2048 
Number of full passes  10 
Minibatch size  64 
GAE  0.99 
GAE  0.95 
Target Functions  Value of 

default targets  0.32 
default targets with gradient clipping  0.48 
Target Function 1  0.84 
Target Function 1 20 passes  0.48 
Target Function 2  2.19 
Target Function 3  0.8, 
PPO with 20 passes  0.12 
References
 [1] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 [2] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 [3] Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, pages 5285–5294, 2017.
 [4] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [5] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
 [6] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31, 2017.
 [7] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 [8] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [9] Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
 [10] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 [11] Thomas Degris, Martha White, and Richard S Sutton. Offpolicy actorcritic. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 179–186, 2012.
 [12] Jan Peters and Stefan Schaal. Natural actorcritic. Neurocomputing, 71(79):1180–1190, 2008.
 [13] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
 [14] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
 [15] ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 [16] Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
 [17] Joshua Achiam. Advanced policy gradient methods. http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_13_advanced_pg.pdf. Accessed: 20180524.
 [18] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. CoRR, abs/1611.01224, 2016.
 [19] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
 [20] John Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440, 2017.
 [21] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 [22] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
 [23] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
 [24] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.
 [25] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
 [26] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778, 2016.
 [27] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438, 2015.