Multi-Pass Q-Networks for Deep Reinforcement Learning with
Parameterised Action Spaces
Parameterised actions in reinforcement learning are composed of discrete actions with continuous action-parameters. This provides a framework for solving complex domains that require combining high-level actions with flexible control. The recent P-DQN algorithm extends deep Q-networks to learn over such action spaces. However, it treats all action-parameters as a single joint input to the Q-network, invalidating its theoretical foundations. We analyse the issues with this approach and propose a novel method—multi-pass deep Q-networks, or MP-DQN—to address them. We empirically demonstrate that MP-DQN significantly outperforms P-DQN and other previous algorithms in terms of data efficiency and converged policy performance on the Platform, Robot Soccer Goal, and Half Field Offense domains.
Reinforcement learning (RL) and deep RL in particular have demonstrated remarkable success in solving tasks that require either discrete actions, such as Atari [\citeauthoryearMnih et al.2015], or continuous actions, such as robot control [\citeauthoryearSchulman et al.2015, \citeauthoryearLillicrap et al.2016]. Reinforcement learning with parameterised actions [\citeauthoryearMasson et al.2016] that combine discrete actions with continuous action-parameters has recently emerged as an additional setting of interest, allowing agents to learn flexible behavior in tasks such as 2D robot soccer [\citeauthoryearHausknecht and Stone2016a, \citeauthoryearHussein et al.2018], simulated human-robot interaction [\citeauthoryearKhamassi et al.2017], and terrain-adaptive bipedal and quadrupedal locomotion [\citeauthoryearPeng et al.2016].
There are two main approaches to learning with parameterised actions: alternate between optimising the discrete actions and continuous action-parameters separately [\citeauthoryearMasson et al.2016, \citeauthoryearKhamassi et al.2017], or collapse the parameterised action space into a continuous one [\citeauthoryearHausknecht and Stone2016a]. Both of these approaches fail to fully exploit the structure present in parameterised action problems. The former does not share information between the action and action-parameter policies, while the latter does not take into account which action-parameter is associated with which action, or even which discrete action is executed by the agent. More recently, \citeauthorxiong2018 \shortcitexiong2018 introduced P-DQN, a method for learning behaviours directly in the parameterised action space. This leverages the distinct nature of the action space and is the current state-of-the-art algorithm on 2D robot soccer and King of Glory, a multiplayer online battle arena game. However, the formulation of the approach is flawed due to the dependence of the discrete action values on all action-parameters, not only those associated with each action. In this paper, we show how the above issue leads to suboptimal decision-making. We then introduce a novel multi-pass method to separate action-parameters, and demonstrate that the resulting algorithm—MP-DQN—outperforms existing methods on the Platform, Robot Soccer Goal, and Half Field Offense domains.
Parameterised action spaces [\citeauthoryearMasson et al.2016] consist of a set of discrete actions, , where each has a corresponding continuous action-parameter with dimensionality . This can be written as
We consider environments modelled as a Parameterised Action Markov Decision Process (PAMDP) [\citeauthoryearMasson et al.2016]. For a PAMDP : is the set of all states, is the parameterised action space, is the Markov state transition probability function, is the reward function, and is the future reward discount factor. An action policy maps states to actions, typically with the aim of maximising Q-values , which give the expected discounted return of executing action in state and following the current policy thereafter.
The Q-PAMDP algorithm [\citeauthoryearMasson et al.2016] alternates between learning a discrete action policy with fixed action-parameters using Sarsa() [\citeauthoryearSutton and Barto1998] with the Fourier basis [\citeauthoryearKonidaris et al.2011] and optimising the continuous action-parameters using episodic Natural Actor Critic (eNAC) [\citeauthoryearPeters and Schaal2008] while the discrete action policy is kept fixed. \citeauthorhausknecht2016 \shortcitehausknecht2016 apply artificial neural networks and the Deep Deterministic Policy Gradients (DDPG) algorithm [\citeauthoryearLillicrap et al.2016] to parameterised action spaces by treating both the discrete actions and their action-parameters as a joint continuous action vector. This can be seen as relaxing the parameterised action space (Equation 1) into a continuous one:
where are continuous values in . An -greedy or softmax policy is then used to select discrete actions. However, not only does this fail to exploit the disjoint nature of different parameterised actions, but optimising over the joint action and action-parameter space can result in premature convergence to suboptimal policies, as occurred in experiments by \citeauthormasson2016 \shortcitemasson2016. We henceforth refer to the algorithm used by \citeauthorhausknecht2016 as PA-DDPG.
2.1 Parameterised Deep Q-Networks
Unlike previous approaches, \citeauthorxiong2018 \shortcitexiong2018 introduce a method that operates in the parameterised action space directly by combining DQN and DDPG. Their P-DQN algorithm achieves state-of-the-art performance using a Q-network to approximate Q-values used for discrete action selection, in addition to providing critic gradients for an actor network that determines the continuous action-parameter values for all actions. By framing the problem as a PAMDP directly, rather than alternating between discrete and continuous action MDPs as with Q-PAMDP, or using a joint continuous action MDP as with PA-DDPG, P-DQN necessitates a change to the Bellman equation to incorporate continuous action-parameters:
To avoid the computationally intractable calculation of the supremum over , \citeauthorxiong2018 \shortcitexiong2018 state that when the function is fixed, one can view as a function for any state and . This allows the Bellman equation to be rewritten as:
P-DQN uses a deep neural network with parameters to represent , and a second deterministic actor network with parameters to represent the action-parameter policy , an approximation of . With this formulation it is easy to apply the standard DQN approach of minimising the mean-squared Bellman error to update the Q-network using minibatches sampled from replay memory [\citeauthoryearMnih et al.2015], replacing with :
where is the update target derived from Equation (4). Then, the loss for the actor network in P-DQN is given by the negative sum of Q-values:
Although this choice of loss function was not motivated by \citeauthorxiong2018 \shortcitexiong2018, it resembles the deterministic policy gradient loss used by PA-DDPG where a scalar critic value is used over all action-parameters [\citeauthoryearHausknecht and Stone2016a]. During updates, the estimated Q-values are backpropagated through the critic to the actor, producing gradients indicating how the action-parameters should be updated to increase the Q-values.
3 Problems with Joint Action-Parameters
The P-DQN architecture inputs the joint action-parameter vector over all actions to the Q-network, as illustrated in Figure 1. This was pointed out by \citeauthorxiong2018 \shortcitexiong2018 but they did not discuss it further. While this may seem like an inconsequential implementation detail, it changes the formulation of the Bellman equation used for parameterised actions (Equation 4) since each Q-value is a function of the joint action-parameter vector , rather than only the action-parameter corresponding to the associated action:
This in turn affects both the updates to the Q-values and the action-parameters. Firstly, we consider the effect on the action-parameter loss, specifically that each Q-value produces gradients for all action-parameters. Consider for demonstration purposes the action-parameter loss (Equation 6) over a single sample with state :
The policy gradient is then given by:
Expanding the gradients with respect to the action-parameters gives
where . Theoretically, if each Q-value were a function of just as the P-DQN formulation intended, then and simplifies to:
However this is not the case in P-DQN, so the gradients with respect to other action-parameters are not zero in general. This is a problem because each Q-value is updated only when its corresponding action is sampled, as per Equation 5, and thus has no information on what effect other action-parameters have on transitions or how they should be updated to maximise the expected return. They therefore produce what we term false gradients. This effect may be mitigated by the summation over all Q-values in the action-parameter loss, since the gradients from each Q-value are summed and averaged over a minibatch.
The dependence of Q-values on all action-parameters also negatively affects the discrete action policy. Specifically, updating the continuous action-parameter policy of any action perturbs the Q-values of all actions, not just the one associated with that action-parameter. This can lead to the relative ordering of Q-values changing, which in turn can result in suboptimal greedy action selection. We demonstrate a situation where this occurs on the Platform domain in Figure 2.
4 Multi-Pass Q-Networks
The naïve solution to the problem of joint action-parameter inputs in P-DQN would be to split the Q-network into separate networks for each discrete action. Then, one can input only the state and relevant action-parameter to the network corresponding to . However, this drastically increases the computational and space complexity of the algorithm due to the duplication of network parameters for each action. Furthermore, the loss of the shared feature representation between Q-values may be detrimental.
We therefore consider an alternative approach that does not involve architectural changes to the network structure of P-DQN. While separating the action-parameters in a single forward pass of a single Q-network with fully connected layers is impossible, we can do so with multiple passes. We perform a forward pass once per action with the state and action-parameter vector as input, where is the standard basis vector for dimension . Thus is the joint action-parameter vector where each is set to zero. This causes all false gradients to be zero, , and completely negates the impact of the network weights for unassociated action-parameters from the input layer, making only depend on . That is,
Both problems are therefore addressed without introducing any additional neural network parameters. We refer to this as the multi-pass Q-network method, or MP-DQN.
A total of forward passes are required to predict all Q-values instead of one. However, we can make use of the parallel minibatch processing capabilities of artificial neural networks, provided by libraries such as PyTorch and Tensorflow, to perform this in a single parallel pass, or multi-pass. A multi-pass with actions is processed in the same manner as a minibatch of size :
where is the Q-value for action generated on the th pass where is non-zero. Only the diagonal elements are valid and used in the final output . This process is illustrated in Figure 3.
Compared to separate Q-networks, our multi-pass technique introduces a relatively minor amount of overhead during forward passes. Although minibatches for updates are similarly duplicated times, backward passes to accumulate gradients are not duplicated since only the diagonal elements are used in the loss function. The computational complexity of this overhead scales linearly with the number of actions and minibatch size during updates. Unlike separate Q-networks (and even when a larger Q-network with more hidden layers and neurons is used) if the number of actions does not change, then the overhead of multi-passes would be the same as with a smaller Q-network, provided the minibatch is of a reasonable size and can be processed in parallel.
We compare the original P-DQN algorithm with a single Q-network against our proposed multi-pass Q-network (MP-DQN), as well as against separate Q-networks (SP-DQN). We also compare against Q-PAMDP and PA-DDPG, the former state-of-the-art approaches on their respective domains. We are unable to use King of Glory as a benchmark domain as it is closed-source and proprietary.
Similar to \citeauthormnih2015 \shortcitemnih2015 and \citeauthorhausknecht2016 \shortcitehausknecht2016, we add target networks to P-DQN to compute the update targets for stability. Soft updates (Polyak averaging) are used for the target networks. Adam [\citeauthoryearKingma and Ba2014] with is used to optimise the neural network parameters for P-DQN and PA-DDPG. Layer weights are initialised following the strategy of \citeauthorhe2015 \shortcitehe2015 with rectified linear unit (ReLU) activation functions. We employ the inverting gradients approach to bound action-parameters for both algorithms, as \citeauthorhausknecht2016 \shortcitehausknecht2016 claim PA-DDPG is unable to learn without it on Half Field Offense. Action-parameters are scaled to , as we found this increased performance for all algorithms.
We perform a hyperparameter grid search for Platform and Robot Soccer Goal over: the network learning rates ; Polyak averaging factors ; minibatch size ; and number of hidden layers and neurons in . The hidden layers are kept symmetric between the actor and critic networks as in previous works. Each combination is tested over random runs for P-DQN and PA-DDPG separately on each domain. The same hyperparameters are used for P-DQN, SP-DQN and MP-DQN.
To keep the comparison with PA-DDPG fair, we do not use dueling networks [\citeauthoryearWang et al.2016] nor asynchronous parallel workers as \citeauthorxiong2018 \shortcitexiong2018 used for P-DQN. For each algorithm and domain, we train agents with unique random seeds and evaluate them without exploration for another episodes. Our experiments are implemented in Python using PyTorch [\citeauthoryearPaszke et al.2017] and OpenAI Gym [\citeauthoryearBrockman et al.2016], and run on the following hardware: Intel Core i7-7700, 16GB DRAM, NVidia GTX 1060 GPU. Complete source code is available online.111https://github.com/cycraig/MP-DQN
The Platform domain [\citeauthoryearMasson et al.2016] has three actions—run, hop, and leap—each with a continuous action-parameter to control horizontal displacement. The agent has to hop over enemies and leap across gaps between platforms to reach the goal state. The agent dies if it touches an enemy or falls into a gap. A -dimensional state space gives the position and velocity of the agent and local enemy along with features of the current platform such as length.
We train agents on this domain for episodes, using the same hyperparameters for Q-PAMDP as \citeauthormasson2016 \shortcitemasson2016, except we reduce the learning rate for eNAC () to , and exploration noise variance () to , to account for the scaled action-parameters. For P-DQN, shallow networks with one hidden layer were found to perform best with , , , , and . PA-DDPG uses two hidden layers with , , , , and . A replay memory size of samples is used for both algorithms, update gradients are clipped at , and .
We introduce a passthrough layer to the actor networks of P-DQN and PA-DDPG to initialise their action-parameter policies to the same linear combination of state variables that \citeauthormasson2016 \shortcitemasson2016 use to initialise the Q-PAMDP policy. The weights of the passthrough layer are kept fixed to avoid instability; this does not reduce the range of action-parameters available as the output of the actor network compensates before inverting gradients are applied. We use an -greedy discrete action policy with additive Ornstein-Uhlenbeck noise for action-parameter exploration, similar to \citeauthorlillicrap2015 \shortcitelillicrap2015, which we found gives slightly better performance than Gaussian noise.
5.2 Robot Soccer Goal
The Robot Soccer Goal domain [\citeauthoryearMasson et al.2016] is a simplification of RoboCup 2D [\citeauthoryearKitano et al.1997] in which an agent has to score a goal past a keeper that tries to intercept the ball. The three parameterised actions—kick-to, shoot-goal-left, and shoot-goal-right—are all related to kicking the ball, which the agent automatically approaches between actions until close enough to kick again. The state space consists of continuous features describing the position, velocity, and orientation of the agent and keeper, and the ball’s position and distance to the keeper and goal.
Training consisted of episodes, using the same hyperparameters for Q-PAMDP as \citeauthormasson2016 \shortcitemasson2016 except we set and . P-DQN uses a single hidden layer , with , , , , and . Two hidden layers are used for PA-DDPG, with , , , , and . Both algorithms use a replay memory size of , , gradients clipping at , and the same action-parameter policy initialisation as Q-PAMDP with additive Ornstein-Uhlenbeck noise.
|Platform||Robot Soccer Goal||Half Field Offense|
|Return||P(Goal)||P(Goal)||Avg. Steps to Goal|
|PA-DDPG222Average over runs [\citeauthoryearHausknecht and Stone2016a].||-||-|
|Async. P-DQN333Average over runs with workers [\citeauthoryearXiong et al.2018].||-||-|
5.3 Half Field Offense
The third and final domain, Half Field Offense (HFO) [\citeauthoryearHausknecht and Stone2016a], is also the most complex. It has state features and three parameterised actions available: dash, turn, and kick. Unlike Robot Soccer Goal, the agent must first learn to approach the ball and then kick it into the goals, although there is no keeper in this task.
We use episodes for training on HFO. This is more than the episodes (or roughly million transitions) used by \citeauthorhausknecht2016 \shortcitehausknecht2016 and \citeauthorxiong2018 \shortcitexiong2018 so that ample opportunity is given for the algorithms to converge in order to fairly evaluate the final policy performance. We use the same network structure as previous works with hidden layers of neurons for P-DQN and neurons for PA-DDPG. The leaky ReLU activation function with negative slope is used on HFO because of these deeper networks. \citeauthorxiong2018 \shortcitexiong2018 use asynchronous parallel workers for -step returns on HFO. For fair comparison and due to the lack of sufficient hardware, we instead use mixed -step return targets [\citeauthoryearHausknecht and Stone2016b] with a mixing ratio of for both P-DQN and PA-DDPG, as this technique does not require multiple workers. The value was selected after a search over . We otherwise use the same hyperparameters as \citeauthorhausknecht2016betamixing \shortcitehausknecht2016betamixing apart from the network learning rates: , for P-DQN and , for PA-DDPG. In the absence of an initial action-parameter policy, we use the same -greedy with uniform random action-parameter exploration strategy as the original authors. In general we kept as many factors consistent between the two algorithms as possible for a fair comparison.
We select of the most relevant state features for Q-PAMDP to avoid intractable Fourier basis calculations. These features include: player orientation, stamina, proximity to ball, ball angle, ball-kickable, goal centre position, and goal centre proximity. Even with this reduced selection, we found at most a Fourier basis of order could be used. We use an adaptive step-size [\citeauthoryearDabney and Barto2012] for Sarsa() with an eNAC learning rate of . The Q-PAMDP agent initially learns with Sarsa() for a period of episodes before alternating between eNAC updates of rollouts each, and episodes of discrete action re-exploration.
The resulting learning curves of MP-DQN, SP-DQN, P-DQN, PA-DDPG, and Q-PAMDP on the three parameterised action benchmark domains are shown in Figure 4, with mean evaluation scores detailed in Table 1.
Our results show that MP-DQN learns significantly faster than baseline P-DQN with joint action-parameter inputs and achieves the highest mean evaluation scores across all three domains. SP-DQN similarly shows better performance than P-DQN on Platform and Robot Soccer Goal but to a slightly lesser extent than MP-DQN. Notably, SP-DQN exhibits fast initial learning on HFO but plateaus at a lower performance level than P-DQN. This is likely due to the aforementioned lack of a shared feature representation between the separate Q-networks and the duplicate network parameters which require more updates to optimise.
In general, we observe that P-DQN and its variants outperform Q-PAMDP on Platform and Robot Soccer Goal, while PA-DDPG consistently converges prematurely to suboptimal policies. \citeauthorwei2018 \shortcitewei2018 observe similar behaviour for PA-DDPG on Platform. This highlights the problem with updating the action and action-parameter policies simultaneously and was also observed when using eNAC for direct policy search on Platform [\citeauthoryearMasson et al.2016]. On HFO, Q-PAMDP fails to learn to score any goals—likely due to its reduced feature space and use of linear function approximation rather than neural networks. Unexpectedly, baseline P-DQN appears to learn slower than PA-DDPG on HFO. This suggests that the dueling networks and asynchronous parallel workers used by \citeauthorxiong2018 \shortcitexiong2018 were major factors improving P-DQN in their comparisons.
7 Related Work
Many recent deep RL approaches follow the strategy of collapsing the parameterised action space into a continuous one. \citeauthorhussein2018 \shortcitehussein2018 present a deep imitation learning approach for scoring goals on HFO using long-short-term-memory networks with a joint action and action-parameter policy. \citeauthoragarwa2018 \shortciteagarwa2018 introduces skills for multi-goal parameterised action space environments to achieve multiple related goals; they demonstrate success on robotic manipulation tasks by combining PA-DDPG with hindsight experience replay and their skill library.
One can alternatively view parameterised actions as a 2-level hierarchy: \citeauthorklimek2017 \shortciteklimek2017 use this approach to learn a reach-and-grip task using a single network to represent a distribution over macro (discrete) actions and their lower-level action-parameters. The work most relevant to this paper is by \citeauthorwei2018 \shortcitewei2018, who introduce a parameterised action version of TRPO (PATRPO). They also take a hierarchical approach but instead condition the action-parameter policy on the discrete action chosen to avoid predicting all action-parameters at once. While their preliminary results show the method achieves good performance on Platform, we omit comparison with PATRPO as it fails to learn to score goals on HFO.
We identified a significant problem with the P-DQN algorithm for parametrised action spaces: the dependence of its Q-values on all action-parameters causes false gradients and can lead to suboptimal action selection. We introduced a new algorithm, MP-DQN, with separate action-parameter inputs which demonstrated superior performance over P-DQN and former state-of-the-art techniques Q-PAMDP and PA-DDPG. We also found that PA-DDPG was unstable and converged to suboptimal policies on some domains. Our results suggest that future approaches should leverage the disjoint nature of parameterised action spaces and avoid simultaneous optimisation of the policies for discrete actions and continuous action-parameters.
This work is based on the research supported in part by the National Research Foundation of South Africa (Grant Number: 113737).
- [\citeauthoryearAgarwal2018] Arpit Agarwal. Deep reinforcement learning with skill library: Exploring with temporal abstractions and coarse approximate dynamics models. Master’s thesis, Carnegie Mellon University, Pittsburgh, PA, July 2018.
- [\citeauthoryearBrockman et al.2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
- [\citeauthoryearDabney and Barto2012] William Dabney and Andrew G Barto. Adaptive step-size for online temporal difference learning. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
- [\citeauthoryearHausknecht and Stone2016a] Matthew Hausknecht and Peter Stone. Deep reinforcement learning in parameterized action space. In Proceedings of the International Conference on Learning Representations, 2016.
- [\citeauthoryearHausknecht and Stone2016b] Matthew Hausknecht and Peter Stone. On-policy vs. off-policy updates for deep reinforcement learning. In Deep Reinforcement Learning: Frontiers and Challenges, IJCAI Workshop, July 2016.
- [\citeauthoryearHe et al.2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- [\citeauthoryearHussein et al.2018] Ahmed Hussein, Eyad Elyan, and Chrisina Jayne. Deep imitation learning with memory for Robocup soccer simulation. In Proceedings of the International Conference on Engineering Applications of Neural Networks, pages 31–43. Springer, 2018.
- [\citeauthoryearKhamassi et al.2017] Mehdi Khamassi, George Velentzas, Theodore Tsitsimis, and Costas Tzafestas. Active exploration and parameterized reinforcement learning applied to a simulated human-robot interaction task. In Proceedings of the First IEEE International Conference on Robotic Computing, pages 28–35. IEEE, 2017.
- [\citeauthoryearKingma and Ba2014] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [\citeauthoryearKitano et al.1997] Hiroaki Kitano, Minoru Asada, Yasuo Kuniyoshi, Itsuki Noda, Eiichi Osawa, and Hitoshi Matsubara. Robocup: A challenge problem for AI. AI Magazine, 18:73–85, 1997.
- [\citeauthoryearKlimek et al.2017] Maciej Klimek, Henryk Michalewski, and Piotr Miłoś. Hierarchical reinforcement learning with parameters. In Conference on Robot Learning, pages 301–313, 2017.
- [\citeauthoryearKonidaris et al.2011] George D. Konidaris, Sarah Osentoski, and Philip S. Thomas. Value function approximation in reinforcement learning using the Fourier basis. In Proceedings of the Twenty-Fifth Conference on Artificial Intelligence, pages 380–385, August 2011.
- [\citeauthoryearLillicrap et al.2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2016.
- [\citeauthoryearMasson et al.2016] Warwick Masson, Pravesh Ranchod, and George Konidaris. Reinforcement learning with parameterized actions. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 1934–1940, 2016.
- [\citeauthoryearMnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- [\citeauthoryearPaszke et al.2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, 2017.
- [\citeauthoryearPeng et al.2016] Xue Bin Peng, Glen Berseth, and Michiel Van de Panne. Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Transactions on Graphics, 35(4):81:1–81:12, 2016.
- [\citeauthoryearPeters and Schaal2008] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
- [\citeauthoryearSchulman et al.2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference of Machine Learning, volume 37, pages 1889–1897, 2015.
- [\citeauthoryearSutton and Barto1998] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1998.
- [\citeauthoryearWang et al.2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pages 1995–2003, 2016.
- [\citeauthoryearWei et al.2018] Ermo Wei, Drew Wicke, and Sean Luke. Hierarchical approaches for reinforcement learning in parameterized action space. In 2018 AAAI Spring Symposium Series, 2018.
- [\citeauthoryearXiong et al.2018] Jiechao Xiong, Qing Wang, Zhuoran Yang, Peng Sun, Lei Han, Yang Zheng, Haobo Fu, Tong Zhang, Ji Liu, and Han Liu. Parametrized deep Q-networks learning: Reinforcement learning with discrete-continuous hybrid action space. arXiv preprint arXiv:1810.06394, 2018.