Regularized Hierarchical Policies
for Compositional Transfer in Robotics
Abstract
The successful application of flexible, general learning algorithms — such as deep reinforcement learning — to realworld robotics applications is often limited by their poor dataefficiency. Domains with more than a single dominant task of interest encourage algorithms that share partial solutions across tasks to limit the required experiment time. We develop and investigate simple hierarchical inductive biases — in the form of structured policies — as a mechanism for knowledge transfer across tasks in reinforcement learning (RL). To leverage the power of these structured policies we design an RL algorithm that enables stable and fast learning. We demonstrate the success of our method both in simulated robot environments (using locomotion and manipulation domains) as well as real robot experiments, demonstrating substantially better dataefficiency than competitive baselines.
Regularized Hierarchical Policies
for Compositional Transfer in Robotics
noticebox[b]Preprint. Under review.\end@float
1 Introduction
While recent successes in deep (reinforcement) learning for computer games (Atari (Mnih et al., 2013), StarCraft (Vinyals et al., 2019), Go (Silver et al., 2017) and other highthroughput domains, e.g. (OpenAI et al., 2018), have demonstrated the potential of these methods in the big data regime, high costs of data acquisition limit progress in many tasks of realworld relevance. Data efficiency in machine learning often relies on inductive biases to guide and accelerate the learning process; e.g. by including expert domain knowledge of varying granularity. While incorporating this knowledge often helps to accelerate learning, when inaccurate it can inappropriately bias the space of solutions and lead to suboptimal results.
Instead of designing inductive biases manually we aim to infer them from related tasks. In particular, we turn towards research on multitask transfer learning, where, from the view of one task, training signals of related tasks can serve as an additional inductive bias (Caruana, 1997). Focusing on domains with consistent algorithm embodiment and more than a single task of interest – such as robotics – sharing and applying knowledge across tasks is imperative for efficiency and scalability. Successes for transfer learning have, for example, built on optimizing initial parameters (e.g. Finn et al., 2017), sharing models and parameters across tasks (e.g. Rusu et al., 2016; Teh et al., 2017; Galashov et al., 2018), datasharing across tasks (e.g. Riedmiller et al., 2018; Andrychowicz et al., 2017) or related, auxiliary objectives (Jaderberg et al., 2016; Wulfmeier et al., 2017). Transfer between tasks can, however, lead to either constructive or destructive transfer for humans (Singley and Anderson, 1989) as well as machines (Pan and Yang, 2010; Torrey and Shavlik, 2010). That is, jointly learning to solve different tasks can provide benefits as well as disadvantages for individual tasks, depending on the similarity of required solutions.
In reinforcement learning (RL), (partial) solutions can be shared e.g. in the form of policies or value functions. While training independent, pertask, policies prevents interference, it also prevents transfer. On the other hand, training a monolithic policy shared across tasks can instead cause interference of opposing modes of behaviour. With this work we continue to explore the benefits of hierarchy for multitask learning and the modelling of multimodal distributions (Rosenstein et al., 2005; Bishop, 1994).
We focus on developing an actorcritic RL approach for training hierarchical, modular models (policies) for continuous control. We represent lowlevel subpolicies as Gaussian policies combined with a categorical distribution for the highlevel controller. We parametrize the policies as simple Mixture Density Networks (Bishop, 1994) with the important modification of providing task information only to the highlevel controller, so that the lowlevel policies acquire more robust, taskindependent behaviours. To train hierarchical policies in a coordinated and robust manner, we extend the optimization scheme of the Maximum aposteriori Policy Optimization algorithm (MPO) (Abdolmaleki et al., 2018a). We additionally demonstrate the general benefits of hierarchical policies for transfer learning via improved results for another, gradientbased, entropy regularized RL algorithm (Heess et al., 2016) (see Appendix A.8.4).
The resulting algorithm, Regularized Hierarchical Policy Optimization (RHPO), is appealing due to its conceptual simplicity and appears extremely effective in practice. We demonstrate this first in simulation for a variety of continuous control problems. In multitask transfer problems, we find that RHPO can improve learning speed compared to the stateoftheart. We then compare the approach against competitive continuous control baselines on a real robot for robotic manipulation tasks where RHPO leads to a significant speed up in training, enabling us to learn with increased efficiency in a challenging stacking domain on a single robot. To the best of our knowledge, RHPO demonstrates the first success for learning such a challenging task, defined via a sparse reward, from raw robot state and a randomly initialized policy in the real world – which we can learn within about a week of real robot time^{1}^{1}1 Additional videos are provided under https://sites.google.com/view/rhpo/.
2 Preliminaries
We consider a multitask reinforcement learning setting with an agent operating in a Markov Decision Process (MDP) consisting of the state space , the action space , the transition probability of reaching state from state when executing action at the previous time step . The actions are drawn from a probability distribution over actions referred to as the agent’s policy. Jointly, the transition dynamics and policy induce the marginal state visitation distribution . Finally, the discount factor together with the reward gives rise to the expected reward, or value, of starting in state (and following thereafter) . Furthermore, we define multitask learning over a set of tasks with common agent embodiment as follows. We assume shared state, action spaces and shared transition dynamics across tasks; tasks only differ in their reward function . Furthermore, we consider task conditional policies . The overall objective is defined as
(1) 
where all actions are drawn according to the policy , that is, and we used the common definition of the stateaction value function – here conditioned on the task – .
3 Method
This section introduces Regularized Hierarchical Policy Optimization (RHPO) which focuses on efficient training on modular policies by sharing data across tasks building on an instance of Scheduled Auxiliary Control (SACX) with randomized scheduling (SACU) (Riedmiller et al., 2018). We start by introducing the considered class of policies, followed by the required extension of the MPO algorithm (Abdolmaleki et al., 2018b, a) for training structured hierarchical policies in an offpolicy setting.
3.1 Modular Hierarchical Policies
We start by defining the modular policy class which supports sharing subpolicies across tasks. Formally, we decompose the pertask policy as
(2) 
with and respectively representing a “highlevel” switching controller (a categorical distribution) and a “lowlevel” subpolicy and is the index of the subpolicy. Here, denotes the parameters of both and , which we will seek to optimize. While the number of components has to be decided externally, the method is robust with respect to this parameter (Appendix A.8.1).
Note that, in the above formulation only the highlevel controller is conditioned on the task information ; i.e. we employ a form of information asymmetry (Galashov et al., 2018; Tirumala et al., 2019; Heess et al., 2016) to enable the lowlevel policies to acquire general, taskindependent behaviours. This choice strengthens decomposition of tasks across domains and inhibits degenerate cases of bypassing the highlevel controller. These subpolicies can be understood as building reflexlike lowlevel control loops, which perform domaindependent useful behaviours and can be modulated by higher cognitive functions with knowledge of the task at hand.
3.2 Decoupled Hierarchical Multitask Optimization
In the following sections, we present the equations underlying RHPO, for the complete pseudocode algorithm the reader is referred to the Appendix A.2.1. To optimize the policy class described above we build on the MPO algorithm (Abdolmaleki et al., 2018b, a) which decouples the policy improvement step (optimizing independently of the policy structure) from the fitting of the hierarchical policy. Concretely, we first introduce an intermediate nonparametric policy and consider optimizing while staying close, in expectation, to a reference policy
(3) 
where denotes the Kullback Leibler divergence, defines a bound on the KL, denotes the data contained in a replay buffer (note that we do not differentiate between states from different tasks here, see Section 3.3 for further explanation), and assuming that we have an approximation of the groundtruth stateaction value function available (see Section 3.3 for details on how can be learned from offpolicy data). Starting from an initial policy we can then iterate the following steps (interleaved with policy evaluation to estimate ) to improve the policy :
Step 1: Obtaining Nonparametric Policies
We first find the intermediate policy by maximizing Equation (3). Analogous to MPO (Abdolmaleki et al., 2018b, a), we obtain a closedform solution with a nonparametric policy for each task, as
(4) 
where is a temperature parameter (corresponding to a given bound ) which is optimized alongside the policy optimization. See Abdolmaleki et al. (2018b, a) for the general form as well as the Appendix A.1.1 for a detailed derivation for the multitask case. As mentioned above, this policy representation is independent of the form of the parametric policy ; i.e. only depends on through its density. This, crucially, makes it easy to employ complicated structured policies (such as the one introduced in Section 3.1). The only requirement for this and the following steps is that we must be able to sample from and calculate the gradient (w.r.t. ) of its log density.
Step 2: Fitting Parametric Policies
In the second step we fit a policy to the nonparametric distribution obtained from the previous calculation by minimizing the divergence Assuming that we can sample from this step corresponds to maximum likelihood estimation (MLE). Furthermore, we can effectively mitigate optimization instabilities during training by placing a prior on the change of the policy (a trustregion constraint). The idea is commonly used in on as well as offpolicy RL (Schulman et al., 2015; Abdolmaleki et al., 2018b, a). The application to hierarchical policy classes highlights the importance of this constraint as investigated in Section 4.1. Formally, we aim to obtain the solution
(5)  
s.t. 
where defines a bound on the change of the new policy. Here we drop constant terms and the negative sign in the second line (turning into ), and explicitly insert the definition , to highlight that we are marginalizing over the highlevel choices in this fitting step (since is not tied to the policy structure). Hence, the update is independent of the specific policy component from which the action was sampled, enabling joint updates of all components. This reduces the variance of the update and also enables efficient offpolicy learning. Different approaches can be used control convergence for both the “highlevel” categorical choices and the action choices to change slowly throughout learning. The average KL constraint in Equation (5) is similar in nature to a upper bound on the computationally intractable KL divergence between the two mixture distributions and has been determined experimentally to perform better in practice than simple bounds. In practice, inspired by Abdolmaleki et al. (2018a), in order to control the change of the high level and low level policies independently we decouple the constraints to be able to set different for the means (), covariances () and the categorical distribution () in case of a mixture of Gaussian policy. Please see the Appendix A.1.2 for more details on decoupled constraints. To solve Equation (5), we first employ Lagrangian relaxation to make it amenable to gradient based optimization and then perform a fixed number of gradient descent steps (using Adam (Kingma and Ba, 2014)); details on this step, as well as an algorithm listing, can be found in the Appendix A.1.2.
3.3 Dataefficient Multitask Learning
For dataefficient offpolicy learning of we consider the same setting as scheduled auxiliary control setting (SACX) (Riedmiller et al., 2018) which introduces two main ideas to obtain dataefficiency: i) experience sharing across tasks; ii) switching between tasks within one episode for improved exploration. We consider uniform random switches between tasks every N steps within an episode (the SACU setting from (Riedmiller et al., 2018)).
Formally, we assume access to a replay buffer containing data gathered from all tasks, which is filled asynchronously to the optimization (similar to e.g. Espeholt et al. (2018); Abdolmaleki et al. (2018a)) where for each trajectory snippet we record the rewards for all tasks as a vector in the buffer. Using this data we define the retrace objective for learning , parameterized via , following Riedmiller et al. (2018) as
(6) 
where is the Lstep retrace target (Munos et al., 2016), see the Appendix A.2.2 for details.
4 Experiments
In the following sections, we investigate how RHPO can provide compelling benefits for multitask transfer learning in real and simulated robotic manipulation tasks. This includes experiments on physical hardware with robotic manipulation tasks for the Sawyer arm, emphasizing the importance of dataefficiency. In addition, we provide results for singletask domains from the DeepMind Control Suite (Tassa et al., 2018) in Appendix A.5.4, where we demonstrate how hierarchical policies can be employed to flexibly integrate human domain knowledge.
All policies are trained via the same reward structures, underlying constraints and task hyperparameters. More details on these parameters and the complete results for additional ablations and all tasks from the multitask domains are provided in the Appendix A.4. Across all tasks, we build on a distributed actorcritic framework (similar to (Espeholt et al., 2018)) with flexible hardware assignment (Buchlovsky et al., 2019) to train all agents, performing critic and policy updates from a replay buffer, which is asynchronously filled by a set of actors.
4.1 Simulated Robot Experiments
We use three simulated multitask scenarios with the Kinova Jaco and Rethink Robotics Sawyer robot arms to test in a variety of conditions. Pile1: Here, the seven tasks of interest range from simple reaching for a block over tasks like grasping it, to the final task of stacking the block on top of another block. In addition to the experiments in simulation, which are executed with 5 actors in a distributed setting, the same Pile1 multitask domain (same rewards and setup) is investigated with a single, physical robot in Section 4.2. In all figures with error bars, we visualize mean and variance derived from 3 runs. We further extend the evaluation towards two more complex multitask domains in simulation. The first extension includes stacking with both blocks on top of the respective other block, resulting in a setting with 10 tasks (Pile2). And a last domain including harder tasks such as opening a box and placing blocks into this box, consisting of a total of 13 tasks (Cleanup2).
Across all experiments, we utilize the datasharing scheme of Scheduled Auxiliary Control (SACX) (Riedmiller et al., 2018) (see Section 3.3), and execute a series of random tasks per episode, i.e. our setup is comparable to SACU. We thus compare RHPO for training hierarchical policies against a flat, monolithic policy shared across all tasks which is provided with the additional task id as input (displayed as Monolithic in the plot) as well as policies with task dependent heads (displayed as Independent in the plots) following (Riedmiller et al., 2018) – both using MPO as the optimizer and a reimplementation of SACU using SVG (Heess et al., 2015). The baselines aim to provide the two opposite, naive perspectives on transfer: by using the same monolithic policy across tasks we enable positive as well as negative interference and independent policies prevent policybased transfer. After experimentally confirming the robustness of RHPO with respect to the number of lowlevel subpolicies (see Appendix A.8.1), we set equal to the number of tasks in each domain.
Figure 1 demonstrates that the hierarchical policy (RHPO) outperforms the monolithic as well as the independent baselines. For simple tasks such as the stacking of a one block on another (Pile1), the difference is comparably small, but the more tasks are trained and the more complex the domain becomes (cf. Pile2 and Cleanup2), the greater is the advantage of sharing learned behaviours across tasks. Compared to a reimplementation of SACU (using SVG as the optimizer) – dashed line in the plots – we observe that the baselines based on MPO already result in an improvement, which becomes even bigger with the hierarchical policies.
4.2 Physical Robot Experiments
For realworld experiments, dataefficiency is crucial. We perform all experiments in this section relying on a single robot (a single actor) – demonstrating the benefits of RHPO in the low data regime. The performed task is the real world version of the Pile1 task described in Section 4.1. The main task objective is to stack one cube onto a second one and move the gripper away from it. We introduce an additional third cube which serves purely as a distractor.
The setup for the experiments consists of a Sawyer robot arm mounted on a table, equipped with a Robotiq 2F85 parallel gripper. A basket of size in front of the robot contains the three cubes. Three cameras on the basket track the cubes using fiducials (augmented reality tags). As in simulation, the agent is provided with proprioception information (joint positions, velocities and torques), a wrist sensor’s force and torque readings, as well as the cubes’ poses – estimated via the fiducials. The agent action is five dimensional and consists of the three Cartesian translational velocities, the angular velocity of the wrist around the vertical axis and the speed of the gripper’s fingers.
Figure 3 plots the learning progress on the real robot for two (out of 7) of the tasks, the simple reach tasks and the stack task – which is the main task of interest. Plots for the learning progress of all tasks are given in the appendix A.6. As can be observed, all methods manage to learn the reach task quickly (within about a few thousand episodes) but only RHPO with a hierarchical policy is able to learn the stacking task (taking about 15 thousand episodes to obtain good stacking success), which takes about 8 days of training on the real robot. With respect to the current state of the baselines and previous experiments it can be estimated that the baselines will take considerably longer time (and we will continue running the ’Independent’ baseline experiment after submission until all reach the same training duration).
Reach  Stack 
4.2.1 Additional Ablations
In this section we build on the earlier introduced Pile1 domain to perform a set of ablations, providing additional insights into benefits of RHPO and important factors for robust training.
Impact of Data Rate
Evaluating in a distributed offpolicy setting enables us to investigate the effect of different rates for data generation by controlling the number of actors. Figure 4 demonstrates how the different agents converge slower lower data rates (changing from 5 to 1 actor). These experiments are highly relevant for the application domain as the number of available physical robots for realworld experiments is typically highly limited. To limit computational cost, we focus on the simplest domain from Section 4.1, Pile1, in this comparison. We display the learning curve of the most complicated task within the 7 tasks (the final stack). The results for all tasks as a function of the number of actor processes generating data is available in the Appendix A.8.3.
Importance of Regularization
Coordinating convergence progress in hierarchical models can be challenging but can be effectively moderated by the KL constraint. To investigate its importance, we perform an ablation study varying the strength of KL constraints on the highlevel controller between prior and the current policy during training – demonstrating a range of possible degenerate behaviors.
As depicted in Figure 5, with a weak KL constraint, the highlevel controller can converge too quickly leading to only a single subpolicy getting a gradient signal per step. In addition, the categorical distribution tends to change at a high rate, preventing successful convergence for the lowlevel policies. On the other hand, the lowlevel policies are missing task information to encourage decomposition as described in Section 3.2. This fact, in combination with strong KL constraints, can prevent specialization of the lowlevel policies as the categorical remains near static, finally leading to no or very slow convergence. As long as a reasonable constraint is picked, convergence is fast and the final policies obtain high quality for all tasks. We note that no tuning of the constraints is required across domains and the range of admissible constraints is quite broad.
5 Related Work
Transfer learning, in particular in the multitask context, has long been part of machine learning (ML) for datalimited domains (Caruana, 1997; Torrey and Shavlik, 2010; Pan and Yang, 2010). Commonly, it is not straightforward to train a single model jointly across different tasks as the solutions to tasks might not only interfere positively but also negatively (Wang et al., 2018).
Preventing this type of forgetting or negative transfer presents a challenge for biological (Singley and Anderson, 1989) as well as artificial systems (French, 1999). In the context of ML, a common scheme is the reduction of representational overlap (French, 1999; Rusu et al., 2016; Wang et al., 2018). Bishop (1994) utilize neural networks to parametrize mixture models for representing multimodal distributions thus mitigating shortcomings of nonhierarchical approaches. Rosenstein et al. (2005) demonstrate that hierarchical classification models can be applied to limit the impact of negative transfer. Our constrained optimization scheme for mixture policies builds on (Abdolmaleki et al., 2018b). Some aspects of it are similar to Daniel et al. (2016) but we phrase the learning problem in a fashion suitable for largescale offpolicy learning with deep neural networks.
Hierarchical approaches have a long history in the reinforcement learning literature (e.g. Sutton et al., 1999; Dayan and Hinton, 1993). Similar to our work, the well known options framework (Sutton et al., 1999; Precup, 2000) supports a two level behavior hierarchy, where the higher level chooses from a discrete set of subpolicies or “options” which commonly are run until a termination criterion is satisfied. The framework focuses on the notion of temporal abstraction. A number of works have proposed practical and scalable algorithms for learning option policies with reinforcement learning (e.g. Bacon et al., 2017; Zhang and Whiteson, 2019; Smith et al., 2018) or criteria for option induction (e.g. Harb et al., 2018; Harutyunyan et al., 2019). Rather than temporal abstraction, RHPO emphasizes the sharing of subpolicies across tasks to provide structure for efficient multitask transfer.
Conceptually different from the options framework and from our work are approaches such as (Vezhnevets et al., 2017; Nachum et al., 2018a, b; Xie et al., 2018) which employ different rewards for different levels of the hierarchy rather than optimizing a single objective for the entire model as we do. Other works have shown the additional benefits for the stability of training and dataefficiency when sequences of highlevel actions are given as guidance during optimization in a hierarchical setting (Shiarlis et al., 2018; Andreas et al., 2017; Tirumala et al., 2019).
Probabilistic trajectory models have been used for the discovery of behavior abstractions as part of an endtoend reinforcement learning paradigm (e.g. Teh et al., 2017; Igl et al., 2019; Tirumala et al., 2019; Galashov et al., 2018) where the models act as learned inductive biases that induce the sharing of behavior across tasks. Several works have previously emphasized the importance of information asymmetry or information hiding (Galashov et al., 2018; Heess et al., 2016). In a vein similar to the present work (e.g Heess et al., 2016; Tirumala et al., 2019) share a lowlevel controller across tasks but modulate the lowlevel behavior via a continuous embedding rather than picking from a small number of mixture components. In related work Hausman et al. (2018); Haarnoja et al. (2018) learn hierarchical policies with continuous latent variables optimizing the entropy regularized objective.
6 Discussion
We introduce a novel framework, RHPO, to enable robust training of hierarchical policies for complex realworld tasks and provide insights into methods for stabilizing the learning process. In simulation as well as on real robots, the proposed approach outperforms baseline methods which either handle tasks independently or utilize implicit sharing. Especially with increasingly complex tasks or limited data rate, as given in realworld applications, we demonstrate hierarchical inductive biases to provide a compelling foundation for transfer learning, reducing the number of environment interactions significantly and often leading to more robust learning as well as improved final performance. While this approach partially mitigates negative interference between tasks in a parallel multitask learning scenario, addressing catastrophic inference in sequential settings remains challenging and provides a valuable direction.
Since with mixture distributions, we are able to marginalize over components when optimizing the weighted likelihood over action samples in Equation 5, the extension towards multiple levels of hierarchy is trivial but can provide a valuable direction for practical future work.
We believe that especially in domains with consistent agent embodiment and high costs for data generation learning tasks jointly and sharing will be imperative. RHPO combines several ideas that we believe will be important: multitask learning with hierarchical and modular policy representations, robust optimization, and efficient offpolicy learning. Although we have found this particular combination of components to be very effective we believe it is just one instance of – and step towards – a spectrum of efficient learning architectures that will unlock further applications of RL both in simulation and, importantly, on real hardware.
Acknowledgments
The authors would like to thank Michael Bloesch, Jonas Degrave, Joseph Modayil and Doina Precup for helpful discussion and relevant feedback for shaping our submission. As robotics (and AI research) is a team sport we’d additionally like to acknowledge the support of Francesco Romano, Murilo Martins, Stefano Salicetti, Tom Rothörl and Francesco Nori on the hardware and infrastructure side as well as many others of the DeepMind team.
References
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M. Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Yuhuai Wu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaStar: Mastering the RealTime Strategy Game StarCraft II. https://deepmind.com/blog/alphastarmasteringrealtimestrategygamestarcraftii/, 2019.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 OpenAI et al. (2018) OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub W. Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous inhand manipulation. CoRR, abs/1808.00177, 2018. URL http://arxiv.org/abs/1808.00177.
 Caruana (1997) Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org, 2017.
 Rusu et al. (2016) Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
 Teh et al. (2017) Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. CoRR, abs/1707.04175, 2017. URL http://arxiv.org/abs/1707.04175.
 Galashov et al. (2018) Alexandre Galashov, Siddhant M Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M Czarnecki, Yee Whye Teh, Razvan Pascanu, and Nicolas Heess. Information asymmetry in klregularized rl. 2018.
 Riedmiller et al. (2018) Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playingsolving sparse reward tasks from scratch. arXiv preprint arXiv:1802.10567, 2018.
 Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
 Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Wulfmeier et al. (2017) Markus Wulfmeier, Ingmar Posner, and Pieter Abbeel. Mutual alignment transfer learning. arXiv preprint arXiv:1707.07907, 2017.
 Singley and Anderson (1989) Mark K Singley and John Robert Anderson. The transfer of cognitive skill. Number 9. Harvard University Press, 1989.
 Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 Torrey and Shavlik (2010) Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242–264. IGI Global, 2010.
 Rosenstein et al. (2005) Michael T Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G Dietterich. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, volume 898, pages 1–4, 2005.
 Bishop (1994) Christopher M Bishop. Mixture density networks. Technical report, Citeseer, 1994.
 Abdolmaleki et al. (2018a) Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, and Martin Riedmiller. Relative entropy regularized policy iteration. arXiv preprint arXiv:1812.02256, 2018a.
 Heess et al. (2016) Nicolas Heess, Greg Wayne, Yuval Tassa, Timothy Lillicrap, Martin Riedmiller, and David Silver. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
 Abdolmaleki et al. (2018b) Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Rémi Munos, Nicolas Heess, and Martin A. Riedmiller. Maximum a posteriori policy optimisation. CoRR, abs/1806.06920, 2018b.
 Tirumala et al. (2019) Dhruva Tirumala, Hyeonwoo Noh, Alexandre Galashov, Leonard Hasenclever, Arun Ahuja, Greg Wayne, Razvan Pascanu, Yee Whye Teh, and Nicolas Heess. Exploiting hierarchy for learning and transfer in klregularized rl. arXiv preprint arXiv:1903.07438, 2019.
 Schulman et al. (2015) John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pages 1889–1897. JMLR. org, 2015.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. In International Conference on Machine Learning, pages 1406–1415, 2018.
 Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, 2016.
 Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
 Buchlovsky et al. (2019) Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio Gómez Colmenarejo, Aedan Pope, Fabio Viola, and Dan Belov. Tfreplicator: Distributed machine learning for researchers. arXiv preprint arXiv:1902.00465, 2019.
 Heess et al. (2015) Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2015.
 Wang et al. (2018) Zirui Wang, Zihang Dai, Barnabás Póczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. arXiv preprint arXiv:1811.09751, 2018.
 French (1999) Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
 Daniel et al. (2016) Christian Daniel, Gerhard Neumann, Oliver Kroemer, and Jan Peters. Hierarchical relative entropy policy search. The Journal of Machine Learning Research, 17(1):3190–3239, 2016.
 Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Dayan and Hinton (1993) Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993.
 Precup (2000) Doina Precup. Temporal abstraction in reinforcement learning. University of Massachusetts Amherst, 2000.
 Bacon et al. (2017) PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Zhang and Whiteson (2019) Shangtong Zhang and Shimon Whiteson. Dac: The double actorcritic architecture for learning options. arXiv preprint arXiv:1904.12691, 2019.
 Smith et al. (2018) Matthew Smith, Herke Hoof, and Joelle Pineau. An inferencebased policy gradient method for learning options. In International Conference on Machine Learning, pages 4710–4719, 2018.
 Harb et al. (2018) Jean Harb, PierreLuc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option: Learning options with a deliberation cost. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Harutyunyan et al. (2019) Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Rémi Munos, and Doina Precup. The termination critic. CoRR, abs/1902.09996, 2019. URL http://arxiv.org/abs/1902.09996.
 Vezhnevets et al. (2017) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3540–3549. JMLR. org, 2017.
 Nachum et al. (2018a) Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018a.
 Nachum et al. (2018b) Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Nearoptimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018b.
 Xie et al. (2018) Saining Xie, Alexandre Galashov, Siqi Liu, Shaobo Hou, Razvan Pascanu, Nicolas Heess, and Yee Whye Teh. Transferring task goals via hierarchical reinforcement learning, 2018. URL https://openreview.net/forum?id=S1Y6TtJvG.
 Shiarlis et al. (2018) Kyriacos Shiarlis, Markus Wulfmeier, Sasha Salter, Shimon Whiteson, and Ingmar Posner. Taco: Learning task decomposition via temporal alignment for control. In International Conference on Machine Learning, pages 4661–4670, 2018.
 Andreas et al. (2017) Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 166–175. JMLR. org, 2017.
 Igl et al. (2019) Maximilian Igl, Andrew Gambardella, Nantas Nardelli, N Siddharth, Wendelin Böhmer, and Shimon Whiteson. Multitask soft option learning. arXiv preprint arXiv:1904.01033, 2019.
 Hausman et al. (2018) Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018.
 Haarnoja et al. (2018) Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. In International Conference on Machine Learning, pages 1846–1855, 2018.
 Reynolds et al. (2017) Malcolm Reynolds, Gabriel BarthMaron, Frederic Besse, Diego de Las Casas, Andreas Fidjeland, Tim Green, Adrià Puigdomènech, Sébastien Racanière, Jack Rae, and Fabio Viola. Open sourcing Sonnet  a new library for constructing neural networks. https://deepmind.com/blog/opensourcingsonnet/, 2017.
 Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Sutton (1988) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 Kingma and Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
Appendix A Appendix
a.1 Additional Derivations
In this section we explain the detailed derivations for training hierarchical policies parameterized as a mixture of Gaussians.
a.1.1 Obtaining Nonparametric Policies
In each policy improvement step, to obtain nonparametric policies for a given state and task distribution, we solve the following program:
To make the following derivations easier to follow we open up the expectations, writing them as integrals explicitly. For this purpose let us define the joint distribution over states together with randomly sampled tasks as , where denotes the uniform distribution over possible tasks. This allows us to rewrite the expectations that include the corresponding distributions, i.e. , but again, note that here is not necessarily the task under which was observed. We can then write the Lagrangian equation corresponding to the above described program as
Next we maximize the Lagrangian w.r.t the primal variable . The derivative w.r.t reads,
Setting it to zero and rearranging terms we obtain
However, the last exponential term is a normalization constant for . Therefore we can write,
(7) 
Now, to obtain the dual function , we plug in the solution to the KL constraint term (second term) of the Lagrangian which yields
After expanding and rearranging terms we get
Most of the terms cancel out and after rearranging the terms we obtain
a.1.2 Extended Update Rules For Fitting a Mixture of Gaussians
After obtaining the non parametric policies, we fit a parametric policy to samples from said nonparametric policies – effectively employing using maximum likelihood estimation with additional regularization based on a distance function , i.e,
(9)  
where is an arbitrary distance function to evaluate the change of the new policy with respect to a reference/old policy, and denotes the allowed change for the policy. To make the above objective amenable to gradient based optimization we employ Lagrangian Relaxation, yielding the following primal:
(10) 
We solve for by iterating the inner and outer optimization programs independently: We fix the parameters to their current value and optimize for the Lagrangian multipliers (inner minimization) and then we fix the Lagrangian multipliers to their current value and optimize for (outer maximization). In practice we found that it is effective to simply perform one gradient step each in inner and outer optimization for each sampled batch of data.
The optimization given above is general, i.e. it works for any general type of policy. As described in the main paper, we consider hierarchical policies of the form
(11) 
In particular, in all experiments we made use of a mixture of Gaussians parametrization, where the high level policy is a categorical distribution over low level Gaussian policies, i.e,
where denote the index of components and is the high level policy assigning probabilities to each mixture component for a state s given the task and the low level policies are all Gaussian. Here s are the probabilities for a categorical distribution over the components.
We also define the following distance function between old and new mixture of Gaussian policies
where evaluates the KL between categorical distributions and corresponds to the average KL across Gaussian components, as also described in the main paper (c.f. Equation 5 in the main paper).
In order to bound the change of categorical distributions, means and covariances of the components independently – which makes it easy to control the convergence of the policy and which can prevent premature convergence as argued in Abdolmaleki et al. [2018a] – we separate out the following three intermediate policies
Which yields the following final optimization program
(12)  
This decoupling allows us to set different values for the change of means, covariances and categorical probabilities, i.e., . Different lead to different learning rates. We always set a much smaller epsilon for the covariance and categorical than for the mean. The intuition is that while we would like the distribution to converge quickly in the action space, we also want to keep the exploration both locally (via the covariance matrix) and globally (via the high level categorical distribution) to avoid premature convergence.
a.2 Algorithmic Details
a.2.1 Pseudocode for the full procedure
a.2.2 Details on the policy improvement step
As described in the main paper we consider the same setting as scheduled auxiliary control setting (SACX) [Riedmiller et al., 2018] to perform policy improvement (with uniform random switches between tasks every N steps within an episode, the SACU setting).
Given a replay buffer containing data gathered from all tasks, where for each trajectory snippet we record the rewards for all tasks as a vector in the buffer, we define the retrace objective for learning , parameterized via , following Riedmiller et al. [2018] as
(13)  
where the importance weights are defined as , with denoting an arbitrary behavior policy; in particular this will be the policy for the executed tasks during an episode as in [Riedmiller et al., 2018]. Note that, in practice, we truncate the infinite sum after steps, bootstrapping with . We further perform optimization of Equation (6) via gradient descent and make use of a target network [Mnih et al., 2015], denoted with parameters , which we copy from after a couple of gradient steps. We reiterate that, as the stateaction value function remains independent of the policy’s structure, we are able to utilize any other offtheshelf Qlearning algorithm such as TD(0) [Sutton, 1988].
Given that we utilize the same policy evaluation mechanism as SACU it is worth pausing here to identify the differences between SACU and our approach. The main difference is in the policy parameterization: SACU used a monolithic policy for each task (although a neural network with shared components, potentially leading to some implicit task transfer, was used). Furthermore, we perform policy optimization based on MPO instead of using stochastic value gradients (SVG [Heess et al., 2016]). We can thus recover a variant of plain SACU using MPO if we drop the hierarchical policy parameterization, which we employ in the single task experiments in the main paper.
a.2.3 Network Architectures
To represent the Qfunction in the multitask case we use the network architecture from SACX (see right subfigure in Figure 6). The proprioception of the robot, the features of the objects and the actions are fed together in a torso network. At the input we use a fully connected first layer of 200 units, followed by a layer normalization operator, an optional tanh activation and another fully connected layer of 200 units with an ELU activation function. The output of this torso network is shared by independent head networks for each of the tasks (or intentions, as they are called in the SACX paper). Each head has two fully connected layers and outputs a Qvalue for this task, given the input of the network. Using the task identifier we then can compute the Q value for a given sample by discrete selection of the according head output.
While we use the network architecture for the Q function for all multitask experiments, we investigate different architectures for the policy in this paper. The original SACX policy architecture is shown in Figure 6 (left subfigure). The main structure follows the same basic principle that we use in the Q function architecture. The only difference is that the heads compute the required parameters for the policy distribution we want to use (see subsection A.2.4). This architecture is referenced as the independent heads (or task dependent heads).
The alternatives we investigate in this paper are the monolithic policy architecture (see Figure 7, left subfigure) and the hierarchical policy architecture (see Figure 7, right subfigure). For the monolithic policy architecture we reduce the original policy architecture basically to one head and append the taskid as a onehot encoded vector to the input. For the hierarchical architecture, we build on the same torso and create a set of networks parameterizing the Gaussians which are shared across tasks and a taskspecific network to parameterize the categorical distribution for each task. The final mixture distribution is taskdependent for the highlevel controller but taskindependent for the lowlevel policies.
a.2.4 Algorithm Hyperparameters
In this section we outline the details on the hyperparameters used for RHPO and baselines in both single task and multitask experiments. All experiments use feedforward neural networks. We consider a flat policy represented by a Gaussian distribution and a hierarchical policy represented by a mixture of Gaussians distribution. The flat policy is given by a Gaussian distribution with a diagonal covariance matrix, i.e, . The neural network outputs the mean and diagonal Cholesky factors , such that . The diagonal factor has positive diagonal elements enforced by the softplus transform to ensure positive definiteness of the diagonal covariance matrix. Mixture of Gaussian policy has a number of Gaussian components as well as a categorical distribution for selecting the components. The neural network outputs the Gaussian components based on the same setup described above for a single Gaussian and outputs the logits for representing the categorical distribution. Tables 1 show the hyperparameters we used for the single tasks experiments. We found layer normalization and a hyperbolic tangent () on the layer following the layer normalization are important for stability of the algorithms. For RHPO the most important hyperparameters are the constraints in Step 1 and Step 2 of the algorithm.
Hyperparameters  Hierarchical  Single Gaussian 
Policy net  200200200  200200200 
Number of actions sampled per state  10  10 
Q function net  500500500  500500500 
Number of components  3  NA 
0.1  0.1  
0.0005  0.0005  
0.00001  0.00001  
0.0001  NA  
Discount factor ()  0.99  0.99 
Adam learning rate  0.0002  0.0002 
Replay buffer size  2000000  2000000 
Target network update period  250  250 
Batch size  256  256 
Activation function  elu  elu 
Layer norm on first layer  Yes  Yes 
Tanh on output of layer norm  Yes  Yes 
Tanh on input actions to Qfunction  Yes  Yes 
Retrace sequence length  10  10 
Hyperparameters  Hierarchical  Independent  Monolith 

Policy torso  
(shared across tasks)  400200  
Policy taskdependent heads  100 (highlevel controller)  100  NA 
Policy shared heads  100 (per mixture component)  NA  200 
Number of action samples  20  
Q function torso  
(shared across tasks)  400400  
Q function head  
(per task)  300  
Number of components  number of tasks  NA  NA 
Discount factor ()  0.99  
Replay buffer size  1e6 * number of tasks  
Target network update period  500  
Batch size  256 (512 for Pile1) 
a.3 Additional Details on the SACU with SVG baseline
For the SACU baseline we used a reimplementation of the method from [Riedmiller et al., 2018] using SVG [Heess et al., 2015] for optimizing the policy. Concretely we use the same basic network structure as for the "Monolithic" baseline with MPO and parameterize the policy as
where denotes the identity matrix and is computed from the network output via a softplus activation function.
Together with entropy regularization, as described in [Riedmiller et al., 2018] the policy can be optimized via gradient ascent, following the reparameterized gradient for a given states sampled from the replay:
(14) 
which can be computed, using the reparameterization trick, as
(15) 
where is now a deterministic function of a sample from the standard multivariate normal distribution. See e.g. Heess et al. [2015] (for SVG) as well as Kingma and Welling [2014] (for the reparameterization trick) for a detailed explanation.
a.4 Details on the Experimental Setup
a.4.1 Simulation (Single and Multitask)
For the simulation of the robot arm experiments the numerical simulator MuJoCo ^{2}^{2}2MuJoCo: see www.mujoco.org was used – using a model we identified from the real robot setup.
We run experiments of length 2  7 days for the simulation experiments (depending on the task) with access to 25 recent CPUs with 32 cores each (depending on the number of actors) and 2 recent NVIDIA GPUs for the learner. Computation for data buffering is negligible.
a.4.2 Real Robot Multitask
Compared to simulation where the ground truth position of all objects is known, in the real robot setting, three cameras on the basket track the cube using fiducials (augmented reality tags).
For safety reasons, external forces are measured at the wrist and the episode is terminated if a threshold of 20N on any of the three principle axes is exceeded (this is handled as a terminal state with reward 0 for the agent), adding further to the difficulty of the task.
The real robot setup differs from the simulation in the reset behaviour between episodes, since objects need to be physically moved around when randomizing, which takes a considerable amount of time. To keep overhead small, object positions are randomized only every 25 episodes, using a handcoded controller. Objects are also placed back in the basket if they were thrown out during the previous episode. Other than that, objects start in the same place as they were left in the previous episode. The robot’s starting pose is randomized each episode, as in simulation.
a.5 Task Descriptions
a.5.1 Pile1
For this task we have a real setup and a MuJoCo simulation that are well aligned. It consists of a Sawyer robot arm mounted on a table and equipped with a Robotiq 2F85 parallel gripper. In front of the robot there is a basket of size 20x20 cm which contains three cubes with an edge length of 5 cm (see Figure 8).
The agent is provided with proprioception information for the arm (joint positions, velocities and torques), and the tool center point position computed via forward kinematics. For the gripper, it receives the motor position and velocity, as well as a binary grasp flag. It also receives a wrist sensor’s force and torque readings. Finally, it is provided with the cubes’ poses as estimated via the fiducials, and the relative distances between the arm’s tool center point and each object. At each time step, a history of two previous observations is provided to the agent, along with the last two joint control commands, in order to account for potential communication delays on the real robot. The observation space is detailed in Table 4.
The robot arm is controlled in Cartesian mode at 20Hz. The action space for the agent is 5dimensional, as detailed in Table 3. The gripper movement is also restricted to a cubic volume above the basket using virtual walls.
Entry  Dimensions  Unit  Range 

Translational Velocity in x, y, z  3  m/s  [0.07, 0.07] 
Wrist Rotation Velocity  1  rad/s  [1, 1] 
Finger speed  1  tics/s  [255, 255] 
Entry  Dimensions  Unit 

Joint Position (Arm)  7  rad 
Joint Velocity (Arm)  7  rad/s 
Joint Torque (Arm)  7  Nm 
Joint Position (Hand)  1  rad 
Joint Velocity (Hand)  1  tics/s 
ForceTorque (Wrist)  6  N, Nm 
Binary Grasp Sensor  1  au 
TCP Pose  7  m, au 
Last Control Command (Joint Velocity)  8  rad/s 
Green Cube Pose  7  m, au 
Green Cube Relative Pose  7  m, au 
Yellow Cube Pose  7  m, au 
Yellow Cube Relative Pose  7  m, au 
Blue Cube Pose  7  m, au 
Blue Cube Relative Pose  7  m, au 
For the Pile1 experiment we use 7 different task to learn, following the SACX principles. The first 6 tasks are seen as auxiliary tasks that help to learn the final task (STACK_AND_LEAVE(G, Y)) of stacking the green cube on top of the yellow cube. Overview of the used tasks:

REACH(G): :
Minimize the distance of the TCP to the green cube. 
GRASP:
Activate grasp sensor of gripper ("inward grasp signal" of Robotiq gripper) 
LIFT(G):
Increase z coordinate of an object more than 3cm relative to the table. 
PLACE_WIDE(G, Y):
Bring green cube to a position 5cm above the yellow cube. 
PLACE_NARROW(G, Y): :
Like PLACE_WIDE(G, Y) but more precise. 
STACK(G, Y):
Sparse binary reward for bringing the green cube on top of the yellow one (with 3cm tolerance horizontally and 1cm vertically) and disengaging the grasp sensor. 
STACK_AND_LEAVE(G, Y):
Like STACK(G, Y), but needs to move the arm 10cm above the green cube.
Let be the distance between the reference of two objects (the reference of the cubes are the center of mass, TCP is the reference of the gripper), and let be the distance only in the dimensions denoted by the set of axes . We can define the reward function details by:
(16) 
(17) 
(18) 
a.5.2 Pile2
For the Pile2 task, taken from Riedmiller et al. [2018], we use a different robot arm, control mode and task setup to emphasize that RHPO’s improvements are not restricted to cartesian control or a specific robot and that the approach also works for multiple external tasks.
Here, the agent controls a simulated Kinova Jaco robot arm, equipped with a Kinova KG3 gripper. The robot faces a 40 x 40 cm basket that contains a red cube and a blue cube. Both cubes have an edge length of 5 cm (see Figure 9). The agent is provided with proprioceptive information for the arm and the fingers (joint positions and velocities) as well as the tool center point position (TCP) computed via forward kinematics. Further, the simulated gripper is equipped with a touch sensor for each of the three fingers, whose value is provided to the agent as well. Finally, the agent receives the cubes’ poses, their translational and rotational velocities and the relative distances between the arm’s tool center point and each object. Neither observation nor action history is used in the Pile2 experiments. The cubes are spawned at random on the table surface and the robot hand is initialized randomly above the tabletop with a height offset of up to 20 cm above the table (minimum 10 cm). The observation space is detailed in Table 6.
Entry  Dimensions  Unit  Range 

Joint Velocity (Arm)  6  rad/sec  [0.8, 0.8] 
Joint Velocity (Hand)  3  rad/sec  [0.8, 0.8] 
The robot arm is controlled in raw joint velocity mode at 20 Hz. The action space is 9dimensional as detailed in Table 5. There are no virtual walls and the robot’s movement is solely restricted by the velocity limits and the objects in the scene.
Analogous to Pile1 and the SACX setup, we use 10 different task for Pile2. The first 8 tasks are seen as auxiliary tasks, that the agent uses to learn the main two tasks PILE_RED and PILE_BLUE, which represent stacking the red cube on the blue cube and stacking the blue cube on the red cube respectively. The tasks used in the experiment are:

REACH(R) = :
Minimize the distance of the TCP to the red cube. 
REACH(B) = :
Minimize the distance of the TCP to the blue cube. 
MOVE(R) = :
Move the red cube. 
MOVE(B) = :
Move the blue cube. 
LIFT(R) =
Increase the zcoordinate of the red cube to more than 5cm relative to the table. 
LIFT(B) =
Increase the zcoordinate of the blue cube to more than 5cm relative to the table. 
ABOVE_CLOSE(R, B) =
Bring the red cube to a position above of and close to the blue cube. 
ABOVE_CLOSE(B, R) =
Bring the blue cube to a position above of and close to the red cube. 
PILE(R):
Place the red cube on another object (touches the top). Only given when the cube doesn’t touch the robot or the table. 
PILE(B):
Place the blue cube on another object (touches the top). Only given when the cube doesn’t touch the robot or the table.
The sparse reward above(A, B) is given by comparing the bounding boxes of the two objects A and B. If the bounding box of object A is completely above the highest point of object B’s bounding box, above(A, B) is , otherwise above(A, B) is .
Entry  Dimensions  Unit 

Joint Position (Arm)  6  rad 
Joint Velocity (Arm)  6  rad/s 
Joint Position (Hand)  3  rad 
Joint Velocity (Hand)  3  rad/s 
TCP Position  3  m 
Touch Force (Fingers)  3  N 
Red Cube Pose  7  m, au 
Red Cube Velocity  6  m/s, dq/dt 
Red Cube Relative Position  3  m 
Blue Cube Pose  7  m, au 
Blue Cube Velocity  6  m/s, dq/dt 
Blue Cube Relative Position  3  m 
Lid Position  1  rad 
Lid Velocity  1  rad/s 
a.5.3 CleanUp
The CleanUp task is also taken from Riedmiller et al. [2018] and builds on the setup described for the Pile2 task. Besides the two cubes, the workspace contains an additional box with a moveable lid, that is always closed initially (see Figure 10). The agent’s goal is to clean up the scene by placing the cubes inside the box. In addition to the observations used in the Pile2 task, the agent observes the lid’s angle and it’s angle velocity.
Analogous to Pile2 and the SACX setup, we use 13 different task for CleanUp. The first 12 tasks are seen as auxiliary tasks, that the agent uses to learn the main task ALL_INSIDE_BOX. The tasks used in this experiments are:

REACH(R) = :
Minimize the distance of the TCP to the red cube. 
REACH(B) = :
Minimize the distance of the TCP to the blue cube. 
MOVE(R) = :
Move the red cube. 
MOVE(B) = :
Move the blue cube. 
NO_TOUCH =
Sparse binary reward, given when neither of the touch sensors is active. 
LIFT(R) =
Increase the zcoordinate of the red cube to more than 5cm relative to the table. 
LIFT(B) =
Increase the zcoordinate of the blue cube to more than 5cm relative to the table. 
OPEN_BOX =
Open the lid up to 85 degrees. 
ABOVE_CLOSE(R, BOX) =
Bring the red cube to a position above of and close to the box. 
ABOVE_CLOSE(B, BOX) =
Bring the blue cube to a position above of and close to the box. 
INSIDE(R, BOX) =
Place the red cube inside the box. 
INSIDE(B, BOX) =
Place the blue cube inside the box. 
INSIDE(ALL, BOX) =
Place the all cubes inside the box.
The sparse reward inside(A, BOX) is given by comparing the bounds of the object A and the box. If the bounding box of object A is completely within the box’s bounds inside(A, BOX) is , otherwise inside(A, BOX) is .
a.5.4 Single Task Results
We investigate the benefits of utilizing hierarchical policies in simulated, singletask environments to flexibly incorporate human domain knowledge. Formally, this scenario corresponds to a single task in the task set: . In this setting RHPO recovers a variant of MPO [Abdolmaleki et al., 2018b, a] with hierarchical policies; and we hence compare to MPO with a flat (nonhierarchical) policy as the baseline.
We consider two highdimensional tasks for continuous control: humanoidrun and humanoidstand from Tassa et al. [2018]. We compare a single Gaussian policy (i.e. MPO) to training a hierarchical policy, a mixture of Gaussians with three components. To incorporate domain knowledge, we evaluate hierarchical models using two different initialization schemes for the components, i.e, 1 We initialize all the components with zero mean and standard deviation of one (in line with the initialization of the single Gaussian). 2 We distribute the initial means of the three components (with a vector of 0, 1 and 1) such that the highlevel controller can select from low action and high action policies immediately. The categorical distribution is initialized as a uniform distribution over components.
In the presented setup learner and actors work asynchronously. To align the update rates of all approaches for a fair comparison, we additionally consider a setting with matching learner speed for single Gaussian and mixture of Gaussians^{3}^{3}3We found that optimizing a mixture of Gaussians computationally is more expensive than a single Gaussian resulting in less gradient step w.r.t. data generation rate in an asynchronous setup.. Figure 11 visualizes the results in terms of both the number of episodes and the gradient steps. As can be observed, the hierarchical policy with distributed means initialization leads to faster learning both in terms of gradient steps and generated data (with approximately aligned update rates). The improved performance in this scenario is clearly connected to the incorporation of domain knowledge which affects the initial policy structure.
a.6 Multitask Results
a.6.1 Pile1 – All Tasks
a.6.2 Pile2 – All Tasks
a.6.3 Cleanup2 – All Tasks
a.7 Physical Robot Pile1 – All Tasks
a.8 Additional Ablations Multitask
a.8.1 Number of Component Policies
a.8.2 Kullback Leibler Constraint Ablation
a.8.3 Data Rate Ablation
Reach 
Grasp 
Lift 
Place Wide 
Place Narrow 
Stack 
Stack and Leave 
a.8.4 Hierarchical Policies in Gradientbased RL
To test whether the benefits of a hierarchical policy transfer to a setting where a different algorithm is used to optimize the policy we performed additional experiments using SVG [Heess et al., 2015] in place of MPO. For this purpose we use the same hierarchical policy structure as for the MPO experiments but change the categorical to an implementation that enables reparameterization with the GumbelSoftmax trick [Maddison et al., 2016, Jang et al., 2016]. We then change the entropy regularization from Equation (15) to a KL towards a target policy (as entropy regularization did not give stable learning in this setting) and use a regularizer equivalent to the distance function (per component KL’s from Equation (12)) – using a multiplier of 0.05 for the regularization multiplier was found to be the best setting via a coarse grid search. This is similar to previous work on hierarchical RL with SVG [Tirumala et al., 2019]. The results of this experiment are depicted in Figure 19, as can be seen, for this simple domain results in mild improvements over standard SACU.