Efficient Bimanual Manipulation Using Learned Task Schemas
Abstract
We address the problem of effectively composing skills to solve sparsereward tasks in the real world. Given a set of parameterized skills (such as exerting a force or doing a top grasp at a location), our goal is to learn policies that invoke these skills to efficiently solve such tasks. Our insight is that for many tasks, the learning process can be decomposed into learning a stateindependent task schema (a sequence of skills to execute) and a policy to choose the parameterizations of the skills in a statedependent manner. For such tasks, we show that explicitly modeling the schema’s stateindependence can yield significant improvements in sample efficiency for modelfree reinforcement learning algorithms. Furthermore, these schemas can be transferred to solve related tasks, by simply relearning the parameterizations with which the skills are invoked. We find that doing so enables learning to solve sparsereward tasks on realworld robotic systems very efficiently. We validate our approach experimentally over a suite of robotic bimanual manipulation tasks, both in simulation and on real hardware. See videos at http://tinyurl.com/chitnisschema.
I Introduction
Let us consider the task of opening a bottle. How should a twoarmed robot accomplish this? Even without knowing the bottle geometry, its position, or its orientation, one can answer that the task will involve holding the bottle’s base with one hand, grasping the bottle’s cap with the other hand, and twisting the cap off. This “schema,” the highlevel plan of what steps need to be executed, only depends on the task and not on the object’s geometric and spatial state, which only influence how to parameterize each of these steps (e.g., deciding where to grasp, or how much to twist).
However, typical endtoend reinforcement learning approaches do not leverage this kind of structure, and instead aim to solve tasks by learning a policy, which would involve inferring both the schema and the parameterizations, as a function of the raw sensory input. These approaches have led to impressive successes across domains such as gameplaying [1, 2, 3, 4] and robotic control tasks [4, 5, 6, 7, 8, 9], but are known to have very high sample complexity. For instance, they require millions of frames of interaction to learn to play Atari games, or several weeks’ worth of experience to learn simulated control policies, which makes them impractical to train on real hardware.
In this work, we address the problem of learning to perform tasks in environments with a sparse reward signal, given a discrete set of generic skills parameterized by continuous arguments. Examples of skills include exerting a force at a location or moving an end effector to a target pose. The action space is hybrid discretecontinuous: at each timestep, the agent must decide both 1) which skill to use and 2) what continuous arguments to use for it (e.g., the location to apply force, the amount of force, or the target pose to move to). The sample inefficiency of current reinforcement learning methods is exacerbated in domains with these large search spaces; even basic tasks such as opening a bottle with two arms are challenging to learn from sparse rewards. While one could handengineer dense rewards, this is undesirable as it does not scale to more complicated tasks. We ask a fundamental question: can we use the given skills to efficiently learn policies for tasks with a large policy search space, like bimanual manipulation, given only sparse rewards?
Our insight is that for many tasks, the learning process can be decomposed into learning a stateindependent task schema (sequence of skills) and a statedependent policy that chooses appropriate parameterizations for the different skills. Such a decomposition of the policy into statedependent and stateindependent parts simplifies the credit assignment problem and leads to more effective sharing of experience, as data from different instantiations of the task can be used to improve the same shared skills. This leads to faster learning.
This modularization can further allow us to transfer learned schemas among related tasks, even if they have different state spaces. For example, suppose we have learned a good schema for picking up a long bar in simulation, where we have access to object poses, geometry information, etc. We can then reuse that schema for a related task such as picking up a tray in the real world from only raw camera observations, even though both the state space and the optimal parameterizations (e.g., grasp poses) differ significantly. As the schema is fixed, policy learning for this tray pickup task will be very efficient, since it only requires learning the (observationdependent) arguments for each skill. Transferring the schema in this way enables learning to solve sparsereward tasks very efficiently, making it feasible to train real robots to perform complex skills. See Figure 2 for an overview of our approach.
We validate our approach over a suite of robotic bimanual manipulation tasks, both in simulation and on real hardware. We give the robots a very generic library of skills such as twisting, lifting, and reaching. Even given these skills, bimanual manipulation is challenging due to the large search space for policy optimization. We consider four task families: lateral lifting, picking, opening, and rotating, all with varying objects, geometries, and initial poses. All tasks have a sparse binary reward signal: 1 if the task is completed, and 0 otherwise. We empirically show that a) explicitly modeling schema stateindependence yields large improvements in learning efficiency over the typical strategy of conditioning the policy on the full state, and b) transferring learned schemas to realworld tasks allows complex manipulation skills to be discovered within only a few hours (<10) of training on a single setup. Figure 1 shows some examples of realworld tasks solved by our system.
Ii Related Work
Search in hybrid discretecontinuous spaces. An agent equipped with a set of skills parameterized by continuous arguments must learn a policy that decides both which skills to use and what continuous arguments to use for them. Therefore, policy optimization requires searching in a hybrid discretecontinuous space. Learning in such a hybrid space has been addressed within the field of task and motion planning [10, 11, 12, 13], but these methods typically rely on handdesigned abstract representations of the state space in order to make use of classical planners. In contrast, we enable endtoend reinforcement learning from raw images by building independence assumptions into our model. A separate line of work learns control policies for steps in a policy sketch [14], which can be recombined in novel ways to solve new task instances; however, this work does not consider the discrete search aspect of the problem, as we do.
Transfer learning for robotics. The idea of transferring a learned policy from simulation to the real world for more efficient robotic learning was first developed in the early 1990s [15, 16]. More recent techniques include learning from model ensembles [17] and utilizing domain randomization [18, 19, 20], in which physical properties of a simulated environment are randomized to allow learned policies to be robust. However, as these methods directly transfer the policy learned in simulation, they rely on the simulation being visually and physically similar to the real world. In contrast, we only transfer one part of our learned policy — the skill sequence to be executed — from simulation to the real world, and allow the associated continuous parameters to be learned in the realworld domain.
Temporal abstraction for reinforcement learning. The idea of using temporally extended actions to reduce the sample complexity of reinforcement learning algorithms has been studied for decades [21, 22, 23, 24]. For instance, work on macroactions for mdps [24] attempts to build a hierarchical model in which the primitive actions occupy the lowest level, and subsequently higher levels build local policies, each equipped with their own termination conditions, that make use of actions at the level below. In our work, the skills are parameterized, and therefore the agent must reason about not only which skills to apply, but also what arguments to use for the chosen skills.
Bimanual manipulation. Dualarm manipulation tasks have been studied in classical control settings [25], and often rely on hybrid forceposition control strategies to guide both manipulators [26, 27]. These tasks have also been addressed via learning from demonstration [28, 29, 30]. In our work, we do not rely on demonstrations, and we are able to learn control policies directly from raw sensory inputs (camera images) without relying on models of the environment, which are difficult to specify by hand.
Iii Approach
Given a set of parameterized skills, we aim to solve sparsereward tasks by learning a policy that decides both which skill to execute and what arguments to use when invoking it. Our insight is that, for many tasks, the same sequence of skills (possibly with different arguments) can be used to optimally solve different instantiations of the task. We operationalize this by disentangling the policy into a stateindependent task schema (sequence of skills) and a statedependent prediction of how to parameterize these skills. We first formally define our problem setup, and then present our model for leveraging the stateindependence of schemas to learn efficiently. Finally, we describe how our approach also allows transferring schemas across tasks, letting us learn realworld policies from raw images by reusing schemas learned for related tasks in simulation.
Task Family  Object (Sim)  Objects (Real)  Schema Discovered from Learning in Simulation 

lateral lifting  bar  aluminum tray, rolling pin, heavy bar, plastic box  1) L: top grasp, R: top grasp 2) L: lift, R: lift 
picking  ball  soccer ball  1) L: top grasp, R: goto pose 2) L: noop, R: goto pose 3) L: lift, R: lift 
opening  bottle  glass jar, water bottle  1) L: top grasp, R: side grasp 2) L: twist, R: noop 
rotating  corkscrew  Twrench, corkscrew  1) L: goto pose, R: side grasp 2) L: goto pose, R: noop 3) L: rotate, R: noop 
Problem Setup. Each task we consider is defined as a finitehorizon Markov decision process (mdp) [31], with a hybrid discretecontinuous action space and time horizon . The reward associated with each task is a binary function indicating whether the current state is an element of the set of desired goal configurations, such as a state with the bottle opened. The learning objective, therefore, is to obtain a policy that maximizes the expected proportion of times that following it achieves the goal. Note that this is a particularly challenging setup for reinforcement learning algorithms due to the sparse nature of the reward function.
The agent is given a discrete library of generic skills , where each skill is parameterized by a corresponding vector of continuous values. Examples of skills can include exerting a force at a location, moving an end effector to a target pose, or rotating an end effector about an axis. An action is therefore a tuple , indicating what skill to apply as well as the corresponding parameterization. A schema is a sequence of skills in , where captures the sequence of skills but not their corresponding continuous parameterizations.
Assumption. We assume that the optimal schema is stateindependent: it depends only on the task, not on the state and its dynamics. This implies that the same schema is optimal for all instantiations of a task, e.g. different geometries and poses of objects. We note that this is a valid assumption across many tasks of interest, since the skills themselves can be appropriately chosen to be complicated and expressive, such as stochastic, closedloop control policies for guiding an end effector.
Modular Policies. The agent must learn a policy that, at each timestep, infers both which skill to use (a discrete choice) and what continuous arguments to use.
What is a good form for such a policy? A simple strategy, which we use as a baseline and depict in Figure 3 (top), would be to represent via a neural network, with weights , that takes the state as input and has a twoheaded output. One head predicts logits that represent a categorical distribution over the skills , while the other head predicts a mean and variance of a Gaussian distribution over continuous argument values for all skills. To sample an action, we can sample from the logits predicted by the first head, then sample arguments using the subset of means and variances predicted by the second head that correspond to .
However, this does not model the fact that the optimal schema is stateindependent. To capture this, we need to remove the dependence of the discrete skill selection on the input state. Thus, we propose to maintain a separate array, where row is the logits of a categorical distribution over which skill to use at time . Note that is the horizon of the mdp. In this architecture, the neural network is only tasked with predicting the skill arguments. The array of logits and the neural network, taken together, represent the policy , as depicted in Figure 3 (bottom).
Learning Schemas and Skill Arguments. The weights of the neural network can be updated via standard policy gradient methods. Let denote a trajectory induced by following in an episode. The objective we wish to maximize is . Policy gradient methods such as reinforce [32] leverage the likelihood ratio trick, which says that , to tune via gradient ascent. When estimating this gradient, we treat the current setting of the array of logits as a constant.
Updating the logits within the array can also be achieved via policy gradients; however, since there is no input, and because we have sparse rewards, the policy optimization procedure is quite simple. Let be the logit for time and skill . Given trajectory :

If achieves the goal, i.e. , increase for each timestep and skill taken at that timestep.

If does not achieve the goal, i.e. , decrease for each timestep and skill taken at that timestep.
The amount by which to increase or decrease is absorbed by the step size and thus gets tuned as a hyperparameter. See Algorithm 1 for full pseudocode.
Schema Transfer Across Tasks. Since we have disentangled the learning of the schema from the learning of the skill arguments within our policy architecture, we can now transfer the array of logits across related tasks, as long as the skill spaces and horizons are equal. Therefore, learning for a new task can be made efficient by reusing a previously learned schema, since we would only need to train the neural network weights to infer skill arguments for that new task.
Importantly, transferring the schema is reasonable even when the tasks have different state spaces. For instance, one task can be a set of simulated bimanual bottleopening problems in a lowdimensional state space, while the other involves learning to open bottles in the real world from highdimensional camera observations. As the state spaces can be different, it follows immediately that the tasks can also have different optimal arguments for the skills.
Iv Experiments
We test our proposed approach on four robotic bimanual manipulation task families: lateral lifting, picking, opening, and rotating. Table I lists the different objects that we considered for each one. These task families were chosen because they represent a challenging hybrid discretecontinuous search space for policy optimization, while meeting our requirement that the optimal schema is independent of the state. We show results on these tasks both in simulation and on real Sawyer arms: schemas are learned in simulation by training with lowdimensional state inputs, then transferred asis to visual inputs (in simulation as well as in the real world), for which we only need to learn skill arguments. Our experiments show that our proposed approach is significantly more sampleefficient than one that uses the baseline policy architecture, and allows us to learn bimanual policies on real robots in less than 10 hours of training. We first describe the experimental setup, then discuss our results.
Iva MuJoCo Experimental Setup
Environment. For all four task families, two Sawyer robot arms with paralleljaw grippers are placed at opposing ends of a table, facing each other. A single object is placed on the table, and the goal is to manipulate the object’s pose in a taskspecific way. Lateral lifting (bar): The goal is to lift a heavy and long bar by 25cm while maintaining its orientation. We vary the bar’s location and density. Picking (ball): The goal is to lift a slippery (low coefficient of friction) ball vertically by 25cm. The ball slips out of the gripper when grasped by a single arm. We vary the ball’s location and coefficient of friction. Opening (bottle): The goal is to open a bottle implemented as two links (a base and a cap) connected by a hinge joint. If the cap is twisted without the base being held in place, the entire bottle twists. The cap must undergo a quarterrotation while the base maintains its pose. We vary the bottle’s location and size. Rotating (corkscrew): The goal is to rotate a corkscrew implemented as two links (a base and a handle) connected by a hinge joint, like the bottle. The handle must undergo a halfrotation while the base maintains its pose. We vary the corkscrew’s location and size.
Skills. The skills we use are detailed in Table II, and the search spaces for the skill parameters are detailed in Table III. Note that because we have two arms, we actually need to search over a cross product of this space with itself.
State and Policy Representation. Experiments conducted in the MuJoCo simulator [34] use a lowdimensional state: proprioceptive features (joint positions, joint velocities, end effector pose) for each arm, the current timestep, geometry information for the object, and the object pose in the world frame and each end effector’s frame. The policy is represented as a 4layer MLP with 64 neurons in each layer, ReLU activations, and a multiheaded output for the actor and the critic. Since object geometry and pose can only be computed within the simulator, our realworld experiments will instead use raw RGB camera images.
Skill  Allowed Task Families  Continuous Parameters 

top grasp  lateral lifting, picking, opening  (x, y) position, zorientation 
side grasp  opening, rotating  (x, y) position, approach angle 
goto pose  picking, rotating  (x, y) position, orientation 
lift  lateral lifting, picking  distance to lift 
twist  opening  none 
rotate  rotating  rotation axis, rotation radius 
noop  all  none 
Parameter  Relevant Skills  Search Space (Sim)  Search Space (Real)  

(x, y) position  grasps, goto pose 

location on table surface  
zorientation  top grasp  
approach angle  side grasp  
orientation  goto pose 



distance to lift  lift  
rotation axis  rotate 

location on table surface  
rotation radius  rotate 
Training Details. We use the Stable Baselines [35] implementation of proximal policy optimization (ppo) [36], though our method is agnostic to the choice of policy gradient algorithm. We use the following hyperparameters: Adam [33] with learning rate , clipping parameter , entropy loss coefficient , value function loss coefficient , gradient clip threshold , number of steps , number of minibatches per update , and number of optimization epochs . Our implementation builds on the Surreal Robotics Suite [37]. Training is parallelized across 50 workers. The time horizon in all tasks.
IvB RealWorld Sawyer Experimental Setup
Environment. Our realworld setup also contains two Sawyer robot arms with paralleljaw grippers placed at opposing ends of a table, facing each other. We task the robots with manipulating nine common household objects that require two paralleljaw grippers to interact with. We consider the same four task families (lateral lifting, picking, opening and rotating), but work with more diverse objects (such as a rolling pin, soccer ball, glass jar, and Twrench), as detailed in Table I. For each task family, we use the schema discovered for that family in simulation, and only learn the continuous parameterizations of the skills in the real world. See Figure 1 for pictures of some of our tasks.
Skills. The skills and parameters are the same as in simulation (Table II), but the search spaces are less constrained (Table III) since we do not have access to object poses.
State and Policy Representation. The state for these realworld tasks is the RGB image obtained from an overhead camera that faces directly down at the table. To predict the continuous arguments, we use a fully convolutional spatial neural network architecture [38], as shown in Figure 4 along with example response maps.
Training Details. We use ppo and mostly the same hyperparameters, with the following differences: learning rate , number of steps , number of minibatches per update , number of optimization epochs , and no parallelization. We control the Sawyers using PyRobot [40].
IvC Results in Simulation
Figure 5 shows that our policy architecture greatly improves the sample efficiency of modelfree reinforcement learning. In all simulated environments, our method learns the optimal schema, as shown in the last column of Table I. Much of the difficulty in these tasks stems from sequencing the skills correctly, and so our method, which more effectively shares experience across task instantiations in its attempt to learn the task schema, performs very well.
Before transferring the learned schemas to the realworld tasks, we consider learning from rendered images in simulation, using the architecture from Figure 4 to process them. Figure 6 shows the impact of transferring the schema versus relearning it in this more realistic simulation setting. We see that when learning visual policies, transferring the schemas learned in the tasks with lowdimensional state spaces is critical to efficient training. These results increase our confidence that transferring the schema will enable efficient realworld training with raw RGB images, as we show next.
IvD Results in Real World
Figure 7 shows our results on the nine realworld tasks, with schemas transferred from the simulated tasks. We can see that, despite the challenging nature of the problem (learning from raw camera images, given sparse rewards), our system is able to learn to manipulate most objects in around 410 hours of training. We believe that our approach can be useful for sampleefficient learning in problems other than manipulation as well; all one needs is to define skills appropriate for the environment such that the optimal sequence depends only on the task, not the (dynamic) state. The skills may themselves be parameterized closedloop policies.
Please see the supplementary video for examples of learned behavior on the realworld tasks.
V Future Work
In this work, we have studied how to leverage stateindependent sequences of skills to greatly improve the sample efficiency of modelfree reinforcement learning. Furthermore, we have shown experimentally that transferring sequences of skills learned in simulation to realworld tasks enables us to solve sparsereward problems from images very efficiently, making it feasible to train real robots to perform complex skills such as bimanual manipulation.
An important avenue for future work is to relax the assumption that the optimal schema is openloop. For instance, one could imagine predicting the schema via a recurrent mechanism, so that the decision on what skill to use at time is conditioned on the skill used at time . Another interesting future direction is to study alternative approaches to training the stateindependent schema predictor.
Acknowledgments
We would like to thank Dhiraj Gandhi for help with experimental setup. Rohan is supported by an NSF Graduate Research Fellowship. Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
References
 [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [2] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Qlearning,” in Thirtieth AAAI conference on artificial intelligence, 2016.
 [3] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, 2015.
 [4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
 [5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
 [6] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [7] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389–3396.
 [8] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7559–7566.
 [9] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
 [10] Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. LozanoPérez, “Active model learning and diverse action sampling for task and motion planning,” arXiv preprint arXiv:1803.00967, 2018.
 [11] R. Chitnis, D. HadfieldMenell, A. Gupta, S. Srivastava, E. Groshev, C. Lin, and P. Abbeel, “Guided search for task and motion plans using learned heuristics,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 447–454.
 [12] B. Kim, L. P. Kaelbling, and T. LozanoPérez, “Learning to guide task and motion planning using scorespace representation,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 2810–2817.
 [13] C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov, “Combining neural networks and tree search for task and motion planning in challenging environments,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 6059–6066.
 [14] J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 166–175.
 [15] Y. Davidor, Genetic Algorithms and Robotics: A heuristic strategy for optimization. World Scientific, 1991, vol. 1.
 [16] E. Gat, “On the role of simulation in the study of autonomous mobile robots,” in AAAI95 Spring Symposium on Lessons Learned from Implemented Software Architectures for Physical Agents, 1995.
 [17] I. Mordatch, K. Lowrey, and E. Todorov, “EnsembleCIO: Fullbody dynamic motion planning that transfers to physical humanoids,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 5307–5314.
 [18] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30.
 [19] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige et al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 4243–4250.
 [20] F. Sadeghi and S. Levine, “CAD2RL: Real singleimage flight without a single real image,” arXiv preprint arXiv:1611.04201, 2016.
 [21] M. Asada, S. Noda, S. Tawaratsumida, and K. Hosoda, “Purposive behavior acquisition for a real robot by visionbased reinforcement learning,” Machine learning, vol. 23, no. 23, pp. 279–303, 1996.
 [22] L. Chrisman, “Reasoning about probabilistic actions at multiple levels of granularity,” in AAAI Spring Symposium: DecisionTheoretic Planning, 1994.
 [23] P. Dayan and G. E. Hinton, “Feudal reinforcement learning,” in Advances in neural information processing systems, 1993, pp. 271–278.
 [24] M. Hauskrecht, N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier, “Hierarchical solution of Markov decision processes using macroactions,” in Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1998, pp. 220–229.
 [25] C. Smith, Y. Karayiannidis, L. Nalpantidis, X. Gratal, P. Qi, D. V. Dimarogonas, and D. Kragic, “Dual arm manipulation – a survey,” Robotics and Autonomous systems, vol. 60, no. 10, pp. 1340–1353, 2012.
 [26] P. Hsu, “Coordinated control of multiple manipulator systems,” IEEE Transactions on Robotics and Automation, vol. 9, no. 4, pp. 400–410, 1993.
 [27] N. Xi, T.J. Tarn, and A. K. Bejczy, “Intelligent planning and control for multirobot coordination: An eventbased approach,” IEEE transactions on robotics and automation, vol. 12, no. 3, pp. 439–452, 1996.
 [28] R. Zollner, T. Asfour, and R. Dillmann, “Programming by demonstration: Dualarm manipulation tasks for humanoid robots,” in 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), vol. 1. IEEE, 2004, pp. 479–484.
 [29] E. Gribovskaya and A. Billard, “Combining dynamical systems control and programming by demonstration for teaching discrete bimanual coordination tasks to a humanoid robot,” in 2008 3rd ACM/IEEE International Conference on HumanRobot Interaction (HRI). IEEE, 2008, pp. 33–40.
 [30] O. Kroemer, C. Daniel, G. Neumann, H. Van Hoof, and J. Peters, “Towards learning hierarchical skills for multiphase manipulation tasks,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 1503–1510.
 [31] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994.
 [32] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 34, pp. 229–256, 1992.
 [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [34] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for modelbased control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
 [35] A. Hill, A. Raffin, M. Ernestus, A. Gleave, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, “Stable baselines,” https://github.com/hilla/stablebaselines, 2018.
 [36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [37] L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. CreusCosta, S. Savarese, and L. FeiFei, “SURREAL: Opensource reinforcement learning framework and robot manipulation benchmark,” in Conference on Robot Learning, 2018.
 [38] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
 [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [40] A. Murali, T. Chen, K. V. Alwala, D. Gandhi, L. Pinto, S. Gupta, and A. Gupta, “Pyrobot: An opensource robotics framework for research and benchmarking,” arXiv preprint arXiv:1906.08236, 2019.