Learning One-Shot Imitation from Humans without Humans
Humans can naturally learn to execute a new task by seeing it performed by other individuals once, and then reproduce it in a variety of configurations. Endowing robots with this ability of imitating humans from third person is a very immediate and natural way of teaching new tasks. Only recently, through meta-learning, there have been successful attempts to one-shot imitation learning from humans; however, these approaches require a lot of human resources to collect the data in the real world to train the robot. But is there a way to remove the need for real world human demonstrations during training? We show that with Task-Embedded Control Networks, we can infer control polices by embedding human demonstrations that can condition a control policy and achieve one-shot imitation learning. Importantly, we do not use a real human arm to supply demonstrations during training, but instead leverage domain randomisation in an application that has not been seen before: sim-to-real transfer on humans. Upon evaluating our approach on pushing and placing tasks in both simulation and in the real world, we show that in comparison to a system that was trained on real-world data we are able to achieve similar results by utilising only simulation data. Videos can be found here†††https://sites.google.com/view/tecnets-humans.
Humans are able to learn how to perform a task by simply observing their peers performing it once; this is a highly desirable behaviour for robots, as it would allow the next generation of robotic systems, even in households, to be easily taught tasks, without additional technology or long interaction times. Endowing a robot with the ability to learn from a single human demonstration rather than through teleoperation, would allow for a more seamless human-robot interaction.
Previous work has investigated hand-engineered systems which track movements and specify a mapping between the human and robot domains [25, 43]. Rather than explicitly hand-engineered systems, an emerging trend in robotics is to instead learn control directly from raw sensor data in an end-to-end manner. These systems operate well when close and complicated coordination is required between vision and control . Domain-Adaptive Meta-Learning (DAML) is a recent approach that uses an end-to-end method for one-shot imitation of humans  which leveraged a large amount of prior meta-training data collected for many different tasks. This approach required thousands of examples across many tasks during meta-training: these examples are videos of a person physically performing the tasks and teleoperated robot demonstrations, meaning that there has to be an active and long human presence when collecting the dataset. Is there a way to reduce, or even eliminate, the amount of human presence that is needed when collecting datasets that require footage of humans? We propose that the recent successes in visual simulation-to-reality transfer [16, 27, 5, 19] suggest there is a way.
To that end, we present an approach to the one-shot human imitation learning problem which does not require an active manual intervention during training, thus saving tens or hundreds of researchers hours. We show that the recent work on Task-Embedded Control Networks (TecNets)  can be used to infer control polices by embedding human demonstrations that can condition a control policy and achieve one-shot imitation learning. Rather than using real humans to supply demonstrations during training, we instead leverage domain randomisation in an application that has not been seen before: sim-to-real transfer on humans. After training, we are able to deploy a system in the real world which can perform a previously unseen task in a new configuration after a single real-world human demonstration. Our approach, which is summarised in Figure 1, is evaluated on pushing and placing tasks in both simulation and in the real world. We show we are able to achieve similar results to a system trained on real-world data. Moreover, we show that our approach remains robust to visual domain-shifts, such as a substantially different background, between the human demonstrator and the robot agent performing the task.
Ii Related Work
Imitation learning aims to learn tasks by observing a demonstrator, and can broadly be classified into two key areas: (1) behaviour cloning, where an agent learns a mapping from observations to actions given demonstrations [30, 34], and (2) inverse reinforcement learning , where an agent attempts to estimate a reward function that describes the given demonstrations [1, 11]. In this work we concentrate on the former. The majority of work in behaviour cloning operates on a set of configuration-space trajectories that can be collected via tele-operation [7, 45], kinesthetic teaching [2, 29], sensors on a human demonstrator [9, 10, 6, 22], through motion planners , or even by observing humans directly. Expanding further on the latter, learning by observing humans has previously been achieved through hand-designed mappings between human actions and robot actions [25, 43, 35], visual activity recognition and explicit handtracking [24, 31], and more recently by a system that infers actions from a single video of a human via an end-to-end trained system .
One-shot and few-shot learning is the paradigm of learning from a small number of examples at test time, and has been widely studied in the image recognition community [42, 21, 37, 32, 40, 38]. Our approach is based on James et al.  which comes under the domain of metric learning [23, 4]. There is an abundance of work in metric learning, including Matching Networks , which use an attention mechanism over a learned embedding space to produce a weighted nearest neighbour classifier given labelled examples called a support set and unlabelled examples called a query set. Prototypical Networks  are similar, but differ in that they represent each class by the mean of its examples, the prototype, and use a squared Euclidean distance rather than the cosine distance. In the case of one-shot learning, matching networks and prototypical networks become equivalent.
Sim-to-real methods attempt to address the apparent domain gap of both the visual and dynamics between simulation and the real-world, which reduces the need for expensive real-data collection. It has been shown that naively transferring skills between the two domains is not possible , resulting in numerous attempts at transfer methods in both computer vision and robotics. Domain randomisation, which applies random textures, lighting, and camera position to the simulated scenes, has seen great success in numerous vision-based robotics applications [36, 39, 16, 27, 5]. This method allows the algorithm operating on these randomised scenes to become invariant to domain differences that appear in the real world. Rather than directly operating on randomised images, RCAN  is a recent approach that instead translate randomised rendered images into their equivalent non-randomised, canonical versions, producing superior results on a complex sim-to-real grasping task. Rather than operating on RGB images, other works have instead used depth images to cross the domain gap [41, 14]; however, in our tasks, the colour of an object is an important feature when inferring what object the robot needs to interact with, particularly when the geometry of the objects are very similar. In our work, we show that domain randomisation can be leveraged to transfer the ability to infer actions from human demonstrations.
Our approach builds on Task-Embedded Control Networks (TecNets), which we summarise in the following. The method is composed of a task-embedding network and a control network that are jointly trained to output actions (e.g. motor velocities) for a new variation of a task, given one or more demonstrations of it. Using these demonstrations, the task-embedding network has the responsibility of learning a compact representation of a task, which we call a sentence. The control network then takes this static sentence along with current observations of the world to output actions on a variation of the same task. TecNets do not have a strict restriction on the number of tasks that can be learned, and do not easily forget previously learned tasks during training, or after. The setup only expects the observations (e.g. visual) from the demonstrator during test time, which makes it very applicable for learning from human demonstrations.
Formally, a policy for task maps observations to actions , and we assume to have expert policies for multiple different tasks. Corresponding example trajectories consist of a series of observations and actions: and we define each task to be a set of such examples, . TecNets aim to learn a universal policy that can be modulated by a sentence , where is a learned description of a task . The resulting universal policy should emulate the expert policy for task .
For training, we sample two disjoint sets of examples for every task : a support set and a query set . The support set is used to compute a combined sentence for the task, by taking the normalised mean of the sentence for each example:
where and where is the embedding network. Using a combination of the cosine distance between points and the hinge rank loss (inspired by ), the loss for a query set is defined as:
This loss helps learning an embedding space in which tasks that are visually and semantically similar are also close in the embedding space. Additionally, given a sentence , computed from the support set , as well as examples from the query set , the following behaviour-cloning loss for the policy can be computed:
It was found that having the control network also predict the action for the examples in the support set leads to increased performance. Thus, the final loss is:
Iv Learning From Humans Using Tecnets
We expand on the TecNets method introduced in the previous section by incorporating the notion of a human demonstrator which can be summarised in Figure 2. We slightly modify the definition of a task to instead include two collections of examples: a human demonstrator collection and a robot agent collection , such that .
From this, we now pick three disjoint sets of examples (rather than the original two) for every task : a support set of human examples , a query set of human examples , and a set of robot examples .
In analogy to Eq. (1) a combined sentence for a task is computed by taking the normalised mean of the sentence for each example in the support set of human examples :
We then train the embedding model to produce a higher dot-product similarity between human demonstrations of a task’s embedded example and its sentence than to sentences of human demonstrations from other tasks :
Additionally, given a sentence , computed from the support set , as well as examples from the robot set for the same task we can compute the following behaviour-cloning loss for the policy :
The final loss is the combination of the embedding loss , the control loss on the support set for the human examples, and the control loss on the robot examples:
Note that only the human examples of the same task are explicitly enforced to be close together in the embedding space, rather than human and robot examples. Although we could have also enforced an additional embedding loss on human and robot examples being close together, in practice we found that this was not necessary. This is due to the joint training of both task-embedding and control networks which enforces the network to implicitly learn to map the embedded human examples to a set of corresponding robot actions. Pseudocode for both the training is provided in Algorithm 1.
Input to the task-embedding network consists of , where represents the RGB channels. As in the TecNets paper, we found that we only need to take the first and last frame of an example trajectory for computing the task embedding and so we discard intermediate frames, resulting in an input of . The sentence from the task-embedding network is then tiled and concatenated channel-wise to the input of the control network (as shown in Figure 2), resulting in an input image of , where represents the length of the embedding.
Iv-a Data Collection in Simulation
Many approaches to human imitation rely on training in the real world. This has many disadvantages, but most evident is the amount of time and effort needed to collect data for the training dataset. In the case of DAML, thousands of demonstrations had to be recorded, which rely on an active human presence to obtain both human and robot demonstrations, as the robot still has to be controlled in some way. For instance, in the DAML placing experiment a total of demonstrations were collected to form the training dataset, meaning tens of research hours dedicated to collecting data, with no guarantees that the dataset allows the network to generalise well enough. Training in simulation provides much more flexibility and availability of data: data generation can be easily parallelised and does not require constant human intervention. Additionally, there have been many successful examples of systems trained in simulation and then run in the real-word; one common approach to do this is domain randomisation [36, 39, 16, 27, 5, 19].
Our approach generates the training dataset using PyRep , a recently released robot learning research toolkit, built on top of V-REP . We modelled a 3D mesh of a human arm from nonecg.com, which we then broke down into rigid shapes. Our simulated arm has 26 degrees of freedom: 3 in the shoulder, 2 in the elbow, 2 for the wrist and the remaining 19 in the hand. 26 revolute joints link together the different rigid shapes: to emulate the soft-body behaviour of a real arm during motion, adjacent shapes slide over each other, making previously hidden parts of each shape visible. The resulting effect is very similar to real human arm motion.
During dataset generation, we collect the image, the joint angles and the joint velocities at each timestep for both human arm and robot. To achieve sim-to-real transfer we perform domain randomisation. Specifically, we sample from a set of textures and procedurally generated images (via Perlin noise), and apply them to all objects in the scene and to the human arm (an example can be seen in Figure 3). Additionally, we sample the position, the orientation and the size of the objects from a uniform distribution. The starting configuration of both the demonstrator and the agent, camera pose, light directions and lighting parameters are sampled from a normal distribution. A snapshot of the simulation and real-world setup can be seen in Figure 4.
Our task-embedding network and control network use a convolutional neural network (CNN), which consists of convolution layers, each with filters of size , followed by fully-connected layers consisting of neurons. Each layer is followed by layer normalisation  and an activation function , except for the final layer, where the output is linear for both the task-embedding and control network.
Input consists of a RGB images and the robot proprioceptive data, including the joint angles. The proprioceptive data is concatenated to the features extracted from the CNN layers of the control network, before being sent through the fully-connected layers. The output of the embedding network (embedding size) is a vector of length . The output of the control network corresponds to velocities applied to the joints of a Kinova Mico 6-DoF arm. During training, we set the margin to be for the embedding loss , and set both the support and query size to be .
Optimisation was performed with Adam  with a learning rate of and a batch size of . Lambdas were set as follows: , , and .
In our experiments we try to answer the following questions: (1) Can TecNets learn the domain shift between a demonstrator and an agent? In other words, can our approach learn an embedding of a task given demonstrator examples, and also a mapping from the demonstrator domain to the agent domain for control? (2) Is it possible to learn a task from a real-world human demonstration when all the training is done is simulation? (3) How does our approach compare to another state-of-the-art one-shot human imitation learning method? We consider two experiments, placing and pushing, which were undertaken for DAML  in order to compare our approach with their results. We run a set of experiments in both simulation and in the real world.
We begin by presenting our results for the placing experiment, both in simulation and in the real world: the goal is to place a hand-held object in a container, with other two containers in the scene acting as distractors. A trial is successful if the object lands inside the container. Our dataset features a total of 2280 tasks, where each contains 15 simulated human demonstrations, and 15 simulated robot demonstrations. For each task we sampled three objects from the MIL dataset  of 105 training meshes, and used them as target containers and distractors; we randomised the scene as described in IV-A and we trained the network in simulation with the parameters in IV-B.
We evaluated one-shot placing in simulation on 74 tasks, with 6 trials each, using the MIL test meshes: in every trial we randomise the position of the objects and of the camera, and we procedurally generate the hand-held object. We also performed evaluation for the same system in the real world (Figure 5 Left) on 18 tasks and 4 trials, using the containers and the held objects shown in Figure 5(a), maintaining the same camera pose and background between demonstration and trial.
The results for the placing experiment are shown in Table I. We find that the robot is able to learn from just one human demonstration of a previously unseen task, and can leverage the training with domain randomisation to bridge the reality gap, with comparable success rates to DAML. Additionally, we report the results of simulated placing evaluation for a network trained without the the control loss on the human examples support set . As it was previously outlined in James at al. , the inclusion of the support loss assists the network in learning the task and the mapping between domains.
We also report the results of a real world experiment with a dataset where we simply randomised the scene and made the held object float to the target bowl, without using our simulated human arm. The results show that without the simulated arm, the resulting real-world policy chooses a target at random; therefore the arm model is necessary to successfully learn to imitate from a single human demonstration.
|DAML: Real World||93.8%|
|Ours: Real World||88.9%|
|Ours: Real World (No Sim Arm)||39%|
In the pushing experiment the goal is to push an object against a target amid one distractor: a trial is successful if the object hits or falls within 5cm of the target. Our dataset features a total of 1620 tasks, with 15 domain randomised demonstrations for both robot and human, using objects from the MIL dataset.
We evaluated one-shot pushing in both simulation and real-world (Figure 5 Right), with the same number of trials as for the previous placing experiments. The objects used for the real-world experiment are shown in Figure 5(b). We report the results in Table II together with the DAML results to show that our network trained in simulation has once again comparable performance to a model trained with real data.
|DAML: Real World||88.9%|
|Ours: Real World||84.7%|
V-C Large Domain Shift
As a final experiment, we tested how resilient our model is against large domain shifts in the real world, expecting it to leverage the adaptability acquired from domain randomisation. We evaluated placing in the real world taking the human demonstrations with a cloth on the table, and then making the robot perform the same task with the table cloth removed, therefore with a substantial change of background (Figure 5 Centre): the model placed the held object correctly 87.5% of the 72 trials.
We have therefore shown that due to domain randomisation our performance does not degrade on large domain shifts, whereas for example in DAML the success rate under large change of scenes drops by up to 15%. This showcases the benefits of leveraging large scale simulations for robotic learning.
We have presented an approach to the one-shot human imitation problem that leverages zero human interaction during training time. We achieve this by the combination of 2 methods. Firstly, we extending Task-Embedded Control Networks (TecNets)  to infer control polices by embedding human demonstrations that can condition a control policy and achieve one-shot imitation learning. Secondly, and most importantly, we show that we are able to perform sim-to-real on humans which allows us to train our system with no real-world data. With this approach, we are able to achieve similar performance to a state-of-the-art alternative method that relies on thousands of training demonstrations collected in the real-world, whilst also remaining robust to visual domain-shifts, such as a substantially different backgrounds. For future work, we hope to further investigate the variety of human actions that can be transferred from simulation to reality. For example, in this work, we have shown that a human arm can be transferred, but would the same method work for demonstrations including the entire torso of a human? We hope this work provides the first step in answering this question.
We thank Michael Bloesch, Ankur Handa, Sajad Saeedi, and Dan Lenton for insightful feedback on an early draft of this paper.
-  (2004) Apprenticeship learning via inverse reinforcement learning. International Conference on Machine learning. Cited by: §II.
-  (2012) Trajectories and keyframes for kinesthetic teaching: a human-robot interaction perspective. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, pp. 391–398. Cited by: §II.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §IV-B.
-  (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709. Cited by: §II.
-  (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §I, §II, §IV-A.
-  (2006) Teaching a humanoid robot to recognize and reproduce social cues. In ROMAN 2006-The 15th IEEE International Symposium on Robot and Human Interactive Communication, pp. 346–351. Cited by: §II.
-  (2009) Learning collaborative manipulation tasks by demonstration using a haptic interface. In 2009 International Conference on Advanced Robotics, pp. 1–6. Cited by: §II.
-  (2016) Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representation. Cited by: §IV-B.
-  (2004) Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems 47 (2-3), pp. 109–116. Cited by: §II.
-  (2004) Interactive grasp learning based on human demonstration. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, Vol. 4, pp. 3519–3524. Cited by: §II.
-  (2016) Guided cost learning: deep inverse optimal control via policy optimization. International Conference on Machine Learning. Cited by: §II.
-  (2017) One-shot visual imitation learning via meta-learning. Conference on Robot Learning. Cited by: §V-A.
-  (2013) Devise: a deep visual-semantic embedding model. Advances in Neural Information Processing Systems. Cited by: §III.
-  (2016) High precision grasp pose detection in dense clutter. In IROS, pp. 598–605. Cited by: §II.
-  (2018) Task-embedded control networks for few-shot imitation learning. Conference on Robot Learning. Cited by: §I, §II, §V-A, §VI.
-  (2017) Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. Conference on Robot Learning. Cited by: §I, §II, §II, §IV-A.
-  (2019) PyRep: bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176. Cited by: §IV-A.
-  (2016) 3d simulation for robot arm control with deep q-learning. NIPS Workshop (Deep Learning for Action and Interaction). Cited by: §II.
-  (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12627–12637. Cited by: §I, §II, §IV-A.
-  (2015) Adam: a method for stochastic optimization. International Conference on Learning Representation. Cited by: §IV-B.
-  (2015) Siamese neural networks for one-shot image recognition. ICML Deep Learning Workshop. Cited by: §II.
-  (2010) Learning actions from observations. IEEE robotics & automation magazine 17 (2), pp. 30–43. Cited by: §II.
-  (2012) Metric learning: a survey. Foundations and Trends in Machine Learning. Cited by: §II.
-  (2017) Learning robot activities from first-person human videos using convolutional future regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–2. Cited by: §II.
-  (2013) A syntactic approach to robot imitation learning using probabilistic activity grammars. Robotics and Autonomous Systems 61 (12), pp. 1323–1334. Cited by: §I, §II.
-  (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research. Cited by: §I.
-  (2018) Sim-to-real reinforcement learning for deformable object manipulation. Conference on Robot Learning. Cited by: §I, §II, §IV-A.
-  (2000) Algorithms for inverse reinforcement learning.. International Conference on Machine Learning. Cited by: §II.
-  (2011) Online movement adaptation based on previous sensor experiences. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 365–371. Cited by: §II.
-  (1989) Alvinn: an autonomous land vehicle in a neural network. Advances in Neural Information Processing Systems. Cited by: §II.
-  (2017) Transferring skills to humanoid robots by extracting semantic representations from observations of human activities. Artificial Intelligence 247, pp. 95–118. Cited by: §II.
-  (2017) Optimization as a model for few-shot learning. International Conference on Learning Representations. Cited by: §II.
-  (2013) V-rep: a versatile and scalable robot simulation framework. In Proc. of The International Conference on Intelligent Robots and Systems (IROS), Cited by: §IV-A.
-  (2011) A reduction of imitation learning and structured prediction to no-regret online learning. International Conference on Artificial Intelligence and Statistics. Cited by: §II.
-  (2018) Deep episodic memory: encoding, recalling, and predicting episodic experiences for robot action execution. IEEE Robotics and Automation Letters 3 (4), pp. 4007–4014. Cited by: §II.
-  (2016) Cad2rl: real single-image flight without a single real image. arXiv preprint arXiv:1611.04201. Cited by: §II, §IV-A.
-  (2016) Meta-learning with memory-augmented neural networks. International Conference on Machine Learning. Cited by: §II.
-  (2017) Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems. Cited by: §II.
-  (2017) Domain randomization for transferring deep neural networks from simulation to the real world. Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. Cited by: §II, §IV-A.
-  (2017) Few-shot learning through an information retrieval lens. Advances in Neural Information Processing Systems. Cited by: §II.
-  Learning a visuomotor controller for real world robotic grasping using easily simulated depth images. In Conference on Robot Learnng, Cited by: §II.
-  (2016) Matching networks for one shot learning. Advances in Neural Information Processing Systems. Cited by: §II.
-  (2015) Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §I, §II.
-  (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. Robotics: Science and Systems. Cited by: §I, §II, §V.
-  (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. International Conference on Robotics and Automation. Cited by: §II.