Visual Geometric Skill Inference by Watching Human Demonstration

Visual Geometric Skill Inference by Watching Human Demonstration


We study the problem of learning manipulation skills from human demonstration video by inferring the association relationships between geometric features. Motivation for this work stems from the observation that humans perform eye-hand coordination tasks by using geometric primitives to define a task while a geometric control error drives the task through execution. We propose a graph based kernel regression method to directly infer the underlying association constraints from human demonstration video using Incremental Maximum Entropy Inverse Reinforcement Learning (InMaxEnt IRL). The learned skill inference provides human readable task definition and outputs control errors that can be directly plugged into traditional controllers. Our method removes the need for tedious feature selection and robust feature trackers required in traditional approaches (e.g. feature-based visual servoing). Experiments show our method infers correct geometric associations even with only one human demonstration video and can generalize well under variance.

I Introduction

Understanding and applying the mechanism of learning by watching has been researched in robotics for over two decades1, where the core problem is how to extract high-level reusable symbolic task definitions by observing a human demonstration [15, 22]. Most of the research focuses on learning task goal configurations rather than task execution [2, 4]. This approach reduces the learning complexity and, most importantly, extracts an abstract task representation which allows for generalization. Symbolic task plans can be represented as a tree [35] or graph structure [27, 33] based on the assumption that a task can be decomposed into low-level conditioned elementary skills [5], such as, grasping, striking [19], alignment [17], and peg-in-hole [25]. In order to define the symbols [2], (action, object, task) recognition techniques and a predefined skill sub-module are hand engineered [2]. These predefined manipulation skills are highly task-dependant and do not generalize well in practice.

Fig. 1: Representation of manipulation skills using constraints association between geometric primitives. For example, the alignment skill (or insertion skill) is a combination of several point-to-point or co-linearity constraints; This parameterization partitions the problem into two parts: the geometric task representation and its control error output by computing the Euclidean distance between geometric primitives.

The main question is whether a general solution exists to parameterize a task. There is no absolute answer, but even if a parameterization exists, it is difficult to find because manipulation tasks are too complex in general. One way of addressing this problem is to use low-level elementary skills as the corner stones of a task; these are potentially easier to learn and generalize. Among the various types of skills, we are interested in those that can be generally parameterized using geometric association constraints (Fig. 1), since a variety of skills can be created from their combinations. We name these geometric skills, which are inherently represented in the image space by geometric primitives (points, lines, conics, planes, etc.) and similar to how human eye-hand coordination works [13]. This parameterization method was introduced by Dodds et al.  [8] to solve the box-packing task and then implemented by Gridseth et al.  [11] on various skills, including, grasping, placing, insertion, and cutting.

This approach, however, has several drawbacks. The task specification is tedious, as it requires manual assignment of associations among geometric features, and is highly dependent on robust feature trackers [3]. This paper aims to address these issues by learning from watching. We propose a method to directly regress the geometric association constraints on each frame.

The main contributions of this paper are:

  • Provide an interpretable and invariant robotic task representation using geometric features and their association constraints that are easy to monitor and validate.

  • Remove the dependency on robust feature trackers in previous methods [7, 11] by directly estimating the association constraint between geometric features.

  • Provide a robust feature association learning method by utilizing the fact that multiple feature associations can define the same task.

To elaborate details of our third contribution, for example, when some features are occluded, other candidates will make up for this and continue to define the task. In contrast, traditional tracking-based methods fix such associations in the initial feature selection stage and may be unable to recover from occlusions. This stiffness on constraints is removed in our maximum entropy-based geometric constraint regression method.

The remainder of this work will focus on two issues: (1) how to generally encode different types of geometric association constraints and build more complex geometric skills from them; and (2) how to optimize such constraints given one human demonstration video. Experimental results are reported in Sec. IV.

Fig. 2: A: Graph structured skill kernels. B: Function of a skill kernel and kernel ensembles. A skill kernel takes input of all geometric feature association instances (kernel graph instances) and output their rank of relevance (select-out) w.r.t. the skill definition (defined by human demonstration video). A skill is a combination (kernel ensembles) of several skill kernels. For example, an ‘insertion’ skill consists of a point-to-point and a line-to-line skill kernel. Given one image observation, we can enumerate all possible geometric feature associations. By feeding their corresponding descriptors {}, each association will create one kernel graph instance and each kernel instance will output a select-out to decide which association should be selected. A control error computed from their corresponding {}.

Ii Related Works

This paper is inspired by research works in robot learning, visual servoing and graph-based relational inference.

End to end learning by watching: This approach, commonly known as imitation learning [1], has been recently gaining interest. Sermanet et al. presented TCN [26] to learn from contrastive positive and negative frame changes along time. Yu et al. proposed a meta-learning based method [36] to encode prior knowledge from a few thousand human/robot demonstrations, then learned a new task from one demonstration. End to end learning approaches lack interpretability. Furthermore, to the authors’ knowledge, learning by watching only from one human demonstration is still difficult.

Learning task plans by watching: This approach provides the most intuitive motivation and contributes to many of the early works in learning by watching. Such approaches try to generate human readable symbolic representations at a semantic level [6, 35, 33] to provide high level task planning, which is important for generalizability. Ikeuchi et al. presented a general framework [15] that relies on object/task/grasp recognition to generate assembly plans from observation. Modern approaches use a grammar parser [35], causal inference [33], and neural task programming [34]. Konidaris et al. proposed constructing skill trees [20] at the trajectory level to acquire skills from human demonstration using hierarchical reinforcement learning (RL) with options. This work presents a general framework to learn a tree level structured task. However, such works require hard-coded recognition submodules or lacks generality in various tasks.

Learning correspondence relationships by watching: Learning correspondence relationships to represent a task concept from human demonstration videos provides a generalizable task representation. Current approaches formulate the correspondence relationship through learning at either the object level [9] or key points level [23, 24]. Beyond a simple correspondence relationship representation, Sieb et al. propose a graph-structured object relationship inference method [28] in visual imitation learning. However, apart from learning relationships in the objects or key points level, using a more general framework to construct complex tasks from constraints among fine-grained geometric primitives (points, lines, conics) has rarely been studied.

Geometric approaches in skill learning: Constructing skills using geometric features provides good interpretability. Apart from works mentioned in Sec. I, Ahmadzadeh et al. proposed a system called VSL [2] that is capable of learning skills from one demonstration. VSL first detects objects in an image and represents them using image feature extractors like SIFT. It computes object spatial motion changes via feature matching and then forms a new task goal configuration used to generate motion primitives by a trajectory-based learning from demonstration (LfD) method [14]. Landmark-based pre/post action condition detection is also used to construct a task plan. Triantafyllou et al. proposed a geometric approach to solve the garment unfolding task [30]. Tremblay et al. proposed a human-readable plan generation method [29] which provides interpretability by modeling the 3D wire frame of blocks, however, it requires simulator training for prior 3D modelling.

Iii Method

Iii-a Geometric Skill Kernels

Let denote the observation space and denote the observed geometric features2. Each feature has two parts: a descriptor that encodes locally invariant properties and a coordinate parameter set that encodes globally geometric properties3. A geometric skill kernel is a composite functional structure that describes association constraints between geometric features . To ground our formalism, we describe some basic examples:

  • point-to-point : the coincidence of two points.

  • point-to-line : a point is on a line.

  • line-to-line : a line is collinear with another line.

  • coplanarity : coplanar four points or two lines.

Each kernel has two parts: a geometric association constraint representation part and a control error generation part used to guide robot actions.

Geometric association constraint representation

Inspired by graph motifs [37], each skill kernel is a unit graph with different structures. An undirected graph is used to represent the association constraint, where nodes are variables that take input of feature descriptors , and edges define a fixed graph structure (as shown in Fig. 2A). For example, the graph for has two connected nodes, and each node corresponds to . By feeding in two points, we get a graph instance and use a select-out function to measure how relevant it is to define the skill. Then, we have:


For example, in the ‘insertion’ skill (Fig. 2B), the graph instance of P3 and P4 has higher output than that of P1 and P2, and will be selected out.

It is worth noting the ambiguity property of skill kernels, where several association instances may define the same skill. Because of this property, it is crucial to learn a robust feature association selection model in the long-run since when the current association is not available (occluded or outside of the field of view (FOV), a candidate association will be selected. For example, in Fig. 2B, both the association of P3 to P4 and P2 to P5 can partially define the skill. In successive steps, P5 will be occluded so its instance won’t be observable, however, P3 to P4 can make up the role. Another issue is task decidability, which determines which 2D image-coordinate constraints are needed to guarantee a particular 3D configuration. We do not cover decidability here, but refer to  [8].

Geometric control error generation

Let denote the mapping of all nodes geometric parameters to a control error vector where is the degree of freedom that this constraint contributes. For example, given a point-to-point skill, for image points. The control error of a point-to-point kernel is the point distance, while of a point-to-line kernel is the dot product of their homogeneous coordinates. More examples can be found in [7, 11]. will be used in the following optimization using human demonstrations and in generating control signals used to guide robot action.

Iii-B Parameterization

Parameterization of

is parameterized by a -layer message passing graph neural network [10]. Each node relates a -dimensional hidden state . At layer (or time step ), each nodes hidden state is updated via three steps. (I) Pair-wise message generation :


where relates to any node connected to . (II) Message aggregation which collects all incoming messages:


We simply use summation as in our implementation. Lastly, (III) message update :


where a gated recurrent unit (GRU) is used. After layer updates, all of the nodes final states are fed into a MLP layer with an activation function that outputs a scalar value .

The select-out function

Given one image, we construct graph instances by enumerating all possible geometric primitive combinations (e.g., point-to-point by listing association between any two points). Each instance {} represents one association and will output its relevance . A select-out function outputs a relevance factor :


Let denotes the control error for each graph instance, we now define the overall control error for a whole image as:


Iii-C InMaxEnt IRL for optimization

Given human demonstration video frames, we apply InMaxEnt IRL [17] for optimization. To this end, we define the reward function, which connects skill kernels to entropy models. Through the optimization of this reward function, the skill kernel is also optimized. In practice, each skill kernel is optimized individually. We use as an example in the following discussion.

Fig. 3: RSW measures how much relevance contributed from the non-selected association instances. The lower RSW, the more deterministic in select-out. A shows the training curve with RSW regularizer. The relevance from remaining instances stays below 0.1%. B is without RSW regularizer. Although the cost function is optimized, the non-selected ones still occupy 75% of relevance. Note that we are maximizing the loss.

Reward function

Each state is an image related to a control error , where the subscript denotes the time step in RL. An optimized should consistently select ‘correct instances’ among all states. During human demonstration we should expect to decrease globally (but not necessarily in each step). Intuitively, we should get a positive reward if we observe a decrease in , otherwise the reward should be negative. Let , we define:


, where normalizes the scale of different skill kernels’ output .

The variational expert assumption

InMaxEnt IRL considers imperfect expert demonstrations with a confidence level . A higher confidence level results in smaller variance in demonstration. We assume that in a human demonstration, at state the probability of selecting an action that transitions to the observed state follows a Boltzmann distribution with conditions:


where is the reward of this observed state change, and


is the partition function, is a truncated normal distribution with domain in [-1,1]. This means that the expert prefers the action with the highest reward among all possible actions {}. To emphasize high impact actions in , suppose gets a reward , the chance of included in the pool is: , this is called a human factor [17] since it varies with the human demonstrator’s confidence.

Loss function

To maximize the probability of observed human demonstration video sequence by applying MDP property, we have:


With equation (8) and removing the last constant, the cost function can be further written as:


Note that if has domain , the loss function is a constant. Proofs can be found on our website [18].

To force into making selections more deterministic meanwhile considering the ambiguity property, a penalty regularizer is added to the reward where is a hyperparameter and RSW is the residual sum of weights (RSW). This makes output major weights on selected alternatives while minimizing the residual sum of weights. Fig. 3 shows a comparison of training with and without RSW penalty in the Sorting task.

Input: Expert demonstration video frames {}, confidence level
Result: Optimal weights of
Construct kernel graph instances on each frame
for t = 1:n do
        Feature point extraction on to get {}
        Enumerate all instances by association
        Feed all instances to to get
end for
Prepare State Change Samples
Compute using ; Shuffle ; Initialize
for each iteration do
        for each observed sample change in  do
               Forward pass
               Gradient ascent update
        end for
end for
Algorithm 1 Optimizing


The last item in eq. (11) is a constant and is a function of , which is further represented using skill kernels with parameters . Then, we have:


can be solved by back propagation from eq. (7) to the graph neural network in the skill kernel. can be estimated by a Monte Carlo estimator sampling samples from the truncated normal distribution :


is the derivative of an expectation. By applying the log derivative trick, we have:


Since is tractable, it’s trivial to get:


, , . where is defined in [31]. By combining the above equations, is solved.

The optimization on is summarized in Algorithm 1.

Iii-D From Skill Kernel to Skills

In this paper, we consider that a skill is simply the combination of several skill kernels, namely kernel ensembles. There should be more advanced ways to construct a skill from different kernels, although this is not discussed here. 4

Iii-E From Skill to Control

Given an image, each skill kernel will select out several alternative association instances and generate control error vectors. For example, the point-to-point will output vectors with structure [], which can be plugged in controllers like feature-based visual servoing or uncalibrated visual servoing [16]. More examples using geometric features (lines, conics) and based on which, the constructed skill kernels in UVS control are included in [11].

Fig. 4: Four types of skills with human demonstration. A: Sorting skill. B:Insertion skill. C:Folding cloth skill. D: Driving a Screw to the hole skill.

Iv Experiments

Iv-a Quantitative Evaluation

We first evaluate what types of skills the learned inference behavior is capable of. Four types are tested (Fig. 4): Sorting skill represents a regular setting; Insertion is for skills that need line-to-line constraint; Folding is for manipulation with deformable objects; and Screw skill represents types that have low image textures. Each skill is evaluated on videos that show a human performing the same task but with random behaviors. The objective is to infer the correct geometric feature associations that can be used to define the demonstrated skill.

We next test if the learned behavior from human demonstration video directly generalizes to a robot hand. For our tests, the background table is also changed and the target pose is randomly arranged (Fig. 7A).

Lastly, we test on the robot with four other scenarios (Fig.8, B-E): moving camera; occlusion; object running out of camera’s field of view; and illumination changes.

Fig. 5: The hand designed baseline requires human to specifically select 10 pairs of feature points to define the demonstrated skill.

Baseline: To our best knowledge, there are no existing methods that learn geometric feature associations by watching human demonstration. However, for comparison, we hand designed a baseline on the Sorting skill. The baseline requires a human to manually select 10 pairs of feature points and initialize 20 trackers. Each pair has one point on the object and another on the target. All of the 10 pairs simultaneously define the same skill (Fig.5), resulting in a robust baseline. In evaluation, as long as one pair still defines the skill, the baseline is marked as a successful trial.

Fig. 6: Results of the four skills.
Fig. 7: A: Experimental setup on the manipulator. B: We change the camera pose in evaluation by rotation and a random displacement.
Fig. 8: Given one human demonstration video, we evaluate the learned behavior on 5 scenarios. A: using robot hand with a different background and random target pose; B: projective variance due to camera pose change; C: occlusion; D: object out of camera’s view-field; and E: illumination change. For each scenario, we detect all feature points and use a colored line to mark the select-out associations. The top one is marked red and the bar next to it indicates the estimation confidence. Only the association with confidence greater than 10% is displayed. Results are reported in Table I.

Metric: We evaluate on each video frame and calculate the accuracy of inferences. For the baseline, when it fails on one frame, it can’t be resumed unless a human hand select the features again, therefore we report only success or failure on the final result. For our method, failures can be automatically corrected in successive frames. While our method can output inferred associations, we pick the top one for evaluation.


For each skill, we evaluate the point-to-point kernel using SIFT and ORB features respectively. For the Insertion skill, we add the line-to-line kernel using LBD [38] line descriptor. All kernels have the same graph layer size=5 with hidden state dimension=512 and p=10 alternatives (III-C1). In training, we set the regularizer coefficient , and human factor . Each kernel with different descriptors are trained individually.

Random Move Object Outside Change
Target Camera Occlusion FOV Illumination
Baseline 100.0% 0.0% 0.0% 0.0% 0.0%
Ours 99.1% 96.7% 92.7% 81.2% 0.0%
TABLE I: Evaluations results of running robot under various environmental settings as shown in Fig. 8. For each variation setting, we count correct geometric association inferences on each frame and calculate the percentage of successful inferences among all the frames during the execution of Sorting task.


Different skills Results (Fig.6) on the 4 skills show our method is capable of the Sorting and Insertion skill and performs moderately in Folding and Screw skills. In experiments, we observed that when both object and target have rich textures, results improve. This may be from the use of SIFT or ORB that are local descriptors dependent on textures. We can expect further improvement by using other local feature descriptors [32] [21]. We also find the more features that can be fed into the skill kernel, the better accuracy it performs. Due to our hardware GPU limitation, we can only test using a small number (60 in average) of features.

Varying environment Fig. 8 lists results on various environmental conditions. In general, i) our method is robust to occlusion. When some feature associations are occluded, the selection of others will make it up; and ii) our method exhibits robust behavior so that failure in some frames doesn’t affect successive frames since it directly selects the feature associations on each frame. In contrast, the baseline method depends on the initialization of video trackers and continuous tracking. We observe that the learned inference behavior tends to select fixed association instances while showing the flexibility of selecting alternatives when fixed ones are not observable. We also observe that the accuracy is highly related to the capability of SIFT descriptor. It reaches high accuracy under projective variance (B), however, fails under illumination changes (E).

Although results on the robot manipulator show our method can output the correct selections of geometric feature associations which can be directly used in controllers (e.g. uncalibrated visual servoing [11]), due to resource limitations we did not test with a plug-in controller. We leave this to our future work.

V Conclusion

We propose a graph based kernel regression method to infer the association relationship between geometric features by watching human demonstrations. The learned skill inference provides human readable task definition and outputs control errors that fit in traditional controllers. Our method removes the dependency on robust feature trackers and tediously hand selection process in traditional robotic task specification. The learned selection model provides a robust feature association behavior under various environmental settings.

Although results are promising, there are issues that need to be further investigated. 1) Consistent control error output: while the result shows that our method tends to select a fixed set of associations, it can’t guarantee the selection consistency. One possible solution is to add constraints between frames. 2) Other local feature descriptors [32] [21] are worth trying for better generalization. 3) The generalization to point cloud geometric primitive needs to be further studied.


  1. The earliest work can be traced back to Ikeuchi et al. [15] and Kuniyoshi et al. [22] in 1994.
  2. points, lines, conics, planes, spheres etc. from an image or point cloud.
  3. More details on the parameterization of geometric primitives in [12].
  4. For example, for a ‘peg-in-hole’ skill, the point-to-point kernel should be used to first coarsely move to the target, while line-to-line kernel best fits in the final alignment actions. Their relationship is not a simple combination.


  1. P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. 21 Int. Conf. on Machine learning - ICML ’04, pp. 1. External Links: Document, 1206.5264, ISBN 1581138285, ISSN 0028-0836 Cited by: §II.
  2. S. R. Ahmadzadeh, A. Paikan, F. Mastrogiovanni, L. Natale, P. Kormushev and D. G. Caldwell (2015) Learning symbolic representations of actions from human demonstrations. In 2015 IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 3801–3808. Cited by: §I, §II.
  3. Q. Bateux, E. Marchand, J. Leitner, F. Chaumette, Q. Bateux, E. Marchand, J. Leitner, F. Chaumette, P. C. V. Ser-, Q. Bateux and E. Marchand (2017) Visual servoing from deep neural networks. arXiv preprint arXiv:1705.08940. Cited by: §I.
  4. M. Dehghan, Z. Zhang, M. Siam, J. Jin, L. Petrich and M. Jagersand (2019) Online object and task learning via human robot interaction. In 2019 Int. Conf. on Robotics and Automation (ICRA), pp. 2132–2138. Cited by: §I.
  5. R. Dillmann, M. Kaiser and A. Ude (1995) Acquisition of elementary robot skills from human demonstration. In International symposium on intelligent robotics systems, pp. 185–192. Cited by: §I.
  6. R. Dillmann (2004) Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems 47 (2-3), pp. 109–116. External Links: Document, ISBN 09218890, ISSN 09218890 Cited by: §II.
  7. Z. Dodds, M. Jägersand and G. Hager (1999) A Hierarchical Architecture for Vision-Based Robotic Manipulation Tasks. First Int. Conf. on Computer Vision Systems 542, pp. 312–330. Note: geometric constraints can be used to describe a task. Cited by: 2nd item, §III-A2.
  8. Z. Dodds, G. D. Hager, A. S. Morse and J. P. Hespanha (1999) Task specification and monitoring for uncalibrated hand/eye coordination. In Proceedings 1999 IEEE International Conference on Robotics and Automation, Vol. 2, pp. 1607–1613. Cited by: §I, §III-A1.
  9. P. R. Florence, L. Manuelli and R. Tedrake (2018) Dense object nets: learning dense visual object descriptors by and for robotic manipulation. arXiv preprint arXiv:1806.08756. Cited by: §II.
  10. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §III-B1.
  11. M. Gridseth, O. Ramirez, C. P. Quintero and M. Jagersand (2016) ViTa: Visual task specification interface for manipulation with uncalibrated visual servoing. Proceedings - IEEE International Conference on Robotics and Automation 2016-June, pp. 3434–3440. External Links: Document, ISBN 9781467380263, ISSN 10504729 Cited by: 2nd item, §I, §III-A2, §III-E, §IV-A2.
  12. R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: footnote 3.
  13. S. Hutchinson, G. D. Hager and P. I. Corke (1996) A tutorial on visual servo control. IEEE transactions on robotics and automation 12 (5), pp. 651–670. Cited by: §I.
  14. A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor and S. Schaal (2013) Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation 25 (2), pp. 328–373. Cited by: §II.
  15. K. Ikeuchi and T. Suehiro (1994) Toward an assembly plan from observation. i. task recognition with polyhedral objects. IEEE Trans. on robotics and automation 10 (3), pp. 368–385. Cited by: §I, §II, footnote 1.
  16. M. Jagersand and R. Nelson (1995) Visual space task specification, planning and control. In Proc. of Int. Symposium on Computer Vision-ISCV, pp. 521–526. Cited by: §III-E.
  17. J. Jin, L. Petrich, M. Dehghan, Z. Zhang and M. Jagersand (2018) Robot eye-hand coordination learning by watching human demonstrations: a task function approximation approach. arXiv preprint arXiv:1810.00159. Cited by: §I, §III-C2, §III-C.
  18. Jin, Jun and Dehghan, Masood and Jagersand, Martin (2019) Visual geometric skill inference by watching human demonstration: supplementary materials. Note: [Online; accessed 5-Sept-2019] External Links: Link Cited by: §III-C3.
  19. J. Kober, K. Mülling, O. Krömer, C. H. Lampert, B. Schölkopf and J. Peters (2010) Movement templates for learning of hitting and batting. In Robotics and Automation (ICRA), 2010 IEEE Int. Conf. on, pp. 853–858. Cited by: §I.
  20. G. Konidaris, S. Kuindersma, R. Grupen and A. Barto (2012) Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research 31 (3), pp. 360–375. Cited by: §II.
  21. B. Kumar, G. Carneiro and I. Reid (2016) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5385–5394. Cited by: §IV-A2, §V.
  22. Y. Kuniyoshi, M. Inaba and H. Inoue (1994) Learning by watching: extracting reusable task knowledge from visual observation of human performance. IEEE Tran. robotics and automation 10 (6), pp. 799–822. Cited by: §I, footnote 1.
  23. L. Manuelli, W. Gao, P. Florence and R. Tedrake (2019) Kpam: keypoint affordances for category-level robotic manipulation. arXiv preprint arXiv:1903.06684. Cited by: §II.
  24. Z. Qin, K. Fang, Y. Zhu, L. Fei-Fei and S. Savarese (2019) KETO: learning keypoint representations for tool manipulation. arXiv preprint arXiv:1910.11977. Cited by: §II.
  25. G. Schoettler, A. Nair, J. Luo, S. Bahl, J. A. Ojea, E. Solowjow and S. Levine (2019) Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. arXiv preprint arXiv:1906.05841. Cited by: §I.
  26. P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine and G. Brain (2018) Time-contrastive networks: self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. Cited by: §II.
  27. N. Shukla, C. Xiong and S. Zhu (2015) A unified framework for human-robot knowledge transfer. In 2015 AAAI Fall Symposium Series, Cited by: §I.
  28. M. Sieb, Z. Xian, A. Huang, O. Kroemer and K. Fragkiadaki (2019) Graph-structured visual imitation. arXiv preprint arXiv:1907.05518. Cited by: §II.
  29. J. Tremblay, T. To, A. Molchanov, S. Tyree, J. Kautz and S. Birchfield (2018) Synthetically trained neural networks for learning human-readable plans from real-world demonstrations. arXiv preprint arXiv:1805.07054. Cited by: §II.
  30. D. Triantafyllou, I. Mariolis, A. Kargakos, S. Malassiotis and N. Aspragathos (2016) A geometric approach to robotic unfolding of garments. Robotics and Autonomous Systems 75, pp. 233–243. Cited by: §II.
  31. Wikipedia contributors (2019) Truncated normal distribution. Note: [Online; accessed 5-Sept-2019] External Links: Link Cited by: §III-C4.
  32. S. A. Winder and M. Brown (2007) Learning local image descriptors. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §IV-A2, §V.
  33. C. Xiong, N. Shukla, W. Xiong and S. Zhu (2016) Robot learning with a spatial, temporal, and causal and-or graph. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2144–2151. Cited by: §I, §II.
  34. D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei and S. Savarese (2018) Neural task programming: learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
  35. Y. Yang, Y. Li, C. Fermuller and Y. Aloimonos (2015) Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web. In 29th AAAI Conference on Artificial Intelligence, Cited by: §I, §II.
  36. T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel and S. Levine (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557. Cited by: §II.
  37. R. Zellers, M. Yatskar, S. Thomson and Y. Choi (2018) Neural motifs: scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840. Cited by: §III-A1.
  38. L. Zhang and R. Koch (2013) An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency. Journal of Visual Communication and Image Representation 24 (7), pp. 794–805. Cited by: §IV-A1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description