Visual Geometric Skill Inference by Watching Human Demonstration
Abstract
We study the problem of learning manipulation skills from human demonstration video by inferring the association relationships between geometric features. Motivation for this work stems from the observation that humans perform eyehand coordination tasks by using geometric primitives to define a task while a geometric control error drives the task through execution. We propose a graph based kernel regression method to directly infer the underlying association constraints from human demonstration video using Incremental Maximum Entropy Inverse Reinforcement Learning (InMaxEnt IRL). The learned skill inference provides human readable task definition and outputs control errors that can be directly plugged into traditional controllers. Our method removes the need for tedious feature selection and robust feature trackers required in traditional approaches (e.g. featurebased visual servoing). Experiments show our method infers correct geometric associations even with only one human demonstration video and can generalize well under variance.
I Introduction
Understanding and applying the mechanism of learning by watching has been researched in robotics for over two decades
The main question is whether a general solution exists to parameterize a task. There is no absolute answer, but even if a parameterization exists, it is difficult to find because manipulation tasks are too complex in general. One way of addressing this problem is to use lowlevel elementary skills as the corner stones of a task; these are potentially easier to learn and generalize. Among the various types of skills, we are interested in those that can be generally parameterized using geometric association constraints (Fig. 1), since a variety of skills can be created from their combinations. We name these geometric skills, which are inherently represented in the image space by geometric primitives (points, lines, conics, planes, etc.) and similar to how human eyehand coordination works [13]. This parameterization method was introduced by Dodds et al. [8] to solve the boxpacking task and then implemented by Gridseth et al. [11] on various skills, including, grasping, placing, insertion, and cutting.
This approach, however, has several drawbacks. The task specification is tedious, as it requires manual assignment of associations among geometric features, and is highly dependent on robust feature trackers [3]. This paper aims to address these issues by learning from watching. We propose a method to directly regress the geometric association constraints on each frame.
The main contributions of this paper are:

Provide an interpretable and invariant robotic task representation using geometric features and their association constraints that are easy to monitor and validate.

Provide a robust feature association learning method by utilizing the fact that multiple feature associations can define the same task.
To elaborate details of our third contribution, for example, when some features are occluded, other candidates will make up for this and continue to define the task. In contrast, traditional trackingbased methods fix such associations in the initial feature selection stage and may be unable to recover from occlusions. This stiffness on constraints is removed in our maximum entropybased geometric constraint regression method.
The remainder of this work will focus on two issues: (1) how to generally encode different types of geometric association constraints and build more complex geometric skills from them; and (2) how to optimize such constraints given one human demonstration video. Experimental results are reported in Sec. IV.
Ii Related Works
This paper is inspired by research works in robot learning, visual servoing and graphbased relational inference.
End to end learning by watching: This approach, commonly known as imitation learning [1], has been recently gaining interest. Sermanet et al. presented TCN [26] to learn from contrastive positive and negative frame changes along time. Yu et al. proposed a metalearning based method [36] to encode prior knowledge from a few thousand human/robot demonstrations, then learned a new task from one demonstration. End to end learning approaches lack interpretability. Furthermore, to the authors’ knowledge, learning by watching only from one human demonstration is still difficult.
Learning task plans by watching: This approach provides the most intuitive motivation and contributes to many of the early works in learning by watching. Such approaches try to generate human readable symbolic representations at a semantic level [6, 35, 33] to provide high level task planning, which is important for generalizability. Ikeuchi et al. presented a general framework [15] that relies on object/task/grasp recognition to generate assembly plans from observation. Modern approaches use a grammar parser [35], causal inference [33], and neural task programming [34]. Konidaris et al. proposed constructing skill trees [20] at the trajectory level to acquire skills from human demonstration using hierarchical reinforcement learning (RL) with options. This work presents a general framework to learn a tree level structured task. However, such works require hardcoded recognition submodules or lacks generality in various tasks.
Learning correspondence relationships by watching: Learning correspondence relationships to represent a task concept from human demonstration videos provides a generalizable task representation. Current approaches formulate the correspondence relationship through learning at either the object level [9] or key points level [23, 24]. Beyond a simple correspondence relationship representation, Sieb et al. propose a graphstructured object relationship inference method [28] in visual imitation learning. However, apart from learning relationships in the objects or key points level, using a more general framework to construct complex tasks from constraints among finegrained geometric primitives (points, lines, conics) has rarely been studied.
Geometric approaches in skill learning: Constructing skills using geometric features provides good interpretability. Apart from works mentioned in Sec. I, Ahmadzadeh et al. proposed a system called VSL [2] that is capable of learning skills from one demonstration. VSL first detects objects in an image and represents them using image feature extractors like SIFT. It computes object spatial motion changes via feature matching and then forms a new task goal configuration used to generate motion primitives by a trajectorybased learning from demonstration (LfD) method [14]. Landmarkbased pre/post action condition detection is also used to construct a task plan. Triantafyllou et al. proposed a geometric approach to solve the garment unfolding task [30]. Tremblay et al. proposed a humanreadable plan generation method [29] which provides interpretability by modeling the 3D wire frame of blocks, however, it requires simulator training for prior 3D modelling.
Iii Method
Iiia Geometric Skill Kernels
Let denote the observation space and denote the observed geometric features

pointtopoint : the coincidence of two points.

pointtoline : a point is on a line.

linetoline : a line is collinear with another line.

coplanarity : coplanar four points or two lines.
Each kernel has two parts: a geometric association constraint representation part and a control error generation part used to guide robot actions.
Geometric association constraint representation
Inspired by graph motifs [37], each skill kernel is a unit graph with different structures. An undirected graph is used to represent the association constraint, where nodes are variables that take input of feature descriptors , and edges define a fixed graph structure (as shown in Fig. 2A). For example, the graph for has two connected nodes, and each node corresponds to . By feeding in two points, we get a graph instance and use a selectout function to measure how relevant it is to define the skill. Then, we have:
(1) 
For example, in the ‘insertion’ skill (Fig. 2B), the graph instance of P3 and P4 has higher output than that of P1 and P2, and will be selected out.
It is worth noting the ambiguity property of skill kernels, where several association instances may define the same skill. Because of this property, it is crucial to learn a robust feature association selection model in the longrun since when the current association is not available (occluded or outside of the field of view (FOV), a candidate association will be selected. For example, in Fig. 2B, both the association of P3 to P4 and P2 to P5 can partially define the skill. In successive steps, P5 will be occluded so its instance won’t be observable, however, P3 to P4 can make up the role. Another issue is task decidability, which determines which 2D imagecoordinate constraints are needed to guarantee a particular 3D configuration. We do not cover decidability here, but refer to [8].
Geometric control error generation
Let denote the mapping of all nodes geometric parameters to a control error vector where is the degree of freedom that this constraint contributes. For example, given a pointtopoint skill, for image points. The control error of a pointtopoint kernel is the point distance, while of a pointtoline kernel is the dot product of their homogeneous coordinates. More examples can be found in [7, 11]. will be used in the following optimization using human demonstrations and in generating control signals used to guide robot action.
IiiB Parameterization
Parameterization of
is parameterized by a layer message passing graph neural network [10]. Each node relates a dimensional hidden state . At layer (or time step ), each nodes hidden state is updated via three steps. (I) Pairwise message generation :
(2) 
where relates to any node connected to . (II) Message aggregation which collects all incoming messages:
(3) 
We simply use summation as in our implementation. Lastly, (III) message update :
(4) 
where a gated recurrent unit (GRU) is used. After layer updates, all of the nodes final states are fed into a MLP layer with an activation function that outputs a scalar value .
The selectout function
Given one image, we construct graph instances by enumerating all possible geometric primitive combinations (e.g., pointtopoint by listing association between any two points). Each instance {} represents one association and will output its relevance . A selectout function outputs a relevance factor :
(5) 
Let denotes the control error for each graph instance, we now define the overall control error for a whole image as:
(6) 
IiiC InMaxEnt IRL for optimization
Given human demonstration video frames, we apply InMaxEnt IRL [17] for optimization. To this end, we define the reward function, which connects skill kernels to entropy models. Through the optimization of this reward function, the skill kernel is also optimized. In practice, each skill kernel is optimized individually. We use as an example in the following discussion.
Reward function
Each state is an image related to a control error , where the subscript denotes the time step in RL. An optimized should consistently select ‘correct instances’ among all states. During human demonstration we should expect to decrease globally (but not necessarily in each step). Intuitively, we should get a positive reward if we observe a decrease in , otherwise the reward should be negative. Let , we define:
(7) 
, where normalizes the scale of different skill kernels’ output .
The variational expert assumption
InMaxEnt IRL considers imperfect expert demonstrations with a confidence level . A higher confidence level results in smaller variance in demonstration. We assume that in a human demonstration, at state the probability of selecting an action that transitions to the observed state follows a Boltzmann distribution with conditions:
(8) 
where is the reward of this observed state change, and
(9) 
is the partition function, is a truncated normal distribution with domain in [1,1]. This means that the expert prefers the action with the highest reward among all possible actions {}. To emphasize high impact actions in , suppose gets a reward , the chance of included in the pool is: , this is called a human factor [17] since it varies with the human demonstrator’s confidence.
Loss function
To maximize the probability of observed human demonstration video sequence by applying MDP property, we have:
(10) 
With equation (8) and removing the last constant, the cost function can be further written as:
(11) 
Note that if has domain , the loss function is a constant. Proofs can be found on our website [18].
To force into making selections more deterministic meanwhile considering the ambiguity property, a penalty regularizer is added to the reward where is a hyperparameter and RSW is the residual sum of weights (RSW). This makes output major weights on selected alternatives while minimizing the residual sum of weights. Fig. 3 shows a comparison of training with and without RSW penalty in the Sorting task.
Optimization
The last item in eq. (11) is a constant and is a function of , which is further represented using skill kernels with parameters . Then, we have:
(12) 
can be solved by back propagation from eq. (7) to the graph neural network in the skill kernel. can be estimated by a Monte Carlo estimator sampling samples from the truncated normal distribution :
(13) 
is the derivative of an expectation. By applying the log derivative trick, we have:
(14)  
Since is tractable, it’s trivial to get:
(15) 
, , . where is defined in [31]. By combining the above equations, is solved.
The optimization on is summarized in Algorithm 1.
IiiD From Skill Kernel to Skills
In this paper, we consider that a skill is simply the combination of several skill kernels, namely kernel ensembles. There should be more advanced ways to construct a skill from different kernels, although this is not discussed here.
IiiE From Skill to Control
Given an image, each skill kernel will select out several alternative association instances and generate control error vectors. For example, the pointtopoint will output vectors with structure [], which can be plugged in controllers like featurebased visual servoing or uncalibrated visual servoing [16]. More examples using geometric features (lines, conics) and based on which, the constructed skill kernels in UVS control are included in [11].
Iv Experiments
Iva Quantitative Evaluation
We first evaluate what types of skills the learned inference behavior is capable of. Four types are tested (Fig. 4): Sorting skill represents a regular setting; Insertion is for skills that need linetoline constraint; Folding is for manipulation with deformable objects; and Screw skill represents types that have low image textures. Each skill is evaluated on videos that show a human performing the same task but with random behaviors. The objective is to infer the correct geometric feature associations that can be used to define the demonstrated skill.
We next test if the learned behavior from human demonstration video directly generalizes to a robot hand. For our tests, the background table is also changed and the target pose is randomly arranged (Fig. 7A).
Lastly, we test on the robot with four other scenarios (Fig.8, BE): moving camera; occlusion; object running out of camera’s field of view; and illumination changes.
Baseline: To our best knowledge, there are no existing methods that learn geometric feature associations by watching human demonstration. However, for comparison, we hand designed a baseline on the Sorting skill. The baseline requires a human to manually select 10 pairs of feature points and initialize 20 trackers. Each pair has one point on the object and another on the target. All of the 10 pairs simultaneously define the same skill (Fig.5), resulting in a robust baseline. In evaluation, as long as one pair still defines the skill, the baseline is marked as a successful trial.
Metric: We evaluate on each video frame and calculate the accuracy of inferences. For the baseline, when it fails on one frame, it can’t be resumed unless a human hand select the features again, therefore we report only success or failure on the final result. For our method, failures can be automatically corrected in successive frames. While our method can output inferred associations, we pick the top one for evaluation.
Training
For each skill, we evaluate the pointtopoint kernel using SIFT and ORB features respectively. For the Insertion skill, we add the linetoline kernel using LBD [38] line descriptor. All kernels have the same graph layer size=5 with hidden state dimension=512 and p=10 alternatives (IIIC1). In training, we set the regularizer coefficient , and human factor . Each kernel with different descriptors are trained individually.
Random  Move  Object  Outside  Change  

Target  Camera  Occlusion  FOV  Illumination  
Baseline  100.0%  0.0%  0.0%  0.0%  0.0% 
Ours  99.1%  96.7%  92.7%  81.2%  0.0% 
Results
Different skills Results (Fig.6) on the 4 skills show our method is capable of the Sorting and Insertion skill and performs moderately in Folding and Screw skills. In experiments, we observed that when both object and target have rich textures, results improve. This may be from the use of SIFT or ORB that are local descriptors dependent on textures. We can expect further improvement by using other local feature descriptors [32] [21]. We also find the more features that can be fed into the skill kernel, the better accuracy it performs. Due to our hardware GPU limitation, we can only test using a small number (60 in average) of features.
Varying environment Fig. 8 lists results on various environmental conditions. In general, i) our method is robust to occlusion. When some feature associations are occluded, the selection of others will make it up; and ii) our method exhibits robust behavior so that failure in some frames doesn’t affect successive frames since it directly selects the feature associations on each frame. In contrast, the baseline method depends on the initialization of video trackers and continuous tracking. We observe that the learned inference behavior tends to select fixed association instances while showing the flexibility of selecting alternatives when fixed ones are not observable. We also observe that the accuracy is highly related to the capability of SIFT descriptor. It reaches high accuracy under projective variance (B), however, fails under illumination changes (E).
Although results on the robot manipulator show our method can output the correct selections of geometric feature associations which can be directly used in controllers (e.g. uncalibrated visual servoing [11]), due to resource limitations we did not test with a plugin controller. We leave this to our future work.
V Conclusion
We propose a graph based kernel regression method to infer the association relationship between geometric features by watching human demonstrations. The learned skill inference provides human readable task definition and outputs control errors that fit in traditional controllers. Our method removes the dependency on robust feature trackers and tediously hand selection process in traditional robotic task specification. The learned selection model provides a robust feature association behavior under various environmental settings.
Although results are promising, there are issues that need to be further investigated. 1) Consistent control error output: while the result shows that our method tends to select a fixed set of associations, it can’t guarantee the selection consistency. One possible solution is to add constraints between frames. 2) Other local feature descriptors [32] [21] are worth trying for better generalization. 3) The generalization to point cloud geometric primitive needs to be further studied.
Footnotes
 The earliest work can be traced back to Ikeuchi et al. [15] and Kuniyoshi et al. [22] in 1994.
 points, lines, conics, planes, spheres etc. from an image or point cloud.
 More details on the parameterization of geometric primitives in [12].
 For example, for a ‘peginhole’ skill, the pointtopoint kernel should be used to first coarsely move to the target, while linetoline kernel best fits in the final alignment actions. Their relationship is not a simple combination.
References
 (2004) Apprenticeship learning via inverse reinforcement learning. 21 Int. Conf. on Machine learning  ICML ’04, pp. 1. External Links: Document, 1206.5264, ISBN 1581138285, ISSN 00280836 Cited by: §II.
 (2015) Learning symbolic representations of actions from human demonstrations. In 2015 IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 3801–3808. Cited by: §I, §II.
 (2017) Visual servoing from deep neural networks. arXiv preprint arXiv:1705.08940. Cited by: §I.
 (2019) Online object and task learning via human robot interaction. In 2019 Int. Conf. on Robotics and Automation (ICRA), pp. 2132–2138. Cited by: §I.
 (1995) Acquisition of elementary robot skills from human demonstration. In International symposium on intelligent robotics systems, pp. 185–192. Cited by: §I.
 (2004) Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems 47 (23), pp. 109–116. External Links: Document, ISBN 09218890, ISSN 09218890 Cited by: §II.
 (1999) A Hierarchical Architecture for VisionBased Robotic Manipulation Tasks. First Int. Conf. on Computer Vision Systems 542, pp. 312–330. Note: geometric constraints can be used to describe a task. Cited by: 2nd item, §IIIA2.
 (1999) Task specification and monitoring for uncalibrated hand/eye coordination. In Proceedings 1999 IEEE International Conference on Robotics and Automation, Vol. 2, pp. 1607–1613. Cited by: §I, §IIIA1.
 (2018) Dense object nets: learning dense visual object descriptors by and for robotic manipulation. arXiv preprint arXiv:1806.08756. Cited by: §II.
 (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §IIIB1.
 (2016) ViTa: Visual task specification interface for manipulation with uncalibrated visual servoing. Proceedings  IEEE International Conference on Robotics and Automation 2016June, pp. 3434–3440. External Links: Document, ISBN 9781467380263, ISSN 10504729 Cited by: 2nd item, §I, §IIIA2, §IIIE, §IVA2.
 (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: footnote 3.
 (1996) A tutorial on visual servo control. IEEE transactions on robotics and automation 12 (5), pp. 651–670. Cited by: §I.
 (2013) Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation 25 (2), pp. 328–373. Cited by: §II.
 (1994) Toward an assembly plan from observation. i. task recognition with polyhedral objects. IEEE Trans. on robotics and automation 10 (3), pp. 368–385. Cited by: §I, §II, footnote 1.
 (1995) Visual space task specification, planning and control. In Proc. of Int. Symposium on Computer VisionISCV, pp. 521–526. Cited by: §IIIE.
 (2018) Robot eyehand coordination learning by watching human demonstrations: a task function approximation approach. arXiv preprint arXiv:1810.00159. Cited by: §I, §IIIC2, §IIIC.
 (2019) Visual geometric skill inference by watching human demonstration: supplementary materials. Note: [Online; accessed 5Sept2019] External Links: Link Cited by: §IIIC3.
 (2010) Movement templates for learning of hitting and batting. In Robotics and Automation (ICRA), 2010 IEEE Int. Conf. on, pp. 853–858. Cited by: §I.
 (2012) Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research 31 (3), pp. 360–375. Cited by: §II.
 (2016) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5385–5394. Cited by: §IVA2, §V.
 (1994) Learning by watching: extracting reusable task knowledge from visual observation of human performance. IEEE Tran. robotics and automation 10 (6), pp. 799–822. Cited by: §I, footnote 1.
 (2019) Kpam: keypoint affordances for categorylevel robotic manipulation. arXiv preprint arXiv:1903.06684. Cited by: §II.
 (2019) KETO: learning keypoint representations for tool manipulation. arXiv preprint arXiv:1910.11977. Cited by: §II.
 (2019) Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. arXiv preprint arXiv:1906.05841. Cited by: §I.
 (2018) Timecontrastive networks: selfsupervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. Cited by: §II.
 (2015) A unified framework for humanrobot knowledge transfer. In 2015 AAAI Fall Symposium Series, Cited by: §I.
 (2019) Graphstructured visual imitation. arXiv preprint arXiv:1907.05518. Cited by: §II.
 (2018) Synthetically trained neural networks for learning humanreadable plans from realworld demonstrations. arXiv preprint arXiv:1805.07054. Cited by: §II.
 (2016) A geometric approach to robotic unfolding of garments. Robotics and Autonomous Systems 75, pp. 233–243. Cited by: §II.
 (2019) Truncated normal distribution. Note: [Online; accessed 5Sept2019] External Links: Link Cited by: §IIIC4.
 (2007) Learning local image descriptors. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §IVA2, §V.
 (2016) Robot learning with a spatial, temporal, and causal andor graph. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2144–2151. Cited by: §I, §II.
 (2018) Neural task programming: learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
 (2015) Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web. In 29th AAAI Conference on Artificial Intelligence, Cited by: §I, §II.
 (2018) Oneshot imitation from observing humans via domainadaptive metalearning. arXiv preprint arXiv:1802.01557. Cited by: §II.
 (2018) Neural motifs: scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840. Cited by: §IIIA1.
 (2013) An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency. Journal of Visual Communication and Image Representation 24 (7), pp. 794–805. Cited by: §IVA1.