A Geometric Perspective on Visual Imitation Learning
We consider the problem of visual imitation learning without human supervision (e.g. kinesthetic teaching or teleoperation), nor access to an interactive reinforcement learning (RL) training environment. We present a geometric perspective to derive solutions to this problem. Specifically, we propose VGS-IL (Visual Geometric Skill Imitation Learning), an end-to-end geometry-parameterized task concept inference method, to infer globally consistent geometric feature association rules from human demonstration video frames. We show that, instead of learning actions from image pixels, learning a geometry-parameterized task concept provides an explainable and invariant representation across demonstrator to imitator under various environmental settings. Moreover, such a task concept representation provides a direct link with geometric vision based controllers (e.g. visual servoing), allowing for efficient mapping of high-level task concepts to low-level robot actions.
Compared to traditional robotic task teaching methods, visual imitation learning promises a more intuitive way for general purpose task programming. Like most other learning methods, it suffers from the generalization problem. Commonly, three strategies are used to tackle generalization. The first one is to increase the number of human demonstrations via kinesthetic teaching or teleoperation. This has been proven effective for supervised learning methods such as behavior cloning . However, it requires laborious human supervision which can be tedious. The second strategy is to assume access to robot-environment interactions where samples can be expanded via reinforcement learning (RL) methods (e.g. IRL , GCL , GAIL ). Unfortunately, new issues regarding transfer learning and low sample efficiency arise during both simulation and real world training. The last strategy assumes that shared knowledge can be learned from demonstration samples across multiple but similar tasks; from this shared knowledge, the robot is able to learn a new task when given one more demonstration. This strategy is used in meta-learning based approaches (e.g. one-shot ).
Generally, aforementioned methods use human demonstration as state-action samples to learn a policy mapping from image to action (i.e. they approximate a target state-action distribution). Consequently, in order to improve generalization, it is necessary to collect more state-action experiences from either human teaching (supervision) or robot self-explorations (RL training). However, neither approach proves satisfactory. This motivates us to ask the question: is it possible to learn by watching one human demonstration without the extra effort of interactive training?
Recently, several methods have been proposed to tackle this question. One key insight is to rethink the classical ‘correspondence’ problem  which studies the difference between demonstrator and imitator. Such insight changes our view on human demonstration towards encoding more task concepts rather than control. Empirically, this aligns with our cognitive process in peer learning which involves first understanding the task before attempting any motor actions
In this paper, we provide a geometric perspective to derive solutions. We show that, instead of learning from image pixels to actions, learning a geometry-parameterized task concept
We propose VGS-IL (Visual Geometric Skill Imitation Learning), an end-to-end geometry-parameterized task concept inference method used to infer globally consistent geometric association rules from demonstration video frames. Instead of learning from geometric primitives [30, 3] (e.g. points, lines, and conics) with handcrafted feature descriptors [34, 37, 44], VGS-IL can directly optimize a combinatorial representation from image pixels. Experiments show that the learned task concept generalizes well from human to robot despite the visual difference in arm and hand appearance.
We show such geometry-parameterized task concept can be directly linked to geometric vision based controllers , thus forming an efficient way to map high-level task concept to low-level robot actions. Unlike prevalent methods requiring hierarchically training of an additional control policy [35, 18, 39, 20, 19], experimental results show that our learned representation fits directly into a visual servoing  controller, removing the need for feature trackers.
By using geometric primitive associations and 3D computer vision geometry based controllers, we present a method for general purpose robotic task programming.
Ii Related Works
Visual Imitation Learning: The problem defined in visual imitation learning is: given one or several human demonstration videos, how can a new task be learned? Research on this topic dates back to 1994 [27, 31]. With the rise of deep learning and reinforcement learning, more influential works have since been published. While some are reviewed in section I, which aim to learn a task from visual inputs, it’s worth noting another research stream aiming to learn a semantic knowledge representation. This method commonly relies on independent pipelines like object detection, action recognition etc. Despite of their method complexity, experiments show they can learn semantic task plans that follow a procedural manner [42, 3, 43].
Hierarchical Visual Imitation Learning: Instead of simultaneously learning task definition and control, hierarchical approaches decouple the two by focusing on learning a shared high-level task representation across human demonstrator and robot imitator. The two core problems are: i) how to represent the high-level task concept; and ii) how to train the low-level control policy. The first one is more important since representation of the task concept determines the controller training. For example, many pioneer works parameterize the task concept in pixel level by using sub-goal output from a neural network [35, 39]. The low-level policy is then sub-goal conditioned and trained following a Hierarchical Reinforcement Learning manner.
More recent works represent a task in the object level where object correspondence [20, 19] or graph structure  relationships are utilized to parameterize a task. The low-level controller is then trained based on distance errors in the embedded parameterization space. This approach shows success in pushing and placing tasks, however, it lacks the definition resolution required for more complex tasks like insertion.
Geometry-Based Visual Imitation Learning: Alternatively, going deeper inside the objects, geometric feature level based approaches arise. Early pioneer works from Ahmadzadeh et al. 2015  proposed VSL to learn feature point correspondence based task representation given one human demo video. A similar approach from Qin et al. 2019  presents KETO which utilizes key point relationships to represent a tool manipulation task. In general, their low-level controllers are tediously trained separately without enough study emphasis on how a proper task representation will facilitate the low-level policy training.
Beyond a simple key point correspondence based task concept representation, other basic geometric constraints (point-to-line, line-to-line, etc.) can enrich our toolbox for parameterization of task concepts . Furthermore, by concurrently combining and sequentially linking them , we can find a general way to program more complex manipulation tasks that exhibit scalability. To the authors best knowledge, applying such systematic geometry-based task programming in visual imitation learning is rarely studied.
This research builds upon our previous work on visual geometric skills learning , employing a more data driven approach to learn globally consistent geometric feature association rules without hand crafted feature descriptors [34, 37, 44].
Iii-a Geometry parameterized task representation
The basic idea of geometric feature association rules based task parameterization, as firstly proposed in , has two parts: i) the basic geometry constraints or visual geometric skill (VGS) kernels ; and ii) the combination or conditioned linking of basic kernels to create more complex tasks, which we refer to as visual geometric skills. For example, some commonly used geometric skill kernels are:
point-to-point : the coincidence of two points.
point-to-line : a point fits on (touches) a line.
line-to-line : a line is co-linear with another line.
point-to-conic : a point fits a conic.
This parameterization method provides a programmable framework that can be used to create more complex tasks by combining several constraints in parallel and then sequentially linking the basic kernels. For example, inserting a pen tip inside its cap involves a point-to-point constraint linked by a line-to-line one (Fig. 2A).
First, how do we define a good VGS kernel representation? Given a set of geometric features , a VGS kernel representation is an operator used to map to a latent vector. We propose three essential properties of a good :
Communicative: A good operator should be consistent for all input features sequence orders. For example, when we enumerate all possible associations of three features, we require for all 6 possible permutations since they define the same task.
Non-inner-associative: A good should be able to represent . For example, consider a point and a line defined by three points . A point-to-line kernel operation is , which is unique from any other inner associations.
Scalability: The ways to parameterize should be scalable to fit n-ary operations. For example, a point-to-point kernel is a binary operation while point-to-line is a quaternary operation if three points represent a line.
Examples of basic VGS kernel parameterizations are included in Fig. 2B. As shown in Jun et. al , a parameterization is found using a message passing graph neural network  with a gated recurrent unit GRU  that satisfies the above properties: i) a graph structure is scalable to represent n-ary operations; ii) graph edges define different inner associations; and iii) the message passing mechanism combined with GRU makes invariant from input orders. Specifically, this design (Fig. 2C) has four steps: A pair-wise message generation :
, where , are connected nodes’ hidden states. A message aggregation which collects all incoming messages:
A message update using a gated recurrent unit (GRU):
Finally a readout function is parameterized using MLP layers. After T layer updates, all nodes’ final states are fed into a readout function: .
Next, we show how to encode the graph entities. Previous works [30, 13] used hand crafted feature descriptors in training. Is it possible to utilize the representation capability of deep learning by directly learning from raw images? Moreover, for simple geometric features like points and lines, there are on-the-shelf descriptors that can be used. What about more complex geometric primitive like conics and planes? In this paper, we propose a composable graph structure used to encode geometric primitives with point-based image patches, as shown in Fig. 2B.
Lastly, a visual geometric skill (VGS) is composed by combining or linking multiple VGS kernels. This paper will only cover kernel combinations and will leave kernel linking for future research.
Iii-B VGS learning by watching human demonstrations
Assume a VGS task consists of multiple geometric skill kernels , learning VGS becomes the optimization of each given a human demonstration image sequence .
An optimal selects the right geometric feature associations out of a set of combinatorial instances. For example, in the point-to-point kernel, we can get N feature points from one image by applying any feature extractor. To enumerate, there are candidate instances. Suppose each instance has an output by applying the operator . We compute its relevant factor and the right one is selected out from the maximum .
Applying on each image frame will output a control error signal
We measure the quality of control signals by a reward function using two metrics: i) errors are overall decreasing along the time steps of human demonstration; and ii) error changes are smooth. The first metric is encoded into the reward function as defined in . To achieve smoothness, we modify the loss function defined in  by adding a geometry consistent regularizer (GCR): while keeping the same residual sum of weights (RSW) regularizer for deterministic selection purpose. GCR forces learning a more consistent selection across frames.
Iii-C Links to geometric vision-based controllers
The control signal output from is observed in image pixel space. Mapping image observations to robot actions is a long running research topic  also known as robot eye-hand coordination, visuomotor policy learning, or vision guided robot control . Approaches can be divided into two categories
As shown in Fig. 4, the basic idea of using geometric vision in VS control is: i) mapping an error vector to camera motion via an interaction matrix derived from the camera relative spatial velocity equation ; and ii) mapping to robot motion via a calibration model as in VS or a trial-error based online estimation as shown in Uncalibrated Visual Servoing (UVS ). Here we discuss feature-based visual servoing which our VGS learning directly links to.
VGS-IL removes the need for robust feature trackers while keeping the geometric error output that can be linked with a visual servoing controller. Compared to traditional approaches that hand select features to encode a task concept, VGS-IL directly learns the feature selection using a data driven approach. Instead of tracking each geometric feature and then associating them, VGS-IL directly extracts their associations in an adaptive manner which has been shown to be more robust .
It is worth noting that visual servoing control is sensitive to modeling errors . Combining the 3D geometric vision aspect from VS to learn more robust controllers via Reinforcement Learning has the potential to derive both efficient and robust controllers.
|Task||Sorting||Insertion: point-to-point||Insertion: line-to-line||Folding||Screw|
|Settings||Random Target||Change Camera||Object Occlusion||Object Outside FOV||Change Illumination|
Through experimental evaluation we aim to determine: (i) whether VGS-IL can learn a correct and consistent geometry-parameterized task concept given one human demonstration; and (ii) whether VGS-IL can output high-quality error signals for accurate robot control. For analysis, we decompose the two goals into four evaluation steps: (1) Given one human demonstration video, will VGS-IL output a correct and consistent task concept; (2) how will VGS-IL generalize from human demonstrator to robot imitator under changed task and environmental settings; (3) How does control error converge, and how is it affected at different network training time for VGS-IL.
Baselines: We hand designed two baselines to use in comparison. Baseline1 is conventional visual servoing with a video-tracking of a redundant feature set. This involves human interaction to carefully hand select 10 pairs of geometric features used to represent a task and initialize multiple feature trackers for each camera. As long as one pair out of ten is able to track throughout the entire task process, baseline1 succeeds. Baseline2 is a method from our previous work  that relies on hand crafted geometric feature descriptors (SIFT  and LBD ) in training; however, it doesn’t take into consideration representation consistency.
Metrics: We designed two evaluation metrics: (1) Acc to measure accuracy; and (2) conAcc to measure consistency. Specifically, given N video frames, Acc, where M is the number of frames with correct geometric task concept inference. Defining conAcc is more challenging since directly measuring the inference consistency involves complex statistical methods . For simplicity, we measure the time-series control error output (i.e. the inference outcome) and define conAcc , which is the autocorrelation measurement over time-series error norms with shift=k. We fix k=2 in all experiments. Since baseline1 is a collection of redundant pairs of trackers, measuring the conAcc is difficult. In this case we assume that conAcc=1, the maximum, if baseline1 succeeds.
Tasks: To facilitate comparisons, we follow the same four tasks: Sorting, Insertion, Folding, and Screw tasks as defined in  (see Fig. 5 for details). Sorting represents a rich texture clue task that requires a point-to-point kernel; the Insertion task needs a combination of point-to-point and line-to-line kernels; the Folding task represents deformable object manipulation; and the Screw task has low image textures.
Evaluation on human demonstration videos
Our first step is to evaluate if VGS-IL learns a both correct and consistent geometric feature associations, given one human demonstration video. For a fair test, we changed both background and target pose in evaluation. Qualitative results are displayed in Fig. 5. Quantitative metric scores are shown in Table I. Results show VGS-IL succeeds to generalize the learned geometry-parameterized task concept in all the four tasks. Regarding selection consistency, VGS-IL performs the best compared to other two baselines.
Generalization under different environmental settings
Then we test if the learned task concept generalizes from human demonstrator to robot imitator. A WAM robot equipped with a Barret Hand is used to test the Sorting task (Fig. 7). Furthermore, we keep testing on the robot while exploring more variance settings (Fig. 6): (a) random target; (b) move camera: We test for real-world projective invariance by randomly translating and rotating the camera; (c) object occlusion; (d) object outside FOV: The object moves outside the camera’s field of view and each method is required to automatically recover when the object is back in the image; and (e) change illumination: the lighting condition is changed by adding a spotlight light source. We pick the task Sorting to evaluate. Results are shown in Table II which indicate VGS-IL performs the best in all settings.
Evaluation of the ‘good’ control error signal output
We test how ‘good’ or ‘bad’ the control error signals output from VGS-IL are. To do this, we had the robot perform the Sorting task via teleoperation, then ran VGS-IL on the resulting task video and measured the corresponding time-series error signals. Therefore, if VGS-IL was capable of outputting ‘good’ control signals, the results of this video should also be good. To make our evaluation more interesting, we wanted to see how control error signals are improved along with the optimization process of VGS-IL. Fig. 8 shows the results in three different training stages.
We present a geometric perspective on visual imitation learning. Specifically, we propose VGS-IL, visual geometric skill imitation learning, to learn a geometry-parameterized task concept. VGS-IL infers globally consistent geometric feature association rules from human demonstration video. The learned task concept outputs control error signals that can be directly linked to geometric vision based controllers, thus providing an efficient way to map learned high-level task concepts to low level robot actions. Experimental evaluations show that our method generalizes well from human demonstrator to robot imitator under various environmental settings.
In practice, VGS-IL needs large GPU computation resource due to its optimization over the whole combinatorial feature association candidates. A potential solution is to utilize high dimensional Bayesian Optimization methods  to directly estimate geometry representation and association parameters from the observation space. Moreover, although we demonstrated applying VGS-IL in tasks by combining different VGS kernels, it is worth further exploring how to sequentially link VGS kernels to program more complex tasks.
- This is studied in observational learning  in psychology.
- For further reading, task parameterization using geometric constraints (e.g. point-to-point, point-to-line, point-to-conics, etc.) are intensively studied in [16, 15, 14, 24].
- For example, a point-to-point kernel outputs x-y errors in image pixels. A point-to-line kernel outputs error signal from the dot product of their homogeneous coordinates. More examples can be found in .
- For further reading, a comparison has been discussed in the ICRA 2018 Tutorial on Vision-based Robot Control 
- (2004) Apprenticeship learning via inverse reinforcement learning. Twenty-first international conference on Machine learning - ICML ’04, pp. 1. External Links: Cited by: §I.
- (1977) Servoing with visual feedback. SRI International. Cited by: §III-C.
- (2015) Learning symbolic representations of actions from human demonstrations. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 3801–3808. Cited by: 1st item, §II, §II.
- (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §I.
- (2009) A survey of robot learning from demonstration. Robotics and Autonomous Systems 57 (5), pp. 469–483. External Links: Cited by: Fig. 1, §I.
- (2018) Training deep neural networks for visual servoing. In ICRA 2018-IEEE International Conference on Robotics and Automation, pp. 1–8. Cited by: §III-C.
- (2018) Going further with direct visual servoing. Ph.D. Thesis, Rennes 1. Cited by: §III-C.
- (2010) Neural mechanisms of observational learning. Proceedings of the National Academy of Sciences 107 (32), pp. 14431–14436. Cited by: footnote 1.
- (2006) Visual servo control. I. Basic approaches. IEEE Robotics and Automation Magazine 13 (4), pp. 82–90. External Links: Cited by: 2nd item, §III-C, §III-C.
- (2018) Geometric and photometric vision-based robot control: modeling approach. Cited by: §III-C, footnote 4.
- (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §III-A.
- (1994) High-performance visual closed-loop robot control. Ph.D. Thesis. Cited by: 2nd item, §III-C.
- (2004) Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems 47 (2-3), pp. 109–116. External Links: Cited by: §III-A.
- (1999) A Hierarchical Architecture for Vision-Based Robotic Manipulation Tasks. First Int. Conf. on Computer Vision Systems 542, pp. 312–330. Note: geometric constraints can be used to describe a task. Cited by: §II, footnote 2.
- (1999) Task Specification and Monitoring for Uncali brat ed Hand / E ye Coordination *. (May). Cited by: footnote 2.
- (1999) Task specification and monitoring for uncalibrated hand/eye coordination. In Proceedings 1999 IEEE International Conference on Robotics and Automation, Vol. 2, pp. 1607–1613. Cited by: §III-A, footnote 2.
- (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International conference on machine learning, pp. 49–58. Cited by: §I.
- (2017) One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905. Cited by: 2nd item, §I.
- (2019) Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters. Cited by: 2nd item, §II.
- (2018) Dense object nets: learning dense visual object descriptors by and for robotic manipulation. arXiv preprint arXiv:1806.08756. Cited by: 2nd item, §II.
- (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §III-A.
- (2016) ViTa: Visual task specification interface for manipulation with uncalibrated visual servoing. Proceedings - IEEE International Conference on Robotics and Automation 2016-June, pp. 3434–3440. External Links: Cited by: §II.
- (2016) Vita: visual task specification interface for manipulation with uncalibrated visual servoing. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 3434–3440. Cited by: Fig. 2.
- (2000) On specifying and performing visual tasks with qualitative object models. Proceedings-IEEE International Conference on Robotics and Automation 1 (April), pp. 636–643. External Links: Cited by: footnote 2.
- (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: footnote 3.
- (2016) Generative adversarial imitation learning. pp. 4565–4573. Cited by: §I.
- (1994) Toward an assembly plan from observation. i. task recognition with polyhedral objects. IEEE transactions on robotics and automation 10 (3), pp. 368–385. Cited by: §II.
- (1997) Experimental evaluation of uncalibrated visual servoing for precision manipulation. Proceedings of International Conference on Robotics and Automation 4 (April), pp. 2874–2880. External Links: Cited by: §III-C.
- (2018) Robot eye-hand coordination learning by watching human demonstrations: a task function approximation approach. arXiv preprint arXiv:1810.00159. Cited by: §I, §III-B2.
- (2019) Visual geometric skill inference by watching human demonstration. arXiv preprint arXiv:1911.04418. Cited by: 1st item, §I, §III-A, §III-A, §III-A, §III-B2, §III-B2, §III-C, §III, §IV, §IV, 1.
- (1994) Learning by watching: extracting reusable task knowledge from visual observation of human performance. IEEE transactions on robotics and automation 10 (6), pp. 799–822. Cited by: §II.
- (2015) Deep learning. Nature 521 (7553), pp. 436–444. External Links: Cited by: §III-C.
- (2016) The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §III-C.
- (1999) Object recognition from local scale-invariant features.. In iccv, Vol. 99, pp. 1150–1157. Cited by: 1st item, §III, §IV.
- (2018) Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2050–2053. Cited by: 2nd item, §I, §II.
- (2019) KETO: learning keypoint representations for tool manipulation. arXiv preprint arXiv:1910.11977. Cited by: §II.
- (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Vol. 11, pp. 2. Cited by: 1st item, §III.
- (2016) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §V.
- (2019) Third-person visual imitation learning via decoupled hierarchical controller. In Advances in Neural Information Processing Systems, pp. 2593–2603. Cited by: 2nd item, §I, §II.
- (2019) Graph-structured visual imitation. arXiv preprint arXiv:1907.05518. Cited by: §II.
- (1996) Self-consistency: a fundamental concept in statistics. Statistical Science (3), pp. 229–243. Cited by: §IV.
- (2016) Robot learning with a spatial, temporal, and causal and-or graph. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2144–2151. Cited by: §II.
- (2015) Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §II.
- (2013) An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency. Journal of Visual Communication and Image Representation 24 (7), pp. 794–805. Cited by: 1st item, §III, §IV.