A Geometric Perspective on Visual Imitation Learning

A Geometric Perspective on Visual Imitation Learning


We consider the problem of visual imitation learning without human supervision (e.g. kinesthetic teaching or teleoperation), nor access to an interactive reinforcement learning (RL) training environment. We present a geometric perspective to derive solutions to this problem. Specifically, we propose VGS-IL (Visual Geometric Skill Imitation Learning), an end-to-end geometry-parameterized task concept inference method, to infer globally consistent geometric feature association rules from human demonstration video frames. We show that, instead of learning actions from image pixels, learning a geometry-parameterized task concept provides an explainable and invariant representation across demonstrator to imitator under various environmental settings. Moreover, such a task concept representation provides a direct link with geometric vision based controllers (e.g. visual servoing), allowing for efficient mapping of high-level task concepts to low-level robot actions.

I Introduction

Compared to traditional robotic task teaching methods, visual imitation learning promises a more intuitive way for general purpose task programming. Like most other learning methods, it suffers from the generalization problem. Commonly, three strategies are used to tackle generalization. The first one is to increase the number of human demonstrations via kinesthetic teaching or teleoperation. This has been proven effective for supervised learning methods such as behavior cloning [4]. However, it requires laborious human supervision which can be tedious. The second strategy is to assume access to robot-environment interactions where samples can be expanded via reinforcement learning (RL) methods (e.g. IRL [1], GCL [17], GAIL [26]). Unfortunately, new issues regarding transfer learning and low sample efficiency arise during both simulation and real world training. The last strategy assumes that shared knowledge can be learned from demonstration samples across multiple but similar tasks; from this shared knowledge, the robot is able to learn a new task when given one more demonstration. This strategy is used in meta-learning based approaches (e.g. one-shot [18]).

Fig. 1: Rethinking the classical ‘correspondence’ problem [5] in imitation learning reveals two essential questions: i) ‘what’ information should be transferred from a human demonstrator to a robot imitator; and ii) ‘how’ can this information be used to bring about actions (i.e. motor action). This paper presents a geometric perspective to this problem. We find a proper geometric representation of ‘what’ that facilitates the training of ‘how’.

Generally, aforementioned methods use human demonstration as state-action samples to learn a policy mapping from image to action (i.e. they approximate a target state-action distribution). Consequently, in order to improve generalization, it is necessary to collect more state-action experiences from either human teaching (supervision) or robot self-explorations (RL training). However, neither approach proves satisfactory. This motivates us to ask the question: is it possible to learn by watching one human demonstration without the extra effort of interactive training?

Recently, several methods have been proposed to tackle this question. One key insight is to rethink the classical ‘correspondence’ problem [5] which studies the difference between demonstrator and imitator. Such insight changes our view on human demonstration towards encoding more task concepts rather than control. Empirically, this aligns with our cognitive process in peer learning which involves first understanding the task before attempting any motor actions1. This hierarchical view decouples learning the ‘what’ and ‘how’ (zero-shot [35], see Fig. 1). Benefits of this method are immediately observed: i) the promise to generalize well since it learns a high-level cognitive concept [29, 39] of the task instead of directly matching state-action distributions; and ii) the promise of reusable low level polices as basic skills across different tasks [30]. However, two new problems arise: i) what is the high-level task; and ii) how can we train the low level controllers without an additional intensive cost.

In this paper, we provide a geometric perspective to derive solutions. We show that, instead of learning from image pixels to actions, learning a geometry-parameterized task concept2 provides an explainable and invariant representation across demonstrator to imitator under various environmental settings. Moreover, it provides controllability that can be directly linked to geometric vision based controllers (e.g. visual servoing). Our contributions are:

  • We propose VGS-IL (Visual Geometric Skill Imitation Learning), an end-to-end geometry-parameterized task concept inference method used to infer globally consistent geometric association rules from demonstration video frames. Instead of learning from geometric primitives [30, 3] (e.g. points, lines, and conics) with handcrafted feature descriptors [34, 37, 44], VGS-IL can directly optimize a combinatorial representation from image pixels. Experiments show that the learned task concept generalizes well from human to robot despite the visual difference in arm and hand appearance.

  • We show such geometry-parameterized task concept can be directly linked to geometric vision based controllers [12], thus forming an efficient way to map high-level task concept to low-level robot actions. Unlike prevalent methods requiring hierarchically training of an additional control policy [35, 18, 39, 20, 19], experimental results show that our learned representation fits directly into a visual servoing [9] controller, removing the need for feature trackers.

By using geometric primitive associations and 3D computer vision geometry based controllers, we present a method for general purpose robotic task programming.

Ii Related Works

Visual Imitation Learning: The problem defined in visual imitation learning is: given one or several human demonstration videos, how can a new task be learned? Research on this topic dates back to 1994 [27, 31]. With the rise of deep learning and reinforcement learning, more influential works have since been published. While some are reviewed in section I, which aim to learn a task from visual inputs, it’s worth noting another research stream aiming to learn a semantic knowledge representation. This method commonly relies on independent pipelines like object detection, action recognition etc. Despite of their method complexity, experiments show they can learn semantic task plans that follow a procedural manner [42, 3, 43].

Hierarchical Visual Imitation Learning: Instead of simultaneously learning task definition and control, hierarchical approaches decouple the two by focusing on learning a shared high-level task representation across human demonstrator and robot imitator. The two core problems are: i) how to represent the high-level task concept; and ii) how to train the low-level control policy. The first one is more important since representation of the task concept determines the controller training. For example, many pioneer works parameterize the task concept in pixel level by using sub-goal output from a neural network [35, 39]. The low-level policy is then sub-goal conditioned and trained following a Hierarchical Reinforcement Learning manner.

More recent works represent a task in the object level where object correspondence [20, 19] or graph structure [40] relationships are utilized to parameterize a task. The low-level controller is then trained based on distance errors in the embedded parameterization space. This approach shows success in pushing and placing tasks, however, it lacks the definition resolution required for more complex tasks like insertion.

Geometry-Based Visual Imitation Learning: Alternatively, going deeper inside the objects, geometric feature level based approaches arise. Early pioneer works from Ahmadzadeh et al. 2015 [3] proposed VSL to learn feature point correspondence based task representation given one human demo video. A similar approach from Qin et al. 2019 [36] presents KETO which utilizes key point relationships to represent a tool manipulation task. In general, their low-level controllers are tediously trained separately without enough study emphasis on how a proper task representation will facilitate the low-level policy training.

Beyond a simple key point correspondence based task concept representation, other basic geometric constraints (point-to-line, line-to-line, etc.) can enrich our toolbox for parameterization of task concepts [22]. Furthermore, by concurrently combining and sequentially linking them [14], we can find a general way to program more complex manipulation tasks that exhibit scalability. To the authors best knowledge, applying such systematic geometry-based task programming in visual imitation learning is rarely studied.

Fig. 2: A: An example of visual geometric skill (VGS) kernel linking task [23]: Inserting a pen inside its cap involves a point-to-point constraint (pen tip touches cap top) followed by a line-to-line constraint (align pen direction to the cap). B: Basic geometric skill kernel representations using a graph structure. More complex tasks are programmed by combining the basic kernels. C: An example of parameterization design of the point-to-point kernel . Using a message passing graph neural network with a GRU update fulfills the three required properties for a good representation. Other visual geometric skill kernels can be parameterized in a similar manner.

Iii Method

This research builds upon our previous work on visual geometric skills learning [30], employing a more data driven approach to learn globally consistent geometric feature association rules without hand crafted feature descriptors [34, 37, 44].

Iii-a Geometry parameterized task representation

The basic idea of geometric feature association rules based task parameterization, as firstly proposed in [16], has two parts: i) the basic geometry constraints or visual geometric skill (VGS) kernels [30]; and ii) the combination or conditioned linking of basic kernels to create more complex tasks, which we refer to as visual geometric skills. For example, some commonly used geometric skill kernels are:

  • point-to-point : the coincidence of two points.

  • point-to-line : a point fits on (touches) a line.

  • line-to-line : a line is co-linear with another line.

  • point-to-conic : a point fits a conic.

This parameterization method provides a programmable framework that can be used to create more complex tasks by combining several constraints in parallel and then sequentially linking the basic kernels. For example, inserting a pen tip inside its cap involves a point-to-point constraint linked by a line-to-line one (Fig. 2A).

First, how do we define a good VGS kernel representation? Given a set of geometric features , a VGS kernel representation is an operator used to map to a latent vector. We propose three essential properties of a good :

  • Communicative: A good operator should be consistent for all input features sequence orders. For example, when we enumerate all possible associations of three features, we require for all 6 possible permutations since they define the same task.

  • Non-inner-associative: A good should be able to represent . For example, consider a point and a line defined by three points . A point-to-line kernel operation is , which is unique from any other inner associations.

  • Scalability: The ways to parameterize should be scalable to fit n-ary operations. For example, a point-to-point kernel is a binary operation while point-to-line is a quaternary operation if three points represent a line.

Examples of basic VGS kernel parameterizations are included in Fig. 2B. As shown in Jun et. al  [30], a parameterization is found using a message passing graph neural network [21] with a gated recurrent unit GRU [11] that satisfies the above properties: i) a graph structure is scalable to represent n-ary operations; ii) graph edges define different inner associations; and iii) the message passing mechanism combined with GRU makes invariant from input orders. Specifically, this design (Fig. 2C) has four steps: A pair-wise message generation :


, where , are connected nodes’ hidden states. A message aggregation which collects all incoming messages:


A message update using a gated recurrent unit (GRU):


Finally a readout function is parameterized using MLP layers. After T layer updates, all nodes’ final states are fed into a readout function: .

Next, we show how to encode the graph entities. Previous works [30, 13] used hand crafted feature descriptors in training. Is it possible to utilize the representation capability of deep learning by directly learning from raw images? Moreover, for simple geometric features like points and lines, there are on-the-shelf descriptors that can be used. What about more complex geometric primitive like conics and planes? In this paper, we propose a composable graph structure used to encode geometric primitives with point-based image patches, as shown in Fig. 2B.

Lastly, a visual geometric skill (VGS) is composed by combining or linking multiple VGS kernels. This paper will only cover kernel combinations and will leave kernel linking for future research.

Iii-B VGS learning by watching human demonstrations

Assume a VGS task consists of multiple geometric skill kernels , learning VGS becomes the optimization of each given a human demonstration image sequence .

Select-out function.

An optimal selects the right geometric feature associations out of a set of combinatorial instances. For example, in the point-to-point kernel, we can get N feature points from one image by applying any feature extractor. To enumerate, there are candidate instances. Suppose each instance has an output by applying the operator . We compute its relevant factor and the right one is selected out from the maximum .


Applying on each image frame will output a control error signal 3. Assuming is optimal, applying on a human demonstration image sequence will output a high-quality control error signal sequence . Hence, optimizing is essentially selecting the right observation space to observe a high-quality control error signal output from the human demonstrator. We call this the observational expert assumption. By maximizing the quality of control error signals, we are able to adjust our estimation (Fig. 3).

We measure the quality of control signals by a reward function using two metrics: i) errors are overall decreasing along the time steps of human demonstration; and ii) error changes are smooth. The first metric is encoded into the reward function as defined in [30]. To achieve smoothness, we modify the loss function defined in [30] by adding a geometry consistent regularizer (GCR): while keeping the same residual sum of weights (RSW) regularizer for deterministic selection purpose. GCR forces learning a more consistent selection across frames.

Fig. 3: Given human demonstration video frames, optimizing is essentially selecting the right observation space to measure a high-quality control error signal output from the human expert demonstrator. By maximizing the quality of control error signals, we adjust our estimation. Up: The control error output of trained without geometry consistent regularizer (GSR) of the sorting task described in Sec. IV. Down: The control error output of a well-trained by adding GSR to the loss function.

By optimizing the reward function using InMaxEntIRL [29, 30], the control signal quality from the human demonstrator is optimized, resulting in an optimized . To summarize, we propose VGS-IL (Visual Geometric Skill Imitation Learning) as detailed in Algorithm 1.

Input: Expert demonstration video frames {}, demonstrator confidence level , VGS={}
Result: Optimal weights of
Construct kernel graph instances on each frame
for i=1:m do
        for t = 1:n do
               Feature extraction on according to defined in Section III-A
               Construct all instances by feature association
        end for
       Prepare State Change Samples
        = InMaxEntIRL() [30]
end for
Algorithm 1 VGS-IL

Iii-C Links to geometric vision-based controllers

The control signal output from is observed in image pixel space. Mapping image observations to robot actions is a long running research topic [2] also known as robot eye-hand coordination, visuomotor policy learning, or vision guided robot control [12]. Approaches can be divided into two categories4: i) end-to-end learning methods [33]; and ii) visual servoing (VS) [9]. End-to-end learning approaches can work without explicit features, and are useful in complex visual environments due to their powerful representation capability [32], but require time consuming training and show poor transfer to new environments (i.e. poor generalization). Visual servoing approaches run in real-time using a geometric vision-based control law, but can lack sufficient visual representation capability. Combining the geometric vision part from visual servoing with learning-based methods is rarely studied [7, 6].

Fig. 4: A geometric vision-based controller utilizes camera 3D geometry to build relationships between 3D object motion, observed feature motion in image plane and camera spatial velocity. At last it links with robot actions via a calibration model or trial-error manner based online learning.

As shown in Fig. 4, the basic idea of using geometric vision in VS control is: i) mapping an error vector to camera motion via an interaction matrix derived from the camera relative spatial velocity equation [9]; and ii) mapping to robot motion via a calibration model as in VS or a trial-error based online estimation as shown in Uncalibrated Visual Servoing (UVS [28]). Here we discuss feature-based visual servoing which our VGS learning directly links to.

VGS-IL removes the need for robust feature trackers while keeping the geometric error output that can be linked with a visual servoing controller. Compared to traditional approaches that hand select features to encode a task concept, VGS-IL directly learns the feature selection using a data driven approach. Instead of tracking each geometric feature and then associating them, VGS-IL directly extracts their associations in an adaptive manner which has been shown to be more robust [30].

It is worth noting that visual servoing control is sensitive to modeling errors [10]. Combining the 3D geometric vision aspect from VS to learn more robust controllers via Reinforcement Learning has the potential to derive both efficient and robust controllers.

Fig. 5: Left: Four tasks designed in evaluation: Sorting, Insertion, Folding and Screw. Right: Qualitative evaluation results of VGS-IL in the four tasks. We select two frames for each task. The Insertion task includes two columns representing point-to-point and line-to-line kernel respectively. For a fair test, we changed the background and target pose in each task. Red line indicates selected feature association with highest confidence. Experiments show VGS-IL succeeds to learn a consistent geometry-parameterized task concept from human demonstrator in all the four tasks. Quantitative results are displayed in Table 1-2 below .
Task Sorting Insertion: point-to-point Insertion: line-to-line Folding Screw
Metrics Acc conAcc Acc conAcc Acc conAcc Acc conAcc Acc conAcc
Baseline1 100.0% 1.00 100.0% 1.00 100.0% 1.00 10.0% n/a 8.2% n/a
Baseline2 100.0% 0.03 100.0% 0.02 81.2% -0.06 80.0% 0.10 33.0% 0.08
VGS-IL 100.0% 0.98 100.0% 0.85 93.0% 0.91 84.0% 0.98 49.0% 0.92
TABLE I: Quantitative evaluation of VGS-IL in the four tasks. All tests are based on changed background and randomly placed target. Results show VGS-IL performs better in learning a consistent geometry-parameterized task concept.
Settings Random Target Change Camera Object Occlusion Object Outside FOV Change Illumination
Metrics Acc conAcc Acc conAcc Acc conAcc Acc conAcc Acc conAcc
Baseline1 100% 1.00 0.0% n/a 0.0% n/a 0.0% n/a 0.0% n/a
Baseline2 99.1% -0.03 96.7% -0.10 92.7% -0.05 81.2% -0.03 0.0% n/a
VGS-IL 100.0% 0.55 95.0% 0.61 97.3% 0.10 79.8% 0.19 19.2% 0.42
TABLE II: Evaluation results of VGS-IL on the robot imitator under different environmental settings (shown in Fig. 6). We keep testing on the real robot in the Sorting task, while exploring more variance settings. Results show VGS-IL performs the best under all conditions.

Iv Experiments

Through experimental evaluation we aim to determine: (i) whether VGS-IL can learn a correct and consistent geometry-parameterized task concept given one human demonstration; and (ii) whether VGS-IL can output high-quality error signals for accurate robot control. For analysis, we decompose the two goals into four evaluation steps: (1) Given one human demonstration video, will VGS-IL output a correct and consistent task concept; (2) how will VGS-IL generalize from human demonstrator to robot imitator under changed task and environmental settings; (3) How does control error converge, and how is it affected at different network training time for VGS-IL.

Baselines: We hand designed two baselines to use in comparison. Baseline1 is conventional visual servoing with a video-tracking of a redundant feature set. This involves human interaction to carefully hand select 10 pairs of geometric features used to represent a task and initialize multiple feature trackers for each camera. As long as one pair out of ten is able to track throughout the entire task process, baseline1 succeeds. Baseline2 is a method from our previous work [30] that relies on hand crafted geometric feature descriptors (SIFT [34] and LBD [44]) in training; however, it doesn’t take into consideration representation consistency.

Metrics: We designed two evaluation metrics: (1) Acc to measure accuracy; and (2) conAcc to measure consistency. Specifically, given N video frames, Acc, where M is the number of frames with correct geometric task concept inference. Defining conAcc is more challenging since directly measuring the inference consistency involves complex statistical methods [41]. For simplicity, we measure the time-series control error output (i.e. the inference outcome) and define conAcc , which is the autocorrelation measurement over time-series error norms with shift=k. We fix k=2 in all experiments. Since baseline1 is a collection of redundant pairs of trackers, measuring the conAcc is difficult. In this case we assume that conAcc=1, the maximum, if baseline1 succeeds.

Tasks: To facilitate comparisons, we follow the same four tasks: Sorting, Insertion, Folding, and Screw tasks as defined in [30] (see Fig. 5 for details). Sorting represents a rich texture clue task that requires a point-to-point kernel; the Insertion task needs a combination of point-to-point and line-to-line kernels; the Folding task represents deformable object manipulation; and the Screw task has low image textures.

Evaluation on human demonstration videos

Our first step is to evaluate if VGS-IL learns a both correct and consistent geometric feature associations, given one human demonstration video. For a fair test, we changed both background and target pose in evaluation. Qualitative results are displayed in Fig. 5. Quantitative metric scores are shown in Table I. Results show VGS-IL succeeds to generalize the learned geometry-parameterized task concept in all the four tasks. Regarding selection consistency, VGS-IL performs the best compared to other two baselines.

Generalization under different environmental settings

Then we test if the learned task concept generalizes from human demonstrator to robot imitator. A WAM robot equipped with a Barret Hand is used to test the Sorting task (Fig. 7). Furthermore, we keep testing on the robot while exploring more variance settings (Fig. 6): (a) random target; (b) move camera: We test for real-world projective invariance by randomly translating and rotating the camera; (c) object occlusion; (d) object outside FOV: The object moves outside the camera’s field of view and each method is required to automatically recover when the object is back in the image; and (e) change illumination: the lighting condition is changed by adding a spotlight light source. We pick the task Sorting to evaluate. Results are shown in Table II which indicate VGS-IL performs the best in all settings.

Fig. 6: A1: Human demonstration settings. A2: Robot imitation settings. B0: Human demonstration video used to train VGS-IRL. B1-5: Evaluation on robot under five different environmental settings. B1) random target; B2) change camera; B3) object occlusion; B4) object outside camera’s FOV; B5) change illumination.
Fig. 7: Example of VGS-IL results in Sorting task. A: results in a human demo. B: results in a robot demo. Top five geometric feature associations are selected. Only the top one, as marked red color, is used in evaluation. Results show the same feature point association is selected regardless of human hand or robot hand under different backgrounds and target poses.
Fig. 8: Evaluation of the control error signals output from VGS-IL in the Sorting task. A: Training curve of VGS-IL with three different stages picked for evaluation. B, C, D: Control error signals output from VGS-IL trained in stage S1, S2, S3. Results clearly show that VGS-IL outputs a ‘good’ control error signal. Moreover, the optimization process is indeed optimizing the quality of control error signals.

Evaluation of the ‘good’ control error signal output

We test how ‘good’ or ‘bad’ the control error signals output from VGS-IL are. To do this, we had the robot perform the Sorting task via teleoperation, then ran VGS-IL on the resulting task video and measured the corresponding time-series error signals. Therefore, if VGS-IL was capable of outputting ‘good’ control signals, the results of this video should also be good. To make our evaluation more interesting, we wanted to see how control error signals are improved along with the optimization process of VGS-IL. Fig. 8 shows the results in three different training stages.

V Conclusion

We present a geometric perspective on visual imitation learning. Specifically, we propose VGS-IL, visual geometric skill imitation learning, to learn a geometry-parameterized task concept. VGS-IL infers globally consistent geometric feature association rules from human demonstration video. The learned task concept outputs control error signals that can be directly linked to geometric vision based controllers, thus providing an efficient way to map learned high-level task concepts to low level robot actions. Experimental evaluations show that our method generalizes well from human demonstrator to robot imitator under various environmental settings.

In practice, VGS-IL needs large GPU computation resource due to its optimization over the whole combinatorial feature association candidates. A potential solution is to utilize high dimensional Bayesian Optimization methods [38] to directly estimate geometry representation and association parameters from the observation space. Moreover, although we demonstrated applying VGS-IL in tasks by combining different VGS kernels, it is worth further exploring how to sequentially link VGS kernels to program more complex tasks.


  1. This is studied in observational learning [8] in psychology.
  2. For further reading, task parameterization using geometric constraints (e.g. point-to-point, point-to-line, point-to-conics, etc.) are intensively studied in [16, 15, 14, 24].
  3. For example, a point-to-point kernel outputs x-y errors in image pixels. A point-to-line kernel outputs error signal from the dot product of their homogeneous coordinates. More examples can be found in [25].
  4. For further reading, a comparison has been discussed in the ICRA 2018 Tutorial on Vision-based Robot Control [10]


  1. P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. Twenty-first international conference on Machine learning - ICML ’04, pp. 1. External Links: Document, 1206.5264, ISBN 1581138285, ISSN 0028-0836 Cited by: §I.
  2. G. J. Agin (1977) Servoing with visual feedback. SRI International. Cited by: §III-C.
  3. S. R. Ahmadzadeh, A. Paikan, F. Mastrogiovanni, L. Natale, P. Kormushev and D. G. Caldwell (2015) Learning symbolic representations of actions from human demonstrations. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 3801–3808. Cited by: 1st item, §II, §II.
  4. B. D. Argall, S. Chernova, M. Veloso and B. Browning (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §I.
  5. B. D. Argall, S. Chernova, M. Veloso and B. Browning (2009) A survey of robot learning from demonstration. Robotics and Autonomous Systems 57 (5), pp. 469–483. External Links: Document, arXiv:1105.1186v1, ISBN 0921-8890, ISSN 09218890 Cited by: Fig. 1, §I.
  6. Q. Bateux, E. Marchand, J. Leitner, F. Chaumette and P. Corke (2018) Training deep neural networks for visual servoing. In ICRA 2018-IEEE International Conference on Robotics and Automation, pp. 1–8. Cited by: §III-C.
  7. Q. Bateux (2018) Going further with direct visual servoing. Ph.D. Thesis, Rennes 1. Cited by: §III-C.
  8. C. J. Burke, P. N. Tobler, M. Baddeley and W. Schultz (2010) Neural mechanisms of observational learning. Proceedings of the National Academy of Sciences 107 (32), pp. 14431–14436. Cited by: footnote 1.
  9. F. Chaumette and S. Hutchinson (2006) Visual servo control. I. Basic approaches. IEEE Robotics and Automation Magazine 13 (4), pp. 82–90. External Links: Document, ISBN 1070-9932, ISSN 10709932 Cited by: 2nd item, §III-C, §III-C.
  10. F. Chaumette (2018) Geometric and photometric vision-based robot control: modeling approach. Cited by: §III-C, footnote 4.
  11. J. Chung, C. Gulcehre, K. Cho and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §III-A.
  12. P. I. Corke (1994) High-performance visual closed-loop robot control. Ph.D. Thesis. Cited by: 2nd item, §III-C.
  13. R. Dillmann (2004) Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems 47 (2-3), pp. 109–116. External Links: Document, ISBN 09218890, ISSN 09218890 Cited by: §III-A.
  14. Z. Dodds, M. Jägersand and G. Hager (1999) A Hierarchical Architecture for Vision-Based Robotic Manipulation Tasks. First Int. Conf. on Computer Vision Systems 542, pp. 312–330. Note: geometric constraints can be used to describe a task. Cited by: §II, footnote 2.
  15. Z. Dodds, A. S. Morse and N. Haven (1999) Task Specification and Monitoring for Uncali brat ed Hand / E ye Coordination *. (May). Cited by: footnote 2.
  16. Z. Dodds, G. D. Hager, A. S. Morse and J. P. Hespanha (1999) Task specification and monitoring for uncalibrated hand/eye coordination. In Proceedings 1999 IEEE International Conference on Robotics and Automation, Vol. 2, pp. 1607–1613. Cited by: §III-A, footnote 2.
  17. C. Finn, S. Levine and P. Abbeel (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International conference on machine learning, pp. 49–58. Cited by: §I.
  18. C. Finn, T. Yu, T. Zhang, P. Abbeel and S. Levine (2017) One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905. Cited by: 2nd item, §I.
  19. P. Florence, L. Manuelli and R. Tedrake (2019) Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters. Cited by: 2nd item, §II.
  20. P. R. Florence, L. Manuelli and R. Tedrake (2018) Dense object nets: learning dense visual object descriptors by and for robotic manipulation. arXiv preprint arXiv:1806.08756. Cited by: 2nd item, §II.
  21. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §III-A.
  22. M. Gridseth, O. Ramirez, C. P. Quintero and M. Jagersand (2016) ViTa: Visual task specification interface for manipulation with uncalibrated visual servoing. Proceedings - IEEE International Conference on Robotics and Automation 2016-June, pp. 3434–3440. External Links: Document, ISBN 9781467380263, ISSN 10504729 Cited by: §II.
  23. M. Gridseth, O. Ramirez, C. P. Quintero and M. Jagersand (2016) Vita: visual task specification interface for manipulation with uncalibrated visual servoing. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 3434–3440. Cited by: Fig. 2.
  24. G. D. Hager and Z. Dodds (2000) On specifying and performing visual tasks with qualitative object models. Proceedings-IEEE International Conference on Robotics and Automation 1 (April), pp. 636–643. External Links: Document, ISBN 0780358864, ISSN 10504729 Cited by: footnote 2.
  25. R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: footnote 3.
  26. J. Ho and S. Ermon (2016) Generative adversarial imitation learning. pp. 4565–4573. Cited by: §I.
  27. K. Ikeuchi and T. Suehiro (1994) Toward an assembly plan from observation. i. task recognition with polyhedral objects. IEEE transactions on robotics and automation 10 (3), pp. 368–385. Cited by: §II.
  28. M. Jagersand, O. Fuentes and R. Nelson (1997) Experimental evaluation of uncalibrated visual servoing for precision manipulation. Proceedings of International Conference on Robotics and Automation 4 (April), pp. 2874–2880. External Links: Document, ISBN 0-7803-3612-7, ISSN 10504729 Cited by: §III-C.
  29. J. Jin, L. Petrich, M. Dehghan, Z. Zhang and M. Jagersand (2018) Robot eye-hand coordination learning by watching human demonstrations: a task function approximation approach. arXiv preprint arXiv:1810.00159. Cited by: §I, §III-B2.
  30. J. Jin, L. Petrich, Z. Zhang, M. Dehghan and M. Jagersand (2019) Visual geometric skill inference by watching human demonstration. arXiv preprint arXiv:1911.04418. Cited by: 1st item, §I, §III-A, §III-A, §III-A, §III-B2, §III-B2, §III-C, §III, §IV, §IV, 1.
  31. Y. Kuniyoshi, M. Inaba and H. Inoue (1994) Learning by watching: extracting reusable task knowledge from visual observation of human performance. IEEE transactions on robotics and automation 10 (6), pp. 799–822. Cited by: §II.
  32. Y. Lecun, Y. Bengio and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. External Links: Document, arXiv:1312.6184v5, ISBN 9780521835688, ISSN 14764687 Cited by: §III-C.
  33. S. Levine, C. Finn, T. Darrell and P. Abbeel (2016) The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §III-C.
  34. D. G. Lowe (1999) Object recognition from local scale-invariant features.. In iccv, Vol. 99, pp. 1150–1157. Cited by: 1st item, §III, §IV.
  35. D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros and T. Darrell (2018) Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2050–2053. Cited by: 2nd item, §I, §II.
  36. Z. Qin, K. Fang, Y. Zhu, L. Fei-Fei and S. Savarese (2019) KETO: learning keypoint representations for tool manipulation. arXiv preprint arXiv:1910.11977. Cited by: §II.
  37. E. Rublee, V. Rabaud, K. Konolige and G. R. Bradski (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Vol. 11, pp. 2. Cited by: 1st item, §III.
  38. B. Shahriari, K. Swersky, Z. Wang, R. P. Adams and N. de Freitas (2016) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §V.
  39. P. Sharma, D. Pathak and A. Gupta (2019) Third-person visual imitation learning via decoupled hierarchical controller. In Advances in Neural Information Processing Systems, pp. 2593–2603. Cited by: 2nd item, §I, §II.
  40. M. Sieb, Z. Xian, A. Huang, O. Kroemer and K. Fragkiadaki (2019) Graph-structured visual imitation. arXiv preprint arXiv:1907.05518. Cited by: §II.
  41. T. Tarpey and B. Flury (1996) Self-consistency: a fundamental concept in statistics. Statistical Science (3), pp. 229–243. Cited by: §IV.
  42. C. Xiong, N. Shukla, W. Xiong and S. Zhu (2016) Robot learning with a spatial, temporal, and causal and-or graph. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2144–2151. Cited by: §II.
  43. Y. Yang, Y. Li, C. Fermuller and Y. Aloimonos (2015) Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §II.
  44. L. Zhang and R. Koch (2013) An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency. Journal of Visual Communication and Image Representation 24 (7), pp. 794–805. Cited by: 1st item, §III, §IV.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description