Visual Task Progress Estimation with Appearance Invariant Embeddings for Robot Control and Planning

Visual Task Progress Estimation with Appearance Invariant Embeddings for Robot Control and Planning


To fulfill the vision of full autonomy, robots must be capable of reasoning about the state of the world. In vision-based tasks, this means that a robot must understand the dissimilarities between its current perception of the environment with that of another state. To be of practical use, this dissimilarity must be quantifiable and computed over scenes with different viewpoints, nature (simulated vs. real), and appearances (shape, color, luminosity, etc.). Motivated by this problem, we propose an approach that uses the consistency of the progress among different examples and viewpoints of a task to train a deep neural network to map images into measurable features. Our method builds upon Time-Contrastive Networks (TCNs), originally proposed as a representation for continuous visuomotor skill learning, to train the network using only discrete snapshots taken at different stages of a task such that the network becomes sensitive to differences in task phases. We associate these embeddings to a sequence of images representing gradual task accomplishment, allowing a robot to iteratively query its motion planner with the current visual state to solve long-horizon tasks. We quantify the granularity achieved by the network in recognizing the number of objects in a scene and in measuring the volume of liquid in a cup. Our experiments leverage this granularity to make a mobile robot move a desired number of objects into a storage area and to control the amount of pouring in a cup.

I Introduction

One of the most basic capabilities of an intelligent autonomous system is to be able to reason about its current world, understand how it is different from a more desirable world, and act towards improvement by modifying its environment. For example, in Figure 1(a), a service robot could use its camera to capture the image of a disorganized desk. Using an appropriate metric, it would then compare this image with the one of a clean desk (say, retrieved from memory). This difference is then transformed into a tidy up action, which may be realized by a series of manipulation commands. This process is then repeated over a long horizon until the robot finds no discrepancy between the phases of the images.

Fig. 1: (a) A service robot must be able to reason about the relative state of a disorganized desk concerning an organized desk. The result of this reasoning can be used in conjunction with planner to sequence long-horizons cleaning motor actions. To this end, we investigate a metric of “distance-to-go” that can be extracted directly from images with different objects and viewpoints. (b) The image (i) shows the viewpoint of a cleaning robot with seven objects sitting in front of it. Under our desired metric, the robot quantifies that image (iii) with six objects better represents the current task progress than the more visually similar image (ii) with thirteen objects.

Here, we focus on the problem of how to learn the described metric to estimate the distance towards the goal of the task, which we will refer to as the “distance-to-go”. We show that a Time-Contrastive Network [23] can be trained to learn a metric of distance-to-go by leveraging the different phases of the task in a self-supervised fashion. Once the metric is available, the subsequent problem of mapping the result of the comparison to a robot motion turns out to be a straightforward search on a sequence of images whose associated actions are grounded by a controller (as explained in Sec. III-B).

The main challenge is that of learning to output small distances between images that are dissimilar in appearance but close in regards to the phase of the task. Conversely, images that look similar but represent different stages of the task should be identified as being far from each other. This problem is exemplified in Figure 1(b). When the robot observes its actual scene which contains seven objects on a gray carpet (i), it should understand that despite (i)-(ii) being the closest pair in appearance, (i)-(iii) are the closest pair in terms of the task progress (i.e. number of objects). Identifying the distance-to-go as progress rather than pixel-wise appearance is essential to provide the correct control action to a planner.

Solving the invariance to appearance has many implications for the practicality of a service robot. It allows the system to be trained on images from a multitude of environments regardless of the particular robot in use and the actual working environment (which is only precisely known after deployment), it frees a mobile robot from the need to move to pre-defined positions from which to make observations, it opens the opportunity to seamlessly combine real and simulated images as training data, it supports lifelong learning of robots as scene and objects evolve/change but the underlying task goal remains; to cite a few.

The principal contribution of this paper is to introduce a method to allow a robot to estimate the level of accomplishment of a task from vision (i.e. the distance-to-go metric) by proposing a modified time-contrastive training based on TCNs [23] that focuses solely on the task progress. We also suggest a policy based on the network output to associate local actions to a sequence of images representing the gradual task accomplishment. This policy allows a motion planner to solve long-horizon tasks (e.g. cleaning up a floor by moving blocks one-by-one), and can also be used as a feedback error signal. Finally, with the support from the experiments, we discuss and provide an intuition of why the triplet-loss metric finds a suitable place in learning from tasks that have visually discernible phases and how this insight may be explored to increase efficiency in training.

Ii Related Work

We discuss the background on the components and related research that form the basis for the design of our method.

Ii-a Uses of Contrastive Learning in Robotics

Triplet loss, which is the core metric to train the network in this paper, is perhaps most known in face recognition [22]. This loss was motivated as a way to recognize faces by directly measuring distances in a continuous space of features without relying on explicit layers for classification. Deep convolutional networks trained under this loss have shown to achieved state-of-the-art performance [9].

Contrastive losses have also found uses in the context of robotics, mainly for learning visual descriptors in both 2D (RGB) [21] and 3D (RGB-D) images [27]. Such descriptors have many applications in robotics, from mapping and localization to navigation, and manipulation. Dense descriptors have been proposed to track a specific keypoint across variations of the image [21] or poses of an object [8].

Although the use of contrastive learning in robotics has mainly focused on the perception problem per se, dense descriptors have a close link to robot control. They can be used to provide a controller with a target for grasping [8] or a set of compact descriptors [7] to learn a control policy. Since these methods are designed for learning correspondences of points between images while we focus on measuring the distance to finish a task, the resulting methods are quite different but show the versatility of contrastive loss in learning features for robotics applications.

Time-Contrastive Networks (TCNs) [23] introduced a self-supervised, time-based approach to the triplet loss training. TCNs embeddings were used to learn motor skills from observations of humans [23] in both imitation and reinforcement learning. A multi-frame extension capable of capturing velocity concepts from images was used to learn polices for swing-up a cart-pole and for running a half-cheetah [5] in simulation. TCNs were also used as part of a sim-to-real adaptation technique where a robot first learned motor skills in simulation using deep reinforcement learning, and the time-based contrastive loss was added for transferring the task to the real robot [11]. We discuss the relation of our work with TCNs in detail in Section III.

Ii-B Goal as Images

Extracting and using goals from images has been a long-standing challenge in robotics [14, 3] with past work reliant on the ingenuity of feature design for the processing of relevant information. Recently, automatic feature and representation learning of goals from images with deep learning have become predominant. Such goal representations have been not only used in imitation and reinforcement learning [6, 24, 18] but also to transfer images of simulated goals to the real world with domain randomization [20].

Procedurally close to our paper, but under a setting of grasping in clutter, Jang et al. [10] proposed a method where the desired image of a grasped object is given as a goal, and the robot learns to compare dissimilarities of this image with the result of its actions. However, as in most of the visuomotor learning research with real robots, such methods have been almost exclusively developed and validated on short-horizon tasks without a clear mechanism for long-term planning. Moreover, the usual use of fixed arms robots (as opposed to mobile bases) means that the invariance to the viewpoint is usually not the main concern in such works.

Ii-C Grounding Actions via Motion Planning

Our goal is to leverage motion planners—and task-and-motion planners (TAMP) for complex cases—to ground actions obtained from the network embeddings. Since TCNs were originally proposed as a representation for visuomotor skill learning, one may wonder why not use the TCN to learn end-to-end. Our first issue is that of practicality. Particularly due to the need for creating training data where the robot is part of the images. This leads to networks that are not transferable to a different robot (unless training data with the new robot is fed into the system), the robot motion becomes dependent on how the task was demonstrated, and failed roll-outs must be demonstrated to provide useful signal during reinforcement learning.

For that matter, the work of Liu et al. [17] where the robot learns the context of task from multiple views, partially overcomes the issue of having the robot in the view by using tools with long handles (sticks, brushes, spatulas). As such, when observed from the limited field of view of the camera, it is not possible to discern who (robot or human) is manipulating the object. Thus, while such an approach circumvents the issue of collecting training data in the presence of a robot, it also constrains the range of tasks and the means a robot is allowed to interact with the environment.

Certainly, while visuomotor learning with real robots is progressing at an unprecedented pace in both reinforcement learning (e.g. [16, 6]) and imitation learning (e.g. [17, 26]), at the current stage, long horizon problems are still extremely hard to tackle with deep policies. Besides, no single visuomotor policy addresses the full stack required by mobile manipulators1, and even when learning a single short-horizon grasping action, it is impractical to imagine a commercial service robot training its physical motor skills in a hospital or in a childcare facility.

On the other hand, motion planners are widely applied in the real-world and commercial applications [15, 13] and can leverage the ubiquity of RGB-D/LIDAR cameras and algorithms for instance segmentation and matching to obtain positional information. Planners produce predictable, often verified solutions, and are supported by a wider, matured body of work. For the sake of practicality and real-world deployment in the short term, until many of the issues with visuomotor control with deep policies are solved (in particular long-horizon TAMP problems [4]), we believe that recurring to existing planning methods is a reasonable trade-off, particularly when higher levels of technological readiness must be factored in.

Iii Learning an Appearance Invariant Metric of Task Accomplishment

We re-visit TCNs and propose to train the network to specialize in the degree of accomplishment of a goal rather than on robot skills. In other words, we modify the TCN training to solve the question of what is the state of the task rather than on how to achieve the task. The capability of TCNs to estimate task accomplishment as a metric of distance-to-go has not been investigated so far. The closest evaluation in the spirit of our paper’s goal was to classify the keyframes of a video during pouring in its different stages such as “within pouring distance”, “liquid is flowing”, “recipient has liquid”, etc. [23]. Here, we push into this direction and investigate whether TCNs can achieve the understanding of the task progress at a fine-grained level, for example, to allow a robot to control the pouring of a liquid at any arbitrary level in a cup.

Iii-a Time-Based Self-Supervision with Triplet Loss

Fig. 2: (a) A network trained on triplet loss separates positive and negative examples by learning to locate the margin over all possible triplets in the training data. (b) The separation of the triplet loss when using sequential data where each frame is taken at a different phase of the task. In this case, the adjacent frames are hard cases and one should expect the dissimilarity to increase as the phase from the anchor increases.

Figure 2(a) illustrates the basic concept of a triplet loss as proposed by Schroff et al. [22]. The network is trained to encode high-dimensional images into a low-dimensional Euclidean space where the similarity between embeddings can be directly compared. Thus, the network implements a function , where the output is a d-dimensional vector. Given a data set with possible triplets, for a given instance an image is selected as the anchor, an image belonging to the same class of the anchor is selected as a positive , and an image of a different class is selected as a negative . The network is trained on all possible image triplets to minimize the cost


where is a margin of separation.

Figure 2(b) illustrates the use of triplet loss to quantify distances between phases of task sequences following the time contrastive training formulation proposed in [23]. Assume two sequences of images where the images are phase-aligned2. The sequences could be taken at different environments, using different objects, or they may be taken on the same task but using cameras with different viewpoints. We select one of the images as the anchor, while the image on the other sequence with the same phase index is selected as the positive. Any other image with a different phase index is a negative candidate. Although the network is trained with the triplet loss (1), the main feature of using sequential data is that it provides structure to enable self-supervision. The self-supervision relates to the fact that positives and negatives are given by the indexes of the frames, and thus can be extracted automatically without the need of labels.

Note that in our particular goal of task progress estimation, if the network learns to distinguish the anchor from the adjacent frames, we should expect it to be even more accurate for frames farther apart as the dissimilarity increases as time progresses. Although we did not formally investigate this property in this paper, it suggests that training a distance-to-go metric with TCNs could be made more efficient while preserving accuracy by focusing only on the adjacent frames as negatives.

Compared to the original TCN architecture, since our data has a simpler nature—snapshots of task progress as opposed to continuous human and/or robot motions—we used a simpler architecture with RGB images of size 300x300 pixels as input. The network’s visual embedder consists of six padded 2D-convolution layers with (32, 32, 64, 64, 128, 128) channels and 10x10 kernel size on the first layer and 3x3 on the rest. Except for the first and last layers, each convolution layer is followed by a 2x2 spatial max-pooling. Each layer runs through a ReLu activation function. The output of the visual feature extractor is fed to a spatial softmax-function [16] that converts pixel-wise features to feature points. Finally, the feature points are fed to a 32-wide MLP and the outputted embedding vector is L2-normalized. Our network does not use batch normalization as in the original TCN [23] as we did not notice significant improvements, perhaps due to the smaller size of the network.

Iii-B Computing a Feedback Policy from a Sequence of Images

Fig. 3: The policy consists of associating robot’s actions to a sequence of images. Given a desired goal image, the direction to which move along the sequence (left, right, none) is found by finding the nearest-neighbor image w.r.t the robot’s view of the scene. The output action can be used to trigger a motion planner or as a feedback error signal. The photos in the sequence are illustrative and not used for real experiments.

The goal of our policy is to iteratively steer the robot actions such that the desired goal image—in the form of its embedding—is achieved in the long-horizon. The proposed policy is a sequence of the gradual progress of the task , where each image in the sequence is associated with an action and its embedding . At execution time, the robot queries its vision sensor for an image , compute its embedding , and find the closest embedding in the sequence via nearest-neighbor to retrieve the associated action . Since the policy is constrained on a sequence whose frames cannot be skipped, as shown in Figure 3, for each image in the sequence , there are only three possible actions, go the frame on the right, on the left, or do nothing.

Consider a sequence of pouring as shown in Figure 3. Given one of the images in the sequence as the goal state (the starred image), we attribute the images at the right with the action “pour in”, and the images at the left with “pour out” (or “spread one object” in a cleaning task). The action from the current goal image is “do nothing”. While during training, an arbitrary number of simultaneous sequences are used (two in the example in Figure 2), at execution time only a single image from the robot’s perspective is fed to the network.

The simplicity of the policy is due to two assumptions. First, the difficult part, which is to disentangle the visual complexity of a scene into a phase index is done by the TCN. Second, we assume the robot has means to manipulate the environment such that the next spatial configuration can be achieved (i.e. the distance-to-go can be grounded by appropriate control actions) and that its resulting image can be captured into an embedding for further comparison.

The first assumption regards the existence of a properly trained TCN. The second assumption relies on the existence of a motion planner or a task-and-motion planner (TAMP)[12, 4] for more hierarchical cases. As motivated earlier in the introduction, our goal in training the network is to understand the task state, not to understand the control actions to achieve such states.

Iv Experiments with a Mobile Base Manipulator

Two experiments of different control nature were conducted. In the first, in a simulated environment, the robot had to either clean or spread objects inside a certain area where the distance-to-go was used as a high-level task command for a motion planner. In the second experiment, the robot had to either fill or empty a simulated cup to a desired level, in which the distance-to-go was used as a feedback error signal for a bang-bang controller (i.e. pour in, pour out).

While Figure 2(b) illustrated a training with only two sequences, in practice the number of sequences depends on the number of available cameras. In real-world cases, one sequence most likely is generated by the first-person view of the robot, while another could be obtained from a second camera mounted on the robot’s wrist, for example. In this paper’s simulation, we set four cameras for training the TCN where their locations are slightly perturbed after each picture is taken except the fourth camera, which is from the robot’s simulated sensor. Thus, one run of data collection generates four sequences.

A total of 200 randomized runs with variations in color, viewpoint, shape, and size of objects, resulting in 800 sequences were collected for training. Each sequence was discretized in 16 steps. Examples of randomized training images are shown in Figure 4 at particular steps of the cleaning (a,b) and pouring (c,d) tasks. Figure 5 shows the learning curves on the validation set as a function of the training epochs. The training time was around three hours in each case on a modest desktop with a single GPU.

Fig. 4: Examples of training data. Each of the four views are capture at the same time-step. The fourth view is the robot’s perspective. In (a) a cleaning scene with ten objects and in (b) and with four objects. (c) Four viewpoints of a full cup (345 particles) and (d) a partially filled cup. The pairs (a,b) and (c,d) differ due to the randomization between runs.

Iv-a Cleaning or Spreading Objects on the Floor

In this experiment we set up a long-horizon task where the robot must either clean a specified floor area by moving objects from within this area and placing them in a storage area; or conversely, by spreading objects inside by picking them from the storage area. We used the Toyota Human Support Robot (HSR) as the service robot platform. The robot has an arm with five degrees of freedom and a gripper as an end-effector. The holonomic base allows the robot to move in XY directions and change its orientation. Details on the robot can be found in [25].

Figure 6(a) shows the scene setup at the start of a cleaning task3. Figure 6(b) shows the sequence of 16 images used to compute the policy where 15 objects of random shapes and sizes are initially provided and removed one-by-one at each frame advance. If the task is to clean the floor, we specify the goal of the task by showing the robot the image . If the task is to spread blocks inside the area delimited by , we specify the goal of the task as the image and initialize the scene with 15 blocks in the storage area. Any other intermediate stage can also be used as a goal. After a goal is specified, the actions in the policy are computed as described in Sec. III-B.

Fig. 5: The decrease in validation error as a function of the number of epochs. The total training time was roughly 3 hours in both cases.
Fig. 6: (a) Initial setup of a scene for a long-horizon cleaning task (rendering of the floor is omitted). (b) Sequence of images showing one object removed per frame advance. This sequence is used to generate the policy from which the robot iteratively retrieves actions until all objects are moved into the storage area.
Fig. 7: Sequence of snapshots taken at varying steps where the robot executes an entire clean-up task. A motion planner is used to ground the high-level commands “remove object”, “add object”, “do nothing” obtained from the policy . The borders of the area is only shown in the third snapshot and are not visible to the robot’s vision sensor. The rendering of the floor was omitted for clarity but it is visible to the robot. Experiments can be watched in the following video

Figure 7 shows snapshots of an entire cleaning task where the motion planner is used to ground the high-level actions . In the first frame, the robot retrieves images from the scene using its simulated RGB camera from the perimeter of . The coordinates to which the robot moves on the border are randomly picked such that the robot has no pre-defined position from which to observe the scene. From the image, the embedding is computed, and the appropriate actions are retrieved by finding the nearest-neighbor in .

If the retrieved action is to “remove object” the robot picks the closest object to itself (second snapshot of Figure 7) and drops it inside the storage area (third snapshot). If the action is “spread one object” the robot picks the object from the storage area and places in randomly within the boundaries delimited by . Once either of these actions is taken, the robot goes back to the perimeter of (fifth snapshot) to obtain a new image and retrieve the next action. The robot repeats the process until the nearest-neighbor in the embedding space reaches the specified goal image. From that point on, any extra step will generate a “do nothing” action (last snapshot). Watch the video for more details.

The simulated setting allows quantification against a ground-truth. The result of interest is whether the robot is capable of achieving the same number of objects in the goal image. The goal state is perfectly achieved if the final number of objects in the scene is the same as the number of objects in the goal image . An error in task execution means that the robot reaches the “do nothing” action despite the number of objects in the final scene and goal image being different. Figure 8 shows the final error where each error bar summarizes 50 tasks under randomized initial conditions. The X-axis represents the number of objects in the goal image . For example, when is zero, the goal is to completely clean the floor so that no objects should remain in the scene. Since the maximum number of objects inside was 15, there were 14 possible initializations regarding the number of objects at the start of the task.

Fig. 8: Error in the number of objects in the scene after the task was considered finished by the robot. The precision is high when totally cleaning the floor but tends to worsen as clutter increases.

Note from Figure 8 that the accuracy is reasonably constant and nearly zero throughout the range of goal objects indicating that, on average, the correct number of objects was achieved. On the other hand, the magnitude of the error bars changes according to indicating that the precision is affected by the clutter of the goal images. The robot achieves the lowest variance when the goal is to completely wipe the floor. This result is somewhat intuitive as the scene without objects is the most discernible visually among its adjacent frames. On the other hand, the precision is worse when the clutter is large () which also reflects our own difficulty in visually differentiating a scene with 10 objects from a scene with 11 objects. The precision improves slightly when nearing 15 as the storage area is depleted; consequently, the robot’s only source of mistake is to spread fewer blocks in than required, but not more.

For humans, it is obvious that the core operation in the proposed task relies on counting the number of objects. It may seem surprising that a network can achieve similar reasoning to count only from images although the network was not explicitly trained to count and has no concept of numbers. However, one must recall that the dimension of phase or task progress, which although not visible to the human eye, is the underlying fundamental mechanism on the network training. The network is sensitive to phases, which is the variable that allows a robot to identify how many steps, or objects, have been manipulated.

Figure 9 shows 50 distance-to-go trajectories. All values in the Y-axis are relative to the goal image of a floor without blocks (top plot) and an empty cup (bottom plot). In the cleaning case, the robot starts with all the 15 blocks in front of it, and the goal is to completely clean the floor following the policy . Note that, in general, the L2 distance decreases as the goal is approached. At first glance, the fact that the distance decreases in an almost monotonic fashion is somewhat odd as the negatives in the triplet-loss do not contain explicit information regarding distance towards the anchor. This phase-driven dissimilarity indicates that training with a triplet-loss under time-varying phenomena can be, in fact, easier than training static cases where data-mining to search for hard cases is an actual issue [22].

Fig. 9: History of distance-to-go as mean and two standard deviations on the cleaning task (top) and pouring (bottom). Both tasks were discretized in 15 steps. Meaning, in the cleaning case 1 block is removed per frame advance, and in the pouring task 23 particles are decreased per frame. The distance values are respective to the last frame of the sequence (clean floor and empty cup).

Iv-B Pouring in a Simulated Cup by Directly Looking at a Screen

Fig. 10: Cup setup experiment where the robot interacts with a simulator via an RGB camera at the top of its head and a VIVE controller. The robot controls the pouring amount by rotating the controller while observing the state of the cup by looking directly at the monitor.

Here, we use the visual policy as a feedback error signal and directly feed it to a bang-bang control law (pour-in/pour out). Figure 10 shows the hybrid real-virtual experimental setup where particles are used as a proxy of liquid for ground truth purposes. A VIVE controller is used to send commands to a simulator, which detects the inclination of the controller. When the controller is turned more than 35, 15 particles fall inside the cup (4  of the total volume of a full cup). When turned less than 35, 15 particles are removed from the cup. The network is trained on images directly collected from the simulator (as shown in Figure 4) but at run-time, the image is obtained from the robot’s RGB camera pointed at the monitor. As such, the images are cropped to eliminate the physical borders of the monitor4.

Images in the simulator are randomized in regards to the color and transparency of the particles, the cup, the background, and variations in camera viewpoint. The geometry of the cup is modified by scaling one of the principal axes of the rim (from circular to ellipsoidal) and the height. Both scales vary from 1 to 1.45. Note that, different from the previous experiment, here the viewpoint of the robot is fixed as the robot constantly looks at the cup in its hand. Thus, to increase geometry variation on the robot’s view, before each trial, the cup is randomly rotated along its vertical axis such that the principal axis of the ellipsoidal cup is unlikely to repeat directions.

Fig. 11: Cup pouring accuracy according to different levels of desired particles. The variance is reasonably constant but there is a trend to under-pour as the desired volume increases. We conjecture this may be due to the unfavorable angle from which the robot observes the cup.

Figure 11 quantifies the pouring error as a function of the amount of desired “liquid” inside the cup. On the X-axis, the desired fullness of 0  is represented by an image of an empty cup. The desired fullness of 80  is given by an image of a cup with 275 particles. The Y-axis shows the mean and two standard deviations from 20 trials per case. Apart from the case where the goal is to empty the cup, the variance is roughly similar. It is observed a trend, however, where the robot tends to fill less of the cup as the amount of desired liquid increases. It is not clear why this trend occurs, but it could be related to the fact that the fixed angle from which the robot observed the cup does not favor such assessment (that is, the best angle to assess the liquid level would be to look from the side of the cup, and worst case to look from the top). In the case of the cleaning task, this effect was not noticeable as the robot’s view of the scene included all objects in a single image.

Fig. 12: One instance of the pouring task. The image at the left is the goal image. The sequence of snapshots shows the robot adding particles into the cup. The monitor is physically rotated during the task. The robot achieves a similar volume to the goal despite the visual differences with the goal image and the tilt of the monitor (the tilt was never seen during training).

Figure 12 shows snapshots of one instance of the experiment where the desired volume to be attained is given by the image at the left. The robot rotates the controller, which increases the volume by 15 particles in its cup and returns the controller to its original rotation. The robot then observes the monitor and searches for the next action in the policy. During the experiment (see accompanying video) the monitor is physically tilted and the robot could still reasonably accomplish the task (20  pouring error) although the embeddings were not trained on tilted images and images observed from a real camera.

V Directions for Future Work

We trained the network naively by random sampling negatives in the triplet loss. As discussed when presenting Figure 2(b), if the network learns to separate adjacent frames it should learn to separate the remaining frames as well. This insight does not hold in face recognition, for example, as no sequence is available. The experimental result in Figure 9 provides evidence that the link between dissimilarity and phase does manifest empirically. We believe there are more efficient ways to train the temporal triplet loss networks by focusing the learning only on the adjacent frames as negatives.

The literature for temporal learning and representation learning with multiple-views is extremely rich. Multi-modal learning [1], learning from mutual information across views [2], using prediction for learning features in self-supervision [19], etc. a just a few of the different approaches that have been recently proposed outside the robotics scope. For this paper, TCNs were a natural fit as a method designed specifically for robotics, but we leave for future work a benchmarking among pure vision methods for temporal learning or multiple-view learning, which could potentially be used as an alternative to TCNs in the method here proposed.

Vi Conclusion

In this paper, we re-visited Time-Contrastive Networks and introduced it as a method to specialize in the degree of accomplishment of a goal (distance-to-go) rather than on robot motor skills. With our formulation, the robot acquires the concept of distance from the task accomplishment regardless of the appearance of images. To ground the distance-to-goal signal with physical robot actions, we introduced a simple policy based on a similar sequence of images used during training, where each frame is associated with an action. This action was evaluated both as a task specification for long-horizon task planning and as a feedback error signal for low-level control.

We are currently working on a real-world implementation integrating SLAM, planning and collision-avoidance, object instance segmentation and pose estimation for grasping. Although more complex than the simulated system, we expect to run the same time-contrastive network training procedure presented here by augmenting the training data with real images.

Vii Acknowledgment

The authors would like to acknowledge Koichi Ikeda, Kunihiro Iwamoto and Takashi Yamamoto from Toyota Motor Corporation for their support of the HSR platforms.


  1. for example, SLAM and planning of the base navigation combined with task plans, object pose estimation, and grasping.
  2. we emphasize the use of phase rather than time as time-alignment does not imply that the different signs of progress among tasks are aligned.
  3. textures on the floor and illumination is suppressed in the figure to facilitate the visualization of the scene, but are present on the data fed to the network.
  4. This arrangement brings us close to a full real experiment in which we are currently working to replace the VIVE controller with a real cup, and add the TCN with a mixture of simulated and real data of pouring.


  1. Y. Aytar, T. Pfaff, D. Budden, T. Paine, Z. Wang and N. de Freitas (2018) Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pp. 2930–2941. Cited by: §V.
  2. P. Bachman, R. D. Hjelm and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In Advances in neural information processing systems, pp. 15509–15519. Cited by: §V.
  3. D. C. Bentivegna, C. G. Atkeson and G. Cheng (2004) Learning tasks from observation and practice. Robotics and Autonomous Systems 47 (2), pp. 163–169. Cited by: §II-B.
  4. N. T. Dantam, Z. K. Kingston, S. Chaudhuri and L. E. Kavraki (2016) Incremental Task and Motion Planning: A Constraint-Based Approach. Robotics: Science and Systems. Cited by: §II-C, §III-B.
  5. D. Dwibedi, J. Tompson, C. Lynch and P. Sermanet (2018) Learning actionable representations from visual observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1577–1584. Note: tex.organization: IEEE Cited by: §II-A.
  6. C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine and P. Abbeel (2016) Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 512–519. Cited by: §II-B, §II-C.
  7. P. Florence, L. Manuelli and R. Tedrake (2019) Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters. Note: tex.publisher: IEEE Cited by: §II-A.
  8. P. R. Florence, L. Manuelli and R. Tedrake (2018) Dense object nets: Learning dense visual object descriptors by and for robotic manipulation. arXiv preprint arXiv:1806.08756. Cited by: §II-A, §II-A.
  9. A. Hermans, L. Beyer and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §II-A.
  10. E. Jang, C. Devin, V. Vanhoucke and S. Levine (2018-10) Grasp2Vec: Learning object representations from self-supervised grasping. In Proceedings of the 2nd conference on robot learning, Proceedings of machine learning research, Vol. 87, pp. 99–112. Cited by: §II-B.
  11. R. Jeong, Y. Aytar, D. Khosid, Y. Zhou, J. Kay, T. Lampe, K. Bousmalis and F. Nori (2019-10) Self-Supervised Sim-to-Real Adaptation for Visual Robotic Manipulation. arXiv:1910.09470 [cs]. Cited by: §II-A.
  12. L. P. Kaelbling and T. Lozano-Pérez (2011) Hierarchical task and motion planning in the now. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 1470–1477. Cited by: §III-B.
  13. T. Kröger (2011) Opening the door to new sensor-based robot applications—The Reflexxes Motion Libraries. In 2011 IEEE International Conference on Robotics and Automation, pp. 1–4. Cited by: §II-C.
  14. Y. Kuniyoshi, M. Inaba and H. Inoue (1994) Learning by watching: Extracting reusable task knowledge from visual observation of human performance. Robotics and Automation, IEEE Transactions on 10 (6), pp. 799–822. Cited by: §II-B.
  15. J.-P. Laumond (2006-06) Kineo CAM: a success story of motion planning algorithms. IEEE Robotics & Automation Magazine 13 (2), pp. 90–93 (en). External Links: ISSN 1070-9932 Cited by: §II-C.
  16. S. Levine, C. Finn, T. Darrell and P. Abbeel (2016) End-to-end training of deep visuomotor policies. Journal of Machine Learning Research 17 (39), pp. 1–40. Cited by: §II-C, §III-A.
  17. Y. Liu, A. Gupta, P. Abbeel and S. Levine (2018) Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125. Cited by: §II-C, §II-C.
  18. A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin and S. Levine (2018) Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §II-B.
  19. A. v. d. Oord, Y. Li and O. Vinyals (2019-01) Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748 [cs, stat]. Note: arXiv: 1807.03748 Cited by: §V.
  20. L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba and P. Abbeel (2018-06) Asymmetric actor critic for image-based robot learning. In Proceedings of robotics: Science and systems, Pittsburgh, Pennsylvania. External Links: Document Cited by: §II-B.
  21. T. Schmidt, R. Newcombe and D. Fox (2016) Self-supervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters 2 (2), pp. 420–427. Cited by: §II-A.
  22. F. Schroff, D. Kalenichenko and J. Philbin (2015) Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §II-A, §III-A, §IV-A.
  23. P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine and G. Brain (2018) Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1134–1141. Note: tex.organization: IEEE Cited by: §I, §I, §II-A, §III-A, §III-A, §III.
  24. P. Sermanet, K. Xu and S. Levine (2017-07) Unsupervised perceptual rewards for imitation learning. In Proceedings of robotics: Science and systems, Cambridge, Massachusetts. External Links: Document Cited by: §II-B.
  25. T. Yamamoto, K. Terada, A. Ochiai, F. Saito, Y. Asahara and K. Murase (2019) Development of Human Support Robot as the research platform of a domestic mobile manipulator. ROBOMECH journal 6 (1), pp. 4. Note: tex.publisher: Springer Cited by: §IV-A.
  26. T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel and S. Levine (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557. Cited by: §II-C.
  27. A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao and T. Funkhouser (2017) 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1802–1811. Cited by: §II-A.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description