# Self-Supervised Learning of State Estimation for Manipulating Deformable Linear Objects

###### Abstract

We demonstrate model-based, visual robot manipulation of linear deformable objects. Our approach is based on a state-space representation of the physical system that the robot aims to control. This choice has multiple advantages, including the ease of incorporating physical priors in the dynamics model and perception model, and the ease of planning manipulation actions. In addition, physical states can naturally represent object instances of different appearances. Therefore, dynamics in the state space can be learned in one setting and directly used in other visually different settings. This is in contrast to dynamics learned in pixel space or latent space, where generalization to visual differences are not guaranteed. Challenges in taking the state-space approach are the estimation of the high-dimensional state of a deformable object from raw images, where annotations are very expensive on real data, and finding a dynamics model that is both accurate, generalizable, and efficient to compute. We are the first to demonstrate self-supervised training of rope state estimation on real images, without requiring expensive annotations. This is achieved by our novel differentiable renderer and image loss, which are generalizable across a wide range of visual appearances. With estimated rope states, we train a fast and differentiable neural network dynamics model that encodes the physics of mass-spring systems. Our method has a higher accuracy in predicting future states compared to models that do not involve explicit state estimation and do not use any physics prior. We also show that our approach achieves more efficient manipulation, both in simulation and on a real robot, when used within a model predictive controller.

## I Introduction

Manipulating deformable objects is an important but challenging task in robotics. It has a wide range of applications in manufacturing, domestic services and health care such as robotic surgery, assistive dressing, textile manufacturing or folding clothes [Suture, Dressing, Folding]. Unlike rigid objects, deformable objects have a high-dimensional state space and their dynamics is complex and nonlinear. This makes state estimation challenging and forward prediction expensive.

We propose a vision-based system that allows a robot to autonomously manipulate a linear deformable object to match a visually provided goal state. Previous learning-based approaches towards this problem [ZSVI, DVF_journal] learn models in image space or a latent space, and do not incorporate any physics prior into the learning process. While conceptually these methods could be applied to other object classes, they suffer from low data efficiency and difficulty in generalization [kloss_icra2018]. We take a different approach to this problem, based on an explicit state-space representation of the physical system. This choice has several advantages. First, it allows us to incorporate physics priors about the behaviour of a deformable object when it is manipulated, e.g. by reflecting a mass-spring system in the network structure. As we show in our experiments, such dynamics models produce more realistic predictions of the object’s behaviour over a longer horizon than dynamics models learned directly from pixels. Second, explicit states are invariant to the appearance of the object and its environment. Therefore, dynamics learned in one setting can be directly used in other visually different settings. It also allows us to specify goal shapes with one rope that is then achieved with a rope of different length, thickness, and/or appearance. It is not obvious how to achieve this invariance with a method operating in a learned latent space or in pixel space. Finally, an explicit state-space representation more readily lends itself to manipulation planning and control especially when optimizing a sequence of actions. It is straight-forward to construct intuitive and informative losses for the optimization, as well as to construct heuristics of promising action sequences to initialize the optimization.

The main challenge then becomes to estimate this explicit state from raw images. This visual perception task has previously been tackled by hand-engineered image processing algorithms, e.g. in [DeformPercept1]. Recently, Pumarola et al. [DeformPercept2] demonstrated explicit state estimation for deformable surfaces in simulation, where ground truth annotations are easily accessible. Such annotations are expensive to obtain for real images. We overcome this problem by proposing a differentiable renderer and image loss that enable continuous, self-supervised training of deformable object state estimation on real images once the model is initialized with a small set of synthetic images.

We demonstrate the effectiveness of our method on the task of rope manipulation. We embed the learned perception model for explicit state estimation into a full system that includes a dynamic models in state space with a physics prior, and Model Predictive Control (MPC). We quantitatively show that the proposed method is significantly more efficient in manipulating ropes to match specified goal configurations compared to previous methods that learn in pixel space.

Summarizing, the contributions of this work are:

(i) We explore the direction of learning vision-based robotic manipulation of deformable objects with explicit state-space representation at the core.

(ii) We propose a differentiable renderer and a novel image space loss that enable self-supervised training of state estimation on real data, without requiring ground truth annotations.

(iii) We demonstrate that bi-directional Long Short-Term Memory (LSTM) networks can efficiently enforce physics priors for linear deformable objects. Our LSTM dynamics model has comparable performance to a physics simulation engine, while being significantly faster.

(iv) We demonstrate rope manipulation in both simulated and real environments. Quantitative comparisons in simulation show improved prediction accuracy and significantly more efficient manipulation of our method compared to a baseline.

## Ii Related Work

#### Self-supervised State Estimation

While a lot of previous works learn dynamics models in image space or latent space using self-supervision, only a few have looked at self-supervised learning of explicit state estimation, such as object pose estimation. Wu et al. [Wu_NIPS2017] used a differentiable renderer to convert predicted rigid object poses back to images, and compare with ground truth observations. Ehrhardt et al. [Ehrhardt2018] proposed two regularizing losses to enforce object trajectory continuity and spatial equivariance for object tracking from images. Byravan et al. [SE3PoseNets] train networks that learn robot’s latent link segmentation and pose space dynamics from point cloud time series, using a reconstruction loss for self-supervision. While achieving good reconstruction, the network does not necessarily converge to the true link structure. Our image-space loss is inspired by this line of works, but addresses much higher-dimensional linear deformable objects like ropes. We are able to train the perception network on real data with only self-supervision, provided that it has been warm started with a small set of rendered images.

#### Deformable Object Tracking

Several previous works have studied tracking of deformable objects given segmented point clouds [Tracking_ICRA13, Tracking_IROS15, Tracking_IROS17]. On the high level, these tracking methods iteratively deform a pre-defined mesh model to fit the segmented point cloud, and use physical simulation to regularize the mesh deformation to have low energy. There are two major limitations in the previous tracking methods: (1) the mesh configuration needs to be initialized manually or using algorithms engineered for specific cases. (2) Rope segmentation is required as input, usually achieved by color/depth filtering and requires manual tuning for each case. Our state estimation method eliminates the above limitations, while also being generalizable to different appearances of the object or the background.

#### Rope Manipulation

Specific to the task of robot manipulation of ropes, Yamakawa et al. [DynamicKnot] have demonstrated high-speed knotting, which depends on an accurate dynamic model of the robot fingers and the rope. Lee et al. [TPS] and Tang et al. [TPS-T] use spatial warping to transfer demonstrated manipulation skills to new but similar initial conditions. More related to our work, Li et al. [PropNet] and Battaglia et al. [InteractionNet] model ropes as mass-spring systems, and use graph networks to learn rope dynamics. However, they assume that the rope’s physical state is fully observable. Ebert et al. [DVF_journal] learn a video prediction model, without any physical concept of objects or dynamics. However, a series of efforts [Chelsea17, Chelsea18] has been made to find informative losses on images, which are required for long-horizon planning in the model predictive control framework. Wang et al. [Abbeel_RSS19] embed images into a latent space associated with an action-agnostic transition model, plan state trajectories and servo the trajectory with an additional learned inverse model. While they achieve satisfactory result manipulating one particular rope, visually different ropes are not guaranteed to share the embedding space and transition model, making generalization difficult. Different from previous works, we directly estimate the ropes’ explicit states from images. This space lends itself more readily to efficient learning, flexible goal specification and manipulation planning than pixel space or latent spaces.

## Iii Method

A flow chart of the full system at test time is shown in Fig. 1. We use a robot arm with gripper to manipulate a rope on the table. The task is to move the rope to match a desired goal state, specified by an image. At each time step, a Convolutional Neural Network (CNN) estimates the explicit rope state, which we formulate as the positions of an ordered sequence of points on the rope. The structure of this network is described in Sec. III-A. The network is first trained with rendered images, then finetuned on real images with our proposed self-supervised image loss described in Sec. III-B. We use Model Predictive Path Integral Control (MPPI) [MPPI] in combination with a dynamics model to optimize a sequence of actions. We train a neural network dynamics model from physical simulation data for speed and parallel processing on GPU. The dynamic model is described in Sec. III-C. Details on the modified MPPI algorithm are described in Sec. III-D.

### Iii-a State estimation network

We formulate the problem of rope state estimation from an RGB image as estimating the positions of an ordered sequence of points on the rope. The rope state estimation problem has a divide-and-conquer structure, i.e., estimating the state of a segment of the rope is the same problem as estimating the state of the entire rope. To exploit this structure, we use Spatial Transformer Networks (STNs) [STN] to estimate the rope state in a coarse-to-fine manner as visualized in Fig. 2 (a). A VGG network [vgg] first estimates 8 straight segments that roughly approximate the shape of the rope. Using STNs, 8 square regions are extracted, one per segment. Within each extracted region, the network updates the position of the two end-points and estimates the position of the middle point on the rope segment. The outputs are converted back to the original image coordinates, and endpoints from neighboring regions are averaged, so that the entire rope is now represented by 16 straight segments (see Fig. 2 (b)). New regions are extracted for each of the new segments, with higher spatial resolution. The process repeats until the rope is represented at sufficient resolution, in our case with 64 segments. The detailed network structure and parameters are described in Appendix (I). Code will be made available upon publication.

### Iii-B Image-space loss for self-supervised finetuning

We train the neural network model with rendered spline curves as ropes, where ground truth rope states are easily available. However, real images look different from rendered images in many aspects, e.g. different lighting, distractor objects, occlusions through robot arms or, a different rope shape distribution. To close the reality gap without requiring ground truth annotations of rope states in real images, we propose a novel differentiable renderer and an image space loss to achieve self-supervision on real images (see Fig. 3)

Our method makes the assumption that the object has a strong color contrast with the background, which is often the case for ropes or other linear deformable objects. Consider the simplifying case where the rope and the background each have a solid color. If we think of each pixel as a point in RGB space, all the pixels should form two clusters, one around the rope color and one around the background color. If we model the distribution in RGB space as a mixture of two Gaussians, and assign each pixel to the more probable Gaussian, the assignment variables will give us the segmentation mask of the rope versus background. When we estimate the rope configuration using the perception network, this estimate should agree with the color-based segmentation.

Clustering in RGB space can be achieved with the Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM). Given an initial estimate of GMM parameters , i.e., the component weights , means , and covariance matrices , , the EM algorithm iterates between the E step, which updates the membership weights of data point to cluster , and the M step, which updates . In the E step, membership weights are updated as:

(1) |

where and are multivariate Gaussian densities. In the M step, GMM parameters are updated as:

(2) | ||||

For rope state estimation, we model the distribution in RGB space with two Gaussians, thus . For each pixel with coordinate , let be its membership weight to the rope RGB cluster parameterized by and , i.e., . Then is the membership weight of pixel to the background RGB cluster parameterized by and , i.e., . refers to the RGB value of pixel . While the M step is straightforward to apply given , the per-pixel membership weights should be expressed in terms of the estimated rope state, instead of i.i.d. per pixel. Thus, the E step does not apply as is.

We propose a differentiable renderer that links to the rope state. The rope state is a sequence of segments. We individually render each segment with end-points , to get , and take the pixel-wise maximum, . Rendering of one segment is defined as

(3) |

where is the distance of pixel to its closest point on segment . is a learnable parameter that controls the width of the rendered segments.

Given the initial state estimate from the neural network, we can compute based on Eq. 3. The M step can be applied easily to compute the parameters of the Gaussian clusters in RGB space. We also follow the E step in Eq. 1 to calculate . Then, instead of directly using for the next M step, we refine the estimated rope state by minimizing the distance between and , defined as

(4) |

Thus, we adapt the EM algorithm for GMMs to minimize the above loss. Note that, the GMM parameters are only transient values estimated for each individual image. They are not memorized as parameters. Thus our method does not make any assumption about the distribution of rope colors or background colors. Instead, we make the much weaker assumption that the rope has good color contrast with the nearby background.

To model occlusions, e.g. from the robot arm, we clip the gradient of this loss on each pixel to be non-negative. In this way, we do not penalize pixels that belong to a rope segment according to the estimated rope state, but whose color belongs to the background color cluster, because that rope segment could be occluded.

The proposed loss can be used for either finetuning the perception network or for refining the rope state estimate at test time, without updating the network weights.

#### Network finetuning with an automatic curriculum and temporal consistency

While the image space loss is generalizable across different visual appearances, it is not free of local minima. When the estimate from the perception network is not good enough, gradients of the proposed loss could lead the rope state into undesired local minima. When finetuning the perception network on real images, we want to prevent such undesired gradients from negatively affecting the network weights. We use an automatic curriculum based on the current loss of each training example. At each iteration, we first run the forward pass to obtain the loss for each example, then select examples whose loss is below a pre-determined threshold, and only use gradients of these selected examples to update the neural network weights. Since the selected examples are already very close to the true rope state thus having low loss, it is very unlikely their gradients will lead to wrong local minima. As the network learns, examples that originally have higher losses will improve, and their probability of falling into undesired local minima decreases. These examples will be included in the effective training set at a later point when their loss drops below the threshold.

In addition to using a curriculum, we also exploit temporal consistency in the recorded sequences to help the learning converge faster and better. If one frame has an image loss below the threshold while its neighboring frame has an image loss above the threshold, we take the predicted rope state from the better frame to guide the prediction on the worse frame. Specifically, the network is trained with the image loss on the better frame, but on the worse frame, the network is trained with the L2 loss between predicted rope states from each frame. In this way we help the network escape bad local minima on the worse frames. Exploiting temporal consistency in self-supervised training greatly improves the result, as shown in Fig 5.

#### Generalization to more complex visual appearances

Although we derived the loss for the simple case of rope and background each having a solid color, we note that this loss is also applicable if the rope or the background is textured with several different colors, or when there are distractor objects or occlusion. We demonstrate a few examples in Fig. 4. To demonstrate the effectiveness of our loss independent of other components, we manually initialized the rope state estimation by clicking a few points on the images. During manipulation experiments, such initial estimates are provided by the perception network. The refined rope state estimate after convergence is shown in Fig. 4 ( and row). We compare to the method generalized from [Wu_NIPS2017], where the rendered grey scale image is colored with the mean color of each cluster, and the L2 loss with the input image is used (Fig. 4 ( and row)). Our proposed loss is robust to lighting variations, bi-colored ropes, bi-colored/textured backgrounds, distractor objects, multiple ropes, or occlusions. The baseline method does not always converge to the desired solution. The superior robustness of our method compared to using the L2 loss can be attributed to the different assumptions used by GMM clustering and K-means clustering. Using the same notation as above, the L2 loss can be written as:

Since are computed from Eq.3, only depends on the hyper parameter and the total length of the rope when approximated to the first order. Thus we do not consider the effects of the last term. The first two terms correspond to K-means clustering in the RGB space. K-means clustering has the following assumptions: (1) The variance of each cluster is roughly equal. (2) The number of data points in each cluster is roughly equal. (3) The distribution of each cluster is roughly isotropic (spherical). All of these assumptions can be broken in real images, e.g. when the rope or background has multiple colors, or when the variance of brightness of pixels is much larger than the variance of hue, due to lighting and shadows. On the other hand, our proposed loss is using GMM to cluster the RGB space. GMM does not make any of the above assumptions, thus is more generalizable on real images.

### Iii-C A dynamics model with a physics prior

After rope states are estimated by the perception network and further refined with the image loss, a dynamics model is needed to predict future rope states given hypothetical actions, so that we can plan action sequences towards the goal.

Since neural networks have advantages in speed and in parallel processing on GPUs, we train a neural network with data generated from a physics-based simulator PhysBAM [physbam]. The neural network uses a bi-directional LSTM to model the structure of a mass-spring system. While LSTMs are usually used to propagate information in time, here we use it to propagate information along the rope’s mass-spring chain in both directions. Recurrently applying the same LSTM cell to each node in the rope ensures that the same physical law is applied, whether the node is closer to the endpoint or in the middle of the rope. Details of the network is described in Appendix (II). We also experimented with the recently proposed graph network [InteractionNet] but found it less effective in propagating along a long chain of nodes. When generating training data, physical parameters used in the simulation are identified automatically using CEM on a small set of real data. Simulation sequences with random actions are generated and the model is trained on one-step prediction.

### Iii-D Rope manipulation with MPC

We use model predictive control to plan for a sequence of actions that takes the rope from the starting configuration to the goal configuration. Both are estimated from input images. We formulate actions as first selecting a point on the rope to grasp, and then selecting a 2D planar vector to move the gripper and the rope being grasped. This is different from the action space used in previous works [DVF_journal, ZSVI], where a grasping point is selected in image space, and a large portion of the action space will not have any contact with the rope. Note that if our estimated rope configuration deviates from the real rope significantly, the robot may still fail to grasp the real rope. However, such cases rarely happen in our experiments, since minor errors from the perception network can be corrected in the refinement process with our proposed image loss.

We adapt a sampling-based approach, MPPI [MPPI], for planning actions to manipulate the rope. Since the grasping point is a discrete variable and movement trajectories are continuous, we perform a nested optimization to obtain an optimal action sequence. In the inner loop, we sample trajectories of displacements of a given grasping point on the rope. These trajectories are rolled-out with our dynamics model over a time horizon . The cost of each rope state along the trajectory is its distance to the goal state. The optimal trajectory per grasping points is computed by forming the cost-weighted average of the sampled trajectories as derived in [MPPI]. For the outer loop, we sample grasping points on the rope and run the inner optimization loop for each in parallel. The grasping point with lowest predicted cost of its optimal trajectory is selected.

Because the explicit rope states are available, defining an informative loss as well as sampling promising action candidates for MPPI is straight forward, compared to methods that operate in image space [DVF_journal]. See Appendix (III) for more details.

## Iv Experiments and Results

We evaluate each of the components described in the above section, and demonstrate that both our perception and dynamics model can be trained more effectively compared to baseline models that do not incorporate any prior structure. Our image loss is able to transfer the perception model from simple rendered images to real images with unseen occlusions. In addition, the components work with each other to achieve efficient manipulation of ropes to match visually specified goals, both in simulation and on real robots.

### Iv-a Perception networks comparison

Train | Test | |
---|---|---|

Baseline: Direct Estimate | 0.0104 | 0.0354 |

Ours: Coarse-to-Fine | 0.0177 | 0.0231 |

We evaluate estimation accuracy and generalization ability for two CNNs. The baseline model directly outputs 65 point coordinates from fully connected layers. We compare this to our proposed network that uses STNs for a coarse-to-fine estimation. Both models are trained on 10000 rendered images of b-spline curves. We report the training and evaluation loss for each network in Table I. Although the training loss for our network is larger than that of the baseline, our network performs significantly better on a held out test set, demonstrating better generalization due to our coarse-to-fine formulation. Because the state space of a rope is very high dimensional, densely sampling in this space is difficult and would lead to a data set whose size is exponential in the number of rope points. Therefore, generalization as provided by the hierarchical STNs is very important for our problem.

### Iv-B Finetuning with image loss

Since the robot arm is not modeled in the renderer, a network only trained on rendered images never sees the robot arm or the resulting occlusion of the rope, and thus it does not generalize well to real images (see Fig. 5). We use our proposed image loss to finetune the parameters of our perception network on real images, without requiring annotations of rope states. We visualize the result after finetuning in Fig. 5. Ablation studies confirm that both automatic curriculum learning and enforcing temporal consistency brings significant improvements. Due to the limited size of real dataset and the very large state space of ropes, the network does not generalize well to rope states outside of training distribution. Experimentally, we observe most of the overfitting is from the coarsest prediction layer, and the refinement layers generalize well if given a reasonable coarse prediction. This motivates us to further exploit temporal information during manipulation. Based on the latest estimated state , the MPC plans the optimal action . We use the learned dynamics model to predict the next state . is subsampled into segments to feed into the CNN’s first STN, together with the next image , and the CNN refines into the next estimated state . Thus instead of estimating the segments at the coarsest level from image , they are assumed to be equal the predicted segments from the previous frame. The CNN only refines the coarsened into the next estimated state .

### Iv-C Learning dynamics models

We evaluate the long-horizon prediction accuracy of the learned dynamics model on both simulated and real data. We use two distance metrics for rope states: the average and the maximum deviation. Given a pair of rope states we first compute the Euclidean distance for each pair of corresponding points, . The average deviation is defined as and the maximum deviation is defined as . These metrics will be used for all the following experiments.

We show the prediction accuracy of the neural network dynamics model on real data, and compare to the simulator it is trained from. As shown in Fig. 6 (top), the prediction accuracy of our neural network model is comparable to the simulation engine for the first 40 steps, with the initial state estimated from image. We expect it to be further improved if also trained with multi-step prediction and finetuned on real data. The main advantage of the neural dynamics model is that it is significantly faster to predict, taking second per action on average, compared to second per action for the simulator. The neural network model is also readily parallelized on GPU with batch size up to . Both aspects are beneficial to MPC, since a lot of mental rollouts are required in parallel.

#### Benefit of incorporating a physics prior

We further compare the long-horizon prediction accuracy of our neural network dynamics model with the visual dynamics model (DVF) from [DVF_journal], on a batch of sequences from the simulated dataset. Both models are trained with the same dataset, except that DVF takes images, whereas our model takes explicit rope states. For each sequence, both models receive the same starting image and a sequence of actions. The input state for our model is estimated from the image. For DVF, the model tracks points on the rope by predicting heat maps in image space, and point positions are the expectations from predicted point distributions. As shown in Fig. 6 (bottom), our neural dynamics model is significantly more accurate than DVF. Note that both models are trained on a large dataset of 0.5M simulated actions, equivalent to at least 600 robot hours. By trading off some level of generality of the method, we are able to incorporate the physics prior for deformable linear objects and achieve significant gain in data efficiency and generalization, manifesting itself in the prediction accuracy and subsequent manipulation performance (Sec IV-D).

### Iv-D Manipulation results

We evaluate the performance of our system on the task of manipulating a rope on the table to match desired goal states. We compare our method with the baseline method [DVF_journal], which uses MPC with the pixel distance cost. For the baseline, a task is specified by the start and goal positions of equidistant points on the rope. To select promising grasping points from images, we compute the pixel-wise difference between the current observed image and the goal image, and only sample grasping positions where the two images are significantly different.

For quantitative evaluation, we run manipulation tasks in simulation, and select starting states and goal states from randomly generated b-spline curves. Both our method and the baseline only see the rendered images. We report the distance to goal as a function of time , where is the average deviation from the current rope state to the specified goal state. We show the mean and standard deviation over independent experiments in Fig. 7. Our method achieves the goal state within steps in most cases, and the remaining distance is very small, while the baseline method that operates in image space often cannot achieve the goal state within steps, showing large residual distances at . For more visualizations of the simulated manipulations refer to Appendix (V).

We also demonstrate rope manipulation on the real robot. To highlight one benefit of using explicit state representations, we use a different rope on a different background to demonstrate goals. The two ropes have different appearance as well as different lengths and thicknesses. The L2 loss between goal and observed images would not be informative for video prediction methods, and it would be hard to embed the goal image into the latent space of the manipulated rope, if using [Abbeel_RSS19]. We arrange the goal state of the rope to be an “S” shape, a “W” shape, or an “” shape. We visualize the start, goal, and achieved state after actions in Fig. 8. Also see supplementary material for robot manipulation videos.

## V Conclusion

We demonstrated model-based, visual robot manipulation of deformable linear objects. Our forward model makes explicit estimation of rope states from images, and learns a dynamic model in state space. We proposed a differentiable renderer and a novel image loss, which enables self-supervised continuous training of rope state estimation on real data, without requiring expensive annotations. Our renderer and loss are generalizable across a wide range of visual appearances. With access to the rope’s explicit state, we are able to incorporate physics priors, e.g., the structure of mass-spring systems, into the design of the network structure for dynamics models, in addition to using physical simulation for data generation. We demonstrated that our method has higher accuracy in long-horizon future prediction compared to models that do not involve explicit state estimation and do not use any physics prior, and that our method achieves more efficient manipulation in matching visually specified goals.

For future work, it would be interesting to explore using our image loss to continuously train the perception and dynamics network while the robot is performing manipulation tasks, similar to DAGGER [DAGGER]. Our neural network dynamics model can be extended to have object’s physical properties as latent variables, such that the dynamics model can adapt quickly to ropes/wires with different physical properties, by updating the latent variables instead of network weights. Finally, we would also like to extend this method to deformable objects with even higher dimensional state spaces, such as clothing.

## Acknowledgments

Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.