Model-Driven Feed-Forward Prediction for Manipulation of Deformable Objects

Model-Driven Feed-Forward Prediction for Manipulation of Deformable Objects

Yinxiao Li, Yan Wang, Yonghao Yue, Danfei Xu, Michael Case, Shih-Fu Chang, Eitan Grinspun, and Peter Allen

Robotic manipulation of deformable objects is a difficult problem especially because of the complexity of the many different ways an object can deform. Searching such a high dimensional state space makes it difficult to recognize, track, and manipulate deformable objects. In this paper, we introduce a predictive, model-driven approach to address this challenge, using a pre-computed, simulated database of deformable object models. Mesh models of common deformable garments are simulated with the garments picked up in multiple different poses under gravity, and stored in a database for fast and efficient retrieval. To validate this approach, we developed a comprehensive pipeline for manipulating clothing as in a typical laundry task. First, the database is used for category and pose estimation for a garment in an arbitrary position. A fully featured 3D model of the garment is constructed in real-time and volumetric features are then used to obtain the most similar model in the database to predict the object category and pose. Second, the database can significantly benefit the manipulation of deformable objects via non-rigid registration, providing accurate correspondences between the reconstructed object model and the database models. Third, the accurate model simulation can also be used to optimize the trajectories for manipulation of deformable objects, such as the folding of garments. Extensive experimental results are shown for the tasks above using a variety of different clothing.

Model-Driven Feed-Forward Prediction for Manipulation of Deformable Objects

Journal Title XX(X):Model-Driven Feed-Forward Prediction for Manipulation of Deformable ObjectsReferences ©The Author(s) 2015 Reprints and permission: DOI: 10.1177/ToBeAssigned

Yinxiao Li, Yan Wang, Yonghao Yue, Danfei Xu, Michael Case, Shih-Fu Chang, Eitan Grinspun, and Peter Allen


Deformable Objects, Recognition, Robotic Manipulation, Simulation

00footnotetext: Department of Computer Science, Columbia University
Department of Electrical Engineering, Columbia University
00footnotetext: Corresponding author:
Yinxiao Li, Columbia University, Room 450 Mudd 500 W. 120th Street, M.C. 0401 New York, New York 10027
00footnotetext: Email:

1 Introduction

Robotic manipulation of deformable objects is a challenging problem especially because of the complexity of the many different ways an object can deform. Searching within such a high dimensional state space makes it difficult to recognize, track, and manipulate deformable objects. In this paper we present a feed-forward, model-driven approach to address this challenge, using a pre-computed, simulated database of deformable thin-shell object models, where the bending of the mesh models is predominant Grinspun et al. [2003]. The models are detailed, robust, and easy to construct, and using a physics engine one can accurately predict the behavior of the objects in simulation, which can then be applied to a real physical setting. This work bridges the gap between the simulation world and the real world. The predictive, feed-forward, model-driven approach takes advantages of the simulation and generates a large number of instances for learning approaches, which not only alleviates the burden of data collection, which can be efficiently done in simulation, but also makes adaptation of the methods to other application areas easier and faster. Mesh models of common deformable garments are simulated with the garments picked up in multiple different poses under gravity, and stored in a database for fast and efficient retrieval. To validate this approach, we developed a comprehensive pipeline for manipulating clothing as in a typical laundry task. First, the database is used to estimate categories and poses of garments in arbitrary positions. A fully featured 3D volumetric model of the garment is constructed in real-time and volumetric features are then used to obtain the most similar model in the database to predict the object category and pose. Second, the database can significantly benefit the manipulation of deformable objects via non-rigid registration, providing accurate correspondences between the reconstructed object model and the database models. Third, the accurate model simulation can also be used to optimize the trajectories for manipulation of deformable objects, such as the folding of garments. In addition, the simulation can be easily adapted to new garment models. Extensive experimental results are shown for the tasks above using a variety of different clothing.

Figure 1: The overall pipeline of robotic manipulation of deformable objects.

Figure 1 shows a typical pipeline for manipulating clothing as in a laundry task. This paper brings together work addressing all the tasks in the pipeline (except ironing) which have been previously published in conference papers (Li et al. [2014a] Li et al. [2014b] Li et al. [2015a] Li et al. [2015b]). These tasks, with the exception of the ironing task, are all implemented using a feed forward, model-driven methodology, and this paper serves to consolidate all these results into a single integrated whole. The work has also been extended to include novel garments not found in the database, extended results on regrasping using a much larger dataset of objects and examples, quantitative registration results for our hybrid rigid/deformable registration methods, new dense mesh modeling techniques, and a novel dissimilarity metric used to assess folding success. The ironing task is omitted from this paper due to size constraints. Full details on ironing can be found in  Li et al. [2016].

In addition, a set of videos of our experimental results are available at:

2 Related Work

2.1 Recognition

There has been previous work on the recognition and manipulation of deformable objects. Willimon et al. [2013] Willimon et al. [2011] used interactive perception to classify the clothing type. Their work was based on a image-only database of 6 categories, each of which is with 5 different items from real garments. Later, they increased the size of the database but still used real garments. Their work focused on small clothing such as socks and short pants usually consisting of a single color. Miller et al. [2012], Wang et al. [2011], Schulman et al. [2013], Cusumano-Towner, et. al Cusumano-Towner et al. [2011] have done some impressive work in clothing recognition and manipulation. They have successfully enabled the PR2 robot to fold clothing and towels. Their methods mainly focus on aligning the current edge/shape from observation to an existing shape. A series of works on clothes pose recognition were done by  Kita et al. [2011a] Kita et al. [2011b] Kita and Kita [2002]. They used a simulation database of a single garment with around different grasping points, which were mostly selected on the border when the garment was laid flat. Their work demonstrated the ability to identify the pose of the clothes by registration to pre-recorded template images. Doumanoglou et al. [2014] used a pair of two industrial arms to recognize and manipulate deformable garments. They used a database of depth images captured from real garments, such as a sweater or a pair of pants.

With powerful computing resources reconstructing a 3D model of the garment, and using that to search a pre-computed database of simulated garment models in different poses can be more accurate and efficient. With the increasing popularity of Kinect sensor, there are various methods emerging in computer graphics such as KinectFusion and its variants Newcombe et al. [2011] Chen et al. [2013] Li et al. [2013]. Although these methods have shown success in reconstructing static scenes, they do not fit our scenario directly where a robotic arm is rotating the target garment about a grasping point. Therefore we first do a 3D segmentation to get the masks of the garment on the depth images, and then invoke KinectFusion to do the reconstruction.

Shape matching is another related and long-standing topic in robotics and computer vision. On the 2D side, various local features have been developed for image matching and recognition Huttenlocher et al. [1993] Latecki et al. [2000] Lowe [1999], which have shown good performance on textured images. Another direction is shape-context based recognition Belongie et al. [2002] Toshev et al. [2010] Tu and Yuille [2004], which is better for handwriting and character matching. On the 3D side, Wu Wu et al. [2008] and Wang Wang et al. [2006] have proposed methods to match patches based on 3D local features. They extract Viewpoint-Invariant Patches or the distribution of geometry primitives as features, based on which matching is performed. Osada and Funkhouser [2001], Thayananthan et al. [2003], and Frome et al. [2004] apply 3D shape-context as a metric to compute similarities of 3D layout for recognition. However, most of the methods are designed for noise-free human-designed models, without the capability to match between the relatively noisy and incomplete mesh model produced by Kinect and the human-designed models. Our method is inspired by 3D shape context Frome et al. [2004], but provides the capability of cross-domain matching with a learned distance metric, and also utilizes a volumetric data representation to efficiently extract the features.

2.2 Manipulation

Osawa et al. [2007] proposed a method using a dual-arm setup to unfold a garment from pick up. They used a segmented mask to match the pre-stored template mask to track the states of the garment. The PR2 robot is probably the first robot that has successfully manipulated deformable objects such as a towel or a T-shirt Maitin-Shepard et al. [2010]. The visual recognition in this work targets corner-based features, which does not require a template to match. The subsequent work has improved the prediction of the state of a garment using a HMM framework by regrasping at the lowest corner point Cusumano-Towner et al. [2011]. Doumanoglou et al. [2014] applied pre-recorded depth images to guided the manipulation procedures. sun2015 used a pair of stereo cameras to analysis the surface of a piece of cloth and performed flattening and unfolding.

One of the applications of the our database is to localize the regrasping point during the manipulation by mapping the pre-determined points from simulation mesh to the reconstructed mesh. Therefore, a fast and accurate registration algorithm plays a key role in our method. Rigid or non-rigid surface registration is a fundamental tool to find shape correspondence. A thorough review can be found in Tam et al. [2013]. Our registration algorithm builds on previous techniques for rigid and non-rigid registrations. First, we use an iterative closest point method Besl. and McKay. [1992] to rigidly align the garment. Here, we use a distance field to accelerate the computation. Next, we perform a non-rigid registration to improve the matching by locally deforming the garment. Similar to Li et al. [2008], we find the correspondence by minimizing an energy function that describes the deformation and the fitting.

2.3 Folding Deformable Objects

With the garment fully spread on the table, attention is turned to parsing its shape. S. Miller et al. have designed a parametrized shape model for unknown garments Miller et al. [2011] Miller et al. [2012]. Each set of parameters defines a certain type of garment such as a sweater or a towel. The goal is to minimize the distance between the observed garment contour points and points from the parametrized shape model. The fitting score between the observed contour and the shape models can also be used for recognition of garment category. However, the average time for the fitting procedure is seconds and sometimes does not converge. The contour-based garment shape model was further improved by J. Stria et al. using polygonal models Stria et al. [2014a]. The detected garment contour is matched to a polygonal model by removing non-convex points using a dynamic programming approach. The landmarks on the polygonal model are then mapped to the real garment contour, and followed by generating a folding plan.

Folding is another application of garment manipulation. F. Osawa et al. used a robot to fold a garment with a special purpose table that contains a plate that can bend and fold the clothes assisted by a dual-arm robot. The robot mainly worked on repositioning the clothes for the plate for each folding action. Within several “flip-fold” operations, the garment can be folded. Another folding method using a PR2 robot was implemented by van den Berg et al. [2010]. The core of their approach was the geometry reasoning with respect to the cloth model without any physical simulation. Contour fitting at each step took relatively longer than execution of the folding actions, which reduced its efficiency. This was further sped up by Stria et al. [2014b] using two industrial arms and a polygonal contour model. They showed impressive folding results by utilizing a specifically-designed gripper Le et al. [2013] that is suitable for cloth grasping and manipulation.

None of the previous works focus on trajectory optimization for garment folding, which brings uncertainty to the layout given the same folding plan. One possible case is that the garment shifts on the table during one folding action so that the targeted folding position is also moved. Another case is that an improper folding trajectory causes additional deformation of the garment itself, which can accumulate. Our previous work Li et al. [2015b] has proved that with effective simulation, bad trajectories can be avoided and the results of manipulation of the deformable objects is predictable.

3 A Database For Deformable Object Recognition

3.1 Motivation

Figure 1 shows an overview of our pipeline for dexterous manipulation of deformable objects. The first step is the visual recognition of deformable objects. We need to have a large set of exemplars of how garments will look visually when arbitrarily grasped. In addition, as we mentioned previously, a 3D model can be used for regrasping and further manipulation after an accurate registration. Therefore, a database with a number of 3D models is desirable. In order to have a set of pre-calculated trajectories for efficient manipulation of deformable objects, off-line simulation is an effective way to approach this. With low-cost and fast simulation, optimized trajectories can be calculated, and exported and adapted to the real robotic manipulation.

Physically having people, or a robot arm, successively pick up an object and image its appearance is too slow and cannot span the large space we are hoping to learn. Given the physical nature of this training set, it can be very time-consuming to create, and may have problems encompassing a wide range of garments and different fabrics which we can more easily accommodate in the simulation environment. Using advanced simulators such as Maya Maya2015 to physically simulate deformable objects, we can produce thousands of exemplars efficiently, which we can then use as a corpus for learning the visual appearances of the deformed garments.

Take manipulation of the deformable garment as an example. One solution is to use prior knowledge to guide the robot to follow steps of a task. In our previous work, we have successfully used online registration between the database model and the reconstructed model to achieve a stable regrasping and unfold a garment. The work that is closest to ours is by  Doumanoglou et al. [2014]. This work has impressive results for unfolding of a number of different garments. They use a dual-arm industrial robot to unfold a garment guided by a set of depth images which provide a regrasping point. This method achieves promising accuracy. Their training set is a number of physical garments that have been grasped at different grasping points to create feature vectors for learning. A major difference is our use of simulated predictive thin shell models for garments from a large database of garments and their poses. We also use an online registration of a reconstructed 3D surface mesh with the simulated model to find regrasping points. By this method, we can choose arbitrary regrasping points without having to train the physical model for the occurrence of the grasping points. This allows us to choose any point on the garment at any time as the regrasping point.

3.2 Simulating Deformable Objects

We have developed an off-line simulation pipeline whose results can be used to predict poses of deformable objects. The off-line simulation is time efficient, noise free, and more accurate compared with acquiring data via sensors from real objects. Simulation models do not suffer from occlusion or noise and are more complete than physically scanned models. In the off-line simulation, we use a few well-defined garment mesh models such as sweaters, jeans, and short pants, etc. Similar garment mesh models can also be obtained from Poserworld Inc. [a] and Turbo Squid Inc. [b]. We can also generate models by using our own “Sensitive Couture ” software Umetani et al. [2011]. Figure 2 shows a few of our current garment models rendered in Maya software.

Figure 2: (a), (b): The original garment mesh models of a sweater and a pair of jeans rendered in Maya. (c), (d): Simulation result of hanging (a) and (b) under under gravity, respectively.

For each grasping point, we compute the garment layout by hanging under gravity in the simulator. In Maya, a mesh model can be converted into an nCloth model which can be then simulated with some cloth properties such as hanging and falling down. Maya also allows for control of cloth thickness and deformation resistance, etc. In addition, any vertex on the mesh can be selected as a constraint point to simulate a draping effect. The hanging under gravity effect of the garment models is shown in Figure 2. We manually label each garment in the database with the key grasping points such as sleeve end, elbow, shoulder, chest, and waist, etc.

The simulation model be can exported as an obj file for recognition using volumetric approach Li et al. [2014b]. Figure 3 shows a small sample of different picking points of a single garment hanging under gravity that simulated in Maya.

Figure 3: Six different mesh models of a same sweater, but picked up on different points, simulated in Maya.

4 Pose Estimation

Pose estimation of deformable objects is an important problem in robotics, laying the foundation for further procedures. For example, in the task of garment folding, once the robot has detected the pose of the garment, it can then proceed to manipulate the target garment to a preset “standard pose.” Unlike rigid object recognition which has finite state spaces, deformable object recognition is much harder because of the very large state space of how it deforms.

Figure 4: Our application scenario: a Baxter robot grasps a sweater, and a Kinect captures depth images to recognize the pose of the sweater. The recognition result is shown on the right.

In this section, we describe a real-time pose recognition algorithm with accurate prediction of grasping point locations. Figure 4 shows the experimental settings for our algorithm: a Baxter robot grasping a garment and predicting the grasping location (e.g. cm left of the collar). With this information, the robot is then able to proceed to subsequent tasks such as regrasping and folding. The main idea of our method is to first accurately reconstruct a 3D mesh model from a low-cost depth sensor, and then compute the similarity between the reconstructed model and the models simulated offline to predict the pose of the object. The database introduced in the previous section provides a perfect source for such offline-simulated models.

Figure 5: Overview of our proposed pipeline for pose estimation of deformable objects. In the offline training stage (the red rectangle), we extract a tailored binary feature from the simulated database, and learn a weighted Hamming distance from additional calibrated data collected from the Kinect. In the online testing stage (the green rectangle), we reconstruct a 3D model from the depth input, find the nearest neighbor from the simulated database with the learned distance metric, and then adopt the pose of the matched model as the output.

4.1 Method

Our method consists of two stages: the offline model simulation stage and the online recognition stage. In the offline model simulation stage, we use a physics engine to simulate the stationary state of the mesh models of different types of garments in different poses. In the online recognition stage, we use a Kinect sensor to capture many depth images of different view points of the garment by rotating it as it hangs from a robotic arm. We then reconstruct a smooth 3D model from the depth input, extract compact 3D features from it, and finally match against the offline model database to recognize its pose. Figure 5 shows the framework of our method, which will be introduced in the subsequent subsections.

4.1.1 3D Reconstruction

Given the model database described above, we now need to generate depth images and match against the database. Direct recognition from depth images suffers from the problems of self-occlusion and sensor noise. This naturally leads to our new method of first building a smooth 3D model from the noisy input, and then performing recognition in 3D. However, how to do such reconstruction is still an open problem. Although there are existing approaches of obtaining high-quality models from noisy depth inputs such as KinectFusion Newcombe et al. [2011], which requires the scene to be static. In our data collection settings, the target garment is being rotated by a robotic arm, which invalidates the KinectFusion’s assumptions. We solve this problem by first segmenting out the garment from its background, and then invoke KinectFusion to obtain a smooth 3D model, assuming that the rotation is slow and steady enough such that the garment will not deform in the process.

Segmentation. Before diving into the reconstruction algorithm, let us first define some notation. Given the intrinsic matrix of the depth camera and the th depth image , we are able to compute the 3D coordinates of all the pixels in the camera coordinate system with , in which is the coordinate of a pixel in , with as the corresponding depth, and is the corresponding 3D coordinate in the camera coordinate system.

Our segmentation is then performed in the 3D space. We ask the user to specify a 2D bounding box on the depth image with a rough estimation of the depth of the garment . Given that the data collection environment is reasonably constrained, we find even one predefined bounding box works well. Then we adopt all the pixels having their 3D coordinates within the bounding box as the foreground, resulting in a series of masked depth images and their corresponding 3D points, which will be fed into the reconstruction module.

The 3D reconstruction is done by feeding the masked depth images into KinectFusion, while the unrelated surroundings are eliminated, leaving the scene to reconstruct as static. This process can be done in real time. In addition to a smooth mesh, the KinectFusion library also generates a Signed Distance Function (SDF) mapping, which will be used for 3D feature extraction. The SDF is defined on any 3D point . It has the property that it is negative when the point is within the surface of the scanned object, positive when the point is outside a surface, and zero when it is on the surface. We will use this function to efficiently compute our 3D features in the next subsection.

4.1.2 Feature Extraction

Figure 6: Feature extraction from a reconstructed mesh model. (a) indicates that a bounding cylinder of a garment is cut into several layers. (b) shows a set of layers (sections). For each layer, we divide it into cells via rings and sectors. (c) shows a binary feature vector collected from each cell. Details are described in section 4.1.2.

Inspired by 3D Shape Context Belongie et al. [2002], we design a binary feature to describe the 3D models. In our method, the features are defined on a cylindrical coordinate system fit to the hanging garment as opposed to traditional 3D Shape Context which uses a spherical coordinate system Frome et al. [2004].

For each layer, as shown in Figure 6 top-right, we uniformly divide the world space into rings sectors in a polar coordinate system, with the largest ring covering the largest radius among all the layers. The center of the polar coordinate system is determined as the mean of all the points in the highest layer, which usually contains the robot gripper. Note we do a uniform division instead of logarithm division of as Shape Context does. The reason why Shape Context uses logarithm division of is that the cells farther from the center are less important, which is not the case in our settings. For each layer, instead of doing a point count as in the original Shape Context method, we check the Signed Distance Function (SDF) of the voxel which the center of the polar cell belongs to, and fill one () in the cell if the SDF is zero or negative (i.e. the cell is inside the voxel), otherwise zero (). Finally, all the binary numbers in each cell are collected in an order (e.g. with increasing and then increasing), and are concatenated as the final feature vector.

The insight behind this design is, to improve the robustness against local surface disturbance due to friction, we include the 3D voxels inside the surface in the features. Note we do not need to do the time-consuming classification (e.g. ray tracing) to determine whether each cell is inside the surface, but only need to look up their SDFs, thus dramatically speed up the feature extraction.

Input: Vertices of the mesh model , precomputed SDF, Parameters: = #layers, = #rings, = #sectors
Output: Corresponding feature vector
Divide mesh into layers in a top-down manner;\Hy@raisedlink\hyper@anchorstartAlgoLine0.2\hyper@anchorend
Origin = Mean;\Hy@raisedlink\hyper@anchorstartAlgoLine0.3\hyper@anchorend
[] = Polar (Origin, ) ;\Hy@raisedlink\hyper@anchorstartAlgoLine0.4\hyper@anchorend
for each layer do  for each cell (ring, sector) do  center of the cell ; \Hy@raisedlink\hyper@anchorstartAlgoLine0.6\hyper@anchorend
if SDF then  ; \Hy@raisedlink\hyper@anchorstartAlgoLine0.7\hyper@anchorend
 else  ; \Hy@raisedlink\hyper@anchorstartAlgoLine0.8\hyper@anchorend
      return .
Algorithm 1 Feature extraction for pose estimation of deformable objects








Matching Scheme. Similar to Shape Context, when matching against two shapes, we conceptually rotate one of them and adopt the minimum distance as the matching cost, to provide rotation invariance. That is,


in which are the features to be matched ( is the binary set ), is the binary XOR operation, and is the transform matrix to rotate the feature of each layer by . Recall that both features to be matched are compact binary codes. Thus such conceptual rotation as well as Hamming distance computation can be efficiently implemented by integer shifting and XOR operations, resulting in matching that is even faster than the Euclidean Distance given reasonable s (e.g. ). A complete illustration of the feature extraction algorithm can be found in Algorithm 1.

4.1.3 Domain Adaptation

Now we have a feature vector representation for each model in the simulated database and for the query. A natural idea is to find the Nearest Neighbor (NN) of the query in the database and transfer the metadata such as category and pose from the NN to the query. But a naive NN algorithm with Euclidean distance does not work here because, even for the same garment and the same grasping point by the robot, the way it deforms may still be slightly different due to friction. This requires a solution in the matching stage, especially given that it is impractical to simulate every object with all the possible materials. Therefore, essentially we are doing cross-domain retrieval, which generally requires a “calibration” step to adapt the knowledge from one domain (simulated models) to another (reconstructed models).

Weighted Hamming Distance. Similar with the distance calibration in Wang et al. [2013], we use a learned distance metric to improve the NN accuracy, i.e.


in which is the feature vector of the query, is the index of models in the simulated database, and is the binary XOR operation. indicates the feature vector of the th model, with as the optimal in Equation 1.

The insight here is that we wish to grant our distance metric more robustness against material properties by assigning larger weights to the regions invariant to the material differences (this amplifies the features that are more intrinsic for the recognition task).

Distance Metric Learning. To robustly learn the weighted Hamming distance, we use an extra set of mesh models collected from a Kinect as calibration data. The collection settings are the same as described in “3D Reconstruction” and only a small amount of calibration data is needed for each category (e.g. models in poses for long-sleeve shirt model). To determine the weight vector , we then formulate the learning process as an optimization problem of minimizing the empirical error with a large-margin regularizer:


in which is the orientation-calibrated feature of the th model (from the database), with as the corresponding ground truth label (i.e. the index of the pose). is the extracted feature of the th training model (from Kinect), with as the ground truth label. We wish to minimize , which indicates how many wrong results the learned metric gives, with a quadratic regularizer. controls how much penalty is given to wrong predictions.

This is a non-convex and even non-differentiable problem. Therefore we employ the RankSVM Joachims [2002] to obtain an approximate solution using the cutting-plane method.

Knowledge Transfer. Given the learned , in the testing stage, we then use Equation 2 to obtain the nearest neighbor of the query model. We directly adopt the grasping point of the nearest neighbor, which is known from the simulation process, as the final prediction.

4.2 Experimental Results

We used a series of experiments to demonstrate the effectiveness of the proposed method and justify the components. We tested our method on a dataset of various kinds of garments collected from practical settings, by treating it as a classification problem and calculating the classification accuracy. Experimental results demonstrate that our method is able to achieve both reasonable accuracy and fast speed.

4.2.1 Data Acquisition

Since the simulated database introduced in the previous section does not have the practically captured data, we collect an extra test dataset for general evaluation of pose recognition of deformable objects based on depth image as inputs.

Figure 7: Visual examples of the pose recognition result of our method. The garment is picked up via a griper of the Baxter robot. From left to right, each example shows the color image, input depth image, reconstructed model, matched simulated model, ground truth simulated model, and the predicted grasping points (red) marked on the model with the ground truth (yellow). The example shown in the bottom right shown here is considered as a failure example, which may be because of the uninformative deformation shape. Note our method does not use any color information. (Best viewed in color)

The dataset consists of 2 parts, a test set and a calibration set. To collect the testing set, we use a Baxter robot, which is equipped with two arms with 7 degrees of freedom. A Kinect sensor is mounted on a horizontal platform at height of meters to capture the depth images, as shown in Figure 4. We bought kinds of garments – long-sleeve shirts, pants and shorts, as representative examples in the manufacturing industry, and then collect their depth images with the same grasping points of the training database. We then use our 3D reconstruction algorithm to obtain their mesh models. For each grasping point of each garment, the robot rotates the garment 360 degrees around seconds while the Kinect captures at fps, which gives us around depth images for each garment/pose. This results in a test set of mesh models, with their raw depth images.

Given we also need to learn/calibrate a distance metric from extra data from Kinect (using Equation 3), we collect an extra small amount of data with the same settings as the calibration data, only collecting five poses for each garment. A weight vector is then learned from this calibration data for each type of garment.

4.2.2 Qualitative Evaluation

We demonstrate some of the recognition results in Figure 7 in the order of color image, depth image, reconstructed model, predicted model, ground truth model, and predicted grasping point (red) vs. ground truth grasping point (yellow) on the garment. From the figure, we can first see that our 3D reconstruction is able to provide us with good-quality models for a fixed camera capturing a dynamic scene. And our shape retrieval scheme with learned distance metrics is also able to provide reasonable matches for the grasping points. Note that our method is able to output a mesh model of the target garment, which is critical for the subsequent operations such as path planning and object manipulation.

4.2.3 Quantitative Evaluation

Implementation Details. In the 3D reconstruction, we set , voxels and the resolution of the voxels as voxels per meter to obtain a trade-off between resolution and robustness against sensor noise. In the feature extraction, our implementation adopts in the feature extraction as an empirically good configuration. That is, each mesh model gives a dimensional binary feature. We set the penalty in Equation 3.

Classification Accuracy. For each input garment, we compute the classification accuracy of pose recognition, i.e.



The classification accuracy for each garment type is reported in Table 1 (left). Given we have two models for each garment in the database (except shorts), we report the accuracy achieved of using only Model 1 for retrieval, using only Model 2 for retrieval, and use all the available data. The total grasping points for long-sleeve shirts, pants, and shorts are , , and respectively. Our method is benefited from the 3D reconstruction step, which reduces the sensor noise and integrates the information of each frame to a comprehensive model and thus leads to better decisions. Among three types of garments, recognition of shorts is not as accurate as the other two. One possible reason is that many of the shapes from different grasping points look very similar. Even for human observers, it is hard to distinguish them.

Garment Model 1 Model 2 Both models Long-Sleeve Shirts Pants Shorts N/A Garment Running Time Long-Sleeve Shirts Pants Shorts
Table 1: Left: Average classification accuracy for different garment types. Right: Average running time in seconds to process one garment of the proposed method on the proposed database, with the input of different garment types.

Running Time. In addition, we also report the processing time of our method. The time is measured on a PC with an Intel i7 3.0 GHz CPU, and shown in Table 1 (right). We can see that our method demonstrates orders of magnitude speed-up against the state-of-the-art depth-image based method which takes minutes to process one input. This verifies our advantages from the efficient 3D reconstruction, feature extraction, and matching.

4.2.4 Generality to Novel Garments

Though we used a relatively small garment database for our experiments, we noticed that our simulated models can also be generalized to recognize similar but unseen garments. For example, long-sleeve shirts and jackets can be considered as similar garments to our long-sleeve shirts model. Also, knit pants and suit pants are similar to our jeans model. Although they are made of different materials, the way they deform are similar to our training models in some poses. Figure 8 shows some extra examples of recognizing poses of unseen garments using the same weight learned on our original dataset. We also noticed that there exist some decorations such as pockets or shoulder boards on those garments, however, our method is robust enough to ignore these subtler features.

Figure 8: Sample results of applying our method on novel garments. Each group of results shows the color image, reconstructed model, predicted grasping points (red) vs. ground truth (yellow) marked on the model from left to right. (Best viewed in color)

5 Online Model Registration for Regrasping and unfolding

As a part of the pipeline as shown in Fig. 1, to unfold a garment, we can use the simulated models in the database to guide real object manipulation by registration. In this pipeline, the registration results can be used for detection of regrasping points. One of such scenarios is that unfolding a garment by iterative registration between the reconstructed mesh model and the database mesh model, and then regrasping. After several steps of regrasping, the robot holds the garment at two desired positions. Using a long-sleeve shirt as an example, we defined the optimal grasping positions on the two sleeves, respectively. The regrasping is built on the recognition pipeline described in the previous section. Once we have a recognized 3D object model from the database, we can perform a registration-search that looks for an optimal registration between the model and physical garment over the entire mesh model. Once registered, we can then predict the best regrasping point in 3-dimensional space and guide the other hand to approach and regrasp at this point. We do this using a fast, two-stage deformable object registration algorithm that integrates off-line simulated results with online localization and uses a novel non-rigid registration method to minimize energy differences between source and target models. Then, we use a constrained weighted metric for evaluating grasping points during regrasping, which can also be used for a convergence criterion.

Figure 9: Our application scenario: a Baxter robot picks up a garment to recognize its pose via reconstruction. By deformable registration between the simulated mesh (top right) and reconstructed mesh (top middle), we obtain the regrasping point via a pre-determined point on the simulated mesh. A long-sleeve shirt mesh with rendered weighted Gaussian distribution is shown on the bottom right. Red color indicates a higher score for evaluation of the grasping points, which are designated as the elbows of the sleeves. The final unfolding result by the Baxter robot is shown on the bottom left.
Figure 10: If the recognition is not successful or the pose is improper evaluated by the function, the robot will regrasp the object and repeat the step of pose estimation (the red rectangle). Note that by registration between the reconstructed mesh from the Kinect and the simulate mesh from pose estimation, the robot knows where to regrasp subsequently as indicated by a red dot. This will be evaluated by the function. If , the robot moves to unfold phase (the green rectangle). If this is not the case, the robot regrasps and goes back to pose estimation.

5.1 Problem Formulation

Our objective is to put the garment into a certain configuration, which is defined as the relative grasping points on the garment Li et al. [2014b], such that the garment can be easily placed flat on a table for the folding process. This problem can be formulated as a mathematical optimization problem:


Here ***Each garment mesh is defined in a 2-dimensional parameter space. When we choose a grasping point, we choose a particular set of parameters, which then will be mapped by registration with the sensed garment to a grasping point in . are the positions of the left and right grasping points on the garment (the configuration) and the function is an evaluation function for such a configuration. We seek a principled way to build a feedback loop for garment regrasping, which allows us to grasp at pre-determined points on the garment, and place it flat.

Suppose the candidate garment is a long-sleeve shirt which we want to unfold and place flat. A desired solution is grasping points lying on the elbows of the sleeves. Our goal is to find a pair of grasping points , through a series of regrasping procedures that will converge to a value close to . We need a quantitative function defined on the pose of the garment (i.e., where the robot arm grasps the garment) in order to evaluate how good a grasping point is. While this can be computed on the continuous surface of the garment, we can also discretize the garment into a set of anchor points , which typically contains about points for a garment in our database. After such quantization, the garment pose recognition can be treated as a discrete classification problem, which the current robotics system is able to handle reliably. This also simplifies the definition of the objective function, which then becomes a D score table or a matrix, given our robot has two arms.

Details of the optimization procedure and inference can be found in Li et al. [2015a]. The objective function which needs to be maximized finally can be written as:


The related parameters in the objective such as and are set depending on the desired configuration. For example, for long-sleeve shirts, we set and on the elbow of the two sleeves. The Gaussian formulation ensures a smooth decrease from the expected grasping points, as visualized in Figure 11 as an example.

(a) (b)
Figure 11: Visualization of the defined objective used in this section. (a) A long-sleeve shirt model rendered with weighted Gaussian distribution. (b) A pants model rendered with weighted Gaussian distribution. When a point is given over the garment surface, we can then evaluate the score by the objective function .

5.2 Deformable Registration

After obtaining the location of the current grasp point, we seek to register the reconstructed 3D model to the ground truth garment mesh to establish point correspondences. The input to the registration is a canonical reference (“source”) triangle mesh that has been computed in advance and stored in the garment database, and a target triangle mesh representing the geometry of the garment grasped by the robot, as acquired by 3D scans of the grasped garment.

The registration proceeds in three steps. First, we scale the source mesh to match its size to the target mesh . Next, we apply an iterative closest point (ICP) technique to rigidly transform the source mesh (i.e., via only translation and rotation). Finally, we apply a non-rigid registration technique to locally deform the source mesh toward the target .


First, we compute a representative size for each of the source and target meshes. For a given mesh, let and be the area and barycenter of the th triangle. Then the area-weighted center of the mesh is


where is the number of vertices of the source mesh . Given the area-weighted center, the representative size of the mesh is given by


Let the representative sizes of the source and target meshes be and , respectively. Then, we scale the source mesh by a factor of .

Computing the rigid transformation

We use a variant of ICP Besl. and McKay. [1992] to compute the rigid transformation. ICP iteratively updates a rigid transformation by (a) finding the closest point on the target mesh for each of the vertices of the source mesh , (b) computing the optimal rigid motion (rotation and translation) that minimizes the distance between and , and then (c) updating the vertices via this rigid motion.

Figure 12: Visualization of distance function given a mesh. The color bar on the right shows the normalization distance.

To accelerate the closest point query, we prepare a grid data structure during preprocessing. For each grid point, we compute the closest point on the target mesh using fast sweeping h R.Tsai [2002], and store for runtime using both the found point and its distance to the grid point as shown in Figure 12. At runtime, we approximate the closest point query for vertex by searching only among those eight precomputed closest points corresponding to the eight grid points surrounding , thereby reducing the complexity of the closest point query to per vertex.

After establishing point correspondences, we compute the optimal rotation and translation for registering with  Besl. and McKay. [1992]. We iteratively compute point correspondences and rigid motions until successive iterations converge to a fixed rigid motion, yielding a rigidly registrated source mesh .

Non-rigid registration

Given a candidate source mesh obtained via rigid registration, our non-rigid registration seeks the vertex positions of the source mesh that minimize


where penalizes discrepancies between the source and target meshes, and seeks to limit and regularize the deformation of the source mesh away from its rigidly registrated counterpart . The term


penalizes deviation of the source and target meshes. Here is the barycenter of the triangle , and is the distance from to the closest point on the target mesh. As in the rigid case, we use the precomputed distance field to query for the distance.

It might appear that the fitting energy could be trivially minimized by moving each vertex of mesh to lie on mesh . In practice, however, this does not work because all of the geometry of the precomputed reference mesh is discarded; instead, the geometry of this mesh, which was precomputed using fabric simulation, should serve as a prior. Thus, we introduce a second term to retain as much as possible the geometry of the reference mesh :

The deformation term , derived from a physically based energy (e.g., see Grinspun et al. [2003]), is a sum of three terms


where , and are user-specified coefficients.The term


penalizes changes to the area of each mesh triangle. Here is the area of the triangle , and refers to a corresponding quantity form the undeformed mesh . The term


penalizes shearing of each mesh triangle, where is the th angle of the triangle . The term  Grinspun et al. [2003]


penalizes bending, measured by the angle formed by adjacent triangles. Here is the hinge angle of edge , i.e., the angle formed by the normals of the two triangles incident to ; is the length of the edge , and is a third of the sum of the heights of the two triangles incident to the edge .

We used the secant-version of the L-M methodMadsen et al. [2004] to seek the source mesh that minimizes the energy Eq.(9). Sample registration results are shown in Figure 13.

5.3 Grasping Point Localization

We use a pre-determined anchor point (e.g., elbow on the sleeve of a long-sleeve shirt) to indicate a possible regrasping point. The detection of the regrasping point can be summarized in two steps: global localization and local refinement. Global localization is achieved by deformable registration. The registered simulation mesh will provide a 3D regrasping point from the recognized state which will be then mapped onto the reconstructed mesh. Details of local refinement can be found in Li et al. [2015a].

In order to improve the regrasping success rate, we propose a step of local refinement. The point on the actual garment may be hard to grasp for several reasons. One is that during the garment manipulation steps, such as rotation, the curvature over the garment may change. Another reason is that when considering the width of robot hand gripper, a ridge curve with proper orientation and width should be selected for regrasping. We consider the proper orientation as a direction perpendicular to the opening of the gripper. Therefore, we propose an efficient 1D blob curvature detection algorithm that can find a refined position in the local area over the garment surface via an IR range sensor.

In our experiment, the Baxter robot is equipped with a IR range sensor close to the gripper as shown in Figure 15 top. Once the gripper moves to the same height of the predicted 3D regrasping point from registration, it will perform a horizontal scan search to achieve a refinement of the local grasping point, moving from one side to the other, so that the IR sensor will scan over the full local curvature.

We then apply a curvature detection algorithm that convolves the IR depth signal with a fixed width kernel, where the width is determined by the opening of the gripper. Here we use a Laplacian-Gaussian Kernel :


where is the depth signal, and is the width parameter.

5.4 Convergence

After the regrasping is finished, we evaluate the current grasping configuration by the objective function . If is greater than a given value , which means the grasping points are on the desired positions, and the robot will then stop regrasping and enter the placing flat mode. The two arms will open to slightly stretch the garment and place it on a table. The overall algorithm is summarized in Algorithm 2.

Input:  Simulation meshes = ;
Trained classifier ;
Objective function ;
Output:  Two desired grasping points and ;
, ;\Hy@raisedlink\hyper@anchorstartAlgoLine0.1\hyper@anchorend
while do  Pick up at a grasping point ; \Hy@raisedlink\hyper@anchorstartAlgoLine0.2\hyper@anchorend
3D Reconstruction;\Hy@raisedlink\hyper@anchorstartAlgoLine0.3\hyper@anchorend
Reg(, ) //Registration;\Hy@raisedlink\hyper@anchorstartAlgoLine0.5\hyper@anchorend
  return and ;\Hy@raisedlink\hyper@anchorstartAlgoLine0.8\hyper@anchorend
Place the garment flat on a table;\Hy@raisedlink\hyper@anchorstartAlgoLine0.9\hyper@anchorend
Algorithm 2 Iterative Procedure for Regrasping









5.5 Experimental Results

To evaluate our results, we tested our method on several different garments such as long-sleeve shirts and pants for multiple trials.

Figure 13: Registration examples. First Row: A long-sleeve shirt grasped at elbow. Second Row: A long-sleeve shirt grasped at sleeve end. Third Row:A pair of pants grasped near knee. Fourth Row: A pair of pants grasped near ankle. Each row depicts from left to right: a reconstructed mesh, the predicted mesh from the database, rigid registration only, and rigid plus non-rigid registration.
Figure 14: Examples of each step in our unfolding procedure. For each row from left to right is: a snapshot of initial pick up, a 3D reconstructed mesh, a predicted mesh from database, the predicted mesh with weighted Gaussian distribution distance, predicted regrasping point on the 3D reconstructed mesh, a snapshot of regrasping, and finally a snapshot of unfolding. Top Row: The Baxter robot unfolds a long-sleeve shirt following pick up. Bottom Row: The Baxter robot unfolds a pair of pants following pick up.

Below, we briefly recap the pose recognition method. Details can be found in the previous section. We first pick up the garment at a random point. In the online recognition stage, we use a Kinect sensor to capture depth images of different views of the garment while it is being rotated by a robotic arm. The garment is rotated clockwise and then counter-clockwise to obtain about depth images for an accurate reconstruction. We reconstruct a 3D mesh model from the depth image segmentation and volumetric fusion. Then with an efficient 3D feature extraction algorithm, we build up a binary feature vector and finally match against the offline database for pose recognition. One of the outputs is a high-quality reconstructed mesh, which is used for 3D registration and accurate regrasping point prediction, as described below.

5.6 Registration

We apply both rigid and non-rigid registrations. The rigid registration step mainly focuses on mesh rescaling and alignment, whereas the non-rigid registration step refines the results and improves the mapping accuracy. In Figure 13, we compare the difference between using rigid registration only and using rigid plus non-rigid registration side by side. We can clearly see that with non-rigid registration, the two meshes are registered more accurately. In addition, the location of the designated grasping points on the sleeves are also closer to the ground truth points. Note that for the fourth row, after the alignment by the rigid registration algorithm, the state is evaluated as a local minimum. Therefore, there is no improvement by the following non-rigid registration. But as we can see from the visualization, such a case is still good enough for finding point correspondence.

Source Mesh S to T (R) T to S (R) S to T (R+N) T to S (R+N)
Long-Sleeve T-Shirt 1
Long-Sleeve T-Shirt 2
Long Pants 1
Long Pants 2
Table 2: Registration results. We compare the source mesh (S) registered to the target mesh(T), and vice versa, for both rigid-only registration(R) and rigid plus non-rigid registration(R+N). We can see that when the source mesh registered to the target mesh, the average error distance is less than the target mesh registered to the source mesh. This is because the source mesh is with less resolution, whose deformation can be easily computed to reach a minimum. Also, we can see that with additional non-rigid registration, the average error distance is reduced.

We also evaluate the registration algorithm on the entire database, which contains two stage, rigid registration using ICP algorithm and non-rigid registration algorithm. To show the performance of our registration algorithm, the registration pairs are established with the knowledge that the recognition of the pose is % correct. This will enable the registration to happen between the closest grasping location. Meanwhile, we design the registration experiments in two directions, the source mesh to the target mesh, and vice versa. We also compare the registration results of the rigid registration, and the rigid plus the non-rigid registration for all the pairs. Detailed results are shown in Table 2. For example, for the S to T(R), we first subdivides the source mesh into a set of disjoint triangulated patches, and generates a single sample point in each patch. Each sample point is also assigned the area of the patch it belongs to. Then, from each such sample point, we find the closest point on the target mesh, and sum up the distance of all point pairs and multiplied by the corresponding patch area. Finally, the summed value is divided by the total area of the source mesh.

5.7 Search for Best Grasping Point by Local Curvature

Once we choose a potential grasping point, we can perform a search to find the best local grasping point for the gripper. We are trying to a find a fold in the vicinity of the potential grasping point with a high local curvature tuned to the gripper width that allows for a stable grasp. The opening size of the gripper is approximately and empirically we set in the equation 15. Figure 15 top shows a picture of the IR range sensor on the gripper. A plot of its signal, as well as the convoluted signal, are shown in Figure 15 bottom left and right. We can clearly see that the response from the filter is at a minimum where the grasping should take place. The tactile sensors then assure that the gripper has properly closed on the fabric.

Figure 15: IR range sensor scan example. Top: An image of the Baxter hand. The IR range sensor is shown in yellow rectangle and two Tactile sensors in blue rectangles. Bottom Left: Single reading plot from the IR range sensor. Bottom Right: Convoluted result of the sensor reading and a Laplacian-Gaussian kernel with different kernel size. The lowest point (in red) is the place the gripper should grasp.

5.8 Iterative regrasping

Figure 14 shows two examples (long-sleeve shirt and pants) of iterative regrasping using the Baxter robot. The robot first picks up a garment at a random grasping point. Once the arm reaches a pre-defined position, the last joint of the arm starts to rotate and the Kinect will capture the depth images as it rotates, and reconstruct the 3D mesh in real-time. After the rotation, a predicted pose is recognized Li et al. [2014b] as shown in the third image of each row. For each pose, we have a constrained weighted evaluation metric over the surface to identify the regrasping point as indicated in the fourth image. By registration of the reconstructed mesh and predicted mesh from the database, we can map the desired regrasping point onto the reconstructed mesh. The robot then regrasps by moving the other gripper towards it. With our 1D blob curvature detection method, the gripper can move to the best curvature on the garment and regrasp, which increases the success rate. The iterative regrasping stops when the two grasped points are the designated anchor points on the garment (e.g., elbows on the sleeves of a long-sleeve shirt).

Garment # of Trial Successful Recognition Successful Regrasping Successful Unfolding Avg. # of Regrasps Success Only Sweatshirt Sweater Knitwear Jeans Pants Leggings Shorts Average
Figure 16: Left: A picture of our test garments. Right: Results for each unfolding test on the garments. We evaluate the results by recognition, regrasping, unfolding, and regrasping attempts for each test. The last row shows the average of each evaluation component.

Figure 16 left shows sample garments in our test, and the table on the right shows the results. For each garment, we perform unfolding tests. We have on average an successful recognition rate for the pose of the objects over all the garments. We have on average an successful regrasping rate for each garments, where regrasping is defined as a successful grasp of the other arm on the garment. of the time we are able to successfully unfold the garment, placing the grippers at the designated grasping points. Unsuccessful unfolding occurred when either the gripper lost contact with the garment, or the gripper was unable to find a regrasping point. Although we did not perform this experiment, it is possible to restart the method after one of the grippers loses contact as an error recovery procedure.

For the successful unfolding cases, we also report the average number of regrasping attempts. The minimum number of regrasping attempts . This happens when the initial grasping is at one of the desired positions, and the regrasping succeeds at the other desired position (i.e., two elbows on the sleeves for a long-sleeve shirt). In most cases, we are able to successfully unfold the garments using regraspings.

Among all these garments, jeans, pants, and leggings achieve high success rate because of their unique layout when grasping at the leg position. The shorts are difficult for both recognition and unfolding steps possibly because its ambiguous appearances in different grasping points. One observation is that in a few cases, when the recognition is not accurate, our registration algorithm was sometimes able to find a desired regrasping point for unfolding. This is an artifact of the geometry of pant-like garments where the designated regrasping points are at the extreme locations on the garments.

6 Trajectory Optimization for Folding

Robotic folding of a garment is a difficult task because it requires sequential manipulations of a highly unconstrained, deformable object. Given the garment shape, the robot can fold it by following a folding plan Miller et al. [2011] Miller et al. [2012]. However, the layout of the same folding action can vary in terms of the material properties such as cloth hardness and the environment such as friction between the garment and the table. Given the starting and ending folding positions, different folding trajectories will lead to different results. In this section, we propose a novel method that learns optimal folding trajectory parameters from predicted thin shell simulations of similar garments, which can then be applied to a real garment folding task (see Figure 17). We first present an online optimization algorithm that learns optimal trajectories for manipulation from mathematical model evolution combined with predictive thin shell simulation. Meanwhile, a novel approach is introduced that can adjust the simulation environment to the robot working environment for the purpose of creating a similar manipulation result. Then, with the learned simulation results, we introduce a fast and robust algorithm that can detect garment key points such as sleeve ends, collar, and waist corner, automatically. These key points can be used for folding plan generation. The trajectories themselves are general in that they can be scaled to accommodate similar garments of different size.

Figure 17: Top: Comparison of our simulation of robotic manipulation. Bottom: Real robot implementation. The green curves show the virtual and the real trajectories for folding.
Figure 18: Failure example with improper folding trajectories. First Row: Folding trajectory is low and flat that causes drift to the towel and long-sleeve T-Shirt. Second and third Rows: Folding trajectory is too high when the gripper approaching the target folding position that piles up the towel. Fourth Row: Dual-arm folding. If the distance between the two arms is too close, the folding may fail.

Figure 19 shows the key steps of the garment folding. The garment folding is the final step of the entire pipeline of garment manipulation which contains visual recognition, unfolding, ironing, and folding in Figure 1. This section specifically addresses the robotic folding task (purple rectangle in Figure 19) with the goal of finding optimal trajectories to successfully fold garments.

Figure 19: Details of the folding procedure. We apply offline simulation with iterative trajectory optimization to find the best trajectory for a specific folding action by comparing the result (light blue contour) with template (black contour). Similar steps are repeated until the garment is folded in the simulator. Then all the folding trajectories are exported, adapted, and implemented on a real robot. Green arcs illustrate the actual trajectories of robotic arms.

Figure 18 shows a few failure examples with improper trajectories. We use green tape on the table to show the original position of the garments. The first two rows show that if the moving trajectory is too low and close to the garment, the folded part will fall down, pull the rest, and cause drift of the whole garment. These cases usually happen when the folding step is lengthy without trajectory optimization. The third row shows a case where the folding trajectory is too high, which will cause extra wrinkles or even piling up. The last row shows two cases using two arms to fold. If the arms are close to each other, the part in between loses tension, and will fall down and pull the rest away. The focus of this work is to create trajectories for folding that will overcome these problems.

6.1 Simulation Environment

6.1.1 Folding Pipeline in Simulation

In the model simulation, we use a physics engine Maya [] to simulate the movement and deformation of the garment mesh models. We assume there is only one garment for each folding task, which has been placed flat on a table. A virtual table is added to the scene which the garment lies on, as shown in Figure 17, top.

During each folding step, the robot arm picks up a small part of the mesh, moves it to the target position following a computed trajectory, and places it on the table to simulate an entire folding scenario. If the part of the garment to be folded is relatively wide, then both left and right arms may be involved. The trajectory is generated using a Bézier curve, which will be discussed in section 6.2 below.

We can use the mesh model from the database to simulate the folding process. However, for faster computation, these mesh models are relatively low resolution meshes, which are not very accurate when used to simulate folding via bending the mesh. For more accurate simulation purposes, we propose a method to build a mesh model from our real garments. Specifically, a garment mesh is created by first extracting the contour of the real garment Li et al. [2015b]. Then by inserting points on the inside of the garment contour, we triangulate a mesh by connecting these points. Lastly, we mirror the mesh to construct a two-sided garment mesh (see figure 20).

Figure 20: Garment models. Top left: the input contour. Top right: we insert vertices into the internal region. Bottom left: we build a flat triangle mesh using the contour and the inserted vertices. Bottom right: we shift the contour vertices and mirror the mesh to create the garment mesh.

6.1.2 Parameter Adaptation

There are two key parameters needed to accurately simulate the real world folding environment. The first is the material properties of the fabric, and the second is the frictional forces between the garment and the table.

Material properties

Through many experiments, we found that the most important property for the garments in the simulation environment is shear resistance. It specifies the amount the simulated mesh model resists shear under strain; when the garment is picked up and hung by gravity, the total length will be elongated due to the balance between gravity force and shear resistance. An appropriate shear resistance measure allows the simulated mesh to reproduce the same elongation as the real garment. This measurement will bridge the gap between the simulation and the real world for the garment mesh model.

For each new garment, we follow the steps described below to measure the shear resistance. Figure 21 shows an example.

  • Manually pick one extremum part of the garment such as the sleeve end of a T-shirt, the waist part of a pair of pants, and a corner of a towel.

  • Hang the garment under gravity and measure the length between the picking point and the lowest point as

  • Slowly put down the garment on a table and keep the picking point and the lowest point in the previous step at maximum spread condition. Measure the distance between these two points again as . The shear resistance fraction is defined as the following

  • We then the virtual garment into the same configuration in Maya, adjusting the Maya shear parameter such that the shear fraction as calculated in the simulator is identical to the real world.

Figure 21: Method for measuring the shear resistance. Left: Diagonal length measurement. Middle: Zoomed in regions. Right: The garment is hanging under gravity.
Frictional forces

The surface of the table can be rough if covered by a cloth sheet or slippery if not covered, which leads to variance in friction between the table and garment. A shift of the garment during the folding can possibly impair the whole process and cause additional repositioning. Adjusting the frictional level in the simulation environment to the real world is crucial and necessary for trajectory optimization.

To measure the friction between the table and the garment, we do the following steps.

  • Place a real garment on the real table of length .

  • Slowly lift up one side of the real table, until the garment in the real world begins to slide. The lifted height is . The friction angle is computed as,

  • In the virtual environment, the garment is placed flat on a table with gravity. Assign a relatively high friction value to the virtual table. Lift up one side of the virtual table to the angle of .

  • Gradually decrease the frictional force in the virtual environment, until the garment begins to slide. Use this frictional force in the virtual environment as it mirrors the real world

With these two parameters set up, we obtain similar manipulation results for both the simulation and the real garment.

6.2 Trajectory Optimization

The goal of the folding task is specified by the initial and folded shapes of the garment, and by the starting and target positions of the grasp point (as in Figure 22). Given the simulation parameters, we seek the trajectory that effects the desired set of folds. We first describe how to optimize the trajectory for a single end effector and then discuss the case of two end effectors.

6.2.1 Trajectory parametrization

We use a Bézier curve Farin [1988] to describe the trajectory. An -th order Bézier curve has control points , defined by


where are the Bernstein basis functions.

Figure 22: An example of the folding task: we want to fold a sleeve into the blue target position, by using a robotic gripper to move the tip of the sleeve (grasp point) from the starting position () to the target position (), following a trajectory, shown as the red curve. and are knot points that form the Bézier trapezoid.

We use for simplicity, but our method can be easily extended to deal with higher order curves. and are fixed to the specified starting and target positions of the grasp point (as in Figure 22). The intermediate control points can then be adjusted to define a new trajectory using the objective function defined below.


Here is a cost function with two terms. The first term penalizes the trajectory length , thus preferring a folding path that is efficient in time and energy. The second term seeks the desired fold, by penalizing dissimilarity between the desired folded shape , compared to the shape obtained by the candidate folding trajectory , as predicted by a cloth simulation; we used a physical simulation engine Maya [], for the cloth simulation. The weight balances the two terms; we used in our experiment.

Intuitively, dissimilarity measures the difference between the desired folded shape and the folded garment in simulation. We define the dissimilarity term as


where is the total surface area of the garment mesh including both sides of the garment, is a point on the target folded shape , is the corresponding point on the simulated folded shape, and is the area measure, see Figure 23, left. Our implementation assumes and are given as triangle meshes, and discretizes (20) as


where is the barycenter of -th triangle on the target shape, is the (corresponding) barycenter of -th triangle on the simulated shape, and is the area of the -th triangle on the target shape.

Figure 23: Left: The dissimilarity captures the misalignment between and by integrating the distance between the corresponding points and over the garment. Right: The barycentric dual area associated with this vertex is defined as the area of the polygon created by connecting the barycenters of the triangles adjacent to .

To compute the trajectory length , we use the De Casteljau’s algorithm Farin [1988] to recursively subdivide the Bézier curve into a set of Bézier curves , until the deviation between the chord length () and the total length between the control points () for each subdivided curve is sufficiently small. Then, is approximated by summing up the chord lengths of all the subdivided curves: .

We initialize and as


where is the unit vector in the upward vertical direction, is a constant, which is set to , which means the initial trajectory will have equal horizontal extent between knot points.

6.3 Optimization.

To optimize equation (19), we apply a secant version of the Levenberg-Marquardt algorithm Madsen et al. [2004]Nocedal and Wright [2006]. For the current trajectory generated by , we estimate the derivative of the cost function numerically, by sampling slightly modified trajectories , where , are the orthonormal bases, and we used in our implementation.

The secant version of Levenberg-Marquardt algorithm iteratively builds a local quadratic approximation of based on the numerical derivative, and then takes a step toward an improved state. The direction of the step is a combination of the steepest gradient descent direction and the conjugate gradient direction. We use the specific approach described by Madsen et al. Madsen et al. [2004] (see §3.5 therein). The iterative procedure terminates when the improvement in becomes sufficiently small.

In the case of using multiple arms, we associate an individual trajectory to each of the arms . We then extend the state variable to . The rest of the optimization procedure is the same as the single arm case. Note that both single and dual-arm trajectories are in 3D space. The optimization for dual-arm trajectories is able to find a solution which will overcome failures such as shown in Figure 18 bottom.

6.4 Experimental Results

To evaluate our results, we tested our method on several different garments such as long-sleeve t-shirts, pants, and towels for multiple trials, as shown in Figure 24 left. These garments require both single and dual-arm folds.

6.4.1 Measurement of parameters

To make the offline simulation better approximate the real scenario, we manually measure the stretch resistance of each garment and friction on the table.

Figure 24, left shows a picture of all the test garments we used in different colors, sizes, and material properties. Figure 24, right table shows the measured parameters of each test garment, including stretch percentage and Friction angle, and corresponding Maya parameters. For common garments, these parameters do not have a significant variance. Therefore, we suggest that if researchers use simulators such as Maya, the average values of each column are a reasonably good initialization.

Garment Type Stretch (%) Friction Angle () Maya Shear Resistance Maya Friction Long-Sleeve T-Shirt (large) Long-Sleeve T-Shirt (small) Jeans Pants Large Towel Medium Towel Small Towel Average
Figure 24: Left: A picture of our test garments. Right: Results for each unfolding test on the garments. We show the results of stretch percentage, Friction angle of the table, and the corresponding parameters in Maya by each test. The last row shows the average of each measurement component.

6.4.2 Garment manipulation and folding

Figure 27 shows three successful folding examples from the simulation and the real world, including a long-sleeve shirt, a pair of pants, and a medium size towel. We show six key frames for each folding task. The folding poses from the simulation are in the first row of each group with an optimized trajectory. We also show corresponding results from the real world. The green tape contour on the table indicates the original position of the garment.

Figure 25: Garment folding plan for a long-sleeve T-shirt.

Each garment is first segmented from the background and key points are detected from the binary mask. Given the key points, a corresponding multi-step folding plan is created (The folding plan is predefined, and one of our folding plans for a long-sleeve T-shirt is shown in Figure 25). For each garment, we have optimized trajectories for each folding step. Here, we map these optimized trajectories to our scenario according to the generated folding plan. Then the Baxter robot follows the folding plan with optimized trajectories to fold the garment. We can see that the deformation of the real garment and the simulated garment is very similar. Therefore, the final folding outcome is comparable to the simulation.

Table 3 shows statistical results of the garment folding test. Each time one or two robotic arms fold the garment counts as one fold. We ran trials for each test garment. It turns out that the folding performance of the Long-Sleeve T-Shirts and Towels are very stable with our optimized trajectories. Jeans and pants are less stable because the shear resistance of the surface is relatively high, and sometimes is difficult to bend, leading to unsuccessful folding. In the successful folding cases for jeans and pants, we sometimes ended up with small wrinkles, but the folding plan was still able to complete successfully. We also show the average time to fold a garment in the last row. The robot is able to fold most garments in about minutes.

6.4.3 Solution Space

The solution space is a subspace of the trajectory space where the folded garment ends in a shape with a dissimilarity score less than a threshold. Intuitively, a number of trajectories within the solution space will fold the garment, leaving its shape close to the desired shape. We have found that trajectories within the solution space can vary to a degree while still allowing the robot to accomplish the folding task. This result also agrees with the fact that people do not have to follow a unique trajectory to fold the garment. However, trajectories outside the solution space cause issues for the folding task (see Figure 18). Our trajectory optimization automatically avoids such cases.

To further explore the relationship between the trajectories and folded shapes, we experimented the folding with a few different trajectories in simulation. A notable finding is that the symmetric trajectories can always produce better folded shape, as shown in Figure 26. The thirteen color curves in each plot represent thirteen different trajectories. The dissimilarity bar on the right shows the difference between folded shape and the desired folded shape for each folding simulation. We also tested with asymmetric trajectories for the folding, as shown in the second and third plots in Figure 26. We can see that the second plot has larger dissimilarities than the first and third plots, which is mainly caused by the friction. The robot should raise the starting point to a high enough position at the beginning to prevent the grasped portion of the garment pushing the other portion on the table. This is also consistent with our simulation results that our optimizer will drive the height of the trajectories to a reasonable distance from the garment.

Figure 26: The dissimilarity values from different trajectories for folding the towel model in the second folding step. The trajectory is projected to a 2D plane for illustration purposes. S and T stand for the start and target position, respectively. (Best viewed in color)

There is a trade-off between doing contour fitting at each step and total time spent to fold a garment. In this work, we start with one template and then assume that each step after that the folded garment is close to that in the simulation. Our experimental results as shown in Table 3 verify that this method works well and is able to save time since we only do the contour fitting once. With our simulated trajectories, the Baxter robot is able to fold a garment under predefined steps correctly. An alternative method could use the contour fitting at each step but this would require more time and computation.

We note that some failures due to the motor control error from the Baxter robot. When the robot executes an optimized trajectory, its arm suffers from a sudden drop or jitter. Such actions will raise pull forces to the garment, leading to drift and inaccurate folding. This can be solved by using an industrial level robotic arm with more accurate control. We also note that failures can be recognized with the correct sensing suite, and we are currently investigating ways to effect online error recovery for such failures. One difference between the simulation and the real world we found is that moving a point on the mesh in the simulation is different from using a gripper to grasp a small area of a real garment and move it. In the future, we hope to be able to simulate a similar grasp effect for the trajectory optimization.

Figure 27: Successful folding examples with optimized folding trajectories from offline simulation. The first row of each group is from the simulation and the second row is from the real world (Green tape shows the original garment contour position). Top Group: Long-sleeve shirt folding with steps. Middle Group: Long pants folding with steps. Bottom Group: Medium size towel folding with steps.
Garment Type # of folds Success Rate Avg. Time (sec)
L-S T-Shirt (large)
L-S T-Shirt (small)
Large Towel
Medium Towel
Small Towel
Table 3: Results of folding test for each garment . We show the number of folding steps, successful rate, and total time of each garment. Each garment has been tested times. L-S stands for Long-Sleeve. The time is the average over all successful trials for each garment.

7 Conclusion

In this paper, we introduced a simulation database of common deformable garments to facilitate recognition and manipulation. The database contains five different garments within three categories: sweater, pants, and shorts. Each garment is fully simulated with a number of depth images and 3D mesh models for all the semantic labeled grasping points. We demonstrated three applications of using the database to improve the recognition and the manipulation of deformable objects. The first is training from the simulated mesh models to recognizing an unknown object by 3D shape-based features. The second is applying the simulated mesh model to guide the iterative regrasping of the garment using both rigid and non-rigid registrations. The third is importing the mesh model into the simulator and computing optimized trajectories for manipulation of the deformable objects. Ee extensively tested the three applications with designed experiments such as garment recognition via pick up, unfolding the garment to a known desired state and laying flat, and using pe-computed folding plans to fold it using a novel trajecotry optimization method that prevents common folding errors. We have addressed all the phases of the pipeline in Figure 1 individually. However, there are still some system and hardware issues that prevent the system from being a completely seamless pipeline. These are due to 1) kinematic constraints on the Baxter robot which limits its ability to work with larger garments on a normal size table, and 2) our need to manually mount the iron on the robot hand for the ironing task.

While the focus of our work has been on clothing, we want to underline the point that model-driven, feed forward prediction can work well in complex environments with many unknown states. While we have not yet attempted this, we believe that the ideas in this paper can be ported to similar domains such as food handling (“soft deformable objects”) and articulated rigid objects that have multiple kinematic states.

We’d like to thank J. Weisz, J. Varley, and R. Ying for many discussions, P. M. Lopez for the help of folding plan. We’d also like to thank NVidia Corporation, and Intel Corporation for the hardware support. This material is based upon work supported by the National Science Foundation under Grant No. 1217904.


  • Belongie et al. [2002] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 24(24), apr 2002.
  • Besl. and McKay. [1992] P. J. Besl. and N. D. McKay. A method for registration of 3-d shapes. IEEE Trans. Pattern Anal. Mach. Intell., 14(2):239–256, feb 1992. ISSN 0162-8828. doi: 10.1109/34.121791.
  • Chen et al. [2013] J. Chen, D. Bautembach, and S. Izadi. Scalable real-time volumetric surface reconstruction. SIGGRAPH, 32(4):113:1–113:16, jul 2013.
  • Cusumano-Towner et al. [2011] M. Cusumano-Towner, A. Singh, S. Miller, J. F. O’Brien, and P. Abbeel. Bringing clothing into desired configurations with limited perception. In Proc. ICRA, 2011.
  • Doumanoglou et al. [2014] A. Doumanoglou, A. Kargakos, T-K Kim, and S. Malassiotis. Autonomous active recognition and unfolding of clothes using random decision forests and probabilistic planning. In Proc. ICRA, May 2014.
  • Farin [1988] G. Farin. Curves and Surfaces for Computer Aided Geometric Design. Academic Press, 1988.
  • Frome et al. [2004] A. Frome, D. Huber, R. Kolluri, T. Bülow, and J. Malik. Recognizing objects in range data using regional point descriptors. In Proc. ECCV, pages 224–237, 2004.
  • Goldfeder et al. [2009] C. Goldfeder, M. Ciocarlie, H. Dang, and P. K. Allen. The Columbia grasp database. Proc. ICRA, 2009.
  • Grinspun et al. [2003] E. Grinspun, A. N. Hirani, M. Desbrun, and P. Schröder. Discrete shells. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’03, pages 62–67, Aire-la-Ville, Switzerland, Switzerland, 2003. Eurographics Association. ISBN 1-58113-659-5.
  • h R.Tsai [2002] Y h R.Tsai. Rapid and accurate computation of the distance function using grids. Journal of Computational Physics, 178(1):175 – 195, 2002. ISSN 0021-9991. doi:
  • Huttenlocher et al. [1993] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge. Comparing images using the hausdorff distance. PAMI, 1993.
  • Inc. [a] Poser World Inc. Poser world clothes models, a.
  • Inc. [b] Turbo Squid Inc. Turbo squid 3d models, b.
  • Joachims [2002] T. Joachims. Optimizing search engines using clickthrough data. In Proc. KDD, pages 133–142, 2002.
  • Kita and Kita [2002] Y. Kita and N. Kita. A model-driven method of estimating the state of clothes for manipulating it. In Proc. WACV, 2002.
  • Kita et al. [2011a] Y. Kita, F. Kanehiro, T. Ueshiba, and N. Kita. Clothes handling based on recognition by strategic observation. In Humanoid Robots, 2011a.
  • Kita et al. [2011b] Y. Kita, T. Ueshiba, E-S Neo, and N. Kita. Clothes state recognition using 3d observed data. In Proc. ICRA, 2011b.
  • Latecki et al. [2000] L. Latecki, R. Lakamper, and T. Eckhardt. Shape descriptors for non-rigid shapes with a single closed contour. In Proc. CVPR, 2000.
  • Le et al. [2013] T-H-L Le, M. Jilich, A. Landini, M. Zoppi, D. Zlatanov, and R. Molfino. On the development of a specialized flexible gripper for garment handling. Journal of automation and control engineering, 1(3), 2013.
  • Li et al. [2008] H. Li, R. W. Sumner, and M. Pauly. Global correspondence optimization for non-rigid registration of depth scans. In Proceedings of the Symposium on Geometry Processing, SGP ’08, pages 1421–1430, Aire-la-Ville, Switzerland, Switzerland, 2008. Eurographics Association.
  • Li et al. [2013] H. Li, E. Vouga, A. Gudym, L. Luo, J. T. Barron, and G. Gusev. 3d self-portraits. ToG (SIGGRAPH Asia), 32(6), November 2013.
  • Li et al. [2014a] Y. Li, C-F Chen, and P. K. Allen. Recognition of deformable object category and pose. In Proc. ICRA, June 2014a.
  • Li et al. [2014b] Y. Li, Y. Wang, M. Case, S-F Chang, and P. K. Allen. Real-time pose estimation of deformable objects using a volumetric approach. In Proc. IROS, September 2014b.
  • Li et al. [2015a] Y. Li, D. Xu, Y. Yue, Y. Wang, S-F Chang, E. Grinspun, and P. K. Allen. Regrasping and unfolding of deformable garments using predictive thin shell modeling. In Proc. ICRA, May 2015a.
  • Li et al. [2015b] Y. Li, Y. Yue, D. Xu, E. Grinspun, and P. K. Allen. Folding deformable objects using predictive simulation and trajectory optimization. In Proc. IROS, 2015b.
  • Li et al. [2016] Y. Li, X. Hu, D. Xu, Y. Yue, E. Grinspun, and P. K. Allen. Multi-sensor surface analysis for robotic ironing. In Proc. ICRA, 2016.
  • Lowe [1999] D. G. Lowe. Object recognition from local scale-invariant features. In Proc. ICCV, 1999.
  • Madsen et al. [2004] K. Madsen, H. B. Nielsen, and O. Tingleff. Methods for non-linear least squares problems (2nd ed.). Technical report, Technical University of Denmark, 2004.
  • Maitin-Shepard et al. [2010] J. Maitin-Shepard, M. Cusumano-Towner, J. Lei, and P. Abbeel. Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In Proc. ICRA, 2010.
  • [30] Maya. Maya,
  • Miller et al. [2011] S. Miller, M. Fritz, T. Darrell, and P. Abbeel. Parametrized shape models for clothing. In Proc. ICRA, Sept. 2011.
  • Miller et al. [2012] S. Miller, J. Berg, M. Fritz, T. Darrell, K. Goldberg, and P. Abbeel. A geometric approach to robotic laundry folding. IJRR, 2012.
  • Newcombe et al. [2011] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, pages 127–136, 2011.
  • Nocedal and Wright [2006] J. Nocedal and S. Wright. Numerical Optimization Second Edition. Springer, 2006.
  • Osada and Funkhouser [2001] R. Osada and T. Funkhouser. Matching 3d models with shape distributions. In Proc. SMI Int. Conf., 2001.
  • Osawa et al. [2007] F. Osawa, H. Seki, and Y. Kamiya. Unfolding of massive laundry and classification types by dual manipulator. Journal of Advanced and Intelligent Informatics, “11”(“5”), 2007.
  • Schulman et al. [2013] J. Schulman, A. Lee, J. Ho, and P. Abbeel. Tracking deformable objects with point clouds. In Proc. ICRA, 2013.
  • Stria et al. [2014a] J. Stria, D. Prusa, and V. Hlavac. Polygonal models for clothing. In Proc. Towards Autonomous Robotic Systems, 2014a.
  • Stria et al. [2014b] J. Stria, D. Prusa, V. Hlavac, L. Wagner, V. Petrik, P. Krsek, and V. Smutny. Garment perception and its folding using a dual-arm robot. In Proc. IROS, Sept. 2014b.
  • Tam et al. [2013] G. K. L. Tam, Z-Q Cheng, Y-K Lai, F. C. Langbein, Y Liu, D. Marshall, R.R. Martin, X-F Sun, and P.L. Rosin. Registration of 3d point clouds and meshes: A survey from rigid to nonrigid. Visualization and Computer Graphics, IEEE Transactions on, 19(7):1199–1217, July 2013. ISSN 1077-2626. doi: 10.1109/TVCG.2012.310.
  • Thayananthan et al. [2003] A. Thayananthan, B. Stenger, P. H. S. Torr, and R. Cipolla. Shape context and chamfer matching in cluttered scenes. In CVPR, 2003.
  • Toshev et al. [2010] A. Toshev, B. Taskar, and K. Daniilidis. Object detection via boundary structure segmentation. In Proc. CVPR, 2010.
  • Tu and Yuille [2004] Z. Tu and A. Yuille. Shape matching and recognition: Using generative models and informative features. In Proc. ECCV, 2004.
  • Umetani et al. [2011] N. Umetani, D. M. Kaufman, T. Igarashi, and E. Grinspun. Sensitive Couture for Interactive Garment Editing and Modeling. ACM Transactions on Graphics, 30(4), Aug 2011.
  • van den Berg et al. [2010] J. van den Berg, S. Miller, K. Goldberg, and P. Abbeel. Gravity-based robotic cloth folding. In Proc. Intl. Workshop on the Algorithmic Foundations of Robotics (WAFR), 2010.
  • Wang et al. [2006] J. Wang, L. Yin, X. Wei, and Y. Sun. 3d facial expression recognition based on primitive surface feature distribution. In Proc. CVPR, 2006.
  • Wang et al. [2011] P-C Wang, S. Miller, M. Fritz, T. Darrell, and P. Abbbeel. Perception for the manipulation of socks. Proc. IROS, 2011.
  • Wang et al. [2013] Y. Wang, R. Ji, and S.-F. Chang. Label propagation from imagenet to 3d point clouds. In Proc. CVPR, June 2013.
  • Willimon et al. [2011] B. Willimon, S. Birchfield, and I. Walker. Classification of clothing using interactive perception. In Proc. ICRA, 2011.
  • Willimon et al. [2013] B. Willimon, I. Walker, and S. Birchfield. A new approach to clothing classification using mid-level layers. In Proc. ICRA, 2013.
  • Wu et al. [2008] C. Wu, B. Clipp, X. Li, J-M Frahm, and M. Pollefeys. 3d model matching with viewpoint-invariant patches. In Proc. CVPR, 2008.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description