Encoding cloth manipulations using a graph of states and transitions

Encoding cloth manipulations using a graph of states and transitions


Cloth manipulation is very relevant for domestic robotic tasks, but it presents many challenges due to the complexity of representing, recognizing and predicting behaviour of cloth under manipulation. In this work, we propose a generic, compact and simplified representation of the states of cloth manipulation that allows for representing tasks as sequences of states and transitions. We also define a graph of manipulation primitives that encodes all the strategies to accomplish a task. Our novel representation is used to encode the task of folding a napkin, learned from an experiment with human subjects with video and motion data. We show how our simplified representation allows to obtain a map of meaningful motion primitives and to segment the motion data to obtain sets of trajectories, velocity and acceleration profiles corresponding to each manipulation primitive in the graph.

I Introduction

Cloth manipulation is an important area of robotics research that has applications both in industrial scenarios and in domestic environments. Despite its importance, research efforts have been focused mostly on manipulation of rigid objects, showing a great progress in service robotics [1], while core capabilities such as grasping, placing, or handing to a person still remain as a hard and unsolved problem when dealing with challenging objects such as textiles. Recently, a stronger interest in deformable object manipulation emerged and the survey in [2] presents the latest advances.

Cloth manipulation presents many additional challenges with respect to rigid object manipulation related to the complexity of modeling the object or predicting its behavior [3]. For this reason, cloth manipulation research has traditionally put more effort in cloth state estimation and grasp point detection, whereas manipulation skills are underdeveloped [2]. For instance, one of the crucial skills for manipulation is grasping, and there are very few works analyzing grasping for highly-flexible objects, despite having a variety of grasp types is of paramount importance to determine possible actions and define the state of a scene in terms useful for manipulation. In this direction, we presented a taxonomy of textile grasps [4] based on the geometry of the prehension patches conforming the shape of the grasped part of cloth.

Fig. 1: Experimental setup. (Top) The subject wears a motion tracking suit, a GoPro camera mounted on the head and we also take a Kinect screenshot of the final result, although the latter is not used in this paper. (Bottom) Wearable point-point gripper used in the experiments.

The complexity of defining and recognizing scene states makes getting reliable data very difficult, hindering the training of AI systems and high-level planners for task strategies. However, it is important to learning cloth manipulation tasks through human demonstrations to obtain a diversity of strategies to accomplish a task, and with different parameters related to safety, fast accomplishment of the objective or number of steps needed to accomplish a task, inducing a measure of task complexity.

The first contribution of this work is to propose a generic and compact representation of scene states and the possible transitions between them that enables to account for textile manipulations. The second contribution is the Graph of Manipulation Primitives (GoMP), a graph that can be built using the previous representation to encode all the possible states and transitions of a given manipulation task. We show the feasibility of our approach through the task of folding a napkin in 3 folds. We performed an experiment with 8 subjects that are wearing a gripper and the Xsens suit (Fig. 1). From the motion data and the labeling of video data we extract a map of strategies, and we generate segmented motion data that contains specific parameters about the upper body trajectories, velocities and accelerations that have been used for each identified manipulation primitive. The labeled video data and segmented motion data will be made public, enabling the training and comparison of state recognition algorithms.

Ii Related work

Simplified representations of scene states based on the contact interactions between the hands, the object and the environment was used in the past in the context of manipulation of rigid objects [5], and then used for recognition, segmentation [6] and learning manipulation actions to be executed by a robot [7]. In this work, states are representations of simple interactions (touching/not touching) between all the different rigid objects present in a particular scene. The transitions can be complex (like cut or pour) and are very well defined from semantic to real execution. However, they are not related in a sequence. Contrarily, we are interested in the interaction of symbolic actions and in building a representation that allows to relate subsequent actions.

Graphs is a common representation for manipulation transitions. For instance, the Dexterous Manipulation Graph [8] was developed for the case of in-hand manipulations. It is a graph where nodes are the positions and orientations of the contact points in a grasp, and the edges exist if the reconfiguration from both grasps is possible. As they consider rigid objects, DMGs do not need to specify the state of the object or the changes that the manipulation can cause to its configuration.

In [9] contact interactions with the environment were used to define sequences of support phases in a multi-contact locomotion tasks. Symbolic representations of whole-body support poses were used to learn a model of possible step transitions leaning on handles, walls or tables to provide a more robust manipulation. In that case, model based environmental objects were directly captured by the motion tracking system, enabling the data capture and labeling of pose sequences, which is not possible for cloth items.

The Transition Graph of Object Pose was proposed [10] in the context of object manipulation with a humanoid robot. This graph encodes the complexity of whole body contacts between a humanoid robot and simple objects. As we do, contacts are classified into face (that we call plane), line, point, and no contact. Considered motions to change contacts are rather complex, like sliding, rotating, pivoting, and lifting. The graph is generated by taking the geometric model of a particular object and applying the different motions taking into account the stability using physical models. Instead, we consider more symbolic states and propose to learn the transitions by observing several executions performed by humans.

Fig. 2: Point contacts are noted as P, linear contacts as L and planar contacts as . (a) Double pinch grasp. Each pinch is formed with two point geometries. (b) A double pinch with the additional extrinsic contact of the table, denoted with an ”e” subscript. (c) A double grasp obtained with a linear contact and a planar surface (d) A more complex combination of grasps achieved with a grasping agent (the hand) against the table.

Iii A generic state-and-transition definition

To recognize and understand a manipulation action, it is necessary to interpret the states of a scene at each time-step. This is a difficult problem and our approach is to define a simplified representation of a scene in a way that can be recognized by a robot and that allows it to execute the next action.

We propose to define a state as a tuple


  • is the grasp type,

  • are the location of the grasp with respect to the cloth, and

  • is the cloth configuration.

Then, we define a manipulation primitive as the tripled


  • and are the origin and destination states, and

  • is a semantic label of the action primitive the subject is performing.

Regarding the definition of the grasp type , it is based in the cloth grasp framework and taxonomy introduced in our previous work [4]. In this framework, each grasp is defined by the geometries of the two contact patches that apply the couple of opposing forces, that is, the geometries of each virtual finger [11, 12]. The framework considers as virtual fingers geometries that can be either intrinsic (part of the gripper) or extrinsic (like the table). It also considers bi-manual grasps in a natural way. A partial glimpse of the grasp framework is provided in Fig. 2. Our grasp framework considers elements in the environment as extrinsic contact geometries and, therefore, it already encompasses environmental contact interactions. Note that all cloth states realize a grasp, as when there is no contact with the subject, the cloth lays on a table, i.e., a grasp, or it is held by other elements in the environment, such as for example a hook, i.e., a P grasp, or a rail, i.e., a L grasp, all of which relying on gravity as the opposing force.

Fig. 3: The labeling when the cloth is on the table or hanging. It depends on the location of the subject that is performing the manipulation. Any interior point is labeled I.

Regarding the grasp location , we have defined a set of labels to describe the approximate locations of the grasping points on a given rectangular cloth, shown in Fig. 3. Similar nomenclature could be used for other cloth shapes. Locations are encoded with respect to the subject, i.e., left corner (LC) refers to the corner closest to the subject at that side, and right corner (RC) likewise, up to rotations of 45ª. The two farthest corners are labeled far left (FL) and far right (FR). When the cloth is hanging, the right and left corners are the top ones (closer to subject hands). This means that for certain state transitions we may get a swap of labels for the same points. For instance, when placing a cloth flat on a table, and then folding it without releasing it, the labeling swaps from (LC+RC) to (FL+FR) after the table contact has been added. See the next section for more details and examples.

Regarding the configuration of the cloth, , it is well known the configuration state of a textile is infinite-dimensional, therefore this parameter could be of very high complexity. That, together with the high number of self-occlusions that occur when manipulating clothes, makes cloth state estimation a difficult problem. The high complexity of its full solution has been bypassed in the past by just looking for task-oriented features, such as adequate and accessible grasping points, e.g., shirt collars for hanging [13] or towel corners for folding [14].Increasingly, it becomes clearer that we need simplified representations, specially regarding deformable objects, as stated in [15]. We have defined only 5 categories of simplified cloth states:

For the crumpled category, there are subcategories dependent on the number of visible corners. A further simplification in the current work has been to assume there is always a visible corner that can be grasped. This is enough to extract coherent sequences of states for a task. In future developments of the project we will determine if this is enough for recognition and execution on the robot.

Fig. 4: Above, snapshots of the video data labeled according to our four-field characterization of manipulation primitives. Below, the corresponding scene state-and-transition representation automatically generated from the labels.

Finally, regarding the motion semantic label , we define a set of labels related to the action the subject is performing from that initial state until the following one, like for instance, place flat on the table, fold on the table or trace an edge of the cloth. Semantic labels are useful for high-level planning and scene understanding, and can be linked to low level parameters like motion primitives or other trajectory representations.

The proposed state and transition definition induces a segmentation of a manipulation tasks at each change of scene state. The state changes at each re-grasp, which in our grasp framework, this includes changes in contacts with the environment. However, it is not only re-grasps that change the state, but also changes in the locations or in the cloth states. For instance, unfolding in the air starts from a 2PP grasp at the corners, with the cloth crumpled, and the two hands move to make the cloth flat (unfolded) by the end of the primitive. If then the subject places the grasped points on the table, this constitutes a different manipulation primitive although the grasp type and grasping point locations haven’t changed. For the edge tracing manipulation, the subject performs a pinch-and-slide motion that results in different grasping point locations at the beginning and end of the segment.

Iv Experimental setup and data collection

We tested a total of 8 subjects wearing a motion data suit (XSens) and a GoPro camera fixed at their forehead (Fig. 1). The experiment included several cloth manipulation tasks, but for the scope of this paper we focus on the task of folding a napkin on the table, as already mentioned. We asked the subjects to wear a simple gripper, at the bottom of Fig. 1, to reduce their manipulation dexterity to one closer to that of the robot. Subjects were only instructed to fold the cloth in 3 folds and they practiced the task three to four times before starting the recordings.

When it comes to cloth manipulation, subject experiments provide us with a lot of useful information regarding variety of strategies to accomplish a task, that is not observed in robot cloth manipulation demonstrations, as analyzed in [4]. Therefore, learning state sequences from humans will provide us with a much richer graph regarding alternative strategies, and we will be able to learn new manipulation approaches for robots. However, there is a trade off between obtaining a great diversity of strategies and sparsity on the obtained data. This is specially true when it comes to cloth manipulation that almost each subject has its own tricks to fold their clothes. For this reason, we instructed the subjects to perform a very specific task (fold on the table, not in the air, and in 3 folds). Despite these indications, we obtained a lot of variability, sometimes even between the trials of one same subject. However, some strategies have been used consistently by most of the users.

From the data collected, we have manually labeled the videos at each change of state, associating a motion semantic label to each transition to state to state depending on the action that was done, following the representation introduced in the previous section. We purposely ignored any manipulation that corrected a mistake, or that relocated the cloth on the table, just to simplify the data. The labels include a timestamps at each change of state, providing the segmentation of the data and the sequence of states. The motion data is synchronized through an initial clapping of the subject, that is labeled in the subtitle and detected as a peak of acceleration in the motion data.

By using the grasp type framework described before, the grasping point locations in Fig. 3, and the 5 cloth configuration labels for the folding task, we generated the scene states by combining these values. An example of the labels and their corresponding graphic state representations can be seen in Fig. 4.

V Graph of manipulation primitives

Thanks to the proposed representation, and extracting the sequences of state and transitions of the labeled video data, we can generate a graph where each node is a scene state, and the edges represent the manipulation primitives.Thus, the graph represents a map of possible manipulations from the first to the last state.

To generate the graph, for each trial we defined an edge for each state change, and we represented it symbolically using the formulation introduced in Section III, where each initial and destination state are the initial and end node of the graph edge, and the motion semantic value is the edge label. We then identify common nodes and common edges, defining the graph with all the distinct vertices and edges that have appeared, counting their multiplicity.

As an example, we will show the obtained graph from the task of folding a napkin on the table. To simplify data classification, we have removed some left and right distinctions. For instance, a single corner grasped is the same irrespective of whether it is the left or right corner, grasped with the left or right hand. We also assume two grasped points on the same cloth edge are the same regardless if they are on the right or left side. All the simplifications are properly defined in the attached documentation. 1.

As stated before, one of the issues we face with this type of data is that there is a lot of variability in the way individuals perform cloth manipulation tasks, resulting in sparse data. Using all the data collected, we obtain a graph with 31 nodes and 65 edges, but many of them they appear a single time in our data. If we require each edge to appear at least 3 times in the data, the graph is reduced to 17 nodes with 23 edges. The reduced graph is shown in Fig. 5. The complete graph can’t be included in the paper for space reasons, but it can be accessed via a link 2.

The label of each edge consist of the semantic name of the primitive, the number of times it appears in the data (in parenthesis) and the mean velocities and accelerations averaged for both hands. Only in the case where left and right values were very different, we show a separate value for left and right hand velocity and acceleration.

We performed a total of 24 trials, meaning the maximum times one primitive can appear repeated in the data is 24. Despite the diversity of strategies displayed by the subjects there are some transitions that consistently appear. We plotted in red the transitions that appear in at least half of the total capacity (12 times) and, in orange, the ones that appear 6 times or more. We can see that the weakest flow in the graph is in the transition from the crumpled on the table to the flat hanging (Fig. 5-b). There is a great variety of manipulations to find the two corners. Once the corners are grasped, the primitives to unfold in the air become less sparse (Fig. 5-c). The configurations of grasped by the corners and crumpled in the air are reached by edges with a multiplicity of less than 3, and therefore, they don’t appear.

Vi Discussion

The most common primitive is the first grasp (Fig. 5-a), because the initial configuration of the cloth clearly shows a free corner to grasp, that is grasped either with the right or the left hand depending on the subject. Alternatives include grasping directly both corners or grasping an elevated edge point and tracing the edge until the corner.

Depending on the cloth states, we can identify the first, second and third fold, the first and third being the most consistent and similar in grasping the two corners at the far or close edge. The second fold is consistently performed by grasping the two corners on the right side edge of the cloth, but the way these two corners are grasped has more variability, either both of them grasped directly (7 instances) or first one of the corners and then performing either a lift to facilitate the second corner grasp (5 instances) or performing an edge tracing on the table to reach the second corner (6 instances).

(a)                            (b)                  (c)                                                                          (d)

Fig. 5: Reduced graph of manipulation sequences obtained by requiring each edge to appear at least 3 times in the data. We can clearly see the different phases of the task, from the crumpled on the table phase on the left, to the central hanging part of the manipulation, and then the semi-folded states on the table during the first, second and third folds, located to the right.

We have the trajectories, velocity and acceleration profiles of all the trials in each of the edges. This allows us to observe different ways of executing the same manipulation, with very different accelerations and velocity profiles. For instance, for the edges of folding on table from the LC+RC grasping points, which appears in the first and third folds, have several trials with very different velocity profiles, shown in Fig. 6. The low velocity one, plot in blue, is more common for the third fold, but appears also for the first fold.

Other primitives, like place flat on table to the grasp are more consistent in their trajectories and velocity profiles, as seen in Fig. 7. An identification and classification of profiles with different dynamic parameters is very useful to then use the graph of primitives to plan tasks optimizing different cost functions. For instance, plan based on maximum velocities or accelerations is very relevant for collaboration context where safety is very relevant, or based on the shortest graph path, meaning less primitives are executed [16].

Fig. 6: Three different velocity profiles that occur in the manipulation of fold on the table from the LC+RC location. Each color corresponds to mean velocity profiles for the right and the left hands. As these are bi-manual symmetric tasks, the velocities for right and left hands is the same.
Fig. 7: The velocity profiles for the primitive of place flat on table are much more consistent compared to the fold on table ones. Each color corresponds to left and right hand velocity profiles. As these are bi-manual symmetric tasks, the velocities for right and left hands is the same.

One of the motivation behind our approach is to enable explainable reasoning at the manipulation level as well as learning a dynamic movement primitive (DMP) for each re-grasp strategy [17], which is also associated with its preconditions and effects. The resulting DMPs can be used for task planning [18], and for explainability purposes as well, since the learning process makes explicit the conditions that enable to execute the primitive and the expected outcomes. In addition, the state-and-transition representation simplifies the perceptual information that needs to be acquired. Thus, in subsequent research, we plan to use previous work in our group on cloth part recognition and pose estimation [13] and grasping point detection [19] to perceive the aforementioned manipulation-oriented scene states, including cloth state, grasping point location and confidence values that can provide explanations about the belief in the current state.

Vii Conclusions

We have introduced a compact and generic representation of states of a manipulation task in the context of cloth manipulation. The representations are vast simplifications of the complexity of a cloth manipulation state, but we shown how they are enough to segment a manipulation task into relevant and coherent manipulation primitives. In addition, from the sequences of states and transitions, we have defined a Graph of Manipulation Primitives (GoMP) that encodes the diversity of strategies to accomplish the task. We have shown an example for a common cloth manipulation task for which the GoMP is learned from an experiment with human subjects and allows to gather motion data from each of the manipulation primitives in the graph. Learning from human demonstrations allows to identify manipulation primitives not used so far by robots that could be especially handy for the versatile manipulation of clothing items.

The GoMP encoding cloth states and their transitions under manipulation actions that we have proposed in this paper complies with the desideratum that ”low-complexity representations for the deformable objects should be the objective” [15]. This manipulation-oriented representation of cloth states and transitions would permit probabilistic planning of actions to ensure reaching a desired cloth configuration without requiring high accuracy in perception nor searching in high-dimensional configuration spaces.

In addition, our encoding of manipulation tasks facilitates the definitions of metrics and measures of complexity of a given strategy, which is very useful to define increasing complexity benchmark tasks. Also, from our definition of states that represent a task, concrete requirements can be extracted to guide gripper design.

Finally, this work will lead to a database of labeled video data synchronized with motion data of different cloth manipulation tasks, which could be of great utility for the whole manipulation community working on highly deformable objects.


  1. http://www.iri.upc.edu/people/jborras/files/IROS2020/DataSimplifications.pdf
  2. http://www.iri.upc.edu/people/jborras/files/IROS2020/FullGraph.pdf


  1. C. Torras, “Service robots for citizens of the future,” European Review, vol. 24, no. 1, pp. 17–30, 2016.
  2. J. Sanchez, J.-A. Corrales, B.-C. Bouzgarrou, and Y. Mezouar, “Robotic manipulation and sensing of deformable objects in domestic and industrial applications: a survey,” The International Journal of Robotics Research, vol. 37, no. 7, pp. 688–716, 2018.
  3. P. Jiménez, “Visual grasp point localization, classification and state recognition in robotic manipulation of cloth: An overview,” Robotics and Autonomous Systems, vol. 92, pp. 107–125, 2017.
  4. J. Borràs, G. Alenyà, and C. Torras, “A grasping-centered analysis for cloth manipulation,” IEEE Transactions on Robotics, vol. 36, no. 3, pp. 924–936, 2020.
  5. F. Wörgötter, E. E. Aksoy, N. Krüger, J. Piater, A. Ude, and M. Tamosiunaite, “A simple ontology of manipulation actions based on hand-object relations,” IEEE Transactions on Autonomous Mental Development, vol. 5, no. 2, pp. 117–134, 2013.
  6. E. E. Aksoy, A. Abramov, J. Dörr, K. Ning, B. Dellen, and F. Wörgötter, “Learning the semantics of object–action relations by observation,” The International Journal of Robotics Research, vol. 30, no. 10, pp. 1229–1249, 2011.
  7. M. J. Aein, E. E. Aksoy, and F. Wörgötter, “Library of actions: Implementing a generic robot execution framework by using manipulation action semantics,” The International Journal of Robotics Research, p. 0278364919850295, 2018.
  8. S. Cruciani, C. Smith, D. Kragic, and K. Hang, “Dexterous manipulation graphs,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2040–2047, IEEE, 2018.
  9. J. Borràs, C. Mandery, and T. Asfour, “A whole-body support pose taxonomy for multi-contact humanoid robot motions,” Science Robotics, vol. 2, no. 13, p. eaaq0560, 2017.
  10. M. Murooka, Y. Inagaki, R. Ueda, S. Nozawa, Y. Kakiuchi, K. Okada, and M. Inaba, “Whole-body holding manipulation by humanoid robot based on transition graph of object motion and contact,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3950–3955, IEEE, 2015.
  11. T. Iberall, “Opposition space as a structuring concept for the analysis of skilled hand movements,” Generation and modulation of action patterns, vol. 15, pp. 158–173, 1986.
  12. T. Iberall, C. Torras, and C. MacKenzie, “Parameterizing prehension: A mathematical model of opposition space,” in Proceedings of the third COGNITIVA symposium on At the crossroads of artificial intelligence, cognitive science, and neuroscience, pp. 635–642, North-Holland Publishing Co., 1991.
  13. A. Ramisa, G. Alenyà, F. Moreno-Noguer, and C. Torras, “A 3d descriptor to detect task-oriented grasping points in clothing,” Pattern Recognition, vol. 60, pp. 936–948, 2016.
  14. J. Maitin-Shepard, M. Cusumano-Towner, J. Lei, and P. Abbeel, “Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding,” in IEEE Int. Conf. on Robotics and Automation, pp. 2308–2315, 2010.
  15. C. Smith, Y. Karayiannidis, L. Nalpantidis, X. Gratal, P. Qi, D. V. Dimarogonas, and D. Kragic, “Dual arm manipulation—a survey,” Robotics and Autonomous systems, vol. 60, no. 10, pp. 1340–1353, 2012.
  16. D. Martínez, G. Alenya, and C. Torras, “Safe robot execution in model-based reinforcement learning,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6422–6427, IEEE, 2015.
  17. A. Colomé and C. Torras, “Dimensionality reduction for dynamic movement primitives and application to bimanual manipulation of clothes,” IEEE Transactions on Robotics, vol. 34, no. 3, pp. 602–615, 2018.
  18. G. Canal, E. Pignat, G. Alenyà, S. Calinon, and C. Torras, “Joining high-level symbolic planning with low-level motion primitives in adaptive hri: application to dressing assistance,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9, IEEE, 2018.
  19. E. Corona, G. Alenyà, A. Gabas, and C. Torras, “Active garment recognition and target grasping point detection using deep learning,” Pattern Recognition, vol. 74, pp. 629–641, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description