Learning and Inference of Dexterous Grasps for Novel Objects with Underactuated Hands
Recent advances have been made in learning of grasps for fully actuated hands. A typical approach learns the target locations of finger links on the object. When a new object must be grasped, new finger locations are generated, and a collision free reach-to-grasp trajectory is planned. This assumes a collision free trajectory to the final grasp. This is not possible with underactuated hands, which cannot be guaranteed to avoid contact, and in fact exploit contacts with the object during grasping, so as to reach an equilibrium state in which the object is held securely. Unfortunately these contact interactions are i) not directly controllable, and ii) hard to monitor during a real grasp. We overcome these problems so as to permit learning of transferrable grasps for underactuated hands. We make two main technical innovations. First, we model contact interactions during the grasp implicitly. We do this by modelling motor commands that lead reliably to the equilibrium state, rather than modelling contact changes themselves. This alters our reach-to-grasp model. Second, we extend our contact model learning algorithm to work with multiple training examples for each grasp type. This requires the ability to learn which parts of the hand reliably interact with the object during a particular grasp. Our approach learns from a rigid body simulation. This enables us to learn how to approach the object and close the underactuated hand from a variety of poses. From nine training grasps on three objects the method transferred grasps to previously unseen, novel objects, that differ significantly from the training objects, with an 80% success rate.
Transferring dexterous grasps to novel objects is a challenging problem. One approach is to machine learn solutions with techniques able to perform powerful generalisation. Another is to use an underactuated hand to cope with shape variation. In this paper we combine the benefits of both approaches by learning grasps for underactuated hands. Underactuated hands exploit the contacts that occur during grasping to achieve a wide variety of final grasp configurations. The final grasp configuration depends not only on the final hand pose, but also on the object shape, and on the reach to grasp trajectory. An interesting challenge is to use machine learning to exploit these interactions. The key technical challenge in applying machine learning to grasping with underactuated hands is to learn the right motor actions so as to learn contact interactions that will lead to a reliable final grasp.
One approach would be to ecplicitly learn the typical contact interactions that occur during a grasp, and to generate new grasps that reproduce these. The contact interactions are, however, complex and variable, even given small variations in object shape and friction. This makes transfer learning for underactuated hands challenging. There are two main novel technical contributions. First, we extend our learning method, so as to learn a grasp type from multiple trajectories. Second we implicitly encode the sequence of contact interactions by remembering the motor commands for finger closing as well as the wrist trajectory. We also demonstrate, for the first time, learning from examples generated in a rigid body physics simulation. Finally, at the grasp selection stage we now optimise across a space defined by several examples so as to maximise the chance of reaching a stable grasp. The method copes with partial and noisy shape information for the test objects.
I-a Related Work
Previous work in learning generalisable grasps falls broadly into two classes. One class of approaches utilises the shape of common object parts or their appearance to generalise grasps across object categories [1, 2, 3, 4]. This works well for low DoF hands. Another class of approaches captures the global properties of the hand shape either at the point of grasping, or during the approach . This global hand shape can additionally be associated with global object shape, allowing generalisation by warping grasps to match warps of global object shape . This second class works well for high DoF hands, but generalisation is more limited. We have previously achieved the advantages of both classes, generalising grasps across object categories with high DoF hands. In this paper we go beyond this, learning and generalising grasps for under-actuated hands.
Several hands with such behavior have been proposed in the literature with different implementations [7, 8], with a common goal: simplicity plus robustness. Their initial tests under human operation are promising, but autonomous grasping with underactuated hands faces challenges due to the almost non-observability of the finger deformation when the hand is constrained by the environment and/or a target object. Most of the existing planning algorithms for this type of hands boil down to generating good wrist poses and let the adaptive mechanism handle all variation and uncertainty while closing, such as , where a sequence of wrist and object poses and the corresponding interaction wrenches are generated, which are expected to exploit environmental constraints. Another approach is that by , where static wrist poses are sampled using different strategies around the object from where the fingers are closed using a rigid-body simulator, to finally select the areas of major success rate to generate new wrist poses.
While these approaches exploit, to some extent, the adaptive properties of the underactuated mechanism, they can be improved on. In this paper we show how we can, for the first time, learn grasps for underactuated hands that are then transferred to novel objects. This requires learning representations of the final grasp state that are amenable to transfer to new objects, grouping example grasps by the end grasp state, and learning and optimisation of reach-to-grasp trajectories.
Ii Overview of Approach
In our approach the main steps are as follows. A model of a training object is presented in a rigid body physics simulator. Then a number of example grasps are executed by a human, with the precise motions of hand and object during contact being determined by the simulation. Each example grasp continues until a final stable grasp state is reached. We call this the equilibrium state, consisting of the final hand shape, and the final set of contact relations between hand and object. For training and inference purposes each example grasp has two parts: an equilibrium state, and the reach to grasp trajectory leading to it.
We generate the example grasps in sets. Each set corresponds to a type of grasp, e.g. power or pinch. This means that the equilibrium states are similar within a set, but differ substantially between sets.
Models are then learned for each grasp and for each set. Models are learned of the reach to grasp, the hand configuration in the equilibrium state, and the contact relations between hand and object in the equilibrium state. Given these models, when a new object is presented, a (partial) model of that object is obtained by sensing. This model is combined with the models learned from the training grasps.
Then many candidate equilibrium states, and associated candidate reach to grasp trajectories are generated by sampling. Finally they are optimised so as to maximise the likelihood of the grasp according to a product of experts.
Iii Basic Representations
We now sketch the representations underpinning our approach. We define several models: an object model (partial and acquired from sensing); a model of the contact between a finger link and the object; a model of the whole hand configuration; and a model of the reach to grasp trajectory. First we describe the kernel density representation for all these models. Then we describe the surface features we use to encode some of these models. Then we follow with a description of each model type. Throughout we assume that the robot’s hand comprises rigid links: a palm, and several phalanges or links. We denote the set of links .
Iii-a Kernel Density Estimation
denotes the group of rotations in three dimensions. A feature belongs to the space , where is the group of 3D poses, and surface descriptors are composed of two real numbers. We extensively use probability density functions (PDFs) defined on . We represent these PDFs non-parametrically with a set of features (or particles)
The probability density in a region is determined by the local density of the particles in that region. The underlying PDF is created through kernel density estimation , by assigning a kernel function to each particle supporting the density, as
where is the kernel bandwidth and is a weight associated to such that . We use a kernel that factorises into three functions defined by the separation of into for position, a quaternion for orientation, and for the surface descriptor. Furthermore, let us denote by another feature, and its separation into position, orientation and a surface descriptor. Finally, we denote by a triplet of real numbers:
We define our kernel as
where is the kernel mean point, is the kernel bandwidth, is an -variate isotropic Gaussian kernel, and corresponds to a pair of antipodal von Mises-Fisher distributions which form a Gaussian-like distribution on [12, 13]. The value of is given by
where is a normalising constant, and denotes the quaternion dot product.
Using this representation, conditional and marginal probabilities can easily be computed from Eq. (2). The marginal density is computed as
where . The conditional density is given by
Iii-B Surface Features
All objects considered in the paper are represented by point clouds for the purpose of learning and testing. Test object models were constructed from a single view with a depth camera, and were thus incomplete5. We directly augment these points with a frame of reference and a surface feature that is a local curvature descriptor. For compactness, we also denote the pose of a feature as . As a result,
The surface normal at is computed from the nearest neighbours of using a PCA-based method (e.g. ). The surface descriptors are the local principal curvatures . Their directions are denoted , and the curvatures along and form a -dimensional feature vector . The surface normal and the principal directions define the orientation of a frame that is associated with the point .
Iii-C Object Model
Thus, given a point cloud of object , a set of features can be computed. This set of features defines, in turn, a joint probability distribution, which we call the object model:
where is short for , , is defined in Eq. (4) with bandwidth , and where all weights are equal, . In summary this object model represents the object as a pdf over surface normals and curvatures.
Iv Learned Models
We now describe the representations for each of the three models that are learned from a set of grasp examples. We start with the contact receptive field, proceed with the contact model, the equilibrium state hand configuration model, and finish with the reach to grasp model.
Iv-a Contact Receptive Field
The contact receptive field is a region of space relative to the associated robot link (see Fig. 2) which specifies the neighbourhood of a particular robot link. The contact receptive field is realised as a function of surface feature pose :
the value of which determines the relevance of a particular surface feature to a given link in terms of the likelihood of the physical contact. We use contact receptive fields which are family of parameterised functions for which the value falls off quickly with the distance to the link:
where and is the point on the surface of that is closest to . This means that the contact contact receptive field will only take account of the local shape, while falling off smoothly. A variety of monotonic, fast declining functions could be used instead.
Iv-B Contact Model
Let us assume that we have as many grasp examples of the same grasp type as the number of objects . We denote by the pose of relative to the pose of the -th object feature. In other words, is defined as
where denotes the pose of , denotes the pose composition operator, and is the inverse of , with (see Fig. 2).
Contact model encodes the joint probability distribution of surface features of object and of the 3D pose of the -th hand link in the equilibrium state. Let us consider the hand as having grasped given training object . The contact model for link is denoted by
where is short for , is the random variable modelling surface features of object , and models the pose of relative to the frame of reference defined by the directions of principal curvature and the surface normal. In other words, denoting realisations of and by and , is proportional to the probability of finding at pose relative to the frame of a nearby object surface patch that exhibits features (principal curvatures) equal to .
The contact model of object is estimated as
where is a normalising constant and is kernel function (4) defined at poses from Eq. (12), and where we have performed integration over all kernels of the object model (9) which uniquely determine poses .
We also introduce the idead of a contact model norm, which estimates the likelihood of a physical contact of link with surface features of object :
where is a normalising constant. We use this norm to help estimate which links are reliably involved in an grasp equilibrium state when there are several example grasp trajectories.
Iv-C Contact Model Selection
The overall number of non-empty contact models for some example grasp is usually smaller than the number of links of the robot hand. The contact selection procedure determines which contact models should be instantiated for a given grasp type. The issue is complex because we require multiple grasp trajectories to learn from. Because of the variability in how the hand and object interact, it is then non-trivial to determine which finger links reliably participate in the equilibrium state of the grasp. The contact selection procedure determines this. The procedure starts with comparing contact model norms (15) to the average model norm for all grasp examples and all robot links. The results of this comparison are stored in binary variables :
where is a threshold constant, is the number of hand links and is the number of training objects.
The contact model is instantiated if the total number of non-empty contact models, determined by , is higher than some minimum value given the total number of objects or training grasps . This is encoded in binary variable :
where is a threshold constant. The contact model is then constructed as a mixture of as follows:
The above procedure is performed independently for each grasp type . Each grasp type may also involve a different set of objects. We denote the set of contact models learned for a particular grasp type as . The parameters , , and , , were chosen empirically. The time complexity for learning each contact model from an example grasp is where is the number of triangles in the tri-mesh describing hand link , and is the number of points of object .
Iv-D Equilibrium State Hand Configuration Model
The equilibrium state hand configuration model, denoted , for the grasp examples within the set of training examples of a particular grasp type . The purpose of this model is to restrict the grasp search space (during grasp transfer) to hand configurations that resemble those observed during training. We combine the configurations for the examples to create a single mixture model density:
This expresses a density over hand configurations in the equilibrium state for a grasp type .
Iv-E Reach to Grasp Model
For a particular grasp type, in addition to modelling the equilibrium states of the hand, we must also model the trajectories taken to reach those equilibrium states. A single reach to grasp trajectory for an underactuated hand has three elements: the tool centre point (wrist) trajectory, the hand configuration trajectory, and the motor signal trajectory. We assume that a trajectory starts at time and ends in the equilibrium state at time . We denote the wrist trajectory , the hand configuration trajectory , and the motor signal trajectory respectively. The motor signal can be a wide variety of signals in practice. Here we choose it to be the position of the single actuator. When the hand is not in contact with an object the motor signal and the wrist pose together determine the hand configuration. When in contact, the actual hand configuration will differ. The reach to grasp model is simply the concatenation of each component . The set of reach to grasp trajectory models define an attractor basin, leading towards the final hand configuration. Although this notion of using the motor command sequence is simple, it is essential to encode the envelope of hand behaviours that will lead to a stable equilibrium grasp.
In the next section we explain how we gather the grasp examples that are used to learn these models. Then in Section VI the inference method—by which the models are used to generate grasps for new objects—is described.
V Data Generation
There are several ways to implement underactuation in a dexterous hand. In this paper we employ an approach based on adaptive synergy transmission, due to its simplicity and robust design, and its ability for complex interaction with the environment. The Pisa/IIT SoftHand  implements such a transmission mechanism. This hand has 19 degrees of freedom (DoF) distributed over four fingers and an opposable thumb, but only 1 degree of actuation (DoA). The synergy motion of the hand in free space has been derived from databases of human hand postures. The overall behaviour parameters are the matrices that correspond to the transmission ratio, , to the joint stiffness, . The actuation is done through a single tendon routed throuh all joints, making the fingers flex and abduct.
Moving such a hand to grasp an object results in a hard-to-predict contact and hand shapes due to the adaptivity. We thus generate a variety of grasp examples to cover a portion of the interaction space. However, recording many trajectories of all the finger elements that affect the grasp in the real world is non-trivial. For this reason, we generate the example interactions for training using a rigid-body physics simulator, where these problems are avoided. The main two simulation elements we have developed are the contact stability model and the hand behavior model. In the case of the Pisa/IIT softhand, the latter depends heavily on the former. We used the standard distribution of Gazebo and Open Dynamic Engine, both in widespread use. The adaptive synergy equations have been implemented as a plugin to these, and accompany the proper kinematic description of the Pisa/IIT SoftHand111The Pisa/IIT SoftHand ROS/Gazebo packages are available at https://github.com/CentroEPiaggio/pisa-iit-soft-hand.
At the current state, there are no generally accepted measures concerning whether a grasp by an underactuated hand is good or not, hence the lack of robust grasp planners for them is not a surprise. Thus, generating a large dataset at this point is useless, and there are plans in the future to cover this area. As a result, we generated the examples by guiding manually the hand to a “nice” grasp as shown in Fig. 3. In this simpler scenario, we assume without loss of generality that the grasps are labelled by type. In our example dataset, we have three grasp types namely pinch, rim and by-the-handle. The main difference between pinch and rim is the fingers configuration w.r.t. the top border of objects. In the pinch grasp, the thumb goes inside whereas in the rim grasp, the fingers go inside. In the latter, the container can be filled with liquid while holding, for instance. For each grasp example, the corresponding dataset comprises the set of trajectories as described in the previous section.
After acquiring the models from a set of training grasps, we present the robot with a test (query) object. The goal is to find a generalisation of the training grasp that is likely according to all of the model types simultaneously. First of all, we obtain a point cloud for the test object, and thus an object model. We then combine every contact model with that object model, so as to obtain a set of query densities, one for each link with a contact model defined for the example grasp. The -th query density is a density modelling where the -th link can be placed, in the equilibrium state, with respect to the surface of a new object.
From the query densities, a candidate equilibrium grasp state is generated. This is then augmented with a reach to grasp trajectory that finishes close to the candidate equilibrium grasp state. Finally we refine the equilibrium grasp and reach to grasp by performing a simulated annealing search in the space of equilibrium state wrist poses and hand configurations, so as to maximise the grasp likelihood. We repeat the entire process many times. This procedure generates many possible grasps, ranked by likelihood. We give details below.
Vi-a Query Density
A query density is, for a hand link and an object model, a density over the pose of that hand link relative to the object. Intuitively the query density encourages a finger link to make contact with the object at locations that have similar local surface curvature to that in the training example. Specifically, we use kernels centred on the set of weighted finger link poses:
with -th kernel centre , and weights are normalised . When a test object is presented, a set of query densities is calculated for the equilibrium state for each training grasp for the grasp type . The set has members, one for each contact model in . We estimate the query density using a Monte Carlo procedure detailed in Alg.VI-A.
Vi-B Equilibrium Grasp Generation
Once query densities have been created for the new object for each training example, an initial set of equilibrium state grasps is generated for each grasp type . For each candidate equilibrium grasp of a particular grasp type we proceed as follows. First an example grasp is selected at random. Then a finger link is selected at random. This ‘seed’ link indexes its query density . A link pose is then sampled from that query density. Then an equilibrium state hand configuration is sampled from . Together the seed link and the hand configuration define a complete equilibrium state hand pose in the workspace via forward kinematics. This is an initial ‘seed’ grasp, which will subsequently be refined. A large set of such initial solutions is generated, where means the initial solution for grasp type .
Vi-C Reach to Grasp Generation
Given an equilibrium grasp, a reach to grasp trajectory is selected and adapted to maximise the chance of reaching that equilibrium grasp state. Specifically, we sample a reach to grasp model according to a multinomial probability distribution created by the normalised values of a Gaussian centred on the candidate equilibrium grasp . To align the selected reach to grasp to the candidate equilibrium grasp, the wrist trajectory is trivially redefined to be relative to the frame . Then the configuration trajectory only is warped (see Fig. 4) so that it smoothly shifts from to from the beginning to the end of the trajectory. Having generated an initial solution set stages of optimisation and selection are then interleaved.
Vi-D Grasp Optimisation
The objective of the grasp optimisation steps is, given a candidate equilibrium grasp and a reach to grasp model, to find a grasp that maximises the product of the likelihoods of the query densities and the hand configuration density
where is the overall likelihood, where is the hand configuration model (19), are query densities (20). Thus whereas each initial grasp is generated using only a single query density, grasp optimisation requires evaluation of the grasp against all query densities. It is only in this improvement phase that all query densities must be used. Improvement is by simulated annealing (SA) . The SA temperature is declined linearly from to over the steps. In each time step, one step of simulated annealing is applied to every grasp in .
Vi-E Grasp Selection
At predetermined selection steps (here steps 1 and 50 of annealing), grasps are ranked and only the most likely retained for further optimisation. During these selection steps the criterion in (21) is augmented with an additional expert penalising collisions for the entire reach to grasp trajectory in a soft manner. This soft collision expert has a cost that rises exponentially with the greatest degree of penetration through the object point cloud by any of the hand links. We thus refine Eq. 21:
where is now factorised into three parts, which evaluate the collision, hand configuration and query density experts, all at a given hand pose . A final refinement of the selection criterion is due to the fact the number of links involved in a grasp varies across grasp types. Thus the number of query densities , for different grasp models also varies, and so the values of and cannot be compared directly. Given the grasp with the maximum number of involved links , we therefore normalise the likelihood value (22) with
It is this normalised likelihood that is used to rank all the generated grasps across all the grasp types during selection steps. After simulated annealing has yielded a ranked list of optimised grasp poses, they are checked for reachability given other objects in the workspace, and unreachable poses are pruned.
Vi-F Grasp Execution
The remaining best scoring hand pose is then used to generate a reach to grasp trajectory. Since the hand is underactuated this consists of the wrist pose trajectory, and the motor signal trajectory. This is the command sequence that is executed on the robot.
The experiments were conducted as follows. Training consisted of nine example grasps, executed in simulation, with a human in control. These nine grasps were grouped into three grasp types (rim, pinch, and handle). The rim and pinch grasp types were trained on the colander object, and the handle grasp type was demonstrated on the saucepan. During testing an object was placed on the table. Every grasp type was compared automatically, and one selected for execution according to the methods described above. The models of the test objects consisted of a point cloud taken from just one view. Thus reconstructions were partial, typically less that 25% of the object’s surface area. No test objects had been seen previously by the robot, and it can be seen from Fig. 6. Fifteen test objects were presented, and 12 of the 15 test grasps succeeded, giving a generalisation success rate of 80%. While the difference is not statistically significant, this is slightly higher than the 77.7% success rate we recorded on a larger test set for a fully actuated hand also working from a single view of an object .
Finally, it is worth noting the similiarity between the location of the actual and targetted final grasp states. In a majority of cases the grasp involved interactions with the object, moving it to the stable grasp pose. This is a natural property of the hand, but it might have been supposed that the learning method would not be robust to such interactions in terms of the accuracy of the grasp.
-  A. Saxena, L. Wong, and A. Ng, “Learning grasp strategies with partial shape information,” in Proceedings of AAAI. AAAI, 2008, pp. 1491–1494.
-  R. Detry, C. H. Ek, M. Madry, and D. Kragic, “Learning a dictionary of prototypical grasp-predicting parts from grasping experience,” in International Conference on Robotics and Automation. IEEE, 2013, pp. 601–608.
-  A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, J. Bohg, T. Asfour, and S. Schaal, “Learning of grasp selection based on shape-templates,” Autonomous Robots, vol. 36, no. 1-2, pp. 51–65, 2014.
-  O. Kroemer, E. Ugur, E. Oztop, and J. Peters, “A kernel-based approach to direct action perception,” in IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 2605–2610.
-  H. Ben Amor, O. Kroemer, U. Hillenbrand, G. Neumann, and J. Peters, “Generalization of human grasping for multi-fingered robot hands,” in International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 2043–2050.
-  U. Hillenbrand and M. Roa, “Transferring functional grasps through contact warping and local replanning,” in IEEE/RSJ International Conference on Robotics and Systems. IEEE, 2012, pp. 2963–2970.
-  M. G. Catalano, G. Grioli, E. Farnioli, A. Serio, C. Piazza, and A. Bicchi, “Adaptive synergies for the design and control of the Pisa/IIT SoftHand,” The International Journal of Robotics Research, vol. 33, no. 5, pp. 768–782, 2014.
-  A. M. Dollar and R. D. Howe, “The highly adaptive SDM hand: Design and performance evaluation,” The International Journal of Robotics Research, vol. 29, no. 5, pp. 585–597, 2010.
-  C. Eppner and O. Brock, “Planning grasp strategies that exploit environmental constraints,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2015.
-  M. Bonilla, D. Resaco, M. Gabiccini, and A. Bicchi, “Grasp planning with soft hands using bounding box object decomposition,” in In Proceedings of the IEEE International Conference of Intelligent Robots and Systems, 2015.
-  B. W. Silverman, Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1986.
-  R. A. Fisher, “Dispersion on a sphere,” in Proc. Roy. Soc. London Ser. A., vol. 217, no. 1130. Royal Society, 1953, pp. 295–305.
-  E. B. Sudderth, “Graphical models for visual object recognition and tracking,” Ph.D. dissertation, MIT, Cambridge, MA, 2006.
-  K. Kanatani, Statistical optimization for geometric computation: theory and practice. Courier Dover Publications, 2005.
-  M. Spivak, A comprehensive introduction to differential geometry. Publish or Perish Berkeley, 1999, vol. 1.
-  S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.
-  M. Kopicki, R. Detry, M. Adjigble, A. Leonardis, and J. L. Wyatt, “One shot learning and generation of dexterous grasps for novel objects,” International Journal of Robotics Research, 2015.