Multi-Modal Geometric Learning for Grasping and Manipulation
This work provides an architecture that incorporates depth and tactile information to create rich and accurate 3D models useful for robotic manipulation tasks. This is accomplished through the use of a 3D convolutional neural network (CNN). Offline, the network is provided with both depth and tactile information and trained to predict the object’s geometry, thus filling in regions of occlusion. At runtime, the network is provided a partial view of an object and tactile information is acquired to augment the captured depth information. The network can then reason about the object’s geometry by utilizing both the collected tactile and depth information. We demonstrate that even small amounts of additional tactile information can be incredibly helpful in reasoning about object geometry. This is particularly true when information from depth alone fails to produce an accurate geometric prediction. Our method is benchmarked against and outperforms other visual-tactile approaches to general geometric reasoning. We also provide experimental results comparing grasping success with our method.
Robotic grasp planning based on raw sensory data is difficult due to occlusion and incomplete information regarding scene geometry. Often, for example, one sensory modality does not provide enough context to enable reliable planning. For example a single depth sensor image cannot provide information about occluded regions of an object, and tactile information is incredibly sparse spatially. This work utilizes a 3D convolutional neural network to enable stable robotic grasp planning by incorporating both tactile and depth information to infer occluded geometries. This multi-modal system is able to utilize both tactile and depth information to form a more complete model of the space the robot can interact with and also to provide a complete object model for grasp planning.
At runtime, a point cloud of the visible portion of the object is captured, and multiple guarded moves are executed in which the hand is moved towards the object, stopping when contact with the object occurs. The newly acquired tactile information is combined with the original partial view, voxelized, and sent through the CNN to create a hypothesis of the object’s geometry.
Depth information from a single point of view often does not provide enough information to accurately predict object geometry. There is often unresolved uncertainty about the geometry of the occluded regions of the object. To alleviate this uncertainty, we utilize tactile information to generate a new, more accurate hypothesis of the object’s 3D geometry, incorporating both visual and tactile information. Fig. 1 demonstrates an example where the understanding of the object’s 3D geometry is significantly improved by the additional sparse tactile data collected via our framework. An overview of our sensory fusion architecture is shown in Fig. 2.
The contributions of this work include: 1) a framework for integrating multi-modal sensory data to reason about object geometry and enable robotic grasping, 2) an open source dataset for training a shape completion system using both tactile and depth sensory information, 3) open source code for alternative visual-tactile general completion methods, 4) experimental results comparing the completed object models using depth only, the combined depth-tactile information, and various other visual-tactile completion methods, and 5) real and simulated grasping experiments using the completed models. This dataset, code, and extended video are freely available at http://crlab.cs.columbia.edu/visualtactilegrasping/.
Ii Related Work
The idea of incorporating sensory information from vision, tactile and force sensors is not new . Despite the intuitiveness of using multi-modal data, there is still no concensus on which framework best integrates multi-modal sensory information in a way that is useful for robotic manipulation tasks. In this work, we are interested in reasoning about object geometry, and in particular, creating models from multi-modal sensory data that can be used for grasping and manipulation.
Several recent uses of tactile information to improve estimates of object geometry have focused on the use of Gaussian Process Implicit Surfaces (GPIS) . Several examples along this line of work include  . This approach is able to quickly incorporate additional tactile information and improve the estimate of the object’s geometry local to the tactile contact or observed sensor readings. There has additionally been several works that incorporate tactile information to better fit planes of symmetry and superquadrics to observed point clouds . These approaches work well when interacting with objects that conform to the heuristic of having clear detectable planes of symmetry or are easily modeled as superquadrics.
There has been successful research in utilizing continuous streams of visual information similar to Kinect Fusion  or SLAM  in order to improve models of 3D objects for manipulation, an example being . In these works, the authors develop an approach to building 3D models of unknown objects based on a depth camera observing the robot’s hand while moving an object. The approach integrates both shape and appearance information into an articulated ICP approach to track the robot’s manipulator and the object while improving the 3D model of the object. Similarly, another work  attaches a depth sensor to a robotic hand and plans grasps directly in the sensed voxel grid. These approaches improve their models of the object using only a single sensory modality but from many time points.
In previous work , we created a shape completion method using single depth images. The work provides an architecture to enable robotic grasp planning via shape completion, which is accomplished through the use of a 3D CNN. The network is trained on an open source dataset of over 440,000 3D exemplars captured from varying viewpoints. At runtime, a 2.5D point cloud captured from a single point of view is fed into the CNN, which fills in the occluded regions of the scene, allowing grasps to be planned and executed on the completed object. Runtime shape completion is rapid because most of the computational costs of shape completion are borne during offline training. This prior work explored how the quality of completions vary based on several factors. These include whether or not the object being completed existed in the training data, how many object models were used to train the network, and the ability of the network to generalize to novel objects, allowing the system to complete previously unseen objects at runtime. The completions are still limited by the training datasets and occluded views that give no clue to the unseen portions of the object. From a human perspective, this problem is often alleviated by using the sense of touch. In this spirit, this paper addresses this issue by incorporating sparse tactile data to better complete the object models for grasping tasks.
Iii Visual-Tactile Geometric Reasoning
Our framework utilizes a trained CNN to produce a mesh of the target object, incorporating both depth and tactile information. The architecture of the CNN is shown in Fig. 3. The model was implemented using the Keras  deep learning library, with a Theano  backend. Each layer used rectified linear units as nonlinearities except the final fully connected (output) layer which used a sigmoid activation to restrict the output to the range . We used the cross-entropy error as the cost function with target and output :
This cost function encourages each output to be close to either 0 for unoccupied target voxels or 1 for occupied. The optimization algorithm Adam , which computes adaptive learning rates for each network parameter, was used with default hyperparameters (, , ) except for the learning rate, which was set to 0.0001. Weights were initialized following the recommendations of  for rectified linear units and  for the logistic activation layer. The model was trained with a batch size of 32.
We used the Jaccard similarity to evaluate the similarity between a generated voxel occupancy grid and the ground truth. The Jaccard similarity between sets A and B is given by:
The Jaccard similarity has a minimum value of 0, where A and B have no intersection and a maximum value of 1, where A and B are identical. The CNNs were trained with an NVIDIA Titan X GPU.
Iv Completion of Simulated Geometric Shapes
Iv-a Geometric Shape Dataset
In order to evaluate our system’s ability to utilize additional tactile sensory information to reason about 3D geometry, initial experimentation was done on a toy geometric shape dataset. This dataset consisted of conjoined half-shapes. Both front and back halves of the objects were randomly chosen to be either a sphere, cube, or diamond of varying sizes. The front and back halves do match in size. An example shape is shown in Fig. 4(b), a half-cube half-sphere. Next, synthetic sensory data was generated for these example shapes and embedded in a voxel grid. Depth information was captured from a fixed camera location, and tactile data was generated by intersecting 3 rays with the object. The rays originated at (13, 20, 40), (20, 20, 40) and (26, 20, 40), and traveled in the direction until either contact occurred with the object or the ray left the voxelized volume. Sensory data for a shape is shown in Fig. 4(a).
|Completion Method||Depth Only||Tactile & Depth|
Iv-B Depth and Tactile Completion
Three networks with the exact same architecture from  as shown in Fig. 3 were trained on this dataset using different sensory data as input. We initially trained a network that only utilized tactile information during training and this performed poorly due to the sparsity of information, as expected. A second network was given only the depth information during training, and performed better than the tactile-only network did, but it still encountered many situations where it did not have enough information to accurately complete the obstructed half of the object. A third network was given depth and tactile information. The sparse tactile data was generated using the tactile exploration method described in Fig. 4. This network successfully utilized the tactile information to differentiate between plausible geometries of occluded regions and the results are shown in Table I. This task demonstrated that a CNN can be trained to leverage sparse tactile information in order to decide between multiple object geometry hypothesis. For example, when the object geometry had sharp edges in its occluded region, the system would use tactile information to also generate a completion that contained similar sharp edges in the occluded region. This completion is more accurate not just in the observed region of the object but also in the unobserved portion of the object.
V Completion of YCB and Grasp Dataset Objects
After demonstrating efficacy using the toy geometric shape dataset, we used the dataset from  to create a new dataset consisting of half a million triplets of oriented voxel grids: depth, tactile, and ground truth. Depth voxels are marked as occupied if visible to a camera. Tactile voxels are marked if tactile contact occurs within the voxel. Ground truth voxels are marked as occupied if the object intersects a given voxel, independent of perspective. The point clouds for the depth information were synthetically rendered in the Gazebo  simulator. This dataset consists of 608 meshes from both the Grasp  and YCB  datasets. 486 of these meshes were randomly selected and used for a training set and the remaining 122 meshes were kept for a holdout set.
The synthetic tactile information was generated according to Algorithm 1. In order to generate tactile data, the voxelization of the ground truth high resolution mesh (vox_gt) (Alg.1:L1) was aligned with the captured depth image (Alg.1:L4). 40 random points were sampled in order to generate synthetic tactile data (Alg.1:L5-6). For each of these points (Alg.1:L7), a ray was traced in the , direction and the first occupied voxel was stored as a tactile observation (Alg.1:L11). Finally this set of tactile observations was converted back to a point cloud (Alg.1:L13).
Again, two identical CNNs were trained where one CNN was provided only the depth information (Depth Only) and a second was provided both the tactile and depth information Tactile and Depth. During training, performance was evaluated on simulated views of meshes within the training data (Training Views), novel simulated views of meshes in the training data (Holdout Views), novel simulated views of meshes not in the training data (Holdout Meshes), and real views of 8 meshes from the YCB dataset (Holdout Live). Fig. 5 shows examples from the three simulated splits.
The Holdout Live examples consist of depth information captured from a real Kinect and tactile information captured from a real Barrett Hand attached to a Staubli Arm. The object was fixed in place during the tactile data collection process. While collecting the tactile data, the arm was manually moved to place the end effector behind the object, and 6 exploratory guarded motions were made where the fingers closed towards the object. Each finger stopped independently when contact was made with the object, as shown in Fig. 6.
Fig. 6(a) and Fig. 6(b) show that as the CNNs are exposed to harder completion problems, the predicted completions have lower Jaccard similarity scores. Fig. 6(b) also demonstrates that the difference between the Depth Only CNN completion and the Tactile and Depth CNN completion becomes larger on more difficult completion problems. The performance of the Depth Only CNN nearly matches the performance of the Tactile and Depth CNN on the training views. This is because these views are used during training, so at runtime in both cases, the network is provided enough information to generate a reasonable completion. Moving from Holdout Views to Holdout Meshes to Holdout Live, the completion problems move further away from the examples experienced during training. As the problems become harder, the Tactile and Depth network outperforms the Depth Only network by a greater margin, as it is able to utilize the sparse tactile information to differentiate between various possible completions. This trend shows that the network is able to make more use of the tactile information when the depth information alone is insufficient to generate a quality completion.
V-a Mesh Generation
Alg. 2 shows how we merged the dense partial view with our voxel grid. More information about this method is available from .
In order to merge with the partial view, the output of the CNN is converted to a point cloud, and its density is compared to the density of the partial view point cloud (Alg.2:L4). The CNN output is up-sampled by to match the density of the observed point cloud (Alg.2:L5). The upsampled output from the CNN is then merged with the observed point cloud of the partial view, and the combined cloud is voxelized at the new higher resolution of (Alg.2:L6). Any gaps in the voxel grid between the upsampled CNN output and the observed partial view cloud are filled (Alg.2:L7). The voxel grid is smoothed using a CUDA implementation of the convex quadratic optimization problem from  (Alg.2:L8).
The weighted voxel grid is then run through marching cubes (Alg.2:L9).
Vi Comparison to Alternative Visual-Tactile Completion Methods
Vi-a Alternative Visual-Tactile Completion Methods
In this work we benchmarked our framework against the following general visual tactile completion methods.
Partial Completion: The set of points captured from the Kinect is concatenated with the tactile data points. The combined cloud is run through marching cubes, and the resulting mesh is then smoothed using Meshlab’s  implementation of Laplacian smoothing. These completions are incredibly accurate where the object is directly observed but make no predictions in unobserved areas of the scene.
Convex Hull Completion: The set of points captured from the Kinect is concatenated with the tactile data points. The combined cloud is run through QHull to create a convex hull. The hull is then run through Meshlab’s implementation of Laplacian smoothing as well. These completions are reasonably accurate near observed regions. However, a convex hull will fill regions of unobserved space.
Gaussian Process Implicit Surface Completion (GPIS): Approximated depth cloud normals were calculated using PCL’s KDTree normal estimation. Approximated tactile cloud normals were computed to point towards the camera origin. The depth point cloud was downsampled to size and appended to the tactile point cloud. We used a distance offset to add positive and negative observation points along the direction of the surface normal. We then sampled the Gaussian process using  with a voxel grid and a noise parameter to create meshes from the point cloud. We empirically determined the values of by sampling the Jaccard similarity of GPIS completions where , , , and . We found to be a good tradeoff between speed and completion quality. Additionally we used , , and .
In prior work  the Depth Only CNN completion method was compared to both a RANSAC based approach  and a mirroring approach . These approaches make assumptions about the visibility of observed points and do not work with data from tactile contacts that occur in unobserved regions of the workspace.
Vi-B Geometric Comparison Metrics
The Jaccard similarity was used to compare CNN outputs with the ground truth, as shown in Fig. 6(a). We also used this metric to compare the final resulting meshes from several completion strategies. The completed meshes were voxelized at and compared with the ground truth mesh. The results are shown in Table \subreftab:Jaccard. Our proposed method results in higher similarity to the ground truth meshes than do all other described approaches.
The Hausdorff distance metric computes the average distance from the surface of one mesh to the surface of another. A symmetric Hausdorff distance was computed with Meshlab’s Hausdorff distance filter in both directions. Table \subreftab:Hausdorff shows the mean values of the symmetric Hausdorff distance for each completion method. In this metric, our tactile and depth CNN mesh completions are significantly closer to the ground truth compared to the other approaches’ completions.
Both the partial and Gaussian process completion methods are accurate when close to the observed points but fail to approximate geometry in occluded regions. We found that in our training, the Gaussian Process completion method would often create a large and unruly object if the observed points were only a small portion of the entire object or if no tactile points were observed in simulation. Using a neural network has the added benefit of abstracting object geometries, whereas the alternative completion methods fail to approximate the geometry of objects which do not have points bounding their geometry.
Vi-C Grasp Comparison in Simulation
In order to evaluate our framework’s ability to enable grasp planning, the system was tested in simulation using the same set of completions. The use of simulation allowed for the quick planning and evaluation of 7900 grasps. GraspIt! was used to plan grasps on all of the completions of the objects by uniformly sampling different approach directions. These grasps were then executed not on the completed object but on the ground truth meshes in GraspIt!. In order to simulate a real-world grasp execution, the completion was removed from GraspIt! and the ground truth object was inserted in its place. Then the hand was placed 20 cm away from the ground truth object along the approach direction of the grasp. The spread angle of the fingers was set, and the hand was moved along the approach direction of the planned grasp either until contact was made or a maximum approach distance was traveled. At this point, the fingers closed to the planned joint values. Then each finger continued to close until either contact was made with the object or the joint limits were reached.
Table \subreftab:sim_grasp_results shows the average difference between the planned and realized Cartesian finger tip and palm poses, while Table \subreftab:l2_joint_error shows the difference in pose of the end effector between the planned and realized grasps averaged over the 7 joints of the hand. Using our method caused the end effector to end up closer to its intended location in terms of both joint space and the palm’s Cartesian position.
Vi-D Live Grasping Results
To further test our network’s efficacy, the grasps were planned and executed on the Holdout Live views using a Staubli arm with a Barrett Hand mounted to its wrist. The grasps were planned via the different completion methods described above. For each of the 8 objects, we ran the arm once using each completion method. The results are shown in Table \subreftab:live_grasp_results. Our method enabled an improvement over the general visual-tactile shape completion methods in terms of grasp success rate and resulted in executed grasps closer to the planned grasps, as shown by the lower average joint error. While the success rate of lifting the tactile and depth CNN was equal to the success rate of the Gaussian Process completion, our method constructed an object geometry significantly faster, as shown in Table \subreftab:live_grasp_timings, and had a much lower average joint error, as shown in Table \subreftab:live_grasp_results. After performing lifts on these 8 objects, we have shown that our completion method performs better in terms of timing, completion quality, and grasp quality for objects the network has not observed. These completions are shown in Fig. 8.
We have developed a framework for geometric learning and reasoning utilizing a CNN trained from a large simulated dataset of rendered depth and tactile data. The CNN uses both the dense depth information and the sparse tactile information to fill in occluded regions of an object. Experimental results show that using both tactile and visual data provides more accurate completion than using either vision or tactile data alone. In addition, our approach outperforms other general visual-tactile completion methods in the fidelity of the object models created and in using these models for grasping and manipulation tasks.
-  B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The ycb object and model set: Towards common benchmarks for manipulation research,” in Advanced Robotics (ICAR), 2015 International Conference on. IEEE, 2015, pp. 510–517.
-  P. K. Allen, A. Miller, B. Leibowitz, and P. Oh, “Integration of vision, force and tactile sensing for grasping,” Int. Journal of Intelligent Mechatronics, vol. 4, no. 1, pp. 129–149, 1999.
-  O. Williams and A. Fitzgibbon, “Gaussian process implicit surfaces,” Gaussian Proc. in Practice, pp. 1–4, 2007.
-  S. Caccamo, Y. Bekiroglu, C. H. Ek, and D. Kragic, “Active exploration using gaussian random fields and gaussian process implicit surfaces,” in IROS. IEEE, 2016, pp. 582–589.
-  Z. Yi, R. Calandra, F. Veiga, H. van Hoof, T. Hermans, Y. Zhang, and J. Peters, “Active tactile object exploration with gaussian processes,” in IROS. IEEE, 2016, pp. 4925–4930.
-  M. Bjorkman, Y. Bekiroglu, V. Hogman, and D. Kragic, “Enhancing visual perception of shape through tactile glances,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 3180–3186.
-  S. Dragiev, M. Toussaint, and M. Gienger, “Gaussian process implicit surfaces for shape estimation and grasping,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 2845–2850.
-  N. Jamali, C. Ciliberto, L. Rosasco, and L. Natale, “Active perception: Building objects’ models using tactile exploration,” in Humanoid Robots (Humanoids), 2016 IEEE-RAS 16th International Conference on. IEEE, 2016, pp. 179–185.
-  N. Sommer, M. Li, and A. Billard, “Bimanual compliant tactile exploration for grasping unknown objects,” in ICRA. IEEE, 2014, pp. 6400–6407.
-  J. Mahler, S. Patil, B. Kehoe, J. van den Berg, M. Ciocarlie, P. Abbeel, and K. Goldberg, “GP-GPIS-OPT: Grasp planning with shape uncertainty using gaussian process implicit surfaces and sequential convex programming,” in IEEE International Conference on Robotics and Automation (ICRA), 2015.
-  J. Ilonen, J. Bohg, and V. Kyrki, “Three-dimensional object reconstruction of symmetric objects by fusing visual and tactile sensing,” The International Journal of Robotics Research, vol. 33, no. 2, pp. 321–341, 2014.
-  ——, “Fusing visual and tactile sensing for 3-d object reconstruction while grasping,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3547–3554.
-  A. Bierbaum, I. Gubarev, and R. Dillmann, “Robust shape recovery for sparse contact location and normal data from haptic exploration,” in Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on. IEEE, 2008, pp. 3200–3205.
-  R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on. IEEE, 2011, pp. 127–136.
-  S. Thrun and J. J. Leonard, “Simultaneous localization and mapping,” in Springer handbook of robotics. Springer, 2008, pp. 871–889.
-  M. Krainin, P. Henry, X. Ren, and D. Fox, “Manipulator and object tracking for in-hand 3d object modeling,” The International Journal of Robotics Research, vol. 30, no. 11, pp. 1311–1327, 2011.
-  M. Krainin, B. Curless, and D. Fox, “Autonomous generation of complete 3d object models using next best view manipulation planning,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 5031–5037.
-  A. Hermann, F. Mauch, S. Klemm, A. Roennau, and R. Dillmann, “Eye in hand: Towards gpu accelerated online grasp planning based on pointclouds from in-hand sensor,” in Humanoids. IEEE, 2016, pp. 1003–1009.
-  J. Varley, C. DeChant, A. Richardson, A. Nair, J. Ruales, and P. Allen, “Shape completion enabled robotic grasping,” in Intelligent Robots and Systems (IROS), IEEE/RSJ 2017 International Conference on, extended version preprint at arXiv:1609.08546.
-  F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.
-  J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a cpu and gpu math expression compiler,” in Proceedings of the Python for scientific computing conference (SciPy), vol. 4. Austin, TX, 2010.
-  F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I., A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio, “Theano: new features and speed improvements,” arXiv:1211.5590, 2012.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” arXiv preprint arXiv:1502.01852, 2015.
-  X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the 13th AISTATS, 2010, pp. 249–256.
-  N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” in IROS, vol. 3. IEEE, 2004, pp. 2149–2154.
-  D. Kappler, J. Bohg, and S. Schaal, “Leveraging big data for grasp planning,” in ICRA. IEEE, 2015, pp. 4304–4311.
-  V. Lempitsky, “Surface extraction from binary volumes with higher-order smoothness,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1197–1204.
-  P. Cignoni, M. Corsini, and G. Ranzuglia, “Meshlab: an open-source 3d mesh processing system,” Ercim news, vol. 73, pp. 45–46, 2008.
-  M. P. Gerardo-Castro, T. Peynot, F. Ramos, and R. Fitch, “Robust multiple-sensing-modality data fusion using gaussian process implicit surfaces,” in Information Fusion (FUSION), 2014 17th International Conference on. IEEE, 2014, pp. 1–8 https://github.com/marcospaul/GPIS.
-  C. Papazov and D. Burschka, “An efficient ransac for 3d object recognition in noisy and occluded scenes,” in Asian Conference on Computer Vision. Springer, 2010, pp. 135–148.
-  J. Bohg, M. Johnson-Roberson, B. León, J. Felip, X. Gratal, N. Bergström, D. Kragic, and A. Morales, “Mind the gap-robotic grasping under incomplete observation,” in ICRA. IEEE, 2011, pp. 686–693.