Deep Gated Multi-modal Learning:In-hand Object Pose Changes Estimation using Tactile and Image Data

Deep Gated Multi-modal Learning: In-hand Object Pose Changes Estimation using Tactile and Image Data


For in-hand manipulation, estimation of the object pose inside the hand is one of the important functions to manipulate objects to the target pose. Since in-hand manipulation tends to cause occlusion by the hand or the object itself, image information only is not sufficient for in-hand object pose estimation. Multiple modalities can be used in this case, the advantage is that other modalities can compensate for occlusion, noise, and sensor malfunctions. Even though deciding the utilization rate of a modality (referred to as reliability value) corresponding to the situations is important, the manual design of such models is difficult, especially for various situations. In this paper, we propose deep gated multi-modal learning, which self-determines the reliability value of each modality through end-to-end deep learning. For the experiments, an RGB camera and a GelSight tactile sensor were attached to the parallel gripper of the Sawyer robot, and the object pose changes were estimated during grasping. A total of 15 objects were used in the experiments. In the proposed model, the reliability values of the modalities were determined according to the noise level and failure of each modality, and it was confirmed that the pose change was estimated even for unknown objects.

I Introduction

Robots are expected to work not only in factories, but also in home situations where robots are expected to grasp objects in various environments. Reaching the target grasp pose in a direct motion is difficult when there are obstacles in the environment. In such situations, the robots need to grasp the object once and then place it on a surface such as a desk to grasp it again, in order to achieve the target pose. Another way is to rearrange the object inside the hand, also known as in-hand manipulation. Though commonly seen as the easier solution, placing and re-grasping an object takes more time compared to in-hand manipulation. Furthermore, surfaces to place the object on do not always exist in environments, therefore this method can not always be used. For robot manipulation tasks, in particular in-hand manipulation, object pose estimation within a robotic hand is one of the important functions to manipulate objects to the target pose accurately [39, 26].

Although many researchers have developed object pose estimation based on image information only, images alone are not sufficient for in-hand manipulation, where the object is likely occluded from the camera by the object or the hand itself (e.g. the gripper hides the object. Furthermore, large objects can go outside the field of vision of the camera or hide parts of themselves as depicted in Fig. 1). One of the approaches to address these challenges is to combine multiple sensor modalities. One of the advantages of using multiple modalities is that even if there is an occlusion, noise, or a malfunction in some of the modalities, the other modalities can compensate and provide information from a different perspective. In such situations, it is necessary to predict how much each modality should be considered. For the remainder of this article, we refer to the utilization ratio of each modality as its reliability value. However, it is difficult to manually design a model that can decide a given modality’s reliability value, especially if the model has to deal with a wide variety of environments and situations.

Fig. 1: An illustration of in-hand object pose estimation with image and tactile data through our proposed network, deep gated multi-modal learning, which can decide the reliability value for each modality.

In this paper, we propose a method that we call deep gated multi-modal learning (DGML), which uses end-to-end deep learning to predict and determine the reliability value of each modality by DGML itself (See Fig. 1 and Fig. 2). By virtue of end-to-end deep learning, this method is capable of generalizing to unknown objects without assuming a known 3D model when doing in-hand pose estimation.

The rest of this paper is organized as follows. The contribution is explained in Section II and works related to this paper are described in Section III, while our proposed method is detailed in  Section IV. Section V outlines our experiment setup and evaluation settings with the results presented in Section VI. Finally, future work and conclusions are explained in Section VII.

Ii Contributions

The target of our method was to estimate the object pose changes inside the hand using image and tactile sensors as robustly as possible, despite occlusions, noise, sensor malfunctions and other possible obstructions. This new method dynamically determined the reliability value and scale each modality’s contribution appropriately instead of focusing on only one modality or the other. What should be noted though is that our method estimated in-hand object pose changes during grasping instead of the absolute object pose with respect to the robot base, meaning that we estimated how much the object moves within the hand after the robot first make the object (when it has grasped).

The main contributions of this article are as follows:

  • Proposed a new approach in which the network itself determines the reliability of each modality dynamically and uses that reliability value to scale each modality’s contribution.

  • Proposed a new approach in which end-to-end learning combined image and tactile data without assuming a 3D model to estimate the pose changes of unseen objects with occlusions, noise, and sensor malfunctions conditions.

  • Investigated details of noise, malfunction, and occlusion behavior in sensor information unique to the robot field.

Fig. 2: The proposed network architecture for deep gated multi-modal learning. Inputs are sequence of images and tactile, and output is time-series object pose changes. The values of gate ( and ) represent the reliability of each module, and the values are acquired by the network itself. After training, if one of the gate value of the modality is smaller than another modal’s, the reliability value of the modal is lower than another modal, whereas one of the gate value of the modal is greater than another modal’s, the modal is greater reliability than another modal.

Iii Related Works

Iii-a Object Pose Estimation with Depth and Image Data

Object pose estimation is a well-studied problem in computer vision and is important for robotic tasks. Many researchers have particularly been developing methods using depth data (point cloud) or RGB-D data [9, 8, 1]. Classical approaches with depth data are mainly based on point cloud matching methods such as iterative closest point (ICP) [30]. These methods can achieve high accuracy, but since these methods require 3D models of the objects in advance, they cannot be used for unknown objects.

Recently, deep learning has become an active research area and especially the computer vision field has achieved success. Image based object pose estimation methods through combination of deep learning with model-based approaches have been studied [22, 38, 37]. The convergence to the final result of ICP heavily relies on the choice of the initial position, but the convergence error can be suppressed by giving a proper initial position through deep-learning methods. However, these methods still require a 3D model, thus adapting to unknown objects remains challenging. Some state-of-the-arts pose estimation methods that do not require 3D models have been realized through deep learning [31, 15, 18, 2].

Iii-B Tactile-based Object Pose Estimation

As described in Section III-A, most of the existing pose estimation methods use images and depth information only. During in-hand object manipulation, the object is occluded from the camera and depth sensor by the hand or the object itself (See Fig. 1 as examples of occlusions).

Tactile sensors are gaining attention since they can observe the contact state without causing an occlusion,  [36, 35, 40, 10]. The majority of these sensors fall in either of the following two categories:

  1. Single-axis multi-touch enabled sensors, which can only sense normal force [28, 19, 24, 11].

  2. Three-axis single-touch sensors [29]

Two of the few exceptions are the uSkin [35] and the GelSight [20, 10] which are multi-touch sensors that can measure shear force as well as normal force. The commercialized uSkin sensor [35] utilizes embedded magnets inside a silicone rubber and measures the deformation of the silicone during contact by monitoring changes in the magnetic fields. Using this method, it is able to measure both normal as well as shear force per sensor unit. The prototype supports 16 different points of contact [36]. Instead of magnets, GelSight [40, 7] is an optical-based tactile sensor, which uses a camera to capture and measure the deformation of the attached elastomer during contact with a surface.

There are some works using tactile sensors for pose estimation such as [6], but a 3D model of the object is required a priori. It is challenging to estimate the pose of an object using only the tactile information of the grasped part since there is little information from only a single grasp. Therefore, some work has been done in which tactile and image information are combined [12, 7, 5], and the effectiveness of this fusion was demonstrated in object grasping tasks. Some of the works performed object pose estimation through model based approaches using known 3D models [16, 14, 4]. In these studies, a 3D model of the object is often required to estimate absolute object pose because it is assumed that the pose of the object is unknown when the object is grasped. However, we expect that even the methods that work on unknown objects described in III-A can be improved by using additional tactile information for dealing with occlusions. The object pose can namely be estimated before the robot grasps an object, through methods such as [31, 15, 18] earlier mentioned in III-A. If we then keep track of the object pose changes using a tactile sensor after the robot first makes contact (when it has grasped the object and thus occludes it), we will be able to track the absolute object pose without using a given 3D model. In addition, our hypothesis is that tactile sensors improves the object pose estimation accuracy compared to using images only (Details of our experiment results to test this are described in Table II in section VI-B).

Iii-C Attention Based Learning

A variety of attention methods has been studied. Regarding the division of attention to each modality, the majority of these methods fall in either of the following three categories, except for our proposed method:

1) Equal attention to all modalities: Each modality is fed into their respective networks to extract features, and the obtained features are simply combined to estimate information such as grasping point and motion. For example, the combination of image and tactile described in Section III-B, motion and image [32], force and image [23], language and image [13], and, image and sound [27].

2) Attention within modalities: Only the important parts of the given modalities are used. Within each modality, the part to focus on is extracted by the network itself. For example, in the case of images, the pixel of interest is used instead of the whole image [21, 17]. This case does not necessarily use multiple modalities as it can also be done when using a single modality.

3) Cancellation of irrelevant modalities: Only the important modalities are used while other modalities are ignored. The network decides which modality should be used from all modalities. For example, choosing to utilize language or images according to the situation [3].

The new approach we developed does not fit in these categories, since it determines the contribution ratio of each modality dynamically and uses that reliability value to scale each modality’s contribution.

Iv Deep Gated Multi-modal learning

We propose a method that we call deep gated multi-modal learning (DGML) for in-hand object pose changes estimation with image and tactile information based on end-to-end deep learning. The DGML estimates in-hand object pose changes during grasping instead of absolute object pose. Figure 2 shows the concept of the proposed network model. The main concept of DGML is that the network itself dynamically decides how much each modality it should rely on, in other words to decide the reliability value per modality. We aimed to design a network with a structure that is as simple as possible, but still sufficient to show the effectiveness of the deep gated multi-modal unit. Increased complexity of the network architecture will most likely improve the accuracy, but the concept of reliability value can be used in the same way. The details of the network structure will be described in section V-D.

DGML is composed of three components for in-hand pose estimation:

  • Feature extraction unit to extract feature from image and tactile data

  • Deep gated multi-modal unit for calculation and application of reliability value to each modality

  • Object pose change estimation unit to estimate object pose changes using time-series input information

For training and inference of the network, a sequence of image and tactile data are the input, and the output is a sequence of object pose changes. A training dataset consists of sequence with steps.

As a feature extraction unit, convolutional neural networks (CNNs) are used to calculate image features and tactile features at step from image input and tactile input , respectively.

Then, the reliability value for image in step is given from as:


where and are fully connected (FC) layers to reduce the number of dimensions, and is used as activation function for the gate. The reliability values for image and for tactile data are conditioned to sum up to 1:


and are one-dimensional scalars. The gate determines each modality’s reliability values from all modalities’ information through equations (1) and (2).

The extracted features vectors for each modality, and , are then scaled by multiplying them with their respective reliability values, giving us the scaled features vectors and in time step :


Since a sigmoid function is used and the total sum of reliability values is 1, the reliability values are continuous values from 0 to 1, instead of being only 0 or 1. The lower the reliability, the smaller the contribution to the output. On the other hand, the greater the reliability, the greater the output contribution. In addition, if the reliability value becomes 0, the modality is completely ignored. Therefore, it is not necessary to manually determine which sensor to enable/disable in advance, because the network will automatically ignore them if they are not helpful. Note that the reliability values are not absolute, but relative. We chose to use relative values to learn the correlation between multiple modalities. This however means that the method simply assigns a higher reliability value to a given modality, if said modality performs better than any other modality. Thus, all modalities could be noisy or under-performing, but our method will still assign reliability values by comparing the modalities’ performances with each other. The downside of this is that the robot will continue to operate even if none of the sensor modalities are performing well. Rather, it tries to make the best out of the situation and attempts to focus on only the good modalities of the data it has. Detecting when the robot should give up trying or improving the quality of the data however is out of scope of our work.

As the object pose changes estimation unit, long short-term memory (LSTM) with its output connected to FC is used. Input of the LSTM is a sequence of and , whereas the output of FC is , which is a sequence of the object pose changes. is calculated from homogeneous transformation matrix as . is the pose of the object at time with respect to the initial pose of the object, that is the pose of the object when the robot first grasps the object and makes contact.

Since our proposed method works without 3D models given in advance, we estimate the relative object pose (object pose changes) during grasping instead of the absolute pose. Thus, object pose change is the current object pose at step with respect to the initial object pose at step when the robot first grasps the object and makes contact. When , .

The loss function is minimized as follows:


where are the parameters to be trained including reliability values and , is the number of sequences for mini-batch training, and is the teaching signal. We used the Huber loss as loss function .

There is no teaching signal for the reliability values and because these are calculated by the network itself to minimize the output error by equation (4). Then, the reliability values are multiplied by each modality in equation (3).

V Experimental Setup

The purpose of the experiments is to verify DGML in situations with occlusion, noise, and sensor malfunctions.

We note that our values of the hyper-parameters provided in this section are tuned by random search.

V-a Hardware Setup

Tactile sensor

The GelSight tactile sensor we duplicate from article [40, 7] is an optical-based tactile sensor, which captures a image by a camera, which can then be used to calculate the 3D model and the applied normal and shear force in the x, y, and z axes in Newton. However, instead of using the 3D model and the force in Newton we directly use the captured raw image from the GelSight (See Fig. 1). Because the silicone layer from the sensor’s container tears when an excess amount of force is applied, we applied baby powder on this layer to reduce friction between the surface and the grasped object. The reason why we chose GelSight is that it is a multi-touch sensor that can measure both normal and shear forces as opposed to other types of tactile sensors described in section III-B. It is also relatively cheap and simple to reproduce.

Fig. 3: Setup used in our experiments. Custom printed end-effector with both a tactile skin sensor and a web camera. The Sawyer robot let the gripper move to the minus x-axis, y-axis direction and rotation of yaw.
Fig. 4: Trained objects (red) and unknown objects (blue)


We developed a gripper to grasp objects shown in Fig. 3. This gripper is a parallel gripper which has two fingers driven by a servo motor (Dynamixel XM430-W350-R). A GelSight tactile sensor is attached to one fingertip, and the other fingertip has a sponge. A camera (BUFFALO BSW200MBK) is mounted at the center of the gripper.


To perform our experiments, we use a Sawyer 7-DOF robotic arm with the gripper as end-effector (See Fig. 3). The Sawyer, GelSight sensor, gripper, and camera are connected to a PC running Ubuntu 16.04 with ROS Kinetic.

Layer In Out

Feature Extraction Unit

Image feature extraction

\nth1 conv. 1 32 (3,3) ReLu
\nth2 conv. 32 32 (3,3) ReLu
Ave. pooling 32 32 (4,4) -
\nth3 conv. 32 32 (3,3) ReLu
\nth4 conv. 32 32 (3,3) ReLu
Ave. pooling 32 32 (2,2) -
\nth5 conv. 32 32 (3,3) ReLu
\nth6 conv. 32 32 (3,3) ReLu
Ave. pooling 32 32 (2,2) -

Tactile feature extraction

\nth1 conv. 1 32 (3,3) ReLu
\nth2 conv. 32 32 (3,3) ReLu
Ave. pooling 32 32 (4,4) -
\nth3 conv. 32 32 (3,3) ReLu
\nth4 conv. 32 32 (3,3) ReLu
Ave. pooling 32 32 (2,2) -
\nth5 conv. 32 32 (3,3) ReLu
\nth6 conv. 32 32 (3,3) ReLu
Ave. pooling 32 32 (2,2) -

DGM Unit


\nth1 1120 1 - -
\nth1 2240 1 - -
\nth2 Gate 2 1 - sigmoid



3360 sigmoid
LSTM (, 170 - &
) tanh


\nth1 Output 170 3 - -
  • In and out are the number of channels for image and tactile data, and these are the number of neurons for the gate, LSTM, and FC. Batch normalization is applied after the n-th convolution. Stride and padding for the n-th convolution in image and tactile are (1, 1). Input is composed of an RGB image for and a image for tactile data for , and output is a sequence of object pose changes, .

TABLE I: Network Design1

V-B Objects

For the target objects, we have prepared 15 objects with various size and shape (See Fig. 4). 11 of these objects are used for training, while the remaining 4 were used to evaluate our trained network as unknown objects.

V-C Data Collection

Figure 3 shows one of the initial positions of the robot from which it starts to manipulate the object for data collection (The attached video shows more examples of the initial positions). A table is placed in front of the robot and the object is fixed on the table with double-sided tape. The robot grasps the object, after which the gripper posture, the image, and tactile data are recorded while the robot slides the object in its hand. Since we use a parallel gripper popular in robotics, the object pose is limited to translating in the xy plane and rotating around the z axis.

We estimate the three DoF object pose changes given the coordinate system of the hand, which is depicted in Fig. 3. The pose of the object is the inverse transformation of the posture of the gripper. The object pose change is calculated from homogeneous transformation matrix as . Where is homogeneous transformation matrix as the pose of the gripper at time described in the base link coordinate system. This can be calculated easily by using forward kinematics. We define object poses with respect to the gripper pose at initial grasp contact. In other words, we estimate the relative object pose changes between the pose at the current step and initial step when the robot first make the object (when it has grasped). Thus, the object pose change is not the absolute pose of the object. However, the absolute pose can be calculated if the robot can estimate the absolute object pose before grasping as described in section III-B. Thus, the teaching signal and ground truth are defined as follows:


We decided to collect the teaching signals/ground truth though means of forward kinematics and by fixing the object to the table because this results in a higher accuracy as compared to other non-fixed methods such as AR marker tracking. We also prepared the movement patterns including both translational and rotational motions, and collect data for each object (The attached video shows examples of the motions). In order not to depend on the features of the background movement as the robot moves, we covered the background and desk in green clothes. The maximum movement in translation was about 30 mm and for rotation the maximum was about 40 degrees. For small objects like tapes, cups, scale, and wrench, only rotational movement was performed (See Fig. 4). For each object, the number of motions for translation, rotation, and combination of both of them are 10, 10, and 12, respectively. Each grasp posture is different in each motion. Therefore, the object is occluded from the camera by the object itself if the object is large. For the trained objects in Fig. 4, 6 out of 10 translations, 6 out of 10 rotations, and 8 out of 12 combined motions are used for training, and the remaining sets are used for evaluation.

Images, tactile, and object poses were acquired at , and the dataset used for training was re-sampled to . The captured images were converted to gray-scale because the object pose is independent of the color of the objects. One motion of the training dataset has a length of about 150 steps, and each step composes of pixels for image, pixels for tactile, and 3 for object pose changes, .

V-D Network Design

The architecture of our network model is composed of two CNNs, gate, and LSTM with FC to perform DGML as shown in Fig. 2 as described in Section IV. We used Chainer [34, 33, 25] as deep learning library, for implementation. More details on the network parameters are shown in Table I. For training, we used the Huber loss as loss function . Due to computer resources, we reset the history of LSTM every 20 steps. All our network experiments were conducted on a machine equipped with 256 GB RAM, an Intel Xeon E5-2667v4 CPU, and eight Tesla P100-PCIE with 12GB resulting in about 24 to 48 hours of training time.

Fig. 5: Learning curves and gate values of DGML. Note that the values of and are the average from all training data during training per epoch. The values of and are not fixed but are rather calculated from gate through the image and tactile features and are inferred during run-time.

Vi Results

Vi-a Learning Curve of DGML

As a comparison with DGML, we prepared several networks that use only images, only tactile, and both images and tactile with a simple connection without using a gate. The number of layers and parameters of CNN and LSTM are the same in all comparison models. However, if one modality is disconnected, the number of input neurons of LSTM changes, since the number of input neurons is the sum of the number of neurons connected per modality. In addition, a simple connection without using a gate is the same as DGML with fixing reliability values as .

Figure 5 shows their learning curves and average reliability values of all sequences in DGML. Note that the reliability values of and are not constants but change dynamically depending on the input . As for the reliability values of for image and for tactile, is greater than at first, but it can be seen that the reliability value for tactile information gradually increases. This means that the network gradually relies more on tactile data as opposed to image, which is most likely due to the image data becoming less reliable. As one of the characteristics of DGML, the computational epoch until convergence of DGML is the fastest of all models since training progresses from easy-to-train modals.

Note that the size for is twice the size of that of , but this is merely a result of the parameter search we performed, and it gave us the best performance of the network used in this study. As a result, the size of is twice as large as for the same reason. The reliability values of and can change depending on these parameters. However, the focus of this work is to improve the performance by introducing DGML with reliability values, so comparing the accuracy when the same input size per modality is used is out of the scope of this work.

Cond. Model Known obj. Unknown obj.
Trans. Rot. Trans. Rot.
Normal Image 2.10 1.45 3.25 5.47
Tactile 1.43 1.09 2.04
w/o gate 1.01 1.70
DGML 1.11 1.63
w/o image w/o gate 3.00 8.13 3.01 9.00
DGML 1.04 2.35 1.39 3.33
w/o tactile w/o gate 5.17 5.02
DGML 3.49 3.87
TABLE II: Inference error of in-hand object pose estimation
Cond. Known obj. Unknown obj.
:image :tactile :image :tactile
with image & tactile 0.261 0.739 0.272 0.728
w/o image 0.069 0.931 0.072 0.928
w/o tactile 0.994 0.006 0.994 0.006
TABLE III: The average relative value of for image and for tactile

Vi-B Inference Result of Object Pose

Table II shows the object pose changes inference accuracy during in-hand manipulation. The accuracy for translation and rotation is the average difference between the ground truth and inferred object pose at each time step and is calculated as follows:


The ground truth is the object pose change as calculated during the data collection (see section V-C and equation (5)). We evaluated the models under normal conditions, but also by substituting either image or tactile input with random noise. Under these conditions, we compared four different models: a model using only image input, a model using only tactile input, a model using both image and tactile (but no gate), and the proposed model.

From the results under normal conditions, the results are always better when tactile information is included as we already expected in section III-B. In addition, the models including tactile information can predict correctly for both known objects and unknown objects.

From the results with one of the modalities muted in Table II, we can see that the performance of the proposed DGML is the best. When the image input is absent, the value is much smaller than the value (See Table III). On the other hand, in the case of absence of tactile input, the value becomes almost 0. This is because if one of the modailties’ input is absent, the gate decides that the reliability value of that modality should be reduced, resulting in almost ignorance of said modality (See Table III). Even though the training dataset does not include data where a modality is completely absent, the gate of DGML can still deal with these situations.

Vi-C Gate Values under Noise

In this section, we discuss the change of reliability values when noise is applied to the input of a modality. The noise is applied to tactile input since in-hand pose changes estimation relies more on tactile information than image information according to section VI-B. Figure 6 shows histogram of the reliability value for images when the network infers from the dataset with noise added to the original tactile input . The noise is generated from a normal distribution with different variances : 400, 450, and 500 (See Fig. 6).

From Fig. 6, it can be seen that the reliability value for images increases as the noise to the tactile input increases. The gate can correctly recognize the noise of the input signal in the tactile data. The strength of using DGML is that DGML helps to understand the network behavior through the reliability values of the gate because the gate represents how much each modality should be used. Furthermore, if a sensor is broken, the reliability value for that sensor is always close to 0, thus DGML can recognize sensor failures.

Fig. 6: Histogram of image reliability value with different size noise to tactile
Fig. 7: Histogram of image reliability value . The objects surrounded by a red color frame is known objects, whereas the objects surrounded by a blue color frame is unknown objects. The size of the object in each image is the same as the ratio of the actual size.

Vi-D Reliability Values of Objects

In this section, we discuss the change of the reliability values of objects due to object shape, size, and occlusions. We visualize the reliability value of image, , by creating the histogram with all the inferred values of through untrained motions with known and unknown objects (See Fig. 7). The most of reliability value of smaller than 0.5 indicates that the network relies more on the tactile features. From the comparison between only using images and only using tactile data in section VI-B, we guess that the reliability value for images becomes lower because the accuracy for the method using tactile data is higher than that for the one using only images (See Table II).

The objects are depicted on the most frequently occurring on the histogram in Fig. 7. The ratio between the object sizes in the image and the actual object size is the same. Since objects with values between 0 and 0.2 such as tapes, wrench, and cups tend to provide rich tactile information due to their unique shape and/or surface, the network relies more on tactile information for these objects. Conversely, image information is reliable for objects with no surface irregularities and are displayed with values between 0.4 and 0.6 such as boxes and cans that are difficult to observe the tactile features from. Although some objects have no tactile features on the surface, the reliability of the image is low when such an object is large and thus occludes itself or doesn’t fit within the camera view, as opposed to small objects (See top row of Fig. 7). For example, the wood, the coffee plastic bottle, and orange triangle object are small, thus movement can be observed from the image. In contrast, large objects such as the mustard and green plastic bottles create occlusions, which reduces the reliability of image information, therefore the network uses more tactile information. Therefore, we can say that the proposed method is effective to determine the reliability value of each module according to the object size shape and size, and occlusions.

Vii Conclusion

In this paper, we proposed a method for in-hand object pose changes estimation using tactile and image data, while also predicting the modalities’ reliability values, called deep gated multi-modal learning (DGML). The proposed method can estimate not only known object pose changes but also unknown object pose changes during grasping, since it doesn’t rely on known 3D models. Moreover, the modalities’ reliability values can be changed dynamically and automatically by the network depending on the situations such as sensor failure and different magnitudes of noise. Visualization of reliability values helps to understand the network behavior such as which modality to utilize and how much. Furthermore, by using the proposed method, computation efficiency for training has been improved.

For future work, we will develop a system of in-hand object manipulation to integrate the object pose estimator.


The authors would like to thank Wilson Ko for helping to discuss and writting, and Koichi Nishiwaki and Tianyi Ko for proofreading.


  1. A. Aldoma, Z. Marton, F. Tombari, W. Wohlkinger, C. Potthast, B. Zeisl, R. B. Rusu, S. Gedikli and M. Vincze (2012-Sept) Tutorial: point cloud library: three-dimensional object recognition and 6 dof pose estimation. IEEE Robotics Automation Magazine 19 (3), pp. 80–91. External Links: Document, ISSN 1070-9932 Cited by: §III-A.
  2. O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell and A. Ray (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §III-A.
  3. J. Arevalo, T. Solorio, M. Montes-y-Gómez and F. A. González (2017) Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992. Cited by: §III-C.
  4. J. Bimbo, S. Luo, K. Althoefer and H. Liu (2016-01) In-hand object pose estimation using covariance-based tactile to geometry matching. IEEE Robotics and Automation Letters 1 (1), pp. 570–577. External Links: ISSN 2377-3766 Cited by: §III-B.
  5. J. Bimbo, S. Rodríguez-Jimenez, H. Liu, X. Song, N. Burrus, L. D. Senerivatne, M. Abderrahim and K. Althoefer (2012-Sept) Object pose estimation and tracking by fusing visual and tactile information. In 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Vol. , pp. 65–70. External Links: Document, ISSN Cited by: §III-B.
  6. J. Bimbo, P. Kormushev, K. Althoefer and H. Liu (2015) Global estimation of an object’s pose using tactile sensing. Advanced Robotics 29 (5), pp. 363–374. Cited by: §III-B.
  7. R. Calandra, J. Lin, A. Owens, J. Malik, U. C. Berkeley, D. Jayaraman and E. H. Adelson (2017) More Than a Feeling : Learning to Grasp and Regrasp using Vision and Touch. (Nips), pp. 1–10. Cited by: §III-B, §III-B, §V-A1.
  8. C. Choi and H. I. Christensen (2012-10) 3D pose estimation of daily objects using an rgb-d camera. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 3342–3349. External Links: Document, ISSN 2153-0866 Cited by: §III-A.
  9. C. Choi, Y. Taguchi, O. Tuzel, M. Liu and S. Ramalingam (2012-05) Voting-based pose estimation for robotic assembly using a 3d sensor. In 2012 IEEE International Conference on Robotics and Automation, Vol. , pp. 1724–1731. External Links: ISSN 1050-4729 Cited by: §III-A.
  10. S. Dong, W. Yuan and E. H. Adelson (2017) Improved gelsight tactile sensor for measuring geometry and slip. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 137–144. Cited by: §III-B.
  11. J. A. Fishel and G. E. Loeb (2012) Sensing Tactile Microvibrations with the BioTac—Comparison with Human Sensitivity. In IEEE RAS & EMBS International Conference on Biomedical Robotics and Biomechatronics (BioRob), pp. 1122–1127. Cited by: item 1.
  12. Y. Gao, L. A. Hendricks, K. J. Kuchenbecker and T. Darrell (2016) Deep learning for tactile understanding from visual and haptic data. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 536–543. Cited by: §III-B.
  13. J. Hatori, Y. Kikuchi, S. Kobayashi, K. Takahashi, Y. Tsuboi, Y. Unno, W. Ko and J. Tan (2018) Interactively picking real-world objects with unconstrained spoken language instructions. 2018 IEEE International Conference on Robotics and Automation (ICRA). Cited by: §III-C.
  14. P. Hebert, N. Hudson, J. Ma and J. Burdick (2011) Fusion of stereo vision, force-torque, and joint sensors for estimation of in-hand object location. In 2011 IEEE International Conference on Robotics and Automation, pp. 5935–5941. Cited by: §III-B.
  15. T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke and X. Zabulis (2018) BOP: benchmark for 6d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §III-A, §III-B.
  16. K. Honda, T. Hasegawa, T. Kiriki and T. Matsuoka (1998) Real-time pose estimation of an object manipulated by multi-fingered hand using 3d stereo vision and tactile sensing. In Proceedings. 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems. Innovations in Theory, Practice and Applications (Cat. No. 98CH36190), Vol. 3, pp. 1814–1819. Cited by: §III-B.
  17. J. Hu, L. Shen and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §III-C.
  18. Y. Hu, J. Hugonot, P. Fua and M. Salzmann (2019) Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3385–3394. Cited by: §III-A, §III-B.
  19. H. Iwata and S. Sugano (2009) Design of Human Symbiotic Robot TWENDY-ONE. In IEEE International Conference on Robotics and Automation (ICRA), pp. 580–586. Cited by: item 1.
  20. M. K. Johnson and E. H. Adelson (2009) Retrographic Sensing for the Measurement of Surface Texture and Shape. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1070–1077. Cited by: §III-B.
  21. J. Kim, J. Koh, Y. Kim, J. Choi, Y. Hwang and J. W. Choi (2018) Robust deep multi-modal learning based on gated information fusion network. arXiv preprint arXiv:1807.06233. Cited by: §III-C.
  22. A. Krull, E. Brachmann, F. Michel, M. Y. Yang, S. Gumhold and C. Rother (2015-12) Learning analysis-by-synthesis for 6d pose estimation in rgb-d images. In Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 954–962 (English). External Links: Document Cited by: §III-A.
  23. M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg and J. Bohg (2019) Making sense of vision and touch: self-supervised learning of multimodal representations for contact-rich tasks. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8943–8950. Cited by: §III-C.
  24. P. Mittendorfer and G. Cheng (2011) Humanoid Multimodal Tactile-sensing Modules. IEEE Transactions on robotics 27 (3), pp. 401–410. Cited by: item 1.
  25. Y. Niitani, T. Ogawa, S. Saito and M. Saito (2017) ChainerCV: a library for deep learning in computer vision. In ACM Multimedia, Cited by: §V-D.
  26. C. Nikhil, H. Rachel and R. Alberto (2018) In-hand manipulation via motion cones. Robotics: Science and Systems (RSS). Cited by: §I.
  27. K. Noda, H. Arie, Y. Suga and T. Ogata (2014) Multimodal integration learning of robot behavior using deep neural networks. Robotics and Autonomous Systems 62 (6), pp. 721–736. Cited by: §III-C.
  28. Y. Ohmura, Y. Kuniyoshi and A. Nagakubo (2006) Conformable and Scalable Tactile Sensor Skin for Curved Surfaces. In IEEE International Conference on Robotics and Automation (ICRA), pp. 1348–1353. Cited by: item 1.
  29. T. Paulino, P. Ribeiro, M. Neto, S. Cardoso, A. Schmitz, J. Santos-Victor, A. Bernardino and L. Jamone (2017) Low-cost 3-axis Soft Tactile Sensors for the Human-Friendly Robot Vizzy. In IEEE International Conference on Robotics and Automation (ICRA), pp. 966–971. Cited by: item 2.
  30. S. Rusinkiewicz and M. Levoy (2001-05) Efficient variants of the icp algorithm. In Proceedings Third International Conference on 3-D Digital Imaging and Modeling, Vol. , pp. 145–152. External Links: Document, ISSN Cited by: §III-A.
  31. M. Schwarz, H. Schulz and S. Behnke (2015-05) RGB-d object recognition and pose estimation based on pre-trained convolutional neural network features. In 2015 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 1329–1335. External Links: ISSN 1050-4729 Cited by: §III-A, §III-B.
  32. K. Takahashi, T. Ogata, J. Nakanishi, G. Cheng and S. Sugano (2017) Dynamic motion learning for multi-dof flexible-joint robots using active–passive motor babbling through deep learning. Advanced Robotics 31 (18), pp. 1002–1015. Cited by: §III-C.
  33. A. Takuya, F. Keisuke and S. Shuji (2017) ChainerMN: Scalable Distributed Deep Learning Framework. In Proceedings of Workshop on ML Systems in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS), External Links: Link Cited by: §V-D.
  34. S. Tokui, K. Oono, S. Hido and J. Clayton (2015) Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), External Links: Link Cited by: §V-D.
  35. T. O. Tomo, A. Schmitz, W. K. Wong, H. Kristanto, S. Somlor, J. Hwang, L. Jamone and S. Sugano (2017) Covering a robot fingertip with uskin: a soft electronic skin with distributed 3-axis force sensitive elements for robot hands. IEEE Robotics and Automation Letters 3 (1), pp. 124–131. Cited by: §III-B.
  36. T. P. Tomo, W. K. Wong, A. Schmitz, H. Kristanto, A. Sarazin, L. Jamone, S. Somlor and S. Sugano (2016-11) A modular, distributed, soft, 3-axis sensor system for robot hands. In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), Vol. , pp. 454–460. External Links: Document, ISSN 2164-0580 Cited by: §III-B.
  37. C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei and S. Savarese (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3343–3352. Cited by: §III-A.
  38. Y. Xiang, T. Schmidt, V. Narayanan and D. Fox (2018) PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes. Robotics: Science and Systems (RSS). Cited by: §III-A.
  39. H. Yousef, M. Boukallel and K. Althoefer (2011) Tactile sensing for dexterous in-hand manipulation in robotics—a review. Sensors and Actuators A: Physical 167 (2), pp. 171 – 187. Note: Solid-State Sensors, Actuators and Microsystems Workshop External Links: ISSN 0924-4247, Document Cited by: §I.
  40. W. Yuan, S. Dong and E. H. Adelson (2017) GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force. Sensors 17 (12), pp. 2762. Cited by: §III-B, §V-A1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description