Correlated and Individual Multi-Modal Deep Learning for RGB-D Object Recognition

Correlated and Individual Multi-Modal Deep Learning for RGB-D Object Recognition

Abstract

In this paper, we propose a correlated and individual multi-modal deep learning (CIMDL) method for RGB-D object recognition. Unlike most conventional RGB-D object recognition methods which extract features from the RGB and depth channels individually, our CIMDL jointly learns feature representations from raw RGB-D data with a pair of deep neural networks, so that the sharable and modal-specific information can be simultaneously and explicitly exploited. Specifically, we construct a pair of deep residual networks for the RGB and depth data, and concatenate them at the top layer of the network with a loss function which learns a new feature space where both the correlated part and the individual part of the RGB-D information are well modelled. The parameters of the whole networks are updated by using the back-propagation criterion. Experimental results on two widely used RGB-D object image benchmark datasets clearly show that our method outperforms most of the state-of-the-art methods.

1Introduction

The pipeline of our proposed approach. We construct a two-way ResNet for RGB images and surface normal depth images for feature extraction. We design a multi-modal learning layer to learn the correlated part V_iX_i and the individual parts Q_iX_i of the RGB and depth features, respectively. We project the RGB-D features X_1, X_2 into a new feature space. The loss function enforces the affinity of the correlated part from different modalities and discriminative information of the modal-specific part, where the combination weights are also automatically learned within the same framework. (Best view in the color file)
The pipeline of our proposed approach. We construct a two-way ResNet for RGB images and surface normal depth images for feature extraction. We design a multi-modal learning layer to learn the correlated part and the individual parts of the RGB and depth features, respectively. We project the RGB-D features into a new feature space. The loss function enforces the affinity of the correlated part from different modalities and discriminative information of the modal-specific part, where the combination weights are also automatically learned within the same framework. (Best view in the color file)

Object recognition is one of the most challenging problems in computer vision, and is catalysed by the swift development of deep learning [16] in recent years. Various works have achieved exciting results on several RGB object recognition challenges [11]. However, there are several limitations for object recognition using only RGB information in many real world applications, as it projects the 3-dimensional world into a 2-dimensional space which leads to inevitable data loss. To amend those shortcomings of RGB images, using depth images as a complimentary is a plausible way. The RGB image contains information of color, shape and texture while the depth contains information of shape and edge. Those basic features can serve both as a strength or weakness in object recognition. For example, we are able to tell the difference between an apple and a table simply by the shape information from depth. However it is ambiguous when it comes to figure out whether it is an apple or an orange just by depth. When an orange plastic ball and an orange are placed together, it is equally difficult for us to tell the difference just by RGB image. This means that a simple combination of features from two modalities sometimes jeopardizes the discriminability of feature. Therefore, we are supposed to choose those shared and specific features more wisely. Thus, we believe a more elaborated combination of modality-specific and modality-correlated features will generate a more discriminative representation.With the development of high-quality consumer depth cameras such as the Microsoft Kinect, numerous efforts have been made for RGB-D object recognition in recent years. Compared to RGB object recognition, the introduction of depth images greatly improves the recognition performance because depth information provides geometrical cues which are invariant to lighting and color variations, which are usually difficult to describe in RGB images. As there is a growing trend of the appearance of many depth-camera embedded devices like Google Tango [?] and Microsoft Hololens [?], the requirements and application potentials for RGB-D object recognition technology are growing rapidly.

There are two main procedures in RGB-D object recognition: feature representation [1] and object matching [9]. Compared with object matching, feature representation affects the performance of the object recognition system significantly, because real-world objects usually suffer from large intra-class discrepancy and inter-class affinity. A variety of methods have been proposed for RGB-D object representation recently [3], and they can be mainly classified into two categories: hand-crafted methods and learning-based methods. Methods in the first category design an elaborated hand-crafted descriptor for both the RGB and depth channels [4] for feature extraction. Representative features include textons [29], color histograms [1], SIFT [30] and SURF [2], which describe objects from different aspects such as color, shape, and texture.However, these hand-crafted methods usually require large amount of domain-specific knowledge, which is inconvenient to generalize to different datasets. Methods in the second category employ some machine learning techniques to learn feature representations in a data-driven manner, so that more data-adaptive discriminative information can be exploited [3]. However, most existing learning-based methods consider the RGB and depth information from two different channels individually, which ignores the sharable property and the interaction relationship between these two modalities. To address this, multi-modal learning approach recently has been presented for RGB-D object recognition. Wang et al. [38] proposed a multi-modal learning approach by extracting RGB and depth features within the deep learning framework, which can fully exploit the information of both RGB and depth modalities. While the correlation information of the RGB and depth information can be exploited, the modal-specific information has not been explicitly modelled in their method. Another weakness of that work is that, as they fixed the total dimension of shareable and specific feature, they didn’t provide any analysis on how to choose the ratio between shareable dimension and specific dimension. Moreover, they treat the individual part and correlated part equally, which neglects the different discriminative power between them.

In this paper, we propose a correlated and individual multi-modal deep learning (CIMDL) method for RGB-D object recognition. Specifically, we develop a multi-modal learning framework to learn discriminative features from both the correlated and individual parts, and automatically learn the weights for different feature components in a data-driven manner. The basic pipeline of our proposed CIMDL is illustrated in Figure 1. We first utilize two ways of deep CNNs to learn features from RGB and depth modality individually. Then, we feed the features learned from these two channels into a multi-modal learning layer to concatenate them.This layer is designed for three purposes: 1) generating the correlated part between these two modalities, 2) extracting the discriminative part of features from both of these two modalities and 3) learning the weights for the correlated and individual parts automatically for feature combination. Finally, a feature vector containing both the correlated part and the individual part of the RGB and depth modalities is obtained at the top layer of the network, which is used as the final feature representation for RGB-D object recognition.The parameters of the whole network are updated by using the back-propagation criterion. Experimental results on the RGB-D object [25] and 2D3D [7] datasets are presented to show the effectiveness of the proposed approach.

2Related Work

2.1RGB-D Object Recognition

A variety of methods have been proposed for RGB-D object representation recently, and they can be mainly classified into two categories: hand-crafted methods and learning-based methods.There are many visual tasks based on the RGB-D input [8]. Most conventional RGB-D object recognition methods use hand-crafted feature descriptors for object matching. For example, Lai et al. [25] exploited a bunch of hand-crafted features including color histograms [1], textons [29] and SIFT [30] for the RGB channel, spin images [22] and SIFT [30], and multiple shape features for the depth channel for feature representation. Finally, they concatenated these features together as the final representation for recognition. Bo et al. proposed a kernel descriptor method [4] which combines several RGB-D features such as 3D shape, depth edge and texture for RGB-D object recognition.

In recent years, learning-based features have aroused more and more attention in RGB-D object recognition. For example, Blum et al. [3] used an unsupervised learning method to obtain a codebook for feature encoding. Bo et al. [5] presented a Hierarchal Matching Pursuit (HMP) method to learn advanced representations from local patches. Socher et al. [36] presented a cascaded network by integrating CNN and RNN to learn a concatenated feature for RGB and depth image. Browazki et al. [7] fused several SVMs-trained features by utilizing a multi-layer perception. Lai et al. [26] devised a distance metric learning approach [31] to fuse heterogeneous feature representations for RGB-D object recognition. However, most of them learn features from the color and depth channels individually, and construct a final representation by a simple concatenation, which ignores the physical meanings of different feature modalities and their potential relationship.

Deep learning has also been employed for RGB-D visual analysis or in recent years[10]. For example, Gupta et al. [17] proposed an approach to encode the depth data into three channels: horizontal disparity, height above ground, angle between point normal and inferred gravity. Then, they trained CNNs on these three-channel instead of the ordinal depth images for RGB-D object recognition and segmentation. Couprie et al. [10] presented a multi-scale CNN for RGB-D scene labeling based on a hierarchical feature method. Wang et al. [41] designed a deep neural network for surface normal prediction. However, these methods ignored the relationship between data from different modalities because the RGB and depth information are only simply concatenated.

More recently, several multi-modal deep learning methods have been proposed to make better use of the RGB and depth information from different modalities for various visual analysis tasks. For example, Srivastava et al. [37] proposed a multi-modal Deep Boltzmann Machine approach, where a concatenated layer was added to connect DBMs from different modalities to learn multi-modal feature representations jointly. Eitel et al. [12] proposed a two-stream CNN model combined with a fusion layer for RGB-D object recognition. Wang et al. [38] proposed a multi-modal feature deep learning approach by exploiting the shareable properties of RGB and depth images for RGB-D object recognition. Inspired by Wang’s work, Zhu et al. [42] proposed a discriminative multi-modal feature learning method for RGB-D scene recognition. Lenz et al. [28] presented a multi-modal deep learning approach for robotic grasp detection, where the stacked auto-encoders were used for multi-modal feature learning. Jou et al. [23] proposed a cross-residual learning for multitask learning.

3Proposed Approach

3.1Baseline Architecture

Several ways of CNN-based methods have been proposed for RGB-D object recognition. For example, Couprie et al. used a four-channel CNNs (three are from the RGB data and the other one is from the depth data) for scene labeling [10].Gupta et al. [17] extracted features from RGB and depth images independently and concatenated them as the final features for object detection, where both the RGB CNN and the depth CNN are fine-tuned on the model which was pre-trained on the ImageNet dataset [11]. Another method is to fuse the second fully connected layer of the RGB network and the depth network so that the supervised information can be back-propagated for both modalities. These models are shown in Figure 2. The structure of the CIMDL layer which we used in this work is shown in Figure 2. We have 5 convolution layers and 3 fully connected layers.As the residual network has proved its strength over conventional CNN architecture [18], we adopt the ResNet as baseline architecture for both RGB and depth channel and train the networks independentlyby following the setting in [17]to extract features from the last pooling layer of the ResNet. For the depth network, we adopt the surface normals instead of the depth image for network training, which can be fine-tuned on the RGB pre-trained model.

The architecture of the last layer of our network, where V_i and Q_i denote the mapping matrices that map the original feature into a correlated and individual feature space. The output of the last layer is concatenated into a new feature vector which is assigned with different weights c_i for different components in the learning framework.
The architecture of the last layer of our network, where and denote the mapping matrices that map the original feature into a correlated and individual feature space. The output of the last layer is concatenated into a new feature vector which is assigned with different weights for different components in the learning framework.

3.2Multi-modal Deep Learning Model

We develop a multi-modal deep architecture for RGB-D feature learning, which consists of two residual networks. Specifically, we adopt two models which are pre-trained on the ImageNet dataset [11] and finetune them on our RGB training dataset and the depth training dataset to generate the parameters of RGB-ResNet and SN-ResNet layers, respectively. Then, we feed the outputs of the last pooling layer from both RGB-ResNet and SN-ResNet into our correlated and individual multi-modal learning structure. In our new structure, we replace the original softmax layer with our new CIMDL layer which will be detailed later. The pipeline of our proposed method is shown in Figure 1. Instead of directly putting the depth image to the ResNet network, we also extract the surface normal of the depth informationby following the setting in [17], which encodes it as a three-channels representation. We empirically find that the surface normal results in better recognition performance than the raw depth data in feature learning.

Figure 2 shows the architecture of the last layer of our network, where denote the activations of the second fully connected layer of RGB-ResNet and SN-ResNet with images in one data batch, and are mapping matrices which transfer original features into the modal-specific domain and the correlated domain, and denotes the label matrix with classes.

The physical meaning of our multi-modal learning model is to leverage the correlated properties of the RGB and depth information, enforce the modal-specific property of both modalities and adjust the weights for different parts of the feature in recognition. Therefore, the final purpose of our proposed method is to learn two mapping matrices and to map the original feature into the correlated feature space and the individual feature space. Hence, there are three key characteristics in our model: 1) a multi-modal deep learning strategy which automatically decomposes features into a correlated part and an individual part, 2) ensuring the discriminative power and orthogonality of the correlated part and the individual part and 3) learning the weights for different parts in a data-driven manner to improve the recognition performance.

3.3Objective Function

We first learn the mapping matrices for both modalities to map the original features into the correlated feature space, where we expect to minimize the difference between the correlated parts from two datasets, respectively. Let be the -dimensional activations of the last pooling layer in one batch with images, where corresponds to the RGB channel and depth channel, respectively. Our goal is to learn discriminative feature representations to achieve two objectives: 1) some information are shared by different modalities, and 2) some modal-specific information are exploited for each modality individually. To achieve this, we formulate the following objective function:

where denotes the Frobenius norm. By minimizing the above objective function, the mapping matrix increases the similarity of correlated part from both modalities. Besides the correlated part, the modal-specific feature is also an essential part of the feature . Hence, we present the following criterion to achieve this good:

where is the correlated part and denotes the modal-specific components of the ith modality’s feature. The correlated part is the component of the original feature . Both the correlated feature and model-specific feature reconstruct the original feature as follows:

where is the mapping matrix to project the original feature into the correlated domain and denotes the individual component of the ith modality’s feature. Since these two matrices and map the original feature into the correlated domain and the modal-specific domain, respectively, they should be unrelated and not contaminated by each other. Therefore, the mapping matrices contain bases that come from discrepant space and should be orthogonal to each other. Therefore, we present the following constraints:

Consequently, we formulate the objective function of our multi-modal feature learning as two parts. The first part enforces the features of two domains to share a congruity part after the mapping . The second part is the softmax loss of our network. The first constraint ensures the reconstruction of the original feature, and the second constraint engenders discrepancy between the correlated part and the modal-specific part, described as below:

where are the weights of the last softmax layer for different modalities, is the activation of the CIMDL layer. In Figure 2, the matrices in the middle show the detail of . is weight for different parts of the feature for softmax regression which will be introduced later. Here we incorporate supervised learning by minimizing the loss of softmax regression from to .

3.4Optimization

In this work, we adopt an alternating optimization for all those variables and . Following the Lagrange multiplier and gradient decent criterion, we obtain a local optimal solution for (6). Based on (6), we construct a Lagrange function as follows:

where denotes the Frobenius norm. and are positive Lagrange multipliers associated with the linear constrains and . The first term of the objective function intends to minimize the difference of the correlated part that generated from the color modality and the depth modality separately. The second term regularizes the feature’s ability to be reconstructed by its correlated part and modal-specific part. The third part ensures the mutual orthogonality by regulating the inner product of the two transfer matrices and . The last part of the objective function are regularizations of and where denotes the smooth penalty function [27].

By applying the gradient decent algorithm, we update the variables , and with the same learning rate. First, we update ,. The derivative of the Lagrange function with respect to can be expressed as

where is the derived function of smooth penalty function. is the weight of the correlated part in the regression.

According to the gradient descent rule, is updated as

The derivative of the Lagrange function with respect to can be expressed as

According to the gradient descent rule, is updated as

As and share similar forms with and , we just omit how we update those variables. we do not spare extra space to explicitly illustrate how we update those parameters.

Having completed updating the mapping matrices and , we keep them fixed and update the regression matrix . According to the work of [32], the derivative of the Lagrange function with respect to is

where is a diagonal matrix with , and . According to the gradient descent criterion, we update by

The correlated part of two distinct dataset feature plays a significant part in extracting the shareable information but not equally powerful as the modal-specific part. Therefore, we design adaptive weights for different part of the fused feature, where and correspond to the correlated parts, the modal-specific part for RGB and the modal-specific part for surface normal respectively. With the feature representation of and updated and fixed, we update the adaptive parameter according to the following rules:

We regularize the vector and make sure that . Iteratively, we automatically select the weights for different parts of our learned feature. Our proposed CIMDL method is summarized in Algorithm 1.

3.5Discussion

As our method is enlightened by some previous works on multi-modal learning and deep learning for RGB-D vision tasks, there are some similarities and differences between our method and previous methods that needs to be highlighted in the following.

In this section, we discuss the relationship and difference between our approach and several existing related methods:

Differences from other deep learning methods [10]:

. In [10], the depth information is used as an extra channel which are concatenated with RGB image at the input. Both [35] and [17] employed multi-stream architecture for their RGB channel and depth channel while they ignore the relationship between the RGB and depth information. Those works treat RGB and depth images in-differentially. They either concatenate RGB and depth from the input layer [10] or concatenate the feature they got from two neural networks[17] to facilitate the usage of their RGB-D data.The main difference between our proposed method and those deep learning based methods is that we explore the relationship between RGB and depth data and utilize both the correlated and individual information wisely.

Difference from other multi-modal learning methods [12]:

CNN-RNN [36] is a cascade architecture containing both CNN and RNN. Multi-modal DBM [37] trained two specific deep boltzmann machine for the text and image modalities. Comparing with these methods, our method use ResNet for feature learning of different modalities. Fus-CNN [12] employed a two stream CNN structure and adopted a three-step training method which includes both joint training and individual training. However, comparing with our method, they failed to leverage the information that is shared by both modalities. MMSS [38] proposed a multi-modal deep learning method which can utilize both sharable and specific information of different modalities. However, they focus on learning the shareable feature and manually set the ratio for shareable and specific part. In our work, we enforce the learning of modal-specific part by exerting constraints over those two parts and learn how to utilize both correlated and individual feature in a data-driven manner. Most of those methods implement a framework that integrate different modalities to play a part in the final feature presentation, while neglecting the relationship between different modalities. This criterion works well when the data from each modalities vary a lot. However, RGB and depth shares similar properties in many tones with their modal-specific information reserved. To better leverage the merit of similarity between RGB and depth, [38] investigated the effectiveness of shareable and specific feature for RGB-D recognition. To be noted, our work is enlightened by theirs. Comparing with their CNN architecture, we use ResNet as a baseline. Meanwhile they explicitly model the shareable part for both RGB and depth channels and calling the part of their features excluded the shareable part as the specific one which might vanish the discriminative power of the modal-specific feature. On the other hand, they manually fixed the dimension of both shareable and specific feature vectors and treat the shareable part and specific part equally, which neglects the different discriminative power of those two parts. Thus, in our work, we enforce the learning of modal-specific part by exerting constraints over those two parts and automatically learn how to utilize both correlated and individual feature in our proposed architecture. As we have mentioned in previous sections, the methods proposed in [38] investigated the effectiveness of shareable and specific feature for RGB-D recognition. In their work, they proposed a multi-modal learning framework called MMSS which didn’t treat RGB-D image as information from two undifferentiated channels but to explore the relationship between them. To be noted, our work is enlightened by theirs. While they explicitly model the shareable part for both RGB and depth channels, they called the part of their features excluded the shareable part as the specific one which might vanish the discriminative power of the modal-specific feature. On the other hand, they manually fixed the dimension of both shareable and specific feature vectors and treat the shareable part and specific part equally, which neglects the different discriminative power of those two parts. Thus, in our work, we enforce the learning of modal-specific part by exerting constraints over those two parts and automatically learn how to utilize both correlated and individual feature in our proposed architecture.

4Experiments

We conducted experiments on two datasets including the RGB-D Object Dataset [25] and the 2D3D Dataset [7] for RGB-D object recognition, to evaluate the effectiveness of our proposed method. The followings describe the detailed experimental settings and results.

To be noted, the work in [12] proposed a CNN based RGB-D object recognition approach, where they implemented a two-stream CNNs for RGB and depth input with a fused softmax layer at the top layer of the network. Although the idea for the design of their architecture is quite straight forward, they ended up with fairly satisfying recognition accuracy. In that work, they achieved compatible results on both RGB-D Object Dataset and 2D3D Dataset.As a comparison, we implemented their two-stream network as a baseline structure as well as a pre-train model for our CIMDL. Further more, we adapted the input to surface normal instead of the default input HHA stated in [12] for fair comparison as some of the RGB-D dataset didn’t provide us with HHA and the parameters that are essential in generating HHA from raw depth image. And our architecture enjoys a higher accuracy, which presents an edge over method proposed in [14].

4.1Datasets

RGB-D Object Dataset:

The RGB-D dataset is a large dataset containing 51 different classes of 300 distinct objects shoot from multiple views.The objects in this dataset are cups, fruits, households, tools, vegetables, things that occurred frequently in daily life and commonplace. Each of them is recorded by cameras located at three different positions in the video mode.evaluation angles of approximately and There are 207,920 RGB-D image frames, with roughly 600 images per object. In our experiments, we conducted a down-sampling from every 5 consecutive frames of the video. We run the 10 random splits provided by [25], where each of those splits covers the whole 51 classes with different objects. Finally, it came out with approximately 51 different test objects. We conducted experiments on those different splits, where there are 34000 images in average for training and 6900 images in average for testing.

2D3D Dataset:

The 2D3D dataset includes 154 objects in 14 different classes.Those objects includes books, bottles, monitors that occurred frequently in office. Each of them is recorded by a CamCube 2.0 time-of-flight camera, generating 5544 RGB-D images in total which were shot in different evaluation angles.Following the same settings in [7], we divided the dataset into the training part and the testing part, containing 6 objects of each class. Finally, 1476 RGB-D images from 82 objects are used for training and 1332 RGB-D images from 74 objects are employed for testing.

4.2Implementat Details

Architecture of ResNets:

Our experiments were performed on the Caffe framework [21]. We adopted a 50-layers ResNet structure described in [18]. The two modalities shared the same network architecture before the last pooling layer. For both the RGB modality and the surface normal modality, the input images were resized to .

Finetuning:

The ResNet model [18] was pre-trained on the ImageNet dataset [11] both for RGB and SN channels. Our finetuning setting is listed as follows. the same as the one used in [17], whereThe learning rate was initialized at and decreased by a factor of 0.1 every iterations. We finetuned the model with iterations with a batch size of . The whole finetuning part was done by a Tesla K40c and it took nearly hours for the whole finetuning procedure.

Parameters Setting:

In the Lagrange function of our multi-modal learning model, and are the reconstruction and orthogonal controlling parameters, and they were set as follows: and , where denotes the size of training samples and denotes the size of the feature vectors, is the softmax regression controlling parameter and it is set as . are regularization parameters and they were empirically set as , and , the learning rate for is set empirically as and for as , respectively. All parameters were set the same for experiments both on RGB-D object dataset and 2D3D dataset.

4.3Results on RGB-D Object Dataset

The confusion matrix of our method on the RGB-D object dataset, where the vertical axis indicates the ground truth label and the horizontal axis indicates the predicted label, respectively.
The confusion matrix of our method on the RGB-D object dataset, where the vertical axis indicates the ground truth label and the horizontal axis indicates the predicted label, respectively.

Comparison with Deep Learning Baselines:

We first constructed several deep learning baselines for RGB-D object recognition by using ResNet and compared them with our proposed approach. Motivated by the work of Gupta et al. [17] and Eitel et al. [12], we encoded each depth image into surface normal to efficiently exploit the information provided by depth image.

The structure of our different baseline methods is detailed as follows: 1) ResNet using only RGB image as input with and without pre-train model; 2) ResNet using only depth images as input; 3) ResNet using only surface normal images as input with and without pre-train model; and 4) two separate ways of ResNet trained for RGB and surface normal with last pooling layer concatenated, with and without pre-train model.

Table 1 shows the performance of different baselines. We see that the ResNet trained by raw depth images achieves worse performance than that trained by the surface normal. This is because surface normal can better represent geometry information than depth images. The accuracy rises swiftly when we combine the RGB-ResNet and SN-ResNet into a two way architecture.image into a 4-channel input. RGB-SN 6-channel input performs a comparative result with RGB-D 4-channel input CNN. This indicates that the addition of more modalities improves the recognition performance than only using one modality such as RGB or depth image. However, the performance of the architecture which fuses the RGB and depth image from the input degenerates the modal-specific part of different input modality greatly.The experimental results clearly shows that the two way ResNet structure prone to be more effective and accurate with more information from modal-specific part.

In order to boost performance on the RGB-D object dataset, we also used the caffe model [18] pre-trained on the ImageNet [11] for both the RGB input ResNet and the SN input ResNet. In contrast with the same structure without pre-train model, we achieved higher recognition accuracy.

Comparison of our method and deep learning baselines.
Method Accuracy (%)
RGB ResNet
Depth ResNet
SN ResNet
RGB ResNet (pretrain)
SN ResNet (pretrain)
RGB-SN ResNet (two-way)
RGB-SN ResNet (two-way, pretrain)
Ours
Comparison of our method and state-of-the-art methods on the RGB-D object dataset.
Method Accuracy (%)
Lai et al. [25]
Blum et al. [3]
Socher et al. [36]
Bo et al. [6]
Wang et al. [38]
Eitel et al. [12]
Ours
Some misclassification instances of our method in RGB-D object Dataset. Those misclassifications are caused by the large variations of shape, color or texture affinity between objects from other classes.
Some misclassification instances of our method in RGB-D object Dataset. Those misclassifications are caused by the large variations of shape, color or texture affinity between objects from other classes.

In our method, we used the surface normal as a substitute of the depth image with pre-trained model. Our proposed multi-modal learning method outperforms the best baseline by in terms of the accuracy. Comparing with the best baseline that simply connect the last pooling layer, our method can 1) generate the correlated part and the modal-specific part of different modalities and make sure they are not contaminated by each other, and 2) better exploit the information from the correlated and individual part of different modalities.

Comparison with State-of-the-arts:

We also compare our method with six state-of-the-art RGB-D object recognition methods: 1) extracting depth feature using SIFT and spin images, RGB image using SIFT and color and texton histogram [25]; 2) CKM [3]; 3) CNN-RNN [36];4) HMP [6] 5) MMSS [38] and 6) FusNet [12]. In addition, the work in [12] proposed a CNN based RGB-D object recognition approach, where they implemented a two-stream CNNs for RGB and depth input with a fused softmax layer at the top layer of the network. However, they used RGB image and HHA as the input while ours use RGB image and surface normal. For a fair comparison, we implemented their work and adapted it to fit the same input and equivalent experimental settings. And our architecture enjoys a higher accuracy, which presents an edge over method proposed in [12].Table 2 shows that our method outperformed all the state-of-the-art methods. Table 1 shows the performance of different methods on the RGB-D dataset.We clearly see that our methods outperforms existing state-of-the-art methods.

Figure 3 shows the confusion matrix of recognition results on RGB-D object dataset. The diagonal elements represent the accuracy for each object class. We display several misclassified objects in Figure 4. Those error dues to the similarity objects from different classes like camera and cellphone or color affinity of ball and garlic. Texture similarity is also to blamed for misclassification.

4.4Results on 2D3D Dataset

We utilized the same architecture in section 4.3 and used the same pre-trained caffe model. Table 3 shows the performance of different baselines on 2D3D dataset. Table 4 shows the comparison between our proposed method and several state-of-the-art methods. The confusion matrix of our recognition results is shown in Figure 5. Both among those baseline methods and other previous methods, we achieved better results.

The confusion matrix of our method on the 2D3D dataset, where the vertical axis indicates the ground truth and the horizontal axis indicates the predicted labels, respectively.
The confusion matrix of our method on the 2D3D dataset, where the vertical axis indicates the ground truth and the horizontal axis indicates the predicted labels, respectively.
Comparison with Existing Deep Learning Baselines on the 2D3D dataset.
Method Accuracy(%)
RGB ResNet
SN ResNet
RGB ResNet(pretrain)
SN ResNet(pretrain)
RGB+SN ResNet (two-way)
RGB+SN ResNet(two-way, pretrain)
ours
Comparison with state-of-the-art methods on the 2D3D dataset.
Method Accuracy(%)
Browatzki et al. [7]
Bo et al. [6]
Wang et al. [38]
Ours

4.5Parameter Analysis

The accuracy of our object recognition method is affected by the weights , which decides whether the correlated component or the modal-specific part from RGB or depth dominates. In our proposed method, the weights are self-adapted. In this section, we kept the weights fixed and revealed the relationship between weights and the recognition accuracy. Note that is for the correlated part and correspond to the RGB and depth modality, respectively, where we have . We set parameter as .

Figure 6 shows the recognition rate of our method versus different values of the weighting parameter . In Figure 6, when is small, which means that the correlated part plays smaller significance in recognition and the accuracy is relatively low. When becomes larger, the correlated part gradually plays a relatively more important part than is low, the accuracy rises. However, when is too large, which means that the correlated part dominates, the accuracy decreases as the effects of the modal-specific part begin to vanish. The value of reaches the peak is close to the scenario when the parameter learned by our multi-modal learning method.

The performance relationship between p and the recognition accuracy on the RGB-D object dataset.
The performance relationship between and the recognition accuracy on the RGB-D object dataset.

5Conclusions

In this paper, we have proposed a correlated and individual multi-modal deep learning method for RGB-D object recognition. In our proposed method, we enforced both the correlated and the modal-specific parts in our learned features for RGB-D image object to satisfy several characteristics within a joint learning framework, so that the sharable and modal-specific information can be well exploited. Experimental results on two widely used RGB-D object image benchmark datasets clearly show that our method outperforms state-of-the-art methods.How to extend the proposed method for RGB-D video classification seems to be an interesting future direction.

References

  1. CSIFT: A SIFT descriptor with color invariant characteristics.
    A. E. Abdel-Hakim and A. A. Farag. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1978–1983, 2006.
  2. Speeded-up robust features (SURF).
    H. Bay, A. Ess, T. Tuytelaars, and L. J. V. Gool. Computer Vision and Image Understanding, 110(3):346–359, 2008.
  3. A learned feature descriptor for object recognition in RGB-D data.
    M. Blum, J. T. Springenberg, J. Wülfing, and M. A. Riedmiller. In IEEE International Conference on Robotics and Automation, pages 1298–1303, 2012.
  4. Depth kernel descriptors for object recognition.
    L. Bo, X. Ren, and D. Fox. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 821–826, 2011.
  5. Hierarchical matching pursuit for image classification: Architecture and fast algorithms.
    L. Bo, X. Ren, and D. Fox. In Advances in Neural Information Processing Systems, pages 2115–2123, 2011.
  6. Unsupervised feature learning for RGB-D based object recognition.
    L. Bo, X. Ren, and D. Fox. In International Symposium on Experimental Robotics, pages 387–402, 2012.
  7. Going into depth: Evaluating 2d and 3d cues for object classification on a new, large-scale object dataset.
    B. Browatzki, J. Fischer, B. Graf, H. H. Bülthoff, and C. Wallraven. In IEEE International Conference on Computer Vision Workshops, pages 1189–1195, 2011.
  8. Query adaptive similarity measure for rgb-d object recognition.
    Y. Cheng, R. Cai, C. Zhang, Z. Li, X. Zhao, K. Huang, and Y. Rui. In IEEE International Conference on Computer Vision, pages 145–153, December 2015.
  9. Support-vector networks.
    C. Cortes and V. Vapnik. Machine Learning, 20(3):273–297, 1995.
  10. Indoor semantic segmentation using depth information.
    C. Couprie, C. Farabet, L. Najman, and Y. LeCun. CoRR, abs/1301.3572, 2013.
  11. Imagenet: A large-scale hierarchical image database.
    J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  12. Multimodal deep learning for robust RGB-D object recognition.
    A. Eitel, J. T. Springenberg, L. Spinello, M. A. Riedmiller, and W. Burgard. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 681–687, 2015.
  13. The softmax nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element.
    I. M. Elfadel and J. L. W. Jr. In Advances in Neural Information Processing Systems, pages 882–887, 1993.
  14. The pascal visual object classes (voc) challenge.
    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. International Journal of Computer Vision, 88(2):303–338, June 2010.
  15. Interactive segmentation on rgbd images via cue selection.
    J. Feng, B. Price, S. Cohen, and S.-F. Chang. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4321–4329, June 2016.
  16. Rich feature hierarchies for accurate object detection and semantic segmentation.
    R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. In IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
  17. Learning rich features from RGB-D images for object detection and segmentation.
    S. Gupta, R. B. Girshick, P. A. Arbeláez, and J. Malik. In European Conference on Computer Vision, pages 345–360, 2014.
  18. Deep residual learning for image recognition.
    K. He, X. Zhang, S. Ren, and J. Sun. CoRR, abs/1512.03385, 2015.
  19. Random decision forests.
    T. K. Ho. In International Conference on Document Analysis and Recognition, pages 278–282, 1995.
  20. Real-time RGB-D activity prediction by soft regression.
    J. Hu, W. Zheng, L. Ma, G. Wang, and J. Lai. In European Conference on Computer Vision, pages 280–296, 2016.
  21. Caffe: Convolutional architecture for fast feature embedding.
    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. In ACM Multimedia, pages 675–678, 2014.
  22. Using spin images for efficient object recognition in cluttered 3d scenes.
    A. E. Johnson and M. Hebert. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5):433–449, 1999.
  23. Deep cross residual learning for multitask visual recognition.
    B. Jou and S. Chang. CoRR, abs/1604.01335, 2016.
  24. Imagenet classification with deep convolutional neural networks.
    A. Krizhevsky, I. Sutskever, and G. E. Hinton. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  25. A large-scale hierarchical multi-view RGB-D object dataset.
    K. Lai, L. Bo, X. Ren, and D. Fox. In IEEE International Conference on Robotics and Automation, pages 1817–1824, 2011.
  26. Sparse distance learning for object recognition combining RGB and depth information.
    K. Lai, L. Bo, X. Ren, and D. Fox. In IEEE International Conference on Robotics and Automation, pages 4007–4013, 2011.
  27. ICA with reconstruction cost for efficient overcomplete feature learning.
    Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. In Advances in Neural Information Processing Systems, pages 1017–1025, 2011.
  28. Deep learning for detecting robotic grasps.
    I. Lenz, H. Lee, and A. Saxena. International Journal of Robotics Research, 34(4-5):705–724, 2015.
  29. Representing and recognizing the visual appearance of materials using three-dimensional textons.
    T. K. Leung and J. Malik. International Journal of Computer Vision, 43(1):29–44, 2001.
  30. Distinctive image features from scale-invariant keypoints.
    D. G. Lowe. International Journal of Computer Vision, 60(2):91–110, 2004.
  31. Recognition by association via learning per-exemplar distances.
    T. Malisiewicz and A. A. Efros. In IEEE Conference on Computer Vision and Pattern Recognition, 2008.
  32. Efficient and robust feature selection via joint l2, 1-norms minimization.
    F. Nie, H. Huang, X. Cai, and C. H. Q. Ding. In Advances in Neural Information Processing Systems, pages 1813–1821, 2010.
  33. Understanding everyday hands in action from rgb-d images.
    G. Rogez, J. S. Supancic, III, and D. Ramanan. In IEEE International Conference on Computer Vision, pages 3889–3897, December 2015.
  34. Learning a distance metric from relative comparisons.
    M. Schultz and T. Joachims. In Advances in Neural Information Processing Systems, pages 41–48, 2003.
  35. RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features.
    M. Schwarz, H. Schulz, and S. Behnke. In ICRA, pages 1329–1335. IEEE, 2015.
  36. Convolutional-recursive deep learning for 3d object classification.
    R. Socher, B. Huval, B. P. Bath, C. D. Manning, and A. Y. Ng. In Advances in Neural Information Processing Systems, pages 665–673, 2012.
  37. Multimodal learning with deep boltzmann machines.
    N. Srivastava and R. Salakhutdinov. In Advances in Neural Information Processing Systems, pages 2231–2239, 2012.
  38. MMSS: multi-modal sharable and specific feature learning for RGB-D object recognition.
    A. Wang, J. Cai, J. Lu, and T. Cham. In IEEE International Conference on Computer Vision, pages 1125–1133, 2015.
  39. Learning actionlet ensemble for 3d human action recognition.
    J. Wang, Z. Liu, Y. Wu, and J. Yuan. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):914–927, 2014.
  40. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks.
    J. Wang, Z. Wang, D. Tao, S. See, and G. Wang. CoRR, abs/1608.01082, 2016.
  41. Designing deep networks for surface normal estimation.
    X. Wang, D. F. Fouhey, and A. Gupta. In IEEE Conference on Computer Vision and Pattern Recognition, pages 539–547, 2015.
  42. Discriminative multi-modal feature fusion for rgbd indoor scene recognition.
    H. Zhu, J.-B. Weibel, and S. Lu. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2969–2976, June 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
31040
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description