RotationNet: Joint Object Categorization and Pose Estimation
Using Multiviews from Unsupervised Viewpoints
We propose a Convolutional Neural Network (CNN)-based model “RotationNet,” which takes multi-view images of an object as input and jointly estimates its pose and object category. Unlike previous approaches that use known viewpoint labels for training, our method treats the viewpoint labels as latent variables, which are learned in an unsupervised manner during the training using an unaligned object dataset. RotationNet is designed to use only a partial set of multi-view images for inference, and this property makes it useful in practical scenarios where only partial views are available. Moreover, our pose alignment strategy enables one to obtain view-specific feature representations shared across classes, which is important to maintain high accuracy in both object categorization and pose estimation. Effectiveness of RotationNet is demonstrated by its superior performance to the state-of-the-art methods of 3D object classification on - and -class ModelNet datasets. We also show that RotationNet, even trained without known poses, achieves the state-of-the-art performance on an object pose estimation dataset. The code is available on https://github.com/kanezaki/rotationnet
Object classification accuracy can be enhanced by the use of multiple different views of a target object [4, 23]. Recent remarkable advances in image recognition and collection of 3D object models enabled the learning of multi-view representations of objects in various categories. However, in real-world scenarios, objects can often only be observed from limited viewpoints due to occlusions, which makes it difficult to rely on multi-view representations that are learned with the whole circumference. The desired property for the real-world object classification is that, when a viewer observes a partial set ( images) of the full multi-view images of an object, it should recognize from which directions it observed the target object to correctly infer the category of the object. It has been understood that if the viewpoint is known the object classification accuracy can be improved. Likewise, if the object category is known, that helps infer the viewpoint. As such, object classification and viewpoint estimation is a tightly coupled problem, which can best benefit from their joint estimation.
We propose a new Convolutional Neural Network (CNN) model that we call RotationNet, which takes multi-view images of an object as input and predicts its pose and object category (Fig. 1). RotationNet outputs viewpoint-specific category likelihoods corresponding to all pre-defined discrete viewpoints for each image input, and then selects the object pose that maximizes the integrated object category likelihood. Whereas, at the training phase, RotationNet uses a complete set of multi-view images of an object captured from all the pre-defined viewpoints, for inference it is able to work with only a partial set of all the multi-view images – a single image at minimum – as input. This property is particularly important for object classification in real-world scenarios, where it is often difficult to place a camera at all the pre-defined viewpoints due to occlusions. In addition, RotationNet does not require the multi-view images to be provided at once but allows their sequential input and updates of the target object’s category likelihood. This property is suitable for applications that require on-the-fly classification with a moving camera.
The most representative feature of RotationNet is that it treats viewpoints where training images are observed as latent variables during the training (Fig. 2). This enables unsupervised learning of object poses using an unaligned object dataset; thus, it eliminates the need of preprocessing for pose normalization that is often sensitive to noise and individual differences in shape. Our method automatically determines the basis axes of objects based on their appearance during the training and achieves not only intra-class but also inter-class object pose alignment. Inter-class pose alignment is important to deal with joint learning of object pose and category, because the importance of object classification lies in emphasizing differences in different categories when their appearances are similar. Without inter-class pose alignment, it may become an ill-posed problem to obtain a model to distinguish, e.g., a car and a bus if the side view of a car is compared with the frontal view of a bus.
Our main contributions are described as follows. We first show that RotationNet outperforms the current state-of-the-art classification performance on 3D object benchmark datasets consisting of - and -categories by a large margin (Table 5). Next, even though it is trained without the ground-truth poses, RotationNet achieves superior performance to previous works on an object pose estimation dataset. We also show that our model generalizes well to a real-world image dataset that was newly created for the general task of multi-view object classification.111We will make the dataset publicly available after the peer review. Finally, we train RotationNet with the new dataset named MIRO and demonstrate the performance of real-world applications using a moving USB camera or a head-mounted camera (Microsoft HoloLens).
2 Related work
There are two main approaches for the CNN-based 3D object classification: voxel-based and 2D image-based approaches. The earliest work on the former approach is 3D ShapeNets , which learns a Convolutional Deep Belief Network that outputs probability distributions of binary occupancy voxel values. Latest works on similar approaches showcased improved performance [21, 20, 38]. Even when working with 3D objects, 2D image-based approaches are shown effective for general object recognition tasks. Su et al.  proposed multi-view CNN (MVCNN), which takes multi-view images of an object captured from surrounding virtual cameras as input and outputs the object’s category label. Multi-view representations are also used for 3D shape retrieval . Qi et al.  gives a comprehensive study on the voxel-based CNNs and multi-view CNNs for 3D object classification. Other than those above, point-based approach [11, 24, 15] is recently drawing much attention; however, the performance on 3D object classification is yet inferior to those of multi-view approaches. The current state-of-the-art result on the ModelNet40 benchmark dataset is reported by Wang et al. , which is also based on the multi-view approach.
Because MVCNN integrates multi-views in a view-pooling layer which lies in the middle of the CNN, it requires a complete set of multi-view images recorded from all the pre-defined viewpoints for object inference. Unlike MVCNN, our method is able to classify an object using a partial set of multi-view images that may be sequentially observed by a moving camera. Elhoseiny et al.  explored CNN architectures for joint object classification and pose estimation learned with multi-view images. Whereas their method takes a single image as input for its prediction, we mainly focus on how to aggregate predictions from multiple images captured from different viewpoints.
Viewpoint estimation is significant in its role in improving object classification. Better performance was achieved on face identification , human action classification , and image retrieval  by generating unseen views after observing a single view. These methods “imagine” the appearance of objects’ unobserved profiles, which is innately more uncertain than using real observations. Sedaghat et al.  proposed a voxel-based CNN that outputs orientation labels as well as classification labels and demonstrated that it improved 3D object classification performance.
All the methods mentioned above assume known poses in training samples; however, object poses are not always aligned in existing object databases. Novotny et al.  proposed a viewpoint factorization network that utilizes relative pose changes within each sequence to align objects in videos in an unsupervised manner. Our method also aligns object poses via unsupervised viewpoint estimation, where viewpoints of images are treated as latent variables during the training. Here, viewpoint estimation is learned in an unsupervised manner to best promote the object categorization task. In such a perspective, our method is related to Zhou et al. , where view synthesis is trained as the “meta”-task to train multi-view pose networks by utilizing the synthesized views as the supervisory signal.
Although joint learning of object classification and pose estimation has been widely studied [28, 19, 42, 2, 35], inter-class pose alignment has drawn little attention. However, it is beneficial to share view-specific appearance information across classes to simultaneously solve for object classification and pose estimation. Kuznetsova et al.  pointed out this issue and presented a metric learning approach that shares visual components across categories for simultaneous pose estimation and class prediction. Our method also uses a model with view-specific appearances that are shared across classes; thus, it is able to maintain high accuracy for both object classification and pose estimation.
3 Proposed method
The training process of RotationNet is illustrated in Fig. 2. We assume that multi-view images of each training object instance are observed from all the pre-defined viewpoints. Let be the number of the pre-defined viewpoints and denote the number of target object categories. A training sample consists of images of an object and its category label . We attach a viewpoint variable to each image and set it to when the image is observed from the -th viewpoint, i.e., . In our method, only the category label is given during the training whereas the viewpoint variables are unknown, namely, are treated as latent variables that are optimized in the training process.
RotationNet is defined as a differentiable multi-layer neural network . The final layer of RotationNet is the concatenation of softmax layers, each of which outputs the category likelihood where for each image . Here, denotes an estimate of the object category label for . For the training of RotationNet, we input the set of images simultaneously and solve the following optimization problem:
The parameters of and latent variables are optimized to output the highest probability of for the input of multi-view images .
Now, we describe how we design outputs. First of all, the category likelihood should become close to one when the estimated is correct; in other words, the image is truly captured from the -th viewpoint. Otherwise, in the case that the estimated is incorrect, may not necessarily be high because the image is captured from a different viewpoint. As described above, we decide the viewpoint variables according to the outputs as in (1). In order to obtain a stable solution of in (1), we introduce an “incorrect view” class and append it to the target category classes. Here, the “incorrect view” class plays a similar role to the “background” class for object detection tasks, which represents negative samples that belong to a “non-target” class. Then, RotationNet calculates by applying softmax functions to the -dimensional outputs, where . Note that , which corresponds to the probability that the image belongs to the “incorrect view” class for the -th viewpoint, indicates how likely it is that the estimated viewpoint variable is incorrect.
Based on the above discussion, we substantiate (1) as follows. For the purpose of loss calculation, we generate the target value of based on the current estimation of . Letting denote a matrix composed of for all the viewpoints and classes, the target value of in the case that is correctly estimated is defined as follows:
In this way, (1) can be rewritten as the following cross-entropy optimization problem:
If we fix here, the above can be written as a subproblem of optimizing as follows:
where the parameters of can be iteratively updated via standard back-propagation of softmax losses. Since are not constant but latent variables that need to be optimized during the training of , we employ alternating optimization of and . More specifically, in every iteration, our method determines according to obtained via forwarding of (fixed) , and then update according to the estimated by fixing them.
The latent viewpoint variables are determined by solving the following problem:
in which the conversion used the fact that is constant w.r.t. . Because the number of candidates for is limited, we calculate the evaluation value of (5) for all the candidates and take the best choice. The decision of in this way emphasizes view-specific features for object categorization, which contributes to the self-alignment of objects in the dataset.
In the inference phase, RotationNet takes as input images of a test object instance, either simultaneously or sequentially, and outputs probabilities. Finally, it integrates the outputs to estimate the category of the object and the viewpoint variables as follows:
Similarly to the training phase, we decide according to the outputs . Thus RotationNet is able to estimate the pose of the object as well as its category label.
Viewpoint setups for training
While choices of the viewpoint variables can be arbitrary, we consider two setups in this paper, with and without an upright orientation assumption, similarly to MVCNN . The former case is often useful with images of real objects captured with one-dimensional turning tables, whereas the latter case is rather suitable for unaligned 3D models. We also consider the third case that is also based on the upright orientation assumption (as the first case) but with multiple elevation levels. We illustrate the three viewpoint setups in Fig. 3.
Case (i): with upright orientation In the case where we assume upright orientation, we fix a specific axis as the rotation axis (e.g., the -axis), which defines the upright orientation, and then place viewpoints at intervals of the angle around the axis, elevated by (set to in this paper) from the ground plane. We set in default, which yields views for an object (). We define that “view ” is obtained by rotating the view position “view ” by the angle about the -axis. Note that the view obtained by rotating “view ” by the angle about the -axis corresponds to “view 1.” We assume the sequence of input images is consistent with respect to a certain direction of rotation in the training phase. For instance, if is , then is . Thus the number of candidates for all the viewpoint variables is .
Case (ii): w/o upright orientation In the case where we do not assume upright orientation, we place virtual cameras on the vertices of a dodecahedron encompassing the object. This is because a dodecahedron has the largest number of vertices among regular polyhedra, where viewpoints can be completely equally distributed in 3D space. Unlike case (i), where there is a unique rotation direction, there are three different patterns of rotation from a certain view, because three edges are connected to each vertex of a dodecahedron. Therefore, the number of candidates for all the viewpoint variables is ()222A dodecahedron has 60 orientation-preserving symmetries..
Case (iii): with upright orientation and multiple elevation levels This case is an extension of case (i). Unlike case (i) where the elevation angle is fixed, we place virtual cameras at intervals of in . There are viewpoints, where and . As with the case (i), the number of candidates for all the viewpoint variables is due to the upright orientation assumption.
In this section, we show the results of the experiments with 3D model benchmark datasets (Sec. 4.1), a real image benchmark dataset captured with a one-dimensional turning table (Sec. 4.2), and our new dataset consisting of multi-view real images of objects viewed with two rotational degrees of freedom (Sec. 4.3). The baseline architecture of our CNN is based on AlexNet , which is smaller than the VGG-M network architecture that MVCNN  used. To train RotationNet, we fine-tune the weights pre-trained using the ILSVRC 2012 dataset . We used classical momentum SGD with a learning rate of and a momentum of for optimization.
As a baseline method, we also fine-tuned the pre-trained weights of a standard AlexNet CNN that only predicts object categories. To aggregate the predictions of multi-view images, we summed up all the scores obtained through the CNN. This method can be recognized as a modified version of MVCNN , where the view-pooling layer is placed after the final softmax layer. We chose average pooling for the view-pooling layer in this setting of the baseline, because we observed that the performance was better than that with max pooling. We also implemented MVCNN  based on the AlexNet architecture with the original view-pooling layer for a fair comparison.
4.1 Experiment on 3D model datasets
We first describe the experimental results on two 3D model benchmark datasets, ModelNet10 and ModelNet40 . ModelNet10 consists of 4,899 object instances in 10 categories, whereas ModelNet40 consists of 12,311 object instances in 40 categories. First, we show the change of object classification accuracy versus the number of views used for prediction in cases (i) and (ii) with ModelNet40 and ModelNet10, respectively, in Fig. 4 (a)-(b) and Fig. 4 (d)-(e). For fair comparison, we used the same training and test split of ModelNet40 as in  and . We prepared multi-view images (i) with the upright orientation assumption and (ii) without the upright orientation assumption using the rendering software published in . Here, we show the average scores of trials with randomly selected multi-view sets. In Figs. 4 (a) and 4 (d), which show the results with ModelNet40, we also draw the scores with the original MVCNN using Support Vector Machine (SVM) reported in . Interestingly, as we focus on the object classification task whereas Su et al.  focused more on object retrieval task, we found that the baseline method with late view-pooling is slightly better in this case than the original MVCNN with the view-pooling layer in the middle. The baseline method does especially well with ModelNet10 in case (i) (Fig. 4 (b)), where it achieves the best performance among the methods. With ModelNet40 in case (i) (Fig. 4 (a)), RotationNet achieved a comparable result with MVCNN when we used all the views as input. In case (ii) (Figs. 4 (d) and (e)), where we consider full 3D rotation, RotationNet demonstrated superior performance to other methods. Only with three views, it showed comparable performance to that of MVCNN with a full set ( views) of multi-view images.
Next, we investigate the performance of RotationNet with three different architectures: AlexNet , VGG-M , and ResNet-50 . Table 1 shows the classification accuracy on ModelNet40 and ModelNet10. Because we deal with discrete viewpoints, we altered different camera system orientations (similarly to ) and calculated the mean and maximum accuracy of those trials. Surprisingly, the performance difference among different architectures is marginal compared to the difference caused by different camera system orientations. It indicates that the placement of viewpoints is the most important factor in multiview-based 3D object classification. See Sec. B for more details.
|AlexNet||93.70 1.07||96.39||94.52 1.01||97.58|
|VGG-M||94.68 1.16||97.37||94.82 1.17||98.46|
|ResNet-50||94.77 1.10||96.92||94.80 0.96||97.80|
Finally, we summarize the comparison of classification accuracy on ModelNet40 and ModelNet10 to existing 3D object classification methods in Table 5333We do not include the scores of “VRN Ensemble”  using ensembling technique because is written in  “we suspect that this result is not general, and do not claim it with our main results.” The reported scores are 95.54% with ModelNet40 and 97.14% with ModelNet10, which are both outperformed by RotationNet with any architecture (see Table 1). . RotationNet (with VGG-M architecture) significantly outperformed existing methods with both the ModelNet40 and ModelNet10 datasets. We reported the maximum accuracy among the aforementioned 11 rotation trials. Note that the average accuracy of those trials on ModelNet40 was 94.68%, which is still superior to the current state-of-the-art score 93.8% reported by Wang et al. . Besides, Wang et al.  used additional feature modalities: surface normals and normalized depth values to improve the performance by %.
|Dominant Set Clustering ||93.8||-|
|Multiple Depth Maps ||87.8||91.5|
|Geometry Image ||83.9||88.4|
|Beam Search ||81.26||88|
4.2 Experiment on a real image benchmark dataset
Next, we describe the experimental results on a benchmark RGBD dataset published in , which consists of real images of objects on a one-dimensional rotation table. This dataset contains object instances in categories. Although it contains depth images and 3D point clouds, we used only RGB images in our experiment. We applied the upright orientation assumption (case (i)) in this experiment, because the bottom faces of objects on the turning table were not recorded. We picked out images of each object instance with the closest rotation angles to . In the training phase, objects are self-aligned (in an unsupervised manner) and the viewpoint variables for images are determined. To predict the pose of a test object instance, we predict the discrete viewpoint that each test image is observed, and then refer the most frequent pose value among those attached to the training samples predicted to be observed from the same viewpoint.
Table 4.1 summarizes the classification and viewpoint estimation accuracies. The baseline method and MVCNN are not able to estimate viewpoints because they are essentially viewpoint invariant. As another baseline approach to compare, we learned a CNN with AlexNet architecture that outputs scores to distinguish both viewpoints and categories, which we call “Fine-grained.” Here, denotes the number of iterations that the CNN parameters are updated in the training phase. As shown in Table 4.1, the classification accuracy with “Fine-grained” decreases while its viewpoint estimation accuracy improves as the iteration grows. We consider this is because the “Fine-grained” classifiers become more and more sensitive to intra-class appearance variation through training, which affects the categorization accuracy. In contrast, RotationNet demonstrated the best performance in both object classification and viewpoint estimation, although the ground-truth poses are not given to RotationNet during the training.
Table 6 shows the pose estimation error comparison to existing methods, where “NN” is a simple nearest neighbor regressor. “Med” indicates the median error and “Ave” indicates the average error. “Med(C)” and “Ave(C)” are computed only on test images that were assigned the correct category by the system, whereas “Med(I)” and “Ave(I)” are computed only on those that were assigned the correct instance by the system. Even though it is only able to predict discrete poses, RotationNet achieved the best performance.
|Indep Tree ||73.3||62.1||44.6||89.3||81.4||63.0|
4.3 Experiment on a 3D rotated real image dataset
We describe the experimental results on our new dataset “Multi-view Images of Rotated Objects (MIRO)” in this section. Exemplar images in MIRO dataset are shown in Sec. E. We used Ortery’s 3D MFP studio444https://www.ortery.com/photography-equipment/3d-modeling/ to capture multi-view images of objects with 3D rotations. The RGBD benchmark dataset  has two issues for training multi-view based CNNs: insufficient number of object instances per category (which is a minimum of two for training) and inconsistent cases to the upright orientation assumption. There are several cases where the upright orientation assumption is actually invalid; the attitudes of object instances against the rotation axis are inconsistent in some object categories. Also, this dataset does not include the bottom faces of objects on the turning table. Our MIRO dataset includes object instances per object category. It consists of object instances in categories in total. We captured each object instance with levels of elevation angles and levels of azimuth angles to obtain images. For our experiments, we used images () with elevation of an object instance in case (i). We carefully captured all the object instances in each category to have the same upright direction in order to evaluate performance in the case (i). For case (ii), we used images observed from the vertices of a dodecahedron encompassing an object.
Figures 4 (c) and 4 (f) show the object classification accuracy versus the number of views used for the prediction in case (i) and case (ii), respectively. In both cases, RotationNet clearly outperforms both MVCNN and the baseline method when the number of views is larger than . We also tested the “Fine-grained” method that outputs scores in case (i) and scores in case (ii) to distinguish both viewpoints and categories, and the overall results are summarized in Tables 4.1 and 4.1. Similar to the results with an RGBD dataset described above, there is a trade-off between object classification and viewpoint estimation accuracies in the “Fine-grained” approach. RotationNet achieved the best performance in both object classification and viewpoint estimation, which demonstrates the strength of the proposed approach.
Finally, we demonstrate the performance of RotationNet for real-world applications. For training, we used our MIRO dataset with the viewpoint setup case (iii), where all the outputs for images with levels of elevation angles are concatenated, which enables RotationNet to distinguish viewpoints. We added rendered images of a single 3D CAD model (whose upright orientation is manually assigned) to each object class, which were trained together with MIRO dataset. Then we obtained successful alignments between a CAD model and real images for all the 12 object classes (Fig. 5). Figure 6 shows exemplar objects recognized using a USB camera. We estimated relative camera poses by LSD-SLAM  to integrate predictions from multiple views in sequence. The results obtained using multiple views (shown in the third and sixth rows) are consistently more accurate than those using a single view (shown in the second and fifth rows). It is worth noting that not only object classification but also pose estimation performance is improved by using multiple views.
We proposed RotationNet, which jointly estimates object category and viewpoint from each single-view image and aggregates the object class predictions obtained from a partial set of multi-view images. In our method, object instances are automatically aligned in an unsupervised manner with both inter-class and intra-class structures based on their appearance during the training. In the experiment using 3D object benchmark datasets ModelNet40 and ModelNet10, RotationNet significantly outperformed the state-of-the-art methods based on voxels, point clouds, and multi-view images. RotationNet is also able to achieve comparable performance to MVCNN  with different multi-view images using only a couple of view images, which is important for real-world applications. Another contribution is that we developed a publicly available new dataset named MIRO. Using this dataset and RGBD object benchmark dataset , we showed that RotationNet even outperformed supervised learning based approaches in a pose estimation task. We consider that our pose estimation performance benefits from view-specific appearance information shared across classes due to the inter-class self-alignment.
Similar to MVCNN  and any other 3D object classification method that considers discrete variance of rotation, RotationNet has the limitation that each image should be observed from one of the pre-defined viewpoints. The discrete pose estimation by RotationNet, however, demonstrated superior performance to existing methods on the RGBD object benchmark dataset. It can be further improved by introducing a fine pose alignment post-process using e.g. iterative closest point (ICP) algorithm. Another potential avenue to look into is the automatic selection of the best camera system orientations, since it has an effect on object classification accuracy.
-  S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. J. Latecki. Gift: A real-time and scalable 3d shape search engine. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  A. Bakry and A. Elgammal. Untangling object-view manifold for multiview recognition and pose estimation. In Proceedings of European Conference on Computer Vision (ECCV), 2014.
-  L. Bo, X. Ren, and D. Fox. Unsupervised feature learning for rgb-d based object recognition. In Proceedings of International Symposium on Experimental Robotics (ISER), 2013.
-  H. Borotschnig, L. Paletta, M. Prantl, and A. Pinz. Appearance-based active object recognition. Image and Vision Computing, 18(9), 2000.
-  A. Brock, T. Lim, J. Ritchie, and N. Weston. Generative and discriminative voxel modeling with convolutional neural networks. In Proceedings of NIPS Workshop on 3D Deep Learning, 2017.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of British Machine Vision Conference (BMVC), 2014.
-  C.-Y. Chen and K. Grauman. Inferring unseen views of people. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung. On visual similarity based 3D model retrieval. Computer Graphics Forum, 22(3), 2003.
-  M. Elhoseiny, T. El-Gaaly, A. Bakry, and A. Elgammal. A comparative analysis and study of multiview cnn models for joint object categorization and pose estimation. In Proceedings of International Conference on Machine Learning (ICML), 2016.
-  J. Engel, T. Schöps, and D. Cremers. Lsd-slam: Large-scale direct monocular slam. In Proceedings of European Conference on Computer Vision (ECCV), 2014.
-  A. Garcia-Garcia, F. Gomez-Donoso, J. Garcia-Rodriguez, S. Orts-Escolano, M. Cazorla, and J. Azorin-Lopez. Pointnet: A 3d convolutional neural network for real-time object class recognition. In Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN), 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  V. Hegde and R. Zadeh. Fusionnet: 3d object classification using multiple data representations. arXiv preprint arXiv:1607.05695, 2016.
-  E. Johns, S. Leutenegger, and A. J. Davison. Pairwise decomposition of image sequences for active multi-view recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of International Conference on Computer Vision (ICCV), 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2012.
-  A. Kuznetsova, S. J. Hwang, B. Rosenhahn, and L. Sigal. Exploiting view-specific appearance similarities across classes for zero-shot pose prediction: A metric learning approach. In Proceedings of AAAI Conference on Artificial Intelligence, 2016.
-  K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view rgb-d object dataset. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2011.
-  K. Lai, L. Bo, X. Ren, and D. Fox. A scalable tree-based approach for joint object and pose recognition. In Proceedings of AAAI Conference on Artificial Intelligence, 2011.
-  Y. Li, S. Pirk, H. Su, C. R. Qi, , and L. J. Guibas. Fpnn: Field probing neural networks for 3d data. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2016.
-  D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015.
-  D. Novotny, D. Larlus, and A. Vedaldi. Learning 3d object categories by looking around them. In Proceedings of International Conference on Computer Vision (ICCV), 2017.
-  L. Paletta and A. Pinz. Active object recognition by view integration and reinforcement learning. Robotics and Autonomous Systems, 31(1), 2000.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  C. R. Qi, H. Su, M. Niessner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view CNNs for object classification on 3D data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500, 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 2015.
-  S. Savarese and L. Fei-Fei. 3D generic object categorization, localization and pose estimation. In Proceedings of International Conference on Computer Vision (ICCV), 2007.
-  N. Sedaghat, M. Zolfaghari, and T. Brox. Orientation-boosted voxel nets for 3D object recognition. In Proceedings of British Machine Vision Conference (BMVC), 2017.
-  K. Sfikas, T. Theoharis, and I. Pratikakis. Exploiting the panorama representation for convolutional neural network classification and retrieval. In Proceedings of Eurographics Workshop on 3D Object Retrieval (3DOR), 2017.
-  B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12), 2015.
-  M. Simonovsky and N. Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  A. Sinha, J. Bai, and K. Ramani. Deep learning 3D shape surfaces using geometry images. In Proceedings of European Conference on Computer Vision (ECCV), 2016.
-  H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
-  H. Su, C. R. Qi, Y. Li, and L. J. Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3D model views. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
-  H. Su, F. Wang, E. Yi, and L. J. Guibas. 3D-assisted feature synthesis for novel views of an object. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
-  C. Wang, M. Pelillo, and K. Siddiqi. Dominant set clustering and pooling for multi-view 3d object recognition. In Proceedings of British Machine Vision Conference (BMVC), 2017.
-  J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2016.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  X. Xu and S. Todorovic. Beam search for learning a deep convolutional neural network of 3d shapes. In Proceedings of International Conference on Pattern Recognition (ICPR), 2016.
-  P. Zanuttigh and L. Minto. Deep learning for 3d shape classification from multiple depth maps. In Proceedings of IEEE International Conference on Image Processing (ICIP), 2017.
-  H. Zhang, T. El-Gaaly, A. M. Elgammal, and Z. Jiang. Joint object and pose recognition using homeomorphic manifold analysis. In Proceedings of AAAI Conference on Artificial Intelligence, volume 2, 2013.
-  S. Zhi, Y. Liu, X. Li, and Y. Guo. Lightnet: A lightweight 3D convolutional neural network for real-time 3D object recognition. In Proceedings of Eurographics Workshop on 3D Object Retrieval (3DOR), 2017.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view perceptron: a deep model for learning face identity and view representations. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2014.
Appendix A Qualitative evaluation of self-alignment during the training of RotationNet
Figures 7 and 8 show the state transition of the inter- and intra-class object pose alignment that is automatically achieved during the training of RotationNet with ModelNet40, which depict the variation of the average images generated by concatenating multi-view images in order of their predicted viewpoint variables. The figures correspond to cases (i): with upright orientation and (ii): w/o upright orientation, respectively. We can see that the variance of average images decreases together with the variance of object poses. The red dotted lines show the mean variance of average images of all the classes, whereas the blue lines show the variance of average images of the “chair” class.
The images of test object instances in the “chair” class with the same predicted viewpoint variable are shown in the right of the figures. In both cases (i) and (ii), where the latter case is more interesting because of its difficulty, the chairs with initial random poses gradually get aligned in the same direction after several hundreds of training iterations. Moreover, the average images of all the 40 classes with the same predicted viewpoint variable shown in red boxes indicate that not only the intra-class alignment but also the inter-class alignment is achieved. The alignment is less obvious in the red boxes of Fig. 8; however, it is confirmed that this does not harm the object classification accuracy.
Appendix B Influence of camera system orientation
As shown in Section 4.1 in the main manuscript, we tested the performance of RotationNet with different camera system orientations. Figure 9 shows exemplar multi-view images of a chair in ModelNet40 dataset captured in case (ii) with the camera system orientations. Although “aligned” ModelNet40 dataset has been recently released, we used the original “unaligned” ModelNet40 dataset in our work. Camera system orientations are first rotated by about the -axis, and then rotated by about the -axis. In this way, different camera system orientations can capture different object profiles. Table 7 shows the comparison of classification accuracy (%) on ModelNet40 and ModelNet10 with the different camera system orientations. We altered the base architecture of RotationNet as AlexNet, VGG-M, and ResNet-50. The best scores with each architecture among the orientations are shown in bold. As shown here, the best camera system orientation is consistent across different architectures: the second one for ModelNet10 and the fourth one for ModelNet40. It indicates that multi-view object classification is greatly improved by observing appropriate aspects of objects. In addition, Table 7 shows that the best performance on the validation set (which we extracted from the training split of ModelNet40) was achieved with the same camera system orientation as the test set. Therefore, it is possible to obtain the best RotationNet model by selecting the one that best classifies a validation set among different camera system orientations.
|Camera system orientation ID|
|ModelNet40 - val||AlexNet||93.03||94.33||95.14||95.54||92.63||92.46||92.38||92.95||92.79||92.79||92.87||93.35 1.06|
|ModelNet40 - test||VGG-M||93.64||95.91||96.07||97.37||94.12||93.80||94.08||94.25||93.68||94.41||94.17||94.68 1.16|
|ModelNet10 - test||VGG-M||94.38||98.46||94.05||94.93||94.38||94.38||94.38||94.49||94.82||94.49||94.27||94.82 1.17|
Appendix C Effectiveness of fine-tuning
Even when the input images are grayscale rendered images of 3D models, fine-tuning of the ImageNet pre-trained weights is effective. Figures 12 and 12 respectively show the training loss and the classification accuracy (%) on ModelNet40 using RotationNet trained w/ and w/o fine-tuning. Here, we used AlexNet as the baseline architecture of RotationNet. As shown in these figures, fine-tuned RotationNet converges earlier to the optimal one that is better than the model achieved without fine-tuning. It indicates that the ImageNet pre-trained weights capture general features of objects, which leads RotationNet to achieve reliable performance in the object classification task.
Appendix D Sensitivity to pre-defined views assumption
Similar to existing 3D object classification methods (such as MVCNN) that consider discrete variance of rotation, RotationNet has the limitation that each image should be observed from one of the pre-defined viewpoints. To examine the sensitivity to pre-defined views assumption, we conducted an additional experiment with ShapeNetCore55 dataset555http://shapenet.cs.stanford.edu/shrec16/ which consists of 3D models in 55 object categories. We trained our model with aligned training dataset and tested the classification of unaligned (i.e., randomly rotated) models. The accuracy was whereas it was for aligned test models, which means our model is rather sensitive to pre-defined views. However, as shown in Fig. 12, the accuracy increases if we randomly rotate the test model times and use the maximum object scores. Moreover, when trained with unaligned dataset (as is the case with ModelNet dataset), we achieve a model that is much less sensitive to the viewpoint sampling; the accuracy in this case was for unaligned test models with .
Appendix E Object instances in MIRO dataset
Figure 13 shows thumbnail images of all the object instances in our new dataset MIRO. Our MIRO dataset includes object instances per object category. It consists of object instances in categories in total. Each object instance has images captured from different viewpoints approximately equally distributed in the spherical coordinates. An example of the multi-view images are shown in Fig. 14.
Appendix F Candidates for viewpoint variables in case (ii): w/o upright orientation
In the case where we do not assume upright orientation, we place virtual cameras on the vertices of a dodecahedron encompassing the object. There are three different patterns of rotation from a certain view, because three edges are connected to each vertex of a dodecahedron. Therefore, the number of candidates for all the viewpoint variables is (). Figures 15-23 show all the candidates for a set of viewpoint variables in this case, in which vertex and image IDs are shown on the top and bottom rows respectively. Here, indicates the ID of the vertex where the -th image of the object instance is observed. For instance, in Candidate #2 is (Fig. 15 (b)). The red star indicates the camera position where the first image is captured. The red dot indicates the camera position where the ninth image is captured.