3D Object Classification via Spherical Projections
Abstract
In this paper, we introduce a new method for classifying 3D objects. Our main idea is to project a 3D object onto a spherical domain centered around its barycenter and develop neural network to classify the spherical projection. We introduce two complementary projections. The first captures depth variations of a 3D object, and the second captures contourinformation viewed from different angles. Spherical projections combine key advantages of two mainstream 3D classification methods: imagebased and 3Dbased. Specifically, spherical projections are locally planar, allowing us to use massive image datasets (e.g, ImageNet) for pretraining. Also spherical projections are similar to voxelbased methods, as they encode complete information of a 3D object in a single neural network capturing dependencies across different views. Our novel network design can fully utilize these advantages. Experimental results on ModelNet40 and ShapeNetCore show that our method is superior to prior methods.
1 Introduction
We perceive our physical world via different modalities (e.g., audio/text/images/videos/3D models). Compared to other modalities, 3D models provide the most accurate encoding of physical objects. Developing algorithms to understand and process 3D geometry is vital to automatic understanding of our physical environment. Algorithms for 3D data analysis and processing were predominantly focused on handcrafted features or shadow networks, due to limited training data we had. However, the status started to change as we have witnessed the significant growth of 3D data during the past few years (e.g., Warehouse 3D ^{1}^{1}1https://3dwarehouse.sketchup.com/?hl=en and Yobi3D ^{2}^{2}2https://www.yobi3d.com/), which offers rich opportunities for developing deep learning algorithms to significantly boost the performance of 3D data understanding.
Deep neural networks usually take vectorized data as input. This means how to encode input objects in vectorized forms is crucial to the resulting performance. While this problem is trivial for other modalities, e.g., audio signals, images and videos are already in vectorized forms, it becomes far more complicated for 3D objects, which are intrinsically 2D objects embedded in 3D ambient spaces. Existing deep learning algorithms fall into two categories, 3Dbased and imagebased. 3Dbased methods typically encode a given 3D object using an occupancy grid. Yet due to memory and computational restrictions, the resolution of these occupancy grids remains low (e.g., 40x40x40 at best among most existing methods). Such a low resolution prohibits the usage of local geometric details for geometric understanding. In contrast to 3Dbased approaches, imagebased methods analyze and process 3D models via their 2D projections. Imagebased techniques have the advantage that one can utilize significantly higher resolution for analyzing projected images. Moreover, it is possible to utilize largescale training data (e.g., ImageNet) for pretraining. It turns out that with similar network complexity, imagebased techniques appear to be superior to 3Dbased techniques. Yet, there are significant restrictions of existing imagebased techniques. For example, we need to determine the viewpoints for projection. Moreover, the projected images process discontinuities across image boundaries. Finally, existing techniques do not capture dependencies across different views, e.g., the correlation between the front and the back of an object, for geometric understanding.
In this paper, we introduce a novel projection method, which possesses the advantages of existing imagebased techniques yet addresses the issues described above. The basic idea is to project 3D objects on a viewing sphere. On one hand, spherical domains are locally 2D, so that we can develop convolutional kernels at high resolution and utilize largescale image data for pretraining. On the other hand, spherical domains are continuous and global, allowing us to capture patterns from complete 3D objects that are usually not present in standard imagebased projections. Such characteristics make spherical projection advantageous compared with standard imagebased projections. To fully utilize largescale image training data, we present two spherical projection methods, one captures the depth variance in shapes from different view points, and the other captures the contour information of shapes from different viewpoints. These two projections utilize the textural and object boundary information that is captured by pretrained neural network models from ImageNet.
We introduce two principled ways to utilize these spherical projections for the task of 3D object classification. The guiding principle is to perform convolutions on cylindrical patches, which admit standard 2D neural network operators and thus allow us to use pretrained neural networks. We show how to sample cylindrical patches that minimize the number of cylindrical patches, and yet are sufficient to capture rich crossview dependencies.
2 Related Works
3D object classification has been studied extensively in the literature. While early works focus on leveraging handcrafted features for classification [14, 1, 12, 8, 15, 9], recent methods seek to leverage the power of deep neural networks [27, 25]. For simplicity, we only provide a summary of methods that use deep neural networks, as they are most relevant to the context of this paper. Existing deep learning methods for 3D object classification fall into two categories: 3Dbased and imagebased.
3Dbased methods classify 3D shapes based on 3D geometric representations such as voxel occupancy grids or point clouds. In [27], Wu et al. propose to represent a 3D shape as a probability distribution of binary variables on a 3D voxel occupancy grid and apply Deep Belief Network for classification. Recent methods utilize the same data representation but apply 3D convolutional neural networks for classification [13, 5, 2, 17, 3]. They differ from each other in terms of specifications of the training data (e.g., with front orientation or without front orientation) as well as details of network training. ORION [2] adds an orientation estimation module to the original VoxNet [13] as another task and trains both tasks simultaneously, which boosts the performance of VoxNet. Volumetric CNN [17] proposes two approaches to improve the performance of volumetric convolutional neural networks. The first one adds a subvolume supervision task, which simultaneously trains networks that understand object parts as well as a network that understands the whole object. The second approach exploits an anisotropic probing kernel, which serves as a projection operator from 3D objects to 2D images. The results of the 2D projections can then be classified using 2D CNNs. The difference between our method and that of Volumetric CNN lies in the representation used for integrating 2D and 3D training data. VoxceptionResNet [3] designs a volumetric residual network architecture. To maximize the performance, it augments the training data with multiple rotations of the input and aggregates predictions from multiple residual networks. In addition to the representations of 3D convolution neural networks, people have proposed Beam Search [28] to optimize the model structure and hyperparameters of 3D convolutional networks. The basic idea is to define primitive actions to alter the model structures and hyperparameters locally, so as to find the best model structure and parameter setting for 3D objects represented by 3D voxel grids.
Despite the significant advances in 3D convolution, existing techniques possess a major limitation — the resolution of a 3D convolutional neural network is usually very coarse. Although this issue has been recently alleviated by Octreebased representations [19, 20, 26], the cost of 3D volumetric neural networks is still significant higher than 2D neural networks of the same resolution.
Besides voxelbased representations, people have looked at other geometric representations such as pointbased representations. In [16], the authors propose a novel neural network architecture that operates directly on point clouds. This method leads to significant improvement in terms of running time. The major challenge of designing neural networks for point cloud data is to ensure permutation invariance of the input points. In an independent work, SetLayer [18] concentrates on the permutation equivalent property of point cloud and introduces a setequivariant layer for this purpose.
Imagebased techniques. The key advantage of 3Dbased techniques is that the underlying 3D object can be exactly characterized by the corresponding 3D representation. On the other hand, 3D training data remains limited compared to the amount of 2D training data we have access to. In addition, such 3D representations are unable to utilize the large amount of labeled images, which can be considered as projections of the underlying 3D objects. This is the motivation of imagebased 3D object classification techniques, which apply 2D convolutional neural networks to classify rendered images. In [25], Su et al. propose to render 12views for 3D shapes and classify the rendered images. The image classification network is initialized using VGG [23], pretrained on ImageNet data and then finetuned on the ModelNet40 dataset. [10] provides a different way to finetune the network with rendered images, each of which is given a weight to measure its importance to the final prediction. [24] proposes a way to convert 3D objects to geometry images and implicitly learn the geometry and topology of 3D objects. Despite the fact that imagebased techniques can utilize pretrained image classification networks, image projections present significant information loss, and it is not easy to capture complete relative dependencies, e.g., those that can not be projected to the same view. Perhaps the most relevant to our method is classifying panoramas [22, 21], which projects a 3D object onto a cylindrical domain. Although cylindrical domain is certainly more flexible than imagedomains, it still does not cover the entire object. Our experimental results reveal that a single cylindrical projection is insufficient for obtaining stateoftheart object classification performance.
Hybrid methods. Several works seek to combine 3Dbased techniques and imagebased techniques. In particular, FusionNet [6] utilizes both 3D voxel data and 2D projection data by training two 3D CNNs: general 3D CNN and kernelbased 3D CNN with varying size. FusionNet also trains an imagebased multiview CNN network mentioned above. FusionNet then fuses features from these three networks so as to exploit advantages of different features.
We observe that methods based on 2D projections tend to perform better than those based on volumetric representations since they can exploit pretrained models to address the issue of insufficient training data. Yet, imagebased techniques require a large number of views. There are significant overlaps across views, exhibiting information redundancy. Thus, we propose a novel spherical projection approach, which uses a single sphere to aggregate information from different viewing directions. We also explore data dependencies across different views, which are beneficial for object classification.
3 Spherical Projection
In this section, we describe the two proposed spherical projections, i.e., depthbased projection and imagebased projection. The input to these two projections is a 3D model with prescribed upright orientation. However, we do not assume the front orientation is given. Such a setup is applicable to almost all internet 3D model repositories (e.g., Warehouse3D and Yobi3D). Both spherical projections utilize a sphere centered around the barycenter of each object. The radius of this sphere is chosen as three times the diagonal of the object bounding box. Note that the radius of the sphere does not affect the depthbased projection and has minor effects on the imagebased projection.
Depthbased projection. The depthbased projection is generated by shooting a ray from each point on the sphere to the center. Each point is recorded as the distance to the first hitting point. Otherwise, the distance is set to be zero. We compute depth values for vertices of a semiregular quadmesh whose axis aligns with the longitude and latitude, i.e.,
(1) 
Then the depth value of other points on the sphere are generated by linear interpolation. Specifically, denote as the depth value that corresponds to . Then given a point with spherical coordinate , where , its depth value is given by
(2) 
This allows us to generate the depth value for every single point on the sphere. In our implementation, we further use an Octree to accelerate the raymesh intersection. For all of experiments, we use , i.e., one pixel per 2 degrees along both the latitude and longitude. We proceed to generate cylindrical strips from the depth projection described above. We first use the strip covering the following area:
(3) 
Since regions of high latitude suffer from severe distortion, we eliminate them by setting from to . In the following, we will call this strip the latitude strip, which is fitted into the convolution layers (See Figure LABEL:Fig:Spherical:Depth_Network).
To utilize information form high latitude regions for classification, we also utilize a circle of vertical strips parallel to a longitude (See Figure LABEL:Fig:Spherical:Depth_Network). There are 12 strips in total, and the angle between adjacent strips is . The pixel coordinates on each strip indexed by is given by
(4) 
where is the number of strips. For all experiments shown in this paper, we use and . The total number of pixels is comparable with that used in the MVCNN [25].
As illustrated in Figure LABEL:Fig:Spherical:Depth_Projection, depthbased projection effectively captures geometric variations. Moreover, for a wide range of objects (i.e., the ray defined by every shape point and the sphere center reaches the sphere without occlusion), the original shapes can be directly reconstructed from the depthbased projection. Such objects include convex objects and many other boxlike objects. In other words, depthbased projection is quite informative. On the downside, at the global scale, the pattern reveals in the depthbased projection seems to deviate from natural images. Moreover, the contours of objects, which provide important cues for 3D object classification, are not present in the depthbased projection. This motivates us to consider imagebased projection.
Imagebased projection. As shown in Figure LABEL:Fig:Spherical:Contour_Network, imagebased projection shoots a 3x12 grid of images of the input object from 36 view points in total. The locations of the cameras are given by setting in (LABEL:Eq:1), i.e., . At each camera location, the upright orientation of the camera always points to the northpole. The viewing angle of each image is so that the projected images barely overlap. The resolution of each image is 224x224. In our experiments, we have varied the value of and found that provides a good tradeoff between minimizing the number of views and ensuring that the resulting projections are approximately invariant to rotating the input object.
Note that our imagebased projection does not generate a perpixel value for each point on the sphere. Instead, we use the sphere to guide how these images are projected, enabling us to capture dependencies across different views.
4 Classification Network and Training
The major motivation of the proposed network design is twofold: 1) leveraging pretrained convolution kernels from ImageNet, and 2) capturing dependencies that cannot be projected in the same view (e.g., front and back of an object). To this end, we propose two steps for designing the classification network.
Network design. Figure LABEL:Fig:Spherical:Depth_Network illustrates the networkdesign for depthbased projection. The same as AlexNet, this network has a convolution module and a fully connected module. The convolution module utilizes the same set of convolution kernels as AlexNet. This allows us to use the pretrained kernels from AlexNet. As shown in Figure LABEL:Fig:Spherical:Depth_Network, if the strip is parallel to the latitude, the convolutions are applied in a periodical manner, so as to utilize the continuity of the data. Specifically, is a periodical convolution network which captures dependencies across different views in both convolutional and fullyconnected layers. In contrast, common convolutions are applied to strips that are parallel to the longitude independently (See ). The fully connected layers, which are shown in , capture the data dependencies across different views. To preserve the spatial relation while maintain rotation invariance, we introduce 12 fully connected layerconnections between and . Let be the feature vectors at layer . We define the feature vectors at layer as
Note that the initial network weights are set to be the AlexNet weights. The other weights are initialized as zero, i.e., . In other words, the initial weights apply fully connected operations on the feature vector attached to each image in isolation, while the cross links force the network to learn dependencies across different views.
Figure LABEL:Fig:Spherical:Contour_Network shows the network design for contourbased projections. 36 Cameras are uniformly distributed along three latitudes (, , ) of the sphere. We concatenate all the rendered images in their spatial order on the sphere. We then feed the entire image to , i.e., the convolutional neural network for the contourbased projection.
As described in the previous Section, the resolutions of the depthbased projections are 240x360 and 360x60x12 for strips along the longitude and the latitude, respectively. In addition, the resolution of the contourbased projection is 224x224x36. Note that although we utilize more pixels, the number of parameters in the network remains relatively small, as we share network parameters across different strips.
Training. We train the entire depth network at three stages. We first train the convolutional layers and the direct connection layers, i.e., . If pretraining is allowed, the training at this stage starts from the pretrained weights of AlexNet. We then train the convolutional network for the latitude strip. After this step, we have two pretrained models for longitude and latitude strips. Finally, we train the entire network together with weights copied from these two pretrained models except the final classifying layer. Note that for the contourbased network, we directly train from scratch if AlexNet parameters are not provided. If AlexNet parameters are provided, we train with all the parameters except the last classifying layer.
We use the Caffe Framework for all of our experiments. For layers which are trained from scratch, we set its learning rate to be 10 times that of the other layers. We used minibatch stochastic gradient descent (SGD) with 0.9 momentum and the learning rate annealing strategy implemented in Caffe. The learning rate is is crossvalidated by grid search started from and ended at , where the multiplicative stepsize is . We fix the minibatch size as and set the weight decay as .
5 Experimental Evaluation
5.1 Experimental Setup
Datasets and evaluation propocol. We evaluate the proposed approaches on two Benchmark datasets ModelNet40 and ShapeNetCore. They both collect models from Warehouse3D but with different model classes.
ModelNet40 [27] contains 12311 shapes across 40 categories ranging from airplane to xbox. We use the default trainingtesting split (c.f. [13, 2, 17]) that leads to 9843 models in the training set and 2468 models in the testing set.
ShapeNetCore [4] contains 51300 shapes across 55 categories. We use the default training set (36148 models) for training and default validation set (5615 models) for testing. Note that the size of ShapeNetCore is bigger than that of ModelNet40. Distributions of categories are also different, e.g., ShapeNetCore contains less furniture categories.
Baseline methods. Since our method does not utilize the front orientation, for baseline comparison we only consider algorithms that do not utilize such information as well. In addition, we also report performance of stateoftheart methods on ModelNet40.


Method  ModelNet40  ShapeNetCore  ModelNet40SubI  ShapeNetCoreSubI 
3D Shapenets [27]  85.9  na  83.33  na 
Voxnet [13]  87.8  na  85.99  na 
FusionNet [6]  90.80  na  89.54  na 
Volumetric CNN [17]  89.9  na  88.65  na 
MVCNN [25]  92.31  88.93  91.22  88.64 
MVCNNMultiRes [17]  93.8  90.01  92.60  90.00 
OctNet [20]  87.83  88.03  86.45  87.85 
depthbase pattern  91.36  89.45  90.25  89.13 
contourbased pattern  93.31  90.49  92.20  90.80 
overall pattern  94.24  91.00  93.09  91.22 

labeltable:overall_results
The baseline algorithms we choose include MVCNN [25], MVCNNMultiRes [17], 3D ShapeNets [27], Voxnet [13], FusionNet [6], Volumetric CNN [17] and OctNet [20]. In the following, we briefly summarize the characteristics of these methods. MVCNN classifies a given model by fusing the feature layers of rendered images with a maxpooling layer. MVCNNMultiRes improves MCVNN by exploiting rendered images from multiple resolutions. 3D ShapeNets is the first deep learning method on 3D shape data which is built on a Deep Belief Network. Voxnet leverages a 3D convolution neural network for shape classification. FusionNet fuses features extracted from 3D voxel data and 2D projection data by different networks. Finally, Volumetric CNN modifies Voxnet by adding subvolume supervision task and anisotropic probing kernel convolution. OctNet transforms 3D objects into Octreebased representations and design a special network to classify these representations. All of these methods use the upright orientation but do not use the front orientation.
5.2 Classification Results
Table LABEL:table:overall_results collects the overall classification accuracy of different methods on ModelNet40 and ShapeNetCore. As we can see, the proposed depthbased projection method is superior to most existing 3Dbased methods. This demonstrates the power of incorporating massive image training datasets. Compared to most other imagebased techniques, our contourdriven projection method exhibits the top performance, showing the advantage of generating projections on the spherical domain. Although MVCNNMultiRes outperforms contourbased projection, it needs to render images of different resolutions, which is much slower than our contour projection. When combining depthbased projection and contourbased projection, our overall method performs better than MVCNNMultiRes and achieves the highest accuracy, which also demonstrates that our two projections are complementary.
When comparing the performance of various algorithms on ModelNet40 and ShapeNetCore, we find that the performance on ShapeNetCore is lower, which is expected since repositories in ShapeNetCore exhibit bigger variance. Moreover, as ShapeNetCore is bigger than ModelNet40, the gap between depthbased projection and viewbased project is bigger, since the effects of pretraining may be reduced when the size of the 3D data increases. In the following, we provide more detailed analysis of the results.
5.3 Analysis of Results
The effects of pretraining. As illustrated in Table LABEL:table:pre_after_modelnet and Table LABEL:table:pre_after_shapenet, all 2Dbased techniques benefit from ImageNet pretraining which justifies the fact that the size of both ShapeNetCore and ModelNet40 are relative small to train highquality classification networks, and images from ImageNet contain rich information that can be used to differentiate rendered images. When comparing ShapeNetCore with ModelNet40, the effects of pretraining on ShapeNetCore is more salient than that on ModelNet40. An explanation is that ShapeNetCore exhibits higher diversity, so that ImageNet features help more. Another factor is that the distribution of ShapeNetCore categories are closer to corresponding categories in ImageNet than ModelNet40 categories.
Quite surprisingly, the improvement of pretraining on depthbased projection is as strong as that on contourdriven projection. This means that pretrained ImageNet models contain rich interior edge information as well (e.g., changes of texture information in the presence of depth continuities), which is beneficial for classifying depthbased projections.
Depthbased versus contourbased. The overall performance of depthbased projection is slightly below that of contourdriven projection. This is expected because object contours provide strong cues for classification.
To further compare the effectiveness of different methods on a particular type of shapes, we selected the classes belonging to furniture from both datasets as two curated subsets, which are bathtub, bed, bookshelf, chair, curtain, desk, door, dresser, lamp, mantel, night_stand, range_hood, sink, sofa, stool, table, toilet, tv_stand, wardrobe in ModelNet40 and bathtub, bed, bookshelf, cabinet, chair, clock, dishwasher, lamp, loudspeaker, sofa, table, washer in ShapeNetCore respectively. We tested our model on these subsets, and results are included in Table LABEL:table:class_wise.
It is clear that the winning categories of each method are drastically different. As indicated in Table LABEL:table:class_wise, depthbased projection is advantageous on categories such as bowl, table, and bottle, which possess strong interior depth patterns. In contrast, contourbased projection is superior to depthbased projection on categories such as plant, guitar, and sofa, which have unique contour features.
Comparison to MultiViewCNN. The proposed contourbased projection method is superior to MultiViewCNN on both ModelNet40 and ShapeNetCore. The main reason is due to the fact our network captures dependencies earlier in the convolution, while MultiViewCNN only maxpools features extracted from the convolution layers. This performance gap indicates that dependencies across different views and at different scales are important cues for shape classification.
Comparison to voxelbased techniques. Our approach is also superior to most voxelbased classification methods, indicating the importance of leveraging image training data. The only exception is the recent work of VoxelResNet [28]. However, that work assumes that the frontorientation of each shape is given. In addition, its performance highly relies on training an ensemble network. In contrast, the accuracy of each individual network of [28] is upper bounded by .
Comparison to panoramabased techniques. A building block of our technique is based on classifying cylindrical strips of spherical projections. This is relevant to some recent works of classifying panoramas of 3D objects [22, 21]. However, the major difference in our approach is that we use multiple strips to capture the correlations of the spherical projection from multiple strips. In addition, the network design utilizes a pretrained model from ImageNet. As indicated by our experiments, on ModelNet40 our approach leads to and improvements in terms of accuracy from using a single strip and that of [21], respectively.
Varying the resolutions of the projections. We have also tested the performance of our network by varying resolutions of the projections. For depthbased projection, we increase the resolution of the grid from to and , we found that the classification accuracy drops by and , respectively. We thus used for efficiency concerns. For contourbased projection, we have changed resolution of the grid pattern to and , we found that the improvements in classification accuracy improves by less than from grids to grids. On the other hand, the improvement from using grids to grids is about on average, which is expected since using a grid is insufficient for handling rotation invariance.
Varying the elevation degree horizontal strip. We also tried varying the elevation degree of the horizontal strips, which is in Figure LABEL:Fig:Spherical:Depth_Network. Experimental results in Figure LABEL:Fig:elevation show that the performance is not that sensitive when the elevation degree varies. We can see that when the elevation degree increases larger than , the performance does not apparently increase with the increasing of the elevation degree, since high altitude areas have severe distortion in the horizontal strip, where pretrained models become ineffective. Thus, we choose elevation degree as .
Timing. The rendering, inference, and training time are listed in Table LABEL:Table:2, which are performed on a machine with 16 Intel Xeon E52637 v4 @ 3.50GHz CPUs and 1 Titan X(Pascal) GPU. As indicated in Table LABEL:Table:2, classifying a single 3D object takes around 2 seconds. The dominant computational cost is on generating the projections. Note that both depthbased projections and contourbased projections can be accelerated using GPU. We believe the computational cost can be significantly improved by exploring such options.
6 Conclusions, Discussion and Future Work
In this paper, we have introduced a spherical representation and developed deep neural networks to classify 3D objects. Our approach explores two ways to project 3D shapes onto a spherical domain. The first one leverages depth variation, while the other one leverages contour information. Such a representation shows advantages of 1) allowing high resolution grid representations to capture geometric details, 2) incorporating largescale labeled images for training, and 3) capturing data dependencies across the entire object. We also described principled ways to define convolution operations on spherical domains such that the output of the neural networks is not sensitive to the front orientation of each object. Experimental results show that the proposed methods are competitive against stateoftheart methods on both ModelNet40 and ShapeNetCore.
There are ample opportunities for future research. The methods presented in this paper still use rectangular convolutional kernels mainly due to the fact that we want to use pretrained convolutional kernels. However, technically it will be interesting to see if one can define convolutional kernels directly on spherical domains. One potential solution is to use spherical harmonics [11]. In another direction, it remains interesting to consider other types of spherical projections, e.g., spherical parameterizations of geometric objects [7], which is free of occlusions. We did not use such parameterizations mainly due to that models in ModelNet40 and ShapeNetCore consist a lot of disconnected components. Finally, we only consider the task of classification, it will be interesting to consider other tasks such as shape segmentation and shape synthesis. For both tasks, the standard imagebased techniques require stitching predictions from different views of the objects. In contrast, the spherical projection is complete and may not suffer from this issue.
Acknowledgement. We would like to acknowledge support of the NSF Award IIP #1632154. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
References
 [1] The princeton shape benchmark. In Proceedings of the Shape Modeling International 2004, SMI ’04, pages 167–178, Washington, DC, USA, 2004. IEEE Computer Society.
 [2] N. S. Alvar, M. Zolfaghari, and T. Brox. Orientationboosted voxel nets for 3d object recognition. CoRR, abs/1604.03351, 2016.
 [3] A. Brock, T. Lim, J. Ritchie, and N. Weston. Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236, 2016.
 [4] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q.X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An informationrich 3d model repository. CoRR, abs/1512.03012, 2015.
 [5] A. GarciaGarcia, F. GomezDonoso, J. G. Rodríguez, S. OrtsEscolano, M. Cazorla, and J. A. López. Pointnet: A 3d convolutional neural network for realtime object class recognition. In 2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 2429, 2016, pages 1578–1584, 2016.
 [6] V. Hegde and R. Zadeh. Fusionnet: 3d object classification using multiple data representations. CoRR, abs/1607.05695, 2016.
 [7] K. Hormann, K. Polthier, and A. Sheffer. Mesh parameterization: Theory and practice. In ACM SIGGRAPH ASIA 2008 Courses, SIGGRAPH Asia ’08, pages 12:1–12:87, New York, NY, USA, 2008. ACM.
 [8] Q.X. Huang, H. Su, and L. Guibas. Finegrained semisupervised labeling of large shape collections. ACM Trans. Graph., 32(6):190:1–190:10, Nov. 2013.
 [9] N. Iyer, S. Jayanti, K. Lou, Y. Kalyanaraman, and K. Ramani. Threedimensional shape searching: Stateoftheart review and future trends. Comput. Aided Des., 37(5):509–530, Apr. 2005.
 [10] E. Johns, S. Leutenegger, and A. J. Davison. Pairwise decomposition of image sequences for active multiview recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pages 3813–3822, 2016.
 [11] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3d shape descriptors. In Symposium on Geometry Processing, SGP ’03, pages 156–164, 2003.
 [12] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool. Hough transform and 3d surf for robust three dimensional classification. In Proceedings of the 11th European Conference on Computer Vision: Part VI, ECCV’10, pages 589–602, 2010.
 [13] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In IROS, pages 922–928, 2015.
 [14] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin. Shape distributions. ACM Trans. Graph., 21(4):807–832, Oct. 2002.
 [15] J. Pu and K. Ramani. On visual similarity based 2d drawing retrieval. Comput. Aided Des., 38(3):249–259, Mar. 2006.
 [16] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. CoRR, abs/1612.00593, 2016.
 [17] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview cnns for object classification on 3d data. In Computer Vision and Pattern Recognition, pages 5648–5656, 2016.
 [18] S. Ravanbakhsh, J. G. Schneider, and B. Póczos. Deep learning with sets and point clouds. CoRR, abs/1611.04500, 2016.
 [19] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger. Octnetfusion: Learning depth fusion from data. CoRR, abs/1704.01047, 2017.
 [20] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. CoRR, abs/1611.05009, 2016.
 [21] K. Sfikas, T. Theoharis, and I. Pratikakis. Exploiting the PANORAMA Representation for Convolutional Neural Network Classification and Retrieval. In Eurographics Workshop on 3D Object Retrieval, 2017.
 [22] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep panoramic representation for 3d shape recognition. IEEE Signal Process. Lett., 22(12):2339–2343, 2015.
 [23] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [24] A. Sinha, J. Bai, and K. Ramani. Deep learning 3d shape surfaces using geometry images. In European Conference on Computer Vision, pages 223–240, 2016.
 [25] H. Su, S. Maji, E. Kalogerakis, and E. G. LearnedMiller. Multiview convolutional neural networks for 3d shape recognition. In Proc. ICCV, 2015.
 [26] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3d outputs. CoRR, abs/1703.09438, 2017.
 [27] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
 [28] X. Xu and S. Todorovic. Beam search for learning a deep convolutional neural network of 3d shapes. CoRR, abs/1612.04774, 2016.