LargeScale Shape Retrieval with Sparse 3D Convolutional Neural Networks
Abstract
In this paper we present results of performance evaluation of S3DCNN — a Sparse 3D Convolutional Neural Network — on a largescale 3D Shape benchmark ModelNet40, and measure how it is impacted by voxel resolution of input shape. We demonstrate comparable classification and retrieval performance to stateoftheart models, but with much less computational costs in training and inference phases. We also notice that benefits of higher input resolution can be limited by an ability of a neural network to generalize high level features.
1 Introduction
For computer vision systems a precise and robust operations in real environments is only possible by harnessing information from 3D data. To achieve this we need to overcome some challenges of this kind of problems.
Data, received from such devices as scanners, is often given in the form of noisy meshes or point clouds, which is not the best fit for new kinds of models such as Convolutional Neural Networks (CNN’s) [14].
In the current stateoftheart systems Convolutional Neural Networks are widely used, their effectiveness at processing 2D images is also suggestive of their efficacy to process 3D objects if presented in the form of several rendered views of the object. For example on one ModelNet40 [22] benchmark three recent papers, based on this idea, showed incremental improvements in recognition performance [18, 10, 8]. However, it can be argued that high performance is predicated by the usage of CNNs pretrained on ImageNet [6].
Voxel representation of 3D shapes (i.e. a shape is represented as a threedimensional grid, where occupied cells are binary values) are compatible with ConvNets input layers but create a number of difficulties. Adding a third spatial dimension in the input grid correspondingly increases computational costs. Number of cells scales as a power of three w.r.t. the resolution of the voxel grid. Low resolution grids make it difficult to differentiate between similar shapes, and lose some of the fine details available in 2D renderings of equivalent resolution.
Some 3D Dense Convolutional Networks have been evaluated on the ModelNet40 benchmark [16, 17, 22, 3], but they still not perform as well as their multirendering 2D counterparts.
At the same time using Modified Spatially Sparse Neural Networks algorithms [7] to process data we are able to have reasonable training and inference time even with input resolution up to voxels.
In this work, we present Sparse 3D Deep Convolutional Neural Networks and explore their ability to perform largescale shape retrieval on the popular benchmark ModelNet40 [22] depending on an input resolution and a network architecture.
Sparse 3D CNNs are able to generate relevant features for retrieval analogously to 2D extractors. To have a system that uses many 2D rendered projections for inference is computationally very costly, especially for the task of Large Scale 3D Shape Retrieval. In this paper we present some preliminary results of our attempt to find out if the resolution of an object Voxelization impacts on descriptive feature extraction as measured by the retrieval performance on a sufficiently big dataset. Also we demonstrate ability of Sparse 3D CNNs to perform metric learning in the triplet loss setup. Lastly we train our model to perform classification on the ModelNet40 benchmark.
In Section 2 we formulate the problem in more detail and discuss latest relevant methods. In Section 3 we describe our approach to neural networks that helps us to solve the problem posed in Section 2. In Section 4 we document conditions of computational experiments we performed. In Section 5 we discuss results and make conclusions about our approach to the problem.
2 3D LargeScale Retrieval
2.1 LargeScale 3D Shape datasets
As can be seen the great improvements in recent years for the problem of 2D largescale image recognition, are not just the result of widespread adoption of Deep Learning techniques, but also it is due to the availability of large datasets that capture sufficient variety of features at different scales to be representative of some domain. However, only recently in the 3D recognition and retrieval such datasets started being published.
The recent competition ModelNet evaluated several models utilizing Neural Networks for 3D retrieval. ModelNet40 is a subset of this dataset, and it is going to be our main benchmark for the retrieval task. The approach for creating descriptors from multiple projections of a 3D shape with a transfer learning from ImageNet showed the best performance [18]. No full 3D algorithms that process voxels directy have been described up to now.
2.2 Shape descriptors
To make inferences about 3D objects for purposes of computer vision or computer graphics, researchers developed a big amount of shape descriptors[11, 12, 4, 13].
Shape descriptors usually fit into two categories: one where shape descriptors are computed using 3D representations of objects, e.g. voxel discretizations, meshes, point clouds, or implicit surfaces, and the second one that describes a shape of a 3D object by a collection of 2D projections, often from multiple viewpoints.
Before largescale 3D shape datasets such as ModelNet [22] and 3dShapeNet model which learns shape descriptors from voxel representation of a mesh object through 3D convolutional nets, 3D shape descriptors were mostly special functions capturing specific geometric properties of the shape surface or volume, for example: spherical functions computed on volumetric grids [11], generalization of SIFT and SURF feature descriptors for voxel grids [12], or for nonrigid bodies and deformable shapes heat kernel signatures on meshes [4, 13]. Developing classifiers and other supervised machine learning algorithms on top of such 3D shape descriptors poses a number of challenges. The success of CNNs image descriptors allows us to hope that descriptors based on 3D convolutional nets can be also beneficial compared to classic descriptors.
2.3 Triplet learning
Recent work in [9] shows that learning representations with triplets of examples gives much better results than learning with pairs using the same network. Inspired by this, we focus on learning feature descriptors based on triplets of patches.
Learning with triplets involves training from samples of the form , where

is an anchor object,

denotes a positive object, which is a sample we want to be closer to and usually being a different sample of the same class as , and

is a negative sample belonging to a different class than and .
Optimizing parameters of the network brings and close in the feature space, and pushes apart and .
Finally, let us introduce this triplet loss, also known as the ranking loss. It was first proposed for learning embedding using CNNs in [21] and can be defined as follows:

Let us define and , i.e. this is a cosine distance between some feature representations for different objects,

Then for a particular triplet we calculate the triplet loss using the formula
where is a margin parameter. The correct order should be ,

If order of objects, provided by their corresponding descriptors are incorrect w.r.t. the triplet loss, then the network adjusts its weights through backpropagation signal to reduce the error.
3 Sparse Neural Networks
Using sparsity to make a neural network computations more efficient is pioneered by Benjamin Graham [7], who developed a lowlevel C++/CUDA library SparseConvNet
Transformation of data between layers (e.g. convolutions, pooling, nonlinear activation functions), are performed on those sparse vectors. Data in areas with inactive voxels, which are most of them, does not depend on a voxel relative position, therefore it can be replaced by vectors of a smaller size without explicit spatial dimensions.
It’s well known that, operating with a sparse data structures is more efficient than working with dense data. Another useful property is that we need to store much less data for each object. We have computed sparsity for all classes of ModelNet40 train dataset at voxel resolution equal to 40, and it’s only 5.5%.
Paper [22] describes using 3D convolutions for their deep model. Voxel labeled as active when it’s intersects with a mesh object, and inactive otherwise. This binary representation of 3D shape given as input to a 3D CNN, which has a structure similar to a 2D one. The main problem of this approach is ineffectiveness with which data is represented and processed. Mentioned model uses cells, which is approximately the number of pixels in 2D applications of CNN. If we take into account linear dimensions it’s obviously not a lot, as can be seen from Figure 2. That resolution was primarily chosen because of computational resource limitation. Besides that, — convolution is very computationally expensive operation, complexity of which rises very fast with input scale. Computational complexity of 3D convolution for image with dimensions of with filters sizes of is equal to . If we use Fast Fourier Transform (FFT), complexity can be reduced to in exchange for more memory cost [15]. But even in that case, complexity of convolutions makes it impossible to work with objects in big voxel resolutions.
3.1 PySparseConvNet
The SparseConvNet Library is written in C++ programming language, and utilizes a lot of CUDA capabilities for speed and efficiency. But it is very limited when it comes to

extending functionality — class structure and CUDA kernels are very complex, and require recompilation on every modification,

changing loss functions — the only learning configuration was SoftMax with loglikelihood loss function,

fine grained access to layer activations — there was no way to extract activations and therefore features from hidden layers,

interactivity for models exploration — every experiment had to be a compiled binary with no way to perform operations step by step, to explore properties of models.
Because of all these problems we developed PySparseConvNet
Interface of PySparseConvNet is much simpler, and consist’s of 4 classes:

SparseNetwork — Network object class, it has all the methods to change it’s structure, manipulate weights and activations,

SparseDataset — Container class for sparse samples and their labels,

SparseBatch — Gives access to data in dataset when processing separate minibatches,

Off3DPicture — Wrapper class for 3D models in OFF (Object File Format), used to voxelize samples to be processed by SparseNetwork.
layer #  layer type  size  stride  channels  spatial size  sparsity (%) 

0  Data input      1  126  0.18 
1  Sparse Convolution  2  1  8  125   
2  Leaky ReLU ( = 0.33)      32  125  0.35 
3  Sparse MaxPool  3  2  32  62  0.69 
4  Sparse Convolution  2  1  256  61   
5  Leaky ReLU ( = 0.33)      64  61  1.07 
6  Sparse MaxPool  3  2  64  30  1.93 
7  Sparse Convolution  2  1  512  29   
8  Leaky ReLU ( = 0.33)      96  29  3.26 
9  Sparse MaxPool  3  2  96  14  7.32 
10  Sparse Convolution  2  1  768  13   
11  Leaky ReLU ( = 0.33)      128  13  15.14 
12  Sparse MaxPool  3  2  128  6  46.30 
13  Sparse Convolution  2  1  1024  5   
14  Leaky ReLU ( = 0.33)      160  5  97.54 
15  Sparse MaxPool  3  2  160  2  100.00 
16  Sparse Convolution  2  1  1280  1   
17  Leaky ReLU ( = 0.33)      192  1  100.00 
4 Experiments
4.1 ModelNet40 dataset
In our experiments we used well known data set of 3D objects ModelNet40. It is a subset of 40 classes of larger data set called ModelNet [22] that contains different 3D CAD models in OFF format.
The total size of ModelNet40 data set . The data set is split into training and test subsets, their sizes are and correspondingly. The data set is not balanced. Number of samples per class vary: from 64 to 889, see Figure 1.
4.2 Implementation details
To demonstrate the impact that the triplet based training has on the performance of CNN descriptors we use a deep network architecture shown in a Table 1. This network was implemented in PySparseConvNet, which is our modification of the SparseConvNet library [7]. Besides new loss functions PySparseConvNet can be accessed from Python for a more interactive usage.
When forming a triplet for training we choose uniformly randomly a positive pair of objects from one class and select a negative sample uniformly randomly from one of other classes.
For the optimization we use the SGD [2], and the training is done

in batches of size from to depending on a GPU video memory,

with a learning rate of ,

and a momentum equal to .
Training can take up to a week on a server with advanced GPU, such as NVIDIA Titan X or GTX980ti.
We train Sparse 3D Convolutional Neural Network (S3DCNN) on the 3D shape classification dataset by splitting it into training and validation subsets, adding augmentation of data to achieve rotational and translational invariance. After training a model on a dataset of pairs, we use it to embed voxel representations of 3D meshes into dimensional space. The retrieval consist of ranking search objects by a cosine distance of vectors from a query vector.
The most popular metrics for evaluating retrieval performance are

PrecisionRecall Curve shows a tradeoff between these two measures and how quickly the precision drops with the recall increase,

Mean average precision (mAP). Given a query, its average precision is the average of all precision values computed on all relevant objects in the retrieved list. Given several queries, the mean average precision (mAP) is the mean of average precisions for these queries.
We evaluated mAP for different voxel rendering sizes of 3D shapes both at train and test times, see also Figure 2.
To check if our model is comparable with other architectures, we consider a classification task. So, we trained our model for the classification task using the ModelNet40 train subset with

SoftMax last layer for epochs,

with exponentially discounting learning rate,

and performed retrieval evaluation on the test subset,

taking images from every class, and ranking them w.r.t their norm by activations taken from the th layer.
Results of these experiments are provided in Table 2. We can see that in case of classification task setup our model is comparable in terms of the classification accuracy, but mAP values are worse. But in case of metric learning performace of S3DCNN on mAP metric is much better. Superior performance of retrieval task with MVCNN is not a surprising result, since MVCNN uses neural nets, pretrained on ImageNet. On the other hand our model only requires 3D Shape dataset to learn.
In Figure 4 we provide the dependence of mAP on the input spatial resolution. We can see that the retrieval performance improves with increase in the input spatial resolution up to around , after that it drops slightly and goes to plateau. It can be attributed to the insufficient amount of layers for the same scale of features, that can be separated in higher layers. Light blue color shows range of mAP on validation for top trained architectures.
We would like to note that in Figure 4 mAP values provided for different validation epochs and variability of best model can be explained by difference in total learning time.
5 Results
We found that the retrieval performance improves with increase in the input spatial resolution. However, such an effect is difficult to check experimentally and to use in practice, as e.g. for usual 3D dense CNNs the computational time is prohibitively large. In our case, data sparsity helps us to process data in reasonable time even with input resolution up to voxels, therefore we can benefit from the increase of the input spatial resolution when performing retrieval. In Figure 4 we can see that our method is comparable to [22] in low recall, and better at higher recall values, that indicates better scalability of our method. In Table 2 for the retrieval we used features from the one before last layer of the network of size 192, which in comparison to 4000 in 3DShapeNet model [22] is 20 times smaller but achieves almost the same retrieval metrics.
We evaluated our network architecture described in Table 1 on popular stateoftheart frameworks for Deep Learning, such as Tensorflow[1] on GPU and Theano[19] on CPU. Using Keras[5] 2.0.2 with Tensorflow[1] 1.2.1 backend on Nvidia Titan X GPU with 12Gb of GPU memory, we were able to exhaust all of it with batch size equal to 12, and performed forward passes on average 0.0301 seconds/sample, which is comparable to processing speed of our implementation with render size of about 6070. Other setup was an implementation of our network architecture on Keras with Theano backend using Intel i75820K 6core CPU processor, took 1.53 seconds/sample, which is significantly slower.
method  Classification  Retrieval AUC  Retrieval mAP 

3DShapeNet [22]  77.32%  49.94%  49.23% 
MVCNN [18]  90.10%  —  80.20% 
VoxNet [16]  83.00%  —  — 
VRN [3]  91.33%  —  — 
S3DCNN (proposed)  90.30%  36.05%  33.67% 
S3DCNN + triplet (proposed)  —  48.81%  46.71% 
Acknowledgments
We are very grateful to Dmitry Yarotsky for his contribution to this research project. Big Thanks to Benjamin Graham for some useful comments and ideas. Thanks to Rasim Akhunzyanov for his help in debugging the PySparseConvNet code.
The research was partially supported by the Russian Science Foundation grant (project 145000150).
Footnotes
 email: e.burnaev@skoltech.ru
 email: e.burnaev@skoltech.ru
 email: e.burnaev@skoltech.ru
 https://github.com/btgraham/SparseConvNet
 https://github.com/gangiman/PySparseConvNet
 Last column “sparsity” is computed for render size = and averaged for all samples
References
 M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, and I. G. et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 L. Bottou. Stochastic gradient tricks. Neural Networks, Tricks of the Trade, Reloaded, pages 430–445, 2012.
 A. Brock, T. Lim, J. Ritchie, and N. Weston. Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236, 2016.
 A. M. Bronstein, M. M. Bronstein, L. J. Guibas, and M. Ovsjanikov. Shape google: Geometric words and expressions for invariant shape retrieval. ACM Transactions on Graphics (TOG), 30(1):1, 2011.
 F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
 J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 B. Graham. Spatiallysparse convolutional neural networks. arXiv preprint arXiv:1409.6070, 2014.
 V. Hegde and R. Zadeh. Fusionnet: 3d object classification using multiple data representations. arXiv preprint arXiv:1607.05695, 2016.
 E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International Workshop on SimilarityBased Pattern Recognition, pages 84–92. Springer, 2015.
 E. Johns, S. Leutenegger, and A. J. Davison. Pairwise decomposition of image sequences for active multiview recognition. arXiv preprint arXiv:1605.08359, 2016.
 M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3 d shape descriptors. In Symposium on geometry processing, volume 6, pages 156–164, 2003.
 J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool. Hough transform and 3d surf for robust three dimensional classification. In European Conference on Computer Vision, pages 589–602. Springer, 2010.
 I. Kokkinos, M. M. Bronstein, R. Litman, and A. M. Bronstein. Intrinsic shape context descriptors for deformable shapes. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 159–166. IEEE, 2012.
 Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013.
 D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
 N. Sedaghat, M. Zolfaghari, and T. Brox. Orientationboosted voxel nets for 3d object recognition. arXiv preprint arXiv:1604.03351, 2016.
 H. Su, S. Maji, E. Kalogerakis, and E. G. LearnedMiller. Multiview convolutional neural networks for 3d shape recognition. In Proc. ICCV, 2015.
 Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016.
 S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a nextgeneration open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twentyninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
 J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning finegrained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1386–1393, 2014.
 Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.