3D Semantic Segmentation with Submanifold Sparse Convolutional Networks
Abstract
Convolutional networks are the defacto standard for analyzing spatiotemporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGBD camera. Standard “dense” implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatiallysparse data more efficiently, and use them to develop spatiallysparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior stateoftheart on the test set of a recent semantic segmentation competition.
1 Introduction
Convolutional networks (ConvNets) constitute the stateofthe art method for a wide range of tasks that involve the analysis of data with spatial and/or temporal structure, such as photos, videos, or 3D surface models. While such data frequently comprises a densely populated (2D or 3D) grid, other datasets are naturally sparse. For instance, handwriting is made up of onedimensional lines in twodimensional space, pictures made by RGBD cameras are threedimensional point clouds, and polygonal mesh models form twodimensional surfaces in 3D space.
The curse of dimensionality applies, in particular, to data that lives on grids that have three or more dimensions: the number of points on the grid grows exponentially with its dimensionality. In such scenarios, it becomes increasingly important to exploit data sparsity whenever possible in order to reduce the computational resources needed for data processing. Indeed, exploiting sparsity is paramount when analyzing, e.g., RGBD videos which are sparsely populated 4D structures.
Traditional convolutional network implementations are optimized for data that lives on densely populated grids, and cannot process sparse data efficiently. More recently, a number of convolutional network implementations have been presented that are tailored to work efficiently on sparse data [4, 18, 3]. Mathematically, some of these implementations are identical to regular convolutional networks, but they require fewer computational resources in terms of FLOPs and/or memory [4, 3]. Prior work uses a sparse version of the im2col operation that restricts computation and storage to “active” sites [4], or uses the voting algorithm from [22] to prune unnecessary multiplications by zeros [3]. OctNets [18] modify the convolution operator to produce “averaged” hidden states in parts of the grid that are outside the region of interest.
One of the downsides of prior sparse implementations of convolutional networks is that they “dilate” the sparse data in every layer by applying “full” convolutions. In this work, we show that it is possible to create convolutional networks that
keep the same level of sparsity throughout the network.
To this end, we develop a new implementation for performing sparse convolutions (SCs) and introduce a novel convolution operator termed submanifold sparse convolution (SSC).
In Table 1, we present the performance of SSCNs on the test set of a recent partbased segmentation competition [23] and compare it to some of the topperforming entries in the competition: SSCNs outperform all of these entries. Source code for our library is publicly available online
Method  Average IoU 
NN matching with Chamfer distance  77.57% 
Synchronized Spectral CNN [11]  84.74% 
PdNetwork (extension of KdNetwork [10])  85.49% 
Densely Connected PointNet (extension of [17])  84.32% 
PointCNN  82.29% 
Submanifold SparseConvNet (Section 6.5)  85.98% 
2 Related Work
Our work primarily builds upon previous literature on sparse convolutional networks [3, 4], and image segmentation using dense convolutional networks [14, 19, 24]. Examples of applications of dense 3D convolutions on volumetric data include classification [15] and segmentation [2]; these methods suffer from high memory usage and slow inference, limiting the size of models that can be used.
Methods for processing 3D point clouds without voxelization have also been developed [10, 17]. This may seem surprising given the dominance of ConvNets for processing 2D inputs; it is likely due to the computational obstacles involved in using dense 3D convolutional networks.
Prior work on sparse convolutions implements a convolutional operator that increases the number of active sites with each layer [3, 4]. In [4], all sites that have at least one “active” input site are considered as active. In [3], a greater degree of sparsity is attained after the convolution has been calculated by using ReLUs and a special loss function. In contrast, we introduce submanifold sparse convolutions that fix the location of active sites so that the sparsity remains unchanged for many layers. We show that this makes it practical to train deep and efficient networks similar to VGG networks [20] or ResNets [7], and that it is well suited for the task of pointwise semantic segmentation.
OctNets [18] are an alternative form of sparse convolution. Sparse voxels are stored in octtrees: a data structure in which the grid cube is progressively subdivided into smaller subcubes until the subcubes are either empty or contain a single active site. OctNet operates on the surfaces of empty regions, so a size OctNet convolution on an empty cube of size requires 23% of the calculation of a dense 3D convolution. Conversely, submanifold convolutions require no calculations in empty regions.
Another approach to segmenting point clouds is to avoid voxelizing the input, which may lead to a loss of information due to the finite resolution. This can be done by either using carefully selected data structures such as Kdtrees [10], or by directly operating on the unordered set of points [17]. KdNetworks [10] build a Kdtree by recursively partitioning the space along the axis of largest variation until each leaf of the tree represents one input point. This takes O time for input points. PointNet [17] uses a pooling operation to produce a global feature vector.
Fully convolutional networks (FCNs) were proposed in [14] as a method of 2D image segmentation; FCNs make use of information at multiple scales to preserve lowlevel information to accurately delineate object boundaries. UNets [19] extend FCNs by using convolutions to more accurately merge together the information from the different scales before the final classification stage; see Figure 4.
3 Spatial sparsity for ConvNets
We define a dimensional convolutional network as a network that
takes as input a dimensional tensor: the input tensor contains spatiotemporal
dimensions (such as length, width, height, time, etc.) and one additional featurespace
dimension (e.g., RGB color channels or surface normal vectors).
The input corresponds to a dimensional grid of sites, each of which is
associated with a feature vector. We define a site in the input to be active if any element in the feature vector is not in its ground state, e.g., if it is nonzero
Similarly, the hidden layers of a dimensional convolutional network are represented by dimensional grids of featurespace vectors. When propagating the input data through the network, a site in a hidden layer is active if any of the sites in the layer that it takes as input is active. (Note that when using size3 convolutions, each site is connected to sites in the hidden layer below.) Activity in a hidden layer thus follows an inductive definition in which each layer determines the set of active states in the next layer. In each hidden layer, inactive sites all have the same feature vector: the one corresponding to the ground state. The value of the ground state only needs to be calculated once per forward pass at training time, and only once for all forward passes at test time. This allows for substantial savings in computational and memory requirements; the exact savings depend on data sparsity and network depth.
However, we argue that the framework described above is unduly restrictive, in particular, because the convolution operation has not been modified to accommodate the sparsity of the input data. If the input data contains a single active site, then after applying a convolution, there will be active sites. Applying a second convolution of the same size will yield active sites, and so on. This rapid growth of the number of active sites is a poor prospect when implementing modern convolutional network architectures that comprise tens or even hundreds of convolutional layers, such as VGG networks, ResNets, or DenseNets [8, 9, 20].
Of course, convolutional networks are not often applied to inputs that only have a single active site, but the aforementioned dilation problems are equally problematic when the input data comprises onedimensional curves in spaces with two or more dimensions, or twodimensional surfaces in three or more dimensions. We refer to this problem as the “submanifold dilation problem”, which is illustrated in Figure 2. The figure illustrates that even when we apply small convolutions on this grid, the sparsity of the grid rapidly disappears.
4 Submanifold Convolutional Networks
We explore a simple solution to the submanifold dilation problem that restricts the output of the convolution only to the set of
active input points. A potential problem of this approach is that hidden layers in the network may not receive all the information they require to classify the input data: in particular, two neighboring connected components are treated
completely independently. We resolve this problem by using convolutional networks that incorporate
some form of pooling, or use strided convolutions. Such operations are essential in the sparse convolutional networks
4.1 Sparse Convolutional Operations
We define a sparse convolution SC( with input feature planes, output feature planes, a filter size of , and stride . An SC convolution computes the set of active sites in the same way as a regular convolution: it looks for the presence of any active sites in its receptive field of size . If the input has size then the output will have size . Unlike a regular convolution (and the sparse convolution from [4]), an SC convolution discards the ground state for nonactive sites by assuming that the input from those sites is zero. This seemingly small change to the convolution brings computational benefits in practice.
Submanifold sparse convolution.
Next, we define a second type of sparse convolution, which forms the main contribution of this paper. Let denote an odd number. We define a submanifold sparse convolution SSC as a modified SC convolution. First, we pad the input with zeros on each side, so that the output will have the same size as the input. Next, we restrict an output site to be active if and only if the site at the corresponding site in the input is active (i.e., if the central site in the receptive field is active). Whenever an output site is determined to be active, its output feature vector is computed by the SSC convolution; see Figure 3 for an illustration. Table 2 shows the computational and memory requirements of a regular convolution (C) and our SC and SSC convolutions.
Submanifold sparse convolutions are similar to OctNets [18] in that they preserve the sparsity structure. However, unlike OctNets, empty space imposes no computational or memory overhead in the implementation of submanifold sparse convolutions.
Other operators.
To construct convolutional networks using SC and SSC, we also need activation functions, batch normalization, and pooling. Activation functions are defined as usual, but are restricted to the set of active sites. Similarly, we define batch normalization in terms of regular batch normalization applied over the set of active sites. Maxpooling MP and averagepooling AP operations are defined as a variant of SC. MP takes the maximum of the zero vector and the input feature vectors in the receptive field. AP calculates times the sum of the active input vectors. We also define a deconvolution [25] operation DC as an inverse of the SC convolution. The set of active output sites from a DC convolution is exactly the same as the set of input active sites to the corresponding SC convolution: the connections between input and output sites are simply inverted.
Active  Type  C  SC  SSC 

Yes  FLOPs  
Memory  
No,  FLOPs  0  
Memory  0  
No,  FLOPs  0  0  
Memory  0  0 
4.2 Implementation
To implement (S)SC convolutions efficiently, we store the
state of a input/hidden layer in two parts: a hash table

Iterate once through the input hashtable. We build the output hash table and rule book onthefly by iterating over points in the input layers, and all the points in the output layer that can see them. When an output site is visited for the first time, a new entry is created in the output hash table. For each active input point located at point in the receptive field of an output point , add a row (inputhash, outputhash) to element of the rule book.

Initialize the output matrix to all zeros. For each , there is a parameter matrix with size . For each row in , multiply the th row of the input feature matrix by and add it to the th row of the the output feature matrix. This can be implemented very efficiently on GPUs because it is a matrixmatrix multiplyadd operation.
To implement an SSC convolution, we reuse the input hash table for the output, and construct an appropriate rule book. Note that because the sparsity pattern does not change, the same rule book can be reused in VGG/ResNet/DenseNet networks until a pooling or subsampling layer is encountered in the architecture.
If there are active points in the input layer, the cost of building the input hashtable is . For FCN and UNet networks, assuming the number of active sites reduces by a multiplicative factor with each downsampling operation, the cost of building all the hashtables and rulebooks is also regardless of the depth of the network.
The above implementation differs from [4] in that the cost of calculating an output site is proportional to the number of active inputs, rather than the size of the receptive field. For SC convolutions this is similar to the voting algorithm [22, 3]  the filter weights are never multiplied with inactive input locations  but for SSC convolutions, this implementation is less computationally intensive than voting as there is no interaction between active input locations and inactive neighboring output locations.
5 Submanifold FCNs and UNets for Semantic Segmentation
Threedimensional semantic segmentation involves the segmentation of 3D objects or scenes represented as point clouds into their constituent parts; each point in the input cloud must be assigned a part label. As substantial progress has been made in the segmentation of 2D images using convolutional neural networks [14, 19, 24], interest in the problem of 3D semantic segmentation has grown recently. This interest was fueled, in particular, by the release of a new dataset for the partbased segmentation of 3D objects, and an associated competition [23].
We use a sparse voxelized input representation similar to [4, 3], and a combination of SSC convolutions and strided SC convolutions to construct versions of the popular FCN [14] and UNet [2] architectures. The resulting convolutional network architectures are illustrated in Figure 4; see the associated caption for details. We refer to these networks as submanifold sparse convolutional networks (SSCNs), because they process lowdimensional data living in a space of higher dimensionality.
The basic building block for our models are “preactivated” SSC convolutions. Each convolution is preceded by batch normalization and a ReLU nonlinearity. In addition to FCN and UNets with standard convolutional layers, we also experiment with variants of these networks that use preactivated residual blocks [8] that contain two SSC convolutions. Herein, the residual connections are identity functions: the number of input and output features are equal. Whenever the networks reduce the spatial scale by a factor of two, we use SC convolutions. Our implementation of FCNs upsamples feature maps to their original resolution rather than performing deconvolutions using residual blocks. This substantially reduces the number of parameters and the number of multiplicationaddition operations of the FCN.
6 Experiments
In this section, we perform experiments with SSCNs on the ShapeNet competition dataset from [23]. We compare SSCNs against three strong baseline models in terms of performance and computational cost. Specifically, we consider shape contexts [1], dense 3D convolutional networks, and 2D multiview convolutional networks [21] as baselines. Throughout our experimental evaluation, we focus on the tradeoff between segmentation accuracy and computational efficiency measured in FLOPs
6.1 Dataset
The ShapeNet segmentation dataset [23] comprises 16 different object categories (plane, chair, hat, etc.), each of which is composed of up to 6 different parts. For instance, a “plane” is segmented into “wings”, “engine”, “body”, and “tail”. Across all object categories, the dataset contains a total of 50 different object part classes. Each object is represented as a 3D point cloud that was obtained by sampling points uniformly from the surface of the underlying CAD model. Each point cloud contains between and points. To increase the size of the validation set, we resplit the training and validation sets using the first bit of the MD5 hash of the point cloud files to obtain a training set with 6,955 examples and a validation set with 7,052 examples. The test set contains 2,874 examples.
In the original dataset, the objects are axisaligned: for instance, rockets always point along the axis. To make the problem more challenging, we perform a random 3D translation and rotation on each point cloud before classifying it. The results in Table 3 show that removing the alignment, indeed, makes the segmentation task more challenging.
View type  IoU accuracy 

Aligned  63.5% 
Random pose  47.8% 
To evaluate the accuracy of our models, we adopt the intersectionoverunion (IoU) metric of [23]. The IoU is computed for each part per object category and averaged over parts and examples for the category to produce a “percategory IoU”. This way of averaging the IoU scores rewards models that make accurate predictions even for objectparts that are very small: small parts have the same weight in the final accuracy measure as larger parts. The final accuracy measure is obtained by taking a weighted average of the percategory IoUs, using the fraction of training examples per category as weights.
6.2 Details of Experimental Setup
In all experiments, the same data preprocessing procedure is used. Specifically, each point cloud is centered and rescaled to fit into a sphere with diameter ; scale determines the size of the voxelized representation. We use in our experiments. At scale , the voxels are approximately 99% sparse. In experiments with dense convolutional networks, we place the sphere randomly in a grid of size . For submanifold sparse convolutional networks, we place the sphere randomly in a grid of size . To voxelize the point cloud, we measure the number of points per voxel and normalize them so that nonempty voxels have a mean density of one.
Networks are trained using the same optimization hyperparameters, unless otherwise noted. We use stochastic gradient descent (SGD) with a momentum of 0.9, Nesterov updates, and weight decay of . The initial learning rate is set to , and the learning rate is decayed by a factor of after every epoch. We train all networks for epochs using a batch size of . We train a single network on all object categories jointly using a multiclass negative loglikelihood loss function over all part labels.
For our SSCNs, we experiment with two types of network architectures. The first architecture (C3) operates on a single spatial resolution by stacking SSC convolutions; we use with , , , or filters per layer, and , , or layers. The second architecture type comprises FCNs and UNets with three layers of downsampling. These networks have , , , or filters in the first layer, and double the number of filters each time the data is downsampled. For the convolutional blocks in these networks, we use stacks of , , or SSC convolutions, or stacks of , , or residual blocks.
Details on testing.
At test time, we only compute softmax probabilities for part labels that actually appear in the object that is being segmented, i.e., we assume the models know the category of the object they are segmenting. Softmax probabilities for irrelevant part classes are set to zero (and the distribution over part labels is renormalized).
For each of the three network types (C3, FCN, and UNet), we train a range of models with varying sizes, as described above, and monitor their accuracy on the validation set. For each network type, we select the networks that correspond to local maxima on the accuracy vs. FLOPs curve, and report test set accuracies for those networks. Akin to multicrop testing that is commonly used with 2D convolutional networks, we measure test accuracies using “multiview” testing. In particular, we generate different views of the object with by randomly rotating them, and average the model predictions for each point over the different views of the object.
6.3 Baselines
In addition to SSCNs, we consider three baseline models in our experiments: (1) shape contexts [1]; (2) dense 3D convolutional networks; and (3) 2D multiview convolutional networks [21]. Details of the four baseline models are described separately below.
Shape contexts.
Inspired by [1], we define a voxelized shape context vector. Specifically, we define a ShapeContext layer as a special case of the SSC submanifold convolution operator: we set the weight matrix of the operator to be a identity matrix so that it accumulates the voxel intensities in its neighborhood.
We scale the data using average pooling with sizes 2, 4, 8 and 16 to create four additional views. Combined this gives each voxel a 135dimensional feature vector. The feature vector is then fed into a nonconvolutional multilayer perceptron (MLP) with two hidden layers, followed by a 50class softmax classifier. The MLPs have 32, 64, 128, 256, or 512 units per layer. At test time, we use multiview testing with .
Dense 3D convolutional networks.
For dense 3D convolutional networks, we simply considered dense versions of the SSCN networks. Due to computational constraints, we restricted the FCN and UNet convolutional blocks to a single C3layer. We trained some of the models with a reduced learning rate due to numerical instabilities we observed during training. Again, we use multiview testing.
Convolutional networks on multiview 2D projections.
This baseline model discards the inherent 3D structure of the data by projecting the point cloud into a twodimensional view by assuming infinite focal length, applying a 2D convolutional network on this projection, and averaging the predictions over multiple views. An immediate advantage of this approach is that wellstudied models from 2D vision can be used outofthebox without further adaptations. Moreover, the computational cost scales with the surface area, rather than the volume of the point cloud.
In our implementation of this approach, we first convert the point clouds into a 3D grid of size as we did for the previous baseline. We then project to a plane of size , i.e. a face of the cube, with two feature channels. One feature channel is the first visible, nonzero voxel along the corresponding column. The second channel is the distance to the visible voxel, normalized to the range ; this is analogous to the depth channel of an RGBD image. Our network architectures are twodimensional versions of the dense 3D convolutional networks described above.
During training, a random projection of the point cloud is passed into the model. Points in the point cloud that fall into the same voxel are given the same prediction. Some voxels are occluded by others—the network receives no information on the occluded voxels. We modify the multiview testing procedure to take into account the occlusion of voxels. Similar to before, predictions are performed using a weighted sum over random projections. We found that 2D networks require more views to obtain high accuracy and therefore use . Voxels that are observed in the 2D projection are given a weight of . The weight of occluded voxels decays exponentially with the distance to the voxel that occludes them.
6.4 Results
In Figure 5, we report the average IoU of a range of differently sized variants of (1) the three baseline models and (2) submanifold C3, FCN and UNets on the ShapeNet test set. The average IoU is shown as a function of the number of multiplicationaddition operations (FLOPs) required by the models for computing the predictions. Note that these results are not directly comparable with those in [23] because we are testing the models in the more challenging “randompose” setting.
SSCNs vs. baselines.
Figure 5(a) compares SSCNs with the three baselines.
Ablation.
In Figure 5(b), we compare the three SSCN architectures presented in Section 6.2. We observe that SSCNs involving downsampling and upsampling operations (FCNs and UNets) outperform SSCNs operating on a single spatial resolution and we conjecture that this is due to the increased receptive field obtained by downsampling.
Figure 5(c) shows the performance of SSCNs at three different scales (using all three architectures; C3, FCN, and UNet). We observe that the performance of SSCNs is similar for different values of , particularly for a low number of FLOPs. At a higher number of FLOPs, the models operating at a larger scale perform slightly better.
6.5 Results on Competition Data
To compare SSCNs with the entries to the competition in [23], we also trained a FCN SSCN on the aligned point clouds. In this experiment, we performed data augmentation using random affine transforms. We set and use filters in the input layer, three levels of downsampling, and two residual blocks per spatial resolution. The results of 10view testing are compared with the competition entries in Table 1. With a test error of 85.98%, our network outperforms other methods by IoU.
6.6 Semantic Segmentation of Scenes
We also performed experiments on the NYU Depth dataset (v2) [16] for semantic segmentation of scenes rather than objects. The dataset contains RGBD images, which are semantically segmented into 894 different classes. Figure 6 shows two pairs of an RGB image and the associated depth map from the dataset. Following [6, 14], we crop the images and reduce the number of classes to . To assess the performance of our models, we measure their pixelwise classification accuracy. We compare our models to a 2D FCN [14].
We perform experiments with two differently sized SSCNFCN networks. Network A has 16 filters in the input layer, and one SSC convolution per level. Network B has 24 filters in the input layer, and two SSC convolutions per level. Both networks use eight levels of downsampling; we increase the number of filters in the networks when downsampling, adding 16 (A) or 24 (B) features with each reduction of scale.
We use the depth information to convert the RGBD images into a 3D point cloud. Each point in the cloud has the three (RGB) features that were normalized to the range , and a fourth indicator features that is set to for each point in the point cloud. During training, we perform data augmentation by applying random affine transformations to the point cloud. Before voxelizing the point cloud, we downscale by a factor of two, and place the points into the model’s receptive field. We form voxels by averaging the feature vectors of the points corresponding to the voxel. At test time, we perform multiview testing with .
The results of our experiments on the NYU Depth dataset (v2) are presented in Table 4. The results in the table show that SSCNs outperform 2D FCN in terms of pixel accuracy by up to , whilst also substantially reducing the computational costs of the model.
To verify that SSCNFCN A actually uses depth information, we repeat the experiment whilst setting all the depth values to zero; this prevents the SSCN from exploiting depth information. We observe: (1) a reduction of FLOPs by 60%, as there are fewer active voxels; and (2) a drop in accuracy from 64.1% to 50.8%, which demonstrates that SSCNs do use 3D structure when performing segmentation.
Network  Accuracy  FLOPs  Memory  

2D FCN [14]  1  61.5%  28.50G  135.7M 
SSCNFCN A  1  64.1%  1.09G  5.2M 
4  66.9%  4.36G  20.7M  
SSCNFCN B  1  66.4%  4.50G  11.6M 
4  68.5%  17.90G  46.4M 
7 Conclusions
In this paper, we presented submanifold sparse convolutional networks (SSCNs) for the efficient processing of highdimensional, sparse input data. We demonstrated the efficacy of SSCNs in a series of experiments on semantic segmentation of threedimensional point clouds. Specifically, our SSCN networks outperform a range of stateoftheart approaches for this problem, both when identifying parts within an object and when recognizing objects in a larger scene. Moreover, SSCNs are computationally efficient compared to alternative approaches.
Footnotes
 footnotemark:
 These operators appeared earlier in an unpublished technical report [5].
 https://github.com/facebookresearch/SparseConvNet
 Note that the ground state does not necessarily have to be zero, in particular, when convolutions with a bias term are used.
 By “sparse convolutional networks”, we mean networks designed to operate on spatiallysparse input data. We do not mean networks that have sparse parameter matrices [12, 13].
 https://github.com/sparsehash/sparsehash
 We note that this is a slight abuse of the term “submanifold”. We emphasize that the data on which these networks are applied may contain multiple connected components, and even a mixture of 1D and 2D objects embedded in 3D space.
 We ignore the FLOPs from the final classification layer.
 The number of FLOPs reported for shape contexts may be slightly misleading: the computational costs of calculating shape context features is not reflected in the number of FLOPs, as it involves integer arithmetic.
References
 S. Belongie, J. Malik, and J. Puzicha. Shape Matching and Object Recognition using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002.
 Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger. 3D UNet: Learning Dense Volumetric Segmentation from Sparse Annotation. International Conference on Medical Image Computing and ComputerAssisted Intervention, 2016.
 M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. Vote3Deep: Fast Object Detection in 3D Point Clouds using Efficient Convolutional Neural Networks. IEEE International Conference on Robotics and Automation, 2017.
 B. Graham. Sparse 3D Convolutional Neural Networks. British Machine Vision Conference, 2015.
 B. Graham and L. van der Maaten. Submanifold Sparse Convolutional Networks. 2017. https://arxiv.org/abs/1706.01307.
 S. Gupta, P. Arbelaez, and J. Malik. Perceptual Organization and Recognition of Indoor Scenes from RGBD Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. European Conference on Computer Vision, 2016.
 G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely Connected Convolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 R. Klokov and V. Lempitsky. Escape from Cells: Deep KdNetworks for The Recognition of 3D Point Cloud Models. arXiv preprint arXiv:1704.01222, 2017.
 X. G. L. Yi, H. Su and L. Guibas. SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation. arXiv preprint arXiv:1612.00606, 2016.
 Y. LeCun, J. Denker, and S. Solla. Optimal Brain Damage. In Advances in Neural Information Processing Systems, 1990.
 B. Liu, M. Wang, H. Foroosh, M. Tappen, , and M. Penksy. Sparse Convolutional Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 D. Maturana and S. Scherer. VoxNet: A 3D Convolutional Neural Network for RealTime Object Recognition. IEEE International Conference on Intelligent Robots and Systems, 2015.
 P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor Segmentation and Support Inference from RGBD Images. European Conference on Computer Vision, 2012.
 C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv preprint arXiv:1612.00593, 2016.
 G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learning Deep 3D Representations at High Resolutions. arXiv preprint arXiv:1611.05009, 2016.
 O. Ronneberger, P. Fischer, and T. Brox. UNet: Convolutional Networks for Biomedical Image Segmentation. International Conference on Medical Image Computing and ComputerAssisted Intervention, 2015.
 K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.
 H. Su, S. Maji, E. Kalogerakis, and E. G. LearnedMiller. MultiView Convolutional Neural Networks for 3D Shape Recognition. International Conference on Computer Vision, 2015.
 D. Z. Wang and I. Posner. Voting for voting in online point cloud object detection. Robotics: Science and Systems, 2015.
 L. Yi, H. Su, L. Shao, M. Savva, et al. LargeScale 3D Shape Reconstruction and Segmentation from ShapeNet Core55. arXiv preprint arXiv:1710.06104, 2017.
 F. Yu and V. Koltun. MultiScale Context Aggregation by Dilated Convolutions. arXiv preprint arXiv:1511.07122, 2015.
 M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.