SurfConv: Bridging 3D and 2D Convolution for RGBD Images
Abstract
We tackle the problem of using 3D information in convolutional neural networks for downstream recognition tasks. Using depth as an additional channel alongside the RGB input has the scale variance problem present in image convolution based approaches. On the other hand, 3D convolution wastes a large amount of memory on mostly unoccupied 3D space, which consists of only the surface visible to the sensor. Instead, we propose SurfConv, which “slides” compact 2D filters along the visible 3D surface. SurfConv is formulated as a simple depthaware multiscale 2D convolution, through a new DataDriven Depth Discretization () scheme. We demonstrate the effectiveness of our method on indoor and outdoor 3D semantic segmentation datasets. Our method achieves stateoftheart performance with less than 30% parameters used by the 3D convolutionbased approaches. \@footnotetextCode & data at https://github.com/chuhang/SurfConv
1 Introduction
While 3D sensors have been popular in the robotics community, they have gained prominence in the computer vision community in the recent years. This has been the effect of extensive interest in applications such as autonomous driving [11], augmented reality [32] and urban planning [47]. These 3D sensors come in various forms such as active LIDAR sensors, structured light sensors, stereo cameras, timeofflight cameras, etc. These range sensors produce a 2D depth image, where the value at every pixel location corresponds to the distance traveled by a ray from the sensor through the pixel location, before it hits a visible surface in the 3D scene.
Image Convolution 
3D Convolution 
Surface Convolution 
Recent success of convolutional neural networks for RGB input images [24] have raised interests in using them for depth data. One of the common approaches is to use handcrafted representations of the depth data and treat them as additional channels alongside the RGB input [13, 9]. While this line of work has shown that additional depth input can improve performance on several tasks, it is not able to solve the scale variance problem of 2D convolutions. In the top of Fig. 1, we can see that for two cars at different distances, the receptive fields of a point have the same size. This means that models are required to learn to recognize the same object in different inputs.
To overcome this issue, an alternative is to represent the data as a 3D grid and use 3D convolution on it [49]. For such a dense representation, it requires huge computation and memory resources. This limits the resolution in all three dimensions. Furthermore, since 3D sensor captures the information of how far the objects are from the sensor at a single frame, the visible surface of the scene occludes the rest of the 3D volume. Thus, the information in the input occupies an extremely small fraction ( 0.35%^{1}^{1}1Calculated with the standard 0.1m resolution for [11] and 0.02m resolution for [40]) of the entire volume. This results in the 3D convolution based approaches to spend a large fraction of time and memory on the unoccupied empty space shown in the middle of Fig. 1.
We propose to reformulate the default 3D convolution as Surface Convolution (SurfConv) for a single frame RGBD input. Instead of “sliding” 3D filters in the voxel space, we slide compact 2D filters along the observed 3D surface. This helps us to exploit the surface nature of our input and help the network learn scaleinvariant representations (bottom of Fig. 1). A straightforward implementation of surface convolutions is challenging since it requires depthdependent rescaling at every location, which is a computational bottleneck. To address this problem, we propose a DataDriven Depth Discretization () scheme, which makes surface convolution practically feasible. We use our approach to show stateoftheart results on the singleview 3D semantic segmentation task in KITTI [11] and NYUv2 [40] benchmarks. To summarize, our main contributions are:

We propose Surface Convolution, that processes single frame 3D data in accordance with its surface nature.

We propose to realize Surface Convolution through a DataDriven Depth Discretization scheme, which offers a simple yet effective solution that achieves stateoftheart single view 3D semantic segmentation.
2 Related Work
Deep 2D RGBD Representations.
In the last few years, 2D CNNs have been used to create powerful feature descriptors for images [24], and can learn complex patterns in the data [57]. One of the approaches to extend the success of these 2D convolutions to range data, is by projecting the 3D data into multiple viewpoints, each of which is treated as a 2D input [4, 34, 39, 44]. However, the computation time scales linearly with the number of views. Since a single frame RGBD image sees only the unoccluded portion of the 3D world, the visible surfaces from drastically different viewpoints might not align with that of the input camera viewpoint. Furthermore, reasoning about multiple viewpoints does not lead to a natural, interpretable 3D description of the scene into parts and their spatial relations [17].
3D Convolution.
To handle the scale variance, 3D convolution learns the correlations directly in the 3D space. To represent the point cloud information, input representations such as occupancy grids [20, 31, 38] and TSDF [5, 10, 42, 53, 55] have been explored.
One of the key challenges in 3D convolution is the fact that increasing the input dimensions can lead to significant increase in the memory requirements. Thus, common practices are to either limit the input resolution to a low resolution grid, or reduce the network parameters [43]. Since the range data is sparse in nature, approaches such as [8, 37, 46] have also aimed at reducing the memory consumption of the activation maps. However, these works are difficult to implement and are nontrivial to scale to a wide variety of tasks in challenging benchmarks.
Another disadvantage of using voxel grids is that they build on the assumption that the scene has an Euclidean structure and is not invariant to transformations such as isotropy and nonrigid deformations. This limitation can be overcome by considering the points as members of an orderless set, which are used along a global representation [33, 35, 52] of the 3D volume.
Approaches such as [23, 45] have used a CRF for postprocessing the semantic segmentation prediction from a 3D ConvNet. In [36], the authors used a 3D graph neural network to iteratively improve the unary semantic segmentation predictions. Our approach can be used to provide better a unary term for these methods.
3D Surface based Descriptors.
A different approach would be to reason along the surface of the 3D volumes. [21] introduced the idea of spin images, which builds a 3D surface based descriptors for object recognition. [19] learns a generative model to produce object structure through surface geometry. [30, 1] extended the idea of convolutions to nonEuclidean structures by learning anisotropic heat kernels which relates to surface deformity. However, such methods require point associations to learn the filters, which are difficult to obtain for range data depicting natural scenes. [22] combines the segmentation results of multiple views into a surface representation of the 3D object. This is followed by a postprocessing step with a CRF which smoothens the labels along the surface geometry. Such smoothening CRF can be used to further improve the results of our approach as well.
MultiScale 2D Convolutions.
To get better performance, a host of approaches have used multiscale input to for tasks such as semantic segmentation [2, 27], optical flow [48], and detection [15, 18]. Other approaches include adaptively learning pixelwise scale [6, 54], and upscaling feature activations to combine multiple scales for the final prediction [28, 56]. The key difference of such approaches with ours is that these scales are arbitrary and do not utilize the 3D geometry of the scene.
3 Method
An image from a single frame RGBD camera captures only the visible surface of the 3D space. Instead of wasting memory on the entire 3D volume, we introduce SurfConv, which concentrates the computation only along the visible surface. In Sec. 3.1 we derive SurfConv, which approximates 3D convolution operation to a depthaware multiscale 2D convolution. We justify the approximation assumption and its implications in Sec. 3.2. In Sec. 3.3, we describe the scheme that determines the scales in a systematic fashion.
3.1 Surface Convolution
Notation. We denote a point detected by the sensor as . Three scalars (,,) represents its position in 3D. Following the classic camera model, we set the sensor position at the origin, and the principal axis as the positive direction. The distance from the image plane to the camera center is the camera’s focal length. For simplicity, we rescale the coordinates such that the focal length is equal to 1. We can then compute the image plane coordinates of point using standard perspective projection as:
(1) 
We denote the information (e.g. color intensity values) of point as , and its semantic class label as . At high level, semantic segmentation can be formulated as
(2) 
where is a function of choice (e.g. a convolutional neural network), and defines a local neighborhood of point . We refer to as the receptive field at point , as commonly used in the literature. Different types of convolution take different forms of and . Next, we mainly focus on the receptive field , as it defines the local neighborhood that affects the final segmentation prediction.
Image Convolution. In image convolution, the receptive field of point is defined as , such that
(3)  
where defines the receptive field radius. We can see that defines a rectangle on the projected image plane. The receptive field has the same number of pixels regardless of the center point’s distance. Therefore, image convolution suffers from the scale variance problem.
3D Convolution. To utilize the 3D information, especially in the depth dimension, 3D convolution has been introduced. In 3D convolution, the receptive field can be defined by trivially extending into all three spatial dimensions, i.e. such that
(4)  
This defines a 3D cuboid centered at point , with radius . In 3D convolution, the receptive field becomes independent to depth and no longer suffers from scale variance. However, for a singleframe 3D sensor, the actual 3D data is essentially a surface backprojected from the image plane. This means at any given 3D cubic receptive field, the majority of space is empty, which makes training difficult. To address the sparsity problem, approaches have used Truncated Signed Distance Function (TSDF) [55] and flippedTSDF [43] that fills the empty space, or decrease the voxel resolution [8].
Local Planarity Assumption. We seek a solution that directly operates along the visible surface, where the meaningful information resides. To achieve this, we first introduce the local planarity assumption. Then we show that under this assumption, we can reformulate 3D convolution as a depthaware multiscale 2D convolution. We name this reformulated approximation as Surface Convolution (SurfConv).
The local planarity assumption is defined as: All neighbor points are approximated to have the same depth as the receptive field center. Fig. 2 illustrates the approximation assumption. Under this assumption, we have
(5) 
Surface Convolution. Under the local planarity assumption, we can transform the 3D convolution receptive field into the SurfConv receptive field. Combining Eq. 1 and Eq. 4, we get
(6) 
Then we apply the local planarity assumption as in Eq. 7, and get
(7) 
We can further apply the projection matrix, and obtain the final receptive field definition of SurfConv: such that
(8)  
where defines the receptive field radius in the 3D space. In this way, the SurfConv receptive field defines a square image region, whose size is controlled by the center point’s depth. This means SurfConv is essentially a depth aware multiscale 2D convolution. This bridges the 3D and 2D perspectives, and avoids the disadvantages of either method. Compared to 2D convolution, SurfConv utilizes 3D data and does not suffer from scale variance. Compared to 3D convolution, SurfConv not only saves the preprocessing step of filling empty voxels, but also enables learning compact, parameterefficient convolution 2D filters that directly targets the realworld scale of the input data.
original camera  
SurfConv at  
SurfConv at 
In SurfConv, is a continous variable. This means for each point, we need to dynamically resize the receptive field based on its size determined by Eq. 9, before passing it to the recognition module that takes fixed size input. This is computationally inefficient in practice. To address this problem, we further replace the continues depth with a set of discretized values, i.e. . We refer to as the level of SurfConv. With the discretized depth, we can cache levels of the image pyramid. Note that since we are interested in surface convolution, each pixel in the original RGBD image belongs to exactly one level of the pyramid. Fig. 3 shows a toy example of our discretization process.
3.2 Bridging 3D and 2D Convolution
In SurfConv, we discretize the dimension into levels and maintain the full resolution in and dimensions. Thus, our surface convolution can be seen as a deformed version of general 3D convolution, where SurfConv has coarser resolution consisting of levels, and divides the 3D space into a stretched voxel grid. The memory constraints of current day GPUs limits the resolution of the input. In 3D convolution, the 3D space is discretized similarly in all three axes. This results in large grids and lowered maximum feasible resolution. In contrast, SurfConv maintains the full resolution along axes parallel to the image projection plane ( and ), and have a much coarser resolution for the axis perpendicular to the image plane (i.e. ). In an RGBD image, the information only resides along the visible surface. This motivates the lower resolution, because information is scarce along this direction. Practically, SurfConv can be simply implemented with a depthaware multiscale 2D convolution. Each depth level consists a proportionally scaled version of the input, masked to contain points within its depth range. Standard 2D CNN training is applied to all levels simultaneously. Therefore, SurfConv can easily benefit from networks pretrained on a variety of largescale 2D image datasets.
3.3 : DataDriven Depth Discretization
To obtain a set of discretized depth levels, uniform bins are suboptimal. This is because in singleviewpoint input data, near points significantly outnumber far points due to occlusion and decreasing resolution over depth. Fig. 4 shows actual depth distributions from real indoor and outdoor data. Therefore, uniform bins result in unbalanced data allocation between levels, i.e. the first few levels have almost all points, while the last few levels are almost empty.
To address this problem, we introduce the scheme. Instead of dividing levels evenly, we compute level boundaries such that all levels contribute the same amount of influence to the segmentation model . First, we define the importance function of a point as
(9) 
where we refer to as the importance index. We use the importance function to assign a weight to each input point, then we find discretization levels such that all levels possess the same amounts summed importance.
Intuitively, with , all points are equally important regardless of their depth. As result, all levels are allocated with same number of image pixels. With , a point’s influence is proportional to the backprojected 3D surface area it covers. As result, all levels have equal amount of total 3D surface area after the discretization and allocation.
Ideally, seems the optimal setting because it divides the visible surface area evenly to different levels. However, we argue that should instead be a hyperparameter. Data quality decreases over distance for sensors, i.e. the farther an object is, the less detailed measurement a sensor receives. A farther object occupies a smaller field of view from the sensor’s viewpoint. This means lower resolution, hence lower capture quality. Additionally, in sensors such as stereo cameras and Microsoft Kinect, precision decreases as depth increases, making farther points inherently more noisy. Therefore, in order to learn the best recognition model, there exists a tradeoff between trusting near clear data, and paying attention to adapt to far noisy data. In other words, the best index is determined by , where quantifies this nearfar tradeoff. It is difficult to analytically compute , because it depends on the actual sensor configuration and scene properties. Therefore, we tune , hence , through validation on the actual data.
4 Experimental Results
We demonstrate the effectiveness of our approach by showing results on two realworld datasets (KITTI [11] and NYUv2 [40]) for the 3D semantic segmentation task.
4.1 Experimental Setup
CNN model.
We use the skipconnected fully convolutional architecture [29] with two different backbones:

ResNet18 [16]: We modify the size of all convolutional kernels to and experiment with different number of feature channels in each layer. We try light and heavy weight versions where the number of feature channels are , , same, or twice of the number of original channels. The input to our network is a 6channel RGB+HHA [13] image. This network has been trained from scratch, similar to the baseline 3D convolution based approaches.
Using the light weight models, we show that our performance is competitive (NYUv2) or better (KITTI) than the stateoftheart 3D convolution based approaches even with about a quarter of their parameters. Since the memory requirement of our network is low compared to 3D convolution based approaches, we can take advantage of heavier models to further improve our performance. Moreover, our approach can take advantage of pretrained weights on existing large scale 2D datasets. For training our networks, we follow FCN8s and use the logarithm loss function.
3d  RGB  # of para.  infer./ms  
Conv3D [45, 43]  ftsdf  no  233k  8  12.43  50.05  12.69  53.34 
no  yes  238k  10  12.36  48.44  12.66  51.29  
ftsdf  yes  241k  11  13.19  49.85  13.65  52.89  
PointNet [33]  xyz  no  1675k  118  6.25  46.44  5.82  47.46 
xyzG  no  1675k  118  6.54  46.88  6.16  47.85  
xyzG  yes  1675k  117  6.87  47.35  6.47  48.21  
DeformCNN [6]  HHA  yes  101k  6  12.82  55.05  11.67  54.12 
SurfConv1  HHA  yes  65k  5  12.31  53.74  11.27  54.24 
SurfConv41.0  HHA  yes  65k  26  12.01  52.19  11.98  55.44 
SurfConv42.0  HHA  yes  65k  24  13.10  53.48  12.79  55.99 
Baselines.
We compare our approach with Conv3D [45, 43], PointNet [33], and DeformCNN [6]. For Conv3D, we use the SSCNet architechture [43], and train it with three variations of gravityaligned voxel input: RGB, flippedTSDF, and both. We follow [43] and use the maximum possible voxel resolution that can fit a singlesample batch into 12GB memory, which results in a 240144240 voxel grid (with 2cm resolution) on NYUv2, and a 40060320 voxel grid (with 10cm resolution) on KITTI. The points that fall into the same voxel are given the same predicted label in inference.
For PointNet, we directly use the published source code, and train it on three types of input: original point cloud, gravityalgined point cloud, and RGB plus gravity alignment. We randomly sample points from the point cloud as suggested in the paper. Specifically, we set the sample number as 25K, which fills 12GB memory with batch size 8.
For DeformCNN, we replace res5 layers with deformable convolution as recommended in [6]. We try jointly training all layers of DeformCNN, as well as training with deformation offset frozen before the joint training. We report measurements of the latter for its better performance. For fair comparison, we further augment DeformCNN to use depth information by adding extra HHA channels.
SurfConv with a single level is equivalent to the FCN8s [29] baseline. All models are trained using the original data asis, without any augmentation tricks.
Metrics.
For all experiments, we use the pixelwise accuracy () and the intersection over union () metrics. We report these metrics on both pixellevel ( and ) and surfacelevel ( and ). For the surface level metrics, we weigh each point by its surface area in 3D to compute the metrics. To reduce model sensitivity to initialization and random shuffling order in training, we repeat all experiments five times on a Nvidia TitianX GPU, and report the average model performance.
4.2 NYUv2
NYUv2 [40] is a semantically labeled indoor RGBD dataset captured by a Kinect camera. In this dataset, we use the standard split of 795 training images and 654 testing images. We randomly sample 20% rooms from the training set as the validation set. The hyperparameters are chosen based on the best mean IOU on the validation set, which we then use to evaluate all metrics on the test set. For the label space, we use the 37class setting [13, 36]. To obtain 3D data, we use the holefilled dense depth map provided by the dataset. Training our model over all repetitions and hyperparameters takes a total of 950 GPU hours.
The result is shown in Table 1. Compared to Conv3D, SurfConv achieves close performance on IOU and better performance on accuracy, while using 30% of its number of parameters. Compared to PointNet, SurfConv achieves 6% improvement across all metrics, while only using less than 5% of its number of parameters. Compared to the latest scaleadaptive architecture DeformCNN, SurfConv is more suitable for RGBD images because it uses depth information more effectively, achieving better or close performance while using fewer parameters. Having more number of weights (VGG16 architecture) and pretraining with Imagenet gives us a huge boost in performance (Fig. 5).
NYUv237class  KITTI11class 
NYUv237class  KITTI11class 
Comparing SurfConv with different levels trained from scratch in Table 1, it can be seen that the 4level model is slightly better or close to the 1level model in imagewise metrics, and significantly better in surfacewise metrics. Using pretrained network (Fig. 5), our 4level SurfConv achieves better performance than the vanilla singlelevel model (FCN8s [29] baseline), especially in the surfacewise metrics. We also explore a SurfConv variant where the training loss for each point is reweighted by its area of imageplane projection, marked by . This makes the training objective closer to . The reweighted version achieves slightly better imagewise performance, at the cost of having slightly worse surfacewise performance.
4.3 Kitti
KITTI [11] provides parallel camera and LIDAR data for outdoor driving scenes. We use the semantic segmentation annotation provided in [51], which contains 70 training and 37 testing images from different scenes, with high quality pixel annotations in 11 categories. Due to the smaller dataset size and lack of standard validation split, we directly validate all compared methods on the heldout testing set. To obtain dense points from sparse LIDAR input, we use a simple realtime surface completion method that exhaustively join adjacent points into mesh triangles. The densified points are used as input for all methods evaluated. The smaller size of KITTI allows us to thoroughly explore different settings of SurfConv levels, influence index , as well as CNN model capacity. Our KITTI experiments take a total of 750 GPU hours.
Baseline comparisons.
Table 2 lists the comparison with baseline methods. SurfConv outperforms all comparisons in all metrics. In KITTI, the median maximum scene depth is 75.87m. This scenario is particularly difficult for Conv3D, because voxelizing the scene with sufficient resolution would result in large tensors and makes training Conv3D difficult. On the contrary, SurfConv can be easily trained because its compact 2D filters do not suffer from insufficient memory budget. DeformCNN performs better than image convolution (i.e. SurfConv1) for its deformation layers that adapts to object scale variance. However, multilevel SurfConv achieves more significant improvement, demonstrating its capablity of using RGBD data more effectively.
Conv3D [45, 43]  17.53  64.54  17.38  62.58 

PointNet [33]  9.41  55.06  9.07  64.38 
DeformCNN [6]  34.24  79.17  27.51  73.36 
SurfConv1  33.67  79.13  26.56  72.04 
SurfConvbest  35.09  79.37  30.65  75.97 
Model capacity.
We study the effect of CNN model capacity across different SurfConv levels. To change the model capacity, we widen the model by adding more feature channels, while keeping the same number of layers. This results in 4 capacities that has {,,,}65k parameters. We empirically set for all models in this experiment. Fig. 8 shows the result. It can be seen that a higher level SurfConv models have better or similar imagewise performance, while being significantly better in surfacewise metrics. In general, the performance increases as SurfConv level increases. This is because higher SurfConv level enables closer approximation to the scene geometry.
Finetuning.
Similar to our NYUv2 experiment, we compare multilevel SurfConv with the singlelevel baseline. The relatively smaller dataset size allows us to also thoroughly explore different values (Fig. 9). It can be seen that with a good choice of , multilevel SurfConv is able to achieve significant improvement over the singlelevel baseline in all imagewise and surfacewise metrics, while using exactly the same CNN model ( in Eq. 2). Comparing NYUv2 and KITTI, it can be seen that our improvement on KITTI is more significant. We credit this to the larger depth range of KITTI data, where scaleinvariance plays an important role in segmentation success.
4.4 Influence of
The influence index is an important parameter for SurfConv. We therefore further explore its effects. The optimal values of can be different depending on whether the model has been trained from scratch or it has been pretrained, as shown in Table 1 and Fig. 5. On NYUv2, is better for finetuning and is better for training from scratch. The pretrained models are adapted to the Imagenet dataset where most objects are clearly visible and close to camera. The setting weighs the farther points less, which results in a larger number of points at the discretized bin with the largest depth value. In this way, the model is forced to spend more effort on lowquality far points. The observation of lower optimal on pretrained networks is further verified by our KITTI results, where and achieve best results for pretrained and fromscratch networks respectively. In KITTI, good values are in general lower than in NYUv2. We attribute this to the fact that in KITTI, besides having a larger range of depth values, the peak of the depth distribution (Fig. 4) occurs much earlier.
5 Conclusion
We proposed SurfConv to bridge and avoid the issues with both 3D and 2D convolution on RGBD images. SurfConv was formulated as a simple depthaware multiscale 2D convolution, and realized with a DataDriven Depth Discretization scheme. We demostrated the effectiveness of SurfConv on indoor and outdoor 3D semantic segmentation datasets. SurfConv achieved stateoftheart performance while using less than 30% parameters used by 3D convolution based approaches.
References
 [1] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In NIPS, 2016.
 [2] L.C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scaleaware semantic image segmentation. In CVPR, 2016.
 [3] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals using stereo imagery for accurate object class detection. TPAMI, 2017.
 [4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multiview 3d object detection network for autonomous driving. In CVPR, 2017.
 [5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richlyannotated 3d reconstructions of indoor scenes. CVPR, 2017.
 [6] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
 [7] Z. Deng and L. J. Latecki. Amodal detection of 3d objects: Inferring 3d bounding boxes from 2d ones in rgbdepth images. In CVPR, 2017.
 [8] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In ICRA, 2017.
 [9] Y. Fang, J. Xie, G. Dai, M. Wang, F. Zhu, T. Xu, and E. Wong. 3d deep shape descriptor. In CVPR, 2015.
 [10] L. Ge, H. Liang, J. Yuan, and D. Thalmann. 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In CVPR, 2017.
 [11] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. IJRR, 32(11):1231–1237, 2013.
 [12] K. Guo, D. Zou, and X. Chen. 3d mesh labeling via deep convolutional neural networks. TOG, 35(1):3, 2015.
 [13] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning rich features from rgbd images for object detection and segmentation. In ECCV, 2014.
 [14] S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.
 [15] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu. Scaleaware face detection. In CVPR, 2017.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [17] D. Hoiem and S. Savarese. Representations and techniques for 3d object recognition and scene interpretation. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(5):1–169, 2011.
 [18] P. Hu and D. Ramanan. Finding tiny faces. In CVPR, 2017.
 [19] H. Huang, E. Kalogerakis, and B. Marlin. Analysis and synthesis of 3d shape families via deeplearned generative models of surfaces. In Computer Graphics Forum, volume 34, pages 25–38, 2015.
 [20] J. Huang and S. You. Point cloud labeling using 3d convolutional neural network. In ICPR, 2016.
 [21] A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. TPAMI, 21(5):433–449, 1999.
 [22] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3d shape segmentation with projective convolutional networks. In CVPR, 2017.
 [23] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker. Efficient multiscale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis, 36:61–78, 2017.
 [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [25] B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d lidar using fully convolutional network. arXiv:1608.07916, 2016.
 [26] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. Lstmcf: Unifying context modeling and fusion with lstms for rgbd scene labeling. In ECCV, 2016.
 [27] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, 2016.
 [28] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
 [29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [30] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In ICCV Workshops, 2015.
 [31] D. Maturana and S. Scherer. 3d convolutional neural networks for landing zone detection from lidar. In ICRA, 2015.
 [32] S. OrtsEscolano, C. Rhemann, S. Fanello, W. Chang, A. Kowdle, Y. Degtyarev, D. Kim, P. L. Davidson, S. Khamis, M. Dou, et al. Holoportation: Virtual 3d teleportation in realtime. In UIST, 2016.
 [33] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
 [34] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview cnns for object classification on 3d data. In CVPR, 2016.
 [35] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
 [36] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3d graph neural networks for rgbd semantic segmentation. In CVPR, 2017.
 [37] G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. arXiv:1611.05009, 2016.
 [38] N. Sedaghat, M. Zolfaghari, and T. Brox. Orientationboosted voxel nets for 3d object recognition. arXiv:1604.03351, 2016.
 [39] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep panoramic representation for 3d shape recognition. Signal Processing Letters, 22(12):2339–2343, 2015.
 [40] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
 [41] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556, 2014.
 [42] S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in rgbd images. In CVPR, 2016.
 [43] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017.
 [44] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller. Multiview convolutional neural networks for 3d shape recognition. In ICCV, 2015.
 [45] L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and S. Savarese. Segcloud: Semantic segmentation of 3d point clouds. arXiv:1710.07563, 2017.
 [46] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant cnns. In 3DV, 2017.
 [47] S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun. Torontocity: Seeing the world with a million eyes. In ICCV, 2017.
 [48] S. Wang, L. Luo, N. Zhang, and J. Li. Autoscaler: Scaleattention networks for visual correspondence. In BMVC, 2017.
 [49] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
 [50] J. Xie, Y. Fang, F. Zhu, and E. Wong. Deepshape: Deep learned shape descriptor for 3d shape matching and retrieval. In CVPR, 2015.
 [51] P. Xu, F. Davoine, J.B. Bordes, H. Zhao, and T. Denœux. Multimodal information fusion for urban scene understanding. Machine Vision and Applications, 27(3):331–349, 2016.
 [52] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola. Deep sets. In NIPS, 2017.
 [53] A. Zeng, S. Song, M. Nießner, M. Fisher, and J. Xiao. 3dmatch: Learning the matching of local 3d geometry in range scans. In CVPR, 2017.
 [54] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan. Scaleadaptive convolutions for scene parsing. In ICCV, 2017.
 [55] Y. Zhang, M. Bai, P. Kohli, S. Izadi, and J. Xiao. Deepcontext: contextencoding neural pathways for 3d holistic scene understanding. In ICCV, 2017.
 [56] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
 [57] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. arXiv:1412.6856, 2014.