FPConv: Learning Local Flattening for Point Convolution
Abstract
We introduce FPConv, a novel surfacestyle convolution operator designed for 3D point cloud analysis. Unlike previous methods, FPConv doesn’t require transforming to intermediate representation like 3D grid or graph and directly works on surface geometry of point cloud. To be more specific, for each point, FPConv performs a local flattening by automatically learning a weight map to softly project surrounding points onto a 2D grid. Regular 2D convolution can thus be applied for efficient feature learning. FPConv can be easily integrated into various network architectures for tasks like 3D object classification and 3D scene segmentation, and achieve comparable performance with existing volumetrictype convolutions. More importantly, our experiments also show that FPConv can be a complementary of volumetric convolutions and jointly training them can further boost overall performance into stateoftheart results. Code is available at https://github.com/lyqun/FPConv
1 Introduction
With the rapid development of 3D scan devices, it is more and more easy to generate and access 3D data in the form of point clouds. This also brings the challenging of robust and efficient 3D point clouds analysis, which serves as an important component in many real world applications like robotics navigation, autonomous driving, augmented reality applications and so on [35, 52, 3, 38].
Despite decades of development in 3D analysing technologies, it is still quite challenging to perform point cloud based semantic analysis, largely due to its sparse and unordered structure. Early methods [7, 8, 11, 28] utilized handcrafted features with complex rules to tackle this problem. Such empirical humandesigned features would suffer from limited performance in general scenes. Recently, with the explosive growth of machine learning and deep learning techniques, Deep Neural Network (CNN) based methods have been introduced into this task [36, 37] and reveal promising improvements. However, both PointNet [36] and PointNet++ [37] doesn’t support convolution operation which is a key contributing factor in Convolutional Neural Network (CNN) for efficient local processing and handling largescale data.
A straightforward extension of 2D CNN is treating 3D space as a volumetric grid and using 3D convolution for analysis [49, 39]. Although these approaches have achieved success in tasks like object classification and indoor semantic segmentation [30, 9], they still have limitations like cubic growth rate of memory requirement and high computational cost, leading to insufficient analysis and low predication accuracy on largescale scenes. Recently, [48, 44] are proposed to approximate such volumetric convolutions with pointbased convolution operations, which greatly improves the efficiency and preserves the output accuracy. However, these methods are still difficult to capture fine details on surface with relatively flat and thin structures.
In reality, data captured by 3D sensors and LiDAR are usually sparse that points fall near scene surfaces and almost no points interior. Hence, surfaces are more natural and compact for 3D data representation. Towards this end, works like [10, 51] establish connections among points and apply graph convolutions in the corresponding spectral domain or focus on the surface represented by the graph [40], which are usually impractical to create and sensitive to local topological structures.
More recently, [43, 33, 18] are proposed to learn convolution on a specified 2D plane. Inspired by these pioneering works, we develop FPConv, a new convolution operation for point clouds. It works directly on local surface of geometry without any intermediate grid or graph representation. Similar to [43], it works in projectioninterpolation manner but more general and implicit. Our key observation is that projection and interpolation can be simplified into a single weight map learning process. Instead of explicitly projecting onto the tangent plane [43] for convolution, FPConv learns how to diffuse convolution weights of each point along the local surface, which is more robust to various input data and greatly improves the performance of surfacestyle convolution.
As a local feature learning module, FPConv can be further integrated with other operations in classical nerual network architectures and works on various analysis tasks. We demonstrate FPConv on 3D object classification as well as 3D scene semantic segmentation. Networks with FPConv outperform previous surfacestyle approaches [43][18][33] and achives comparable results with current startoftheart methods. Moreover, our experiments also shows that FPConv performs better at regions that are relatively flat thus can be a complementary to volumetrictype works and joint training helps to boost the overall performance into stateoftheart results.
To summarize, the main contributions of this work are as follows:

FPConv, a novel surfacestyle convolution for efficient 3D point cloud analysis.

Significant improvements over previous surfacestyle convolution based methods and comparable performance with stateoftheart volumetricstyle methods in classification and segmentation tasks.

An indepth analysis and comparison between surfacestyle and volumetricstyle convolution, demonstrating that they are complementary to each other and joint training boosts the performance into stateoftheart.
2 Related Work
Deep learning based 3D data analysis has been a quite hot research topic in recent years. In this section, we mainly focus on point cloud analysis and briefly review previous works according to their underling methodologies.
Volumetricstyle point convolution Since a point cloud disorderly distributes in a 3D space without any regular structures, pioneer works sample points into grids for conventional 3D convolutions apply, but limited by high computational load and low representation efficiency [30, 49, 39, 41]. PointNet [36] proposes a shared MLP on every point individually followed by a global maxpooling to extract global feature of the input point cloud. [37] extends it with nested partitionings of point set to hierarchically learn more local features, and many works follow that to approximate point convolutions by MLPs [24, 25, 16, 46]. However, adopting such a representation can not capture the local features very well. Recent works define explicit convolution kernels for points, whose weights are directly learned like image convolutions [17, 50, 12, 2, 44]. Among them, KPConv [44] proposes a spatially deformable point convolution with any number of kernel points which alleviates both varying densities and computational cost, outperform all associated methods on point analysis tasks. However, there volumetricstyle approaches may not capture uniform areas very well.
Graphstyle point convolution When the relationships among points have been established, a Graphstyle convolution can be applied to explore and study point cloud more efficiently than volumetricstyle. Convolution on a graph can be defined as convolution in its spectral domain. [6, 15, 10]. ChebNet [10] adopts Chebyshev polynomial basis for representing the spectral filters to alleviate the cost of explicitly computing the graph Fourier transform. Furthermore, [21] uses a localized firstorder approximation of spectral graph convolutions for semisupervised learning on graphstructured data which greatly accelerates calculation efficiency and improves classification results. However, these methods are all depending on a specific graph structure. Then [51] introduces a spectral parameterization of dilated convolution kernels and a spectral transformer network, sharing information across related but different shape structures. In the meantime, [29, 5, 40, 32] focus on graph learning on manifold surface representation to avoid the spectral domain operation, while [45, 47] learn filters on edge relationships instead of points relative positions. Although a graph convolution combines features on local surface patches and can be invariant to the deformations in Euclidean space. However, reasonable relationships among distinct points are not easy to establish.
Surfacestyle point convolution Since data captured by 3D sensors typically represent surfaces, another mainstream approach attempts to operate directly on surface geometry. Most works project a shape surface consist of points to an intermediate grid structure, e.g. multiview RGBD images, following with conventional convolutions [13, 26, 31, 4, 23]. Such methods often suffer from the redundant representation of multiview and the amubiguity casued of by different viewpoints. [43] proposes projecting local neighborhoods of each point to its local tangent plane and processing them with 2D convolutions, which is efficient for analyzing dense point clouds of largescale and outdoor environments. However, this method relies heavily on point tangent estimation, and this linear projection is not always optimal for complex areas. [33] optimizes the calculation with parallel tangential frames, while [18] utilizes a 4rotational symmetric field to define a domain for convolution on surface, which not only increase the robustness, but also make the utmost of detailed information. However, existing surfacestyle learning algorithms cannot perform very well on challenge datasets such as S3DIS [1] and ScanNet [9], since they lose 1dimensional information and they cannot estimate the surface accurately.
Our method is inspired by surfacestyle point convolutions. The network learns a nonlinear projection for each local patch, say flattening the local neighborhood points into a 2D grid plane. Then 2D convolutions can be applied for feature extraction. Although learning on surface will lose 1dimensional information, FPConv still achieves comparable performance with existing volumetricstyle convolutions. In addition, our FPConv can be integrated into volumetricstyle convolution and achieve stateoftheart results.
3 FPConv
In this section, we formally introduce FPConv. We first revisit the definition of convolution along point cloud surface and then show it can be simplified into a weight learning problem under discrete setting. All derivations are provided in the form of point clouds.
3.1 Learn Local Flattening
Notation:
Let be a point from a point cloud and be a scalar function defined over points. Here can encode signals like color, geometry or features from intermediate network layers. We denote as a local point cloud patch centered at where with being the chosen radius.
Convolution on local surface:
In order to convolve around the surface, we first extend it to a continuous function over a continuous surface. We introduce a virtual 2D plane with a continuous signal together with a map which maps onto and
(1) 
The convolution at is defined as:
(2) 
where is a convolution kernel. We now describe how to formulate the above convolution into a weight learning problem.
Local flattening by learning projection weights:
As shown in Eq.3, with , will be mapped as scattered points in , thus we need a interpolation method to estimate the full signal function , as shown in Eq.3.
(3) 
Now if we discretize into a grid plane with size of . For each grid , where in , we can have from Eq.1 and Eq.3:
(4) 
Where . Furthermore, we can rewrite Eq.2 in an approximate discretized form as:
(5) 
Where is discretized convolution kernel weights, and in . Let , , , and . Now we can see that projection and interpolation can be combined into a single weight matrix where it only depends on the point location w.r.t the center point.
3.2 Implementation
According to Eq.5, we can design a module to learn projection weights directly instead of learning projection and interpolation separately, as shown in Fig.2. We also want this module to have two properties: first, it should be invariant to input permutation since the local point cloud is unordered; second, it should be adaptive to input geometry, hence the projection should combine local coordinates and global information of local patch. Therefore, we first use pointnet [36] to extract the global feature of local region, namely distribution feature, which is invariant to permutation. Then we concatenate the distribution feature to each of the input points, as shown in Fig.3. After that, a shared MLPs is employed to predict the final projection weights.
After projection, 2D convolution is applied on obtained grid feature plane. To extract a local feature vector, global convolution or pooling can be applied on the last layer of 2D convolution network.
However, feature intensity of pixels in grid plane may be unbalanced when the summation of feature intensities received from points in local region is varying, which can break the stability of a neural network and make the training hard to converge. In order to balance the feature intensity of grid plane, we further introduce two normalization methods on learned projection weights.
Dense Grid Plane: Let projection weights matrix be . One possible way to obtain a dense grid plane is normalizing at the first dimension by dividing their summation to make sure the summation of intensities received at each pixel is equal to 1. This is similar to bilinear interpolation method. In our implementation, we use softmax to avoid being divided by zero, which is shown in Eq.6.
(6) 
Sparse Grid Plane: Due to natural sparsity of point cloud, normalize the projection weights to get a dense grid plane may not be optimal. In this case, we design a 2step normalization which can preserve the sparsity of projection weights matrix, and then the grid plane. Moreover, we conduct ablation studies on our proposed two normalization techniques.
First step is to normalize at second dimension to balance the intensity given out by local neighbor points. Here, we add a positive to avoid being divided by zero. As shown in Eq.7, indicates th row of .
(7) 
Second step is to normalize at first dimension to balance the intensity received at each pixel position. It can be implemented similar to first step by dividing by summation of each column. However, we choose another method shown in Eq.8 to maintain a continuous sparsity, where indicates th column of . Examples of continuous sparsity and binary sparsity are shown in Fig.4.
(8) 
4 Architecture
4.1 Residual FPConv Block
To build a deep network for segmentation and classification, we develop a bottleneckdesign residual FPConv block inspired by [14], as shown in Fig.6. This block takes a point cloud as input, applying a stack of shared MLP, FPConv, and shared MLP, where shared MLPs are responsible for reducing and then increasing (or restoring) dimensions, similar to convolutions in residual convolution block [14].
4.2 MultiScale Analysis
Farthest Point Sampling: we use iterative farthest point sampling to downsample the point cloud. As mentioned in PointNet++ [37], FPS has better coverage of the entire point set given the same number of centroids compared with random sampling.
Pooling: we use maxpooling to group local features. Given an input point cloud and a downsampled point cloud with their corresponding features and , we group neighbors for each point in with radius of and apply pooling operator on features of grouped points, as shown in Eq.9, where for any .
(9) 
FPConv with FPS: similar to pooling operation, this block applies FPConv on each point of downsampled point cloud and search neighbors over full point cloud, as shown in Eq.10.
(10) 
Upsampling: we use nearest neighbors interpolation to upsample point cloud by euclidean distance. Given a point cloud with features and a target point cloud , we compute feature for each point in by interpolating its neighbor points searched over .
In the upsampling phase, skip connection and a shared MLPs is used for fusing features from encoder and decoder. nearest neighbors upsampling and shared MLPs can be replaced by deconvolution, but it does not lead to a significant improvement as mentioned in [44], so we do not employ it in our experiments.
Architecture shown in Fig.5 is designed for large scene segmentation, including four layers of downsampling and upsampling for multiscale analysis. For classification task, we apply a global pooling on the last layer of downsampling to obtain global feature for representing full point cloud, and then use a fully connected network for classification.
4.3 Fusing Two Convolutions
As one of our main contributions, we also try to answer a question ”Can we combine two convolutions for further boosting the performance?” The answer is yes but only works when the two convolutions are in different types or complementary (please see Section 6), say surfacestyle and volumetricstyle.
In this section, we propose two convenient and quick fusion strategies, by combining two convolution operators in a single framework. First one is fusing different convolutional features, similar to inception net [42]. As shown in Fig.7, we design a parallel residual block. Given an input feature, apply multiple convolutions in parallel and then concatenate their outputs as fused feature. This strategy is suitable for some compatible methods, such like SA Module of PointNet++ [37], PointConv [48], both using point cloud as input and applying downsampling strategy, which is the same used in our architecture.
While for other incompatible methods, such as TextureNet [18] using mesh as an additional input, and KPConv [44] applying grid downsampling, we have second fusion strategy by concatenating their output features in the last second layer of networks, an then applying a tiny network for fusion.
Method  Conv.  Samp.  ScanNet  S3DIS  S3DIS6 
PointNet [36]  V  FPS    41.1  47.6 
PointNet++ [37]  V  FPS  33.9     
PointCNN [25]  V  FPS  45.8  57.3   
PointConv [48]  V  FPS  55.6  58.3   
HPEIN [20]  V  FPS  61.8  61.9  67.8 
KPConv [44]  V  Grid  68.4  65.4  69.6 
TangentConv [43]  S  Grid  43.8  52.6   
SurfaceConv [33]  S    44.2     
TextureNet [18]  S  QF [18]  56.6     
FPConv (ours)  S  FPS  63.9  62.8  68.7 
FP PointConv  S + V      64.4   
FP PointConv  S + V  FPS    64.8   
FP KPConv  S + V      66.7   
Method  Conv.  Samp.  Accuracy 

PointNet [36]  V  FPS  89.2 
PointNet++ [37]  V  FPS  90.7 
PointCNN [25]  V  FPS  92.2 
PointConv [48]  V  FPS  92.5 
KPConv [44]  V  Grid  92.9 
FPConv (ours)  S  FPS  92.5 
Method  oA  mAcc  mIoU  ceil.  floor  wall  beam  col.  wind.  door  table  chair  sofa  book.  board  clut. 
PointConv [48]  85.4  64.7  58.3  92.8  96.3  77.0  0.0  18.2  47.7  54.3  87.9  72.8  61.6  65.9  33.9  49.3 
KPConv [44]    70.9  65.4  92.6  97.3  81.4  0.0  16.5  54.4  69.5  80.2  90.1  66.4  74.6  63.7  58.1 
FPConv (ours)  88.3  68.9  62.8  94.6  98.5  80.9  0.0  19.1  60.1  48.9  80.6  88.0  53.2  68.4  68.2  54.9 
FP PointConv  88.2  70.2  64.8  92.8  98.4  81.6  0.0  24.2  59.1  63.0  79.5  88.6  68.1  67.9  67.2  52.4 
FP PointConv  88.6  71.5  64.4  94.2  98.5  82.4  0.0  25.5  62.9  63.1  79.8  87.9  53.5  68.3  67.1  54.5 
KP PointConv  89.4  71.5  65.5  94.6  98.4  81.4  0.0  17.8  56.0  71.7  78.9  90.1  66.8  72.6  65.0  58.7 
FP KPConv  89.9  72.8  66.7  94.5  98.6  83.9  0.0  24.5  61.1  70.9  81.6  89.4  60.3  73.5  70.8  57.8 
5 Experiments
To demonstrate the efficacy of our proposed convolution, we conduct experiments on point cloud semantic segmentation and classification tasks. ModelNet40 [49] is used for shape classification. Two large scale datasets named Stanford LargeScale 3D Indoor Space (S3DIS) [1] and ScanNet [9] are used for 3D point cloud segmentation. We implement our FPConv with PyTorch [34]. Momentum gradient descent optimizer is used to optimize a pointwise cross entropy loss, with a momentum of 0.98, and an initial learning rate of 0.01 scheduled by cosine LR scheduler [27]. Leaky ReLU and batch normalization are applied after each layer except the last fully connected layer. We trained our models 100 epochs for S3DIS and ModelNet40, 300 epochs for ScanNet.
5.1 3D Shape Classification
ModelNet40 [49] contains 12311 3D meshed models from 40 categories, with 9843 for training and 2468 for testing. Normal is used as additional input feature in our model. Moreover, randomly rotation among the axis and jittering are also used for data augmentation. As shown in Table.2, our model achieves stateoftheart performance among surfacestyle methods.
5.2 Large Scene Semantic Segmentation
Data. S3DIS [1] contains 3D point clouds of 6 areas, totally 272 rooms. Each point in the scan is annotated with one of the semantic labels from 13 categories (chair, table, floor, wall etc. plus clutter). To prepare the training data, 14k points are randomly sampled from a randomly picked block of 2m by 2m. Both sampling are onthefly during training. While for testing, all points are covered. Each point is represented by a 9dim vector of XYZ, RGB, and normalized location w.r.t to the room (from 0 to 1). In particular, the sampling rate for each point is 0.5 in every training epoch.
ScanNet [9] contains 1513 3D indoor scene scans, split into 1201 for training and 312 for testing. There are 21 classes in total and 20 classes are used for evaluation while 1 class for free space. Similar to S3DIS, we randomly sample the raw data in blocks then sample points onthefly during training. Each block is of size 2m by 2m, containing 11k points represented by a 6dim vector, XYZ and RGB.
Pipeline for fusion. As mentioned in Section 4.3, we propose two fusion strategies for fusing convkernels of different types. In our experiment, we select PointConv [48] and KPConv [44] rigid for comparison on S3DIS. We apply both two fusion strategies on PointConv with FPConv, and the second strategy, fusion on final feature level on FPConv with KPConv and PointConv with KPConv. In our experiments, KPConv rigid is used for fusion, while its deformable version is ignored for missing released pretrained model and hyperparameters setting. Thus, in the latter part, we use KPConv to represent KPConv rigid.
Results. Following [36], we report the results on two settings for S3DIS, the first one is evaluation on Area 5, and another one is 6fold cross validation (calculating the metrics with results from different folds merged). We report the mean of classwise intersection over union (mIoU), overall pointwise accuracy (oA) and the mean of classwise accuracy (mAcc). For Scannet [9], we report the mIoU score tested on ScanNet bencemark.
Results (mIoU) are shown in Table.1. Detailed results of S3DIS including mIoU of each class are shown in Table.3. As we can see, FPConv outperforms all the existing surfacestyle learning methods with large margins. Specifically, the mIoU of FPConv on Scannet [9] benchmark reaches 63.9%, which outperforms the previous best surfacestyle method by 7.3%. In addition, our FPConv fused with KPConv achieves stateoftheart performance on S3DIS.
Even though mIoU of S3DIS of our FPConv is lower than KPConv, there are still IoUs of some classes outperform the ones of KPConv, such as ceiling, floor, board, etc. Particularly, we find that all of these classes are flat objects, which should have small curvatures. Based on this discovery, we further conduct several ablation studies to explore the relationship between segmentation performance of FPConv and objects curvatures, as shown in next section. Visualization of result is shown in Fig.8 for S3DIS and Fig.9 for ScanNet.
6 Ablation Study
Two ablation studies are conducted, the first one is exploring fusion of surfacestyle and volumetricstyle convolutions. Another one is the effect of detailed configurations, normalization methods and plane size on FPConv.
6.1 On Fusion of S.Conv and V.Conv
We firstly study the performance for different combination methods of the two convolutions. Before that, we show an experimental finding that they are complementary and good at analyzing different specific scenes.
Performance vs. Curvature
As experiments mentioned in Section 5.2, we claim that FPConv can perform better on area with small curvature. To be more convincing, we analyzed the relationship between overall accuracy and curvatures, which is shown in the left of Fig.10. We can see that FPConv outperforms PointConv [48] and KPConv [44] when curvatures are small, and FPConv cannot perform very well on structures which have large curvatures. Moreover, the histogram of distribution of points curvatures shown in the right of Fig.10 implies almost all points have either large curvatures or small curvatures. This explains why there is a huge performance degradation when curvature increases. Furthermore, as shown in Fig.11, we highlight points (in red) with incorrect prediction, and points (in red) with large curvature. It is oblivious that incorrect prediction is concentrated on area with large curvature and FPConv performs well in flat area.
Ablation analysis on fusion method
As mentioned above, FPConv which is a surfacestyle convolution performs better in flat area, worse in rough area and KPConv, as a volumetricstyle convolution performs oppositely. We believe that they can be complementary to each other and conduct 4 fusion experiments, FPConv PointConv, FPConv PointConv, KPConv PointConv, and FPConv KPConv, where represents fusion in final feature level and represents fusion in conv level. We don’t conduct fusion of FPConv and KPConv in conv level for their incompatible downsampling strategies. As shown in Table.3, fusion of FPConv with PointConv or KPConv brings a great improvement, while fusion of PointConv with KPConv brings little improvement. Therefore, we can claim that our FPConv can be complementary to volumetricstyle convolutions, which may direct the convolution design for point cloud in the future.
Visual results are shown in Fig.8. Our FPConv can capture better flat structures than KPConv, such as the class column that does not shown in KPConv. While KPConv can capture better complex structures, such as the door. Moreover, the fusion of KPConv and FPConv can achieve better results than both KPConv and FPConv.
Method  mIoU  mAcc  oAcc 

w sparse norm + 6x6  62.8  69.0  88.3 
w dense norm + 6x6  61.6  68.5  87.6 
w/o norm + 6x6  59.8  67.1  86.2 
w sparse norm + 5x5  61.8  68.1  88.4 
6.2 On FPConv Architecture Design
We conduct 4 experiments as shown in Table.4, to study influence of normalization method and the size of grid plane on performance of FPConv. It tells us that, sparsenorm which indicates 2step normalization method mentioned in Section 3.2 performs better than densenorm. In addition, higher resolution of grid plane may achieve better performance, while bring higher memory cost as well.
7 Conclusion
In this work, we propose FPConv, a novel surfacestyle convolution operator on 3D point cloud. FPConv takes a local region of point cloud as input, and flattens it onto a 2D grid plane by predicting projection weights, followed by regular 2D convolutions. Our experiments demonstrate that FPConv significantly improved the performance of surfacestyle convolution methods. Furthermore, we discover that surfacestyle convolution can be a complementary to volumetricstyle convolution and jointly training can boost the performance into stateoftheart. We believe that surfacestyle convolutions can play an important role in feature learning of 3D data and is a promising direction to explore.
Acknowledge
This work was supported in part by grants No.2018B030338001, NSFC61902334, NSFC61629101, No.2018YFB1800800, No.ZDSYS201707251409055 and No.2017ZT07X152.
Supplementary
The supplementary material contains:
A. More results of the proposed fusion strategy
a. Fusing FPConv and PointConv on ScanNet
b. Fusing FPConv and KPConvdeform on S3DIS
B. More Results on Segmentation Tasks
We provide more details of our experimental results. As shown in Table.7, we compare our FPConv with other popular methods on S3DIS [1] 6fold cross validation, which shows that FPConv can achieve higher score on flatshaped objects, such like ceiling, floor, table, board, etc. While KPConv [44], a volumetricstyle method, performs better on complex structures. More visual results are shown in Fig.12 and Fig.13.
Method  mIoU  ceil.  floor  wall  beam  col.  wind.  door  table  chair  sofa  book.  board  clut. 

KPConvrigid [44]  65.4  92.6  97.3  81.4  0.0  16.5  54.4  69.5  80.2  90.1  66.4  74.6  63.7  58.1 
KPConvdeform [44]  67.1  92.8  97.3  82.4  0.0  23.9  58.0  69.0  81.5  91.0  75.4  75.3  66.7  58.9 
FPConv (ours)  62.8  94.6  98.5  80.9  0.0  19.1  60.1  48.9  80.6  88.0  53.2  68.4  68.2  54.9 
FP KPrigid  66.7  94.5  98.6  83.9  0.0  24.5  61.1  70.9  81.6  89.4  60.3  73.5  70.8  57.8 
FP KPdeform  68.2  94.2  98.5  83.7  0.0  24.7  63.0  66.6  82.5  91.9  76.7  75.6  70.5  59.1 
Method  mIoU  ceil.  floor  wall  beam  col.  wind.  door  table  chair  sofa  book.  board  clut. 

PointNet [36]  47.6  88.0  88.7  69.3  42.4  23.1  47.5  51.6  54.1  42.0  9.6  38.2  29.4  35.2 
RSNet [19]  56.5  92.5  92.8  78.6  32.8  34.4  51.6  68.1  59.7  60.1  16.4  50.2  44.9  52.0 
SPGraph [22]  62.1  89.9  95.1  76.4  62.8  47.1  55.3  68.4  69.2  73.5  45.9  63.2  8.7  52.9 
PointCNN [25]  65.4  94.8  97.3  75.8  63.3  51.7  58.4  57.2  69.1  71.6  61.2  39.1  52.2  58.6 
HPEIN [20]  67.8                           
KPConvrigid [44]  69.6  93.7  92.0  82.5  62.5  49.5  65.7  77.3  64.0  57.8  71.7  68.8  60.1  59.6 
KPConvdeform [44]  70.6  93.6  92.4  83.1  63.9  54.3  66.1  76.6  64.0  57.8  74.9  69.3  61.3  60.3 
FPConv (ours)  68.7  94.8  97.5  82.6  42.8  41.8  58.6  73.4  71.0  81.0  59.8  61.9  64.2  64.2 
References
 (2016) 3D semantic parsing of largescale indoor spaces. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Cited by: §2, §5.2, §5, 2nd item, b. Fusing FPConv and KPConvdeform on S3DIS, B. More Results on Segmentation Tasks.
 (2018) Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091. Cited by: §2.
 (2017) Bounding boxes, segmentations and object coordinates: how important is recognition for 3d scene flow estimation in autonomous driving scenarios?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2574–2583. Cited by: §1.
 (2017) Unstructured point cloud semantic labeling using deep segmentation networks.. 3DOR 2, pp. 7. Cited by: §2.
 (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.
 (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.
 (2004) Supervised parametric classification of aerial lidar data. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 30–30. Cited by: §1.
 (2009) Contribution of airborne fullwaveform lidar and image data for urban scene classification. In 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 1669–1672. Cited by: §1.
 (2017) Scannet: richlyannotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §1, §2, §5.2, §5.2, §5.2, §5, 1st item, a. Fusing FPConv and PointConv on ScanNet.
 (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §1, §2.
 (2009) Shapebased recognition of 3d point clouds in urban environments. In 2009 IEEE 12th International Conference on Computer Vision, pp. 2154–2161. Cited by: §1.
 (2018) Flexconvolution. In Asian Conference on Computer Vision, pp. 105–122. Cited by: §2.
 (2015) Indoor scene understanding with rgbd images: bottomup segmentation, object detection and semantic segmentation. International Journal of Computer Vision 112 (2), pp. 133–149. Cited by: §2.
 (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: Figure 6, §4.1.
 (2015) Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163. Cited by: §2.
 (2018) Monte carlo convolution for learning on nonuniformly sampled point clouds. In SIGGRAPH Asia 2018 Technical Papers, pp. 235. Cited by: §2.
 (2018) Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–993. Cited by: §2.
 (2019) Texturenet: consistent local parametrizations for learning from highresolution signals on meshes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4440–4449. Cited by: §1, §1, §2, §4.3, Table 1.
 (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: Table 7.
 (2019) Hierarchical pointedge interaction network for point cloud semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10433–10441. Cited by: Table 1, Table 7.
 (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
 (2018) Largescale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4558–4567. Cited by: Table 7.
 (2017) Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pp. 95–107. Cited by: §2.
 (2018) Sonet: selforganizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §2.
 (2018) PointCNN. arXiv preprint arXiv:1801.07791. Cited by: §2, Table 1, Table 2, Table 7.
 (2016) Lstmcf: unifying context modeling and fusion with lstms for rgbd scene labeling. In European conference on computer vision, pp. 541–557. Cited by: §2.
 (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §5.
 (2015) 3d all the way: semantic segmentation of urban scenes from start to end in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4456–4465. Cited by: §1.
 (2015) Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pp. 37–45. Cited by: §2.
 (2015) Voxnet: a 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §1, §2.
 (2017) Semanticfusion: dense 3d semantic mapping with convolutional neural networks. In 2017 IEEE International Conference on Robotics and automation (ICRA), pp. 4628–4635. Cited by: §2.
 (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: §2.
 (2018) Convolutional neural networks on 3d surfaces using parallel frames. arXiv preprint arXiv:1808.04952. Cited by: §1, §1, §2, Table 1.
 (2017) Automatic differentiation in pytorch. In NIPSW, Cited by: §5.
 (2018) Ground extraction from 3d lidar point clouds with the classification learner app. In 2018 26th Mediterranean Conference on Control and Automation (MED), pp. 1–9. Cited by: §1.
 (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2, §3.2, Table 1, Table 2, §5.2, Table 7.
 (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1, §2, Figure 6, §4.2, §4.3, Table 1, Table 2.
 (2017) [POSTER] augmented things: enhancing ar applications leveraging the internet of things and universal 3d object tracking. In 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMARAdjunct), pp. 103–108. Cited by: §1.
 (2017) Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586. Cited by: §1, §2.
 (2017) Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §1, §2.
 (2017) Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754. Cited by: §2.
 (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §4.3.
 (2018) Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3887–3896. Cited by: §1, §1, §2, Table 1.
 (2019) KPConv: flexible and deformable convolution for point clouds. arXiv preprint arXiv:1904.08889. Cited by: §1, §2, §4.2, §4.3, Table 1, Table 2, Table 3, Figure 8, Figure 8, §5.2, §6.1, 2nd item, b. Fusing FPConv and KPConvdeform on S3DIS, Table 6, Table 7, B. More Results on Segmentation Tasks.
 (2018) Feastnet: featuresteered graph convolutions for 3d shape analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2598–2606. Cited by: §2.
 (2018) Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2589–2597. Cited by: §2.
 (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §2.
 (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: §1, §4.3, Table 1, Table 2, Table 3, §5.2, §6.1, 1st item, a. Fusing FPConv and PointConv on ScanNet, Table 5.
 (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2, §5.1, §5.
 (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: §2.
 (2017) Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: §1, §2.
 (2018) A lidar point cloud generator: from a virtual world to autonomous driving. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 458–464. Cited by: §1.