FPConv: Learning Local Flattening for Point Convolution

FPConv: Learning Local Flattening for Point Convolution


We introduce FPConv, a novel surface-style convolution operator designed for 3D point cloud analysis. Unlike previous methods, FPConv doesn’t require transforming to intermediate representation like 3D grid or graph and directly works on surface geometry of point cloud. To be more specific, for each point, FPConv performs a local flattening by automatically learning a weight map to softly project surrounding points onto a 2D grid. Regular 2D convolution can thus be applied for efficient feature learning. FPConv can be easily integrated into various network architectures for tasks like 3D object classification and 3D scene segmentation, and achieve comparable performance with existing volumetric-type convolutions. More importantly, our experiments also show that FPConv can be a complementary of volumetric convolutions and jointly training them can further boost overall performance into state-of-the-art results. Code is available at https://github.com/lyqun/FPConv


1 Introduction

With the rapid development of 3D scan devices, it is more and more easy to generate and access 3D data in the form of point clouds. This also brings the challenging of robust and efficient 3D point clouds analysis, which serves as an important component in many real world applications like robotics navigation, autonomous driving, augmented reality applications and so on [35, 52, 3, 38].

Despite decades of development in 3D analysing technologies, it is still quite challenging to perform point cloud based semantic analysis, largely due to its sparse and unordered structure. Early methods [7, 8, 11, 28] utilized hand-crafted features with complex rules to tackle this problem. Such empirical human-designed features would suffer from limited performance in general scenes. Recently, with the explosive growth of machine learning and deep learning techniques, Deep Neural Network (CNN) based methods have been introduced into this task [36, 37] and reveal promising improvements. However, both PointNet [36] and PointNet++ [37] doesn’t support convolution operation which is a key contributing factor in Convolutional Neural Network (CNN) for efficient local processing and handling large-scale data.

A straightforward extension of 2D CNN is treating 3D space as a volumetric grid and using 3D convolution for analysis [49, 39]. Although these approaches have achieved success in tasks like object classification and indoor semantic segmentation [30, 9], they still have limitations like cubic growth rate of memory requirement and high computational cost, leading to insufficient analysis and low predication accuracy on large-scale scenes. Recently, [48, 44] are proposed to approximate such volumetric convolutions with point-based convolution operations, which greatly improves the efficiency and preserves the output accuracy. However, these methods are still difficult to capture fine details on surface with relatively flat and thin structures.

Figure 1: Flattening Projection Convolution: flatten local patch onto a grid plane and then apply 2D convolutions.

In reality, data captured by 3D sensors and LiDAR are usually sparse that points fall near scene surfaces and almost no points interior. Hence, surfaces are more natural and compact for 3D data representation. Towards this end, works like [10, 51] establish connections among points and apply graph convolutions in the corresponding spectral domain or focus on the surface represented by the graph [40], which are usually impractical to create and sensitive to local topological structures.

More recently, [43, 33, 18] are proposed to learn convolution on a specified 2D plane. Inspired by these pioneering works, we develop FPConv, a new convolution operation for point clouds. It works directly on local surface of geometry without any intermediate grid or graph representation. Similar to [43], it works in projection-interpolation manner but more general and implicit. Our key observation is that projection and interpolation can be simplified into a single weight map learning process. Instead of explicitly projecting onto the tangent plane [43] for convolution, FPConv learns how to diffuse convolution weights of each point along the local surface, which is more robust to various input data and greatly improves the performance of surface-style convolution.

As a local feature learning module, FPConv can be further integrated with other operations in classical nerual network architectures and works on various analysis tasks. We demonstrate FPConv on 3D object classification as well as 3D scene semantic segmentation. Networks with FPConv outperform previous surface-style approaches [43][18][33] and achives comparable results with current start-of-the-art methods. Moreover, our experiments also shows that FPConv performs better at regions that are relatively flat thus can be a complementary to volumetric-type works and joint training helps to boost the overall performance into state-of-the-art results.

To summarize, the main contributions of this work are as follows:

  • FPConv, a novel surface-style convolution for efficient 3D point cloud analysis.

  • Significant improvements over previous surface-style convolution based methods and comparable performance with state-of-the-art volumetric-style methods in classification and segmentation tasks.

  • An in-depth analysis and comparison between surface-style and volumetric-style convolution, demonstrating that they are complementary to each other and joint training boosts the performance into state-of-the-art.

2 Related Work

Deep learning based 3D data analysis has been a quite hot research topic in recent years. In this section, we mainly focus on point cloud analysis and briefly review previous works according to their underling methodologies.

Volumetric-style point convolution Since a point cloud disorderly distributes in a 3D space without any regular structures, pioneer works sample points into grids for conventional 3D convolutions apply, but limited by high computational load and low representation efficiency [30, 49, 39, 41]. PointNet [36] proposes a shared MLP on every point individually followed by a global max-pooling to extract global feature of the input point cloud. [37] extends it with nested partitionings of point set to hierarchically learn more local features, and many works follow that to approximate point convolutions by MLPs [24, 25, 16, 46]. However, adopting such a representation can not capture the local features very well. Recent works define explicit convolution kernels for points, whose weights are directly learned like image convolutions [17, 50, 12, 2, 44]. Among them, KPConv [44] proposes a spatially deformable point convolution with any number of kernel points which alleviates both varying densities and computational cost, outperform all associated methods on point analysis tasks. However, there volumetric-style approaches may not capture uniform areas very well.

Graph-style point convolution When the relationships among points have been established, a Graph-style convolution can be applied to explore and study point cloud more efficiently than volumetric-style. Convolution on a graph can be defined as convolution in its spectral domain. [6, 15, 10]. ChebNet [10] adopts Chebyshev polynomial basis for representing the spectral filters to alleviate the cost of explicitly computing the graph Fourier transform. Furthermore, [21] uses a localized first-order approximation of spectral graph convolutions for semi-supervised learning on graph-structured data which greatly accelerates calculation efficiency and improves classification results. However, these methods are all depending on a specific graph structure. Then [51] introduces a spectral parameterization of dilated convolution kernels and a spectral transformer network, sharing information across related but different shape structures. In the meantime, [29, 5, 40, 32] focus on graph learning on manifold surface representation to avoid the spectral domain operation, while [45, 47] learn filters on edge relationships instead of points relative positions. Although a graph convolution combines features on local surface patches and can be invariant to the deformations in Euclidean space. However, reasonable relationships among distinct points are not easy to establish.

Surface-style point convolution Since data captured by 3D sensors typically represent surfaces, another mainstream approach attempts to operate directly on surface geometry. Most works project a shape surface consist of points to an intermediate grid structure, e.g. multi-view RGB-D images, following with conventional convolutions [13, 26, 31, 4, 23]. Such methods often suffer from the redundant representation of multi-view and the amubiguity casued of by different viewpoints. [43] proposes projecting local neighborhoods of each point to its local tangent plane and processing them with 2D convolutions, which is efficient for analyzing dense point clouds of large-scale and outdoor environments. However, this method relies heavily on point tangent estimation, and this linear projection is not always optimal for complex areas. [33] optimizes the calculation with parallel tangential frames, while [18] utilizes a 4-rotational symmetric field to define a domain for convolution on surface, which not only increase the robustness, but also make the utmost of detailed information. However, existing surface-style learning algorithms cannot perform very well on challenge datasets such as S3DIS [1] and ScanNet [9], since they lose 1-dimensional information and they cannot estimate the surface accurately.

Our method is inspired by surface-style point convolutions. The network learns a non-linear projection for each local patch, say flattening the local neighborhood points into a 2D grid plane. Then 2D convolutions can be applied for feature extraction. Although learning on surface will lose 1-dimensional information, FPConv still achieves comparable performance with existing volumetric-style convolutions. In addition, our FPConv can be integrated into volumetric-style convolution and achieve state-of-the-art results.

3 FPConv

Figure 2: Flattening module compared with traditional method: we design a module to learn local flattening directly instead of learning projection and interpolation separately.
Figure 3: Process of conducting FPConv on local region centered around point . The input coordinates and features come from neighbor points randomly picked in a radius range of . The output is at .

In this section, we formally introduce FPConv. We first revisit the definition of convolution along point cloud surface and then show it can be simplified into a weight learning problem under discrete setting. All derivations are provided in the form of point clouds.

3.1 Learn Local Flattening


Let be a point from a point cloud and be a scalar function defined over points. Here can encode signals like color, geometry or features from intermediate network layers. We denote as a local point cloud patch centered at where with being the chosen radius.

Convolution on local surface:

In order to convolve around the surface, we first extend it to a continuous function over a continuous surface. We introduce a virtual 2D plane with a continuous signal together with a map which maps onto and


The convolution at is defined as:


where is a convolution kernel. We now describe how to formulate the above convolution into a weight learning problem.

Local flattening by learning projection weights:

As shown in Eq.3, with , will be mapped as scattered points in , thus we need a interpolation method to estimate the full signal function , as shown in Eq.3.


Now if we discretize into a grid plane with size of . For each grid , where in , we can have from Eq.1 and Eq.3:


Where . Furthermore, we can rewrite Eq.2 in an approximate discretized form as:

\thesubsubfigure Binary Sparsity
\thesubsubfigure Continuous Sparsity
Figure 4: Left: binary sparsity, intensity at each position should be 0 or 1. Right: continuous sparsity, intensity can be in the range of 0 to 1.

Where is discretized convolution kernel weights, and in . Let , , , and . Now we can see that projection and interpolation can be combined into a single weight matrix where it only depends on the point location w.r.t the center point.

Figure 5: Network Architectures for Large Scene Segmentation: our segmentation architecture is composed of 4 downsampling layers for multi-scale analysis and apply skip connections for combination of features from encoder and decoder.

3.2 Implementation

According to Eq.5, we can design a module to learn projection weights directly instead of learning projection and interpolation separately, as shown in Fig.2. We also want this module to have two properties: first, it should be invariant to input permutation since the local point cloud is unordered; second, it should be adaptive to input geometry, hence the projection should combine local coordinates and global information of local patch. Therefore, we first use pointnet [36] to extract the global feature of local region, namely distribution feature, which is invariant to permutation. Then we concatenate the distribution feature to each of the input points, as shown in Fig.3. After that, a shared MLPs is employed to predict the final projection weights.

After projection, 2D convolution is applied on obtained grid feature plane. To extract a local feature vector, global convolution or pooling can be applied on the last layer of 2D convolution network.

However, feature intensity of pixels in grid plane may be unbalanced when the summation of feature intensities received from points in local region is varying, which can break the stability of a neural network and make the training hard to converge. In order to balance the feature intensity of grid plane, we further introduce two normalization methods on learned projection weights.

Dense Grid Plane: Let projection weights matrix be . One possible way to obtain a dense grid plane is normalizing at the first dimension by dividing their summation to make sure the summation of intensities received at each pixel is equal to 1. This is similar to bilinear interpolation method. In our implementation, we use softmax to avoid being divided by zero, which is shown in Eq.6.


Sparse Grid Plane: Due to natural sparsity of point cloud, normalize the projection weights to get a dense grid plane may not be optimal. In this case, we design a 2-step normalization which can preserve the sparsity of projection weights matrix, and then the grid plane. Moreover, we conduct ablation studies on our proposed two normalization techniques.

First step is to normalize at second dimension to balance the intensity given out by local neighbor points. Here, we add a positive to avoid being divided by zero. As shown in Eq.7, indicates -th row of .


Second step is to normalize at first dimension to balance the intensity received at each pixel position. It can be implemented similar to first step by dividing by summation of each column. However, we choose another method shown in Eq.8 to maintain a continuous sparsity, where indicates -th column of . Examples of continuous sparsity and binary sparsity are shown in Fig.4.


4 Architecture

4.1 Residual FPConv Block

To build a deep network for segmentation and classification, we develop a bottleneck-design residual FPConv block inspired by [14], as shown in Fig.6. This block takes a point cloud as input, applying a stack of shared MLP, FPConv, and shared MLP, where shared MLPs are responsible for reducing and then increasing (or restoring) dimensions, similar to convolutions in residual convolution block [14].

Figure 6: Residual FPConv Block: operations at the shortcut connection are optional, shared MLP is only required when is not equal to , which is similar to projection shortcut [14]. FPS (Farthest Point Sampling [37]) and Pooling is needed for downsampling.

4.2 Multi-Scale Analysis

As shown in Fig.6 and Fig.5, we design other operations for multi-scale analysis:

Farthest Point Sampling: we use iterative farthest point sampling to downsample the point cloud. As mentioned in PointNet++ [37], FPS has better coverage of the entire point set given the same number of centroids compared with random sampling.

Pooling: we use max-pooling to group local features. Given an input point cloud and a downsampled point cloud with their corresponding features and , we group neighbors for each point in with radius of and apply pooling operator on features of grouped points, as shown in Eq.9, where for any .


FPConv with FPS: similar to pooling operation, this block applies FPConv on each point of downsampled point cloud and search neighbors over full point cloud, as shown in Eq.10.


Upsampling: we use nearest neighbors interpolation to upsample point cloud by euclidean distance. Given a point cloud with features and a target point cloud , we compute feature for each point in by interpolating its neighbor points searched over .

In the upsampling phase, skip connection and a shared MLPs is used for fusing features from encoder and decoder. nearest neighbors upsampling and shared MLPs can be replaced by de-convolution, but it does not lead to a significant improvement as mentioned in [44], so we do not employ it in our experiments.

Architecture shown in Fig.5 is designed for large scene segmentation, including four layers of downsampling and upsampling for multi-scale analysis. For classification task, we apply a global pooling on the last layer of downsampling to obtain global feature for representing full point cloud, and then use a fully connected network for classification.

4.3 Fusing Two Convolutions

Figure 7: Parallel Residual Block: combine different types (Surface-Conv or Volume-Conv) of convolution kernel for fusion.

As one of our main contributions, we also try to answer a question ”Can we combine two convolutions for further boosting the performance?” The answer is yes but only works when the two convolutions are in different types or complementary (please see Section 6), say surface-style and volumetric-style.

In this section, we propose two convenient and quick fusion strategies, by combining two convolution operators in a single framework. First one is fusing different convolutional features, similar to inception net [42]. As shown in Fig.7, we design a parallel residual block. Given an input feature, apply multiple convolutions in parallel and then concatenate their outputs as fused feature. This strategy is suitable for some compatible methods, such like SA Module of PointNet++ [37], PointConv [48], both using point cloud as input and applying downsampling strategy, which is the same used in our architecture.

While for other incompatible methods, such as TextureNet [18] using mesh as an additional input, and KPConv [44] applying grid downsampling, we have second fusion strategy by concatenating their output features in the last second layer of networks, an then applying a tiny network for fusion.

Method Conv. Samp. ScanNet S3DIS S3DIS-6
PointNet [36] V FPS - 41.1 47.6
PointNet++ [37] V FPS 33.9 - -
PointCNN [25] V FPS 45.8 57.3 -
PointConv [48] V FPS 55.6 58.3 -
HPEIN [20] V FPS 61.8 61.9 67.8
KPConv [44] V Grid 68.4 65.4 69.6
TangentConv [43] S Grid 43.8 52.6 -
SurfaceConv [33] S - 44.2 - -
TextureNet [18] S QF [18] 56.6 - -
FPConv (ours) S FPS 63.9 62.8 68.7
FP PointConv S + V - - 64.4 -
FP PointConv S + V FPS - 64.8 -
FP KPConv S + V - - 66.7 -
Table 1: Mean IoU of large scene segmentation result. The second column is the convolution type (surface or volumetric-style) and third column indicates sampling strategy. S3DIS-6 represents 6-fold cross validation. is fusion in final feature level while is fusion in convolutional feature level by applying parallel block. Note that 58.3 is our implementation on PointConv.
Method Conv. Samp. Accuracy
PointNet [36] V FPS 89.2
PointNet++ [37] V FPS 90.7
PointCNN [25] V FPS 92.2
PointConv [48] V FPS 92.5
KPConv [44] V Grid 92.9
FPConv (ours) S FPS 92.5
Table 2: Classification Accuracy on ModelNet40
Method oA mAcc mIoU ceil. floor wall beam col. wind. door table chair sofa book. board clut.
PointConv [48] 85.4 64.7 58.3 92.8 96.3 77.0 0.0 18.2 47.7 54.3 87.9 72.8 61.6 65.9 33.9 49.3
KPConv [44] - 70.9 65.4 92.6 97.3 81.4 0.0 16.5 54.4 69.5 80.2 90.1 66.4 74.6 63.7 58.1
FPConv (ours) 88.3 68.9 62.8 94.6 98.5 80.9 0.0 19.1 60.1 48.9 80.6 88.0 53.2 68.4 68.2 54.9
FP PointConv 88.2 70.2 64.8 92.8 98.4 81.6 0.0 24.2 59.1 63.0 79.5 88.6 68.1 67.9 67.2 52.4
FP PointConv 88.6 71.5 64.4 94.2 98.5 82.4 0.0 25.5 62.9 63.1 79.8 87.9 53.5 68.3 67.1 54.5
KP PointConv 89.4 71.5 65.5 94.6 98.4 81.4 0.0 17.8 56.0 71.7 78.9 90.1 66.8 72.6 65.0 58.7
FP KPConv 89.9 72.8 66.7 94.5 98.6 83.9 0.0 24.5 61.1 70.9 81.6 89.4 60.3 73.5 70.8 57.8
Table 3: Detailed semantic segmentation scores on S3DIS Area-5. represents fusion in final feature level while represents fusion in convolutional feature level. Note that PointConv indicates our implementation on S3DIS.

5 Experiments

To demonstrate the efficacy of our proposed convolution, we conduct experiments on point cloud semantic segmentation and classification tasks. ModelNet40 [49] is used for shape classification. Two large scale datasets named Stanford Large-Scale 3D Indoor Space (S3DIS) [1] and ScanNet [9] are used for 3D point cloud segmentation. We implement our FPConv with PyTorch [34]. Momentum gradient descent optimizer is used to optimize a point-wise cross entropy loss, with a momentum of 0.98, and an initial learning rate of 0.01 scheduled by cosine LR scheduler [27]. Leaky ReLU and batch normalization are applied after each layer except the last fully connected layer. We trained our models 100 epochs for S3DIS and ModelNet40, 300 epochs for ScanNet.

5.1 3D Shape Classification

ModelNet40 [49] contains 12311 3D meshed models from 40 categories, with 9843 for training and 2468 for testing. Normal is used as additional input feature in our model. Moreover, randomly rotation among the -axis and jittering are also used for data augmentation. As shown in Table.2, our model achieves state-of-the-art performance among surface-style methods.

5.2 Large Scene Semantic Segmentation

Data. S3DIS [1] contains 3D point clouds of 6 areas, totally 272 rooms. Each point in the scan is annotated with one of the semantic labels from 13 categories (chair, table, floor, wall etc. plus clutter). To prepare the training data, 14k points are randomly sampled from a randomly picked block of 2m by 2m. Both sampling are on-the-fly during training. While for testing, all points are covered. Each point is represented by a 9-dim vector of XYZ, RGB, and normalized location w.r.t to the room (from 0 to 1). In particular, the sampling rate for each point is 0.5 in every training epoch.

ScanNet [9] contains 1513 3D indoor scene scans, split into 1201 for training and 312 for testing. There are 21 classes in total and 20 classes are used for evaluation while 1 class for free space. Similar to S3DIS, we randomly sample the raw data in blocks then sample points on-the-fly during training. Each block is of size 2m by 2m, containing 11k points represented by a 6-dim vector, XYZ and RGB.

Pipeline for fusion. As mentioned in Section 4.3, we propose two fusion strategies for fusing conv-kernels of different types. In our experiment, we select PointConv [48] and KPConv [44] rigid for comparison on S3DIS. We apply both two fusion strategies on PointConv with FPConv, and the second strategy, fusion on final feature level on FPConv with KPConv and PointConv with KPConv. In our experiments, KPConv rigid is used for fusion, while its deformable version is ignored for missing released pre-trained model and hyper-parameters setting. Thus, in the latter part, we use KPConv to represent KPConv rigid.

\thesubsubfigure Input
\thesubsubfigure Ground Truth
\thesubsubfigure FP KPConv
\thesubsubfigure KPConv [44]
\thesubsubfigure FPConv
Figure 8: Visualization of semantic segmentation results of S3DIS area 5. Images shown in second row is roomed version of first row images. The two red bounding boxes show that two structures that KPConv [44] and FPConv cannot handle both of them very well while FP KPConv can do it much better.
\thesubsubfigure Input
\thesubsubfigure Ground Truth
\thesubsubfigure Prediction
Figure 9: Visualization of segmentation results on ScanNet.

Results. Following [36], we report the results on two settings for S3DIS, the first one is evaluation on Area 5, and another one is 6-fold cross validation (calculating the metrics with results from different folds merged). We report the mean of class-wise intersection over union (mIoU), overall point-wise accuracy (oA) and the mean of class-wise accuracy (mAcc). For Scannet [9], we report the mIoU score tested on ScanNet bencemark.

Results (mIoU) are shown in Table.1. Detailed results of S3DIS including mIoU of each class are shown in Table.3. As we can see, FPConv outperforms all the existing surface-style learning methods with large margins. Specifically, the mIoU of FPConv on Scannet [9] benchmark reaches 63.9%, which outperforms the previous best surface-style method by 7.3%. In addition, our FPConv fused with KPConv achieves state-of-the-art performance on S3DIS.

Even though mIoU of S3DIS of our FPConv is lower than KPConv, there are still IoUs of some classes outperform the ones of KPConv, such as ceiling, floor, board, etc. Particularly, we find that all of these classes are flat objects, which should have small curvatures. Based on this discovery, we further conduct several ablation studies to explore the relationship between segmentation performance of FPConv and objects curvatures, as shown in next section. Visualization of result is shown in Fig.8 for S3DIS and Fig.9 for ScanNet.

6 Ablation Study

Two ablation studies are conducted, the first one is exploring fusion of surface-style and volumetric-style convolutions. Another one is the effect of detailed configurations, normalization methods and plane size on FPConv.

6.1 On Fusion of S.Conv and V.Conv

We firstly study the performance for different combination methods of the two convolutions. Before that, we show an experimental finding that they are complementary and good at analyzing different specific scenes.

Performance vs. Curvature

As experiments mentioned in Section 5.2, we claim that FPConv can perform better on area with small curvature. To be more convincing, we analyzed the relationship between overall accuracy and curvatures, which is shown in the left of Fig.10. We can see that FPConv outperforms PointConv [48] and KPConv [44] when curvatures are small, and FPConv cannot perform very well on structures which have large curvatures. Moreover, the histogram of distribution of points curvatures shown in the right of Fig.10 implies almost all points have either large curvatures or small curvatures. This explains why there is a huge performance degradation when curvature increases. Furthermore, as shown in Fig.11, we highlight points (in red) with incorrect prediction, and points (in red) with large curvature. It is oblivious that incorrect prediction is concentrated on area with large curvature and FPConv performs well in flat area.

Figure 10: Left: curvature versus cumulative accuracy. Right: histogram of curvatures. Both of them are based on S3DIS area 5
Figure 11: Relationship between accuracy and curvature. Left: raw point cloud. Middle: prediction of FPConv with incorrect points highlighted in red, and green for correct. Right: points with large curvature are highlighted in red. We can see that distribution of incorrect points is consistent with large curvature points

Ablation analysis on fusion method

As mentioned above, FPConv which is a surface-style convolution performs better in flat area, worse in rough area and KPConv, as a volumetric-style convolution performs oppositely. We believe that they can be complementary to each other and conduct 4 fusion experiments, FPConv PointConv, FPConv PointConv, KPConv PointConv, and FPConv KPConv, where represents fusion in final feature level and represents fusion in conv level. We don’t conduct fusion of FPConv and KPConv in conv level for their incompatible downsampling strategies. As shown in Table.3, fusion of FPConv with PointConv or KPConv brings a great improvement, while fusion of PointConv with KPConv brings little improvement. Therefore, we can claim that our FPConv can be complementary to volumetric-style convolutions, which may direct the convolution design for point cloud in the future.

Visual results are shown in Fig.8. Our FPConv can capture better flat structures than KPConv, such as the class column that does not shown in KPConv. While KPConv can capture better complex structures, such as the door. Moreover, the fusion of KPConv and FPConv can achieve better results than both KPConv and FPConv.

Method mIoU mAcc oAcc
w sparse norm + 6x6 62.8 69.0 88.3
w dense norm + 6x6 61.6 68.5 87.6
w/o norm + 6x6 59.8 67.1 86.2
w sparse norm + 5x5 61.8 68.1 88.4
Table 4: Different normalization results on S3DIS area 5. 6x6 and 5x5 represent different plane sizes.

6.2 On FPConv Architecture Design

We conduct 4 experiments as shown in Table.4, to study influence of normalization method and the size of grid plane on performance of FPConv. It tells us that, sparse-norm which indicates 2-step normalization method mentioned in Section 3.2 performs better than dense-norm. In addition, higher resolution of grid plane may achieve better performance, while bring higher memory cost as well.

7 Conclusion

In this work, we propose FPConv, a novel surface-style convolution operator on 3D point cloud. FPConv takes a local region of point cloud as input, and flattens it onto a 2D grid plane by predicting projection weights, followed by regular 2D convolutions. Our experiments demonstrate that FPConv significantly improved the performance of surface-style convolution methods. Furthermore, we discover that surface-style convolution can be a complementary to volumetric-style convolution and jointly training can boost the performance into state-of-the-art. We believe that surface-style convolutions can play an important role in feature learning of 3D data and is a promising direction to explore.


This work was supported in part by grants No.2018B030338001, NSFC-61902334, NSFC-61629101, No.2018YFB1800800, No.ZDSYS201707251409055 and No.2017ZT07X152.


The supplementary material contains:

  • The results of the proposed fusion model between FPConv and PointConv [48] on Scannet [9].

  • The results of the proposed fusion model between FPConv and KPConv-deform [44] on S3DIS [1].

  • More qualitative and quantitative results on large-scale scene segmentation tasks.

A. More results of the proposed fusion strategy

a. Fusing FPConv and PointConv on ScanNet

We conduct experiments on fusion of FPConv with PointConv [48] on ScanNet [9]. The results are reported in Table.5, where all methods are performed under same settings (architecture, hyper parameters, etc.). Note that we reduce sampled points to 8k in a block of 1.5m 1.5m for all experiments.

Method mIoU mA oA
PointConv [48] 55.6 - -
PointConv 60.3 72.3 83.6
FPConv (ours) 63.0 75.6 85.3
FPConv PointConv 64.2 76.1 86.0
Table 5: Quantitative results of the segmentation task on evaluation dataset of ScanNet. PointConv indicates our re-implementation of PointConv [48].

b. Fusing FPConv and KPConv-deform on S3DIS

We further report the results of the proposed fusion model between FPConv and KPConv-deform [44] on S3DIS [1] in Table.6, where the results of each class are also shown. As seen, the proposed fusion model wins all existing methods, reaching the state-of-the-art.

B. More Results on Segmentation Tasks

We provide more details of our experimental results. As shown in Table.7, we compare our FPConv with other popular methods on S3DIS [1] 6-fold cross validation, which shows that FPConv can achieve higher score on flat-shaped objects, such like ceiling, floor, table, board, etc. While KPConv [44], a volumetric-style method, performs better on complex structures. More visual results are shown in Fig.12 and Fig.13.

Method mIoU ceil. floor wall beam col. wind. door table chair sofa book. board clut.
KPConv-rigid [44] 65.4 92.6 97.3 81.4 0.0 16.5 54.4 69.5 80.2 90.1 66.4 74.6 63.7 58.1
KPConv-deform [44] 67.1 92.8 97.3 82.4 0.0 23.9 58.0 69.0 81.5 91.0 75.4 75.3 66.7 58.9
FPConv (ours) 62.8 94.6 98.5 80.9 0.0 19.1 60.1 48.9 80.6 88.0 53.2 68.4 68.2 54.9
FP KP-rigid 66.7 94.5 98.6 83.9 0.0 24.5 61.1 70.9 81.6 89.4 60.3 73.5 70.8 57.8
FP KP-deform 68.2 94.2 98.5 83.7 0.0 24.7 63.0 66.6 82.5 91.9 76.7 75.6 70.5 59.1
Table 6: Fusion results on S3DIS area 5. indicates fusing in final feature level.
Method mIoU ceil. floor wall beam col. wind. door table chair sofa book. board clut.
PointNet [36] 47.6 88.0 88.7 69.3 42.4 23.1 47.5 51.6 54.1 42.0 9.6 38.2 29.4 35.2
RSNet [19] 56.5 92.5 92.8 78.6 32.8 34.4 51.6 68.1 59.7 60.1 16.4 50.2 44.9 52.0
SPGraph [22] 62.1 89.9 95.1 76.4 62.8 47.1 55.3 68.4 69.2 73.5 45.9 63.2 8.7 52.9
PointCNN [25] 65.4 94.8 97.3 75.8 63.3 51.7 58.4 57.2 69.1 71.6 61.2 39.1 52.2 58.6
HPEIN [20] 67.8 - - - - - - - - - - - - -
KPConv-rigid [44] 69.6 93.7 92.0 82.5 62.5 49.5 65.7 77.3 64.0 57.8 71.7 68.8 60.1 59.6
KPConv-deform [44] 70.6 93.6 92.4 83.1 63.9 54.3 66.1 76.6 64.0 57.8 74.9 69.3 61.3 60.3
FPConv (ours) 68.7 94.8 97.5 82.6 42.8 41.8 58.6 73.4 71.0 81.0 59.8 61.9 64.2 64.2
Table 7: Detailed semantic segmentation scores on S3DIS 6-fold cross validation.
Figure 12: Visualization of semantic segmentation results of FPConv on ScanNet.
\thesubsubfigure Input
\thesubsubfigure Ground Truth
\thesubsubfigure FPConv KPConv
\thesubsubfigure KPConv
\thesubsubfigure FPConv
Figure 13: Qualitative comparisons of semantic segmentation tasks on S3DIS area 5. indicates fusing in final feature level.


  1. I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer and S. Savarese (2016) 3D semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Cited by: §2, §5.2, §5, 2nd item, b. Fusing FPConv and KPConv-deform on S3DIS, B. More Results on Segmentation Tasks.
  2. M. Atzmon, H. Maron and Y. Lipman (2018) Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091. Cited by: §2.
  3. A. Behl, O. Hosseini Jafari, S. Karthik Mustikovela, H. Abu Alhaija, C. Rother and A. Geiger (2017) Bounding boxes, segmentations and object coordinates: how important is recognition for 3d scene flow estimation in autonomous driving scenarios?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2574–2583. Cited by: §1.
  4. A. Boulch, B. Le Saux and N. Audebert (2017) Unstructured point cloud semantic labeling using deep segmentation networks.. 3DOR 2, pp. 7. Cited by: §2.
  5. M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam and P. Vandergheynst (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.
  6. J. Bruna, W. Zaremba, A. Szlam and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.
  7. A. P. Charaniya, R. Manduchi and S. K. Lodha (2004) Supervised parametric classification of aerial lidar data. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 30–30. Cited by: §1.
  8. N. Chehata, L. Guo and C. Mallet (2009) Contribution of airborne full-waveform lidar and image data for urban scene classification. In 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 1669–1672. Cited by: §1.
  9. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §1, §2, §5.2, §5.2, §5.2, §5, 1st item, a. Fusing FPConv and PointConv on ScanNet.
  10. M. Defferrard, X. Bresson and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §1, §2.
  11. A. Golovinskiy, V. G. Kim and T. Funkhouser (2009) Shape-based recognition of 3d point clouds in urban environments. In 2009 IEEE 12th International Conference on Computer Vision, pp. 2154–2161. Cited by: §1.
  12. F. Groh, P. Wieschollek and H. P. Lensch (2018) Flex-convolution. In Asian Conference on Computer Vision, pp. 105–122. Cited by: §2.
  13. S. Gupta, P. Arbeláez, R. Girshick and J. Malik (2015) Indoor scene understanding with rgb-d images: bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision 112 (2), pp. 133–149. Cited by: §2.
  14. K. He, X. Zhang, S. Ren and J. Sun (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: Figure 6, §4.1.
  15. M. Henaff, J. Bruna and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §2.
  16. P. Hermosilla, T. Ritschel, P. Vázquez, À. Vinacua and T. Ropinski (2018) Monte carlo convolution for learning on non-uniformly sampled point clouds. In SIGGRAPH Asia 2018 Technical Papers, pp. 235. Cited by: §2.
  17. B. Hua, M. Tran and S. Yeung (2018) Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–993. Cited by: §2.
  18. J. Huang, H. Zhang, L. Yi, T. Funkhouser, M. Nießner and L. J. Guibas (2019) Texturenet: consistent local parametrizations for learning from high-resolution signals on meshes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4440–4449. Cited by: §1, §1, §2, §4.3, Table 1.
  19. Q. Huang, W. Wang and U. Neumann (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: Table 7.
  20. L. Jiang, H. Zhao, S. Liu, X. Shen, C. Fu and J. Jia (2019) Hierarchical point-edge interaction network for point cloud semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10433–10441. Cited by: Table 1, Table 7.
  21. T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
  22. L. Landrieu and M. Simonovsky (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4558–4567. Cited by: Table 7.
  23. F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan and M. Felsberg (2017) Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pp. 95–107. Cited by: §2.
  24. J. Li, B. M. Chen and G. Hee Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §2.
  25. Y. Li, R. Bu, M. Sun and B. Chen (2018) PointCNN. arXiv preprint arXiv:1801.07791. Cited by: §2, Table 1, Table 2, Table 7.
  26. Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng and L. Lin (2016) Lstm-cf: unifying context modeling and fusion with lstms for rgb-d scene labeling. In European conference on computer vision, pp. 541–557. Cited by: §2.
  27. I. Loshchilov and F. Hutter (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §5.
  28. A. Martinovic, J. Knopp, H. Riemenschneider and L. Van Gool (2015) 3d all the way: semantic segmentation of urban scenes from start to end in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4456–4465. Cited by: §1.
  29. J. Masci, D. Boscaini, M. Bronstein and P. Vandergheynst (2015) Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pp. 37–45. Cited by: §2.
  30. D. Maturana and S. Scherer (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §1, §2.
  31. J. McCormac, A. Handa, A. Davison and S. Leutenegger (2017) Semanticfusion: dense 3d semantic mapping with convolutional neural networks. In 2017 IEEE International Conference on Robotics and automation (ICRA), pp. 4628–4635. Cited by: §2.
  32. F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda and M. M. Bronstein (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: §2.
  33. H. Pan, S. Liu, Y. Liu and X. Tong (2018) Convolutional neural networks on 3d surfaces using parallel frames. arXiv preprint arXiv:1808.04952. Cited by: §1, §1, §2, Table 1.
  34. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §5.
  35. A. Pomares, J. L. Martínez, A. Mandow, M. A. Martínez, M. Morán and J. Morales (2018) Ground extraction from 3d lidar point clouds with the classification learner app. In 2018 26th Mediterranean Conference on Control and Automation (MED), pp. 1–9. Cited by: §1.
  36. C. R. Qi, H. Su, K. Mo and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2, §3.2, Table 1, Table 2, §5.2, Table 7.
  37. C. R. Qi, L. Yi, H. Su and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1, §2, Figure 6, §4.2, §4.3, Table 1, Table 2.
  38. J. Rambach, A. Pagani and D. Stricker (2017) [POSTER] augmented things: enhancing ar applications leveraging the internet of things and universal 3d object tracking. In 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), pp. 103–108. Cited by: §1.
  39. G. Riegler, A. Osman Ulusoy and A. Geiger (2017) Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586. Cited by: §1, §2.
  40. M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §1, §2.
  41. S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva and T. Funkhouser (2017) Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754. Cited by: §2.
  42. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §4.3.
  43. M. Tatarchenko, J. Park, V. Koltun and Q. Zhou (2018) Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3887–3896. Cited by: §1, §1, §2, Table 1.
  44. H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette and L. J. Guibas (2019) KPConv: flexible and deformable convolution for point clouds. arXiv preprint arXiv:1904.08889. Cited by: §1, §2, §4.2, §4.3, Table 1, Table 2, Table 3, Figure 8, Figure 8, §5.2, §6.1, 2nd item, b. Fusing FPConv and KPConv-deform on S3DIS, Table 6, Table 7, B. More Results on Segmentation Tasks.
  45. N. Verma, E. Boyer and J. Verbeek (2018) Feastnet: feature-steered graph convolutions for 3d shape analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2598–2606. Cited by: §2.
  46. S. Wang, S. Suo, W. Ma, A. Pokrovsky and R. Urtasun (2018) Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2589–2597. Cited by: §2.
  47. Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §2.
  48. W. Wu, Z. Qi and L. Fuxin (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: §1, §4.3, Table 1, Table 2, Table 3, §5.2, §6.1, 1st item, a. Fusing FPConv and PointConv on ScanNet, Table 5.
  49. Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1, §2, §5.1, §5.
  50. Y. Xu, T. Fan, M. Xu, L. Zeng and Y. Qiao (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: §2.
  51. L. Yi, H. Su, X. Guo and L. J. Guibas (2017) Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: §1, §2.
  52. X. Yue, B. Wu, S. A. Seshia, K. Keutzer and A. L. Sangiovanni-Vincentelli (2018) A lidar point cloud generator: from a virtual world to autonomous driving. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 458–464. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description