Neighbors Do Help: Deeply Exploiting Local Structures of Point Clouds

Neighbors Do Help: Deeply Exploiting Local Structures of Point Clouds

Yiru Shen  22footnotemark: 2
The authors contributed equally. This work is supported by MERL.
   Chen Feng11footnotemark: 1  33footnotemark: 3
   Yaoqing Yang44footnotemark: 4
   Dong Tian33footnotemark: 3
   Clemson University  Mitsubishi Electric Research Laboratories (MERL)  Carnegie Mellon University

Unlike on images, semantic learning on 3D point clouds using a deep network is challenging due to the naturally unordered data structure. Among existing works, PointNet has achieved promising results by directly learning on point sets. However, it does not take full advantage of a point’s local neighborhood that contains fine-grained structural information which turns out to be helpful towards better semantic learning. In this regard, we present two new operations to improve PointNet with more efficient exploitation of local structures. The first one focuses on local 3D geometric structures. In analogy with a convolution kernel for images, we define a point-set kernel as a set of learnable points that jointly respond to a set of neighboring data points according to their geometric affinity measured by kernel correlation, adapted from a similar technique for point cloud registration. The second one exploits local feature structures by recursive feature aggregation on a nearest-neighbor-graph computed from 3D positions. Experiments show that our network is able to robustly capture local information and efficiently achieve better performance on major datasets.

1 Introduction

Figure 1: Visualization of learned kernel correlations. To represent the complex local geometric structure around a point , we propose to compute kernel correlation (see section 3.1) as an affinity measure between two point sets which are kernel points and ’s neighboring points. This figure displays kernel point positions and kernel width as sphere centers and radius (top row) and the filter responses (the second to fourth rows) of some of the 16 kernels learned in our network. Different colors indicate different levels of structural affinity measured in normalized kernel correlation (red: strongest, blue: weakest). Note the various geometric structures (plane, edge, corner, concave and convex surfaces) captured by different kernels.

As 3D data become ubiquitous with the rapid development of various 3D sensors, semantic understanding and analysis of such kind of data using deep networks is gaining attentions [19, 31, 12, 22, 1], due to its wide applications in robotics, autonomous driving, reverse engineering, and civil infrastructure monitoring. In particular, as one of the most primitive 3D data format and often the raw 3D sensor output, 3D point clouds can not be trivially consumed by deep networks in the same way as 2D images by convolutional networks. This is mainly caused by the irregular organization of points, a fundamental challenge inherent in this raw data format: compared with a row-column indexed image, a point cloud is a set of point coordinates (possibly with attributes like intensities and surface normals) without obvious orderings between points, except for point clouds computed from depth images.

Figure 2: Proposed architecture. Local geometric structures are exploited by the front-end kernel correlation layer computing an affinity between each data point’s nearest neighbor points and kernels of learnable point sets. The resulting responses are concatenated with original 3D coordinates. Local feature structures are later exploited by multiple graph pooling layers, sharing a same graph per 3d object instance constructed offline from each point’s 3D euclidean neighborhood. Finally, the classification net aggregates feature by global max pooling as in [19], followed by an MLP to model class probabilities. Two versions of segmentation architecture are proposed. One segmentation net directly applies a per-point MLP to model per-point class probabilities without the need of global feature replication. Another segmentation net associates local features with global features to boost performance. ReLU is used in each layer without Batchnorm. Dropout layers are used for the last MLP in classification net. Solid arrows indicates forward operation with backward propagation, while dashed arrows means no backward propagation.

Nevertheless, influenced by the success of convolutional networks for images, many works have focused on 3D voxels, i.e., regular 3D grids converted from point clouds prior to the learning process. Only then do the 3D convolutional networks learn to extract features from voxels [15, 32, 20, 14]. However, to avoid the intractable computation time complexity and memory consumption, such methods usually work on a small spatial resolution only, which results in quantization artifacts and difficulties to learn fine details of geometric structures, except for a few recent improvements using Octree [22, 31].

Different from convolutional nets, PointNet [19] provides an effective and simple architecture to directly learn on point sets by firstly computing individual point features from per-point Multi-Layer-Perceptron (MLP) and then aggregating all those features as a global presentation of a point cloud. While achieving state-of-the-art results in different 3D semantic learning tasks, the “jump” from per-point features directly to the global feature suggests that PointNet does not take full advantage of a point’s local structure to capture fine-grained patterns: a per-point MLP output encodes roughly only the information on the existence of a 3D point in a certain nonlinear partition of the 3D space. More efficient representation is expected if the MLP can encode not only “whether” a point exists but also “what type of form” (e.g., corner vs. planar, convex vs. concave, etc.) a point exists in the non-linear 3D space partition. Such “type” information has to be learned from the point’s local neighborhood on the 3D object surface, which is the main motivation of this paper.

While the follow-up PointNet++ [21] attempts to address the above issue by segmenting a point set into smaller clusters, sending each through a small PointNet, and repeating such a process in higher-dimensional feature point sets iteratively which leads to a complicated architecture with reduced speed, we try to explore from another direction: is there any efficient learnable local operations with clear geometric interpretations that can help directly augment and improve the original PointNet while maintaining its simple architecture?

To address the above question, in this paper, we focus on supervised learning of 3D point cloud representations by improving PointNet with local geometric and feature structures using two new operations as depicted in Figure 2. We summarize these new operations in the following main contributions of the paper:

  • We propose a kernel correlation layer exploiting local geometric structures, with a geometric interpretation.

  • We propose a graph-based pooling layer exploiting local feature structures to enhance network robustness.

  • We efficiently improve point cloud semantic learning tasks using the two new operations.

The remainder of this paper is organized as follows: we first review the related works on various local geometric properties, and deep learning on graph structured data in section 2. We then explain the two new operations in details in section 3. In section 4, we evaluate the performance of the proposed model on benchmark datasets (MNIST, ModelNet10, ModelNet40, and ShapeNet part). Finally we conclude the work in section 5.

2 Related Works

2.1 Local Geometric Properties

We will first discuss some local geometric properties frequently used in 3D data and how they leads us to modifying kernel correlation as a tool to enable potentially complex data-driven characterization of local geometric structures.

Surface Normal. As a basic surface property, surface normals are heavily used in many areas including 3D shape reconstruction, plane extraction, and point set registration [29, 18, 30, 6, 2]. They usually come directly from CAD models or can be estimated by Principle Component Analysis on data covariance matrix of neighboring points as the minimal variance direction [7]. Using per-point surface normal in PointNet corresponds to modeling a point’s local neighborhood as a plane, which is shown in [19, 21] to improve performances comparing with only 3D coordinates. This meets our previous expectation that a point’s “type” along with its positions should enable better representation. Yet, this also leads us to a question: since normals can be estimated from 3D coordinates (not like colors or intensities), meaning the “amount of information” in the two kinds of input data are almost the same, then why cannot PointNet with only 3D coordinate input learn to achieve a same performance? We believe it is due to the following: a) the per-point MLP cannot capture neighboring information from just 3D coordinates, and b) global pooling cannot or is not efficient enough to achieve that either.

Covariance Matrix. A second-order description of a local neighborhood is through data covariance matrix, which has also been widely used in cases such as plane extraction, curvature estimation [6, 4] along with normals. Following the same line of thought from normals, the information provided by the local data covariance matrix is in fact richer than normals as it models the local neighborhood as an ellipsoid, which includes lines and planes in rank-deficient cases. We also observe empirically that it is better than normals for semantic learning.

Kernel Correlation. For 3D semantic object classification of fine-grained categories, or 3D semantic segmentation, more detailed analysis of each point’s local neighborhood is naturally expected to improve performances. For these tasks, covariance matrices may not be descriptive enough, due to the fact that point sets of completely different shapes can share a similar data covariance matrix. After all, surface normals and covariance matrices are only hand-crafted fixed descriptions. While treating local neighboring points as a small point cloud and describing it using a small PointNet as in [21] is one natural way to learn such descriptions, given that PointNet can universally approximate a point cloud, this way might not be the most efficient one. Instead, we would like to find a learnable description that is efficient, simple, and has a clear geometric interpretation just as the above two hand-crafted ones, so that it can be directly plugged into the original elegant PointNet architecture.

To achieve the goal that the description has clear geometric meaning, we would like to ensure that the learnable parameters in this description are still 3D points, just like image convolution kernels are still images. For images, convolution (often implemented as cross-correlation) is used to quantify the similarity between the input image and the convolution kernel [13]. However, in face of the aforementioned challenge for directly using convolution on point clouds, how can we measure the correlation between two point sets? This question leads us to kernel correlation [28, 9] as a tool that naturally fulfills our design goals. It has been shown that kernel correlation as a function of pairwise-point-distance is an efficient way to measure geometric affinity between 2D/3D point sets and has been used in point cloud registration and feature correspondence problems [23, 28, 9]. For registration in particular, a source point cloud is transformed to best match a reference one by iteratively refining a rigid/non-rigid transformation between the two to maximize their kernel correlation response.

Thus, in our network’s front-end, we take inspiration from such algorithms and treat a point’s local neighborhood as the source in kernel correlation, and a set of learnable points, i.e., a kernel, as the reference that characterizes certain types of local geometric structures/shapes. We modify the original kernel correlation computation by allowing the reference to freely adjust its shape (kernel point positions) through backward propagation. Note the change of perspective here compared with point set registration: we want to learn template/reference shapes through a free per-point transformation, instead of using a fixed template/reference to find an optimal transformation between source and reference point sets. In this way, a set of learnable kernel points is analogous to a convolutional kernel, which activates to points only in its joint neighboring regions and captures local geometric structures within this receptive field characterized by the kernel function and its kernel width. Under this setting, the learning process can be viewed as finding a set of reference/template points encoding the most effective and useful local geometric structures that lead to the best learning performance jointly with other parameters in the network.

2.2 Deep Learning on Graph

In our kernel correlation computation, to efficiently store the local neighborhood of points, we build a 3D neighborhood graph by considering each point as a vertex, with edges connecting only nearby vertices. This graph is also useful for later computations in the proposed deep network. In addition to exploiting local geometric structure which is only performed in the network front-end, inspired by the ability of convolutional networks to locally aggregate features and gradually increase receptive fields through multiple layers, we exploit local feature structures in the top layers of our network by recursive feature propagation and aggregation along edges in that same 3D neighborhood graph for kernel correlation computation. Our key insight is that neighbor points tend to have similar geometric structures and hence propagating features through neighborhood graph helps to learn more robust local patterns. Note that we specifically avoid changing this neighborhood graph structure in top layers, which is also analogous to convolution on images: even input image feature channels expand in top layers of convolutional nets, each pixel’s spatial ordering and neighborhoods remains unchanged except for pooling operations. In fact, our neighborhood graph is constructed offline for each input point cloud.

Nonetheless, we see such an operation well aligned with a rising trend of using graphs in deep learning. Naturally, graph representation is flexible to irregular or even non-Euclidean data such as point clouds, user data on social network, text documents, and gene data [11, 26, 10, 17, 17, 16, 1]. K nearest neighbor (KNN) graph is usually used to establish local connectivity information, in the applications of point cloud on surface detection, 3D object recognition, 3D object segmentation and compression  [5, 25, 27]. Note how our graph construction is different from recent graph-based learning where each node corresponds to a data instance and a single graph corresponds to a whole dataset [11].

3 Method

We now explain the details of learning local geometric and feature structures over point neighborhoods by: (i) kernel correlation that measures geometric affinity of point sets, and (ii) a k nearest neighbor graph that propagates local features across vertices between neighboring points. Figure 2 illustrates our full network architecture.

3.1 Learning on Local Geometric Structure

We adapt ideas of the Leave-one-out Kernel Correlation (LOO-KC) and the multiply-linked registration cost function in [28] to capture local geometric structures of a point cloud. Let us define our kernel correlation (KC) between a point-set kernel with learnable points and the current anchor point in a point cloud of points as:


where is the m-th learnable point in the kernel, the neighborhood index set of the anchor point from the KNN graph, and one of ’s neighbor point. is any valid kernel function ( for 2D or 3D point clouds). Following [28], without loss of generality, we choose the Gaussian kernel in this paper:


where is the Euclidean distance between two points and is the kernel width that controls the influence of distance between points. One nice property of Gaussian kernel is that Gaussian kernel decays exponentially as a function of the distance between the two points, providing a soft-assignment from each kernel point to neighboring points of the anchor point, relaxing from the non-differentiable hard-assignment in ordinary ICP. Our KC encodes pairwise distance between kernel points and neighboring data points and increases as two point sets become similar in shape, hence it can be clearly interpreted as a geometric similarity measure. Note the importance of choosing kernel width here, since either a too large or a too small will lead to undesired performances, similar to the same issue in kernel density estimation. Fortunately, for 2D or 3D space in our case, this parameter can still be empirically chosen as the average neighbor distance in the neighborhood graphs over all training point clouds.

To complete the description of the proposed new learnable layer, given as the network loss function, its derivative w.r.t. each point ’s KC response propagated back from top layers, we provide the back-propagation equation for each kernel point as:


where point ’s normalizing constant , and the local difference vector .

Although originates from LOO-KC in [28], our KC operation is different: a) unlike LOO-KC as a compactness measure between a point set and one of its element point, our KC computes the similarity between a data point’s neighborhood and a kernel of learnable points; and b) unlike the multiply-linked cost function involving a parameter of a transformation for a fix template, our KC allows all points in the kernel to freely move and adjust, thus replacing the template and the transformation parameter as a point-set kernel.

3.2 Learning on Local Feature Structure

We take further advantage of neighborhood information stored in the KNN graph for exploiting local feature structures. Let represent a 3D point cloud, in which points are treated as vertices in an undirected graph with adjacency matrix in which k nearest neighbors of each point are connected. It is intuitive that neighboring points forming local surface often share similar feature patterns. Therefore, we aggregate features of each point within its neighborhood by a graph max pooling operation:


where is a per-point MLP that maps an input point feature in a -dimensional space into a -dimensional output feature space. denotes a graph max pooling function taking maximum feature over the neighborhood of each vertex, independently operated over each of the dimensions. Thus the output is in and the (,)-th entry of is:


where is a neighbor of vertex on the graph. A local signature is obtained by graph max pooling. This signature can represent the aggregated feature information of the local surface. By recursively max pooling over neighborhood, the network propagates feature information into larger receptive field. Note the connection of this operation with PointNet++ [21] is that each point ’s local neighborhood is similar to the clusters/segments in PointNet++. This graph operation enables local feature aggregation on the original PointNet architecture.

3.3 Learning on Object Classification

The proposed network is a feed-forward network that utilizes kernel correlation for learning local geometries and graph max pooling for learning local feature structures. Computation results from kernel correlation are considered as local geometric features and are concatenated with original inputs as per-point features to be processed by MLP for feature learning on each data point. The graph max pooling is applied to aggregate local features within neighborhood of each point. For classification, the global signature is then extracted by max pooling over all points.

3.4 Learning on Part Segmentation

In the segmentation task, recent networks will learn both the local and global features of an object and associate them through concatenation, up-sampling or encoder-decoder [19, 21, 12]. However our new operations enables straightforward segmentation in the proposed architecture without explicitly learning global features. Compared to classification task, in our model no global max pooling is required. Instead, after recursively max pooling over neighborhood of each point, local features are propagated directly through MLPs for per-point labeling. Compared with PointNet, our segmentation net is more similar to that for images. While our module seems simple, it achieves comparable performance in part segmentation with much less number of parameters than other methods and faster processing speed compared with Kd-Net/PointNet++ (Section 4).

4 Experiments

Now we discuss the proposed architecture to applications of 3D object classification (Section 4.1) and part segmentation on challenging benchmark (Section 4.2). We compare our results to state-of-the-art methods, analyze proposed model and visualize local structures learned by our network (Section 4.3).

4.1 Object Classification

Datasets. We evaluate our network on both 2D and 3D point clouds. For 2D object classification, we convert MNIST dataset [13] to 2D point clouds. MNIST contains images of handwritten digits with 60,000 training and 10,000 testing images. We transform non-zero pixels in each image to 2D points, keeping coordinates as input features and normalize them within [-0.5,0.5]. For 3D object classification, we evaluate our model on 10-categories and 40-categories benchmarks ModelNet10 and ModelNet40 [32], consisting of 4899 and 12311 CAD models respectively. ModelNet10 is split into 3991 for training and 909 for testing. ModelNet40 is split into 9843 for training and 2468 for testing. To obtain 3D point clouds, we uniformly sample points from meshes by Poisson disk sampling using MeshLab [3] and normalize them into a unit ball.

Network Configuration. Our model has 10 parametric layers in total, with one kernel correlation and two graph max pooling layer. Firstly coordinates of each point are taken into kernel correlation layer, with the output concatenated with coordinates. Then features are passed into three MLPs for per-point feature propagation. Next two graph max pooling layers take per-point features and propagate them within neighborhood of each point. The output is passed through global max pooling layer to extract global signature by taking maximized features of all the points in the point cloud. Finally features are passed into three MLPs and output object scores. The configuration can be described as: KC(16)-I(64)-I(64)-I(64)-M(128)-M(1024)-P-FC(512)-FC(256)-FC(K), where KC(c) denotes kernel correlation layer with c output channels from c sets of learnable kernel points, I(c) denotes per-point feature learning with c output channels, M(c) denotes graph max pooling layer with c channels of per-point feature maximizing over neighborhood of each vertex on the K-NN graph and we use 16-NN as default, P denotes global max pooling that aggregates point features to the global signature of the entire shape, FC(c) denotes fully connected layer with c channels output. ReLU is used in each layer without Batchnorm. Dropout layers are used for fully connected layers. We initialize 16 sets of learnable kernel points uniformly within [-0.2, 0.2] and kernel width 0.005.

Results. Table 1 and Table 2 compares our results with several recent works. In MNIST digit classification, our model reaches comparable results obtained with ConvNets. In ModelNet10 shape classification, our model achieves state-of-the-art performance among methods directly taking 3D point cloud as input. In ModelNet40 shape classification, our method achieves better performance and accomplishes 1.6% higher accuracy than PointNet [19]. Comparing with PointNet++ [21], our model is slightly better than their version with input data size (1,024 points) and features (coordinates only). We are 1.1% worse than their version with input data size (5,000 points) and features (coordinates and normal vectors). The reason could be that 5,000 points contain more fine-grained local information than 1,024 points. The proposed model is able to learn local geometric and feature structures efficiently with simple yet powerful kernel correlation and graph max pooling. Table 3 summarizes space (number of parameters in the network) and forward time of our model. Although not fully reaching state-of-the-art performance in ModelNet40, our model is designed efficiently and less computationally expensive.

Method Accuracy (%)
LeNet5 [13] 99.2
PointNet (vanilla) [19] 98.7
PointNet [19] 99.2
PointNet++ [21] 99.5
Ours 99.2
Table 1: MNIST digit classification.
Method MN10 MN40
ECC [24] 90.0 83.2
PointNet (vanilla) [19] - 87.2
PointNet [19] - 89.2
PointNet++ (without normal) [21] - 90.7
PointNet++ (5K pts with normal) [21] - 91.9
Kd-Net (depth 10) [12] 93.3 90.6
Kd-Net (depth 15) [12] 94.0 91.8
Ours 94.6 90.8
Table 2: ModelNet shape classification. Comparison of accuracy of proposed model with state-of-the-art. Our model has better performance on both ModelNet10 and ModelNet40. Two versions of PointNet++ are included, one of which is with input data size (1,024 points) and features (coordinates only), another is with input data size (5,000 points) and features (coordinates and normal vectors).
#params (M) Forward time (ms)
PointNet (vanilla) 0.8 11.6
PointNet 3.5 25.3
PointNet (MSG) 1.0 163.2
Ours 0.8 42.2
Table 3: Comparison of space and time complexity. ”M” stands for million. Models are tested on a single GTX 1080 with batch size 8 using Caffe  [8].

4.2 Part Segmentation

Part segmentation is an important problem for tasks that require accurate segmentation of fine-grained or complex shapes. We use model discussed in Section 3.4 to classify part label of each point in a 3D point cloud object (e.g. for a car each point can represent wheel, roof or hood).

Datasets. We evaluate our work for part segmentation on ShapeNet part dataset [33]. There are 16,881 shapes of 3D point cloud objects from 16 categories, with each point in an object corresponds to a part label (50 parts in total). On average each object consists of less than 6 parts and the highly imbalanced data makes the task quite challenging. To convert CAD model to point cloud, we use the same strategy as in Section 4.1 to uniformly sample 2048 points for each object.

Network Configuration. Similar with classification, input point clouds are passed into the architecture illustrated in Figure 2 and the detailed configuration for segmentation net investigating local features only can be described as: KC(12)-I(64)-I(64)-I(64)-M(128)-M(1024)-FC(512)-FC(256)-FC(K), where K is 50 for ShapeNet part dataset and denotes kernel correlation layer, denotes per-point feature learning, denotes graph max pooling layer and denotes fully connected layer. ReLU is used in each layer without Batchnorm. Dropout layers are used for fully connected layers with drop ratio 0.3. We construct 64-NN graph for shapeNet part dataset and initialize 12 sets of learnable kernel points within [-0.3, 0.3] and kernel width 0.01. To further utilize the global signature of the object shape, we extend the segmentation architecture to associate local features with global features to achieve better performance. Specifically, we add the global max pooling layer indicating global signature of the object and concatenate it with local features from graph max pooling layer, as done in PointNet. Thus we obtain both global features describing the type of object and also local features describing fine geometric structures. The configuration can be described as: KC(12)-I(64)-I(64)-I(64)-M(128)-M(1024)-P-FC(512)-FC(256)-FC(50), where P denotes global max pooling.

Results. We compared our method with PointNet [19], PointNet++ [21] and Kd-Net [12]. We use intersection over union (IoU) of each category as the evaluation metric following [12]: IoU of each shape is averaged over IoU of each part that occurs in this shape. The mean IoU of each category is obtained by averaging IoUs of all the shapes in the category. The overall mean IoU can then be calculated by averaging IoUs of all categories. In Table 4, we report both IoUs of each category and the overall mean IoU (mIoU). Although not improving the state-of-the-art overall mIoU comparing to more complex architectures such as PointNet++, our model demonstrates comparative or even better performance on certain categories, with less number of parameters and faster processing speed. Comparing with the proposed architecture only using local features, the combination of local features and global features increases the performance by 1.9%. However, our model do not perform well on categories such as rocket, cap, earphone and motor. We speculate several reasons: a) we do not have T-nets compared with PointNet, thus will suffer from misalignment for some shape instances; b) we do not use shape category information as in PointNet; c) insufficient and imbalanced training samples for these categories may bias our network to learn local structures more useful for other classes.

mean aero bag cap car chair ear guitar knife lamp laptop motor mug pistol rocket skate table
phone board
# shapes 2690 76 55 898 3758 69 787 392 1547 451 202 184 283 66 152 5271
PointNet 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
PointNet++ 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
Kd-Net 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 81.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3
Ours (L) 82.4 85.9 70.4 51.4 76.8 86.5 54.9 90.4 80.6 70.3 95.8 57.7 90.2 82.0 32.1 68.6 81.8
Ours (G+L) 84.3 86.1 73.0 54.9 77.4 88.8 55.0 90.6 86.5 75.2 96.1 57.3 91.7 83.1 53.9 72.5 83.8
Table 4: Segmentation results on ShapeNet part dataset. Intersection over Union (IoU) is reported as the evaluation metric. Two versions of the proposed architectures are reported. Ours (L) indicates the proposed architecture only using local features and Ours (G+L) indicates the combination of both local and global features.

4.3 Model Analysis

In this section we validate our model by elaboration experiments. We first investigate the properties of kernel correlation compared to normal vectors. Then different symmetry functions on graph are analyzed. Finally, robustness test is performed to compare the proposed method with PointNet under random noise.

Effectiveness of Kernel Correlation. In Table 5 we demonstrate that our kernel correlation is descriptive to capture local geometric structures. In this experiment, we use normal vectors as the local geometric features, concatenate them with coordinates and pass into the proposed architecture in Figure 2. Normal vector of each point is computed by applying PCA to the covariance matrix to obtain the direction of minimal variance. Results show that kernel correlation achieves better performance compared to normal vectors. Besides, kernel correlation is embedded in the end-to-end architecture with learnable points captured while normal vectors needs to be manually computed as part of the inputs. The learnable kernels points are powerful to capture different local geometric structures, which is displayed in Figure 4.

Local features Accuracy (%)
normal 88.4
kernel correlation 89.4
Table 5: Comparison of different local geometric structures on ModelNet40. Kernel correlation achieves better performance over normal vectors. Note that graph max pooling layer is not used here.

Comparison with Different Symmetry Functions. Symmetry function is able to make a model invariant to input permutation  [19]. In this section, we investigate several symmetry functions on graph, including graph max pooling and graph average pooling. In particular, graph max pooling is to take the maximized features over neighborhood in Equation 5. Graph average pooling is to take features of a point averaged over neighborhood:


where is the same per-point function in Equation 4, and is the normalized adjacency matrix:


where is the adjacency matrix with neighboring vertices connected by an edge and is the degree matrix defined as:


where counts the number of vertices connected with vertex .

As shows in Table 6, there is a minor difference between graph max pooling and average pooling, and we use graph max pooling to extract local features in this paper.

Symmetry function Accuracy (%)
max pooling 90.8
average pooling 90.6
Table 6: Comparison between two symmetry functions on graph. Evaluation data is ModelNet40 test set.

Effectiveness of Local Geometric and Feature Representation. In Table 7 we demonstrate the effect of our local geometric and feature structures learned by kernel correlation and graph max pooling, respectively. It is noteworthy that our kernel correlation and graph max pooling layer alone already achieves comparable performance compared to PointNet. While neighborhood information is also studied in PointNet++ [21], we take a different method focusing on local feature aggregation through kernel correlation and graph pooling. Specifically, we explicitly learn a set of points to capture different local geometric structures and aggregate local features through neighbor vertices on a K-NN graph while PointNet++ learns local features by sampling and hierarchically grouping neighboring points and passing into PointNet to encode local region patterns.

Methods of learning local structures Accuracy (%)
graph max pooling (geometric) 89.3
kernel correlation (feature) 89.4
both 90.8
Table 7: Effects of local geometric and feature structures on ModelNet40.

Robustness Test. We perform an experiment to compare our model with PointNet on robustness to random noise in the input point cloud. Both networks are trained on the same train and test data with 1024 points per object. For PointNet, we augment the training data by random rotating the object along up-axis and jitter the position of each points by a Gaussian noise with zero mean and 0.02 standard deviation as explained in [20]. During testing, a certain number of randomly selected input points are replaced with uniformly distributed noise ranging between [-1.0, 1.0]. As shown in Figure 3, our model is more robust to random noise. Accuracy of PointNet drops greatly when 10 points in each object are replaced with uniformly distributed noise, from 79.1% to 30.6%, while our model drops from 88.3% to 70.3%. This shows an advantage of local structures over per-point features in PointNet - our network learns to exploit local geometric and feature structures within neighboring region and thus is robust to random noise.

Figure 3: Our model vs. PointNet on random noise. Different number of points in each object is replaced with noise uniformly distributed between [-1,1]. Metric is overall classification accuracy on ModelNet40 test set. We compare our model with different versions of PointNet [20] (vanilla and two T-nets) and show that our model is more robust to random noise.
Figure 4: Examples of local geometric structures captured by kernel points in ModelNet40. Color indicates normalized correlation (red: strongest, blue: weakest). Examples are the filter responses of 16 kernels of bed, bench and cup from top to bottom.
Figure 5: Results of part segmentation on validation data of ShapeNet part dataset. Examples are car, chair, earphone, knife, motor, pistol and skateboard from top to bottom.

5 Conclusion

In this work, we propose a novel deep neural model that studies neighbor information for representation of local geometric and feature structures in 3D point cloud. Our network measures geometry affinity between a set of learnable points and neighbor points by kernel correlation. Features are then aggregated by max pooling within the neighborhood to encode local feature structures. We have shown that our proposed network is able to capture local patterns efficiently and achieved competitive performance on 3D point cloud classification and part segmentation benchmarks.


The authors gratefully acknowledge the helpful comments and suggestions of Teng-Yok Lee, Ziming Zhang, Zhiding Yu, Yuichi Taguchi, and Alan Sullivan.


  • [1] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [2] Y. Chen and G. Medioni. Object modeling by registration of multiple range images. In Robotics and Automation, 1991. Proceedings., 1991 IEEE International Conference on, pages 2724–2729. IEEE, 1991.
  • [3] P. Cignoni, M. Callieri, M. Corsini, M. Dellepiane, F. Ganovelli, and G. Ranzuglia. MeshLab: an Open-Source Mesh Processing Tool. In V. Scarano, R. D. Chiara, and U. Erra, editors, Eurographics Italian Chapter Conference. The Eurographics Association, 2008.
  • [4] C. Feng, Y. Taguchi, and V. R. Kamat. Fast plane extraction in organized point clouds using agglomerative hierarchical clustering. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 6218–6225. IEEE, 2014.
  • [5] A. Golovinskiy, V. G. Kim, and T. Funkhouser. Shape-based recognition of 3d point clouds in urban environments. In Computer Vision, 2009 IEEE 12th International Conference on, pages 2154–2161. IEEE, 2009.
  • [6] D. Holz and S. Behnke. Fast range image segmentation and smoothing using approximate surface reconstruction and region growing. Intelligent autonomous systems 12, pages 61–73, 2013.
  • [7] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle. Surface reconstruction from unorganized points, volume 26. ACM, 1992.
  • [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
  • [9] B. Jian and B. C. Vemuri. Robust point set registration using gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1633–1645, 2011.
  • [10] D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003.
  • [11] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [12] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. arXiv preprint arXiv:1704.01222, 2017.
  • [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [14] Y. Li, S. Pirk, H. Su, C. R. Qi, and L. J. Guibas. Fpnn: Field probing neural networks for 3d data. In Advances in Neural Information Processing Systems, pages 307–315, 2016.
  • [15] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
  • [16] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. arXiv preprint arXiv:1611.08402, 2016.
  • [17] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In International Conference on Machine Learning, pages 2014–2023, 2016.
  • [18] D. OuYang and H.-Y. Feng. On the normal vector estimation for point cloud data from smooth surfaces. Computer-Aided Design, 37(10):1071–1079, 2005.
  • [19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  • [20] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5648–5656, 2016.
  • [21] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017.
  • [22] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [23] G. L. Scott and H. C. Longuet-Higgins. An algorithm for associating the features of two images. Proceedings of the Royal Society of London B: Biological Sciences, 244(1309):21–26, 1991.
  • [24] M. Simonovsky and N. Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  • [25] J. Strom, A. Richardson, and E. Olson. Graph-based segmentation for colored 3d laser point clouds. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2131–2136. IEEE, 2010.
  • [26] D. Teney, L. Liu, and A. v. d. Hengel. Graph-structured representations for visual question answering. arXiv preprint arXiv:1609.05600, 2016.
  • [27] D. Thanou, P. A. Chou, and P. Frossard. Graph-based compression of dynamic 3d point cloud sequences. IEEE Transactions on Image Processing, 25(4):1765–1778, 2016.
  • [28] Y. Tsin and T. Kanade. A correlation-based approach to robust point set registration. In European conference on computer vision, pages 558–569. Springer, 2004.
  • [29] G. Vosselman, S. Dijkman, et al. 3d building model reconstruction from point clouds and ground plans. International archives of photogrammetry remote sensing and spatial information sciences, 34(3/W4):37–44, 2001.
  • [30] G. Vosselman, B. G. Gorte, G. Sithole, and T. Rabbani. Recognising structure in laser scanner point clouds. International archives of photogrammetry, remote sensing and spatial information sciences, 46(8):33–38, 2004.
  • [31] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (SIGGRAPH), 36(4), 2017.
  • [32] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
  • [33] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, A. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description