Neighbors Do Help: Deeply Exploiting Local Structures of Point Clouds
Abstract
Unlike on images, semantic learning on 3D point clouds using a deep network is challenging due to the naturally unordered data structure. Among existing works, PointNet has achieved promising results by directly learning on point sets. However, it does not take full advantage of a point’s local neighborhood that contains finegrained structural information which turns out to be helpful towards better semantic learning. In this regard, we present two new operations to improve PointNet with more efficient exploitation of local structures. The first one focuses on local 3D geometric structures. In analogy with a convolution kernel for images, we define a pointset kernel as a set of learnable points that jointly respond to a set of neighboring data points according to their geometric affinity measured by kernel correlation, adapted from a similar technique for point cloud registration. The second one exploits local feature structures by recursive feature aggregation on a nearestneighborgraph computed from 3D positions. Experiments show that our network is able to robustly capture local information and efficiently achieve better performance on major datasets.
1 Introduction
As 3D data become ubiquitous with the rapid development of various 3D sensors, semantic understanding and analysis of such kind of data using deep networks is gaining attentions [19, 31, 12, 22, 1], due to its wide applications in robotics, autonomous driving, reverse engineering, and civil infrastructure monitoring. In particular, as one of the most primitive 3D data format and often the raw 3D sensor output, 3D point clouds can not be trivially consumed by deep networks in the same way as 2D images by convolutional networks. This is mainly caused by the irregular organization of points, a fundamental challenge inherent in this raw data format: compared with a rowcolumn indexed image, a point cloud is a set of point coordinates (possibly with attributes like intensities and surface normals) without obvious orderings between points, except for point clouds computed from depth images.
Nevertheless, influenced by the success of convolutional networks for images, many works have focused on 3D voxels, i.e., regular 3D grids converted from point clouds prior to the learning process. Only then do the 3D convolutional networks learn to extract features from voxels [15, 32, 20, 14]. However, to avoid the intractable computation time complexity and memory consumption, such methods usually work on a small spatial resolution only, which results in quantization artifacts and difficulties to learn fine details of geometric structures, except for a few recent improvements using Octree [22, 31].
Different from convolutional nets, PointNet [19] provides an effective and simple architecture to directly learn on point sets by firstly computing individual point features from perpoint MultiLayerPerceptron (MLP) and then aggregating all those features as a global presentation of a point cloud. While achieving stateoftheart results in different 3D semantic learning tasks, the “jump” from perpoint features directly to the global feature suggests that PointNet does not take full advantage of a point’s local structure to capture finegrained patterns: a perpoint MLP output encodes roughly only the information on the existence of a 3D point in a certain nonlinear partition of the 3D space. More efficient representation is expected if the MLP can encode not only “whether” a point exists but also “what type of form” (e.g., corner vs. planar, convex vs. concave, etc.) a point exists in the nonlinear 3D space partition. Such “type” information has to be learned from the point’s local neighborhood on the 3D object surface, which is the main motivation of this paper.
While the followup PointNet++ [21] attempts to address the above issue by segmenting a point set into smaller clusters, sending each through a small PointNet, and repeating such a process in higherdimensional feature point sets iteratively which leads to a complicated architecture with reduced speed, we try to explore from another direction: is there any efficient learnable local operations with clear geometric interpretations that can help directly augment and improve the original PointNet while maintaining its simple architecture?
To address the above question, in this paper, we focus on supervised learning of 3D point cloud representations by improving PointNet with local geometric and feature structures using two new operations as depicted in Figure 2. We summarize these new operations in the following main contributions of the paper:

We propose a kernel correlation layer exploiting local geometric structures, with a geometric interpretation.

We propose a graphbased pooling layer exploiting local feature structures to enhance network robustness.

We efficiently improve point cloud semantic learning tasks using the two new operations.
The remainder of this paper is organized as follows: we first review the related works on various local geometric properties, and deep learning on graph structured data in section 2. We then explain the two new operations in details in section 3. In section 4, we evaluate the performance of the proposed model on benchmark datasets (MNIST, ModelNet10, ModelNet40, and ShapeNet part). Finally we conclude the work in section 5.
2 Related Works
2.1 Local Geometric Properties
We will first discuss some local geometric properties frequently used in 3D data and how they leads us to modifying kernel correlation as a tool to enable potentially complex datadriven characterization of local geometric structures.
Surface Normal. As a basic surface property, surface normals are heavily used in many areas including 3D shape reconstruction, plane extraction, and point set registration [29, 18, 30, 6, 2]. They usually come directly from CAD models or can be estimated by Principle Component Analysis on data covariance matrix of neighboring points as the minimal variance direction [7]. Using perpoint surface normal in PointNet corresponds to modeling a point’s local neighborhood as a plane, which is shown in [19, 21] to improve performances comparing with only 3D coordinates. This meets our previous expectation that a point’s “type” along with its positions should enable better representation. Yet, this also leads us to a question: since normals can be estimated from 3D coordinates (not like colors or intensities), meaning the “amount of information” in the two kinds of input data are almost the same, then why cannot PointNet with only 3D coordinate input learn to achieve a same performance? We believe it is due to the following: a) the perpoint MLP cannot capture neighboring information from just 3D coordinates, and b) global pooling cannot or is not efficient enough to achieve that either.
Covariance Matrix. A secondorder description of a local neighborhood is through data covariance matrix, which has also been widely used in cases such as plane extraction, curvature estimation [6, 4] along with normals. Following the same line of thought from normals, the information provided by the local data covariance matrix is in fact richer than normals as it models the local neighborhood as an ellipsoid, which includes lines and planes in rankdeficient cases. We also observe empirically that it is better than normals for semantic learning.
Kernel Correlation. For 3D semantic object classification of finegrained categories, or 3D semantic segmentation, more detailed analysis of each point’s local neighborhood is naturally expected to improve performances. For these tasks, covariance matrices may not be descriptive enough, due to the fact that point sets of completely different shapes can share a similar data covariance matrix. After all, surface normals and covariance matrices are only handcrafted fixed descriptions. While treating local neighboring points as a small point cloud and describing it using a small PointNet as in [21] is one natural way to learn such descriptions, given that PointNet can universally approximate a point cloud, this way might not be the most efficient one. Instead, we would like to find a learnable description that is efficient, simple, and has a clear geometric interpretation just as the above two handcrafted ones, so that it can be directly plugged into the original elegant PointNet architecture.
To achieve the goal that the description has clear geometric meaning, we would like to ensure that the learnable parameters in this description are still 3D points, just like image convolution kernels are still images. For images, convolution (often implemented as crosscorrelation) is used to quantify the similarity between the input image and the convolution kernel [13]. However, in face of the aforementioned challenge for directly using convolution on point clouds, how can we measure the correlation between two point sets? This question leads us to kernel correlation [28, 9] as a tool that naturally fulfills our design goals. It has been shown that kernel correlation as a function of pairwisepointdistance is an efficient way to measure geometric affinity between 2D/3D point sets and has been used in point cloud registration and feature correspondence problems [23, 28, 9]. For registration in particular, a source point cloud is transformed to best match a reference one by iteratively refining a rigid/nonrigid transformation between the two to maximize their kernel correlation response.
Thus, in our network’s frontend, we take inspiration from such algorithms and treat a point’s local neighborhood as the source in kernel correlation, and a set of learnable points, i.e., a kernel, as the reference that characterizes certain types of local geometric structures/shapes. We modify the original kernel correlation computation by allowing the reference to freely adjust its shape (kernel point positions) through backward propagation. Note the change of perspective here compared with point set registration: we want to learn template/reference shapes through a free perpoint transformation, instead of using a fixed template/reference to find an optimal transformation between source and reference point sets. In this way, a set of learnable kernel points is analogous to a convolutional kernel, which activates to points only in its joint neighboring regions and captures local geometric structures within this receptive field characterized by the kernel function and its kernel width. Under this setting, the learning process can be viewed as finding a set of reference/template points encoding the most effective and useful local geometric structures that lead to the best learning performance jointly with other parameters in the network.
2.2 Deep Learning on Graph
In our kernel correlation computation, to efficiently store the local neighborhood of points, we build a 3D neighborhood graph by considering each point as a vertex, with edges connecting only nearby vertices. This graph is also useful for later computations in the proposed deep network. In addition to exploiting local geometric structure which is only performed in the network frontend, inspired by the ability of convolutional networks to locally aggregate features and gradually increase receptive fields through multiple layers, we exploit local feature structures in the top layers of our network by recursive feature propagation and aggregation along edges in that same 3D neighborhood graph for kernel correlation computation. Our key insight is that neighbor points tend to have similar geometric structures and hence propagating features through neighborhood graph helps to learn more robust local patterns. Note that we specifically avoid changing this neighborhood graph structure in top layers, which is also analogous to convolution on images: even input image feature channels expand in top layers of convolutional nets, each pixel’s spatial ordering and neighborhoods remains unchanged except for pooling operations. In fact, our neighborhood graph is constructed offline for each input point cloud.
Nonetheless, we see such an operation well aligned with a rising trend of using graphs in deep learning. Naturally, graph representation is flexible to irregular or even nonEuclidean data such as point clouds, user data on social network, text documents, and gene data [11, 26, 10, 17, 17, 16, 1]. K nearest neighbor (KNN) graph is usually used to establish local connectivity information, in the applications of point cloud on surface detection, 3D object recognition, 3D object segmentation and compression [5, 25, 27]. Note how our graph construction is different from recent graphbased learning where each node corresponds to a data instance and a single graph corresponds to a whole dataset [11].
3 Method
We now explain the details of learning local geometric and feature structures over point neighborhoods by: (i) kernel correlation that measures geometric affinity of point sets, and (ii) a k nearest neighbor graph that propagates local features across vertices between neighboring points. Figure 2 illustrates our full network architecture.
3.1 Learning on Local Geometric Structure
We adapt ideas of the Leaveoneout Kernel Correlation (LOOKC) and the multiplylinked registration cost function in [28] to capture local geometric structures of a point cloud. Let us define our kernel correlation (KC) between a pointset kernel with learnable points and the current anchor point in a point cloud of points as:
(1) 
where is the mth learnable point in the kernel, the neighborhood index set of the anchor point from the KNN graph, and one of ’s neighbor point. is any valid kernel function ( for 2D or 3D point clouds). Following [28], without loss of generality, we choose the Gaussian kernel in this paper:
(2) 
where is the Euclidean distance between two points and is the kernel width that controls the influence of distance between points. One nice property of Gaussian kernel is that Gaussian kernel decays exponentially as a function of the distance between the two points, providing a softassignment from each kernel point to neighboring points of the anchor point, relaxing from the nondifferentiable hardassignment in ordinary ICP. Our KC encodes pairwise distance between kernel points and neighboring data points and increases as two point sets become similar in shape, hence it can be clearly interpreted as a geometric similarity measure. Note the importance of choosing kernel width here, since either a too large or a too small will lead to undesired performances, similar to the same issue in kernel density estimation. Fortunately, for 2D or 3D space in our case, this parameter can still be empirically chosen as the average neighbor distance in the neighborhood graphs over all training point clouds.
To complete the description of the proposed new learnable layer, given as the network loss function, its derivative w.r.t. each point ’s KC response propagated back from top layers, we provide the backpropagation equation for each kernel point as:
(3) 
where point ’s normalizing constant , and the local difference vector .
Although originates from LOOKC in [28], our KC operation is different: a) unlike LOOKC as a compactness measure between a point set and one of its element point, our KC computes the similarity between a data point’s neighborhood and a kernel of learnable points; and b) unlike the multiplylinked cost function involving a parameter of a transformation for a fix template, our KC allows all points in the kernel to freely move and adjust, thus replacing the template and the transformation parameter as a pointset kernel.
3.2 Learning on Local Feature Structure
We take further advantage of neighborhood information stored in the KNN graph for exploiting local feature structures. Let represent a 3D point cloud, in which points are treated as vertices in an undirected graph with adjacency matrix in which k nearest neighbors of each point are connected. It is intuitive that neighboring points forming local surface often share similar feature patterns. Therefore, we aggregate features of each point within its neighborhood by a graph max pooling operation:
(4) 
where is a perpoint MLP that maps an input point feature in a dimensional space into a dimensional output feature space. denotes a graph max pooling function taking maximum feature over the neighborhood of each vertex, independently operated over each of the dimensions. Thus the output is in and the (,)th entry of is:
(5) 
where is a neighbor of vertex on the graph. A local signature is obtained by graph max pooling. This signature can represent the aggregated feature information of the local surface. By recursively max pooling over neighborhood, the network propagates feature information into larger receptive field. Note the connection of this operation with PointNet++ [21] is that each point ’s local neighborhood is similar to the clusters/segments in PointNet++. This graph operation enables local feature aggregation on the original PointNet architecture.
3.3 Learning on Object Classification
The proposed network is a feedforward network that utilizes kernel correlation for learning local geometries and graph max pooling for learning local feature structures. Computation results from kernel correlation are considered as local geometric features and are concatenated with original inputs as perpoint features to be processed by MLP for feature learning on each data point. The graph max pooling is applied to aggregate local features within neighborhood of each point. For classification, the global signature is then extracted by max pooling over all points.
3.4 Learning on Part Segmentation
In the segmentation task, recent networks will learn both the local and global features of an object and associate them through concatenation, upsampling or encoderdecoder [19, 21, 12]. However our new operations enables straightforward segmentation in the proposed architecture without explicitly learning global features. Compared to classification task, in our model no global max pooling is required. Instead, after recursively max pooling over neighborhood of each point, local features are propagated directly through MLPs for perpoint labeling. Compared with PointNet, our segmentation net is more similar to that for images. While our module seems simple, it achieves comparable performance in part segmentation with much less number of parameters than other methods and faster processing speed compared with KdNet/PointNet++ (Section 4).
4 Experiments
Now we discuss the proposed architecture to applications of 3D object classification (Section 4.1) and part segmentation on challenging benchmark (Section 4.2). We compare our results to stateoftheart methods, analyze proposed model and visualize local structures learned by our network (Section 4.3).
4.1 Object Classification
Datasets. We evaluate our network on both 2D and 3D point clouds. For 2D object classification, we convert MNIST dataset [13] to 2D point clouds. MNIST contains images of handwritten digits with 60,000 training and 10,000 testing images. We transform nonzero pixels in each image to 2D points, keeping coordinates as input features and normalize them within [0.5,0.5]. For 3D object classification, we evaluate our model on 10categories and 40categories benchmarks ModelNet10 and ModelNet40 [32], consisting of 4899 and 12311 CAD models respectively. ModelNet10 is split into 3991 for training and 909 for testing. ModelNet40 is split into 9843 for training and 2468 for testing. To obtain 3D point clouds, we uniformly sample points from meshes by Poisson disk sampling using MeshLab [3] and normalize them into a unit ball.
Network Configuration. Our model has 10 parametric layers in total, with one kernel correlation and two graph max pooling layer. Firstly coordinates of each point are taken into kernel correlation layer, with the output concatenated with coordinates. Then features are passed into three MLPs for perpoint feature propagation. Next two graph max pooling layers take perpoint features and propagate them within neighborhood of each point. The output is passed through global max pooling layer to extract global signature by taking maximized features of all the points in the point cloud. Finally features are passed into three MLPs and output object scores. The configuration can be described as: KC(16)I(64)I(64)I(64)M(128)M(1024)PFC(512)FC(256)FC(K), where KC(c) denotes kernel correlation layer with c output channels from c sets of learnable kernel points, I(c) denotes perpoint feature learning with c output channels, M(c) denotes graph max pooling layer with c channels of perpoint feature maximizing over neighborhood of each vertex on the KNN graph and we use 16NN as default, P denotes global max pooling that aggregates point features to the global signature of the entire shape, FC(c) denotes fully connected layer with c channels output. ReLU is used in each layer without Batchnorm. Dropout layers are used for fully connected layers. We initialize 16 sets of learnable kernel points uniformly within [0.2, 0.2] and kernel width 0.005.
Results. Table 1 and Table 2 compares our results with several recent works. In MNIST digit classification, our model reaches comparable results obtained with ConvNets. In ModelNet10 shape classification, our model achieves stateoftheart performance among methods directly taking 3D point cloud as input. In ModelNet40 shape classification, our method achieves better performance and accomplishes 1.6% higher accuracy than PointNet [19]. Comparing with PointNet++ [21], our model is slightly better than their version with input data size (1,024 points) and features (coordinates only). We are 1.1% worse than their version with input data size (5,000 points) and features (coordinates and normal vectors). The reason could be that 5,000 points contain more finegrained local information than 1,024 points. The proposed model is able to learn local geometric and feature structures efficiently with simple yet powerful kernel correlation and graph max pooling. Table 3 summarizes space (number of parameters in the network) and forward time of our model. Although not fully reaching stateoftheart performance in ModelNet40, our model is designed efficiently and less computationally expensive.
Method  Accuracy (%) 

LeNet5 [13]  99.2 
PointNet (vanilla) [19]  98.7 
PointNet [19]  99.2 
PointNet++ [21]  99.5 
Ours  99.2 
Method  MN10  MN40 

ECC [24]  90.0  83.2 
PointNet (vanilla) [19]    87.2 
PointNet [19]    89.2 
PointNet++ (without normal) [21]    90.7 
PointNet++ (5K pts with normal) [21]    91.9 
KdNet (depth 10) [12]  93.3  90.6 
KdNet (depth 15) [12]  94.0  91.8 
Ours  94.6  90.8 
#params (M)  Forward time (ms)  

PointNet (vanilla)  0.8  11.6 
PointNet  3.5  25.3 
PointNet (MSG)  1.0  163.2 
Ours  0.8  42.2 
4.2 Part Segmentation
Part segmentation is an important problem for tasks that require accurate segmentation of finegrained or complex shapes. We use model discussed in Section 3.4 to classify part label of each point in a 3D point cloud object (e.g. for a car each point can represent wheel, roof or hood).
Datasets. We evaluate our work for part segmentation on ShapeNet part dataset [33]. There are 16,881 shapes of 3D point cloud objects from 16 categories, with each point in an object corresponds to a part label (50 parts in total). On average each object consists of less than 6 parts and the highly imbalanced data makes the task quite challenging. To convert CAD model to point cloud, we use the same strategy as in Section 4.1 to uniformly sample 2048 points for each object.
Network Configuration. Similar with classification, input point clouds are passed into the architecture illustrated in Figure 2 and the detailed configuration for segmentation net investigating local features only can be described as: KC(12)I(64)I(64)I(64)M(128)M(1024)FC(512)FC(256)FC(K), where K is 50 for ShapeNet part dataset and denotes kernel correlation layer, denotes perpoint feature learning, denotes graph max pooling layer and denotes fully connected layer. ReLU is used in each layer without Batchnorm. Dropout layers are used for fully connected layers with drop ratio 0.3. We construct 64NN graph for shapeNet part dataset and initialize 12 sets of learnable kernel points within [0.3, 0.3] and kernel width 0.01. To further utilize the global signature of the object shape, we extend the segmentation architecture to associate local features with global features to achieve better performance. Specifically, we add the global max pooling layer indicating global signature of the object and concatenate it with local features from graph max pooling layer, as done in PointNet. Thus we obtain both global features describing the type of object and also local features describing fine geometric structures. The configuration can be described as: KC(12)I(64)I(64)I(64)M(128)M(1024)PFC(512)FC(256)FC(50), where P denotes global max pooling.
Results. We compared our method with PointNet [19], PointNet++ [21] and KdNet [12]. We use intersection over union (IoU) of each category as the evaluation metric following [12]: IoU of each shape is averaged over IoU of each part that occurs in this shape. The mean IoU of each category is obtained by averaging IoUs of all the shapes in the category. The overall mean IoU can then be calculated by averaging IoUs of all categories. In Table 4, we report both IoUs of each category and the overall mean IoU (mIoU). Although not improving the stateoftheart overall mIoU comparing to more complex architectures such as PointNet++, our model demonstrates comparative or even better performance on certain categories, with less number of parameters and faster processing speed. Comparing with the proposed architecture only using local features, the combination of local features and global features increases the performance by 1.9%. However, our model do not perform well on categories such as rocket, cap, earphone and motor. We speculate several reasons: a) we do not have Tnets compared with PointNet, thus will suffer from misalignment for some shape instances; b) we do not use shape category information as in PointNet; c) insufficient and imbalanced training samples for these categories may bias our network to learn local structures more useful for other classes.
mean  aero  bag  cap  car  chair  ear  guitar  knife  lamp  laptop  motor  mug  pistol  rocket  skate  table  

phone  board  
# shapes  2690  76  55  898  3758  69  787  392  1547  451  202  184  283  66  152  5271  
PointNet  83.7  83.4  78.7  82.5  74.9  89.6  73.0  91.5  85.9  80.8  95.3  65.2  93.0  81.2  57.9  72.8  80.6 
PointNet++  85.1  82.4  79.0  87.7  77.3  90.8  71.8  91.0  85.9  83.7  95.3  71.6  94.1  81.3  58.7  76.4  82.6 
KdNet  82.3  80.1  74.6  74.3  70.3  88.6  73.5  90.2  87.2  81.0  94.9  57.4  86.7  78.1  51.8  69.9  80.3 
Ours (L)  82.4  85.9  70.4  51.4  76.8  86.5  54.9  90.4  80.6  70.3  95.8  57.7  90.2  82.0  32.1  68.6  81.8 
Ours (G+L)  84.3  86.1  73.0  54.9  77.4  88.8  55.0  90.6  86.5  75.2  96.1  57.3  91.7  83.1  53.9  72.5  83.8 
4.3 Model Analysis
In this section we validate our model by elaboration experiments. We first investigate the properties of kernel correlation compared to normal vectors. Then different symmetry functions on graph are analyzed. Finally, robustness test is performed to compare the proposed method with PointNet under random noise.
Effectiveness of Kernel Correlation. In Table 5 we demonstrate that our kernel correlation is descriptive to capture local geometric structures. In this experiment, we use normal vectors as the local geometric features, concatenate them with coordinates and pass into the proposed architecture in Figure 2. Normal vector of each point is computed by applying PCA to the covariance matrix to obtain the direction of minimal variance. Results show that kernel correlation achieves better performance compared to normal vectors. Besides, kernel correlation is embedded in the endtoend architecture with learnable points captured while normal vectors needs to be manually computed as part of the inputs. The learnable kernels points are powerful to capture different local geometric structures, which is displayed in Figure 4.
Local features  Accuracy (%) 

normal  88.4 
kernel correlation  89.4 
Comparison with Different Symmetry Functions. Symmetry function is able to make a model invariant to input permutation [19]. In this section, we investigate several symmetry functions on graph, including graph max pooling and graph average pooling. In particular, graph max pooling is to take the maximized features over neighborhood in Equation 5. Graph average pooling is to take features of a point averaged over neighborhood:
(6) 
where is the same perpoint function in Equation 4, and is the normalized adjacency matrix:
(7) 
where is the adjacency matrix with neighboring vertices connected by an edge and is the degree matrix defined as:
(8) 
where counts the number of vertices connected with vertex .
As shows in Table 6, there is a minor difference between graph max pooling and average pooling, and we use graph max pooling to extract local features in this paper.
Symmetry function  Accuracy (%) 

max pooling  90.8 
average pooling  90.6 
Effectiveness of Local Geometric and Feature Representation. In Table 7 we demonstrate the effect of our local geometric and feature structures learned by kernel correlation and graph max pooling, respectively. It is noteworthy that our kernel correlation and graph max pooling layer alone already achieves comparable performance compared to PointNet. While neighborhood information is also studied in PointNet++ [21], we take a different method focusing on local feature aggregation through kernel correlation and graph pooling. Specifically, we explicitly learn a set of points to capture different local geometric structures and aggregate local features through neighbor vertices on a KNN graph while PointNet++ learns local features by sampling and hierarchically grouping neighboring points and passing into PointNet to encode local region patterns.
Methods of learning local structures  Accuracy (%) 

graph max pooling (geometric)  89.3 
kernel correlation (feature)  89.4 
both  90.8 
Robustness Test. We perform an experiment to compare our model with PointNet on robustness to random noise in the input point cloud. Both networks are trained on the same train and test data with 1024 points per object. For PointNet, we augment the training data by random rotating the object along upaxis and jitter the position of each points by a Gaussian noise with zero mean and 0.02 standard deviation as explained in [20]. During testing, a certain number of randomly selected input points are replaced with uniformly distributed noise ranging between [1.0, 1.0]. As shown in Figure 3, our model is more robust to random noise. Accuracy of PointNet drops greatly when 10 points in each object are replaced with uniformly distributed noise, from 79.1% to 30.6%, while our model drops from 88.3% to 70.3%. This shows an advantage of local structures over perpoint features in PointNet  our network learns to exploit local geometric and feature structures within neighboring region and thus is robust to random noise.
5 Conclusion
In this work, we propose a novel deep neural model that studies neighbor information for representation of local geometric and feature structures in 3D point cloud. Our network measures geometry affinity between a set of learnable points and neighbor points by kernel correlation. Features are then aggregated by max pooling within the neighborhood to encode local feature structures. We have shown that our proposed network is able to capture local patterns efficiently and achieved competitive performance on 3D point cloud classification and part segmentation benchmarks.
Acknowledgment
The authors gratefully acknowledge the helpful comments and suggestions of TengYok Lee, Ziming Zhang, Zhiding Yu, Yuichi Taguchi, and Alan Sullivan.
References
 [1] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 [2] Y. Chen and G. Medioni. Object modeling by registration of multiple range images. In Robotics and Automation, 1991. Proceedings., 1991 IEEE International Conference on, pages 2724–2729. IEEE, 1991.
 [3] P. Cignoni, M. Callieri, M. Corsini, M. Dellepiane, F. Ganovelli, and G. Ranzuglia. MeshLab: an OpenSource Mesh Processing Tool. In V. Scarano, R. D. Chiara, and U. Erra, editors, Eurographics Italian Chapter Conference. The Eurographics Association, 2008.
 [4] C. Feng, Y. Taguchi, and V. R. Kamat. Fast plane extraction in organized point clouds using agglomerative hierarchical clustering. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 6218–6225. IEEE, 2014.
 [5] A. Golovinskiy, V. G. Kim, and T. Funkhouser. Shapebased recognition of 3d point clouds in urban environments. In Computer Vision, 2009 IEEE 12th International Conference on, pages 2154–2161. IEEE, 2009.
 [6] D. Holz and S. Behnke. Fast range image segmentation and smoothing using approximate surface reconstruction and region growing. Intelligent autonomous systems 12, pages 61–73, 2013.
 [7] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle. Surface reconstruction from unorganized points, volume 26. ACM, 1992.
 [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
 [9] B. Jian and B. C. Vemuri. Robust point set registration using gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1633–1645, 2011.
 [10] D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003.
 [11] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [12] R. Klokov and V. Lempitsky. Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. arXiv preprint arXiv:1704.01222, 2017.
 [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [14] Y. Li, S. Pirk, H. Su, C. R. Qi, and L. J. Guibas. Fpnn: Field probing neural networks for 3d data. In Advances in Neural Information Processing Systems, pages 307–315, 2016.
 [15] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
 [16] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. arXiv preprint arXiv:1611.08402, 2016.
 [17] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In International Conference on Machine Learning, pages 2014–2023, 2016.
 [18] D. OuYang and H.Y. Feng. On the normal vector estimation for point cloud data from smooth surfaces. ComputerAided Design, 37(10):1071–1079, 2005.
 [19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
 [20] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5648–5656, 2016.
 [21] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017.
 [22] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [23] G. L. Scott and H. C. LonguetHiggins. An algorithm for associating the features of two images. Proceedings of the Royal Society of London B: Biological Sciences, 244(1309):21–26, 1991.
 [24] M. Simonovsky and N. Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
 [25] J. Strom, A. Richardson, and E. Olson. Graphbased segmentation for colored 3d laser point clouds. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2131–2136. IEEE, 2010.
 [26] D. Teney, L. Liu, and A. v. d. Hengel. Graphstructured representations for visual question answering. arXiv preprint arXiv:1609.05600, 2016.
 [27] D. Thanou, P. A. Chou, and P. Frossard. Graphbased compression of dynamic 3d point cloud sequences. IEEE Transactions on Image Processing, 25(4):1765–1778, 2016.
 [28] Y. Tsin and T. Kanade. A correlationbased approach to robust point set registration. In European conference on computer vision, pages 558–569. Springer, 2004.
 [29] G. Vosselman, S. Dijkman, et al. 3d building model reconstruction from point clouds and ground plans. International archives of photogrammetry remote sensing and spatial information sciences, 34(3/W4):37–44, 2001.
 [30] G. Vosselman, B. G. Gorte, G. Sithole, and T. Rabbani. Recognising structure in laser scanner point clouds. International archives of photogrammetry, remote sensing and spatial information sciences, 46(8):33–38, 2004.
 [31] P.S. Wang, Y. Liu, Y.X. Guo, C.Y. Sun, and X. Tong. Ocnn: Octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (SIGGRAPH), 36(4), 2017.
 [32] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
 [33] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, A. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.