Point Clouds Learning with Attention-based Graph Convolution Networks

Point Clouds Learning with Attention-based Graph Convolution Networks

Zhuyang Xie zyxie@my.swjtu.edu.cn Junzhou Chen chenjunzhou@mail.sysu.edu.cn Bo Peng bpeng@swjtu.edu.cn School of Information Science and Technology, Southwest Jiaotong University, Chengdu, Sichuan 611756, China Research Center of Intelligent Transportation System, School of Intelligent Systems Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510006, China
Abstract

Point clouds data, as one kind of representation of 3D objects, are the most primitive output obtained by 3D sensors. Unlike 2D images, point clouds are disordered and unstructured. Hence it is not straightforward to apply classification techniques such as the convolution neural network to point clouds analysis directly. To solve this problem, we propose a novel network structure, named Attention-based Graph Convolution Networks (AGCN), to extract point clouds features. Taking the learning process as a message propagation between adjacent points, we introduce an attention mechanism to AGCN for analyzing the relationships between local features of the points. In addition, we introduce an additional global graph structure network to compensate for the relative information of the individual points in the graph structure network. The proposed network is also extended to an encoder-decoder structure for segmentation tasks. Experimental results show that the proposed network can achieve state-of-the-art performance in both classification and segmentation tasks.

keywords:
point clouds, attention, graph network, autoencoder

1 Introduction

With the rapid development of data acquirement techniques, 3D sensors have been widely used in robotics, autonomous, reverse engineering and virtual reality. There are increasing demands for 3D data analysis algorithms Charles et al. (2017); Qi et al. (2017); Wang et al. (2018a); Xu et al. (2018); Xie et al. (2018); Wang and Posner (2015), meanwhile a large number of 3D point clouds datasets are available recently  Wu et al. (2014); Armeni et al. (2017); Chang et al. (2015); Dai et al. (2017); Hackel et al. (2017). Due to the irregular distribution in 3D space and the lacking of canonical order, 3D point clouds data are difficult to be processed by traditional methods.
In recent years, convolutional neural networks (CNNs) have achieved a great success in processing the standard grid data for tasks such as image recognition He et al. (2016); Ioffe and Szegedy (2015), semantic segmentation Long et al. (2015); Noh et al. (2015) and machine translation Cheng et al. (2016); Vaswani et al. (2017). By virtue of the success of CNNs, many methods focus on 3D voxels Maturana and Scherer (2015); Riegler et al. (2017); Qi et al. (2016), which convert point clouds into 3D volumetric grid in the pre-processing step. In this way, features from voxels can be extracted by the 3D convolution network. However, the calculations may be redundant due to the sparsity of most 3D data and its computational complexity increases exponentially with the increase of resolution Maturana and Scherer (2015); Riegler et al. (2017); Qi et al. (2016); Li et al. (2016).
Some latest work has focused on direct learning on point clouds. As a pioneer of directly processing point clouds, PointNet Charles et al. (2017) provides an effective strategy to learn the features of individual points through a shared multi-layer perception (MLP), and eventually encodes global information by a symmetric function that guarantees permutation invariance to the points’ order. However, the MLP of PointNet Charles et al. (2017) is based on the feature learning of individual points, and does not consider the local geometry. To solve this problem, PointNet++ Qi et al. (2017) divides the point set into several subsets, sends these subsets to a shared PointNet, and builds a hierarchical network by repeating such a process iteratively. Although PointNet++ Qi et al. (2017) has built local point sets, the relationship between these local point sets is not well constructed. In the latest work, ShapeContextNet Xie et al. (2018) uses the self-attention mechanism to learn the relationship between individual points. However, they regards attention as a query operation, and calculates attention score for each individual point on the whole point clouds, which significantly increases the computation cost and memory usage.
In order to solve the above problems, we mainly consider two aspects, that is, how to model the relationship between local structural information and how to effectively aggregate local information. In this paper, we propose AGCN to learn point clouds feature. Specifically, in the stage of local structure learning, some nodes are first obtained by sparse sampling and for each node we construct a local point set. Then the local structure feature of each node is directly learned on its local point set. Subsequently, we construct a KNN graph for these nodes, and design a point attention layer on the KNN graph to learn the relationship between different local information and gather the information of neighbors for each node. In our network, we can learn from the local to the global features by stacking multiple layers of point attention layer. Also AGCN can propagate high-level features to the fine-grained features and we extended our network to an encoder-decoder structure for segmentation tasks. In addition, in order to compensate for the relative information of the individual nodes in the graph structure network, we propose a global point graph to assist the learning of point attention layer. Our key contributions are as follows:

  • We propose a point attention layer on the KNN graph to calculate the attention score of the nearest neighbors, which can effectively aggregate the local information and guarantee permutation invariance to the points’ order.

  • We propose a global point graph to compensate for the relative location information of the individual nodes in the graph structure network.

  • We extend point attention layer and propose an attention-based encoder-decoder network for point clouds segmentation.

  • We have achieved better performance on some standard datasets compared with state-of-the-art approaches.

2 Related Works

In this section, we will briefly review some of the current approaches for 3D data. Specifically, they can be classified into multi-view based methods, voxel based methods, graph based methods, and point based methods.

2.1 View-based Methods

The view-based approaches Su et al. (2015); Novotny et al. (2017) project a 3D object into a collection of 2D views, which applies the conventional 2D convolutional neural network to each view, and then aggregate these features by multi-view pooling for classification and retrieval tasks. However, the view-based method requires a complete view set for each target, which adds preprocessing work and computation cost. Also, it is nontrivial to extend them to scene understanding or other 3D tasks(e.g., per-point classification), because the view-based approaches lose 3D spatial information.

2.2 Volumetric Methods

The voxelization methods converts unstructured geometric data into 3D regular grid Maturana and Scherer (2015); Riegler et al. (2017); Qi et al. (2016), which can be applied to 3D convolution operation. However, the volume representation is often redundant due to the sparsity of most 3D data. Therefore, a voxel method is usually limited by the resolution of volumetric grids and the computation cost of 3D convolution, which leads to use lower resolution as input and it is hard to learn local geometric details. In addition, due to the limitation of resolution, it is challenging for a voxel method to process large scale point clouds data.

2.3 Graph-based Methods

The use of graphs to represent irregular or non-European data (such as point clouds, social networks) are flexible. There are two classes of methods of this kind. The first method directly defines convolution operation on the graph. ECC Simonovsky and Komodakis (2017) is the first work to apply the graph convolution to point clouds. It defines filter weights conditioned on the specific edge labels in the neighborhood of a vertex. KC-Net Shen et al. (2018) contains a KNN graph to extract the local structural feature of the point clouds and aggregates the neighbor information through the graph max pooling. DGCNN Wang et al. (2018b) proposes EdgeConv on a KNN graph to achieve local information fusion by learning the features of the edges of neighbor points.
The other is spectral based methods, which define the convolution operations in the Fourier domain Yi et al. (2017); Kipf and Welling (2016); Defferrard et al. (2016). Some latest work has shown concern in this area. SpiderCNN Xu et al. (2018) defines a series of convolution kernels, as a product of a simple step function and a Taylor polynomial, to approximate the weight function. LocalSpecGCN Wang et al. (2018a) use spectral convolution combined with recursive clustering and pooling strategy to extract features of neighbor points. However, spectral based methods cause a large number of parameters of the convolution filter.

2.4 PointNets

Fig. 1: The architecture of AGCN. AGCN firstly samples nodes based on the input point clouds, then extracts local point sets of size for each node, and learns local features for each node (I). The KNN graph is constructed according to the coordinates of nodes, and the feature aggregation of neighbor nodes is realized by introducing the attention mechanism in the KNN graph (II), we stacked 3 layers of point attention layer for classification. In order to compensate for the relative information of the individual nodes in the graph structure network, we additionally construct a global graph structure network to assist the learning of the point attention layer (III).

Some recent work has focused on learning directly from point clouds. PointNet Charles et al. (2017) uses MLP with shared weights to learn the features of individual points. Finally, the point clouds information is coded by maxpool for classification. However, PointNet Charles et al. (2017) uses the individual points’ feature, and does not utilize local information. To solve this problem, PointNet++ Qi et al. (2017) builds a hierarchical network and extracts multi-scale information from local point sets. The different hierarchical information is learned by iteratively radius search, but this also increases the computational complexity and does not model the relationship between different local information. Kd-Net Klokov and Lempitsky (2017) constructs a hierarchical structure through kd tree, divides space according to three axes and learns the weights along specific axis. However, as the number of point clouds and the resolution of kd tree increase, the corresponding computation cost will also increase.
There is also a part of work to learn the local features of the point clouds by constructing convolution kernels. KC-Net Shen et al. (2018) defines a series of learnable point sets on a KNN graph by kernel correlation, which is used to extract the local structure feature of point clouds. ShapeContextNet Xie et al. (2018) constructs shape context kernels, through the concept of shape context. In addition, in order to deal with point clouds which usually have varying size and density, the A-SCN Xie et al. (2018) network based on self-attention is proposed.
In our work, we mainly focus on and learn the relationship between different local information. Inspired by KC-Net Shen et al. (2018), we build KNN graph to learn each nodes’ local structure feature, and implement local feature aggregation through the attention mechanism, which is different from graph max pooling used in KC-Net Shen et al. (2018). Unlike A-SCN Xie et al. (2018), in point attention layer, we only calculate the attention score for neighbors, which greatly reduces the computation cost. And our proposed point attention layer can be stacked in multiple layers for better classification and segmentation tasks.

3 Method

Our method mainly consists of three parts: (1)learning local structural feature (Section 3.1); (2)point attention layer (Section 3.2); (3)global point graph (Section 3.3). Figure 1 illustrates our full network architectures for classification.

3.1 Learning Local Structural feature

The input of the original point clouds are represented by three-dimensional coordinates , and the features such as color, surface normal or other information can also be added. We extract local point sets in the same way as PointNet++ Qi et al. (2017) does. In Figure 1(I), nodes are sampled from the farthest point in , which forms a set , and . For each node , a local point set is constructed, where is a local point set obtained by taking the node as the center, and extracting points of the nearest neighbor of the node . We extract the local structural features for each point set by learning a local mapping , which converts local point set to a d-dimensional vector. The specific function f is defined as:

(1)

where represents the -th point of the local point set , represents the normalized coordinate, and the features of the individual points are extracted by three MLPs with shared weights. Finally, the features of individual points are fused by local maxpool.
By constructing the local point set , we learn the local feature representation for each node , and use the features of nodes as the input of the subsequent network for feature learning, which further reduces the computation cost of the latter network.

3.2 Point Attention layer

Attention mechanism is widely used in different types of deep learning tasks such as natural language processing Vaswani et al. (2017); Cheng et al. (2016), modeling the relationship about relevant parts. In this section, we will introduce the attention-based point attention layer, to learn the relationship between adjacent points. By learning the local structural features of Section 3.1, we obtain the local feature representation of nodes: . The features of these nodes are taken as the input of point attention layer, and the updated features of these nodes are obtained as the output of the network: .

Fig. 2: Point attention layer. We illustrate a point attention layer with 3-NN graph. In the left part of this figure, for each node , we aggregate the information of the neighbors around the node according to the attention score. Arrows indicate the direction of the information propergation, and different colors indicate independent attention calculations. In the right part of this figure, each node with different color represents the aggregated feature.

As illustrated in Figure 2, we construct a KNN graph for nodes. Different from KC-Net’s Shen et al. (2018) graph max pooling for neighbor information aggregation, we focus on nearest neighbor nodes around the node and aggregate the information of the neighbor nodes according to the attention score. The feature aggregation formula of node is as follows:

(2)

where denotes the feature of node . denotes the index of the neighbors of node . denotes the feature of the -th nearest neighbor of node , and indicates the updated feature of node . denotes the attention score between node and the -th neighbor. The operation can be regarded as a weighted summation of the neighbor nodes around node , which guarantees the permutation invariance to the nodes’ order. Attention is calculated as follows:

(3)

With Equation ( 2), each individual node’s feature can be updated in parallel. In addition, in order to incorporate additional nonlinearity and increase the capacity of the model, we add a feature transformation function, which uses a 2-layer MLP with a nonlinear activation function to perform feature transformation on each updated feature .
As shown in Figure 1(II), in our network structure, a multi-layered point attention layer is adopted, which is a very effective structure. By stacking multiple layers of point attention layer, a CNN-like effect can be achieved. The number of neighbor nodes can be regarded as the kernel size in CNN, and as the network depth increases, the receptive field of the network increases correspondingly, so that from local to global information can be learned.

3.3 Global Point Graph

Fig. 3: Attention-based encoder-decoder architecture. The encoder-decoder architecture is an invert operation. The encoder aggregates neighbor information through attention, and the decoder propagates high-level semantic information to lower-level finer information. All nodes’ features of global point graph are concated with corresponding nodes’ features in each point attention layer.
Fig. 4: Global point graph. We construct a KNN graph for nodes. (+1) denotes nearest neighbor nodes and node . We get the global graph feature through maxpool, and concat the global feature with individual nodes to learn each node’s relative information.

In our entire network, nodes’ features can only describe local features and do not provide information relative to the global field. So we design a simple network to build a global structure diagram to learn the global information of each node. As shown in Figure 4, the entire network can be seen as a simplified PointNet Charles et al. (2017), taking the sampled nodes as input. The same as the Section 3.2, we construct a KNN graph and the local feature of each node is learned by a two-layer MLP. Finally, we get the global feature through maxpool. We concat the global feature with features of the individual nodes to learn the representation of each node relative to the global.
In Figure 1(III), we concat each node’s feature learned by the global point graph with the corresponding nodes’ features in the point attention layer, giving these nodes global information. The experimental result shows that the global point graph can further combine the local information with the global information, thus assisting the learning of point attention layer and improving the performance of the network.

3.4 Attention-based Encoder-Decoder for Segmentation

In the classification network, from the local structure feature learning to the multi-layer point attention layer, and the global maxpool, the whole process can be regarded as an encoder. In the task of point-by-point classification, such as segmentation, requires the integration of local and global information, and the integration process can be regarded as an invert operation of the encoder. Therefore, we design a decoder network which illustrated in Figure 3. In the decoder, we use the attention structure opposite to the encoder to concatenate the local information of nodes with the global feature as the input of the decoder network.
Intuitively, the encoder aggregates neighbor information through attention to achieve feature learning from local to global. On the contrary, in the decoder, it can be seen that each node sends the global information to its’ neighbor nodes, which makes high-level semantic information propagate to the finer information.
Finally, in order to obtain more fine-grained local information for the segmentation, we use the inverse distance weighted average interpolation in 3D Euclidean space. The formulas are as follows:

(4)

where m represents m nearest neighbor points in 3D Euclidean space, and m=3 in the experimental setup; represents the inverse square Euclidean distance between point and the neighbor point .

4 Experimental results

To verify the performance our AGCN network, we compare some point-based methods on classification and segmentation tasks. The data sets used mainly include ModelNet40 Wu et al. (2014), ShapeNet part dataset Chang et al. (2015), and Large-Scale 3D Indoor Spaces Dataset (S3DIS) Armeni et al. (2016).

4.1 3D Point Set Classification

We evaluate our network on ModelNet40 Wu et al. (2014) for 3D point set classification. ModelNet40 contains 12311 CAD models from 40 categories and is split into 9843 for training and 2468 for testing. For fair comparison, we use the same data provided by PointNet  Charles et al. (2017). In our experiment, we employ the same augmentation strategy as PointNet Charles et al. (2017) by randomly rotating point clouds along the z-axis and jittering the position of each point by a Gaussian noise with zero mean and 0.02 standard deviation.
As we described in Section 3.1, our network uses the coordinates of the point clouds as well as the surface normal as input. In the stage of local structural feature learning, we sample =256 nodes to form a set , and extract =16 nearest points for each node to build local point set . In the encoder, we stack 3 layers of point attention layer and build a 3-NN graph for each point attention layer to learn from local to global features. In addition, we build a 3-NN graph for the point set as the input to the global point graph. Finally, we get a global feature through maxpool and send it to three fully connected layers: for object classification. Dropout layers are used for the fully connected layers, dropout ratio is 0.5. ReLU and Batchnorm are used in each MLP layer. In our network, all parameters are uniformly initialized within [-0.001, 0.001]. We train the network for 200 epochs on an NVIDIA GTX 1080 GPU using tensorflow with Adam optimizer, batchsize=32, and initial learning rate is 1e-3, momentum=0.9, the learning rate is reduced by a decay rate of 0.7 for every 20 epochs.
In Table 1, compard with the current methods, we can see that the proposed method has achieved state-of-the-art performance among the point-based methods.

Method input accuracy accuracy
points avg.class overall
ECC Simonovsky and Komodakis (2017) 1024 83.2 87.4
PointNet Charles et al. (2017) 1024 86.2 89.2
A-SCN Xie et al. (2018) 1024 87.4 89.8
KC-Net Shen et al. (2018) 1024 - 91.0
PointNet++ Qi et al. (2017) 5000 - 91.9
Kd-Net Klokov and Lempitsky (2017) - 91.8
SpiderCNN Xu et al. (2018) 1024 - 92.4
LocalSpecGCN Wang et al. (2018a) 2048 - 92.1
AGCN 1024 90.7 92.6
Table 1: Classification accuracy (%) on ModelNet40.

4.2 2D Point Set Classification

We also evaluate the performance of the network on the MNIST dataset. We use the same protocol as used in PointNet++ Qi et al. (2017), where 512 points are sampled for each digit image. Our network uses the coordinates of the 2D point clouds, we sample =128 nodes to form a set , and extract =32 nearest points for each node to build local point set . The other experimental settings are the same as section 4.1. Table 2 shows the classification result of our network, and we can see that our network can achieve performance compared to the most recent methods on the 2D dataset.

Method input Error ate(%)
PointNet Charles et al. (2017) 256 0.78
A-SCN Xie et al. (2018) 256 0.60
KC-Net Shen et al. (2018) 256 0.70
PointNet++ Qi et al. (2017) 512 0.51
Kd-Net Klokov and Lempitsky (2017) 1024 0.90
SpiderCNN Xu et al. (2018) - -
LocalSpecGCN Wang et al. (2018a) 1024 0.42
AGCN 512 0.48
Table 2: Error rate (%) on MNIST dataset.

4.3 Part Segmentation

Cat. Ins. air bag cap car chair ear guitar knife lamp laptop motor mug pistol rocket skate table
mIoU mIoU plane phone bike board
#shapes 2690 76 55 898 3758 69 787 392 1547 451 202 184 283 66 152 5271
PointNet Charles et al. (2017) 80.4 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
PointNet++ Qi et al. (2017) 81.9 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
Kd-Net Klokov and Lempitsky (2017) 77.4 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 81.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3
SpiderCNN Xu et al. (2018) 82.4 85.3 83.5 81.0 87.2 77.5 90.7 76.8 91.1 87.3 83.3 95.8 70.2 93.5 82.7 59.7 75.8 82.8
SynSpecCNN Yi et al. (2017) 82.0 84.7 81.6 81.7 81.9 75.2 90.2 74.9 93.0 86.1 84.7 95.6 66.7 92.7 81.6 60.6 82.9 82.1
A-SCN Xie et al. (2018) 81.8 84.6 83.8 80.8 83.5 79.3 90.5 69.8 91.7 86.5 82.9 96.0 69.2 93.8 82.5 62.9 74.4 80.8
KC-Net Shen et al. (2018) 82.2 84.7 82.8 81.5 86.4 77.6 90.3 76.8 91.0 87.2 84.5 95.5 69.2 94.4 81.6 60.1 75.2 81.3
AGCN 82.6 85.4 83.3 79.3 87.5 78.5 90.7 76.5 91.7 87.8 84.7 95.7 72.4 93.2 84.0 63.7 76.4 82.5
Table 3: Accuracy (%) of part segmentation results on ShapeNet part dataset.
Fig. 5: Part segmentation results. The first row represents the ground truth (GT), and the second row represents our predicted result.

We evaluated our model for part segmentation on ShapeNet part dataset Chang et al. (2015), which contains 16,881 shapes from 16 classes and 50 parts in total. The point of each object is assigned a part label. We use data sets provided by PointNet++  Qi et al. (2017) and employ the same experimental setup for training and testing. The task of part segmentation is to predict the part category of each point, which can be regarded as a point-by-point classification problem.
We use the coordinates of the point clouds as well as the surface normal as input to the network. In the stage of local structural feature learning, we sample =384 nodes to form a set , and extract =16 nearest points for each node to build local point set . As described in Section 3.4, In the encoder, we stack a 3-layer point attention layer and build an 8-NN graph for each point attention layer. In addition, we build an 8-NN graph for the point set as the input to the global point graph. Other hyper-parameters are the same as in Section 4.1.
We use intersection-over-union (IoU) to evaluate our our network, the same as PointNet++ Qi et al. (2017). The Overall average instance mIoU(Ins. mIoU) is calculated by averaging IoUs of all the shape instances and the overall average category mIoU(Cat. mIoU) is calculated by averaging over 16 categories. Results are shown in Table 3, we can see that with the attention-based encoder-decoder structure, we have achieved good segmentation results for most categories, and some of the segmentation results are shown in Figure 5.

4.4 Semantic Segmentation

We evaluate our network on semantic scene segmentation using S3DIS dataset Armeni et al. (2016). S3DIS contains 3D scans from Matterport scanners in 6 areas including 271 rooms. Each point in the scene point clouds is annotated with one of the semantic labels from 13 categories. We use the same strategy used in PointNet Charles et al. (2017) and A-SCN Xie et al. (2018). The data firstly split points by room, and then sample rooms into blocks with area 1m by 1m, each block contains 4096 points.

Method mean IoU Overall
accuracy(%)
PointNet Charles et al. (2017) 47.71 78.62
A-SCN Xie et al. (2018) 52.72 81.59
SEGCloud Tchapmi et al. (2017) 48.92 -
G+RCU Engelmann et al. (2017) 49.7 81.1
RSNet Huang et al. (2018) 56.47 -
Engelmann et al. Engelmann et al. (2018) 58.27 83.95
AGCN 56.63 84.13
Table 4: 6-fold cross validation results on S3DIS dataset.
Method ceiling floor wall beam column window door chair table bookcase sofa board clutter
SEGCloud Tchapmi et al. (2017) 90.06 96.05 69.86 0.00 18.37 38.35 23.12 78.59 70.40 58.42 40.88 12.96 41.06
RSNet Huang et al. (2018) 92.48 92.83 78.56 32.75 34.37 51.62 68.11 59.72 60.13 16.42 50.22 44.85 52.03
G+RCU Engelmann et al. (2017) 90.3 92.1 67.9 44.7 24.2 52.3 51.2 58.1 47.4 6.9 39.0 30.0 41.9
Engelmann et al. Engelmann et al. (2018) 92.1 95.0 72.0 33.5 15.0 46.5 60.9 65.1 69.5 56.8 38.2 6.9 51.3
AGCN 91.37 94.62 76.12 54.93 35.23 56.71 57.69 62.61 55.94 19.37 46.57 37.37 47.64
Table 5: IoU(%) per class on the S3DIS dataset.
Fig. 6: Semantic segmentation results. From left to right: original input scenes, ground truth segmentation, our segmentation results, PointNet Charles et al. (2017) segmentation results.

The input for each point is a 9-dimensional vector (including the xyz, RGB, and the normalized room location). In the stage of local structural feature learning, we sample =512 nodes to form a set , and extract =16 nearest points for each node to build local point set . As described in Section 3.4, In the encoder, we stack a 3-layer point attention layer and build an 8-NN graph for each point attention layer. In addition, we build an 8-NN graph for the point set as the input to the global point graph.
The 6-fold cross validation results of our method are shown in Table 4 and the scores of per class IoU in Table 5. Through the encoder-decoder structure, the mean IoU of our model is 56.63% and the overall accuracy is 84.13%. Some of the experimental results are shown in Figure 6. we can see that, compared to PointNet  Charles et al. (2017), our segmentation results are smoother and the result of segmentation obtained in some flat areas is more uniform.

5 Discussion

5.1 Influence of different inputs on Network Stability

We evaluate the effect of different points on AGCN. Following the settings in Section 4.1, different numbers of points and corresponding normals were used as input to train our network and PointNet++ Qi et al. (2017). The experimental results are shown in Figure 7. The accuracy of our network is 87.82% when it is reduced to 32 points.

Fig. 7: Accuracy with different number of input points on ModelNet40.

5.2 Effectiveness of global point graph

With or without global point graph accuracy(%)
With global point graph 92.61
Without global point graph 90.54
Table 6: Accuracy on ModelNet40 with or without global point graph.

To evaluate the effectiveness of global point graph, we trained two networks (with/without global point graph) on the ModelNet40 classification task. The network settings are the same as the experiments in Section 4.1. Table  6 shows the results of the experiment. It can be seen that the use of global point graph has greatly improved the network (by nearly 2%) compared to networks that without global point graph.

5.3 Visualize point attention layer

We visualize different layers learned from ModelNet40, as illustrated in Figure 8. It can be observed that the features obtained by local structure learning are sparse, but as the depth of the network increases, the distribution of features is close to a cluster, indicating that our point attention layer can effectively aggregate local information. The feature learning from local to global is achieved.

Fig. 8: Visualize point attention layer. The first column shows local structural features, from the second column to the fourth column, we stack 3 layers of point attention layer. We randomly sample one point, and the brighter the color of these points, the higher the correlation between the features of these points.

5.4 Model size and Timing

Method Parameters Train/Inference time
(per batch)
PointNet Charles et al. (2017) 3.48M 0.083/0.024 s
PointNet++ Qi et al. (2017) 1.48M 0.526/0.203 s
SpiderCNN Xu et al. (2018) 3.23M 0.332/0.145 s
AGCN 2.03M 0.076/0.033 s
Table 7: Model size and train/inference time. Our networks were tested with an NVIDIA GTX 1080 GPU and an Intel i7-3770K @ 3.5 GHz 4 cores CPU. "M" stands for million. The train/inference time is for per batch.

We recorded the number of parameters of different networks and the time of training/inference on the tasks of the ModelNet40 classification. For fair comparison, we set the batch size to 16, 1024 points as input to the network, other settings for experiments are the same as in Section 4.1. We statistic network parameters, and average training/inference time for each batch. The experimental results available in Table 7, we can see that our network can exceed or approach the latest methods. Although our parameters are more than PointNet++ Qi et al. (2017), thanks to our point attention layer, which only focuses on neighboring points, the amount of computation of the network is reduced.

5.5 Visualize the local patterns

We visualize the local patterns learned by the kernels of the first layer of our network on ModelNet40. For each point set, we use all points as input to activate the specific neurons. In Figure 9, We can see that the kernel of our network can learn local patterns very well, such as lines, planes etc.

Fig. 9: Visualize the patterns learned from the first layer. Each row represents the same sample, and for each sample we select 4 learned kernels, red color indicates the highest response to the activation of the kernel, and blue color indicates the lowest response to the activation of the kernel.

6 Conclusion

In this paper, we present AGCN. We introduced the attention mechanism in the graph structure network to build point attention layer, and learn the relationship between local features. Also, we extended the point attention layer to build encoder-decoder attention network for segmentation tasks. To compensate for the relative information of individual points in the graph network, we introduce an additional global graph structure network. Through some extensive experiments, we can see that point attention layer can effectively model the relationship between local features and achieve local-to-global feature learning. In the future work, we will further improve the point attention layer for 3D semantic analysis.

Acknowledgement

This work is partly supported by the National Natural Science Foundation of China (No.61876158) and Sichuan science and technology program (No.19ZDYF2070).

Reference

References

  • [1] I. Armeni, S. Sax, A. R. Zamir and S. Savarese (2017) Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105. Cited by: §1.
  • [2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer and S. Savarese (2016) 3D semantic parsing of large-scale indoor spaces. In Computer Vision & Pattern Recognition, Cited by: §4.4, §4.
  • [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song and H. Su (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §1, §4.3, §4.
  • [4] R. Q. Charles, S. Hao, K. Mo and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In IEEE Conference on Computer Vision & Pattern Recognition, Cited by: §1, §2.4, §3.3, Fig. 6, §4.1, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4, Table 7.
  • [5] J. Cheng, L. Dong and M. Lapata (2016) Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. Cited by: §1, §3.2.
  • [6] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser and M. Nie?ner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. Cited by: §1.
  • [7] M. Defferrard, X. Bresson and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Cited by: §2.3.
  • [8] F. Engelmann, T. Kontogianni, A. Hermans and B. Leibe (2017) Exploring spatial context for 3d semantic segmentation of point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pp. 716–724. Cited by: Table 4, Table 5.
  • [9] F. Engelmann, T. Kontogianni, J. Schult and B. Leibe (2018) Know what your neighbors do: 3d semantic segmentation of point clouds. In European Conference on Computer Vision, pp. 395–409. Cited by: Table 4, Table 5.
  • [10] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler and M. Pollefeys (2017) Semantic3D.net: a new large-scale point cloud classification benchmark. Cited by: §1.
  • [11] K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [12] Q. Huang, W. Wang and U. Neumann (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: Table 4, Table 5.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1.
  • [14] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. Cited by: §2.3.
  • [15] R. Klokov and V. Lempitsky (2017) Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In 2017 IEEE International Conference on Computer Vision (ICCV), Cited by: §2.4, Table 1, Table 2, Table 3.
  • [16] Y. Li, S. Pirk, H. Su, C. R. Qi and L. J. Guibas (2016) Fpnn: field probing neural networks for 3d data. In Advances in Neural Information Processing Systems, pp. 307–315. Cited by: §1.
  • [17] J. Long, E. Shelhamer and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
  • [18] D. Maturana and S. Scherer (2015) VoxNet: a 3d convolutional neural network for real-time object recognition. In IEEE/RSJ International Conference on Intelligent Robots & Systems, Cited by: §1, §2.2.
  • [19] H. Noh, S. Hong and B. Han (2015) Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision, Cited by: §1.
  • [20] D. Novotny, D. Larlus and A. Vedaldi (2017) Learning 3d object categories by looking around them. Cited by: §2.1.
  • [21] C. R. Qi, S. Hao, M. Niessner, A. Dai and L. J. Guibas (2016) Volumetric and multi-view cnns for object classification on 3d data. Cited by: §1, §2.2.
  • [22] C. R. Qi, Y. Li, S. Hao and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. Cited by: §1, §2.4, §3.1, §4.2, §4.3, Table 1, Table 2, Table 3, §5.1, §5.4, Table 7.
  • [23] G. Riegler, A. O. Ulusoy and A. Geiger (2017) OctNet: learning deep 3d representations at high resolutions. Cited by: §1, §2.2.
  • [24] Y. Shen, C. Feng, Y. Yang and D. Tian (2018) Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4548–4557. Cited by: §2.3, §2.4, §3.2, Table 1, Table 2, Table 3.
  • [25] M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. Cited by: §2.3, Table 1.
  • [26] H. Su, S. Maji, E. Kalogerakis and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §2.1.
  • [27] L. Tchapmi, C. Choy, I. Armeni, J. Gwak and S. Savarese (2017) Segcloud: semantic segmentation of 3d point clouds. In 2017 International Conference on 3D Vision (3DV), pp. 537–547. Cited by: Table 4, Table 5.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §3.2.
  • [29] C. Wang, B. Samari and K. Siddiqi (2018) Local spectral graph convolution for point set feature learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–66. Cited by: §1, §2.3, Table 1, Table 2.
  • [30] D. Z. Wang and I. Posner (2015) Voting for voting in online point cloud object detection.. In Robotics: Science and Systems, Vol. 1, pp. 10–15607. Cited by: §1.
  • [31] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein and J. M. Solomon (2018) Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829. Cited by: §2.3.
  • [32] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao (2014) 3D shapenets: a deep representation for volumetric shapes. Cited by: §1, §4.1, §4.
  • [33] S. Xie, S. Liu, Z. Chen and Z. Tu (2018) Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4606–4615. Cited by: §1, §2.4, §4.4, Table 1, Table 2, Table 3, Table 4.
  • [34] Y. Xu, T. Fan, M. Xu, L. Zeng and Y. Qiao (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: §1, §2.3, Table 1, Table 2, Table 3, Table 7.
  • [35] L. Yi, H. Su, X. Guo and L. J. Guibas (2017) Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: §2.3, Table 3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
370592
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description