LaTeX

LaTeX

Yongcheng Liu  Bin Fan    Gaofeng Meng  Jiwen Lu  Shiming Xiang  Chunhong Pan
 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
 School of Artificial Intelligence, University of Chinese Academy of Sciences
 Department of Automation, Tsinghua University
{yongcheng.liu, bfan, gfmeng, smxiang, chpan}@nlpr.ia.ac.cn   lujiwen@tsinghua.edu.cn
Corresponding author: Bin Fan

DensePoint: Learning Densely Contextual Representation for Efficient Point Cloud Processing

Yongcheng Liu  Bin Fan    Gaofeng Meng  Jiwen Lu  Shiming Xiang  Chunhong Pan
 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
 School of Artificial Intelligence, University of Chinese Academy of Sciences
 Department of Automation, Tsinghua University
{yongcheng.liu, bfan, gfmeng, smxiang, chpan}@nlpr.ia.ac.cn   lujiwen@tsinghua.edu.cn
Corresponding author: Bin Fan
Abstract

Point cloud processing is very challenging, as the diverse shapes formed by irregular points are often indistinguishable. A thorough grasp of the elusive shape requires sufficiently contextual semantic information, yet few works devote to this. Here we propose DensePoint, a general architecture to learn densely contextual representation for point cloud processing. Technically, it extends regular grid CNN to irregular point configuration by generalizing a convolution operator, which holds the permutation invariance of points, and achieves efficient inductive learning of local patterns. Architecturally, it finds inspiration from dense connection mode, to repeatedly aggregate multi-level and multi-scale semantics in a deep hierarchy. As a result, densely contextual information along with rich semantics, can be acquired by DensePoint in an organic manner, making it highly effective. Extensive experiments on challenging benchmarks across four tasks, as well as thorough model analysis, verify DensePoint achieves the state of the arts.

1 Introduction

Recently, the processing of point cloud, which comprises an irregular set of 3D points, has drawn a lot of attention, due to its wide range of applications such as robot manipulation [20] and autonomous driving [31]. However, modern applications usually demand for a high-level understanding of point cloud, i.e., identifying the implicit 3D shape pattern. This is quite challenging, since the diverse shapes, abstractly formed by these irregular points, are often hardly distinguishable. For this issue, it is essential to capture sufficiently contextual semantic information for a thorough grasp of the elusive shape (see Fig. 1 for details).

Over the past few years, convolutional neural network (CNN) has demonstrated its powerful abstraction ability of semantic information in image recognition field [61]. Accordingly, much effort is focused on replicating its remarkable success on the analysis of image [22, 41], i.e., regular grid data, to irregular point cloud processing [33, 19, 35, 49, 59, 27]. A straightforward strategy is to transform point cloud into regular voxels [53, 28, 4] or multi-view images [44, 3, 6], for easy application of CNN. These transformations, however, usually lead to much loss of rich 3D geometric information, as well as high complexity.

Figure 1: Motivation: sufficiently contextual semantic information is essential for a thorough grasp of the elusive shape formed by point cloud. The “bottle” is misidentified as the “vase” by PointNet [32], while with sufficient context aggregated, it can be accurately recognized. Here, we only illustrate the multi-level context around the blue point for visual clearness.

Another difficult yet attractive solution is to learn directly from irregular point cloud. PointNet [32], a pioneer in this direction, achieves the permutation invariance of points by learning over each point independently, then applying a symmetric function to accumulate features. Though impressive, it ignores local patterns that have been proven to be important for abstracting high-level visual semantics in image CNN [61]. To remedy this defect, KCNet [39] mines local patterns by creating a k-NN graph over each point in PointNet. Nevertheless, it inherits another defect of PointNet, i.e., no pooling layer to explicitly raise the level of semantics. PointNet++ [33] hierarchically groups point cloud into local subsets and learns on them by PointNet. This design indeed works like CNN, but the basic operator, PointNet, demands high complexity for enough effectiveness.

Besides high-level semantics, contextual information, which reflects the potential semantic dependencies between a target pattern and its surroundings [30], is also critical for shape pattern recognition. A typical approach in this view is multi-scale learning. Accordingly, PointNet++ [33] directly applies multi-scale grouping in each layer, i.e., capturing context at the same semantic level. This way, however, is suboptimal as it ignores the inherent difference in semantic levels at different scales, and often causes huge computational cost, especially for lots of scales. Multi-resolution grouping [33] can partly alleviate the latter issue, yet actually, it also abandons crucial context acquisition. ShapeContextNet [55] finds another strategy inspired by shape context [2]. It applies self-attention [48] in each layer of PointNet [32] to dynamically learn the relation weight among all points, and regards this weight as global shape context. Though fully automatic, it lacks an explicit semantic abstraction like CNN from local to global, and the weight matrix in self-attention can cause huge complexity when the number of points increases.

In short, there are mainly two key requirements to exploit CNN for effective learning on point cloud: 1) A convolution operator on point cloud, which can be permutation invariant to unordered points, and can achieve efficient inductive learning of local patterns, is required; 2) A deep hierarchy, which can acquire sufficiently contextual semantics for accurate shape recognition, is also required.

Accordingly, we propose DensePoint, a general architecture to learn densely contextual representation for point cloud processing, as illustrated in Fig. 2. Technically, DensePoint extends regular grid CNN to irregular point configuration by generalizing a convolution operator, which holds the permutation invariance of points, and respects the convolutional properties, i.e., local connectivity and weight sharing. Owing to its efficient inductive learning of local patterns, a deep hierarchy can be easily built in DensePoint for semantic abstraction. Architecturally, DensePoint finds inspiration from dense connection mode [13], to repeatedly aggregate multi-level and multi-scale semantics in the deep hierarchy. As a result, densely contextual information along with rich semantics, can be acquired by DensePoint in an organic manner, making it highly effective.

The key contributions are highlighted as follows:

  • A generalized convolution operator is formulated. It is permutation invariant to points, and respects the convolutional properties of local connectivity and weight sharing, thus extending regular grid CNN to irregular configuration for efficient point cloud processing.

  • A general architecture equipped with the generalized convolution operator to learn densely contextual representation of point cloud, i.e., DensePoint, is proposed. It can acquire sufficiently contextual semantic information for accurate recognition of the implicit shape.

  • Comprehensive experiments on challenging benchmarks across four tasks, i.e., shape classification, shape retrieval, part segmentation and normal estimation, as well as thorough model analysis, demonstrate that DensePoint achieves the state of the arts.

2 Related Work

In this section, we briefly review existing deep learning methods for 3D shape learning.

View-based and volumetric methods.  View-based methods [44, 3, 6, 54, 7, 34, 14] represent a 3D shape as a collection of 2D views, over which classic CNN used in image analysis field can be easily applied. However, 2D projections could cause much loss of 3D shape information due to many self-occlusions. Volumetric methods convert a 3D shape into a regular 3D grid [53, 28, 4], over which 3D CNN [47] can be employed. The main limitation is the quantization loss of 3D shape information due to the low resolution enforced by 3D grid. Although this issue can be partly rescued by recent space partition methods like K-d trees [21] or octrees [50, 45, 36, 51], they still rely on a subdivision of a bounding volume. By contrast, our work devotes to learn directly from irregular 3D point cloud.

Deep learning on point cloud.  Much effort has been focused on learning directly on point cloud. PointNet [32] pioneers this route by learning on each point independently and accumulating the final features. Yet it ignores local patterns, which limits its semantic learning ability. Accordingly, some works [33, 5, 39] partition point cloud into local subsets and learn on them based on PointNet. Some other works introduce graph convolutional network to learn over a local graph [49, 46, 25] or geometric elements [23]. However, these methods either lack an explicit semantic abstraction like CNN from local to global, or cause considerable complexity. By contrast, our work extends regular grid CNN to irregular point configuration, achieving efficient learning for point cloud processing.

In addition, there are some works mapping point cloud into a regular space to facilitate the application of classic CNN, e.g., a sparse lattice structure [43] with bilateral convolution [18] or a continuous volumetric function [1] with 3D CNN. Nevertheless, in our case, we learn directly from irregular point cloud, which is much more challenging.

Figure 2: The illustration of DensePoint. It extends regular grid CNN to irregular point configuration by an efficient generalized convolution operator (PConv in Eq. (1)). Instead of classic CNN architecture with layer-by-layer connections, it finds inspiration from dense connection mode [13], to repeatedly aggregate multi-level along with multi-scale semantics in an organic manner. To avoid high complexity in deep layers, it forces the output of each layer to be equally narrow with a small constant (e.g., 24). As a result, densely contextual representation can be learned efficiently for point cloud processing. Here, is the number of points while , and denote feature dimension.

Contextual learning on point cloud.  Contextual information is important for identifying the implicit shape pattern. PointNet++ [33] follows the traditional multi-scale learning by directly capturing context on the same layer, which often causes huge complexity. Hence an alternate called multi-resolution grouping [33] is devised for efficiency. It forces each layer to learn from its previous layer and the raw input (on the same local region) simultaneously. However, this can be less effective as it actually abandons crucial context acquisition. ShapeContextNet [55] finds another strategy inspired by shape context [2]. Instead of the traditional handcrafted design, it applies self-attention [48] to dynamically learn a weight for all point pairs. Though fully automatic, it lacks a local-to-global semantic learning like CNN. By contrast, we develop a deep hierarchy by an efficient generalized convolution operator, and organically aggregate multi-level contextual semantics in this hierarchy.

3 Method

In this section, we first describe the generalized convolution operator and the pooling operator on point cloud. Then, we present DensePoint, and elaborate how it learns densely contextual representation for point cloud processing.

3.1 Convolution and Pooling on Point Cloud

PConv: convolution on point cloud.  Classic convolution on the image operates on a local grid region (i.e., local connectivity), and the convolution filter weights of this grid region are shared along the spatial dimension (i.e., weight sharing). However, this operation is difficult to implement on point cloud due to the irregularity of points. To deal with this problem, we decompose the classic convolution into two core steps, i.e., feature transformation and feature aggregation. Accordingly, a generalized convolution on point cloud can be formulated as

(1)

where both and denote a 3D point in , and is feature vector. , the neighborhood formed by a local point cloud to convolve, is sampled from the whole point cloud by taking a sampled point as the centroid, and the nearby points as its neighbors . , the convolutional result as the inductive representation of , is obtained by: (i) performing a feature transformation with function on each point in ; (ii) applying a aggregation function to aggregate these transformed features. Finally, as shown in the upper part of Fig. 2 (PConv), similar to classic grid convolution, is assigned to be the feature vector of the centroid point in the next layer. Noticeably, some previous works such as [33] also use this general formulation.

In Eq. (1), can be permutation invariant only when the inner function is shared over each point in , and the outer function is symmetric (e.g., sum). Accordingly, for high efficiency, we employ a shared single-layer perceptron (SLP, for short) following a nonlinear activator, as to implement feature transformation. Meanwhile, as done in classic convolution, is also shared over each local neighborhood, for achieving the weight sharing mechanism. As a result, with a symmetric , the generalized PConv can achieve efficient inductive learning of local patterns, whilst be independent of the irregularity of points. Further, using PConv as the basic operator, a classic CNN architecture (no downsampling), as shown in the upper part of Fig. 2, can be easily built with layer-by-layer connections.

PPool: pooling on point cloud.  In classic CNN, pooling is usually performed to explicitly raise the semantic level of the representation and improve computational efficiency. Here, using PConv, this operation can be achieved on point cloud in a learnable way. Specifically, points are first uniformly sampled from the input points, where (e.g., ). Then, PConv can be applied to convolve all the local neighborhoods centered on those points, to generate a new downsampling layer.

3.2 Learning Densely Contextual Representation

Classic CNN architecture.   In a classic CNN architecture with layer-by-layer connections (the upper part of Fig. 2), hierarchical representations can be learned with the low-level ones in early layers and the high-level ones in deep layers [61]. However, a significant drawback is that each layer can only learn from single-level representation. As a consequence, all layers can capture only single-scale shape information from the input point cloud. Formally, assume a point cloud that is passed through this type of network. The network comprises layers, in which the layer performs a non-linear transformation . Then, the output of the layer can be learned from its previous layer as

(2)

where each point in is of single-scale receptive filed on the input point cloud , resulting that the learned captures only single-scale shape information. Finally, this will lead to a weakly contextual representation, which is not effective enough for identifying the diverse implicit shapes.

DensePoint architecture.  To overcome the above issue, we present a general architecture, i.e., DensePoint shown in the lower part of Fig. 2, inspired by dense connection mode [13]. Specifically, for each layer in DensePoint (no downsampling), the outputs of all preceding layers are used as its input, and its own output is used as the input to all subsequent layers. That is, in Eq. (2) becomes

(3)

where denotes the concatenation of the outputs of all preceding layers. Here, is forced to learn from multi-level representations, which facilitates to aggregate multi-level shape semantics along with multi-scale shape information. In this way, each layer in DensePoint can capture a certain level (scale) of context, and the level can be gradually increased as the network deepens. Moreover, the acquired dense context in deep layers can also improve the abstraction of high-level semantics in turn, making the whole learning process organic. Eventually, very rich local-to-global shape information in the input can be progressively aggregated together, resulting in a densely contextual representation for point cloud processing.

Note that DensePoint is quite different from the traditional multi-scale strategy [33]. The former progressively aggregates multi-level (multi-scale) semantics that is organically learned by each layer, while the latter artlessly gathers multi-scale information at the same level. It is also dissimilar to a simple concatenation of all layers as the final output, which results in each layer being less contextual.

Narrow architecture.  When the network deepens, DensePoint will suffer from high complexity, since the convolutional overhead of deep layers will be huge with all preceding layers as the input. Thus, we narrow the output channels of each layer in DensePoint with a small constant (e.g., 24), instead of the large ones (e.g., 512) in classic CNN.

ePConv: enhanced PConv.  Though lightweight, such narrow DensePoint will lack the expressive power, since with much narrow output , the shared SLP in PConv, i.e., in Eq. (1), could be insufficient in terms of learning ability. To overcome this issue, we introduce the filter grouping [22] to enhance PConv, which divides all the filters in a layer into several groups, and each group performs individual operation. Formally, the enhanced PConv (ePConv, for short) converts Eq. (1) to

(4)

where , the grouped version of SLP , can widen its output to enhance its learning ability and maintain the original efficiency, and , a normal SLP (shared over each centroid point ), is added to integrate the detached information in all groups. Both and include a nonlinear activator.

Input: point cloud ; input features {}; depth ; weight , and bias , for SLP and SLP in Eq. (4), {1, …, }; non-linearity ; aggregation function ; neighborhood method
Output: densely contextual representations {, }
1 ;
2 for  do
3        for  do
4               ;
5               ;
6              
7        end for
8       ;
9       
10 end for
return {} 00footnotetext: No sibling is visited
Algorithm 1 DensePoint forward pass algorithm

To elaborate ePConv with filter grouping, let SLP (resp. SLP) denote the SLP of (resp. ), and (resp. ) denote the input (resp. output) channels of SLP. is the number of groups. Then, the parameter number of SLP before and after filter grouping is, vs. . Here and are divisible by and the few parameters in the bias term are ignored for clearness. In other words, using filter grouping, can be increased by times but with almost the same complexity. Besides, inspired by the bottleneck layer [9], we fix the output channels of SLP and SLP as (i.e., ), to hold the original narrowness for DensePoint. Hence, with a small , SLP actually leads to only a little complexity of , which can be easily remedied by a suitable . The detailed forward pass procedure of DensePoint equipped with ePConv can be referred in Algorithm 1, where indicates performing grouping operation.

DensePoint for point cloud processing.  DensePoint applied in point cloud classification and per-point analysis (e.g., segmentation) are illustrated in Fig. 3. In both tasks, DensePoint with ePConv is applied in each stage of the network to learn densely contextual representation, while PPool with original PConv is used to explicitly raise the semantic level and improve efficiency. For classification, the final global representation is learned by three PPools and two DensePoints (11 layers in total, ), followed by three fully connected (fc) layers as the classifier. For per-point analysis, four levels of representations learned by four PPools and three DensePoints (17 layers in total, ), are sequentially upsampled by feature propagation [33] to generate per-point predictions. All the networks can be trained in an end-to-end manner. The configuration details are included in the supplementary material.

Figure 3: DensePoint applied in point cloud classification (a) and per-point analysis (b). PPool: pooling on point cloud (Sec 3.1). is the number of points. The stage means several successive layers with the same number of points.

Implementation details.  PPool: the farthest points are picked from point cloud for uniform downsampling. Neighborhood: the spherical neighborhood is adopted; a fixed number of neighbors are randomly sampled in each neighborhood for batch processing (the centroid is reused if not enough), and they are normalized by subtracting the centroid. Group number in (Eq. (4)): . Nonlinear activator: ReLU [29]. Dropout [42]: for model regularization, we apply dropout with 20% ratio on in Eq. (4) and dropout with 50% ratio on the first two fc layers in the classification network (Fig. 3(a)). Narrowness : . Aggregation function : symmetric function max pooling is employed. Batch normalization (BN) [17]: as done in image CNN, BN is used before each nonlinear activator for all layers. Note that only 3D coordinates () in are used as the initial input features. Code is available111https://github.com/Yochengliu/DensePoint.

4 Experiment

We conduct comprehensive experiments to validate the effectiveness of DensePoint. We first evaluate DensePoint for point cloud processing on challenging benchmarks across four tasks (Sec 4.1). We then provide detailed experiments to study DensePoint thoroughly (Sec 4.2).

4.1 DensePoint for Point Cloud Processing

Shape classification.  We evaluate DensePoint on ModelNet40 and ModelNet10 classification benchmarks [53]. The former comprises 9843 training models and 2468 test models in 40 classes, while the latter consists of 3991 training models and 908 test models in 10 classes. The point cloud data is sampled from these models by [32]. For training, we uniformly sample 1024 points as the input. As in [21], we augment the input with random anisotropic scaling in range [-0.66, 1.5] and translation in range [-0.2, 0.2]. For testing, similar to [32, 33], we apply voting with 10 tests using random scaling and then average the predictions.

The quantitative comparisons with the state-of-the-art point-based methods are summarized in Table 1. Our DensePoint outperforms all the point-input methods. Specifically, it reduces the error rate of PointNet++ by 26.9% on ModelNet40, and also surpasses its advanced version that applies additional normal data with very dense points (5k). Furthermore, even using only point as the input, DensePoint can also surpass the best additional-input method SO-Net [24] by 0.9% on ModelNet10. These results convincingly verify the effectiveness of DensePoint.

 

method input #points M40 M10

 

Pointwise-CNN [12] pnt 1k 86.1 -
Deep Sets [60] pnt 1k 87.1 -
ECC [40] pnt 1k 87.4 90.8
PointNet [32] pnt 1k 89.2 -
SCN [55] pnt 1k 90.0 -
Kd-Net(depth=10) [21] pnt 1k 90.6 93.3
PointNet++ [33] pnt 1k 90.7 -
MC-Conv [11] pnt 1k 90.9 -
KCNet [39] pnt 1k 91.0 94.4
MRTNet [4] pnt 1k 91.2 -
Spec-GCN [49] pnt 1k 91.5 -
DGCNN [52] pnt 1k 92.2 -
PointCNN [26] pnt 1k 92.2 -
PCNN [1] pnt 1k 92.3 94.9
Ours pnt 1k 93.2 96.6
SO-Net [24] pnt 2k 90.9 94.1
Kd-Net(depth=15) [21] pnt 32k 91.8 94.0

 

O-CNN [50] pnt, nor - 90.6 -
Spec-GCN [49] pnt, nor 1k 91.8 -
PointNet++ [33] pnt, nor 5k 91.9 -
SpiderCNN [56] pnt, nor 5k 92.4 -
SO-Net [24] pnt, nor 5k 93.4 95.7

 

Table 1: Shape classification results (overall accuracy, %) on ModelNet40 (M40) and ModelNet10 (M10) benchmarks (pnt: point coordinates, nor: normal, “-”: unknown).

 

input method #points/views M40 M10

 

Points PointNet [10] 1k 70.5 -
DGCNN [52] 1k 85.3 -
PointCNN [26] 1k 83.8 -
Ours 1k 88.5 93.2

 

Images GVCNN [3] 12 85.7 -
Triplet-center [10] 12 88.0 -
PANORAMA-ENN [38] - 86.3 93.3
SeqViews [7] 12 89.1 89.5

 

Table 2: Shape retrieval results (mAP, %) on ModelNet40 (M40) and ModelNet10 (M10) benchmarks (“-”: unknown).

 

method input class mIoU instance mIoU air plane bag cap car chair ear phone guitar knife lamp laptop motor bike mug pistol rocket skate board table

 

Kd-Net [21] 4k 77.4 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 81.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3
PointNet [32] 2k 80.4 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6
SCN [55] 1k 81.8 84.6 83.8 80.8 83.5 79.3 90.5 69.8 91.7 86.5 82.9 96.0 69.2 93.8 82.5 62.9 74.4 80.8
SPLATNet [43] - 82.0 84.6 81.9 83.9 88.6 79.5 90.1 73.5 91.3 84.7 84.5 96.3 69.7 95.0 81.7 59.2 70.4 81.3
KCNet [39] 2k 82.2 84.7 82.8 81.5 86.4 77.6 90.3 76.8 91.0 87.2 84.5 95.5 69.2 94.4 81.6 60.1 75.2 81.3
RS-Net [15] - 81.4 84.9 82.7 86.4 84.1 78.2 90.4 69.3 91.4 87.0 83.5 95.4 66.0 92.6 81.8 56.1 75.8 82.2
DGCNN [52] 2k 82.3 85.1 84.2 83.7 84.4 77.1 90.9 78.5 91.5 87.3 82.9 96.0 67.8 93.3 82.6 59.7 75.5 82.0
PCNN [1] 2k 81.8 85.1 82.4 80.1 85.5 79.5 90.8 73.2 91.3 86.0 85.0 95.7 73.2 94.8 83.3 51.0 75.0 81.8
Ours 2k 84.2 86.4 84.0 85.4 90.0 79.2 91.1 81.6 91.5 87.5 84.7 95.9 74.3 94.6 82.9 64.6 76.8 83.7

 

SO-Net [24] -,nor 80.8 84.6 81.9 83.5 84.8 78.1 90.8 72.2 90.1 83.6 82.3 95.2 69.3 94.2 80.0 51.6 72.1 82.6
SyncCNN [58] mesh 82.0 84.7 81.6 81.7 81.9 75.2 90.2 74.9 93.0 86.1 84.7 95.6 66.7 92.7 81.6 60.6 82.9 82.1
PointNet++ [33] 2k,nor 81.9 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6
SpiderCNN [56] 2k,nor 82.4 85.3 83.5 81.0 87.2 77.5 90.7 76.8 91.1 87.3 83.3 95.8 70.2 93.5 82.7 59.7 75.8 82.8

 

Table 3: Shape part segmentation results (%) on ShapeNet part benchmark (nor: normal, “-”: unknown).

Shape retrieval.  To further explore the recognition ability of DensePoint for the implicit shapes, we apply the global features, i.e., the outputs of the penultimate fc layer in the classification network (Fig. 3(a)), for shape retrieval. We sort the most relevant shapes for each query from the test set by cosine distance, and report mean Average Precision (mAP). Except for point-based methods, we also compare with some advanced 2D image-based ones. The results are summarized in Table 2. As can be seen, DensePoint significantly outperforms PointNet by 18%. It is also comparable with those image-based methods (even the ensemble one [38]), which greatly benefit from image CNN and pre-training with large-scale datasets (e.g., ImageNet [37]). Fig. 4 shows some retrieval examples.

Figure 4: Retrieval examples on ModelNet40. Top-10 matches are shown for each query, with the 1st line for PointNet [32] and the 2nd line for our DensePoint. The mistakes are highlighted in red.

Shape part segmentation.  Part segmentation is a challenging task for fine-grained shape recognition. Here we evaluate DensePoint on ShapeNet part benchmark [57]. It contains 16881 shapes with 16 categories, and is labeled in 50 parts in total, where each shape has 25 parts. We follow the data split in [32], and similarly, we also randomly pick 2048 points as the input and concatenate the one-hot encoding of the object label to the last feature layer of the segmentation network in Fig. 3(b). In testing, we also apply voting with ten tests using random scaling. Except for standard IoU (Inter-over-Union) score for each category, two types of mean IoU (mIoU) that are averaged across all classes and all instances respectively, are also reported.

Table 3 summarizes the quantitative comparisons with the state-of-the-art methods, where DensePoint achieves the best performance. Furthermore, it significantly surpasses the second best point-input methods, i.e., DGCNN [52], with 1.9 in class mIoU and 1.3 in instance mIoU respectively. Noticeably, it also sets new state of the arts over the point-based methods in eight categories. These improvements demonstrate the robustness of DensePoint to diverse shapes. Some segmentation examples are shown in Fig. 5.

Figure 5: Segmentation examples on ShapeNet part benchmark.

 

dataset method #points error

 

ModelNet40 PointNet [1] 1k 0.47
PointNet++ [1] 1k 0.29
PCNN [1] 1k 0.19
MC-Conv [11] 1k 0.16
Ours 1k 0.149

 

Table 4: Normal estimation error on ModelNet40 benchmark.

Normal estimation.  Normal estimation in point cloud is a crucial step for numerous applications, from surface reconstruction and scene understanding to rendering. Here, we regard normal estimation as a supervised regression task, and implement it by deploying DensePoint with the segmentation network in Fig. 3(b). The cosine-loss between the normalized output and the normal ground truth is employed for training. We evaluate DensePoint on ModelNet40 benchmark for this task, where 1024 points are uniformly sampled as the input.

The quantitative comparisons of the estimation error are summarized in Table 4, where DensePoint outperforms other advanced methods. Moreover, it significantly reduces the error of PointNet++ by 48.6%. Fig. 6 shows some normal prediction examples. As can be seen, DensePoint with densely contextual semantics can obtain more decent normal predictions, while PointNet and PointNet++ present a lot of deviations above 90 from the ground truth. However, in this task, DensePoint can not process some intricate shapes well, e.g., curtains and plants.

Figure 6: Normal estimation examples on ModelNet40 benchmark. For visual clearness, we only show the predictions with the angle less than 30 in blue, and the angle greater than 90 in red between the ground truth.

4.2 DensePoint Analysis

In this section, we first perform a detailed ablation study for DensePoint. Then, we discuss the group number in ePConv (Eq. (4)), the network narrowness , the aggregation function and the network stage to apply DensePoint, respectively. Finally, we analyze the robustness of DensePoint on sampling density and random noise, and investigate the model complexity. All the experiments are conducted on ModelNet40 [53] dataset.

Ablation study.  The results of ablation study are summarized in Table 5. We set two baselines: model A and model . Model A is set as a classic hierarchical version (the upper part of Fig. 2, i.e., layer-by-layer connections without contextual learning by DensePoint) of the classification network with the same number of layers, and each layer is configured with the same width. Model directly concatenates all layers in each stage of model A as the final output of that stage. Both of them are equipped with PConv in Eq. (1).

The baseline model A gets a low classification accuracy of 88.6%, and it increases by only 0.5 percent with direct concatenation (model ). However, with the densely contextual semantics of DensePoint, the accuracy raises significantly by 2.5 percent (91.1%, model B). This convincingly verifies its effectiveness. Then, when using ePConv to enhance the expressive power of each layer in DensePoint, the accuracy can be further improved to 92.5% (model C). Noticeably, the dropout on in Eq. (4) can bring a boost of 0.3 percent (model D). The data augmentation technique can result in an accuracy variation of 0.7 percent (model E). Finally, by voting strategy, our final model F can achieve an impressive accuracy of 93.2%. In addition, we also investigate the number of input points by increasing it to 2k, yet obtaining no gain (model G). Maybe the model needs to be modified to adapt for more input points.

 

model #points DA DP ePConv DO vote acc.

 

A 1k 88.6
1k 89.1

 

B 1k 91.1
C 1k 92.5
D 1k 92.8
E 1k 92.1
F 1k 93.2
G 2k 93.2

 

Table 5: Ablation study of DensePoint (%) (DA: data augmentation, DP: DensePoint, DO: dropout on in Eq. (4)).

 

group number #params #FLOPs/sample acc. (%)

 

1 0.73M 1030M 92.7
2 0.67M 651M 93.2
4 0.62M 457M 92.2
6 0.61M 394M 92.3
12 0.60M 331M 92.1

 

Table 6: The impact of the group number on network parameters, FLOPs and performance ().

Group number in ePConv (Eq. (4)).  The filter grouping can greatly reduce the model complexity, whilst leading to a model regularization by rarefying the filter relationships [16]. Table 6 summarizes the impact of on model parameters, model FLOPs (floating point operations/sample) and classification accuracy. As can be seen, the model parameters are very few (0.73M), even though the filter grouping is not performed. This is due to the narrow design () of each layer in DensePoint and few parameters in the generalized convolution operator, ePConv. Eventually, with , DensePoint can achieve the best result of 93.2% with acceptable model complexity.

Network narrowness .  Table 7 summarizes the comparisons of different . One can see that a very small DensePoint, i.e., , can even obtain an impressive accuracy of 92.1%. This further verifies the powerfulness of the densely contextual semantics acquired by DensePoint on shape identification. Note that a large is usually unnecessary for DensePoint, as it will greatly raise the model complexity but not bring any gains.

Aggregation function .  We experiment with three symmetric functions, i.e., sum, average pooling and max pooling, whose results are 91.0%, 91.3% and 93.2%, respectively. The max pooling performs best, probably because it can select the biggest feature response to keep the most expressive representation and remove redundant information.

Figure 7: (a) Point cloud with different sampling densities. (b) Results of testing with sparser points. (c) Point cloud with some points being replaced with random noise (highlighted in red). (d) Results of testing with noisy points.

 

narrowness #params #FLOPs/sample acc. (%)

 

12 0.56M 294M 92.1
24 0.67M 651M 93.2
36 0.76M 957M 92.9
48 0.88M 1310M 92.7

 

Table 7: The comparisons of different narrowness ().

 

model 1st stage 2nd stage acc. (%)

 

90.5
B 91.8
C 92.3
D 93.2

 

Table 8: The comparisons of DensePoint applied in different stages of the classification network (Fig. 3(a)).

Network stage to apply DensePoint.  To investigate the impact of contextual semantics at different levels on shape recognition, we also apply DensePoint with ePConv in different stages of the classification network (Fig. 3(a)). The results are summarized in Table 8. The baseline (model ) is set as the same as the model A in Table 5 but equipped with ePConv for a fair comparison. One can see that DensePoint applied in the 1st stage (model B) and the 2nd stage (model C) can both bring a considerable boost, while the latter performs better. This indicates the higher-level contextual semantics in the 2nd stage can result in a more powerful representation for shape recognition. Finally, with DensePoint in each stage for sufficiently contextual semantic information, the best result of 93.2% can be reached.

Robustness analysis.  The robustness of DensePoint on sampling density and random noise are shown in Fig. 7. For the former, we use sparser points of 1024, 512, 256, 128 and 64, as the input to a model trained with 1024 points. Random input dropout is applied during training, for fair comparisons with PointNet [32], PointNet++ [33], SO-Net [24], PCNN [1] and DGCNN [52]. Fig. 7(b) shows that our model and PointNet++ perform better in this testing. Nevertheless, our model can obtain higher accuracy than PointNet++ at all densities. This indicates the densely contextual semantics of DensePoint, is much more effective than the traditional multi-scale information of PointNet++.

For the latter, as in KCNet [39], we replace a certain number of randomly picked points with uniform noise ranging [-1.0, 1.0] during testing. The comparisons with PointNet, PointNet++ and KCNet are shown in Fig. 7(d). Note that for this testing, our model is trained without any data augmentations to avoid confusion. As can be seen, our model is quite robust on random noise, while the others are vulnerable. This demonstrates the powerfulness of densely contextual semantics in DensePoint.

Model complexity.  The comparisons of model complexity with the state of the arts are summarized in Table 9. As can be seen, our model is quite competitive and it can be the most efficient one with the network depth (accuracy 92.1%). This shows its great potential for real-time applications, e.g., scene parsing in autonomous driving.

Discussion of limitations.  (1) The density of local point clouds is not considered, which could lead to less effectiveness in greatly non-uniform distribution; (2) The importance of each level of context is not evaluated, which could lead to the difficulty in identifying very alike shapes.

 

method #params #FLOPs/sample

 

PointNet [32] 3.50M 440M
PointNet++ [26] 1.48M 1684M
DGCNN [26] 1.84M 2767M
SpecGCN [26] 2.05M 1112M
KCNet [39] 0.90M -
PCNN [26] 8.20M 294M
PointCNN [26] 0.60M 1581M

 

Ours () 0.67M 651M
Ours () 0.53M 148M

 

Table 9: The comparisons of model complexity (“-”: unknown).

5 Conclusion

In this work, DensePoint, a general architecture to learn densely contextual representation for efficient point cloud processing, has been proposed. DensePoint extends regular grid CNN to irregular point configuration by an efficient generalized convolution operator. Based on this operator, DensePoint develops a deep hierarchy and progressively aggregate multi-level and multi-scale semantics from it. As a consequence, DensePoint can acquire sufficiently contextual information along with rich semantics in an organic manner, making it highly effective for implicit shape identification. Extensive experiments on challenging benchmarks across four tasks, as well as thorough model analysis, have demonstrated that DensePoint achieves the state of the arts. In addition, DensePoint shows quite good robustness against noisy points, which could provide a promising direction for robust point cloud representation learning.

References

  • [1] M. Atzmon, H. Maron, and Y. Lipman (2018) Point convolutional neural networks by extension operators. In SIGGRAPH, pp. 1–14. Cited by: §2, §4.2, Table 1, Table 3, Table 4.
  • [2] S. J. Belongie, J. Malik, and J. Puzicha (2002) Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24 (4), pp. 509–522. Cited by: §1, §2.
  • [3] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao (2018) GVCNN: group-view convolutional neural networks for 3D shape recognition. In CVPR, pp. 264–272. Cited by: §1, §2, Table 2.
  • [4] M. Gadelha, R. Wang, and S. Maji (2018) Multiresolution tree networks for 3D point cloud processing. In ECCV, pp. 105–122. Cited by: §1, §2, Table 1.
  • [5] P. Guerrero, Y. Kleiman, M. Ovsjanikov, and N. J. Mitra (2018) PCPNet: learning local shape properties from raw point clouds. Comput. Graph. Forum 37 (2), pp. 75–85. Cited by: §2.
  • [6] H. Guo, J. Wang, Y. Gao, J. Li, and H. Lu (2016) Multi-view 3D object retrieval with deep embedding network. IEEE Trans. Image Processing 25 (12), pp. 5526–5537. Cited by: §1, §2.
  • [7] Z. Han, M. Shang, Z. Liu, C. Vong, Y. Liu, M. Zwicker, J. Han, and C. L. P. Chen (2019) SeqViews2SeqLabels: learning 3D global features via aggregating sequential views by RNN with attention. IEEE Trans. Image Processing 28 (2), pp. 658–672. Cited by: §2, Table 2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV, pp. 1026–1034. Cited by: §E.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2.
  • [10] X. He, Y. Zhou, Z. Zhou, S. Bai, and X. Bai (2018) Triplet-Center loss for multi-view 3D object retrieval. In CVPR, pp. 1945–1954. Cited by: Table 2.
  • [11] P. Hermosilla, T. Ritschel, P. Vázquez, A. Vinacua, and T. Ropinski (2018) Monte carlo convolution for learning on non-uniformly sampled point clouds. ACM Trans. Graph. 37 (6), pp. 235:1–235:12. Cited by: Table 1, Table 4.
  • [12] B. Hua, M. Tran, and S. Yeung (2018) Pointwise convolutional neural networks. In CVPR, pp. 974–993. Cited by: Table 1.
  • [13] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, pp. 2261–2269. Cited by: §1, Figure 2, §3.2.
  • [14] H. Huang, E. Kalogerakis, S. Chaudhuri, D. Ceylan, V. G. Kim, and E. Yumer (2018) Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM Trans. Graph. 37 (1), pp. 6:1–6:14. Cited by: §2.
  • [15] Q. Huang, W. Wang, and U. Neumann (2018) Recurrent slice networks for 3D segmentation of point clouds. In CVPR, pp. 2626–2635. Cited by: Table 3.
  • [16] Y. Ioannou, D. P. Robertson, R. Cipolla, and A. Criminisi (2017) Deep roots: improving CNN efficiency with hierarchical filter groups. In CVPR, pp. 5977–5986. Cited by: §4.2.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §3.2.
  • [18] V. Jampani, M. Kiefel, and P. V. Gehler (2016) Learning sparse high dimensional filters: image filtering, dense crfs and bilateral neural networks. In CVPR, pp. 4452–4461. Cited by: §2.
  • [19] M. Jiang, Y. Wu, and C. Lu (2018) PointSIFT: A SIFT-like network module for 3D point cloud semantic segmentation. arXiv preprint arXiv:1807.00652. Cited by: §1.
  • [20] D. I. Kim and G. S. Sukhatme (2014) Semantic labeling of 3D point clouds with object affordance for robot manipulation. In ICRA, pp. 5578–5584. Cited by: §1.
  • [21] R. Klokov and V. S. Lempitsky (2017) Escape from cells: deep Kd-Networks for the recognition of 3D point cloud models. In ICCV, pp. 863–872. Cited by: §2, §4.1, Table 1, Table 3.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, pp. 1106–1114. Cited by: §1, §3.2.
  • [23] L. Landrieu and M. Simonovsky (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, pp. 4558–4567. Cited by: §2.
  • [24] J. Li, B. M. Chen, and G. H. Lee (2018) SO-Net: self-organizing network for point cloud analysis. In CVPR, pp. 9397–9406. Cited by: §4.1, §4.2, Table 1, Table 3.
  • [25] R. Li, S. Wang, F. Zhu, and J. Huang (2018) Adaptive graph convolutional neural networks. In AAAI, pp. 3546–3553. Cited by: §2.
  • [26] Y. Li, R. Bu, M. Sun, and B. Chen (2018) PointCNN: convolution on X-transformed points. In NeurIPS, pp. 828–838. Cited by: Table 1, Table 2, Table 9.
  • [27] Y. Liu, B. Fan, S. Xiang, and C. Pan (2019) Relation-shape convolutional neural network for point cloud analysis. In CVPR, pp. 8895–8904. Cited by: §1.
  • [28] D. Maturana and S. Scherer (2015) VoxNet: A 3D convolutional neural network for real-time object recognition. In IROS, pp. 922–928. Cited by: §1, §2.
  • [29] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Cited by: §3.2.
  • [30] A. Oliva and A. Torralba (2007) The role of context in object recognition. Trends in Cognitive Sciences 11 (12), pp. 520–527. Cited by: §1.
  • [31] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum PointNets for 3D object detection from RGB-D data. In CVPR, pp. 918–927. Cited by: §1.
  • [32] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016) PointNet: deep learning on point sets for 3D classification and segmentation. In CVPR, pp. 77–85. Cited by: Figure 1, §1, §1, §2, Figure 4, §4.1, §4.1, §4.2, Table 1, Table 3, Table 9.
  • [33] C. R. Qi, L. Yi, H. su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, pp. 5099–5108. Cited by: §1, §1, §1, §2, §2, §3.1, §3.2, §3.2, §4.1, §4.2, Table 1, Table 3.
  • [34] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas (2016) Volumetric and multi-view CNNs for object classification on 3D data. In CVPR, pp. 5648–5656. Cited by: §2.
  • [35] S. Ravanbakhsh, J. Schneider, and B. Poczos (2017) Deep learning with sets and point clouds. In ICLR, pp. 1–12. Cited by: §1.
  • [36] G. Riegler, A. O. Ulusoy, and A. Geiger (2017) OctNet: learning deep 3D representations at high resolutions. In CVPR, pp. 6620–6629. Cited by: §2.
  • [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §4.1.
  • [38] K. Sfikas, I. Pratikakis, and T. Theoharis (2018) Ensemble of PANORAMA-based convolutional neural networks for 3D model classification and retrieval. Computers & Graphics 71, pp. 208–218. Cited by: §4.1, Table 2.
  • [39] Y. Shen, C. Feng, Y. Yang, and D. Tian (2018) Mining point cloud local structures by kernel correlation and graph pooling. In CVPR, pp. 4548–4557. Cited by: §1, §2, §4.2, Table 1, Table 3, Table 9.
  • [40] M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In CVPR, pp. 29–38. Cited by: Table 1.
  • [41] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, pp. 1–14. Cited by: §1.
  • [42] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. 15 (1), pp. 1929–1958. Cited by: §3.2.
  • [43] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang, and J. Kautz (2018) SPLATNet: sparse lattice networks for point cloud processing. In CVPR, pp. 2530–2539. Cited by: §2, Table 3.
  • [44] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller (2015) Multi-view convolutional neural networks for 3D shape recognition. In ICCV, pp. 945–953. Cited by: §1, §2.
  • [45] M. Tatarchenko, A. Dosovitskiy, and T. Brox (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3D outputs. In ICCV, pp. 2107–2115. Cited by: §2.
  • [46] G. Te, W. Hu, A. Zheng, and Z. Guo (2018) RGCNN: regularized graph CNN for point cloud segmentation. In MM, pp. 746–754. Cited by: §2.
  • [47] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3D convolutional networks. In ICCV, pp. 4489–4497. Cited by: §2.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 6000–6010. Cited by: §1, §2.
  • [49] C. Wang, B. Samari, and K. Siddiqi (2018) Local spectral graph convolution for point set feature learning. In ECCV, pp. 1–16. Cited by: §1, §2, Table 1.
  • [50] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong (2017) O-CNN: octree-based convolutional neural networks for 3D shape analysis. ACM Trans. Graph. 36 (4), pp. 72:1–72:11. Cited by: §2, Table 1.
  • [51] P. Wang, C. Sun, Y. Liu, and X. Tong (2018) Adaptive O-CNN: a patch-based deep representation of 3D shapes. ACM Trans. Graph. 37 (6), pp. 217:1–217:11. Cited by: §2.
  • [52] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (), pp. 1–13. Cited by: §4.1, §4.2, Table 1, Table 2, Table 3.
  • [53] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D ShapeNets: A deep representation for volumetric shapes. In CVPR, pp. 1912–1920. Cited by: §1, §2, §4.1, §4.2.
  • [54] J. Xie, G. Dai, F. Zhu, E. K. Wong, and Y. Fang (2017) DeepShape: deep-learned shape descriptor for 3D shape retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 39 (7), pp. 1335–1345. Cited by: §2.
  • [55] S. Xie, S. Liu, Z. Chen, and Z. Tu (2018) Attentional ShapeContextNet for point cloud recognition. In CVPR, pp. 4606–4615. Cited by: §1, §2, Table 1, Table 3.
  • [56] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao (2018) SpiderCNN: deep learning on point sets with parameterized convolutional filters. In ECCV, pp. 90–105. Cited by: Table 1, Table 3.
  • [57] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. J. Guibas (2016) A scalable active framework for region annotation in 3D shape collections. ACM Trans. Graph. 35 (6), pp. 210:1–210:12. Cited by: §4.1.
  • [58] L. Yi, H. Su, X. Guo, and L. J. Guibas (2017) SyncSpecCNN: synchronized spectral CNN for 3D shape segmentation. In CVPR, pp. 6584–6592. Cited by: Table 3.
  • [59] K. Yin, H. Huang, D. Cohen-Or, and H. (. Zhang (2018) P2P-NET: bidirectional point displacement net for shape transform. ACM Trans. Graph. 37 (4), pp. 152:1–152:13. Cited by: §1.
  • [60] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In NeurIPS, pp. 3394–3404. Cited by: Table 1.
  • [61] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In ECCV, pp. 818–833. Cited by: §1, §1, §3.2.

Supplementary Material

A Outline

This supplementary material provides: (1) further investigations of the proposed DensePoint (Sec B); (2) more shape retrieval examples of DensePoint and some analysis (Sec C); (3) network configuration details (Sec D). (4) training details (Sec E)

B Further Investigations

In this section, we provide further investigations of DensePoint on four aspects. Specifically, the discussion of neighborhood method is presented in Sec B.1. The effect of dropout on in Eq. (4) is analyzed in Sec B.2. The impact of network depth on classification performance is investigated in Sec B.3. The memory and runtime are summarized in Sec B.4. All the investigations are conducted on ModelNet40 dataset.

b.1 Neighborhood Method

In the main paper, the local convolutional neighborhood in Eq. (1) is set to be a spherical neighborhood, from which a fixed number of neighbors are randomly sampled for batch processing. We compare this strategy (Random-In-Sphere) with another typical one, i.e., k-nearest neighbor (k-NN). For a fair comparison, the models with these two strategies are configured with the same settings. Table I summarizes the results.

As can be seen, the model with Random-In-Sphere performs better. We speculate that the model with k-NN will suffer from the distribution inhomogeneity of points. In this case, the contextual learning in DensePoint will be less effective, as the receptive fields will be confined to a local region with large density, which leads to ignoring those sparse points that are essential for recognizing the implicit shape. By contrast, Random-In-Sphere can have a better coverage of points even in the case of inhomogeneous distribution.

b.2 Dropout on in Eq. (4)

The dropout technique can force the whole network to behave as an ensemble of a lot of subsets and reduce the risk of model overfitting. To analyze its effect on DensePoint, we apply it with different ratios on in Eq. (4). The results are summarized in Table II. As can be seen, the best result of 93.2% can be achieved with a dropout ratio of 20%.

b.3 Network Depth

We further explore the impact of the network depth (fully connected layers are not included) on classification performance. The results are summarized in Table III. Surprisingly, a 6-layer network equipped with DensePoint can achieve an accuracy of 92.1% with only 0.53M params and 148M FLOPs/sample. This even outperforms PointNet++ [33] (accuracy 90.7%, params 1.48M [26], FLOPs/sample 1684M [26]) by 15% in error rate, whilst being one order of magnitude faster in terms of FLOPs/sample. We also observe that it is unnecessary to develop a very deep network (e.g., 23 layers) with DensePoint, as it increases complexity without bringing any gain. Eventually, the best result of 93.2% can be reached with acceptable complexity by an 11-layer network.

 

neighborhood method acc.

 

k-NN 91.3
Random-In-Sphere 93.2

 

Table I: The results (%) of two neighborhood strategies. The number of neighbors is equally set in each layer of the two models.

 

ratio (%) 0 10 20 30 40 50

 

acc. 92.9 92.8 93.2 93.0 92.8 92.5

 

Table II: The results (%) of dropout with different ratios applied on in Eq. (4).

 

#layers #params #FLOPs/sample acc.

 

6 0.53M 148M 92.1
9 0.56M 510M 92.9
11 0.67M 651M 93.2
15 0.78M 779M 93.0
19 0.88M 1222M 92.7
23 1.03M 1416M 92.6

 

Table III: The results (%) of different network depths (fully connected layers are not included).

b.4 Memory and runtime

The memory and runtime of the proposed DensePoint are summarized in Table IV. As can be seen, the model (=) is competitive while another model (=) is the best one in terms of efficiency. Actually, the memory and training time issues in dense connection mode are greatly alleviated due to the shallow design of DensePoint and our highly-efficient implementation. Moreover, although extremely deep network could be unnecessary for 3D currently, in case of very deep DensePoint in the future, the technique of Shared Memory Allocations can be applied to achieve linear memory complexity.

 

method #points Time (ms) Memory (GB)
training test training test

 

PointNet [31] 1024 55 22 1.318 0.469
PointNet++ [33] 1024 195 47 8.311 2.305
DGCNN [52] 1024 300 68 4.323 1.235
PointCNN [26] 1024 55 38 2.501 1.493
Ours (=, =) 1024 21 10 3.745 1.228
Ours (=, =) 1024 10 5 1.468 0.886

 

Table IV: Time and memory of classification network, where is network narrowness, is network depth. The statistics of all the models are summarized with batch size 16 on NVIDIA TITAN Xp, and time is the mean time of tests. The compared models are tested using their available official codes.

C Shape Retrieval

In this section, we show more shape retrieval examples in Fig. 8. As can be seen, compared with PointNet [31], our DensePoint obtains superior shape identification results. Specifically, PointNet is confused between the query “bottle” and the sample “vase” due to their similar shapes. Nevertheless, DensePoint with densely contextual semantics acquired can identify them accurately. We notice that DensePoint could also be confused for some very alike shapes, e.g., the query “bench” and the sample “tv_stand”. This could be improved by learning to weight multi-level contextual information instead of identically aggregating all levels of information. We leave it as future work.

D Network Configuration Details

In this section, we present the configuration details of three networks on shape classification, shape part segmentation and normal estimation, respectively. For clearness, we describe the layer and corresponding setting format as follows:

PPool: [downsampling rate, neighborhood radius, #number of neighbors, SLP(#input channels, #output channels)]. The global pooling is achieved by directly applying PConv to convolve all points.

ePConv: [neighborhood radius, #number of neighbors,
SLP(#input channels, #output channels, #group number),
SLP(#input channels, #output channels), dropout ratio].

FP (feature propagation layer): MLP(#channels, ). Feature propagation layer [33] is used for transforming the features that are concatenated from current interpolated layer and long-range connected layer. We employ a multi-layer perceptron (MLP) to implement this transformation.

FC (fully connected layer): [(#input channels, #output channels), dropout ratio]. Note that the dropout technique is applied for all FC layers except for the last FC layer (used for prediction).

In addition, except for the last prediction layer, all layers (including the inside perceptrons) are followed with batch normalization and ReLU activator. The output shape is in the format of (#feature dimension, #number of points).

d.1 Shape Classification Network

The configuration details of shape classification network are presented in Table VI. The network has 14 layers in total, which comprises 3 PPools (the last one is global pooling layer) and 2 DensePoints (the 1st one has 3 layers while the 2nd one has 5 layers), followed by 3 FC layers.

d.2 Shape Part Segmentation Network

Table V summarizes the configuration details of shape part segmentation network. As it shows, the network has 23 layers in total, which comprises 4 PPools, 3 DensePoints (4 layers, 6 layers and 3 layers in 2nd stage, 3rd stage and 4th stage respectively) and 4 FP layers, followed by 2 FC layers. As in [31, 33], we concatenate the one-hot encoding (16-d) of the object label to the last feature layer.

d.3 Normal Estimation Network

The normal estimation network is presented in Table VII. It is almost the same as the segmentation network, except for three aspects: (1) the input becomes 1024-d and the one-hot encoding becomes 40-d for ModelNet40 dataset; (2) the settings of some layers are slightly changed to be consistent with the 1024-d input; (3) the final output becomes 3-d for normal prediction. As done in the segmentation network, we also concatenate the one-hot encoding (40-d) of the object label to the last feature layer.

E Training Details

Our DensePoint is implemented using Pytorch. The Adam optimization algorithm is employed for training, with a mini-batch size of . The momentum for batch normalization starts with and decays with a rate of every epochs. The learning rate begins with and decays with a rate of every epochs. The weight is initialized using the techniques introduced by He et al[8].

 

stage layer type setting detail output shape long-range

 

- Input - (3, 2048) FP

 

1 PPool [1/2, 0.1, 32, (3, 64)] (64, 1024) FP

 

2 PPool [1/4, 0.2, 64, (64, 128)] (128, 256)
ePConv [0.3, 32, (128, 96, 2), (96, 24), 20%] (24, 256)
ePConv [0.3, 32, (152, 96, 2), (96, 24), 20%] (24, 256)
ePConv [0.3, 32, (176, 96, 2), (96, 24), 20%] (24, 256)
ePConv [0.3, 32, (200, 96, 2), (96, 24), 20%] (24, 256)

 

The output of DensePoint in 2nd stage (224, 256) FP

 

3 PPool [1/4, 0.3, 32, (224, 192)] (192, 64)
ePConv [0.5, 16, (192, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (216, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (240, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (264, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (288, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (312, 96, 2), (96, 24), 20%] (24, 64)

 

The output of DensePoint in 3rd stage (336, 64) FP

 

4 PPool [1/4, 0.8, 32, (336, 360)] (360, 16)
ePConv [0.8, 8, (360, 96, 2), (96, 24), 20%] (24, 16)
ePConv [0.8, 8, (384, 96, 2), (96, 24), 20%] (24, 16)
ePConv [0.8, 8, (408, 96, 2), (96, 24), 20%] (24, 16)

 

The output of DensePoint in 4th stage (432, 16)

 

FP (768, 512, 512) (512, 64)
FP (736, 384, 384) (384, 256)
FP (448, 256, 256) (256, 1024)
FP (259, 128, 128) (128, 2048)

 

FC [(128+16, 128), 50%] (128, 2048)
FC [(128, ), -] softmax (, 2048)

 

  • This is the one-hot encoding of the object label on ShapeNet part dataset.

Table V: The configuration details of shape part segmentation network. “long-range” indicates the long-range connections (see Fig. 3(b) in the main paper). is the number of classes.

Figure 8: Retrieval examples on ModelNet40 dataset. Top-10 matches are shown for each query, with the 1st line for PointNet [31] and the 2nd line for our DensePoint. The mistakes are highlighted in red.

 

stage layer type setting detail output shape

 

- Input - (3, 1024)

 

1 PPool [1/2, 0.25, 64, (3, 96)] (96, 512)
ePConv [0.2, 32, (96, 96, 2), (96, 24), 20%] (24, 512)
ePConv [0.2, 32, (120, 96, 2), (96, 24), 20%] (24, 512)
ePConv [0.2, 32, (144, 96, 2), (96, 24), 20%] (24, 512)

 

The output of DensePoint in 1st stage (168, 512)

 

2 PPool [1/4, 0.3, 64, (168, 144)] (144, 128)
ePConv [0.4, 16, (144, 96, 2), (96, 24), 20%] (24, 128)
ePConv [0.4, 16, (168, 96, 2), (96, 24), 20%] (24, 128)
ePConv [0.4, 16, (192, 96, 2), (96, 24), 20%] (24, 128)
ePConv [0.4, 16, (216, 96, 2), (96, 24), 20%] (24, 128)
ePConv [0.4, 16, (240, 96, 2), (96, 24), 20%] (24, 128)

 

The output of DensePoint in 2nd stage (264, 128)

 

3 PPool [-, -, 128, (264, 512)] (512, )
FC [(512, 512), 50%] (512, )
FC [(512, 256), 50%] (256, )
FC [(256, ), -] softmax (, )

 

Table VI: The configuration details of shape classification network. is the number of classes.

 

stage layer type setting detail output shape long-range

 

- Input - (3, 1024) FP

 

1 PPool [1, 0.2, 32, (3, 64)] (64, 1024) FP

 

2 PPool [1/4, 0.2, 32, (64, 128)] (128, 256)
ePConv [0.3, 32, (128, 96, 2), (96, 24), 20%] (24, 256)
ePConv [0.3, 32, (152, 96, 2), (96, 24), 20%] (24, 256)
ePConv [0.3, 32, (176, 96, 2), (96, 24), 20%] (24, 256)
ePConv [0.3, 32, (200, 96, 2), (96, 24), 20%] (24, 256)

 

The output of DensePoint in 2nd stage (224, 256) FP

 

3 PPool [1/4, 0.3, 32, (224, 192)] (192, 64)
ePConv [0.5, 16, (192, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (216, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (240, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (264, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (288, 96, 2), (96, 24), 20%] (24, 64)
ePConv [0.5, 16, (312, 96, 2), (96, 24), 20%] (24, 64)

 

The output of DensePoint in 3rd stage (336, 64) FP

 

4 PPool [1/4, 0.8, 32, (336, 360)] (360, 16)
ePConv [0.8, 8, (360, 96, 2), (96, 24), 20%] (24, 16)
ePConv [0.8, 8, (384, 96, 2), (96, 24), 20%] (24, 16)
ePConv [0.8, 8, (408, 96, 2), (96, 24), 20%] (24, 16)

 

The output of DensePoint in 4th stage (432, 16)

 

FP (768, 512, 512) (512, 64)
FP (736, 384, 384) (384, 256)
FP (448, 256, 256) (256, 1024)
FP (259, 128, 128) (128, 1024)

 

FC [(128+40, 128), 50%] (128, 1024)
FC [(128, 3), -] (3, 1024)

 

  • This is the one-hot encoding of the object label on ModelNet40 dataset.

Table VII: The configuration details of normal estimation network. “long-range” indicates the long-range connections (see Fig. 3(b) in the main paper).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
389242
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description