ConvPoint: Continuous Convolutions for Point Cloud Processing
Point clouds are unstructured and unordered data, as opposed to images. Thus, most machine learning approach developed for image cannot be directly transferred to point clouds. In this paper, we propose a generalization of discrete convolutional neural networks (CNNs) in order to deal with point clouds by replacing discrete kernels by continuous ones. This formulation is simple, allows arbitrary point cloud sizes and can easily be used for designing neural networks similarly to 2D CNNs. We present experimental results with various architectures, highlighting the flexibility of the proposed approach. We obtain competitive results compared to the state-of-the-art on shape classification, part segmentation and semantic segmentation for large-scale point clouds.
Preprint. Accepted at Computer & Graphics, see final version in journal issue.
|(a) NPM3D||(b) Semantic8|
Point clouds are widely used in varied domains such as historical heritage preservation and autonomous driving. They can be either directly generated using active sensors such as LIDARs or RGB-Depth sensors, or an intermediary product of photogrammetry.
These point sets are sparse samplings of the underlying surface of scenes or objects. With the exception of structured acquisitions (e.g., with LIDARs), point clouds are generally unordered and without spatial structure; they cannot be sampled on a regular grid and processed as image pixels. Moreover, the points may or may not hold colorimetric features. Thus, point-cloud dedicated processing must be able to deal only with the relative positions of the points.
Due to these considerations, methods developed for image processing cannot be used directly on point clouds. In particular, convolutional neural networks (CNNs) have reached state-of-the-art performance in many image processing tasks. They use discrete convolutions which make extensive use the grid structure of the data, which does not exist with point clouds.
Two common approaches to circumvent this problem are: first, to project the points into a space suitable for discrete convolutions, e.g., voxels; second, to reformulate the CNNs to take into account point clouds’ unstructured nature.
In this paper, we adopt the second approach and propose a generalization of CNNs for point clouds. The main contributions of this paper are two-fold.
First, we introduce a continuous convolution formulation designed for unstructured data. The continuous convolution is a simple and straightforward extension of the discrete one.
Second, we show that this continuous convolution can be used to build neural networks similarly to its image processing counterpart. We design neural networks using this convolution and a hierarchical data representation structure based on a search tree.
We show that our framework, ConvPoint, which extends our work presented in [boulch2019conv], can be applied to various classification and segmentation tasks, including large scale indoor and outdoor semantic segmentation. For each task, we show that ConvPoint is competitive with the state-of-the-art.
The paper is organized as follows: Section 2 presents the related work, Section 3 describes the continuous convolutional formulation, Section 4 is dedicated to the spatial representation of the data, and Section 5 presents the convolution as a layer. Finally, Section 6 shows experiments on different datasets for classification and semantic segmentation of point clouds.
2 Related work
Point cloud processing is a widely discussed topic. We focus here on machine learning techniques for point cloud classification or local attribute estimation.
Most of the early methods use handcrafted features defined using a point and its neighborhood, or a local estimate of the underlying surface around the point. These features describe specific properties of the shape and are designed to be invariant to rigid or non rigid transformations of the shape [johnson1999using, aubry2011wave, bronstein2010scale, ling2007shape]. Then, classical machine learning algorithms are trained using these descriptors.
In the last years, the release of large annotated point cloud databases has allowed the development of deep neural network methods that can learn both descriptors and decision functions.
The direct adaptation of CNNs developed for image processing is to use 3D convolutions. Methods like [wu20153d, maturana2015voxnet] apply 3D convolutions on a voxel grid. Even though recent hardware advances enable the use of these networks on larger scenes, they are still time consuming and require a relatively low voxel resolution, which may result in a loss of information and undesirable bias due to grid axis alignment. In order to avoid these drawbacks, [SubmanifoldSparseConvNet, graham20183d] use sparse convolutional kernels and [li2016fpnn, wang2015voting] scan the 3D space to focus computation where objects are located.
A second class of algorithms avoids 3D convolutions by creating 2D representations of the shape, applying 2D CNNs and projecting the results back to 3D. This approach has been used for object classification [su2015multi], jointly with voxels [qi2016volumetric], and for semantic segmentation [boulch2017snapnet]. One of the main issues of using multi-view frameworks is to design an efficient and robust strategy for choosing viewpoints.
The previous methods are based on 2D or 3D CNNs, creating the need to structure the data (3D grid or 2D image) in order to process it. Recently, work has focused on developing deep learning architectures for non-Euclidean data. This work, referred to as geometric deep learning [bronstein2017geometric], operates on manifolds, graphs, or directly on point clouds.
Neural networks on graphs were pioneered in [scarselli2008graph]. Since then, several methods have been developed using the spectral domain from the graph Laplacian [shuman2013emerging, bruna2013spectral, yi2017syncspeccnn] or polynomials [defferrard2016convolutional] as well as gated recurrent units for message passing [li2015gated, landrieu2018large].
The first extension of CNNs to manifolds is Geodesic CNN [masci2015geodesic], which redefines usual layers such as convolution or max pooling. It is applied to local mesh patches in a polar representation to learn invariant shape features for shape correspondences. In [boscaini2016learning], the representation is improved with anisotropic diffusion kernels and the resulting method is not bound to triangular meshes anymore. More recently, MoNet [monti2017geometric] offers a unified formulation of CNNs on manifolds which included [masci2015geodesic, atwood2016diffusion, boscaini2016learning] and reaches state-of-the-art performance on shape correspondence benchmarks.
These methods have proved to be very efficient but they usually operate on graphs or meshes for surface analysis. However, building a mesh from raw point clouds is a very difficult task and requires in practice priors regarding the surface to be reconstructed [berger2017survey].
A fourth class of machine learning approaches processes directly the raw point clouds; the method proposed in this paper belongs to this category.
One of the recent breakthroughs in dealing with unstructured data is PointNet [qi2017pointnet].
The key idea is to construct a transfer function invariant by permutation of the inputs features, obtained by using the order invariant max pooling function.
The coordinates of the points are given as input and geometric operations (affine transformations) are obtained with small auxiliary networks.
However, it fails to capture local structures. Its improvement, PointNet++ [qi2017pointnet++], uses a cascade of PointNet networks from local scale to global scale.
Convolutional neural layers are widely used in machine learning for image processing and more generally for data sampled on a regular grid (such as 3D voxels). However, the use of CNNs when data is missing or not sampled on a regular grid is not straightforward. Several works have studied the generalization of such powerful schemes to such cases.
To avoid problems linked to discrete kernels and sparse data. [dai2017deformable] introduces deformable convolutional kernels able to adapt to the recognition task. In [su2018splatnet], the authors adapt [dai2017deformable] to deal with point clouds. The input signal is interpolated on the convolutional kernel, the convolution is applied, and the output is interpolated back to input shape. The kernel elements’ locations are optimized like in [dai2017deformable] and the input points are weighted according to their distance to kernel elements, like in [su2018splatnet]. However, unlike in [su2018splatnet], the our approach is not dependent of a convolution kernel designed on a grid.
In [NIPS2018_7362], a -transform is applied on the point coordinates to create geometrical features to be combined with the input point features. This is a major difference relative to our approach: as in [qi2017pointnet++], [NIPS2018_7362] makes use of the input geometry as features. We want the geometry to behave as in discrete convolutions, i.e. weighting the relation between kernel and input. In other words, the geometric aspects are defined in the network structure (convolution strides, pooling layers) and point coordinates (e.g. pixels indices for images) are not an input of the network.
In [wang2018deep], the authors propose a convolution layer defined as a weighted sum of the input features, the weights being computed by a multi-layer perceptron (MLP). Our approach has points in common with [wang2018deep] both in concept and implementation, but differs in two ways. First, our approach computes a dense weighting function that takes into account the whole kernel. Second, as in [thomas2019KPConv], the derived kernel is an explicit set of points associated with weights. However, whereas [thomas2019KPConv] uses an explicit RBF Gaussian function to correlate input and kernel, we propose to learn this kernel to input relation function with a multi-layer perceptron (MLP).
3 Convolution for point processing
We build our continuous convolutional layer by adapting the discrete convolution formulation used for grid-sampled data such as images.
Notations. In the following sections, is the dimension of the spatial domain (e.g., for 3D point clouds) and is the dimension of the features domain (i.e., the dimension of the input features of the convolutional layer). is an integer sequence from to with step . The cardinality of a set is denoted by .
3.1 Discrete convolutions
Let be the kernel and be the input. In discrete convolutions, and have the same cardinality.
We first consider the case where the elements of and range over the same locations (e.g., grid coordinates on an image).
Given a bias , the output is:
where is the indicator function such that if and otherwise. This expresses a one-to-one relation between kernel elements and input elements.
We now reformulate the previous equation by making explicit the underlying order of sets and . Let’s consider (resp. ) where (resp. ) is the spatial location of the kernel element (resp. is the spatial location of the -th point in the input). For convolutions on images, and denote pixel coordinates in the considered patch. The output is then:
3.2 Convolutions on point sets
In this study, we consider that elements of may have any spatial locations, i.e., we do not assume a grid or other structure on . In practice, using equation (2) on such an would result in to be zero (almost) all the time as the probability for a random to match exactly is zero. Thus, using the indicator function is too restrictive to be useful when processing point clouds.
We need a more general function to establish a relation between the points and the kernel elements . We define as a geometrical weighting function that distributes the input onto the kernel.
must be invariant to permutations of the input points (it is necessary since point clouds are unordered). This is achieved if is independently applied on each point . It can also be function of the set kernel points, i.e., .
Moreover, to be invariant to a global translation applied on both the kernel and the input cloud, we used relative coordinates with respect to kernel elements, i.e., we apply the function on , the set of relative positions of relatively to kernel elements.
The function is then a function such that:
Finally, the convolution operation for point sets is defined by:
where we added a normalization according to the input set size for robustness to variation in input size.
In this formulation, the spatial domain and the feature domain are mixed similarly to the discrete formulation. Unlike PointNet [qi2017pointnet] and PointNet++ [qi2017pointnet++], spatial coordinates are not input features. This formulation is closer to [wang2018deep] where the authors estimate directly the weights of the input features, i.e., the product , mixing estimation in the feature space and the geometrical space.
3.3 Weighting function
In practice designing by hand such a function is not easy. Intuitively, it would decrease with the norm of . As in [thomas2019KPConv], Gaussian functions could be used for that, but it would require handcrafted parameters which are difficult to tune. Instead, we choose to learn this function with a simple MLP. Such an approach does not make any specific assumption about the behavior of the function .
The convolution operation is illustrated in Fig. 2. For a kernel , the spatial part and feature part are processed separately. The black arrows and boxes are for operations in the features space, green ones are for operations in the point cloud space. Please note that removing the green operations corresponds to the discrete convolution.
Moreover, as in the case of discrete convolutions, our convolutional layer is made of several kernels (corresponding to the number of output channel). The output of the convolution is thus a vector .
3.5 Parameters and optimization on kernel element location
To set the location of the kernel elements, we randomly sample the locations in the unit sphere and consider them as parameters. As is differentiable with respect to its input, can be optimized.
At training time, both parameters and of as well as the MLP parameters are optimized using gradient descent.
Permutation invariance. As stated in [qi2017pointnet], operators on point clouds must be invariant unser the permutation of points. In the general case, the points are not ordered. In our formulation, is a sum over the input points. Thus, permutations of the points have no effect on the results.
Translation invariance. As the geometric relations between the points and the kernel elements are relative, i.e., (), applying a global translation to the point cloud does not change the results.
Insensitivity to the scale of the input point cloud. Many point clouds, such as photogrammetric point clouds, have no metric scale information. Moreover the scale may vary from one point cloud to another inside a dataset. In order to make the convolution robust to the input scale, the input geometric points are normalized to fit in the unit ball.
Reduced sensibility to the size of input point cloud. Dividing by makes the output less sensitive to input size. For instance, using ”″ as input does not change the result.
4 Hierarchical representation and neighborhood computation
Let be an input point cloud. The convolution as described in the previous section is a local operation on a subset of . It operates a projection of on an output point cloud and computes the features associated with each point of .
Depending on the cardinality of , we have three possible behaviors for the convolution layer:
(a) . In this case the cardinality of the output is the same as the intput. In the discrete framework it corresponds to the convolution without stride.
(b) . This includes the particular case of . The convolution operates a spatial size reduction. The parallel with the discrete framework is a convolution with stride bigger than one.
(c) . This includes the particular case of . The convolutional layer produces an upsampled version of the input. It is similar to the up-convolutional layer in discrete pipelines.
Computation of can either be given as an input, or computed from . For the second case, a possible strategy is to randomly pick points in to generate . However, it is possible to pick several times the same point and some points of may not be in the neighborhoods of the points of . An alternative is proposed in [qi2017pointnet++] using a furthest-point sampling strategy. This is very efficient and ensures a spatially uniform sampling, however it requires to compute all mutual distances in the point cloud.
We propose an intermediate solution. For each point, we memorize how many times it has been selected. We pick the next point in the set of points with the lower number of selection. Each time a point is selected, its score is increased by . The score of the points in its neighborhood are increased by . The points of are iteratively picked until the specified number of points is reached. Using a higher score for the points in ensures that they will not be chosen anymore, except if all points have been selected once.
Neighborhood computation All -nearest neighbor search are computed using a kd-tree built with .
5 Convolutional layer
The convolutional layer is presented on Fig. 3. It is composed of the two previously described operations (point selection and the convolution itself). The inputs are and optionally . If is not provided, it is selected as a subset of following the procedure described in the previous section. First, for each point of , local neighborhoods in are computed using a k-d tree. Then, for each of these subsets, we apply the convolution operation, creating the output features. Finally, the output is the union of the pairs .
Parameters The parameters of the convolutional layers are very similar to discrete convolution parameters in the most deep learning frameworks.
Number of output channels (): it is the number of convolutional kernels used in the layer. It defines the output feature dimension, i.e., the dimension of .
Size of the output point cloud (): it is the number of points that are passed to the next layer.
Kernel size (): it is the number of kernel elements used for the convolution computation.
Neighborhood size (): it is the number of points in to consider for each point in .
The following section is dedicated to experiments and comparison with state-of-the-art methods. As the spatial structure generation (selection of the output point cloud for each layer) is a stochastic process, two runs through the network may lead to different outputs. In the following, we aggregate the results of multiple runs by averaging the outputs. The number of runs is then referred to as the number of spatial samplings. In the folowing tables, it correspond to the number between parentheses (for classification and part segmentation). The influence of this number is discussed in section 6.3.1.
The first experiments are dedicated to points cloud classification. We experimented on both 2D and 3D point cloud datasets.
Network The network is described in Fig. 4. It is composed of five convolutions that progressively reduce the point cloud size to one single point, while increasing the number of channels. The features associated to this point are the inputs of a linear layer. This architecture is very similar to the ones that can be used for image processing (e.g., LeNet [lecun1998gradient]).
2D classification: MNIST The 2D experiment is done on the MNIST dataset. This is a dataset for the classification of gray scale handwritten digits. The point cloud is built from the images, using pixel coordinates as point coordinates and, thus, is sampled on a grid. We build two variants of the dataset: first, point clouds are built with the whole image and the features associated with each point is the grey level (); second, only the black points are considered ().
Results are presented in table 1(a). We compare with both image CNNs (LeNet [lecun1998gradient] and Network in Network [lin2013network]) and point-based methods (PointNet++ [qi2017pointnet++] and PointCNN [NIPS2018_7362]). Scores, averaged over 16 spatial samplings, are competitive with other methods. More interestingly, we do not observe a great difference between the two variants (grayscale points or black points only). In the Gray levels experiment (whole image), the framework is able to learn from the color value only as the points do not hold shape-related information. On the contrary, in the Black points only, it learns from geometry only, which is a common case for point cloud.
3D classification: ModelNet40 We also experimented on 3D classification on the ModelNet40 dataset. This dataset is a set of meshes from 40 various classes (planes, cars, chairs, tables…). We generated point clouds by randomly sampling points on the triangular faces of the meshes. In our experiments, we use an input size of either 1024 or 2048 points for training. Table 1(b) presents the results. As for 2D classification, we are competitive with the state of the art concerning point-based classification approaches.
|(a) MNIST||(b) ModelNet 40|
Note: striped convolution (layer 0, 1 and 12) are only in semantic segmentation network.
The segmentation network is presented on Fig. 5. It has an encoder-decoder structure, similar to U-Net, a stack of convolutions that reduces the cardinality of the point cloud as an encoder and a symmetrical stack of convolution as a decoder with skip connections. In the decoder, the points used for upsampling are the same as the points in the encoder at the corresponding layer. Following the U-Net architerture, the features from the encoder and from the decoder are concatenated at the input of the convolutional layers of the decoder. Finally, the last layer is a point-wise linear layer used to generate an output dimension corresponding to the number of classes.
The network comes in two variants, i.e., with two different numbers of layers. The part segmentation network (plain colors, in Fig. 5) is used with ShapeNet for part segmentation. The second network, used for large-scale segmentation, is the same network with three added convolutions (hatched layers in Fig. 5). It is a larger network; its only purpose is to deal with larger input point clouds.
For both versions of the network, we add a dropout layer between the last convolution and the linear layer. At training time, the probability of an element to be set to zero is 0.5.
Given a point cloud, the part segmentation task is to recognize the different constitutive parts of the shape. In practice, it is a semantic segmentation at shape level. We use the Shapenet [yi2016scalable] dataset. It is composed of 16680 models belonging to 16 shape categories and split in train/test sets. Each category is annotated with 2 to 6 part labels, totalling 50 part classes. As in [NIPS2018_7362], we consider the part annotation problem as a 50-class semantic segmentation problem. We use a cross-entropy loss for each category. The scores are computed at shape level.
We use the semantic segmentation network from Fig. 5. The provided point clouds have various sizes, we randomly select 2500 points (possibly with duplication if the point cloud size is lower than 2500) and predict the labels for each input point. The points do not have color features so we set the input features to one (this is the same as for the MNIST experiment with black points only). As all points may not have been selected for labeling, the class of an unlabeled point is given by the label of its nearest neighbor.
The results are presented in table 2. The scores are the mean class intersection over union (mcIoU) and instance average intersection over union (mIoU). Similarly to classification, we aggregate the scores of multiple runs through the network. Our framework is also competitive with the state-of-the-art methods as we rank among the top five methods for both mcIoU and mIou.
We now consider semantic segmentation for large-scale point clouds. As opposed to part segmentation, point clouds are now sampled in multi-object scenes outdoors and indoors.
Datasets We use three datasets, one with indoor scenes, two with outdoors acquisitions.
The Stanford 2D-3D-Semantics dataset [2017arXiv170201105A] (S3DIS) is an indoor point cloud dataset for semantic segmentation. It is composed of six scenes, each corresponding to an office building floor. The points are labeled according to 13 classes: 6 building elements classes (floor, ceiling…), 6 office equipment classes (tables, chairs…) and a “stuff” class regrouping all the small equipment (computers, screens…) and rare items. For each scene considered as a test set, we train on the other five.
The Semantic8 [sem3d] outdoor dataset is composed of 30 ground lidar scenes: 15 for training and 15 for evaluation. Each scene is generated from one lidar acquisition and the point cloud size ranges from 16 million to 430 million points. 8 classes are manually annotated.
Finally, the Paris-Lille 3D dataset (NPM3D) [roynard2018paris] has been acquired using a Mobile Laser System. The training set contains four scenes taken from the two cites, totalizing 38 million points. The 3 tests scenes, with 10 million points each, were acquired on two other cities. The annotations correspond to 10 coarse classes from buildings to pedestrian.
For both Semantic8 and NPM3D, test labels are unknown and evaluated on an online server.
Learning and prediction strategies As the scenes may be large, up to hundred of millions of points, the whole point clouds cannot be directly given as input to the network.
At training time, we randomly select points in the considered point cloud, and extract all the points in an infinite vertical column centered on this point, the column section is 2 meters wide for indoor scenes and 8 meters wide for outdoor scenes (Fig. 6).
During testing, we compute a 2D occupancy pixel map with “pixel” size 0.1 meters for indoor scenes and 0.5 meters for outdoor scenes by projecting vertically on the horizontal plane. Then, we considered each occupied cell as a center for a column (same size as for training).
For each column, we randomly select 8192 points which we feed as input to the network. Finally, the output scores are aggregated (summed) at point level and points not seen by the network receive the label of their nearest neighbor.
Improving efficiency with fusion of two networks learned on different modalities First, the objective is to evaluate the influence of the color information as an input feature.
We trained two networks, one with color information (RGB) and one without (NoColor). Let be the input point cloud. In the first case, RGB, the input features is the color information, i.e., , being the red, green and blue values associated with each point. For the NoColor model, input features do not hold color information: .
Intuitively, the RGB network should outperform the other model on all categories: the RGB point cloud is the NoColor point cloud with additional color information.
However, according to the scores, the intuition is only partly verified. As an example in table 4(a), the NoColor model is much more efficient than the RGB model on the column class, mainly due to color confusion as walls and columns have most of the time the same color. To our understanding, when color is provided, the stochastic optimization process follows the easiest way to reach a training optimum, thus, giving too much importance to color information. In contrast, the NoColor generates different features, based only on the geometry, and is more discriminating on some classes.
Now looking at table 4(b), we observe similar performances for both models. As these models use different inputs, they likely focus on different features of the scene. It should thus be possible to exploit this complementarity to improve the scores.
Model fusion To show the interest of fusing models, we chose to use a residual fusion module [audebert_semantic_2016, boulch2017snapnet]. This approach have proven to produce good results for networks with different input modalities. Moreover, one advantage is that the two segmentation networks are first trained separately, the fusion module being trained afterward. This training process makes it possible to first, reuse the geometry only model if the RGB information is not available and second, to train with limited resources (see implementation details in section 7).
The fusion module is presented in Fig. 7. The features before the fully-connected layer are concatenated, becoming the input of two convolutions and a point-wise linear layer. The outputs of both segmentation networks and the residual module are then added to form the final output. At training time, we add a dropout layer between the last convolution and the linear layer.
In table 4(a) and (b), the results of segmentation with fusion is reported at line ConvPoint - fusion. As expected, the fusion increases the segmentation scores with respect to RGB and NoColor alone.
On the S3DIS dataset (table 4(a)), the fusion module obtains a better score on 10 out 13 categories and the average intersection over union is increased by 3.5%. It is also interesting to note that, on categories for which the fusion is not better than with one single modality or the other, the score is close to the best mono-mode model. In other words, the fusion often improves the performance and in any case does not degrade it.
On the Semantic8 dataset (table 4(b)), we observe the same behavior. The gain is particularly high on the artefact class, which is one of the most difficult class: both mono-mode models reach 43-44% while the fusion model reaches 61%. It validates the fact that both RGB and NoColor models learn different features and that the fusion module is not only a sum of activations, but can select the best of both modalities.
For comparison with official benchmarks, we use the fusion model.
Large-scale datasets: comparison with the state of the art Table 4 presents also, for comparison, the scores obtained by other methods in the literature. Our approach is competitive with the state of the art on the three datasets.
On S3DIS, we place second behind KPConv [thomas2019KPConv] in term of average intersection over union (mIoU), while being first on several categories. It can be noted that approaches sharing concepts with ours, such as PCCN [wang2018deep] or PointCNN [NIPS2018_7362], do not perform as well as ours.
On Semantic8, we report the state of the benchmark leaderboard at the time of article writing (for entries that are not anonymous). PointNet++ has two entries in the benchmark, we only reported the best one. Our convolutional network for segmentation places first before Superpoint Graph (SPG) [landrieu2018large] and SnapNet [boulch2017snapnet]. It differs greatly from SPG, which relies on a pre-segmentation of the point cloud, and SnapNet, which uses a 2D segmentation network on virtual pictures of the scene to produce segments. We surpass the PointNet++ by 13% on the average IoU. We perform particularly well on car and artifacts detection where other methods, except for SPG get relatively low results.
Finally on NPM3D Paris-Lille dataset (table 4(c)), we also report the official benchmark at the time the paper was written. Based on the average IoU, we place second surpassed only by KPConv [thomas2019KPConv]. Our approach is the best or second best for 6 out of 9 categories. The second place is explained mostly by the relatively low score on pedestrian and trash cans. These are particularly difficult classes, due to their variability and the low number of instances in the train set.
|(a) Weighting functions||(b) Filters|
Best score, second best and third best.
|ConvPoint - RGB||87.9||-||64.7||95.1||97.7||80.0||44.7||17.7||62.9||67.8||74.5||70.5||61.0||47.6||57.3||63.5|
|ConvPoint - NoColor||85.2||-||62.6||92.8||94.2||76.7||43.0||43.8||51.2||63.1||71.0||68.9||61.3||56.7||36.8||54.7|
|ConvPoint - Fusion||88.8||-||68.2||95.0||97.3||81.7||47.1||34.6||63.2||73.2||75.3||71.8||64.9||59.2||57.6||65.0|
|Method||mIoU||OA||Man made||Natural||High veg.||Low veg.||Buildings||Hard scape||Artefacts||Cars|
|ConvPoint - RGB||0.750||0.938||0.934||0.847||0.758||0.706||0.950||0.474||0.432||0.902|
|ConvPoint - NoColor||0.726||0.927||0.918||0.788||0.748||0.646||0.962||0.451||0.442||0.856|
|ConvPoint - Fusion||0.765||0.934||0.921||0.806||0.760||0.719||0.956||0.473||0.611||0.877|
|ConvPoint - Fusion||75.9||99.5||95.1||71.6||88.7||46.7||52.9||53.5||89.4||85.4|
6.3 Emprical properties of the convolutional layer.
Filter visualization Fig. 8 presents a visualization of some characteristics of the first convolutional layer of the 2D classification model trained on MNIST.
Fig. 8(a) shows the weighting function associated with four of the sixty-four kernel elements of the first convolutional layer.
These weights are the output of the MLP function. As expected, their nature varies a lot depending on the kernel element, underlying different regions of interest for each kernel element.
Fig. 8(b) shows the resulting convolutional filters. These are computed using the previously presented weighting functions, multiplied by the weights of each kernel element () and summed over the kernel elements. As with discrete CNNs, we observe that the network has learned various filters, with different orientations and shapes.
Influence of random selection of output points
The strategy for selecting points of the output is stochastic at each layer, i.e., for a fixed input, two runs through the same layer may lead to two different ’s. Therefore, the outputs ’s may also be different, and so may be the predicted labels. To increase robustness, we aggregate several outputs computed with the same network and the same input point cloud. This is referred to as the number of sampling (from 1 to 16) in table 3. We observe an improvement of the performances with the number of spatial sampling. In practice, we only use up to 16 samplings because a larger number does not significantly improve the scores.
This procedure shares similarities with the approaches used for image processing for test set augmentation such as image crops [simonyan2014very, szegedy2016rethinking]. The main difference resides in the fact that the output variation is not a result of input variation but is inherent to the network, i.e., the network is not deterministic.
Robustness to point cloud size and neighborhood size
In order to evaluate the robustness to test conditions that are different from the training ones, we propose two experiments. As stated in section 3, the definition of the convolutional layer does not require a fixed input size. However for a gain in performance and time, we trained the networks with minibatches, fixing the input size at training. We evaluate the influence of the input size at inference on the performance of a model trained with fix input size. Results are presented in Fig. 9 for the ModelNet40 classification dataset. Each curve (from blue to red) is an instance of the classification network, trained with 16, 32, …, 2048 input points. The black dots are the scores for each model at their native input size. The dashed curve describe a theoretical model performing as well as the best model for each input size. Please note that the horizontal scale is a log scale and that each step is doubling the number of input points. A first observation is that almost each model performs the best at its native input size and that very few points are needed on ModelNet40 to reach decent performances: with 32 points, the performance already reaches to 85%. Besides, the larger the training size is, the more robust to size variation the model becomes. While the model trained on 32 points see its performance drop by 25% with 50% points, the model trained with 2048 points still reaches 82% (a drop of 10%) with only 512 points (4 times less).
Second, in our formulation, the neighborhood size for each convolution layer remains a variable parameter after training, i.e., it is possible to change the neighborhood size at test time. It is the reason why, in equation (5), the normalizing weight (average with respect to input size) has been added to be robust to neighborhood size variation. In addition to the robustness provided by averaging, the variation of somehow simulates a density variation of the points for the layer. We evaluate the robustness to such variations in table 5, on the classification dataset ModelNet40, with a single model trained with the default configuration (see Fig. 4). We report the impact of on the first layer (table 5(a)) and on the fourth layer (table 5(b)). As expected, even though the best score is reached for the default , the layer is robust to a high variation of values: the overall accuracy loss is lower than 2% when is 2 times larger or smaller. A particularly interesting feature is that the first layer, which extract local features, is more robust to a decreasing , making the features more local than a increasing . It is the opposite for the fourth layer: global features are more robust to an increase of than a decreasing .
|(a)||First layer (default )|
|(b)||4 layer (default )|
Note: scores are computed with a 16-element spatial structure. The model was trained with the default configuration.
7 Computation times and implementation details
In our experiments, the convolutional pipeline was implemented using Pytorch [paszke2017automatic]. The neighborhood computation are done with NanoFLANN [blanco2014nanoflann] and parallelized over CPU cores using OpenMP [Dagum:1998:OIA:615255.615542].
Table 6 presents the computation times. We consider two hardware configurations: a desktop workstation (Config. 1) and a low-end configuration, i.e., a gaming laptop (Config. 2). For instance, Config. 2 could correspond to a hardware specification for embedded devices.
Table 6(a) is a comparison with PointCNN [NIPS2018_7362] which was reported as the fastest in [NIPS2018_7362] among other methods. We used the code version available one the official PointCNN repository at the time of submission, and used the recommended network architecture. For both framework, we ran experiments on the Config. 1 computer. On the ModelNet40 classification dataset, our model is about 30% faster than PointCNN for training, but inference times are similar. The difference is more significant on the ShapeNet segmentation dataset. For a batch size of 4, our segmentation framework is more than 5 times faster for training, and 3 times faster at test time.
Moreover, we also show that our ConvPoint is more memory efficient than the implementation of PointCNN. The “-” symbol in table 6 indicates batch sizes / numbers of points configurations that exceed the GPU memory. For example, we can train the classification model with 2048 points and batch size of 128, which is not possible with PointCNN (on an Nvidia GTX 1070 GPU). The same goes on ShapeNet where we can train with a batch size of 64 while PointCNN already uses too much GPU memory with batch size 8.
In table 6(b), we report timings for the large-scale segmentation network and the fusion architecture, with a point cloud size of 8192. Note that for training this network we used NVidia GTX 1080Ti GPU. These timings represent the inference time only (neighborhood computation and convolutions), not data loading. We first observe that even the fusion architecture (two segmentation networks and a fusion module) can be run on the small configuration. Second, as for CNN with images, we benefit from using batches which reduce per point cloud computation time. Finally, our implementation is efficient given that for the segmentation network, we are able to process from 100,000 points (Config. 2) to 200,000 points (Config. 1) per second.
8 Discussion and limitations
Convolutional layer First, our convolution design is agnostic to the object scales, due to neighborhood normalization to the unit ball. It is of interest for non metric data such as CAD models or photogrammetric point clouds where scales are not always available. On the contrary, in metric scans, object sizes are valuable information (e.g., humans have almost all the time similar sizes) but removing the normalization would cause the kernel and the input points to have different volumes.
Second, an alternative is to use a fixed-radius neighborhood instead of a fix number of neighbors. As pointed out in [thomas2019KPConv], the resulting features would be more related to geometry and less to sampling. However, the actual code optimization to speed up computation such as batch training would be inapplicable due to a variable number of neighbors in a batch.
Input features Another perspective is to explore the use of precomputed features as inputs. In this study, we only use raw data for network inputs: RGB colors when available, or all features set to 1 otherwise. In the future, we will work on feeding the networks with extra features such as normals or curvatures.
Network architecture Finally, we proposed two networks architectures that are widely inspired from computer vision models. It would be interesting to explore further variations of network architectures. As our formulation generalizes the discrete convolution, it is possible to transpose more CNN architectures, such as residual networks.
In this paper, we presented a new CNN framework for point cloud processing. The proposed formulation is a generalization of the discrete convolution for sparse and unstructured data. It is flexible and computationally efficient, which it makes it possible to build various network architectures for classification, part segmentation and large-scale semantic segmentation. Through several experiments on various benchmarks, real and simulated, we have shown that our method is efficient and at state of the art.