IPC-Net: 3D point-cloud segmentation using deep inter-point convolutional layers
Abstract††Accepted and presented at the International Conference on Tools with Artificial Intelligence (ICTAI 2018).
“ 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.”
Over the last decade, the demand for better segmentation and classification algorithms in 3D spaces has significantly grown due to the popularity of new 3D sensor technologies and advancements in the field of robotics. Point-clouds are one of the most popular representations to store a digital description of 3D shapes. However, point-clouds are stored in irregular and unordered structures, which limits the direct use of segmentation algorithms such as Convolutional Neural Networks. The objective of our work is twofold: First, we aim to provide a full analysis of the PointNet architecture to illustrate which features are being extracted from the point-clouds. Second, to propose a new network architecture called IPC-Net to improve the state-of-the-art point cloud architectures. We show that IPC-Net extracts a larger set of unique features allowing the model to produce more accurate segmentations compared to the PointNet architecture. In general, our approach outperforms PointNet on every family of 3D geometries on which the models were tested. A high generalisation improvement was observed on every 3D shape, especially on the rockets dataset. Our experiments demonstrate that our main contribution, inter-point activation on the network’s layers, is essential to accurately segment 3D point-clouds.
The ability to directly learn from unordered data (i.e., 3D point clouds or 3D geometrical shapes) remains an open question. An ample amount of research has been done on extracting representations from ordered structures, so they can be used to achieve classification or segmentation of 3D spaces. Usually, most methodologies condense the 3D representation into geometrical features that summarise the global and local attributes of the shape. Transforming the 3D space often comes with a negative impact on the accuracy of the segmentation or classification task [Su2015]. The majority of segmentation and object recognition problems rely on state-of-the-art algorithms such as Convolutional Neural Networks (CNN) to exploit the spatial information that exists within the input space of the problem. CNNs are powerful algorithms for object recognition and are known for outperforming human accuracy on several cases [mdy166, russakovsky2015imagenet]. However, CNNs cannot be directly used on the 3D shapes as the convolutions are ill-suited for extracting spatially-local correlations in irregular and unordered data [lecun2015deep].
We introduce a CNN architecture that exploits the local correlations that exist within neighbouring points in a 3D point cloud to improve the accuracy of predicting segmentations in the 3D space. Due to the popularity across different fields of research (i.e., in robotics and 3D sensors) [sansoni2009state], a point-cloud representation is convenient as a large proportion of 3D spaces can be represented as a point-cloud. Our research builds upon two prior studies [Dieleman2016, Qi2017] that demonstrated that point clouds can directly be used in a Neural Network that learns to approximate functions that induce invariance towards rigid transformations such as rotations in the point-cloud. In this manuscript we analyse these types of network and show that the architecture proposed in [Qi2017] ignores the spatial information that exists within clusters of neighbouring points. Furthermore, we propose a solution to this problem in the form of an improved architecture and call it Inter-Point Convolutions Network (IPC-Net). In summary, Section 3 will provide an introduction on the PointNet architecture. Section 4 will illustrate how the PointNet kernel activations partitions the 3D space but omits inter-point neighbouring information. Finally, in section 5 we propose our main contribution which exploits inter-point activations to achieve high segmentation accuracy.
2 Related Work
Different methods have been proposed to solve classification and segmentation problems for 3D geometries [theologou2015comprehensive]. State-of-the-art techniques translate the geometry into a representation that learning algorithms can understand. This is often achieved by summarising the 3D shapes into geometrical features (i.e., characteristics). In the literature, most feature based methods are divided into 2 categories: local features methods (LFM) and global features methods (GFM). Both categories represent a geometry in different ways. LFMs target the local characteristics of neighbouring information. For example, computing the curvature in a subset of the input space. In contrast, GFMs characterise the global shape of the geometry by considering the entire input space at once.
Osada et al. developed a method using shape distributions [Osada2002], where the concept of shape functions is designed to measure geometrical characteristics (e.g., functions that calculate distances or angles between arbitrary points). They uniquely identify a 3D geometry by a probability shape distribution generated from one or more shape functions. These unique signatures can be used for classification and shape retrieval problems. Shape distributions is one of the models that belong to the category of GFM, as they summarise the overall geometry in a single distribution. Other methods however attempt to make a trade off between local and global features. Hang Su et al. rendered a collection of images by taking snapshots from different view-angles of 3D geometries [Su2015]. These images are subsequently fed into an ensemble of Convolutional Neural Networks (CNN) to generate view-based descriptors (e.g., descriptors generated by images). Hang Su found that view-based descriptors encompass a good balance between local and global features in comparison to more complex structures such as voxel-based representations[Brock2016]. This is due to the fact that rendered images are characterised by highly dense representations (pixels) and hence facilitate the extraction of representative 3D features.
Despite the high accuracy of view-based descriptors, such models still require the transformation of the original 3D format into lower level representations, thereby losing important information. For instance, view-based descriptors [Su2015] dismiss internal segments of 3D shapes as the rendering only considers the exterior of the geometry. Rendering the internal segments of the geometry would be unfeasible as it requires a combination of affine transformations (i.e., rotation, translation, shear and scaling) to capture the geometry. To counter these concerns, Qi et al. proposed the PointNet algorithm that learns directly from 3D meshes and point clouds [Qi2017]. Their learning framework requires no additional transformations, which provides an advantage over the algorithms mentioned. Point-clouds or meshes are more challenging to learn from, compared to other representations such as view-based descriptors. For example, point clouds do not contain an underlying spatial or temporal order, in contrast to pixels in images or samples of a signal. As a result, Neural Networks or other machine learning algorithms cannot be used directly on point clouds. Qi et al proposed an innovative solution to extract global and local geometrical features by approximating symmetric functions. A symmetric function is a function that remains unchanged by any permutation of its arguments[WeissteinEW]. In contrast to view-based descriptors, PointNet is sensitive to the density representation of the point cloud. It is common that different point clouds datasets contain regions with non-uniform density areas. This results into a combination of different density sets that may highly decrease the performance of the leaning algorithm. PointNet++ [qi2017pointnet++] circumvents the limitation of different density representation by adapting learning layers to combine features from multiple scales. As our contribution builds upon their Neural Network architecture, we will further describe the PointNet method in Section 3.
Considering that the IPC-Net builds upon the PoinNet architecture, in the following sections we will further describe the PointNet architecture and provide the necessary background to understand our analysis in Section 4.
3.1 Neural Networks
Neural Networks (NN) are composed of computational units (i.e., neurons) arranged hierarchically by a set of interconnected layers. Information flows back and forth from the lower to the higher layers of the network allowing it to learn higher order representations of the input data. Intuitively, each computational unit tries to learn a specific characteristic of their input. Connections between input and output units are represented as weights and biases expressing the importance of the respective inputs to the outputs. The objective of NNs is to find the weights and biases which minimise the cost function of the network. The cost function is a way to quantify the objective of the Neural Network. The most commonly used objective functions are the Mean Square Error[allen1971mean] and Cross-Entropy[shore1980axiomatic] functions. Gradient based functions such as (stochastic) gradient descent, conjugate gradient and Adam [bottou2010large, moller1993scaled, kingma2014adam] are the most popular methods to minimise cost functions in a NN. Finally, a technique called back propagation is used to propagate back the cost of the output layer to every unit in the network. The minimisation phase is an iterative process and it will stop until a desired cost is reached or a condition is fulfilled.
One of the most popular network architectures for image segmentation and classification are the Convolutional Neural Networks (CNN)[krizhevsky2012imagenet]. CNNs can directly learn from multidimensional arrays (e.g., 2-dimensional images) by introducing 3 new architectural concepts namely: Local Receptive field, shared weights and pooling. A Local Receptive Field allows each layer of the network to introduce a focal view of the input space. This view is called a receptive field which is defined as a patch of the input space that a particular CNN’s feature is looking at. In a convolutional hidden layer, units are organised as feature maps which are the result of convolving a matrix of weights, called kernels, with previous feature maps. The convolution in a layer begins by sliding over the input feature map and perform an element-wise computation of the 2 matrices and sums the results. This process is shown in Figure 1
Every unit in a specific feature map shares the same kernel, allowing it to extract patterns that are present across the input space. This dramatically reduces the amount of weights that are needed to train the network. Pooling is performed right after the convolution, where the main idea is to further summarise the features that were captured by the feature maps. This is done by compressing the information generated by different feature maps either by extracting the maximum or the mean activation of neighbouring units. Doing this removes redundant information encoded within the feature maps and increases the spatial invariance of the input. As a result it makes the model more robust to rigid transformations in the image such as rotation and translation.
3.2 PointNet architecture
Convolutional Neural Networks are a perfect choice when dealing with regularly ordered input domains such as images, as CNNs exploit the spatial-local correlations that exists within the pixel representation. Nevertheless, a 3D point cloud is an irregular and unordered representation for which convolutions that leverage spatial correlations are ill-suited. Ideally, the point-cloud could be ordered to exploit the points’ spatial information and to extract local and global signatures of the 3D shape. However, some attempts such as [jaderberg2015spatial, vinyals2015order] did not manage to achieve an acceptable accuracy when ordering the inputs. Zaheer et al. [zaheer2017deep] and Qi, Charles R et al. [Qi2017] proposed to approximate a symmetric function to introduce invariance in the point set. The PointNet architecture compresses the point-cloud into a smaller set of features that roughly corresponds to the skeleton of objects. The algorithm starts by transforming the input space into its canonical representation using a symmetric function. Then it extracts the important features from this representation which results in a new representation of the feature space. This representation can be further aligned by computing an additional affine transformation. Since these transformations have a higher number of dimensions (i.e., 64 x 64) than the input transformations (i.e., 3x3), the feature transformation matrix is constrained to be close to an orthogonal matrix allowing the preservation of its symmetric inner product[Qi2017]. To achieve this, they regularised the cost function by the following equation:
Where is the feature transformation matrix approximated by the network, is the identity matrix and is the Frobenius norm of a matrix. Qi, Charles et al. found that adding the regularisation term to the cost function stabilises the optimisation. The global features of the input shape are extracted by a maxpooling layer, that summarises the activation of each point into a single feature activation. This results into a feature vector that uniquely represents the overall 3D shape. After extracting the local and global features of the 3D shape, these features are aggregated and used to classify or segment the 3D shape.
A summary of the architecture can be observed in Figure 2. It only illustrates the segmentation path of their architecture as we want to stress the combination of global and local features. The segmentation layer in the PointNet architecture predicts the probability that each point belongs to a particular segmentation by using the aggregated global and local features of the previous layers.
Until now, we described how PointNet uses symmetric functions and maxpooling layers to extract global and local features of the 3D shape. Aggregating these signatures allows the algorithm to achieve high accuracy when segmenting point clouds. However, we postulate that the PointNet architecture still dismisses information that is useful for the segmentation of 3D geometries. A part of this paper (Section 4), is devoted to showing how this information is being disregarded by analysing the PointNet architecture’s kernel activations of the hidden units.
4 A new perspective on the PointNet
In Section 2, we wrote that PointNet extracts the local and global features of 3D point clouds by means of Neural Networks and symmetric function approximations. According to the findings of [Qi2017], a general function that defines a point set can be approximated by applying symmetric functions on every element in the set as shown in
, where is a symmetric function modelled by a Neural Network and is a combination of and maxpooling functions. Based on several combinations of , different representations of can be learned. PointNet aggregates these groups of functions into a single -dimensional features which we called kernel features. They encompass different properties of the set that are considered to be robust under transformations and generic to a variety of 3D shapes families. In this Section we will provide a different perspective of the and functions to further improve the understanding of the properties that are being extracted from the point set. This is achieved by analysing the activations of a subset of kernel features in the PointNet architecture. Our objective is to employ this analysis to improve the original network architecture.
Initially, we analysed the kernels that are not part of the symmetric function approximation and inspected the activations of the remaining kernels in the architecture. The kernels of PointNet were activated by introducing a targeted set of 3D shapes and were visualised similarly as kernels for image classification/segmentation [zeiler2014visualizing].
In the input and feature kernels group, we found that each kernel is learning a complex combination of 2D planes that partitions the 3D space. This complex combination of planes allows the activation of only one specific part of the space. Our interpretation is aligned with the initial findings of [Qi2017] which states that point features highlight the important local sections of the geometry. However, our interpretation is more general as it encompasses both a point feature that emphasise the local signatures and the combination of partitions generated by these kernels. An example of kernel partitions can be seen in Figure 3. A complex combination of 2D planes selects a subset of points or features that are relevant to the overall goal of the learning algorithm. This cluster can be either a fundamental composition or compositions of different parts of the 3D shape.
Consequently, we show that the architecture in PointNet is learning to find the optimal partitions of the 3D space that lead to the discovery of the principal components of the 3D geometry. Furthermore, aggregating these partitions will yield a global shape signature that provides a unique characteristic for every geometry that belongs to the same family of shapes. From this perspective, we can slightly shift the objective of the symmetric function towards learning affine transformation matrices that optimise the partitioning of the 3D space. This does not discard the meaning of the symmetric function mentioned in [Qi2017]. Instead, it adds an extra layer of interpretability. As an example, our analysis showed that the symmetric function will approximate a shear transformation matrix that separates two 3D segmentation surfaces that are close in Euclidean space. Consequently, for the learning algorithm, it becomes easier to find a set of partitions that fragments these two surfaces.
After further analysis of the kernels, we concluded that the PointNet architecture discounts the inner-information that exists within the different partitions of the feature space. For instance, in Figure 3c we can observe that most of these points are near each other. This means that we could potentially extract extra features from this ensemble of points. Nonetheless, a convolution cannot be directly applied to these points as they are not spatially ordered in the feature space. Therefore, to exploit the available information on the kernels, we need to spatially group this set of points in order to extract inner-features by means of convolution. In Section 5 we will describe a new model that uses partition kernels and inner-kernel information to achieve high accuracy for segmentation problems.
5 Inter Point Convolution Network (IPC-Net)
In this section we will explain an extension of the PointNet architecture that uses Convolutional Neural Networks to exploit the inter-local kernel information. We show that these inner-features build a new set of attributes that exist within points in a field-view of a kernel.
Similar to the symmetric function, we build an external Convolutional Neural Network to extract the disregarded local features that exist within the kernel activations. In this CNN we initially make use of the maxpooling operation to remove most of the zero values of layers that have been activated by a kernel. This ensures that the neighbouring points’ features in Euclidean space are group together in the activation matrix as shown in figure 5. The resulting set of features is convolved and downsampled by a new set of kernels which encode the neighbouring characteristics of the set. Similar to PointNet, the global signatures are extracted to ensure that the overall geometry is taken into account. We finalise the architecture by aggregating the local, neighbouring and global features into a single feature tensor. This results in a method that embodies a richer set of features compared to those of PointNet. A summary of the inter-point layers shown in Figure 4 is provided in Table 1. In Section 6 we will validate our hypothesis by showing that this newly improved set of features enhances the performance and accuracy of the model for 3D point segmentation.
|Layers||Feature extraction||zero removal||down-sample1||down-sample2||down-sample3||transform-concat|
6 Experiments & Results
Our convolutional model was trained on the segment-annotated family of the ShapeNet dataset [chang2015]. This dataset comes in two flavours: the original dataset which contains roughly 51,300 unique 3D models categorised into different family groups, and the annotated-ShapeNet [yi2016scalable] which contains a labelled subset of shapes from the original ShapeNet. The annotated-ShapeNet dataset contains 16,881 shapes from 16 different categories. In contrast to the original dataset, the annotated-ShapeNet dataset holds a point-cloud representation of the original shapes where the (ground truth) annotations are labelled for each point in the point-cloud. We selected the Aircraft category to perform the kernel analysis that was described in section 4. In addition, we selected the Car, Motorbike and Rocket categories to compare the segmentation accuracies between the PointNet and the IPC-Net. Each chosen dataset was randomly partitioned into training and test sets and run over several trials to evaluate the robustness and generalisation accuracy of IPC-Net. Table 2 illustrates an overview of the properties of each dataset.
|Family||Train ()||Test ()||Labels|
Although the test set fully reflects the generalisation capacity of the models as it was not used to optimise the hyperparameters of the network, the model that yields the best accuracy in the test set was chosen to be the final model of the network. Therefore, we are indirectly introducing part of the test model into the learning phase. Consequently, the generalisation of the models was further compared visually by creating a validation set (see Section 6.2). This was done to ensure that the models are not tested on geometries that were used in the training phase and to guarantee that the geometries are robust to a different point-sampling technique.
6.2 Sampling strategy
The 3D shapes for the validation set were extracted from the original ShapeNet dataset as the annotated-ShapeNet dataset only holds a subset of this dataset. Due to the fact that the original shapes are not in a point-cloud representation, we sampled the triangular surface of the meshes to generate a point-cloud. Prior to sampling, we first normalise the 3D shapes according to the unit sphere normalisation shown in the following formulas, as both models are not invariant to scale transformations.
Equation 3 is responsible for centring the geometries to the origin in Euclidean space for each coordinates. Equation 4 normalises the centred geometries to a unit sphere. After normalising each shape, points were uniformly sampled from the surface of the triangulation of the 3D shape. A triangulation is a common mesh representation of 3D geometries that approximates the shape of the 3D geometry by a set of connected triangles. To ensure that the overall surface of the 3D shape is captured, the surface of a triangle was sampled proportional to the area times its equilaterality ratio. To calculate the probability that a triangle will be sampled, we used a binary search algorithm on the cumulative area and in the weighted ratio distribution of the edges [Osada2002]. The coordinates of the points at the surface of a triangle were calculated with the following equation:
In this equation are the vertices of the triangle, is the percentage distance from vertex to the opposite edge and is the percentage along the opposite edge [yu2011three].
6.3 Kernel Analysis
We selected a group of aircraft to analyse the kernels of the original model. Figure 5 shows the kernel visualisation of two hidden layers in the PointNet network. Each visualisation was obtained by multiplying a feature kernel vector with the point-cloud. This results into a sparse matrix were the non-zero values are the activations (important features) for that particular kernel. In these figures we can observe that several patterns of activations have been learned. These activations are a complex collection of approximated 2D planes that partition the 3D space. We visualised the kernels of the lower and higher layers as most kernels throughout the network yield highly similar features. In figure 4(a) we can observe that most partitions lean towards a more linear fragmentation (i.e., less complex features) compared to those in Figure 4(b). This was expected as features that are closer to the higher layers will naturally contain more complex representations.
It is important to note that both figures are a 2D projection of the kernels. Therefore, there may be features that are not relevant for this 2D projection. In regards to feature activation, we note that in most cases the partitions in the feature space yield clusters of points that are close to each other in Euclidean space (e.g., points that belong only to a wing or to the fuselage of the plane). This inner-information that exists within local activations is exploited by our architecture as described in section 5.
For every dataset in Table 2, we plotted the mean accuracy and the variance of the IPC-Net and PointNet. In Figure 5(b) we perceive a noticeable improvement for the IPC-Net. The biggest improvement can be seen in the rocket’s dataset where the gap accuracy between the two models is considerably bigger. Additionally, from this Figure, we observe that our model managed to learn the correct segmentations considerably faster and it is substantially more consistent over the different runs. This performance improvement is due to the exploitation of neighbouring features that exist within the kernels of the PointNet. These new features allow the network to produce partitions that are more informative for the segmentation task. For example, based on the heatmap in Figure 7 the PointNet architecture extracts insufficient information from the data to correctly guide the partitioning of the feature space. This statement is derived from the kernel redundancy (light colours) found in the last layers of the PointNet as shown Figure 6(a). In contrast, due to the exploitation of neighbouring kernel information, the IPC-Net managed to extract a larger set of unique kernels reducing the amount of redundant features. This is reflected in Figure 6(b) where the red denotes the uniqueness of the features in the kernels. Furthermore, Figure 7 provides a complementary heat map to the rocket Figure in 5(d) which explains the large difference in accuracy between the IPC-Net and the PointNet. This Figure shows that the majority of kernels in PointNet are very similar (i.e, more redundant) which leads the model to suggest the wrong segmentations. In contrast, Figure 6(d) shows a large spectrum of unique features that aids the model to produce more desirable segmentations.
6.5 Visualisation results
We visually evaluated the segmentation of 4 geometry families as shown in Figure 8. This visualisation was done to evaluate the generalisation of the models by predicting on geometries that were sampled with a different sampling technique as described in Section 6.1. In this figure we can observe that our model accomplished a better generalisation accuracy compared to the PointNet model. For instance, the second aircraft prediction of Figure 8 shows that the IPC-Net kernels responsible for extracting circular information from engines, extrapolate this knowledge to a circular object found in the fuselage of the aircraft. This shows a clear example on how local neighbouring activations are essential to label other parts of geometries that share similar characteristics. Additionally, it also shows that the generalisation remains consistent on the rocket dataset.
7 Conclusion & Discussion
We propose an enhanced network architecture called IPC-Net that exploits the inner-information that exists within the kernel activations of the PointNet. A full kernel analysis was additionally provided which confirmed that the PointNet architecture disregards important information. We showed that the IPC-Net model is able to extract a more unique set of features which lead it to surpass the segmentation accuracy of the original architecture. This was clearly noticeable in every dataset where on average a large accuracy gap between the two models was observed. We additionally showed by means of heat map kernels the reason why the IPC-Net is more accurate and also learns considerably faster and more robustly across different family of geometries.
While our work brings a notable improvement, there are some aspects that could be enhanced in future research. For example, in the predictions of the car dataset, we notice that the improvement over the original model is not as prominent as on the other datasets. Especially when predicting the hood label on cars that had a symmetrical structure (i.e., both the front and the rear of the car are highly similar). This behaviour can be explained by the similarity of neighbouring activations found for both front and rear parts of the cars. We believe that the global features of the network do not sufficiently influence the neighbouring features of the network which leads the model to a wrong segmentation prediction. This could be solved by increasing the number of samples in the dataset such that the global features become more prominent in the network. As a result, it will provide a better point of reference on where the neighbouring features are located in Euclidean space. Another solution is to increase the number of labels such that the global reference of the segmentations becomes clearer. A further analysis of this symmetric limitation needs to be investigated.
Additionally, we found that in certain segmentations, an isolated cluster of misclassified segmentations can be observed. Similar to the symmetric limitation, we belief that these random misclassified clusters arise due to similar neighbouring characteristics that are found across the 3D shape. As a result, if the global reference is not prominent, it will influence the network to perform a misclassification. This problem could potentially be improved by using Conditional Random fields [lafferty2001conditional] which are popular in the field of image segmentation. This method influences the model to punish points that are comprised of different labels and are near each other. This technique could be used to smooth these random clusters that arise due to similar kernels.
Felipe Gomez Marulanda’s work was supported by Doctiris-innoviris brussels grant. Pieter Libin and Timothy Verstraeten were supported by a PhD grant of the FWO (Fonds Wetenschappelijk Onderzoek-Vlaanderen).