Geometric Property Guided Semantic Analysis of 3D Point Clouds
Auxiliary Geometric Learning on Point Clouds
Improving Point Cloud Analysis by Auxiliary Deep Regression on Geometric Properties
Improving Semantic Analysis on Point Sets by Auxiliary Regression on Geometric Properties
Deep Geometric Learning with Auxiliary Regression on Surface Properties
Improving Semantic Analysis on Point Clouds via Auxiliary Supervision of Geometric Properties
Improving Semantic Analysis on Point Clouds via Auxiliary Supervision of Local Geometric Priors
Abstract
Existing deep learning algorithms for point cloud analysis mainly concern discovering semantic patterns from global configuration of local geometries in a supervised learning manner. However, very few explore geometric properties revealing local surface manifolds embedded in 3D Euclidean space to discriminate semantic classes or object parts as additional supervision signals. This paper is the first attempt to propose a unique multitask geometric learning network to improve semantic analysis by auxiliary geometric learning with local shape properties, which can be either generated via physical computation from point clouds themselves as selfsupervision signals or provided as privileged information. Owing to explicitly encoding local shape manifolds in favor of semantic analysis, the proposed geometric selfsupervised and privileged learning algorithms can achieve superior performance to their backbone baselines and other stateoftheart methods, which are verified in the experiments on the popular benchmarks.
I Introduction
Point clouds collecting a set of orderless points to represent 3D geometry of objects have been verified as a powerful shape representation in a number of recent works [27, 14, 33, 34, 26, 41, 42, 38]. Semantic analysis on a point set aims to categorizing the points globally into semantic classes (e.g. plane, chairs, mugs) [33, 34, 41, 43, 24, 47] or locally into object parts [33, 41, 24] according to their topological configuration. Such a problem plays a vital role in many applications, especially those demanding visual perception and interaction between machines and surrounding environment such as augmented reality, robotics and automatic driving. Semantic patterns of point clouds can be discovered from global configuration of local geometric patterns, but it is challenging to discover and exploit such local geometries due to inherently missing pointwise connectivity in their neighborhood.
A number of recent works have been proposed to feature learning on point sets, via either designing locallyconnected convolutional/pooling layers on irregular nonEuclidean points such as PointNet [33], PointCNN [26], Dynamic Graph CNN (DGCNN) [41], and GeoNet [14], or hierarchically aggregating features revealing geometric patterns across scales, e.g. PointNet++ [34], SONet [24]. These existing methods in a supervised learning manner utilize predefined annotations to implicitly learn a global topology and local geometries sensitive to semantic classes. Very few work pays an attention to explicitly constraining 3D neural classifiers with auxiliary regressing onto local geometric properties.
Local geometric properties such as pointwise normal vectors, curvatures, and tangent spaces etc. are the primitive properties of local point groupings that reveal local geometric manifolds. For example, for computing a normal of a point, the typical solution is to first fit a plane via a set of its nearest neighboring points and obtain the normal of the plane, which indicates pointwise geometric properties describing local connectivity across nearby points. Some works [13, 3] design a deep network to directly estimate these geometric properties from point clouds. However, local geometric properties can be freely obtained by physical computation with no price for additional efforts on manually annotation, especially for massively amounts of auxiliary data that are usually produced by computer aided design (CAD).
Pointwise geometric properties, in most of existing works [34, 26], are combined with their corresponding point coordinates together as a type of rich pointbase feature representation, which are set as input and then fed into deep networks directly for semantic analysis. Alternatively, geometric properties can be served as auxiliary selfsupervision signals, inspired by the recent success of selfsupervised learning in visual recognition [7, 19, 12, 29, 30, 11, 40], which generate supervision signals from data itself to avoid expensive manual annotations and then learns a proxy loss for network optimization. Moreover, highquality local properties preserving finer geometric details can be more accurate in view of more dense sampling of points, which can be provided as privileged supervision signals only available during training.
Existing geometric learning methods concern on discovering semantic patterns from global shape, which consists of local geometric patterns. It remains an open problem whether capturing local geometric patterns have any positive effects on semantic analysis of its global configuration. This paper is the first attempt to design a novel geometric learning method to explicitly fit local geometric properties in either a selfsupervised or a privilegedsupervised learning manner as an additional optimization goal to support semantic analysis on point sets. Fig. 1 shows the main difference between the proposed geometric learning and conventional supervised classifier. Specifically, our deep model shares the lowlevel feature encoding layers and has two branches for semantic analysis (e.g. 3D object classification, part/scene segmentation) and geometric properties estimation tasks respectively in a multitask learning style.
The main contributions of this paper are as follows.

This work for the first time explores geometric properties of pointbased surface, perceiving the underlying local connectivity, as auxiliary supervision signals to improve 3D semantic analysis.

A novel geometric selfsupervised learning method is proposed to jointly encode feature discriminative for semantic analysis on point sets and also well fitting local geometric properties in a multitask learning manner.

Beyond geometric properties via physical computation, highquality geometric properties as privileged information can further boost performance on semantic analysis.
Experimental evaluation on three public benchmarks can demonstrate our motivation to exploit local geometric patterns to improve learning semantic patterns of point clouds, with consistently achieving superior performance to its backbone competitor DGCNN [41] and other stateoftheart methods in 3D object classification and part/scene segmentation.
Source codes and pretrained models will be released
Ii Related Works
Semantic analysis of point clouds – As a pioneer, the PointNet [33] starts the trend of designing deep networks for operating on irregular pointbased surface, with the permutation invariance of points encoded by pointwise manipulation in multilayer perceptrons (MLPs) and a symmetric function for accumulating features. Its following work – Pointnet++ [34] hierarchically aggregates multiscale features to inherently capturing different modes of semantic patterns. However, both PointNet and PointNet++ only implicitly model semantic concept aware geometric patterns in local regions via deep feature encoding, but miss considering neighborhood information of points to benefit semantic analysis. Recently, the SONet [24] explicitly regularizes spatial correlation across points via NN search on 2D projection of 3D points during feature encoding, while GeoNet [14] implicitly incorporates local connectivity via an autoencoder and a geodesic matching into extra pointwise features for further fusion. An alternative solution for analyzing point clouds are recentlyproposed geometric deep learning methods, such as spectral networks [4], which apply convolution operation on graphs representing irregular distributed structure of points. Its followuppers concern on either reducing computational cost by replacing Laplacian eigen decomposition with a polynomial [5, 18] and rational [23] spectral alternatives, or improving its generalization capabilities [8, 50, 31]. Recently, the dynamic graph CNN (DGCNN) is proposed by Wang et al. [41] to discover local geometric manifold of each point by an edge convolution operation on a dynamic NN graph, which is iteratively updated by the nearest neighbours. Such a DGCNN model achieves the stateoftheart performance on semantic analysis, which is thus adopted in our methods as the backbone CNN model. The key difference between our methods and the DGCNN baseline lies in incorporating an extra branch (as shown in Figure 2) to learn local geometric patterns with selfsupervision or privileged supervision signals. Superior performance of our methods can be achieved and illustrated in Tables I and VI of Sec. IV.
Geometric analysis of point clouds – Geometric analysis on point clouds aims to obtaining pointwise geometric properties such as the normal and curvature. A typical solution for obtaining local geometric properties of a point is direct computation based on principle component analysis (PCA) [15] within a local region, e.g. a plane best fitting the point and its nearest neighbours. Such a method is simple but sensitive to noises and generation strategies of local regions. A number of advanced geometric computation techniques [28, 16] are developed to improve robustness against the aforementioned challenges, but remain impractical due to their poor generalization. On the other hand, geometric shape analysis can be learningbased, i.e. learning a regression mapping from point sets to pointwise geometric properties. A recent deep learning based PCPNet [13] performs robustly against noises and shape variation under a wide variety of settings, with sufficiently largescale training data. Our goal of this paper is to directly mine local geometric patterns to additionally support semantic analysis on pointbased shape via an auxiliary supervised mapping onto geometric properties. In our proposed multitask network, more robust estimation on geometric properties can be achieved than fixed backbone baseline (See Fig. 6 and 7), with also improving classification accuracy for semantic analysis (See Table I).
Deep selfsupervised learning – Deep learning has gained significant successes in visual recognition [20, 49, 44, 25] and semantic shape analysis [33, 34, 41, 43, 24, 47] but heavily hinges on largescale labelled training samples. Data augmentation becomes a simple yet effective preprocessing step to alleviate the demand for sufficient data to fit network parameters, especially for the larger network capacity than size of training samples. For avoiding label acquisition for some supervisionstarved tasks and using vast numbers of unlabelled data, selfsupervised learning is considered as a powerful alternative to relax the impractical requirement about largescale labelled data available, via generating supervision labels from data itself. In other words, the selfsupervised learning paradigm is typically formulated into a pretext learning task, such as motion segmentation in videos [32], and relative positions [6], exemplars [9] in the image domain. In light of this, the target task can be solved through transferring knowledge from selfsupervised learning on a proxy loss. Inspired by the concept in selfsupervised learning, this paper for the first time develops a novel geometric selfsupervised learning (GeoSSL) to exploit local geometric patterns discovered by selfsupervised learning to improve semantic analysis of point clouds. With local geometric regularization on deep feature encoding for semantic analysis, experiment results of the proposed GeoSSL can beat its direct competitor – DGCNN (the backbone net) as well as other comparative methods (see Table I).
Learning with privileged information – Information only available during training is referred to privileged information, which has been exploited in classification [37, 22], regression [46] and ranking [35]. For image based semantic analysis, text [35], attributes [35], bounding boxes [35], head pose [46], and gender [46] have been exploited as privileged information to boost performance, but this paper is the first work, to the best of our knowledge, in geometric learning with highquality properties from more densely sampled points as privileged information. Similar to the aforementioned GeoSSL method, our geometric privileged learning (GeoPL) employs the identical multitask network structure, and the only difference between GeoSSL and GeoPL lies in the quality of geometric properties to discover local patterns of 3D geometry to support semantic classification and segmentation. Experimental verification in this paper demonstrates that our model with privileged geometric properties performs better than the stateoftheart methods in Table I as well as its selfsupervised variant.
Iii Methodology
Iiia Supervised Semantic Learning
Existing deep algorithms on point clouds focus on analysing semantic patterns of 3D geometry, in view of only semantic labels available in 3D object classification [41] or part segmentation [26]. Given a pair of 3D observation in the representation of a point cloud and its semantic label , the typical network architecture of supervised semantic learning frameworks such as PointNet [33], PointNet++ [34], PointCNN [26] and DGCNN [41] consists of several feature encoding layers (e.g. convolutional layers, MLP layers or a hybrid of both). Take the DGCNN [41] (the backbone network of the proposed GeoSSL and GeoPL) as an example, which is shown in gray box of Figure 2. The DGCNN introduces edge convolution operation on a directed graph representation for local connectivity of points. In details, a directed Nearest Neighbour (NN) graph models correlation across closest vertices, where and denotes its vertices and edges. A parametric mapping function on edges is adopted for capturing global and local shape patterns, where is the parameters to be optimized in each edge convolution layer. In this sense, the output of edge convolution on the NN graph on each vertex is calculated by aggregating edge features, which is thus invariant to the total size of points in the set.
Shared parts of the DGCNN is made up of three MLP based edge convolution blocks and a fullyconnected layer to encode each point into a 1024dimensional feature, and taskspecific layers for object classification and part segmentation respectively. On one hand, another multilayer perception, the output dimension of hidden layers in each MLP based decoder fixed to {512, 256, }, where denotes the size of object classes, is added to the shared parts of the DGCNN for semantic object classification. On the other hand, the shared parts of the DGCNN is followed by a multilayer perception with {256, 128, }, where denotes the size of object part classes in part segmentation. However, such a model cannot provide supervision signals to incorporate local geometric structural information, which encourages us to design a novel network for improving semantic analysis by learning primitive geometric properties of points in their local neighbourhood.
IiiB Generation of Local Geometric Properties
Given a point set , pointwise geometric properties can be either measured or calculated directly. A typical solution of generating th point’s normal is first to find out its nearest neighbors and then calculate the covariance matrix as
(1) 
where denotes points in the cloud and . Eigenvectors and eigenvalues of can be obtained via spectral decomposition [2]. The eigenvector corresponding to the minimal eigenvalue defines the estimated surface normal of point , as defined in [36]. Similarly, the secondorder geometric property – curvature can also be calculated based on eigen decomposition on covariance matrix [2]. Particularly, the ratio of the minimal eigenvalue and the sum of all the eigenvalues can be used to estimate the change of geometric curvature. In mathematics, for th point , the change of curvature can be approximated as the following
(2) 
where denotes the minimal eigenvalue of . Additionally, for th point , the curvature can also be computed by the normal vectors of that point and its neighbors as
(3) 
Although geometric properties can be directly computed from point clouds, they can also be estimated via supervised regression learning algorithms [13, 3].
Normal and curvature approximating local geometric patterns of the shape are vital in semantic analysis, which encourages a number of work [33, 34] to combine such pointwise geometric properties with their corresponding coordinates, which are then fed into a supervised learning model as feature input. However, very few works consider normal and curvature of points as auxiliary supervision signals to improve analyzing semantic patterns owing to feature encoding local manifold structure and superior robustness against noisy point sets, especially when the model is trained on clean data. Beyond pointwise normal and curvature by computational selfgeneration from point clouds, more accurate and high quality geometric properties can be provided as privileged information available only during training, e.g. via physical computation from more dense points.
IiiC MultiTask Geometric Learning
In view of lack of local connectivity across orderless points, our motivation is to design an auxiliary task (regression learning with geometric properties) to explicitly incorporate local neighborhood information underlying surface manifolds. To this end, we propose a multitask geometric learning network to simultaneously learn semantic and geometric patterns for 3D object classification and part segmentation, whose pipelines are visualized in Fig. 2. Given input and output pairs for an ordinary supervised learning network, i.e. a point cloud and its semantic class labels , geometric properties can be generated by physical computation in Sec. IIIB as extra selfsupervision signals or provided as privileged information extracted from high quality point clouds, i.e. Geometric SelfSupervised Learning (GeoSSL) and Geometric Privileged Learning (GeoPL) respectively. It is noted that, regardless of qualities of auxiliary labels, the proposed networks have an identical network structure for classification or segmentation. Training pairs for our multitask geometric learning network are thus , where denotes pointwise geometric properties and is the size of training samples.
Based on the backbone DGCNN depicted in Sec. IIIA, the proposed geometric learning consists of the shared layers and the applicationspecific block (blue or green boxes in Fig. 2), which shares the first three Edge convolution blocks followed by one MLP layer and is divided into two taskspecific branches. The top branch is an auxiliary task to regress pointwise local geometric properties, while the bottom one is the original tasks of semantic analysis (i.e. classification, part/scene segmentation). To jointly optimizing both branches, we introduce a combinational loss function as the following, which utilizes the mean square loss to control the quality of normal/curvature estimation in geometric learning branch and the crossentropy loss for taskspecific semantic analysis on point sets as:
(4) 
where and denote output of two branches in the proposed model, and are weighting parameters of the proposed geometric learning model. denotes shared weights in the lower shared layers, and are weights for the classification/segmentation and the geometry regression branch, respectively. is a tradeoff parameter between two loss terms.
The key merit of the aforementioned cost function lies in that it brings additional object function to discover geometric patterns missing by existing supervised point cloud classifiers trained by semantic labels only. During training, we adopt the mean square loss for and the crossentropy loss for . It is noted that the regression loss is not limited to the mean square, and we select it owing to its solid performance on estimation of geometric properties. Specifically, we have explored the Euclidean distance and Cosine similarity for the oriented normal vector, the unoriented normal Euclidean distance and RMS angle difference between the estimated normal and ground truth normal in our experiments. Without the loss of generality, we also employ the mean square loss for supervising geometric curvature. As a result, with both normal and curvature, the loss function can be written as
(5) 
where and denote the ground truth normal (selfgenerated in GeoSSL or privileged provided in GeoPL) and the predicted normal, and and denote the ground truth curvature and the predicted curvature.
Iv Experiments
We evaluate the proposed geometric learning algorithms (i.e. GeoSSL and GeoPL) introduced in Sec. III on three popular semantic analysis tasks, i.e. 3D object classification, part segmentation and scene segmentation.
Datasets and Settings – Evaluation on 3D object classification was conducted on the commonly used ModelNet40 benchmark [45], which contains 12,311 CAD models belonging to 40 predefined categories. In our experiments, we split the dataset into two parts, i.e. 9,843 for training and 2,468 for testing. We followed the same experimental settings as in [33, 41]. Specifically, 1024 points are sampled from mesh faces by farthest point sampling, and are normalized into a unit sphere. We evaluated our model architectures for part segmentation on the ShapeNet part dataset [48], containing 16,880 3D shapes from 16 object categories, annotated with 50 parts in total. We followed the data split as [26], i.e. 14006 for training and 2874 for testing. Part category labels are assigned to each point in the point cloud, which consists of 2048 points uniformly sampled from mesh surfaces of training samples. It is worth mentioning here that we assume that each object contains less than six parts. S3DIS [1] dataset is adopted on evaluation of our method for scene segmentation. Unlike the samples in the ModelNet40 and ShapeNet, which are made by 3D modeling tools, the S3DIS samples are collected from real scans of indoor environments. In details, this dataset contains 3D scans from Matterport scanners in 6 areas within 271 rooms. Each point in the scan is annotated with one semantic label from 13 categories.
Performance Metrics – For the classification task, we use mean accuracy as our evaluation metric widely adopted in recent work [33, 34, 41]. In the part segmentation task, IntersectionoverUnion (IoU) is used to evaluate our model and other comparative methods, following the same evaluation protocol as the DGCNN [41]: the IoU of a shape is obtained by averaging the IoUs of different parts involving in that shape, while the mean IoU (mIoU) is calculated by averaging the IoUs of all the testing samples. In the scene segmentation task, mean IntersectionoverUnion (mIoU) and overall accuracy(OA) are utilized for evaluating our method.
Implementation Details – To efficiently use the geometric cues, we pretrain independently the shared layers and geometric leaning branch (the top branches in Fig. 2) with generated geometry properties on the ShapeNetCore dataset, which is similar to the DGCNN architecture for part segmentation with the only change lying in the last layer to output 4 continuous values. Model parameters learned by such a network are then used to initialize the shared layers both in GeoSSL and GeoSSL. The learning rates of the GeoSSL and GeoSSL are set as 0.01 and 0.001 respectively, and are decreased with an exponential function by every 20 epochs. The overall training epochs in our experiments are 200.
Iva Comparison with StateoftheArt
3D object classification – Comparative evaluation in 3D object classification on the ModelNet40 are shown in Table I. We can see that our GeoSSL achieves superior performance to its direct competitor DGCNN [41] as well as other stateoftheart methods. In light of the identical input and output as well as the backbone CNN model, performance gain can only be explained by auxiliary incorporation of local geometric properties into the DGCNN. We also evaluate our geometric privileged learning (GeoPL) for classification on the ModelNet40 with privileged geometric properties only available during training, whose normal and curvature are generated from more dense pointbased surface and thus more accurate than those directly computed from sparse points. For example, we can generate privileged normal and curvature from a dense point cloud consisting of 10000 points used in our experiment compared to ordinary one with 1024 points. Experiment results in Table I show significantly better performance than other comparative algorithms given accurate geometric properties, which further verifies the effectiveness of our concept on improve semantic analysis via exploiting local geometric priors.
Methods  Mean  Overall 

Class Accuracy  Accuracy  
VoxNet [?]  83.0  85.9 
PointNet [33]  86.0  89.2 
PointNet++ [34]    90.7 
SONET [24]  87.3  90.9 
PointCNN [26]    92.2 
DGCNN [41]  88.2  91.2 
DGCNN+ [41]  90.2  92.2 
GeoSSL (ours)  90.3  92.9 
GeoPL(ours)  90.8  93.5 
3D Part segmentation – The part segmentation network is evaluated on the ShapeNet Part benchmark, whose results on IntersectionoverUnion (IoU) are illustrated in Table II.
Evidently, regardless of the network structure, e.g. PointNet++ [34], PointCNN [26] or DGCNN [41], the proposed GeoSSL can consistently perform better than the backbone competitors.
Specifically, PointNet++ [34] achieves the better performance compared to our GeoSSL but demands high quality pointwise geometric properties as input, which can be impractical for accurate pointwise normals available in real world.
It is noted that we reimplement PointNet++, PointCNN and DGCNN by following the settings in original works, whose results are reported
Methods  Mean IoU 

PointNet [33]  83.7 
SONET [24]  84.9 
PointNet++ [34]  85.1 
PointNet++ (backbone)  84.3 
GeoSSL (ours)  84.8 
PointCNN [26]  86.1 
PointCNN (backbone)  85.3 
GeoSSL (ours)  85.6 
DGCNN+ [41]  85.1 
DGCNN (backbone)  84.5 
GeoSSL (ours)  85.7 
Indoor Scene Segmentation – We also apply our GeoSSL to the semantic scene segmentation task, which replaces object part labels in part segmentation by semantic object classes in the scene. We conduct experiments on the S3DIS[1], which is collected from real scans of indoor environments. For a fair comparison, we follow the same setting as the DGCNN, where each room is sliced into 1 1 squaremeters block, and 4096 points are sampled for each block. Based on the sampled points, we then calculate pointwise geometric properties (i.e. normal, curvature) using the method in Sec. IIIB. Finally, we use the 6fold cross validation over the 6 areas, and report the mean of evaluation results. We compare the proposed method with the stateoftheart methods on the S3DIS, whose results are shown in Table III. We can conclude that our method consistently achieves superior segmentation performance to its direct competitor DGCNN [41], yet outperforms most of stateoftheart methods except for PointCNN [26] and SPGraph [21]. Note that, the concept of our method is generic, which can be applied to other specific backbone CNN models, which achieves stateoftheart scene segmentation performance, such as PointCNN [26] and SPGraph [21].
Methods  Mean  Overall 

IoU  Accuracy  
PointNet(baseline) [33]  20.1  53.2 
PointNet [33]  47.6  78.5 
PointCNN [26]  65.4  88.1 
G+RCU [10]  49.7  81.1 
SGPN [39]  50.4   
RSNet [17]  56.5   
SPGraph [21]  62.1  85.5 
DGCNN+ [41]  56.1  84.1 
DGCNN  54.5  83.6 
GeoSSL  59.1  86.3 
IvB More Results and Discussions
Methods  DGCNN[41]  GeoSSL 

91.2  92.2  
+  91.7  92.5 
+  91.4  92.3 
+ +  91.9  92.9 
Ablation studies on geometric properties – Evaluation on combination of different geometric properties is shown in Table IV. In DGCNN [41], geometric properties are concatenated as additional feature input, while our GeoSSL exploits them as selfsupervision signals of an auxiliary task. We observe that all methods with geometric properties either as input feature or as selfsupervision signals can boost classification performance, which demonstrates our motivation to employ local geometric properties can reveal rich local geometries of 3D semantic classes. Moreover, geometric properties as selfsupervision signals (in the right column) can consistently perform better than that as feature (in the middle column). The main reason is that our GeoSSL takes the form of multitask learning, where selfsupervision serves an auxiliary task to regularize learning of the main, supervised task. This is different from some alternatives, e.g. pretraining based selfsupervision methods, where features are learned via selfsupervision alone, and are subsequently used for supervised tasks. Given large capacities of deep networks, GeoSSL regularizes feature learning (via selfsupervised prediction learning of local geometric properties), reduces their potentials of overfitting, and thus improves generalization of the learned features for the supervised tasks. Moreover, the combination with normal and curvature can be preferred as selfsupervision signals in view of exploiting both first and second order geometric smoothness in point sets.
Methods  Cosine Similarity 

DGCNN [41]  0.99 
DGCNN  0.97 
GeoSSL  0.99 
Effects of learning geometric patterns in typical supervised semantic learning – We are interested in whether the learned feature in supervised semantic learning on point clouds can be used to estimate geometric properties. As a result, we conduct an experiment for normal estimation to compare the following models: the first setting is to train the DGCNN for normal estimation from scratch, denoted as DGCNN in Table V; the second setting is another DGCNN, whose network parameters of lower layers are shared by the DGCNN pretrained on the ModelNet40 for classification, which are then fixed during training with tuning the other parameters in higher layers (we denote it as Fixed DGCNN (DGCNN). The results are illustrated in Table V for a comparative purpose on the Cosine Similarity metric, which reveals an angle difference between the predict normal and the ground truth normal, i.e. the larger its value, the better. We also illustrate qualitative difference between our method and DGCNN in Fig. 6, which shows that the proposed method can predict more accurate normal than its competitor. Quantitative comparisons with normal estimation errors can be found in Fig. 7. Both Table V, Fig. 6 and 7 show that the DGCNN gain the worse performance in comparison with the DGCNN and GeoSSL. It implies that existing point cloud analysis methods with only semantic supervision labels pay less attention on whether the networks can learn local geometric patterns. Our method with geometric selfsupervised learning can benefit each other task simultaneously, which captures local geometric patterns to further augment semantic recognition tasks.
Methods  Overall Accuracy 

Pointnet++ [34]  90.7 
PointCNN [26]  92.2 
DGCNN+ [41]  92.2 
GeoSSL  91.7 
GeoSSL  92.8 
GeoSSL  92.9 
Evaluation across CNN backbone models – Evaluative results on different CNN baselines (i.e. PointNet++ [34], DGCNN [41], and PointCNN [26]) are illustrated in Table VI. We can evidently find out that, our proposed methods can consistently outperform their baseline models. It further confirms that the nature of auxiliary geometric learning on improving semantic point cloud recognition.
Methods  Classification  Segmentation 

Accuracy(%)  IoU(%)  
DGCNN+ [41]  98.8  85.1 
GeoSSL  98.9  84.4 
GeoSSL  99.4   
GeoSSL    85.7 
Evaluation on multitask learning architecture – To this end, we additionally conducted experiments on the ShapeNet Part dataset. The network architecture used here is the same as in Fig. 2, the only difference lies in the task setting. Comparison results are shown in Table VII, where we evaluate different options of combining two tasks in a multitask learning framework. As can be seen from Table VII, when simply combining classification and segmentation tasks in a multitask manner, denoted as MTNet. The classification performance (98.9%) of the MTNet is only slightly better than its baseline DGCNN (98.8%), but even worse than its baseline DGCNN on segmentation performance (84.4%). Different from that, our models with an auxiliary fitting on geometric properties achieve superior results to the DGCNN and MTNet both on classification (GeoSSL) and segmentation (GeoSSL) tasks, which further demonstrates performance gain of our method can be credited to additional regression learning branch.
Evaluation on estimation of geometric properties – Fig. 3 and 4 visualize the predicted normals and curvatures with the proposed GeoSSL respectively. From which we can see that estimation performance of our method are very close to the ground truth. Furthermore, when neural networks are trained on clean point sets, they could predict more accurate normals than those obtained by geometric computation, especially for noisy testing set. This could be attributed to their capability to learn statistical regularities from training data. For verification, we train a DGCNN based normal estimation network using clean training sets of points from the ModelNet40; for testing, we add Gaussian permutations to instances of point sets, where the noise level for each point is (clean point sets are normalized in a unit sphere). Geometric computation produces an averaged error of against GT normals (measured in the Cosine distance, ranging in ), and our trained neural model gives a lower one of , which verifies our claim the learning based method with clean data can predict more accurate geometric properties.
Setting ( )  1  e1  e2  e3  e4  e5 

Accuracy (%)  87.1  89.3  92.9  92.3  91.9  91.7 
Evaluation on ratio between losses – In our classification settings, is an important parameter to determine the proportion of two loss function (i.e. the regression loss for fitting local geometries and the classification/segmentation loss). We hold out 20% of training data as the validation set. We observe that the tradeoff parameter varies across different network architectures and different tasks, but when is set as between [, ], our model can steadily perform well. As a result, we select either 0.01 or 0.001 for in our experiments. Specifically, Table VIII illustrates the trend of classification accuracy with varying on the ModelNet40 with GeoSSL. When = e2, it can reach the best classification performance.
Experiment Setting  Accuracy 

Random initialization  92.5 
Pretrained on the ShapeCore  92.9 
Effects of pretraining with auxiliary data – An experiment to evaluate the effects of auxiliary data on pretraining is conducted by pretraining the proposed models on the ShapeNetCore dataset. Results in Table IX show that moderate improvement on the pretrained models can be achieved over the identical network with random initialization, which encourages us to adopt pretraining for boosting performance.
V Conclusion
This paper, for the first time, systematically introduces selfsupervised learning into 3D point cloud semantic analysis, which is a generic method to readily replace its backbone with any other deep geometric learning. Rather than employing geometric properties as additional feature input, our network utilizes them as auxiliary supervision signals, which can consistently improve performance on semantic analysis. Given accurate privileged local shape information, our method can further be boosted to 93.5% mean classification accuracy on the ModelNet40.
Footnotes
 https://github.com/Necole123/GeoSSL
 Our Implementation is slight worse than the reported results in the original works.
References
 (2016) 3d semantic parsing of largescale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1534–1543. Cited by: §IVA, §IV.
 (2008) A method for automated registration of unorganised point clouds. ISPRSJ. Photogramm. Remote Sens. 63 (1), pp. 36–54. Cited by: §IIIB.
 (2018) Nestinet: normal estimation for unstructured 3d point clouds using convolutional neural networks. arXiv preprint arXiv:1812.00709. Cited by: §I, §IIIB.
 (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §II.
 (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §II.
 (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §II.
 (2017) Multitask selfsupervised visual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2051–2060. Cited by: §I.
 (2018) Generalpurpose deep point cloud feature extractor. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1972–1981. Cited by: §II.
 (2016Sep.) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 38 (9), pp. 1734–1747. External Links: ISSN 01628828 Cited by: §II.
 (2017) Exploring spatial context for 3d semantic segmentation of point clouds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 716–724. Cited by: TABLE III.
 (2017) Selfsupervised video representation learning with oddoneout networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3636–3645. Cited by: §I.
 (2018) Geometry guided convolutional neural networks for selfsupervised video representation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5589–5597. Cited by: §I.
 (2018) PCPNet learning local shape properties from raw point clouds. In Comput. Graph. Forum, Vol. 37, pp. 75–85. Cited by: §I, §II, §IIIB.
 (2019) GeoNet: deep geodesic networks for point cloud analysis. arXiv preprint arXiv:1901.00680. Cited by: §I, §I, §II.
 (1992) Surface reconstruction from unorganized points. Vol. 26, ACM. Cited by: §II.
 (2013) Edgeaware point set resampling. ACM Trans. Graph. 32 (1), pp. 9. Cited by: §II.
 (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2626–2635. Cited by: TABLE III.
 (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §II.
 (2019) Revisiting selfsupervised visual representation learning. arXiv preprint arXiv:1901.09005. Cited by: §I.
 (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §II.
 (2017) Largescale point cloud semantic segmentation with superpoint graphs. arXiv preprint arXiv:1711.09869. Cited by: §IVA, TABLE III.
 (2014) Learning using privileged information: SVM+ and weighted SVM. Neural Netw. 53, pp. 95–108. Cited by: §II.
 (2017) Cayleynets: graph convolutional neural networks with complex rational spectral filters. arXiv preprint arXiv:1705.07664. Cited by: §II.
 (2018) Sonet: selforganizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §I, §I, §II, §II, TABLE I, TABLE II.
 (2019) Orthogonal deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. (), pp. 1–1. External Links: Document, ISSN 19393539 Cited by: §II.
 (2018) PointCNN: convolution on xtransformed points. In Adv Neural Inf Process Syst, pp. 828–838. Cited by: §I, §I, §I, §IIIA, §IVA, §IVA, §IVB, TABLE I, TABLE II, TABLE III, TABLE VI, §IV.
 (2015) Robotic online path planning on point cloud. IEEE T. Cybern. 46 (5), pp. 1217–1228. Cited by: §I.
 (2010) Voronoibased curvature and feature estimation from point clouds. IEEE Trans. Vis. Comput. Graph. 17 (6), pp. 743–756. Cited by: §II.
 (2018) Boosting selfsupervised learning via knowledge transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9359–9367. Cited by: §I.
 (2018) Selfsupervised learning of geometrically stable features through probabilistic introspection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3637–3645. Cited by: §I.
 (2019) Learning graph embedding with adversarial training methods. IEEE T. Cybern. (), pp. 1–13. External Links: ISSN 21682275 Cited by: §II.
 (2017) Learning features by watching objects move. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2701–2710. Cited by: §II.
 (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §I, §I, §II, §II, §IIIA, §IIIB, TABLE I, TABLE II, TABLE III, §IV, §IV.
 (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: Fig. 1, §I, §I, §I, §II, §II, §IIIA, §IIIB, §IVA, §IVB, TABLE I, TABLE II, TABLE VI, §IV.
 (2013) Learning to rank using privileged information. In Proceedings of the IEEE International Conference on Computer Vision, pp. 825–832. Cited by: §II.
 (2018) Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3887–3896. Cited by: §IIIB.
 (2009) A new learning paradigm: learning using privileged information. Neural Netw. 22 (5), pp. 544–557. Cited by: §II.
 (2019) Deep cascade generation on point sets. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence (IJCAI), pp. 3726–3732. Cited by: §I.
 (2018) Sgpn: similarity group proposal network for 3d point cloud instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2569–2578. Cited by: TABLE III.
 (2017) Transitive invariance for selfsupervised visual representation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1329–1338. Cited by: §I.
 (2019) Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38 (5), pp. 1–12. Cited by: Fig. 1, §I, §I, §I, §II, §II, §IIIA, §IVA, §IVA, §IVA, §IVB, §IVB, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE V, TABLE VI, TABLE VII, §IV, §IV.
 (2019) Geometryaware generation of adversarial and cooperative point clouds. arXiv preprint arXiv:1912.11171. Cited by: §I.
 (2016) Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in neural information processing systems, pp. 82–90. Cited by: §I, §II.
 (2018) Deep attentionbased spatially recursive networks for finegrained visual recognition. IEEE T. Cybern. 49 (5), pp. 1791–1802. Cited by: §II.
 (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §IV.
 (2013) Privileged informationbased conditional regression forest for facial feature detection. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6. Cited by: §II.
 (2018) Foldingnet: point cloud autoencoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206–215. Cited by: §I, §II.
 (2016) A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. 35 (6), pp. 1–12. Cited by: §IV.
 (2019) Partaware finegrained object categorization using weakly supervised part detection network. IEEE Trans. Multimedia. Cited by: §II.
 (2018) Graph convolutional network hashing. IEEE T. Cybern.. Cited by: §II.