# PointWise:An Unsupervised Point-wise Feature Learning Network

###### Abstract.

The availability and plethora of unlabeled point-clouds as well as their possible applications make finding ways of characterizing this type of data appealing. Previous research focused on describing entire point-clouds representing an object in a meaningful manner. We present a deep learning framework to learn point-wise description from a set of shapes without supervision. Our approach leverages self-supervision to define a relevant loss function to learn rich per-point features. We use local structures of point-clouds to incorporate geometric information into each point’s latent representation. In addition to using local geometric information, we encourage adjacent points to have similar representations and vice-versa, creating a smoother, more descriptive representation. We demonstrate the ability of our method to capture meaningful point-wise features by clustering the learned embedding space to perform unsupervised part-segmentation on point clouds.

^{†}

^{†}journal: TOG

^{†}

^{†}copyright: none\acmSubmissionID

462

## 1. Introduction

The use of three-dimensional (3D) representations is ever-growing in applications such as computer-vision, augmented-reality, autonomous-driving and many more. In this realm, point-clouds are an efficient representation from a learning perspective. They are compact, easy to manipulate and scale, and compose the output of many modern 3D scanning devices. The use of point clouds, however, is technically challenging since point clouds are irregular and unordered. Significant progress has been made in the field of direct-processing of point-clouds using neural networks, with architectures such as PointNet (Qi et al., 2017a) and PoiNtnet++(Qi et al., 2017b) that use global pooling operations that bypass the lack of order, or with an architecture that transforms the points into a canonical regular configuration that allows performing convolutions (Li et al., 2018a). These networks are trained with supervision, yielding global task-specific features per point cloud.

As large as labeled datasets may be, the vast majority of available data remains unlabeled and so, cannot be utilized by the networks described above. In order to overcome this obstacle, one can utilize unsupervised learning. Moreover, learning highly-descriptive features without supervision leads to more generic features that can be used in various applications, rather than being task-specific.

In this paper, we present the first network that learns point-wise meaningful features with no supervision. We show that the point-wise features constitute semantic meaning and can therefore be used to achieve a meaningful segmentation of the dataset.

We train a network to learn point-wise features without supervision. Our approach leverages architectures which compute internal per-point features. The training is self-supervised, where the losses are derived from the geometry of the training set itself.

To learn a meaningful latent space we base our losses on two main assumptions: The first is that to achieve a semantically meaningful embedding, the local geometry around each point should be taken into account. Therefore we incorporate reconstruction of local patches as part of the loss term. The second assumption is that the latent space learned should be smooth, i.e., that generally nearby points in the original 3D space should also have a similar embedding. Distant points should have a different embedding in the latent space, unless they have similar local geometry.

We show that the point-wise features learned by our network are highly-descriptive, and carry more meaningful information, than otherwise the bare coordinates themselves. Figure 1 shows an example of a chair colored according to the Cartesian coordinates and according to the 3D PCA of our learned representation. The learned representation is more descriptive and learns meaningful attributes of the chair such as embedding the two armrests in a similar way, while still keeping a smooth transition between representation of close points.

We demonstrate the competence of the point-wise features through segmentation. Our network can be used to segment both objects seen by the network during training and unseen objects.

## 2. Related Work

### 2.1. Neural Point-cloud processing

In recent years, neural networks for point-cloud processing are rapidly emerging. PointNet (Qi et al., 2017a) is a pioneering work, which presents an architecture that elegantly tackles the unordered nature of point-clouds. PointNet calculates point-wise descriptors using a shared multi-layer perceptron (MLP), followed by a symmetric, channel-wise pooling function (specifically - max pooling) on all these descriptors simultaneously. PointNet is robust to the irregularity of the data by default, as it does not directly take into account any relations between points.

More evolved solutions were later suggested. In PointNet++ (Qi et al., 2017b), the authors apply PointNet recursively on neighboring sub-sets at various scales in order to produce valuable local information for varying model densities. (Wang et al., 2018) introduce a new layer named EdgeConv to generalize the regular-grid convolution operator and capture local geometric features of point-clouds. (Li et al., 2018a), propose PoinCNN, an architecture that aspires to bring neighboring point subsets to a canonical form before applying a standard weight-based convolution. (Atzmon et al., 2018) find a way to apply convolutional neural nets on point-clouds as well by mapping point cloud functions to volumetric functions and vice-versa.

All the above suggest different methods of aggregating local-neighborhood based features into some reference point. These architectures are invariant to point-order and robust to data irregularity. They are all trained to perform classification and segmentation in a supervised manner, using a loss based on comparison between the estimated scores and a given ground-truth. Consequently, they don’t take advantage of most of the available data, which is unlabeled.

### 2.2. Unsupervised feature learning

Unsupervised feature learning is an attractive approach since it leverages the abundance of available, unlabeled data. To produce a semantically meaningful latent-space without relying on annotations, networks are trained to perform tasks based on some context derived from the data itself. The assumption is that such tasks encourage the learning of rich, descriptive features. A notable use of such approach was presented by (Doersch et al., 2015). They train a network to predict the relative positions of image patches in order to encourage the learning of semantically meaningful features. They show that with these features it is possible to cluster large, unlabeled datasets into classes.

(Danon et al., 2018) use a triplet margin loss on image patches to establish a latent space in which semantically similar patches are adjacent, and semantically different patches are not.

In the realm of point-clouds, (Achlioptas et al., 2018), (Deng et al., [n. d.]), (Yang et al., 2018) and (Li et al., 2018b) train various architectures to learn an embedding for whole sets of points. That is, the entire shape represented by a point-cloud is represented by single vector in latent space. Similarly, (Sun et al., 2018) train a network to generate new point clouds by training on existing point-clouds without annotation. In contrast to previous methods, we learn a point-wise representation.

(Guerrero et al., 2018) train a network to estimate local 3D shape geometric properties such as normals. In comparison we use local shape properties to achieve meaningful point-wise features.

## 3. Unsupervised Point-Wise Feature Learning

The goal of this work is to learn a meaningful unsupervised representation of each point in a point cloud. To that end, we base our network on existing architectures used for point cloud processing and extraction of point-wise features. However, contrary to previous methods which use labeled data for training, we use unlabeled data. To that end, we add layers to the network and change the training loss function. A schematic illustration of our architecture can be seen in Figure 2.

In order to produce a descriptive latent-space in an unsupervised manner, our architecture includes a sense of ”self-supervision” as a means to define a relevant loss. The chosen loss terms encourage the network to learn a semantically meaningful point-wise description. We formulate a loss which yields a latent space that is geometrically-descriptive on one hand, and context-aware on the other. In the following section we will describe the network architecture and the different loss terms used.

Let us define some key notations. Given a point-cloud comprised of points , we denote the latent-space descriptor of the point with . The set of all latent-space descriptors for the points in is denoted as .

### 3.1. Feature Extractor Network

To achieve a feature representation which has a contextual meaning, the features representing each point should not depend solely on the point’s coordinates, but on its context in a broader area. Therefore, the feature extractor network receives an entire point cloud (i.e. ) as input and outputs a set of point-wise feature vectors (i.e. ) which depend also on the entire shape.

In this paper, we use a modified, untrained version of the PointNet segmentation network (Qi et al., 2017a) as a feature extractor network. However, other more complex networks such as PointNet++ (Qi et al., 2017b) or PointCNN (Li et al., 2018a) can be used as well.

The network is comprised of a point-wise feature extractor which is then used to create an single, global point-cloud feature. For a given input point-cloud, every point passes through an MLP that lifts its dimension to extract local features. next, a global, channel-wise pooling layer is applied on all point-wise feature vectors. The pooling layer’s output passes two fully connected layers to produce a single global feature of the entire point-cloud. The global feature is then concatenated to all the point-wise features to produce new feature vectors which contain information about the global structure. The subsequent feature vectors pass through a second MLP to create the extracted point-wise features.

### 3.2. Loss Terms

Our loss function consists of two components:

(1) |

The first loss term encodes geometrically meaningful information into the latent space by incorporating information about the local environment of each point into the point’s representation. The second loss term uses a triplet margin loss to create a similar embedding for close-by points in the original shape and a different representation for distant points. Thus creating a continuous representation.

Examples of the final embedding achieved by our method can be seen in Figure 4 and 5. The figures show point-clouds colored according to the three dimensional PCA values for each point. Each RGB component is equivalent to the point-descriptors projections on one PCA component and is normalized separately to values between 0 and 1. As can be seen in the figures, points with similar geometric meaning receive similar point-wise features and are therefore colored in a similar color in the image. In Figure 5 we show coloring of point-clouds from the same class according to point-wise features’ projections on their joint PCA components. In this example, semantically similar parts share similar colors in all the different point-clouds, e.g., all the chair-arms are colored in magenta. This illustrates that similar parts of objects are embedded closer together in the latent space.

Reconstruction loss: The key idea of the reconstruction loss term is that some of the semantically meaningful information regarding each point is derived from the its local environment. Points that share similar geometries of their surrounding areas are more likely to have a similar semantic meanings. Therefore, if the feature representation of each point includes information about its local environment, the point-wise representation is prone to be more semantically meaningful. We include such information by using a reconstruction loss on the surrounding area of each point.

Specifically, from each feature representation in the latent space we reconstruct a set of points surrounding the original point as can be seen in Figure 6. Thus, the feature representation of each point must contain information that is relevant not only to that point’s coordinates but to its surrounding area as well. The reconstruction loss is calculated using chamfer distance between the reconstructed environment and the original environment as derived from the input itself:

(2) |

The environment can be defined in many different methods. In this paper we chose to use kNN selected according to euclidean distance.

Smoothness loss: This term regularizes the reconstruction loss and contributes to the smoothness of the latent space. Points which are close in the original 3D space should have a similar embedding in the latent space, while points which are far away in the original 3D space should usually have a different embedding in the latent space. This rule of thumb should be broken when the local geometry surrounding nearby points changes dramatically. In that case the points should have a different embedding in the latent space due to the reconstruction loss. To achieve an embedding which sustains this quality we use a triplet margin loss to bring representations of adjacent points in the original 3D point-cloud closer together and drive representations of distant points farther apart. To that end we use a variation of kNN - common-kNN: two points are considered a positive pair if the number of overlapping members in both points’ k nearest neighbors passes some pre-defined threshold, . More formally, if is the set of k nearest points to point , then are common-kNN if:

The common-kNN requirement is more strict than the regular kNN and therefore prevents some of the outlier connections between points. For each point in the point-cloud we randomly choose a set of triplets to be used in the triplet loss. We form each triplet by choosing a positive point out of the common-kNN and a negative point out of the farthest points randomly. The positive pairs and the negative pairs for a point are defined as and . The triplet loss is defined as follows:

(3) |

where: and is the number of points in the point-cloud.

As well as creating a smoother latent space representation along the shape, the smoothness loss also contributes to the convergence of the reconstruction loss. By requiring close points to have a similar embedding, we also encourage close points to yield similar reconstruction. Furthermore, by demanding distant points to have a different representation we encourage the data to have more divergence in the latent space and therefore also more divergence in the reconstructed environment of each point. These two attributes prevent the reconstruction loss from converging into some local minimum of the reconstruction loss function and reconstructing the same generic patch for all points.

For a complete overview of our method, see Figure 3. The figure shows the different outputs of the network - the extracted features, examples of reconstructed patches and the part-segmentation results.

## 4. Results and Evaluation

### 4.1. Implementation Details

Our architecture is based heavily on the segmentation network proposed in (Qi et al., 2017a) and follows the main principals and properties demonstrated in it in order to extract point-wise features for a given input model. The main difference lies in the global pooling operation, which, in our network, is comprised of maximum, mean an variance operation on the point-wise feature tensor, in opposed to the max-pooling applied in PointNet. We found this modification leads to a smoother convergence of the training objectives. We also employ an additional MLP for the purpose of reconstructing a local patch from every point-descriptor.

We use a reconstruction loss weight of 100, and a smoothness loss weight of 0.1. The number of kNN used to create the ground truth patches for the reconstruction is set to 100. For each point we construct three triplets to be processed by the smoothness loss. For each triplet we randomly choose a positive example out of the corresponding point’s 20 nearestneighbors and a negative example out of the 200 most distant points using l2 norm. The smoothness loss’ margin is set to 2. We trained the network for 3 epochs, with a learning rate of 0.001 and a decay of 0.7. We produce 100 dimensional descriptors.

### 4.2. Reconstruction Results

In Figure 6 we show some of the patches reconstructed by our network with the original point-cloud in the background for comparison. As can be seen in the figure, the reconstructed patches form different planes and local structures from the original point-clouds.

### 4.3. Segmentation Results

To demonstrate the quality of the learned point-wise features, we use them to present unsupervised part-segmentation. The goal is to separate each shape into meaningful parts. Supervised methods formulate the problem of part-segmentation as a point-wise classification problem where the classes correspond to the different segments. Since our method is completely unsupervised, instead of formulating the part-segmentation problem as a classification problem, we formulate it as a clustering problem where the number of segments in each object is the number of clusters.

To that end, we first train our network on the entire ShapeNet-part dataset (Yi et al., 2016). We use the entire dataset for training and not only one class at a time to make the network generalize better to the different parts of the objects.

There are two possible ways to perform the segmentation. The first is co-segmentation where the segmentation is performed on an entire class of objects simultaneously, the second is segmenting each object individually. We chose to use the second type. For a given number of segments, we use a grid search of parameters for spectral clustering and choose the final segmentation according to the clustering that achieves the highest silhouette indexing score:

(4) |

where is the average distance from all other point features in the same segment and is the smallest average distance from all point features in each of the other segments. We compute this score twice, using distances in the latent space and in the Cartesian coordinate space. We average the two scores to produce our chosen segmentation.

As we are not aware of any other methods which extract point-wise features without any supervision, we compare our results to a naive approach - performing the segmentation using each points coordinates as features.

#### 4.3.1. Qualitative Evaluation

In Figure 7 we present qualitative results of our method and of the segmentation on the points coordinates. The objects shown in the figure have not been seen by the network during training. It is clearly shown that the segmentation using our method achieves far better results than the segmentation applied on the points coordinates alone. The network learns the symmetry for each object. For example, the wings of an airplane or the armrests of a chair are segmented together unlike in the segmentation performed on the coordinates alone. Some further results can be seen in Figure 8.

#### 4.3.2. Quantitative Evaluation

The average accuracy for part segmentation using the features extracted by our method is 70.1% compared to the average accuracy using the Cartesian cooardinates which is 61.2%. Most of the resulting improvement in our method is reflected by changes in the segmentation of small parts such as armrests or earplugs. Therefore the improvement is seen more clearly when viewing the objects than through the accuracy results.

Some of the classes in ShapeNet can be semantically segmented in a few different ways. For example all four legs of a chair can be segmented into one part or they could be segmented into two parts - front and back legs. Both divisions make sense. Some of the degradation in accuracy results stems from such cases where the match between the ground truth segmentation and the point-wise features segmentation is poor, however, it does not necessarily mean the resulting segmentation is incorrect. A few examples where this happens are shown in Figure 9.

For the part-segmentation results we used the number of segments each object contains as an input to the clustering algorithm. In two classes out of ShapeNet-part we used a different number of segments from the ground truth - tables and guitars, where we used two segments instead of three. The reason is that the third segment in each of them can be considered as part of another segment. Especially when the training is performed without supervision. Examples of objects from these two classes with three segments and our segmentation into two segments can be seen in Figure 9.

## 5. Conclusion

We presented a method to extract meaningful point-wise representation for point-clouds. Unlike previous works, our method requires no access to labeling of the points or the entire point-clouds. The key idea is defining objectives that allow deep-neural networks to learn descriptive features using only self-supervision. We use two types of loss terms which, when combined, enable the network to learn a semantically meaningful representation for each point. The first is a local patch reconstruction loss which is guided by the notion that the structure of the local environment surrounding each point plays an important role in characterizing it. The second loss term relies on the idea that close points in the 3D point-cloud should have a similar representation while points which are distant will usually have a different representation. By incorporating this triplet loss into the loss term the transition between the embedding of close points becomes more smooth while the distinction between distant point’s features leads to a more diverse latent space. Our framework is demonstrated here using a very basic feature extractor, but can be implemented using any architecture aimed at extracting point-wise features from point-clouds. We predict that incorporating a more advanced neural network with local-environment convolution ability into out framework will produce an even richer, more descriptive embedding.

We have shown an implementation of the embedding learned for part-segmentation of point-clouds. In the future we would like to explore other applications such as finding point correspondence between shapes and shape completion.

In general, we believe that finding more complex better ways to describe points in a set of shapes or scans is an important direction of research. As well as improving our method we would like to incorporate semi-supervision to achieve better results.

## References

- (1)
- Achlioptas et al. (2018) Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. 2018. Learning Representations and Generative Models for 3D Point Clouds. (2018).
- Atzmon et al. (2018) Matan Atzmon, Haggai Maron, and Yaron Lipman. 2018. Point Convolutional Neural Networks by Extension Operators. arXiv preprint arXiv:1803.10091 (2018).
- Danon et al. (2018) Dov Danon, Hadar Averbuch-Elor, Ohad Fried, and Daniel Cohen-Or. 2018. Unsupervised Natural Image Patch Learning. arXiv preprint arXiv:1807.03130 (2018).
- Deng et al. ([n. d.]) Haowen Deng, Tolga Birdal, and Slobodan Ilic. [n. d.]. Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors. ([n. d.]).
- Doersch et al. (2015) Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision. 1422–1430.
- Guerrero et al. (2018) Paul Guerrero, Yanir Kleiman, Maks Ovsjanikov, and Niloy J Mitra. 2018. PCPNet Learning Local Shape Properties from Raw Point Clouds. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 75–85.
- Li et al. (2018b) Jiaxin Li, Ben M Chen, and Gim Hee Lee. 2018b. SO-Net: Self-Organizing Network for Point Cloud Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9397–9406.
- Li et al. (2018a) Yangyan Li, Rui Bu, Mingchao Sun, and Baoquan Chen. 2018a. PointCNN. arXiv preprint arXiv:1801.07791 (2018).
- Qi et al. (2017a) Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1, 2 (2017), 4.
- Qi et al. (2017b) Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems. 5099–5108.
- Sun et al. (2018) Yongbin Sun, Yue Wang, Ziwei Liu, Joshua E Siegel, and Sanjay E Sarma. 2018. PointGrow: Autoregressively Learned Point Cloud Generation with Self-Attention. arXiv preprint arXiv:1810.05591 (2018).
- Wang et al. (2018) Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2018. Dynamic graph CNN for learning on point clouds. arXiv preprint arXiv:1801.07829 (2018).
- Yang et al. (2018) Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. 2018. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Vol. 3.
- Yi et al. (2016) Li Yi, Vladimir G Kim, Duygu Ceylan, I Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, Leonidas Guibas, et al. 2016. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG) 35, 6 (2016), 210.