PointPoseNet: Accurate Object Detection and6 DoF Pose Estimation in Point Clouds

PointPoseNet: Accurate Object Detection and 6 DoF Pose Estimation in Point Clouds


We present a learning-based method for 6 DoF pose estimation of rigid objects in point cloud data. Many recent learning-based approaches use primarily RGB information for detecting objects, in some cases with an added refinement step using depth data. Our method consumes unordered point sets with/without RGB information, from initial detection to the final transformation estimation stage. This allows us to achieve accurate pose estimates, in some cases surpassing state of the art methods trained on the same data.


Frederik Hagelskjær, Anders Glent Buch \addressSDU Robotics, University of Southern Denmark


Deep learning, Point clouds, Pose estimation

1 Introduction

Deep learning has reigned supreme for almost a decade now, with applications in a number of domains, perhaps most notably computer vision. A key challenge for visual processing systems is detection and accurate localization of rigid objects, allowing for a number of applications, including robotic manipulation. The most mature branch of deep learning is arguably CNN-based models, which are now applied in a variety of contexts. For object detection, CNNs are currently providing the best building blocks for learning-based detection systems in 2D, where the relevant objects are localized by a bounding box. More recently, full 6 DoF pose estimation systems have been demonstrated using similar techniques.

Some methods rely on RGB inputs only to reconstruct the full 6 DoF object pose. In [3] an uncertainty-aware regression method provides dense predictions of 3D object coordinates in the input image. Only keypoints are detected in [13] and [16], in the latter case using a YOLO-like [14] backbone. In [19] a prediction is made for the object center, followed by a regression towards the rotation. The method in [10] uses a compromising approach where a semi-dense set of keypoints are predicted over the image.

Methods relying on RGB-D data often augment an initial RGB-based detection with a refinement step using the depth information. [2] again casts dense pixel-wise votes for 3D object coordinates, but using RGB-D inputs. In [6], an SSD-like detector [8] provides initial detections, which are refined using depth information. Initial segmentations are used in [17] and finally the pose is refined using PointNet [11].

Some of these works use point cloud data, but not in the global detection phase, rather as a component during refinement. Our work, on the other hand, uses raw, unordered 3D point cloud data from end to end, i.e. both during initial detection of candidate regions of the scene and during actual transformation estimation and refinement. Our method relies on a PointNet backbone [11] to achieve this, and we show that in many of the tested circumstances, this significantly increases detection rates and localization accuracy during full 6 DoF pose estimation of objects.

We describe here a method for training a joint classification and segmentation to directly propose detections and perform point-to-point correspondence assignments between an observed scene point cloud and a known 3D object model. These correspondences are passed to a voting algorithm that performs density estimates directly in the 6-dimensional pose parameter space. We also show how the addition of color coordinates to the input point cloud dramatically increases the accuracy of our method.

2 Method

This section gives a full description of our method. We start with a description of all the processing stages of our pipeline during inference, followed by a detailed description of the training process.

Figure 1: Visualization of the pose estimation process. The original point cloud is sub-sampled. For each of these anchor points a sphere of neighbors is collected. Using PointNet, we first predict the 16 anchors most likely to include the object. For all neighbor points we predict their corresponding model point by a segmentation branch. All model point correspondences are used to vote for the pose, which is then finally refined with ICP on the points found to belong to the object.

2.1 Inference pipeline

Object representation: The goal of our pipeline is to provide the full 6 DoF pose of an object, in our case provided as either a CAD model or as a point cloud. To aid the training process, we represent this model by a very low number of keypoints, uniformly sampled on the object surface with an empirically chosen distance of 2.5 cm. A voxel grid achieves this downsampling very efficiently [15] and we end up with a number of keypoints between approx. 50 and a few hundreds, depending on the size of the object. In the top part of Fig. 1 we show an example of a model next to a coloring of the model points according to the nearest keypoint.

Input preprocessing: The input to our system during inference is a novel scene, provided as an unordered point cloud containing XYZ and, when applicable, RGB coordinates. The first stage is a binary classifier, aimed at determining candidate points in the scene where the searched object is located. To limit the data, the scene is uniformly downsampled—again with an empirically chosen 2.5 cm point spacing—to approx. 3000-5000 anchor points, depending on the size of the scene. We show a visualization of this in the middle of Fig. 1.

Proposal via classification: The thousands of uniform anchor points are each seen as candidate positions for the center of the searched object. In order to detect promising positions, we use a binary PointNet classifier [11] to estimate the probability of the presence of the object at each anchor point. More specifically, we sample 2048 scene points in the spherical neighborhood around an anchor point and pass these to a PointNet with a single logistic output neuron. The choice of the number of point neighbors is a trade-off between speed and accuracy, and we have observed a saturation in accuracy around the chosen 2048 points. In the bottom part of Fig. 1 we show an example of such a neighborhood.

Correspondences via segmentation: The classifier allows to greatly prune the search space by only considering the top predictions from the object-specific classifier. At this stage we process only the most promising anchor points and their neighborhoods. The objective now is to associate each of the 2048 points around the anchor to either background or to the corresponding point on the given object surface. If the object is reduced to keypoints, we now invoke a -way point-wise semantic segmentation network (again inspired by PointNet) to label each point as either background or one of the object keypoints. An example of a labeling is shown in the bottom of Fig. 1, where black corresponds to a background prediction. At this stage, we perform segmentation only on the 16 top scoring anchor points.

6 DoF pose estimation from correspondences: The outcome of the labeling process is a set of many-to-few correspondences, where, on average we can expect the translation errors of many of the points in the scene to cancel each other out, as they always vote for the point in the center of each of the object segments. The problem of pose estimation has now been reduced to a classical 3D-3D correspondence problem, where we are given a large set of up to 2048 point-to-point correspondences for each of the 16 top scoring anchors. Many algorithms are available for estimating the relative pose between the two point sets in the presence of mismatches. One of the most effective algorithms is the rotational subgroup voting algorithm [4], which has shown superior performance for this type of problems. We thus directly pass the many non-background labeled points directly to this algorithm and compute the full 6 DoF rigid transformation between the object and the scene. The 16 poses for each of the processed anchors are refined using a coarse-to-fine ICP [1].

Multi-modal localization loss: A final pose verification is performed on the 16 pose estimates by a multi-modal loss that determines how well the estimated pose of the object fits with the observed scene data. The voting algorithm in the previous step already produces a density estimate in the parameter space , which is proportional to the number and quality of correspondences that vote for the final pose. This, however, has proven insufficient for our method, since that score does not intrinsically include the sensor-specific information that we have available, i.e. a viewing direction and in some cases color information. Our pose verification is thus performed by first transforming the object model points into the scene using the estimated pose. Occluded points, i.e. points lying behind the scene data relative to the camera viewing axis, are removed and the remaining points are paired with the closest points in the scene using a -d tree. We now compute a geometric and, when applicable, a color loss as RMS errors in the Euclidean and the perceptual RGB space:


where designates the number of remaining points after occlusion removal and the subscript means the nearest neighbor scene point for each of the transformed object points in . In colored point clouds, acquired from e.g. RGB-D sensors, each point also has an associated RGB tuple, which in the second equation is denoted . The geometric and perceptual losses are combined with the KDE-based score of the voting algorithm [4] to produce the final localization loss:


This allows us to separate good from bad pose estimates with high specificity. The final output of our algorithm is the detection that minimizes this loss.

2.2 Training

Our algorithm is trained with a number of real examples, to which we apply a limited set of augmentations to prevent overfitting. Similar to existing approaches, e.g. [10, 13, 16], the training examples are gathered from real scenes, each annotated with one or more ground truth poses of objects. Object models are given as mesh models, which can be with or without color information. The difference from many other works is the use of raw, unordered point cloud data in our work. To extract as much geometry from the surfaces, we include XYZ coordinates, normal vector coordinates, and the local curvature. The normal vectors/curvature are computed using PCL [15] with a support radius of 1 cm. When including color, we add three extra RGB components to each surface point.

Data preparation: For a single annotated example, we start by transforming the provided 3D model into the scene using the ground truth pose. All scene points within a distance threshold of 1 cm of the transformed model are treated as foreground points. Points above a distance threshold of 2 cm are considered as background.1 For each of the foreground points, we now associate a segmentation label corresponding to the index of the nearest keypoint on the model.

Translation-invariant positives: Next, we sample 20 random foreground points for creating positive training examples for PointNet. Random sampling on the visible object surface, as opposed to only considering e.g. the centroid, makes our algorithm robust to translations, which will inevitably occur when using uniform anchor points during inference. For each sampled visible point we uniformly extract 2048 scene points within a spherical neighborhood of 0.6 times the 3D bounding box diagonal of the object model. This provides us with 20 positive training examples with both a class label for object presence and 2048 point-wise segmentation labels. All training examples are centered by subtracting the centroid of the points in the sphere.

Easy and hard negatives: The naive way of extracting counter-examples for training would be a fully uniform sampling in the rest of the scene. However, significantly higher specificity can be obtained by including a number of hard negatives during training. Thus, the easy negatives are sampled far enough away from the object instance to not include any foreground points. The hard negatives are sampled in the vicinity of the object, and, although some of the points in these neighborhoods are labeled with a non-background segmentation label, the classification label is still set to non-object. We use 20 easy and 10 hard negatives, which, together with the positives, sums to 50 training examples per annotated object instance, each with 2048 unordered 3D points.

Augmentation: To increase the network’s ability to generalize to different views, we perform simple augmentation on top of the 50 per-instance training examples as follows. For each of the positives we remove the background points and insert a randomly sampled background from one of the easy negatives.2 The positive cloud is randomly translated a small distance, and random segments around object keypoints are also removed from the point cloud. The background cloud is then translated randomly with a uniform distribution scaled by half the object diagonal. Finally, the point cloud is cut so that all points fit within the sphere of 0.6 times the object diagonal. Another 20 positive examples are created, but now where no background points are inserted. Finally, 20 segments of mixed background segments, i.e. easy negatives, are also created to train on random backgrounds. All in all, in addition to the original 50 per-instance training examples, we augment by 60 examples with equal number of positive and negative classification labels. Finally, all training examples, regular and augmented, are jittered by additive, zero-centered Gaussian noise on all geometry/color coordinates with a standard deviation of 0.01, which for the XYZ coordinates (given in mm) translates to a very small displacement.

Symmetry handling: Some objects have two or more rotational symmetries, in which case the exact rotation around the axis of rotation cannot be determined. To handle this, we reduce symmetric models down to the fewest required distinct keypoints during training set generation. An example is a -fold symmetric cylinder, which can be described using only a single line of points along the main axis.

3 Experiments

In this section we present experimental results on two of the most well-tested datasets for pose estimation in cluttered scenes, LINEMOD [5] and Occlusion [2]. Both datasets show one or more objects in several hundred scenes. We use the same split (approx. 15/85 % for train/test) for the LINEMOD dataset as in earlier works, such as [3, 13, 16, 10]. For the Occlusion dataset, eight of the LINEMOD sequences make up the training examples. In all our experiments, we rely entirely on unordered point cloud data, which are reconstructed using the provided RGB and depth images. As per convention, the Eggbox and Glue objects are treated as 2- and -fold symmetric objects, respectively. We evaluate all our results using the ADD metric mandated by the dataset creators [5, 2].

Training parameters: We jointly train per-object classification and segmentation heads on top of a standard PointNet architecture. The LINEMOD dataset is trained for 80 epochs and the Occlusion training set, being much larger, is trained for 20 epochs. The remaining training parameters are the same, as follows. The Adam optimizer [7] is used for the optimization with a batch size of 16 and a base learning rate 0.001. We also use a momentum of 0.9 to further speed up convergence. The two cross entropies have different scales and we empirically set the relative importance to 0.15 for the classification loss and 0.85 for the segmentation loss during backpropagation.

3.1 Linemod

In Tab. 1 we compare results for LINEMOD. The first three methods are using either RGB-D or pure image data. DenseFusion [17] and our method use colored point cloud data (DenseFusion by image crops, ours directly by RGB components attached to each 3D point). Of these methods, our method is able to produce more accurate pose estimates on average.

[6] [13] [10] [17] Ours
Ape 65.0 40.4 43.6 92.0 80.7
Bench v. 80.0 91.8 99.9 93.0 100
Camera 78.0 55.7 86.9 94.0 100
Can 86.0 64.1 95.5 93.0 99.7
Cat 70.0 62.6 79.3 97.0 99.8
Driller 73.0 74.4 96.4 87.0 99.9
Duck 66.0 44.3 52.6 92.0 97.9
Eggbox 100 57.8 99.2 100 99.9
Glue 100 41.2 95.7 100 84.4
Hole p. 49.0 67.2 81.9 92.0 92.8
Iron 78.0 84.7 98.9 97.0 100
Lamp 73.0 76.5 99.3 95.0 100
Phone 79.0 54.0 92.4 93.0 96.2
Average 79.0 62.7 86.3 94.3 96.3
Table 1: LINEMOD results. The competing methods are SSD-6D [6], BB8 [13], PVNet [10], and DenseFusion [17].

3.2 Occlusion

We show Occlusion dataset results in Tab. 23. In the first case, we used only XYZ coordinates as inputs to our method. Both PoseCNN [19] and PVNet [10], which both use image data, produce much less accurate poses for this dataset. On the contrary, when adding a projective ICP refinement step to PoseCNN, this method achieves slightly better results than ours. This is likely due to a) the more sophisticated ICP, compared to the standard 3D point to point ICP used by us, and b) the use of 80000 extra synthetic images during training.

PoseCNN [19] PVNet [10] PointPoseNet
Ape 9.60 15.0 66.2
Can 45.2 63.0 90.3
Cat 0.93 16.0 34.7
Driller 41.4 25.0 59.6
Duck 19.6 65.0 63.3
Eggbox 22.0 50.0 42.9
Glue 38.5 49.0 21.2
Holepuncher 22.1 39.0 42.1
Average 24.9 40.8 52.6
Table 2: Occlusion results with a single modality.
PoseCNN [19] PointPoseNet
Ape 76.2 70.0
Can 87.4 95.5
Cat 52.2 60.8
Driller 90.3 87.9
Duck 77.7 70.7
Eggbox 72.2 58.7
Glue 76.7 66.9
Holepuncher 91.4 90.6
Average 78.0 75.1
Table 3: Occlusion results with two modalities.

4 Conclusion and future work

In this work, we presented PointPoseNet, a method for 6 DoF object pose estimation using deep learning on point clouds. The developed algorithm is tested on two datasets. On the LINEMOD dataset, it outperforms other methods and achieves state-of-the-art performance. On the Occlusion dataset, the algorithm achieves comparable results with the current methods.

The contribution is a novel framework for pose estimation using deep learning in 3D. PointNet was one of the first networks for training directly on unordered 3D data. Since then, a number of 3D point cloud networks with better performance have been developed [12, 18, 9]. By directly replacing PointNet by these networks, our method can potentially improve.


  1. Due to non-negligible inaccuracies of the provided pose annotations, we need a fairly large distance threshold for foreground points. Additional variation in these ground truth pose inaccuracies, result in a band of approx. 1-2 cm where both foreground and background points occur. To avoid an excessive amount of mislabeled points, we discard all points within this band.
  2. For the smaller training set in our experiments, LINEMOD, this particular augmentation is performed three times per positive example.


  1. P. Besl and N. D. McKay (1992) A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (2), pp. 239–256. Cited by: §2.1.
  2. E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton and C. Rother (2014) Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pp. 536–551. Cited by: §1, §3.
  3. E. Brachmann, F. Michel, A. Krull, M. Ying Yang and S. Gumhold (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3364–3372. Cited by: §1, §3.
  4. A. G. Buch, L. Kiforenko and D. Kraft (2017) Rotational subgroup voting and pose clustering for robust 3d object recognition. In IEEE International Conference on Computer Vision, pp. 4137–4145. Cited by: §2.1, §2.1.
  5. S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pp. 548–562. Cited by: §3.
  6. W. Kehl, F. Manhardt, F. Tombari, S. Ilic and N. Navab (2017) SSD-6d: making rgb-based 3d detection and 6d pose estimation great again. In IEEE International Conference on Computer Vision, pp. 1521–1529. Cited by: §1, Table 1.
  7. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
  8. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  9. Y. Liu, B. Fan, S. Xiang and C. Pan (2019) Relation-shape convolutional neural network for point cloud analysis. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: §4.
  10. S. Peng, Y. Liu, Q. Huang, X. Zhou and H. Bao (2019) PVNet: pixel-wise voting network for 6dof pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570. Cited by: §1, §2.2, §3.2, Table 1, Table 2, §3.
  11. C. R. Qi, H. Su, K. Mo and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §1, §2.1.
  12. C. R. Qi, L. Yi, H. Su and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §4.
  13. M. Rad and V. Lepetit (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §1, §2.2, Table 1, §3.
  14. J. Redmon, S. Divvala, R. Girshick and A. Farhadi (2016) You only look once: unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Cited by: §1.
  15. R. B. Rusu and S. Cousins (2011) 3d is here: point cloud library (pcl). In IEEE international conference on robotics and automation, pp. 1–4. Cited by: §2.1, §2.2.
  16. B. Tekin, S. N. Sinha and P. Fua (2018) Real-time seamless single shot 6d object pose prediction. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301. Cited by: §1, §2.2, §3.
  17. C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei and S. Savarese (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3343–3352. Cited by: §1, §3.1, Table 1.
  18. Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics. Cited by: §4.
  19. Y. Xiang, T. Schmidt, V. Narayanan and D. Fox (2018) PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes. Robotics: Science and Systems. Cited by: §1, §3.2, Table 2, Table 3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description