KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations
Detecting 3D objects keypoints is of great interest to the areas of both graphics and computer vision. There have been several 2D and 3D keypoint datasets aiming to address this problem in a data-driven way. These datasets, however, either lack scalability or bring ambiguity to the definition of keypoints. Therefore, we present KeypointNet: the first large-scale and diverse 3D keypoint dataset that contains 83,060 keypoints and 8,329 3D models from 16 object categories, by leveraging numerous human annotations. To handle the inconsistency between annotations from different people, we propose a novel method to aggregate these keypoints automatically, through minimization of a fidelity loss. Finally, ten state-of-the-art methods are benchmarked on our proposed dataset.
Detection of 3D keypoints is essential in many applications such as object matching, object tracking, shape retrieval and registration [17, 4, 32]. Utilization of keypoints to match 3D objects has its advantage of providing features that are semantically significant and such keypoints are usually made invariant to rotations, scales and other transformations.
In the trend of deep learning, 2D semantic point detection has been boosted with the help of a large quantity of high-quality datasets [2, 18]. However, there are few 3D datasets focusing on the keypoint representation of an object. Dutagaci et al.  collect 43 models and label them according to annotations from various persons. Annotations from different persons are finally aggregated by geodesic clustering. ShapeNetCore keypoint dataset , and a similar dataset , in another way, resort to an expert’s annotation on keypoints, making them vulnerable and biased.
In order to alleviate the bias of experts’ definitions on keypoints, we ask a large group of people to annotate various keypoints according to their own understanding. Challenges rise in that different people may annotate different keypoints and we need to identify the consensus and patterns in these annotations. Finding such patterns is not trivial when a large set of keypoints spread across the entire model. A simple clustering would require a predefined distance threshold and fail to identify closely spaced keypoints. As shown in Figure 1, there are four closely spaced keypoints on each airplane empennage and it is extremely hard for simple clustering methods to distinguish them. Besides, clustering algorithms do not give semantic labels of keypoints since it is ambiguous to link clustered groups with each other. In addition, people’s annotations are not always exact and errors of annotated keypoint locations are inevitable. In order to solve these problems, we propose a novel method to aggregate a large number of keypoint annotations from distinct people, by optimizing a fidelity loss. After this auto aggregation process, we verify these generated keypoints based on some simple priors such as symmetry.
In this paper, we build the first large-scale and diverse dataset named KeypointNet which contains 8,957 models with 62,295 keypoints. These keypoints are of high fidelity and rich in structural or semantic meanings. Some examples are given in Figure 1. We hope this dataset could boost semantic understandings of common objects.
In addition, we propose two large-scale keypoint prediction tasks: keypoint saliency estimation and keypoint correspondence estimation. We benchmark ten state-of-the-art algorithms with mIoU, mAP and PCK metrics. Results show that the detection and identification of keypoints remain a challenging task.
In summary, we make the following contributions:
To the best of our knowledge, we provide the first large-scale dataset on 3D keypoints, both in number of categories and keypoints.
We come up with a novel approach on aggregating people’s annotations on keypoints, even if their annotations are independent from each other.
We experiment with ten state-of-the-art benchmarks on our dataset, including point cloud, graph, voxel and local geometry based keypoint detection methods.
2 Related Work
2.1 Detection of Keypoints
Detection of 3D keypoints has been a very important task for 3D object understanding which can be used in many applications, such as object pose estimation, reconstruction, matching, segmentation, etc. Researchers have proposed various methods to produce interest points on objects to help further objects processing. Traditional methods like 3D Harris , HKS , Salient Points , Mesh Saliency , Scale Dependent Corners , CGF , SHOT , etc, exploit local reference frames (LRF) to extract geometric features as local descriptors. However, because of the repeatability of similar local frames, most of the interest points detected by traditional methods are redundant which are not suitable for describing the objects. Besides, these methods only consider local geometric information without semantic knowledge, diverged from human understanding.
Recent deep learning methods like SyncSpecCNN , deep functional dictionaries  are proposed to detect keypoints. Unlike traditional ones, these methods do not handle rotations well. Though some recent methods like S2CNN  and PRIN  try to fix this, deep learning methods still rely on ground-truth keypoint labels annotated by human with expert verification.
2.2 Keypoint Datasets
Keypoint datasets have its origin in 2D images, where plenty of datasets on human skeletons and object interest points are proposed. For human skeletons, MPII human pose dataset , MSCOCO keypoint challenge  and PoseTrack  annotate millions of keypoints on humans. For more general objects, SPair-71k  contains 70,958 image pairs with diverse variations in viewpoint and scale, with a number of corresponding keypoints on each image pair. PUB  provides 15 part locations on 11,788 images from 200 bird categories and PASCAL  provides keypoint annotations for 20 object categories.
Keypoint datasets on 3D objects, include Dutagaci et al. , SyncSpecCNN  and Kim et al. . Dutagaci et al.  aggregates multiple annotations from different people with an ad-hoc method while the dataset is extremely small. Though SyncSpecCNN , Pavlakos et al.  and Kim et al.  give a relatively large keypoint dataset, they rely on a manually designed template of keypoints, which is inevitably biased and flawed.
3 KeypointNet: A Large-scale 3D Keypoint Dataset
3.1 Data Collection
KeypointNet is built on ShapeNetCore . ShapeNetCore covers 55 common object categories with about 51,300 unique 3D models.
We filter out those models that deviate from the majority and keep at most 1000 instances for each category in order to provide a balanced dataset. In addition, a consistent canonical orientation is established (e.g., upright and front) for every category because of the incomplete alignment in ShapeNetCore.
We let annotators determine which points are important, and same keypoint indices should indicate same meanings for each annotator. Though annotators are free to give their own keypoints, three general principles should be obeyed: (1) each keypoint should describe an objectâs semantic information shared across instances of the same object category, (2) keypoints of an object category should spread over the whole object and (3) different keypoints have distinct semantic meanings. After that, we utilize a heuristic method to aggregate these points, which will be discussed in Section 4.
Keypoints are annotated on meshes and these annotated meshes are then downsampled to 2,048 points. Our final dataset is a collection of point clouds, with keypoint indices.
3.2 Annotation Tools
We develop an easy-to-use web annotation tool based on NodeJS. Every user is allowed to click up to 20 interest points according to his/her own understanding. The UI interface is shown in Figure 2. Annotated models are shown in the left panel while the next unprocessed model is shown in the right panel.
3.3 Dataset Statistics
At the time of this work, our dataset has collected 16 common categories from ShapeNetCore, with 8957 models. Each model contains 3 to 20 keypoints. Our dataset is divided into train, validation and test splits, with 7:1:2 ratio. Table 1 gives detailed statistics of our dataset. Some visualizations of our dataset is given in Figure 3.
4 Keypoint Aggregation
Given all human labeled raw keypoints, we leverage a novel method to aggregate them together into a set of ground-truth keypoints.
There are generally two reasons: 1) distinct people may annotate different sets of keypoints and human labeled keypoints are sometimes erroneous, so we need an elegant way to aggregate these keypoints; 2) a simple clustering algorithm would fail to distinguish those closely spaced keypoints and cannot give consistent semantic labels.
4.1 Problem Statement
Given a -dimensional sub-manifold , where is the index of the model, a valid annotation from the -th person is a keypoint set , where is the keypoint index and is the number of keypoints annotated by person . Note that different people may have different sets of keypoint indices and these indices are independent.
Our goal is to aggregate a set of potential ground-truth keypoints , where is the number of proposed keypoints for each model , so that and share the same semantic.
4.2 Keypoint Saliency
Each annotation is allowed to be erroneous within a small region, so that a keypoint distribution is defined as follows:
where is Gaussian kernel function. is a normalization constant. This contradicts many previous methods on annotating keypoints where a -function is implicitly assumed. We argue that it is common that humans make mistakes when annotating keypoints and due to central limit theorem, the keypoint distribution would form a Gaussian.
4.3 Ground-truth Keypoint Generation
We propose to jointly output a dense mapping function whose parameters are , and the aggregated ground-truth keypoint set . transforms each point into an high-dimensional embedding vector in . Specifically, we solve the following optimization problem:
where is the data fidelity loss and is a regularization term to avoid trivial solution like . The constraint states that the embedding of ground-truth keypoints with the same index should be the same.
We define as:
where is the L2 distance between two vectors in embedding space:
Unlike previous methods such as Dutagaci et al. where a simple geodesic average of human labeled points is given as ground-truth points, we seek a point whose expected embedding distance to all human labeled points is smallest. The reason is that geodesic distance is sensitive to the misannotated keypoints and could not distinguish closely spaced keypoints, while embedding distance is more robust to noisy points as the embedding space encodes the semantic information of an object.
Equation 1 involves both and and it is impractical to solve this problem in closed form. In practice, we use alternating minimization with a deep neural network to approximate the embedding function , so that we solve the following dual problem instead (by slightly loosening the constraints):
and alternate between the two equations until convergence.
By solving this problem, we find both an optimal embedding function , together with intra-class consistent ground-truth keypoints , while keeping its embedding distance from human-labeled keypoints as close as possible. The ground-truth keypoints can be viewed as the projection of human labeled data onto embedding space.
Non Minimum Suppression
Equation 3 may be hard to solve since is also unknown beforehand. For each model , the fidelity error associated with each potential keypoint is:
Then is found by conducting Non Minimum Suppression (NMS), such that:
where is some neighborhood threshold.
After NMS, we would get several ground-truth points for each manifold . However, the arbitrarily assigned index within each model does not provide a consistent semantic correspondence across different models. Therefore we cluster these points according to their embeddings by first projecting them onto 2D subspace with t-SNE .
Though the above method automatically aggregate a set of potential set of keypoints with high precision, it omits some keypoints in some cases. As the last step, experts manually verify these keypoints based on some simple priors such as the rotational symmetry and centrosymmetry of an object.
4.4 Implementation Details
At the start of the alternating minimization, we initialize to be sampled from raw annotations and then run one iteration, which is enough for the convergence. We choose PointConv with hidden dimension 128 as the embedding function . During the optimization of Equation 3, we classify each point into classes with a SoftMax layer and extract the feature of the last but one layer as the embedding. The learning rate for PointConv is 1e-3 and the optimizer is Adam .
The whole pipeline is shown in Figure 4. We first infer dense embeddings from human labeled raw annotations. Then fidelity error maps are calculated by summing embedding distances to human labeled keypoints. Non Minimum Suppression is conducted to form a potential set of keypoints. These keypoints are then projected onto 2D subspace with t-SNE and verified by humans.
5 Tasks and Benchmarks
In this section, we propose two keypoint prediction tasks: keypoint saliency estimation and keypoint correspondence estimation. Keypoint saliency estimation requires evaluated methods to give a set of potential indistinguishable keypoints while keypoint correspondence estimation asks to localize a fixed number of distinguishable keypoints.
5.1 Keypoint Saliency Estimation
For keypoint saliency estimation, we only consider whether a point is the keypoint or not, without giving its semantic label. Our dataset is split into train, validation and test sets with the ratio 70%, 10%, 20%.
Two metrics are adopted to evaluate the performance of keypoint saliency estimation. Firstly, we evaluate their mean Intersection over Unions  (mIoU), which can be calculated as
mIoU is calculated under different error tolerances from 0 to 0.1. Secondly, for those methods that output keypoint probabilities, we evaluate their mean Average Precisions (mAP) over all categories.
We benchmark seven state-of-the-art algorithms on point cloud semantic analysis: PointNet , PointNet++ , RSNet , SpiderCNN , PointConv , RSCNN , DGCNN  and GraphCNN . Three traditional local geometric keypoint detectors are also considered: Harris3D , SIFT3D  and ISS3D .
For deep learning methods, we use the default network architectures and hyperparameters to predict the keypoint probability of each point and mIoU and mAP are adopted to evaluate their performance. For local geometry based methods, mIoU is used. Each method is tested with various geodesic error thresholds. In Table 2, we report mIoU and mAP results under a restrictive threshold 0.01. Figure 6 shows the mIoU curves under different distance thresholds from 0 to 0.1 and Figure 7 shows the mAP results. We can see that under a restrictive distance threshold 0.01, geometric and deep learning methods both fail to predict qualified keypoints.
Figure 5 shows some visualizations of the results from RSNet, RSCNN, DGCNN, GraphCNN, ISS3D and Harris3D. Deep learning methods can predict some of ground-truth keypoints while the predicted keypoints are sometimes missing. For local geometry based methods like ISS3D and Harris3D, they give much more interest points spread over the entire model while these points are agnostic of semantic information. Learning discriminative features for better localizing accurate and distinct keypoints across various objects is still a challenging task.
5.2 Keypoint Correspondence Estimation
Keypoint correspondence estimation is a more challenging task than keypoint saliency estimation. One needs to predict not only the keypoints, but also their semantic labels. The semantic labels should be consistent across different objects in the same category.
For keypoint correspondence estimation, each keypoint is labeled with a semantic index. For those keypoints that do not exist on some objects, index -1 is given. Similar with SyncSpecCNN , the maximum number of keypoints of each category is bounded. Data split is the same as keypoint saliency estimation.
Similarly, we use the default network architectures. Table 3 shows the PCK results with error distance threshold 0.01. Figure 9 illustrates the percentage of correct points curves with distance thresholds varied from 0 to 0.1. RS-Net performs relatively better than other methods with the distance threshold under 0.02. RSCNN gives better results by a large margin with the distance threshold above 0.02. However, all seven methods face big difficulties in predicting exact consistent semantic keypoints.
Figure 8 shows some visualizations of the results for different methods. Same colors denote same semantic labels. We can see that most methods can accurately predict some of keypoints. However, there are still some missing keypoints and inaccurate localizations.
Keypoint saliency estimation and keypoint correspondence estimation are both important for object understanding. Keypoint saliency estimation gives a spare representation of object by extracting meaningful points. Keypoint correspondence estimation establishes relations between points on different objects. From the results above, we can see that these two tasks still remain challenging. The reason is that object keypoints from human perspective are not simply geometrically salient points but abstracts semantic meanings of the object.
In this paper, we propose a large-scale and high-quality KeypointNet dataset. In order to generate ground-truth keypoints from raw human annotations where identification of their modes are non-trivial, we transform the problem into an optimization problem and solve it in an alternating fashion. By optimizing a fidelity loss, ground-truth keypoints, together with their correspondences are generated. In addition, we evaluate and compare several state-of-the-art methods on our proposed dataset and we hope this dataset could boost the semantic understanding of 3D objects.
- (2018) Posetrack: a benchmark for human pose estimation and tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5167–5176. Cited by: §2.2.
- (2014-06) 2D human pose estimation: new benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
- (2009) Poselets: body part detectors trained using 3d human pose annotations. In 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. Cited by: §2.2.
- (2016) DETECTION of geometric keypoints and its application to point cloud coarse registration.. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences 41. Cited by: §1.
- (2008) Sparse points matching by combining 3d mesh saliency with statistical descriptors. In Computer Graphics Forum, Vol. 27, pp. 643–652. Cited by: §2.1.
- (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §3.1.
- (2018) Spherical cnns. arXiv preprint arXiv:1801.10130. Cited by: §2.1.
- (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §5.1, §5.2.
- (2012) Evaluation of 3d interest point detection techniques via human-generated ground truth. The Visual Computer 28 (9), pp. 901–917. Cited by: §1, §2.2, §4.3.
- (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: §5.1, §5.2.
- (2017) Learning compact geometric features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 153–161. Cited by: §2.1.
- (2013) Learning part-based templates from large collections of 3d shapes. ACM Transactions on Graphics (TOG) 32 (4), pp. 70. Cited by: §1, §2.2.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.4.
- (2005) Mesh saliency. ACM transactions on graphics (TOG) 24 (3), pp. 659–666. Cited by: §2.1.
- (2019) Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8895–8904. Cited by: §5.1, §5.2.
- (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.3.
- (2006) Three-dimensional model-based object recognition and segmentation in cluttered scenes. IEEE transactions on pattern analysis and machine intelligence 28 (10), pp. 1584–1601. Cited by: §1.
- (2019) SPair-71k: a large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543. Cited by: §1, §2.2.
- (2016)(Website) External Links: Cited by: §2.2.
- (2007) Scale-dependent 3d geometric features. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §2.1.
- (2017) 6-dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. Cited by: §2.2.
- (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §5.1, §5.2.
- (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §5.1, §5.2.
- (2017) Volumetric image registration from invariant keypoints. IEEE Transactions on Image Processing 26 (10), pp. 4900–4910. Cited by: §5.1.
- (2015) Learning a descriptor-specific 3d keypoint detector. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2318–2326. Cited by: §5.1.
- (2011) Harris 3d: a robust extension of the harris operator for interest point detection on 3d meshes. The Visual Computer 27 (11), pp. 963. Cited by: §2.1, §5.1.
- (2009) A concise and provably informative multi-scale signature based on heat diffusion. In Computer graphics forum, Vol. 28, pp. 1383–1392. Cited by: §2.1.
- (2018) Deep functional dictionaries: learning consistent semantic structures on 3d models from functions. In Advances in Neural Information Processing Systems, pp. 485–495. Cited by: §2.1, §5.2.
- (2014) 3D interest point detection via discriminative learning. In European Conference on Computer Vision, pp. 159–173. Cited by: §5.1.
- (2010) Unique signatures of histograms for local surface description. In European conference on computer vision, pp. 356–369. Cited by: §2.1.
- (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §2.2.
- (2018) Learning 3d keypoint descriptors for non-rigid shape matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1.
- (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §5.1, §5.2.
- (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: §5.1, §5.2.
- (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87–102. Cited by: §5.1, §5.2.
- (2017) Syncspeccnn: synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2282–2290. Cited by: §1, §2.1, §2.2, §5.2, §5.2.
- (2018) Prin: pointwise rotation-invariant network. arXiv preprint arXiv:1811.09361. Cited by: §2.1.