StickyPillars: Robust feature matching on point clouds using Graph Neural Networks
StickyPillars introduces a sparse feature matching method on point clouds. It is the first approach applying Graph Neural Networks on point clouds to stick points of interest. The feature estimation and assignment relies on the optimal transport problem, where the cost is based on the neural network itself. We utilize a Graph Neural Network for context aggregation with the aid of multi-head self and cross attention. In contrast to image based feature matching methods, the architecture learns feature extraction in an end-to-end manner. Hence, the approach does not rely on handcrafted features. Our method outperforms state-of-the art matching algorithms, while providing real-time capability.
Point cloud registration is an essential computer vision problem and necessary for a wide range of tasks in the domain of real-time scene understanding or applied robotics. Furthermore, new generations of 3D sensors, like depth cameras or LiDARs (light detection and ranging), enable dense perception including distance information recorded within a large field of view. This also increases the requirements for point cloud based feature matching applicable for various tasks namely perception, mapping, re-localization or SLAM (simultaneous localization and mapping). Even more fundamental operations like multi-sensor calibration rely on exact matching of feature correspondences.
Point cloud registration using 3D sensors is commonly solved using local describable features combined with a global optimization step Shan and Englot (2018); Zhang and Singh (2014); Lin and Zhang (2019). These real-time approaches achieve state-of-the art performance on odometry challenges like KITTI Geiger et al. (2012), although they are free from modern machine learning algorithms, because unbiased depth values of the sensors enable safe distances estimations processed by classical algorithms. However, recent research towards neural network based point cloud processing, e.g. classification and segmentation Qi et al. (2017a, b); Lang et al. (2019); Zhou and Tuzel (2018), opened up new possibilities for perception, registration, mapping and odometry and has shown impressive results Engel et al. (2019); Li et al. (2019). The downside of all those methods is that they tackle odometry estimation itself based on a global rigid body operation. Hence, the target is the calculation of a robust coordinate transformation assuming many static objects within the environment and proper viewpoints. It leads to instabilities for challenging situations like vast number of dynamic objects, widely varying viewpoint or small overlapping areas. Furthermore, the mapping quality itself is suffering and often not evaluated qualitatively. Examples are blurring of dynamic objects in the map.To bear these disadvantages satisfactorily we introduce a novel registration strategy for point clouds utilizing graph neural networks. Inspired by DeTone et al. (2018); Vaswani et al. (2017), we solve the feature correspondence instead of a direct odometry estimation. StickyPillars is a robust real-time registration method for point clouds (see Fig. 1) and confident under challenging conditions, like dynamic environments, challenging viewpoints or small overlapping areas. This enables more powerful odometry estimation, mapping or SLAM (example in Fig. 2). We verify our technique on the odometry KITTI benchmark using recalculation Geiger et al. (2012) and significantly outperforming state-of-the-art approaches.
2 Related Work
Point cloud registration was fundamentally investigated and influenced by Besl and McKay (1992); Zhang (1994); Rusinkiewicz and Levoy (2001). The iterative closed point search (ICP) is a powerful algorithm for calculating the displacement between two point sets. Its convergence and speed depends on the matching accuracy itself. Given correct data associations (e.g. similar viewpoints, large overlap, other constraints), the transformation can be computed efficiently. This method is still a common technique and used in a wide range of odometry, mapping and SLAM algorithms Shan and Englot (2018); Zhang and Singh (2014); Lin and Zhang (2019). However, its convergence and accuracy depends largely on the similarity of the given point sets. Rusinkiewicz and Levoy (2001) reports its error susceptibility for challenging tasks like small overlap of point sets, different viewpoints.
Local feature correspondence was more widely used in the domain of image processing. Similar to ICP, the classical idea is composed by several steps, i.e. point detection, feature calculation and matching. On top, the geometric transformation calculation is performed. The standard pipelines were proposed by FLANN Muja and Lowe (2009) or SIFT Lowe (2004). Models based on neighborhood consensus were evaluated by Bian et al. (2017); Sattler et al. (2009); Tuytelaars and Van Gool (2000); Cech et al. (2010) or in a more robust way combined with a solver called RANSAC Fischler and Bolles (1981); Raguram et al. (2008).
Recently, Deep Learning based approaches, i.e. Convolutional Neural Networks (CNNs), were utilized to learn local descriptors and sparse correspondences Dusmanu et al. (2019); Ono et al. (2018); Revaud et al. (2019); Yi et al. (2016). Most of those approaches operate on sets of matches and ignore the assignment structure. In contrast, Sarlin et al. (2019) focuses on bundling aggregation, matching and filtering based Graph Neural Networks.
Deep Learning on point clouds is a novel field. Research has been proposed by Chen et al. (2017); Simon et al. (2018, 2019) utilizing CNNs. Points are usually not ordered, influenced by the interaction amongst each other and viewpoint invariant. Hence, a more specific architecture was needed and first investigated for segmentation and classification by Qi et al. (2017a) capable of handling large point sets Qi et al. (2017b). For registration, some work has been proposed utilizing Deep Learning on 3D point clouds to approximate ICP Lu et al. (2019); Li and Lee (2019) or image generation Milz et al. (2019), where the former focuses on rigid transformations and the latter on the key-point descriptor itself. However, this approach lacks of accuracy and robustness, when there is a demand for real-time capability.
Optimal transport is related to the graph matching problem and therefore utilized in this work. In general, it describes a transportation plan between two probability distributions. Numerically, this could be solved with the Sinkhorn algorithm Sinkhorn and Knopp (1967); Cuturi (2013); Peyré and Cuturi (2019) and its derivatives. We approximate graph matching using optimal transport based on multi-head Vaswani et al. (2017) attention (self and cross-wise) to learn a robust registration, not related to handcrafted features or specific costs, but approximated by the network itself.
3 The StickyPillars Architecture
The idea beyond StickyPillars is the development of a robust-point cloud registration algorithm to replace ICP as most common matcher for environmental perception on point clouds. ICP has fundamental drawbacks in terms of stability, predictable runtime and depends on a good initialization. We identify a need of an approach that is working in real-time under challenging conditions: small overlapping area and diverging viewpoints. Hence, a good registration algorithm should match corresponding key-points (even dynamic, no rigid global pose) or discard occlusions respectively non points. We propose such a model, that not relies on a encoder-decoder system, instead using a graph based on multihead Vaswani et al. (2017) self- and cross attention to learn the context aggregation in an end-to-end manner (see Fig. 3. Our three-dimensional point cloud features (pillars) are flexible and fully composed by learnable parameters.
Problem description Let and be two point clouds to be registered. The key-points of those point clouds will be denoted as and with and , while other points will be denoted as and . Each key-point with index is associated to a point pillar, which can pictured as a sphere having a centroid position and a center of gravity . All points () within a pillar are associated with a pillar feature stack , with as pillar encoder input depth. The same applies for . and compose the input for the graph. The overall goal is to find partial assignments for the optimal re-projection with .
3.1 Pillar Layer
Key-Point Selection is the initial basis of the pillar layer. Most common 3D sensors deliver dense point clouds having more than 120k points that need to be spared. Inspired by Zhang and Singh (2014), we place the centroid pillar coordinates on sharp edges and planar surface patches. A smoothness term identifies smooth or sharp areas. For a point cloud the corresponding smoothness term is defined by:
and denote point indices within the point cloud having coordinates . is a set of neighboring points of and is the cardinality of . With the aid of the sorted smoothness factors in , we define two thresholds and to pick a fixed number of key-points in sharp and planar regions . The similar quotation applies for on selecting points with index .
Pillar Encoder is inspired by Qi et al. (2017a); Lang et al. (2019). Any selected key-point, and , is associated with a point pillar and describing a set of specific points and . We sample points into a pillar using an euclidean distance function:
Similar equations apply for . Due to a fixed input size of the pillar encoder, we draw a maximum of points per pillar, where is used in our experiments. is the distance threshold defining the pillar size (e.g. 50cm). For computational reasons, the point clouds are organized within a -d tree Bentley (1975). Based on the closest samples are drawn into the pillar , whereas points with a distance greater were rejected and refilled by zero tupels instead.
To compose a sufficient feature input stack for the pillar encoder , we stack for each sampled point with in the style of Lang et al. (2019):
denotes the sample points coordinate . is a scalar and represents the intensity value, being the difference to the pillars center of gravity and is the difference to the pillars key-point. is the L2 norm of the point itself. This leads to an overall input depth . The pillar encoder is a single linear projection layer with shared weights across all pillars and frames followed by a batchnorm and ReLU layer:
The output has a depth of , e.g. 32 values in our experiments.
Positional Encoder is introduced to learn contextual aggregation without applying pooling operations. The positional encoder is inspired by Qi et al. (2017a) and utilizes a single Multi-Layer-Perceptron (MLP) shared across and and all pillars including batchnorm and ReLU. From the key-points centroid coordinates we derive again an output with a depth of :
3.2 Graph Neural Network Layer
Graph Architecture assumes two complete graphs and , which nodes are related to the selected pillars and equivalent in its quantity, we derive the initial node conditions in the following way:
The overall composed graph is a multiplex graph inspired by Mucha et al. (2010); Nicosia et al. (2013). It is composed by intra-frame edges, i.e. self edges connecting each key-point within and each key-point within respectively. Additionally, to perform global matching using context aggregation inter-frame edges are introduced, i.e. cross edges that connect all nodes of with and vice versa.
Multi-Head Self- and Cross-Attention is all we need to integrate contextual cues intuitively and increase its distinctiveness considering its spatial and 3D relationship with other co-visible pillars, such as those that are salient, self-similar or statistically co-occurring Sarlin et al. (2019). An attention function Vaswani et al. (2017) is a mapping function of a query and a set of key-point pairs to an output, where the query , the keys , and the values are simply vectors. It is defined by:
describes the feature depth analogous to the depth of every node. We apply the Multi-Head Attention function to each node , at state to calculate its next condition . The node conditions are represented as network layers to propagate information throughout the graph:
We alternate the indices for and to perform self and cross attention alternately with increasing depth of throughout the network, where the following applies :
The Multi-Head Attention function is defined by:
with being the concatenation operator. A single head is composed by the attention function as follows:
The same applies for :
All weights are shared throughout all pillars and both graphs within a single layer .
Final predictions are computed by the last layer within the Graph Neural Network and designed as single linear projection with shared weights across both graphs and pillars:
3.3 Optimal Transport Layer
Intensive research regarding the optimal transport problem was done for decades Sinkhorn and Knopp (1967); Vallender (1974); Cuturi (2013). In general, there are some requirements to propagate data throughout the network with subsequent back-propagation of the error. To compute a soft-assignment matrix , defining correspondences between pillars in and , the optimal transport layer has to be fully differentiable, parallelizable (run-time), responsible for 2D normalization (row and column-wise) and has to handle invisible key-points (occlusion or non-overlap) sufficiently.
To design an optimal transport plan, the achieved matching matrices, which include all learned contextual information (), were combined in the following way:
This design enables a cross-wise scalar multiplication of all feature in with all in , whereby similar features reveal higher score entries as unequal ones. In order to reconcile non-visible pillars in both frames with an adequate loss function, we concatenate a single learnable weight parameter at the end of the matching matrix shared across all columns and rows ():
Thereby, each key-point that is occluded or not visible in the other point cloud should be assigned to this value and vice versa. Starting from to we perform the sinkhorn algorithm in the simple following way, highly parallelizable and fully differentiable:
with the row and column-wise normalization functions:
We approximate our soft-assignment matrix for iterations with . The overall tensor graph is shown in Fig 4 including architectural details from the pillar layer to the optimal transport layer.
The overall architecture with its three layer types: Pillar Layer, Graph Neural Network Layer and Optimal Transport Layer is fully differentiable. Hence, the network is trained in a supervised manner. The ground truth being the set including all index tuples with pillar correspondences in our datasets and also ground truth unmatched labels and , with being nonsense. We apply three different losses to compare their results in the ablation study, e.g.the negative log-likelihood loss:
During training we detected occasionally, especially if non-visible matches were underrepresented within , a poor ability to generalize well for invisible key-points. Hence, according to sinkhorns methodology, we introduce an extra penalty term only in row direction, exclusively were unmatched points occur within the ground truth with . We observed its sufficient to apply it only in row direction:
The overall problem, could be seen as binary classification problem. Hence, we apply as well the dual cross entropy loss for an integral penalization of false matches and full reward of correct matches:
4.1 Implementation Details
Model: For key-point extraction, we used variable and to achieve key-points as inputs for the pillar layer. Each point pillar is sampled with up to points using an euclidean distance threshold of . Our implemented feature depth is . The key-point encoder has five layers with the dimensions set to channels respectively. The graph is composed by self and cross attention layers with heads each. Overall, this results in linear layers. Our model is implemented in PyTorch Paszke et al. (2017)(v1.4), using Python 3.7. A forward pass for the model described above, dues an average of 27/26ms (37/38 fps) on a Nvidia RTX 2080 Ti / RTX Titan GPU, for one pair of point clouds (see 5).
Training details: We process all sequences to of the KITTI Geiger et al. (2012) odometry dataset, using the smoothness function (1) and identify key-points as described in section 3.1. Ground truth correspondences and unmatched sets are generated using the existing odometry ground truth. Both point clouds are transformed into a shared coordinate system. Ground truth correspondences are either key-point pairs with the nearest neighbor distance smaller than or invisible matches, i.e. all pairs with distances greater are unmatched. We ignore all associations with a distance in range to , to ensure variances in resulting features. The entire pre-processing is repeated three times with temporal distances , i.e. consecutive frame index distances. For our training, we use the Adam optimizer Kingma and Ba (2014) with a constant learning rate of and a batch size of 16. We split the dataset for ablation studies into a subset of three training and three evaluation datasets, resulting in for training and and for validation with varying temporal distance pairs, trained for epochs.
4.2 Transformation Estimation
Matching Score is introduced in order to validate the results of our method, we compare our predictions based on two different metrics and in comparison to various state of the art methods. Thereby, being a metric depicting the mean percentage of correct predicted matches compared to the total amount of correct matches in the test sequence ( is frame, by totals frames). The matching score is used to compare the amount of correct predictions compared to two benchmark methods (Tab 1), e.g. a simple nearest neighbour search for the 3D coordinates of our key-points based on a -d tree Bentley (1975) and Point Feature Histograms (PFH Rusu et al. (2009)) descriptors to find a high dimensional representation of our key-points which can be used to find corresponding key-points in the associated frame based on a high dimensional -d tree search. Based on these predicted correspondences of the different methods, it is possible to deduce a transform estimation using singular value decomposition (SVD).
Transformation Error is calculated by comparing the ground truth odometry poses of the KITTI odometry dataset with the transform estimation based on the predicted correspondences by StickyPillars: . describes the transformation difference between ground truth and estimation for two related frames. Based on this, we are conveying the translational and rotational error values used in Geiger et al. (2012):
We are estimating the transformation based on predicted correspondences and subsequent SVD from nearest neighbour search (NN Muja and Lowe (2009)), the Point Feature Histograms (PFH Rusu et al. (2009)), StickyPillars (OURS) and also from all the possible valid matches in the ground truth labels (VM). We are using VM to set a baseline for the transformation error if all correspondences were found and correct. Furthermore our results are compared to the Iterative Closest Point (ICP Zhang (1994)) which is applied to our source and target key-points and exploited to iteratively refine the intermediate rigid transformation. The results for our validation metrics for various frame distances 1, 5 and 10 (, and ) respectively can be found in Tab 1.
|Translational Error ()|
|Rotational Error ()|
|nll + penalty -|
|dual cross entropy -|
Based on our robust feature matching the results show that our method can be also used to estimate the ego motion based on features extracted from LiDAR point cloud scans. We are able to find corresponding features with a high matching score even from far apart scans. Therefore we are reaching highest matching score in all experiments and hence also lowest translational and rotational error. For Frame differences of 1 and 5 we are even close to par to VM which uses all valid correspondences from the ground truth to estimate the desired transformation. Using nearest neighbour correspondences with SVD outperforms ICP in because we solely use valid matches to perform transformation estimation. For higher distances it fails.
4.3 Ablation Study
Table 2 shows a confusion matrix with precision and accuracy results of our model trained on the subsets and validated on . When using , we saw lots of nearly equal distributed probabilities for unmatched key-points, which expresses uncertainties. and works slightly better than , because they contain additional penalties for unmatched key-points compared to . Nevertheless, our model has an exceptional matching performance, independent from the underlying loss. We see constant biases towards the underlying training data, i.e. the model trained on performs best on . Still, all differences are minor, indicating good generalization.
We present a novel model for point-cloud registration using Deep Learning. Thereby, we introduce a three stage model composed by a point cloud encoder, an attention-based graph and an optimal transport algorithm. Our neural network performs local and global feature matching at once using contextual aggregation. We achieve significantly better results compared to the state of the art in a very robust manner. We have shown our results on KITTI odometry dataset.
We would like to thank Valeo, especially Driving Assistance Research Kronach, Germany to make this work possible. Further, we want to thank Prof. Patrick Mäder from the Ilmenau University of Technology and the Institute for Software Engineering for Safety-Critical Systems (SECSY) to support this work.
- Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: §3.1, §4.2.
- Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, Vol. 1611, pp. 586–606. Cited by: §2.
- Gms: grid-based motion statistics for fast, ultra-robust feature correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4181–4190. Cited by: §2.
- Efficient sequential correspondence selection by cosegmentation. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1568–1581. Cited by: §2.
- Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §2.
- Sinkhorn distances: lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300. Cited by: §2, §3.3.
- Superpoint: self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236. Cited by: §1.
- D2-net: a trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561. Cited by: §2.
- DeepLocalization: landmark-based self-localization with deep neural networks. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 926–933. Cited by: §1.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §2.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §1, §4.1, §4.2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
- Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §1, §3.1, §3.1.
- USIP: unsupervised stable interest point detection from 3d point clouds. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
- LO-net: deep real-time lidar odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8473–8482. Cited by: §1.
- Loam_livox: a fast, robust, high-precision lidar odometry and mapping package for lidars of small fov. arXiv preprint arXiv:1909.06700. Cited by: §1, §2.
- Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §2.
- DeepVCP: an end-to-end deep neural network for point cloud registration. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
- Points2Pix: 3d point-cloud to image translation using conditional gans. In Pattern Recognition: 41st DAGM German Conference, DAGM GCPR 2019, Dortmund, Germany, September 10–13, 2019, Proceedings, Vol. 11824, pp. 387. Cited by: §2.
- Community structure in time-dependent, multiscale, and multiplex networks. science 328 (5980), pp. 876–878. Cited by: §3.2.
- Fast approximate nearest neighbors with automatic algorithm configuration.. VISAPP (1) 2 (331-340), pp. 2. Cited by: §2, §4.2.
- Growing multiplex networks. Physical review letters 111 (5), pp. 058701. Cited by: §3.2.
- LF-net: learning local features from images. In Advances in neural information processing systems, pp. 6234–6244. Cited by: §2.
- Automatic differentiation in pytorch. NIPS Workshops. Cited by: §4.1.
- Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. Cited by: §2.
- Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1, §2, §3.1, §3.1.
- Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1, §2.
- A comparative analysis of ransac techniques leading to adaptive real-time random sample consensus. In European Conference on Computer Vision, pp. 500–513. Cited by: §2.
- R2d2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195. Cited by: §2.
- Efficient variants of the icp algorithm. In Proceedings Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152. Cited by: §2.
- Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE international conference on robotics and automation, pp. 3212–3217. Cited by: §4.2, §4.2.
- SuperGlue: learning feature matching with graph neural networks. arXiv preprint arXiv:1911.11763. Cited by: §2, §3.2.
- SCRAMSAC: improving ransac’s efficiency with a spatial consistency filter. In 2009 IEEE 12th International Conference on Computer Vision, pp. 2090–2097. Cited by: §2.
- LeGO-loam: lightweight and ground-optimized lidar odometry and mapping on variable terrain. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4758–4765. Cited by: §1, §2.
- Complexer-yolo: real-time 3d object detection and tracking on semantic point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2.
- Complex-yolo: an euler-region-proposal for real-time 3d object detection on point clouds. In European Conference on Computer Vision, pp. 197–209. Cited by: §2.
- Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics 21 (2), pp. 343–348. Cited by: §2, §3.3.
- Wide baseline stereo matching based on local, affinely invariant regions.. In BMVC, Vol. 412. Cited by: §2.
- Calculation of the wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications 18 (4), pp. 784–786. Cited by: §3.3.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2, §3.2, §3.
- Lift: learned invariant feature transform. In European Conference on Computer Vision, pp. 467–483. Cited by: §2.
- LOAM: lidar odometry and mapping in real-time. In Proceedings of Robotics: Science and Systems Conference, Cited by: §1, §2, §3.1.
- Iterative point matching for registration of free-form curves and surfaces. International journal of computer vision 13 (2), pp. 119–152. Cited by: Figure 2, §2, §4.2.
- Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §1.