3D Local Features for Direct Pairwise Registration
Abstract
We present a novel, data driven approach for solving the problem of registration of two point cloud scans. Our approach is direct in the sense that a single pair of corresponding local patches already provides the necessary transformation cue for the global registration. To achieve that, we first endow the state of the art PPFFoldNet [19] autoencoder (AE) with a posevariant sibling, where the discrepancy between the two leads to posespecific descriptors. Based upon this, we introduce RelativeNet, a relative pose estimation network to assign correspondencespecific orientations to the keypoints, eliminating any local reference frame computations. Finally, we devise a simple yet effective hypothesizeandverify algorithm to quickly use the predictions and align two point sets. Our extensive quantitative and qualitative experiments suggests that our approach outperforms the state of the art in challenging real datasets of pairwise registration and that augmenting the keypoints with local pose information leads to better generalization and a dramatic speedup.
otherfnsymbols•∘§§††‡‡** \textbardbl∥¶¶
1 Introduction
Learning and matching local features have fueled computer vision for many years. Scholars have first handcrafted their descriptors [37] and with the advances in deep learning, devised data driven methods that are more reliable, robust and practical [35, 55]. These developments in the image domain have quickly escalated to 3D where 3D descriptors [45, 56, 20] have been developed.
Having 3D local features at hand is usually seen as an intermediate step towards solving more challenging 3D vision problems. One of the most prominent of such problems is 3D pose estimation, where the six degreeoffreedom (6DoF) rigid transformations relating 3D data pairs are sought. This problem is also known as pairwise 3D registration. While the quality of the intermediary descriptors is undoubtedly an important aspect towards good registration performance [26], directly solving the final problem at hand is certainly more critical. Unfortunately, contrary to 2D descriptors, the current deeply learned 3D descriptors [56, 20, 19] are still not tailored for the task we consider, i.e. they lack any kind of local orientation assignment and hence, any subsequent pose estimator is coerced to settle for nearest neighbor queries and exhaustive RANSAC iterations to robustly compute the aligning transformation. This is neither reliable nor computationally efficient.
In this paper, we argue that descriptors that are good for pairwise registration should also provide cues for direct computation of local rotations and propose a novel, robust and endtoend algorithm for local feature based 3D registration of two point clouds (See Fig. 1). We begin by augmenting the stateoftheart unsupervised, 6DoFinvariant local descriptor PPFFoldNet [19] with a deeply learned orientation. Via our posevariant orientation learning, we can decouple the 3D structure from 6DoF motion. This can result in features solely explaining the pose variability up to a reasonable approximation. Our network architecture is shown in Fig. 2. We then make the observation that locally good registration leads to good global alignment and vice versa. Based on that, we propose a simple yet effective hypothesizeandverify scheme to find the optimal alignment conditioned on an initial correspondence pool that is simply retrieved from the (mutually) closest nearest neighbors in the latent space.
For the aforementioned idea to work well, the local orientations assigned to our keypoints (sampled with spatial uniformity) should be extremely reliable. Unfortunately, finding such repeatable orientations of local patches immediately calls for local reference frames (LRF), which are by themselves a large source of ambiguity and error [42]. Therefore, we instead choose to learn to estimate relative transformations instead of aiming to find a canonical frame. We find the relative motion to be way more robust and easiertotrain for than an LRF. To this end, we introduce RelativeNet, a specialized architecture for relative pose estimation.
We train all of our networks endtoend by combining three loss functions: 1) Chamfer reconstruction loss for the unsupervised PPFFoldNet [19], 2) Weaklysupervised relative pose cues for the transformationvariant local features, 3) A featureconsistency loss which enforces the nearby points to give rise to nearby features in the embedding space. We evaluate our method extensively against multiple widely accepted benchmark datasets of 3DMatchbenchmark [56] and Redwood [13], on the important tasks of feature matching and geometric registration. On our assessments, we improve the state of the art by in pairwise registration while reducing the runtime by 20 folds. This dramatic improvement in both aspects stems from the weak supervision making the local features capable of spilling rotation estimates and thereby easing the job of the final transformation estimator. The interaction of three multitask losses in return enhances all predictions. Overall, our contributions are:

Invariant + posevariant network for local feature learning designed to generate poserelated descriptors that are insensitive to geometrical variations.

A multitask training scheme which could assign orientations to matching pairs and simultaneously strengthen the learned descriptors for finding better correspondences.

Improvement of geometric registration performance on given correspondence set using direct network predictions both interms of speed and accuracy.
2 Related Work
Local descriptors
There has been a long history of handcrafted features, designed by studying the geometric properties of local structures. FPFH [44], SHOT [45], USC [48] and Spin Images [30] all use different ideas to capture these properties. Unfortunately, the challenges of real data, such as the presence of noise, missing structures, occlusions or clutter significantly harm such descriptors [26]. Recent trends in data driven approaches have encouraged the researchers to harness deep learning to surmount these nuisances. Representative works include 3DMatch [56], PPFNet [20], CGF [33], 3DFeatNet [29], PPFFoldNet [19] and 3D pointcapsule networks [57], all outperforming the handcrafted alternatives by large margin. While the descriptors in 2D are typically complemented by the useful information of local orientation, derived from the local image appearance [37], the nature of 3D data renders the task of finding a unique and consistent local coordinate frame way more challenging [24, 42]. Hence, none of the aforementioned works were able attach local orientation information to 3D patches. This motivates us to jointly consider descriptor extraction and the direct 3D alignment.
Pairwise registration
The approaches to pairwise registration fork into two main research directions.
The first school tries to find an alignment of two point sets globally. Iterative closest point (ICP) [2] and its transcendents [47, 53, 2, 36] alternatively hypothesize a correspondence set and minimize the 3D registration error optimizing for the rigid pose. Despite its success, making ICP outlierrobust is considered, even today, to be an open problem [31, 22, 50, 11]. Practical applications of ICP also incorporate geometric, photometric or temporal consistency cues [40] or odometry constraints [58], whenever available. ICP is prone to the initialization and is known to tolerate only up to a misalignment [5, 3].
Another family branches off from Random Sample Consensus (RANSAC) [23]. These works hypothesize a set of putative matches of keypoints and attempt to disable the erroneous ones via a subsequent rejection. The discovered inliers can then be used in a Kabschlike [32] algorithm to estimate the optimal transformation. A notable drawback of RANSAC is the huge amount of trials required, especially when the inlier ratio is low and the expected confidence of finding a correct subset of inliers is high [12]. This encouraged the researchers to propose accelerations to the original framework, and at this time, the literature is filled with an abundance of RANSACderived methods [16, 17, 34, 15], unified under the USAC framework [43].
Even though RANSAC is now a well developed tool, heuristics associated to it facilitated the scholars to look for more direct detection and pose estimation approaches, hopefully alleviating the flaws of feature extraction and randomized inlier maximization. Recently, the geometric hashing of point pair features (PPF) [4, 21, 6, 27, 49] is found to be the most reliable solution [28]. Another alternative includes 4point congruent set (4PCS) [1, 9] further made efficient by the Super4PCS [38] and generalized by [39]. As we will elaborate in the upcoming sections, our approach lies at the intersection of local feature learning and direct pairwise registration inheriting the good traits of both.
3 Method
Purely geometric local patches typically carry two pieces of information: (1) 3D structure, summarized by the sample points themselves where and (2) motion, which in our context corresponds to the 3D transformation or the pose holistically orienting and spatially positioning the point set :
(1) 
where . A point set , representing a local patch is generally viewed as a transformed replica of its canonical version : . Oftentimes, finding such a canonical absolute pose from a single local patch involves computing local reference frames [45], that are known to be unreliable [42]. We instead base our idea on the premise that a good local (patchwise) pose estimation leads to a good global rigid alignment of two fragments. First, by decoupling the pose component from the structure information, we devise a data driven predictor network capable of regressing the pose for arbitrary patches and showing good generalization properties. Fig. 2 depicts our architectural design. In a following part, we tackle the problem of relative pose labeling without the need for a canonical frame computation.
Generalized pose prediction
A naive way to achieve tolerance to 3Dstructure is to train the network for pose prediction conditioned on a database of input patches and leave the invariance up to the network [56, 20]. Unfortunately, networks trained in this manner either demand a very large collection of unique local patches or simply lack generalization. To alleviate this drawback, we opt to eliminate the structural components by training an invariantequivariant network pair and using the intermediary latent space arithmetic. We characterize an equivariant function as [51]:
(2) 
where is a function dependent only upon the pose. When , is said to be invariant and for the scope of our application, for any input leads to the outcome of the canonical one . Note that eq. 2 is more general than Cohen’s definition [18] as the group element is not restricted to act linearly. Within the body of this paper the term equivariant will loosely refer to such quasiequivariance or covariance. When , we further assume that the action of can be approximated by some additive linear operation:
(3) 
being a probably highly nonlinear function of . By plugging eq. 3 into eq. 2, we arrive at:
(4) 
that is, the difference in the latent space can approximate the pose up to a nonlinearity, . We approximate the inverse of by a fourlayer MLP network and propose to regress the motion (rotational) terms:
(5) 
where . Note that solely explains the motion and hence, can generalize to any local patch structure, leading to a powerful pose predictor under our mild assumptions.
The manifolds formed by deep networks are found sufficiently close to a Euclidean flatness. This rather flat nature has already motivated prominent works such as GANs [25] to use simple latent space arithmetic to modify faces, objects etc. Our assumption in eq. 3 follows a similar premise. Semantically speaking, by subtracting out the structure specific information from point cloud features, we end up with descriptors that are pose/motionfocused.
Relative pose estimation
Note that can be directly used to regress the absolute pose to a canonical frame. Yet, due to the aforementioned difficulties of defining a unique local reference frame, it is not advised [42]. Since our scenario considers a pair of scenes, we can safely estimate a relative pose rather than the absolute, ousting the prerequisite for a nicely estimated LRF. This also helps us to easily forge the labels needed for training. Thus, we model by a relative pose predictor, RelativeNet, as shown in Fig. 2.
We further make the observation that, correspondent local structures of two scenes that are wellregistered under a rigid transformation also align well with . As a result, the relative pose between local patches could be easily obtained by calculating the relative pose between the fragments and vice versa. We will use these ideas in the following section § 3.1 to design our networks, and in § 3.2 explain how to train them.
3.1 Network Design
To realize our generalized relative pose prediction, we need to implement three key components: the invariant network where , the network that varies as a function of the input and the MLP . The recent PPFFoldNet [19] autoencoder is luckily very suitable to model , as it is unsupervised, works on point patches and achieves true invariance thanks to the point pair features (PPF) fully marginalizing the motion terms. Interestingly, keeping the network architecture identical as PPFFoldNet, if we were to substitute the PPF part with the 3D points themselves (), the intermediate feature would be dependent upon both structure and pose information. We coin this version as PCFoldNet and use it as our equivariant network . We rely on using PPFFoldNet and PCFoldNet to learn rotationinvariant and variant features respectively. They share the same architecture while take in a different encoding of local patches, as shown in Fig. 3. Taking the difference of the encoder outputs of the two networks, i.e. the latent features of PPF and PCFoldNet respectively, results in new features which specialize almost exclusively on the pose (motion) information. Those features are subsequently fed into the generalized pose predictor RelativeNet to recover the rigid relative transformation. The overall architecture of our complete relative pose prediction is illustrated in Fig. 2.
3.2 MultiTask Training Scheme
We train our networks with multiple cues, supervised and unsupervised. In particular, our loss function is composed of three parts:
(6) 
, and are the reconstruction, pose prediction and feature consistency losses, respectively. For the sake of clarity, we omit the function arguments.
Reconstruction loss
reflects the reconstruction fidelity of PC/PPFFoldNet. To enable the encoders of PPF/PCFoldNet to generate good features for pose regression, as well as for finding robust local correspondences, similar to the steps in PPFFoldNet[19], use the Chamfer Distance as the metric to train the both of the autoencoders in an unsupervised manner:
(7)  
(8)  
operator denotes the reconstructed (estimated) set and the PPFs of the points computed identically as [19].
Pose prediction loss
A correspondence of two local patches are centralized and normalized before being sent into PC/PPFFoldNets. This cancels the translational part . The main task of our pose prediction loss is then to enable our RelativeNet to predict the relative rotation between given patches . Hence, a natural choice for describes the discrepancy between the predicted and the ground truth rotations:
(9) 
Note that we choose to parameterize the spatial rotations by quaternions , the Hamiltonian 4tuples [10, 8] due to: 1) decreased the number of parameters to regress, 2) lightweight projection operator  vectornormalization.
Translation , conditioned on the hypothesized pair and the predicted rotation can be computed by:
(10) 
where corresponds to the matrix representation of . Such an L2 error is easier to train with negligible loss compared to the geodesic metric.
Feature consistency loss
Unlike [19], our RelativeNet requires pairs of local patches for training. Thus, we can additionally make use of pair information as an extra weak supervision signal to further facilitate the training of our PPFFoldNet. We hypothesize that such guidance would improve the quality of intermediate latent features that were previously trained in a fully unsupervised fashion. In specific, correspondent features subject to noise, missing data or clutter would generate a high reconstruction loss causing the local features to be different even for the same local patches. This new information helps us to guarantee that the features extracted from identical patches live as close as possible in the embedded space, which is extremely beneficial since we establish local correspondences by searching their nearest neighbor in the feature space. The feature consistency loss reads:
(11) 
represents the set of correspondent local patches and is the feature extracted at by the PPFFoldNet, .
3.3 Hypotheses Selection
The final stage of our algorithm involves selecting the best hypotheses among many, produced per each sample point. The full 6DoF pose is parameterized by the predicted 3DoF orientation (eq. 9) and the translation (eq. 10) conditioned on matching points (3DoF). For our approach, having a set of correspondences is equivalent to having a pregenerated set of transformation hypotheses since each keypoint is associated an LRF. Note that this is contrary to the standard RANSAC approaches where correspondences parameterize the pose, and establishing correspondences can lead to hypotheses to be verified. Our small number of hypotheses, already linear in the number of correspondences, makes it possible to exhaustively evaluate the putative matching pairs for verification. We further refine the estimate by recomputing the transformation using all the surviving inliers. The hypothesis with the highest score would be kept as the final decision.
Fig. 4 shows that both translational and rotational components of our hypothesis set are tighter and have smaller deviation from the true pose as opposed to the standard RANSAC hypotheses.
Kitchen  Home 1  Home 2  Hotel 1  Hotel 2  Hotel 3  Study  MIT Lab  Average  

3DMatch [56]  0.5751  0.7372  0.7067  0.5708  0.4423  0.6296  0.5616  0.5455  0.5961 
CGF [33]  0.4605  0.6154  0.5625  0.4469  0.3846  0.5926  0.4075  0.3506  0.4776 
PPFNet [20]  0.8972  0.5577  0.5913  0.5796  0.5769  0.6111  0.5342  0.6364  0.6231 
FoldingNet [54]  0.5949  0.7179  0.6058  0.6549  0.4231  0.6111  0.7123  0.5844  0.613 
PPFFoldNet [19]  0.7352  0.7564  0.625  0.6593  0.6058  0.8889  0.5753  0.5974  0.6804 
Ours  0.7964  0.8077  0.6971  0.7257  0.6731  0.9444  0.6986  0.6234  0.7458 
4 Experiments
We train our method using the training split of the defacto 3DMatch benchmark dataset [56], containing lots of real local patch pairs with different structure and pose, captured by Kinect cameras. We then conduct evaluations on its own test set and on the challenging synthetic Redwood Benchmark [13]. We assess our performance against the state of the art datadriven algorithms as well as the prosperous handcrafted methods of the RANSACfamily on the tasks of feature matching and geometric registration.
Implementation details
We represent a local patch by randomly collecting K points around a reference one within cm vicinity. To provide relative pose supervision, we associate each patch a pose fetched from the ground truth relative transformations. Local correspondences are established by finding the mutually closest neighbors in the feature space. Our implementation is based on PyTorch [41], a widely used deep learning framework.
4.1 Evaluations on 3D Match Benchmark [56]
How good are our local descriptors?
We begin by putting our local features at test for fragment matching task, which reflects how many good correspondence sets could be found by the specific features. A fragment pair is said to match if a true correspondence ratio of and above is achieved. See [19, 20] for details. In Tab. 1 we report the recall of various data driven descriptors, 3DMatch [56], CGF [33], PPFNet [20], FoldingNet [54], PPFFoldNet [19], as well as ours. It is remarkable to see that our network outperforms the supervised PPFNet [20] by and the unsupervised PPFFoldNet [19] by . Note that, we are architecturally identical to PPFFoldNet and hence the improvement is enabled primarily by the multitask training signals, interacting towards a better minimum and decoupling of the shape and pose within the architecture. Thanks to the doublesiamese structure of our network, we can provide both rotationinvariant features like [19], or upright ones, similar to [20].
Kitchen  Home 1  Home 2  Hotel 1  Hotel 2  Hotel 3  Study  MIT Lab  Average  

Different Feautures + RANSAC  3DMatch [56]  Rec.  0.8530  0.7830  0.6101  0.7857  0.5897  0.5769  0.6325  0.5111  0.6678 
Prec.  0.7213  0.3517  0.2861  0.7186  0.4144  0.2459  0.2691  0.2000  0.4009  
CGF [33]  Rec.  0.7171  0.6887  0.4591  0.5495  0.4872  0.6538  0.4786  0.4222  0.5570  
Prec.  0.5430  0.1830  0.1241  0.3759  0.1538  0.1574  0.1605  0.1033  0.2251  
PPFNet [20]  Rec.  0.9020  0.5849  0.5723  0.7473  0.6795  0.8846  0.6752  0.6222  0.7085  
Prec.  0.6553  0.1546  0.1572  0.4159  0.2181  0.2018  0.1627  0.1267  0.2615  
Our Features + RANSAC variants  USAC [43]  Rec.  0.8820  0.7642  0.6101  0.7527  0.6538  0.8077  0.6709  0.5778  0.7149 
Prec.  0.5083  0.1397  0.1362  0.2972  0.1536  0.1329  0.1530  0.1053  0.2033  
SPRT [16]  Rec.  0.8797  0.7453  0.6101  0.7253  0.6538  0.8462  0.6624  0.4444  0.6959  
Prec.  0.5170  0.1341  0.1374  0.3158  0.1599  0.1384  0.1593  0.0881  0.2062  
LR [34]  Rec.  0.8753  0.7925  0.6038  0.7198  0.7051  0.7692  0.6667  0.5556  0.7110  
Prec.  0.5019  0.1348  0.1294  0.2854  0.1549  0.1190  0.1465  0.1012  0.1967  
RAN SAC  Rec.  0.8530  0.7642  0.6038  0.7033  0.6667  0.7692  0.6496  0.5111  0.6901  
Prec.  0.5527  0.1614  0.1479  0.3647  0.1825  0.1587  0.1658  0.1139  0.2309  
Our Features + Pose Prediction  Rec.  0.8998  0.8302  0.6352  0.8242  0.6923  0.9231  0.7650  0.6444  0.7768  
Prec.  0.5437  0.1778  0.1807  0.4011  0.2061  0.2087  0.1843  0.1465  0.2561 
How useful are our features in geometric registration?
To further demonstrate the superiority of our learned local features, we evaluate them for the task of local geometric registration (LGM). In a typical LGM pipeline, local features are first extracted and then a set of local correspondences are established by some form of a search in the latent space. Out of these putative matches, a subsequent RANSAC iteratively selects a subset of minimally 3 correspondences in order to estimate a rigid pose. The best relative rigid transformation between the fragment pair is then the one with the highest inlier score. For the sake of fairness among all the methods and to have a controlled setting where the result depends only on the differences in descriptors, we use the simple RANSAC framework [43] across all methods to find the best matches.
The first part of Tab. 2 shows how well different local features could aid RANSAC to register fragments on the 3DMatch Benchmark. Recall and precision are computed the same way as in 3DMatch [56]. For this evaluation, recall is a more important measure, because the precision can be improved by employing better hypothesis pruning schemes filtering out the bad matches without harming recall [34, 33]. The registration result shows that our method is on par with or better than the best performer PPFNet [20] on average recall, while using a much more lightweighted training pipeline. Interestingly, our recall on this task drops when compared to the one of the fragment matching. This means that for certain fragment pairs, even though the inlier ratio is above , RANSAC fails to do the work. Thus, one is motivated to seek better ways to recover the rigid transformation from 3D correspondences.
How accurate is our direct 6D prediction?
We now evaluate the contributions of RelativeNet in fixing the aforementioned breaking cases of RANSAC. Thanks to our architecture, we are able to endow each correspondence with a pose information. Normally, each of these correspondences are expected to be good. However, in practice this is not the case. Hence, we devise a linear search to find the best of those, as explained in § 3.3. In Tab. 2 (bottom), we report our LGM results as an outcome of this verification, on the same 3DMatch Benchmark. As we can see, with the same set of correspondences, our method could yield a much higher recall, reaching up to , around higher than what is achievable by RANSAC. This is higher than PPFNet. Also, this number is around higher than the recall in fragment matching, which means that not only pairs with good correspondences are registered, but also some challenging pairs with even less than inlier ratio are successfully registered, pushing the potential of matched correspondences to the limit.
It is noteworthy to point out that the iterative scheme of RANSAC requires finding at least 3 correct correspondences to estimate , whereas it is sufficient for us to rely on a single correct match. Moreover due to downsampling [7], poses computed directly from 3points are crude, whereas patchwise pose predictions of our network are less prone to the accuracy of exact keypoint location.
Comparisons against the RANSACfamily
To further demonstrate the power of RelativeNet, we compare it with some of the stateoftheart variants of RANSAC, namely USAC [43], SPRT [16] and Latent RANSAC (LR) [34]. Those methods are proved to be both faster and more powerful than the vanilla version [43, 34].
All the methods are given the same set of putative matching points found by our rotationinvariant features. The results depicted in Tab. 2 shows that even a simple hypothesis prunning combined with our data driven RelativeNet can surpass an entire set of handcrafted methods, achieving approximately higher reacall than the best obtained by USAC and better than the highest precision obtained by standard RANSAC. In this regard, our method takes a dominant advantage on 3D pairwise geometric registration.
Running times
Speed is another important factor regarding any pairwise registration algorithm and it is of interest to see how our work compares to the state of the art in this aspect. We implement our hypotheses verification part based on USAC to make the comparison fair with other USACbased implementations.
The average time needed for registering a fragment pair is recorded in Tab. 3, feature extraction time excluded. All timings are done on a Intel(R) Core(TM) i74820K CPU @ 3.70GHz with a single thread. Note that, our method is much faster than the fastest RANSACvariant LatentRANSAC [34]. The average time for generating all hypotheses for a fragment pair by RelativeNet is about 0.013s, and the subsequent verification costs 0.016s, making up around 0.03s in total. An important reason why we can terminate so quickly is that the number of hypotheses generated and verified is much smaller compared to the RANSAC methods. While LR is capable of reducing this amount significantly, the number of surviving hypotheses to be verified is still much more than ours.
Effect of correspondence estimation on the registration
We put 5 different ways to constructing putative matching pair sets under an ablation study. Strategies include: (1) keeping different number of mutual closest neighboring patches , each dubbed as and (2) keeping a nearest neighbor for all the local patches from both fragments as a match pair, dubbed Closest. These strategies are applied on the same set of local features to estimate initial correspondences for further registration. The results of each method on different scenes and their average are plotted in Fig. 5. As increases and the criteria for accepting a neighbor to be a pair relaxes, we observe an overall trend of increasing registration recall on different sequences. Not surprisingly, this trend is most obvious in the Average column. This is of course not sufficient to conclude that relaxation helps correspondences. The second important observation is that the number of established correspondences also increases as this condition relaxes. The average amount of putative matches found by Closest is around 3664, much larger than ’s 334, approximately times more, meaning that a subsequent verification would need more time to process them. Hence, we arrive at the conclusion that if recall/accuracy is the main concern, more putative matches should be kept. If, conversely, speed is an issue, Mutual1 could achieve a rather satisfying result quicker.
Generalization to unseen domains
To show that our algorithm could generalize well to other datasets, we evaluate its performance on the wellknown and challenging global registration benchmark provided by Choi et al., the Redwood Benchmark [13]. This dataset contains four different synthetic scenes with sequence of fragments. Our network is not finetuned with any synthetic data, instead, the weights trained with real data from 3DMatch dataset is used directly. We follow the evaluation settings as Choi et al. for an easy and fair comparison, and report the registration results in Fig. 6. This precision and recall plot also depcits results achieved by some recent methods including FGR [58], CZK [13], 3DMatch [56], CGF+FGR [33], CGF+CZK [33], and LatentRansac [34]. Among them, 3DMatch and CGF are datadriven. 3DMatch was trained with real data on the same data source as ours, while CGF trained with synthetic data. Note that our method shows higher recall against 3DMatch. Although we are not using any synthetic data for finetuning, we still achieve a better recall of w.r.t. CGF and its combination with CZK. In general, our method outperforms all the other stateoftheart methods on Redwood Benchmark [13], which validates the generalizability and good performance of our method simultaneously. Note that while in general, the maximal precision is low across all the methods, it is not hard to improve it when the recall is high. To show that recall is the primary measure, we ran a global optimization [13] on our initial results, bringing precision up to without big loss of recall  still at .
5 Conclusion
We proposed a unified endtoend framework for both local feature extraction and pose prediction. Comprehensive experiments on 3DMatch benchmark demonstrate that a multitask training scheme could inject more power into the learned features, hence improve the quality of the correspondence set for further registration. Geometric registration using the pose predictions by our RelativeNet given the putative matched pairs is also shown to be both more robust and much faster than various stateoftheart RANSAC methods. We also studied how different methods of establishing local correspondences would affect the registration performance. The outstanding performance on the challenging synthetic Redwood benchmark strongly validates that our method is not only robust, but also generalizes well to unseen datasets. In the future, we also plan to introduce a data driven hypotheses verification approach.
References
 [1] D. Aiger, N. J. Mitra, and D. CohenOr. 4points congruent sets for robust pairwise surface registration. In ACM Transactions on Graphics (TOG), volume 27, page 85. ACM, 2008.
 [2] P. J. Besl and N. D. McKay. Method for registration of 3d shapes. In RoboticsDL tentative, pages 586–606. International Society for Optics and Photonics, 1992.
 [3] T. Birdal, E. Bala, T. Eren, and S. Ilic. Online inspection of 3d parts via a locally overlapping camera network. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–10. IEEE, 2016.
 [4] T. Birdal and S. Ilic. Point pair features based object detection and pose estimation revisited. In 3D Vision, pages 527–535. IEEE, 2015.
 [5] T. Birdal and S. Ilic. Cad priors for accurate and flexible instance reconstruction. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 133–142, Oct 2017.
 [6] T. Birdal and S. Ilic. A point sampling algorithm for 3d matching of irregular geometries. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6871–6878. IEEE, 2017.
 [7] T. Birdal and S. Ilic. A point sampling algorithm for 3d matching of irregular geometries. In International Conference on Intelligent Robots and Systems (IROS 2017). IEEE, 2017.
 [8] T. Birdal, U. Simsekli, M. O. Eken, and S. Ilic. Bayesian pose graph optimization via bingham distributions and tempered geodesic mcmc. In Advances in Neural Information Processing Systems, pages 306–317, 2018.
 [9] M. Bueno, F. Bosché, H. GonzálezJorge, J. MartínezSánchez, and P. Arias. 4plane congruent sets for automatic registration of asis 3d point clouds with 3d bim models. Automation in Construction, 89:120–134, 2018.
 [10] B. Busam, T. Birdal, and N. Navab. Camera pose filtering with local regression geodesics on the riemannian manifold of dual quaternions. In Proceedings of the IEEE International Conference on Computer Vision, pages 2436–2445, 2017.
 [11] Á. P. Bustos and T.J. Chin. Guaranteed outlier removal for point cloud registration with correspondences. IEEE transactions on pattern analysis and machine intelligence, 40(12):2868–2882, 2018.
 [12] S. Choi, T. Kim, and W. Yu. Performance evaluation of ransac family. Journal of Computer Vision, 24(3):271–300, 1997.
 [13] S. Choi, Q.Y. Zhou, and V. Koltun. Robust reconstruction of indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [14] S. Choi, Q.Y. Zhou, and V. Koltun. Robust reconstruction of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [15] O. Chum and J. Matas. Matching with prosacprogressive sample consensus. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 220–226. IEEE, 2005.
 [16] O. Chum and J. Matas. Optimal randomized ransac. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8):1472–1482, 2008.
 [17] O. Chum, J. Matas, and J. Kittler. Locally optimized ransac. In Joint Pattern Recognition Symposium, pages 236–243. Springer, 2003.
 [18] T. Cohen and M. Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999, 2016.
 [19] H. Deng, T. Birdal, and S. Ilic. Ppffoldnet: Unsupervised learning of rotation invariant 3d local descriptors. In The European Conference on Computer Vision (ECCV), September 2018.
 [20] H. Deng, T. Birdal, and S. Ilic. Ppfnet: Global context aware local features for robust 3d point matching. Computer Vision and Pattern Recognition (CVPR). IEEE, 1, 2018.
 [21] B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 998–1005. Ieee, 2010.
 [22] B. Eckart, K. Kim, and J. Kautz. Hgmr: Hierarchical gaussian mixtures for adaptive 3d registration. In The European Conference on Computer Vision (ECCV), September 2018.
 [23] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981.
 [24] Z. Gojcic, C. Zhou, J. D. Wegner, and W. J. D. The perfect match: 3d point cloud matching with smoothed densities. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [25] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [26] Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, and J. Zhang. Performance evaluation of 3d local feature descriptors. In Asian Conference on Computer Vision, pages 178–194. Springer, 2014.
 [27] S. Hinterstoisser, V. Lepetit, N. Rajkumar, and K. Konolige. Going further with point pair features. In European Conference on Computer Vision, pages 834–848. Springer, 2016.
 [28] T. Hodaň, F. Michel, E. Brachmann, W. Kehl, A. G. Buch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, C. Sahin, F. Manhardt, F. Tombari, T.K. Kim, J. Matas, and C. Rother. Bop: Benchmark for 6d object pose estimation. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, pages 19–35, 2018.
 [29] Z. Jian Yew and G. Hee Lee. 3dfeatnet: Weakly supervised local 3d features for point cloud registration. In Proceedings of the European Conference on Computer Vision (ECCV), pages 607–623, 2018.
 [30] A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on pattern analysis and machine intelligence, 21(5):433–449, 1999.
 [31] F. Järemo Lawin, M. Danelljan, F. Shahbaz Khan, P.E. Forssén, and M. Felsberg. Density adaptive point set registration. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [32] W. Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 32(5):922–923, 1976.
 [33] M. Khoury, Q.Y. Zhou, and V. Koltun. Learning compact geometric features. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [34] S. Korman and R. Litman. Latent ransac. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6693–6702, 2018.
 [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [36] H. Li and R. Hartley. The 3d3d registration problem revisited. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
 [37] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
 [38] N. Mellado, D. Aiger, and N. J. Mitra. Super 4pcs fast global pointcloud registration via smart indexing. In Computer Graphics Forum, volume 33, pages 205–215. Wiley Online Library, 2014.
 [39] M. Mohamad, M. T. Ahmed, D. Rappaport, and M. Greenspan. Super generalized 4pcs for 3d registration. In 3D Vision (3DV), 2015 International Conference on, pages 598–606. IEEE, 2015.
 [40] J. Park, Q.Y. Zhou, and V. Koltun. Colored point cloud registration revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 143–152, 2017.
 [41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPSWorkshops, 2017.
 [42] A. Petrelli and L. Di Stefano. On the repeatability of the local reference frame for partial shape matching. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2244–2251. IEEE, 2011.
 [43] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.M. Frahm. Usac: a universal framework for random sample consensus. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):2022–2038, 2013.
 [44] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3212–3217. IEEE, 2009.
 [45] S. Salti, F. Tombari, and L. Di Stefano. Shot: Unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding, 125:251–264, 2014.
 [46] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgbd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, 2013.
 [47] R. Toldo, A. Beinat, and F. Crosilla. Global registration of multiple point clouds embedding the generalized procrustes analysis into an icp framework. In 3DPVT 2010 Conference, 2010.
 [48] F. Tombari, S. Salti, and L. Di Stefano. Unique shape context for 3d data description. In Proceedings of the ACM workshop on 3D object retrieval, pages 57–62. ACM, 2010.
 [49] J. Vidal, C.Y. Lin, and R. Martí. 6d pose estimation using an improved method based on point pair features. In 2018 4th International Conference on Control, Automation and Robotics (ICCAR), pages 405–409. IEEE, 2018.
 [50] J. Vongkulbhisal, B. I. Ugalde, F. De la Torre, and J. P. Costeira. Inverse composition discriminative optimization for point cloud registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2993–3001, 2018.
 [51] D. Worrall and G. Brostow. Cubenet: Equivariance to 3d rotation and translation. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, 2018.
 [52] J. Xiao, A. Owens, and A. Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In Proceedings of the IEEE International Conference on Computer Vision, pages 1625–1632, 2013.
 [53] J. Yang, H. Li, and Y. Jia. Goicp: Solving 3d registration efficiently and globally optimally. In Proceedings of the IEEE International Conference on Computer Vision, pages 1457–1464, 2013.
 [54] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point cloud autoencoder via deep grid deformation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [55] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision, pages 467–483. Springer, 2016.
 [56] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from rgbd reconstructions. In CVPR, 2017.
 [57] Y. Zhao, T. Birdal, H. Deng, and F. Tombari. 3d pointcapsule networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [58] Q.Y. Zhou, J. Park, and V. Koltun. Fast global registration. In European Conference on Computer Vision, pages 766–782. Springer, 2016.
Appendix A Appendix
a.1 Ablation Study
Does multitask training scheme help to boost the feature quality?
In order to find out how multitask training affects the quality of the learned intermediate features, we trained several networks with combinations of different supervision signals. For the sake of controlled experimentation, all networks are made to have the identical architecture. They are trained with the same data for 10 epochs. Hence, the only variable remains to be the objective function used for each group.
In total, there are four networks to be compared. The first one is trained with all the available supervision signals, i.e. reconstruction loss, feature consistency loss and pose prediction loss. Regarding the other three groups, each of the networks is trained with one of the three signals excluded. For simplicity, those groups are tagged as All, No Reconstruction, No Consistency and No Pose respectively. The fragment matching results using features from different networks are shown in Fig. 10.
As shown in Fig. 10, with all the training signals on, the learned features are the most robust and outperform all the others which lack at least one piece of information and thus suffer a performance drop. When no reconstruction loss is applied, the learned features almost always fail at matching. It is therefore the most critical loss to minimize. The absence of pose prediction loss has the least negative influence. Yet, it is necessary for RelativeNet to learn to predict the relative pose for given patch pairs. Without this the later stages of the pipeline such as hypotheses generation and verification cannot continue. These results validate that our multitask training scheme takes full advantage of all the available information to drive the performance of learned local features to a higher level.
Matching of invariant vs posevariant features
Our method extracts two kinds of local features using two different network components. The ones extracted by PPFFoldNet are fully rotationinvariant, while local features of PCFoldNet change as the pose of local patches vary. Experimentation contained in the paper used local features from PPFFoldNet only to establish correspondences thanks to its superior property of invariance. Here, we use invariant and equivariant features to match fragment pairs separately, and compare their matching performance. This is important in validating our choice that invariant features are more suitable for nearest neighbor queries.
Fig. 9 exhibits the distribution of correspondence inlier ratio for the matched fragment pairs by using different local features. Matching results of equivariant features shows a huge amount of fragment pairs having correspondences with only a small fraction of inliers (less than 5%). Invariant features though, manage to provide many fragment pairs with a set of correspondences with over 10% true matches. It proves that invariant features are better at finding good correspondence set for further registration stage. All in all, rotationinvariant features extracted by PPFFoldNet is more suitable for finding putative local matches. Note that this was also verified by [19].
Closest  

# Matches  335  1099  1834  2609  3664 
More details for correspondence estimation methods
In the main paper, we found out that a more relaxed condition for keeping neighbors lead to a better subsequent registration. However, this performance gain comes at a cost and hence introduces a tradeoff. Tab. 4 tabulates the average number of putative matches found by different methods. As we can see, the size of correspondence set increases rapidly as we relax the standard and keep more neighbors. In return, this means more computation time in the following registration stage.
a.2 Quantitative Results
Distribution of hypotheses
Fig. 13 shows the distribution of poses predicted by RelativeNet and poses determined by running RANSAC on the randomly selected subsets of corresponding points. Each hypothesis is composed of a rotational and translational part. The former is represented as a Rodrigues vector to keep it in . It is obvious that hypotheses predicted by RelativeNet are centered more around the ground truth pose, both in rotation and translation. It also reveals the reason why the hypotheses of our network could facilitate an easier and faster registration procedure.
Qualitative comparison against RANSAC
Fig. 14 shows some challenging cases where only a small number of correct correspondences are established. In these examples, RANSAC fails to recover the pose information from the small set of inliers hidden in a big set of mismatches. However, a registration procedure with the aid of RelativeNet could succeed with a correct result. The qualitative comparison demonstrates that our method is robust at registering fragment pairs even in extreme cases where insufficient inliers are presented.
Multiscan registration
Finally, we apply our method in registering multiple scans to a common reference frame. To do that, we first align pairwise scans and obtain the most likely relative pose per pair. These poses are then fed into a global registration pipeline [14]. Note that while this method can use a global iterative closest point alignment [2] in the final stage, we deliberately omit this step to emphasize the quality of our pairwise estimates. Hence, the outcome is a rough, but nevertheless an acceptable alignment on which we can optionally apply the globalICP refining the points and scans. The results are shown in Fig. 15 on the Red Kitchen sequence of the 7scenes [46] as well as in Fig. 16 on the Sun3D Hotel sequence [52], a part of 3DMatch benchmark [56].