Scene Coordinate and Correspondence Learning for Image-Based Localization
Scene coordinate regression has become an essential part of current camera relocalization methods. Different versions in the form of regression forests and deep learning methods have been successfully applied to estimate the corresponding camera pose given a single input image. In this work, we propose to regress scene coordinates pixel-wise for a given RGB image using deep learning. Compared to the recent methods, which usually employ RANSAC to obtain a robust pose estimate from the established point correspondences, we propose to regress confidences of these correspondences, which allows us to immediately discard erroneous predictions resulting in boosting initial pose estimates. Finally, the resulting confidences can be used to score initial pose hypothesis and aid in pose refinement, offering a generalized solution to solve this task.
Camera re-localization from a single input image is an important topic for many computer vision applications such as SLAM , augmented reality and navigation . Due to rapid camera motion or occlusions, tracking can be lost, making re-localization methods an essential component in such applications. Early methods focus on keyframe  or descriptor matching, e.g., using SIFT or ORB  features, to obtain point correspondences from which the camera pose can be inferred. However, those methods usually do not perform well under occlusion or in poorly textured environments. On the other side, machine learning methods have recently shown great capabilities in estimating camera poses from a single image. In particular, regression forests have been employed for a robust pixel to scene coordinate prediction. Correspondence samples are then used to obtain multiple pose hypothesis, and a robust pose estimate is found using RANSAC [5, 6, 7, 8].
In comparison to regression forests, deep learning methods, mainly focusing on RGB images as input, have recently emerged, directly regressing the camera pose and thus offering a fast camera pose estimates [9, 10]. Most of them, however, can not achieve similar accuracy as the scene coordinate regression approaches. This leads to the assumption that the intermediate step of regressing scene coordinates plays a crucial role in estimating camera poses for deep learning algorithms and generalization of those methods. Moreover, RANSAC  usually plays a vital role in achieving good accuracy in any of the methods focusing on camera relocalization via scene coordinate regression.
In this paper, we introduce a new method, which as a first step, densely regresses pixel-wise scene coordinates given an input RGB image using deep learning. In addition, we propose a new form of regularization, smoothing the regressed coordinates, which can be applied to further improve the regressed coordinates. Thus, a detailed analysis of scene coordinate regression and the influence of different loss functions on the quality of the regressed scene coordinates is conducted. As our main contribution, in a second step, confidences of the obtained image to scene coordinate correspondences are regressed, based on which erroneous predictions can immediately be discarded, resulting in more robust initial pose hypothesis. Additionally, the resulting confidence predictions can be used to optimize the estimated camera poses in a refinement step similar to previous works . Although, in contrast to these methods, our approach offers a more general solution, which does not restrict itself in terms of optimization function and thresholding, typically used to define the inliers and outliers in RANSAC optimizations.
Ii Related Works
There exists a vast amount of research focusing on the topic of camera pose estimation. The most related to our work can be divided into two categories. The first focuses on descriptor matching to obtain point correspondences from which the camera pose can be inferred, either using analytical solutions or learning methods. The second performs direct pose regression using deep learning methods to obtain a pose estimate from a single input image.
Correspondence Learning. Well-known methods working on the topic of camera re-localization have used random forest for correspondence prediction. Here, the forest is trained to predict pixel to 3D coordinate correspondences, from which the camera pose can then be inferred and iteratively refined in a pre-emptive RANSAC optimization . Several extensions and improvements of the method have been proposed, increasing its performance and robustness [8, 6, 7].
On the other hand, due to the recent success of this method, there have been various related methods using deep learning approaches. Inspired by the approach of , Brachmann et al. use two CNN’s to predict the pose for an RGB image, where the first CNN is used to predict point correspondences and is linked to a second one by a differentiable version of RANSAC they call DSAC. Notably, reinforcement learning is used to obtain a probabilistic selection to enable end-to-end learning of the framework.
Recently, Schmidt et al.  rely on learned feature representations for correspondence estimation by using a 3D model of the scene to automatically find correspondences in RGB-D images. A fully-convolutional neural network is trained on a contrastive loss to produce pixel-wise descriptors. Here, pixels corresponding to the same model coordinate are mapped together whereas the remaining pixels have dissimilar feature descriptors. Despite the lack of complete guarantee that the descriptors learned from one video can be mapped to features of a different video capturing the same scene, the method showed robustness and generalization.
Instead of relying on feature representations for point correspondences, Zamir et al.  learn a generic feature representation based on the viewpoint itself, which can be used to retrieve a pose estimate. A siamese network is trained on a multi-task loss including the pose and a matching function. Image patches are extracted and matched according to their pose, training the network to match patches with similar viewpoints. Additionally, it is shown that the resulting models and features generalize well to novel tasks, for example, scene layout or surface normal estimation.
Direct Regression. On the other hand, direct regression approaches are emerging, which use deep learning methods to regress camera poses from a single image. Mostly CNNs are used in this context to estimate camera poses [9, 15, 16, 10]. Therefore, Kendall et al. use a CNN, called PoseNet, to directly predict the six degrees of freedom for camera pose estimation from RGB images. They parameterize rotation by quaternions, which leads to a total of seven parameters for rotation and translation to regress. Although they achieve reasonable performance on several indoor scenes, the method in comparison to random forest approaches still shows a significant gap in accuracy, which leads to the assumption that predicting an intermediate representation as point correspondences is of importance to infer the final pose. However, this method only relies on RGB images, without the need for depth information, which makes it easily applicable in indoor as well as outdoor settings. Walch et al.  extend this approach and connect the last feature vector of the neural network to four LSTM units before concatenating the results and feeding this feature vector to the final regression layer. As in  a pre-trained CNN used for classification is adapted and fine-tuned to enable the regression of the camera pose. By connecting LSTMs and thus correlating the spatial information, the receptive field of each pixel is enlarged substantially, improving the final pose estimation. In , the authors of  extend their method by introducing novel loss functions, further reducing the gap in accuracy compared to the state-of-the-art methods. Further, they show that the re-projection error can be used to additionally fine-tune the model and optimize the regression prediction given that depth information is available. As a first direct regression approach that achieves comparable accuracy with regard to the scene coordinate regression methods,  proposes a multi-task learning framework. By combining global and relative pose regression between image pairs, the authors present a framework for localization and odometry, which shows great improvements in accuracy.
In comparison, our approach is most related to scene coordinate regression methods using deep learning . Scene coordinates are densely regressed as opposed to patch-based regression proposed in the state-of-the art method. Moreover, correspondence confidences are predicted to remove outliers and boost the accuracy of initial pose hypothesis. Additionally, the resulting confidences can directly be used for hypothesis scoring and pose refinement.
Given an input RGB image of a scene, with and being the image height and width, our goal is to estimate the corresponding camera pose, given by its orientation and position . The camera pose describes the mapping between the camera and the scene coordinates x and as
The relation between the 3D camera coordinates and the image pixels depends on the camera’s focal length , and optical center , , and is defined by
with being a point in the camera coordinate frame, given by its coordinates and its depth value . In case the camera pose is unknown, it can be retrieved given number of correspondences either, if depth information is available, using Kabsch algorithm from the 3D-3D correspondences between x and X or using PnP algorithm  from the 2D-3D correspondences between image points and X.
For this aim, our proposed framework consists of three steps: (1) scene coordinate regression, where we densely predict scene coordinates and, in this context, add a novel regularization (2) confidence prediction, where we aim to find accurate correspondences in our coordinate predictions and (3) pose estimation, where we employ the aforementioned algorithms to compute the camera pose estimate on the most confident predictions . An overview of our framework is given in Figure 1.
Iii-a Scene Coordinate Regression
Coordinate Regression. As the first step, our aim is to model the function , obtaining the predicted scene coordinates , as , where are the model parameters. Therefore, we compute the ground truth scene coordinates X to train a model and regress the scene coordinates of every pixel, obtaining an output map. For this purpose, we use the Tukey’s Biweight loss function  to regress the 3D coordinates, given as
where is the residual, is the number of coordinates to regress and is defined as Tukey’s Biweight function. In Tukey’s Biweight function the choice of the tuning constant plays a crucial role, which is proposed to be chosen according to the median absolute deviation over the residuals assuming a gaussian distribution. Nevertheless, we propose to choose the parameter depending on the current scenes spatial extent, where we found half of the scenes diameter given in meters to provide better results. In case of missing depth values and thus missing ground truth scene coordinates, we omit these pixels during training in order not to negatively influence the network.
Coordinate Smoothing. Considering graph Laplacian regularization, which has successfully been applied for image denoising on image patches , we consider the scene coordinates in a given neighborhood. Minimizing the graph Laplacian regularizer results in smoothing of image patches with respect to the given graph. Similarly, we consider scene coordinates as vertices and compute weights according to the depth value at the corresponding pixel
where, given a pixel position , we compute weights for each index in a given neighborhood . In this case, represent the depth value at index . Finally, we obtain an additional smoothing term in our loss function
where correspond to predicted scene coordinates at a given pixel index. This term loosely pushes the surrounding points closer together, given the fact that their depth values are similar; otherwise, a larger difference between points is accepted.
Overall, we train the model using our loss function, described as .
Iii-B Confidence Prediction
The regressed coordinates can be used to obtain a pose estimate. However, these correspondences usually include a large amount of erroneous correspondences, which is most often solved using RANSAC.
Inspired by , which classifies point correspondences between image pairs, we train a neural network to estimate probabilities of 2D-3D correspondence. Instead of solving this as a classification problem, though, we consider this task as a regression problem. Therefore, we use the model described in the previous section to create probabilities for each scene coordinate prediction and construct a training set , where . Here, is used as a scale, so that accurate coordinates are given high probability. In this step, the objective is, therefore, to compute the function , described by the model parameters , so that . To this end, we feed points containing the image pixel and the predicted scene coordinates to our model. As an output, we obtain a probability for each point according to if it is likely to be a good correspondence or not. As a loss function, we use the loss to train this model,
The pose predictions can then easily be obtained by sampling the most confident point correspondences, while removing initial erroneous predictions right away.
Iii-C Pose Estimation and Refinement
The initial pose hypothesis are refined as a post-processing step. Following previous works , pose hypothesis are sampled using number of point correspondences for each. Out of which, only the points with highest confidence are kept. However, only one hypothesis out of the is chosen, by scoring each hypothesis using the mean confidence over the probabilities of the correspondences used to compute the pose estimate. Then, the best hypothesis is refined by repeatedly sampling randomly selected points and re-running PnP or Kabsch algorithm, inclduing the additional most confident correspondences out of this point set.
Iii-D Implementation Details
For scene coordinate regression, U-Net is used as network architecture, as it can easily be used to regress correspondence maps and has been shown to train well on few input images . Four convolutional layers with pooling layers and dropout applied after each, and four up-sampling layers are used, giving a final feature map of size . By applying a last convolutional layer, the final correspondence map of size is obtained.
For confidence prediction, we adapt PointNet  to our specific input and output requirements. For the final regression layer, following , we first apply a hyperbolic tangent followed by a ReLu activation, so the network can easily predict low confidence for highly inaccurate points. Mostly for evaluation purposes, is computed, so that inlier points have a lower bound probability of 0.75, resulting in .
We set out of which only the most confident points are used to estimate a camera pose, resulting in only points. For pose refinement, we follow the parameter settings of  and sample initial hypothesis and refine the best one for 8 iterations on the additional most confident out of randomly sampled points.
Both networks are trained separately for 800 epochs, and batch size 20 using RMSprop Optimizer with an initial learning rate of . All experiments were conducted on a Linux-based system with 64GB RAM and 8GB NVIDIA GeForce GTX 1080 graphics card and implemented using TensorFlow .
To evaluate our method, we define baseline models, which are described in section IV-B, and use the following metrics. For this purpose, inliers and outliers are defined as , with being a common threshold chosen to define inliers [5, 8, 12]. Every inlier point is therefore counted as true positive in our evaluations. For pose estimation we compute the median rotation, translation error and the pose accuracy, where a pose is considered correct if the rotation and translation error are below and , respectively.
Our method is evaluated on the publicly available 7-Scenes dataset from Microsoft, which consists of seven indoor scenes with varying volumes ranging from 1 to 18. RGB and depth images of each scene and their corresponding ground truth camera pose are available. For each scene images in the range of 1K to 7K are provided, including very difficult frames due to motion blur, reflecting surfaces and repeating structures, for example, in case of the Stairs scene. Images were captured using a Kinect RGB-D sensor and ground truth poses obtained using KinectFusion.
Iv-B Baseline Models
First, we evaluate each individual component of our model and create the baseline models for comparison. To start with, the first step of our pipeline, the scene coordinate regression is evaluated. To this aim, a model is trained on scene coordinate regression using different loss functions. Mainly we compare between and the loss as described in section III-A. In this case, a pose estimate is computed by sampling number of points randomly. Further, we abbreviate these model as and . Next, to evaluate the scene coordinate regression quality, regularization is added in the form of the introduced smoothness term , corresponding to model . Further, and more importantly, the second step of our pipeline, the confidence prediction is analyzed. The predicted scene coordinates and associated image points are used to train a model and regress confidences of each correspondence. For pose estimation, in this case, only the most confident correspondences out of the initial randomly sampled points are kept. We abbreviate this model as .
Iv-C Evaluation of Baseline Models
Each of our models is evaluated, comparing the error between regressed and ground truth scene coordinates, and the pose error. We compare our models trained on scene coordinate regression and additionally regularizing the model using a smoothness term. Here, we found a slight improvement in terms of regressed coordinates as well pose estimate comparing our models with and without additional regularization.
Using our proposed confidence prediction, as can be seen in Figure 2, the point errors of the points used to compute a pose estimate significantly decrease, successfully eliminating most of the erroneous predictions and boosting the pose estimation accuracy greatly. As a result, the estimated poses significantly improve as well, which is reported in Table I. Specifically, the translation error greatly decreases. It should be noted, that only the most confident point correspondences are used to compute a pose estimate for this model. As a result, more accurate poses are obtained using a much smaller number of points. Additionally, the percentage of inliers in the sampled points used to compute initial pose hypothesis significantly increases. Furthermore, we analyze the quality of our models coordinate regression and confidence prediction, where example error and probability maps are shown in Figure 4. In our case, since we densely regress the scene coordinates, high values in error usually correspond to missing depth values and therefore missing ground truth coordinates in the image. Although it seems difficult for the network to accurately predict low confidences in regions with unusually large error, in regions of inlier predictions, the model is able to predict corresponding high confidence.
Iv-D Evaluation of Confidence Prediction
To assess the quality of regressing correspondence probabilities, a model is trained on simple classification, where a point correspondence can be either labeled as an inlier or an outlier depending on threshold . The model is trained using cross-entropy loss. As a second step, we train our proposed model, regressing probabilities in the range of instead and plot the resulting ROC curves shown in Figure 3. As a result, we assessed that regression in this case performs more or less equal to classification. However, we do not restrict the model to a specific threshold chosen for inlier definition, which needs to be adapted for each scene depending on the quality of scene coordinate regression. In comparison, a classification model trained on the challenging Stairs scene results in a drop of relative rotation and translation error of and , due to very few inliers being available during training.
Iv-E Comparison to the state-of-the art
|RGB information||RGB-D information|
|VLocNet ||DSAC ||SCoRe |
Finally, we report the results of our framework using a combination of scene coordinate regression and confidence prediction, described as . We compare our results to the current state-of-the art methods, namely PoseNet , which directly regresses the camera poses from the RGB input images and refines the trained models optimizing on the re-projection error. The median rotation and translation errors evaluated on the 7-scenes dataset can be found in Table II, where we report results obtained using PnP () as well as, given depth information is available, using Kabsch algorithm (). Since our model does not depend in any way on the algorithm used to compute pose predictions, we can easily interchange these algorithms without the need to train additional models.
In most cases, we found a significant improvement in pose accuracy compared to  and adaptations of this method . In addition, we compare to recent works on scene coordinate regression, . Although, there has been a very recent version of this work , we compare to the earlier version, since its framework is more similar to our approach, keeping the hypothesis scoring CNN in mind. Except for the challenging Stairs scene, the state-of-the art method shows slightly better accuracy in terms of RGB pose estimation considering each scene individually. On average our method shows good performance compared to the state-of-the art. Although our confidence prediction significantly improves the results, initial scene coordinate regression still seems erroneous, which will be further explored in future work considering optimizations in handling missing depth and thus ground truth scene coordinates as well. Given that the depth information is available, improvements of the accuracy using RGB-D information can easily be obtained since neither the models nor the pose refinement rely on these algorithms. It should be noted as well that, since we only depend on the most confident points, our results were obtained using a smaller number of points. Further, RANSAC based optimization, as applied in most state-of-the art methods, could be easily applied to obtain more accurate pose estimates. For evaluation and comparison to current RGB methods, we keep the parameter settings for pose refinement as proposed in .
In terms of computational time, our method is evaluated on an Intel Core i7 4.2 GHz CPU. The only part, which is running on GPU is the evaluation of the networks. To solve the PnP, we use OpenCV’s implementation of . Our framework runs in , most of which is due to hypothesis sampling () and refinement (). In comparison, DSAC, implemented in C++, reports a run-time of around , whereas our method, implemented in Python, already performs well.
In this work, we present a framework for dense scene coordinate regression in the context of camera re-localization using a single RGB image as input. Given that the depth information is available to obtain camera coordinates, the corresponding scene coordinates can be regressed and used to obtain a camera pose estimate. We incorporate this information into the network and analyze how the scene coordinate regression can be further optimized using a smoothing term in the loss function. In addition and more importantly, we predict confidences for the resulting image point to scene coordinate correspondences, from which the camera pose can be inferred, thus eliminating most of the outliers in advance and greatly improving the accuracy of the estimated camera poses. As a final step, the resulting confidences can be used to refine the initial pose estimates, further improving the methods accuracy.
-  R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
-  Y. H. Lee and G. Medioni, “Rgb-d camera based wearable navigation system for the visually impaired,” Computer Vision and Image Understanding, vol. 149, pp. 3–20, 2016.
-  B. Glocker, J. Shotton, A. Criminisi, and S. Izadi, “Real-time rgb-d camera relocalization via randomized ferns for keyframe encoding,” IEEE transactions on visualization and computer graphics, vol. 21, no. 5, pp. 571–583, 2015.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Computer Vision (ICCV), 2011 IEEE international conference on. IEEE, 2011, pp. 2564–2571.
-  J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 2930–2937.
-  A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon, and S. Izadi, “Multi-output learning for camera relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1114–1121.
-  T. Cavallari, S. Golodetz, N. A. Lord, J. Valentin, L. Di Stefano, and P. H. Torr, “On-the-fly adaptation of regression forests for online camera relocalisation,” in CVPR, vol. 2, 2017, p. 3.
-  J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. H. Torr, “Exploiting uncertainty in regression forests for accurate camera relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4400–4408.
-  A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015, pp. 2938–2946.
-  A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” in Proc. CVPR, vol. 3, 2017, p. 8.
-  M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” in Readings in computer vision. Elsevier, 1987, pp. 726–740.
-  E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.
-  T. Schmidt, R. Newcombe, and D. Fox, “Self-supervised visual descriptor learning for dense correspondence,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 420–427, 2017.
-  A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and S. Savarese, “Generic 3d representation via pose estimation and matching,” in European Conference on Computer Vision. Springer, 2016, pp. 535–553.
-  F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-based localization using lstms for structured feature correlation,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), vol. 1, no. 2, 2017, p. 3.
-  A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 4762–4769.
-  A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localization and odometry,” in International Conference on Robotics and Automation (ICRA 2018). IEEE, 2018.
-  V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” International journal of computer vision, vol. 81, no. 2, p. 155, 2009.
-  V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust optimization for deep regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2830–2838.
-  J. Pang and G. Cheung, “Graph laplacian regularization for image denoising: analysis in the continuous domain,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 1770–1785, 2017.
-  K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua, “Learning to find good correspondences,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, vol. 1, no. 2, p. 4, 2017.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  E. Brachmann and C. Rother, “Learning less is more-6d camera localization via 3d surface regression,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.