LMReloc: LevenbergMarquardt Based Direct Visual Relocalization
Abstract
We present LMReloc – a novel approach for visual relocalization based on direct image alignment. In contrast to prior works that tackle the problem with a featurebased formulation, the proposed method does not rely on feature matching and RANSAC. Hence, the method can utilize not only corners but any region of the image with gradients. In particular, we propose a loss formulation inspired by the classical LevenbergMarquardt algorithm to train LMNet. The learned features significantly improve the robustness of direct image alignment, especially for relocalization across different conditions. To further improve the robustness of LMNet against large image baselines, we propose a pose estimation network, CorrPoseNet, which regresses the relative pose to bootstrap the direct image alignment. Evaluations on the CARLA and Oxford RobotCar relocalization tracking benchmark show that our approach delivers more accurate results than previous stateoftheart methods while being comparable in terms of robustness.
1 Introduction
Mapbased relocalization, that is, to localize a camera within a prebuilt reference map, is becoming more and more important for robotics [6], autonomous driving [24, 4] and AR/VR [29]. Sequentialbased approaches, which leverage the temporal structure of the scene provide more stable pose estimations and also deliver the positions in global coordinates compared to single imagebased localization methods. The map is usually generated by either using LiDAR or visual Simultaneous Localization and Mapping (vSLAM) solutions. In this paper, we consider vSLAM maps due to the lowercost visual sensors and the richer semantic information from the images. Featurebased methods [14, 7, 22, 23] and direct methods [13, 12, 11, 1] are two main lines of research for vSLAM.
Once a map is available, the problem of relocalizing within this map at any later point in time requires to deal with longterm changes in the environment. This makes a centimeteraccurate global localization challenging, especially in the presence of drastic lighting and appearance changes in the scene. For this task, featurebased methods are the most commonly used approaches to estimate the egopose and its orientation. This is mainly due to the advantage that features are more robust against changes in lighting/illumination in the scene.
However, featurebased methods can only utilize keypoints that have to be matched across the images before the pose estimation begins. Thus they ignore large parts of the available information. Direct methods, in contrast, can take advantage of all image regions with sufficient gradients and as a result, are known to be more accurate on visual odometry benchmarks [41, 11, 39].
In this paper, we propose LMReloc, which applies direct techniques to the task of relocalization. LMReloc consists of LMNet, CorrPoseNet, and a nonlinear optimizer, which work seamlessly together to deliver reliable pose estimation without RANSAC and feature matching. In particular, we derive a loss formulation, which is specifically designed to work well with the LevenbergMarquardt (LM) algorithm [16, 20]. We use a deep neural network, LMNet, to train descriptors that are being fed to the direct image alignment algorithm. Using these features results in better robustness against bad initializations, large baselines, and against illumination changes.
While the robustness improvements gained with our loss formulation are sufficient in many cases, for very large baselines or strong rotations, some initialization can still be necessary. To this end, we propose a pose estimation network. Based on two images it directly regresses the 6DoF pose, which we utilize as initialization for LMNet. The CorrPoseNet contains a correlation layer as proposed in [27], which ensures that the network can handle large displacements. The proposed CorrPoseNet displays a lot of synergies with LMNet. Despite being quite robust, the predictions of the CorrPoseNet are not very accurate. Thus it is best used in conjunction with our LMNet, resulting in very robust and accurate pose estimates.
We evaluate our approach on the relocalization tracking benchmark from [36], which contains scenes simulated using CARLA [9], as well as sequences from the Oxford RobotCar dataset [19]. Our LMNet shows superior accuracy especially in terms of rotation while being competitive in terms of robustness.
We summarize our main contributions:

LMReloc, a novel pipeline for visual relocalization based on direct image alignment, which consists of LMNet, CorrPoseNet, and a nonlinear optimizer.

A novel loss formulation together with a point sampling strategy that is used to train LMNet such that the resulting feature descriptors are optimally suited to work with the LM algorithm.

Extensive evaluations on the CARLA and Oxford RobotCar relocalization tracking benchmark which show that the proposed approach achieves stateoftheart relocalization accuracy without relying on feature matching or RANSAC.
2 Related Work
In this section, we review the main topics that are closely related to our work, including direct methods for visual localization and featurebased visual localization methods.
Direct methods for visual localization. In recent years, direct methods [13, 12, 11] for SLAM and visual odometry have seen a great progress. Unlike featurebased methods [14, 7, 22, 23] which firstly extracts keypoints as well as the corresponding descriptors, and then minimize the geometric errors, direct methods minimize the energy function based on the photometric constancy assumption without performing feature matching or RANSAC. By utilizing more points from the images, direct methods show higher accuracy than featurebased methods [39]. However, classical direct methods show lower robustness than featurebased methods when the photometric constancy assumption is violated due to, e.g., the lighting and weather changes which are typical for longterm localization [33]. In [2] and [25], the authors propose to use the handcrafted features to improve the robustness of direct methods against low light or global appearance changes. Some recent works [5, 18, 36] address the issue by using learned features from deep neural networks [15]. In [5] they train deep features using a HingeLoss based on the LucasKanade method, however, in contrast to us, they estimate the optical flow instead of applying the features to the task of relocalization. The most related work to ours is GNNet [36] which proposes a GaussNewton loss to learn deep features. By performing direct image alignment on the learned features, GNNet can deliver reliable pose estimation between the images taken from different weather or season conditions. The proposed LMNet further derives the loss formulation based on LevenbergMarquardt to improve the robustness against bad initialization compared to the GaussNewton method. Inspired by D3VO [38], LMReloc also proposes a relative pose estimation network with a correlation layer [27] to regress a pose estimate which is used as the initialization for the optimization.
Featurebased visual localization. Most approaches for relocalization utilize feature detectors and descriptors, which can either be handcrafted, such as SIFT [17] or ORB [28], or especially in the context of drastic lighting and appearance changes can be learned. Recently, many descriptor learning methods have been proposed which follow a detectanddescribe paradigm, e.g., SuperPoint [8], D2Net [10], or R2D2 [26]. Moreover, SuperGlue [32], a learningbased alternative to the matching step of featurebased methods has been proposed and yields significant performance improvements. For a complete relocalization pipeline the local pose refinement part has to be preceded by finding the closest image in a database given a query [3]. While some approaches [31, 30, 35] address the joint problem, in this work, we decouple these two tasks and only focus on the pose refinement part.
3 Method
In this work, we address the problem of computing the 6DoF pose between two given images and . Furthermore, we assume that depths for a sparse set of points are available, e.g., by running a direct visual SLAM system such as DSO [11].
The overall pipeline of our approach is shown in Figure LABEL:fig:teaser. It is composed of LMNet, CorrPoseNet, and a nonlinear optimizer using the LM algorithm. LMNet is trained with a novel loss formulation designed to learn feature descriptors optimally suited for the LM algorithm. The encoderdecoder architecture takes as input a reference image as well as a target image . The network is trained endtoend and will produce multiscale feature maps and , where denotes the different levels of the feature pyramid. In order to obtain an initial pose estimate for the nonlinear optimization, we propose CorrPoseNet, which takes and as the inputs and regress their relative pose. Finally, the multiscale feature maps together with the depths obtained from DSO [11] form the nonlinear energy function which is minimized using LM algorithm in a coarsetofine manner to obtain the final relative pose estimate. In the following, we will describe the individual components of our approach in more detail.
3.1 Direct Image Alignment with LevenbergMarquardt
In order to optimize the pose (consisting of rotation matrix and translation ), we minimize the featuremetric error:
(1) 
where is the Huber norm and is the point projected onto the target image using the depths and the pose:
(2) 
This energy function is first minimized on the coarsest pyramid level , whose feature maps have a size of , yielding a rough pose estimate. The estimate is refined by further minimizing the energy function on the subsequent pyramid levels , , and , where has the size of the original image . In the following, we provide details of the minimization performed in every level and for simplicity we will denote as from now on.
Minimization is performed using the LevenbergMarquardt algorithm. In each iteration we compute the update in the Lie algebra as follows: Using the residual vector , the Huber weight matrix , and the Jacobian of the residual vector with respect to the pose , we compute the GaussNewton system:
(3) 
The damped system can be obtained with either Levenberg’s formula [16]:
(4) 
or the Marquardt’s formula [20]:
(5) 
depending on the specific application.
The parameter can be seen as an interpolation factor between gradient descent and the GaussNewton algorithm. When is high the method behaves like gradient descent with a small step size, and when it is low it is equivalent to the GaussNewton algorithm. In practice, we start with a relatively large and multiply it by after a successful iteration, and by after a failed iteration [11].
Figure 1 shows the typical behaviour of the algorithm. In the beginning the initial pose is inaccurate, resulting in projected point positions, which are a couple of pixels away from the correct location. will be high meaning that the algorithm will behave similar to gradient descent. After a couple of iterations, the pose got more accurate, and the projected points are in a closer vicinity to the correct location. By now, has probably decreased, so the algorithm will behave more similar to the GaussNewton algorithm. Now we expect the algorithm to converge quickly.
3.2 Loss Formulation for LevenbergMarquardt
The key contribution of this work is LMNet which provides feature maps that improve the convergence behaviour of the LM algorithm and, in the meantime, are invariant to different conditions. We train our network in a Siamese fashion based on groundtruth pixel correspondences.
In this section, denotes a reference point (located on image ) and the groundtruth correspondence (located on image ) is . For the loss functions explained below we further categorize into , , and , which is realized by using different negative correspondence sampling. Our loss formulation is inspired by the typical behaviour of the LevenbergMarquardt algorithm explained in the previous section (see Figure 1). For a point, we distinguish four cases which can happen during the optimization:

The point is at the correct location ().

The point is an outlier ().

The point is relatively far from the correct solution ().

The point is very close to the correct solution ().
In the following we will derive a loss function for each of the 4 cases:
1. The point is already at the correct location. In this case we would like the residual to be as small as possible, in the best case 0.
(7) 
2. The point is an outlier or the pose estimate is completely wrong. In this case the projected point position can be at a completely different location than the correct correspondence. In this scenario we would like the residual of this pixel to be very large to reflect this, and potentially reject a wrong update. To enforce this property we sample a negative correspondences uniformly across the whole image, and compute
(8) 
where is the margin how large we would like the energy of a wrong correspondence to be. In practice, we set it to .
3. The predicted pose is relatively far away from the optimum, meaning that the projected point position will be a couple of pixels away from the correct location. As this typically happens during the beginning of the optimization we assume that will be relatively large and the algorithm behaves similar to gradient descent. In this case we want that the gradient of this point is oriented in the direction of the correct solution, so that the point has a positive influence on the update step.
For computing a loss function to enforce this property we sample a random negative correspondence in a relatively large vicinity around the correct solution (in our experiments we use 5 pixels distance). Starting from this negative correspondence we first compute the GaussNewton system for this individual point, similarly to how it is done for optical flow estimation using LucasKanade:
(9) 
(10) 
We compute the damped system using a relatively large fixed , as well as the optical flow step
(11) 
In order for this point to have a useful contribution to the direct image alignment, this update step should move in the correct direction by at least . We enforce this using a GradientDescent loss function which is small only if the distance to the correct correspondence after the update is smaller than before the update:
(12) 
In practice, we choose and .
4. The predicted pose is very close to the optimum, yielding a projected point position in very close proximity of the correct correspondence, and typically will be very small, so the update will mostly be a GaussNewton step. In this case we would like the algorithm to converge as quickly as possible, with subpixel accuracy. We enforce this using the GaussNewton loss [36]. To compute it we first sample a random negative correspondence in a 1pixel vicinity around the correct location. Then we use Equations (9) and (10), replacing with to obtain the GaussNewton system formed by and . We compute the updated pixel location:
(13) 
Note that in contrast to the computation of the LMLoss (Equation (12)), in this case is just added to ensure invertibility and therefore is much smaller than the used above. The GaussNewton loss is computed with:
(14) 
Note how all our 4 loss components use a different way to sample the involved points, depicted also in Figure 1. With the derivation above we argue that each loss component is important to achieve optimal performance and we demonstrate this in the results section. Note that the GaussNewton systems computed for the GDLoss and the GNLoss are very relevant for the application of direct image alignment. In fact the full GaussNewton system containing all points (Equation (3)), can be computed from these individual GaussNewton systems (Equation (10)) by simply summing them up and multiplying them with the derivative with respect to the pose [36].
3.3 CorrPoseNet
In order to deal with the large baselines between the images, we propose CorrPoseNet to regress the relative pose between two images and , which serves as the initialization of LM optimization. As our network shall work even in cases of large baselines and strong rotations, we utilize the correlation layer proposed in [27] which is known to boost the performance of affine image transformation and optical flow [21] estimation for large displacements, but has not been applied to pose estimation before.
Our network first computes deep features , from both images individually using multiple strided convolutions with ReLU activations in between. Then the correlation layer correlates each pixel from the normalized source features with each pixel from the normalized target features yielding the correlation map :
(15) 
The correlation map is then normalized in the channel dimension and fed into 2 convolutional layers each followed by batch norm and ReLU. Finally we regress the Euler angle and translation using a fully connected layer. More details on the architecture are shown in the supplementary material.
We train CorrPoseNet from scratch with image pairs and groundtruth poses . We utilize an L2loss working directly on Euler angles and translation:
(16) 
where is the weight, which we set to in practice.
As the distribution of groundtruth poses in the Oxford training data is limited we apply the following data augmentation. We first generate dense depths for all training images using a stateoftheart dense stereo matching algorithm [40]. The resulting depths are then used to warp the images to a different pose sampled from a uniform distribution. In detail, we first warp the depth image to the random target pose, then inpaint the depth image using the OpenCV implementation of Navier Stokes, and finally warp our image to the target pose using this depth map. Note that the dense depths are only necessary for training, not for evaluation. We show an ablation study on the usage of correlation layers and the proposed data augmentation in the supplementary material.
4 Experiments
We evaluate our method on the relocalization tracking benchmark proposed in [36], which contains images created with the CARLA simulator [9], and scenes from the Oxford RobotCar dataset [19]. We train our method on the respective datasets from scratch. LMNet is trained using the Adam optimizer with a learning rate of and for CorrPoseNet we use a learning rate of . For both networks we choose hyperparameters and epoch based on the results on the validation data. Our networks use the same hyperparameters for all experiments except where stated otherwise; the direct image alignment code is slightly adapted for Oxford RobotCar, mainly to improve performance when the egovehicle is standing still.
As the original relocalization tracking benchmark [36] does not include validation data on Oxford RobotCar we have manually aligned two new sequences, namely 20150417090625 and 20150519140638, and extend the benchmark with these sequences as validation data.
Evaluation metrics: We evaluate the predicted translation and rotation against the groundtruth and according to Equations (17) and (18).
(17)  
(18) 
In this section, we plot the cumulative translation and rotation error until m and , respectively. For quantitative results we compute the area under curve (AUC) of these cumulative curves in percent, which we denote as for translation and for rotation from now on.
We evaluate the following direct methods:
Ours: The full LMReloc approach consisting of CorrPoseNet, LMNet features and direct image alignment based on LevenbergMarquardt. The depths used for the image alignment are estimated with the stereo version [37] of DSO [11].
Ours (w/o CorrPoseNet): For a more fair comparison to GNNet we use identity as initialization for the direct image alignment instead of CorrPoseNet. This enables a direct comparison between the two loss formulations.
GNNet [36]: In this work, we have also improved the parameters of the direct image alignment pipeline based on DSO [11]. Thus we have reevaluated GNNet with this improved pipeline to make the comparison as fair as possible. These reevaluated results are better than the results computed in the original GNNet paper.
Baseline methods: Additionally, we evaluate against current stateoftheart indirect methods, namely SuperGlue [32], R2D2 [26], SuperPoint [8], and D2Net [10]. For these methods, we estimate the relative pose using the models provided by the authors and the OpenCV implementation of solvePnPRansac. We have tuned the parameters of RANSAC on the validation data and used iterations and a reprojection error threshold of for all methods. For estimating depth values at keypoint locations we use OpenCV stereo matching. It would be possible to achieve a higher accuracy by using SfM and MVS solutions such as COLMAP [34]. However, one important disadvantage of these approaches is, that building a map is rather time consuming and computationally expensive, whereas all other approaches evaluated on the benchmark [36] are able to create the map close to realtime, enabling applications like longterm loopclosure and mapmerging.
4.1 CARLA Relocalization Benchmark
Method  

Ours  80.65  77.83 
SuperGlue [32]  78.99  59.31 
R2D2 [26]  73.47  54.42 
SuperPoint [8]  72.76  53.38 
D2Net [10]  47.62  16.47 
Ours (w/o CorrPoseNet)  63.88  61.9 
GNNet [36]  43.72  44.08 
Sequence  Ours  SuperGlue [32]  R2D2 [26]  SuperPoint [8]  D2Net [10]  

SunnyOvercast  79.83  55.48  81.01  52.83  80.86  53.57  78.95  50.03  71.93  39.0 
SunnyRainy  71.54  43.7  75.58  40.59  74.84  41.23  69.76  37.12  65.63  27.5 
SunnySnowy  59.69  44.06  63.57  43.64  62.92  41.78  60.85  40.02  55.65  30.86 
OvercastRainy  80.54  63.7  79.99  61.64  81.29  61.23  80.36  61.56  75.66  51.06 
OvercastSnowy  55.38  47.88  57.67  47.16  57.68  48.41  55.39  44.96  51.17  34.54 
RainySnowy  68.57  41.67  69.91  39.87  71.79  39.86  67.7  38.05  61.91  27.74 
Sequence  Ours (w/o CorrPoseNet)  GNNet [36]  

SunnyOvercast  79.61  55.45  73.53  49.31 
SunnyRainy  70.46  42.86  64.58  37.27 
SunnySnowy  59.7  44.17  55.27  41.36 
OvercastRainy  79.67  63.08  75.72  60.13 
OvercastSnowy  54.94  47.19  51.34  42.91 
RainySnowy  66.23  39.93  62.63  36.2 
Figure 2 depicts the results on the test data of the CARLA benchmark. For all methods we show the cumulative error plot for translation in meters and rotation in degree. It can be seen that our method is more accurate than the stateoftheart while performing similarly in terms of robustness. We also show the AUC for the two Figures in Table 1. Compared to GNNet it can be seen that our new loss formulation significantly improves the results, even when used without the CorrPoseNet as initialization. The figure conveys that the direct methods (Ours, GNNet) are more accurate than the evaluated indirect methods.
4.2 Oxford RobotCar Relocalization Benchmark
We compare to the stateoftheart indirect methods on the 6 test sequence pairs consisting of the sequences 20150224123219 (sunny), 20150317110844 (overcast), 20141205110910 (rainy), and 20150203084510 (snowy). In Table 2, we show the area under curve until meters / degrees for all methods. It can be seen that our method clearly outperforms the stateoftheart in terms of rotation accuracy, while being competitive in terms of translation error. It should be noted that the groundtruth for these sequences was generated using ICP alignment of the 2DLiDAR data accumulated for 60 meters. We have computed that the average root mean square error of the ICP alignment is 16 centimeters. Therefore, especially the groundtruth translations have limited accuracy. As can be seen from Figure 2, the accuracy improvements our method provides are especially visible in the range below meters which is hard to measure on this dataset. The rotation error of LiDAR alignment is lower than the translational one, which is why we clearly observe the improvements of our method on the rotations.
In Table 3, we compare LMNet without the CorrPoseNet to GNNet. Due to our novel loss formulation LMNet outperforms the competitor on all sequences significantly.
4.3 Ablation Studies
We evaluate LMNet on the CARLA validation data with and without the various losses (Figure 3). Compared to a normal contrastive loss, the given loss formulation is a large improvement. As expected, (green line) mainly improves the robustness, whereas (blue line) improves the accuracy. Only when used together (our method) we achieve large robustness and large accuracy, confirming our theoretical derivation in Section 3.
4.4 Qualitative Results
To demonstrate the accuracy of our approach in practice, we show qualitative results on the Oxford RobotCar dataset. We track the snowy test sequence 20150203084510 using Stereo DSO [37] and at the same time perform relocalization against the sunny reference map 20150224123219. Relocalization between the current keyframe and the closest map image is performed using LMNet. Initially, we give the algorithm the first corresponding map image (which would in practice be provided by an image retrieval approach such as NetVLAD [3]). Afterwards we find the closest map image for each keyframe using the previous solution for the transformation between the map and the current SLAM world . We visualize the current point cloud (blue) and the point cloud from the map (grey) overlayed using the smoothed (Figure 4). The point clouds will align only if the relocalization is accurate. As can be seen in Figure 4, the lane markings, poles, and buildings between the reference and query map align well, hence qualitatively showing the high relocalization accuracy of our method. We recommend watching the video at https://vision.in.tum.de/lmreloc. In Figure 5 we show example images from the benchmark.
5 Conclusion
We have presented LMReloc as a novel approach for direct visual localization. In order to estimate the relative 6DoF pose between two images from different conditions, our approach performs direct image alignment on the trained features from LMNet without relying on feature matching or RANSAC. In particular, with the loss function designed seamlessly for the LevenbergMarquart algorithm, LMNet provides deep feature maps that coin the characteristics of direct image alignment and are also invariant to changes in lighting and appearance of the scene. The experiments on the CARLA and Oxford RobotCar relocalization tracking benchmark exhibit the stateoftheart performance of our approach. In addition, the ablation studies also show the effectiveness of the different components of LMReloc.
See pages 1 of supplement.pdf See pages 2 of supplement.pdf
Footnotes
References
 H. Alismail, B. Browning, and S. Lucey. Photometric bundle adjustment for visionbased SLAM. In ACCV, 2017.
 H. Alismail, M. Kaess, B. Browning, and S. Lucey. Direct visual odometry in low light using binary descriptors. RAL, 2, 2017.
 R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, 2016.
 M. A. Brubaker, A. Geiger, and R. Urtasun. Mapbased probabilistic visual selflocalization. PAMI, 38(4):652–665, 2015.
 C.H. Chang, C.N. Chou, and E. Y. Chang. CLKN: Cascaded lucaskanade networks for image alignment. In CVPR, pages 2213–2221, 2017.
 M. Cummins and P. Newman. FABMAP: probabilistic localization and mapping in the space of appearance. IJRR, 27(6), 2008.
 A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. MonoSLAM: Realtime single camera SLAM. PAMI, 29(6):1052–1067, 2007.
 D. DeTone, T. Malisiewicz, and A. Rabinovich. SuperPoint: selfsupervised interest point detection and description. In CVPRW, 2018.
 A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: an open urban driving simulator. In CoRL, 2017.
 M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler. D2Net: a trainable CNN for joint description and detection of local features. In CVPR, 2019.
 J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. PAMI, 40(3), 2018.
 J. Engel, T. Schöps, and D. Cremers. LSDSLAM: Largescale direct monocular SLAM. In ECCV, 2014.
 C. Kerl, J. Sturm, and D. Cremers. Dense visual SLAM for RGBD cameras. In IROS, pages 2100–2106. IEEE, 2013.
 G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. In ISMAR, pages 225–234. IEEE, 2007.
 Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 K. Levenberg. A method for the solution of certain nonlinear problems in least squares. Quarterly of Applied Mathematics, 2(2):164–168, 1944.
 D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 60(2):91–110, 2004.
 Z. Lv, F. Dellaert, J. M. Rehg, and A. Geiger. Taking a deeper look at the inverse compositional algorithm. In CVPR, pages 4581–4590, 2019.
 W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 Year, 1000km: The Oxford RobotCar Dataset. IJRR, 36(1), 2017.
 D. W. Marquardt. An algorithm for leastsquares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
 I. Melekhov, A. Tiulpin, T. Sattler, M. Pollefeys, E. Rahtu, and J. Kannala. Dgcnet: Dense geometric correspondence network. In WACV, pages 1034–1042. IEEE, 2019.
 R. MurArtal, J. M. Montiel, and J. D. Tardos. ORBSLAM: A versatile and accurate monocular SLAM system. IEEE TRO, 31(5), 2015.
 R. MurArtal and J. D. TardÃ³s. Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras. IEEE TRO, 33(5), 2017.
 T. Ort, L. Paull, and D. Rus. Autonomous vehicle navigation in rural environments without detailed prior maps. In ICRA, pages 2040–2047. IEEE, 2018.
 G. Pascoe, W. Maddern, M. Tanner, P. Piniés, and P. Newman. Nidslam: Robust monocular slam using normalised information distance. In CVPR, pages 1435–1444, 2017.
 J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel. R2d2: Reliable and repeatable detector and descriptor. In NeurIPS, pages 12405–12415, 2019.
 I. Rocco, R. Arandjelovic, and J. Sivic. Convolutional neural network architecture for geometric matching. In CVPR, pages 6148–6157, 2017.
 E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alternative to SIFT or SURF. In ICCV, pages 2564–2571. Ieee, 2011.
 S. Saeedi, B. Bodin, H. Wagstaff, A. Nisbet, L. Nardi, J. Mawer, N. Melot, O. Palomar, E. Vespa, T. Spink, et al. Navigating the landscape for realtime localization and mapping for robotics and virtual and augmented reality. Proceedings of the IEEE, 106(11):2020–2039, 2018.
 P.E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In CVPR, pages 12716–12725, 2019.
 P.E. Sarlin, F. Debraine, M. Dymczyk, R. Siegwart, and C. Cadena. Leveraging deep visual descriptors for hierarchical efficient localization. In CoRL, 2018.
 P.E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich. Superglue: Learning feature matching with graph neural networks. In CVPR, pages 4938–4947, 2020.
 T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla. Benchmarking 6DOF outdoor visual localization in changing conditions. In CVPR, 2018.
 J. L. Schönberger and J.M. Frahm. Structurefrommotion revisited. In CVPR, 2016.
 H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. Inloc: Indoor visual localization with dense matching and view synthesis. In CVPR, pages 7199–7209, 2018.
 L. von Stumberg, P. Wenzel, Q. Khan, and D. Cremers. GNNet: The GaussNewton Loss for MultiWeather Relocalization. RAL, 5(2):890–897, 2020.
 R. Wang, M. Schwörer, and D. Cremers. Stereo dso: Largescale direct sparse visual odometry with stereo cameras. In ICCV, 2017.
 N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3VO: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In CVPR, pages 1281–1292, 2020.
 N. Yang, R. Wang, X. Gao, and D. Cremers. Challenges in monocular visual odometry: Photometric calibration, motion bias, and rolling shutter effect. RAL, 3(4):2878–2885, 2018.
 F. Zhang, V. Prisacariu, R. Yang, and P. H. Torr. Ganet: Guided aggregation net for endtoend stereo matching. In CVPR, pages 185–194, 2019.
 X. Zheng, Z. Moratto, M. Li, and A. I. Mourikis. Photometric patchbased visualinertial odometry. In ICRA, pages 3264–3271, 2017.