SceneAware Error Modeling of LiDAR/Visual Odometry for Fusionbased Vehicle Localization
Abstract
Localization is an essential technique in mobile robotics. In a complex environment, it is necessary to fuse different localization modules to obtain more robust results, in which the error model plays a paramount role. However, exteroceptive sensorbased odometries (ESOs), such as LiDAR/visual odometry, often deliver results with scenerelated error, which is difficult to model accurately. To address this problem, this research designs a sceneaware error model for ESO, based on which a multimodal localization fusion framework is developed. In addition, an endtoend learning method is proposed to train this error model using sparse global poses such as GPS/IMU results. The proposed method is realized for error modeling of LiDAR/visual odometry, and the results are fused with dead reckoning to examine the performance of vehicle localization. Experiments are conducted using both simulation and realworld data of experienced and unexperienced environments, and the experimental results demonstrate that with the learned sceneaware error models, vehicle localization accuracy can be largely improved and shows adaptiveness in unexperienced scenes.
I Introduction
Recent years have witnessed considerable progress in developing autonomous systems[1][2][3][4], where highly accurate vehicle localization is the key to achieving safe and efficient autonomy in a complex real world.
GNSSs (global navigation satellite systems) have been widely used in vehicle localization in outdoor environments and are usually combined with proprioceptive sensors such as IMUs (inertial measurement units) and wheel encoders for interpolating positions during satellite signal outages[5][6]. However, such systems are restricted by GNSS conditions, and the IMU maintains accuracy only for short periods due to accelerometer biases and gyro drifts. Therefore, exteroceptive sensorbased approaches such as LiDAR odometry[7][8][9] or visual odometry[10][11][12] have been studied to assist in highly accurate localization. Hereinafter, we refer to exteroceptive sensorbased odometry as ESO and proprioceptive sensorbased odometry as PSO.
However, the performance of exteroceptive sensorbased localization is strongly related to the scenes. When fusing them with other localization approaches, e.g., [13][14][15], precise error modeling is essential to the fusion efficiency. Covariance has been a widely used measurement for error estimation[16][17][18]. Many of these methods correlate pose uncertainty with the covariance of data matching[19][20], which could be an illposed problem in many situations. In addition, most existing works only model the error of their measurements or features[21][22][23], whereas fewer studies focus on the error of final localization results such as [24][18][25].
This research proposes a method of ESO sceneaware error modeling for fusionbased localization, which is formulated as a mapping from given scene data to a prediction of the ESO error as an information matrix, a dual form of a covariance matrix. A CNN (convolutional neural network) is used to model the mapping procedure, and a vehicle localization framework is devised to incorporate the sceneaware error modeling results in fusing ESO for pose estimation. An endtoend method is developed to train the error model by using reliable global localization, such as GPS, as supervision, which could be sporadic. Therefore, at each iteration, the vehicle is localized by forward propagation on the current parameters for a number of frames, and when a reliable GPS measurement is found, the global localization error is backpropagated along the pipeline to correct the CNN parameters.
The proposed method is realized for error modeling of LiDAR odometry and visual odometry, and the results are fused with dead reckoning to examine the performance of vehicle localization. Experiments are conducted using both simulation and realworld data. The former validates the adaptability of the proposed method in simple but typical scenes, while the latter examines the performance in a complex real world that contains experienced and unexperienced environments. The experiments are deployed on some popular LiDAR odometry[8][26] and visual odometry[10][12][27] methods and compared with the traditional fusion approaches using covariancebased error modeling. The experimental results demonstrate that with the learned sceneaware error models, vehicle localization accuracy can be largely improved, and it shows adaptiveness in unexperienced scenes.
The paper is organized as follows. A literature review about ESO and the corresponding error model is presented in section II. An overview of the proposed error model learning method is given in section III. Experiments on LiDAR/visual odometry and the analysis of their results are illustrated in sections IV and V, respectively. Finally, section VI describes the conclusions and the direction of our future works.
[b]
Research  Category  Optimizing Method  Feature  Objective  Error Model  
LO  Censi, 2008, [26]  Fb  Lagrange’s Multiplier  RP&Line  Feature Distance  
Bosse, 2009, [28]  Fb  WLS  Shape Info.  Match&Smoothness Func.    
Armesto, 2010, [29]  Fb  LS  RP, Facet  Metric Distance  
Zhang, 2014, [9]  Fb  LM  Line, Plane  Feature Distance    
Velas, 2016, [30]  Fb  SVD  Line  Feature Distance    
Wang, 2018, [31]  Fb  EM, IRLS  RP  Likelihood Func.    
Magnusson, 2007, [32]  Direct  Newton  RP  Joint Prob.    
Olson, 2009, [8]  Direct  Search  RP  Posterior Observation Prob.  Cov.  
Olson, 2015, [33]  Direct  Search  RP  Correlative Cost Func.    
Jaimez, 2016, [34]  Direct  IRLS  RP(Range Flow)  Geometric Residual    
Ramos, 2007, [35]  Heuristic/Fb  WLS  Local Feature  CRF Inference Error    
Diosi, 2007, [36]  Heuristic/Fb  WLS, Parabola Fitting  RP  Polar Range Distance  Cov.  
Censi, 2009, [37]  Heuristic/Direct  LS  RP(Hough)  Spectrum Func.    
VO  Howard, 2008, [38]  Fb  LM  Harris/FAST  RE   
Kitt, 2010, [10]  Fb  ISPKF  Harris et.al.  RE  Cov.  
Mouats, 2014, [39]  Fb  GN  LogGabor Wavelets  RE    
Gomezojeda, 2016, [27]  Fb  GN  ORB&LSD  RE  Cov.  
Zhang, 2012, [40]  Fb  PHD  SIFT  RE  Cov.  
Engel, 2013, [41]  Direct  RGN    PE  Cov.  
Kerl, 2013, [42]  Direct  IRLS    PE    
Wang, 2017, [43]  Direct  GN    PE    
Li, 2018, [44]  Direct  GN    PE    
Engel, 2017, [12]  Direct  GN    PE    
Forster, 2014, [11]  SemiDirect  GN  Sparse Feature Patches  PE&RE    
Wang, 2017, [45]  Deep Learning  BP    Pose MSE    
Others  Tanskanen, 2015, [46]  VisualInertial  EKF    PE  Cov. 
Usenko, 2016, [47]  VisualInertial  LM    PhotometricInertial Energy    
Qin, 2018, [48]  VisualInertial  LS  Harris  Feature&IMU Residual    
Zhang, 2015, [49]  VisualLiDAR  LM  Haris&RP  Feature Distance    
Hemann, 2016, [24]  LiDARInertial  KF  RP&DEM  Crosscorrelation Func.  Cov.  
Barjenbruch, 2015, [50]  Radar  Gradientbased  Spatial&Doppler Info.  Metric Func.   

*The denotions for abbreviations in this table are arranged in alphabetical order by column.

BP: Back Propagation

Cov.: Covariance

CRF: Conditional Random Field

DEM: Digital Elevation Model

EKF: Extended Kalman Filter

EM: ExpectationMaximization

Fb: Featurebased

Func.: Function

GN: GaussNewton

IMU: Inertial Measurement Unit

Info.: Information

IRLS: Iteratively Reweighted Least Squares

ISPKF: Iterated Sigma Point Kalman Filter

KF: Kalman Filter

LM: LevenbergMarquardt

LO: LiDAR Odometry

LS: Least Squares

MSE: Meansquare Error

PE: Photometric Error

PHD: Probability Hypothesis Density

Prob.: Probability

RE: Reprojection Error

RGN: Reweighted GaussNewton

RP: Raw Point

SVD: Singular Value Decomposition

VO: Visual Odometry

WLS: Weighted Least Squares
Ii Related Works
Iia LiDAR Odometry
LiDAR odometry performs relative positioning by comparing laser measurements from sequent LiDAR scans, which has a more popular name, scan matching. Following the conventional taxonomy of visual odometry, this paper divides LiDAR odometries into featurebased methods and direct methods by whether explicit feature correspondence is needed.
Featurebased methods. A typical method for scan matching is to build the feature correspondence for sequent LiDAR scans, and then the motion from the reference frame to the target frame can be calculated from the matching results. In feature selection, various definitions of features, such as points, lines, planes and other selfdefined local features, can be used alone or in combination[26][9]. In optimization strategies of feature matching, many works, such as [28]and [29], are variants of the ICP (iterative closest point)[51] algorithm, which iteratively minimizes the feature matching error using an optimizer such as least squares. Apparently, feature association in such an indirect matching method creates considerable computing cost and often leads to overconfident mismatching.
Direct methods. To overcome the efficiency problem of feature association, some researchers have attempted to avoid building such explicit correspondence. [7] transformed the scantoscan matching problem into a correlation evaluation under a probabilistic framework, and [32] extended it to 3D applications. [8] proposed correlative scan matching by employing a Monte Carlo sampling strategy, and [33] improved the efficiency of such methods using multiresolution matching. [34] designed a range flowbased approach in the fashion of dense 3D visual odometry, which performs scan alignment using scan gradients.
In addition, many heuristic methods have been proposed to compensate for the flaws of previous work, such as poor convergence or dependence on initialization. [36] matched LiDAR points with the same bearing under polar coordinates to run faster than ICP. [35] presented a CRF (conditional random field)based scan matching, which takes into account the highlevel shape information. [37] attempted to use the Hough transformation to decompose the 6DoF search into a series of fast onedimensional crosscorrelations.
IiB Visual Odometry
Similar to LiDAR odometry, visual odometry retrieves camera motion using information from images taken from different poses. Visual odometries can be simply divided into 2 classes: featurebased methods and direct methods.
Featurebased methods. These methods require feature extraction and association, mostly aiming at minimizing the reprojection error of the matched features. In feature extraction, typical image point features such as corners are well utilized, such as [38][10]. Line features and other novel features can also be used for different image scenes or camera sensors. [27] combined ORB and LSD features to obtain more stable tracking in lowtextured scenes. With multispectral cameras, [39] used logGabor wavelets to obtain interest points at different orientations and scales. In the optimization process, most works employ a nonlinear optimizer for feature matching between consecutive images, as previously mentioned [38][39]. In addition, some works exploit filtering methods to track the features over an image sequence. [10] used the iterated sigma point Kalman filter to track the egomotion trajectory and feature observation. [40] considered image features as group targets and used the probability hypothesis density filter to track the group states. Most featurebased visual odometries share the same problem of computing efficiency and accuracy for data association, similar to featurebased LiDAR odometries. Moreover, featurebased visual odometries only concentrate on the features extracted without considering the information remaining in the images, which actually places a strong requirement on feature abundance.
Direct methods. To eliminate the shortcomings of featurebased visual odometries above, direct visual odometries have appeared in recent studies. They directly use camera sensor measurements without precomputation, considering the photometric error for pose estimation. For instance, [42] presented a direct method working for RGBD cameras. [12] proposed direct sparse odometry, which combines photometric error minimization and the joint optimization of camera model parameters. [44] introduced a direct line guidance odometry, which uses lines to guide the key point selection. There are also hybrid methods, such as semidirect visual odometry in [11] and deep learningbased methods[45].
IiC Other Odometries
To further improve the accuracy of the aforementioned odometries, many studies have attempted to incorporate inertial sensors. [46][47][48] each propose a visualinertial odometry. [24] used IMU to improve the performance of LiDAR odometry in longrange navigation. Second, using LiDAR and a camera together is another direction. [49] implemented visualLiDAR odometry, which has better robustness in conditions of lacking visual features or aggressive motion. There are also some works using radar sensors[50].
IiD Error Model
As Table I shows, only a small part of the ESOrelated literature, such as [8][10], presents error models for uncertainty estimation. In these studies, the Hessian method and sampling method are 2 representative routes for error modeling. Consider a simple odometry model as an example:
(1) 
where is the pose to be estimated as a column vector, is the corresponding observation set consisting of sensor measurements with i.i.d. noise of variance , and is the objective function that measures the matching error between and . The error modeling methods can be formulated as follows.
Hessian method. When is designed to be analytical and differentiable as
(2) 
where represents a measurement model mapping to , Eq.1 can be solved using least squares. Therefore, the closeformed solution of can be approached recursively as
(3)  
in the Newton method, so that the conditional covariance of can be derived as
(4) 
where is defined as the Hessian matrix of in mathematics. For the uncertainty of measurements, is propagated to by the inverse Hessian matrix of , and this method is named the Hessian method.
Hessian methods are widely used in various odometries, such as [52][53][27][54]. However, several problems place restrictions on its usage. First, for feature matchingbased odometry, Hessian methods depend on a strong assumption that the feature correspondence is established correctly[55]. Second, in some cases, the inverse Hessian Matrix is difficult to calculate but can be approximated from a Jacobian matrix such as [27][41], which actually decreases the covariance accuracy. Third, it is difficult to ensure that step in Eq.3 is infinitesimal, which is required by covariance calculation for nonlinear least squares. Due to these problems, many studies, such as [7][56][20], attempt to extend the Hessian method case by case, but it is still far from accurate.
Sampling method. For nonanalytical , the covariance can be calculated by sampling poses according to a distribution , such as the prediction from the motion model. Assuming the sample set , the mean value of these samples can be regarded as an estimation of
(5) 
where denotes the probabilistic measurement model, so that the covariance can be calculated as
(6) 
where the superscript represents the transpose operator here and later.
For instance, [57][8] used this method to calculate the covariance for scan matching. ROS[58] package
amcl
Overall, existing error modeling methods strongly depend on a series of definitions and assumptions, which may have a negative influence on the uncertainty estimation. For example, the measurement model and objective function in the Hessian method and sampling distribution in the sampling method need to be manually designed or approximated, which may not objectively reflect the true relationship between target parameter and sensor observations . In addition, for fusionbased localization, there are another 2 important characteristics of ESO error that are often overlooked. First, the error model should be compatible with other localization modules for comparison. More importantly, the performance of ESO is sensitive to the scene. Therefore, a sceneaware error model is needed to capture the relationship between odometry performance and the environment.
Iii Methodology
Iiia Fusionbased Localization with SceneAware Error Modeling
Assume that a PSO such as dead reckoning has a system error, where the model parameters are calibrated and are not correlated with the scene. Referring to such a PSO, a sceneaware error model of an ESO such as LiDAR or visual odometry can be learned, and a multimodal fusionbased localization can be achieved as described in Fig. 1.
At time , let and be the relative poses estimated by PSO and ESO, respectively.
(7) 
where is the true relative pose, and are Gaussian noise of and with their respective covariances and . is a system error, while is predicted by a sceneaware error model on data that describes the scene at the moment.
An information filter is used to find an MAP (maximum a posteriori) estimation of the relative pose , which is represented by a mean pose and a covariance matrix . In Fig. 1, we denote the fusion module by , i.e., , which is operated recurrently and has the function of history memory. The process is detailed in the next section.
Since it is difficult to find accurate relative poses as the ground truth, we used the global pose by RTKGPS instead. The supervision is conducted sporadically when the following two conditions are met: 1) a reliable GPS measurement is obtained, and 2) relative pose error has been accumulated for frames that exceed the error level of GPS.
Representing in a uniform matrix,
(8) 
where are the rotation matrix and translation vector, and the vehicle’s mean pose at a global coordinate system can be estimated by accumulating the relative motions sequentially from an initial global pose .
(9)  
In Fig. 1, we denote the module of the pose accumulator by , i.e., , which is operated recurrently and has the function of history memory.
When the supervision conditions are met, with a reliable GPS measurement , the localization error is backpropagated along the pipeline to optimize the parameters of the error model . Hence, the major pipeline of fusionbased localization with sceneaware error modeling can be summarized by the following formulas, where for conciseness, subscript is omitted.
(10)  
IiiB PSO and ESO Fusion
Relative Pose Fusion Using an Information Filter
The maximum a posterior probability (MAP) estimation of the vehiclesâ relative motion state can be formulated as below, consisting of two subsequent steps in each iteration, i.e., prediction using vehicle control and updating using measurement .
(11) 
When occurs in a very short time, this recursive estimation can also be regarded as a tracking process over a series of velocity measurements.
An information filter [16] can be used to estimate vehicle pose, with the multivariate Gaussian distribution represented using an information vector and an information matrix in canonical representation as
(12)  
where denotes the covariance matrix of this distribution. Obviously, and are dual to measure the uncertainty.
By extending the original information filter, a relative pose fuser (RPF) is developed in this research, as listed in Algorithm 1. At each frame, given and from the PSO and ESO modules, as well as their covariance matrices and as uncertainty estimation, rhe RPF function estimates the mean relative motion at time , and and . Here, we use the information matrix as the input of . The reasons are twofold: 1) taking as an input of can reduce the computing cost of the inversing matrix; and 2) the numerical estimation of is more stable in the case of so that the formula of in Eq. 10 should be rewritten as
(13) 
The original information filter and the derivation of Algorithm 1 are given in Appendix A.
Step 1 of Algorithm 1 is coordinate transformation. Since this research estimates the vehicle’s relative pose using the information filter, at each frame , we have denoting the zero position and consequently , where is used to represent the zero vector or matrix here and later. On the other hand, the information matrix obtained in the previous iteration needs to be transformed to compensate for the rotation factor in . Assume can be decoupled as , where denotes the rotation factor and is the translation factor. Let we have
(14) 
Here, , and denote the results obtained in the estimation of time , which are relative to the vehicle’s coordinate system at time . Whereas , and are the converted results in the estimation of time , which are relative to the vehicle’s coordinate system at time .
Steps 23 predict the information matrix and vector by incorporating the outputs of the referred PSO module, steps 45 are measurement updating using results from the target ESO, and step 6 is conversion from canonical representation to find a mean relative pose .
Extension to Multimodal Fusion
This model can be easily extended to a system with other mutually independent odometry modules. The output from the th target ESO module can be seen as independent observation variables similar to the 2module system. Assuming that there are target ESO modules, the th observation and its covariance are , the probabilistic formulation can be extended as
(15)  
so that in the information filter framework, steps 4 and 5 in Algorithm 1 can be rewritten as
(16)  
IiiC SceneAware Error Model Learning
Error Modeling
The pipeline is shown in Fig. 2. For ESO, such as LiDAR/visual odometry, the scene data can be obtained from its input, such as a camera image or LiDAR point cloud, and the next step is to map it to the pose error, namely, the information matrix in RPF. An information matrix is symmetric and positive definite when ; hence, it can be factorized by Cholesky decomposition
(17) 
where
(18) 
is a unique lowertriangular matrix of . Define an information descriptor consisting of all the independent elements of . The neural network needs to be customized for different scene information and input output to , with which can be uniquely estimated as the predicted error.
Loss Function
To learn a parameter set of the neural network in Fig. 2, supervised learning is not adaptive, as the ground truth of neither the information descriptor nor the information matrix is available. However, the vehicle’s ground truth position can be obtained under certain conditions. For example, a vehicle pose can be measured using, e.g., a GPS/IMU suite or a loop closure detector. is considered as a ground truth location if and only if
(19) 
where is the estimation at the time by fusing scan matching and dead reckoning outputs, and is the error level of the measurement .
Given a parameter set , the localization module initiates from a ground truth and estimates vehicle pose for steps. The localization error caused by is accumulated during these steps, which is evaluated at time as below, where is a hyperparameter to weight errors in location and heading angle.
(20) 
Given a pair of ground truth positions , the objective is to learn a parameter set to minimize error , subject to .
(21)  
Parameter Learning
is refined iteratively whenever a pair of the vehicle’s ground truth position is obtained, where learning is conducted in two subsequent steps: forward prediction and backpropagation, which are described in Fig. 3 and Algorithm 2.
Forward prediction estimates a sequence of vehicle poses on the current parameter set for steps, where the process of each step is described in lines 513 of Algorithm 2. Initiated from , forward prediction results in an estimation of the vehicle pose at time .
Backpropagation refines to minimize the error between the estimated vehicle pose at time and its ground truth . The functions , , and the neural network are differentiable, and the error can be backpropagated from time to by stochastic gradient descent. The gradient estimation and the backpropagation process are described in lines 1423 of Algorithm 2.
Iv Experiment on LiDAR Odometry
An overview of the processing flow for fusing LiDAR odometry is given in Fig. 4. A simple dead reckoning (DR) is used as the referred PSO odometry. For the target LiDAR odometry, two classical 2D scan matching algorithms, CSM[8] and PLICP[26], are selected to perform error model learning, and their traditional error models [8] and [53] are used to compare with our method, corresponding to the aforementioned sampling method and Hessian method, respectively.
Three experimental results are presented. First, simulation data at specifically designed simulation environments are used to verify the proposed method and demonstrate that the predicted error models can capture scene properties. Second, realworld data from an instrumented vehicle are used, where training and testing are conducted in the same campus environment to compare the performance. Third, experiments in an unexperienced environment are conducted, where training and testing are performed at different sites to demonstrate the generality of the proposed method in unexperienced scenes.
Iva State Definition and Network Design
For intelligent vehicles, generally, 2dimensional localization is sufficient in the structural urban environment, so that we set in Eq. 7. More specifically, for relative positioning, any pose state is defined as a column vector including 3 independent elements , where are the displacement relative to the zero position in the local coordinate system, and is the heading change in the Euler angle.
For local relative localization, there is no considerable scenario change when cars move such a short distance (several meters). Therefore, only one of the two frames in scan matching is enough to represent the local scenario, which contains sufficient scene information as network input. Therefore, given a LiDAR scan , a neural network is designed to map it to an information matrix that models the error of scan matching result on . A CNN (convolutional neural network)[60] is used due to its superior performance, which has been demonstrated in the literature such as [61][62][63]. Therefore, a LiDAR scan is first converted to a binary image by regularly tessellating an egocentered horizontal space, and each pixel value is or in which means there is no LiDAR observation falling into the grid and indicates that at least one LiDAR beam hit the grid. In this research, considering learning efficiency and the sparsity of LiDAR points, a image is generated for each scan at a dimension of , and the pixel size is . The detailed network structure is given in Fig. 5.
IvB Simulation Data Experiment
Dataset
Gazebo[64] and ROS[58] are used as the simulator to build an artificial environment and collect simulated sensor data. As corridor scenes with two parallel featureless walls are very challenging for scan matching and their fusionbased approaches, such an environment is built, as shown in the first column of Fig. 6. The sensor set of the simulated car model includes a 360degree horizontal LiDAR for scan matching, a wheel encoder, and a yaw rate sensor for dead reckoning. To make the simulated data more realistic, Gaussian noise is added to these sensor readings.
In data collection, the simulated car traveled along the corridor with a series of steering operations so that the direction of the LiDAR frame changed continuously. Two sets of data are collected for training and testing by driving the car along a rectangular and a circular trajectory, respectively.
Learning Result of SceneAware Error Model
Because the corridor walls are straight and parallel, the point features are monotonous. The error distribution of scan matching at such a scene usually has a main direction along the direction of the passage. Moreover, the covariance can be estimated by a conventional solution [8][53] for comparison. Several typical cases in the testing process are shown in Fig. 6, where the results of our method are compared with those of a conventional method that are shown side by side. The covariances are represented by 2 standard deviation ovals and sampled scattered points, which are drawn in the plane. Apparently, due to the dependence of sampling, although the conventional solution can give the correct main direction of covariance, the scale is not accurate enough, which may lead to a worse localization fusion.
Localization Accuracy
In contrast, our model can calculate more than the correct main direction of covariance, and it also obtains a more accurate covariance scale for fusion, which matches the error scale of odometry. The position error statistics of every 40meterlong trajectory segment in the testing process are shown in Fig. 9, in which Fig. (a)a gives the Euclidean distance error distribution at the end of every trajectory, and Fig. (b)b shows the corresponding yaw error distribution. Our method has obvious advantages in the comparison of localization accuracy on both of these LiDAR odometry algorithms.
IvC Real Data Experiment
Dataset
An instrumented vehicle, as shown in Fig. 13, is used to collect data at a realworld scene to evaluate the performance of the proposed method. The following sensors are used: 1) LiDARs are horizontally mounted in the front and rear of the car profile for scan matching; 2) a wheel encoder and a yaw rate sensor are used for dead reckoning; and 3) a highly accurate GPS/IMU suite is used to obtain ground truth locations of the vehicle for model training and localization result evaluation.
For experiments on experienced scenes, two sets of data are collected in the same region of the Peking University campus for training and testing, which are conducted on different days. For experiments on unexperienced scenes, we collect a largescale dataset in several different regions with a total mileage of approximately 10 km. Testing data accounts for 40% of the dataset, most of which cannot be seen by LiDAR in the training data. To avoid the great accuracy disparity between the referred PSO and the target ESO, which may lead to no complementary information for fusion, we manually adjusted the accuracy settings of the sensors in the experiments for different control groups.
Experimental Settings
During the experiment, a new scan matching is triggered once the car moves ahead by 1.0 meter or the heading angle changes by 30 degrees since the last operation. In the training process, we use a constant step , meaning that the program conducts forward prediction based on the current parameter set for every steps. Then, a ground truth is obtained from the GPS/IMU suite and is used to adjust through backpropagation along the sequence to minimize the error between and the predicted location . However, the hyperparameter of the loss function is set to 100.0 in this research to weight the errors in distance and angle.
Learning Result of SceneAware Error Model
The CSM[33] is used to examine our error model performance in different training stages. During training, a new is learned every steps with a ground truth location obtained. Such a procedure iterates until a limit condition is reached. Below, we use “epoch” to denote a single pass through the full training set, and let to represent the learned parameter set at Epoch . At each specific scene, the predicted covariance error of LiDAR scan matching changes with . This result is analyzed in Fig. (a)a, where three scenes are selected, and the predicted covariance is represented by 2 standard deviation ovals and sampled scatter points. With the initial parameter set , the predicted covariance of all three scenes shows quite similar shapes. As the number of epochs increases, the shapes vary differently, but they show a tendency of converging to their own stable states. We use the parameter sets , , and to estimate the sequences of vehicle poses, which are drawn in Fig. (b)b as trajectories A, B, C and D, respectively. It is obvious that the localization error decreases progressively from trajectory A to D, demonstrating the efficiency of the learning procedure, where the accuracy of the predicted covariance error model is greatly improved.
Localization Accuracy in Experienced Environments
The localization accuracy of the proposed method is compared with dead reckoning, LiDAR odometry CSM[8] and PLICP[26], and their conventional fusionbased method using covariance estimation in [8] and [53]. The sample trajectories estimated by these methods on testing data are shown in Fig. (a)a. Compared with the traditional fusion method, the trajectories of our fusion method (solid line) are closer to the ground truth than the traditional methods (dashed dotted line). With the GPS/IMU output as the ground truth, the position and heading error statistics of every 100meterlong trajectory segment are plotted in Fig. (b)b and Fig. (c)c. For the fusion of DR and CSM, our method obtains a 12.7% and 49.5% reduction in the average Euclidean distance error and average yaw error, respectively; for the fusion of DR and PLICP, our method obtains a 48.1% and 75.1% reduction in the average Euclidean distance error and average yaw error, respectively.
Localization Accuracy in Unexperienced Environments
Similarly, localization trajectories on testing data of unexperienced regions are compared with other methods, as shown in Fig. (a)a. The position and heading error statistics of every 100meterlong trajectory segment are plotted in Fig. (b)b and (c)c. From Fig. (a)a, we can see that the trajectories of our fusion methods denoted by solid lines are closer to the ground truth than the traditional trajectories drawn by dashed dotted lines. From a statistical perspective, our methods reduce the average Euclidean error by 18.5%(DR+CSM) and 27.4%(DR+PLICP) and the average yaw error by 33.9%(DR+CSM) and 45.8%(DR+PLICP).
To compare with the conventional method of error modeling, it is noteworthy that the measurement noise parameter in Eq. 4 of the Hessian method and the likelihood in Eq. 5 of the sampling method has a strong effect on the covariance scale, which may lead to different fusion accuracies in RPF . Therefore, in the real data experiment section, we perform a grid search to rescale the covariance from the conventional method so that their best performance on the error scale can be used to compare with our method.
V Experiment on Visual Odometry
This is a supplementary experiment to prove that our method is also effective in odometry of other modalities except for LiDAR odometry. An overview of the processing flow for fusing visual odometry is given in Fig. 22. Similar to the experiment on LiDAR odometry, dead reckoning (DR) is used as the referred PSO module. For the target visual odometry, three representative algorithms, LIBVISO[10](featurebased method), DSO[12](direct method) and PLSVO[27](a variant of SVO[11], semidirect method), are selected to perform error model learning. However, because there is no available error model for LIBVISO and DSO, the conventional error model comparison can only be performed on PLSVO. Limited to sensor equipment, all of these visual odometries work in a monocular mode in our experiments.
Va State Transition Model and Network Design
Here, we use the same state definition as the experiments on LiDAR odometry, and the 6DoF results of visual odometry used in our experiments are projected to 3DoF in the same coordinate as GPS/IMU. However, as it is challenging for monocular visual odometries to output reliable scale information[65], we need to customize our state transition model Eq. 7 as follows:
(22) 
in which
(23) 
and is the function for calculating the translation scale of relative movement so that we can use an extended information filter to track this nonlinear state update. Steps 45 in Algorithm 1 should be modified as
(24)  
where , .
For visual odometry, the realtime images can be used as scene information, so that a similar network architecture, as shown in Fig. 5, is used in this experiment. For the sake of training efficiency, we resize the grayscale image to the size of (pixel), and the network structural parameters related to the input image size are alsomodified accordingly.
VB Real Data Experiment
Dataset
This dataset was also collected using the platform, as shown in Fig. 13. The following sensors are used: 1) a monocular camera is mounted above the windshield for visual odometry; 2) a wheel encoder and an IMU with lower precision are used for dead reckoning; and 3) a highly accurate GPS/IMU suite is used to obtain ground truth locations of the vehicle for model training and localization result evaluation.
Based on the good performance of the experiment on LiDAR odometry, we only challenge the experiments on visual odometry in an unexperienced environment to verify the extensibility of our method. Similarly, a largescale dataset is collected in several different regions with a total mileage of approximately 10 km. Training data accounts for 60% of the dataset, where there is almost no scene intersection with the remaining data for testing.
Experimental Settings
To compare trajectories from different methods synchronously, we keep the trigger behavior of DSO and align the trigger time of the other visual odometries LIBVISO and PLSVO with DSO. In the training process, we use a constant step , and the hyperparameter of the loss function is set to 100.0 to balance the errors in distance and angle.
Localization Accuracy
The localization accuracy of the proposed method is compared with dead reckoning, DSO[12], LIBVISO[10], PLSVO[27], and the conventional fusionbased method using covariance estimation of PLSVO in the author’s open source code
In contrast, the conventional method of PLSVO error modeling achieves poor performance. In addition to the reasons analyzed in subsection IID, another important reason lies in its derivation. PLSVO optimizes the SE(3) pose using the left multiplicative perturbation model on its Lie group so that its covariance on needs to be mapped to SE(3) and SE(2) to be used in our state transition model as Eq. 22. In this process, several nonlinear mappings need to be linearized, which makes this error model more inaccurate.
Vi Conclusion
In this research, a sceneaware error model is designed for LiDAR/visual odometry, and a localization fusion framework is developed to fuse the results using such an error model. Moreover, an endtoend learning method is devised to train the error model in the proposed localization fusion framework.
We thoroughly evaluate the proposed method on simulation data to verify its adaptability at various simple but typical scenes and on real data to examine its efficiency in realworld situations. The experimental results demonstrate that the proposed method is efficient in learning the CNNbased error model, and the localization accuracy based on such models is superior compared with the fusion accuracy of the other traditional methods.
Future work will focus on the following limitations of the proposed method.
1) Gradient vanishing problem. This is a general problem of training RNNs (recurrent neural networks)[66][67]. Apparently, the error model learning process of our method is similar to the typical RNN. When training using a long trajectory with a large number of iterations, we expect the error information to be amplified by continuous rotation, whereas sometimes it is also likely to be overwhelmed.
2) Optimization of the training trigger strategy. In our experiment, every backpropagation is performed after forward prediction with fixed time steps, which is convenient for offline batch training. However, such a method cannot fit Eq.19 properly and makes it inefficient to be extended to online learning.
3) Hyperparameter setting. The hyperparameter in the loss function (Eq. 20) is another important factor for training performance. Manually in the experiment above is selected after many attempts. This troublesome but necessary procedure must be performed for different datasets.
Appendix A The Original Information Filter
Assume that the state transition and measurement probabilities are governed by the following linear Gaussian equations:
(A1)  
where and are the control and measurement at time , and denote their Gaussian noise with covariances and , respectively.
Probabilistic estimation of using a Gaussian filter finds a mean pose and a covariance matrix . Whereas using an information filter [16], the Gaussian distribution is represented in canonical representation, and the problem is to estimate an information vector and an information matrix , which is described in Algorithm A1.
Appendix B Derivation of the RPF Algorithm
Footnotes
 ”amcl” is a probabilistic localization system for a robot moving in 2D, http://http://wiki.ros.org/amcl
 https://github.com/rubengooj/plsvo.git
References
 M. H. Hebert, C. E. Thorpe, and A. Stentz, Intelligent unmanned ground vehicles: autonomous navigation research at Carnegie Mellon. Springer Science & Business Media, 2012, vol. 388.
 J. L. Jones, “Robots at the tipping point: the road to irobot roomba,” IEEE Robotics & Automation Magazine, vol. 13, no. 1, pp. 76–78, 2006.
 S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann et al., “Stanley: The robot that won the darpa grand challenge,” Journal of field Robotics, vol. 23, no. 9, pp. 661–692, 2006.
 G. Nelson, A. Saunders, and R. Playter, “The petman and atlas robots at boston dynamics,” Humanoid Robotics: A Reference, pp. 169–186, 2019.
 J. Georgy, T. Karamat, U. Iqbal, and A. Noureldin, “Enhanced memsimu/odometer/gps integration using mixture particle filter,” GPS solutions, vol. 15, no. 3, pp. 239–252, 2011.
 A. N. Ndjeng, D. Gruyer, S. Glaser, and A. Lambert, “Low cost imu–odometer–gps ego localization for unusual maneuvers,” Information Fusion, vol. 12, no. 4, pp. 264–274, 2011.
 P. Biber and W. Straßer, “The normal distributions transform: A new approach to laser scan matching,” in Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), vol. 3. IEEE, 2003, pp. 2743–2748.
 E. B. Olson, “Realtime correlative scan matching,” in 2009 IEEE International Conference on Robotics and Automation. IEEE, 2009, pp. 4387–4393.
 J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in realtime.” in Robotics: Science and Systems, vol. 2, 2014, p. 9.
 B. Kitt, A. Geiger, and H. Lategahn, “Visual odometry based on stereo image sequences with ransacbased outlier rejection scheme,” in Intelligent Vehicles Symposium (IV), 2010.
 C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semidirect monocular visual odometry,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 15–22.
 J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2017.
 H.P. Chiu, X. S. Zhou, L. Carlone, F. Dellaert, S. Samarasekera, and R. Kumar, “Constrained optimal selection for multisensor robot navigation using plugandplay factor graphs,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 663–670.
 S. R. Sukumar, H. Bozdogan, D. L. Page, A. F. Koschan, and M. A. Abidi, “Sensor selection using information complexity for multisensor mobile robot localization,” in Proceedings 2007 IEEE International Conference on Robotics and Automation. IEEE, 2007, pp. 4158–4163.
 G. Bresson, M.C. Rahal, D. Gruyer, M. Revilloud, and Z. Alsayed, “A cooperative fusion architecture for robust localization: Application to autonomous driving,” in 2016 IEEE 19th international conference on intelligent transportation systems (ITSC). IEEE, 2016, pp. 859–866.
 S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics. MIT press, 2005.
 S. Shen, Y. Mulgaonkar, N. Michael, and V. Kumar, “Multisensor fusion for robust autonomous flight in indoor and outdoor environments with a rotorcraft mav,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 4974–4981.
 D. P. Koch, T. W. McLain, and K. M. Brink, “Multisensor robust relative estimation framework for gpsdenied multirotor aircraft,” in 2016 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, 2016, pp. 589–597.
 O. Bengtsson and A.J. Baerveldt, “Robot localization based on scanmatchingâestimating the covariance matrix for the idc algorithm,” Robotics and Autonomous Systems, vol. 44, no. 1, pp. 29–40, 2003.
 S. Bonnabel, M. Barczyk, and F. Goulette, “On the covariance of icpbased scanmatching techniques,” in 2016 American Control Conference (ACC). IEEE, 2016, pp. 5498–5503.
 M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual inertial odometry using a direct ekfbased approach,” in 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2015, pp. 298–304.
 T. N. N. Hossein, S. Mita, and H. Long, “Multisensor data fusion for autonomous vehicle navigation through adaptive particle filter,” in 2010 IEEE Intelligent Vehicles Symposium. IEEE, 2010, pp. 752–759.
 D. Gulati, F. Zhang, D. Clarke, and A. Knoll, “Vehicle infrastructure cooperative localization using factor graphs,” in 2016 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2016, pp. 1085–1090.
 G. Hemann, S. Singh, and M. Kaess, “Longrange gpsdenied aerial inertial navigation with lidar localization,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 1659–1666.
 S. Lynen, M. W. Achtelik, S. Weiss, M. Chli, and R. Siegwart, “A robust and modular multisensor fusion approach applied to mav navigation,” in 2013 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2013, pp. 3923–3929.
 A. Censi, “An icp variant using a pointtoline metric,” in 2008 IEEE International Conference on Robotics and Automation, May 2008, pp. 19–25.
 R. GomezOjeda and J. GonzalezJimenez, “Robust stereo visual odometry through a probabilistic combination of points and line segments,” in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 2521–2526.
 M. Bosse and R. Zlot, “Continuous 3d scanmatching with a spinning 2d laser,” in 2009 IEEE International Conference on Robotics and Automation. IEEE, 2009, pp. 4312–4319.
 L. Armesto, J. Minguez, and L. Montesano, “A generalization of the metricbased iterative closest point technique for 3d scan matching,” in 2010 IEEE International Conference on Robotics and Automation. IEEE, 2010, pp. 1367–1372.
 M. Velas, M. Spanel, and A. Herout, “Collar line segments for fast odometry estimation from velodyne point clouds,” in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 4486–4495.
 D. Wang, J. Xue, Z. Tao, Y. Zhong, D. Cui, S. Du, and N. Zheng, “Accurate mixnormbased scan matching,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1665–1671.
 M. Magnusson, A. Lilienthal, and T. Duckett, “Scan registration for autonomous mining vehicles using 3dndt,” Journal of Field Robotics, vol. 24, no. 10, pp. 803–827, 2007.
 E. Olson, “M3rsm: Manytomany multiresolution scan matching,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 5815–5821.
 M. Jaimez, J. G. Monroy, and J. GonzalezJimenez, “Planar odometry from a radial laser scanner. a range flowbased approach,” in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 4479–4485.
 F. T. Ramos, D. Fox, and H. F. DurrantWhyte, “Crfmatching: Conditional random fields for featurebased scan matching.” in Robotics: Science and Systems, 2007.
 A. Diosi and L. Kleeman, “Fast laser scan matching using polar coordinates,” The International Journal of Robotics Research, vol. 26, no. 10, pp. 1125–1153, 2007.
 A. Censi and S. Carpin, “Hsm3d: featureless global 6dof scanmatching in the hough/radon domain,” in 2009 IEEE International Conference on Robotics and Automation. IEEE, 2009, pp. 3899–3906.
 A. Howard, “Realtime stereo visual odometry for autonomous ground vehicles,” in 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008, pp. 3946–3952.
 T. Mouats, N. Aouf, A. D. Sappa, C. Aguilera, and R. Toledo, “Multispectral stereo odometry,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 3, pp. 1210–1224, 2014.
 F. Zhang, H. Stähle, A. Gaschler, C. Buckl, and A. Knoll, “Single camera visual odometry based on random finite set statistics,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 559–566.
 J. Engel, J. Sturm, and D. Cremers, “Semidense visual odometry for a monocular camera,” in The IEEE International Conference on Computer Vision (ICCV), December 2013.
 C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for rgbd cameras,” in 2013 IEEE International Conference on Robotics and Automation. IEEE, 2013, pp. 3748–3754.
 R. Wang, M. Schworer, and D. Cremers, “Stereo dso: Largescale direct sparse visual odometry with stereo cameras,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3903–3911.
 S.J. Li, B. Ren, Y. Liu, M.M. Cheng, D. Frost, and V. A. Prisacariu, “Direct line guidance odometry,” in 2018 IEEE international conference on Robotics and automation (ICRA). IEEE, 2018, pp. 1–7.
 S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards endtoend visual odometry with deep recurrent convolutional neural networks,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 2043–2050.
 P. Tanskanen, T. Naegeli, M. Pollefeys, and O. Hilliges, “Semidirect ekfbased monocular visualinertial odometry,” in 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2015, pp. 6073–6078.
 V. Usenko, J. Engel, J. Stückler, and D. Cremers, “Direct visualinertial odometry with stereo cameras,” in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 1885–1892.
 T. Qin, P. Li, and S. Shen, “Vinsmono: A robust and versatile monocular visualinertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
 J. Zhang and S. Singh, “Visuallidar odometry and mapping: Lowdrift, robust, and fast,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 2174–2181.
 M. Barjenbruch, D. Kellner, J. Klappstein, J. Dickmann, and K. Dietmayer, “Joint spatialand dopplerbased egomotion estimation for automotive radars,” in 2015 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2015, pp. 839–844.
 P. J. Besl and N. D. McKay, “Method for registration of 3d shapes,” in Sensor Fusion IV: Control Paradigms and Data Structures, vol. 1611. International Society for Optics and Photonics, 1992, pp. 586–607.
 F. Lu and E. Milios, “Robot pose estimation in unknown environments by matching 2d range scans,” Journal of Intelligent and Robotic systems, vol. 18, no. 3, pp. 249–275, 1997.
 A. Censi, “An accurate closedform estimate of icp’s covariance,” in Proceedings 2007 IEEE international conference on robotics and automation. IEEE, 2007, pp. 3167–3172.
 Y. Aksoy and A. A. Alatan, “Uncertainty modeling for efficient visual odometry via inertial sensors on mobile devices,” in 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014, pp. 3397–3401.
 M. Brenna, “Scan matching covariance estimation and slam: models and solutions for the scanslam algorithm,” Ph.D. dissertation, Politecnico di Milano, 2009.
 O. Bengtsson and A.J. Baerveldt, “Localization in changing environmentsestimation of a covariance matrix for the idc algorithm,” in Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No. 01CH37180), vol. 4. IEEE, 2001, pp. 1931–1937.
 G. Kantor and S. Singh, “Preliminary results in rangeonly localization and mapping,” in Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), vol. 2. Ieee, 2002, pp. 1818–1823.
 M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an opensource robot operating system,” in ICRA workshop on open source software, vol. 3, no. 3.2. Kobe, Japan, 2009, p. 5.
 A. G. Buch, D. Kraft et al., “Prediction of icp pose uncertainties using monte carlo simulation with synthetic depth images,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 4640–4647.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
 A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features offtheshelf: an astounding baseline for recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2014, pp. 806–813.
 S. Gidaris and N. Komodakis, “Object detection via a multiregion and semantic segmentationaware cnn model,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1134–1142.
 N. P. Koenig and A. Howard, “Design and use paradigms for gazebo, an opensource multirobot simulator.” in IROS, vol. 4. Citeseer, 2004, pp. 2149–2154.
 H. Strasdat, J. Montiel, and A. J. Davison, “Scale driftaware large scale monocular slam,” Robotics: Science and Systems VI, vol. 2, no. 3, p. 7, 2010.
 T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010.
 R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, 2013, pp. 1310–1318.