Vision-Aided Absolute Trajectory Estimation Using an Unsupervised Deep Network with Online Error Correction
We present an unsupervised deep neural network approach to the fusion of RGB-D imagery with inertial measurements for absolute trajectory estimation. Our network, dubbed the Visual-Inertial-Odometry Learner (VIOLearner), learns to perform visual-inertial odometry (VIO) without inertial measurement unit (IMU) intrinsic parameters (corresponding to gyroscope and accelerometer bias or white noise) or the extrinsic calibration between an inertial measurement unit (IMU) and camera. The network learns to integrate IMU measurements and generate hypothesis trajectories which are then corrected online according to the Jacobians of scaled image projection errors with respect to a spatial grid of pixel coordinates. We evaluate our network against state-of-the-art (SOA) visual-inertial odometry, visual odometry, and visual simultaneous localization and mapping (VSLAM) approaches on the KITTI Odometry dataset  and demonstrate competitive odometry performance.
- Army Research Laboratory
- inertial measurement unit
- visual simultaneous localization and mapping
- visual odometry
- visual-inertial odometry
- Real Time Kinematic
- Simultaneous localization and mapping
- Extended Kalman Filter
- Multi-state Constraint Kalman Filter
- mixture density network
- convolutional neural network
- root mean squared error
- convolutional neural network
Originally coined to characterize honey bee flight , the term “visual odometry” describes the integration of apparent image motions for dead-reckoning-based navigation (literally, as an odometer for vision). While work in what came to be known as visual odometry (VO) began in the 1980s for the Mars Rover project, the term was not popularized in the engineering context until around 2004 .
Approaches in visual-inertial odometry (VIO) combine visual estimates of motion with those measured by multi-axis accelerometers and gyroscopes in inertial measurement units (IMUs). As IMUs measure only linear accelerations and angular velocities, inertial approaches to localization are prone to exponential drift over time due to the double integration of accelerations into pose estimates. Combining inertial estimates of pose change with visual estimates allows for the ‘squashing’ of inertial estimates of drift.
While visual odometry (VO), VIO, and visual simultaneous localization and mapping (VSLAM) can be performed by a monocular camera system, the scale of trajectory estimates cannot be directly estimated as depth is unobservable from monocular cameras (although for VIO, integrating raw IMUs measurements can provide noisy, short-term estimates of scale). Generally, depth can be estimated from RGB-D cameras or from stereo camera systems. We have chosen to include depth in our input domain as a means of achieving absolute scale recovery: while absolute depth can be generated using an onboard sensor, the same cannot be said for the change in pose between image pairs.
GPS, proposed as a training pose-difference signal in , has limited accuracy on the order of meters which then inherently limits the accuracy of a GPS-derived inter-image training signal for bootstrapping purposes. While Real Time Kinematic (RTK) GPS can reduce this error to the order of centimeters, Real Time Kinematic (RTK) GPS solutions require specialized localized base-stations. Scene depth, on the other hand, can be accurately measured by an RGB-D sensor (e.g. Microsoft Kinect, Asus Xtion, or new indoor/outdoor Intel RealSense cameras), stereo cameras, or LIDAR (e.g. a Velodyne VLP-16 or PUCK Lite) directly onboard a vehicle.
The main contribution of this paper is the unsupervised learning of trajectory estimates with absolute scale recovery from RGB-D + inertial measurements with
Ii Related Work
Ii-a Traditional Methods
In VO and visual simultaneous localization and mapping (VSLAM), only data from camera sensors is used and tracked across frames to determine the change in the camera pose. Simultaneous localization and mapping (SLAM) approaches typically consist of a front end, in which features are detected in the image, and a back end, in which features are tracked across frames and matched to keyframes to estimate camera pose, with some approaches performing loop closure as well. ORB-SLAM2  is a visual Simultaneous localization and mapping (SLAM) system for monocular, stereo, and RGB-D cameras. It used bundle adjustment and a sparse map for accurate, real-time performance on CPUs. ORB-SLAM2 performed loop closure to correct for the accumulated error in its pose estimation. ORB-SLAM2 has shown state-of-the-art (SOA) performance on a variety of visual odometry (VO) benchmarks.
In VIO and visual-inertial SLAM, the fusion of imagery and IMU measurements are typically accomplished by filter-based approaches or non-linear optimization approaches. ROVIO  is a VIO algorithm for monocular cameras, using a robust and efficient robocentric approach in which 3D landmark positions are estimated relative to camera pose. It used an Extended Kalman Filter (EKF) to fuse the sensor data, utilizing the intensity errors in the update step. However, because ROVIO is a monocular approach, accurate scale is not recovered. OKVIS  is a keyframe-based visual-inertial SLAM approach for monocular and stereo cameras. OKVIS relied on keyframes, which consisted of an image and estimated camera pose, a batch non-linear optimization on saved keyframes, and a local map of landmarks to estimate camera egomotion. However, it did not include loop closure, unlike the SLAM algorithms with state-of-the-art (SOA) performance.
There are also several approaches which enhance VIO with depth sensing or laser scanning. Two methods of depth-enhanced VIO are built upon the Multi-state Constraint Kalman Filter (MSCKF) [8, 9] algorithm for vision-aided inertial navigation, which is another Extended Kalman Filter (EKF)-based VIO algorithm. One method is the MSCKF-3D  algorithm, which used a monocular camera with a depth sensor, or RGB-D camera system. The algorithm performed online time offset correction between camera and IMU, which is critical for its estimation process, and used a Gaussian mixture model for depth uncertainty. Pang et al.  also demonstrated a depth-enhanced VIO approach based on MSCKF, with 3D landmark positions augmented with sparse depth information kept in a registered depth map. Both approaches showed improved accuracy over VIO-only approaches. Finally, Zhang and Singh  proposed a method for leveraging data from a 3D laser. The approach utilized a multi-layer pipeline to solve for coarse to fine motion by using VIO as a subsystem and matching scans to a local map. It demonstrated high position accuracy and was robust to individual sensor failures.
Ii-B Learning-Based Methods
Recently, there have been several successful unsupervised approaches to depth estimation that are trained using reconstructive loss [13, 14] from image warping similar to our own network. Garg et. al and Godard et. al used stereo image pairs with known baselines and reconstructive loss for training. Thus, while technically unsupervised, the known baseline effectively provides a known transform between two images. Our network approaches the same problem from the opposite direction: we assume known depth and estimate an unknown pose difference.
Pillai and Leonard  demonstrated visual egomotion learning by mapping optical flow vectors to egomotion estimates via a mixture density network (MDN). Their approach not only required optical flow to already be externally generated (which can be very computationally expensive), but also was trained in a supervised manner and thus required the ground truth pose differences for each exemplar in the training set.
SFMLearner  demonstrated the unsupervised learning of unscaled egomotion and depth from RGB imagery. They input a consecutive sequence of images and output a change in pose between the middle image of the sequence and every other image in the sequence, and the estimated depth of the middle image. However, their approach was unable to recover the scale for the depth estimates or, most crucially, the scale of the changes in pose. Thus, their network’s trajectory estimates needed to be scaled by parameters estimated from the ground truth trajectories and in the real-world, this information will of course not be available. SFMLearner also required a sequence of images to compute a trajectory. Their best results were on an input sequence of five images whereas our network only requires a source-target image pairing.
UnDeepVO  is another unsupervised approach to depth and egomotion estimation. It differs from  in that it was able to generate properly scaled trajectory estimates. However, unlike  and similar to [13, 14], it used stereo image pairs for training where the baseline between images is known and thus, UnDeepVO can only be trained on datasets where stereo image pairs are available. Additionally, the network architecture of UnDeepVO cannot be extended to include motion estimates derived from inertial measurements because the spatial transformation between paired images from stereo cameras are unobservable by an IMU (stereo images are recorded simultaneously).
VINet  was the first end-to-end trainable visual-inertial deep network. While VINet showed robustness to temporal and spatial misalignments of an IMU and camera, it still required extrinsic calibration parameters between camera and IMU. This is in contrast to our VIOLearner which requires no IMU intrinsics or IMU-camera extrinsics. In addition, VINet was trained in a supervised manner and thus required the ground truth pose differences for each exemplar in the training set which are not always readily available.
VIOLearner is an unsupervised VIO deep network that estimates the scaled egomotion of a moving camera between some time at which source image is captured and time when target image is captured. VIOLearner receives an input RGB-D source image, a target RGB image, IMU data from to , and a camera calibration matrix with the camera’s intrinsic parameters. With access to , VIOLearner can generate camera pose changes in the camera frame using a view-synthesis approach where the basis for its training is in the Euclidean loss between a target image and a reconstructed target image generated using pixels in a source image sampled at locations determined by a learned 3D affine transformation (via a spatial transformer of ).
Iii-a Multi-Scale Projections and Online Error Correction
Similar to , VIOLearner performs multi-scale projections. We scale projections by factors of , , and . However, in our network, multi-scale projections not only help to overcome gradient locality during training (see [15, 19] for a broader discussion of this well-known issue), but also aid in online error correction at runtime.
Generally, at each level, the network computes the Jacobians of the reprojection error at that level with respect to the grid of coordinates. Convolutions are then performed on this Jacobian (sized HxWx) and a is computed. This is added to the previously generated affine matrix . This is repeated for each level in the hierarchy. VIOLearner uses a total of levels and additional detail on the computations performed at each level is provided in the following sections.
Iii-B Level 0
VIOLearner first takes raw IMU measurements and learns to compute an estimated 3D affine transformation that will transform a source image into a target image (the top left of ÃFig. 1(a)Ã). The network downsamples ( in ÃFig. 2Ã) the source image (and associated depth and adjusted camera matrix for the current downsampling factor) by a factor of 8 and applies the 3D affine transformation to a normalized grid of source coordinates to generate a transformed grid of target coordinates .
VIOLearner then performs bilinear sampling to produce a reconstruction image by sampling from the source image at coordinates :
where the function is zero except in the range where it is non-negative and equal to one at .
As in , by only evaluating at the sampling pixels, we can therefore compute sub-gradients that allow error back-propagation to the affine parameters by computing error with respect to the coordinate locations instead of the more conventional error with respect to the pixel intensity:
Starting with the Level 0, the Euclidean loss is taken between the downsampled reconstructed target image and the actual target image. For Level 0, this error is computed as:
The final computation performed by Level 0 is of the Jacobian of the Euclidean loss of ÃEquation 4Ã with respect to the source coordinates from ÃEquation 2Ã and ÃEquation 3Ã. The resulting Jacobian matrix has the same dimensionality of the grid of source coordinates (xx) and is depicted in ÃFig. 1(a)Ã. In traditional approaches, the gradient and error equations above are only used during training. However, VIOLearner is novel in that it also computes and employs these gradients during each inference step of the network. During both training and inference, the Jacobian is computed and passed to the next level for processing.
Iii-C Levels i to n-1
For the second level in the network through the n- level, the previous level’s Jacobian is input and processed through layers of convolutions to generate a . This represents a computed correction to be applied to the previous level’s 3D affine transform . is summed with to generate which is then applied to generate a reconstruction that is downsampled by a factor . Error is again computed as above in ÃEquation 4Ã and the Jacobian is similarly found as it was in Level 0 and input to the next Level .
Iii-D Level n and Multi-Hypothesis Pathways
The final level of VIOLearner employs multi-hypothesis pathways similar to [20, 21] where several possible hypotheses for the reconstructions of a target image (and the associated transformations which generated those reconstructions) are computed in parallel. The lowest error hypothesis reconstruction is chosen during each network run and the corresponding affine matrix which generated the winning reconstruction is output as the final network estimate of camera pose change between images and .
This multi-hypothesis approach allows the network to generate several different pathways and effectively sample from an unknown noise distribution. For example, as IMUs only measure linear accelerations, they fail to accurately convey motion during periods of constant velocity. Thus, a window of integrated IMU measurements are contaminated with noise related to the velocity at the beginning of the window. With a multi-hypothesis approach, the network has a mechanism to model uncertainty in the initial velocity (see ÃSection VI-EÃ for a discussion).
where is the lowest error hypothesis reconstruction. Loss is then only computed for this one hypothesis pathway and error is backpropagated only to parameters in that one pathway. Thus, only parameters that contributed to the winning hypothesis are updated and the remaining parameters are left untouched.
The final loss by which the network is trained is then simply the sum of the Euclidean loss terms for each level plus a weighted L1 penalty over the bias terms which we empirically found to better facilitate training and gradient back-propagation:
Iv-a Network Architecture
Iv-A1 Imu Processing
The initial level of VIOLearner uses two parallel pathways of convolutional layers for the IMU angular velocity and linear accelerations, respectively. Each pathway begins with convolutional layers each of single-stride x filters on the xx IMU angular velocity or linear accelerations followed by convolutional layers of filters each of stride with the same x kernel. Next, convolutional layers of filters are applied with strides of , , and , and kernels of size x, x, and x. The final convolutional layer in the angular velocity and linear acceleration pathways were flattened into xx tensors using a convolutional layer with three filters of kernel size and stride before being concatenated together into a tensor .
Iv-A2 3D Affine Transformations
The first three components in pose_imu correspond to rotations representing rotations about the x, y, and z axis respectively. Rotation matrices are computed as
and a 3D rotation matrix is generated as
The last three elements in directly correspond to a translation vector . Together with the 3D rotation matrix , we finally form a 4x4 homogeneous transformation matrix as
Iv-A3 Online Error Correction and Pose Refinement
For each xx Jacobian matrix at all scales , three convolutional layers of filters are applied with kernel sizes x, x, and x and strides of , , and , respectively. Then, an additional to convolutional layers are applied depending on the downsampling factor of the current level with the number of additional layers increasing as the downsampling factor decreases (the final level using additional convolutional layers). For each level, the final layer generates a pose estimate using a single-strided convolution with a kernel size of .
The outputs are split similarly to into rotations and translation . The new rotations and translations for level are then computed as
This repeats for each level until the final level where is output as the final estimate of the change in camera pose.
Iv-A4 Multi-Hypothesis Generation
The final level of VIOLearner uses hypothesis pathways as described above in ÃSection III-DÃ.
Iv-B Training Procedures
VIOLearner was trained for iterations (approximately epochs) using a batch size of 32. As the network was trained, we calculated error on the validation set at intervals of 500 iterations. The results presented in this paper are from a network model that was trained for iterations as it provided the highest performance on the validation set. We used the Adam solver with momentum1=, momentum2=, gamma=, learning rate=, and an exponential learning rate policy. The network was trained on a desktop computer with a 3.00 GHz Intel i7-6950X processor and Nvidia Titan X GPUs.
Iv-C KITTI Odometry Dataset
We evaluate VIOLearner on the KITTI Odometry dataset  and used sequences for training excluding sequence because the corresponding raw file was not online at the time of publication. Sequences 09 and 10 were withheld for the test set as was done in . Additionally, of KITTI sequences was withheld as a validation set. This left a total of training images, testing images, and validation images.
Depth is not available for the full resolution of KITTI images so we cropped each image in KITTI from x to x (and first resized each image to x if the resolution was different as is the case for certain sequences) and then scaled the cropped images to size x. In all experiments, we randomly selected an image for the source and used the successive image for the target. Corresponding IMU data was collected from the KITTI raw datasets and for each source image, the preceding and the following of IMU data was combined yielding a length x vector ( prior to the source image and the approximately between source and target image). We chose to include IMU data in this way so that the network could learn how to implicitly estimate a temporal offset between camera and IMU data as well as glean an estimate of the initial velocity at the time of source image capture by looking to previous data.
In the literature, there is no readily comparable approach with the exact input and output domains as our network (namely RGB-D + inertial inputs and scaled output odometry; see ÃSection IIÃ for approaches with our input domain but that do not publicly provide their code or evaluation results on KITTI). Nonetheless, we compare our approach to the following recent VO, VIO, and SLAM approaches described earlier in ÃSection IIÃ:
For OKVIS, ROVIO, and SFMLearner, we temporally align each output and ground truth trajectory by cross correlating the norm of the rotational accelerations and perform 6-DOF alignment for the first tenth of the trajectory to ground truth. For ROVIO and SFMLearner, we also estimate the scale from ground truth.
|Med.||1st Quar.||3rd Quar.||Med.||1st Quar.||3rd Quar.||Med.||1st Quar.||3rd Quar.|
Vi Results and Discussion
Vi-a Visual Odometry
VIOLearner compared favorably to the VO approaches listed above as seen in ÃTab. IÃ. It should be noted that the results in ÃTab. IÃ for VIOLearner, UnDeepVO, and SFMLearner are for networks that were tested on data on which they were also trained which is in accordance with the results presented in . We were thus unable to properly evaluate UnDeepVO against VIOLearner on test data that was not also used for training as such results were not provided for UnDeepVO nor is their model available online at the time of writing. With the exception of KITTI sequence 00, VIOLearner performed favorably against UnDeepVO.
Vi-B Visual Inertial Odometry
The authors of VINet  provide boxplots of their method’s error compared to several state-of-the-art approaches for - on KITTI Odometry. We have extracted the median, first quartile, and third quartile from their results plots to the best of our ability and included them in ÃTab. IIÃ. For longer trajectories ( , and ), VIOLearner outperformed VINet on KITTI sequence 10. It should again be noted that while VINet requires camera-IMU extrinsic calibrations, our network is able to implicitly learn this transform from the data itself.
The authors of  reported OKVIS failing to run on the KITTI Odometry dataset and instead used a custom EKF with VISO2 as a comparison to traditional SOA VIO approaches. However, we were able to successfully run KITTI Odometry on sequences 09 and 10 and have included the results in ÃTab. IIIÃ. Additionally, we provide results from ROVIO. VIOLearner outperforms OKVIS and ROVIO on KITTI sequences 09 and 10. However, both OKVIS and ROVIO require tight synchronization between IMU measurements and images which KITTI does not provide. This is most likely the reason for the poor performance of both approaches on KITTI. This also highlights a strength of VIOLearner in that it is able to compensate for loosely temporally synchronized sensors without explicitly estimating their temporal offsets.
Vi-C Visual Simultaneous Localization and Mapping
Additionally, we have included benchmark results from ORB-SLAM2. ORB-SLAM2 performs SLAM, unlike our pure odometry-based solution, and is included as a example of SOA localization to provide additional perspective on our results. VIOLearner was significantly outperformed by ORB-SLAM2. This was not a surprise as ORB-SLAM2 uses bundle adjustment, loop closure, maintains a SLAM map, and generally uses far more data for each pose estimate compared to our VIOLearner.
Vi-D Online Error Correction
Results in ÃFig. 4Ã show the pose root mean squared error (RMSE) between the generated at each level and suggest that our online error correction mechanism is able to reduce pose error. It should however be noted that for Levels 0 to 2 are computed using down-sampled images and thus their Jacobian inputs both have access to less information and use one less convolutional layer each. The extent to which this affects the plots in ÃFig. 4Ã is as of yet not fully clear. However, Level 3 operates on full-size inputs and there is still a reduction in error between Level 3 and Level 4. While this reduction in the final layer can be partially attributed to the multi-hypothesis mechanism in Level 4, the mean root mean squared error (RMSE) for Level 3 is while the mean RMSE from each individual hypothesis pathway is , , , , ( when the lowest error hypothesis is chosen) for KITTI 09 and , , , and ( when the lowest error hypothesis is chosen) for KITTI 10. The lower mean RMSE from individual Level 4 hypotheses (with the exception of the second hypothesis pathway) suggests that the observed error reduction is indeed an effect of our online error correction mechanism rather than simply an artefact of image resolution.
Vi-E Multi-Hypothesis Error Reduction
For the s generated by each of the four hypothesis pathways in the final level for KITTI sequence 09, the average variance of pose error between the four hypotheses was in the x-dimension, in the y-dimension, in the z-dimension, and was the Euclidean error between the computed translation and the true translation. The z-dimension shows orders of magnitude more variance compared to the x- and y- dimensions and is the main contributor to hypothesis variance. For the camera frame in KITTI, the z-direction corresponds to the forward direction which is the predominant direction of motion in KITTI and also where we would expect to see the largest influence from uncertainty in initial velocities. These results are consistent with the network learning to model this uncertainty in initial velocity as intended.
In this work, we have presented our VIOLearner architecture and demonstrated competitive performance against SOA odometry approaches. VIOLearner’s novel multi-step trajectory estimation via convolutional processing of Jacobians at multiple spatial scales yields a trajectory estimator that learns how to correct errors online.
The main contributions of VIOLearner are its unsupervised learning of scaled trajectory, online error correction based on the use of intermediate gradients, and ability to combine uncalibrated, loosely temporally synchronized, multi-modal data from different reference frames into improved estimates of odometry.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
-  M. Srinivasan, M. Lehrer, W. Kirchner, and S. Zhang, “Range perception through apparent image speed in freely flying honeybees,” Visual neuroscience, vol. 6, no. 5, pp. 519–535, 1991.
-  D. Nistér, O. Naroditsky, and J. Bergen, “Visual odometry,” in CVPR, vol. 1. Ieee, 2004, pp. I–I.
-  S. Pillai and J. J. Leonard, “Towards visual ego-motion learning in robots,” arXiv preprint arXiv:1705.10279, 2017.
-  R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
-  M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017.
-  S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
-  A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint Kalman filter for vision-aided inertial navigation,” in ICRA. IEEE, 2007, pp. 3565–3572.
-  M. Li and A. I. Mourikis, “High-precision, consistent EKF-based visual-inertial odometry,” The International Journal of Robotics Research, vol. 32, no. 6, pp. 690–711, 2013.
-  M. N. Galfond, “Visual-inertial odometry with depth sensing using a multi-state constraint Kalman filter,” Master’s thesis, Massachusetts Institute of Technology, 2014.
-  F. Pang, Z. Chen, L. Pu, and T. Wang, “Depth enhanced visual-inertial odometry based on multi-state constraint Kalman filter,” in IROS, Sept 2017, pp. 1761–1767.
-  J. Zhang and S. Singh, “Enabling aggressive motion estimation at low-drift and accurate mapping in real-time,” in ICRA. IEEE, 2017, pp. 5051–5058.
-  R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in ECCV. Springer, 2016, pp. 740–756.
-  C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in CVPR, vol. 2, no. 6, 2017, p. 7.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in CVPR, vol. 2, no. 6, 2017, p. 7.
-  R. Li, S. Wang, Z. Long, and D. Gu, “UnDeepVO: Monocular visual odometry through unsupervised deep learning,” arXiv preprint arXiv:1709.06841, 2017.
-  R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “VINet: Visual-inertial odometry as a sequence-to-sequence learning problem.” in AAAI, 2017, pp. 3995–4001.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks,” Nips, pp. 1–14, 2015.
-  P. Anandan, J. Bergen, K. Hanna, and R. Hingorani, “Hierarchical model-based motion estimation,” in Motion Analysis and Image Sequence Processing. Springer, 1993, pp. 1–22.
-  E. J. Shamwell, W. D. Nothwang, and D. Perlis, “A deep neural network approach to fusing vision and heteroscedastic motion estimates for low-SWaP robotic applications,” in Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, 2017.
-  E. J. Shamwell, W. Nothwang, and D. Perlis, “Multi-hypothesis visual inertial flow,” IEEE Robotics and Automation Letters, Submitted.