VisionAided Absolute Trajectory Estimation Using an Unsupervised Deep Network with Online Error Correction
Abstract
We present an unsupervised deep neural network approach to the fusion of RGBD imagery with inertial measurements for absolute trajectory estimation. Our network, dubbed the VisualInertialOdometry Learner (VIOLearner), learns to perform visualinertial odometry (VIO) without inertial measurement unit (IMU) intrinsic parameters (corresponding to gyroscope and accelerometer bias or white noise) or the extrinsic calibration between an inertial measurement unit (IMU) and camera. The network learns to integrate IMU measurements and generate hypothesis trajectories which are then corrected online according to the Jacobians of scaled image projection errors with respect to a spatial grid of pixel coordinates. We evaluate our network against stateoftheart (SOA) visualinertial odometry, visual odometry, and visual simultaneous localization and mapping (VSLAM) approaches on the KITTI Odometry dataset [1] and demonstrate competitive odometry performance.
 ARL
 Army Research Laboratory
 IMU
 inertial measurement unit
 SOA
 stateoftheart
 VSLAM
 visual simultaneous localization and mapping
 VO
 visual odometry
 VIO
 visualinertial odometry
 RTK
 Real Time Kinematic
 SLAM
 Simultaneous localization and mapping
 EKF
 Extended Kalman Filter
 MSCKF
 Multistate Constraint Kalman Filter
 MDN
 mixture density network
 CNN
 convolutional neural network
 WTA
 winnertakeall
 RMSE
 root mean squared error
 CNN
 convolutional neural network
I Introduction
Originally coined to characterize honey bee flight [2], the term “visual odometry” describes the integration of apparent image motions for deadreckoningbased navigation (literally, as an odometer for vision). While work in what came to be known as visual odometry (VO) began in the 1980s for the Mars Rover project, the term was not popularized in the engineering context until around 2004 [3].
Approaches in visualinertial odometry (VIO) combine visual estimates of motion with those measured by multiaxis accelerometers and gyroscopes in inertial measurement units (IMUs). As IMUs measure only linear accelerations and angular velocities, inertial approaches to localization are prone to exponential drift over time due to the double integration of accelerations into pose estimates. Combining inertial estimates of pose change with visual estimates allows for the ‘squashing’ of inertial estimates of drift.
While visual odometry (VO), VIO, and visual simultaneous localization and mapping (VSLAM) can be performed by a monocular camera system, the scale of trajectory estimates cannot be directly estimated as depth is unobservable from monocular cameras (although for VIO, integrating raw IMUs measurements can provide noisy, shortterm estimates of scale). Generally, depth can be estimated from RGBD cameras or from stereo camera systems. We have chosen to include depth in our input domain as a means of achieving absolute scale recovery: while absolute depth can be generated using an onboard sensor, the same cannot be said for the change in pose between image pairs.
GPS, proposed as a training posedifference signal in [4], has limited accuracy on the order of meters which then inherently limits the accuracy of a GPSderived interimage training signal for bootstrapping purposes. While Real Time Kinematic (RTK) GPS can reduce this error to the order of centimeters, Real Time Kinematic (RTK) GPS solutions require specialized localized basestations. Scene depth, on the other hand, can be accurately measured by an RGBD sensor (e.g. Microsoft Kinect, Asus Xtion, or new indoor/outdoor Intel RealSense cameras), stereo cameras, or LIDAR (e.g. a Velodyne VLP16 or PUCK Lite) directly onboard a vehicle.
The main contribution of this paper is the unsupervised learning of trajectory estimates with absolute scale recovery from RGBD + inertial measurements with
Ii Related Work
Iia Traditional Methods
In VO and visual simultaneous localization and mapping (VSLAM), only data from camera sensors is used and tracked across frames to determine the change in the camera pose. Simultaneous localization and mapping (SLAM) approaches typically consist of a front end, in which features are detected in the image, and a back end, in which features are tracked across frames and matched to keyframes to estimate camera pose, with some approaches performing loop closure as well. ORBSLAM2 [5] is a visual Simultaneous localization and mapping (SLAM) system for monocular, stereo, and RGBD cameras. It used bundle adjustment and a sparse map for accurate, realtime performance on CPUs. ORBSLAM2 performed loop closure to correct for the accumulated error in its pose estimation. ORBSLAM2 has shown stateoftheart (SOA) performance on a variety of visual odometry (VO) benchmarks.
In VIO and visualinertial SLAM, the fusion of imagery and IMU measurements are typically accomplished by filterbased approaches or nonlinear optimization approaches. ROVIO [6] is a VIO algorithm for monocular cameras, using a robust and efficient robocentric approach in which 3D landmark positions are estimated relative to camera pose. It used an Extended Kalman Filter (EKF) to fuse the sensor data, utilizing the intensity errors in the update step. However, because ROVIO is a monocular approach, accurate scale is not recovered. OKVIS [7] is a keyframebased visualinertial SLAM approach for monocular and stereo cameras. OKVIS relied on keyframes, which consisted of an image and estimated camera pose, a batch nonlinear optimization on saved keyframes, and a local map of landmarks to estimate camera egomotion. However, it did not include loop closure, unlike the SLAM algorithms with stateoftheart (SOA) performance.
There are also several approaches which enhance VIO with depth sensing or laser scanning. Two methods of depthenhanced VIO are built upon the Multistate Constraint Kalman Filter (MSCKF) [8, 9] algorithm for visionaided inertial navigation, which is another Extended Kalman Filter (EKF)based VIO algorithm. One method is the MSCKF3D [10] algorithm, which used a monocular camera with a depth sensor, or RGBD camera system. The algorithm performed online time offset correction between camera and IMU, which is critical for its estimation process, and used a Gaussian mixture model for depth uncertainty. Pang et al. [11] also demonstrated a depthenhanced VIO approach based on MSCKF, with 3D landmark positions augmented with sparse depth information kept in a registered depth map. Both approaches showed improved accuracy over VIOonly approaches. Finally, Zhang and Singh [12] proposed a method for leveraging data from a 3D laser. The approach utilized a multilayer pipeline to solve for coarse to fine motion by using VIO as a subsystem and matching scans to a local map. It demonstrated high position accuracy and was robust to individual sensor failures.
IiB LearningBased Methods
Recently, there have been several successful unsupervised approaches to depth estimation that are trained using reconstructive loss [13, 14] from image warping similar to our own network. Garg et. al and Godard et. al used stereo image pairs with known baselines and reconstructive loss for training. Thus, while technically unsupervised, the known baseline effectively provides a known transform between two images. Our network approaches the same problem from the opposite direction: we assume known depth and estimate an unknown pose difference.
Pillai and Leonard [4] demonstrated visual egomotion learning by mapping optical flow vectors to egomotion estimates via a mixture density network (MDN). Their approach not only required optical flow to already be externally generated (which can be very computationally expensive), but also was trained in a supervised manner and thus required the ground truth pose differences for each exemplar in the training set.
SFMLearner [15] demonstrated the unsupervised learning of unscaled egomotion and depth from RGB imagery. They input a consecutive sequence of images and output a change in pose between the middle image of the sequence and every other image in the sequence, and the estimated depth of the middle image. However, their approach was unable to recover the scale for the depth estimates or, most crucially, the scale of the changes in pose. Thus, their network’s trajectory estimates needed to be scaled by parameters estimated from the ground truth trajectories and in the realworld, this information will of course not be available. SFMLearner also required a sequence of images to compute a trajectory. Their best results were on an input sequence of five images whereas our network only requires a sourcetarget image pairing.
UnDeepVO [16] is another unsupervised approach to depth and egomotion estimation. It differs from [15] in that it was able to generate properly scaled trajectory estimates. However, unlike [15] and similar to [13, 14], it used stereo image pairs for training where the baseline between images is known and thus, UnDeepVO can only be trained on datasets where stereo image pairs are available. Additionally, the network architecture of UnDeepVO cannot be extended to include motion estimates derived from inertial measurements because the spatial transformation between paired images from stereo cameras are unobservable by an IMU (stereo images are recorded simultaneously).
VINet [17] was the first endtoend trainable visualinertial deep network. While VINet showed robustness to temporal and spatial misalignments of an IMU and camera, it still required extrinsic calibration parameters between camera and IMU. This is in contrast to our VIOLearner which requires no IMU intrinsics or IMUcamera extrinsics. In addition, VINet was trained in a supervised manner and thus required the ground truth pose differences for each exemplar in the training set which are not always readily available.
Iii Approach
VIOLearner is an unsupervised VIO deep network that estimates the scaled egomotion of a moving camera between some time at which source image is captured and time when target image is captured. VIOLearner receives an input RGBD source image, a target RGB image, IMU data from to , and a camera calibration matrix with the camera’s intrinsic parameters. With access to , VIOLearner can generate camera pose changes in the camera frame using a viewsynthesis approach where the basis for its training is in the Euclidean loss between a target image and a reconstructed target image generated using pixels in a source image sampled at locations determined by a learned 3D affine transformation (via a spatial transformer of [18]).
Iiia MultiScale Projections and Online Error Correction
Similar to [15], VIOLearner performs multiscale projections. We scale projections by factors of , , and . However, in our network, multiscale projections not only help to overcome gradient locality during training (see [15, 19] for a broader discussion of this wellknown issue), but also aid in online error correction at runtime.
Generally, at each level, the network computes the Jacobians of the reprojection error at that level with respect to the grid of coordinates. Convolutions are then performed on this Jacobian (sized HxWx) and a is computed. This is added to the previously generated affine matrix . This is repeated for each level in the hierarchy. VIOLearner uses a total of levels and additional detail on the computations performed at each level is provided in the following sections.
IiiB Level 0
VIOLearner first takes raw IMU measurements and learns to compute an estimated 3D affine transformation that will transform a source image into a target image (the top left of ÃFig. 1(a)Ã). The network downsamples ( in ÃFig. 2Ã) the source image (and associated depth and adjusted camera matrix for the current downsampling factor) by a factor of 8 and applies the 3D affine transformation to a normalized grid of source coordinates to generate a transformed grid of target coordinates .
VIOLearner then performs bilinear sampling to produce a reconstruction image by sampling from the source image at coordinates :
(1) 
where the function is zero except in the range where it is nonnegative and equal to one at .
As in [18], by only evaluating at the sampling pixels, we can therefore compute subgradients that allow error backpropagation to the affine parameters by computing error with respect to the coordinate locations instead of the more conventional error with respect to the pixel intensity:
(2) 
(3) 
Starting with the Level 0, the Euclidean loss is taken between the downsampled reconstructed target image and the actual target image. For Level 0, this error is computed as:
(4) 
The final computation performed by Level 0 is of the Jacobian of the Euclidean loss of ÃEquation 4Ã with respect to the source coordinates from ÃEquation 2Ã and ÃEquation 3Ã. The resulting Jacobian matrix has the same dimensionality of the grid of source coordinates (xx) and is depicted in ÃFig. 1(a)Ã. In traditional approaches, the gradient and error equations above are only used during training. However, VIOLearner is novel in that it also computes and employs these gradients during each inference step of the network. During both training and inference, the Jacobian is computed and passed to the next level for processing.
IiiC Levels i to n1
For the second level in the network through the n level, the previous level’s Jacobian is input and processed through layers of convolutions to generate a . This represents a computed correction to be applied to the previous level’s 3D affine transform . is summed with to generate which is then applied to generate a reconstruction that is downsampled by a factor . Error is again computed as above in ÃEquation 4Ã and the Jacobian is similarly found as it was in Level 0 and input to the next Level .
IiiD Level n and MultiHypothesis Pathways
The final level of VIOLearner employs multihypothesis pathways similar to [20, 21] where several possible hypotheses for the reconstructions of a target image (and the associated transformations which generated those reconstructions) are computed in parallel. The lowest error hypothesis reconstruction is chosen during each network run and the corresponding affine matrix which generated the winning reconstruction is output as the final network estimate of camera pose change between images and .
This multihypothesis approach allows the network to generate several different pathways and effectively sample from an unknown noise distribution. For example, as IMUs only measure linear accelerations, they fail to accurately convey motion during periods of constant velocity. Thus, a window of integrated IMU measurements are contaminated with noise related to the velocity at the beginning of the window. With a multihypothesis approach, the network has a mechanism to model uncertainty in the initial velocity (see ÃSection VIEÃ for a discussion).
Error for this last multihypothesis level is computed according to a winnertakeall (WTA) Euclidean loss rule (see [20] for more detail and justifications):
(5) 
(6) 
where is the lowest error hypothesis reconstruction. Loss is then only computed for this one hypothesis pathway and error is backpropagated only to parameters in that one pathway. Thus, only parameters that contributed to the winning hypothesis are updated and the remaining parameters are left untouched.
The final loss by which the network is trained is then simply the sum of the Euclidean loss terms for each level plus a weighted L1 penalty over the bias terms which we empirically found to better facilitate training and gradient backpropagation:
(7) 
Iv Methods
Iva Network Architecture
IvA1 Imu Processing
The initial level of VIOLearner uses two parallel pathways of convolutional layers for the IMU angular velocity and linear accelerations, respectively. Each pathway begins with convolutional layers each of singlestride x filters on the xx IMU angular velocity or linear accelerations followed by convolutional layers of filters each of stride with the same x kernel. Next, convolutional layers of filters are applied with strides of , , and , and kernels of size x, x, and x. The final convolutional layer in the angular velocity and linear acceleration pathways were flattened into xx tensors using a convolutional layer with three filters of kernel size and stride before being concatenated together into a tensor .
IvA2 3D Affine Transformations
The first three components in pose_imu correspond to rotations representing rotations about the x, y, and z axis respectively. Rotation matrices are computed as
(8) 
(9) 
(10) 
and a 3D rotation matrix is generated as
(11) 
The last three elements in directly correspond to a translation vector . Together with the 3D rotation matrix , we finally form a 4x4 homogeneous transformation matrix as
(12) 
IvA3 Online Error Correction and Pose Refinement
For each xx Jacobian matrix at all scales , three convolutional layers of filters are applied with kernel sizes x, x, and x and strides of , , and , respectively. Then, an additional to convolutional layers are applied depending on the downsampling factor of the current level with the number of additional layers increasing as the downsampling factor decreases (the final level using additional convolutional layers). For each level, the final layer generates a pose estimate using a singlestrided convolution with a kernel size of .
The outputs are split similarly to into rotations and translation . The new rotations and translations for level are then computed as
(13) 
(14) 
This repeats for each level until the final level where is output as the final estimate of the change in camera pose.
IvA4 MultiHypothesis Generation
The final level of VIOLearner uses hypothesis pathways as described above in ÃSection IIIDÃ.
IvB Training Procedures
VIOLearner was trained for iterations (approximately epochs) using a batch size of 32. As the network was trained, we calculated error on the validation set at intervals of 500 iterations. The results presented in this paper are from a network model that was trained for iterations as it provided the highest performance on the validation set. We used the Adam solver with momentum1=, momentum2=, gamma=, learning rate=, and an exponential learning rate policy. The network was trained on a desktop computer with a 3.00 GHz Intel i76950X processor and Nvidia Titan X GPUs.
IvC KITTI Odometry Dataset
We evaluate VIOLearner on the KITTI Odometry dataset [1] and used sequences for training excluding sequence because the corresponding raw file was not online at the time of publication. Sequences 09 and 10 were withheld for the test set as was done in [15]. Additionally, of KITTI sequences was withheld as a validation set. This left a total of training images, testing images, and validation images.
Depth is not available for the full resolution of KITTI images so we cropped each image in KITTI from x to x (and first resized each image to x if the resolution was different as is the case for certain sequences) and then scaled the cropped images to size x. In all experiments, we randomly selected an image for the source and used the successive image for the target. Corresponding IMU data was collected from the KITTI raw datasets and for each source image, the preceding and the following of IMU data was combined yielding a length x vector ( prior to the source image and the approximately between source and target image). We chose to include IMU data in this way so that the network could learn how to implicitly estimate a temporal offset between camera and IMU data as well as glean an estimate of the initial velocity at the time of source image capture by looking to previous data.
V Evaluation
In the literature, there is no readily comparable approach with the exact input and output domains as our network (namely RGBD + inertial inputs and scaled output odometry; see ÃSection IIÃ for approaches with our input domain but that do not publicly provide their code or evaluation results on KITTI). Nonetheless, we compare our approach to the following recent VO, VIO, and SLAM approaches described earlier in ÃSection IIÃ:
For OKVIS, ROVIO, and SFMLearner, we temporally align each output and ground truth trajectory by cross correlating the norm of the rotational accelerations and perform 6DOF alignment for the first tenth of the trajectory to ground truth. For ROVIO and SFMLearner, we also estimate the scale from ground truth.
VIOLearner  UnDeepVO  SFMLearner  VISO2M  ORBSLAMM  ORBSLAM2  

Seq 

00 
14.27  5.29  4.14  1.92  65.27  6.23  18.24  2.69  25.29  7.37  0.70  0.25 
02  4.07  1.48  5.58  2.44  57.59  4.09  4.37  1.18  X  X  0.76  0.23 
05  3.00  1.40  3.40  1.50  16.76  4.06  19.22  3.54  26.01  10.62  0.40  0.16 
07  3.60  2.06  3.15  2.48  17.52  5.38  23.61  4.11  24.53  10.83  0.50  0.28 
08  2.93  1.32  4.08  1.79  24.02  3.05  24.18  2.47  32.40  12.13  1.05  0.32 
VIOLearner  VINet  EKF+VISO2  

Length 

Med.  1st Quar.  3rd Quar.  Med.  1st Quar.  3rd Quar.  Med.  1st Quar.  3rd Quar.  
100  1.87  1.25  2.3  0  0  2.18  2.7  0.54  9.2 
200  3.57  2.9  4.16  2.5  1.01  5.43  11.9  4.89  32.6 
300  5.78  5.19  6.39  6.0  3.26  17.9  26.6  9.23  58.1 
400  8.32  6.87  9.57  10.3  5.43  39.6  40.7  13.0  83.6 
500  12.33  9.49  13.69  16.8  8.6  70.1  57.0  19.5  98.9 
VIOLearner  SFMLearner  OKVIS  ROVIO  ORBSLAM2  

Seq 

09 
1.51  0.90  21.63  3.56979  9.77  2.97  20.18  2.09  0.87  0.27 
10 
2.04  1.37  20.54  10.93  17.30  2.82  20.04  2.24  0.60  0.27 

Vi Results and Discussion
Via Visual Odometry
VIOLearner compared favorably to the VO approaches listed above as seen in ÃTab. IÃ. It should be noted that the results in ÃTab. IÃ for VIOLearner, UnDeepVO, and SFMLearner are for networks that were tested on data on which they were also trained which is in accordance with the results presented in [16]. We were thus unable to properly evaluate UnDeepVO against VIOLearner on test data that was not also used for training as such results were not provided for UnDeepVO nor is their model available online at the time of writing. With the exception of KITTI sequence 00, VIOLearner performed favorably against UnDeepVO.
ViB Visual Inertial Odometry
The authors of VINet [17] provide boxplots of their method’s error compared to several stateoftheart approaches for  on KITTI Odometry. We have extracted the median, first quartile, and third quartile from their results plots to the best of our ability and included them in ÃTab. IIÃ. For longer trajectories ( , and ), VIOLearner outperformed VINet on KITTI sequence 10. It should again be noted that while VINet requires cameraIMU extrinsic calibrations, our network is able to implicitly learn this transform from the data itself.
The authors of [17] reported OKVIS failing to run on the KITTI Odometry dataset and instead used a custom EKF with VISO2 as a comparison to traditional SOA VIO approaches. However, we were able to successfully run KITTI Odometry on sequences 09 and 10 and have included the results in ÃTab. IIIÃ. Additionally, we provide results from ROVIO. VIOLearner outperforms OKVIS and ROVIO on KITTI sequences 09 and 10. However, both OKVIS and ROVIO require tight synchronization between IMU measurements and images which KITTI does not provide. This is most likely the reason for the poor performance of both approaches on KITTI. This also highlights a strength of VIOLearner in that it is able to compensate for loosely temporally synchronized sensors without explicitly estimating their temporal offsets.
ViC Visual Simultaneous Localization and Mapping
Additionally, we have included benchmark results from ORBSLAM2. ORBSLAM2 performs SLAM, unlike our pure odometrybased solution, and is included as a example of SOA localization to provide additional perspective on our results. VIOLearner was significantly outperformed by ORBSLAM2. This was not a surprise as ORBSLAM2 uses bundle adjustment, loop closure, maintains a SLAM map, and generally uses far more data for each pose estimate compared to our VIOLearner.
ViD Online Error Correction
Results in ÃFig. 4Ã show the pose root mean squared error (RMSE) between the generated at each level and suggest that our online error correction mechanism is able to reduce pose error. It should however be noted that for Levels 0 to 2 are computed using downsampled images and thus their Jacobian inputs both have access to less information and use one less convolutional layer each. The extent to which this affects the plots in ÃFig. 4Ã is as of yet not fully clear. However, Level 3 operates on fullsize inputs and there is still a reduction in error between Level 3 and Level 4. While this reduction in the final layer can be partially attributed to the multihypothesis mechanism in Level 4, the mean root mean squared error (RMSE) for Level 3 is while the mean RMSE from each individual hypothesis pathway is , , , , ( when the lowest error hypothesis is chosen) for KITTI 09 and , , , and ( when the lowest error hypothesis is chosen) for KITTI 10. The lower mean RMSE from individual Level 4 hypotheses (with the exception of the second hypothesis pathway) suggests that the observed error reduction is indeed an effect of our online error correction mechanism rather than simply an artefact of image resolution.
ViE MultiHypothesis Error Reduction
For the s generated by each of the four hypothesis pathways in the final level for KITTI sequence 09, the average variance of pose error between the four hypotheses was in the xdimension, in the ydimension, in the zdimension, and was the Euclidean error between the computed translation and the true translation. The zdimension shows orders of magnitude more variance compared to the x and y dimensions and is the main contributor to hypothesis variance. For the camera frame in KITTI, the zdirection corresponds to the forward direction which is the predominant direction of motion in KITTI and also where we would expect to see the largest influence from uncertainty in initial velocities. These results are consistent with the network learning to model this uncertainty in initial velocity as intended.
Vii Conclusion
In this work, we have presented our VIOLearner architecture and demonstrated competitive performance against SOA odometry approaches. VIOLearner’s novel multistep trajectory estimation via convolutional processing of Jacobians at multiple spatial scales yields a trajectory estimator that learns how to correct errors online.
The main contributions of VIOLearner are its unsupervised learning of scaled trajectory, online error correction based on the use of intermediate gradients, and ability to combine uncalibrated, loosely temporally synchronized, multimodal data from different reference frames into improved estimates of odometry.
References
 [1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
 [2] M. Srinivasan, M. Lehrer, W. Kirchner, and S. Zhang, “Range perception through apparent image speed in freely flying honeybees,” Visual neuroscience, vol. 6, no. 5, pp. 519–535, 1991.
 [3] D. Nistér, O. Naroditsky, and J. Bergen, “Visual odometry,” in CVPR, vol. 1. Ieee, 2004, pp. I–I.
 [4] S. Pillai and J. J. Leonard, “Towards visual egomotion learning in robots,” arXiv preprint arXiv:1705.10279, 2017.
 [5] R. MurArtal and J. D. Tardós, “ORBSLAM2: An opensource SLAM system for monocular, stereo, and RGBD cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
 [6] M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended Kalman filter based visualinertial odometry using direct photometric feedback,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017.
 [7] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframebased visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
 [8] A. I. Mourikis and S. I. Roumeliotis, “A multistate constraint Kalman filter for visionaided inertial navigation,” in ICRA. IEEE, 2007, pp. 3565–3572.
 [9] M. Li and A. I. Mourikis, “Highprecision, consistent EKFbased visualinertial odometry,” The International Journal of Robotics Research, vol. 32, no. 6, pp. 690–711, 2013.
 [10] M. N. Galfond, “Visualinertial odometry with depth sensing using a multistate constraint Kalman filter,” Master’s thesis, Massachusetts Institute of Technology, 2014.
 [11] F. Pang, Z. Chen, L. Pu, and T. Wang, “Depth enhanced visualinertial odometry based on multistate constraint Kalman filter,” in IROS, Sept 2017, pp. 1761–1767.
 [12] J. Zhang and S. Singh, “Enabling aggressive motion estimation at lowdrift and accurate mapping in realtime,” in ICRA. IEEE, 2017, pp. 5051–5058.
 [13] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in ECCV. Springer, 2016, pp. 740–756.
 [14] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with leftright consistency,” in CVPR, vol. 2, no. 6, 2017, p. 7.
 [15] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and egomotion from video,” in CVPR, vol. 2, no. 6, 2017, p. 7.
 [16] R. Li, S. Wang, Z. Long, and D. Gu, “UnDeepVO: Monocular visual odometry through unsupervised deep learning,” arXiv preprint arXiv:1709.06841, 2017.
 [17] R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “VINet: Visualinertial odometry as a sequencetosequence learning problem.” in AAAI, 2017, pp. 3995–4001.
 [18] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks,” Nips, pp. 1–14, 2015.
 [19] P. Anandan, J. Bergen, K. Hanna, and R. Hingorani, “Hierarchical modelbased motion estimation,” in Motion Analysis and Image Sequence Processing. Springer, 1993, pp. 1–22.
 [20] E. J. Shamwell, W. D. Nothwang, and D. Perlis, “A deep neural network approach to fusing vision and heteroscedastic motion estimates for lowSWaP robotic applications,” in Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, 2017.
 [21] E. J. Shamwell, W. Nothwang, and D. Perlis, “Multihypothesis visual inertial flow,” IEEE Robotics and Automation Letters, Submitted.