Direct VisualInertial Odometry with SemiDense Mapping
Abstract
The paper presents a direct visualinertial odometry system. In particular, a tightly coupled nonlinear optimization based method is proposed by integrating the recent advances in direct dense tracking and Inertial Measurement Unit (IMU) preintegration, and a factor graph optimization is adapted to estimate the pose of the camera and rebuild a semidense map. Two sliding windows are maintained in the proposed approach. The first one, based on Direct Sparse Odometry (DSO), is to estimate the depths of candidate points for mapping and dense visual tracking. In the second one, measurements from the IMU preintegration and dense visual tracking are fused probabilistically using a tightlycoupled, optimizationbased sensor fusion framework. As a result, the IMU preintegration provides additional constraints to suppress the scale drift induced by the visual odometry. Evaluations on realworld benchmark datasets show that the proposed method achieves competitive results in indoor scenes.
1 Introduction
In the research area of robotics, camera motion estimation and 3D reconstruction take fundamental places both for navigation and perception, such as unmanned aerial vehicle (UAV) navigation [11] and indoor reconstruction [5, 4]. Among these applications, an uptoscale camera motion tracking and 3D structure of the environment are required at the end. Most existing methods formulate this problem as simultaneously localization and mapping (SLAM), which characterized on the sensors it used. Recent efforts include visual SLAM and visual inertial navigation system (VINS).
Visual odometry [17] estimates the depth of features, based on which, track the pose of the camera. In contrast, direct visual odometry without the feature processing pipeline eliminates some of the issues in feature based methods through pose tracking directly based on pixels. However, even these methods are subject to failure during aggressive motions as images can be severely blurred and extremely sensitive to noise, changing illumination, and fast motion. Consequently, aggressive motion of the UAV [14, 15] with significant large angular velocities and linear accelerations, and lighting condition variation make the state estimation subject to scale drift in long term.
While IMU generate noisy but outlierfree measurements, making them great for tracking of shortterm fast motions. However, lowcost MEMS IMUs suffer significant drift in the long term. IMU measurements can improve visual odometry to remedy this issue by providing a short term measurement, making them an ideal combination to better estimate the pose of the camera and generate information on the surroundings. We believe that integrating the complementary natures of the dense visual tracking and IMU measurements opens up the possibility of reliable tracking of aggressive motions.
In this paper, we propose a tightly coupled function that integrates visual and inertial terms in a fully probabilistic manner. We adopt the concept of keyframes due to its successful application in classical visiononly approaches: it is implemented using partial linearization and marginalization. The keyframe paradigm also accounts for driftfree estimation when slow or no motion is present. Instead of using an optimization window of timesuccessive poses, the selected keyframes may be spaced arbitrarily far in time, keeping visual constraints, while still incorporating an IMU term. We provide a strictly probabilistic derivation of the IMU error terms and the respective information matrix, relating successive image frames without explicitly introducing the states at the IMU rate.
The key contribution of this work is a robust and fully integrated system for direct visual inertial odometry. The novelties of the proposed system include: 1) The combination of the direct photometric information and the edge features, which are points with sufficiently high image gradient magnitude. We work with pixels’ intensities, which provides good invariance to changes in viewpoint and illumination. This makes the system more robust and reliable than other methods dealing with detected features. 2) IMU preintegration. The IMU preintegration provides scale information by integrating IMU measurements. Thanks to the use of a factor graph, tracking and mapping are focused in a local covisible area, independent of global map size. 3) tightly coupled optimization. The measurements from the IMU preintegration and dense visual tracking are fused probabilistically using a tightlycoupled, optimizationbased sensor fusion framework. In this paper, the dense visual tracking results provide the visual constraints between current frame and reference frame. While the IMU preintegration provides constraints between two consecutive frames.
In the remainder of the paper, we begin with a review of the related work. In Sect. 4 An overview of the system is described and the details of IMU and visual measurement are are then presented. Dense mapping is introduced in Sect. 5. Sect. 6 shows implement details. The results of experiment are discussed in Sect. 7. Finally, Sect. 8 draws conclusions.
2 Related work
Simultaneously localization and mapping has a long history in a monocular scenario, prominent examples include DTAM [17], SVO [9] and LSDSLAM [3], which works on sparse features and estimate the camera motion through a prediction and correction fashion. Direct methods operating directly on image intensities have become popular since it is robust and computational efficient without feature detection. It is demonstrated that direct methods are suitable for dense mapping than feature based method, and when enhanced by edge alignment [10] [24], it can deal with changing illumination and fast motion. More recently, direct sparse odometry (DSO) [4] presented an impressive semidense 3D reconstruction of the environment through a sliding window optimization method. This direct method minimizes the photometric error of points with sufficiently high image gradient magnitude (edges) using a nonlinear optimization framework.
More recently, Deep neural network demonstrate advanced performance [liu2019approximate]. People pay great attention to the 3D SLAM, like the semantic SLAM [29] and the deep learning enhanced SLAM [21]. Semantic SLam [29] focuses on simultaneous 3D reconstruction and material recognition and segmentation, which ends up with a realtime endtoend system. Deep learning is promising in computer vision [30, 28, 26, 27]. [21] takes the advantage of depth prediction from Convolutional Neural Networks (CNNs) to enhance the performance of monocular slam. DSO [4] takes one monocular camera for 3D reconstruction within inverse depth estimation framework. It is based on photometric error optimization of windowed sparse bundle adjustment.
All of these methods try to explore the scene within the reconstructed 3D map, while one problem they suffer is the scale drift. There are two main approaches towards solving the scale drift problem with extra measurements: the batch nonlinear optimization methods and the recursive filtering methods. The former one jointly minimizes the error originated from the integrated IMU measurements and the (reprojection) errors from the visual terms [11]; while the recursive algorithms usually utilize the IMU measurements for state propagation while the updates come from the visual observations [20, 13]. At the backend, fusion methods can largely be divided into two classes: looselycoupled fusion [23, 18] and tightlycoupled fusion [2, 22, 15, 11]. In looselycoupled fusion, visual measurements are first processed independently to obtain highlevel pose information and then fused with the inertial measurements, usually using a filtering framework [13, 12]. The two subproblems are solved separately in a looselycoupled fusion, resulting in a lower computational cost, however, the results are suboptimal. In tightlycoupled fusion, both the visual and the inertial measurements are fused and optimized in a single framework. It considers the coupling between the two types of measurement and allows the adoption of a graph optimizationbased framework with iterative relinearization to achieve better performance. Tightlycoupled methods usually come with a higher computational cost.
3 Problem Fromulation
3.1 Notation
We use the following notation in the paper:

: the IMU measurements between the th and th images;

: the dense tracking result for the current image with respect to the corresponding reference image (a key frame);

: the residual between the IMU integration and state ;

: the dense tracking residual between and state ;

: the associated covariances of the IMU measurement;

: the associated covariances of the image alignment;

: the th state in the global coordinate system;

: the th position states in the global coordinate system;

: the th velocity states in the global coordinate system;

: the th angular states in the global coordinate system;

: the th acceleration bias states in the global coordinate system;

: the th angular acceleration bias states in the global coordinate system;

.
Our system maintains several states during the processing. A state includes the position, velocity, orientation, accelerometer bias, and gyroscope bias. The full state space is defined as:
4 VisualInertial Odometry
In this section, we present our approach of incorporating inertial measurements into batch visual SLAM. In visual odometry and SLAM, a nonlinear optimization is formulated to find the camera poses and landmark positions by minimizing the reprojection error of landmarks observed in camera frames. Unlike in work [15], which use stereo camera for depth estimation, we adopt the DSO to build a semidense map and take the depth estimation for camera motion tracking. Figure 2 shows the factor graph, in which the visual inertial fusion takes the points in local map as landmarks which provides one kind of geometric constraints to related observation states. The local map points are a collection of points with depth information in latest keyframes.
We seek to formulate the visualinertial localization and mapping problem as one joint optimization of a cost function containing both the (weighted) reprojection errors and the (weighted) IMU error .
(1) 
From the probilistic view [7], a factor graph encodes the posterior probability of the variables, given the available measurements :
(2) 
We assume all the distributions to be Gaussian. The MAP estimate corresponds to the minimum of the negative logposterior. The negative logposterior can be written as a sum of squared residual errors:
(3) 
Combining the results from the IMU integration (Sect. 4.1) and the direct image alignment (Sect. 4.2), and ignoring the prior term, we optimize Eq. 1 iteratively using the GaussianNewton method. To solve this type of minimization problem with a cost function:
(4) 
We introduce the Jacobian and add the weights, which are the inverse of , to the equation before we vectorize the cost function as:
(5) 
Then the solution of the minimization is obtained as:
(6) 
4.1 IMU Measurement
Usually, the frequency of the IMU is higher than that of the camera. There are tens of IMU measurement between the two consecutively image interval. The IMU preintegration between two images provides a prior for image alignment, and also works as a different kind of constraint within the factor graph. The preintegration is given by:
(7) 
where is the gravity, and is the operator for skewsymmetric matrix. and are the IMU measurements, and are white noise affecting the biases, and . The biases are initially set to zeros and the optimized values computed at each subsequent step are used for the next preintegration. Through the IMU propagation [15], we can get the covariance:
(8) 
where is the discretetime error state transition matrix, is the noise transition matrix, and contains all the noise covariance.
(9) 
Then, the residual function between the IMU preintegration and the states is obtained as:
(10) 
4.2 Visual Measurement
Once the IMU integration is complete, we put the images into two categories. A key frame maintains a map (or point clouds) of its environment and works as a reference to track the subsequent, regular frames. A new image frame is categorized as a key frame when it overlaps with the current key frame less than a threshold or the estimated distance between the two frames is over a predefined value. With one monocular camera, the inverse depth estimation is adapted for point tracking. the depth map for the new key frame is initialized from the estimation of DSO.
When the system finishes processing a new key frame, subsequent regular frames are tracked based on the lastest key frame as a reference. We iteratively minimize the sum of the intensity differences for all the pixels in the frame to get the relative transformation from the key frame to the current frame as:
(12) 
(13) 
where denotes the intensity of the pixel in position and is the depth of the pixel. is the function that project the 3dimensional point onto the image plane, while is its inverse projection function.
After obtaining the optimal visual measurement , we can compute the residual function based on the current states and the dense tracking result in the reference frame’s coordinate:
(14) 
Then the Jacobian matrix of the dense tracking residual with respect to the 15 states is a sparse matrix:
(16) 
5 Dense mapping
Rapid camera motions require highfrequency map updates for the camera to be tracked successfully. According to [4], our system maintains a semidense map of high gradient pixels that is quickly updated and serves for camera tracking.
Point candidates are tracked in subsequent frames using a discrete search along the epipolar line, minimizing the photometric error. From the best match, we compute the depth and associated variance, which is used to constrain the search interval for the subsequent frame. This tracking strategy is inspired by LSDSLAM. Note that the computed depth only serves as initialization once the point is activated.
(17) 
(18) 
where denotes the intensity of the pixel in position and is the depth of the pixel. is the function that project the 3dimensional point onto the image plane, while is its inverse projection function.
When a new keyframe is established, all active points are projected into it and slightly dilated, creating a semidense depth map. The selected points with initialied reverse depth are added to the optimization, we choose a number of candidate points (from across all keyframes in the fixed window of the [4]) and add them into the optimization.
The pipeline of the proposed system is illustrated in Figure 3. Between the time interval of two consecutive images, IMU data from the sensor is first preintegrated. In the frontend, the dense tracking thread obtains incremental camera motion using a direct keyframetoframe dense tracking algorithm, assisted by IMU preintegration. Based on the instant tracking performance, it also determines whether to add the frame as a keyframe or regular frame, or report tracking failure. If a keyframe is added, a depth map will be generated from the DSO. The backend periodically checks the frame list buffer. The new frame and IMU measurements are added to the optimization thread. If a keyframe is added, Graph optimization is then applied to find the maximum a posteriori estimate of all the states within the sliding window using connections from IMU preintegration, dense tracking and the prior. A twoway marginalization scheme that selectively removes states is performed in order to both bound the computational complexity and maximize the information stored within the sliding window.
6 Robust method
6.1 Robust Norm
Image processing suffers outliers commonly because of image noise and ambiguity. norm is sensitive to outliers. Very small number of outliers will drive the solution away. Other researchers choose the Huber function [6]. It gives small residuals quadratic and large residuals only linear influence. In contrast to the Tukey function, it is convex and therefore does not introduce new local minima to the optimization. The Huber function is defined as:
(19) 
where is the error and is a given threshold.
6.2 Keyframe and Point Management
The frontend of visual odomerty needs to determine the active frame / point set, provide initialization for new parameters and decide when a frame / point should be marginalized [4] so as to make the system computationally efficient and accurate. Our keyframe and point management is largely based on the DSO [4] except for that we add two more continuous regular frames into the sliding window [19]. This two regular frames are connected by the IMU preintegration.
6.3 Twoway Marginalization
During the graph optimization, the states set increases with any new frame added so as to require more memory and computational resources. In order to reduce the computational complexity, we need to marginalize states based on a twoway marginalization scheme [19, 15, 11] to maintain a sliding window of states and convert measurements corresponding to the marginalized states into a prior. By twoway marginalization, all information of the removed states is kept and the computation complexity is bounded, which is fundamentally different from the traditional keyframebased approaches that simply drop nonkeyframes. Front marginalization removes the second newest frame, ensuring that the time interval for each IMU preintegration is bounded in order to bound the accumulated error. While back marginalization removes the oldest keyframes within the sliding window to improve the effectiveness of multiconstrained factor graph optimization since the dense alignment and the IMU preintegration depend on whether an older state is kept within the sliding window. In this situation, the oldest keyframes provide very few information.
Intuitively, the twoway marginalization will remove the second newest state if the frame tracking is reliable. The second newest state will be marginalized in the next round if the dense tracking is good and the second newest state is near to the current keyframe. Otherwise, the oldest state will be marginalized. The distance is thresholded by a weighted combination of translation and rotation between the latest keyframe and the second newest frame.
For the prior matrix , which contains the information from the marginalized states:
(20) 
where and are the set of removed IMU and visual measurement, respectively. The prior is then marginalized via the Schur complement [20].
7 Experiment
In the literature, a bunch of motion tracking algorithms has been suggested; It could be visual only odometry, loosely coupled fusion, or tightly coupled fusion. How they perform in relation to each other, however, is often unclear, since results are typically shown on individual datasets with different motion and lighting characteristics. In order to make a strong argument for our presented work, we will thus compare it to stateofthe art odometry methods.
7.1 Experiment Setup
We evaluated our system on two different sequences of the publically available popular dataset. The dataset is the the European Robotics Challenge (EuRoC) dataset [1], which contains 11 visualinertial sequences recorded in 3 different indoor environment. The datasets are increasingly difficult to process in terms of flight dynamics and lighting conditions. The two sequences that we use are the V101 and the MH02, which could be successfully handled and compared with other methods, and they are representative sequences of the dataset. The entire algorithm is developed in C++ using ROS.
7.2 Evaluation on EuRoC dataset
The European provides the ground truth data at accuracy. We compared the performance of our system to the ground truth. Since the sequences are recorded in different coordinate systems, we transform the results to the coordinate system of the ground truth. Figure 8 adn Figure 5 shows the comparison for position.
In addition, the 3dimensional reconstruction of the surroundings obtained from monocular configurations for this dataset are shown in Figure 1. It is evident that the reconstruction from the stereo setup is smoother and the contour of the structure in the map is clearer.
To evaluate the efficiency of the proposed system, two different methods dealing with the odometry problem are used for comparison. The OKVIS[11] implements the visualinertial odometry in a tightly coupled fusion method that achieves the stateoftheart performance, and ORB SLAM [16] uses the ORB features for tracking and mapping.
Figure 5 shows the comparison of trajectory estimations tested on the sequence. The dataset is characterized by fast motion and image blur. The ORB SLAM runs for visual odometry without IMU measurements to provide uptoscale information. This results into scale drift for three directions. While OKVIS preforms the best to be the closest to the ground truth since it fuses the IMU measurements with the dense tracking results and takes the depth of features into the optimization formulation. Figure 6 is the estimated biases for the IMU. In our method, We perform a semidense reconstruction and do not take the estimated feature depth into the optimization formulation. And due to the depth estimation error from the DSO, the visual inertial fusion for motion tracking could lead to a reduced performance. Nevertheless, ORB SLAM fails to track in the two sequences with uptoscale accuracy. The other approaches are able to track these sequences successfully. The similar trajectory estimation results on sequence are showed in Figure 8.
For the semidense mapping, Figure 7 is the reconstruction of a machine house from the which is an indoor dark scene. Figure 4 demonstrates the reconstruction of the environment in a small room.
As demonstrated by the challenging experiments, the proposed system is robust and able to handle fast motion and different lighting condition, but it is still subject to scale drift. In all case, the inertial measurement of the IMU provides constraints to largely eliminate the drifts.
8 Conclusion
We propose a direct visualinertial odometry and mapping that tracks the camera motion and reconstructs the environment into a 3dimensional model. Our method is based on a novel integration of dense image tracking and optimization of tracking error with the IMU measurements. Experimental results on the popular dataset demonstrated the performance of our method and the potential for application in unmanned vehicle.
References
 [1] (2016) The EuRoC micro aerial vehicle datasets. The International Journal of Robotics Research 35 (10), pp. 1157–1163. Cited by: §7.1.
 [2] (2016) Visualinertial direct SLAM. In Proceedings of IEEE International Conference on Robotics and Automation, pp. 1331–1338. Cited by: §0.A.1, §2, §4.1.
 [3] (2015) LSDSLAM: largescale direct monocular SLAM.. In Proceedings of European Conference on Computer Vision, pp. 834–849. Cited by: §2.
 [4] (2016) Direct sparse odometry. arXiv preprint arXiv:1607.02565. Cited by: §1, §2, §2, §5, §5, §6.2.
 [5] (2014) LSDslam: largescale direct monocular slam. In European Conference on Computer Vision, pp. 834–849. Cited by: §1.
 [6] (2013) Semidense visual odometry for a monocular camera. In Proceedings of the IEEE international conference on computer vision, pp. 1449–1456. Cited by: §6.1.
 [7] (2015) IMU preintegration on manifold for efficient visualinertial maximumaposteriori estimation. In Robotics: Science and Systems, Cited by: §4.
 [8] (2016) Onmanifold preintegration for realtime visual–inertial odometry. IEEE Transactions on Robotics PP (99), pp. 1–21. Cited by: §0.A.1.
 [9] (2014) SVO: fast semidirect monocular visual odometry. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 15–22. Cited by: §2.
 [10] (2016) Robust camera motion estimation using direct edge alignment and subgradient method. In IEEE International Conference on Robotics and Automation (ICRA2016), Stockholm, Sweden, Cited by: §2.
 [11] (2015) Keyframebased visual–inertial odometry using nonlinear optimization.. Vol. 34, pp. 314–334. Cited by: §1, §2, §6.3, §7.2.
 [12] (2012) Improving the accuracy of ekfbased visualinertial odometry. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 828–835. Cited by: §2.
 [13] (2013) Highprecision, consistent EKFbased visual–inertial odometry.. International Journal of Robotics Research 32 (6), pp. 690–711. Cited by: §2.
 [14] (2015) Dense visualinertial odometry for tracking of aggressive motions.. In Proceedings of IEEE International Conference on Robotics and Biomimetics, Cited by: §1.
 [15] (2016) Aggressive quadrotor flight using dense visualinertial fusion.. In Proceedings of IEEE International Conference on Robotics and Automation, Cited by: §1, §2, §4.1, §4, §6.3.
 [16] (2015) ORBslam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. Cited by: §7.2.
 [17] (2011) Dtam: dense tracking and mapping in realtime.. In Proceedings of IEEE International Conference on Computer Vision, Cited by: §1, §2.
 [18] (2014) Visioncontrolled micro flying robots: from system design to autonomous navigation and mapping in gpsdenied environments. IEEE Robotics & Automation Magazine 21 (3), pp. 26–40. Cited by: §2.
 [19] (2015) Tightlycoupled monocular visualinertial fusion for autonomous flight of rotorcraft mavs. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 5303–5310. Cited by: §6.2, §6.3.
 [20] (2010) Sliding window filter with application to planetary landing.. Journal of Field Robotics 27 (5), pp. 587–608. Cited by: §2, §6.3.
 [21] (2017) CNNslam: realtime dense monocular slam with learned depth prediction. arXiv preprint arXiv:1704.03489. Cited by: §2.
 [22] (2016) Direct visualinertial odometry with stereo cameras. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 1885–1892. Cited by: §2.
 [23] (2012) Realtime onboard visualinertial state estimation and selfcalibration of mavs in unknown environments. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 957–964. Cited by: §2.
 [24] (2016) Edge enhanced direct visual odometry.. In The British Machine Vision Association and Society for Pattern Recognition, Cited by: §2.
 [25] (2016) Direct visualinertial odometry and mapping for unmanned vehicle. In International Symposium on Visual Computing, pp. 595–604. Cited by: §4.1.
 [26] (2019) Toward learning a unified manytomany mapping for diverse image translation. Pattern Recognition 93, pp. 570 – 580. Cited by: §2.
 [27] (2019) Adversarially approximated autoencoder for image generation and manipulation. IEEE Transactions on Multimedia. Cited by: §2.
 [28] (2019) Stacked wasserstein autoencoder. Neurocomputing 363, pp. 195 – 204. Cited by: §2.
 [29] (2017) A fully endtoend deep learning approach for realtime simultaneous 3d reconstruction and material recognition. arXiv preprint arXiv:1703.04699. Cited by: §2.
 [30] (2014) A smart high accuracy silicon piezoresistive pressure sensor temperature compensation system. Sensors 14 (7), pp. 12174–12190. Cited by: §2.
Appendix
Appendix 0.A Jacobians of the IMU residual with respect to the IMU parameters
0.a.1 Background
We take the preintegrated measurements as:
0.a.2 Jacobians
To calculate the Jacobian of the rotation error with respect to the angular velocity bias , we apply the three equations in 0.A.1 to move all the increment terms to the right side
Similarly to the other Jacobians, we apply the equations in 0.A.1 to move all the increment terms to the right side and obtain: