Direct Visual-Inertial Odometry with Semi-Dense Mapping
The paper presents a direct visual-inertial odometry system. In particular, a tightly coupled nonlinear optimization based method is proposed by integrating the recent advances in direct dense tracking and Inertial Measurement Unit (IMU) pre-integration, and a factor graph optimization is adapted to estimate the pose of the camera and rebuild a semi-dense map. Two sliding windows are maintained in the proposed approach. The first one, based on Direct Sparse Odometry (DSO), is to estimate the depths of candidate points for mapping and dense visual tracking. In the second one, measurements from the IMU pre-integration and dense visual tracking are fused probabilistically using a tightly-coupled, optimization-based sensor fusion framework. As a result, the IMU pre-integration provides additional constraints to suppress the scale drift induced by the visual odometry. Evaluations on real-world benchmark datasets show that the proposed method achieves competitive results in indoor scenes.
In the research area of robotics, camera motion estimation and 3D reconstruction take fundamental places both for navigation and perception, such as unmanned aerial vehicle (UAV) navigation  and indoor reconstruction [5, 4]. Among these applications, an up-to-scale camera motion tracking and 3D structure of the environment are required at the end. Most existing methods formulate this problem as simultaneously localization and mapping (SLAM), which characterized on the sensors it used. Recent efforts include visual SLAM and visual inertial navigation system (VINS).
Visual odometry  estimates the depth of features, based on which, track the pose of the camera. In contrast, direct visual odometry without the feature processing pipeline eliminates some of the issues in feature based methods through pose tracking directly based on pixels. However, even these methods are subject to failure during aggressive motions as images can be severely blurred and extremely sensitive to noise, changing illumination, and fast motion. Consequently, aggressive motion of the UAV [14, 15] with significant large angular velocities and linear accelerations, and lighting condition variation make the state estimation subject to scale drift in long term.
While IMU generate noisy but outlier-free measurements, making them great for tracking of short-term fast motions. However, low-cost MEMS IMUs suffer significant drift in the long term. IMU measurements can improve visual odometry to remedy this issue by providing a short term measurement, making them an ideal combination to better estimate the pose of the camera and generate information on the surroundings. We believe that integrating the complementary natures of the dense visual tracking and IMU measurements opens up the possibility of reliable tracking of aggressive motions.
In this paper, we propose a tightly coupled function that integrates visual and inertial terms in a fully probabilistic manner. We adopt the concept of keyframes due to its successful application in classical vision-only approaches: it is implemented using partial linearization and marginalization. The keyframe paradigm also accounts for drift-free estimation when slow or no motion is present. Instead of using an optimization window of time-successive poses, the selected keyframes may be spaced arbitrarily far in time, keeping visual constraints, while still incorporating an IMU term. We provide a strictly probabilistic derivation of the IMU error terms and the respective information matrix, relating successive image frames without explicitly introducing the states at the IMU rate.
The key contribution of this work is a robust and fully integrated system for direct visual inertial odometry. The novelties of the proposed system include: 1) The combination of the direct photometric information and the edge features, which are points with sufficiently high image gradient magnitude. We work with pixels’ intensities, which provides good invariance to changes in viewpoint and illumination. This makes the system more robust and reliable than other methods dealing with detected features. 2) IMU pre-integration. The IMU pre-integration provides scale information by integrating IMU measurements. Thanks to the use of a factor graph, tracking and mapping are focused in a local covisible area, independent of global map size. 3) tightly coupled optimization. The measurements from the IMU pre-integration and dense visual tracking are fused probabilistically using a tightly-coupled, optimization-based sensor fusion framework. In this paper, the dense visual tracking results provide the visual constraints between current frame and reference frame. While the IMU pre-integration provides constraints between two consecutive frames.
In the remainder of the paper, we begin with a review of the related work. In Sect. 4 An overview of the system is described and the details of IMU and visual measurement are are then presented. Dense mapping is introduced in Sect. 5. Sect. 6 shows implement details. The results of experiment are discussed in Sect. 7. Finally, Sect. 8 draws conclusions.
2 Related work
Simultaneously localization and mapping has a long history in a monocular scenario, prominent examples include DTAM , SVO  and LSD-SLAM , which works on sparse features and estimate the camera motion through a prediction and correction fashion. Direct methods operating directly on image intensities have become popular since it is robust and computational efficient without feature detection. It is demonstrated that direct methods are suitable for dense mapping than feature based method, and when enhanced by edge alignment  , it can deal with changing illumination and fast motion. More recently, direct sparse odometry (DSO)  presented an impressive semi-dense 3D reconstruction of the environment through a sliding window optimization method. This direct method minimizes the photometric error of points with sufficiently high image gradient magnitude (edges) using a non-linear optimization framework.
More recently, Deep neural network demonstrate advanced performance [liu2019approximate]. People pay great attention to the 3D SLAM, like the semantic SLAM  and the deep learning enhanced SLAM . Semantic SLam  focuses on simultaneous 3D reconstruction and material recognition and segmentation, which ends up with a real-time end-to-end system. Deep learning is promising in computer vision [30, 28, 26, 27].  takes the advantage of depth prediction from Convolutional Neural Networks (CNNs) to enhance the performance of monocular slam. DSO  takes one monocular camera for 3D reconstruction within inverse depth estimation framework. It is based on photometric error optimization of windowed sparse bundle adjustment.
All of these methods try to explore the scene within the reconstructed 3D map, while one problem they suffer is the scale drift. There are two main approaches towards solving the scale drift problem with extra measurements: the batch nonlinear optimization methods and the recursive filtering methods. The former one jointly minimizes the error originated from the integrated IMU measurements and the (reprojection) errors from the visual terms ; while the recursive algorithms usually utilize the IMU measurements for state propagation while the updates come from the visual observations [20, 13]. At the back-end, fusion methods can largely be divided into two classes: loosely-coupled fusion [23, 18] and tightly-coupled fusion [2, 22, 15, 11]. In loosely-coupled fusion, visual measurements are first processed independently to obtain high-level pose information and then fused with the inertial measurements, usually using a filtering framework [13, 12]. The two sub-problems are solved separately in a loosely-coupled fusion, resulting in a lower computational cost, however, the results are suboptimal. In tightly-coupled fusion, both the visual and the inertial measurements are fused and optimized in a single framework. It considers the coupling between the two types of measurement and allows the adoption of a graph optimization-based framework with iterative re-linearization to achieve better performance. Tightly-coupled methods usually come with a higher computational cost.
3 Problem Fromulation
We use the following notation in the paper:
: the IMU measurements between the -th and -th images;
: the dense tracking result for the current image with respect to the corresponding reference image (a key frame);
: the residual between the IMU integration and state ;
: the dense tracking residual between and state ;
: the associated covariances of the IMU measurement;
: the associated covariances of the image alignment;
: the -th state in the global coordinate system;
: the -th position states in the global coordinate system;
: the -th velocity states in the global coordinate system;
: the -th angular states in the global coordinate system;
: the -th acceleration bias states in the global coordinate system;
: the -th angular acceleration bias states in the global coordinate system;
Our system maintains several states during the processing. A state includes the position, velocity, orientation, accelerometer bias, and gyroscope bias. The full state space is defined as:
4 Visual-Inertial Odometry
In this section, we present our approach of incorporating inertial measurements into batch visual SLAM. In visual odometry and SLAM, a nonlinear optimization is formulated to find the camera poses and landmark positions by minimizing the reprojection error of landmarks observed in camera frames. Unlike in work , which use stereo camera for depth estimation, we adopt the DSO to build a semi-dense map and take the depth estimation for camera motion tracking. Figure 2 shows the factor graph, in which the visual inertial fusion takes the points in local map as landmarks which provides one kind of geometric constraints to related observation states. The local map points are a collection of points with depth information in latest keyframes.
We seek to formulate the visual-inertial localization and mapping problem as one joint optimization of a cost function containing both the (weighted) reprojection errors and the (weighted) IMU error .
From the probilistic view , a factor graph encodes the posterior probability of the variables, given the available measurements :
We assume all the distributions to be Gaussian. The MAP estimate corresponds to the minimum of the negative log-posterior. The negative log-posterior can be written as a sum of squared residual errors:
Combining the results from the IMU integration (Sect. 4.1) and the direct image alignment (Sect. 4.2), and ignoring the prior term, we optimize Eq. 1 iteratively using the Gaussian-Newton method. To solve this type of minimization problem with a cost function:
We introduce the Jacobian and add the weights, which are the inverse of , to the equation before we vectorize the cost function as:
Then the solution of the minimization is obtained as:
4.1 IMU Measurement
Usually, the frequency of the IMU is higher than that of the camera. There are tens of IMU measurement between the two consecutively image interval. The IMU pre-integration between two images provides a prior for image alignment, and also works as a different kind of constraint within the factor graph. The pre-integration is given by:
where is the gravity, and is the operator for skew-symmetric matrix. and are the IMU measurements, and are white noise affecting the biases, and . The biases are initially set to zeros and the optimized values computed at each subsequent step are used for the next pre-integration. Through the IMU propagation , we can get the covariance:
where is the discrete-time error state transition matrix, is the noise transition matrix, and contains all the noise covariance.
Then, the residual function between the IMU pre-integration and the states is obtained as:
4.2 Visual Measurement
Once the IMU integration is complete, we put the images into two categories. A key frame maintains a map (or point clouds) of its environment and works as a reference to track the subsequent, regular frames. A new image frame is categorized as a key frame when it overlaps with the current key frame less than a threshold or the estimated distance between the two frames is over a predefined value. With one monocular camera, the inverse depth estimation is adapted for point tracking. the depth map for the new key frame is initialized from the estimation of DSO.
When the system finishes processing a new key frame, subsequent regular frames are tracked based on the lastest key frame as a reference. We iteratively minimize the sum of the intensity differences for all the pixels in the frame to get the relative transformation from the key frame to the current frame as:
where denotes the intensity of the pixel in position and is the depth of the pixel. is the function that project the 3-dimensional point onto the image plane, while is its inverse projection function.
After obtaining the optimal visual measurement , we can compute the residual function based on the current states and the dense tracking result in the reference frame’s coordinate:
Then the Jacobian matrix of the dense tracking residual with respect to the 15 states is a sparse matrix:
5 Dense mapping
Rapid camera motions require high-frequency map updates for the camera to be tracked successfully. According to , our system maintains a semi-dense map of high gradient pixels that is quickly updated and serves for camera tracking.
Point candidates are tracked in subsequent frames using a discrete search along the epipolar line, minimizing the photometric error. From the best match, we compute the depth and associated variance, which is used to constrain the search interval for the subsequent frame. This tracking strategy is inspired by LSD-SLAM. Note that the computed depth only serves as initialization once the point is activated.
where denotes the intensity of the pixel in position and is the depth of the pixel. is the function that project the 3-dimensional point onto the image plane, while is its inverse projection function.
When a new keyframe is established, all active points are projected into it and slightly dilated, creating a semi-dense depth map. The selected points with initialied reverse depth are added to the optimization, we choose a number of candidate points (from across all keyframes in the fixed window of the ) and add them into the optimization.
The pipeline of the proposed system is illustrated in Figure 3. Between the time interval of two consecutive images, IMU data from the sensor is first pre-integrated. In the front-end, the dense tracking thread obtains incremental camera motion using a direct keyframe-to-frame dense tracking algorithm, assisted by IMU pre-integration. Based on the instant tracking performance, it also determines whether to add the frame as a keyframe or regular frame, or report tracking failure. If a keyframe is added, a depth map will be generated from the DSO. The back-end periodically checks the frame list buffer. The new frame and IMU measurements are added to the optimization thread. If a keyframe is added, Graph optimization is then applied to find the maximum a posteriori estimate of all the states within the sliding window using connections from IMU pre-integration, dense tracking and the prior. A two-way marginalization scheme that selectively removes states is performed in order to both bound the computational complexity and maximize the information stored within the sliding window.
6 Robust method
6.1 Robust Norm
Image processing suffers outliers commonly because of image noise and ambiguity. norm is sensitive to outliers. Very small number of outliers will drive the solution away. Other researchers choose the Huber function . It gives small residuals quadratic and large residuals only linear influence. In contrast to the Tukey function, it is convex and therefore does not introduce new local minima to the optimization. The Huber function is defined as:
where is the error and is a given threshold.
6.2 Keyframe and Point Management
The front-end of visual odomerty needs to determine the active frame / point set, provide initialization for new parameters and decide when a frame / point should be marginalized  so as to make the system computationally efficient and accurate. Our keyframe and point management is largely based on the DSO  except for that we add two more continuous regular frames into the sliding window . This two regular frames are connected by the IMU pre-integration.
6.3 Two-way Marginalization
During the graph optimization, the states set increases with any new frame added so as to require more memory and computational resources. In order to reduce the computational complexity, we need to marginalize states based on a two-way marginalization scheme [19, 15, 11] to maintain a sliding window of states and convert measurements corresponding to the marginalized states into a prior. By two-way marginalization, all information of the removed states is kept and the computation complexity is bounded, which is fundamentally different from the traditional keyframe-based approaches that simply drop non-keyframes. Front marginalization removes the second newest frame, ensuring that the time interval for each IMU preintegration is bounded in order to bound the accumulated error. While back marginalization removes the oldest keyframes within the sliding window to improve the effectiveness of multi-constrained factor graph optimization since the dense alignment and the IMU pre-integration depend on whether an older state is kept within the sliding window. In this situation, the oldest keyframes provide very few information.
Intuitively, the two-way marginalization will remove the second newest state if the frame tracking is reliable. The second newest state will be marginalized in the next round if the dense tracking is good and the second newest state is near to the current keyframe. Otherwise, the oldest state will be marginalized. The distance is thresholded by a weighted combination of translation and rotation between the latest keyframe and the second newest frame.
For the prior matrix , which contains the information from the marginalized states:
where and are the set of removed IMU and visual measurement, respectively. The prior is then marginalized via the Schur complement .
In the literature, a bunch of motion tracking algorithms has been suggested; It could be visual only odometry, loosely coupled fusion, or tightly coupled fusion. How they perform in relation to each other, however, is often unclear, since results are typically shown on individual datasets with different motion and lighting characteristics. In order to make a strong argument for our presented work, we will thus compare it to state-of-the art odometry methods.
7.1 Experiment Setup
We evaluated our system on two different sequences of the publically available popular dataset. The dataset is the the European Robotics Challenge (EuRoC) dataset , which contains 11 visual-inertial sequences recorded in 3 different indoor environment. The datasets are increasingly difficult to process in terms of flight dynamics and lighting conditions. The two sequences that we use are the V101 and the MH02, which could be successfully handled and compared with other methods, and they are representative sequences of the dataset. The entire algorithm is developed in C++ using ROS.
7.2 Evaluation on EuRoC dataset
The European provides the ground truth data at accuracy. We compared the performance of our system to the ground truth. Since the sequences are recorded in different coordinate systems, we transform the results to the coordinate system of the ground truth. Figure 8 adn Figure 5 shows the comparison for position.
In addition, the 3-dimensional reconstruction of the surroundings obtained from monocular configurations for this dataset are shown in Figure 1. It is evident that the reconstruction from the stereo setup is smoother and the contour of the structure in the map is clearer.
To evaluate the efficiency of the proposed system, two different methods dealing with the odometry problem are used for comparison. The OKVIS implements the visual-inertial odometry in a tightly coupled fusion method that achieves the state-of-the-art performance, and ORB SLAM  uses the ORB features for tracking and mapping.
Figure 5 shows the comparison of trajectory estimations tested on the sequence. The dataset is characterized by fast motion and image blur. The ORB SLAM runs for visual odometry without IMU measurements to provide up-to-scale information. This results into scale drift for three directions. While OKVIS preforms the best to be the closest to the ground truth since it fuses the IMU measurements with the dense tracking results and takes the depth of features into the optimization formulation. Figure 6 is the estimated biases for the IMU. In our method, We perform a semi-dense reconstruction and do not take the estimated feature depth into the optimization formulation. And due to the depth estimation error from the DSO, the visual inertial fusion for motion tracking could lead to a reduced performance. Nevertheless, ORB SLAM fails to track in the two sequences with up-to-scale accuracy. The other approaches are able to track these sequences successfully. The similar trajectory estimation results on sequence are showed in Figure 8.
As demonstrated by the challenging experiments, the proposed system is robust and able to handle fast motion and different lighting condition, but it is still subject to scale drift. In all case, the inertial measurement of the IMU provides constraints to largely eliminate the drifts.
We propose a direct visual-inertial odometry and mapping that tracks the camera motion and reconstructs the environment into a 3-dimensional model. Our method is based on a novel integration of dense image tracking and optimization of tracking error with the IMU measurements. Experimental results on the popular dataset demonstrated the performance of our method and the potential for application in unmanned vehicle.
-  (2016) The EuRoC micro aerial vehicle datasets. The International Journal of Robotics Research 35 (10), pp. 1157–1163. Cited by: §7.1.
-  (2016) Visual-inertial direct SLAM. In Proceedings of IEEE International Conference on Robotics and Automation, pp. 1331–1338. Cited by: §0.A.1, §2, §4.1.
-  (2015) LSD-SLAM: large-scale direct monocular SLAM.. In Proceedings of European Conference on Computer Vision, pp. 834–849. Cited by: §2.
-  (2016) Direct sparse odometry. arXiv preprint arXiv:1607.02565. Cited by: §1, §2, §2, §5, §5, §6.2.
-  (2014) LSD-slam: large-scale direct monocular slam. In European Conference on Computer Vision, pp. 834–849. Cited by: §1.
-  (2013) Semi-dense visual odometry for a monocular camera. In Proceedings of the IEEE international conference on computer vision, pp. 1449–1456. Cited by: §6.1.
-  (2015) IMU preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation. In Robotics: Science and Systems, Cited by: §4.
-  (2016) On-manifold preintegration for real-time visual–inertial odometry. IEEE Transactions on Robotics PP (99), pp. 1–21. Cited by: §0.A.1.
-  (2014) SVO: fast semi-direct monocular visual odometry. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 15–22. Cited by: §2.
-  (2016) Robust camera motion estimation using direct edge alignment and sub-gradient method. In IEEE International Conference on Robotics and Automation (ICRA-2016), Stockholm, Sweden, Cited by: §2.
-  (2015) Keyframe-based visual–inertial odometry using nonlinear optimization.. Vol. 34, pp. 314–334. Cited by: §1, §2, §6.3, §7.2.
-  (2012) Improving the accuracy of ekf-based visual-inertial odometry. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 828–835. Cited by: §2.
-  (2013) High-precision, consistent EKF-based visual–inertial odometry.. International Journal of Robotics Research 32 (6), pp. 690–711. Cited by: §2.
-  (2015) Dense visual-inertial odometry for tracking of aggressive motions.. In Proceedings of IEEE International Conference on Robotics and Biomimetics, Cited by: §1.
-  (2016) Aggressive quadrotor flight using dense visual-inertial fusion.. In Proceedings of IEEE International Conference on Robotics and Automation, Cited by: §1, §2, §4.1, §4, §6.3.
-  (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. Cited by: §7.2.
-  (2011) Dtam: dense tracking and mapping in real-time.. In Proceedings of IEEE International Conference on Computer Vision, Cited by: §1, §2.
-  (2014) Vision-controlled micro flying robots: from system design to autonomous navigation and mapping in gps-denied environments. IEEE Robotics & Automation Magazine 21 (3), pp. 26–40. Cited by: §2.
-  (2015) Tightly-coupled monocular visual-inertial fusion for autonomous flight of rotorcraft mavs. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 5303–5310. Cited by: §6.2, §6.3.
-  (2010) Sliding window filter with application to planetary landing.. Journal of Field Robotics 27 (5), pp. 587–608. Cited by: §2, §6.3.
-  (2017) CNN-slam: real-time dense monocular slam with learned depth prediction. arXiv preprint arXiv:1704.03489. Cited by: §2.
-  (2016) Direct visual-inertial odometry with stereo cameras. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 1885–1892. Cited by: §2.
-  (2012) Real-time onboard visual-inertial state estimation and self-calibration of mavs in unknown environments. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 957–964. Cited by: §2.
-  (2016) Edge enhanced direct visual odometry.. In The British Machine Vision Association and Society for Pattern Recognition, Cited by: §2.
-  (2016) Direct visual-inertial odometry and mapping for unmanned vehicle. In International Symposium on Visual Computing, pp. 595–604. Cited by: §4.1.
-  (2019) Toward learning a unified many-to-many mapping for diverse image translation. Pattern Recognition 93, pp. 570 – 580. Cited by: §2.
-  (2019) Adversarially approximated autoencoder for image generation and manipulation. IEEE Transactions on Multimedia. Cited by: §2.
-  (2019) Stacked wasserstein autoencoder. Neurocomputing 363, pp. 195 – 204. Cited by: §2.
-  (2017) A fully end-to-end deep learning approach for real-time simultaneous 3d reconstruction and material recognition. arXiv preprint arXiv:1703.04699. Cited by: §2.
-  (2014) A smart high accuracy silicon piezoresistive pressure sensor temperature compensation system. Sensors 14 (7), pp. 12174–12190. Cited by: §2.
Appendix 0.A Jacobians of the IMU residual with respect to the IMU parameters
We take the preintegrated measurements as:
To calculate the Jacobian of the rotation error with respect to the angular velocity bias , we apply the three equations in 0.A.1 to move all the increment terms to the right side
Similarly to the other Jacobians, we apply the equations in 0.A.1 to move all the increment terms to the right side and obtain: