RaDVIO: Rangefinderaided Downward VisualInertial Odometry
Abstract
Stateoftheart forward facing monocular visualinertial odometry algorithms are often brittle in practice, especially whilst dealing with initialisation and motion in directions that render the state unobservable. In such cases having a reliable complementary odometry algorithm enables robust and resilient flight. Using the common local planarity assumption, we present a fast, dense, and direct frametoframe visualinertial odometry algorithm for downward facing cameras that minimises a joint cost function involving a homography based photometric cost and an IMU regularisation term. Via extensive evaluation in a variety of scenarios we demonstrate superior performance than existing stateoftheart downward facing odometry algorithms for Micro Aerial Vehicles (MAVs).
I Supplementary Material
Representative video: https://youtu.be/gmFKWQCd1us
Ii Introduction and Related Work
Recent advances in optimisation based monocular visualinertial SLAM algorithms for MAVs have made great strides in being accurate and efficient [1]. However, in practice, these algorithms suffer from three main failure modalities  sensitivity to initialisation, undergoing motion that renders the state unobservable, and, to a lesser extent, inability to handle outliers within the optimisation. The first arises from the need for translation to accurately triangulate feature landmarks and being able to excite all axes of the accelerometer to determine scale. The second is a fundamental limit of the sensor characteristics, robot motion, and the environment, most often caused by motion in the camera direction and an absence of texture information. The third is often an artifact of sliding windows necessitated by the constraints imposed by limited compute on aerial platforms.
We believe that in order to have robust and resilient closed loop flight it is imperative to have complementary sources of odometry. Towards this, we present an algorithm that performs metric downward facing odometry which doesn’t depend on triangulation or initialisation, offers observability in an orthogonal direction to a conventional forward facing camera, and is purely a frametoframe method. This enables it to be a fast and reliable whilst still being accurate.
In this paper, we pursue the problem of estimating the linear and angular velocity and orientation of a micro aerial vehicle (MAV) equipped with a downward facing camera, an IMU, and a single beam laser rangefinder which measures the height of the vehicle relative to the ground.
Broadly, there have been two approaches to tackling visual odometry for downward facing cameras. The first involves exploiting epipolar geometry and using sophisticated structure from motion influenced loosely [2, 3] or tightly coupled visual inertial methods [4, 5]. The other approach makes a planar ground assumption. This enables simple optical flow based velocity estimation where the camera ego motion is compensated for using the angular rate data obtained from a gyroscope, and the velocity is then scaled to metric using some altitude sensor [6]. An issue with all such methods is that their performance is predicated on there being detectable motion between camera frames  the epipolar constraint  which is exacerbated for instance when the camera is nearly static in hovering conditions or moves vertically.
However, if making a planar ground assumption, we can exploit the homography constraint that does not have the aformentioned drawbacks. Implicit means of utilising homography constraints have been presented earlier in appearance based localisation, for instance in [7] where cameras are localised against previously acquired images. Most relevant to our approach, the authors in [8, 9, 10] first estimate the optical flow between features in consecutive frames and then using the aforementioned constraint exploit known angular velocity and ground plane orientation to obtain unscaled velocity. An extended Kalman filter (EKF) is then used to fuse the data and output metric velocity.
In our approach, instead of using sparse visual features, that are dependent on textured environments, we utilise a dense, direct method that makes use of all the visual information present in the camera image, and couple it with angular constraints provided by an IMU within a least squares optimisation.
Contributions of this work include:

A homography based frametoframe velocity estimation algorithm, that is accurate and robust in a wide variety of scenes.

An EKF structure to incorporate this with a single beam laser rangefinder signal and estimate IMU bias

Extensive evaluation on a wide variety of environments with comparisons with state of the art algorithms.
Iii Estimation Theory
In this section we present the homography constraint, our optimisation strategy, and the framework to incorporate the corresponding cost functions.
Iiia Homography Constraint and Parameterisation
When looking at points lying on the same plane, their projected pixel coordinates in two images ( and respectively) taken by a downward camera can be related by
(1) 
where
(2) 
where and are the pixel locations in previous and current image respectively, is the warp matrix, , are the rotation matrix and translation vector from the second camera frame to the previous frame, is the unscaled translation, , are the unit normal vector and distance to the ground plane in second camera frame, and is the camera intrinsic matrix (assume known).
During optimisation we parameterise as a Rodrigues vector and as [11]
(3)  
(4) 
Since the IMU provides reliable orientation information, i.e. an estimate of and , three parameterisations are possible: , and . The first two fix , and respectively. These parameterisations directly encode the motion of the camera and can easily be initialised using IMU.
According to implementation, the second is chosen since it provides the most accurate homography optimisation and tracking performance. The underlying assumption for fix is that the ground is horizontal and therefore, the normal vector depends only on the MAV’s orientation. The validity of this assumption will be evaluated in Sec. V.
IiiB Homography Estimation Cost Function
The parameters of the warp matrix are estimated by minimising the SSD error between image pixel intensities. However, a purely photometric cost minimisation may provide incorrect camera pose estimates due to a lack of observability or in the event of non planar objects in the camera field of view. Since the IMU provides reliable orientation information, we add a penalty term which biases the homography solution and avoids these local minima.
Suppose stands for the homography mapping parameterised by the vector , we have
(5) 
where is the initial guess obtained from IMU, is a diagonal penalty weight matrix, and are the previous and current image respectively, and is a pixel position in the evaluation region of current image.
IiiC GaussNewton Optimisation
We solve for the optimal parameters using iterative GaussNewton optimisation. After concatenating all the intensity values of pixel in a vector, the Taylor expansion is
(6) 
where and . The iterative update to the parameter vector ends up being
(7) 
where is the jacobian of the photometric residual term. Note that as an optimisation we only choose pixels with a high gradient magnitude similar to [12]. This significantly speeds up computation of the update with negligible loss in accuracy. The detailed timing performance is discussed in Sec. V.
Iv Visual Inertial Fusion
The optimisation in the previous section outputs an unscaled translation. Inspired by [9], we use an EKF to scale it to metric and additionally filter the frametoframe noise.
Iva Definition
In the following section the superscripts and subscripts and imply a quantity in frame of the camera and IMU, respectively. The state vector contains camera velocity in the camera frame , distance to the plane from the camera , and the linear acceleration in the IMU frame .
IvB Prediction
The derivative of can be modeled [9] as
(8) 
where is the rotation matrix from IMU frame to camera frame, and are the acceleration and gravity in the IMU frame, and are the raw linear acceleration and angular velocity measured by the IMU (subscript denotes raw measurement), and is the angular velocity in the camera frame. The subscript denotes the skew symmetric matrix of the vector inside the bracket.
Therefore, the prediction process in discrete EKF can be written as
(9)  
(10)  
(11) 
where is the time step and is the normal vector in camera frame. The predicted states are denoted using . Using the Jacobian matrix the predicted covariance matrix of system uncertainty is updated as
(12) 
where
(13)  
(14)  
(15) 
IvC Update
When both unscaled velocity and range sensor signal are available for update, the measurement vector is
(16)  
and the predicted measurement based on is  
(17) 
Calculating the Kalman gain
(18)  
(19)  
Estimates and are updated accordingly as  
(20)  
(21) 
V Evaluation
We evaluate the performance of our approach on a wide variety of scenarios and compare and contrast performance with state of the art algorithms, both in simulation, and with a real aerial platform. We first present experimental setup and results in simulation followed by those with realworld data.
Va Benchmarks and Metrics
Our method (RaDVIO) is compared to the tracker proposed in [10] (Baseline), for which we implement the optical flow method described in [8] and, for fair comparison, use the same EKF fusion methodology as our approach. Additionally, we also compare with a stateoftheart monocular visualinertial tracker VINSMono [5] without loop closure (VINSD, VINSdownward). For additional comparison we also record the performance of the VINSMono tracker on a forward facing camera (VINSF), but note that due to very low overlap between the forward and downward images there is no correlation between the performance of VINSD and VINSF.
The metrics used are Relative Pose Error (RPE) (the interval is set to 1s) and Absolute Trajectory Error (ATE) [13]. For ATE, we only report results in the xy plane since the altitude is directly observed by the rangefinder. We also report the number of times frametoframe tracking fails in Fig. 4. Since RaDVIO and Baseline output velocity, the position is calculated using deadreckoning. Since a lot of our trajectories are not closed loops, instead of reporting ATE we divide it by the cumulative length of the trajectory (computed at 1s intervals) in the horizontal plane to get the relative ATE.
VB Simulation Experiments
We utilise AirSim [14], a photorealistic simulator for aerial vehicles for the purpose of evaluation. The camera generates images at a frame rate of 80 Hz. The focal length is set to 300 pixels. The IMU and the single beam laser rangefinder output data at 200 Hz and 80 Hz respectively, and no noises are added.
VB1 Simulation Test Cases
For the tracking system to work, the following assumptions or conditions should be met or partly met:

The ground should be planar (homography constraint)

It should be horizontal (parameterisation choice)

The scene should be static (constancy of brightness)

Small interframe displacement (applicability of warp).
Therefore, the following ten test scenarios are designed to evaluate the algorithms:

p1: All assumptions met  ideal conditions

p2: Viewing planar ground with low texture

p3: Viewing planar ground with almost no texture

p4: Viewing planar ground with moving features

p5: Vehicle undergoing extreme motion

p6: Camera operating at low frame rate

s1: Viewing a sloped or a curved surface

m1: Viewing a plane with small clutter

m2: Vieweing moving features with small clutter

c1: Viewing a plane with large amounts of clutter
Each of these cases are tested in diverse environments including indoors, road, and woods. We use 42 data sets in total for evaluation. Fig. 2 shows some typical images from these datasets.


VB2 Simulation Results and Discussion
The RPE and relative ATE of all test cases are shown in Fig. 3. For relative ATE our method outperforms both Baseline and VINSD in almost all the test cases. The velocity error in Fig. 3 (c) contains both velocity bias and random error, and compared to VINSD our method generates similar velocity errors when the planar assumption is satisfied, but larger when it is not. This error is much smaller than Baseline. Due to the illposed nature of the motion and the data in these environments for the forward facing camera, VINSF performs very poorly.

VINSD takes a lot of time to initialise the system, and doesn’t do well right after initialisation. During the test, VINSD occasionally outputs extremely wrong tracking results, and it has difficulty in initialising/reinitialising through quite a few test cases (for instance s1, where although there are enough features on the ground the lighting is challenging). In contrast our tracker never generates any extreme results due to implicitly being constrained by the frametoframe displacement.
The simulation shows that RaDVIO is overall more accurate than compared to Baseline, and when all the assumptions are met, it is slightly more accurate than VINSD. The test cases demonstrate that the proposed tracker is robust to a lot of nonideal conditions.
VC Realworld Experiments
VC1 Setup and Test Cases
To verify the realworld performance of our proposed approach, indoor and outdoor data were collected. The indoor data was obtained from a custom hexrotor platform in a motion capture arena. We use a MatrixVision BlueFox camera configured to output resolution images at a frame rate of 60 Hz, and a wide angle lens with a focal length of 158 pixels. The frame rates of the TeraRanger One rangefinder and the VN100 IMU are 300 Hz and 200 Hz respectively. The standard deviation of IMU angular velocity and linear acceleration are and respectively. The indoor ground truth is provided by motion capture system. For outdoor data, only GPS data is provided as a reference, and we use an Intel Aero Drone^{1}^{1}1https://www.intel.com/content/www/us/en/products/drones/aeroreadytofly.html for data collection. The same algorithm parameters were used as in the simulation experiments for the vision and optimisation frontends, and the parameters for the EKF were tuned based on the sensor noise for the respective configuration.
Around 10 bag files are tested, and according to the same classification criteria, there are 4 combinations. The severity of moving features, and the platform motion is comparatively more significant in the corresponding real data sets. Note that the abbreviation of the test cases have the same meaning as in simulation section we mentioned before.


VC2 Experiment Results and Discussion
The tracking errors are shown in Fig. 7. Similar to the simulation results for relative ATE our tracker is on average better than both Baseline and VINSD. As for the translation part of RPE, our method is always better or no worse than the other two methods except for the last case in p4 (moving features) and p5 (extreme motion). In these cases the robust sparse feature selection in VINSD avoids being overly influenced. RaDVIO is not as robust to extreme motion as it is in simulation, and there are two reasons: the IMU input is more noisy and the wider field of view ends up capturing objects outside the ground plane that adversely affect the alignment. For rotation part of RPE, both Baseline and RaDVIO are better than VINSD; this is because both the latter two approaches use the IMU rotation directly.
The figure 8 trajectory and velocity of a outdoor test is shown in Fig. 10.


The experiment results show that the proposed method is able to work well in real world even in the presence of sensor synchronisation issues and large noises in the IMU signal.
VD Timing Benchmarks
We evaluated the tracking framerate of RaDVIO on a desktop PC with an Intel i76700K CPU. The frame rate of the tracker is on average 150Hz over all of our datasets. As can be seen Table I, bags with less features and large inter frame displacements result in lower frame rate due to longer time required for convergence.
Category  Mean  Min  Max  

Stable case  159  11.9  123  181 
Less or no feature (p2, p3)  143  12.4  125  160 
Large transform (p5, p6)  139  17.8  107  159 
All  153  15.7  107  181 
Vi Conclusions And Future Work
Conditions  Baseline  RaDVIO  VINSD 

Ideal (p1)  ++  +++*  +++ 
Low texture (p2)  +  +++  +++ 
Negligible texture (p3)  not work  +++  + 
Moving features (p4, m2)      ++ 
Extreme motion (p5)    +++  +++ 
Low image Hz (p6)    +++  ++ 
Slope (s1)  +++  +++  +++ 
Medium clutter (m1)  +  ++  +++ 
High clutter (c1)  +  +  +++ 
+++: Low or no tracking failures, ++: Occasional failures, +: Frequent failures, : Works poorly, *: slightly better than VINSD
We present a framework to obtain 6 degree of freedom state estimation on MAVs using a downward camera, IMU, downward single beam laser rangefinder. The proposed approach first extracts the rotation and unscaled translation between two consecutive image frames based on a cost funtion that combines dense photometric homography alignment and a rotation prior from the IMU then uses an EKF to perform sensor fusion and output a filtered metric linear velocity.
Extensive experiments in a wide variety of scenarios in simulations and in realworld experiments demonstrate the accuracy and robustness of the tracker under extenuating circumstances. The performance exceeds the frametoframe tracking framework proposed in [8, 10], and is slightly better than a current state of the art monocular visualinertial odometry algorithm. Further advantages of the proposed tracker are that it is stable and never generates extremely diverged state estimates (as triangulation based optimisation methods are susceptible to), and can operate at a high frame rate. A qualitative comparison of the performance on simulation data is shown in Table II.
For future work, in line with our introductory statements, we intend to couple the performance of such a downward facing tracker with a conventional sliding window optimisation based tracker to develop a resilient robotic system that exploits the individual strengths of both odometry approaches.
References
 [1] J. Delmerico and D. Scaramuzza, “A benchmark comparison of monocular visualinertial odometry algorithms for flying robots,” in IEEE International Conference on Robotics and Automation (ICRA), no. CONF, 2018.
 [2] S. Weiss, M. W. Achtelik, S. Lynen, M. Chli, and R. Siegwart, “Realtime onboard visualinertial state estimation and selfcalibration of mavs in unknown environments,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 957–964.
 [3] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2017.
 [4] A. I. Mourikis and S. I. Roumeliotis, “A multistate constraint kalman filter for visionaided inertial navigation,” in Robotics and automation, 2007 IEEE international conference on. IEEE, 2007, pp. 3565–3572.
 [5] T. Qin, P. Li, and S. Shen, “Vinsmono: A robust and versatile monocular visualinertial state estimator,” IEEE Transactions on Robotics, no. 99, pp. 1–17, 2018.
 [6] D. Honegger, L. Meier, P. Tanskanen, and M. Pollefeys, “An open source and open hardware embedded metric optical flow cmos camera for indoor and outdoor applications,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 1736–1741.
 [7] B. Steder, G. Grisetti, C. Stachniss, and W. Burgard, “Visual slam for flying vehicles,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 1088–1093, 2008.
 [8] V. Grabe, H. H. Bülthoff, and P. R. Giordano, “Onboard velocity estimation and closedloop control of a quadrotor uav based on optical flow,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 491–497.
 [9] V. Grabe, H. H. Bulthoff, and P. R. Giordano, “A comparison of scale estimation schemes for a quadrotor uav based on optical flow and imu measurements,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 5193–5200.
 [10] V. Grabe, H. H. Bülthoff, D. Scaramuzza, and P. R. Giordano, “Nonlinear egomotion estimation from optical flow for online control of a quadrotor uav,” The International Journal of Robotics Research, vol. 34, no. 8, pp. 1114–1135, 2015.
 [11] D. Crispell, J. Mundy, and G. Taubin, “G taubin, parallaxfree registration of aerial video,” in in Proceedings of the British Machine Vision Conference (BMVC). Citeseer, 2008.
 [12] K. S. Shankar and N. Michael, “Robust direct visual odometry using mutual information,” in 2016 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE, 2016.
 [13] R. Kümmerle, B. Steder, C. Dornhege, M. Ruhnke, G. Grisetti, C. Stachniss, and A. Kleiner, “On measuring the accuracy of slam algorithms,” Autonomous Robots, vol. 27, no. 4, p. 387, 2009.
 [14] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: Highfidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017. [Online]. Available: https://arxiv.org/abs/1705.05065