RaD-VIO: Rangefinder-aided Downward Visual-Inertial Odometry

RaD-VIO: Rangefinder-aided Downward Visual-Inertial Odometry

Bo Fu, Kumar Shaurya Shankar, Nathan Michael Authors are with The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, USA {bofu, kshaurya, nmichael}@cmu.edu

State-of-the-art forward facing monocular visual-inertial odometry algorithms are often brittle in practice, especially whilst dealing with initialisation and motion in directions that render the state unobservable. In such cases having a reliable complementary odometry algorithm enables robust and resilient flight. Using the common local planarity assumption, we present a fast, dense, and direct frame-to-frame visual-inertial odometry algorithm for downward facing cameras that minimises a joint cost function involving a homography based photometric cost and an IMU regularisation term. Via extensive evaluation in a variety of scenarios we demonstrate superior performance than existing state-of-the-art downward facing odometry algorithms for Micro Aerial Vehicles (MAVs).

I Supplementary Material

Representative video: https://youtu.be/gmFKWQCd1us

Ii Introduction and Related Work

Recent advances in optimisation based monocular visual-inertial SLAM algorithms for MAVs have made great strides in being accurate and efficient [1]. However, in practice, these algorithms suffer from three main failure modalities - sensitivity to initialisation, undergoing motion that renders the state unobservable, and, to a lesser extent, inability to handle outliers within the optimisation. The first arises from the need for translation to accurately triangulate feature landmarks and being able to excite all axes of the accelerometer to determine scale. The second is a fundamental limit of the sensor characteristics, robot motion, and the environment, most often caused by motion in the camera direction and an absence of texture information. The third is often an artifact of sliding windows necessitated by the constraints imposed by limited compute on aerial platforms.

We believe that in order to have robust and resilient closed loop flight it is imperative to have complementary sources of odometry. Towards this, we present an algorithm that performs metric downward facing odometry which doesn’t depend on triangulation or initialisation, offers observability in an orthogonal direction to a conventional forward facing camera, and is purely a frame-to-frame method. This enables it to be a fast and reliable whilst still being accurate.

In this paper, we pursue the problem of estimating the linear and angular velocity and orientation of a micro aerial vehicle (MAV) equipped with a downward facing camera, an IMU, and a single beam laser rangefinder which measures the height of the vehicle relative to the ground.

Fig. 1: System Diagram. The homography based optimisation takes in images and differential rotation estimates from the Attitude and Heading Reference System (AHRS) on the IMU and outputs an orientation and unscaled velocity, which is then used in an Extended Kalman filter to provide scaled metric velocity.

Broadly, there have been two approaches to tackling visual odometry for downward facing cameras. The first involves exploiting epipolar geometry and using sophisticated structure from motion influenced loosely [2, 3] or tightly coupled visual inertial methods [4, 5]. The other approach makes a planar ground assumption. This enables simple optical flow based velocity estimation where the camera ego motion is compensated for using the angular rate data obtained from a gyroscope, and the velocity is then scaled to metric using some altitude sensor [6]. An issue with all such methods is that their performance is predicated on there being detectable motion between camera frames - the epipolar constraint - which is exacerbated for instance when the camera is nearly static in hovering conditions or moves vertically.

However, if making a planar ground assumption, we can exploit the homography constraint that does not have the aformentioned drawbacks. Implicit means of utilising homography constraints have been presented earlier in appearance based localisation, for instance in [7] where cameras are localised against previously acquired images. Most relevant to our approach, the authors in [8, 9, 10] first estimate the optical flow between features in consecutive frames and then using the aforementioned constraint exploit known angular velocity and ground plane orientation to obtain unscaled velocity. An extended Kalman filter (EKF) is then used to fuse the data and output metric velocity.

In our approach, instead of using sparse visual features, that are dependent on textured environments, we utilise a dense, direct method that makes use of all the visual information present in the camera image, and couple it with angular constraints provided by an IMU within a least squares optimisation.

Contributions of this work include:

  • A homography based frame-to-frame velocity estimation algorithm, that is accurate and robust in a wide variety of scenes.

  • An EKF structure to incorporate this with a single beam laser rangefinder signal and estimate IMU bias

  • Extensive evaluation on a wide variety of environments with comparisons with state of the art algorithms.

Iii Estimation Theory

In this section we present the homography constraint, our optimisation strategy, and the framework to incorporate the corresponding cost functions.

Iii-a Homography Constraint and Parameterisation

When looking at points lying on the same plane, their projected pixel coordinates in two images ( and respectively) taken by a downward camera can be related by




where and are the pixel locations in previous and current image respectively, is the warp matrix, , are the rotation matrix and translation vector from the second camera frame to the previous frame, is the unscaled translation, , are the unit normal vector and distance to the ground plane in second camera frame, and is the camera intrinsic matrix (assume known).

During optimisation we parameterise as a Rodrigues vector and as [11]


Since the IMU provides reliable orientation information, i.e. an estimate of and , three parameterisations are possible: , and . The first two fix , and respectively. These parameterisations directly encode the motion of the camera and can easily be initialised using IMU.

According to implementation, the second is chosen since it provides the most accurate homography optimisation and tracking performance. The underlying assumption for fix is that the ground is horizontal and therefore, the normal vector depends only on the MAV’s orientation. The validity of this assumption will be evaluated in Sec. V.

Iii-B Homography Estimation Cost Function

The parameters of the warp matrix are estimated by minimising the SSD error between image pixel intensities. However, a purely photometric cost minimisation may provide incorrect camera pose estimates due to a lack of observability or in the event of non planar objects in the camera field of view. Since the IMU provides reliable orientation information, we add a penalty term which biases the homography solution and avoids these local minima.

Suppose stands for the homography mapping parameterised by the vector , we have


where is the initial guess obtained from IMU, is a diagonal penalty weight matrix, and are the previous and current image respectively, and is a pixel position in the evaluation region of current image.

Iii-C Gauss-Newton Optimisation

We solve for the optimal parameters using iterative Gauss-Newton optimisation. After concatenating all the intensity values of pixel in a vector, the Taylor expansion is


where and . The iterative update to the parameter vector ends up being


where is the jacobian of the photometric residual term. Note that as an optimisation we only choose pixels with a high gradient magnitude similar to [12]. This significantly speeds up computation of the update with negligible loss in accuracy. The detailed timing performance is discussed in Sec. V.

Iv Visual Inertial Fusion

The optimisation in the previous section outputs an unscaled translation. Inspired by [9], we use an EKF to scale it to metric and additionally filter the frame-to-frame noise.

Iv-a Definition

In the following section the superscripts and subscripts and imply a quantity in frame of the camera and IMU, respectively. The state vector contains camera velocity in the camera frame , distance to the plane from the camera , and the linear acceleration in the IMU frame .

Iv-B Prediction

The derivative of can be modeled [9] as


where is the rotation matrix from IMU frame to camera frame, and are the acceleration and gravity in the IMU frame, and are the raw linear acceleration and angular velocity measured by the IMU (subscript denotes raw measurement), and is the angular velocity in the camera frame. The subscript denotes the skew symmetric matrix of the vector inside the bracket.

Therefore, the prediction process in discrete EKF can be written as


where is the time step and is the normal vector in camera frame. The predicted states are denoted using . Using the Jacobian matrix the predicted covariance matrix of system uncertainty is updated as




Iv-C Update

When both unscaled velocity and range sensor signal are available for update, the measurement vector is

and the predicted measurement based on is

Calculating the Kalman gain

Estimates and are updated accordingly as

V Evaluation

We evaluate the performance of our approach on a wide variety of scenarios and compare and contrast performance with state of the art algorithms, both in simulation, and with a real aerial platform. We first present experimental setup and results in simulation followed by those with real-world data.

V-a Benchmarks and Metrics

Our method (RaD-VIO) is compared to the tracker proposed in [10] (Baseline), for which we implement the optical flow method described in [8] and, for fair comparison, use the same EKF fusion methodology as our approach. Additionally, we also compare with a state-of-the-art monocular visual-inertial tracker VINS-Mono [5] without loop closure (VINS-D, VINS-downward). For additional comparison we also record the performance of the VINS-Mono tracker on a forward facing camera (VINS-F), but note that due to very low overlap between the forward and downward images there is no correlation between the performance of VINS-D and VINS-F.

The metrics used are Relative Pose Error (RPE) (the interval is set to 1s) and Absolute Trajectory Error (ATE) [13]. For ATE, we only report results in the xy plane since the altitude is directly observed by the rangefinder. We also report the number of times frame-to-frame tracking fails in Fig. 4. Since RaD-VIO and Baseline output velocity, the position is calculated using dead-reckoning. Since a lot of our trajectories are not closed loops, instead of reporting ATE we divide it by the cumulative length of the trajectory (computed at 1s intervals) in the horizontal plane to get the relative ATE.

V-B Simulation Experiments

We utilise AirSim [14], a photorealistic simulator for aerial vehicles for the purpose of evaluation. The camera generates images at a frame rate of 80 Hz. The focal length is set to 300 pixels. The IMU and the single beam laser rangefinder output data at 200 Hz and 80 Hz respectively, and no noises are added.

V-B1 Simulation Test Cases

For the tracking system to work, the following assumptions or conditions should be met or partly met:

  • The ground should be planar (homography constraint)

  • It should be horizontal (parameterisation choice)

  • The scene should be static (constancy of brightness)

  • Small inter-frame displacement (applicability of warp).

Therefore, the following ten test scenarios are designed to evaluate the algorithms:

  • p1: All assumptions met - ideal conditions

  • p2: Viewing planar ground with low texture

  • p3: Viewing planar ground with almost no texture

  • p4: Viewing planar ground with moving features

  • p5: Vehicle undergoing extreme motion

  • p6: Camera operating at low frame rate

  • s1: Viewing a sloped or a curved surface

  • m1: Viewing a plane with small clutter

  • m2: Vieweing moving features with small clutter

  • c1: Viewing a plane with large amounts of clutter

Each of these cases are tested in diverse environments including indoors, road, and woods. We use 42 data sets in total for evaluation. Fig. 2 shows some typical images from these datasets.

Fig. 2: Sample views of environments used in the simulation experiments defined in Sec. V-B1

V-B2 Simulation Results and Discussion

The RPE and relative ATE of all test cases are shown in Fig. 3. For relative ATE our method outperforms both Baseline and VINS-D in almost all the test cases. The velocity error in Fig. 3 (c) contains both velocity bias and random error, and compared to VINS-D our method generates similar velocity errors when the planar assumption is satisfied, but larger when it is not. This error is much smaller than Baseline. Due to the ill-posed nature of the motion and the data in these environments for the forward facing camera, VINS-F performs very poorly.

Fig. 3: Comparison of the algorithms on the simulation datasets described in Sec.V-B1. Solid lines separate the cases where the planar assumption is met, not met or it is a slope, and dashed lines separates other assumptions.
Fig. 4: Percentage of tracking failures: The failure instances of Baseline and RaD-VIO are calculated by a per-frame counter, while that of VINS-D is estimated according to gap between messages, not including time taken for initialisation.
(a) Dead-reckoned odometry tracks.
(b) Corresponding velocities in MAV body frame.
Fig. 5: Comparison of algorithms in simulation for a figure 8 trajectory.

VINS-D takes a lot of time to initialise the system, and doesn’t do well right after initialisation. During the test, VINS-D occasionally outputs extremely wrong tracking results, and it has difficulty in initialising/reinitialising through quite a few test cases (for instance s1, where although there are enough features on the ground the lighting is challenging). In contrast our tracker never generates any extreme results due to implicitly being constrained by the frame-to-frame displacement.

The simulation shows that RaD-VIO is overall more accurate than compared to Baseline, and when all the assumptions are met, it is slightly more accurate than VINS-D. The test cases demonstrate that the proposed tracker is robust to a lot of non-ideal conditions.

V-C Real-world Experiments

V-C1 Setup and Test Cases

To verify the real-world performance of our proposed approach, indoor and outdoor data were collected. The indoor data was obtained from a custom hexrotor platform in a motion capture arena. We use a MatrixVision BlueFox camera configured to output resolution images at a frame rate of 60 Hz, and a wide angle lens with a focal length of 158 pixels. The frame rates of the TeraRanger One rangefinder and the VN-100 IMU are 300 Hz and 200 Hz respectively. The standard deviation of IMU angular velocity and linear acceleration are and respectively. The indoor ground truth is provided by motion capture system. For outdoor data, only GPS data is provided as a reference, and we use an Intel Aero Drone111https://www.intel.com/content/www/us/en/products/drones/aero-ready-to-fly.html for data collection. The same algorithm parameters were used as in the simulation experiments for the vision and optimisation frontends, and the parameters for the EKF were tuned based on the sensor noise for the respective configuration.

Around 10 bag files are tested, and according to the same classification criteria, there are 4 combinations. The severity of moving features, and the platform motion is comparatively more significant in the corresponding real data sets. Note that the abbreviation of the test cases have the same meaning as in simulation section we mentioned before.

Fig. 6: Sample views of environments used in the real-world experiments discussed in Sec. V-C1

V-C2 Experiment Results and Discussion

The tracking errors are shown in Fig. 7. Similar to the simulation results for relative ATE our tracker is on average better than both Baseline and VINS-D. As for the translation part of RPE, our method is always better or no worse than the other two methods except for the last case in p4 (moving features) and p5 (extreme motion). In these cases the robust sparse feature selection in VINS-D avoids being overly influenced. RaD-VIO is not as robust to extreme motion as it is in simulation, and there are two reasons: the IMU input is more noisy and the wider field of view ends up capturing objects outside the ground plane that adversely affect the alignment. For rotation part of RPE, both Baseline and RaD-VIO are better than VINS-D; this is because both the latter two approaches use the IMU rotation directly.

Fig. 7: Comparison of the algorithms on the real-world datasets described in Sec.V-C1. Solid lines separate the cases where the planar assumption is met, not met or it is a slope, and dashed lines separates other assumptions.
Fig. 8: RPE and ATE of tested simulation bag files: The circle marker means the data point is not reliable due to a large number of tracking failures. The errors drawn at the upper boundary are clipped.

The figure 8 trajectory and velocity of a outdoor test is shown in Fig. 10.

(a) Dead-reckoned odometry tracks.
(b) Corresponding velocities in MAV body frame.
Fig. 9: Comparison of algorithms in a real-world indoors dataset with motion capture ground truth.
(a) Dead-reckoned odometry tracks.
(b) Corresponding velocities in MAV body frame.
Fig. 10: Comparison of algorithms for an outdoors trajectory.

The experiment results show that the proposed method is able to work well in real world even in the presence of sensor synchronisation issues and large noises in the IMU signal.

V-D Timing Benchmarks

We evaluated the tracking framerate of RaD-VIO on a desktop PC with an Intel i7-6700K CPU. The frame rate of the tracker is on average 150Hz over all of our datasets. As can be seen Table I, bags with less features and large inter frame displacements result in lower frame rate due to longer time required for convergence.

Category Mean Min Max
Stable case 159 11.9 123 181
Less or no feature (p2, p3) 143 12.4 125 160
Large transform (p5, p6) 139 17.8 107 159
All 153 15.7 107 181

Vi Conclusions And Future Work

Conditions Baseline RaD-VIO VINS-D
Ideal (p1) ++ +++* +++
Low texture (p2) + +++ +++
Negligible texture (p3) not work +++ +
Moving features (p4, m2) - - ++
Extreme motion (p5) - +++ +++
Low image Hz (p6) - +++ ++
Slope (s1) +++ +++ +++
Medium clutter (m1) + ++ +++
High clutter (c1) + + +++

+++: Low or no tracking failures, ++: Occasional failures, +: Frequent failures, -: Works poorly, *: slightly better than VINS-D


We present a framework to obtain 6 degree of freedom state estimation on MAVs using a downward camera, IMU, downward single beam laser rangefinder. The proposed approach first extracts the rotation and unscaled translation between two consecutive image frames based on a cost funtion that combines dense photometric homography alignment and a rotation prior from the IMU then uses an EKF to perform sensor fusion and output a filtered metric linear velocity.

Extensive experiments in a wide variety of scenarios in simulations and in real-world experiments demonstrate the accuracy and robustness of the tracker under extenuating circumstances. The performance exceeds the frame-to-frame tracking framework proposed in [8, 10], and is slightly better than a current state of the art monocular visual-inertial odometry algorithm. Further advantages of the proposed tracker are that it is stable and never generates extremely diverged state estimates (as triangulation based optimisation methods are susceptible to), and can operate at a high frame rate. A qualitative comparison of the performance on simulation data is shown in Table II.

For future work, in line with our introductory statements, we intend to couple the performance of such a downward facing tracker with a conventional sliding window optimisation based tracker to develop a resilient robotic system that exploits the individual strengths of both odometry approaches.


  • [1] J. Delmerico and D. Scaramuzza, “A benchmark comparison of monocular visual-inertial odometry algorithms for flying robots,” in IEEE International Conference on Robotics and Automation (ICRA), no. CONF, 2018.
  • [2] S. Weiss, M. W. Achtelik, S. Lynen, M. Chli, and R. Siegwart, “Real-time onboard visual-inertial state estimation and self-calibration of mavs in unknown environments,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on.   IEEE, 2012, pp. 957–964.
  • [3] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2017.
  • [4] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman filter for vision-aided inertial navigation,” in Robotics and automation, 2007 IEEE international conference on.   IEEE, 2007, pp. 3565–3572.
  • [5] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, no. 99, pp. 1–17, 2018.
  • [6] D. Honegger, L. Meier, P. Tanskanen, and M. Pollefeys, “An open source and open hardware embedded metric optical flow cmos camera for indoor and outdoor applications,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on.   IEEE, 2013, pp. 1736–1741.
  • [7] B. Steder, G. Grisetti, C. Stachniss, and W. Burgard, “Visual slam for flying vehicles,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 1088–1093, 2008.
  • [8] V. Grabe, H. H. Bülthoff, and P. R. Giordano, “On-board velocity estimation and closed-loop control of a quadrotor uav based on optical flow,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on.   IEEE, 2012, pp. 491–497.
  • [9] V. Grabe, H. H. Bulthoff, and P. R. Giordano, “A comparison of scale estimation schemes for a quadrotor uav based on optical flow and imu measurements,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on.   IEEE, 2013, pp. 5193–5200.
  • [10] V. Grabe, H. H. Bülthoff, D. Scaramuzza, and P. R. Giordano, “Nonlinear ego-motion estimation from optical flow for online control of a quadrotor uav,” The International Journal of Robotics Research, vol. 34, no. 8, pp. 1114–1135, 2015.
  • [11] D. Crispell, J. Mundy, and G. Taubin, “G taubin, parallax-free registration of aerial video,” in in Proceedings of the British Machine Vision Conference (BMVC).   Citeseer, 2008.
  • [12] K. S. Shankar and N. Michael, “Robust direct visual odometry using mutual information,” in 2016 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR).   IEEE, 2016.
  • [13] R. Kümmerle, B. Steder, C. Dornhege, M. Ruhnke, G. Grisetti, C. Stachniss, and A. Kleiner, “On measuring the accuracy of slam algorithms,” Autonomous Robots, vol. 27, no. 4, p. 387, 2009.
  • [14] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017. [Online]. Available: https://arxiv.org/abs/1705.05065
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description