RT-SLAM: a generic and real-time visual SLAM implementation
This article presents a new open-source C++ implementation to solve the SLAM problem, which is focused on genericity, versatility and high execution speed. It is based on an original object oriented architecture, that allows the combination of numerous sensors and landmark types, and the integration of various approaches proposed in the literature. The system capacities are illustrated by the presentation of an inertial/vision SLAM approach, for which several improvements over existing methods have been introduced, and that copes with very high dynamic motions. Results with a hand-held camera are presented.
Progresses in image processing, the growth of available computing power and the proposition of approaches to deal with bearings-only observations have made visual SLAM very popular, particularly since the demonstration of a real-time implementation by Davison in 2003 [Dav_03]. The Extended Kalman Filter (EKF) is widely used to solve the SLAM estimation problem, but it has recently been challenged by global optimization methods that have shown superior precision for large scale SLAM. Yet EKF still has the advantage of simplicity and faster processing for problems of limited size [Why_Filter_10], and can be combined with global optimization methods [estrada05] to take the best of both worlds.
Monocular EKF-SLAM reached maturity in 2006 with solutions for initializing landmarks [Kwok_04] [Sola_05] [Montiel_06]. Various methods for landmark parametrization have been analyzed [sola_ahp], and the literature now abounds with contributions to the problem.
This article presents RT-SLAM
Section 2 details the architecture of RT-SLAM and section 3 presents some of the techniques currently integrated within, to define an efficient inertial/visual SLAM approach. Section 4 analyzes results obtained with a hand-held system assessed thanks to ground truth, and section 5 concludes the article by a presentation of prospects.
2 Overall Architecture
Fig. 1 presents the main objects defined in RT-SLAM. They encompass the basic concepts of a SLAM solution: the world or environment contains maps; maps contain robots and landmarks; robots have sensors; sensors make observations of landmarks. Each of these objects is abstract and can have different implementations. They can also contain other objects that may themselves be generic.
The maps contain an optimization or estimation engine: for now RT-SLAM uses a standard formulation of EKF-SLAM. Since this solution is very well documented in the literature [Dav_07], it is not detailed in depth here. Indirect indexing within Boost’s ublas C++ structures is intensively used to exploit the sparsity of the problem and the symmetry of the covariances matrices.
Robots can be of different type according to the way their state is represented and their prediction model. The latter can be either a simple kinematic model (constant velocity, constant acceleration, …) or a proprioceptive sensor (odometric, inertial, …), as illustrated section 3.3. The proprioceptive sensor is an example of generic object contained in robot objects, as different hardware can provide the same function.
Similarly to robots, sensors can also have different models (perspective camera, panoramic catadioptric camera, laser, …), and contain a generic exteroceptive sensor hardware object (firewire camera, USB camera, …). In addition, as sensors belong to the map, their state can be estimated: this opens the possibility for estimating other parameters such as extrinsic calibration, time delays, biases and gain errors, and the like.
Landmarks can be of different type (points, lines, planes, …), and each type can have different state parametrization (Euclidean point, inverse depth point, …). Moreover the parametrization of a landmark can change over time, as explained section 3.2. A landmark also contains a descriptor used for data association, which is a dual description to the state representation.
As shown Fig. a, it is worth noticing that landmark objects are common to the different sensors, all of them being able to observe the same landmark (provided they have compatible descriptors for this landmark of course). This allows to greatly improve the observability of landmarks compared to a system where the sensors are strictly independent. In the particular case of two cameras for instance, landmarks can be used even if they are only visible from one camera or if they are too far away for a stereovision process to observe their depth (this process was introduced in [sola_bicam] as BiCam SLAM).
In RT-SLAM, the notion of observation plays a predominant role. An observation is a real object containing both methods and data. One observation is instantiated for every sensor-landmark pair, regardless of the sensor having actually observed the landmark or not, and has the same lifetime as the associated landmark. The methods it contains are the conventional projection and back-projection models (that depend on the associated sensor and landmark models), while the stored data consist of results and intermediary variables such as Jacobian matrices, predicted and observed appearances, innovations, event counters and others, that allow to greatly simplify and optimize computations.
In order to achieve full genericity wrt landmark types, in particular to allow the concurrent use of different landmark types for one sensor, two different manager objects are added: data manager and map manager. Their implementations define a given management strategy, while their instantiations are dedicated to a certain landmark type. The data manager processes the sensors raw data, providing observations of the external landmarks. For this purpose, it exploits some raw data processors (for feature detection and tracking), and decides which observations are to be corrected and in which order, according to the quantity of information they bring and their quality. For example it can apply an active search strategy and try to eliminate outliers as described in section 3.1. The map manager keeps the map clean, with relevant information, and at a manageable size, by removing landmarks according to their quality and the given policy (e.g. visual odometry where landmarks are discarded once they are not observed, or multimap slam where maps are “closed” according to given criteria). These managers communicate together: for example, the data manager may ask the map manager if there is enough space in the map to start a new initialization process, and to allocate the appropriate space for the new landmark.
3 Inertial/visual SLAM within RT-SLAM
3.1 Active search and one-point RANSAC
The strategy currently implemented in RT-SLAM’s data manager to deal with observations is an astute combination of active search [Dav_07] and outliers rejection using one-point RANSAC [civera_onepointransac].
The goal of active search is to minimize the quantity of raw data processing by constraining the search in the area where the landmarks are likely to be found. Observations outside of this 3 observation uncertainty ellipse would be anyway considered incompatible with the filter and ignored by the gating process. In addition active search gives the possibility to decide anytime to stop matching and updating landmarks with the current available data, thus enabling hard real-time constraints. We extended the active search strategy to landmark initialization: each sensor strives to maintain features in its whole field of view using a randomly moving grid of fixed size, and feature detection is limited to empty cells of the grid. Furthermore the good repartition of features in the field of view ensures a better observability of the motions.
Outliers can come from matching errors in raw data or mobile objects in the scene. Gating is not always discriminative enough to eliminate them, particularly right after the prediction step when the observation uncertainty ellipses can be quite large – unfortunately at this time the filter is very sensitive to faulty corrections because it can mistakenly make all the following observations incompatible. To prevent faulty observations, outliers are rejected using a one-point RANSAC process. It is a modification of RANSAC, that uses the Kalman filter to obtain a whole model with less points than otherwise needed, and provides a set of strongly compatible observations that are then readily corrected. Contrary to [civera_onepointransac] where data association is assumed given when applying the algorithm, we do the data association along with the one-point RANSAC process: this allows to look for features in the very small strongly compatible area rather than the whole observation uncertainty ellipse, and to save additional time for raw data processing.
3.2 Landmark parametrizations and reparametrization
In order to solve the problem of adding to the EKF a point with unknown distance and whose uncertainty cannot be represented by a Gaussian distribution, point landmarks parametrizations and initialization strategies for monocular EKF-SLAM have been well studied [Dav_03] [lemaire_delayed] [Montiel_06]. The solutions now widely accepted are undelayed initialization techniques with inverse depth parametrization. Anchored Homogeneous Point [sola_ahp] parametrization is currently used in RT-SLAM.
The drawback of inverse depth parametrizations is that they describe a landmark by at least 6 variables in the stochastic map, compared to only 3 for an Euclidean point . Memory and temporal complexity being quadratic with the map size for EKF, there is a factor of 4 to save in time and memory by reparametrizing landmarks that have converged enough [Civera07]. The map manager uses the linearity criterion proposed in [Montiel_06] to control this process.
3.3 Motion prediction
The easiest solution for EKF-SLAM is to use a robot kinematic model to do the filter prediction, such as a constant velocity model:
where and are respectively the position and quaternion orientation of the robot, and and are its linear and angular velocities.
The advantage of such a model is that it does not require complicated hardware setup (only a simple camera), but its strong limitation is that the scale factor is not observable. A second camera with a known baseline can provide a proper scale factor, but one can also use a proprioceptive sensor for the prediction step. Furthermore it usually provides a far better prediction with smaller uncertainties than a simple kinematic model, which brings several benefits:
search areas for matching are smaller, so processing is faster,
linearization points are more accurate, so SLAM precision is better, or one can reduce the framerate to decrease CPU load for equivalent quality,
it allows to withstand high motion dynamics.
In the case of an Inertial Measurement Unit (IMU), the robot state is then:
where and are the accelerometers and gyrometers biases, and the 3D gravity vector. Indeed it is better for linearity reasons to estimate the direction of rather than constraining the robot orientation to enforce to be vertical.
A special care has to be taken for the conversion of the noise from continous time (provided by the manufacturer in the sensor’s datasheet) to discrete time. As the perturbation is continuous white noise, the variance of the discrete noise grows linearly with the integration period.
3.4 Image processing
Point extraction is based on Harris detector with several optimizations. Some of them are approximations: a minimal derivative mask is used, as well as a square and constant convolution mask, in order to minimize operations. This allows the use of integral images [Viola_integral] to efficiently compute the convolutions. Additional optimizations are related to active search (section 3.1): only one new feature is searched in a small region of interest, which eliminates the costly steps of thresholding and sub-maxima suppression.
Point matching is based on Zero-mean Normalized Cross Correlation (ZNCC), also with several optimizations. Integral images are used to compute means and variances, and a hierarchical search is made (two searches at half and full resolution are sufficient). We also implemented bounded partial correlation [DiStefano05] in order to interrupt the correlation score computation when there is no more hope to obtain a better score than the threshold or the best one up to now. To be robust to viewpoint changes and to track landmarks longer, tracking is made by comparing the initial appearance of the landmark with its current predicted appearance [Dav_07].
Fig. 2 shows the hardware setup that has been used for the experiments. It is composed of a firewire camera rigidly tied to an IMU, on which four motion capture markers used to measure the ground truth are fixed.
Two different sequences are used, referred to as low dynamic and high dynamic sequences. Both were acquired indoor with artificial lights only, with an image framerate of 50 Hz synchronized to the inertial data rate of 100 Hz. The motion capture markers are localized with a precision of approximately 1 mm, so the ground truth has a precision of mm in position and in angle (baseline of 20 cm).
Movies illustrating a run of every experiment are provided
4.1 Constant velocity model
The inertial data is here not used, and the prediction is made according to a constant velocity model. Fig. 3 presents the estimated trajectory and the ground truth for the low dynamic sequence, and Fig. 4 shows the absolute errors for the same run.
Besides the global scale factor which is manually corrected, the camera trajectory is properly estimated. The movie shows that loops are correctly closed, even after a full revolution.
The position error raises up to 10 cm after quick rotations of the camera: this is due to a slight drift of the scale factor caused by the integration of newer landmarks (when closing loops, the scale factor is restimated closer to its initial value). Also, uncertainty ellipses remain relatively large throughout the sequence: these issues are solved by the use of an IMU to predict the motions.
4.2 Inertial/visual SLAM
Fig. 5 shows the trajectory estimated with the high dynamic sequence, and Fig. 6 and 7 show the behavior of inertial SLAM. Here, all of the 6 degrees of freedom are successively excited, then and are excited with extreme dynamics: the yaw angular velocity goes up to 400 deg/s, the rate of change of angular velocity exceeds 3,000 deg/s, and accelerations reach 3 . It is interesting to note that the time when SLAM diverges (around s) corresponds to motions for which the angular velocity exceeds the limit of the IMU (300 deg/s) and where its output is saturated.
The IMU now allows to observe the scale factor, and at the same time reduces the observation uncertainty ellipses and thus eases the active search procedure. Conversely, the divergence of the SLAM process at the end of the sequence illustrates what happens when vision stops stabilizing the IMU: it quickly diverges because the biases and the gravity cannot be estimated anymore.
We have presented a complete SLAM architecture whose genericity and performance make it both a useful experimentation tool and efficient solution for robot localization. The following extensions are currently being developed: using a second camera to improve landmarks observability, a multimap approach to cope with large scale trajectories and multirobot collaborative localization, and the use of line landmarks complementarily to points to provide more meaningful maps and to ease map matching for loop closure.
The architecture of RT-SLAM allows to easily integrate such developments, and also to consider additional motion sensors: to increase the robustness of the system, it is indeed essential to consider the various sensors usually found on board a robot (odometry, gyrometers, GPS). Eventually, it would be interesting to make RT-SLAM completely generic wrt the estimation engine as well, in order to be able to use global optimization techniques in place of filtering.
- RT-SLAM stands for “Real-Time SLAM”