SPLODE: Semi-Probabilistic Point and Line Odometry with Depth Estimation from RGB-D Camera Motion

SPLODE: Semi-Probabilistic Point and Line Odometry with Depth Estimation from RGB-D Camera Motion

Pedro F. Proença and Yang Gao The authors are with the Surrey Space Centre, Faculty of Engineering and Physical Sciences, University of Surrey, GU2 7XH Guildford, U.K. {p.proenca, yang.gao}@surrey.ac.uk
Abstract

Active depth cameras suffer from several limitations, which cause incomplete and noisy depth maps, and may consequently affect the performance of RGB-D Odometry. To address this issue, this paper presents a visual odometry method based on point and line features that leverages both measurements from a depth sensor and depth estimates from camera motion. Depth estimates are generated continuously by a probabilistic depth estimation framework for both types of features to compensate for the lack of depth measurements and inaccurate feature depth associations. The framework models explicitly the uncertainty of triangulating depth from both point and line observations to validate and obtain precise estimates.

Furthermore, depth measurements are exploited by propagating them through a depth map registration module and using a frame-to-frame motion estimation method that considers 3D-to-2D and 2D-to-3D reprojection errors, independently. Results on RGB-D sequences captured on large indoor and outdoor scenes, where depth sensor limitations are critical, show that the combination of depth measurements and estimates through our approach is able to overcome the absence and inaccuracy of depth measurements.

I Introduction

Modern active depth cameras, i.e., structured-light and time-of-flight (ToF) cameras, are capable of capturing dense 640480 depth maps from poorly textured scenes, at 30 fps. Therefore, the combination of these depth cameras with colour cameras, in a stereo configuration, has become a popular setup to capture both the geometry and appearance of indoor scenes. The calibration of such stereo setup allows the direct registration of depth with RGB images. To leverage this type of fused data, known as RGB-D data, many visual odometry and SLAM methods [1, 2, 3, 4, 5, 6, 7] have emerged, over the last few years. However, these depth sensors suffer from several limitations, which cause missing and erroneous depth readings, such as: limited field of view and range (typically between 0.4 and 5 m), near-infrared (NIR) interference (e.g. sunlight and multiple devices) and non-Lambertian reflections. These limitations may affect severely the performance of pure RGB-D based methods [4, 5, 6, 7] as these only make use of RGB pixels that have associated depth measurements. On the other hand, depth estimation from a single camera motion is less limited by the above issues but purely monocular SLAM [8, 9] suffers from scale drift and degenerate motion.

Fig. 1: Critical indoor and outdoor scenes for active depth cameras. The depth maps captured by a structured-light sensor are mapped and coloured in red, on the left images. Our method exploits points and lines with associated depth measurements from a depth camera and depth estimates from temporal stereo, shown respectively on the left image in green and blue. The respective inverse depth is shown (colour-coded) on the right for both measurements and estimates. Red features do not contain valid depth.
Fig. 2: Overview of SPLODE system.

To address these limitations, we propose a robust visual odometry based on image point and line features, called SPLODE, which exploits not only the depth measurements from a depth camera but also depth estimates from camera motion. While SPLODE overcomes temporally unavailable depth measurements by propagating depth through depth map registration and estimating frame-to-frame pose by using loosely both the 2D-to-3D and 3D-to-2D reprojection errors of feature matches, SPLODE continuously estimates depth for tracked features by recursively triangulating and fusing depth from temporal stereo. The key contributions of this paper are the following:

  • The introduction of a depth estimation framework that models explicitly the triangulation uncertainty of points and lines by taking into account their particular geometries and the pose uncertainty.

  • Showing how to minimize the impact of depth measurement errors on the pose estimation by comparing two different depth estimation schemes.

Our experiments show that the combination of depth camera measurements with estimates through the proposed framework allows SPLODE to operate on challenging environments (i.e. beyond room-sized environments), where the limitations of depth cameras become too problematic to rely just on RGB-D odometry.

Ii Related Work

Recently, several works have shown the benefit of combining different depth sensing modalities for ego-motion estimation. Stereo cameras have been integrated in monocular SLAM methods based on either direct image alignment [10] or point features [11] to help resolving scale-ambiguity and degenerate motions. RGB-D cameras have been used with monocular techniques in [1, 2, 3] to deal with a lack of depth sensor measurements. Hu et al. [1] proposed a heuristic switching algorithm to choose between RGB-D SLAM and monocular SLAM based on epipolar geometry, however with such a hard switch, all depth measurements during the monocular SLAM mode are ignored, thus a map optimization is required to estimate the scale. Zhang et al. [2] proposed estimating motion by combining 3D-to-2D and 2D-to-2D feature point matches in a joint optimization. Additionally, a depth map registration module is proposed to incorporate depth information from past frames and extend the method to LIDAR, nevertheless, during the frame-to-frame pose estimation, depth from the second frame (i.e. 2D-to-3D) is neglected. Alternatively, Ataer-Cansizoglu et al. [3] proposed using a RANSAC framework to combine hybrid features. The framework estimates pose by trying different triplets comprised of either plane matches and 3D-to-3D point matches or just 2D-to-3D point matches, whereas the rest of the matches, including point matches without depth, are used to check the hypothesis consensus, yet the case of 3D-to-2D matches is not considered. Both [2] and [3] triangulate points without depth after being tracked for longer than a certain temporal stereo baseline. However, assessing the stereo baseline, by itself, is not enough, especially for lines, e.g., camera translation parallel to a line or in a direction of a point renders triangulation impossible.

More convincing depth estimation is found among monocular odometry methods [8, 9]. Engel et al. [8] estimated and refined a semi-dense inverse depth map for direct image alignment by modelling the inverse depth uncertainty of pixels with three criteria: An approximation to geometric disparity error, a photometric disparity error and a pixel to inverse depth conversion. Forster et al. [9] introduced a fast monocular odometry, where the inverse depth of point features was modelled as a Gaussian Uniform mixture model by employing a Bayesian filter to fit good depth observations and account for outliers. Such model suggests robustness to repetitive textures, however, unlike our approach, the depth-filter needs to undergo multiple observations before it converges since it does not take into account the uncertainty of each feature observation. This method was further extended in [12] to line features to address textureless scenes.

RGB-D data has also potential to be used in poorly textured and illuminated scenes due to the use of active light. In [4], the combination of point features with ICP based on the depth data proved to be advantageous in dark areas. Similarly in [5], the direct minimization of both photometric and depth error showed better performance in low textured scenes than by just relying on photometric error. Robustness in RGB-D methods has also been achieved [13, 6] by exploiting other geometric primitives besides points. Planes have been widely used [13, 3]. More closely related to our work is the one of Lu and Song [6], who combined points and line segment features under a framework that models the uncertainties of the 3D features based on the low-resolution and noise of depth measurements obtained by structured-light sensors. Despite their use of a 3D line fitting method that tolerates missing depth, the frame-to-frame motion estimation itself requires feature matches to have depth measurements in both frames, as it minimizes the Mahalanobis distance between the 3D feature locations. Consequently, the framework does not allow the inclusion of either features with missing depth or triangulated features.

Iii Method

Our proposed visual odometry system, outlined in Fig. 2, leverages active depth sensing and temporal stereo to retrieve the 3D location of points and line features, which are projected on the RGB camera. Once a new RGB-D frame is captured, 2D features are detected from the current RGB image and depth measurements are sampled from the current depth map to attempt to obtain the respective 3D coordinates. These features are then matched against the ones extracted from the last frame and pose is estimated by minimizing their bidirectional reprojection errors by employing an M-estimator to reduce the impact of spurious matches and wrong depth associations on the estimated pose. Given the pose estimate, past depth measurements can be combined with the current depth map via point cloud registration to achieve a denser depth map, from which depth is resampled and used to recover the 3D coordinates of features, which do not have yet valid depth, so they can be used in the next frame-to-frame pose estimation. For the purpose of depth triangulation, features are tracked by frame-to-frame feature matching since their first frame observation to allow a wide-baseline stereo. As outlined by the feature tracking algorithm in Fig. 3, we have experimented two different schemes to integrate the depth estimation from temporal stereo:

  • Mode A, similarly to [2, 3, 11], only estimates depth for features that do not have associated depth measurements. As a result, the set of features with depth measurements and the set with depth estimates are disjoint.

  • Mode B estimates depth for features, regardless of having depth measurements, such that, even if a feature already has a 3D position, recovered from the depth map, a depth estimate may be applied to initialize a new 3D position hypothesis to be used along with the other hypothesis in the next frame’s pose estimation as two 3D-to-2D feature match residuals. Since the pose estimation implements an M-estimator, the spurious depth hypothesis can be downweighted.

While mode A gives priority to measurements coming from the depth sensor, estimating depth only when necessary, mode B relies less on depth measurements by treating the measurements and estimates equally. The principle behind mode B is that depth measurements may be inaccurate and depth estimates from temporal stereo can be used as an alternative hypothesis. The depth estimation module uses the uncertainty of the stereo triangulation, given the pose uncertainty, to fuse estimates from multiple views and assess the precision of the estimates. Whenever, the uncertainty of a fused depth estimate drops below a certain threshold, that depth estimate is applied to initialize the 3D position of the feature through back-projection, so it can be used in the next camera pose estimation. To combat wrong feature matches, the consistency of both triangulated and fused depth estimates are continuously checked based on the reprojection errors. In the remainder of this section, we detail each individual module of our framework.

Fig. 3: General feature workflow after matching and pose estimation for two proposed schemes to combine depth estimates and measurements: Mode A only estimates depth for features without valid depth measurements, whereas mode B estimates depth for all feature matches.

Iii-a Preliminaries

The projection of a 3D point: on an image plane is denoted as: where the coordinates of this projected point: correspond to its homogeneous vector representation: . The projected point can also be expressed in pixels by: where is known as the camera calibration matrix.

Both in 2D and 3D, a line segment is represented by its endpoints. For the purpose of measuring point-to-line distances, a 2D line is also parameterized in the Hessian normal form as: .

Furthermore, a point can be expressed with respect to a reference frame of a different camera view through a 3D rigid body transformation: , such that: . When necessary, rotation is expressed as a unit quaternion to achieve minimal pose parameters: such that for rotation angles between .

Iii-B Pose Estimation

As several works in RGB-D Odometry [2, 4], we avoid working in the Euclidean space, due to the depth sensor error at long distances, and instead estimate the frame-to-frame pose by jointly minimizing the reprojection errors of both line and point correspondences. To exploit the depth measurements available in both frames, the bidirectional error is used loosely, such that 3D-to-2D correspondences are used if depth is available on the first frame and 2D-to-3D correspondences are used if depth is available on the second frame. Given a 3D-to-2D point correspondence: , we express the reprojection error in the vector form as follows:

(1)

where correspond to the relative pose estimate. Similarly, the reprojection error of a 2D-to-3D point match: can be expressed by using the inverse pose. For a 3D-to-2D correspondence of line segments, the residual is defined as the point-to-line distance between a 2D line: and the two corresponding 3D line endpoints as follows:

(2)

Additionally, the 2D-to-2D point correspondences without depth can also be taken into account by using the residual introduced in [2]:

(3)

Here, we use a factor to down-weight these residuals, whereas in [2], the reprojection errors of the 3D-to-2D matches are in fact scaled by the depth, therefore their impact on the overall optimization depends on the scene depth and metric unit. The value of is set heuristically to 0.01.

These heterogeneous residuals are then stacked together and minimized by using a non-linear least squares algorithm (i.e. Levenberg-Marquart). Formally, let be a set of 3D-to-2D point matches; , a set of 2D-to-3D point matches; , a set of 3D-to-2D line matches; , a set of 2D-to-3D line matches: and , a set of 2D-to-2D point matches, then the optimal pose is given by the minimization of the following joint cost function:

(4)

where are diagonal matrices containing the Tukey weights , which are computed separately per residual-type in an iteratively re-weighted least-squares fashion.

Degenerate feature configurations are addressed by imposing a minimum of 3 total matches with depth and by checking, after the optimization, the uncertainty of the pose parameters , which is approximated as the inverse of the Gauss-Newton approximation to the Hessian, based on the squared Mahalanobis distance (also known as the uncertainty back-propagation [14]):

(5)

where is the combined Jacobian matrix of the residual functions with respect to the estimated pose parameters, evaluated at the solution, and is a diagonal matrix containing the uncertainties of the residuals. Although can be derived by propagating the uncertainty of the 3D feature coordinates, in this work, for simplicity, we assume homoscedasticity by letting the uncertainty of a residual be the variance of the residuals of the same type.

If the largest eigenvalue of the 33 block of corresponding to the translation exceeds a fixed threshold then the configuration is considered degenerate and a decaying velocity model is applied instead to estimate the pose.

Iii-C Point and Line Image Features

At every frame, point and line features are extracted from the RGB image. Specifically for points, we make use of SURF features, whereas for lines, we use the LSD [15] method to detect line segments and then extract binarized LBD [16] descriptors (implemented in OpenCV) from the detected line segments. Features correspondences are then established by matching the descriptors between consecutive frames. Small frame-to-frame motion is assumed to prune away matches which are geometrically far: After performing a -NN descriptor search for a given point , we select the closest match in whose point coordinates lie in a circular region defining the neighbourhood of , whereas, for a given line, we accept the closest line match in that has a similar slope angle and distance from origin, according to their line Hessian normal equations.

Iii-D Point Depth Sampling and 3D Line Fitting

For a calibrated RGB-D sensor, the 3D location of feature points is directly obtained from the depth map by back-projecting the corresponding 2D coordinates.

However, a 2D line segment contains multiple depth samples across its length, which suggests that one should exploit them instead of simply back-projecting the two depth pixels corresponding to the line endpoints as these are noisy, may be missing or correspond instead to the background, for lines tend to be detected on the object contours where depth is discontinuous. Hence, to obtain the 3D lines we propose to use the robust method of [6] with two simple modifications to address the computational cost of the original framework:

  1. Using the Euclidean point-to-line distance within the RANSAC outlier filtering instead of the originally proposed Mahalanobis distance, for the two metric have not shown significant differences in terms of pose estimation accuracy, in our experiments (Section IV), however the computation of the Mahalanobis distance requires either inverting the uncertainty matrices of the line depth samples or applying whitening transformations to them [17]. In our implementation, using Eigen library, computing the Mahalanobis distances makes the process coarsely 3 times slower, thus we resort to use simply the Euclidean distance with an inlier threshold of 3 cm.

  2. The final line fitting step in [6] is cast as a non-linear least squares problem in order to incorporate the line uncertainty. Instead, we perform PCA on the consensus set of line samples: , given by RANSAC. Let the covariance of be factorized as: , then we parameterize the line, according to the spread of samples, with the two 3D points: , where is the mean of the samples, is the square root of the largest eigenvalue in and is the corresponding eigenvector.

(a) (b)
Fig. 4: (a) Ideal point triangulation for point matches that satisfy the epipolar constraint (b) Line triangulation as a line-to-plane intersection.

Iii-E Line Depth Estimation

As depicted in Fig. 4.b, given a pair of line segment matches, the depth of one line segment endpoint can be derived from the point-to-plane distance:

(6)

where is the normal of the projective plane obtained by the cross-product of the two endpoints on the second image. This expression is ill-posed when the rotated point lies on the projective plane of the other frame, therefore it may result in a negative even for a correct line match. Thus, to detect outliers, endpoints with invalid depth estimates are assigned arbitrary positive depth and then reprojected on the second frame to check the point-to-line distance.

Iii-F Point Depth Estimation

In practice, the feature point matches, resulting from the descriptor matching, do not meet exactly the epipolar constraint, meaning that their projection rays do not intersect. This issue is addressed by employing the commonly used linear triangulation method (described in [14], p. 312) to estimate the 3D point coordinates, which in turn we use to obtain the corrected projected point coordinates. The projection errors between the original and the corrected coordinates are checked to detect outliers. Additionally, for invalid depth estimates we use an outlier detection procedure equivalent to the one described in the previous section. Since the problem of point triangulation is overdetermined, the triangulation method [14] relies on a least-squares solution, which makes the derivation of depth uncertainty less straightforward, however, given the corrected coordinates we can now arrive, through the trigonometry represented in Fig. 4.a, at the following formula:

(7)

where:

(8)

Iii-G Uncertainty of Depth and Pose Estimates

The uncertainty of the depth estimation is modelled in terms of inverse depth variance to allow initialization of 3D features that are far away from the camera. Based on the first order error propagation, the uncertainty of the general inverse depth denoted here as: is given by:

(9)

where is the Jacobian of the inverse depth function with respect to the input variables and is the variables’ uncertainty matrix. Specifically, this inverse depth function corresponds to the inverse of (6) and (7) for, respectively, a line endpoint and a point.

Regarding the variables’ uncertainties, depth estimation for a point correspondence depends on two image points: and on the stereo pose parameters , whereas the depth estimation of a line endpoint depends on three image points and . While the uncertainty of each point coordinate is set to 1 pixel, the block of corresponding to the pose is derived from the propagation of the pose uncertainty as follows.

Let the parameters express the camera pose at frame with respect to frame and be the uncertainty obtained by (5), then is given by the state transition function:

(10)

with denoting the quaternion multiplication, and the respective pose uncertainty is given according to the EKF propagation equation as:

(11)

where and are the Jacobians of the first 6 columns of (10) with respect to and , respectively.

To avoid the redundancy of calculating the stereo pose uncertainty for each feature, a sliding window of transformations to the current frame and the respective uncertainties is maintained and updated using (10) and (11), as depicted in Fig. 5.

Iii-H Fusion of Depth Estimates

Sequential filtering is performed in order to fuse the estimates from multiple views. Each point, being tracked, has associated a 1D Gaussian PDF to represent its inverse depth state estimate. This is initialized as (with infinite depth) and it is updated whenever there is a new triangulation. Given the PDF of a new triangulated inverse depth estimate, represented as: , the prior depth estimate state: is updated by the product of two Gaussians:

(12)

For point features, when drops below a fixed threshold, we use to back-project a 3D point, so that it can be used in the next frame. For line features, the 3D endpoints are initialized only when both uncertainties are sufficiently low.

Fig. 5: Illustration of the depth map registration at frame along with a sliding window of pose uncertainties, which is used both to limit the propagation of depth measurements and derive the uncertainty of the triangulated depth. In this example, the uncertainty of the transformation between and is considered to be too high therefore all depth measurements from frame are forgotten.

Iii-I Depth Map Registration

Depth measurements from past frames are propagated as a registered point cloud to recover the missing depth measurements of the current depth map. A denser depth map can then be obtained by transforming and projecting the registered point cloud to the current camera view. At each frame, the point cloud is filtered and updated with new measurements, as shown in Fig. 5, according to the following criteria:

  • Transformation uncertainty: To avoid inaccurate point registration due to the growing pose error, points measured at frames whose transformation to the current frame has an uncertainty indicator higher than a certain limit are removed from the point cloud. Specifically, a sliding window of transformation uncertainties is maintained and their largest eigenvalues are used as an uncertainty indicator, as described in Section III-B.

  • Field-of-view: Only points within the FOV of the color camera are maintained.

  • One point per pixel: Each point must have a unique image projection pixel. In case of conflict, the most recent point is kept.

While one could devise more sophisticated methods for combining the depth measurements, e.g., averaging depth observations, an important advantage of this simple method is the efficiency, since the size of the point cloud is bounded by the image resolution, due to the last two criteria.

Iv Experiments

The performance of our visual odometry method is evaluated on the TUM RGB-D dataset [18] and on an author-collected RGB-D dataset. All sequences have been captured with a structured-light camera and a color camera operating at 30 fps with a resolution of 640480.

Sequence Direct sampling Euclidean dist. Mahalanobis dist.
fr1/desk 42 mm / 2.2° 41 mm / 2.2° 40 mm / 2.2°
fr2/large_no_loop 85 mm / 1.2° 74 mm / 1.2° 75 mm / 1.2°
TABLE I: RMSE of relative pose per second for line odometry with three different 3D line fitting schemes and no depth estimates. Direct sampling simply back-projects the 2D line endpoints whereas the others use the RANSAC method described in Section III-D with different types of inlier distances.
Sequence
Depth
estimates
SPLODE (Mode A)
Best
(others)
Points Lines All
fr1/desk No
32 mm
2.4°
41 mm
2.2°
29 mm
2.2°
26 mm
[19]
fr1/360 No
89 mm
3.5°
80 mm
2.6°
66 mm
3.0°
84mm
[6]
fr2/
360_hemisphere
No
78 mm
1.6°
90 mm
1.1°
74 mm
1.0°
-
Yes
70 mm
1.0°
77 mm
1.0°
66 mm
1.0°
fr2/
large_no_loop
No
91 mm
1.2°
74 mm
1.2°
74 mm
1.1°
96 mm
[7]
Yes
89 mm
1.2°
65 mm
1.1°
72 mm
1.1°
fr2/
pioneer_360
No
96 mm
2.1°
84 mm
2.5°
57 mm
2.4°
-
Yes
67 mm
1.9°
72 mm
2.5°
53 mm
1.9°
TABLE II: RMSE of relative pose per second on TUM sequences. We report also, to our knowledge, the best translational errors obtained by other visual odometry methods. The symbol marks sequences captured in small environments where depth measurements are abundant, thus including the depth estimates did not change significantly the performance.

Iv-a TUM RGB-D dataset

We have conducted experiments on two distinct types of environments that are captured by the TUM RGB-D dataset: A small office and a large industrial hall denoted respectively as fr1 and fr2. While the sequences recorded in the former are commonly used for benchmarking RGB-D odometry and SLAM methods, few works have addressed some of the fr2 sequences. We believe this is mostly due to the large scene depth (see Fig. 8) which leads to insufficient depth measurements, which in turn causes tracking failures. Consequently, only fractions of these sequences were indeed evaluated in [3]. On the contrary, our results are reported for the full sequences, except for the fr2/large_no_loop which only has ground truth for about 20 % of the sequence duration. As described in Section III-B, tracking or pose estimation failures are handled by using a velocity model.

Performance was evaluated in terms of pose drift as the relative pose error (RPE) per second. First, Table I shows how the choice of 3D line fitting affects the pose error. We have observed that when the depth information is rich, simple back-projection of the endpoints seems enough, however when the depth measurements are severely compromised, the robust line fitting becomes necessary. No significant differences have been observed however between using the Mahalanobis or the Euclidean distance.

As can be seen in Table II, including depth estimates reduces overall the odometry drift on the fr2 scene. Moreover, the number of frames, where tracking fails, is reduced, e.g., in the fr2/360_hemisphere, while point odometry without depth estimates misses 29 frames, with the depth estimates, tracking is not interrupted. Since no significant differences were observed between the two depth estimation schemes in this dataset, we only report results for the mode A, which is preferred in terms of computational cost.

Fig. 6: Top view of trajectories estimated by our point and line odometry without depth estimates: PLO, and with depth estimates: SPLODE-A and SPLODE-B for the respective schemes, on the foyer sequence (left image) and the park sequence (right image). The ground-truth of the last position is pinpointed as: end gt.

Iv-B Author-Collected Dataset

To further evaluate the robustness of the proposed method, we have captured two challenging RGB-D sequences, for depth sensors: A large foyer and a park under sun exposure, shown in Fig. 1 and 8. The respective trajectories estimated by SPLODE are shown in Fig. 6, the final error of the trajectories is reported in Table III and a video is available at: https://youtu.be/6lMwPCiCXZc.

As shown in Fig. 6, while pure RGB-D odometry with points and lines (i.e. PLO) fails, even with depth map registration, by making frequently gross errors or failing to estimate pose, the integration of depth estimates allows the method to perform well in both environments. In terms of features (see Table III), point-based odometry performs worst than line-based odometry in the foyer sequence due to the predominance of lines, on the other hand, point-based odometry performs well in the park sequence, whereas using just lines, as expected, is not sufficient, since straight lines are less common in nature.

In this dataset, mode B yields generally lower errors than mode A. This can be explained by the existence of spurious depth measurements due to NIR interference (shown in Fig. 8.d) and non-Lambertian effects (shown in Fig. 7). These depth errors are further propagated and accumulated by the depth map registration, which can affect severely the pose estimation, as shown in Fig. 7, especially for Mode A, which relies more on the depth measurements.

Sequence
(Length)
Depth
estimates
Mode SPLODE
Points Lines All
Foyer
(65 m)
No -
Fail
Fail
4.4 m
3 losses
Yes A
2.3 m
11 losses
2.8 m
0 losses
1.2 m
0 losses
B
2.6 m
19 losses
1.7 m
0 losses
0.9 m
0 losses
Park
(76 m)
No -
4.9 m
38 losses
Fail
7.0 m
5 losses
Yes A
3.5 m
0 losses
Fail
2.8 m
0 looses
B
2.0 m
0 losses
Fail
1.8 m
0 losses
TABLE III: Performance of SPLODE on author-collected sequences reported as the final error of the trajectory and the number of frames where pose estimation is declared to fail.
Fig. 7: Trajectories obtained for the foyer sequence by disabling the uncertainty condition of the depth map registration. Features with two depth hypothesis (used by SLPODE-B) are shown in magenta and features with either depth measurements or estimates are shown respectively in green and blue. Spurious depth measurements, associated to several features observed through the window, are propagated by the depth map registration and eventually cause the mode A to fail, while the mode B is able to survive them.
(a) fr2/360_hemisphere (b) fr2/360_hemisphere (c) foyer (d) park
Fig. 8: Examples of the sequences used in our experiments. Top: Inverse depth of features obtained by either the depth map or the depth estimation process. Middle: Inverse depth maps obtained by the depth map registration process. Bottom: Original inverse depth maps captured by the depth camera.

V Conclusions

This paper presents an RGB-D based odometry method that achieves robustness to poorly textured scenes and depth sensor failures by combining point and line features, and depth sensor measurements with temporal stereo.

Our results show that the depth estimation framework introduced by this method is beneficial in large indoor environments (e.g. warehouses) and outdoor environments, where the depth information captured by depth cameras is too sparse. Moreover, our experiments indicate that gross depth errors occur typically outdoors, due to NIR interference with the sunlight, and its impact on the pose estimation can be minimized by adopting a more active depth estimation approach (referred to as Mode B) than just estimating depth when measurements are missing (Mode A). However, because Mode B is more computational intensive than Mode A, adaptive switching between both is a worthwhile research direction. Another possible improvement of the system, is to make the pose estimation also probabilistic by weighting the residuals according to the uncertainties of both depth measurements and estimates: In this work, SPLODE relies on a finely-tuned maximum uncertainty threshold to accept sufficiently precise depth estimates, but by using the depth estimate uncertainty in the pose estimation, such threshold could be further relaxed to exploit more features with higher uncertainties.

References

  • [1] G. Hu, S. Huang, L. Zhao, A. Alempijevic, and G. Dissanayake, “A robust rgb-d slam algorithm,” in International Conference on Intelligent Robots and Systems (IROS), 2012.
  • [2] J. Zhang, M. Kaess, and S. Singh, “Real-time depth enhanced monocular odometry,” in International Conference on Intelligent Robots and Systems (IROS), 2014.
  • [3] E. Ataer-Cansizoglu, Y. Taguchi, and S. Ramalingam, “Pinpoint slam: A hybrid of 2d and 3d simultaneous localization and mapping for rgb-d sensors,” in IEEE International Conference on Robotics and Automation (ICRA), 2016.
  • [4] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments,” The International Journal of Robotics Research, vol. 31, no. 5, pp. 647–663, 2012.
  • [5] C. Kerl, J. Sturm, and D. Cremers, “Dense visual slam for rgb-d cameras,” in International Conference on Intelligent Robots and Systems (IROS), pp. 2100–2106, IEEE, 2013.
  • [6] Y. Lu and D. Song, “Robust rgb-d odometry using point and line features,” in IEEE International Conference on Computer Vision (ICCV), 2015.
  • [7] T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald, “Robust real-time visual odometry for dense rgb-d mapping,” in IEEE International Conference on Robotics and Automation (ICRA), 2013.
  • [8] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocular camera,” in Proceedings of the IEEE international conference on computer vision, pp. 1449–1456, 2013.
  • [9] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” in IEEE International Conference on Robotics and Automation (ICRA), 2014.
  • [10] J. Engel, J. Stueckler, and D. Cremers, “Large-scale direct slam with stereo cameras,” in International Conference on Intelligent Robots and Systems (IROS), 2015.
  • [11] R. Mur-Artal and J. D. Tardós, “Orb-slam2: an open-source slam system for monocular, stereo and rgb-d cameras,” in ArXiv preprint arXiv:1610.06475, 2016.
  • [12] R. Gomez-Ojeda, J. Briales, and J. González-Jiménez, “Pl-svo: Semi-direct monocular visual odometry by combining points and line segments,” in International Conference on Intelligent Robots and Systems (IROS), 2016.
  • [13] Y. Taguchi, Y.-D. Jian, S. Ramalingam, and C. Feng, “Point-plane slam for hand-held 3d sensors,” in IEEE International Conference on Robotics and Automation (ICRA), pp. 5182–5189, 2013.
  • [14] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
  • [15] v. G. R. Grompone, J. Jakubowicz, J.-M. Morel, and G. Randall, “Lsd: a fast line segment detector with a false detection control.,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 4, pp. 722–732, 2010.
  • [16] L. Zhang and R. Koch, “An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency,” Journal of Visual Communication and Image Representation, vol. 24, no. 7, pp. 794–805, 2013.
  • [17] Z. Lu, S. Baek, and S. Lee, “Robust 3d line extraction from stereo point clouds,” in IEEE Conference on Robotics, Automation and Mechatronics, 2008.
  • [18] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in International Conference on Intelligent Robots and Systems (IROS), 2012.
  • [19] D. Gutierrez-Gomez, W. Mayol-Cuevas, and J. J. Guerrero, “Dense rgb-d visual odometry using inverse depth,” Robotics and Autonomous Systems, vol. 75, pp. 571–583, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
49131
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description