Are We Ready for Service Robots? The OpenLORIS-Scene Datasets for Lifelong SLAM

# Are We Ready for Service Robots? The OpenLORIS-Scene Datasets for Lifelong SLAM

Xuesong Shi, Dongjiang Li, Pengpeng Zhao, Qinbin Tian, Yuxin Tian, Qiwei Long,
Chunhao Zhu, Jingwei Song, Fei Qiao, Le Song, Yangquan Guo, Zhigang Wang,
Yimin Zhang, Baoxing Qin, Wei Yang, Fangshi Wang, Rosa H. M. Chan and Qi She
*Equal contribution.Intel Labs China, Beijing, 100190 China.Department of Electronic Engineering and BNRist, Tsinghua University, Beijing, 100084 China.Gaussian Robotics, Shanghai, 201203 China.Beijing Jiaotong University, Beijing, 100044 China.Beihang University, Beijing, 100191 China.City University of Hong Kong, Hong Kong, China.Corresponding authors: xuesong.shi@intel.com, qiaofei@tsinghua.edu.cn.
###### Abstract

Service robots should be able to operate autonomously in dynamic and daily changing environments over an extended period of time. While Simultaneous Localization And Mapping (SLAM) is one of the most fundamental problems for robotic autonomy, most existing SLAM works are evaluated with data sequences that are recorded in a short period of time. In real-world deployment, there can be out-of-sight scene changes caused by both natural factors and human activities. For example, in home scenarios, most objects may be movable, replaceable or deformable, and the visual features of the same place may be significantly different in some successive days. Such out-of-sight dynamics pose great challenges to the robustness of pose estimation, and hence a robot’s long-term deployment and operation. To differentiate the forementioned problem from the conventional works which are usually evaluated in a static setting in a single run, the term lifelong SLAM is used here to address SLAM problems in an ever-changing environment over a long period of time. To accelerate lifelong SLAM research, we release the OpenLORIS-Scene datasets. The data are collected in real-world indoor scenes, for multiple times in each place to include scene changes in real life. We also design benchmarking metrics for lifelong SLAM, with which the robustness and accuracy of pose estimation are evaluated separately. The datasets and benchmark are available online at lifelong-robotic-vision.github.io/dataset/scene.

## I Introduction

The capability of continuous self localization is fundamental to autonomous service robots. Visual Simultaneous Localization and Mapping (SLAM) has been proposed and studied for decades in robotics and computer vision. There have been a number of open source SLAM systems with careful designs and heavily optimized implementations. Do they suffice for deployment in real-world robots? We claim there is still a gap, coming from the fact that most SLAM systems are designed and evaluated for a single operation. That is, a robot moves through a region, large or small, with a fresh start. Real-world service robots, on the contrary, usually need to operate at a region day after day, with the requirement of reusing a persistent map in each operation to retain spatial knowledge and coordinate consistency. This requirement is more than saving the map and loading it for the next operation. The scene changes in real life and other uncontrolled factors in a long-term deployment bring considerable challenges to SLAM algorithms.

In this work, we use the term lifelong SLAM to describe the SLAM problem in long-term robot deployments. For a robot that needs to operate around a particular region over an extended period of time, the capability of lifelong SLAM aims to build and maintain a persistent map of this region and to continuously locate the robot itself in the map during its operations. To this end, the map must be reused in different operations, even if there are changes in the environment.

We summarize the major source of algorithmic challenges for lifelong SLAM as following:

• Changed viewpoints - the robot may see the same objects or scene from different directions.

• Changed things - objects and other things may have been changed when the robot re-enters a previously observed area.

• Changed illumination - the illumination may change dramatically.

• Dynamic objects - there may be moving or deforming objects in the scene.

• Degraded sensors - there may be unpredictable sensor noises and out-of-calibrations due to mechanical stress, temperature change, dirty or wet lens, etc.

While each of these challenges has been more or less addressed in existing works, there is a lack of public datasets and benchmarks to unify the efforts towards building practical lifelong SLAM systems. Therefore, we introduce the OpenLORIS-Scene datasets, which are particularly built for the research of lifelong SLAM for service robots. The data are collected with commodity sensors carried by a wheeled robot in typical indoor environments as shown in Fig. 1. Ground-truth robot poses are provided based on either a Motion Capture System (MCS) or a high-accuracy LiDAR. The major distinctions of our datasets are:

• The data are from real-world scenes with people in it.

• There are multiple data sequences for each scene, which include not only changes in illumination and viewpoints, but also scene changes caused by human activities in their real life.

• There is a rich combination of sensors including RGB-D, stereo fisheyes, inertial measurement units (IMUs), wheel odometry and LiDAR, which can enable comparison of algorithms with different types of inputs.

This work also introduces new metrics to evaluate lifelong SLAM algorithms. As we believe the robustness of localization should be the most important concern, we use correct rates to explicitly evaluate it, as opposed to existing benchmarks where robustness is partially implied by the accuracy metrics.

## Ii Related Works

The adjective of lifelong has been used in SLAM-related works to emphasis either or both of the two capabilities: robustness against scene changes, and scalability in the long run. A survey of both directions can be found in [3].

Most SLAM works evaluate their algorithms on one or more public datasets to justify their effectiveness in certain aspects. The most well-used datasets include TUM RGB-D [18], EuRoC MAV [2] and KITTI [7]. A recent contribution is the TUM VI benchmark [17], where aligned visual and IMU data are provided. One of the major distinctions of those datasets is their sensor types. While there is favor of RGB-D data source in recent SLAM algorithm research for dense scene reconstruction, there is a lack of dataset with both RGB-D and IMU data. Our dataset provides aligned RGB-D-IMU data, along with odometry data which are widely used in the industry but often lack in public datasets.

Synthesized datasets are also used for SLAM evaluation. Recent progress in random scene generation and photo-realistic rendering [12][11] makes it theoretically possible to synthesize scene changes for lifelong SLAM, but it would be difficult to model realistic changes as in natural lives.

For real-world scene changes, the COLD database [14] provides visual data of several scenes with variations caused by weather, illumination, and human activities. It is the most related work with ours in the principle of data collection, though with different sensor setups. Object-level variations can also be found in the change detection datasets [5], but it is not designed for SLAM and does not provide ground-truth camera poses.

Recently there are efforts towards unified SLAM benchmarking and automatic parameter tuning [1][22], our work contributes to this direction by introducing new data and performance metrics.

## Iii OpenLORIS-Scene Datasets

The OpenLORIS-Scene datasets are designed to be a testbed of real-world practicality of lifelong SLAM algorithms for service robots. Therefore, the major principle is to make the data as close to service robots scenarios as possible. Commercial wheeled robot models equipped with commodity sensors are used to collect data in typical indoor scenes with people in it, as shown in Fig. 1. Rich types of data are provided to enable comparison of methods with different kind of inputs, as listed in Table I. All the data are calibrated and synchronized.

### Iii-a Sensors

To enable monocular, stereo, RGB-D and visual-intertial SLAM algorithms, two camera devices are used for data collection: a RealSense D435i providing RGB-D images and IMU measurements, and a RealSense T265 tracking module providing stereo fisheye images and IMU measurements. The IMU data are hardware synchronized with images from the same device. Both cameras are mounted on the top board of a customized Segway Deliverybot S1 robot, front-facing, at the height of about one meter. The resolution of RGB-D images are chosen to maximize the field of view (FOV), and for the best depth quality [8]. We provide not only aligned depth data as in other RGB-D datasets, but also raw depth images since they have a larger FOV and could benefit depth-based SLAM algorithms.

Wheel encoder-based odometry data are also provided, as they are widely available in wheeled robots. The odometry data in the datasets are fused from wheel encoders and a chassis IMU by proprietary filtering algorithms along with the robot.

The robot also equips markers of an OptiTrack MCS and a Hokuyo UTM-30LX LiDAR, all near the cameras, to provide ground-truth poses of the robot.

### Iii-B Calibration

The intrinsics and intra-device extrinsics of cameras and IMUs are from factory calibration. Other extrinsics are calibrated with various tools listed in Table II. Redundant calibrations are made for quality evaluation. Each non-camera sensor (MCS, LiDAR and odometer) is calibrated against both cameras, and then the extrinsics between the two cameras can be deduced, which is then compared with their extrinsics directly calibrated with Kalibr [6]. The resulted errors are all below 1cm in translation and 2° in rotation, except for odometry calibration whose translation error is 7cm.

### Iii-C Synchronization

Images and IMU measurements from the same RealSense device are hardware synchronized. Software synchronization is performed for each data sequence between data from different devices, including RealSense D435i, RealSense T265, LiDAR, MCS and odometer. For each of those devices, its trajectory can be obtained either via a SLAM algorithm or directly from the measurements. Those per-device trajectories are then synchronized by finding the optimal time offsets to minimize the RMSE of absolute trajectory errors (ATEs). The ATEs of each per-device trajectory are calculated against the trajectory of MCS for the scene of ffice }, and T265 fr others, as the two provide poses in highest rates.

To mitigate the affection by SLAM and measurement noises, we generated a controlled piece of data at the begging of each data sequence by pushing the robot back and forth for several times in a static and feature rich area, and used only this piece of data for synchronization.

The synchronization quality is evaluated by the consistency of resulted optimal time offsets. From our experiments, the standard deviation of offsets ranges from 1.7 ms (MCS to T265) to 7.4 ms (odometry to T265), with a positive correlation with the measurement cycle of each sensor. We think the results acceptable for our scenarios, yet better synchronization methods can be discussed. One inherent drawback of the ATE minimization method is that systematic errors can be introduced if the scale of each estimated trajectory differs, which is frequently observed in the data. We mitigate this effect by using back-and-forth trajectories instead of move-and-stop ones, and also by carefully selecting a period of data when all trajectories can be best matched.

### Iii-D Scenes and Sequences

There are five scenes in the current datasets. For each scene, there are 3-7 data sequences recorded at different times. The sequences are manually selected and clipped from much more recordings to form a concise benchmark including most major challenges in lifelong SLAM.

• ffice }: 7 sequences in a university office with benches and cubicles. The changes in this scene is controlled: in {\verbffice-1 the robot walked along a U-shape route; in ffice-2 } the scene is unchanged but the rute is reversed, so that the cameras observe from opposite view angles; ffice-3 } is a turn-arund that could be used to connect the maps constructed from ffice-1 } and {\verbffice-2 if they had not been aligned; in ffice-4 } and {\verbffice-5 the illumination is different from the first three sequences; in ffice-6 } there arebject changes; and ffice-7 } further intrduced dynamic objects (persons).

• orridor }: 5 sequences in a long corridor with a lobby in the middle and the above office at one end. Apart from the well-known challenges in feature-poor long corridors, additional difficulties come from the high contrast between the corridor and the window at daytime, and extremely low light at night. Between sequences, there are not only illumination changes, but also moved furniture, which could make re-localization and loop closure a tough task. And the largeness of the scene would magnify the inconsistency of maps from different sequences if SLAM algorithms fail to align them.
\item {\verb Home }: 5 sequences in a two bedroom apartment. There are lots of scene differences between sequences, such as changed sheets and curtains, moved sofa and chairs, and people moving around.
\item {\verb
afe : 2 sequences in an open café. There are different people and different things in each sequence.

• arket }: 3 sequences in an open supermarket. This scene is recorded with a different robot and different calibration methods from described in the previous subsections, but with the same data types and formats.
\end{itemize}
\subsection{Ground-truth}
For each scene, ground-truth robot poses in a persistent map are provided for all sequences. For the {\verb office } scene they are obtained from an
CS which wholly covers all the sequences, with a persistent coordinate system. The MCS-based ground-truth is in a rate of 240 Hz, with outliers removed. For other scenes, a 2D laser SLAM method is employed to generate ground-truth poses. A full map is built for each scene, and the robot is localized in the map with each frame of laser scan in the sequences. For the scene of orridor } and {\verbafe , a variant of hector_mapping [10] is used for map construction and localization. For ome } and {\verb market }, anoter laser-based SLAM system combined with multi-sensor fusion is used to avoid from mismatching. The initial pose estimation of each sequence is manually assigned, and the output is manually verified to be correct. A comparison between laser-based ground-truth and MCS-based ground-truth is made with the in-office part of orridor } data, whih gives an ATE of 3 cm.

## Iv Benchmark Metrics

Like most existing SLAM benchmarks, we mainly evaluate the quality of camera trajectory estimated by the SLAM algorithms. We adopt the same definition of Absolute trajectory error (ATE) and Relative pose error (RPE) as in the TUM RGB-D benchmark [18] to evaluate the accuracy of pose estimation for each frame. However, estimation failures or wrong (mismatched) poses are more severe than inaccuracies, and they may occur more commonly in lifelong SLAM due to scene changes. Therefore, we design separate metrics to evaluate the correctness and accuracy respectively.

### Iv-a Robustness Metrics

Correctness. For each pose estimate at time , given the ground-truth pose at that time, we assess the correctness of the estimate by its ATE and absolute orientation error (AOE):

 cϵ,ϕ(pk)={1,if ATE(pk)≤ϵ and AOE(pk)≤ϕ0,otherwise (1)

Correct Rate (CR) and Correct Rate of Tracking (CR-T). While correctness evaluates a single pose estimate, the overall robustness metric over one or more data sequences can be defined as the correct rate over the whole time span of data. For a sequence from to , given an estimated trajectory , define

 CRϵ,ϕ=∑Nk=0(min(tk+1−tk,δ)⋅cϵ,ϕ(pk))tmax−tmin, (2)
 CRϵ,ϕ-T=∑Nk=0(min(tk+1−tk,δ)⋅cϵ,ϕ(pk))tmax−t0, (3)

where , is a parameter to determine how long a correct pose estimation is valid for. Note that in the time for re-localization and algorithm initialization () is excluded, since tracking is not functioning during that time. In practice, the ATE threshold and AOE threshold should be set according to the area of the scene and the expected drift of the SLAM algorithm. should be set larger than the normal cycle of pose estimation, and much smaller than the time span of data sequence. For common room or building size data, we would suggest to set to meter-size and around one second.

Correctness Score of Re-localization (CS-R). As tracking and re-localization are often implemented with different methods in common SLAM pipelines, they should be evaluated separately. The correctness of re-localization can be decided by the same ATE threshold as in CR. But besides correctness, we would also like to know how much time it takes to re-localize. Therefore, we define a score of re-localization as

 Cϵ,ϕSτ-R=e−(t0−tmin)/τ⋅cϵ,ϕ(p0) (4)

where is a scaling factor. Note that for an immediate correct re-localization with , there will be . The score drops with the time for re-localization increases. For normal evaluation cases we would suggest to set .

### Iv-B Accuracy Metrics

To evaluate the accuracy of pose estimation without affected by incorrect results, we suggest to use statistics of ATE and RPE over one or more trajectories with only correct estimations. For example, C-RPE RMSE is the root mean square error of RPE of correct pose estimates selected by an ATE threshold of 0.1 meter.

## V Experiments

The OpenLORIS-Scene datasets and the proposed metrics are tested with open-source SLAM algorithms. The algorithms are chosen to cover most data types listed in Table I, and to represent a diverse set of SLAM techniques. ORB_SLAM2 is a feature-based SLAM algorithm [13]. It can optimize poses with absolute scale by using either stereo features or depth measurements. DSO, on the contrary, tracks the camera’s states with a fully direct probabilistic model [4]. DS-SLAM improves over ORB_SLAM2 by removing features on moving objects [21]. VINS-Mono provides robust pose estimates with absolute scale by fusing pre-integrated IMU measurements and feature observations [16]. InfiniTAM is a dense SLAM system based on point cloud matching with an iterative closed point (ICP) algorithm [9]. ElasticFusion combines the merits of dense reconstruction and globally consistent mapping by using a deformable model [20].

### V-a Per-sequence Evaluation

Method. First we test each data sequence separately, as done in most existing works. For each algorithm, the ground-truth trajectory are transformed into the target frame of pose estimation, for example, the color sensor of D435i for ORB_SLAM2 with RGB-D input. Then the estimated trajectory are aligned with the ground-truth using the method of Horn. For DSO, an optimal scaling factor is calculated with Umeyama’s method [19]. Then ATE of each matched pose is calculated and their RMSE over each sequence is reported. The only difference in our ATE calculation process from conventional ones is that we interpolate the ground-truth trajectory to let each estimated pose get an exact match on the timeline, as opposed to matching the closest ground-truth pose. The reason is that our laser-based ground-truth trajectories have a lower rate than MCS-based ones.

Result. The results are visualized in Fig. 2, with blue line segments indicating successful localization and blank otherwise. The success rate indicated by CR, and accuracy indicated by ATE RMSE are calculated for each sequence. On the figure only statistics over each scene are shown, where CR and ATE RMSE are averaged weighted by the time span of each sequence and the count of pose estimates, respectively. All the algorithms can track successfully most of the time in ffice }, butther scenes are challenging. For example, most algorithms tend to lost in orridor } beause of the featureless walls and low light, yet VINS-Mono can fully track some of the sequences in this scene. Note that VINS-Mono fails to initialize in some low-light sequences in orridor }, and those are not inluded in the average CR. Nevertheless, VINS-Mono shows the best robustness among the tested algorithms.

The wheel odometry data in OpenLORIS-Scene are evaluated along with SLAM algorithms in Fig. 2. It can be seen that our odometry data provides reliable tracking results even in large scenes. We think that odometry should not be neglected by practical SLAM algorithm designers for service robots.

Metrics discussion. If we compare between the CR from DS-SLAM and ORB_SLAM2 with the same inputs, the former tends to lost more often since it uses less features to localize, but it succeeds in arket } which are highly dynaic. If we also note their ATEs, it can be found a consistent negative correlation between the two similar algorithms’ ATE and CR. The reason is that the longer an algorithm tracked, the more error is likely to accumulate. It implies that evaluating algorithms purely by ATE could be misleading. On the other hand, considering only CR could also be misleading. For example, DSO results high CR in orridor } and {\verb market }, but the estimated trajetories are actually erroneous. Its CR would be much lower if we set a proper ATE threshold.

### V-B Lifelong SLAM Evaluation

Method. To test whether SLAM algorithms could continuously localize in changed scenes, we feed the sequences of each scene one by one to the algorithm. There may be a significant view change when switching to the next sequence. The algorithm could either wait for a successful re-localization (e.g. ORB_SLAM2), or start with a fresh map and then try to align it with the old map by loop closing (e.g. VINS-Mono). DSO and ElasticFusion are excluded from this test since the implementation we use does not support re-localization. For ORB_SLAM2 RGB-D, we use a revised version with a few engineering improvements but no algorithmic changes. For each scene, we align the estimated trajectory of the first sequence to the ground-truth, and using the resulted transformation matrix to transform all the estimated trajectories of this scene, then compare them with the ground-truth.

Result. The results are shown in Fig. 3, with red cross and line segment indicating incorrect pose estimates, judged by an ATE threshold of 1/3/5 meters for small/medium/large scenes and AOE threshold of 30°. It can be seen that re-localization is challenging. For example, most algorithms completely fail to re-localize in the 2nd-5th sequences of

ome }.
\textit{Metrics discussion.} From t
e results we see that the metrics are imperfect. For example, for orridor } and {\verb market }, some algorithms get an inorrect initial localization for the first sequence, which is technically unsound. The reason is that large drifts have been accumulated over the long trajectories, and after aligning the full trajectory to the ground-truth, its initial part has a large error. It suggests that we should set even larger ATE thresholds for large scenes, and that further refinement of the accuracy judgement method should be discussed. Besides the false alarm in initial and final parts of orridor-1 } and {\verb market-1 }, the metris succeeds to recognize incorrect localization, and gives meaningful statistics.

Factor analysis. Correct re-localization is rare in Fig. 3 partly because we have deliberately selected the most challenging sequences in the collected data. In most scenes, the challenge comes from mixed factors including changed viewpoints, changed illumination, changed things and dynamic objects. The ffice } data have been designed t help disentangle those factors. Therefore, we conduct another set of tests with specified sequence pairs in ffice }. The tw sequences in each pair have one key different factors, as described in Section III.D. The re-localization scores are listed in Table III. The results suggest that changed viewpoints and illumination are most difficult to deal with. The former is expected as natural scenes are likely to generate different visual and geometric features from different viewpoints. The latter might be mitigated by carefully tuning algorithms and devices. We expect that deep learning based features and semantic information should be able to help address both problems.

## Vi Conclusion

This work introduces the OpenLORIS-Scene datasets and metrics for benchmarking lifelong SLAM for long-term robot deployment. The datasets capture scene changes caused by day-night shifts and human activities, which can be a major challenge of lifelong SLAM algorithms. New metrics are introduced to evaluate the robustness and accuracy of SLAM algorithms separately. With the proposed dataset and metrics, we hope to find shortcomings of existing SLAM algorithms and to encourage new designs with more robust localization capabilities, such as by introducing high-level scene understanding capabilities. The datasets can also be a testbed of the maturity for real-world deployment of future SLAM algorithms for service robots.

## Acknowledgement

The authors would like to thank Yusen Qin, Dongyan Zhai and Jin Wang for customizing the robot and lending it for this project, and performing odometer-camera calibration. Thank Yijia He for providing LiDAR-camera calibration tools and guidance. Thank Phillip Schmidt, Chuan Chen, Hon Pong Ho and Yu Meng for the technical support of RealSense cameras. Thank Mihai Bujanca and Bruno Bodin for helping integrate OpenLORIS-Scene into SLAMBench. Thank Yinyao Zhang, Long Shi and Zeyuan Dong for helping collect and maintain the data. Thank all the anonymous participants in data collection.

## References

• [1] B. Bodin, H. Wagstaff, S. Saeedi, L. Nardi, E. Vespa, J. H. Mayer, A. Nisbet, M. Luján, S. Furber, A. J. Davison, P. H.J. Kelly, and M. O’Boyle (2018-05) SLAMBench2: multi-objective head-to-head benchmarking for visual SLAM. In IEEE Intl. Conf. on Robotics and Automation (ICRA), Cited by: §II.
• [2] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart (2016) The EuRoC micro aerial vehicle datasets. The International Journal of Robotics Research 35 (10), pp. 1157–1163. Cited by: §II.
• [3] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard (2016) Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Transactions on robotics 32 (6), pp. 1309–1332. Cited by: §II.
• [4] J. Engel, V. Koltun, and D. Cremers (2018-03) Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (3), pp. 611–625. External Links: Document, ISSN Cited by: §V.
• [5] M. Fehr, F. Furrer, I. Dryanovski, J. Sturm, I. Gilitschenski, R. Siegwart, and C. Cadena (2017) TSDF-based change detection for consistent long-term dense reconstruction and dynamic object discovery. In 2017 IEEE International Conference on Robotics and automation (ICRA), pp. 5237–5244. Cited by: §II.
• [6] P. Furgale, J. Rehder, and R. Siegwart (2013-11) Unified temporal and spatial calibration for multi-sensor systems. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 1280–1286. External Links: Document, ISSN 2153-0858 Cited by: §III-B, TABLE II.
• [7] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
• [8] A. Grunnet-Jepsen, J. N. Sweetser, and J. Woodfill(Website) External Links: Link Cited by: §III-A.
• [9] O. Kähler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. H. S. Torr, and D. W. Murray (2015) Very high frame rate volumetric integration of depth images on mobile device. IEEE Transactions on Visualization and Computer Graphics (Proceedings International Symposium on Mixed and Augmented Reality 2015 22 (11). Cited by: §V.
• [10] S. Kohlbrecher, J. Meyer, O. von Stryk, and U. Klingauf (2011-11) A flexible and scalable SLAM system with full 3d motion estimation. In Proc. IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR), Cited by: 3rd item.
• [11] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv preprint arXiv:1809.00716. Cited by: §II.
• [12] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison (2016) SceneNet RGB-D: 5M photorealistic images of synthetic indoor trajectories with ground truth. arXiv preprint arXiv:1612.05079. Cited by: §II.
• [13] R. Mur-Artal and J. D. Tardós (2017-10) ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. External Links: Document, ISSN Cited by: §V.
• [14] A. Pronobis and B. Caputo (2009-05) COLD: COsy Localization Database. International Journal of Robotics Research (IJRR) 28 (5), pp. 588–594. External Links: Cited by: §II.
• [15] Qilong Zhang and R. Pless (2004-Sep.) Extrinsic calibration of a camera and laser range finder (improves camera calibration). In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. 3, pp. 2301–2306 vol.3. External Links: Document, ISSN Cited by: TABLE II.
• [16] T. Qin, P. Li, and S. Shen (2018-08) VINS-Mono: a robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics 34 (4), pp. 1004–1020. External Links: Document, ISSN Cited by: §V.
• [17] D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stückler, and D. Cremers (2018) The TUM VI benchmark for evaluating visual-inertial odometry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1680–1687. Cited by: §II, TABLE II.
• [18] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012-Oct.) A benchmark for the evaluation of RGB-D SLAM systems. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Cited by: §II, §IV.
• [19] S. Umeyama (1991-04) Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (04), pp. 376–380. External Links: ISSN 1939-3539, Document Cited by: §V-A.
• [20] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger (2016) ElasticFusion: real-time dense SLAM and light source estimation. The International Journal of Robotics Research 35 (14), pp. 1697–1716. External Links: Cited by: §V.
• [21] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei (2018-10) DS-SLAM: a semantic visual SLAM towards dynamic environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 1168–1174. External Links: Document, ISSN Cited by: §V.
• [22] Y. Zhao, S. Xu, S. Bu, H. Jiang, and P. Han (2019) GSLAM: A general SLAM framework and benchmark. arXiv:1902.07995. Cited by: §II.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters