Place Recognition for Stereo Visual Odometry using LiDAR descriptors

Place Recognition for Stereo Visual Odometry using LiDAR descriptors

Jiawei Mo and Junaed Sattar The authors are with the Department of Computer Science and Engineering, Minnesota Robotics Institute, University of Minnesota Twin Cities, Minneapolis, MN, USA. {moxxx066, junaed} at umn.edu.
Abstract

Place recognition is a core component in SLAM, and in most visual SLAM systems, it is based on the similarity between 2D images. However, the 3D points generated by visual odometry, and the structure information embedded within, are not exploited. In this paper, we adapt place recognition methods for 3D point clouds into stereo visual odometry. Stereo visual odometry generates 3D point clouds with a consistent scale. Thus, we are able to use global LiDAR descriptors for 3D point clouds to determine the similarity between places. 3D point clouds are more reliable than 2D visual cues (e.g., 2D features) against environmental changes such as varying illumination and can benefit visual SLAM systems in long-term deployment scenarios. Extensive evaluation on a public dataset (Oxford RobotCar) demonstrates the accuracy and efficiency of using 3D point clouds for place recognition over 2D methods.

1 Introduction

Visual SLAM has been one of the most active research areas in mobile robotics in the past couple of decades. Many field robots, especially where GPS signal is weak or unavailable (e.g., in urban or underwater settings), often rely on vision-based systems for navigation during deployment. In vision-based systems, visual odometry is used to build a local map and estimate ego-motion to assist in robot navigation. However, error accumulates throughout the process, which can cause odometry estimates to diverge from the correct path. Some form of a “loop closure” approach (e.g., [10]) is required to bring non-local constraints into the system to get a globally consistent map and trajectory, which provides a robot the ability to recognize previously-visited places. Place recognition is thus essential to detect loop closures and improve the accuracy of visual odometry methods.

Classical place recognition methods for vision-based systems usually rely on 2D images. Each location is represented by images taken at that place. To determine the possibility of two locations being the same place, the similarity of their corresponding images is evaluated. There is extensive literature on image similarity, such as feature Bag-of-Words [10], and GIST [21], which is discussed in Sec. 2.

(a)
(b)
(c)
(d)
Figure 1: Snapshots of the RobotCar dataset in different seasons. (\subreffig:robotcar_a) Parks Road in spring, (\subreffig:robotcar_b) Parks Road in winter, (\subreffig:robotcar_c) Holywell Street in spring, (\subreffig:robotcar_d) Holywell Street in winter.

However, the visual odometry methods themselves provide additional information which can be used to detect loop closure. The depth of points (i.e., the distance of these points from the camera) on 2D images can be partially or fully recovered by monocular or multi-camera visual odometry, respectively. The 3D structure of the scene can potentially provide important information for place recognition; however, 2D loop closure methods ignore this. The 3D structure is more robust than 2D images in a dynamic environment (e.g., under varying illumination). The motivation also is biological, as humans also strongly rely on 3D structures for place recognition (e.g., [8]).

On the other hand, a rich body of literature exists on place recognition methods using 3D point clouds generated by LiDAR (Light Detection and Ranging) sensors. LiDARs scan the 3D structure of the environment, rather than its visual appearance, making LiDAR-based place recognition more robust against environmental changes such as appearance and brightness (e.g., Fig. 1). Another benefit of LiDAR methods is their high computational efficiency, and our evaluations demonstrate this when comparing 2D image-based and 3D LiDAR methods (see Sec. 4).

In this work, we adopt LiDAR place recognition methods into visual odometry systems for loop closure detection. The goal is to enable accurate and robust place recognition in a computationally efficient way for a vision-based system in a dynamic environment. The proposed approach mimics a LiDAR range scan using globally consistent 3D point clouds generated by stereo visual odometry, which enables us to adopt global point cloud descriptors for this purpose.

Several challenges must be overcome for applying LiDAR-based methods to vision-based systems. First, the 3D points generated by visual odometry are distributed in a “pyramid-like” shape due to the much narrower field-of-view of cameras (excluding omnidirectional cameras) compared to most LiDARs. The pose of the pyramid changes with the camera, which is not desirable for place recognition. The second challenge is how to (and even if it is necessary to) adapt grayscale intensity data from the vision-based system into LiDAR-based methods, as such information is not available to LiDAR sensors. We address these challenges in this work.

To the best of our knowledge, this is the first approach which uses global LiDAR descriptors for place recognition in visual SLAM systems. The main contributions of this work, as discussed in Sec. 3, are as follows:

  • Adapting global LiDAR descriptors to a vision-based system for place recognition,

  • Augmenting LiDAR descriptors with visual information,

  • Achieving robustness against visual appearance changes with high accuracy and computational efficiency.

We evaluate the proposed method on the Oxford RobotCar Dataset [16]. We demonstrate the robustness of our method against drastic visual appearance change across seasons as recorded in that dataset, and do so with high accuracy and computational efficiency over existing methods. Further performance improvement is achieved by augmenting the LiDAR descriptor with grayscale intensity information.

2 Related Work

In the field of visual SLAM, ORB-SLAM2 [20] is one of the recent work that demonstrates high accuracy and computational efficiency. In ORB-SLAM2, loop closure is detected using ORB [23] feature bag-of-words (FBoW). Place recognition is based on the repetitiveness of 2D image features. FBoW organizes ORB features using a vocabulary tree to speed up the query process. LSD-SLAM [7] adopts FAB-MAP [6] for place recognition, with smaller weights assigned to highly repetitive features for reducing the perceptual aliasing rate. Other than FBoW, Fisher vectors [22] and VLAD [13] also focus on local features. Recently, researchers replaced hand-crafted features (e.g., ORB) with learned features and achieved better performance (e.g., NetVLAD [1] and [3]). On the other hand, global image descriptors are also used for place recognition. GIST [21] is one example, which encodes spatial layout properties (spatial frequencies) of the scene. It exhibits high accuracy if the viewing angle does not significantly change.

Instead of a single image, SeqSLAM [18] recognizes a place by matching a sequence of images. SeqSLAM achieves highly accurate results even in a dynamic environment with changing appearances. Matching a sequence of images does increase the accuracy, however, its underlying image matcher (Sum of Absolute Differences) is sensitive to viewing angle and visual appearance change.

A number of 3D place recognition methods have been designed for RGB-D cameras or LiDARs. RGB-D Mapping [12] uses ICP [2] to detect loop closure, with RANSAC [9] used to get an initial pose for ICP.

For LiDARs, place recognition methods can be categorized into local descriptors and global descriptors. Local descriptors use a subset of the points and describe them locally in a neighborhood. Examples are Spin Image [14] and SHOT [26]. Spin Image describes a keypoint by a histogram of points lying in each bin of a vertical cylinder centered at the keypoint. SHOT creates a sphere around a keypoint and describes the keypoint by the histogram of normal angles in each bin in the sphere.

Global methods describe the entire set of points. These methods are more computationally efficient; however, they require the scale of the point cloud to be consistent. Most recent examples include NDT [17], M2DP [11], Scan Context [15], and DELIGHT [5]. NDT classifies keypoints into line, plane, and sphere classes according to their neighborhoods. A histogram of these three classes is created to represent the point cloud. M2DP projects points onto multiple planes, the histogram of point count in each bin of projection plane is concatenated to get a signature of the point cloud. SVD is used to reduce the dimension of the signature. Scan Context normalizes the point cloud according to the vertical direction, then the point cloud is represented by the histogram of the maximal height of each bin on the horizontal plane. DELIGHT focuses on LiDAR intensity by dividing the scan sphere into 16 parts, and the histogram of LiDAR intensity in each part is calculated to represent the point cloud.

Cieslewski et. al. [4] looked into the possibility of adopting 3D point cloud descriptors for vision-based systems. They proposed an NBLD [4] descriptor for the 3D points triangulated from Structure-from-Motion or visual SLAM. A keypoint is described by the neighborhood points in a vertical cylinder. The point density of each bin in the cylinder is calculated. Point density is then compared with the neighborhood to create a binary descriptor of that keypoint. Ye et. al. [27] extended NBLD with a neural network. The vertical cylinder of NBLD is created in the same way. They trained a neural network to describe the cylinder, instead of calculating the point density. These are novel approaches in adopting local point cloud descriptors into vision-based systems for place recognition; however, the grayscale intensity of the 3D points is not used.

3 Methodology

Similar to the idea of [4], our method recognizes places based on the 3D points generated by visual odometry. The main difference is that the visual odometry in this work is running on stereo cameras. Specifically, we use SO-DSO [19] as our stereo visual odometry, which demonstrates high accuracy and computational efficiency. However, any stereo visual odometry, or even multi-camera visual odometry, is applicable here. Since the 3D point cloud is generated by stereo visual odometry, its scale is consistent. Therefore, it is possible to describe the 3D point cloud using global LiDAR descriptors, rather than the local descriptors in [4]. The goal is high accuracy, robustness to environmental change, and high computational efficiency.

3.1 Point Cloud Preprocessing

Due to the narrow field-of-view of the cameras, the 3D points generated by stereo visual odometry are located in a pyramid shape with the vertex of this pyramid being at the focal point. The NBLD local descriptor can operate directly in the pyramid shape, as NBLD describes 3D points individually. However, if we apply a global descriptor directly inside the pyramid shape, no pair of descriptors will be matched unless they are extracted at an identical location and angle.

Figure 2: A demonstration of the points generated by visual odometry, with different colors representing different keyframes. P1 and P2 are 3D points generated by the red keyframe. P1 is used to imitate both red and blue LiDAR scans; P2 is only used to imitate the blue LiDAR scan.

To make descriptors less sensitive to the location and viewing angle, we propose a simple but effective method that transforms pyramid-shaped 3D points from stereo visual odometry to omnidirectional LiDAR (sphere) shaped 3D points. Stereo visual odometry generates keyframes which consist of the pose of the camera and associated 3D points, illustrated in Fig. 2. Let the desired LiDAR sensing range be meters. For each keyframe from the stereo visual odometry, there are a fair amount of 3D points that are too far () to be used to imitate current scan but can be used to imitate the subsequent scans. in Fig. 2 is an example. On the other hand, like in Fig. 2, a single 3D point can be included by multiple imitated LiDAR scans.

The proposed method is illustrated in Fig. 3. The goal is to imitate a LiDAR scan using 3D points from the stereo visual odometry. We maintain what we refer to as a local points list. For each incoming keyframe from the stereo VO, we first store all its 3D points into the list. To imitate a LiDAR scan from this keyframe, we iterate through the local points list: if the distance of the point is within the desired LiDAR range , we copy the point and transform it to current keyframe coordinate (current pose), then store it to another list containing spherical points within range ; otherwise, if the point goes behind the camera (as the robot moves) and beyond range , we remove it from the local points list. Here we assume the camera motion is predominantly in the forward direction. The spherical points may contain duplicate points which need to be filtered. To best imitate LiDAR, we transfer points to a polar coordinate frame centered at the camera. For each bearing angle, we keep the closest point and eliminate duplicate points along the viewing ray. We call this final set filtered points.

Caching local points enables us to imitate LiDAR scans with points behind the camera, and to have a denser point cloud. Since visual odometry generates locally accurate camera poses and 3D points, concatenating 3D points transformed from multiple nearby keyframes to imitate a LiDAR scan is feasible.

Figure 3: An overview of the proposed approach. The basis lies in the “Point Cloud Preprocessing” block, where 3D points obtained by stereo VO are used to imitate a LiDAR scan, so that efficient place recognition can be performed.

3.2 Point Cloud Description

The next step in the proposed method is to describe the filtered points, for which we rely on global descriptors. This is preferable for two reasons: the first is for its computational efficiency when describing and matching the point clouds; secondly, since the point clouds we have are generated by visual odometry, they are not as consistent and dense as the ones from LiDAR. Many local descriptors, such as Spin Image, depend on the surface normal for which dense point clouds are required, which would be problematic in this case. We choose three global descriptors which are robust to sparse and inconsistent point clouds: DELIGHT [5], M2DP [11], and Scan Context [15].

Delight

DELIGHT operates on LiDAR intensities. The LiDAR scan sphere is divided into 16 non-overlapping bins based on its radius, azimuth, and elevation. Each bin is described by the histogram of LiDAR intensities of all points located inside. The histograms are concatenated to form the descriptor of the entire LiDAR scan. To make the descriptor less sensitive to rotation and translation, the raw LiDAR scan is aligned to a reference frame. The reference frame is obtained by aligning the point cloud by Principal Components Analysis (PCA) [25]. As discussed in [5], there are four possible combinations of reference frames due to the axes direction ambiguity of PCA. Hence, there are four versions of the descriptor for each point cloud with the goal of robust matching.

Analogous to LiDAR scan intensities, the 3D points from the visual odometry have grayscale intensities. We simply replace LiDAR intensities with grayscale intensities to fit the DELIGHT descriptor into our system. Each histogram is composed of 256 bins since the grayscale intensity ranges from 0 to 255.

M2dp

M2DP is a global descriptor that demonstrates high accuracy and efficiency. The point cloud is projected onto multiple planes, and the distribution of projection is calculated by counting the points projected onto individual bins separating the projection plane. The distributions of all projections are concatenated to form a descriptor for the point cloud. For computational and memory efficiency, SVD is used to compress the descriptor. Similar to DELIGHT, PCA is used to align the point cloud to make the descriptor less variant to rotation and translation. In the experiments, we also adopt the four versions of descriptor due to PCA ambiguity to improve accuracy, which is not included in the original M2DP.

We also augment the M2DP descriptor with grayscale intensity. Specifically, when projecting the point cloud onto each plane, we not only count the number of points projected onto each bin but also calculate the average grayscale intensity of all points projected onto the bin. Therefore, we have two kinds of descriptors (namely point count descriptor and grayscale intensity descriptor) for each point cloud representing a particular place. To make the intensity descriptor less sensitive to illumination, we binarize the intensity: for each bin, we set the intensity to 0 or 1 if it is smaller or larger than the global average intensity, respectively.

Scan Context

Scan Context is a straightforward yet effective descriptor designed for LiDAR scans obtained in urban areas. The scan is aligned with respect to the gravitational axis which is measured externally (e.g., with an IMU). Then the horizontal circle plane is separated into multiple bins based on its radius and azimuth. In each bin, the maximum height is found and concatenated to form the descriptor. The descriptor is compared against all possible yaw angles. Root shifting [15] is used by Scan Context for robustness over translation.

To make use of Scan Context for our system, we make the following modifications. First, since we do not wish to bring additional sensors to the system, we adopt the PCA method used in DELIGHT and M2DP to align the point cloud, instead of depending on gravitational axis alignment and root shifting. However, from the experiments, we find that PCA is not enough to find the optimal yaw angle. Thus, the descriptor is compared with all possible yaw angles in both forward and backward directions in our implementation. Second, due to the PCA ambiguity, we replace maximum height in the original Scan Context with height range (maximum height - minimum height). Lastly, we augment the descriptor with grayscale intensity using the same method for M2DP. Consequently, we have both structure descriptor and intensity descriptor for each place.

3.3 Place Recognition

Using the descriptors, we are able to determine the similarity between places. For DELIGHT, we simply generate a difference matrix by calculating the descriptor distance from each place in the query database to every place in the reference database. When calculating the distance between two places, we take the shortest distance from the query descriptor to all four descriptors of the reference place. The distance is based on chi-squared test. For M2DP and Scan Context, the distance is simply the Euclidean distance between descriptors. However, as we have two kinds of descriptors (structure descriptor and intensity descriptor ) for each place, we have two separate difference matrices. To fuse them into a unique difference matrix, we first normalize (N() in Eq. 1) the matrices (row-wise assuming the query is row-major), then add them with the standard deviation as weight, as shown in Eq. 1. Experimentally, we found that a difference matrix with higher standard deviation is able to discriminate better between different places.

(1)

With the difference matrix, each query place (row) is matched to a reference place whose difference value is smallest (along the row). Optionally, we can adopt SeqSLAM to match a sequence on the difference matrix to further improve accuracy.

4 Experimental Evaluation

To evaluate the proposed method, we compare the results of our technique with that of [27], which is the most relevant work to our own. In that work, DSO is used to generate 3D point clouds, which are organized as 3D patches by NBLD afterward. Subsequently, a convolutional neural network is trained to describe the 3D patches. There are three major differences between their work and ours: first, we use our own SO-DSO [19] instead of DSO so that the 3D point clouds have accurate and consistent scale; second, we preprocess the 3D point clouds as discussed in Sec. 3.1; third, we use global LiDAR descriptors for eventual place recognition.

Dataset

Since the authors of [27] have not published their code, we evaluate our algorithms using the same dataset and settings for fair comparisons. Specifically, we employ the Oxford RobotCar Dataset [16] and use the same segment illustrated in Fig. 5(a) of [27] for testing. When computing precision and recall, we also use meters as GPS distance threshold. For comparison, the testing sequence pairs in Table 1 are identical to [27], which cover all combination of seasons. These tests are challenging for place recognition since visual appearance and brightness changes drastically. Snapshots of RobotCar dataset used for evaluation are shown in Fig. 1.

Test a b c d e f g h i j
Dates
05-19
05-22
05-19
08-13
05-19
10-30
05-19
02-10
08-13
07-14
08-13
10-30
08-13
02-10
10-30
11-28
10-30
02-10
02-10
12-12
Seasons
Spr.
Spr.
Spr.
Sum.
Spr.
Fall
Spr.
Win.
Sum.
Sum.
Sum.
Fall
Sum.
Win.
Fall
Fall
Fall
Win.
Win.
Win.
Table 1: The driving sequences from RobotCar dataset and their recording seasons; the proposed approach was tested on all of these sequences.

Implementation

The authors of M2DP have published their Matlab code, which we reimplement in C++ and augment the descriptor with grayscale intensity information. We similarly reimplement DELIGHT and Scan Context. When preprocessing the point cloud as discussed in Sec. 3.1, we set LiDAR range with 1-degree angular resolution. The parameters of M2DP and Scan Context are set to default values. For DELIGHT, the radius of the inner sphere and the outer sphere is set to meters and meters, respectively. Our implementations are available online111https://github.com/jiawei-mo/3d_place_recognition.

For comprehensive validation, we include two 2D image-based methods: FBoW and GIST, representing local and global 2D methods. The implementation of FBoW comes from the code for ORB-SLAM2; the implementation of GIST is available online222http://lear.inrialpes.fr/software.

Results

Test a b c d e f g h i j
CNN 0.774 0.736 0.589 0.419 0.764 0.557 0.489 0.599 0.443 0.597
NBLD 0.651 0.700 0.611 0.351 0.672 0.496 0.379 0.454 0.351 0.491
NetV. 0.482 0.583 0.427 0.537 0.640 0.259 0.512 0.003 0.078 0.158
DELI. 0.789 0.560 0.309 0.035 0.702 0.466 0.019 0.348 0.004 0.014
M2DP 0.907 0.864 0.533 0.317 0.893 0.605 0.305 0.557 0.360 0.627
S.C. 0.953 0.957 0.771 0.641 0.923 0.789 0.576 0.656 0.525 0.808
FBoW 0.256 0.122 0.047 0.170 0.081 0.138 0.256 0.002 0.139 0.002
GIST 0.932 0.918 0.679 0.778 0.914 0.694 0.738 0.003 0.606 0.000
(a) The area under the precision-recall curve (AUC).
Test a b c d e f g h i j
DELI. 0.024 0.030 0.015 0.000 0.028 0.021 0.005 0.009 0.000 0.003
M2DP 0.267 0.091 0.003 0.027 0.220 0.033 0.092 0.084 0.008 0.120
S.C. 0.714 0.607 0.439 0.327 0.759 0.465 0.271 0.294 0.205 0.588
FBoW 0.002 0.002 0.003 0.000 0.003 0.004 0.000 0.000 0.000 0.000
GIST 0.794 0.377 0.242 0.176 0.503 0.242 0.156 0.000 0.109 0.000
(b) Maximal recall at 100% precision.
Table 2: Place recognition accuracy.
Method DELI. M2DP S.C. FBoW GIST
Imitate LiDAR Scan (c++) 1.649 1.649 1.643 - -
Desc. extraction (c++) 0.023 50.78 0.162 39.45 158.6
Query descriptor (Matlab) 240.8 7.921 14.47 219.2 (c++) 5.486
Total 242.5 60.35 16.28 258.7 164.1
Table 3: Run time analysis in milliseconds.

Accuracy

Table 1(a) shows the area under the precision-recall curve (AUC) of each algorithm running the tests in Table 1. Data of CNN, NBLD, and NetVLAD are taken directly from [27], whereas rows marked “DELI.”, “M2DP”, and “S.C.” represent our approaches adopting these three global descriptors. Larger AUC means more places are recognized with fewer errors. To some extent, AUC reflects the discrimination power of an algorithm for place recognition. Table 1(b) illustrates the maximal recall with 100% precision (no false positives, i.e., no errors). Larger recall indicates that more places are recognized before making any errors as a single false positive might significantly affect the accuracy of the entire SLAM algorithm. In the best case scenario, both AUC and recall should be 1.

Scan Context (“S.C.”) achieves both the highest AUC and the highest recall in most tests. Scan Context depends on height range for place recognition. In RobotCar dataset, the maximum height is usually from buildings and the minimum height is normally on the ground. Therefore, the height range is not very sensitive to the seasons. GIST behaves best in the rest of the tests. It works well because the viewing angle is mostly unchanged in RobotCar dataset. M2DP is the next-best performing approach, which has relatively high accuracy when there is less visual appearance change (e.g., Test a: Spring-Spring). However, in different seasons, the trees along the streets have vastly different appearances (e.g., for losing leaves) as illustrated in Fig. 1. The accuracy drops since M2DP relies heavily on the structures. DELIGHT, which purely depends on grayscale intensity, suffers more from changes in visual appearance between seasons. FBoW, unexpectedly, has the worst performance. The potential reason is that ORB is sensitive to changing scene factors such as pedestrians. Fig. 4 shows the effect of seasonal changes on accuracy.

Figure 4: Plots of AUCs at different seasons, when compared with Spring as the baseline.
Figure 5: Places recognized (Red) of Scan Context at 100% precision on Test d: Spring(lower)-Winter(upper).

Fig. 5 shows the places recognized by Scan Context at 100% precision. With the reference of Fig. 1, most places along Holywell street are correctly recognized since it is occupied mostly with buildings; but most places along Parks road are not recognized because there are many trees on both sides of the road.

In conclusion, Scan Context, as adopted in this work, outperforms [27] in terms of accuracy.

Efficiency

Table 3 reports the run-time required to query a place in the database running on an Intel i7-6700 with 16GB of RAM. The table is based on Test a, querying places from 05-19 in the database of 05-22 with 2675 places. There are 3708.56 3D points on average per place.

For descriptor generation, DELIGHT is the fastest due to its straightforward mechanism. Scan Context is the second-fastest because calculating the height range is efficient. M2DP is the slowest one among all proposed 3D methods. The entire set of points are projected onto multiple planes, followed by SVD compression, which is computationally expensive. Another reason is that we run M2DP four times to get descriptors of all possible combinations of reference frames (Sec. 3.2). For GIST, its extremely high accuracy on RobotCar dataset is achieved at a high computational cost.

For place query, GIST is the quickest since it simply calculates the Euclidean distance between two descriptors. This is followed by M2DP, for which we have calculated all possible descriptors of each place, the distance between two places is the smallest Euclidean distance. Scan Contest is slightly slower because the query descriptor is matched against all possible yaws. DELIGHT is much slower since its distances are based on the chi-squared test.

Scan Context achieves the highest overall efficiency as it is able to process more than frames per second, and can run in real-time in most visual SLAM systems. Since Scan Context performs best on the RobotCar dataset in terms of accuracy and efficiency, we will mainly focus on it for the following comparisons.

(a) Summer-Fall
(b) Spring-Spring
(c) Spring-Fall
(d) Summer-Winter
Figure 6: Precision-recall curves of Scan Context compared with that of [27]. M2DP* is the algorithm modified in [27] as local descriptor, which is not the M2DP in Sec. 3.2.

Fig. 6 shows the comparison of the precision-recall curve between the proposed Scan Context and the methods proposed in [27]. It further validates the conclusion that Scan Context proposed in this work outperforms [27] in accuracy.

Test a b c d e f g h i j
Structure
0.966
0.837
0.946
0.145
0.745
0.085
0.651
0.281
0.924
0.486
0.731
0.062
0.556
0.103
0.636
0.049
0.542
0.044
0.791
0.099
Intensity
0.751
0.135
0.582
0.106
0.235
0.014
0.111
0.032
0.771
0.108
0.279
0.028
0.069
0.018
0.269
0.022
0.127
0.033
0.453
0.046
Fused
0.953
0.714
0.957
0.607
0.771
0.439
0.641
0.327
0.923
0.759
0.789
0.465
0.576
0.271
0.656
0.294
0.525
0.205
0.808
0.588
Table 4: AUCs (top sub-rows) and maximal recall (bottom sub-rows) at 100% precision of Scan Context with structure and/or grayscale intensity.

Intensity Contribution

Table 4 shows the AUCs and maximal recall of Scan Context using structure only, intensity only, and both structure and intensity. Scan Context with intensity only performs poorly on the RobotCar dataset, since grayscale intensity changes drastically throughout different seasons. However, augmenting the structure descriptor with intensity information clearly improves the maximal recall, even though it does not obviously improve the AUCs.

Test a b c d e f g h i j
SAD
0.751
0.324
0.690
0.257
0.413
0.084
0.401
0.044
0.762
0.328
0.545
0.108
0.261
0.033
0.002
0.000
0.272
0.000
0.002
0.000
SeqSLAM
0.918
0.842
0.885
0.816
0.717
0.391
0.470
0.199
0.896
0.779
0.791
0.610
0.449
0.192
0.001
0.000
0.380
0.196
0.001
0.000
S.C.
0.829
0.478
0.881
0.592
0.537
0.295
0.462
0.191
0.841
0.603
0.562
0.081
0.386
0.116
0.456
0.113
0.367
0.183
0.619
0.373
SeqS.C.
0.934
0.912
0.904
0.849
0.827
0.602
0.778
0.423
0.916
0.889
0.822
0.487
0.831
0.547
0.646
0.277
0.602
0.280
0.827
0.739
Table 5: AUCs (top sub-rows) and maximal recalls (bottom sub-rows) at 100% precision of the original SeqSLAM and the SeqSLAM with Scan Context descriptor.

SeqSLAM

We also explore SeqSLAM as post-processing for Scan Context. Specifically, we replace image difference (SAD: the Sum of Absolute Difference) descriptor in SeqSLAM with the proposed Scan Context to generate the difference matrix. The difference matrix is further contrast-enhanced and sequence matched. We refer the modified SeqSLAM as SeqScanContext.

Table 5 reports the improvements of SeqSLAM for Scan Context comparing with the original SeqSLAM, as implemented by OpenSeqSlam [24]. We use the images_diff settings, where sub-sampling factor is . A large sub-sampling factor means a large gap between places. Hence, the accuracy of the Scan Context in Table 5 is lower than the original results in the previous tables, but it improves upon the SAD method in the original SeqSLAM. With SeqSLAM as post-processing, both AUCs and recalls of Scan Context are improved, especially with large visual appearance change (e.g., Tests c,d,f,g,i).

5 Conclusions

In this paper, we propose a novel place recognition approach for stereo visual odometry. Instead of 2D image similarity, we depend on the 3D points generated by the visual odometry to determine the correlation between places. The 3D points from stereo systems, with accurate and consistent scale, are used to mimic LiDAR scans and fed into three global LiDAR point cloud descriptors, which are DELIGHT, M2DP, and Scan Context. We augment the descriptors with grayscale intensity information. Experiments on RobotCar dataset show the accuracy and efficiency of the proposed method.

For the next step, we will integrate the proposed method into state-of-the-art stereo visual odometry algorithms to detect loop closure, and quantify the performance and accuracy. Furthermore, we intend to implement the proposed algorithm on board physical field robots and validate its performance in visually challenging environments.

References

  • [1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307. Cited by: §2.
  • [2] P. J. Besl and N. D. McKay (1992) Method for Registration of 3-D Shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, Vol. 1611, pp. 586–607. Cited by: §2.
  • [3] Z. Chen, A. Jacobson, N. Sünderhauf, B. Upcroft, L. Liu, C. Shen, I. Reid, and M. Milford (2017) Deep Learning Features at Scale for Visual Place Recognition. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3223–3230. Cited by: §2.
  • [4] T. Cieslewski, E. Stumm, A. Gawel, M. Bosse, S. Lynen, and R. Siegwart (2016) Point Cloud Descriptors for Place Recognition using Sparse Visual Information. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 4830–4836. Cited by: §2, §3.
  • [5] K. P. Cop, P. V. Borges, and R. Dubé (2018) DELIGHT: An Efficient Descriptor for Global Localisation using LiDAR Intensities. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3653–3660. Cited by: §2, §3.2, §3.2.
  • [6] M. Cummins and P. Newman (2008) FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance. The International Journal of Robotics Research 27 (6), pp. 647–665. Cited by: §2.
  • [7] J. Engel, T. Schöps, and D. Cremers (2014) LSD-SLAM: Large-Scale Direct Monocular SLAM. In European conference on computer vision, pp. 834–849. Cited by: §2.
  • [8] R. Epstein and N. Kanwisher (1998) A Cortical Representation of the Local Visual Environment. Nature 392 (6676), pp. 598. Cited by: §1.
  • [9] M. A. Fischler and R. C. Bolles (1981) Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §2.
  • [10] D. Gálvez-López and J. D. Tardos (2012) Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Transactions on Robotics 28 (5), pp. 1188–1197. Cited by: §1, §1.
  • [11] L. He, X. Wang, and H. Zhang (2016) M2DP: A Novel 3D Point Cloud Descriptor and its Application in Loop Closure Detection. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 231–237. Cited by: §2, §3.2.
  • [12] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox (2012) RGB-D mapping: Using Kinect-Style Depth Cameras for Dense 3D Modeling of Indoor Environments. The International Journal of Robotics Research 31 (5), pp. 647–663. Cited by: §2.
  • [13] H. Jégou, M. Douze, C. Schmid, and P. Pérez (2010) Aggregating Local Descriptors into a Compact Image Representation. In CVPR 2010-23rd IEEE Conference on Computer Vision & Pattern Recognition, pp. 3304–3311. Cited by: §2.
  • [14] A. E. Johnson and M. Hebert (1999) Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes. IEEE Transactions on pattern analysis and machine intelligence 21 (5), pp. 433–449. Cited by: §2.
  • [15] G. Kim and A. Kim (2018) Scan Context: Egocentric Spatial Descriptor for Place Recognition within 3D Point Cloud Map. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4802–4809. Cited by: §2, §3.2, §3.2.
  • [16] W. Maddern, G. Pascoe, C. Linegar, and P. Newman (2017) 1 Year, 1000km: The Oxford RobotCar Dataset. The International Journal of Robotics Research 36 (1), pp. 3–15. Cited by: §1, §4.
  • [17] M. Magnusson (2009) The Three-Dimensional Normal-Distributions Transform: An Efficient Representation For Registration, Surface Analysis, and Loop Detection. Ph.D. Thesis, Örebro universitet. Cited by: §2.
  • [18] M. J. Milford and G. F. Wyeth (2012) SeqSLAM: Visual Route-Based Navigation for Sunny Summer Days and Stormy Winter Nights. In 2012 IEEE International Conference on Robotics and Automation, pp. 1643–1649. Cited by: §2.
  • [19] J. Mo and J. Sattar (2019-11) Extending Monocular Visual Odometry to Stereo Camera System by Scale Optimization. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). To appear, Note: arXiv preprint arXiv:1905.12723 Cited by: §3, §4.
  • [20] R. Mur-Artal and J. D. Tardós (2017) ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §2.
  • [21] A. Oliva and A. Torralba (2006) Building the Gist of a Scene: The Role of Global Image Features in Recognition. Progress in brain research 155, pp. 23–36. Cited by: §1, §2.
  • [22] F. Perronnin and C. Dance (2007) Fisher Kernels on Visual Vocabularies for Image Categorization. In 2007 IEEE Conference on Computer Vision & Pattern Recognition, pp. 1–8. Cited by: §2.
  • [23] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski (2011) ORB: An Efficient Alternative to SIFT or SURF.. In ICCV, Vol. 11, pp. 2. Cited by: §2.
  • [24] B. Talbot, S. Garg, and M. Milford (2018) OpenSeqSlam2.0: An Open Source Toolbox for Visual Place Recognition under Changing Conditions. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7758–7765. Cited by: §4.
  • [25] F. Tombari, S. Salti, and L. Di Stefano (2010) Unique Signatures of Histograms for Local Surface Description. In European Conference on Computer Vision, pp. 356–369. Cited by: §3.2.
  • [26] F. Tombari, S. Salti, and L. Di Stefano (2011) A Combined Texture-Shape Descriptor for Enhanced 3D Feature Matching. In 2011 18th IEEE international conference on image processing, pp. 809–812. Cited by: §2.
  • [27] Y. Ye, T. Cieslewski, A. Loquercio, and D. Scaramuzza (2017) Place Recognition in Semi-Dense Maps: Geometric and Learning-Based Approaches. In Proc. Brit. Mach. Vis. Conf., pp. 72–1. Cited by: §2, Figure 6, §4, §4, §4, §4, §4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390192
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description