Fast Loop Closure Detection via Binary Content

Fast Loop Closure Detection via Binary Content

Abstract

Loop closure detection plays an important role in reducing localization drift in Simultaneous Localization And Mapping (SLAM). It aims to find repetitive scenes from historical data to reset localization. To tackle the loop closure problem, existing methods often leverage on the matching of visual features, which achieve good accuracy but require high computational resources. However, feature point based methods ignore the patterns of image, i.e., the shape of the objects as well as the distribution of objects in an image. It is believed that this information is usually unique for a scene and can be utilized to improve the performance of traditional loop closure detection methods. In this paper we leverage and compress the information into a binary image to accelerate an existing fast loop closure detection method via binary content. The proposed method can greatly reduce the computational cost without sacrificing recall rate. It consists of three parts: binary content construction, fast image retrieval and precise loop closure detection. No offline training is required. Our method is compared with the state-of-the-art loop closure detection methods and the results show that it outperforms the traditional methods at both recall rate and speed.

I Introduction

Over the past decades, loop closure detection has become an important part of visual SLAM. In the early stage of development, SLAM only targeted at visual odometry which accumulates inevitable drifts. The result of navigation and mapping often fails in the long run. Later, it is found that graph based optimization can greatly correct drifting error with the help of loop closure detection and it becomes an essential component of modern SLAM [22]. Nowadays, a full visual SLAM system consists of front-end and back-end. In the front-end, vision based SLAM runs visual odometry to estimate the frame-to-frame transition directly. However, visual odometry often has the problem of cumulative drift in real applications. In the back-end, loop closure partially resets localization to minimize transitional measurement error by matching current frame with historical data [16]. Visual SLAM has been widely applied into many robotics fields such as cleaning robot, drone as well as autonomous cars. It has become a promising technique in robotics.

Fig. 1: Framework of Binary Content Based Loop Closure Detection.

The idea of loop closure detection is to find repetitive scenes from the historical data so that we can link two places together. The link between two places acts as an additional constraint to the mapping. After applying graph optimization, we can minimize the drifting error based on those constraints. Experiments have shown that loop closure detection can greatly improve the performance of SLAM [22]. However, tackling this problem consists of many challenges; First of all, the database grows with time, meaning that the database size can be tremendous without proper compression. Secondly, the complexity of indexing grows proportionally with database size. Hence the requirement of computational resources also gradually increases. Lastly, two frames of same place taken at different timestamp may be slightly different due to variation of light condition, dynamic objects, etc. Therefore, loop closure detection still remains a challenging topic in visual SLAM.

Existing works on loop closure detection share the common idea of using hand-crafted feature points and feature descriptors, such as FAB-MAP[6], Bag of Visual Words (BoVW) [9], VLAD [13] and Fisher Vector [19]. These methods extract feature points from image frame and translate them into descriptors. Then the descriptors are stored into database in sequence and we can simply tell where loop closure happens by comparing current descriptor and database. The number of comparison grows with time, hence in general the comparison of descriptors must be fast. However, there is always a trade-off between speed and precision. To achieve higher accuracy, it takes lots of computational resources. For example, in FAB-MAP [6], it takes 400 ms to extract SIFT features for a frame of size pixels on a normal computer. Image descriptors such as Fisher Vector contain high-order statistics so that it takes more time to process. In conclusion, existing methods leverage on creating accurate image descriptors but lack of satisfactory efficiency.

In this paper, we argue that extracting and comparing feature descriptors takes too much computational resources and becomes a burden to the processing system over a long run. Existing feature point based methods can achieve satisfactory recall rate results but are difficult to run in real time. In the meantime we note that the distribution of the objects or salient patterns is also an important information except from feature points. For a scene, the geometrical distribution of the objects as well as the shape of each object are usually unique and hence this can be used for loop closure detection. However, this information is not utilized in feature point methods. A good fact about the pattern information is that it does not involve any color information. And if we can express it in binary format, the speed can be improved. Hence we introduce this feature for loop closure detection, where the object distribution information is expressed as the binary content of image. Thereafter, we can verify the loop closure places by checking the similarity of the binary contents of two images. At the same time, we keep existing feature point method on top of binary content indexing to achieve high recall rate. The new framework consists of three parts: binary content construction, fast image retrieval and precise loop closure detection. It firstly introduces a binary map into loop closure detection to reduce the computational cost for indexing while applies precise image matching to guarantee precision. Compared to the existing methods, no offline training is required for our method. It is also proven that our method outperforms existing methods at both recall rate and speed. The main contributions of this paper are as follows:

  • We propose a binary content based fast loop closure detection, which combines the advantage of both fast binary operation and traditional loop closure detection approach.

  • The performance is greatly improved. Our method is much faster than existing methods without reducing recall rate and recall precision.

  • Compared to existing methods, the proposed method does not require any offline-training. It can be easily implemented to SLAM system.

This paper is organized as follows: Section II reviews the related works on loop closure detection. Section III describes the details of the proposed method. Section IV shows experiment results and comparison with existing works, followed by conclusion in Section V.

Ii Related Work

Most of loop closure detection methods adopt Bag of Words structure which originates from nature language processing. In this model, a text is represented as the multiset of its words, regardless of grammar or word order. Similarly, this idea is applied into loop closure detection such as FABMAP and DBoW2 [6, 9]. FAB-MAP defines a probabilistic model over the bag-of-words representation [6]. It utilizes Chow-Liu tree [5] to approximate the co-occurrences between SURF feature points. [7] tests datasets of 70 km and 1,000 km in length respectively and achieves a satisfactory recall rate with only a few false positives. DBoW2 [9] creates a tree vocabulary from offline training over a big dataset. New feature points are marked with a sequence number according to the vocabulary so that the co-occurrence of the frames can be estimated by the Euclidean distance of the feature points in the vocabulary.

Some other research works aim to find an effective and efficient image descriptor for loop closure detection. [17] uses SURF feature descriptor for loop detection. It achieves a satisfactory result but consumes lots of computational resource. [13] extracts VLAD vector from each image. VLAD is a first order statistics of the non-probability Fisher Vector [19], which can be obtained by training a codebook of k visual words using k means. The similarity is estimated by measuring the Euclidean distance of related vectors. In recent years, there are also some binary descriptors used in loop closure detection such as Binary Robust Invariant Scalable Keypoints (BRISK) and Binary Robust Independent Elementary Features (BRIEF), [15, 1, 20, 2, 14]. They take the advantage of fast binary operation and use probability theory to represent features. However, they contain some uncertainty so that the accuracy may drop sometimes.

(a) Raw Image
(b) Result
Fig. 2: Example of Log spectral residual approach.
(a) First frame
(b) Binary content extraction result
(c) Second frame
(d) Binary content extraction result
(e) Third frame
(f) Binary content extraction result
Fig. 3: Examples of binary content extraction.

Another trend in loop closure detection is the utilization of Deep Learning based descriptors. [4] has conducted a comprehensive evaluation and has shown the advantages of Deep Learning based features. In the work of [12], the authors apply a pre-trained Convolutional Neural Network (CNN) model, where the outputs at the intermediate layers are used as image descriptors. The utilization of GPU accelerates the processing speed up to the level of milli-second. In [23], the authors apply PCANet [3] to extract features as image descriptors. It only takes 10-60 ms on City Center dataset on an NVIDIA GPU with the recall rate up to 20%. Deep learning method shows a good performance in loop closure detection. However, the application is limited by the requirement of GPU which is costly for robotic systems.

Recently, another research work uses object for loop closure detection [21]. It performs loop closure detection based on the objects cropped from each image. It achieves very satisfactory speed but at the sacrifice of recall rate. Another problem is that it can fail if there is any repetitive objects in the scene.

Iii Framework

The proposed binary content based loop closure detection framework consists of three parts: binary content construction, fast image retrieval and precise loop closure detection, which is shown in Fig. 1. To utilize the object distribution information, the first step binary content construction extracts the objects or salient regions from the image and then further compresses extracted parts into compact binary image. After that, fast image retrieval performs binary image indexing at high speed and filters out most of unmatched pairs. Lastly, precise loop closure detection conducts further check on the result to remove any false positive. In the process of fast binary content indexing, most of unmatched pairs are filtered out so this process only takes limited computational resources. The details of each step will be explained in this section.

Iii-a Binary Content Construction

The extracted binary content should be highly representative information of the original image. However, the binary content cannot reveal the color or grey level of pixel so that we only operate at the level of salient region. A salient region generally refers to those image parts that contain rich texture. The location of the salient region and the shape of salient part can be useful for loop detection. Different images will have different salient regions so that it can be a criterion to search for paired images. To extract salient regions, we perform the Log spectral residual method [11]. The Log spectral residual method has the advantage of low computational cost and high extraction capability. Moreover, no prior knowledge is required for this approach. Generally, given an input image , we define the following notations:

  • : The real part of Fast Fourier Transform of image , .

  • : The imaginary part of Fast Fourier Transform of image , .

  • : The log spectral of , .

The Log spectral residual is defined as:

(1)

where is an average filter of an matrix. Salient region map can be derived by recovering equation (1) with Gaussian filter :

(2a)
(2b)

where the threshold indicates the level of salient region extraction. A larger implies that less salient area will be ignored, and only highly salient regions or objects will be retained. A demonstration of the Log spectral residual approach is shown in Fig. 2, where only the crafts are kept after filtering. Salient region contains the most representative information of the image and in most cases it is unique for each image. By binarizing each frame into salient region map and storing it, the database is built up for later processing.

Iii-B Fast Image Retrieval

Fast image retrieval aims to match binary content with the database. The key idea of this part is to make use of fast logical operation to conduct searching. Similar scenes share similar salient region distribution. When the place is revisited, the light condition or view angle can be slightly changed, but the distribution will remain the same. Hence, by comparing the salient region map and we can perform an element-wise similarity check:

(3)

where is the similarity factor of two images and counts the number of ”true” values in the matrix. The fast image retrieval can be performed by simply setting threshold to . In the meanwhile, we also define a binary image center

(4)

where is the coordinate of pixel in image. By setting threshold on we can simply filter out unmatched pairs. An example of binary content based fast indexing is shown in Fig. 3. We randomly pick up 3 frames from KITTI dataset [10]. The first and second frames are taken at same place but different time, while the third frame is taken at another similar place. The first and second frames are loop closure pairs but the first and third frames are not. By applying the fast indexing, we can calculate the between frames: and . Intuitively we can tell that the second frame is much more similar to the first frame.

Fig. 4: An example of SURF feature points matching.
Dataset Image Size Source of Ground Truth
KITTI 3701226 GPS
New College 640480 GPS
City Center 640480 GPS
TABLE I: Information of Different Datasets.
Dataset Mean Time Average Recall Rate Precision
KITTI 130 54.9 100
New College 92 20.9 100
City Center 86 27.7 100
TABLE II: Loop detection results of our approach.
(a) Ground truth of KITTI sequence 00.
(b) Loop closure detection result.
Fig. 5: Loop closure detection result of the proposed method.

Iii-C Precise Loop Closure Detection

The fast image retrieval is able to remove most unmatched pairs. However, the binary content only considers the structure of the content which is fast but not accurate enough. Considering that traditional method using SURF feature descriptor has a good performance in matching images, we can implement feature point based comparison to further increase the precision.

The SURF feature points are extracted from each frame due to its high precision in image matching [1]. And we use SURF descriptors to examine each image pair. Fig. 4 shows an example of feature matching. The number of matched feature points reveals the similarity of image pair.

Iv Experiment Results

To prove its robustness, we test the proposed method with different datasets including KITTI dataset, New College dataset and City Center dataset [10, 18, 8]. The information of respective dataset is given in Table I. The most important performance indexes for loop closure detection are recall rate, recall precision and speed. Recall precision refers to the ratio of correct loop closure detection against total loop closure detected. The higher recall precision the better, since any false positive may cause filter divergence easily. Recall rate refers to the number of correct loop pairs detected against total loop pairs which can be collected from the ground truth. In this section, we provide a detailed analysis of our proposed method.

(a) Ground truth
(b) Binary content-based approach
(c) FABMAP
(d) DBoW2
Fig. 6: Comparison of binary content extraction with existing methods.
Dataset Sequence 00 Sequence 02 Sequence 05
Mean Time (ms) Recall Rate (%) Precision (%) Mean Time Recall Rate Precision Mean Time Recall Rate Precision
Our Approach 130 54.9 100 129 47.7 100 118 62.5 100
FABMAP 1124 32.2 97.7 1162 23.4 49 1021 35.3 98
DBoW2 460 57.2 100 448 38.9 100 355 54.0 100
TABLE III: Quality Analysis of FABMAP and the proposed method on different datasets.

Iv-a Experiment result on public dataset

We conduct the test on an intel® NUC mini computer which is popularly used in robotics relateed applications. The proposed method is tested with different datasets mentioned above. The loop closure detection results are collected and displayed in Matlab for visualization purpose. An example of our loop closure detection approach on KITTI dataset is shown in Fig. 5. In the figure we plot the moving trajectory of camera and mark the ground truth of loop closure detection with black circle on the first image, while the detection result is shown on the second image with red circle marked instead. Each circle refers to a loop closure pair. Intuitively we can tell that there is no false positive detected and most of loop closure places are identified. Our proposed method achieves a recall rate of 54.9% and a recall precision of 100% which is very satisfactory. More test results can be found in Table II. In total we pick up 5 recordings with loop closure from KITTI dataset, and our method achieves more than 50% on average without any false positive. It also achieves 20% on New College dataset and 27% on City Center dataset without false positive. In the meanwhile, our methods still can run at high speed of 10 Hz on average.

Iv-B Comparison with other methods

We further compare our method with the state-of-the-art methods such as FABMAP, DBoW2 [6, 9]. To be consistent, all experiments are conducted on an intel® NUC mini computer. In order to have a clear comparison, we pick the largest datasets with loop closure for demonstration since the efficiency differs more as database size increases. We use KITTI sequence 00, KITTI sequence 02 and KITTI sequence 05 with more than 10k frames in total. We test KITTI sequence 05 on each method first and the result is shown in Fig. 6. In the experiment, we finely tuned the threshold in both FABMAP and DBoW2 in order to get the best recall rate and recall precision. However, our approach does not require to tune any parameter for specific dataset. Besides, both FABMAP and DBoW2 require offline training of similar dataset in advance, while the proposed method does not. Our method reports most of the loop closure places correctly while FABMAP has false positive and DBoW2 fails to report loop closure in some places. The details of the rest results on other datasets are shown in Table III. The speed of the proposed method is 3 times faster than DBoW2 and 9 times faster than FABMAP. In our approach, we use the sophisticated SURF feature to achieve the precision because feature-wise comparison does not occur frequently. Hence our approach also provides reliable precision and recall rate. A demonstration of the experiment result can be found at https://youtu.be/YCRd3N0LwSA.

V Conclusion

In this paper, we have presented a fast loop closure detection method via binary content. Traditional approaches such as FABMAP and DBoW2 use feature descriptors to compress the image content and build a descriptor vocabulary for indexing. However, these methods require intensive mathematical calculation to estimate the similarity of two images, which is less efficient than binary operation. Observe that operation on binary image can have a similar result but at higher speed than feature descriptor. Hence based on the observation, we proposed a new framework for loop closure detection which consists of three parts: binary content construction, fast image retrieval and precise loop closure detection. The experiment result has demonstrated that it is able to detect most of loop closure places without false positive. The proposed method was also compared with state-of-the-art methods such as FAB and DBoW2. The result has shown that it outperforms other approaches in both recall rate and speed. In addition, no offline training is required in our approach so that it is easy for implementation.

Acknowledgment

The author would like to thank Mr. Wang Chen for many great suggestions during the course of this research work.

References

  1. H. Bay, T. Tuytelaars and L. Van Gool (2006) Surf: speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: §II, §III-C.
  2. M. Calonder, V. Lepetit, C. Strecha and P. Fua (2010) Brief: binary robust independent elementary features. In European conference on computer vision, pp. 778–792. Cited by: §II.
  3. T. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng and Y. Ma (2015) PCANet: a simple deep learning baseline for image classification. IEEE Transactions on Image Processing 24 (12), pp. 5017–5032. Cited by: §II.
  4. K. Chatfield, K. Simonyan, A. Vedaldi and A. Zisserman (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Cited by: §II.
  5. C. Chow and C. Liu (1968) Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory 14 (3), pp. 462–467. Cited by: §II.
  6. M. Cummins and P. Newman (2008) FAB-map: probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research 27 (6), pp. 647–665. Cited by: §I, §II, §IV-B.
  7. M. Cummins (2009) Highly scalable appearance-only slam-fab-map 2.0. Proc. Robotics: Sciences and Systems (RSS), 2009. Cited by: §II.
  8. J. Engel, V. Usenko and D. Cremers (2016) A photometrically calibrated benchmark for monocular visual odometry. arXiv preprint arXiv:1607.02555. Cited by: §IV.
  9. D. Gálvez-López and J. D. Tardos (2012) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28 (5), pp. 1188–1197. Cited by: §I, §II, §IV-B.
  10. A. Geiger, P. Lenz, C. Stiller and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §III-B, §IV.
  11. X. Hou and L. Zhang (2007) Saliency detection: a spectral residual approach. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8. Cited by: §III-A.
  12. Y. Hou, H. Zhang and S. Zhou (2015) Convolutional neural network-based image representation for visual loop closure detection. In Information and Automation, 2015 IEEE International Conference on, pp. 2238–2245. Cited by: §II.
  13. Y. Huang, F. Sun and Y. Guo (2016) VLAD-based loop closure detection for monocular slam. In Information and Automation (ICIA), 2016 IEEE International Conference on, pp. 511–516. Cited by: §I, §II.
  14. S. Leutenegger, M. Chli and R. Y. Siegwart (2011) BRISK: binary robust invariant scalable keypoints. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2548–2555. Cited by: §II.
  15. D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §II.
  16. S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke and M. J. Milford (2016) Visual place recognition: a survey. IEEE Transactions on Robotics 32 (1), pp. 1–19. Cited by: §I.
  17. R. Mur-Artal and J. D. Tardós (2017) Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §II.
  18. M. Smith, I. Baldwin, W. Churchill, R. Paul and P. Newman (2009-05) The new college vision and laser data set. The International Journal of Robotics Research 28 (5), pp. 595–599. External Links: ISSN 0278-3649, Document, Link Cited by: §IV.
  19. Y. Uchida, S. Sakazawa and S. Satoh (2016) Image retrieval with fisher vectors of binary features. ITE Transactions on Media Technology and Applications 4 (4), pp. 326–336. Cited by: §I, §II.
  20. D. G. Viswanathan (2009) Features from accelerated segment test (fast). Cited by: §II.
  21. H. Wang, C. Wang and L. Xie Loop closure detection via salient object. Cited by: §II.
  22. B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid and J. Tardós (2009) A comparison of loop closing techniques in monocular slam. Robotics and Autonomous Systems 57 (12), pp. 1188–1197. Cited by: §I, §I.
  23. Y. Xia, J. Li, L. Qi and H. Fan (2016) Loop closure detection for visual slam using pcanet features. In Neural Networks (IJCNN), 2016 International Joint Conference on, pp. 2274–2281. Cited by: §II.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409384
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description