Mild: MultiIndex Hashing for Appearance Based
Loop Closure Detection
Abstract
Loop Closure Detection (LCD) has been proved to be extremely useful in global consistent visual Simultaneously Localization and Mapping (SLAM) and appearancebased robot relocalization. Methods exploiting binary features in bag of words representation have recently gained a lot of popularity for their efficiency, but suffer from low recall due to the inherent drawback that high dimensional binary feature descriptors lack welldefined centroids. In this paper, we propose a realtime LCD approach called MILD (MultiIndex Hashing for Loop closure Detection), in which image similarity is measured by feature matching directly to achieve high recall without introducing extra computational complexity with the aid of MultiIndex Hashing (MIH). A theoretical analysis of the approximate image similarity measurement using MIH is presented, which reveals the tradeoff between efficiency and accuracy from a probabilistic perspective. Extensive comparisons with stateoftheart LCD methods demonstrate the superiority of MILD in both efficiency and accuracy.
MILD: MULTIINDEX HASHING FOR APPEARANCE BASED
LOOP CLOSURE DETECTION
Lei Han, Lu Fang 
Robotic Institute, Hong Kong University of Science and Technology 
lhanaf@connect.ust.hk, eefang@ust.hk 
Index Terms— Place Recognition, Visual relocalization, Loop Closure Detection, MultiIndex Hashing
1 Introduction
Visual Loop Closure Detection (LCD) tries to detect previously visited places based on appearance information of the scene. LCD can play an important part in Global Consistent Visual Simultaneous Localization and Mapping (SLAM) systems [1, 2] and appearancebased robot relocalization [3]. For visual SLAM, stateoftheart approaches [2] only handle a local window of recently added frames while the previous frames are marginalized out due to the limitation of computational complexity, resulting in the accumulation of state (position and orientation) error. LCD is introduced to identify places that have already been visited, thus creating an observation between history state and current state. The accumulated error can be effectively reduced based on this observation.
The most widely used LCD methods can be summarized as local feature based methods, which try to model image similarity based on hand crafted features. Most methods [4, 5, 6, 7, 8, 9] use Bag of Words (BOW) scheme to represent image since [10], which extracts feature points from an image and cluster them into different centroids called visual words. A histogram of appeared visual words is consequently used to represent the image. The similarity of image pairs is computed based on the difference of the visual words histograms. One wellknown drawback of BOW is the perceptual aliasing introduced in cluster step if two dissimilar features are clustered into the same visual word. The performance of clustering depends on the quality of a previously [4] or online [5] trained dictionary.
Conventional methods [4, 5, 6] using realvalued features like SIFT [11] or SURF [12] suffer from high computational complexity in feature extraction and feature classification. To deal with this problem, recent methods like BOBW [7], IBuILD [8], ORBSLAM [9] have proposed to use efficient binary features like ORB [13] or BRISK [14]. While binary feature based LCD methods can run at real time, the accuracy (typically measured by precision and recall metric [5]) of these methods is not satisfying.
In this paper, MILD: MultiIndex hashing for appearance based Loop closure Detection is proposed as an appearance based LCD approach exploiting the efficiency of binary features. Instead of using BOW representation widely adopted by previous methods, image similarity is measured based on direct feature matching without introducing additional computational complexity with the aid of MultiIndex Hashing (MIH) [15]. Contributions of this paper include:

We propose a novel LCD system based on MultiIndex hashing (MILD). In particular, we do not explicitly find the exact nearest neighbor of each feature or use BOW representation for images. Instead, MIH is used to approximate the image similarity measurement, so that redundant computations between dissimilar features can be avoided.

The approximated image similarity measurement based on MIH is analyzed from a probabilistic perspective, which effectively reveals the tradeoff between the accuracy and complexity in MILD, ensuring the superiority of MILD in high accuracy and low complexity compared with stateoftheart algorithms.
2 Related Work
In this work, we focus on LCD by local image feature. Approaches such as global image descriptor [17, 18] or exploiting illumination invariant components to improve image similarity measurement under different lighting conditions [19] are not discussed, but can be combined for a more robust LCD system.
The accuracy of binary feature based LCD methods is not satisfying, the authors in [20, 21] investigate this problem and find that binary features are not straightforward to cluster using existing nearest neighbor search methods, due to the high dimensionality and the nature of the binary descriptor space. To overcome this deficiency, [21] projects binary features into a realvalued vector space and implements nearest neighbor search in this space.
An alternative way for LCD is direct feature match as proposed by [16, 22]. Instead of using BOW representation, [16] proposes to use raw features to represent an image directly (BoRF), which significantly improves the recall performance. [22] adopts Locality Sensitive Hashing (LSH) for fast approximate nearest neighbor search based on the SIFT feature. These methods suffer from high computational complexity and cannot scale well with the increase of candidate images.
We address this problem by MultiIndex Hashing (MIH) proposed by [15] to hash long binary codes for fast information retrieval. Recently [23] uses MIH for exact nearest neighbor search and tries to find the optimal substring length given the database size, code length and search radius to minimize the upper bound of the search cost. Experiments show that search cost grows rapidly with the increase of search radius.
As a method of nearest neighbor search, MIH has already been used in different applications like image relocalization [24] and image search [25]. [24] follows the same procedure in [23] and complains about the inefficiency of MIH in finding the exact nearest neighbor for each feature. While [25] only explores the use of partial binary descriptors created in MIH as direct codebook indices, and follows a traditional BOW method to measure image similarity.
On the contrary to the previous methods, we do not explicitly find the exact nearest neighbor of each feature or use BOW representation for images. Instead, MIH is used to approximate the image similarity function proposed in [26]. The accuracy and efficiency of such approximation are analyzed from a probabilistic perspective.
3 MILD: MultiIndex Hashing for
Loop closure Detection
The framework of MILD is shown in Fig. 1, where the MILD can be divided into two stages: the first step aims to calculate the similarity between current image and candidate set that are constructed by all the previous images . We denote as the binary local feature set to represent an image , where stands for the number of features. Here the ORB feature [13] is used due to the computation efficiency and rotation invariance, with the descriptor be a 256 bit binary sequence. Given the image similarity, a Bayesian filter is applied to calculate the probability of loop closure for each candidate.
3.1 Image Similarity Measurement
We define the similarity of image pair (, ) as
(1) 
where refers to binary feature similarity [26], i.e.,
(2) 
Here denotes Hamming distance between binary features and , is the weighting parameter, and is the predefined Hamming distance threshold.
A straightforward way to calculate the image similarity is linear search for all the candidates in . However, the computational cost may be unbearable for large datasets. Given the fact that the number of repeating or highlysimilar features is limited between current image and previous images, implying that the valid similarity measurements are highly sparse, we propose to use MultiIndex Hashing (MIH) to avoid invalid computations, since MIH is capable in distinguishing similar features. More analysis is provided in Section 3.3.
As illustrated in Fig. 2, in MIH, a long binary feature is hashed times based on its disjoint substrings. More precisely, if the Hamming distance of two features is smaller than , each feature is divided into disjoint substrings, then at least in 1 substring the Hamming distance of two features will be smaller than [23], implying that for two features with small Hamming distance, the probability that they fall into the same entry in at least one hash table will be close to 1. Then, the image similarity measurement in Eqn. (1) can be approximated using MIH, where the database is constructed online based on the candidate set, and the image similarity is measured during the query stage. In practice, database construction and query are implemented with MIH simultaneously.

Database construction: For every input image and its feature set , all features are hashed into the hash tables by separating each feature into substrings , where is the hash index of th hash table.

Query: For the newly arrived query image and its binary feature set , the similarity between and candidates is initialized as . Let be the collection of features that falls into the same entry with the feature , then in Eqn. (1) can be approximated by
(3)
Examining Eqn. (3), can be calculated by 1 pass traverse of features in during the hashing process. is a subset of . The probability of that falls into (denoted as the recall probability) is related to the Hamming distance between and , and the number of hashing tables in MIH. The detailed analysis of the approximation error between and is provided in Section 3.3.
3.2 Bayesian Inference
Bayesian inference is used to select true loop closure based on image similarity measurement and temporal coherency of camera movement [5]. To enable the detection of multiple loop closures, we propose to extend the random variable representing loop closure hypotheses at time (denoted as ) to be a binary random variable , where is the event that current image closes the loop with the past image . In this way, the time evolution model is formulated as
(4) 
where . Thus, the belief can be computed as
(5) 
Recall that the image similarity measurement is given by Eqn. 3, the likelihood is computed as [5]
(6) 
where and are the mean and standard deviation of sequence . Finally, the loop closure probability given all the previous similarity measurements can be computed as
(7) 
where is defined as a fixed value to normalize the output loop closure probability. The candidates whose loop closure probability is larger than the threshold will be the detected loop closures.
3.3 Analysis of MIH
Suppose the binary feature is divided into disjoint substrings, the probability that a feature pair with Hamming distance falls into the same entry in at least one of the hash tables is denoted as the recall probability . This is equivalent to the case that independent balls are thrown into bins randomly, where the probability of at least one bin has no ball under the assumption of uniform distribution of Hamming errors is a solved problem [27]:
(8) 
Here is the Stirling partition number [27]. Fig. 3 shows the recall probability changes along Hamming distance , as well as the influence of on the recall probability. As we expected, a larger yields a smaller recall probability, while a larger tends to make the decreasing curve of recall probability more gradual.
In LCD, for each feature in the query image, features describing the same place in are referred as inliers and the others are outliers. Then the computational complexity of (denoted as ) is proportional to the average probability of outliers falling into . The accuracy of (denoted as ) can be modeled as the average probability of inliers falling into . The unavoidable computations of similarity calculation for inliers are discarded in . Using the statistics of the distance distribution for inliers and outliers of ORB feature [13], the Hamming distances of outliers and inliers can be modeled as Gaussian distribution and , respectively. Based on this approximation, the accuracy and complexity can be calculated as
(9) 
Given Eqn. (9), the influence of different on the tradeoff between accuracy and complexity of MILD is further presented in Fig. 4. A higher indicates that the approximation error between and is smaller, yielding higher accuracy of MILD. While a lower indicates more efficiency. Although and grow monotonously with , there exists an interval of to achieve good balance of high accuracy and low complexity. An appropriate can be chosen for different applications regarding different bias on accuracy and complexity. For example, in MILD, guarantees relatively high accuracy and very low computational cost. Experiments show that MILD enables loop closure detection within 15 ms for a database containing more than 1000 images, which is efficient enough for realtime LCD system.
4 Experiments and Discussions
To evaluate the performance of MILD, we conduct extensive experiments on different datasets^{1}^{1}1NewCollege [4] contains 1073 images of size . CityCentre [4] contains 1237 images of size . Lip6Indoor [5] has 388 images of size . Lip6Outdoor [5] has 1063 images of size . and compare with stateoftheart methods: Angeli [5], RTABMAP [6] and BOWP [28] which are based on SIFT/SURF feature, as well as BOBW [7] and IBuILD [8] that use binary feature^{2}^{2}2All the Experiments are implemented on an Intelcore i7 @ 2.3 GHz processor with 8 GB RAM. Only one core is used to compare the computational efficiency of MILD with other algorithms. In MILD, 800 ORB features are extracted for each image. Feature descriptor is divided into 16 substrings with 16 bits each. The feature Hamming distance threshold , and the loop closure probability threshold .. The implementation of MILD will be publicly available online.
4.1 Subjective Analysis
For a better understanding of MILD, we particularly show intermediate results of MILD on NewCollege dataset [4] in Fig. 5, where the approximated image similarity measurement using MIH is illustrated in Fig. 5(a). Given the image similarity measurement, Bayesian inference is employed to select loop closures among candidates, as shown in Fig. 5(b). Compared with the ground truth of loop closures (Fig. 5(c)), the proposed MILD works effectively, as reflected by the fact that image similarity score in Fig. 5(a) is high when image pair is a true loop closure, and the detected loop closures in Fig. 5(b) highly resemble ground truth.
4.2 Objective Evaluation
The quantitative comparisons regarding accuracy (recall rate at precision equals to 100%) and complexity on different datasets are presented in Table 1, where the performance of concerned methods are collected directly from the reference papers. Examining Table 1, we have following observations:

On the contrary, although we do not assume single loop closure in the inference stage, which potentially introduces more outliers, MILD still achieves competitive performance in both accuracy and complexity, i.e., the accuracy is comparable to SIFT/SURF feature based methods, and can be successfully implemented in realtime.





Angeli [5]      80%  71%  
    460ms  753ms  
RTABMAP [6]  81%  89%  98%  95%  
700ms  700ms  100ms  400ms  
BOBP [28]  86%  77%  92%  94%  
441ms  393ms  69ms  120ms  
BOBW [7]  30.6%  55.9%      
20ms  20ms      
IBuILD [8]  38%    41.9%  25.5%  
        
MILD  83%  87.3%  94.5%  93.4%  
36ms  35ms  7ms  9ms 
For memory requirement, MIH takes 32 bytes to store feature descriptors and 4 bytes to store its corresponding image index and feature index in each hash table per feature. The only fixed overhead of MILD is pointers for each hash table, where is the substring length. In our experiments, and there are hash tables in total. For example, the minimum memory required for NewCollege dataset is MB, which is acceptable for modern mobile devices.
5 Conclusions and Future Work
While MIH has shown large potential in exactly nearest neighbor search recently [23], we extend its application in approximately nearest neighbor search and propose a novel MultiIndex Hashing scheme for Loop closure Detection problem (MILD). Theoretical analysis successfully reveals the tradeoff between accuracy and efficiency of MIH in image similarity measurement. Experiments on public datasets show that MILD achieves competitive performance regarding high accuracy and low complexity, compared with stateoftheart LCD approaches.
In our work, the uniform distribution of binary codes is assumed, but in practice many features fall into the same entry in the hashing process, such entries are discarded for efficiency consideration. It would be interesting to consider prior knowledge on nonuniform distribution of different features for improving MILD.
References
 [1] Hauke Strasdat, Local accuracy and global consistency for efficient visual slam, Ph.D. thesis, Citeseer, 2012.
 [2] Jakob Engel, Thomas Schöps, and Daniel Cremers, “Lsdslam: Largescale direct monocular slam,” in European Conference on Computer Vision. Springer, 2014, pp. 834–849.
 [3] Brian Williams, Georg Klein, and Ian Reid, “Automatic relocalization and loop closing for realtime monocular slam,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 9, pp. 1699–1712, 2011.
 [4] Mark Cummins and Paul Newman, “Fabmap: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
 [5] Adrien Angeli, David Filliat, Stéphane Doncieux, and JeanArcady Meyer, “Fast and incremental method for loopclosure detection using bags of visual words,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 1027–1037, 2008.
 [6] Mathieu Labbe and Francois Michaud, “Appearancebased loop closure detection for online largescale and longterm operation,” IEEE Transactions on Robotics, vol. 29, no. 3, pp. 734–745, 2013.
 [7] Dorian GálvezLópez and Juan D Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
 [8] Sheraz Khan and Dirk Wollherr, “Ibuild: Incremental bag of binary words for appearance based loop closure detection,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 5441–5447.
 [9] Raúl MurArtal and Juan D Tardós, “Fast relocalisation and loop closing in keyframebased slam,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 846–853.
 [10] Josef Sivic and Andrew Zisserman, “Video google: A text retrieval approach to object matching in videos,” in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003, pp. 1470–1477.
 [11] David G Lowe, “Distinctive image features from scaleinvariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
 [12] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision. Springer, 2006, pp. 404–417.
 [13] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision. IEEE, 2011, pp. 2564–2571.
 [14] Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 International conference on computer vision. IEEE, 2011, pp. 2548–2555.
 [15] Dan Greene, Michal Parnas, and Frances Yao, “Multiindex hashing for information retrieval,” in Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on. IEEE, 1994, pp. 722–731.
 [16] Hong Zhang, “Borf: Loopclosure detection with scale invariant visual features,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 3125–3130.
 [17] Aude Oliva and Antonio Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
 [18] Jana Kosecka, Liang Zhou, Philip Barber, and Zoran Duric, “Qualitative image based localization in indoors environments,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on. IEEE, 2003, vol. 2, pp. II–3.
 [19] Will Maddern, Alex Stewart, Colin McManus, Ben Upcroft, Winston Churchill, and Paul Newman, “Illumination invariant imaging: Applications in robust visionbased localisation, mapping and classification for autonomous vehicles,” in Proceedings of the Visual Place Recognition in Changing Environments Workshop, IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 2014, vol. 2, p. 3.
 [20] Marius Muja and David G Lowe, “Fast matching of binary features,” in Computer and Robot Vision (CRV), 2012 Ninth Conference on. IEEE, 2012, pp. 404–410.
 [21] Simon Lynen, Michael Bosse, Paul Furgale, and Roland Siegwart, “Placeless placerecognition,” in 2014 2nd International Conference on 3D Vision. IEEE, 2014, vol. 1, pp. 303–310.
 [22] Hossein Shahbazi and Hong Zhang, “Application of locality sensitive hashing to realtime loop closure detection,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2011, pp. 1228–1233.
 [23] Mohammad Norouzi, Ali Punjani, and David J Fleet, “Fast exact search in hamming space with multiindex hashing,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 6, pp. 1107–1119, 2014.
 [24] Youji Feng, Lixin Fan, and Yihong Wu, “Fast localization in largescale environments using supervised indexing of binary features,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 343–358, 2016.
 [25] Junjie Cai, Qiong Liu, Francine Chen, Dhiraj Joshi, and Qi Tian, “Scalable image search with multiple index tables,” in Proceedings of International Conference on Multimedia Retrieval. ACM, 2014, p. 407.
 [26] Liang Zheng, Shengjin Wang, and Qi Tian, “Coupled binary embedding for largescale image retrieval,” IEEE transactions on image processing, vol. 23, no. 8, pp. 3368–3380, 2014.
 [27] Ronald L Graham, Concrete mathematics: a foundation for computer science, Pearson Education India, 1994.
 [28] Nishant Kejriwal, Swagat Kumar, and Tomohiro Shibata, “High performance loop closure detection using bag of word pairs,” Robotics and Autonomous Systems, vol. 77, pp. 55–65, 2016.