Loop Closure Detection with RGB-D Feature Pyramid Siamese Networks
In visual Simultaneous Localization And Mapping (SLAM), detecting loop closures has been an important but difficult task. Currently, most solutions are based on the bag-of-words approach. Yet the possibility of deep neural network application to this task has not been fully explored due to the lack of appropriate architecture design and of sufficient training data. In this paper we demonstrate the applicability of deep neural networks by addressing both issues. Specifically we show that a feature pyramid Siamese neural network can achieve state-of-the-art performance on pairwise loop closure detection. The network is trained and tested on large-scale RGB-D datasets with a novel automatic loop closure labeling algorithm. Each image pair is labelled by how much the images overlap, allowing loop closure to be computed directly rather than by labor intensive manual labeling. We present an algorithm to adopt any large-scale generic RGB-D dataset for use in training deep loop-closure networks. We show for the first time that deep neural networks are capable of detecting loop closures, and we provide a method for generating large-scale datasets for use in evaluating and training loop closure detectors.
1 Introduction and Related Work
Loop closure detection is the process of detecting whether an agent has returned to a previously visited location. This is critical for correcting accumulated errors over a large timescale in many real-world navigation applications. In a vision-based simultaneous localization and mapping (SLAM) system, loop closures are often detected via comparison of image pairs through the journey.
Detecting loop closures has been a challenging task for its inherent susceptibility to variations within the scenes. A loop closure detection algorithm must be sensitive as to avoid classifying similar rooms as a positive detection while being resilient to small changes in object location, shifted viewpoints, different lighting conditions and shadows that could drastically alter the visual representation of the same scene.
1.1 Bag-of-Words Approach
The bag-of-words methodology was first proposed for text document analysis  and was further adapted for computer vision applications . For image analysis, a visual analogue of a word is used in the bag-of-words model, which is based on the vector quantization process by clustering low-level visual features of local regions or points, such as color, texture, and so forth .
Currently, bag-of-words approach is the state-of-the-art method for loop closure detection [7, 10, 20, 21] in which each image is represented as a histogram of word-frequency of each word present in the dictionary generated offline from a large number of images. The computation of similarity is based on comparing the histograms  between image pairs with certain heuristics such as spatial constraint or dynamic island . Image pairs with high similarity are deemed as possible loop closures.
Bag-of-words models, most prominently DBoW2 , are built on the clustering of visual features. There have been various types of feature descriptors, such as SIFT , SURF , BRIEF , and ORB . Each of these features has its own characteristics; some are invariant towards illumination or scale but complex to compute while others may be efficient but less distinctive. These hand-crafted features are manually designed, thus none of them can be robust to all application scenarios at all times. In addition, these image representations describe the local appearance of individual patches, limiting their descriptive power with respect to global descriptor methods [37, 19].
1.2 Convolutional Neural Networks
Convolutional neural networks are very powerful for learning visual representation by recognizing increasingly complicated visual patterns through the stacking of convolutional layers . With very deep architecture design, convolutional neural networks have achieved impressive performance on classification [14, 15] and object detection [12, 25]. The ability to learn visual representations has be transferred to other tasks such as face recognition  and fine-grained classification .
The success of deep convolutional neural networks suggests its capability of learning more detailed and general representation of images. The representation can be used to accurately indicate similarity. In fact, by ranking the similarity between images in a database, deep neural networks have already been applied to image retrieval tasks [13, 24].
There have also been some small-scale experiments applying convolutional neural networks to loop closure detection [37, 34]. However, these network designs are not sufficiently utilizing the information from the environment, causing the performance to be incomparable to the state of the art from bag-of-words models. For instance, off-the-shelf usage of convolutional features did not achieve state-of-the-art performance [29, 13], unless offline data whitening is applied  which is impractical in an online procedure.
Furthermore, there is a serious lack of large-scale training data adequate for training deep neural networks. In order for the networks to generalize, a dataset should contain sufficiently large numbers of images from both positive cases and negative cases. Meanwhile, there should be enough difficult loop closures that do not look very similar, as well as confusing non-closure image pairs that do look similar. However, most available loop closure datasets only contain several hundreds to thousands of images and less than 10 loop closures instances [7, 1], and therefore are inadequate for training.
Moreover, the ground truth matrices provided in many existing datasets are usually not based on the visual similarity but on scene categories (i.e., kitchen or bedroom). Other larger image datasets do not provide the ground truth for loop closures at all [30, 22]. To the best of our knowledge, there is currently no proper dataset for the training of a deep neural network for loop closure application.
1.3 Our Contribution
We address the existing problems by designing a novel Siamese architecture and train the network on large-scale datasets to obtain state-of-the-art results.
To achieve this goal and better utilize information from the environment, we add an input channel to take depth information. This provides information about the structure of the scene and is invariant to lighting conditions. The input is passed down a feature pyramid  to capture object representations from different scales.
We train the network end-to-end on large datasets with millions of image pairs. The datasets are simulated from Stanford 2D-3D-S Dataset  and ScanNet Dataset . We reserve one Stanford area and 3 ScanNet scenes for use as a test set. A corresponding depth image is also generated for each chromatic image as shown in Table 1. We also present an algorithm for generating loop closure datasets from any similar RGB-D dataset.
By addressing the two problems that we mentioned, our model is able to achieve state-of-the-art performance on several large-scale realistic datasets that we have labeled. Therefore, we have successfully shown for the first time that deep convolutional neural networks can be effectively applied to loop closure detection. We will release our source code and the labeled datasets to the public.
2 Proposed Network
Besides off-the-shelf usage [29, 37, 34] of convolutional features, different pooling methods have been experimented such as max pooling and sum pooling . Generalized-mean pooling (GeM)  has provided a possibility to adjust the pooling scale between max and average with a parameter that can be learned from end-to-end training .
The Feature Pyramid Network  is a novel architecture design that has achieved excellent performance on object detection. Its success in the detection of smaller objects inspires us to make use of the features from different convolution layers to accurately embed the image in different scales.
Our network is comprised of a fully convolutional feature pyramid network , a set of pooling layers where is the number of output scales from the feature pyramid, and a fully connected layer . Our proposed architecture is depicted in Figure 1.
On the top, the Siamese network converts a pair of images into a pair of 2048-dimensional vectors to compute their cosine similarity. The convolutional part of the network takes one image as input, and the 4 residual blocks with lateral layers output 4 feature maps in a spatial pyramid. Each feature map is fed to a different generalized-mean pooling layer . The 4 pooled outputs are concatenated after being passed through L2-normalization, and the concatenated feature map is whitened by a fully-connected layer to generate the final output.
More formally, the network converts each pair of images into a pair of vectors and . For each input image of shape , , where is of shape . For each , , where is of length . Concatenating to get of length , of length , and is the final embedding of .
The comparison between images is based on cosine similarity for its naturally normalized metrics and good performance on face detection . Loop closures are predicted above a certain threshold of similarity. During the training phase, the difference between ground truth and similarity provides the loss from this prediction and is used for back propagation. We refer to the proposed network as Feature Pyramid Siamese Network (FPSN).
2.1 Feature Pyramid Convolution Layers
The recent success of feature pyramid network on object detection tasks, especially its success in detecting smaller objects , suggests that intermediate outputs of a network are inherently helpful to build semantic feature maps at different scales, which is very important to the fine-grained image comparison in loop closure detection. We thereby modify a pretrained ResNet50 into a feature pyramid model as the original paper did except that we add no padding in lateral convolution to ease the pressure on GPU-memory. The 3 intermediate outputs are passed down through sequential lateral convolution layers to generate 4 final output vectors.
Additionally, we argue that the relative distance between objects are invariant to the change of shadows and other lighting conditions, therefore depth information could be extremely useful. Noticing that the patterns of depth image is very similar to an RGB image, we copied the weight from RGB channel as the initial weight for depth channel.
In particular, the network takes an input of shape , and outputs of shape , of shape , of shape , and of shape .
2.2 Generalized-Mean Pooling
The feature pyramid map generated in Section 2.1 is a global descriptor, which may contain similar objects that may cause confusion during comparison. To address this problem, we add a generalized-mean pooling layer to learn to propose and pool the key regions for comparison.
Suppose the input of a generalized-mean pooling layer is of shape , and is the feature map on c-th channel, then the output is given by
Average pooling and max pooling are two special examples of generalized-mean pooling. When , all elements in the feature map are accounted for equally, which makes it effectively an average pooling layer. And when , the pool pays it’s full attention to the maximum element, which results in max pooling. The parameter is learned from back propagation to make an appropriate balance between these two extremes. Table 2 and Table 3 provide examples of different pooling parameter with one positive image pair taken from the first row of Table 1 at the second and the fourth scale of the feature pyramid.
In particular, the pooling layers outputs 4 vectors , each with length 512. The concatenated features form a 2048-dimensional vector.
2.3 Fully Connected Layer
The principle component analysis projection, in its mathematical form, is equivalent to a fully connected layer. Therefore, to make the network entirely end-to-end, we add a fully connected layer to perform online data whitening  instead of principle component analysis offline .
We maintain the dimensionality to preserve the information while performing discriminative large-margin metric in which one learns a new space where relevant images are closer. The layer takes 2048-dimensional vector and outputs a vector of the same length as final encoding of the input image from the network.
2.4 Cosine Loss
The output vector is compared in pairs. The model takes two images and , and compute the similarity between their embedding and . We use cosine similarity, defined as:
The ground truth of pairs is 1 if it is a loop closure, 2 for unusable pairs, or 0 otherwise. The loss is computed based on the difference between predicted similarity and ground truth. A margin is set to further distinguish positive cases from negative cases.
Let be the ground truth, be the predicted cosine similarity, be the margin. The formula for the loss is
3 Generating Large-Scale Datasets
3.1 Traditional Test Sets are Insufficient
In the literature, New College dataset , City Centre dataset , Lip6 Indoor dataset , and Lip6 Outdoor dataset  are some of the most used datasets for loop closure detection. Although these datasets can have several hundred loop closure pairs, typically they only have one loop closure sequence, which provides almost no variety in the positive pairs tested. In addition, often the way these loop closure datasets is labeled is often not well suited for pairwise comparison. In fact, Lip6Indoor and Lip6Outdoor have asymmetric ground truth matrices and do not mark images along the diagonal as positives.
We developed an algorithm to generate a larger and more diverse loop closure dataset by detecting loop closures in an offline manner with much more information than what would be available to an agent to address this problem. There are very large datasets that have quality depth and pose estimation, which we utilize to generate training and testing data for our neural network. The algorithm is also open source so other people may develop their own datasets and refine their loop closure algorithms.
3.2 Stanford 2D-3D-S Dataset
The Stanford 2D-3D-S dataset  contains 70,496 RGB-D images that originate from 3 different buildings of mainly educational and office use. The dataset is collected in 6 large-scale indoor areas covering over 6,000 using the Matterport Camera, which combines 3 structured-light sensors at different pitches to capture 18 RGB and depth images during a 360Â° rotation at each scan location. Each 360Â° sweep is performed in increments of 60Â°, providing 6 triplets of RGB-D data per location. The output is the reconstructed 3D textured meshes of the scanned area, the raw RGB-D images, and camera metadata. This data is then post processed to refine the depth of each image in conjunction with it’s pose.
3.3 ScanNet Dataset
ScanNet  is an RGB-D video dataset containing 2.5 million views in 1,513 RGB-D scans of 707 unique indoor environments collected using the Occipital Structure RGB-D sensor . The Occipital sensor collects 640x480 images at 30 Hz, similar to the Microsoft Kinect. ScanNet contains a variety of small spaces such as offices, apartments, and bathrooms. Each scan has been annotated with instance-level semantic category labels through crowd sourcing. We select the largest scans from this dataset for testing as many of the rooms are too small for use in loop closure.
3.4 Automatic Loop Closure Labeling
Each image pair from an RGB-D dataset is assigned a score based on backprojecting the point cloud from one image into the other, which we use to separate positive from negative pairs. However, there are a few more steps to prevent false positives and speed up the process.
We first subsample the dataset based on a fixed ratio to avoid generating too many pairs of similar images. We also filter out images that have too little texture or do not have valid depth information.
The volumetric overlap between the two point clouds is then calculated by comparing the convex hull from the pair of point clouds. The volume of the intersection between the two hulls is divided by the volume of the larger of the two hulls is used to filter out pairs of images before moving on to the next step. Images with low overlap are marked as low confidence negatives and we do not use them during training or testing.
For the final step, for each pair we backproject the point cloud for one image into the the camera of the other, then downsample and threshold the result to obtain the percent coverage the second image represents in the first. We project the location of each point of the point cloud into the coordinates in the image space provided that the depth camera is fully calibrated. We then downsample the image to compensate for the sparseness of the point cloud then count the number of non-zero pixels. The percent image overlap is the confidence associated with each image pair. For our purposes, we mark images with greater than 50% as a positive pair and don’t use the rest.
From Stanford 2D-3D-S dataset, we automatically labeled around 25,434 images from 6 areas: area1, area3, area4, area5a, area5b, area6. We reserve area 5a with 5,000 images for testing purposes, and use all the other areas for training. The training set consists of around 21,000 images that yield millions of usable image pairs. On average, image pairs that are labelled as loop closures take up around 1 in every 400 usable pairs, which we believe is consistent with the probability of loop closure occurring in large-scale indoor navigation scenarios.
From ScanNet dataset, we similarly label images from 3 scenes: scene0000_01, scene0000_02 and scene0002_01. Because the data are captured by relatively more inexpensive device, the motion blur in the RGB image and errors in depth are more severe compared to those from Stanford 2D-3D-S. All three of these sample sets are used for testing purposes.
4.1 Training Procedure
We start from training a plain ResNet50  embedder as baseline. The architecture is identical to a ResNet50 except discarding the last pooling layer and the fully-connected layer at the end. We instead feed the output of the last convolutional layer to a generalized-mean pooling layer, and then to a fully connected layer after L2-normalization.
The weights of convolutional layers are initialized from the pre-trained classification model on ImageNet. The initial pooling parameter is 3 which turns out to be very close to the training result, and the weight of fully-connected layer is initialized by random Gaussian distribution.
We use 4 GeForce GTX 1080 Ti GPUs for training. Multiple images are associated to each query to reduce GPU-memory consumption. Specifically, either 7 or 352 images are loaded in a tuple for every query. In each tuple, the first image is always the query, and the second is always the positive image; all the other images in this tuple are negative.
The training iterates two stages. In one stage we use 32 tuples of 7 images to learn the similarity between query and positive images. In the other stage we use 1 tuple that contains 352 images, to bring the ratio between positive and negative cases (1:350), which is as close to the ratio found in the training dataset as can be obtained within the constraints of GPU-memory. At this stage the model tries to distinguish real loop closures from similar but different image pairs.
To obtain the benefits of both Adam for fast convergence and Stochastic Gradient Descent (SGD) for better generalization , we start the first round of 2 stages with Adam with 1e-6 learning rate, 0.9 momentum and 5e-5 weight decay for 24 hours each stage. Then we switch to SGD with 1e-5 learning rate, 0.9 momentum and 5e-5 weight decay. After 3 iterations, the model converges to over 99.9% accuracy on the training set.
Then, we train a 4-channel ResNet50 embedder. The procedure is identical to the that of a plain ResNet50 above except that the first convolutional layer takes input in 4 channels, where the initial weight of the fourth channel is copied from the third channel. Also we adjust the limit for number of images to 302 in the second stage to account for the change in memory capacity. Finally, we train the feature pyramid model initialized from the 4-channel ResNet50 model, in which the image number at stage 2 is limited to 252.
4.2 Testing Procedure
We reserve area 5a from the Stanford 2D-3D-S dataset and scene0000_01, scene0000_02 and scene0002_01 from ScanNet for testing. Some examples can be seen in Table 1.
Area 5a is a typical teaching building at Stanford University. The dataset for this area contains 5,000 RGB and associated depth images of the same dimensions. The entire RGB-D image is then resized to be of size per channel. Each depth image is scaled down within [0, 1] by min-max normalization. Any pixel with a depth beyond 6 meters in the image indicates that the pixel is unusable, so we set it’s depth value to 0. In total, there are 8,743,938 negative pairs and 25,036 positive pairs in the dataset in a ratio of roughly 350:1.
The same preprocessing of depth information is applied to ScanNet scenes. Scene0000_01 contains 197 images with 23,576 negative pairs and 331 positive pairs (71:1). Scene0000_02 contains 102 images with 6,264 negative pairs and 152 positive pairs (41:1). Scene0002_01 contains 241 images with 31,420 negative pairs and 259 positive pairs (121:1).
The model predictions for each dataset are stored as a matrix , where each element holds the cosine similarity between image and image . The matrix is compared directly against the ground truth matrix , as we see if is above certain threshold if is 1 (positive pair), and below that threshold if is 0 (negative pair). We thereby compute the number of true positive, true negative, false positive and false negative cases for each value of the threshold. By altering the threshold from 0 to 1, we compute the corresponding precision and recall, and then plot the precision-recall curve for the dataset.
We test three networks on the above datasets: a ResNet50 network trained using only RGB images, a ResNet50 network trained using both RGB and depth images, and the Feature Pyramid Siamese Network (FPSN) proposed in Figure 1 using both RGB and depth images. We compare these networks against the popular open-source implementation for bag-of-words image comparisons DBoW2 . The vocabulary file for DBoW2 is selected as the vocabulary file used in another state-of-the-art solution ORB-SLAM2 . We attempted to train the vocabulary file using the Stanford dataset, but it achieved sub par results. Each of the precision-recall curves are shown in Figure 4. Our network achieves state-of-the-art performance on all test sets.
On the Stanford 5a dataset, a ResNet50 network trained only with RGB images achieves similar performance as DBoW2, with lower precision than DBoW2 at low recall (below recall), but higher precision at higher recall. We then see further improvement with the addition of depth information, as well as with the use of FPSN.
On the different areas of the ScanNet dataset, while all of our networks still achieve state-of-the-art performance, we see that the ResNet50 with depth information no longer outperforms ResNet50 without depth information. We believe that this is due to the significant difference in depth camera characteristics between our training dataset (collected using high-quality Matterport sensors) and the ScanNet test datasets (collected using portable Occipital Structure sensors). More specifically, the network trained on the Stanford dataset would not have known the depth map characteristics of the ScanNet dataset. We do, however, see FPSN consistently outperform ResNet50, both with depth information, indicating the importance of multi-scale feature detection for loop closure.
5 Conclusion and Future Work
In this paper we have successfully demonstrated the applicability of deep neural networks to the task of pairwise loop closure detection. We show that the inclusion of a depth channel provides new and useful information about the structure of the scene, but may be subject to worse results when the noise of the sensor used for evaluation does not match the noise of the sensor in the training set. Finally we show that the use of our Feature Pyramid Siamese Network architecture improves detection results. Our network achieves the state-of-the-art performance, even outperforming bag-of-words in many cases. We further provide an algorithm for generating training data from large RGB-D datasets, opening the door for further improvements on our results via deep neural networks.
For this paper, we are only able to find quality RGB-D datasets for indoor environments. As such, we were not able to test our detector on outdoor environments. As depth cameras improve, it will be easier to collect data for a wider set of environments. With such data, it will be possible to apply our algorithm to create an all-purpose loop closure dataset similar to ImageNet that provides a thorough training and testing platform for loop closure.
There are a variety of possible extensions previously proposed for the bag-of-words approach that may allow for improved compute efficiency of this approach. We plan to explore how similar approaches can be developed and applied to deep loop closure detectors enabling these detectors to be used in a wider variety of applications.
-  A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. Fast and incremental method for loop-closure detection using bags of visual words. IEEE Transactions on Robotics, 24(5):1027–1037, 2008.
-  I. Armeni, A. Sax, A. R. Zamir, and S. Savarese. Joint 2D-3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints, Feb. 2017.
-  A. Babenko and V. Lempitsky. Aggregating local deep features for image retrieval. In Proceedings of the IEEE international conference on computer vision, pages 1269–1277, 2015.
-  H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer, 2006.
-  A. Bosch, X. Muñoz, and R. Martí. Which is the best way to organize/classify images by content? Image and vision computing, 25(6):778–791, 2007.
-  M. Calonder, V. Lepetit, C. Strecha, and P. Fua. Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer, 2010.
-  M. Cummins and P. Newman. Fab-map: Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research, 27(6):647–665, 2008.
-  A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
-  P. DollÃ¡r, Z. Tu, P. Perona, and S. J. Belongie. Integral channel features. In A. Cavallaro, S. Prince, and D. C. Alexander, editors, BMVC, pages 1–11. British Machine Vision Association, 2009.
-  D. Galvez-LÃ³pez and J. D. Tardos. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics, 28(5):1188–1197, Oct 2012.
-  E. Garcia-Fidalgo and A. Ortiz. ibow-lcd: An appearance-based loop closure detection approach using incremental bags of binary words. CoRR, abs/1802.05909, 2018.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for image search. In European Conference on Computer Vision, pages 241–257. Springer, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  N. S. Keskar and R. Socher. Improving generalization performance by switching from adam to SGD. CoRR, abs/1712.07628, 2017.
-  T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
-  M. J. Milford and G. F. Wyeth. Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 1643–1649. IEEE, 2012.
-  R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
-  R. Mur-Artal and J. D. Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
-  P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
-  I. Occipital. Structure sensor-3d scanning, augmented reality, and more for mobile devices, 2016.
-  F. Radenović, G. Tolias, and O. Chum. Fine-tuning cnn image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. CoRR, abs/1612.08242, 2016.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE international conference on, pages 2564–2571. IEEE, 2011.
-  G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill computer science series. McGraw-Hill, 1983.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
-  A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 806–813, 2014.
-  N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In Proceedings of the International Conference on Computer Vision - Workshop on 3D Representation and Recognition, 2011.
-  C.-F. Tsai. Bag-of-words representation in image annotation: A review. ISRN Artificial Intelligence, 2012, 2012.
-  H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. CoRR, abs/1801.09414, 2018.
-  J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1386–1393, 2014.
-  Y. Xia, J. Li, L. Qi, H. Yu, and J. Dong. An evaluation of deep learning in loop closure detection for visual slam. In Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), 2017 IEEE International Conference on, pages 85–91. IEEE, 2017.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
-  H. Zhang, B. Li, and D. Yang. Keyframe detection for appearance-based visual slam. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2071–2076. IEEE, 2010.
-  X. Zhang, Y. Su, and X. Zhu. Loop closure detection for visual slam systems using convolutional neural network. In Automation and Computing (ICAC), 2017 23rd International Conference on, pages 1–6. IEEE, 2017.