A Holistic Visual Place Recognition Approach using Lightweight CNNs for Severe ViewPoint and Appearance Changes
Recently, deep and complex Convolutional Neural Network (CNN) architectures have achieved encouraging results for Visual Place Recognition under strong viewpoint and appearance changes. However, the significant computation and memory overhead of these CNNs limit their practical deployment for resource-constrained mobile robots that are usually battery-operated. Achieving state-of-the-art performance/accuracy with light-weight CNN architectures is thus highly desirable, but a challenging problem. In this paper, a holistic approach is presented that combines novel regions-based features from a light-weight CNN architecture, pretrained on a place-/scene-centric image database, with Vector of Locally Aggregated Descriptors (VLAD) encoding methodology adapted specifically for Visual Place Recognition problem. The proposed approach is evaluated on a number of challenging benchmark datasets (under strong viewpoint and appearance variations) and achieves an average performance boost of 10% over state-of-the-art algorithms in terms of Area Under the Curve (AUC) calculated under precision-recall curves.
Given a query image, an image retrieval system aims to retrieve all images within a large database that contain similar objects as in the query image. Visual Place Recognition (VPR) can also be interpreted as a system that tries to recognize a place by matching it with the places from stored database . As with a range of other computer vision applications, deep learned CNNs features have shown promising results for VPR problem and managed to shift the focus from traditional handmade features techniques  to CNNs.
Using a pre-trained CNN for VPR, there are three standard approaches to produce a compact image representation: (a) the entire image is directly fed into the CNN and its layers responses are extracted ; (b) the CNN is applied on the user-defined regions of the image and prominent activations are pooled and aggregated from the layers representing those regions ; (c) the entire image is fed into a CNN and salient regions are identified by directly extracting distinguishing patterns based on convolutional layers responses . Generally, category (a) results in global image representations which are not robust against severe viewpoint variations and partial occlusion. Image representations emerging from category (b) usually handle viewpoint changes better but are computation intensive. On the other hand, image representations resulting from category (c) address both the appearance and viewpoint variations. In this paper, we focus on category (c).
The work by  and  are considered as state-of-the-arts in identifying prominent regions by directly extracting unique patterns based on convolutional layers responses for VPR problem. In , the authors used VGG16 network  which was pretrained on ImageNet  and used late convolutional layers activations for regions identification. For regional features encoding, bag-of-words (BoW) was employed on a separate training dataset to learn regional codebook. The system is tested on five severe viewpoint-variant and moderate condition-variant benchmark datasets with AUC-PR curves  as the evaluation metric. It claims to outperform the FABMAP , SEQSLAM  and other pooling techniques such as cross-pooling , sum/average-pooling  and max-pooling .
Despite its good AUC performance, the method proposed in  has some shortcomings. A common strategy for improving CNN accuracy is to make it deep by adding more layers (provided sufficient data and strong regularization). However, increasing network size means more computation and using more memory both at training and test time (such as for storing outputs of intermediate layers and for storing parameters) which is not ideal for resource-constrained robots that are usually battery-operated. Utilizing VGG16 for feature extraction along with BoW regional dictionary degrades the performance of the method proposed by  in real-time applications. On the other hand, the employment of the CNN model pre-trained on an object-centric database in  results in CNN trying to put more emphasis on objects rather than the place itself. This impacts the regional pooled features representations and leads to failure cases.
To bridge these research gaps, this paper proposes a holistic approach targeted for a CNN architecture comprising a small number of layers (such as AlexNet) pretrained on a place-/scene-centric image database  to reduce the memory and computational cost for resource-constrained mobile robots. The proposed method detects novel CNN-based regional features and combines them with VLAD  features encoding methodology adapted specifically for VPR problem. The motivation behind employing VLAD comes from its better performance in various CNN-based image retrieval tasks utilizing a smaller visual word dictionary  compared to BoW . To the best of our knowledge, this is the first work that combines novel CNN-based regional features with VLAD adapted for VPR.
As opposed to  which uses VGG-16 architecture pretrained on an object-centric dataset and utilizes lower convolutional layer for feature descriptors and higher convolutional layer for identifying landmarks, the method proposed in this paper extracts and aggregates descriptors lying under the regions by utilizing a single convolutional layer. The presented approach showcases enhanced accuracy by employing AlexNet architecture, which comprises a small number of layers, pretrained on Places365 dataset. Evaluation on several viewpoint- and condition-variant benchmark place recognition datasets show an average performance boost of 10% over state-of-the-art algorithms based on AUC computed on Precision-Recall curves. In Figure 1, for a query image (a), our proposed system retrieved image (c) from the stored database. (b) and (d) highlight the top distinguishing regions which our proposed methodology identified under severe viewpoint- and condition-variation for VPR.
The rest of the paper is organized as follows. Section II provides literature review for VPR and the CNN models being used in a range of image retrieval tasks. In Section III, the proposed methodology is presented in detail. Section IV illustrates the implementation details and the results achieved on several benchmark place recognition datasets. The conclusion is presented in Section V.
Ii Literature Review
This section provides an overview of major developments in VPR under simultaneous viewpoint and appearance changes using handcrafted features and CNN-based features.
FAB-MAP  is the first work that used handcrafted features (more specifically, SURF features) combined with BoW encoding methodology for VPR. It demonstrated robustness under viewpoint changes due to the invariance properties of SURF. Another work based on sequence matching of images named SeqSLAM  achieved remarkable results under severe appearance changes. However, it is unable to deal with simultaneous condition- and viewpoint-variation.
The first CNN-based VPR system is introduced in , which is followed by ,  and . In , the authors used Overfeat  trained on ImageNet. Eynsham  and QUT datasets with multiple traverses of the same route under environmental changes are used as benchmark datasets. Using the Euclidean distance on the pooled layers responses, test images are matched against the reference images. On the other hand,  and  used landmarks-based approach combined with pretrained CNN models. In , the authors introduced two CNN models for the specific task of VPR (named AmosNet and HybridNet) which are trained and finetuned the original object-centric CaffeNet on place-recognition centric SPED dataset ( million images). SPED dataset consists of thousands of places with severe-condition variance among the same places over different times. The results showed that HybridNet outperformed AmosNet, CaffeNet and PlaceNet on four publicly available datasets exhibiting strong appearance and moderate viewpoint changes .  presented an approach that identifies pivotal landmarks by directly extracting prominent patterns based on responses of the later convolutional layers of a deep object-centric VGG16 neural network for VPR. It achieves state-of-the-art performance on five severe viewpoint- and condition-variant datasets. Recently,  introduced a context-flexible attention model and combines it with a pretrained object-centric deep VGG-16 model fine-tunned on SPED dataset  to learn more powerful condition-invariant regional features. The system has shown state-of-the-art performance on three severe condition- and moderate viewpoint-variant datasets which reveals that identifying context based regions using a fine-tuned deep neural network is effective for severe condition-invariant VPR. However, the efficiency of the proposed approach may be compromised if there be a simultaneous severe viewpoint- and condition-variation. Moreover, performance and efficient resource usage become two important aspects to be looked upon in real-life robotic applications. Thus, in this paper, we have focused on resource- and computation-efficient VPR under simultaneous severe viewpoint- and condition-variation by utilizing a pretrained place/scene-centric shallow CNN model and maintaining the accuracy for real time robotic applications.
Iii Proposed Technique
In this section, the key steps of the proposed methodology are described in detail. It starts from the idea of stacking feature maps activations and extraction of CNN-based regions from middle and late convolutional layers. It then illustrates how to aggregate the stacked feature descriptors lying under those CNN-based regions. Finally, it shows how to adapt and integrate the VLAD on the aggregated CNN-based regions to determine the match between two images. The workflow of the proposed methodology is shown in Figure 2.
Iii-a Stacking of Convolutional Activations for making Descriptors
For an image in a convolutional layer of the CNN model, the output is tensor of dimensions where denotes the number of feature maps. We can also interpret it as be the set of activations/layer’s responses for feature map . Each activation value of the feature maps can be considered as convolutional operation of some filter on the input image. For feature maps in the convolution layer, we stack each activation at some certain spatial location of all the feature maps into dimensional feature representations as shown in Figure 2 (c) with different colors. In (1), represents the dimensional feature descriptors at convolutional layer where is the feature map and be the convolutional layer of model.
Iii-B Identification of regions of interest
To make use of regions-based CNN features, most prominent regions are first identified by grouping the non-zero and spatially 8-connected activations from all the feature maps shown in Figure 2 (d). Energy of each of the region is calculated by averaging over all the activations lying under it, top energetic regions with their bounding boxes are picked. Figure 3 shows a sample images of top , and novel regions. We can see that the identified regions focus on buildings, trees, pedestrians and road signals.
In (2), represents energetic regions of interest from all the feature maps where . To pool the energetic regions-based CNN representations, descriptors in (1) which fall under regions in (2) are aggregated using (3). This gives the final regions-based CNN features for energetic regions representing an image at convolutional layer (see Figure 2 (e)).
Iii-C Region based Vocabulary and Extraction of VLADs for Image Matching
Vector of Locally Aggregated Descriptors (VLAD) adopts K-means  based vector quantization and accumulates the quantization residues for features quantized to each dictionary cluster and concatenates those accumulated vectors into a single vector representation. To employ VLAD on the regional aggregated features, a pretrained regions-based vocabulary is needed. Thus, a separate dataset of images is collected and afore-described regions-based aggregation is employed on it. To learn a diverse regional vocabulary, we employed place-recognition centric images of places from Query247 (taken at day, evening and night times). Other images include a benchmark place recognition dataset St.lucia  with frames of two traverses captured in suburban environment at multiple times of the day. The left over images consist of multiple viewpoint- and condition-variant traverses of two urban road routes collected from Mapillary111https://www.mapillary.com/. aggregated ROIs-Descriptors are identified for all the images and clustered into regions. K-means is employed for clustering the regions such that represents the region center in the regional codebook .
The regional dictionary in (5) consists of aggregated ROIs-Descriptors clustered into regions. Using the learned codebook, the regions of the benchmark test and reference traverses are quantized to predict the clusters/labels .
In (6), contains the cluster numbers of all the regions under which they fall, where is the quantization function that maps the regions on the learned codebook. Using the original regions-based features , predicted labels and the regional codebook , VLAD descriptor for each region can be retrieved using (7).
In (7), for regions that fall in region of the codebook, the sum of the residues of the regions and codebook’s region center are calculated. Sometimes, few regions/words appear more frequently in an image than the statistical expectation known as visual word burstiness . Standard techniques include power normalization  is performed in (8) to avoid it where each component undergoes non-linear transformation . In (9), power normalization is followed by normalization. For every image, number of components get stored in to get final VLAD representation.
Using (12), all the cosine dot products of regions are summed up to reach to a single score . For each test image “A”, this cosine matching is performed against all the reference images and at the end, reference image “X” with the highest similarity score is picked as a matched image.
Iv Results and Analysis
This section presents the results obtained for the proposed method on several benchmark datasets under severe viewpoint and appearance changes. It also discusses the results by comparing with other state-of-the-art algorithms.
More specifically, challenging benchmark VPR datasets Berlin A100, Berlin Halenseestrasse and Berlin Kudamm (see  for more detailed introduction), collected from crowdsourced geotagged photo-mapping platform Mapillary were used to evaluate the proposed approach. Each dataset covers two traverses of the same route uploaded by different users. One traverse is used as reference database and the other traverse is employed as test database (please see TABLE I). Another dataset, Garden Point was captured at QUT campus with one traverse taken in daytime on left side walk and the other traverse was recorded in right side walk at night time. The Synthesized Nordland dataset was recorded on a train with one traverse taken in winter and other the traverse was recorded in spring. Viewpoint variance was added by cropping frames of winter’s traverse to keep 75% resemblance . For all the Mapillary’s datasets, using geotagged information under different conditions and viewpoints, ground truths were generated that match each image of one traverse with the closely resembled images of the other traverse. For Garden Point and Synthesized Nordland, the ground truths were obtained by parsing the frames and maintaining place level resemblance.
|GardenPoint||200||200||campus||very strong||very strong|
The proposed method is implemented in Python and the average system runtime over 5 iterations is recorded with images (comprising test and reference images). AlexNet pretrained on Places365 dataset is employed as a CNN model for regions-based features extraction with input image size. AlexNet is a light-weight CNN model that contains five convolutional and three fully connected layers. For all the baseline experiments, we utilize only conv3 but conv4 and conv5 can also be employed.
For a single image, a forward pass takes around using Caffe on Intel Xeon Gold 6134 @3.2GHz. We extract and aggregated ROIs-Descriptors for conv3 with total time comparable with the state of the art methods  (see Table II). The VLADs are retrieved and matched using aggregated ROIs-Descriptors on and clustered dictionary . For direct comparison with , we use with . The results are also reported for with utilizing AUC-PR  as an accuracy criteria. The choice of clustered dictionary is based on the value of , with larger , we used higher regional dictionary and with smaller , we used the dictionary with less clustered regions. Table II shows that at with and with , our average matching time is and faster than .
|Methodology||Our Region-VLAD (Python)||Region-BoW (MATLAB) |
|GPU/CPU||Intel Xeon Gold 6134 @3.2GHz (32 cores)||Titan X Pascal GPU|
|Forward pass time (ms)||15.574639||59|
|Extraction and Aggregation time (s)||0.328||0.361||0.394||0.402||0.443||0.452||0.349|
|Regions ”V”||64||128||256||64||128||256||64||128||256||64||128||256||64||128||256||64||128||256||10k Visual words|
For Berlin Halenseestrasse and Synthesized Nordland, the proposed method significantly outperforms all other state-of-the-art methods in both the settings. i.e., and shown in Figure 5 and Figure 6. For Berlin Kudamm, our approach with higher number of regions showcases state-of-the-art results (see Figure 7). For Berlin A100, Region-BoW  performs slightly better than the proposed method (see Figure 8). AUC-PR curves of the benchmark datasets across all other approaches were taken from .
Both the Garden Point traverses exhibit strong viewpoint- and condition-variance with strong temporal coherence between the frames. Taking advantage from the sequential imformation, SeqSLAM managed to beat all other techniques and our approach with higher regional features has shown slightly better performance than Region-BoW  which highlights the benefit of employing more regions under simultaneous viewpoint- and condition- variation (see Figure 9). Across all the five benchmark datasets, median AUC-PR performance is shown for all the methods in Figure 10. It is evident that the proposed Region-VLAD methodology achieves considerably better results as compared to the state-of-the-art VPR approaches.
The variation in the AUC-PR values across the benchmark datasets is due to a number of factors. The first reason is the environment of the dataset on which the CNN is trained. Since Place365 database  consists of scenes/labels, with each label contains different places exhibiting the same scene, for example, in Berlin Halenseestrasse and Berlin Kudamm, frames contain objects like signals, buildings, cars and trees. It affects the CNN layers responses as it try to find the objects on which its trained on. Hence, even from a different viewpoint of the same place, the CNN still manages to focus on the common objects results in better accuracy. However, we also observed that if the places contain more common objects like cars, pedestrians, and exhibit more conditional variation like in Berlin A100 then employing a scene-centric CNN sometimes deteriorates the performance. Secondly, the diversity and size of the dataset employed to make the regional vocabulary also play a crucial role along with the equal contribution of VLAD encoding approach for regions matching. The clustered regions in the vocabulary might suit one dataset more as compared to others. We can also see that picking more regions boost up the accuracy, but sometimes, it also degrades the performance as each region contributes to the final matching score which might result into a wrong match if multiple reference images exhibit the similar scene and inclusion of more but less energetic regions decay the overall final score for the correct match. For VLAD, the separate dataset contains only images, whereas images were employed for BoW methodology  to learn regional dictionary. Bigger the dataset, more diverse the dictionary will be. However, due to our system runtime memory limitation and to load images for regions and features aggregation, we have confined ourself to images. Also, we have kept variety in our dataset to learn diverse regional features which reflects on our results with small vocabulary size. Clustering the regions using K-means to make the regional vocabulary is also important, we generate the dictionary twice using the same dataset. AUC-PR curves across all the benchmark datasets using both the dictionaries vary with an average marginal difference of which highlights the importance of clustering. Lastly, employing a less layered CNN architecture helps to reduce the time-cost and also highlights the potential to boost up the performance with our proposed Region-VLAD approach for VPR.
Some sample matched (green) and unmatched (red) images using the proposed methodology are shown in Figure 11 and Figure 12. For the correct matches, our proposed methodology successfully identifies the common regions (shown with different colored boxes in Figure 11) under simultaneous viewpoint and appearance changes. For those queries where our retrieved images are not matched as in Figure 12, the identified layer’s regions are shown. We observed that the mismatching is due to the common regions across the images i.e., trees, lamp posts, cars and buildings etc. Colored boxes on the regions show the area where CNN confuses in and results in a wrong match. The failure cases also highlight the importance of CNN training. For some unmatched scenarios, we observed that the retrieved images are quite similar and geographically closer to the test images but due to the ground truth priorities, we considered those cases as unmatched. Datasets and results are placed at .
For Visual Place Recognition on resource-constrained mobile robots, achieving state-of-the-art performance/accuracy with light-weight CNN architectures is highly desirable but a challenging problem. This paper has taken a step in this direction and presented a holistic approach targeted for a CNN architecture comprising a small number of layers pretrained on a place/scene-centric image database to reduce the memory and computational cost for resource-constrained mobile robots. The proposed method detects novel CNN-based regional features and combine them with VLAD encoding methodology adapted specifically for VPR problem. The proposed method achieves state-of-the-art results on severe viewpoint- and condition-variant benchmark place recognition datasets.
In future, it would be good to analyze the performance of the proposed methodology on other shallow/deep CNN models that are individually trained/fine-tuned on place recognition datasets. CNN based regional features combined from multiple convolutional layers is also a worthwhile research direction for VPR under environmental changes.
-  S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, 2016.
-  H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision. Springer, 2006, pp. 404–417.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
-  Z. Chen, O. Lam, A. Jacobson, and M. Milford, “Convolutional neural network-based place recognition,” arXiv preprint arXiv:1411.1509, 2014.
-  N. Sünderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” Proceedings of Robotics: Science and Systems XII, 2015.
-  P. Panphattarasap and A. Calway, “Visual place recognition using landmark distribution descriptors,” in Asian Conference on Computer Vision. Springer, 2016, pp. 487–502.
-  Z. Chen, F. Maffra, I. Sa, and M. Chli, “Only look once, mining distinctive landmarks from convnet for visual place recognition,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 9–16.
-  Z. Chen, L. Liu, I. Sa, Z. Ge, and M. Chli, “Learning context flexible attention model for long-term visual place recognition,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4015–4022, 2018.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in null. IEEE, 2003, p. 1470.
-  J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.” Radiology, vol. 143, no. 1, pp. 29–36, 1982.
-  M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
-  M. J. Milford and G. F. Wyeth, “Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 1643–1649.
-  L. Liu, C. Shen, and A. van den Hengel, “Cross-convolutional-layer pooling for image recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2305–2313, 2017.
-  A. Babenko and V. Lempitsky, “Aggregating local deep features for image retrieval,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1269–1277.
-  G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval with integral max-pooling of cnn activations,” arXiv preprint arXiv:1511.05879, 2015.
-  B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE transactions on pattern analysis and machine intelligence, 2017.
-  H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 3304–3311.
-  R. Arandjelovic and A. Zisserman, “All about vlad,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2013, pp. 1578–1585.
-  N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. 4297–4304.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
-  Z. Chen, A. Jacobson, N. Sünderhauf, B. Upcroft, L. Liu, C. Shen, I. Reid, and M. Milford, “Deep learning features at scale for visual place recognition,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3223–3230.
-  A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1808–1817.
-  H. Jégou, M. Douze, and C. Schmid, “On the burstiness of visual elements,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 1169–1176.
-  T.-T. Do, T. Hoang, D.-K. L. Tan, and N.-M. Cheung, “From selective deep convolutional features to compact binary representations for image retrieval,” arXiv preprint arXiv:1802.02899, 2018.
-  “Results and datasets,” https://github.com/Ahmedest61/CNN-Region-VLAD-VPR/.