Condition-Invariant Multi-View Place Recognition

Condition-Invariant Multi-View Place Recognition

Jose M. Facil, Daniel Olid, Luis Montesano and Javier Civera Jose M. Facil, Luis Montesano and Javier Civera are with the Robotics, Perception and Real Time Group, I3A, University of Zaragoza {jmfacil,montesano,jcivera}@unizar.es Daniel Olid is now at Opel España but worked in this paper earlier, while he was a student at the University of Zaragoza. Luis Montesano is also with Bitbrain* This work was supported in part by the Spanish government (project DPI2015-67275) and in part by the Aragon regional government (Grupo DGA-T45_17R/FSE), and by NVIDIA Corporation through the donation of a Titan X and Xp GPUs.
Abstract

Visual place recognition is particularly challenging when places suffer changes that modify their appearance. Such changes are indeed common, e.g., due to weather, night/day, seasonal features or dynamic content. In this paper we leverage on recent place recognition research using deep networks; and explore how it can be improved by exploiting the information from multiple views. Specifically, we propose 3 different alternatives (Descriptor Grouping, Fusion and Recurrent Descriptors) for deep networks to combine visual features of several frames in a sequence. We show that our approaches produce more compact and better-performing descriptors than single- and multi-view baselines in the literature in two public databases.

I Introduction

Given a dataset of images taken at different places, visual place recognition [1, 2, 3] aims to identify the place of a new query image by associating it to one or several images of the dataset taken in the same location. Recent advances in computer vision have improved the performance of these algorithms, which are currently applied in several different applications such as image retrieval (e.g., [4]), mapping and navigation in robotics [5, 6, 7], autonomous driving [8] and augmented reality (AR) [9].

One of the main challenges of visual place recognition is dealing with changes in the appearance of places [10]. Indeed, place recognition is reasonably robust under small changes in viewpoint and illumination, due to the invariance of local features and rigidity checks [6]. But, in constrast, non-rigid scene changes, wide baseline matching and extreme illumination variations are considerably more challenging and result in lower performance. Using multiple frames in a sequence can improve the robustness of place recognition against such changes. But the sequence models proposed by the state of the art [11, 12] are handcrafted for a certain set of assumptions (e.g. overlapping trajectories, similar velocity patterns), and their performance suffers if they are not hold. Also, typically, they require a high number of frames.

Descriptors directly extracted from CNNs have shown good generalization properties [13], but they usually do not exploit multi-view information. Improvements usually come at the cost of large descriptors (i.e. in the order of thousands or hundreds of thousands). The complexity of all place recognition algorithms depends on the size of the descriptor and the number of images in the database, the latest being typically high. This limits the applicability of these techniques in several applications as robotics and AR/VR, in which processing time is limited due to real-time constrained loops and limited on-board computational power.

In this paper we target place recognition in the presence of challenging changes in the condition of an environment, that eventually happen in most of the scenes as time passes. For example day/night illumination, seasonal and weather changes, or objects that are moved (cars, persons or furniture). We propose and evaluate three different deep network architectures that exploit multi-view and temporal information for place recognition: naïve descriptor grouping, learning the fusion of single-view descriptors and recurrent networks using LSTM (Long Short Term Memory) layers [14]. An overview of our proposal can be seen in Fig. 1. Up to our knowledge, ours are the first models that use deep learning to combine multi-view information for the purpose of place recognition.

Fig. 1: Overview of our proposal. We extract descriptors (using deep networks) for small sequences of frames. We use such descriptors to find the closest match in a database of already visited places.

We evaluated our models and compared it to state-of-the-art single-view deep models and a non-deep sequential one using two standard datasets: the Partitioned Nordland [15] and Alderley [11]. The experimental results show that the performance of our three proposed multi-view models is better than single-view networks and that we also outperform SeqSLAM, a baseline for recognition from image sequences that does not use deep learning. Furthermore, our learned descriptors are at least one order of magnitude smaller than those of the state of the art showing that multi-view learning is able to extract relevant information for place recognition .

The rest of the paper is organized as follows. Section II refers the related work. Section III gives the details of our network architectures, and section IV of our training. Finally, section V presents the experimental results and section VI the conclusions and lines for future work. Our code and a video showing our results can be found in our project website: http://webdiis.unizar.es/~jmfacil/cimvpr/. A reduced version of the video accompanies the paper as supplementary material.

Descriptor Grouping

Descriptor Fusion

Recurrent Descriptors

\thesubsubfigure
\thesubsubfigure
\thesubsubfigure
Fig. 2: Multi-view models proposed in this paper. The sequence descriptors are used to retrieve a visited place via Nearest Neighbour. From left to right: (a) Descriptor Grouping, where the descriptor of a sequence is the concatenation of all the single image descriptors. (b) Descriptor Fusion, the output of the CNNs serves as inputo to a fully-connected layer that combines the information into a single descriptor. (c) Recurrent Descriptors, the output of the CNNs serves as input to an LSTM network that integrates over time the single-image features to create a multi-image descriptor.

Ii Related Work

There have been many works addressing visual place recognition or related problems. For a general overview, we refer the reader to two surveys, [16] on topological mapping and [1] exclusively for visual place recognition. In this section we will focus on the most relevant to our work.

Ii-a Single-View Place Recognition

Most of the work on visual place recognition extract descriptors from a single-frame. Some approaches have been based on handcrafted holistic image descriptors, like low-resolution thumbnails [11] or GIST descriptors [17]. Although such approaches are very efficient, their performance degrades with large illumination and viewpoint changes and occlusions. Feature-based approaches (e.g., FAB-MAP [18] and DBoW [6]), relying on local information around salient points, are more robust to those changes.

This type of descriptors are not robust to appearance changes due to scene dynamics, seasonal and weather changes, or extreme viewpoint or lighting variations. To address this, [19] used PCA to reduce the dimensionality of descriptors eliminating the dimensions that are influenced by condition changes. [20] incorporates attention in order to focus on the most relevant image features for place recognition. Lowry and Andreasson [21] presented a model using SURF detector and HOG features and studies the use of Bag of Words and Vectors of Locally Aggregated Descriptors (VLAD) for place matching.

Descriptors based on CNNs have shown a high degree of robustness against appearance changes. Sünderhauf et al. [22, 23] showed that CNNs outperform other models, especially for drastic appearance changes. They used AlexNet [24], pretrained on ImageNet [25]. The features of AlexNet contain semantic information about the whole scene, which improves the invariance to certain appearance changes. Thereafter, may other works have studied CNNs as condition-invariant feature extractors [13, 26, 27, 28, 29] and [15]. Gómez-Ojeda et al. [13] were the first that trained a network as single-image feature extractor for visual place recognition under appearance changes. In NetVLAD [26], they proposed a new type of layer inspired in VLAD, an image representation commonly used in image retrieval. Chen et al. [28] proposed a network trained to classify the place the image was taken. Olid et al. [15] who built a model upon a pre-trained VGG-16 [30] and fine-tuned it for the task in a Triplet-Siamese architecture.

Ii-B Multi-View Place Recognition

Although there are only a few works that attempt to consider temporal and multi-view information for place recognition, they all have shown that sequences provide extra information for place recognition. For instance, DBoW [6] incorporates a temporal consistency constraint. SeqSLAM [11] and following works (e.g., [31]) use sequence matching, similarly to [32]. Differently to our approach, they assume linear temporal correlation for sequence matching. We also use semantic information for the matching where they use downsampled images.

More recently, [12] used a graph of single-view descriptors (HOG and AlexNet-based) to model and match image sequences. Their approach is similar to SeqSLAM, with two main differences. The straightforward one is that they use different descriptors. The second one, more subtle, is that their search of the best-matching sequence does not assume a constant speed variation between the sequences. SeqSLAM looks for straight lines in the similarity matrix, while [12] uses a more sophisticated model. In any case, none of them can addresses changes in the sequence direction. And also, they typically rely on long-term sequence matching (i.e., query and database sequences having many consecutive matching frames), which is not always the case. Both sequence models are handcrafted and up to our knowledge there are no models that, as ours, learn to combine single-view CNN descriptors from data.

Iii Network Architectures

In this section, we discuss four different models: A single-view one for place recognition, based on ResNet-50, and three different extensions for multi-view place recognition.

Iii-a Single-View ResNet-50

Our first network is based on the model presented in [15]. The main difference is that we start from ResNet-50 [33] pretrained on ImageNet [25] as our backbone, instead of VGG-16 [30]. Although it is common to directly use the descriptors of different layers (see V-A for results on this), in our case we added and trained a fully connected layer after ResNet-50 to learn a -dimensional descriptor especially designated to the task of visual place recognition. We chose a size of experimentally, as a good compromise between performance and compacity.

Iii-B Descriptor Grouping

In order to include temporal information into the descriptors, our first approach is the naïve concatenation of the descriptors of consecutive frames, see Fig. 2. Thus, starting from our previous single-view model, first we choose a window of frames () to work with, second we generate individual descriptors and third concatenate them. Concluding, as shown in the figure, with a descriptor for the sequence. Notice that this model is trained only from single-view samples. Hence, the relation between consecutive frames is not learn and this model only provides a filtering effect.

Iii-C Descriptor Fusion

Descriptor Grouping, as the simplest strategy to consider several frames, is limited in its capability to weight differently certain features (i.e. features of some of the frames may be more representative of the place than others). For that reason we wanted to build a model that learned to fuse the information of our -frames window into a more discriminant –as well as smaller– -dimensional descriptor. With this Descriptor Fusion strategy, we add an extra fully connected layer that learns how to combine the outputs of ResNet-50 into a single compact descriptor. See Fig. 2 for an illustration of this approach. As this network is able to learn how to weight the features from different frames, it can model more complex cases. For example, when sequences are recorded in reverse order Descriptor Grouping is limited, while Descriptor Fusion has the capability of learning a suitable fusion.

Iii-D Recurrent Descriptors

Descriptor Fusion does not explicitly exploits the sequential nature of the data. In this model we wanted to update online the sequence descriptor as new frames come, keeping the most relevant previous information. With that intention, we propose a Recurrent Neural Network (see Recurrent Descriptors in Fig. 2). In this model, every frame is the input to a ResNet-50, and the top layers serve as the input of a LSTM network [14], that generates a -dimensional descriptor. LSTMs keep an inner state, that is updated with each input frame, and the output depends on the state and the input. Differently to previous models, keeping a recurrent inner state allows this network to produce a descriptor from the first frame, and update it sequentially as more frames arrive.

Iv Training

Iv-a Convention for Same Place

Fig. 3: Same place convention, illustrated with an example where the query-sequence has a length of 3 frames. A place represents a set of frames that are considered to be on the same place. Notice that a frame can be in more that one place. A query-sequence is an input sequence for our model. We want to recover the corresponding place of a query-sequence.

Since our descriptor is generated from a sequence of images (query-sequence) instead of a single image we must define when two query-sequence of frames are considered to be at the same place (the definition of a place being dataset-dependent). To illustrate this definition we will make use of Fig. 3. The figure shows a sequence of frames and several examples of query-sequence, and also shows the set of frames that we consider as the same place. Therefore, during training, we consider two query-sequence to be on the same place if they contain two frames (one per query-sequence) that belong to the same place. For instance, in Fig. 3, query-sequence 1 and query-sequence 2 belong to the same place, as the first frame of query-sequence 1 belongs to place 1, the same as the first frame of query-sequence 2).

Iv-B Model training

Fig. 4: Triplet architecture. We used this scheme for training all our models.

We start from ResNet-50 pre-trained on ImageNet in a standard classification task. We add the extra layers, and train them on our place-recognition task in our datasets. We trained all the models proposed in this work using a triplet architecture (see Fig. 4 for a scheme and [15, 13] for more details). In a few words, triplet architectures are given 3 training samples: An anchor, a positive example and a negative one. During training, the objective is to reduce the distance between the anchor and positive descriptors, and to increase the distance between the anchor and the negative one. The loss we use to achieve that is the Wohlhart-Lepetit loss [34],

(1)

where (margin) is a parameter that limits the difference between the distances, is the descriptor generated for the anchor image, is the descriptor for the positive sample and is the descriptor for the negative sample (see Fig. 4). The specific training details for each model are as follows.

Iv-B1 Descriptor Grouping

This model is trained as a single-view place recognition model. Hence, the data triplets consist of single images (an anchor image, a positive and a negative examples). During training, every single image generates a 128-dimensional descriptor, which means a elements descriptor during test.

Iv-B2 Descriptor Fusion

Our second model learns a fusion of features for an image sequence (query-sequence). Therefor, we concatenate the output of the ResNet-50, for each image, and add a fully-connected extra layer to generate a single descriptor of 128 elements. We train this model generating triples samples of -frames query-sequence.

Iv-B3 Recurrent Descriptors

In our last model we make a sequential update of the image descriptors using Recurrent Neural Networks, concretely an LSTM layer. In order to force the network to learn from the three images instead of only the last one, we add some random sampling in one of the images of the query-sequence plus a Dropout on the LSTM layer.

V Experimental Results

In this section we evaluate the three proposed models on two datasets: the Partitioned Norland [15] and Alderley [11]; and we compare them with state-of-the-art single-view and sequence-based methods.

Experimental setup: For every method, we retrieve the single nearest neighbor as the matched place for a query image or sequence. We treat it as a correct match if it fits with the same-place convention for the dataset (i.e. each dataset has its own ground-truth frame correspondence). During the experiments, we set the query-sequence length to 3 frames for all our multi-view models. We observed in our experiments that, for more than 3 frames, the performance did not improved much.

To compare different models we compute the precision of the model when recall is equal to 1. Hence, we retrieve a place for every query (the nearest neighbor) and compute the fraction of correct matches over the total number of queries.

V-a Partitioned Nordland Dataset

Our experiments use the train-test split proposed by Olid et al. [15]. We trained our model with images and evaluated its performance in a -images set. For every model we fixed the ResNet-50 parameters and trained the additional layers. We evaluated the performance of the descriptors of different layers of ResNet-50 and the best performing features are those of the layer bn3d-branch2b (3d-2b in the table), so we use these in our experiments. We train for full epochs, where each epoch corresponds to triplet examples.

Method
of
frames Number
Size Descriptor
W vs S Accuracy
S vs W Accuracy
VGG16(pool4) 100352 51% 21%
VGG16(pool5) 25088 13% 7%
VGG16(fc6) 4096 6% 3%
VGG16(fc7) 4096 4% 3%
ResNet-50(3a-2a) 100352 42% 32%
ResNet-50(3d-2b) 100352 73% 42%
ResNet-50(4a-2a) 50176 62% 41%
ResNet-50(4b-2a) 50176 62% 31%
ResNet-50(4c-2a) 50176 50% 40%
ResNet-50(4f-2b) 50176 12% 8%
ResNet-50(5a-2a) 100352 43% 24%
Hybridnet [28] 4096 77%
Amosnet [28] 4096 69% 48%
Lowry et al. [19] 1860 67% 66%
Olid et al. [15] 128 75% 79%
ours (single-view) 128 77% 75%
Seqslam [11] 6144 31% 33%
Seqslam [11] 20480 71% 70%
Seqslam [11] 204800 95% 94%
ours (grouping) 384 92% 92%
ours (fusion) 128 87% 86%
ours (recurrent) 128 85% 86%
the bigger the better
TABLE I: Results on the Partitioned Nordland Dataset [15]. \nth1 column: Method. \nth2 column: Number of frames used for recognition (e.g., 1 stands for single-view). \nth3 column: Descriptor size in 32-bits floating point numbers. \nth4 column: Winter vs Summer, query taken from winter and matched to summer database. \nth5 column: Summer vs Winter, query taken from summer and matched to winter database.

Quantitative Results: Results on Table I show the performance of different models for visual place recognition on the Partitioned Nordland. In this table we report the hardest recognition cases, which are representative for the rest, specifically using the seasons Winter and Summer both as query and database respectively. The upper part of the table shows single-view models and the lower part shows the multi-view models.

Our three proposals ours (grouping), ours (fusion) and ours (recurrent) outperform very clearly our single-view approach ours (single-view). Notice that they also outperform the state-of-the-art baselines. Among our multi-view proposals, ours (grouping) is the one achieving the best performance ( of precision using frames). Notice that its descriptor size, , is smaller than most of the single-view and mult-view baselines.

Out methods also outperform SeqSLAM [11], a state-of-the-art baseline able to model information from several frames, when both use the same number of frames (specifically, ). As all multi-view approaches improve their performance when increasing the number of frames, we increased the number of frames used by SeqSLAM. Notice that, in order to outperform our approach, the number of frames has to be increased up to (with a descriptor size of 204800).

Descriptor Grouping

Descriptor Fusion

Recurrent Descriptors

Fig. 5: Precision on Partitioned Nordland Dataset [15]. We evaluate the fraction of correct matches between all the seasons. S stands for summer, F for Fall, W for Winter and Sp for Spring. Notice that our best performing model, Descriptor Grouping, never drops under of correct matches.

Fig. 5 shows the results of all our proposals for all query-reference combinations. As mentioned before, winter is always the hardest case. Notice, however, that none of our models ever drops under performance.

Fig. 6: Example of Matched Places for Single and Multi-View grouping model in the Partitioned Nordland Dataset [15]. The retrieved image is framed on green if it is a correct match or red if it is incorrect. Mismatched frames are very similar and could even fool humans if they are not carefully inspected.

Qualitative Results: Fig. 6 shows some examples of matched places with the grouping model and illustrates when multi-view methods achieve better performance. Notice that, although our single-view method fails in these examples, some of the places are indeed very similar and would be hard to match even by humans.

Reverse Gear

\thesubsubfigure

Random Speed

\thesubsubfigure
Fig. 7: Experiment setup details. (a) Reverse Gear, in which the sequence is played in reverse order for one of the seasons (Fall in the figure). (b) Random Speed, in which the vehicle speed is modified for both seasons, reference (Winter) and query (Spring). In both cases, we mark with a dashed green box the same-place three-frames sequences.
Method
of
frames NT RG RS M/S
Seqslam [11] 33% 0.08% 9% 14.013.9
Seqslam [11] 70% 0.03% 8% 26.0131.27
ours (grouping) 92% 74% 36% 67.323.3
ours (fusion) 86% 80% 78% 81.333.4
ours (recurrent) 86% 82% 84% 84.01.6
biggest the best
TABLE II: Experimental results for Reverse Gear and Random Speed in the Partitioned Norland Dataset [15]. \nth1 column: Method. \nth2 column: Number of frames used for recognition. \nth3-\nth5 column: Summer vs Winter experiments: \nth3 column: NT stands for “Normal Test” and corresponds to the one showed on Table I. \nth4 column: RG stands for “Reverse Gear” (the query frames are all in reversed order, i.e. simulating the train has used a reverse gear). RS stands for “Random Speed” (the speed of the train is simulated to be random, which means some of the frames are lost). The speed variations are independent for the query and the reference databases, and this implies no more 1 to 1 correspondence.

Sequence Speed Changes: Inspecting the previous results (Table I), Descriptor Grouping (ours (grouping)) trained only on single-view and then applied on multi-view by concatenation is the best performing. This is surprising at first sight, as the other two models were trained on multi-view data. We designed two extra experiments (Reverse Gear and Random Speed) to illustrate why this is happening.

The Reverse Gear experiment consist on changing the direction of the train motion on one of the sequences at test time (e.g. when testing Winter vs Fall, the sequence of Fall is played in reverse order, see Fig 7). This experiment will help to discern how much the model exploits the multi-view information rather than just the sequence consistency. Table II shows that, as we expected, models trained with multi-view examples (ours (fusion) and ours (recurrent)) have learned to exploit multiple views: Its performance only degrades by and respectively. On the other side, ours (grouping) drops performance by .

In the Random Speed experiment we modified the speed of the train motion on one of the sequences at test time. Specifically, we modified the frame rate along the sequence simulating changes on the train velocity, see Fig. 7 (in our experiments the velocity was randomly multiplied by , or at every moment of the sequence). The “speed” is modified for the whole sequence, implying that the one-to-one correspondence in plain Nordland does not hold. Table II proves that Random Speed is the most challenging setup for the ours (grouping) approach, dropping its precision to . ours (fusion) and ours (recurrent) keep its performance at a very similar level than the standard Nordland setup ( and respectively).

We run these RG and RS experiments using the state-of-the-art multi-view baseline SeqSLAM [11], observing that its performance drops in both. This should be expected, as SeqSLAM assumes a linear relation between the velocities of the query and the reference sequences (sequence consistency).

The last column of Table II (M/S) summarizes the conclusions of both experiments, reporting the mean (biggest the best) and standard deviation (smallest the best) for all experiments (NT, RG and RS). Observe that ours (recurrent) is the best performing, presenting both the highest average precision and smallest variations. This confirms our hypothesis: The sequence descriptors that incorporate learning to combine single-view features (ours (fusion) and (ours (recurrent)) are more resilient than those based on plain concatenation (ours (grouping)) or handcrafted relations (SeqSLAM).

V-B Alderley

Fig. 8: Examples of Matched Places for Single and the grouping Multi-View model in the Alderley Dataset [11]. The returned image is framed on green if it is a correct match or onred if it is incorrect.
Method
on Trained
of
frames Number
Size D vs N
Olid et al. [15] Norland 128 0.15%
Olid et al. [15] Alderley 128 6.84%
ours (single-view) Norland 128 1.65%
ours (single-view) Alderley 128 6.8%
Seqslam [11] - 6144 3.91%
Seqslam [11] - 20480 9.90%
ours (grouping) Norland 384 1.73%
ours (grouping) Alderley 384 7.82%
biggest the best
TABLE III: Results on the Alderley Dataset [11], \nth1 column: Method. \nth2 column: Dataset in which the model it has been trained with. \nth3 column: Number of frames used for recognition (e.g. 1 would imply to be single-view). \nth4 column: Descriptor size in 32b floating point numbers.\nth5 column: Day vs Night (D vs N), query with daylight image while reference database composed by nighttime images.

We also evaluated our approach on the Alderley dataset [11], that contains images of a car trip in the day, and the same trip at night. It is a very challenging dataset due to the extreme illumination changes. We used the last images as test samples. As the car velocity is similar in both cases, we only evaluate ours (grouping).

Table III shows that our multi-view approach is the best performing model, outperforming the rest both in precision and descriptor compacity. We also compare the difference when training on the Partitioned Norland Dataset or the Alderley Dataset. Fine-tuning on Alderley clearly helps on the task, as the variant conditions (seasons vs day/night) impact differently in the visual appearance.

Qualitative results. Fig. 8 shows several test samples. Notice the increased challenge with respect to the Nordland dataset, with the presence of severe illumination changes plus inclusion of artificial illumination and dynamic objects.

V-C Execution time

Method
Size Descriptor
Extraction Descriptor
1 vs 10K Search
ours (fusion) 128 17 3.86
ours (recurrent) 128 22 3.86
ours (grouping) 384 15 10.70
- 1860 - 49.21
- 4096 - 111.44
- 6144 - 166.13
- 20480 - 688.24
- 204800 - 9279.62
smallest the best
TABLE IV: Execution Time of all our models. \nth1 column: Method. \nth2 column: Descriptor size. \nth3 column: Time in milliseconds needed to extract 1 descriptor. \nth4 column: Given a descriptor and a reference data base of descriptors, time in milliseconds needed to find the best match.

We compared the execution time of all our models in the upper part of Table IV. The fourth column (Descriptor Extraction) shows the time needed to extract the descriptor of a query 3-frames sequence on a NVIDIA TITAN Xp. In this part of our pipeline, Descriptor Grouping proved to be the fastest method as is uses the simplest network.

Last column (Search) shows the time needed to find the best match (Nearest Neighbor (NN)) given a query and a database of examples. Notice that in this second part of our pipeline our methods Descriptor Fusion and Recurrent Descriptors run faster. This was expected, as their descriptor sizes are (query-sequence size) times smaller (3 times in our experiments). Our NN algorithm consists on an exhaustive search through the database. We iterate over all the visited places, compute the distance between their descriptor and the new query and select the minimum. For the distance function we compute the Squared Euclidean Distance (). The computational complexity of this search is where is the number of elements in the database and for the distance function where is the descriptor size. It makes a total of which is practically when or worse than when .

Additionally, we computed the search time corresponding to the sizes of some of the other descriptors used in Tables I and III (bottom part of the table). As expected, the time increases with the descriptor size. Notice that high dimensional descriptors rule out the use of more efficient data structures, such as KD-trees, to speed up the search techniques, since it is not possible to reject candidates by using the difference of a single coordinate [35].

Vi Conclusions

In this work we have introduced three deep learning-based multi-view place recognition models, that outperform existing baselines both in accuracy and compacity. We analyzed different approaches to combine the information of the features from multiple views (grouping, fusion and recurrent), and we evaluated them on different experimental setups in two public datasets: Partitioned Norland and Alderley. Each one of the models we propose has its own strengths and weaknesses. On the one side, Descriptor Grouping ensures the sequential consistency of the frame in a sequence, achieving the best performance in the standard Nordland/Alderley benchmarks, where the interframe motion is similar in different runs. On the other side, Descriptor Fusion and Recurrent Descriptors are able to learn more complex relations between frames and hence proved to be better in cases where the velocities differ or the frames ordering is different. We also evaluated the computational complexity of the approaches, demonstrating its potential for robotic applications.

References

  • [1] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, 2016.
  • [2] R. Arandjelović and A. Zisserman, “Dislocation: Scalable descriptor distinctiveness for location recognition,” in Asian Conference on Computer Vision.   Springer, 2014, pp. 188–204.
  • [3] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi, “Visual place recognition with repetitive structures,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 883–890.
  • [4] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3456–3465.
  • [5] A. Pronobis, B. Caputo, P. Jensfelt, and H. I. Christensen, “A discriminative approach to robust visual place recognition,” in 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2006, pp. 3829–3836.
  • [6] D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
  • [7] S. H. Lee and J. Civera, “Loosely-Coupled Semi-Direct Monocular SLAM,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 399–406, 2019.
  • [8] C. McManus, W. Churchill, W. Maddern, A. D. Stewart, and P. Newman, “Shady dealings: Robust, long-term visual localisation using illumination invariance,” in 2014 IEEE international conference on robotics and automation (ICRA).   IEEE, 2014, pp. 901–906.
  • [9] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt, “Scalable 6-dof localization on mobile devices,” in European conference on computer vision.   Springer, 2014, pp. 268–283.
  • [10] S. Garg, N. Sünderhauf, and M. Milford, “Semantic-Geometric Visual Place Recognition: A New Perspective for Reconciling Opposing Views,” International Journal of Robotics Research, 2019.
  • [11] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on.   IEEE, 2012, pp. 1643–1649.
  • [12] T. Naseer, W. Burgard, and C. Stachniss, “Robust visual localization across seasons,” IEEE Transactions on Robotics, vol. 34, no. 2, pp. 289–302, 2018.
  • [13] R. Gomez-Ojeda, M. Lopez-Antequera, N. Petkov, and J. Gonzalez-Jimenez, “Training a convolutional neural network for appearance-invariant place recognition,” arXiv preprint arXiv:1505.07428, 2015.
  • [14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [15] D. Olid, J. M. Fácil, and J. Civera, “Single-View Place Recognition under Seasonal Changes,” arXiv preprint arXiv:1808.06516, 2018.
  • [16] E. Garcia-Fidalgo and A. Ortiz, “Vision-based topological mapping and localization methods: A survey,” Robotics and Autonomous Systems, vol. 64, pp. 1–20, 2015.
  • [17] A. C. Murillo, G. Singh, J. Kosecká, and J. J. Guerrero, “Localization in urban environments using a panoramic gist descriptor,” IEEE Transactions on Robotics, vol. 29, no. 1, pp. 146–160, 2013.
  • [18] M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
  • [19] S. Lowry and M. J. Milford, “Supervised and unsupervised linear learning techniques for visual place recognition in changing environments,” IEEE Transactions on Robotics, vol. 32, no. 3, pp. 600–613, 2016.
  • [20] Z. Chen, L. Liu, I. Sa, Z. Ge, and M. Chli, “Learning context flexible attention model for long-term visual place recognition,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4015–4022, 2018.
  • [21] S. Lowry and H. Andreasson, “Lightweight, viewpoint-invariant visual place recognition in changing environments,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 957–964, 2018.
  • [22] N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on.   IEEE, 2015, pp. 4297–4304.
  • [23] N. Sünderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” Proceedings of Robotics: Science and Systems XII, 2015.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [26] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
  • [27] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, and E. Romera, “Fusion and binarization of CNN features for robust topological localization across seasons,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on.   IEEE, 2016, pp. 4656–4663.
  • [28] Z. Chen, A. Jacobson, N. Sunderhauf, B. Upcroft, L. Liu, C. Shen, I. Reid, and M. Milford, “Deep Learning Features at Scale for Visual Place Recognition,” arXiv preprint arXiv:1701.05105, 2017.
  • [29] M. Lopez-Antequera, R. Gomez-Ojeda, N. Petkov, and J. Gonzalez-Jimenez, “Appearance-invariant place recognition by discriminatively training a convolutional neural network,” Pattern Recognition Letters, vol. 92, pp. 89–95, 2017.
  • [30] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [31] E. Pepperell, P. I. Corke, and M. J. Milford, “All-environment visual place recognition with smart,” in 2014 IEEE international conference on robotics and automation (ICRA).   IEEE, 2014, pp. 1612–1618.
  • [32] P. Newman, D. Cole, and K. Ho, “Outdoor slam using visual appearance and laser ranging,” in Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006.   IEEE, 2006, pp. 1180–1187.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [34] P. Wohlhart and V. Lepetit, “Learning descriptors for object recognition and 3d pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3109–3118.
  • [35] R. Marimont and M. Shapiro, “Nearest neighbour searches and the curse of dimensionality,” IMA Journal of Applied Mathematics, vol. 24, no. 1, pp. 59–70, 1979.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
340853
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description