Image-based Navigation using Visual Features and Map
Building on progress in feature representations for image retrieval, image-based localization has seen a surge of research interest. Image-based localization has the advantage of being inexpensive and efficient, often avoiding the use of 3D metric maps altogether. This said, the need to maintain a large amount of reference images as an effective support of localization in a scene, nonetheless calls for them to be organized in a map structure of some kind.
The problem of localization often arises as part of a navigation process. We are, therefore, interested in summarizing the reference images as a set of landmarks, which meet the requirements for image-based navigation. A contribution of the paper is to formulate such a set of requirements for the two sub-tasks involved: map construction and self localization. These requirements are then exploited for compact map representation and accurate self-localization, using the framework of a network flow problem. During this process, we formulate the map construction and self-localization problems as convex quadratic and second-order cone programs, respectively. We evaluate our methods on publicly available indoor and outdoor datasets, where they outperform existing methods significantly.
Vision-based navigation is one of the key components of robotics, self-driving cars and many mobile applications. It is tackled either by using a 3D map representation such as in Structure-from-Motion (SfM) based methods [10, 15, 22, 21, 26, 7] and Simultaneous Localization and Mapping (SLAM) methods [19, 8, 6, 5, 9] or by using a map purely represented with geo-tagged images [4, 23, 1, 13]. In contrast to SfM and SLAM-based methods, localization by image retrieval (or simply image-based localization) is inexpensive, with a simple map representation, which also scales better in larger spaces [4, 22]. The problem of image-based localization is posed as the matching of one or more query images taken at unknown locations to a set of reference images captured at known locations in a map. Recent developments in learning image feature representations for object and place recognition [14, 23, 1, 4] have made image retrieval a viable method for localization.
Despite the increased interest, image-based navigation methods are largely error-prone due to matching inaccuracies . Some existing methods address this by learning better feature representations for place recognition [4, 23, 1, 13]. However, errors in matches cannot be avoided in realistic settings with changes in illumination, camera pose and dynamic objects . Methods that directly regress poses [25, 12, 11] also naturally suffer from similar problems. We argue that, in addition to feature representation, the success of localization in navigation is determined by several other important factors. In particular, current methods do not address the problem of adequate map representation.
Many methods use a large (or even complete) reference image set in order to localize a given query image [1, 13, 2]. Although a large reference set has a higher chance of a similar (in pose and illumination) reference and query image pair to exist, it not only leads to higher memory requirements but may also become sub-optimal for the matching process. Another important neglected aspect in image based localization is the order of query image sequences, which is the key to the success of visual SLAM methods. Unlike SLAM, localization by retrieval often works with a much sparser sequence of query images. Exploiting information form such interleaved image sequences is very challenging. In this context,  localizes a sequence of query images by assuming a linear change of features over time. However, this assumption is rather naive, since it fails as soon as some objects appear (or disappear) in images. As a consequence, we are interested in answering the question of what are the desired criteria of a good map representation for image retrieval-based navigation? And how can we benefit from such a representation during image localization?
In this paper, we address the task of navigation on a map where there exist geometric relationships between images or landmarks. Given visual features of images and image locations of the reference set, we identify three key problems: map construction by image selection, path planning, and localization using a history of image matches for multiple images. In particular, we provide new methods for map construction and matching multiple images to the reference images of the map. We present the construction and representation of the map as image landmark selection from a sequence of images using the principles of optimal transport. For that purpose we introduce rules that direct how images should be selected for the map representation and derive the costs accordingly. We model the rules as a problem of computing flow from source images to target images given the image geometric locations and the visual features and solve it using Quadratic Programming (QP). Our second contribution is about the localization of multiple query images on the map, where we model the problem as bipartite graph matching. We solve the localization by computing a flow between the landmark images as the sources and the query images as the targets in the bipartite graph, using Second Order Cone Programming (SOCP). We evaluate both landmark image selection and localization on publicly available indoor and outdoor datasets, and show that we significantly outperform the state-of-the-art.
Let us consider a graph with a set of vertices and a set of directed edges . For the edge , we define the flow capacity and the flow cost rate , respectively. Let be the flow for , such that the flow of an edge is non-negative and cannot exceed its capacity. For each vertex , we define the total outgoing flow and the total incoming flow , such that the net flow is and the absolute flow is . We consider two sets for source and target vertices respectively, such that . For each source vertex , we are given the net outgoing flow . Similarly, is the given net incoming flow of target vertex . For the remaining vertices, we apply the rule of conservation of flows: the sum of the flows entering a vertex must equal the sum of the flows exiting a vertex. We also ensure that the flow between the sources and targets are conserved by imposing the flow constraint . Now, we wish to transport the source flows to the target flows , with minimal transportation cost, by solving the following optimization problem.
3 Image-based Navigation
We rely only on images and the scene topology for all three sub-tasks of navigation—map representation, path planning, and self localization. During these processes, visual features of images and their locations on the topological map are considered. In the following, we provide the exact problem setup addressed in this paper, followed by our solutions for each of the three sub-tasks.
3.1 Problem Setup
We consider a map and a set of images with their location coordinates and visual features . Using this information, we construct a graph where the set of vertices represent images and the set of directed edges represent pairwise relations between images and . Efficient navigation demands a compact representation of , supporting path planning and the self-localization of image sequences.
3.2 Map Representation
For a given set of vertices , we wish to summarize them as a set of landmarks such that . To do so, we first define the following measure,
where is the distance measure between and of the vertex . Here, is the index of the vertex in which is geometrically closest to the point . While summarizing the landmarks, we consider the following four rules.
Rule 3.1 (Geometric Representation)
Landmarks must be well distributed geometrically, i.e. the selected landmarks must minimize the following,
Rule 3.2 (Visual Representation)
Landmarks must be useful for localizing images using their visual features. More precisely, all images must have a small feature distance to the geometrically closest landmark, i.e. for the feature distance , landmarks must also respect,
Rule 3.3 (Navigation Assurance)
Landmarks must support navigation from any source to to any target location, using only visual features. In other words, the next landmark along the path must not only be close, it must also be distinct from the current one, to avoid confusion, i.e. if is the ordered sequence of landmarks along a path, two consecutive landmarks must be within the distance such that,
and their visual features must be distinct such that,
This ensures that the navigation process can find the next landmark without getting confused with the previous one.
Rule 3.4 (Map Compactness)
The number of landmarks must be small, i.e. , for maximally landmarks.
Landmark Summarization for image-based navigation is a multi-objective problem which favours the above four rules.
3.3 Path Planning
The task of path planning is to choose an ordered set of landmarks that help to travel from a given source to a target location, along the shortest path using only the landmark images. Since the rules of map representation already ensure a good set of landmarks, the task of path planning simply becomes a problem of finding the shortest path along the selected landmarks. Such a path can be found using existing methods such as Dijkstra’s algorithm.
3.4 Self Localization
Given a sequence of images and landmarks along a path, the task of self-localization is equivalent to finding the most consistent match. We assume that an ordered sequence of images , captured along a path, are given to us. We wish to localize these images by matching them to landmarks . In this work, we formulate self-localization as a graph matching problem between representing and . Let be a map that generates the desired matching pairs of sequence images and landmark images. For the purpose of self-localization, we want the matching process to favour the following two rules.
Rule 3.5 (Visual Matching)
The visual distance between matched pairs must be minimized, i.e. if are the visual features coming from the pair , the mapping corresponding to the best matching is found as
Rule 3.6 (Geometric Matching)
Neighbours of must be matched to the neighbours of or itself. i.e.
4 Map Construction using Network Flow
To represent the map using only images, we define the graph , as discussed in the previous section. Any edge represents the relationship between and , using the flow capacity and cost rate defined as,
for weights and associated to geometric and visual measures. Recall that and respectively are the geometric and visual distances between images and . Here, we first define the landmark selection process for map representation, in the context of network flow.
Definition 4.1 (Landmarks)
Graph vertices with absolute flow greater than a given flow threshold are the desired landmarks. i.e. the landmarks are, .
In the following, we make use of (9) within the formulation of (1), with additional constraints, in order to obtain landmarks that favour rules 3.1–3.4. We also provide the reason behind our choice of the cost rate and the capacity expressed in (9).
4.1 Geometric Representation
The geometric representation rule 3.1 is indeed the well known -Center Problem, which in itself is NP-hard. However, there exist simple greedy approximation algorithms of complexity that solve the -Center Problem with an approximation factor of 2. We use a similar approach to choose a set of anchor points by solving,
for radius and point-to-set distance . Note that the distance is constrained by to compensate the approximation factor of 2. Using the obtained set of anchor points, we impose the following constraint on the abolute flow to favour the geometric representation rule 3.1,
for neighbourhood flow threshold and neighbouring vertices of the anchor point within a radius . The constraint in (11) ensures flow around every anchor point, thus encouraging the landmarks to be well distributed. In fact, one can alternatively maximize to guarantee the feasibility of the network flow problem, by adding a term to the original cost of (1) for a constant weight .
4.2 Visual Representation
The rule of visual representation demands no image to be visually too far from its geometrically closest landmark. Therefore, all nodes with distinct visual features in a local neighbourhood must have significant absolute flow. We ensure such flow by introducing the flow sensitivity for every edge . The flow sensitivity controls the cost rate as flow approaches capacity, such that the new cost rate is given by,
for base cost rate and sensitivity . We define the sensitivity using the feature distribution around a vertex as follows,
The sensitivity encourages the flow to spread before the maximum capacity of the cheapest edge is used. This is particularly important when a diverse set of visual features is clustered together geometrically. In such cases, the risk is that the flow primarily passes though only one vertex thus selecting only one landmark, since both incoming and outgoing edges offer low cost and sufficient capacity. This violates the visual representation rule. In such circumstances, sensitivity encourages the flow to spread around so that more than one landmark is selected favouring the Rule 3.2. Note that the sensitivity is high for higher feature diversity. On the other hand, if there is only one distinct visual feature in a neighbourhood, the flow sensitivity of the edge to that vertex is very low. Using (12) and (13), the new cost corresponding to any edge can be expressed as,
where the inequality is a rotated cone constraint,
4.3 Navigation Assurance
The formulation of network flow ensures that all flow must transfer from source to sink vertices. Therefore, the network flow problem is already tuned for the navigation task. While encouraging the flow to make bigger geometric jumps, by keeping the capacity directly proportional to the geometric distance (using (9)), we ensure that all jumps are smaller than the navigation radius by constructing the graph such that,
Furthermore, we minimize the flow between two vertices with similar features, by keeping the cost rate inversely proportional to the feature distance (using (9)). This encourages the selection of distinct consecutive features along the flow path, thus favours the objective of (6). The construction of a locally connected graph and our choice of cost rate and capacity favour the navigation assurance Rule 3.3.
4.4 Map Compactness
Given a threshold on the absolute flow, for a vertex to qualify as a landmark, we determine a set of landmarks by controlling source and target flows and . Starting from the input/output flow , we gradually increase to generate new landmarks as long as the flow problem remains feasible and , for a given upper bound on the number of landmarks . In this process, the most important landmarks are generated in the beginning. Therefore, one may further control the compactness by choosing the desired number of initial landmarks.
4.5 Map Construction Algorithm
In the following, we present the flow formulation which builds the core of our landmark selection method and summarize our graph representation to map construction process in the form of a landmark selection algorithm.
Given a graph with cost rate, capacity, and sensitivity for each edge , a set of anchor points , source and target vertices and , and a neighbourhood flow threshold , the flows required for map reconstruction can be obtained by solving the following network flow problem.
The flow problem of (17) is convex and can be solved optimally using Quadratic Programming (QP). In Algorithm 1, we summarize the complete process of obtaining landmarks, starting from images with features and locations. Note that the flow problem needs to be solved multiple times to obtain the desired compactness, as discussed in Section 4.4. This can be done either by gradually increasing the input/output flow (as discussed earlier), or by performing a bisection search on the parameter .
5 Network Flow for Self Localization
We formulate the self localization of an ordered sequence of images with respect to landmarks as a bipartite graph matching problem. For this task, we construct a complete bipartite graph , with directed edges from to . In addition, we introduce auxiliary source and target vertices and , respectively. The source is connected to all the vertices with directed edges . Similarly, directed edges connect to . Using , and , we represent the flow network using the graph with and , as shown in Figure 1. In this section, we solve the bipartite graph matching problem using the network flow formulation of (1), with additional constraints to obtain matches that respect Rules 3.5 and 3.6.
5.1 Visual Matching
To obtain visually similar matches, we define the flow cost rate between any landmark and a query image using the visual distance between them. On the other hand, no cost is added for the flow from source to landmarks and from query images to target. Furthermore, we introduce a robust loss for feature matching such that the cost rate is defined as,
where, is the Huber loss function. To ensure that an image cannot be matched to more than one landmark, we limit the maximum absolute flow at every query image to one. This translates to the the following capacity constraints,
We allow many query images to be matched to one landmark, by setting the source to landmark capacity higher than one. Additionally, (19) also ensures the matching of every query image.
5.2 Geometric Matching
Recall that we are given only the visual features of the query images, along with the visual features and geometric locations of the landmarks. In this regard, our task is to infer the geometric location of the query images. To do so, for a given flow between the landmarks and the query images, we first define the location of the query images as follows,
Note that the absolute flow of every query image is . Therefore, (20) guarantees that the query image lies within the convex polytope defined by landmark locations. Now, the geometric matching rule of 3.6 for navigation radius and sequential query image pair can be expressed as the following quadratic constraint,
5.3 Self Localization Algorithm
We perform self localization by performing the bipartite graph matching, using network flow. In the following, we first present the proposed network flow formulation for self localization, as one of our results. Subsequently, we summarize our self localization method as an algorithm.
Consider a graph constructed using vertices (representing a sequence of images ) and landmarks at locations (as shown in Figure 1), with cost rate and capacity defined using (18)–(19). Given navigation radius and source and target vertices , the flows required for self localization can be obtained by solving the following flow problem.
6.1 Datasets and Experimental Setup
The following paragraphs describe how to we obtain the location coordinates , visual features and the edges , as introduced in Section 3.1, for the Oxford Robotcar and the COLD-Freiburg dataset.
The COLD-Freiburg sequences directly provide the necessary location coordinates . For the Oxford Robotcar dataset we use UTM coordinates, i.e. northing and easting, as . We exclude any sequences with inaccurate or incomplete GPS and INS trajectories. Visual inspection reveals that the northing and easting trajectories provided in the INS files are more reliable than the ones provided by GPS. We therefore use these coordinates as . Given the large size of the Oxford Robotcar dataset and the limitation in download speed for public users, we limit ourselves to a randomly selected subset of sequences and only look at roughly the first 1250m of each run.
To obtain the features , we use off-the-shelf NetVLAD  features with PCA and whitening. This results in visual features of length 4096 per image.
In order to find for a set of reference images from the COLD-Freiburg dataset, we look at any connection between images that are less than 2m apart. If the connection does not intersect with any walls on the given floor plan, it is added as an edge to . For the Oxford Robotcar dataset, a connection between two images is added to if and only if the integrated distance along the driven path between the points is smaller than a threshold of 12m. We use the geodesic distance to avoid edges which cut corners.
6.3 Landmark Selection
In this section, we validate our map construction approach introduced in Section 4 on real world data. Starting from 4853 images from the first 1.25km of the rainy Oxford Robotcar sequence 2015-10-29 12:18:17, we build a reference summary with . For a total of five different setups, Figure 2 shows the distribution of feature distance and geometric distance of the points in the original set to the geometrically nearest neighbour (2) in the summarized set . First we study the baseline setup obtained by sampling uniformly along the path of the captured reference sequence. To illustrate the impact of anchors (Section 4.1) and sensitivity (Section 4.2) on geometric and visual representation respectively, next we switch off the constraints selectively.
The results in Figure 2 clearly show that a reference summarized using network flow has better geometric and visual representation than a uniformly sampled baseline representation with the same number of images. It can be seen, that introducing anchors reduces the number of points with high geometric distance and introducing sensitivity reduces the number of points with high feature distance.
In this section, we illustrate the feasibility of our self localization algorithm. As a reference set, we take 600 images from the same rainy Oxford Robotcar sequence (2015-10-29 12:18:17) as in Section 6.3. In order to distinguish between the impact of our map building and our self localization method, we employ uniform sampling to summarize the reference set for this experiment. As a query sequence, we uniformly sample 125 images from the overcast Oxford Robotcar sequence acquired on 015-02-13 09:16:26 using a step size of 20 images. The query sequence path is shorter than the path from which the reference set was constructed.
The leftmost subplot in Figure 3 illustrates the unrefined top-1 feature matches between the query sequence images and the reference set images. The second subplot in Figure 3 shows the top-1 feature matches after applying our self localization algorithm. It is evident that our approach greatly improves the localization for this example by removing inconsistent matches.
The third subplot from the left in Figure 3 shows the visual distance matrix between query sequence and reference set (ordered according to topology, i.e. the originally driven route). The matrix shows that matches do not occur independently, but that neighbours of matching images also have low feature distance. The true matches (i.e. the matches with smallest geographic distance) are indicated in black. In red, we plot the refined matches of our self localization algorithm. As a comparison, in green we show the matches produced by SeqSLAM , the current state of the art for capitalizing on sequential information to improve image matches.
Finally, the rightmost subplot in Figure 3 indicates the percentage of correctly localized images for any given distance threshold. E.g. for maximum tolerance of 80m, our algorithm has an accuracy of 68.7%, while SeqSLAM reaches 60.9%.
6.4 Quantitative Evaluation
We evaluate the quantitative performance of our method on the first 1250m of the Oxford Robotcar outdoor dataset, as well as the COLD-Freiburg indoor dataset. For both datasets we randomly choose one reference and three query sequences. For the Oxford Robotcar dataset, the reference is the rainy sequence from 2015-10-29 12:18:17. The query sequences were taken in three different conditions: Sun and clouds (2014-11-18 13:20:12), snow (2015-02-03 08:45:10) and overcast (2015-02-13 09:16:26). From the COLD-Freiburg dataset we use the second sunny sequence of the extended part A as a reference. As query sequences, we use the first sunny, cloudy and night sequences taken on the extended part A.
Figure 4 plots the percentage of correctly localized query images for a given distance threshold for each of the six different query sequences. The number of images in the reference set are 415 for Oxford Robotcar and 50 for COLD. In red we report the unrefined top-1 localization accuracy, i.e. the accuracy of returning the best feature matches from a uniformly sampled reference set without any use of sequence information. In black, we report the top-1 accuracy of the state of the art baseline, using a uniformly sampled reference set and matching refined by SeqSLAM. By incorporating sequential information, SeqSLAM clearly outperforms the unrefined top-1 localization. However, the top-1 accuracy achieved by our map building algorithm in combination with our self localization is even higher, as shown in solid green. For some distance thresholds, the top-1 accuracy of our method even beats the unrefined top-10 reference shown in blue. On the COLD dataset the improvement of our method is particularly noticeable for the range between 10m and 15m.
While our method shows significant improvement on all sequences presented in Figure 4, it fails on sequences with non-distinctive image features, such as the outdoor night sequences in the Oxford Robotcar dataset. This is shown in Figure 5. It can be observed, that for these sequences, the the baseline method using SeqSLAM also fails.
In this paper we have formulated a set of requirements for map building and self localization in the context of image-based navigation. Based on these requirements, we proposed a method to perform map building by selecting the most suitable images for navigation. To improve self localization we proposed a method that can use multiple query images. We modeled both the methods using network flow and solved them using convex quadratic and second-order cone programs, respectively. Our experiments on challenging real world datasets show that our approach significantly outperforms existing methods.
-  R. A. and A. Zisserman. Dislocation: Scalable descriptor distinctiveness for location recognition. In ACCV, 2014.
-  A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, and L. Van Gool. Night-to-day image translation for retrieval-based localization. arXiv preprint arXiv:1809.09767, 2018.
-  M. ApS. The MOSEK optimization toolbox for MATLAB manual. Version 7.1 (Revision 28)., 2015.
-  R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
-  G. Bresson, Z. Alsayed, L. Yu, and S. Glaser. Simultaneous localization and mapping: A survey of current trends in autonomous driving. IEEE Transactions on Intelligent Vehicles, 20:1–1, 2017.
-  R. Castle, G. Klein, and D. W. Murray. Video-rate localization in multiple maps for wearable augmented reality. In Wearable Computers, 2008. ISWC 2008. 12th IEEE International Symposium on, pages 15–22, 2008.
-  S. Choudhary and P. Narayanan. Visibility probability structure from sfm datasets and applications. In European conference on computer vision, pages 130–143. Springer, 2012.
-  A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):1052–1067, 2007.
-  E. Eade and T. Drummond. Scalable monocular slam. In CVPR, volume 1, pages 469–476, 2006.
-  A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof. From structure-from-motion point clouds to fast location recognition. In CVPR, 2009.
-  A. Kendall, R. Cipolla, et al. Geometric loss functions for camera pose regression with deep learning. In CVPR, 2017.
-  A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. 2015.
-  H. J. Kim, E. Dunn, and J.-M. Frahm. Learned contextual feature reweighting for image geo-localization. In CVPR, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  Y. Li, N. Snavely, and D. P. Huttenlocher. Location recognition using prioritized feature matching. In ECCV, 2010.
-  W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
-  M. J. Milford and G. F. Wyeth. Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights. In ICRA, 2012.
-  M. J. Milford and G. F. Wyeth. Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 1643–1649. IEEE, 2012.
-  E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Real time localization and 3d reconstruction. In CVPR, 2006.
-  A. Pronobis and B. Caputo. COLD: COsy Localization Database. The International Journal of Robotics Research (IJRR), 28(5):588–594, May 2009.
-  T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. IEEE Transactions on Pattern Analysis & Machine Intelligence, (9):1744–1756, 2017.
-  T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, and T. Pajdla. Are large-scale 3d models really necessary for accurate visual localization? In CVPR, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  J. F. Sturm. Using SEDUMI 1.02, a MATLAB toolbox for optimization over symmetric cones, 2001.
-  H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. Inloc: Indoor visual localization with dense matching and view synthesis. In CVPR, 2018.
-  B. Zeisl, T. Sattler, and M. Pollefeys. Camera pose voting for large-scale image-based localization. In ICCV, 2015.