Automated Map Reading: Image Based Localisation in 2-D Maps Using Binary Semantic Descriptors

Automated Map Reading: Image Based Localisation in 2-D Maps Using Binary Semantic Descriptors


We describe a novel approach to image based localisation in urban environments using semantic matching between images and a 2-D map. It contrasts with the vast majority of existing approaches which use image to image database matching. We use highly compact binary descriptors to represent semantic features at locations, significantly increasing scalability compared with existing methods and having the potential for greater invariance to variable imaging conditions. The approach is also more akin to human map reading, making it more suited to human-system interaction. The binary descriptors indicate the presence or not of semantic features relating to buildings and road junctions in discrete viewing directions. We use CNN classifiers to detect the features in images and match descriptor estimates with a database of location tagged descriptors derived from the 2-D map. In isolation, the descriptors are not sufficiently discriminative, but when concatenated sequentially along a route, their combination becomes highly distinctive and allows localisation even when using non-perfect classifiers. Performance is further improved by taking into account left or right turns over a route. Experimental results obtained using Google StreetView and OpenStreetMap data show that the approach has considerable potential, achieving localisation accuracy of around 85% using routes corresponding to approximately 200 meters.

I Introduction

Image based localisation and place recognition have been looked at extensively as an alternative to infrastructure dependent sensing such as GPS, especially when operating in urban environments. The vast majority of systems adopt an image to image database matching approach, in which environment images are matched to a database of location tagged images [1]. Although these have demonstrated impressive performance, they are also limited in three key respects. The first is scalability - localisation is dependent on having a very large database of images or image features and thus scaling to very large areas is problematic. The second relates to invariance - matching is impacted significantly by variable imaging conditions and so maintaining performance at all times over extended periods is challenging. Finally, such schemes do not align well with how it is believed that humans perceive and undertake location-based activities, which are thought to be based on some form of 2-D map representation [2, 3, 4], and thus these approaches do not lend themselves naturally to human-system interaction.

Motivated by the above, we consider an alternative approach using image to 2-D map matching, in which we link images to semantic features on a 2-D map of an environment to give localisation. We therefore move away from matching images and instead match semantic information. This is akin to human map reading, in which a person relates the surrounding visual appearance of an environment to the semantic information they can perceive on a map, such as buildings, road layout, etc. This renders the approach better suited to human-system interaction. Moreover, the abstraction and compression provided by semantic description also gives potential for significant gains in scalability - our semantic descriptors are many orders of magnitude smaller than images or sets of image features - and improved invariance to variable imaging conditions, since via training, the detection of semantic features in images can be made less dependent on specific appearance.

Fig. 1: Binary semantic descriptors (BSDs). 4-bit binary descriptors are used to represent locations indicating the presence or not of semantic features in 4 directions (front/back facing - junctions; left/right facing - gaps between buildings). These are derived from a 2-D map and compared bitwise with descriptors estimated via classifiers from images captured in the same directions to establish localisation w.r.t the map. On their own the descriptors are not sufficiently distinctive, but when combined sequentially along routes as shown in Figure 2, then localisation becomes possible.

In this paper we present preliminary investigations into the approach. Our central idea is to characterise locations by a small number of semantic features relating to road junctions, buildings, etc, and then represent each location by a binary semantic descriptor (BSD), with each bit indicating the presence or not of a given feature in a given viewing direction. This gives a very compact representation (we use 4-bit descriptors in this work) and so increases scalability. We design classifiers to recognise the features in images, allowing us to estimate the descriptors and hence in principle recognise locations by comparison with a database of location tagged descriptors derived directly from the 2-D map. The approach is illustrated in Figure 1.

However, due to their simplicity, the above descriptors are not sufficiently discriminative on their own; there are many locations having the same descriptors and when coupled with non-perfect classifiers, localisation is not possible. Nevertheless, when the descriptors are concatenated sequentially, then the resulting route descriptors do become highly distinctive, to the extent that localisation is possible despite non-perfect classifiers. In essence, the pattern of semantic features observed along a route become unique providing the route is sufficiently long (in the experiments reported below we achieved localisation after approximately 200 meters). Moreover, when the direction of travel between locations along a route is also taken into account, e.g. left and right turns, performance is further improved. This routes based approach is illustrated in Figure 2. Note that it is feasible because of the compact nature of the map representation, i.e. a small number of bits per location, and is something that would be difficult to achieve using the comparatively large representations used in image to image database matching.

In this paper, we present an implementation using Google StreetView (GSV) and OpenStreetMap1 (OSM) data, with the latter providing vector maps and the former giving 360 degree images at regular locations along roads. We used road junctions and gaps between buildings as our semantic features, assuming the former to be present or not in front and back facing views, and the latter to be present or not in left and right facing views. This gives us 4-bit descriptors for each location. We trained convolutional neural network (CNN) classifiers to recognise the features in images, achieving accuracy of around 75%. In experiments on an area of around , we achieved localisation accuracy in excess of 85% when using routes consisting of 20 or more locations, corresponding to distances of approximately 200 meters. Although initial localisation is delayed as the route evolves, once bootstrapped to the correct location, the method successfully tracks the route at the same rate as location images are captured and achieves this using a significantly smaller database than required in image to image database matching. The results suggest that the method has considerable potential.

Fig. 2: Route based localisation. (a) Images captured in four directions (front, back, left and right facing) at locations along a route are converted to BSDs using binary classifiers (b) and concatenated to produce route descriptors (c). These are compared bit-wise (d) with a database of ground-truth BSDs (e) derived from the 2-D map to determine the closest matching route. Routes are then compared in terms of their turn patterns (f) to give a final ranking of possible locations of the images w.r.t the 2-D map (g).

Ii Related Work

Approaches to image based localisation and place recognition have almost exclusively focused on image to image database matching, in which environments are represented by sets of location tagged images or image features [1]. The key concerns in such methods are the invariance of representations to changes in viewpoint and changes in appearance caused by different lighting and weather conditions. For example, the FAB-MAP algorithms [5] use image features with a degree of viewpoint invariance to give large-scale matching over long routes of up to 1000 km, whilst other methods have sought to deal with changing appearance either through invariant representations [6], storing multiple representations [7] or learning models of appearance change [8]. More recent work has looked at leveraging the power of deep learning methods to gain improved matching [9, 10]. However, in all cases, large scale localisation requires large scale memory requirements, in the order of hundreds of gigabytes [5].

In contrast, although there is a body of work which has looked at using computer vision to extract navigation features from paper maps, see e.g. the survey in [11], there has been very little work on linking maps to images for localisation as described in this paper. There has been some work on utilising semantic information in the form of identifying key landmarks and objects in images, such as buildings, traffic lights, bollards, etc, and using these to represent locations, see e.g. [12, 13]. These approaches have the potential to provide good invariance, including with ultra-wide viewpoint changes [12], and reduced representational size, but to date they have been limited in scale and not linked to map information. Closer in spirit to our work in terms of alignment with human wayfinding is the PhotoMap application described in [14] and [15]. Images of ‘You are here’ public maps are geo-referenced with online maps by hand to provide specialised local data alongside navigation information on mobile devices, recognising the value of pictorial map data for human spatial cognition.

There has been some recent work on estimating 6-D camera pose using a combination of GPS, images and map data in urban environments as described in [16, 17, 18, 19, 20]. The methods described use building edges and planar facades extracted from images to align with 2-D and 2.5-D maps geo-localised using GPS and so give improved estimates of camera position and orientation. However, these methods focus on obtaining precise metric estimates of camera pose for applications such Augmented Reality [18] based on clear views of building facades. As such they would be difficult to extend to general localisation.

The closest work to that presented here is that described by Seff and Xiao [21]. In a similar manner to our detectors described below, they use a CNN approach to recognise semantic features in images of urban settings, such as junctions, number of lanes, drivability, bike lanes, one-way vs two-way, etc. The network training is based on ground-truth features obtained from OSM and images from GSV, in much the same manner as in our approach. However, their focus is on using the outputs of the classifier to validate map locations provided by GPS for self-driving car applications, rather than for general localisation. In addition, they consider locations in isolation, in contrast to our use of route information.

Given the above and to the best of our knowledge, we therefore believe that the approach presented here is the first of its kind in terms of systematically linking 2-D map data with images for position localisation over large areas.

Iii Overview

The main components of the approach are illustrated in Figure 2. From a 2-D vector map, i.e. OSM, we generate binary semantic descriptors (BSDs) for locations spaced at regular intervals along roads in an urban environment. Each descriptor consists of 4 bits, with each bit indicating the presence or not of a semantic feature in a given viewing direction. We used four directions - front, back, left and right facing - and two feature types - junctions and gaps between buildings. The latter were chosen since they are easily identified in the vector map and as described below, they can be reliably detected in images using trained classifiers.

A database of location tagged route descriptors is then created by computing all possible routes within the area of interest up to a certain length in terms of the number of adjacent locations and then concatenating the set of associated BSDs as indicated in Figure 2d, where the circular discs represent the BSDs and the black/white segments indicate individual bits. Note that each route descriptor is then of length bits, where is the number of locations in the route. Thus, although the number of possible routes can be very large, the route database has a small memory footprint. For example, in the experiments described below, for an area of approximately , the number of possible routes containing 40 locations (each approximately 400 meters long represented by a 160-bit route descriptor) is just under . The route descriptor database is then around 800 MB in raw form, i.e. prior to any compression, which would be possible due to significant overlap between routes. This contrasts, for example, with the 177 GB reported in [5] required for image features to represent a single 1000 km route, i.e. equivalent to 71 MB for a single 400 meter route.

Localisation w.r.t the map then proceeds as follows. Images in the four viewing directions are captured at a location, i.e. within GSV in our case. Each image is then fed to a binary classifier, which detects the presence or not of a semantic feature, i.e. a junction for the front and back facing views and a gap between buildings for the left and right facing views. This gives a 4-bit BSD as illustrated in Figure 1, with each bit indicating the presence or not of the feature in each viewing direction.

The above BSD could be compared with those for all locations in the 2-D map to give localisation, but as noted earlier, their simplicity means that they are not sufficiently distinctive, with many locations having the same descriptor. Instead, as shown in in Figure 2a-c, we concatenate BSDs as the ’user’ moves along a route in the environment, capturing images and generating descriptors at regular intervals, creating a route descriptor. In our case, we have a virtual user moving in GSV and generate BSDs at each successive GSV location (approximately every 10 meters). At each location, the current route descriptor is then used to query the database, with Hamming distances used to provide a ranked list of likely locations, as illustrated in Figure 2d-g.

To add further discrimination, we also compare the turn patterns - the position of left or right turns in a route - associated with the query and database routes, requiring that these are identical for a valid match. The motivation here is that direction changes of, for example, an autonomous vehicle can be detected reliably and hence can be used to eliminate spurious matches between route descriptors. The database route having the lowest Hamming distance w.r.t the query route and also the same turn pattern then provides the location estimate.

In the following sections, we provide details of the BSD generation, the design and training of the binary classifiers, the generation and comparison of the turn patterns and a probabilistic interpretation of the approach. Section VIII provides details of the GSV/OSM experiments and results and we conclude with a brief discussion of future work.

Iv Binary Semantic Descriptors

We denote the finite set of locations in an area of interest by , where is the total number of locations. Associated with each location is a BSD, which we denote by the binary string , with denoting the th bit, and define as the set of all descriptors. In this work, and each bit of a BSD denotes the presence or not of a junction or a gap between buildings in one of four viewing directions centred on location . These are derived from the vector map as follows


where (,) and (,) denote the (front,back) and (left,right) viewing directions at location , respectively. The functions and return 1 if there exists a junction or a gap between buildings, respectively, in direction , and 0 otherwise. As illustrated in Figure 3, a feature is deemed to be present in a viewing direction if one lies within the relevant quadrant of a circle of a given radius centred on the location of interest, where the front and back viewing directions are aligned with that of the road upon which the location sits. In the experiments we set the viewing distance radius to be 30m, which is similar to that used in [21].

Fig. 3: Generation of a BSD from the vector map.

For localisation we need to estimate a BSD for a location from images captured in each of the four viewing directions. We do this using binary classifiers, trained to detect the presence or not of the relevant semantic feature. Given image at location in viewing direction , the estimated BSD is given by


where and return 1 if a junction or a gap, respectively, are detected in image , and 0 otherwise, i.e. they mirror the BSD generation functions in Equation 1.

We use a CNN approach to design the binary classifiers and . For training data, we make use of the correspondence between the vector maps in OSM and the images in GSV in a similar manner to that used in [21]. For each feature type - junctions and gaps between buildings - we collect positive samples by identifying the locations of the relevant features in OSM and storing the images from the corresponding locations (based on latitude and longitude) and relevant viewing directions from GSV, ensuring that we get a uniform mix of viewing scenarios. For example, in the case of junctions, we use front and back facing images aligned with the road and ensure that we have examples that cover the range of distances from the junction up to the viewing radius used in the generation of the BSDs. The training set is then completed by collecting approximately the same of number of negative samples in the corresponding viewing directions but not containing the feature of interest. In the experiments, we used a training sets consisting of 440,000 images per classifier taken from 220,000 locations in 23 different cities in the UK. None of these locations were used for evaluating the classifiers or in the localisation experiments.

We implemented the classifiers by using our training dataset to fine-tune an off-the-shelf pre-trained CNN. Specifically, we started from the pre-trained Places205-AlexNet model [22], designed for scene classification in urban environments, which aligns with our application, and derived from the pre-trained AlexNet model [23].We used colour images cropped from GSV panoramas in the required viewing direction corresponding to a horizontal field of view and resized to pixels. The latter results in some distortion but given that we used the same process for both training and testing, this was not deemed to be an issue. Examples of positive and negative images from the training dataset are shown in Figure 4. We tested performance of each classifier using two test sets of 8000 images taken from the same 23 cities but at locations not within the training set and with an equal number of positives and negatives samples, i.e. feature present and not present.

Both classifiers gave good balanced performance in detecting the presence and non-presence of junctions and gaps, with precision and recall values of on the test set. Examples of correct classifications (true positives and true negatives) and incorrect classifications (false positives and false negatives) are shown in Figure 5. Note that the latter illustrate the difficulty of the task. For example, the bottom left view in Figure 5b contains a junction which is significantly obscured and was incorrectly classified as containing no junction, whilst the 2-D map indicates that the bottom right view should contain a gap, but the site appears to be under redevelopment and has been incorrectly classified as not containing a gap. The latter is an example of inaccuracies within the OSM data.

(a) (b)

Fig. 4: Examples of positive (feature present) and negative (feature not present) images from the training datasets used for the semantic classifiers: (a) junction (top) and no junction (bottom); (b) gap (top) and no gap (bottom).

(a) (b)

Fig. 5: Examples of semantic classifications: (a) true positives (top) and true negatives (bottom); (b) false positives (top) and false negatives (bottom). In both (a) and (b) examples are arranged as: junction (top-left); gap (top-right); no junction (bottom-left); no gap (bottom-right).

V Route Descriptors and Turn Patterns

As noted earlier and as we demonstrate later, on their own the above binary descriptors are not sufficiently discriminative to identify a location uniquely and allow localisation. This is true even if we were able to design perfect classifiers for extracting the descriptors from images. The simplicity of the representation, whilst being extremely compact, means that there are many locations with similar descriptors. We address this ambiguity in two ways. First, we concatenate descriptors along routes corresponding to adjacent locations, constructing route descriptors, which prove to be highly discriminative once the routes reach a certain length. Once this length is reached, then localisation can proceed at the rate that new locations are visited, i.e. enabling tracking, by matching with a database of all possible route descriptors constructed offline. Secondly, we introduce further disambiguation by incorporating turn patterns observed along routes into the representation, i.e. the sequence no turn and turn (left or right) at each location along a route, and using these to identify the most likely match within the database.

Let be an adjacency matrix, such that if locations and are adjacent, and otherwise. Locations are regarded as adjacent if on the 2-D map they are connected by a road and there are no other locations between them. A route is then defined as a finite sequence of adjacent locations, i.e. the route is of length , where defines a sequence of adjacent locations such that , . For simplicity we have restricted ourselves to routes that do not loop or turn back on themselves, i.e. , , , but the method could be readily extended to deal with such cases. We define as the set of all such routes up to length defined amongst all the locations in . Associated with each route is a route descriptor, consisting of the sequence of BSDs corresponding to the locations along the route, i.e. , and we define as the set of all route descriptors corresponding to the routes in .

To incorporate turn information into the representation, we define a binary turn pattern associated with a route . The th bit of indicates whether a left and right turn is present between locations and , i.e. , where denotes the front facing direction at location and


where denotes the absolute value of the smallest angle between and , and is an angle threshold, which we set to be to ensure that we only include significant turns. Thus represents the sequence of turns that take place along a route. We define to be the set of such turn patterns corresponding to the routes in .

Vi Localisation and Bootstrapping

Consider an autonomous system making its way through an urban environment, moving between locations in along a specific route of length . At any given location, our goal is to identify its current location by recognising the route taken to date, consisting of the current location plus the previous locations, say. We do this by comparing its estimated route descriptor (obtained by concatenating the estimated BSDs at each location) with those in and its turn pattern with those in , hence determining the most likely route from those in .

It is important to note that in this work we assume that there is a one-to-one correspondence between the locations in our 2-D map and the locations in the environment. This enables us to do a direct comparison between estimated route descriptors and those in the database. When using GSV and OSM data this can be ensured by selecting OSM locations corresponding to the known locations in GSV. In a practical system, we would need a method of forming such one-to-one correspondence or alternatively, a means overcoming the lack of it. We discuss this further in Section IX.

We define the most likely route as being the route whose route descriptor is closest to and whose turn pattern matches , i.e. such that




where denotes the Hamming distance between two binary strings and . For long routes (, say) the number of elements in becomes very large (, rising to near for ), and thus we use an efficient pattern matching algorithm based on a BK-tree [24] to find the closest route descriptor in Equation (4).

Note from the above that we assume that the turn pattern for the query route is correct, but allow errors in the estimate of the route descriptor due to the non-perfect classifiers used in the semantic feature detectors. Our motivation for the former is that in practice detecting significant left or right turns by an autonomous system can be achieved reliably and thus requiring an exact match is reasonable. Note, however, that as we show later, turn patterns alone are not sufficient to achieve localisation, as many routes share the same turn pattern, and it is their combination with route descriptors that gives the required level of distinctiveness.

The above provides an indication of the most likely location given the current route. However it gives no indication as to the confidence in the estimate. There are a number of possibilities for this, including basing it on the distance between and and/or the distance of from the second best matching route descriptor. We found that a consistency metric proved to be most effective, in which we deem a route to be localised if there is sufficient overlap between the most likely routes for a number of successive locations. We set the overlap to be 80% of the locations need to be the same and we required this to occur for 5 successive locations. In essence, if successive query routes are matching with routes that have significant overlap then it is a good indicator that successful localisation has been achieved.

We also demonstrate later that once the above consistency criterion has been met, the query route length can be fixed and localisation proceed by successively updating the query route by appending the latest BSD onto the end and removing the first descriptor. Thus, the phase during which the query route grows can be regarded as a bootstrapping process, during which the route descriptor extends until it becomes sufficiently distinct to allow localisation. Once complete, then continuous tracking can take place using the fixed length query at the same rate as the BSD are created at successive locations. An example of bootstrapping and tracking is shown in the video submitted as supplementary material.

Vii Probabilistic Formulation

The above localisation process can also be considered in probabilistic terms. Given an estimated BSD obtained at a single location , say, then the conditional probability that corresponds to can be written as


where we assume that all descriptors are equally likely. Note that the term expresses the uniqueness of the ground-truth descriptor derived from the 2-D map. Since our descriptors are only 4-bits long, then for a large number of locations, e.g. 6000 in the experiments, , indicating that many locations have the same descriptors and hence localisation is not possible. Given that we have an estimate of the accuracy of our classifiers and hence the detectors and , we can approximate the likelihood in terms of the Hamming distance between and , i.e.


where is the probability of correctly detecting the presence or not of both junctions and gaps (we assume the same value for both probabilities for simplicity, but as noted in Section IV, we also observed similar values in practice of ).

Extending the above to routes, we obtain the following conditional probability that the route descriptor estimate corresponds to route


Hence from Equations (6) and (7) and assuming independence between descriptors


where denotes the Hamming distance between and . Here, expresses the uniqueness of the route descriptor , which as we demonstrate later is high for a sufficiently long routes and thus , giving


Using this expression we can obtain an estimate of the likelihood ratio of one route over another for a given


where is the Hamming distance between and . Hence for , this gives a likelihood ratio of for a difference in Hamming distance from the estimated route descriptor, which is significant. For example, a route whose descriptor is bits closer in Hamming distance to the estimated descriptor, is times more likely to be the correct route. Thus, as we demonstrate in the next section, even with a detector accuracy of only 75% for individual BSDs, the concatenation of descriptors along routes can lead to a high degree of distinctiveness for long enough routes.

Fig. 6: Histogram showing the distribution of 4-bit ground-truth (blue) and estimated (red) BSDs obtained from OSM and GSV images, respectively.

Viii Experiments

We evaluated the performance of the method using GSV and OSM data for a region of London. None of the locations within the region were used to train the semantic classifiers. The region consisted of 6656 GSV locations and from each location we gathered images corresponding to the four viewing directions, from which we estimated 4-bit BSDs using the classifiers described in Section IV.

To illustrate the distribution of descriptors across the region and the performance of the classifiers, Figure 6 shows the histogram of 4-bit ground-truth descriptors (obtained from OSM, shown in blue) and estimated descriptors, shown in red, where the horizontal axis corresponds to the 16 possible 4-bit patterns. The predominance of BSDs with pattern ‘0000’ corresponding to locations in which there is neither gaps between buildings to the left or right, nor junctions towards the front or back, results from the fact that many locations between junctions have these characteristics as can be seen from the 2-D map of the area shown in Figure 11. Note that the distribution of the estimated descriptors is close to that of the ground-truth due to the accuracy of the classifiers, i.e. approximately 75%.

To assess the performance of the route based localisation, we considered route lengths up to a maximum of locations. We tested the method using 150 test routes and for each we recorded the route length at which localisation was achieved according to the consistency criterion, i.e. 5 successive consistent localisations. The results are shown in Figure 7, which shows the percentage of routes that were correctly localised within route lengths of 0-5, 0-10, …, 0-40 locations. We have shown the results for three methods of matching routes: using only turn patterns (grey); using only route BSDs (yellow); and using both BSDs and turn patterns (blue). Note that the latter outperforms the others by a significant margin and that BSDs alone also significantly outperform turn patterns, which only manage to localise of routes even with a route length of 40 locations. This clearly demonstrates the potential of the BSD approach. Note in particular that over 85% of routes are correctly localised even when only using routes consisting of up to 20 locations, which corresponds to approximately 200 meters in GSV.

Fig. 7: Accuracy of localisation (% of correctly identified routes) versus route length using turn patterns (grey), route descriptors (yellow), and route descriptors with turn patterns (blue).
Fig. 8: Accuracy of localisation (% of correctly identified routes) versus classifier accuracy for different ranges of route length.

It is also interesting to consider the significance of the classifier accuracy. Given that we know the ground-truth BSDs, we investigated using BSDs ’estimated’ using classifiers with different accuracy (we assumed the same accuracy for both detecting the presence or not of junction and gaps). A plot of the percentage of correctly localised routes versus the accuracy of the classifiers is shown in Figure 8 for the different route length ranges used in Figure 7. Thus that at 75% accuracy, which we obtained from our trained classifiers, over 85% of routes are predicted to be correctly localised within 0-20 locations, which agrees with our findings in Figure 7. Note also that if classifier accuracy were increased to beyond 80%, then 80-90% of routes could be correctly localised using locations, which again illustrates the potential of the BSD approach.

Fig. 9: Examples of BSD ground-truths, from OSM, and BSD estimates, from the classification of the GSV images in four directions as shown.

Examples of estimated BSDs, their corresponding images in the four viewing directions and the ground-truth BSDs from OSM for part of a route are shown in Figure 9. Note the deviation of the BSD estimates from the ground-truth, which results from the inaccuracy of the classifiers. The challenging nature of the detection task can be seen from the images. This confirms the utility of concatenating BSDs along a route in order to gain uniqueness and hence enable localisation. Figure 10 further illustrates this, which shows the distribution of Hamming distances from descriptors in the database for a given test (query) route at lengths of 15 (left) and 30 (right) locations, with and without using turn patterns (bottom and top, respectively). The correct matches for lengths 15 and 30 have Hamming distances of 15 and 26, respectively. When the test route length is 15 locations, the correct route is not the closest (there are other Hamming distances with values ), although using turns (bottom) significantly reduces the number of routes close to the query route (note that the vertical axes in Figure 10 have significantly different ranges). With 30 locations and without using turns, the correct route does become equal closest with 18 others and there are a significant number of others close by. In contrast, using turn patterns with 30 locations drastically reduces the number of candidate routes and the correct route becomes the closest by a Hamming distance margin of over 20.

Fig. 10: Histograms of Hamming distances between a test route descriptor and those in the database for route lengths of 15 (left) and 30 (right) locations, with (bottom) and without (top) using turn patterns.

To illustrate the localisation process, Figure 11 shows snapshots of the localisation of a test route at route lengths of 2 and 24 locations. It shows the OSM 2-D map and the locations are indicated by coloured squares along roads, where the colour indicates the closeness between their route descriptor and that of the test route (where locations share routes, then the closest descriptor difference is shown). The latest location along the test route is indicated by the orange/red circle, where orange indicates that the route has yet to be correctly and consistently localised, and red indicates that localisation has been achieved. The BSDs, estimated and ground-truths, along with their images, are shown below the 2-D maps. Note that the bottom row of images show the views at the closest (best) match locations, but are not used in the matching process. With route length of 2, the majority of locations have a low likelihood of being correct (dark blue), whilst a small number of disparate locations have a high likelihood (dark red). This reflects the lack of distinctiveness of two 4-bit BSDs. In contrast, once 24 locations are reached, the route has been successfully localised and the vast majority of other locations/routes have been eliminated (their squares are not shown), reflecting the confidence of the localisation. A video showing the complete process has been submitted as supplementary material.


Fig. 11: Snapshots of the localisation process for test route lengths of 2 (top) and 24 locations (bottom). See text for explanation.

Ix Conclusions

We have presented a novel approach to position localisation in urban areas and to the best of our knowledge it is the first example of linking 2-D maps to images over large areas. The key contribution is the demonstration that compact binary semantic descriptors concatenated over routes are sufficiently distinctive to enable localisation and that the representation is vastly smaller than that used in image to image database approaches. Moreover, the use of simple semantic classification offers the potential for invariance to changing environment conditions, which is something that we wish to demonstrate in future. In addition, the reported work relies on an assumption of one-to-one correspondence between map and image locations, achieved by using OSM and GSV data, and this needs to be addressed for developing a practical system, which we are in the process of doing.




  1. S. Lowry, N. Sunderhauf, P. Newman, J. Leonard, D. Cox, P. Corke, and M. Milford, “Visual place recognition: A survey,” Robotics, IEEE Trans. on, vol. 32, pp. 1–19, 2016.
  2. K. Lynch, The image of the city.   MIT Press, 1960.
  3. J. O’Keefe and L. Nadel, The hippocampus as a cognitive map.   Clarendon Press, 1978.
  4. B. Tversky, “Cognitive maps, cognitive collages, and spatial mental models,” in Proc. Euro. Conf. on Spatial Info. Theory (COSIT), 1993.
  5. M. Cummins and P. Newman, “Appearance-only SLAM at large scale with FAB-MAP 2.0,” Int. J. of Robotics Research, vol. 30, pp. 1100–1123, 2010.
  6. M. Milford and G. Wyeth, “Seqslam : visual route-based navigation for sunny summer days and stormy winter nights,” in IEEE Int. Conf. on Robotics and Automation, 2012.
  7. P. Biber and T. Duckett, “Experimental analysis of sample-based maps for long-term slam,” Int. J. on Robotics Research, vol. 28, no. 1, 2009.
  8. S. Lowry, M. Milford, and G. Wyeth, “Transforming morning to afternoon using linear regression techniques,” in Proc. IEEE Int. Conf. on Robotics and Automation, 2014.
  9. N. Sünderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” in Proc. of Robotics: Science and Systems, 2015.
  10. P. Panphattarasap and A. Calway, “Visual place recognition using landmark distribution descriptors,” in Proc Asian Conf. on Computer Vision, 2016.
  11. S. Chiang, Y.; Leyk and C. A. Knoblock, “A survey of digital map processing techniques,” ACM Computing Surveys, vol. 47, no. 1, 2014.
  12. R. Frampton and A. Calway, “Place recognition from disparate views,” in Proc. British Machine Vision Conf. (BMVC), 2013.
  13. A. Mousavian and K. J., “Semantically guided location recognition for outdoor scenes,” in Proc IEEE Int. Conf. on Robotics and Automation, 2015.
  14. K. Cheverst, J. Schöning, A. Krüger, and M. Rohs, “Photomap: Snap, grab and walk away with a “you are here” map,” in Proc. MobileHCI ’08 : Workshop on Mobile Interaction with the Real World, 2008.
  15. J. Schöning, A. Krüger, K. Cheverst, M. Rohs, M. Löchtefeld, and F. Taher, “Photomap: Using spontaneously taken images of public maps for pedestrian navigation tasks on mobile devices,” in Proc. Int. Conf. on Human-Computer Interaction with Mobile Devices and Services.   ACM, 2009.
  16. T.-J. Cham, A. Ciptadi, W.-C. Tan, M.-T. Pham, and L.-T. Chia, “Estimating camera pose from a single urban ground-view omnidirectional image and a 2d building outline map,” IEEE Conf. on Computer Vision and Pattern Recognition, 2010.
  17. A. Mousavian and J. Kosecka, “Semantic image based geolocation given a map,” ArXiv e-prints, 2016.
  18. C. Arth, C. Pirchheim, J. Ventura, D. Schmalstieg, and V. Lepetit, “Instant outdoor localization and slam initialization from 2.5d maps,” IEEE Trans. on Vis. and Computer Graphics, vol. 21, no. 11, 2015.
  19. A. Armagan, M. Hirzer, P. M. Roth, and V. Lepetit, “Accurate camera registration in urban environments using high-level feature matching,” in Proc. British Machine Vision Conf., 2017.
  20. A. Armagan, M. Hirzer, P. Roth, and V. Lepetit, “Learning to align semantic segmentation and 2.5 d maps for geolocalization,” in IEEE Conf. on Computer Vision and Pattern Recognitiion, 2017.
  21. A. Seff and J. Xiao, “Learning from maps: Visual common sense for autonomous driving,” ArXiv e-prints, 2016.
  22. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Proc. Int. Conf. on Neural Information Processing Systems, 2014.
  23. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Int. Conf. on Neural Information Processing Systems, 2012.
  24. W. A. Burkhard and R. M. Keller, “Some approaches to best-match file searching,” Commun. ACM, vol. 16, no. 4, 1973.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description