Is This The Right Place?Geometric-Semantic Pose Verification for Indoor Visual Localization

Is This The Right Place?
Geometric-Semantic Pose Verification for Indoor Visual Localization

Hajime Taira  Ignacio Rocco  Jiri Sedlar  Masatoshi Okutomi  Josef Sivic
Tomas Pajdla  Torsten Sattler  Akihiko Torii
Tokyo Institute of Technology, Inria, CIIRC, CTU in Prague, Chalmers University of Technology
WILLOW project, Departement d’Informatique de l’École Normale Supérieure, ENS/INRIA/CNRS UMR 8548, PSL Research University.CIIRC - Czech Institute of Informatics, Robotics, and Cybernetics, Czech Technical University in Prague.

Visual localization in large and complex indoor scenes, dominated by weakly textured rooms and repeating geometric patterns, is a challenging problem with high practical relevance for applications such as Augmented Reality and robotics. To handle the ambiguities arising in this scenario, a common strategy is, first, to generate multiple estimates for the camera pose from which a given query image was taken. The pose with the largest geometric consistency with the query image, e.g., in the form of an inlier count, is then selected in a second stage. While a significant amount of research has concentrated on the first stage, there is considerably less work on the second stage. In this paper, we thus focus on pose verification. We show that combining different modalities, namely appearance, geometry, and semantics, considerably boosts pose verification and consequently pose accuracy. We develop multiple hand-crafted as well as a trainable approach to join into the geometric-semantic verification and show significant improvements over state-of-the-art on a very challenging indoor dataset.

(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
(j) (k) (l)
Figure 1: Using further modalities for indoor visual localization. Given a set of camera pose estimates for a query image (a, g), we seek to identify the most accurate estimate. (b, h) Due to severe occlusion and weak textures, a state-of-the-art method [72] fails to identify the correct camera pose. To overcome those difficulties, we use several modalities along with visual appearance: (top) surface normals and (bottom) semantics. (c, i) Our approach verifies the estimated pose by comparing the semantics and surface normals extracted from the query (d, j) and database (f, l).

1 Introduction

Visual localization is the problem of estimating the 6 Degree-of-Freedom (DoF) pose from which an image was taken with respect to a 3D scene. Visual localization is vital to applications such as Augmented and Mixed Reality [44, 15], intelligent systems such as self-driving cars and other autonomous robots [41], and 3D reconstruction [61].

State-of-the-art approaches for accurate visual localization are based on matches between 2D image and 3D scene coordinates [10, 11, 56, 84, 40, 16, 71, 46, 47, 72]. These 2D-3D matches are either established using explicit feature matching [72, 56, 84, 71, 40] or via learning-based scene coordinate regression [10, 11, 16, 65, 46, 47]. At large scale or in complex scenes with many repeating structural elements, establishing unique 2D-3D matches becomes a hard problem due to global ambiguities [40, 71, 84]. A strategy to avoid such ambiguities is to restrict the search space for 2D-3D matching. For example, image retrieval [69, 49] can be used to identify a few parts of the scene most likely to be seen in a query image [72, 29, 58]. 2D-3D matching is then performed for each such retrieved place, resulting in one pose hypothesis per place. Subsequently, the “best” pose is selected as the final pose estimate for the query image.

Traditionally, the “best” pose has been defined as the pose with the largest number of inlier matches [29, 61]. Yet, it has been shown that a (weighted) inlier count is not a good decision criterion in the presence of repeated structures and global ambiguities [55]. Rather than only accounting for positive evidence in the form of the inliers, [72] proposed to compare the query photo against an image of the scene rendered using the estimated pose. [72] have shown that such a pixel-wise comparison, termed Dense Pose Verification (DensePV), leads to a significantly better definition of the “best” pose and subsequently improves pose accuracy.

In this paper, we follow the approach from [72], which focuses purely on comparing low-level appearance and geometry information between the re-rendered and actual query image. In contrast, this paper asks the question whether it is possible to improve the pose verification stage and thus the pose accuracy of visual localization approaches. To this end, we analyze the impact of using further geometric and semantic modalities as well as learning in the verification stage. In detail, this paper makes the following contributions: 1) we investigate the impact of using multiple modalities during pose verification for indoor visual localization in challenging scenarios. We hand-design several modifications of the original DensePV approach that integrate additional 3D geometry as well as normal and semantic information. We show that these approaches improve upon the original DensePV strategy, setting a new state-of-the-art in localization performance on the highly challenging InLoc dataset [72]. None of these approaches require fine-tuning on the actual dataset used for localization and are thus generally applicable. We are not aware of prior work that combines such modalities. 2) we also investigate a trainable pose approach for pose verification. We show that it outperforms the original DensePV strategy, which uses a hand-crafted representation. However, it is not able to outperform our novel modifications, even though it is trained on data depicting the scene used for localization. 3) we show that there is still significant room for improvement by more advanced combinations, opening up avenues for future work. In addition, we show that a standard approach for semantic pose verification used for outdoor scenes in the literature [74, 21, 22, 73] is not applicable for indoor scenes. 4) we make our source code and training data publicly available111

2 Related Work

Visual localization. Structure-based visual localization uses a 3D scene model to establish 2D-3D matches between pixel positions in the query image and 3D points in the model [84, 71, 56, 40, 72, 11, 29, 16, 65, 14, 42]. The scene model can be represented either explicitly, e.g., a Structure-from-Motion (SfM) point cloud [39, 40, 56, 29] or a laser scan [72], or implicitly, e.g., through a convolutional neural network (CNN) [45, 10, 11] or a random forest [16, 45, 65, 46, 47]. In the former case, 2D-3D matches are typically established by matching local features such as SIFT [43]. In contrast, methods based on implicit scene representations directly regress 3D scene coordinates from 2D image patches [16, 45, 10, 11]. In both cases, the camera pose is estimated from the resulting 2D-3D matches by applying an -point-pose solver [27, 35, 38] inside a RANSAC [25, 19, 18] loop. Methods based on scene coordinate regressions are significantly more accurate than approaches based on local features [72, 11]. Yet, they currently do not scale to larger and more complex scenes [72].

Closely related to visual localization is the place recognition problem [77, 76, 17, 1, 83, 4, 26, 34, 55, 36, 60]. Given a database of geo-tagged images, place recognition approaches aim to identify the place depicted in a given query image, e.g., via image retrieval [69, 49, 75, 19, 3]. The geo-tag of the most similar database image is then often used to approximate the pose of the query image [83, 76, 30, 31]. Place recognition approaches can also be used as part of a visual localization pipeline [72, 29, 13, 62, 53]: 2D-3D matching can be restricted to the parts of the scene visible in a short list of visually similar database images, resulting in one pose estimate per retrieved image. This restriction helps to avoid global ambiguities in a scene, e.g., caused by similar structures found in unrelated parts of a scene, during matching [54]. Such retrieval-based methods currently constitute the state-of-the-art for large-scale localization in complex scenes [72, 55, 53]. In this paper, we follow this strategy. However, unlike previous work focused on improving the retrieval [4, 1, 34, 26, 20, 36] or matching [72, 56], we focus on the pose verification stage, i.e., the problem of selecting the “best” pose from the estimated poses.

An alternative to the localization approaches outlined above is to train a CNN that directly regresses the camera pose from a given input image [32, 33, 12, 9, 50, 78]. However, it was recently shown that such methods do not consistently outperform a simple image retrieval baseline [59].

Semantic visual localization. In the context of long-time operation in dynamic environments, the appearance and geometry of a scene can change drastically over time [57, 62, 72]. However, the semantic description of each scene part remains invariant to such changes. Semantic visual localization approaches [62, 74, 73, 5, 34, 50, 70, 7, 6, 22, 52, 63, 80, 68, 81] thus use scene understanding, e.g., via semantic segmentation or object detection, as some form of invariant scene representation. Previous work has focused on improving the feature detection and description [62, 34], feature association [37, 70, 80, 63, 7, 74, 52], image retrieval [34, 5, 73, 62, 68, 81], and pose estimation stages [50, 70, 73, 74]. In contrast, this paper focuses on the pose verification stage.

Pose verification. Most similar to this paper are works on camera pose verification. The classical approach is to select the pose with the largest number of (weighted) inliers among all candidate poses [29, 61, 25]. However, a (weighted) inlier count is not an appropriate decision criterion in scenes with repetitive structures as an incorrect pose might have more inliers than the correct one [55]. Instead, it is necessary to explicitly account for such structures [55]. Still, focusing on the geometric consistency of feature matches only accounts for positive evidence. In order to take all pixels into account, [72] propose to re-render the scene from the estimated pose. They compare the resulting image with the original query photo using densely extracted RootSIFT [43, 2] features. In this paper, we build on their Dense Pose Verification (DensePV) approach and integrate additional modalities (surface normals and semantic segmentation) into the verification process. These additional modalities further improve the performance of the pose verification stage. While DensePV is a hand-crafted approach, we also propose a trainable variant.

[21, 22, 74, 73] use semantic scene understanding for pose verification: given a pose, they project the 3D points in a scene model into a semantic segmentation of the query image. They measure semantic consistency as the percentage of 3D points projected into an image region with the correct label. Besides determining whether an estimated camera pose is consistent with the scene geometry [22, 21], this measure can be used to identify incorrect matches [74] and to refine pose estimates [73, 70]. We show that this approach, which has so far been used in outdoor scenes, is not applicable in the indoor scenarios considered in this paper.

View synthesis. Following [72], we use view synthesis to verify estimated camera poses by re-rendering the scene from the estimated viewpoints. View synthesis has also been used to enable localization under strong appearance [8, 76] or viewpoint [66, 76, 64] changes. However, we are not aware of any previous work that combines multiple modalities and proposes a trainable verification approach.

3 Geometric-Semantic Pose Verification

In this paper, we are interested in analyzing the benefits of using more information than pure appearance for camera pose verification in indoor scenes. As such, we propose multiple approaches for pose verification based on the combination of appearance, scene geometry, and semantic information. We integrate our approach into the InLoc pipeline [72], a state-of-the-art visual localization approach for large-scale indoor scenes. In Sec. 3.1, we first review the InLoc algorithm. Sec. 3.2 then discusses how additional geometric information can be integrated into InLoc’s pose verification stage. Similarly, Sec. 3.3 discusses how semantic information can be used for pose verification.

Since obtaining large training datasets for indoor scenes can be hard, this section focuses on verification algorithms that do not require training data. Sec. 4 then introduces a trainable verification approach.

3.1 Indoor Localization with Pose Verification

The InLoc pipeline represents the scene through a set of RGB-D images with known poses. Given an RGB image as an input query, it first identifies a set of locations in the scene potentially visible in the query via image retrieval. For each location, it performs feature matching and re-ranks the locations based on the number of matches passing a 2D geometric verification stage. Camera poses are then estimated and verified for the top-ranked locations only.

Candidate location retrieval. InLoc uses the NetVLAD [1] descriptor to identify the 100 database images most visually similar to the query. For retrieval, the depth maps available for each database image are ignored.

Dense feature matching and pose estimation (DensePE). NetVLAD aggregates densely detected CNN features into a compact image-level descriptor. Given the top 100 retrieved images, InLoc performs mutually nearest neighbor matching of the densely extracted CNN features and performs spatial verification by fitting homographies. For the top 10 candidates with the largest number of homography-inliers, InLoc estimates a 6DoF camera pose: The dense 2D-2D matches between the query image and a retrieved database image define a set of 2D-3D matches when taking the depth map of the database image into account. The pose is then estimated using standard P3P-RANSAC [25].

Dense pose verification (DensePV). In its final stage, InLoc selects the “best” among the 10 estimated camera poses. To this end, InLoc re-renders the scene from each estimated pose using the color and depth information of the database RGB-D scan: The colored point cloud corresponding to the database RGB-D panoramic scan from which the retrieved database image originated is projected into the estimated pose of the query image to form a synthetic query image . InLoc’s dense pose verification stage then densely extracts RootSIFT [43, 2] descriptors from both the synthetic and real query image222RootSIFT is used for robustness to uniform illumination changes.. It then evaluates the (dis)similarity between the two images as the median of the inverse Euclidean distance between descriptors corresponding to the same pixel position. Let


be the local descriptor similarity function between RootSIFT descriptors extracted at pixel position in and . The similarity score between and then is


The median is used instead of the mean as it is more robust to outliers. Invalid pixels, i.e., pixels into which no 3D point projects, are not considered in Eq. 2.

InLoc finally selects the pose estimated using database image that maximizes .

3.2 Integrating Scene Geometry

Eq. 2 measures the similarity in appearance between the original query image and its synthesized version. The original formulation in the InLoc pipeline has two drawbacks: 1) it only considers the 3D geometry seen from a single scan location corresponding to the retrieved database image . As the pose of the query image can substantially differ from the pose of the database image, this can lead to large regions in the synthesized image into which no 3D points are projected (\cfFig. 1(i)). 2) indoor scenes are often dominated by large untextured parts such as white walls. The image appearance of these regions remains constant even under strong viewpoint changes. As such, considering only image appearance in these regions does not provide sufficient information for pose selection. In the following, we propose strategies to address these problems.

Merging geometry through scan-graphs. To avoid large regions of missing pixels in the synthesized images, we use 3D data from multiple database RGB-D scans when re-rendering the query. We construct an image-scan-graph (\cfFig. 2) that describes which parts of the scene are related to each database image and are thus used for generating the synthetic query image. Given a retrieved database image , the graph enables us to re-render the query view using more 3D points than those visible in the panoramic RGB-D scan associated with 333The InLoc dataset used in our experiments consists of multiple panoramic RGB-D scans, subdivided into multiple database images each.. To construct the graph, we first select the ten spatially closest RGB-D panoramic scans for each database image. We estimate the visibility of each 3D scan in the database image by projecting the 3D points into it while handling occlusions via depth and normal information. We establish a graph edge between the database image and a scan if more than 10% of the database image pixels share the 3D points originating from the scan.

Given a query image , the retrieved database image , and the estimated camera pose obtained using DensePE, we can leverage the constructed scan-graph to render multiple synthetic query images, one for each scan connected to in the graph. These views are then combined by taking depth and normal directions into account to handle occlusions.

Our approach assumes that the scans are dense and rather complete and that the different scans are registered accurately \wrteach other. These assumptions do not always hold in practice. Yet, our experiments show that using the scan-graph improves localization performance by reducing the number of invalid pixels in synthesized views compared to using individual scans (\cfSec. 5).

Figure 2: Image-scan-graph for the InLoc dataset [72]. (a) Example RGB-D panoramic scan. (b) Neighboring database image. (c) 3D points of the RGB-D panoramic scan projected onto the view of the database image. (d) Red dots show where RGB-D panoramic scans are captured. Blue lines indicate links between panoramic scans and database images, established based on visual overlap.

Measuring surface normal consistency. The problem of a lack of information in weakly textured regions can also be addressed by considering other complementary image modalities, such as surface normals. When rendering the synthetic view, we can make use of the depth information in the RGB-D images to create a normal map with respect to a given pose. For each 3D point that projects into a 2D point in image space, the normal vector is computed by fitting a plane in a local 3D neighbourhood. This 3D neighborhood is defined as the set of 3D points that project within a pixel patch around . This results in a normal map as seen from the pose estimated via the database image , where each entry corresponds to a unit-length surface normal direction. On the query image side, we use a neural network [82] to predict a surface normal map .

We define two verification approaches using surface normal consistency. Both are based on the cosine similarity between normals estimated at pixel position :


The first strategy, termed dense normal verification (DenseNV), mirrors DensePV but considers the normal similarities instead of the descriptor similarities :


The surface normal similarity maps can contain richer information than the descriptor similarity maps in the case of untextured regions. Yet, the contrary will be the case for highly textured regions. Therefore, we propose a second strategy (DensePNV), which includes surface normal consistency as a weighting term for the descriptor similarity:


where the weighting term shifts and normalizes the normal similarities as


Through , the normal similarities act as an attention mechanism on the descriptor similarities, focusing the attention on image regions where normals are consistent.

Implementation details. For the query images, for which no depth information is available, surface normals are extracted using [82]. The original implementation from [82] first crops the input image into a square shape and rescales it to pixels. However, the cropping operation can decrease the field of view and thus remove potentially important information (\cfappendix). To preserve the field of view, we modified the network configuration to predict surface normals for rectangular images and scale each image such that its longer side is 256 pixels.

3.3 Integrating Scene Semantics

DensePV, DenseNV, and DensePNV implicitly assume that the scene is static, i.e., that the synthesized query image should look identical to the real query photo. In practice, this assumption is often violated as scenes change over time. For example, posters on walls or bulletin boards might be changed or furniture might be moved around. Handling such changes requires a higher-level understanding of the scene, which we model via semantic scene understanding.

Projective Semantic Consistency (PSC) [74, 73, 21]. A standard approach to using scene understanding for pose verification is to measure semantic consistency [74, 73, 21]: These methods use a semantically labeled 3D point cloud, e.g., obtained by projecting semantic labels obtained from RGB images onto a point cloud, and a semantic segmentation of the query image. The labeled 3D point cloud is projected into the query image via an estimated pose. Semantic consistency is then computed by counting the number of matching labels between the query and synthetic image.

Ignoring transient objects. PSC works well in outdoor scenes, where there are relatively many classes and where points projecting into “empty” regions such as sky clearly indicate incorrect / inaccurate pose estimates. Yet, we will show in Sec. 5 that it does not work well in indoor scenes. This is due to the fact that there are no “empty” regions and that most pixels belong to walls, floors, or ceilings. Instead of enforcing semantic consistency everywhere, we use semantic information to determine where we expect geometric and appearance information to be unreliable.

We group the semantic classes into five “superclasses”: people, transient, stable, fixed, and outdoor. The transient superclass includes easily-movable objects, e.g., chairs, books, or trash cans. The stable superclass contains objects that are moved infrequently, e.g., tables, couches, or wardrobes. The fixed superclass contains objects that are unlikely to move, e.g., walls, floors, and ceilings. When computing DensePV, DenseNV, or DensePNV scores, we ignore pixels in the query image belonging to the people and transient superclasses. We refer to these approaches as DensePV+S, DenseNV+S, and DensePNV+S.

Implementation details. Semantics are extracted using the CSAIL Semantic Segmentation/Scene Parsing approach [87, 86] based on a Pyramid Scene Parsing Network [85], trained on the ADE20K dataset [87, 86] containing 150 classes. Details on the mapping of classes to superclasses are provided in the latter appendix.

4 Trainable Pose Verification

In the previous section, we developed several methods for camera pose verification that did not require any training data. Motivated by the recent success of trainable methods for several computer vision tasks, this section presents a trainable approach for pose verification (TrainPV), where we will train a pose verification scoring function from examples of correct and incorrect poses. We first describe the proposed model (\cfFig. 3), then how we obtained training data, and finally the loss used for training.

Network architecture for pose verification.

Our network design follows an approach similar to that of DensePV, where given the original and a synthetic query image we first extract dense feature descriptors and using a fully convolutional network. This feature extraction network plays the role of the dense RootSIFT descriptor of DensePV. Then, a descriptor similarity score map is computed by the cosine similarity444The descriptors are L2 normalized beforehand.:


Finally, the 2D descriptor similarity score-map given by Eq. 7 is processed by a score regression CNN that estimates the agreement between and , resulting in a scalar score.

Figure 3: Network architecture for Trainable Pose Verification. Input images are passed through a feature extraction network to obtain dense descriptors . These are then combined by computing the descriptor similarity map . Finally a score regression CNN produces the score of the trainable pose verification model.

This score regression CNN is composed of several convolution layers followed by ReLU non-linearities and a final average pooling layer. The intuition is that the successive convolution layers can identify coherent similarity (and dissimilarity) patterns in the descriptor similarity score-map . A final average pooling then aggregates the positive and negative evidence over the score map to accept or reject the candidate pose. Note that our architecture bears a resemblance to recent methods for image matching [51] and optical flow estimation [24]. Contrary to these methods, which estimate global geometric transformations or local displacement fields, our input images (, ) are already spatially aligned and we seek to measure their agreement.

Training data.

In order to train the proposed network, we need appropriate annotated training data. For this, we use additional video sequences recorded for the InLoc benchmark [72], which are separate from the actual test images. We created 6DoF camera poses for the new images via manual annotation and Structure-from-Motion (SfM) (\cfappendix). For each image, we generated pose candidates for training in two different ways.

The first approach randomly perturbs the ground-truth pose with 3D translations and rotations up to and . We use the perturbed random poses to generate synthetic images by projecting the 3D point cloud from the InLoc database scan associated to that image.

The second approach uses the DensePE pipeline [72] as a way of generating realistic estimated poses for the additional images. For this, we run the images through the localization pipeline, obtaining pose estimates and the corresponding database images. Then we run synthetic image rendering on these poses and use these images for training. Note that, contrary to the randomized approach where images are generated from the correct query-database image pair, here the synthetic images might be generated from unrelated pairs. This is because the localization pipeline might fail to generate “hard-negatives”: examples corresponding to other similar-looking but different locations.

In both cases, for each real and synthetic image pair, both the ground-truth () and the estimated () poses are known. In order to generate a scalar score that can be used as a training signal, we compute the mean 2D reprojection error of the 3D point cloud in image space:


where is the 3D-2D projection function.

Training loss.

A suitable loss is needed in order to train the above network for pose verification. Given a query image and a set of candidate synthesized query images , we would like to re-rank the candidates in the order given by the average reprojection errors (\cfEq. 8).

In order to do this, we assume that each synthetic image has an associated discrete probability of corresponding to the best matching pose for the query image among the candidates. This probability should be inversely related to the reprojection error from Eq. 8, such that a pose with a high reprojection error has little probability of being the best match. Then, the scores produced by our trained pose verification CNN can be used to model an estimate of this probability as


To define the ground-truth probability distribution , we make use of the reprojection error from Eq. 8:


where is the relative reprojection error with respect to the minimum value within the considered candidates. The soft-max function is used to obtain a normalized probability distribution555Relative reprojection errors are used to prevent saturation of the soft-max function..

The training loss is defined as the cross-entropy between the ground-truth and estimated distributions and :


where the sum is over the candidate poses.

Note that because the ground-truth score distribution is fixed, minimizing the cross-entropy between and is equivalent to minimizing the Kullback-Leibler divergence between these two distributions. Thus, the minimum is achieved when matches exactly. Also note that, at the optimum, the ground-truth ranking between the candidate poses is respected, as desired.

Implementation details.

The feature extraction network is composed of a fully convolutional ResNet-18 architecture (up to the conv4-2 layer) [28], pretrained on ImageNet [23]. Its weights are kept fixed during training as the large number of parameters would lead to overfitting in our small-sized training sets. The score regression CNN is composed of four convolutional layers with filters and a padding of , each followed by ReLU non-linearities. Each convolutional layer operates on channels as input and output, except the first one, which takes the single channel descriptor similarity map as input, and the last one, which also outputs a single channel tensor. Finally, an average pooling layer is used to obtain the final score estimate . The score regression CNN is trained for 10 epochs using the PyTorch framework [48], with the Adam optimizer and a learning rate of .

5 Experimental Evaluation

Dataset. We evaluate our approach in the context of indoor visual localization on the recently proposed InLoc dataset [72]. The dataset is based on the 3D laser scan model from [79] and depicts multiple floors in multiple university buildings. The 10k database images correspond to a set of perspective images created from RGB-D panoramic scans captured using a camera mounted on a laser scanner, i.e., a depth map is available for each database image. The 329 query images were recorded using an iPhone7 about a year after the database images and at different times of the day compared to the database images. The resulting changes in scene appearance between the query and database images make the dataset significantly more challenging than other indoor datasets such as 7 Scenes [65].

Evaluation measure. Following [72, 57], we measure the errors of an estimate pose as the differences in position and orientation from the reference pose provided by the dataset. We report the percentage of query images whose poses differ by no more than meters and degrees from the reference pose for different pairs of thresholds .

Baselines. Our approach is based on the InLoc approach [72], which is the current state-of-the-art in large-scale indoor localization and thus serves as our main baseline (\cfSec. 3.1). We build on top of the code released by the authors of [72]. For a given input image, we first retrieved the top 100 database images via NetVLAD with a pre-trained Pitts30K [1] VGG-16 [67] model. Feature matching is then performed between query and retrieved images also using the densely extracted CNN features of NetVLAD’s VGG-16 [67] architecture. After re-ranking the image list according to the number of homography-inliers, we estimate pose candidates for the top 10 best matched images using a set of dense inlier matches and database depth information (DensePE).

For each candidate, DensePV renders the view with respect to the RGB-D panoramic scan from which the database image originated. The similarity between the original and rendered view is computed as the median distances of densely extracted hand-crafted features [43, 2].

As a semantic baseline, we project the database 3D points, labeled via the database image, into the query and count the number of points with consistent labels (PSC).

As reported in [72], both DSAC [11, 10] and PoseNet [33, 32] fail to train on the InLoc dataset. We thus do not consider them in our experiments.

Error [meter, degree]
Method [0.25, 5] [0.50, 5] [1.00, 10] [2.00, 10]
w/o scan-graph
DensePE [72] 35.0 46.2 57.1 61.1
DensePV [72] 38.9 55.6 69.9 74.2
PSC 30.4 44.4 55.9 58.4
DensePV+S 39.8 57.8 71.1 75.1
DenseNV 32.2 45.6 58.1 62.9
DenseNV+S 31.6 46.5 60.5 64.4
DensePNV 40.1 58.1 72.3 76.6
DensePNV+S 40.1 59.0 72.6 76.3
w/ scan-graph
DensePV 39.8 59.0 69.0 71.4
PSC 28.3 43.2 55.0 58.4
DensePV+S 41.3 61.7 71.4 74.2
DenseNV 34.3 50.5 62.9 66.6
DenseNV+S 35.9 51.4 64.4 68.4
DensePNV 40.4 60.5 72.9 75.4
DensePNV+S 41.0 60.5 72.3 75.1
TrainPV (random) 39.5 56.5 72.3 76.3
TrainPV (DPE) 39.5 56.8 72.3 76.3
Oracle (Upper-bound) 43.5 63.8 77.5 80.5
Table 1: The impact of using the scan-graph for pose verification evaluated on the InLoc dataset [72]. We report the percentage of queries localized within given positional and rotational error bounds.
Figure 4: The impact of geometric and semantic information on the pose verification stage. We validate the performance of the proposed methods that consider additional geometric and semantic information on the InLoc dataset [72]. Each curve shows the percentage of the queries localized within varying distance thresholds (x–axis) and a fixed rotational error of at most degrees.
(a) (b) (c)
(d) (e) (f)
Figure 5: Typical failure cases of view synthesis using the scan-graph. Top: Synthetic images obtained during DensePV with the scan-graph, affected by (a) misalignment of the 3D scans to the floor plan, (b) sparsity of the 3D scans, and (c) intensity changes. Bottom: A typical failure case of DensePV with the scan-graph: (d) query image, (e) re-rendered query, (f) error map computed with RootSIFT.

The impact of using additional modalities. Tab. 1 and Fig. 4 compare the localization performance of the baseline pose verification methods against our novel variants proposed in Sec. 3. DenseNV and PSC perform worst, even compared to the baseline DensePE. This is not surprising as both completely ignore the visual appearance and instead focus on information that by itself is less discriminative (surface normals and semantic segmentation, respectively). On the other hand, combining geometric and / or semantic information with appearance information improves the localization performance compared to DensePV. This clearly validates our idea of using multiple modalities.

We observe the biggest improvement by using our scan-graph, which is not surprising as it reduces the number of invalid pixels and thus adds more information to the rendered images. DensePV+S using a scan-graph shows the best performance at higher accuracy levels. DensePNV using the scan-graph combines appearance and normal information and constantly shows more than a 5% performance gain compared to DensePV. Yet, DensePNV+S with the scan-graph shows less improvement compared to its single scan variant and even performs worse for larger error thresholds. This is partially due to inaccurate depths and camera poses of the database images (\cfFig. 5 (a–c)). There are also failures where a single scan already provides a rather complete view. Fig. 5 (d–f) shows such an example: due to weak textures, the rendering appears similar to the query image. Such failures cannot be resolved using the scan-graph.

Interestingly, simply combining all modalities does not necessarily lead to the best performance. To determine whether the modalities are simply not complementary or whether this is due to the way they are combined, we create an oracle. The oracle is computed from four of our proposed variants (DensePV [72], DensePV w/ scan-graph, DensePV+S w/ scan graph, and DensePNV w/ scan-graph): Each variant provides a top-ranked pose and the oracle, having access to the ground truth, simply selects the pose with the smallest error. As can be seen from Tab. 1 and Fig. 4, the oracle performs clearly better than any of our proposed variants. We also observed DenseNV+S provides better poses than the oracle (which does not use DenseNV+S) for about 9% the queries, which could lead to further improvements. This shows that the different modalities are indeed complementary. Therefore, we attribute the diminishing returns observed for DensePNV+S to the way we combine semantic and normal information. We assume that better results could be obtained with normals and semantics if one reasons about the consistency of image regions rather than on a pixel level (as is done by using the median).

Trainable pose verification. We next evaluate two trainable approaches (TrainPV), which are trained by randomly perturbed views (random) or by selecting views based on DensePE estimation (DPE) (\cfSec. 4). Even though both are trained using only appearance information, they still are able to use higher-level scene context as they use dense features extracted from a pre-trained fully convolutional network. Tab. 1 compares both TrainPV variants with the baselines and our hand-crafted approaches. Even though both variants use different training sets, they achieve nearly the same performance666We are considering a discrete re-ranking problem on a few candidate poses per query. As such, it is not surprising to have very similar results.. This indicates that the choice of training set is not critical in our setting. The results show that TrainPV outperforms the DensePV baseline, but not necessarily our hand-crafted variants based on multiple modalities. This result validates our idea of pose verification based on different sources of information. We tried variants of TrainPV that use multiple modalities, but did not observe further improvements.

6 Conclusion

We have presented a new pose verification approach to improve large-scale indoor camera localization, which is extremely challenging due to the existence of repetitive structures, weakly-textured scenes, and dynamically appearing/disappearing objects over time. To address these challenges, we have developed and validated multiple strategies to combine appearance, geometry, and semantics for pose verification, showing significant improvements over a current state-of-the-art indoor localization baseline. To encourage further progress on the challenging indoor localization problem, we make our code publicly available.

Acknowledgments. This work was partially supported by JSPS KAKENHI Grant Numbers 15H05313, 17H00744, 17J05908, EU-H2020 project LADIO No. 731970, ERC grant LEAP No. 336845, CIFAR Learning in Machines Brains program and the EU Structural and Investment Funds, Operational Programe Research, Development and Education under the project IMPACT (reg. no. CZ). We gratefully acknowledge the support of NVIDIA Corporation with the donation of Quandro P6000 GPU.


This appendix provides additional details that could not be included in the paper due to space constraints: Sec. A describes the construction of the image-scan graph in more detail (\cf Sec. 3.2 in the paper). Sec. B shows that avoiding reduction of the field-of-view of a camera before extracting surface normals improves performance (\cf lines 467-477 in the paper). Sec. C provides details on the construction of the “superclasses” (\cf Sec. 3.3 in the paper) and justifies the design choice made in the paper. Sec. D details the construction of the training sets used by our trainable verification approach (\cf Sec. 4 in the paper). Finally, Sec. E shows qualitative results (\cf Fig. 4 in the paper).

Appendix A Image-Scan Graph

The original InLoc dataset includes RGB-D panoramic scans and perspective RGB-D image cutouts from the scans as the database. To render more complete synthetic query images with fewer missing pixels, we construct an image-scan graph that enables us to render the synthetic images using the 3D points visible in multiple adjacent panoramic scans (\cfSec. 3.2 in the paper). Fig. A shows how we generate the graph: For each perspective database image, we compute the visual overlap with adjacent panoramic scans by projecting their 3D point clouds into the perspective database image, while taking occlusions into account. Based on the ratio of pixels in the rendered view that correspond to 3D points in the scans, we establish edges between the perspective database image and the panoramic scans that have more than % overlap.

Figure A: Image-scan graph for the InLoc dataset [72]. For each perspective database image (e) which is cut out from the RGB-D panoramic scan (b), we compute the overlap with each adjacent scan (a, c) by projecting their 3D points into the perspective view (d, f). Red dots show where RGB-D panoramic scans (and corresponding perspective database images) are located. Blue lines indicate edges between panoramic scans and perspective database images, established based on visual overlap.

Appendix B Cropping before Normal Estimation

As mentioned in Sec. 3.2 of the paper, the original Taskonomy pipeline uses images of size pixels as input when estimating surface normals. Using the original pipeline thus requires to crop and re-scale the images to pixels. Since the cropping reduces the field-of-view and thus potentially discards useful information, we modified the pipeline to avoid cropping. Tab. A compares several of our pose verification methods that use normals (DenseNV and DensePNV) with and without cropping. As a reference, we also report results for DensePV [72], which does not use normal information. Using cropping reduces the performance in most cases, especially when only using normal information for verification (DenseNV). The results validate our design choice of preserving the field-of-view of the input images by avoiding cropping.

Error [meters, degrees]
Method [0.25, 5] [0.50, 5] [1.00, 10] [2.00, 10]
DensePV [72] 38.9 55.6 69.9 74.2
DenseNV (cropped) 29.5 43.5 54.1 59.6
DenseNV 32.2 45.6 58.1 62.9
DensePNV (cropped) 39.5 56.8 71.7 76.9
DensePNV 40.1 58.1 72.3 76.6
Table A: The impact of image cropping on pose verification, evaluated on the InLoc dataset [72]. We report the percentage of queries localized within given positional and rotational error bounds.

Appendix C Semantic Superclass Construction

Below is the full mapping of the 150 semantic classes of the ADE20K dataset [87, 86] to the five “superclasses” that we use to generate semantic masks. Each {} corresponds to one class from the CSAIL Semantic Segmentation pre-trained on MIT ADE20K dataset [87, 86] and each class is described by the labels inside the braces.

  • people: {person, individual, someone, somebody, mortal, soul}

  • transient: {plant, flora, plant, life}, {curtain, drape, drapery, mantle, pall}, {chair}, {mirror}, {rug, carpet, carpeting}, {armchair}, {seat}, {desk}, {lamp}, {cushion}, {base, pedestal, stand}, {box}, {grandstand, covered, stand}, {case, display, case, showcase, vitrine}, {pillow}, {screen, door, screen}, {flower}, {book}, {computer, computing, machine, computing, device, data, processor, electronic, computer, information, processing, system}, {swivel, chair}, {hovel, hut, hutch, shack, shanty}, {towel}, {apparel, wearing, apparel, dress, clothes}, {ottoman, pouf, pouffe, puff, hassock}, {bottle}, {plaything, toy}, {stool}, {barrel, cask}, {basket, handbasket}, {bag}, {cradle}, {ball}, {food, solid, food}, {trade, name, brand, name, brand, marque}, {pot, flowerpot}, {animal, animate, being, beast, brute, creature, fauna}, {bicycle, bike, wheel, cycle}, {screen, silver, screen, projection, screen}, {blanket, cover}, {sconce}, {vase}, {tray}, {ashcan, trash, can, garbage, can, wastebin, ash, bin, ash-bin, ashbin, dustbin, trash, barrel, trash, bin}, {fan}, {plate}, {monitor, monitoring, device}, {radiator}, {glass, drinking, glass}

  • stable: {bed}, {cabinet}, {table}, {painting, picture}, {sofa, couch, lounge}, {shelf}, {wardrobe, closet, press}, {bathtub, bathing, tub, bath, tub}, {chest, of, drawers, chest, bureau, dresser}, {refrigerator, icebox}, {pool, table, billiard, table, snooker, table}, {bookcase}, {coffee, table, cocktail, table}, {bench}, {countertop}, {stove, kitchen, stove, range, kitchen, range, cooking, stove}, {arcade, machine}, {television, television, receiver, television, set, tv, tv, set, idiot, box, boob, tube, telly, goggle, box}, {poster, posting, placard, notice, bill, card}, {canopy}, {washer, automatic, washer, washing, machine}, {oven}, {microwave, microwave, oven}, {dishwasher, dish, washer, dishwashing, machine}, {sculpture}, {shower}, {clock}

  • fixed: {wall}, {floor, flooring}, {ceiling}, {windowpane, window}, {door, double, door}, {railing, rail}, {column, pillar}, {sink}, {fireplace, hearth, open, fireplace}, {stairs, steps}, {stairway, staircase}, {toilet, can, commode, crapper, pot, potty, stool, throne}, {chandelier, pendant, pendent}, {bannister, banister, balustrade, balusters, handrail}, {escalator, moving, staircase, moving, stairway}, {buffet, counter, sideboard}, {stage}, {conveyer, belt, conveyor, belt, conveyer, conveyor, transporter}, {swimming, pool, swimming, bath, natatorium}, {step, stair}, {bulletin, board, notice, board}

  • outdoor: {building, edifice}, {sky}, {tree}, {road, route}, {grass}, {sidewalk, pavement}, {earth, ground}, {mountain, mount}, {car, auto, automobile, machine, motorcar}, {water}, {house}, {sea}, {field}, {fence, fencing}, {rock, stone}, {signboard, sign}, {counter}, {sand}, {skyscraper}, {path}, {runway}, {river}, {bridge, span}, {blind, screen}, {hill}, {palm, palm, tree}, {kitchen, island}, {boat}, {bar}, {bus, autobus, coach, charabanc, double-decker, jitney, motorbus, motorcoach, omnibus, passenger, vehicle}, {light, light, source}, {truck, motortruck}, {tower}, {awning, sunshade, sunblind}, {streetlight, street, lamp}, {booth, cubicle, stall, kiosk}, {airplane, aeroplane, plane}, {dirt, track}, {pole}, {land, ground, soil}, {van}, {ship}, {fountain}, {waterfall, falls}, {tent, collapsible, shelter}, {minibike, motorbike}, {tank, storage, tank}, {lake}, {hood, exhaust, hood}, {traffic, light, traffic, signal, stoplight}, {pier, wharf, wharfage, dock}, {crt, screen}, {flag}

As detailed in lines 518-521 in the paper, we construct semantic masks by ignoring pixels belonging to the people and transient superclasses. This design choice was motivated by preliminary experiments with different ways to use semantic information. More precisely, we evaluated three variants of DensePV+S with semantic masks generated by the criteria listed below:


We keep regions corresponding to the stable and fixed superclasses as informative and discard regions assigned to the other superclasses.


We consider regions assigned to the superclass people as non-informative and regard all other regions as informative.


We determine regions corresponding to the people and transient superclasses as non-informative and regard all other regions as informative.

Tab. B shows the comparison of DensePV [72] and DensePV+S with each type of semantic masking. All variations of DensePV+S considerably outperform the baseline. The best results are obtained with DensePV+S (C), which is the variant used in the paper.

Error [meters, degrees]
Method [0.25, 5] [0.50, 5] [1.00, 10] [2.00, 10]
DensePV [72] 38.9 55.6 69.9 74.2
DensePV+S (A) 39.8 57.4 71.1 75.1
DensePV+S (B) 39.2 56.2 70.5 75.1
DensePV+S (C) 39.8 57.8 71.1 75.1
Table B: The impact of semantic masks, evaluated on the InLoc dataset [72]. We report the percentage of queries localized within given positional and rotational error bounds.

Appendix D Training Data Generation

To train our learnable pose verification (\cf Sec. 4), we use additional video sequences kindly provided by authors of [72], which were captured by them separately from the test images of the InLoc datasets. The images were captured using iPhone7 video streams in the same building as the InLoc dataset. In order to use the images for training, we created 6DoF ground-truth poses for them and used these poses to generate pose candidates for training. Fig. B shows the spatial distributions of the training images that we generated and manually verified. Note that there is little overlap between the original queries [72] and our training images, both in the first floor (a) and the second floor (b) of the building.

The ground-truth poses are computed as follows: 1) From the original video sequences, we pick the frame with intervals of four seconds (key frame) and generate the manually verified 6DoF camera poses in a similar manner as the original InLoc dataset [72]. 2) We additionally reconstruct the video frames adjacent to a key frame, using Structure-from-Motion (SfM) [61]. Note that in the bundle adjustment step, we fixed the 3D points that come from a feature in the key frame which has the depth information with respect to the database scans. This enables us to recover the scale of the SfM reconstruction. 4) We visually inspect all poses and manually discard the images with incorrect poses. We also verify the reference poses by computing the overlap ratio with the relevant database scan with respect to the depth. We finally accepted 3,442 images that have more than % overlap. While training, 2,600 images in DUC1 (first floor) are used for training, and 842 images in DUC2 (second floor) are used for validation.

(a) First floor.
(b) Second floor.
Figure B: Spatial distributions of the training images. The orange dots in the figures show the camera positions of the training images, which we estimated and manually verified. The blue dots correspond to the positions of the original InLoc queries. Gray dots are the scanned 3D points of the InLoc dataset, showing the structure of the building.

Appendix E Qualitative Results

Fig. C shows example localization results obtained by various methods on the InLoc dataset [72]. Fig. C (a) is an example on which the original DensePV [72] selects an inaccurate pose estimate, while our methods succeed when using the image-scan graph. Views rendered using only a single scan often cover only a part of the query view, which results in inaccurate pose verification, i.e., DensePV selects a query pose behind the wall. The image-scan graph enables us to use 3D points seen from the multiple scans related to the query. This results in a more complete synthesized image from a more accurate pose estimate, which is subsequently chosen by our approach.

Fig. C (b) is a typical scene on which DensePV with the scan-graph fails, while DensePV+S succeeds to accurately localize the query. In this case, the query image is dominated by transient objects (shutter blinds) and people, which do not appear in the database images. Pose verification methods using only 3D structures (DensePV [72], DensePV w/ scan-graph) fail to achieve accurate localization in such scenes. DensePV+S discards the less-informative regions in the image based on semantic labels, which improves results. Using normals instead of semantic information has a similar effect in this scene.

The effectiveness of measuring surface normal consistency is shown in Fig. C (c). The query image shows a significant amount of weakly textured surfaces and regions of over-saturated pixels. Appearance-based pose verification methods (DensePV [72], DensePV w/ scan-graph) and our semantic-based DensePV+S approach fail to select an accurate pose candidate, since the scene appearance has largely changed between the query and the retrieved database image. On the other hand, DensePNV additionally compares surface normal directions, which provide useful information for this challenging query photo.

The benefit of combining semantics with surface normal consistency is shown in Fig. C (d). In the query, there are a number of transient objects, e.g., chairs, movable tables, and people. This results in an inaccurate pose being selected by DensePNV since it directly computes surface normals even on such inconsistent objects. Using a semantic mask, DensePNV+S achieves better pose selection, ignoring those less informative regions.

Query DensePV [72] DensePV w/ scan-graph DensePV+S DensePNV DensePNV+S
2.99m, 17.64 0.42m, 1.25 0.42m, 1.25 0.42m, 1.25 0.42m, 1.25
12.39m, 26.71 12.39m, 26.71 0.40m, 2.37 0.40m, 2.37 0.40m, 2.37
4.70m, 88.15 14.29m, 9.02 56.85m, 170.5 0.93m, 2.69 0.93m, 2.69
0.60m, 1.38 38.00m, 107.87 38.00m, 107.87 15.14m, 163.03 0.60m, 1.38
Figure C: Qualitative examples of visual localization on the InLoc dataset [72]. Each row in the figure shows the query image (left) and the rendered views corresponding to the camera poses selected by different methods. The numbers under the synthesized images indicate the position and orientation errors with respect to the ground-truth poses. The scan-graph was used for the methods shown in columns 3 to 6.


  • [1] Relja. Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In Proc. CVPR, 2016.
  • [2] Relja Arandjelović and Andrew Zisserman. Three things everyone should know to improve object retrieval. In Proc. CVPR, 2012.
  • [3] Relja Arandjelovic and Andrew Zisserman. All about VLAD. In Proc. CVPR, 2013.
  • [4] Relja Arandjelović and Andrew Zisserman. Dislocation: Scalable descriptor distinctiveness for location recognition. In Proc. ACCV, 2014.
  • [5] Relja Arandjelović and Andrew Zisserman. Visual vocabulary with a semantic twist. In Proc. ACCV, 2014.
  • [6] Shervin Ardeshir, Amir Roshan Zamir, Alejandro Torroella, and Mubarak Shah. GIS-Assisted Object Detection and Geospatial Localization. In Proc. ECCV, 2014.
  • [7] Nikolay Atanasov, Menglong Zhu, Kostas Daniilidis, and George J. Pappas. Localization from semantic observations via the matrix permanent. Intl. J. of Robotics Research, 35(1-3):73–99, 2016.
  • [8] Mathieu Aubry, Bryan C. Russell, and Josef Sivic. Painting-to-3D Model Alignment via Discriminative Visual Elements. ACM Trans. Graph., 33(2):14:1–14:14, Apr 2014.
  • [9] Vassileios Balntas, Shuda Li, and Victor Adrian Prisacariu. RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. In Proc. ECCV, 2018.
  • [10] Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. DSAC - Differentiable RANSAC for Camera Localization. In Proc. CVPR, 2017.
  • [11] Eric Brachmann and Carsten Rother. Learning less is more-6D camera localization via 3d surface regression. In Proc. CVPR, 2018.
  • [12] Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-Aware Learning of Maps for Camera Localization. In Proc. CVPR, 2018.
  • [13] Song Cao and Noah Snavely. Graph-based discriminative learning for location recognition. In Proc. CVPR, 2013.
  • [14] Song Cao and Noah Snavely. Minimal scene descriptions from structure from motion models. In Proc. CVPR, 2014.
  • [15] Robert Castle, Georg Klein, and David W. Murray. Video-rate localization in multiple maps for wearable augmented reality. In ISWC, 2008.
  • [16] Tommaso Cavallari, Stuart Golodetz, Nicholas A. Lord, Julien Valentin, Luigi Di Stefano, and Philip H. S. Torr. On-The-Fly Adaptation of Regression Forests for Online Camera Relocalisation. In Proc. CVPR, 2017.
  • [17] David M. Chen, Georges Baatz, Kevin Köser, Sam S Tsai, Ramakrishna Vedantham, Timo Pylvänäinen, Kimmo Roimela, Xin Chen, Jeff Bach, Marc Pollefeys, et al. City-scale landmark identification on mobile devices. In Proc. CVPR, 2011.
  • [18] Ondřej Chum and Jiří Matas. Matching with PROSAC-progressive sample consensus. In Proc. CVPR, 2005.
  • [19] Ondřej Chum and Jiří Matas. Optimal randomized RANSAC. IEEE PAMI, 30(8):1472–1482, 2008.
  • [20] Ondřej Chum, Andrej Mikulik, Michal Perdoch, and Jiří Matas. Total recall II: Query expansion revisited. In Proc. CVPR, 2011.
  • [21] Andrea Cohen, Torsten Sattler, and Mark Pollefeys. Merging the Unmatchable: Stitching Visually Disconnected SfM Models. In Proc. ICCV, 2015.
  • [22] Andrea Cohen, Johannes Lutz Schönberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. Indoor-Outdoor 3D Reconstruction Alignment. In Proc. ECCV, 2016.
  • [23] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.
  • [24] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proc. ICCV, pages 2758–2766, 2015.
  • [25] Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM, 24(6):381–395, 1981.
  • [26] Petr Gronat, Guillaume Obozinski, Josef Sivic, and Tomas Pajdla. Learning and calibrating per-location classifiers for visual place recognition. In Proc. CVPR, 2013.
  • [27] Bert M. Haralick, Chung-Nan Lee, Karsten Ottenberg, and Michael Nölle. Review and analysis of solutions of the three point perspective pose estimation problem. IJCV, 13(3):331–356, 1994.
  • [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.
  • [29] Arnold Irschara, Christopher Zach, Jan-Michael Frahm, and Horst Bischof. From Structure-from-Motion point clouds to fast location recognition. In Proc. CVPR, 2009.
  • [30] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Proc. ECCV, 2008.
  • [31] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Packing bag-of-features. In Proc. ICCV, 2009.
  • [32] Alex Kendall and Roberto Cipolla. Geometric loss functions for camera pose regression with deep learning. In Proc. CVPR, 2017.
  • [33] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proc. ICCV, 2015.
  • [34] Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. Learned contextual feature reweighting for image geo-localization. In Proc. CVPR, 2017.
  • [35] L. Kneip, D. Scaramuzza, and R. Siegwart. A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation. In Proc. CVPR, 2011.
  • [36] Jan Knopp, Josef Sivic, and Tomas Pajdla. Avoiding confusing features in place recognition. In Proc. ECCV, 2010.
  • [37] Nikolay Kobyshev, Hayko Riemenschneider, and Luc Van Gool. Matching Features Correctly through Semantic Understanding. In Proc. 3DV, 2014.
  • [38] Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla. Real-Time Solution to the Absolute Pose Problem with Unknown Radial Distortion and Focal Length. In Proc. ICCV, 2013.
  • [39] Yunpeng Li, Noah Snavely, and Daniel P. Huttenlocher. Location recognition using prioritized feature matching. In Proc. ECCV, 2010.
  • [40] Yunpeng Li, Noah Snavely, Daniel P. Huttenlocher, and Pascal Fua. Worldwide pose estimation using 3d point clouds. In Proc. ECCV, 2012.
  • [41] Hyon Lim, Sudipta N. Sinha, Michael F. Cohen, and Matthew Uyttendaele. Real-time image-based 6-DOF localization in large-scale environments. In Proc. CVPR, 2012.
  • [42] Liu Liu, Hongdong Li, and Yuchao Dai. Efficient Global 2D-3D Matching for Camera Localization in a Large-Scale 3D Map. In Proc. ICCV, 2017.
  • [43] David G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
  • [44] Simon Lynen, Torsten Sattler, Michael Bosse, Joel A. Hesch, Marc Pollefeys, and Roland Siegwart. Get Out of My Lab: Large-scale, Real-Time Visual-Inertial Localization. In Proc. RSS, 2015.
  • [45] Daniela Massiceti, Alexander Krull, Eric Brachmann, Carsten Rother, and Philip H.S. Torr. Random Forests versus Neural Networks - What’s Best for Camera Relocalization? In Proc. Intl. Conf. on Robotics and Automation, 2017.
  • [46] Lili Meng, Jianhui Chen, Frederick Tung, James J. Little, Julien Valentin, and Clarence W. de Silva. Backtracking Regression Forests for Accurate Camera Relocalization. In Proc. IEEE/RSJ Conf. on Intelligent Robots and Systems, 2017.
  • [47] Lili Meng, Frederick Tung, James J. Little, Julien Valentin, and Clarence W. de Silva. Exploiting Points and Lines in Regression Forests for RGB-D Camera Relocalization. In Proc. IEEE/RSJ Conf. on Intelligent Robots and Systems, 2018.
  • [48] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • [49] James Philbin, Ondřej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proc. CVPR, 2007.
  • [50] Noha Radwan, Abhinav Valada, and Wolfram Burgard. VLocNet++: Deep multitask learning for semantic visual localization and odometry. IEEE Robotics And Automation Letters (RA-L), 3(4):4407–4414, 2018.
  • [51] Ignacio Rocco, Relja Arandjelović, and Josef Sivic. Convolutional neural network architecture for geometric matching. In Proc. CVPR, 2017.
  • [52] Renato F. Salas-Moreno, Richard A. Newcombe, Hauke Strasdat, Paul H. J. Kelly, and Andrew J. Davison. SLAM++: Simultaneous Localisation and Mapping at the Level of Objects. In Proc. CVPR, 2013.
  • [53] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proc. CVPR, 2019.
  • [54] Torsten Sattler, Michal Havlena, Filip Radenovic, Konrad Schindler, and Marc Pollefeys. Hyperpoints and fine vocabularies for large-scale location recognition. In Proc. ICCV, 2015.
  • [55] Torsten Sattler, Michal Havlena, Konrad Schindler, and Marc Pollefeys. Large-scale location recognition and the geometric burstiness problem. In Proc. CVPR, 2016.
  • [56] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. IEEE PAMI, 39(9):1744–1756, 2017.
  • [57] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas Pajdla. Benchmarking 6DOF outdoor visual localization in changing conditions. In Proc. CVPR, 2018.
  • [58] Torsten Sattler, Akihiko Torii, Josef Sivic, Marc Pollefeys, Hajime Taira, Masatoshi Okutomi, and Tomas Pajdla. Are large-scale 3D models really necessary for accurate visual localization? In Proc. CVPR, 2017.
  • [59] Torsten Sattler, Qunjie Zhou, Mark Pollefeys, and Laura Leal-Taixé. Understanding the Limitations of CNN-based Absolute Camera Pose Regression. In Proc. CVPR, 2019.
  • [60] Grant Schindler, Matthew Brown, and Richard Szeliski. City-Scale Location Recognition. In Proc. CVPR, 2007.
  • [61] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-From-Motion Revisited. In Proc. CVPR, 2016.
  • [62] Johannes Lutz Schönberger, Marc Pollefeys, Andreas Geiger, and Torsten Sattler. Semantic Visual Localization. In Proc. CVPR, 2018.
  • [63] Markus Schreiber, Carsten Knöppel, and Uwe Franke. LaneLoc: Lane marking based localization using highly accurate maps. In Proc. IV, 2013.
  • [64] Qi Shan, Changchang Wu, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, and Steven M. Seitz. Accurate Geo-Registration by Ground-to-Aerial Image Matching. In Proc. 3DV, 2014.
  • [65] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In Proc. CVPR, 2013.
  • [66] Dominik Sibbing, Torsten Sattler, Bastian Leibe, and Leif Kobbelt. SIFT-realistic rendering. In Proc. 3DV, 2013.
  • [67] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, 2015.
  • [68] Gautam Singh and Jana Košecká. Semantically Guided Geo-location and Modeling in Urban Environments. In Large-Scale Visual Geo-Localization, 2016.
  • [69] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. ICCV, 2003.
  • [70] E. Stenborg, C. Toft, and L. Hammarstrand. Long-term Visual Localization using Semantically Segmented Images. In Proc. Intl. Conf. on Robotics and Automation, 2018.
  • [71] Linus Svärm, Olof Enqvist, Fredrik Kahl, and Magnus Oskarsson. City-Scale Localization for Cameras with Known Vertical Direction. IEEE PAMI, 39(7):1455–1461, 2017.
  • [72] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Akihiko Torii. InLoc: Indoor visual localization with dense matching and view synthesis. In Proc. CVPR, 2018.
  • [73] Carl Toft, Carl Olsson, and Fredrik Kahl. Long-term 3D Localization and Pose from Semantic Labellings. In Proc. ICCV Workshops, 2017.
  • [74] Carl Toft, Erik Stenborg, Lars Hammarstrand, Lucas Brynte, Marc Pollefeys, Torsten Sattler, and Fredrik Kahl. Semantic Match Consistency for Long-Term Visual Localization. In Proc. ECCV, 2018.
  • [75] Giorgos Tolias and Hervé Jégou. Local visual query expansion: Exploiting an image collection to refine local descriptors. Technical Report RR-8325, INRIA, 2013.
  • [76] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. In Proc. CVPR, 2015.
  • [77] Akihiko Torii, Josef Sivic, Tomas Pajdla, and Masatoshi Okutomi. Visual place recognition with repetitive structures. In Proc. CVPR, 2013.
  • [78] Florian Walch, Caner Hazirbas, Laura Leal-Taixé, Torsten Sattler, Sebastian Hilsenbeck, and Daniel Cremers. Image-Based Localization Using LSTMs for Structured Feature Correlation. In Proc. ICCV, 2017.
  • [79] Erik Wijmans and Yasutaka Furukawa. Exploiting 2D floorplan for building-scale panorama RGBD alignment. In Proc. CVPR, 2017.
  • [80] Fisher Yu, Jianxiong Xiao, and Thomas A. Funkhouser. Semantic alignment of LiDAR data at city scale. In Proc. CVPR, 2015.
  • [81] X. Yu, S. Chaturvedi, C. Feng, Y. Taguchi, T.-Y. Lee, C. Fernandes, and S. Ramalingam. VLASE: Vehicle Localization by Aggregating Semantic Edges. In Proc. IEEE/RSJ Conf. on Intelligent Robots and Systems, 2018.
  • [82] Amir Roshan Zamir, Alexander Sax, , William B. Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proc. CVPR, 2018.
  • [83] Amir Roshan Zamir and Mubarak Shah. Accurate image localization based on google maps street view. In Proc. ECCV, 2010.
  • [84] Bernhard Zeisl, Torsten Sattler, and Marc Pollefeys. Camera Pose Voting for Large-Scale Image-Based Localization. In Proc. ICCV, 2015.
  • [85] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proc. CVPR, 2017.
  • [86] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20K dataset. In Proc. CVPR, 2017.
  • [87] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. IJCV, 127(3):302–321, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description