Image-based localization using LSTMs for structured feature correlation
In this work we propose a new CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes. CNNs allow us to learn suitable feature representations for localization that are robust against motion blur and illumination changes. We make use of LSTM units on the CNN output, which play the role of a structured dimensionality reduction on the feature vector, leading to drastic improvements in localization performance. We provide extensive quantitative comparison of CNN-based and SIFT-based localization methods, showing the weaknesses and strengths of each. Furthermore, we present a new large-scale indoor dataset with accurate ground truth from a laser scanner. Experimental results on both indoor and outdoor public datasets show our method outperforms existing deep architectures, and can localize images in hard conditions, e.g., in the presence of mostly textureless surfaces, where classic SIFT-based methods fail.
Being able to localize a vehicle or device by estimating a camera pose from an image is a fundamental requirement for many computer vision applications such as navigating autonomous vehicles , mobile robotics and Augmented Reality , and Structure-from-Motion (SfM) .
Most state-of-the-art approaches [33, 52, 64, 46] rely on local features such as SIFT  to solve the problem of image-based localization. Given a SfM model of a scene, where each 3D point is associated with the image features from which it was triangulated, one proceeds in two stages [34, 45]: (i) establishing 2D-3D matches between features extracted from the query image and 3D points in the SfM model via descriptor matching; (ii) using these correspondences to determine the camera pose, usually by employing a -point solver  inside a RANSAC loop . Pose estimation can only succeed if enough correct matches have been found in the first stage. Consequently, limitations of both the feature detector, e.g., motion blur or strong illumination changes, or the descriptor, e.g., due to strong viewpoint changes, will cause localization approaches to fail.
Recently, two approaches have tackled the problem of localization with end-to-end learning. PlaNet  formulates localization as a classification problem, where the current position is matched to the best position in the training set. While this approach is suitable for localization in extremely large environments, it only allows to recover position but not orientation and its accuracy is bounded by the spatial extent of the training samples. More similar in spirit to our approach, PoseNet [26, 25] formulates 6DoF pose estimation as a regression problem. In this paper, we show that PoseNet is significantly less accurate than state-of-the-art SIFT methods [33, 52, 64, 46] and propose a novel network architecture that significantly outperforms PoseNet.
In this paper, we propose to directly regress the camera pose from an input image. To do so, we leverage, on the one hand, Convolutional Neural Networks (CNNs) which allow us to learn suitable feature representations for localization that are more robust against motion blur and illumination changes. As we can see from the PoseNet  results, regressing the pose after the high dimensional output of a FC layer is not optimal. Our intuition is that the high dimensionality of the FC output makes the network prone to overfitting to training data. PoseNet deals with this problem with careful dropout strategies. We propose to make use of Long-Short Term Memory (LSTM) units  on the FC output, which performs structured dimensionality reduction and chooses the most useful feature correlations for the task of pose estimation. Overall, we improve localization accuracy by 32-37% wrt. previous deep learning architectures [26, 25]. Furthermore, we are the first to provide an extensive comparison with a state-of-the-art SIFT-based method , which shreds a light on the strengths and weaknesses of each approach. Finally, we introduce a new dataset for large-scale indoor localization, consisting of 1,095 high resolution images covering a total area of 5,575. Each image is associated with ground truth pose information. We show that this sequence cannot be handled with SIFT-based methods, as it contains large textureless areas and repetitive structures. In contrast, our approach robustly handles this scenario and localizes images on average within 1.31m of their ground truth location.
To summarize, our contribution is three-fold: (i) we propose a new CNN+LSTM architecture for camera pose regression in indoor and outdoor scenes. Our approach significantly outperforms previous work on CNN-based localization [26, 25]. (ii) we provide the first extensive quantitative comparison of CNN-based and SIFT-based localization methods. We show that classic SIFT-based methods still outperform all CNN-based methods by a large margin on existing benchmark datasets. (iii) we introduce TUM-LSI111Dataset available at https://tum-lsi.vision.cs.tum.edu, a new challenging large indoor dataset exhibiting repetitive structures and weakly textured surfaces, and provide accurate ground truth poses. We show that CNN-based methods can handle such a challenging scenario while SIFT-based methods fail completely. Thus, we are the first to demonstrate the usefulness of CNN-based methods in practice.
1.2 Related work
Local feature-based localization. There are two traditional ways to approach the localization problem. Location recognition methods represent a scene by a database of geo-tagged photos. Given a query image, they employ image retrieval techniques to identify the database photo most similar to the query [55, 54, 2, 44, 63]. The geo-tag of the retrieved image is often used to approximate the camera pose of the query, even though a more accurate estimate can be obtain by retrieving multiple relevant images [62, 65, 47].
More relevant to our approach are structure-based localization techniques that use a 3D model, usually obtained from Structure-from-Motion, to represent a scene. They determine the full 6DoF camera pose of a query photo from a set of 2D-3D correspondences established via matching features found in the query against descriptors associated with the 3D points. The computational complexity of matching grows with the size of the model. Thus, prioritized search approaches [34, 10, 46] terminate correspondence search as soon as a fixed number of matches has been found. Similarly, descriptor matching can be accelerated by using only a subset of all 3D points [34, 9], which at the same time reduces the memory footprint of the 3D models. The latter can also be achieved by quantizing the descriptors [38, 43].
For more complex scenes, e.g., large-scale urban environments or even large collections of landmark scenes, 2D-3D matches are usually less unique as there often are multiple 3D points with similar local appearance . This causes problems for the pose estimation stage as accepting more matches leads to more wrong matches and RANSAC’s run-time grows exponentially with the ratio of wrong matches. Consequently, Sattler et al. use co-visibility information between 3D points to filter out wrong matches before pose estimation [46, 43]. Similarly, Li et al. use co-visibility information to adapt RANSAC’s sampling strategy, enabling them to avoid drawing samples unlikely to lead to a correct pose estimate . Assuming that the gravity direction and a rough prior on the camera’s height are known, Svärm et al. propose an outlier filtering step whose run-time does not depend on the inlier ratio . Zeisl et al. adapt this approach into a voting scheme, reducing the computational complexity of outlier filtering from  to for matches .
The overall run-time of classical localization approaches depends on the number of features found in a query image, the number of 3D points in the model, and the number of found correspondences and/or the percentage of correct matches. In contrast, our approach directly regresses the camera pose from a single feed-forward pass through a network. As such, the run-time of our approach only depends on the size of the network used.
As we will show, SIFT-based methods do not work for our new challenging indoor LSI dataset due to repetitive structures and large textureless regions present in indoor scenes. This further motivates the use alternative approaches based, e.g., on deep learning.
Localization utilizing machine learning. In order to boost location recognition performance, Gronat et al. and Cao & Snavely learn linear classifiers on top of a standard bag-of-words vectors [8, 20]. They divide the database into distinct places and train classifiers to distinguish between them.
Donoser & Schmalstieg cast feature matching as a classification problem, where the descriptors associated with each 3D model point form a single class . They employ an ensemble of random ferns to efficiently compute matches.
Aubry et al. learn feature descriptors specifically for the task of localizing paintings against 3D scene models .
In the context of re-localization for RGB-D images, Guzman-Rivera et al. and Shotton et al. learn random forests that predict a 3D point position for each pixel in an image [49, 21]. The resulting 2D-3D matches are then used to estimate the camera pose using RANSAC. Rather than predicting point correspondences, Valentin et al. explicitly model the uncertainty of the predicted 3D point positions and use this uncertainty during pose estimation , allowing them to localize more images. Brachmann et al. adapt the random forest-based approach to not rely on depth measurements during test time . Still, they require depth data during the training stage as to predict 3D coordinates for each pixel. In contrast, our approach directly regresses the camera pose from an RGB image, and thus only needs a set of image-pose pairs as input for training.
Deep learning. CNNs have been successfully applied to most tasks in computer vision since their major success in image classification [30, 50, 23] and object detection [17, 16, 41]. One of the major drawbacks of deep learning is its need for large datasets for training. A common approach used for many tasks is that to fine-tune deep architectures pre-trained on the seemingly unrelated task of image classification on ImageNet . This has been successfuly applied, among others, to object detection , object segmentation [39, 29], semantic segmentation [22, 40], and depth and normal estimation . Similarly, we take pre-trained networks, e.g. GoogLeNet , which can be seen as feature extractors and then fine-tune them for the task of camera pose regression.
LSTM  is a type of Recurrent Neural Network (RNN)  designed to accumulate or forget relevant contextual information in its hidden state. It has been successfully applied for handwriting recognition  and in natural language processing for machine translation . Recently, CNN and LSTM have been combined in the computer vision community to tackle, for example, visual recognition in videos . While most methods in the literature apply LSTM on a temporal sequence, recent works have started to use the memory capabilities of LSTMs to encode contextual information. ReNet  replaced convolutions by RNNs sweeping the image vertically and horizontally.  uses spatial LSTM for person re-identification, parsing the detection bounding box horizontally and vertically in order to capture spatial dependencies between body parts.  employed the same idea for semantic segmentation and  for semantic object parsing. We use LSTMs to better correlate features coming out of the convolutional and FC layers, efficiently reducing feature dimensionality in a structured way that improves pose estimation compared to using dropout on the feature vector to prevent overfitting. A similar approach was simultaneously proposed in , where LSTMs are used to obtain contextual information for object recognition.
End-to-end learning has also been used for localization and location recognition. DSAC  proposes a differentiable RANSAC so that a matching function that optimizes pose quality can be learned. Arandjelović et al. employ CNNs to learn compact image representations, where each image in a database is represented by a single descriptor . Weyand et al. cast localization as a classification problem . They adaptively subdivide the earth’s surface in a set of tiles, where a finer quantization is used for regions exhibiting more images. The CNN then learns to predict the corresponding tile for a given image, thus providing the approximate position from which a photo was taken. Focusing on accurate 6DoF camera pose estimation, the PoseNet method by Kendall et al. uses CNNs to model pose estimation as a regression problem . An extension of the approach repeatedly evaluates the CNN with a fraction of its neurons randomly disabled, resulting in multiple different pose estimates that can be used to predict pose uncertainty . One drawback of the PoseNet approach is its relative inaccuracy . In this paper, we show how a CNN+LSTM architecture is able to produce significantly more accurate camera poses compared to PoseNet.
2 Deep camera pose regression
In this section, we develop our framework for learning to regress camera poses directly from images. Our goal is to train a CNN+LSTM network to learn a mapping from an image to a pose, , where is the neural network. Each pose is represented by its 3D camera position and a quaternion for its orientation.
where and are ground truth and estimated position-orientation pairs, respectively. We represent the orientation with quaternions and thus normalize the predicted orientation to unit length. determines the relative weight of the orientation error wrt. to the positional error and is in general bigger for outdoor scenes, as errors tend to be relatively larger . All hyperparameters used for the experiments are detailed in Section 4.
2.1 CNN architecture: feature extraction
Training a neural network from scratch for the task of pose regression would be impractical for several reasons: (i) we would need a really large training set, (ii) compared to classification problems, where each output label is covered by at least one training sample, the output in regression is continuous and infinite. Therefore, we leverage a pre-trained classification network, namely GoogLeNet , and modify it in a similar fashion as in . At the end of the convolutional layers average pooling is performed, followed by a fully connected layer which outputs a 2048 dimensional vector (c.f. Figure 2). This can be seen as a feature vector that represents the image to be localized. This architecture is used in  to predict camera poses by using yet another fully connected regression layer at the end that outputs the 7-dimensional pose and orientation vector (the quaternion vector is normalized to unit length at test time).
2.2 Structured feature correlation with LSTMs
After the convolutional layers of GoogleNet, an average pooling layer gathers the information of each feature channel for the entire image. Following PoseNet , we use a fully connected (FC) layer after pooling to learn the correlation among features. As we can see from the PoseNet  results shown in Section 4, regressing the pose after the high dimensional output of a fully connected (FC) layer is not optimal. Intuitively, the dimensionality of the 2048D embedding of the image through the FC layer is typically relatively large compared to the amount of available training data. As a result, the linear pose regressor has many degrees of freedom and it is likely that overfitting leads to inaccurate predictions for test images dissimilar to the training images. One could directly reduce the dimensions of the FC, but we empirically found that dimensionality reduction performed by a network with LSTM memory blocks is more effective. Compared to applying dropout within PoseNet to avoid overfitting , our approach consistently estimates more accurate positions, which justifies our use of LSTMs.
Even though Long Short-Term Memory (LTSM) units have been typically applied to temporal sequences, recent works [4, 58, 57, 7, 35] have used the memory capabilities of LSTMs in image space. In our case, we treat the output 2048 feature vector as our sequence. We propose to insert four LSTM units after the FC, which have the function of reducing the dimensionality of the feature vector in a structured way. The memory units identify the most useful feature correlations for the task of pose estimation.
Reshaping the input vector. In practice, inputting the 2048-D vector directly to the LSTM did not show good results. Intuitively, this is because even though the memory unit of the LSTM is capable of remembering distant features, a 2048 length vector is too long for LSTM to correlate from the first to the last feature. We thereby propose to reshape the vector to a matrix and to apply four LSTMs in the up, down, left and right directions as depicted in Figure 2. These four outputs are then concatenated and passed to the fully connected pose prediction layers. This imitates the function of structured dimensionality reduction which greatly improves pose estimation accuracy.
3 A large-scale indoor localization dataset
Machine learning and in particular deep learning are inherently data-intensive endeavors. Specifically, supervised learning requires not only data but also associated ground truth labelling. For some tasks such as image classification  or outdoor image-based localization  large training and testing datasets have already been made available to the community. For indoor scenes, only small datasets covering a spatial extent the size of a room  are currently available.
We introduce the TU Munich Large-Scale Indoor (TUM-LSI) dataset covering an area that is two orders of magnitude larger than the typically used 7Scenes dataset . It comprises 1,095 high-resolution images ( pixels) with geo-referenced pose information for each image. The dataset spans a whole building floor with a total area of 5,575 . Image locations are spaced roughly one meter apart, and at location each we provide a set of six wide-angle pictures, taken in five different horizontal directions (full ) and one pointing up (see Figure 3). Our new dataset is very challenging due to repeated structural elements with nearly identical appearance, e.g. two nearly identical stair cases, that create global ambiguities. Notice that such global ambiguities often only appear at larger scale and are thus missing from the 7Scenes dataset. In addition, there is a general lack of well-textured regions, whereas 7Scenes mostly depict highly textured areas. Both problems make this dataset challenging to approaches that only considers (relatively) small image patches.
In order to generate ground truth pose information for each image, we captured the data using the NavVis M3222www.navvis.com indoor mapping platform. This mobile system is equipped with six Panasonic 16-Megapixel system cameras and three Hokuyo laser range finders. Employing SLAM, the platform is able to reconstruct the full trajectory with sub-centimeter accuracy.
4 Experimental results
We present results on several datasets, proving the efficacy of our method in outdoor scenes like Cambridge  and small-scale indoor scenes such as 7Scenes . The two datasets are very different from each other: 7Scenes has a very high number of images in a very small spatial extent, hence, it is more suited for applications such as Augmented Reality, while Cambridge Landmarks has sparser coverage and larger spatial extent, the perfect scenario for image-based localization. In the experiments, we show that our method can be applied to both scenarios and delivers competitive results. We provide comparisons to previous CNN-based approaches, as well as a state-of-the-art SIFT-based localization method . Furthermore, we provide results for our new TUM-LSI dataset. SIFT-based methods fail on TUM-LSI due to textureless surfaces and repetitive structures, while our method is able to localize images with an average accuracy of 1.31m for an area of 5,575 .
Experimental setup. We initialize the GoogLeNet part of the network with the Places  weights and randomly initialize the remaining weights. All networks take images of size pixel as input. We use random crops during training and central crops during testing. A mean image is computed separately for each training sequence and is subtracted from all images. All experiments are performed on an NVIDIA Titan X using TensorFlow with Adam  for optimization. Random shuffling is performed for each batch, and regularization is only applied to weights, not biases. For all sequences we use the following hyperparameters: batch size 75, regularization , auxiliary loss weights , dropout probability 0.5, and the parameters for Adam: , and . The of Eq. 1 balances the orientation and positional penalties. To ensure a fair comparison, for Cambridge Landmarks and 7Scenes, we take the same values as PoseNet : for the indoor scenes is between 120 to 750 and outdoor scenes between 250 to 2000. For TUM-LSI, we set .
|Scene||Area or Volume||Active Search (w/o) ||Active Search (w/) ||PoseNet ||Bayesian PoseNet||Proposed + Improvement(pos,ori)|
|King’s College||5600||, (0)||, (0)||,||,||, (48,32)|
|Old Hospital||2000||, (2)||, (2)||,||,||, (35,20)|
|Shop Façade||875||, (0)||, (0)||,||,||, (19,8)|
|St Mary’s Church||4800||, (0)||, (0)||,||,||, (43,21)|
|Average All||–||–||–||,||,||, (37,19)|
|Average by ||–||,||,||–||–||,|
|Chess||6||, (0)||, (0)||,||,||, (25,29)|
|Fire||2.5||, (1)||, (1)||,||,||, (28,17)|
|Heads||1||, (1)||, (1)||,||,||, (27,-14)|
|Office||7.5||, (34)||, (34)||,||,||, (37,-5)|
|Pumpkin||5||, (71)||, (68)||,||,||, (30,17)|
|Red Kitchen||18||, (0)||, (0)||,||,||, (37,-2)|
|Stairs||7.5||, (3)||, (0)||,||,||, (15,0.7)|
|Average All||–||–||–||,||,||, (29,5)|
|Average by ||–||,||,||–||–||,|
Comparison with state-of-the-art. We compare results to two CNN-based approaches: PoseNet  and Bayesian PoseNet . On Cambridge Landmarks and 7Scenes, results for the two PoseNet variants [26, 25] were taken directly from the author’s publication . For the new TUM-LSI dataset, their model was fine-tuned with the training images. The hyperparameters used are the same as for our method, except for the Adam parameter , which showed better convergence.
To the best of our knowledge, CNN-based approaches have not been quantitatively compared to SIFT-based localization approaches. We feel this comparison is extremely important to know how deep learning can make an impact in image-based localization, and what challenges are there to overcome. We therefore present results of a state-of-the-art SIFT-based method, namely Active Search .
Active Search estimates the camera poses wrt. a SfM model, where each 3D point is associated with SIFT descriptors extracted from the training images. Since none of the datasets provides both an SfM model and the SIFT descriptors, we constructed such models from scratch using VisualSFM [61, 60] and COLMAP  and registered them against the ground truth poses of the training images. Thus, the camera poses reported for Active Search contain both the errors made by Active Search and the reconstruction and registration processes. The models used for localization do not contain any contribution from the testing images.
Active Search uses a visual vocabulary to accelerate descriptor matching. We trained a vocabulary containing 10k words from training images of the Cambridge dataset and a vocabulary containing 1k words from training images of the smaller 7Scenes dataset. Active Search uses these vocabularies for prioritized search for efficient localization, where matching is terminated once a fixed number of correspondences has been found. We report results both with (w/) and without (w/o) prioritization. In the latter case, we simply do not terminate matching early but try to find as many correspondences as possible. For querying with Active Search, we use calibrated cameras with a known focal length, obtained from the SfM reconstructions, but ignore radial distortion. As such, camera poses are estimated using a 3-point-pose solver  inside a RANSAC loop . Poses estimated from only few matches are usually rather inaccurate. Following common practice [34, 33], Active Search only considers a testing image as successfully localized if its pose was estimated from at least 12 inliers.
4.1 Large-scale outdoor localization
We present results for outdoor image-based localization on the publicly available Cambridge Landmarks dataset in Table 1. We report results for Active Search only for images with at least 12 inliers and give the number of images where localization fails in parenthesis. In order to compare the methods fairly, we provide the average accuracy for all images (Average All), and also the average accuracy for only those images that  was able to localize (Average by ). Note, that we do not report results on the Street dataset of Cambridge Landmarks. It is a unique sequence because the training database consists of four distinct video sequences, each filmed in a different compass direction. This results in training images at similar positions, but with very different orientations. Even with the hyperparameters set by the author of , training did not converge for any of the implemented methods.
In parenthesis next to our results, we show the rounded percentage improvement wrt. PoseNet in both position and orientation, separated by a comma. As we can see, the proposed method on average reduces positional error by 37.5% wrt. PoseNet and the orientation error by 19%. For example, in King’s College the positional error is reduced by more than 40%, going from 1.92m for Posenet to 0.99m for our method. This shows that the proposed LSTM-based structured output is efficient in encoding feature correlations, leading to large improvements in localization performance. It is important to note that none of the CNN-based methods is able to match the precision of Active Search , especially when computing the orientation of the camera. Since  requires 12 inliers to consider an image as localized, it is able to reject inaccurate poses. In contrast, our method always provides a localization result, even if it is sometimes less accurate. Depending on the application, one behavior or the other might be more desirable. As an example, we show in Figure 6 an image from Old Hospital, where a tree is occluding part of the building. In this case,  is not able to localize the image, while our method still produces a reasonably accurate pose. This phenomenon becomes more important in indoor scenes, where Active Search is unable to localize a substantially larger number of images, mostly due to motion blur.
Interestingly, for our method, the average for all images is lower than the average only for the images that  can also localize. This means that for images where  cannot return a pose, our method actually provides a very accurate result. This “complementary” behavior between SIFT- and CNN-based methods could be exploitable in future research. Overall, our method shows a strong performance in outdoor image-based localization, as seen in Figure 7, where PoseNet  provides less accurate poses. In order to better understand how the network localizes an image, Figure 4 plots the class activation maps for the King’s College sequence. Notice how strong activations cluster around distinctive building elements, e.g., the towers and entrance.
4.2 Small-scale indoor localization
In this section, we focus on localization on small indoor spaces, for which we use the publicly available 7Scenes dataset . Results at the bottom of Table 1 show that we also outperform previous CNN-based PoseNet by 29% in positional error and 5.3% orientation error. For example on, Pumpkin we achieve a positional error reduction from the 0.61m for Posenet to 0.33m for our method. As for the Cambridge Landmarks dataset, we observe that our approach consistently outperforms Bayesian PoseNet , which uses dropout to limit overfitting during training. These experimental results validate our strategy of using LSTMs for structured dimensionality reduction in an effort to avoid overfitting.
There are two methods that use RGB-D data and achieve a lower error but still higher than Active Search:  achieves 0.06m positional error and 2.7°orientation error, while  scores 0.08m and 1.60°. Note, that these methods require RGB-D data for training  and/or testing . It is unclear though how well such methods would work in outdoor scenarios with stereo data. In theory, multi-view stereo methods could be used to obtain the required depth maps for outdoor scenes. However, such depth maps are usually substantially more noisy and contain significantly more outliers compared to the data obtained with RGB-D sensors. In addition, the accuracy of the depth maps decreases quadratically with the distance to the scene, which is usually much larger for outdoor scenes than for indoor scenes. In , authors report that 63.4% of all test images for Stairs can be localized with a position error less than 5cm and an orientation error smaller than 5. With and without prioritization, Active Search localizes 77.8% and 80.2%, respectively, within these error bounds. Unfortunately, median registration errors of 2-5cm observed when registering the other SfM models against the 7Scene datasets prevent us from a more detailed comparison.
As we can see in Table 1, if an image can be localized, we notice that Active Search performs better than CNN-based approaches. However, we note that for Office and Pumpkin the number of images not localized is fairly large, 34 and 71, respectively. We provide the average accuracy for all images (Average All), and also the average accuracy for only those images that  was able to localize (Average by ). Note that for our method, the two averages are extremely similar, i.e., we are able to localize those images with the same accuracy as all the rest, showing robustness wrt. motion blur that heavily affect SIFT-based methods. This shows the potential of CNN-based methods.
4.3 Complex large-scale indoor localization
|Area||# train/test||PoseNet ||Proposed|
In our last experiment, we present results on the new TUM-LSI dataset. It covers a total area of 5,575 , the same order of magnitude as the outdoor localization dataset, and much larger than typical indoor datasets like 7Scenes.
Figure 3 shows an example from the dataset that contains large textureless surfaces. These surfaces are known to cause problems for methods based on local features. In fact, we were not able to obtain correct SfM reconstructions for the TUM-LSI dataset. The lack of texture in most parts of the images, combined with repetitive scene elements, causes both VisualSFM and COLMAP to fold repetitive structures onto themselves. For example, the two separate stairwells (red floor in Figure 3) are mistaken for a single stairwell (c.f. Figure 5). As the resulting 3D model does not reflect the true 3D structure of the scene, there is no point in applying Active Search or any other SIFT-based method. Notice that such repetitive structures would cause Active Search to fail even if a good model is provided.
For the experiments on the TUM-LSI dataset, we ignore the ceiling-facing cameras. As we can see in Table 2, our method outperforms PoseNet  by almost 30% in positional error and 55% orientation error, showing a similar improvement as for other datasets. To the best of our knowledge, we are the first to showcase a scenario where CNN-based methods succeed while SIFT-based approaches fail. On this challenging sequence, our method achieves an average error of around 1m. In our opinion, this demonstrates that CNN-based methods are indeed a promising avenue to tackle hard localization problems such as repetitive structures and textureless walls, which are predominant in modern buildings, and are a problem for classic SIFT-based localization methods.
In this paper, we address the challenge of image-based localization of a camera or an autonomous system with a novel deep learning architecture that combines a CNN with LSTM units. Rather than precomputing feature points and building a map as done in traditional SIFT-based localization techniques, we determine a direct mapping from input image to camera pose. With a systematic evaluation on existing indoor and outdoor datasets, we show that our LSTM-based structured feature correlation can lead to drastic improvements in localization performance compared to other CNN-based methods. Furthermore, we are the first to show a comparison of SIFT-based and CNN-based localization methods, showing that classic SIFT approaches still outperform all published CNN-based methods to date on standard benchmark datasets. To answer the ensuing question whether CNN-based localization is a promising direction of research, we demonstrate that our approach succeeds in a very challenging scenario where SIFT-based methods fail. To this end, we introduce a new challenging large-scale indoor sequence with accurate ground truth. Besides aiming to close the gap in accuracy between SIFT- and CNN-based methods, we believe that exploring CNN-based localization in hard scenarios is a promising research direction. Acknowledgements. This work was partially funded by the ERC Consolidator grant 3D Reloaded and a Sofja Kovalevskaja Award from the Alexander von Humboldt Foundation, endowed by the Federal Ministry of Education and Research.
-  R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  R. Arandjelović and A. Zisserman. DisLocation: Scalable descriptor distinctiveness for location recognition . In Asian Conference on Computer Vision (ACCV), 2014.
-  Aubry, Mathieu and Russell, Bryan C. and Sivic, Josef. Painting-to-3D Model Alignment via Discriminative Visual Elements. ACM Transactions on Graphics (TOG), 2014.
-  S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. DSAC - Differentiable RANSAC for Camera Localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, and C. Rother. Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  S. Cao and N. Snavely. Graph-Based Discriminative Learning for Location Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
-  S. Cao and N. Snavely. Minimal Scene Descriptions from Structure from Motion Models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  S. Choudhary and P. J. Narayanan. Visibility probability structure from sfm datasets and applications. In European Conference on Computer Vision (ECCV), 2012.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  M. Donoser and D. Schmalstieg. Discriminative Feature-to-Point Matching in Image-Based Locallization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM (CACM), 1981.
-  Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Towards Internet-scale Multi-view Stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
-  Y. Furukawa and J. Ponce. Accurate, Dense, and Robust Multi-View Stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2010.
-  R. Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurateobject detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  C. Goller and A. Kuchler. Learning task-dependent distributed representations by backpropagation through structure. In IEEE International Conference on Neural Networks (ICNN), 1996.
-  A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Conference on Neural Information Processing Systems (NIPS), 2009.
-  P. Gronat, G. Obozinski, J. Sivic, and T. Pajdla. Learning per-location classifiers for visual place recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
-  A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon, and S. Izadi. Multi-Output Learning for Camera Relocalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  C. Hazirbas, L. Ma, C. Domokos, and D. Cremers. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Asian Conference on Computer Vision (ACCV), 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. In Neural Computation (NECO), 1997.
-  A. Kendall and R. Cipolla. Modelling uncertainty in deep learning for camera relocalization. In IEEE International Conference on Robotics and Automation (ICRA), 2016.
-  A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In IEEE International Conference on Computer Vision (ICCV), 2015.
-  D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.
-  L. Kneip, D. Scaramuzza, and R. Siegwart. A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
-  I. Kokkinos. Pushing the boundaries of boundary detection using deep learning. In International Conference on Learning Representations (ICLR), 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Conference on Neural Information Processing Systems (NIPS), 2012.
-  Z. Kukelova, M. Bujnak, and T. Pajdla. Real-time solution to the absolute pose problem with unknown radial distortion and focal length. In IEEE International Conference on Computer Vision (ICCV), 2013.
-  B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide pose estimation using 3d point clouds. In European Conference on Computer Vision (ECCV), 2012.
-  Y. Li, N. Snavely, and D. P. Huttenlocher. Location Recognition Using Prioritized Feature Matching. In European Conference on Computer Vision (ECCV), 2010.
-  X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan. Semantic object parsing with local-global long short-term memory. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  H. Lim, S. N. Sinha, M. F. Cohen, and M. Uyttendaele. Real-time image-based 6-dof localization in large-scale environments. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 2004.
-  S. Lynen, T. Sattler, M. Bosse, J. Hesch, M. Pollefeys, and R. Siegwart. Get Out of My Lab: Large-scale, Real-Time Visual-Inertial Localization. In Robotics: Science and Systems (RSS), 2015.
-  K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. V. Gool. Convolutional oriented boundaries. In European Conference on Computer Vision (ECCV), 2016.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. IEEE International Conference on Computer Vision (ICCV), 2015.
-  S. Ren, K. H. andRoss Girshick, and J. Sun. Faster R-CNN: Towards real-time object detectionwith region proposal networks. In Conference on Neural Information Processing Systems (NIPS), 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
-  T. Sattler, M. Havlena, F. Radenović, K. Schindler, and M. Pollefeys. Hyperpoints and Fine Vocabularies for Large-Scale Location Recognition. In IEEE International Conference on Computer Vision (ICCV), 2015.
-  T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys. Large-Scale Location Recognition And The Geometric Burstiness Problem. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based localization using direct 2d-to-3d matching. In IEEE International Conference on Computer Vision (ICCV), 2011.
-  T. Sattler, B. Leibe, and L. Kobbelt. Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2016 (to appear).
-  T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, and T. Pajdla. Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization? In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  J. L. Schönberger and J.-M. Frahm. Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Conference on Neural Information Processing Systems (NIPS), 2014.
-  L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson. City-Scale Localization for Cameras with Known Vertical Direction. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2016 (to appear).
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place recognition by view synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  A. Torii, J. Sivic, T. Pajdla, and M. Okutomi. Visual Place Recognition with Repetitive Structures. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
-  J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. Torr. Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long short-term memory architecture for human re-identification. In European Conference on Computer Vision (ECCV), 2016.
-  F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and Y. Bengio. Renet: A recurrent neural network based alternative to convolutional networks. arXiv preprint arXiv:1505.00393, 2015.
-  T. Weyand, I. Kostrikov, and J. Phiblin. Planet - photo geolocation with convolutional neural networks. European Conference on Computer Vision (ECCV), 2016.
-  C. Wu. Towards linear-time incremental structure from motion. In International Conference on 3D Vision (3DV), 2013.
-  C. Wu, S. Agarwal, B. Curless, and S. M. Seitz. Multicore Bundle Adjustment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
-  A. R. Zamir and M. Shah. Accurate image localization based on google maps street view. In European Conference on Computer Vision (ECCV), 2010.
-  A. R. Zamir and M. Shah. Image Geo-Localization Based on Multiple Nearest Neighbor Feature Matching Using Generalized Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2014.
-  B. Zeisl, T. Sattler, and M. Pollefeys. Camera pose voting for large-scale image-based localization. In IEEE International Conference on Computer Vision (ICCV), 2015.
-  W. Zhang and J. Kosecka. Image based localization in urban environments. In International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), 2006.
-  B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Conference on Neural Information Processing Systems (NIPS), 2014.