Where is this? Video geolocation based on neural network features

Where is this? Video geolocation based on neural network features

Abstract

In this work we propose a method that geolocates videos within a delimited widespread area based solely on the frames visual content. Our proposed method tackles video-geolocation through traditional image retrieval techniques considering Google Street View as the reference point. To achieve this goal we use the deep learning features obtained from NetVLAD to represent images, since through this feature vectors the similarity is their L2 norm. In this paper, we propose a family of voting-based methods to aggregate frame-wise geolocation results which boost the video geolocation result. The best aggregation found through our experiments considers both NetVLAD and SIFT similarity, as well as the geolocation density of the most similar results. To test our proposed method, we gathered a new video dataset from Pittsburgh Downtown area to benefit and stimulate more work in this area. Our system achieved a precision of 90% while geolocating videos within a range of 150 meters or two blocks away from the original position.

\aclfinalcopy

1 Introduction

Unlabeled video is readily available from different sources but has been popularly widespread through social media networks. One specific type of shared videos through social media are egocentric, where the video is recorded from a first person point of view as the ones captured during a trip showing the visited places, or from social events such as parades or festivals. Moreover, it has also become become popular to share public safety events such as shootings and manifestations. This last type of videos have become of utter importance while gathering evidence for event reconstruction while gathering evidence. The geolocation of videos would aid to locate where a public safety event such a robbery or an act of vandalism.

A common issue found while using multimedia from media networks is the lack of geotagging information, as this is removed by the user on purpose or eliminated automatically before becoming public to protect the privacy of the users. Taking the fact that social media networks are a rich and a vast multimedia source, it will be useful to come up with a method capable of geolocating the videos based only on the visual content of these digital videos.

Determining the location of a single frame is a problem solved by the geolocalization of an image, which is a well developed area. One of the most common practices consists on building a reference database from which the image locations are known and search for the most similar reference images with respect to a query image. The image locations are usually provided in geodetic coordinates: latitude, longitude and altitude. The set of reference images can be obtained from services such as Google Street View (GSV), which provide multi-angle images taken along major roads, or these could also be obtained from photograph hosting services such as Flickr, since these store images with geolocation based on the user input. In modern days, the data is not a critical factor as the definition of a reliable image descriptor, along with a suitable retrieval strategy. Recently, the NetVLAD image descriptor au pair an approximate K-Nearest Neighbors (KNN) has been used to solve the image geolocation problem[\citenameArandjelović et al.2015, \citenameTorii et al.2013].

In this work, we focus our efforts to geolocate a video based solely on the visual modality. To achieve this we take advantage of state-of-the-art image geolocation techniques to query the keyframes from a video with an anonymous location. Our proposed method aggregates the results obtained from querying the video keyframes to our reference image database based on the Google Street View Pittsburgh 250k dataset. The aggregation result considers different voting schema as: simple voting, weighted-voting and density-based voting.

The main contributions of our work are: (1) the definition of the task of geolocating videos based only on the visual aspect, (2) the creation a high quality testing dataset that consists of 50 videos taken around Pittsburgh downtown area, (3) the definition of a technique to visualize the matching area of images using NetVLAD features to facilitate its understanding, and (4) the proposal of a novel reference image aggregation strategy that significantly improves the precision of video geolocation given a strong baseline based on image retrieval.

The ideas presented in this paper are organized as follows: Section 2 gives a detailed background related to this project. The characteristics and properties of NetVLAD are introduced in Section 3. Section 4 and Section 5 describe our video pre-processing strategy and reference image aggregation strategy respectively. The experiments settings and results can be found in Section 6 and Section 7. Finally, in Section 8 and section 9 we give a conclusion and discussion about the next steps and further ideas that might improve the method.

Figure 1: System architecture

2 Related Work

It is to our knowledge that only few research projects address the problem of geolocating videos. Song et al. [\citenameSong et al.2012] proposed a web video geolocating algorithm by propagating geotags among the web video social relationship graph. For this reason, their algorithm can only locate videos from social networks. Snoek et al. [\citenameSnoek et al.2011] use video content along with various additional signals to infer the location of videos. Their method requires a predefined candidate coarse location set, such as Trafford (UK) and Baghdad (Iraq), where the query video is classified into one of the candidate locations. One limitation to this method is that it restricts the predicted locations to a predefined candidate set. Besides, it only infers the city or town where the video is taken, but cannot give accurate geodetic coordinates.

In this work we tackle the video geolocation problem through image geolocation techniques, and cast it as an image retrieval task. The problem of image place recognition has been studied extensively [\citenameArandjelović et al.2015], [\citenameArandjelović and Zisserman2014], [\citenameCao and Snavely2013], [\citenameChen et al.2011], [\citenameGronat et al.2013], [\citenameKnopp et al.2010], [\citenameSattler et al.2011], [\citenameSchindler et al.2007]. In these proposed methods, usually the query image location is estimated using the locations of the most visually similar images retrieved from a large geotagged database.

In our approach we take advantage of NetVLAD features, the current state-of-art image descriptor for geolocation task. The NetVLAD network is formed by replacing the fully connected and soft-max layers from a regular convolutional neural network (CNN) with a NetVLAD layer. This layer takes the convolutional filters from the last layer before the max-pooling phase as input, applies a VLAD [\citenameJégou et al.2010] operation over the convolutions at depth level, to form locally aggregated descriptors which are intra-normalized, concatenated and finally normalized to one with respect to the newly formed vector.

The learning process of NetVLAD is based on the weakly supervised triplet ranking loss which aims to minimize the L2 norm for a learning sample image with respect to its most similar images, while at the same time tries to maximize its distance against dissimilar images. This is the design key component that enables NetVLAD to localize images from a retrieval perspective.

3 Feature Analysis

As the NetVLAD features are the key component of our proposed method, we performed an exploratory analysis of the features to determine the potential and limitations of the image embedding used in this work.

3.1 Rotation and Scale Invariance

In our first feature analysis we tried to determine if the NetVLAD descriptors are invariant to rotation as the features obtained by SIFT [\citenameLowe1999] or AKAZE [\citenameAlcantarilla and Solutions2011]. In this experiment we fully rotated the image in steps of 10. Figure 1(a) shows the results demonstrating that this image representation is not rotation invariant as the distance increases as soon as we begin to rotate the image.

Additionally, the scale invariance was tested by scaling down an image by a factor of 0.1 per step from to a scale of . As we can observe in Figure 1(b), the results of this experiment demonstrate that the NetVLAD feature vector similarity is not scale invariant, as when we scale down the image from right to left the distance increases which means a decrease in similarity.

(a) Rotation Analysis: angle vs distance
(b) Scale Analysis: scale vs distance
Figure 2: Invariance Analysis

3.2 Visualizing the similarity

Arandjelovic et al. [\citenameArandjelović et al.2015] visualize the results of the NetVLAD network by showing a heatmap of the highest activation output value of the CNN to compare against other neural networks, but do not provide a visualization of the neural net’s activations based on the actual image resemblance.

In an attempt to visualize the similarity between images by using NetVLAD descriptors, we followed the steps of Zeiler and Fergus [\citenameZeiler and Fergus2014], where they partially occlude an image and display in the similarity heatmap the highest classification output value in the location of the partial occlusion. In our case, since the NetVLAD feature is no other than last convolutional layer before max-pooling to which the VLAD operator has been applied, then the image similarity visualization can be achieved through a heatmap which shows the similarity measure between images by comparing the partial occlusion of the querying image against the most similar image from the reference image database.

Our proposed visualization procedure is as follows: First, we calculate the NetVLAD image descriptor of the most similar image from the reference dataset. Then we partially occlude the image with a black patch which must be of the size and location of the convolutional filter from the first layer of the network. Then, we calculate the distance between the occluded querying image with respect to the most similar image and add the similarity value to the block with size and position of the occlusion block into the output heatmap which has the size of the querying image. This procedure is repeated for each of the partially occluded images until the similarity heatmap is completely filled. At the end we normalize the values on the heatmap to be within the range [0,1] and superimpose it to the querying image to visualize the attention of the model.

(a) Procedure
(b) Salient regions example
Figure 3: NetVLAD Visual Attention

In Figure 3 we show on the left a graphical description of the procedure aforementioned for its better understanding, and on the right there is an example of the output of our visualization. As we can see in the example the similarity focuses on salient structures as corners of building, street lamps, sidewalks, signs or high contrast of colors between edges. These regions are no different from what other image features such as SIFT or BRIEF [\citenameCalonder et al.2012] focus on for structure.

4 Video Processing

4.1 Keyframe Extraction

Before passing to the image retrieval phase, it is necessary to extract the keyframes or set of frames that are representative of the query video. Extracting keyframes greatly improves the efficiency of the method as the amount of query images is reduced from the amount of total frames in the video to only a few. The keyframe extraction step also benefits the precision of the method as many of the blurry and noisy frames are filtered out.

In this work, we use the difference of color-histograms for keyframe extraction. The idea behind this method is to select frames that differ in the color-space, as this avoids the selection of redundant scenes. To obtain the keyframes, first we extract the video I-frames, whose decoding do not depend on other video frames. Then the color-histogram is calculated for each I-frame, by first quantizing each of the RGB channels into 4 bins, giving a total of bins. The result is a 64-dimension representation vector for each frame. After obtaining the color histograms, we calculate the Manhattan distance of the previously selected keyframe against the next consecutive I-frames’ histograms. An I-frame is selected as a keyframe when its distance with respect to the previous keyframe is larger than a certain threshold .

5 Aggregation Strategy

In this section we present a family of voting-based methods that aggregates the geotagged image retrieval results into one location predicted for the query video. First we introduce the baseline simple voting method. Then we represent weighted-rank voting that bias towards top retrieved images, and density-based voting that favors areas where the candidate location distributions are dense. Finally, we present a weighted voting method that considers the image similarity in both the NetVLAD space and the SIFT space.

5.1 Simple Voting

We set voting ranking as the baseline to the problem of predicting the location of the video, as this is a direct mapping of the original image retrieval problem that Arandjelovic et al. [\citenameArandjelović et al.2015] formulated for image retrieval. We will refer to the baseline method as simple voting.

For simple voting lets consider that each video is formed by keyframes. The ranking method considers for each keyframe the top most similar images found in the A-KNN phase of the pipeline. Therefore, each video gets geotagged images. These images usually overlap in their locations since two or more querying images may be associated with the same location since a localized reference image could be retrieved by different keyframes. One of the main reasons is that GSV is formed by 24 pictures for each location with different angles and yaws, the images from close locations overlap with each other. This redundancy in visual information enables the voting strategy, since the geotagged reference images are seen as (or less) location votes. The predicted location is the one that has the voting majority.

Nonetheless, using a simple voting scheme conveyed several problems with the reference dataset. For instance, the retrieved locations contained noisy results or outliers on the map. Another problem that we found was that the retrieved locations did not agree and each location got only one vote. In that case, the simple voting method would not perform any better than a random decision.

5.2 Weighted-Rank Voting

Due to the problems found in the simple voting approach, we explored the weighted-rank voting [\citenameFishburn1967] as a second aggregation method. In simple voting we consider the top most similar images per keyframe to be equally important. However, in reality images that are lower at the retrieval ranking must be less similar to the keyframe, that is, less likely to be at the same location of the video. Instead of giving each image vote, we assign a different weight to the vote of of each retrieved image as the inverse of its ranking position as described in eq. 1.

(1)

5.3 Density-based Voting

Weighted-rank voting only takes into consideration the preferences of retrieved images under each keyframe. However, some keyframes can be less informative, or more difficult for the image retrieval process, so the results from these uninformative keyframes are less reliable. As a third method, we considered density-based voting which compares all of the candidate locations of a video globally. The intuition is that buildings in the video can appear in multiple geotagged images taken from different places that are close to each other; thus the area where the candidate locations are most concentrated is more likely to cover the correct location of the video. Density-based voting considers the top most similar images per keyframe as previous methods. The main difference is that this method begins by clustering the images based on their geodetic coordinates. Then it only allows the images from the most dense cluster to vote for the location prediction. In this way, the voting is constrained to a smaller area on the map. For this third approach, we selected DBscan [\citenameEster et al.1996] due to its ability to detect outliers.

(a) Candidate locations shown in Google Maps
(b) Scatter graph of candidate locations
(c) Scatter graph of DBSCAN candidate locations
Figure 4: Comparison of voting methods.

Figure 4 gives an example of the results obtained through the proposed methods. Figure 3(a) shows an example of the location votes heatmap from retrieved geotagged images for a particular video, where the red spots represent more votes on that location, blue represents fewer votes, and the yellow star represents the ground truth. On the other hand, in Figure 3(b) we find the plot of the coordinates of the candidate locations. This plot reveals why simple voting fails: although many candidate locations are scattered around the ground truth, they do not share exactly the same location and demonstrates that averaging the locations would lead to an error. In contrast, Figure 3(c) shows the result after clustering locations based on DBSCAN, where each color represents a cluster: the light blue dots are outliers, while the thick green markers represent the cluster with the largest population. In our method, only the cluster which has the largest density is allowed to vote. As we can see, by adding the clustering step into the method filters out most of the false-positive locations as the ground truth is usually located within the cluster that has the highest density.

5.4 NetVLAD + SIFT

As previously described, we use NetVLAD features to retrieve geotagged images. Thus the weighted-rank voting method are also based on the similarity of NetVLAD features. In our analysis, although NetVLAD had good precision in general, it did make errors in some cases. The next method considers another widely used visual feature, SIFT, to refine the voting. Within our method we assigned a SIFT score to the images which is the average of the distances among the top matching keypoints that were obtained through a brute force approach between the querying keyframe and the top ranked images obtained through the NetVLAD feature matching.

As in our previous methods, we retrieve the top most similar geotagged images per keyframe based on the NetVLAD features. Then we extract SIFT features for these images and keyframes. Each geotagged image is assigned with a weight that linearly combines the NetVLAD ranking and the SIFT similarity score as described in equation 2.

(2)

6 Experiments

6.1 Datasets

Pittsburgh 250K

The dataset used for testing the performance of this work was the Pittsburgh 250K which is formed by 254,064 images obtained from GSV. Each image has a resolution of 640x480 pixels and were taken at 10,586 different locations mainly in Pittsburgh Downtown. At each location the images were captured with 12 yaws of 30 and two pitches one at floor level with 0 and the second with a 30 angle above the ground.

Pittsburgh Urban Video

For testing purposes we built a testing video dataset named Pittsburgh Urban Video. This video corpus is formed by 50 geotagged videos that were recorded at a resolution of pixels. Most of the videos simulate a tourist recording while visiting a city. The majority of the videos were recorded by panning urban views which also include a high tilt into skyscrapers found in the Pittsburgh downtown area. We incorporated recordings in landscape and portrait mode, rotated videos, fast movement, quick zoom-in and zoom-out and close-ups to objects which might not provide great information as waste bins, flower pots or traffic lights. By creating a custom test set we could ensure that the test images were not included in the training set and validate the performance of the proposed method.

6.2 Experimental Setup

NetVLAD feature extraction. The geolocation system required to compute offline the NetVLAD descriptors from all the images in the Pittsburgh 250K dataset and all the keyframes from the Pittsburgh Urban Video dataset. Due to the large scale of the processing, we distributed the workload between 25 different computing nodes, in batches of approximately 10,000 images. After each node finished its task, the data was gathered into one single data block. The distribution of the workload reduced the computational time from an estimate of hours down to hours.

Once the descriptors from the training and testing dataset were obtained, we performed a series of experiments to determine which method is the most suitable for geolocalizing videos.

Evaluation. Location coordinates can be considered as a continuous space, and this makes it nearly impossible to measure the system’s precision or recall based on an exact location prediction. For this reason, we evaluate our methods through the precision of the predicted location within a certain distance range as described in eq. 3.

(3)

Where is the total number of testing videos, is the spatial distance from the predicted location to the ground truth. A prediction is considered to be correct if it is within meters from the ground truth. However, as goes larger, even randomly picking locations in the city could achieve precision, and the measure becomes meaningless. In this work, we consider the prediction to be meaningful only if it is within a block from the ground truth. As a coarse estimation, the average block size in the Pittsburgh Downtown area is around (). Therefore the final evaluation measures precision in the range of meters.

7 Experimental Results

In this section, first we compare the voting scheme to a random baseline, then we evaluate the video geolocation precision for each of the aggregation methods, and finally compare them to an oracle system. Finally, we show a comparison of the performance of all the aggregated methods.

7.1 Video Geolocation Random Baseline

Voting methods assume the existence of an overlap from the retrieved results. As if the different possible predictions have only one vote, the voting would be no different from a random decision. In our first experiment we compare the simple voting method to a random baseline.

The random baseline was built by considering the top most similar images per keyframe as candidates, which results in candidate geotagged images for an keyframes video. Then the random baseline selects the location of one randomly selected image out of the reference images as the prediction of the system.

(a) Comparison of random baseline (dashed line) and simple voting (solid line) at top candidate reference images per keyframe. X-axis: distance from ground truth, Y-axis: video geolocation precision from eq.3.
(b) Comparison of proposed aggregation methods. X-axis is the distance from ground truth, Y-axis is the video geolocation precision as defined in eq.3.

Figure 4(a) shows the performance of the random baseline and the simple voting, at various levels of . The random baseline is relatively strong when , as the random choices are narrowed down to the most similar images. Nonetheless, the voting method performance is better than the best random performance. This experiment proves the assumption of location overlap, and demonstrates that voting can aggregate evidence from retrieved images and improve video location prediction. The location overlap can clearly be seen in Figure 5 where the most salient results tend to be from nearby locations rather from the same location due to the nature of the salient similar structures such as buildings, bridges or even billboards.

Figure 5: Retrieval result example

7.2 Video Geolocation Effectiveness

The next experiment analyzes the performance of the proposed aggregation methods. We also tested an oracle system: consider candidates per keyframe and select the candidate location that is closest to the ground truth as the prediction. This is the upper-bound of any aggregation method given a fixed image retrieval output. The performance of the oracle system also reflects the effectiveness of the NetVLAD image retrieval process.

Figure 4(b) reports the video geolocation performance of 5 proposed methods: simple voting, weighted-rank voting, density-based voting with equal weight, density-based voting with weighted-rank, and density-based voting with NetVLAD+SIFT weighting. The oracle performance is in dashed black line. All methods considered , and for NetVLAD+SIFT weighting scheme, we selected to be the best value after a coarse parameter sweep. As shown in Figure 4(b), the precision gradually improved as more methods were aggregated. We found that the best method considers voting, location density, NetVLAD similarity and SIFT similarity. It is worth noticing that in short error distance all the methods are much lower than the oracle. This means that although some near-correct locations are retrieved from the database, our aggregation methods fails to select them as the final prediction. But as the error distance threshold increase, the differences between each method also increases, pushing the best method closer to the oracle precision.

As shown in Table 1, the NetVLAD+SIFT+Density improved the precision over the simple voting by over . of its predictions are within 150 meters or two blocks from the original location. Moreover, at the 150-meter distance, NetVLAD+SIFT+Density is only worse than the oracle.

Method Distance (, in meters)
5 10 30 50 100 150
Random 0.00 0.04 0.16 0.28 0.36 0.42
Simple Voting 0.04 0.14 0.44 0.62 0.74 0.78
NetVLAD+SIFT+Density 0.06 (+ 50%) 0.12 (- 14%) 0.48 (+ 9%) 0.68 (+ 10%) 0.86 (+16%) 0.90 (+15%)
Oracle 0.1 0.3 0.58 0.78 0.9 0.94
Table 1: Precision of the random baseline, the simple voting baseline the best voting method, NetVLAD + SIFT + Density, and the oracle system. Parameters: for Random, for Simple voting, for NetVLAD+SIFT+Density, and for the oracle system. Percentage in parenthesis shows relative gains/loss from simple voting.

8 Discussion

In this work, we have presented the results of a powerful method capable of localizing videos by combining traditional retrieval techniques such as mixed ranking along with new neural net based image features. During our experiments, the NetVLAD feature space showed a good performance for a relatively large scale image corpus, however it was shown that these features are not invariant to scale or rotation, which limit the model and potentially perform poorly on videos taken with a tilted camera.

We also presented a method to visualize the attention of the NetVLAD model while looking at the similarity heatmap between two images. The method proved to be insightful to understand the attention of the model to structures and colors within a city landscape.

In our experiments we showed that a traditional information retrieval approach as aggregating ranking methods for image retrieval improves the geolocation of videos. We found that the best model in our experiments was the mixed voting of weighted and density based ranking that incorporates SIFT matching scores. The final mixed voting method showed a precision of 0.9 at a distance of 150 meters which are approximately two blocks within Pittsburgh Downtown.

9 Future Work

The analysis based on the oracle system showed that the aggregation fails to fully select the correct near locations, which indicates that there is still room for improvement to our method based on the Pittsburgh Urban Video dataset.

After observing the images in the Google Street View Pittsburgh 250k dataset, we realized that one possible path which might increase the performance of our method is to occlude the sky while measuring the similarity between images. Shen and Wang [\citenameShen and Wang2013] proposed a method to segment the sky by an incremental detection of the horizon which is suitable to the type of images found in urban GSV images. Another approach is to use neural network based image segmentation methods such as Mask R-CNN [\citenameHe et al.2017] or DeepLab [\citenameChen et al.2018].

It also came to our attention that in our best method the SIFT matching score provided a larger weight into the final vote for the location of the videos. This is an invitation to explore the incorporation of more traditional image features such as AKAZE or BRIEF, where the latter might be the most suitable descriptors due to their low computational cost given the large scale of the problem.

As a final note, since NetVLAD features lack rotation and scale invariance, it is also recommended to augment the reference database by randomly rotating and scaling the original content to overcome this limitation.

References

  1. Pablo F Alcantarilla and TrueVision Solutions. 2011. Fast explicit diffusion for accelerated features in nonlinear scale spaces. IEEE Trans. Patt. Anal. Mach. Intell, 34(7):1281–1298.
  2. Relja Arandjelović and Andrew Zisserman. 2014. Dislocation: Scalable descriptor distinctiveness for location recognition. In Asian Conference on Computer Vision, pages 188–204. Springer.
  3. Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2015. Netvlad: Cnn architecture for weakly supervised place recognition. arXiv preprint arXiv:1511.07247.
  4. Michael Calonder, Vincent Lepetit, Mustafa Ozuysal, Tomasz Trzcinski, Christoph Strecha, and Pascal Fua. 2012. Brief: Computing a local binary descriptor very fast. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(7):1281–1298.
  5. Song Cao and Noah Snavely. 2013. Graph-based discriminative learning for location recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 700–707.
  6. David M Chen, Georges Baatz, Kevin Köser, Sam S Tsai, Ramakrishna Vedantham, Timo Pylvänäinen, Kimmo Roimela, Xin Chen, Jeff Bach, Marc Pollefeys, et al. 2011. City-scale landmark identification on mobile devices. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 737–744. IEEE.
  7. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848.
  8. M Ester, HP Kriegel, J Sander, and X Xu. 1996. Density-based spatial clustering of applications with noise. In Int. Conf. Knowledge Discovery and Data Mining, volume 240.
  9. PC Fishburn. 1967. Additive utilities with incomplete product set: Applications to priorities and sharings. Operations Research Society of America (ORSA), Baltimore, MD, USA.
  10. Petr Gronat, Guillaume Obozinski, Josef Sivic, and Tomas Pajdla. 2013. Learning and calibrating per-location classifiers for visual place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 907–914.
  11. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE.
  12. Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3304–3311. IEEE.
  13. Jan Knopp, Josef Sivic, and Tomas Pajdla. 2010. Avoiding confusing features in place recognition. In European Conference on Computer Vision, pages 748–761. Springer.
  14. David G Lowe. 1999. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee.
  15. Torsten Sattler, Bastian Leibe, and Leif Kobbelt. 2011. Fast image-based localization using direct 2d-to-3d matching. In 2011 International Conference on Computer Vision, pages 667–674. IEEE.
  16. Grant Schindler, Matthew Brown, and Richard Szeliski. 2007. City-scale location recognition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE.
  17. Yehu Shen and Qicong Wang. 2013. Sky region detection in a single image for autonomous ground robot navigation. International Journal of Advanced Robotic Systems, 10.
  18. Jasper Snoek, Luciano Sbaiz, and Hrishikesh Aradhye. 2011. From videos to places: Geolocating the world’s videos. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 823–832. IEEE.
  19. Yi-Cheng Song, Yong-Dong Zhang, Juan Cao, Tian Xia, Wu Liu, and Jin-Tao Li. 2012. Web video geolocation by geotagged social resources. IEEE Transactions on Multimedia, 14(2):456–470.
  20. Akihiko Torii, Josef Sivic, Tomas Pajdla, and Masatoshi Okutomi. 2013. Visual place recognition with repetitive structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 883–890.
  21. Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Computer vision–ECCV 2014, pages 818–833. Springer.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
311587
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description