Learning Discriminative Affine Regions via Discriminability
We present an accurate method for estimation of the affine shape of local features. The method is trained in a novel way, exploiting the recently proposed HardNet triplet loss. The loss function is driven by patch descriptor differences, avoiding problems with symmetries. Moreover, such training process does not require precisely geometrically aligned patches. The affine shape is represented in a way amenable to learning by stochastic gradient descent.
When plugged into a state-of-the-art wide baseline matching algorithm, the performance on standard datasets improves in both the number of challenging pairs matched and the number of inliers.
Finally, AffNet with combination of Hessian detector and HardNet descriptor improves bag-of-visual-words based state of the art on Oxford5k and Paris6k by large margin, 4.5 and 4.2 mAP points respectively.
The source code and trained networks are available at https://github.com/ducha-aiki/affnet
Handcrafted local features, as a source of correspondences, are still a component of the state of the art in 3D reconstruction [40, 41], two-view matching , 6DOF image localization . Classical local features have also been successfully used for providing supervision for CNN-based image retrieval .
The gold standard for local feature in image retrieval is the Hessian-Affine detector [24, 32, 45] combined with the RootSIFT descriptor [23, 2]. This is mostly because that affine-covariant shape allows robust matching of images separated by a wide baseline [25, 28], unlike scale-covariant features like ORB  or difference of Gaussian (DoG)  that rely on tests carried out on circular neighborhoods.
On the other hand, DoG is the de-facto standard feature for large scale 3D reconstruction . The main reason is, besides a lower computation cost, that the affine adaptation procedure  reduces the number of local features by 20%-40% [25, 29], which hurts significantly the 3D reconstruction quality. Solving the problem of the drop in the number of correspondence caused by the non-repeatability of the affine adaptation procedure and thus connecting together 3D reconstruction and image retrieval engines brings potentially significant benefits as demonstrated in [42, 35].
The paper present a CNN-based method for affine-shape and orientation estimation of local features and makes two contributions. First, we propose a novel method for learning the affine shape, orientation and potentially other parameters related to the feature geometric and appearance properties. The learning method does not require a precise ground truth which reduces the need for manual annotation. The second contribution is the learned AffNet itself which outperforms significantly previous methods for affine shape estimation and improves the state of art in image retrieval by a large margin.
1.1 Related work
The area of learning local features has been active recently, but the attention has focussed dominantly on learning descriptors [52, 11, 4, 51, 27, 54, 9] and translation-covariant detectors [46, 53, 22, 39].The authors are not aware of any recent work on learning or improvement of local feature affine shape estimation. The most closely related work is the thus the following. Yi et al.  proposed to learn feature orientation by minimizing descriptor distance between positive patches, which correspond to the same point on 3d surface. This allows to avoid hand-picking a "canonical" orientation, thus learning the one, which is the most suited for descriptor matching. We have observed that direct application of the method  approach for affine shape estimation leads to learning degenerate shape.
We extend, and at the same fix, this idea by maximizing the descriptor distance between positive and hardest negative descriptor. This extension is inspired by the descriptor and metric learning used in [27, 13].
Yi et al.  proposed a multi-stage framework for learning the descriptor, orientation and translation-covariant detector. The detector was learned by maximizing the intersection-over-union and the reprojection error between corresponding regions.
Lenc and Vedaldi  introduced the “covariant constraint” for learning various types of local feature detectors. The proposed covariant loss is the Frobenius norm of the difference between the local affine frames. The disadvantage of such approach is that it could lead to features, while being repeatable, are not necessary suited for the matching task. On top that, the common drawback of the Yi et al.  and Lenc and Vedaldi  methods is that they require to know the exact geometric relationship between patches which increases the amount of work needed to prepare the training dataset.
Zhang et al.  proposed to “anchor” the detected features to some pre-defined features with known good discriminability like TILDE . We remark that despite showing images of affine-covariant features, the results presented in the paper results are for translation-covariant features. Savinov et al.  proposed a ranking approach for unsupervised learning of a feature detector. While this is natural and efficient for learning coordinates of the center of the feature, it is problematic to apply it for affine shape estimation. The reason is that requires sampling and scoring many possible shapes.
Finally, Choy et al.  trained a “Universal correspondence network” (UCN) for direct correspondence estimation with contrastive loss on a patch descriptor distance. The apporach is related to the current work, the two methods differ in several important aspects. First, UCN used a ImageNet-pretrained network which is only finetuned. We learn the affine shape estimation from scratch. Second, UCN uses dense feature extraction and negative examples extracted from the same image. While this could be a good setup for short baseline stereo, it does not work well for wide baseline, where affine features are usually sought. Finally, we use the hard triplet margin loss instead of contrastive one.
2 Learning affine shape
2.1 Training procedure
The main blocks of the proposed training procedure are shown in Figure 2. First, a batch of a matching patch pairs is generated, where and correspond to the same point on a 3D surface. Rotation and skew transformation matrices are randomly and independently generated. The patches and are warped by respectively into -transformed patches. Then center patch is cropped and then a pair of transformed patches is feed into convolutional neural network AffNet, which predicts a pair of affine transformations , , which are applied to the -transformed patches via spatial transformers ST .
Thus geometrically normalized patches are cropped to pixels and feed into the descriptor network, e.g. SIFT, obtaining descriptors . Descriptors . then are used to form triplets by procedure, proposed in , following by the triplet margin loss:
where is the distance between matching pairs and is distance to the hardest negative in the batch for pair.
Using triplets is crucial for learning an affine shape and is different from the works [49, 22], where only positive distances were minimized. Removing negative example from the triplet leads to situation when CNN learns to predict a transformation which collapses the ellipse into the line. Such collapsed ellipse corresponds to the uniform-colored patch with very small descriptor distance between positive patches. We haven‘t succeed in learning an affine shape regressor by minimizing only the positive distances, despite testing a wide range of hyperparameters.
2.2 Affine shape parametrization
Full affine shape, a.k.a oriented ellipse or a local affine frame (LAF), is defined by 6 parameters of the affine matrix. Two of them form a translation vector . The rest four can be decomposed in the following way :
where is an isotropic scale, and are rotations, and is a diagonal matrix with . Parameter represents anisotropic scaling along the axis, which is defined by rotation . And is the orientation of the ellipse (see Figure 4).
Translation and scale are obtained by the keypoint detector, typically from an image pyramid. Orientation can be obtained from an orientation estimator applied to the normalized patch. A commonly used procedure is to use the dominant gradient orientation  or OriNet .
We need to estimate two remaining parameters, and . While this is could be done directly, we found that it is easier for CNN to learn "upright" residual to identity matrix :
Matrix is then scale-normalized, so that the area of the ellipse is equal to the unit circle.
We also tried to learn scale correction as well, but the network increased the scale until it reached boundaries of the patch or the image it is sampled from. This observation is in line with the known observation  that increasing local feature scale increases its discriminability, when the scene is planar.
Another possibility is to the estimate matrix of image moments, as it is done by classical affine shape adaptation procedure  and then make an upright LAF by matrix inverse square root, but we have found that this leads to instability in training.
We have randomly generated affine transformations, which consist in random rotation – tied for pair of corresponding patches, and anisotropic scaling in random direction by magnitude , which is gradually increased during training from initial value of 3 to 4.5 at middle of training. The tilt parameter is uniformly sampled from range . We haven‘t used larger tilts due appearing black boundaries of image. Such boundaries, if appear in patch shown to the network, give it a strong clue about transformation and hurts generalization a lot.
2.4 Implementation details
The CNN architecture is adopted from HardNet and shown in Figure 3, but number of channels in all layers decreased by 2x and last 128-d output is replaced by 3-d output predicting ellipse shape parameters. Network formula is 16C3-16C3-32C3/2-32C3-64C3/2-64C3-3C8, where 32C3/2 stands for 3x3 convolutional kernel with 32 filters and stride 2. Padding with zeros is applied to all convolutional layers, to preserve the spatial size, except to the final one. Batch normalization  layer followed by ReLU  non-linearity is added after each layer, except the last one, which is followed by hyperbolic tangent activation. Dropout  regularization with 0.25 dropout rate is applied before the last convolution layer. Grayscale input patches with size pixels are normalized by subtracting the per-patch mean and dividing by the per-patch standard deviation.
We have implemented SIFT descriptor in a differentiable manner in PyTorch  library and used it for training AffNet. We have tried several others variants for descriptor: HardNet, TFeat, raw pixels, but SIFT works the best for training. First two are too robust for small geometrical changes and also have been trained on the same dataset, as we use, thus providing weak gradients. Opposite, raw pixels make training very instable and prone to local minima.
Optimization is done by stochastic gradient descent with learning rate of 0.005, momentum of 0.9 and weight decay of 0.0001. Learning rate was linearly decayed to zero within 20 epoch. Training is done with PyTorch library. Training takes 24 hours on Titan X GPU and the bottleneck is data augmentation procedure.
Inference time is 0.1 ms per patch on Titan X, including patch sampling, which is done on CPU, Baumberg iteration takes 0.05ms per patch on CPU.
2.5 Training dataset
UBC Phototour  dataset is used for training. It consists of three subsets: Liberty, Notre Dame and Yosemite with about 2 400k normalized 64x64 patches in each, detected by DoG and multiscale Harris detectors respectively. Patches are verified by 3d reconstruction model. We randomly sample 10M pairs for training.
Note that although positive point corresponds to roughly the same point on 3D surface, they are not perfectly aligned, thus having position, scale, rotation and affine noise. Using HPatches dataset  for training may improve results, but we decided to keep it for testing a repeatability.
3 Empirical evaluation
Repeatability of affine detectors: Hessian detector + affine shape estimator was benchmarked, following classical work by Mikolajczyk et al. , but on recently introduced larger HSequences  dataset by VLBenchmarks toolbox .
HSequences consists of two subsets. Illumination part contains 57 image sixplets with illumination changes, both natural and artificial. There is no difference is viewpoint in this subset, geometrical relation between images in sixplets is identity.
Second part is Viewpoint, where 59 image sixplets vary in scale, rotation, but mostly in horizontal tilt. The average viewpoint change is a bit smaller than in well-known graffiti sequence from Oxford-Affine dataset .
Local features are detected in pairs of images, reprojected by ground truth homography to the reference image and closest reprojected region is found for each region from reference image.
The correspondence is considered correct, when overlap error of the pair is less than 40%. The repeatability score for a given pair of images is a ratio between number of correct correspondences and the smaller number of detected regions in common part of scene among two images.
Results are shown in Figure 6. Original affine shape estimation procedure, implemented in  is denoted Baum SS 19, as patches are sampled from scale space. AffNet takes patches, which are sampled from original image. So for fair comparison, we also tested Baum versions, where patches are sampled from original image, with 19 and 33 pixels patch size.
AffNet slightly outperforms all the variants of Baumberg procedure for images with viewpoint change in terms of repeatability and more significant – in number of correspondences. The difference is even bigger for them image with illumination change only, where AffNet performs almost the same as plain Hessian, which is upper bound here, as this part of dataset has no viewpoint changes.
The difference between AffNet and Baumberg procedure repeatability is illustrated in Figure 7.
One of the reasons for such difference is feature-rejection strategy. Baumberg iterative procedure rejects feature in one of three cases. First, too elongated ellipses with long-to-short axis ratio more than six, are rejected. Second, features, which touch boundary of the image are also rejected. This is true for AffNet post-processing procedure as well. But as AffNet tends to output less elongated shapes, this two situations happen much less often, leading to increasing number of surving features by 25%. Third, features, which shape haven‘t converged into fixed state within sixteen iteration are not kept as well. The third case is quite rare and happens in approximately 1% cases.
Example of shapes, which are estimated by AffNet and Baumberg procedure are shown in Figure 5. In general, AffNet tends to output more circular-shaped patches, than its competitor. The overlap error between shapes estimated in different conditions is also smaller.
3.2 Wide baseline stereo
|EF ||EVD ||OxAff ||SymB ||GDB ||LTLL |
We conducted an experiment on wide baseline stereo, following local feature detector benchmark protocol, defined in  on the set of two-view matching datasets [12, 47, 55, 10]. The local features are detected by benchmarked detector, described by HardNet++  and HalfRootSIFT  and geometrically verified by RANSAC . Two following metrics are reported: the number of successfully matched image pairs and average number of correct inliers per matched pair. We have replaced original affine shape estimator in Hessian-Affine with AffNet in Hessian and Adaptive threshold Hessian (AdHess). Second, we also tested HesAffNet in view-synthesis based matched called MODS .
The results are shown in Table 1. Our method outperforms original one in both number of registered image pairs and/or number of correct inliers in all datasets, including painting-to-photo pairs in SymB  and multimodal pairs in GDB , despite it was not trained for that domains.
In view synthesis setup AffNet performs almost identical to the classical method and does not bring any benefits, as view synthesis procedure handles view point differences. Moreover, AffNet fails on the hardest pair in EVD dataset. The possible reason is that network was not trained for severe anisotropic blur, which occurs in synthesized images.
3.3 Learned feature orientation for wide baseline stereo
In addition to learning affine shape, we have run experiment in learning feature orientation by the same procedure. Difference to the AffNet learning are marginal: 2d output for and rotation-only augmentation of input patches. Two loss functions are tested: distance between positive patches minimization, as in  and our HardNet loss-based loss. The difference to the affine shape learning is that rotation does not change the content of the patch and the robust repeatability is the only desired property.
Evaluation protocol is identical to previous subsection. HesAffNet is used as detector and combination of HalfRootSIFT and HardNet++ and descriptors. Our networks are denoted as denoted ONetPos and ONetPosNeg respectively. We compared them to dominant gradient orientation  with single assignment – DomOri, and multiple assignments – DomOriMulti, and OriNet .
Results are in Table 4. All the methods have shown similar performance, thus we could recommend our method as more universal, than method by Yi et al. , as ours can be used for both orientation and affine shape learning.
3.4 Image retrieval
We evaluate the proposed approach, and compare against the related ones, on the practical application of image retrieval with local features. As a local feature detector, for all methods, we use the multi-scale Hessian-affine detector , with the exception of our method. For our method only, we replace the Baumberg method for affine shape estimation with the proposed AffNet to estimate the affine shape of the detected local features. This lead to the increasing number of used feature, from 12.5M to 17.5M for Oxford5k and from 15.6M to 21.2M for Paris6k, because of more features survived, as explained in Section 3.1. As a feature descriptor, we compare with methods that use handcrafted descriptors such as SIFT  and RootSIFT , as well as state-of-the-art learned descriptor, namely HardNet++ . Finally, for all methods a 128-D descriptor is produced per local feature.
Standard image retrieval datasets are used for the evaluation, i.e., Oxford5k  and Paris6k  datasets. Both datasets contain a set of images (5062 for Oxford5k and 6391 for Paris6k) depicting 11 different landmarks together with distractors. For each of the 11 landmarks there are 5 different query regions defined by a bounding box, constituting 55 query regions per dataset. The performance is reported as mean average precision (mAP) .
First, we make a comparison on the traditional bag-of-words (BoW)  image retrieval pipeline. Namely, a flat vocabulary with one million centroids is created with the k-means algorithm and approximate nearest neighbor search . All descriptors of an image are first assigned to a respective centroid of the vocabulary, and then they are aggregated with a histogram of occurrences into a BoW image representation. Because of the sparse nature of BoW representation, an inverted file is used for a very efficient image search . To provide a more thorough comparison, we also apply methods known to improve image retrieval performance, i.e., spatial verification (SV) , and standard query expansion (QE) . Comparison is given in the Table 2. On both Oxford5k and Paris6k datasets, and in all settings, our approach achieves the best results, in most of the cases it outperforms the second best approach by a large margin. This experiment clearly shows the benefit of using AffNet in the local features detection pipeline. Interestingly, a pure BoW representation with HesAffNet–HardNet++ outperforms the handcrafted approach of HesAff–RootSIFT, even when the latter is using spatial verification. This comes as a strong benefit, because spatial verification step has a very high memory and processing cost.
|EF ||EVD ||OxAff ||SymB ||GDB ||LTLL |
Additionally, we make a comparison with the state-of-the-art local-feature-based image retrieval methods. In the case of our proposed method (HesAffNet–HardNet++–HQE), a visual vocabulary of 65k visual words is learned, with additional Hamming embedding (HE)  technique that further refines descriptor assignments with a 128 bits binary signature. We follow the same procedure as HesAff–RootSIFT–HQE  method. Specifically, we use: (i) weighting of the votes as a decreasing function of the Hamming distance ; (ii) burstiness suppression ; (iii) multiple assignments of features to visual words [34, 18]; and (iv) QE with feature aggregation . All parameters are set as in . The performance of our method is the best reported on both Oxford5k and Paris6k. Furthermore, we outperform a method learning the vocabulary on the same dataset, i.e., mAP 89.1 was reported  on Oxford5k by learning it on the same dataset comprising the relevant images. We even outperform a method that uses significantly higher amount of local features  (mAP 89.4 was reported on Oxford5k), i.e., 22M compared to 17.5M used here.
We presented a method for learning affine shape of the local features in weakly-supervised manner. The resulting AffNet regressor bridges the gap between performance of the similarity-covariant and affine-covariant detectors on images with short baseline and big illumination differences and it improves performance of affine-covariant detectors in the wide baseline setup. We also presented some observations related to affine shape parametrization, suitable for CNN-based learning.
AffNet applied to the output of the to Hessian detector improves state-of-the art in wide baseline matching, affine detector repeatability and image retrieval. The source code and trained CNN are available at https://github.com/ducha-aiki/affnet
The authors were supported by the, PLACE-FOR-UCU-Eleks - Czech Science Foundation Project GACR P103/12/G084, the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Research and Economy, and the Province of Upper Austria in the frame of the COMET center, the CTU student grant SGS17/185/OHK3/3T/13, and the MSMT LL1303 ERC-CZ grant.
-  PyTorch. http://pytorch.org.
-  R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2911–2918, 2012.
-  V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk. Learning local feature descriptors with triplets and shallow convolutional neural networks. In British Machine Vision Conference (BMVC), 2016.
-  A. Baumberg. Reliable feature matching across widely separated views. In CVPR, pages 1774–1781. IEEE Computer Society, 2000.
-  M. Brown and D. G. Lowe. Automatic panoramic image stitching using invariant features. International Journal of Computer Vision (IJCV), 74(1):59–73, 2007.
-  C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. Universal correspondence network. In Advances in Neural Information Processing Systems, pages 2414–2422, 2016.
-  K. Cordes, B. Rosenhahn, and J. Ostermann. Increasing the accuracy of feature evaluation benchmarks using differential evolution. In IEEE Symposium Series on Computational Intelligence (SSCI) - IEEE Symposium on Differential Evolution (SDE), Apr. 2011.
-  A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. A. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1734–1747, 2016.
-  B. Fernando, T. Tommasi, and T. Tuytelaars. Location recognition over large time lags. Computer Vision and Image Understanding, 139:21 – 28, 2015.
-  X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3279–3286, 2015.
-  D. C. Hauagge and N. Snavely. Image matching using local symmetry features. In Computer Vision and Pattern Recognition (CVPR), pages 206–213, 2012.
-  A. Hermans, L. Beyer, and B. Leibe. In Defense of the Triplet Loss for Person Re-Identification. ArXiv e-prints, Mar. 2017.
-  S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv 1502.03167, 2015.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. ArXiv e-prints, June 2015.
-  H. Jegou, M. Douze, and C. Schmid. On the burstiness of visual elements. In Computer Vision and Pattern Recognition (CVPR), pages 1169–1176, 2009.
-  H. Jegou, M. Douze, and C. Schmid. Improving bag-of-features for large scale image search. International Journal of Computer Vision (IJCV), 87(3):316–336, 2010.
-  H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek. Accurate image search using the contextual dissimilarity measure. Pattern Analysis and Machine Intelligence (PAMI), 32(1):2–11, 2010.
-  A. Kelman, M. Sofka, and C. V. Stewart. Keypoint descriptors for matching across multiple image modalities and non-linear intensity variations. In CVPR 2007, 2007.
-  K. Lebeda, J. Matas, and O. Chum. Fixing the locally optimized ransac. In BMVC 2012, 2012.
-  K. Lenc, V. Gulshan, and A. Vedaldi. Vlbenchmarks, 2012.
-  K. Lenc and A. Vedaldi. Learning Covariant Feature Detectors, pages 100–117. Springer International Publishing, Cham, 2016.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004.
-  K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point detectors. International Journal of Computer Vision (IJCV), 60(1):63–86, 2004.
-  K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision (IJCV), 65(1):43–72, 2005.
-  A. Mikulik, M. Perdoch, O. Chum, and J. Matas. Learning vocabularies over a fine quantization. International Journal of Computer Vision (IJCV), 103(1):163–175, 2013.
-  A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas. Working hard to know your neighbor’s margins: Local descriptor learning loss. In Proceedings of NIPS, Dec. 2017.
-  D. Mishkin, J. Matas, and M. Perdoch. Mods: Fast and robust method for two-view matching. Computer Vision and Image Understanding, 141:81 – 93, 2015.
-  D. Mishkin, J. Matas, M. Perdoch, and K. Lenc. Wxbs: Wide baseline stereo generalizations. Arxiv 1504.06603, 2015.
-  M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In International Conference on Computer Vision Theory and Application (VISSAPP), pages 331–340, 2009.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), pages 807–814, 2010.
-  M. Perdoch, O. Chum, and J. Matas. Efficient representation of local geometry for large scale object retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 9–16, 2009.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008.
-  F. Radenovic, J. L. Schonberger, D. Ji, J.-M. Frahm, O. Chum, and J. Matas. From dusk till dawn: Modeling in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5488–5496, 2016.
-  F. Radenovic, G. Tolias, and O. Chum. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In European Conference on Computer Vision (ECCV), pages 3–20, 2016.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alternative to SIFT or SURF. In International Conference on Computer Vision (ICCV), pages 2564–2571, 2011.
-  T. Sattler, W. Maddern, A. Torii, J. Sivic, T. Pajdla, M. Pollefeys, and M. Okutomi. Benchmarking 6DOF Urban Visual Localization in Changing Conditions. ArXiv e-prints, July 2017.
-  N. Savinov, A. Seki, L. Ladicky, T. Sattler, and M. Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. ArXiv e-prints, Nov. 2016.
-  J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016.
-  J. L. Schonberger, H. Hardmeier, T. Sattler, and M. Pollefeys. Comparative evaluation of hand-crafted and learned local features. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  J. L. Schonberger, F. Radenovic, O. Chum, and J.-M. Frahm. From single image query to detailed 3D reconstruction. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5126–5134, 2015.
-  J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In International Conference on Computer Vision (ICCV), pages 1470–1477, 2003.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15(1):1929–1958, 2014.
-  G. Tolias and H. Jegou. Visual query expansion with or without geometry: refining local descriptors by feature aggregation. Pattern Recognition, 47(10):3466–3476, 2014.
-  Y. Verdie, K. Yi, P. Fua, and V. Lepetit. Tilde: a temporally invariant learned detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5279–5288, 2015.
-  G. Yang, C. V. Stewart, M. Sofka, and C.-L. Tsai. Registration of challenging image pairs: Initialization, estimation, and decision. Pattern Analysis and Machine Intelligence (PAMI), 29(11):1973–1989, 2007.
-  K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned invariant feature transform. In European Conference on Computer Vision (ECCV), pages 467–483, 2016.
-  K. M. Yi, Y. Verdie, P. Fua, and V. Lepetit. Learning to Assign Orientations to Feature Points. In Proceedings of the Computer Vision and Pattern Recognition, 2016.
-  G. Yu and J.-M. Morel. Asift: An algorithm for fully affine invariant comparison. Image Processing On Line, 1:11–38, 2011.
-  B. F. Yurun Tian and F. Wu. L2-Net: Deep learning of discriminative patch descriptor in euclidean space. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  X. Zhang, F. Yu, S. Karaman, and S.-F. Chang. Learning discriminative and transformation covariant local feature detectors. In CVPR, 2017.
-  X. Zhang, F. X. Yu, S. Kumar, and S.-F. Chang. Learning Spread-out Local Feature Descriptors. ArXiv e-prints, Aug. 2017.
-  C. L. Zitnick and K. Ramnath. Edge foci interest points. In International Conference on Computer Vision (ICCV), pages 359–366, 2011.