Learning Discriminative Affine Regions via Discriminability

Learning Discriminative Affine Regions via Discriminability

Dmytro Mishkin, Filip Radenović and Jiři Matas
Visual Recognition Group, CTU in Prague
{mishkdmy, filip.radenovic, matas}@cmp.felk.cvut.cz

We present an accurate method for estimation of the affine shape of local features. The method is trained in a novel way, exploiting the recently proposed HardNet triplet loss. The loss function is driven by patch descriptor differences, avoiding problems with symmetries. Moreover, such training process does not require precisely geometrically aligned patches. The affine shape is represented in a way amenable to learning by stochastic gradient descent.

When plugged into a state-of-the-art wide baseline matching algorithm, the performance on standard datasets improves in both the number of challenging pairs matched and the number of inliers.

Finally, AffNet with combination of Hessian detector and HardNet descriptor improves bag-of-visual-words based state of the art on Oxford5k and Paris6k by large margin, 4.5 and 4.2 mAP points respectively.

The source code and trained networks are available at https://github.com/ducha-aiki/affnet

Figure 1: Four challenging image pairs with significant structural (top two rows), illumination (3rd row) and extreme viewpoint (bottom) changes. Correct correspondences of Hessian-AffNet detector verified by RANSAC are overlayed. Classical Hessian-Affine with Baumberg iterations fails on these pairs, i.e. the number and repeatability of its correspondences is insufficient for RANSAC to recover the dominant homography. Images are from [8, 12]

1 Introduction

Figure 2: AffNet training. Corresponding patches undergo random affine transformation , are cropped and fed into AffNet, which outputs affine transformation to an unknown canonical shape. ST – the spatial transformer warps the patch into an estimated canonical shape. The patch is described by a differentiable version of the SIFT descriptor. descriptor distance matrix is calculated and used to form triplets, according to the HardNet triplet margin loss [27].

Figure 3: AffNet architecture. Numbers in top denote spatial size of the feature map, while bottom ones – number of channels. /2 stands for stride 2.

Handcrafted local features, as a source of correspondences, are still a component of the state of the art in 3D reconstruction [40, 41], two-view matching [28], 6DOF image localization [38]. Classical local features have also been successfully used for providing supervision for CNN-based image retrieval [36].

The gold standard for local feature in image retrieval is the Hessian-Affine detector [24, 32, 45] combined with the RootSIFT descriptor [23, 2]. This is mostly because that affine-covariant shape allows robust matching of images separated by a wide baseline [25, 28], unlike scale-covariant features like ORB [37] or difference of Gaussian (DoG) [23] that rely on tests carried out on circular neighborhoods.

On the other hand, DoG is the de-facto standard feature for large scale 3D reconstruction [41]. The main reason is, besides a lower computation cost, that the affine adaptation procedure [5] reduces the number of local features by 20%-40% [25, 29], which hurts significantly the 3D reconstruction quality. Solving the problem of the drop in the number of correspondence caused by the non-repeatability of the affine adaptation procedure and thus connecting together 3D reconstruction and image retrieval engines brings potentially significant benefits as demonstrated in [42, 35].

The paper present a CNN-based method for affine-shape and orientation estimation of local features and makes two contributions. First, we propose a novel method for learning the affine shape, orientation and potentially other parameters related to the feature geometric and appearance properties. The learning method does not require a precise ground truth which reduces the need for manual annotation. The second contribution is the learned AffNet itself which outperforms significantly previous methods for affine shape estimation and improves the state of art in image retrieval by a large margin.

1.1 Related work

The area of learning local features has been active recently, but the attention has focussed dominantly on learning descriptors [52, 11, 4, 51, 27, 54, 9] and translation-covariant detectors [46, 53, 22, 39].The authors are not aware of any recent work on learning or improvement of local feature affine shape estimation. The most closely related work is the thus the following. Yi et al. [49] proposed to learn feature orientation by minimizing descriptor distance between positive patches, which correspond to the same point on 3d surface. This allows to avoid hand-picking a "canonical" orientation, thus learning the one, which is the most suited for descriptor matching. We have observed that direct application of the method [49] approach for affine shape estimation leads to learning degenerate shape.

We extend, and at the same fix, this idea by maximizing the descriptor distance between positive and hardest negative descriptor. This extension is inspired by the descriptor and metric learning used in [27, 13].

Yi et al. [48] proposed a multi-stage framework for learning the descriptor, orientation and translation-covariant detector. The detector was learned by maximizing the intersection-over-union and the reprojection error between corresponding regions.

Lenc and Vedaldi [22] introduced the “covariant constraint” for learning various types of local feature detectors. The proposed covariant loss is the Frobenius norm of the difference between the local affine frames. The disadvantage of such approach is that it could lead to features, while being repeatable, are not necessary suited for the matching task. On top that, the common drawback of the Yi et al[48] and Lenc and Vedaldi [22] methods is that they require to know the exact geometric relationship between patches which increases the amount of work needed to prepare the training dataset.

Zhang et al[53] proposed to “anchor” the detected features to some pre-defined features with known good discriminability like TILDE [46]. We remark that despite showing images of affine-covariant features, the results presented in the paper results are for translation-covariant features. Savinov et al[39] proposed a ranking approach for unsupervised learning of a feature detector. While this is natural and efficient for learning coordinates of the center of the feature, it is problematic to apply it for affine shape estimation. The reason is that requires sampling and scoring many possible shapes.

Finally, Choy et al[7] trained a “Universal correspondence network” (UCN) for direct correspondence estimation with contrastive loss on a patch descriptor distance. The apporach is related to the current work, the two methods differ in several important aspects. First, UCN used a ImageNet-pretrained network which is only finetuned. We learn the affine shape estimation from scratch. Second, UCN uses dense feature extraction and negative examples extracted from the same image. While this could be a good setup for short baseline stereo, it does not work well for wide baseline, where affine features are usually sought. Finally, we use the hard triplet margin loss instead of contrastive one.

2 Learning affine shape

2.1 Training procedure

The main blocks of the proposed training procedure are shown in Figure 2. First, a batch of a matching patch pairs is generated, where and correspond to the same point on a 3D surface. Rotation and skew transformation matrices are randomly and independently generated. The patches and are warped by respectively into -transformed patches. Then center patch is cropped and then a pair of transformed patches is feed into convolutional neural network AffNet, which predicts a pair of affine transformations , , which are applied to the -transformed patches via spatial transformers ST [15].

Thus geometrically normalized patches are cropped to pixels and feed into the descriptor network, e.g. SIFT, obtaining descriptors . Descriptors . then are used to form triplets by procedure, proposed in [27], following by the triplet margin loss:


where is the distance between matching pairs and is distance to the hardest negative in the batch for pair.

Using triplets is crucial for learning an affine shape and is different from the works [49, 22], where only positive distances were minimized. Removing negative example from the triplet leads to situation when CNN learns to predict a transformation which collapses the ellipse into the line. Such collapsed ellipse corresponds to the uniform-colored patch with very small descriptor distance between positive patches. We haven‘t succeed in learning an affine shape regressor by minimizing only the positive distances, despite testing a wide range of hyperparameters.

2.2 Affine shape parametrization

Figure 4: Affine camera model (2). – anisotropic scale, – latitude, – longitude , – scale. Image courtesy [28]

Full affine shape, a.k.a oriented ellipse or a local affine frame (LAF), is defined by 6 parameters of the affine matrix. Two of them form a translation vector . The rest four can be decomposed in the following way [50]:


where is an isotropic scale, and are rotations, and is a diagonal matrix with . Parameter represents anisotropic scaling along the axis, which is defined by rotation . And is the orientation of the ellipse (see Figure 4).

Translation and scale are obtained by the keypoint detector, typically from an image pyramid. Orientation can be obtained from an orientation estimator applied to the normalized patch. A commonly used procedure is to use the dominant gradient orientation [23] or OriNet [49].

We need to estimate two remaining parameters, and . While this is could be done directly, we found that it is easier for CNN to learn "upright" residual to identity matrix :


Matrix is then scale-normalized, so that the area of the ellipse is equal to the unit circle.

We also tried to learn scale correction as well, but the network increased the scale until it reached boundaries of the patch or the image it is sampled from. This observation is in line with the known observation [25] that increasing local feature scale increases its discriminability, when the scene is planar.

Another possibility is to the estimate matrix of image moments, as it is done by classical affine shape adaptation procedure [5] and then make an upright LAF by matrix inverse square root, but we have found that this leads to instability in training.

Figure 5: AffNet (top) and Baumberg procedure (bottom) estimated affine shape. One ellipse is detected in the reference image, the other is a reprojected closest match from the second image. Baumberg ellipses tend to be more elongated, AffNet ones usually have a smaller overlap error.

2.3 Augmentation

We have randomly generated affine transformations, which consist in random rotation – tied for pair of corresponding patches, and anisotropic scaling in random direction by magnitude , which is gradually increased during training from initial value of 3 to 4.5 at middle of training. The tilt parameter is uniformly sampled from range . We haven‘t used larger tilts due appearing black boundaries of image. Such boundaries, if appear in patch shown to the network, give it a strong clue about transformation and hurts generalization a lot.

2.4 Implementation details

The CNN architecture is adopted from HardNet[27] and shown in Figure 3, but number of channels in all layers decreased by 2x and last 128-d output is replaced by 3-d output predicting ellipse shape parameters. Network formula is 16C3-16C3-32C3/2-32C3-64C3/2-64C3-3C8, where 32C3/2 stands for 3x3 convolutional kernel with 32 filters and stride 2. Padding with zeros is applied to all convolutional layers, to preserve the spatial size, except to the final one. Batch normalization [14] layer followed by ReLU [31] non-linearity is added after each layer, except the last one, which is followed by hyperbolic tangent activation. Dropout [44] regularization with 0.25 dropout rate is applied before the last convolution layer. Grayscale input patches with size pixels are normalized by subtracting the per-patch mean and dividing by the per-patch standard deviation.

We have implemented SIFT descriptor in a differentiable manner in PyTorch [1] library and used it for training AffNet. We have tried several others variants for descriptor: HardNet, TFeat, raw pixels, but SIFT works the best for training. First two are too robust for small geometrical changes and also have been trained on the same dataset, as we use, thus providing weak gradients. Opposite, raw pixels make training very instable and prone to local minima.

Optimization is done by stochastic gradient descent with learning rate of 0.005, momentum of 0.9 and weight decay of 0.0001. Learning rate was linearly decayed to zero within 20 epoch. Training is done with PyTorch library. Training takes 24 hours on Titan X GPU and the bottleneck is data augmentation procedure.

Inference time is 0.1 ms per patch on Titan X, including patch sampling, which is done on CPU, Baumberg iteration takes 0.05ms per patch on CPU.

2.5 Training dataset

UBC Phototour [6] dataset is used for training. It consists of three subsets: Liberty, Notre Dame and Yosemite with about 2 400k normalized 64x64 patches in each, detected by DoG and multiscale Harris detectors respectively. Patches are verified by 3d reconstruction model. We randomly sample 10M pairs for training.

Note that although positive point corresponds to roughly the same point on 3D surface, they are not perfectly aligned, thus having position, scale, rotation and affine noise. Using HPatches dataset [3] for training may improve results, but we decided to keep it for testing a repeatability.

3 Empirical evaluation

3.1 Repeatability

Figure 6: Mean (top) and median (bottom) repeatability and number of correspondences by Mikolajczyk protocol [25] on HSequences dataset [3]. Left – images with illumination differences, right – with viewpoint and scale changes. SS stands for patch taken from level of scale pyramid, where detection took place, image – from image, same way as for description. 19 and 33 are patch sizes. Hessian-Affine implementation is taken from[32]. Note, that for illumination change plain Hessian is efficiently upper bound and our AffNet is quite close to it.

Repeatability of affine detectors: Hessian detector + affine shape estimator was benchmarked, following classical work by Mikolajczyk et al[25], but on recently introduced larger HSequences [3] dataset by VLBenchmarks toolbox [21].

HSequences consists of two subsets. Illumination part contains 57 image sixplets with illumination changes, both natural and artificial. There is no difference is viewpoint in this subset, geometrical relation between images in sixplets is identity.

Second part is Viewpoint, where 59 image sixplets vary in scale, rotation, but mostly in horizontal tilt. The average viewpoint change is a bit smaller than in well-known graffiti sequence from Oxford-Affine dataset [25].

Local features are detected in pairs of images, reprojected by ground truth homography to the reference image and closest reprojected region is found for each region from reference image.

The correspondence is considered correct, when overlap error of the pair is less than 40%. The repeatability score for a given pair of images is a ratio between number of correct correspondences and the smaller number of detected regions in common part of scene among two images.

Results are shown in Figure 6. Original affine shape estimation procedure, implemented in [32] is denoted Baum SS 19, as patches are sampled from scale space. AffNet takes patches, which are sampled from original image. So for fair comparison, we also tested Baum versions, where patches are sampled from original image, with 19 and 33 pixels patch size.

AffNet slightly outperforms all the variants of Baumberg procedure for images with viewpoint change in terms of repeatability and more significant – in number of correspondences. The difference is even bigger for them image with illumination change only, where AffNet performs almost the same as plain Hessian, which is upper bound here, as this part of dataset has no viewpoint changes.

The difference between AffNet and Baumberg procedure repeatability is illustrated in Figure 7.

One of the reasons for such difference is feature-rejection strategy. Baumberg iterative procedure rejects feature in one of three cases. First, too elongated ellipses with long-to-short axis ratio more than six, are rejected. Second, features, which touch boundary of the image are also rejected. This is true for AffNet post-processing procedure as well. But as AffNet tends to output less elongated shapes, this two situations happen much less often, leading to increasing number of surving features by 25%. Third, features, which shape haven‘t converged into fixed state within sixteen iteration are not kept as well. The third case is quite rare and happens in approximately 1% cases.

Example of shapes, which are estimated by AffNet and Baumberg procedure are shown in Figure 5. In general, AffNet tends to output more circular-shaped patches, than its competitor. The overlap error between shapes estimated in different conditions is also smaller.

3.2 Wide baseline stereo

EF [55] EVD [28] OxAff [25] SymB [12] GDB [47] LTLL [10]
Detector 33 inl. 15 inl. 40 inl. 46 inl. 22 inl. 172 inl.
HesAff (original) 33 78 2 38 40 1008 34 153 17 199 26 34
HesAffNet 33 112 2 48 40 1181 37 203 19 222 46 36
AdHesAff 33 111 3 33 40 1330 35 190 19 286 28 35
AdHesAffNet 33 165 4 42 40 1567 37 275 21 336 48 39
WxBS-MODS 33 34 15 43 40 117 45 33 20 55 103 23
WxBS-MODS-AffNet 33 33 14 45 40 117 42 39 20 57 94 25
Table 1: Comparison of the affine shape estimators on wide baseline stereo with Hessian and adaptive Hessian detectors, following the protocol [29] on wide baseline stereo datasets. Number of matched image pairs and average number of inliers are reported. Numbers is the header corresponds to the number of image pairs in dataset. Best results are in bold. Note, that MODS-based results cannot be compared to HesAff-based, because iterative procedure in MODS stops matching as soon as correct geometrical transformation between images is found.

We conducted an experiment on wide baseline stereo, following local feature detector benchmark protocol, defined in [29] on the set of two-view matching datasets [12, 47, 55, 10]. The local features are detected by benchmarked detector, described by HardNet++ [27] and HalfRootSIFT [19] and geometrically verified by RANSAC [20]. Two following metrics are reported: the number of successfully matched image pairs and average number of correct inliers per matched pair. We have replaced original affine shape estimator in Hessian-Affine with AffNet in Hessian and Adaptive threshold Hessian (AdHess). Second, we also tested HesAffNet in view-synthesis based matched called MODS [28].

The results are shown in Table 1. Our method outperforms original one in both number of registered image pairs and/or number of correct inliers in all datasets, including painting-to-photo pairs in SymB [12] and multimodal pairs in GDB [47], despite it was not trained for that domains.

In view synthesis setup AffNet performs almost identical to the classical method and does not bring any benefits, as view synthesis procedure handles view point differences. Moreover, AffNet fails on the hardest pair in EVD dataset. The possible reason is that network was not trained for severe anisotropic blur, which occurs in synthesized images.

3.3 Learned feature orientation for wide baseline stereo

In addition to learning affine shape, we have run experiment in learning feature orientation by the same procedure. Difference to the AffNet learning are marginal: 2d output for and rotation-only augmentation of input patches. Two loss functions are tested: distance between positive patches minimization, as in [49] and our HardNet loss-based loss. The difference to the affine shape learning is that rotation does not change the content of the patch and the robust repeatability is the only desired property.

Evaluation protocol is identical to previous subsection. HesAffNet is used as detector and combination of HalfRootSIFT and HardNet++ and descriptors. Our networks are denoted as denoted ONetPos and ONetPosNeg respectively. We compared them to dominant gradient orientation [23] with single assignment – DomOri, and multiple assignments – DomOriMulti, and OriNet [49].

Results are in Table 4. All the methods have shown similar performance, thus we could recommend our method as more universal, than method by Yi et al[49], as ours can be used for both orientation and affine shape learning.

Figure 7: Hessian-AffNet produces 104 geometrically verified matches on graf1-6 pair [25] – top row, Hessian-Affine produces only 55 matches – bottom.

3.4 Image retrieval

Oxford5k Paris6k
Detector–Descriptor BoW BoW+SV BoW+QE BoW BoW+SV BoW+QE
HesAff–RootSIFT [2] 55.1 63.0 78.4 59.3 63.7 76.4
HesAff–HardNet++ [27] 60.8 69.6 84.5 65.0 70.3 79.1
HesAffNet–HardNet++ 68.3 77.8 89.0 65.7 73.4 83.3
Table 2: Performance (mAP) evaluation on bag-of-words (BoW) image retrieval. Vocabulary consisting of 1M visual words is learned on independent dataset, that is, when evaluating on Oxford5k, the vocabulary is learned with features of Paris6k and vice versa. SV: spatial verification. QE: query expansion. The best results are highlighted in bold.
Oxford5k Paris6k
HesAff–SIFT–BoW [32] 1M 78.4 82.2
HesAff–SIFT–BoW-fVocab [26] 16M 74.0 84.9 73.6 82.4
HesAff–RootSIFT–HQE [45] 65k 85.3 88.0 81.3 82.8
HesAff–HardNet++–HQE [27] 65k 86.8 88.3 82.8 84.9
HesAffNet–HardNet++–HQE 65k 87.9 89.5 84.2 85.9
Table 3: Performance (mAP) comparison with the state-of-the-art image retrieval with local features. Vocabulary is learned on independent dataset, that is, when evaluating on Oxford5k, the vocabulary is learned with features of Paris6k and vice versa. All presented results are with spatial verification and query expansion. VS: vocabulary size. SA: single assignment. MA: multiple assignments. The best results are highlighted in bold.

We evaluate the proposed approach, and compare against the related ones, on the practical application of image retrieval with local features. As a local feature detector, for all methods, we use the multi-scale Hessian-affine detector [25], with the exception of our method. For our method only, we replace the Baumberg method for affine shape estimation with the proposed AffNet to estimate the affine shape of the detected local features. This lead to the increasing number of used feature, from 12.5M to 17.5M for Oxford5k and from 15.6M to 21.2M for Paris6k, because of more features survived, as explained in Section 3.1. As a feature descriptor, we compare with methods that use handcrafted descriptors such as SIFT [23] and RootSIFT [2], as well as state-of-the-art learned descriptor, namely HardNet++ [27]. Finally, for all methods a 128-D descriptor is produced per local feature.

Standard image retrieval datasets are used for the evaluation, i.e., Oxford5k [33] and Paris6k [34] datasets. Both datasets contain a set of images (5062 for Oxford5k and 6391 for Paris6k) depicting 11 different landmarks together with distractors. For each of the 11 landmarks there are 5 different query regions defined by a bounding box, constituting 55 query regions per dataset. The performance is reported as mean average precision (mAP) [33].

First, we make a comparison on the traditional bag-of-words (BoW) [43] image retrieval pipeline. Namely, a flat vocabulary with one million centroids is created with the k-means algorithm and approximate nearest neighbor search [30]. All descriptors of an image are first assigned to a respective centroid of the vocabulary, and then they are aggregated with a histogram of occurrences into a BoW image representation. Because of the sparse nature of BoW representation, an inverted file is used for a very efficient image search [43]. To provide a more thorough comparison, we also apply methods known to improve image retrieval performance, i.e., spatial verification (SV) [33], and standard query expansion (QE) [34]. Comparison is given in the Table 2. On both Oxford5k and Paris6k datasets, and in all settings, our approach achieves the best results, in most of the cases it outperforms the second best approach by a large margin. This experiment clearly shows the benefit of using AffNet in the local features detection pipeline. Interestingly, a pure BoW representation with HesAffNet–HardNet++ outperforms the handcrafted approach of HesAff–RootSIFT, even when the latter is using spatial verification. This comes as a strong benefit, because spatial verification step has a very high memory and processing cost.

EF [55] EVD [28] OxAff [25] SymB [12] GDB [47] LTLL [10]
Detector 33 inl. 15 inl. 40 inl. 46 inl. 22 inl. 172 inl.
DomOri[23] 33 97 2 30 40 1070 35 181 19 190 36 36
ONetPosNeg 33 96 3 27 40 1058 36 175 19 189 36 35
ONetPos 33 96 3 27 40 1058 36 176 19 188 37 34
OriNet [49] 33 96 2 31 40 1050 35 178 19 186 37 34
DomOriMulti [23] 33 116 2 52 40 1181 38 208 19 233 46 38
Table 4: Comparison of the local features orientation estimators on wide baseline stereo with Hessian and adaptive Hessian detectors, following the protocol [29] on wide baseline stereo datasets. Number of matched image pairs and average number of inliers are reported. Numbers is the header corresponds to the number of image pairs in dataset. We don‘t highlight best results, as they are nearly identical for all methods.

Additionally, we make a comparison with the state-of-the-art local-feature-based image retrieval methods. In the case of our proposed method (HesAffNet–HardNet++–HQE), a visual vocabulary of 65k visual words is learned, with additional Hamming embedding (HE) [17] technique that further refines descriptor assignments with a 128 bits binary signature. We follow the same procedure as HesAff–RootSIFT–HQE [45] method. Specifically, we use: (i) weighting of the votes as a decreasing function of the Hamming distance [16]; (ii) burstiness suppression [16]; (iii) multiple assignments of features to visual words [34, 18]; and (iv) QE with feature aggregation [45]. All parameters are set as in [45]. The performance of our method is the best reported on both Oxford5k and Paris6k. Furthermore, we outperform a method learning the vocabulary on the same dataset, i.e., mAP 89.1 was reported [2] on Oxford5k by learning it on the same dataset comprising the relevant images. We even outperform a method that uses significantly higher amount of local features [45] (mAP 89.4 was reported on Oxford5k), i.e., 22M compared to 17.5M used here.

4 Conclusions

We presented a method for learning affine shape of the local features in weakly-supervised manner. The resulting AffNet regressor bridges the gap between performance of the similarity-covariant and affine-covariant detectors on images with short baseline and big illumination differences and it improves performance of affine-covariant detectors in the wide baseline setup. We also presented some observations related to affine shape parametrization, suitable for CNN-based learning.

AffNet applied to the output of the to Hessian detector improves state-of-the art in wide baseline matching, affine detector repeatability and image retrieval. The source code and trained CNN are available at https://github.com/ducha-aiki/affnet


The authors were supported by the, PLACE-FOR-UCU-Eleks - Czech Science Foundation Project GACR P103/12/G084, the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Research and Economy, and the Province of Upper Austria in the frame of the COMET center, the CTU student grant SGS17/185/OHK3/3T/13, and the MSMT LL1303 ERC-CZ grant.


  • [1] PyTorch. http://pytorch.org.
  • [2] R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2911–2918, 2012.
  • [3] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [4] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk. Learning local feature descriptors with triplets and shallow convolutional neural networks. In British Machine Vision Conference (BMVC), 2016.
  • [5] A. Baumberg. Reliable feature matching across widely separated views. In CVPR, pages 1774–1781. IEEE Computer Society, 2000.
  • [6] M. Brown and D. G. Lowe. Automatic panoramic image stitching using invariant features. International Journal of Computer Vision (IJCV), 74(1):59–73, 2007.
  • [7] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. Universal correspondence network. In Advances in Neural Information Processing Systems, pages 2414–2422, 2016.
  • [8] K. Cordes, B. Rosenhahn, and J. Ostermann. Increasing the accuracy of feature evaluation benchmarks using differential evolution. In IEEE Symposium Series on Computational Intelligence (SSCI) - IEEE Symposium on Differential Evolution (SDE), Apr. 2011.
  • [9] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. A. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1734–1747, 2016.
  • [10] B. Fernando, T. Tommasi, and T. Tuytelaars. Location recognition over large time lags. Computer Vision and Image Understanding, 139:21 – 28, 2015.
  • [11] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3279–3286, 2015.
  • [12] D. C. Hauagge and N. Snavely. Image matching using local symmetry features. In Computer Vision and Pattern Recognition (CVPR), pages 206–213, 2012.
  • [13] A. Hermans, L. Beyer, and B. Leibe. In Defense of the Triplet Loss for Person Re-Identification. ArXiv e-prints, Mar. 2017.
  • [14] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv 1502.03167, 2015.
  • [15] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. ArXiv e-prints, June 2015.
  • [16] H. Jegou, M. Douze, and C. Schmid. On the burstiness of visual elements. In Computer Vision and Pattern Recognition (CVPR), pages 1169–1176, 2009.
  • [17] H. Jegou, M. Douze, and C. Schmid. Improving bag-of-features for large scale image search. International Journal of Computer Vision (IJCV), 87(3):316–336, 2010.
  • [18] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek. Accurate image search using the contextual dissimilarity measure. Pattern Analysis and Machine Intelligence (PAMI), 32(1):2–11, 2010.
  • [19] A. Kelman, M. Sofka, and C. V. Stewart. Keypoint descriptors for matching across multiple image modalities and non-linear intensity variations. In CVPR 2007, 2007.
  • [20] K. Lebeda, J. Matas, and O. Chum. Fixing the locally optimized ransac. In BMVC 2012, 2012.
  • [21] K. Lenc, V. Gulshan, and A. Vedaldi. Vlbenchmarks, 2012.
  • [22] K. Lenc and A. Vedaldi. Learning Covariant Feature Detectors, pages 100–117. Springer International Publishing, Cham, 2016.
  • [23] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004.
  • [24] K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point detectors. International Journal of Computer Vision (IJCV), 60(1):63–86, 2004.
  • [25] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision (IJCV), 65(1):43–72, 2005.
  • [26] A. Mikulik, M. Perdoch, O. Chum, and J. Matas. Learning vocabularies over a fine quantization. International Journal of Computer Vision (IJCV), 103(1):163–175, 2013.
  • [27] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas. Working hard to know your neighbor’s margins: Local descriptor learning loss. In Proceedings of NIPS, Dec. 2017.
  • [28] D. Mishkin, J. Matas, and M. Perdoch. Mods: Fast and robust method for two-view matching. Computer Vision and Image Understanding, 141:81 – 93, 2015.
  • [29] D. Mishkin, J. Matas, M. Perdoch, and K. Lenc. Wxbs: Wide baseline stereo generalizations. Arxiv 1504.06603, 2015.
  • [30] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In International Conference on Computer Vision Theory and Application (VISSAPP), pages 331–340, 2009.
  • [31] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), pages 807–814, 2010.
  • [32] M. Perdoch, O. Chum, and J. Matas. Efficient representation of local geometry for large scale object retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 9–16, 2009.
  • [33] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007.
  • [34] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008.
  • [35] F. Radenovic, J. L. Schonberger, D. Ji, J.-M. Frahm, O. Chum, and J. Matas. From dusk till dawn: Modeling in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5488–5496, 2016.
  • [36] F. Radenovic, G. Tolias, and O. Chum. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In European Conference on Computer Vision (ECCV), pages 3–20, 2016.
  • [37] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alternative to SIFT or SURF. In International Conference on Computer Vision (ICCV), pages 2564–2571, 2011.
  • [38] T. Sattler, W. Maddern, A. Torii, J. Sivic, T. Pajdla, M. Pollefeys, and M. Okutomi. Benchmarking 6DOF Urban Visual Localization in Changing Conditions. ArXiv e-prints, July 2017.
  • [39] N. Savinov, A. Seki, L. Ladicky, T. Sattler, and M. Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. ArXiv e-prints, Nov. 2016.
  • [40] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016.
  • [41] J. L. Schonberger, H. Hardmeier, T. Sattler, and M. Pollefeys. Comparative evaluation of hand-crafted and learned local features. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [42] J. L. Schonberger, F. Radenovic, O. Chum, and J.-M. Frahm. From single image query to detailed 3D reconstruction. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5126–5134, 2015.
  • [43] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In International Conference on Computer Vision (ICCV), pages 1470–1477, 2003.
  • [44] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15(1):1929–1958, 2014.
  • [45] G. Tolias and H. Jegou. Visual query expansion with or without geometry: refining local descriptors by feature aggregation. Pattern Recognition, 47(10):3466–3476, 2014.
  • [46] Y. Verdie, K. Yi, P. Fua, and V. Lepetit. Tilde: a temporally invariant learned detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5279–5288, 2015.
  • [47] G. Yang, C. V. Stewart, M. Sofka, and C.-L. Tsai. Registration of challenging image pairs: Initialization, estimation, and decision. Pattern Analysis and Machine Intelligence (PAMI), 29(11):1973–1989, 2007.
  • [48] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned invariant feature transform. In European Conference on Computer Vision (ECCV), pages 467–483, 2016.
  • [49] K. M. Yi, Y. Verdie, P. Fua, and V. Lepetit. Learning to Assign Orientations to Feature Points. In Proceedings of the Computer Vision and Pattern Recognition, 2016.
  • [50] G. Yu and J.-M. Morel. Asift: An algorithm for fully affine invariant comparison. Image Processing On Line, 1:11–38, 2011.
  • [51] B. F. Yurun Tian and F. Wu. L2-Net: Deep learning of discriminative patch descriptor in euclidean space. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [52] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [53] X. Zhang, F. Yu, S. Karaman, and S.-F. Chang. Learning discriminative and transformation covariant local feature detectors. In CVPR, 2017.
  • [54] X. Zhang, F. X. Yu, S. Kumar, and S.-F. Chang. Learning Spread-out Local Feature Descriptors. ArXiv e-prints, Aug. 2017.
  • [55] C. L. Zitnick and K. Ramnath. Edge foci interest points. In International Conference on Computer Vision (ICCV), pages 359–366, 2011.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description