LF-Net: Learning Local Features from Images
We present a novel deep architecture and a training strategy to learn a local feature pipeline from scratch, using collections of images without the need for human supervision. To do so we exploit depth and relative camera pose cues to create a virtual target that the network should achieve on one image, provided the outputs of the network for the other image. While this process is inherently non-differentiable, we show that we can optimize the network in a two-branch setup by confining it to one branch, while preserving differentiability in the other. We train our method on both indoor and outdoor datasets, with depth data from 3D sensors for the former, and depth estimates from an off-the-shelf Structure-from-Motion solution for the latter. Our models outperform the state of the art on sparse feature matching on both datasets, while running at 60+ fps for QVGA images.
LF-Net: Learning Local Features from Images
Yuki Ono Sony Imaging Products & Solutions Inc. email@example.com Eduard Trulls École Polytechnique Fédérale de Lausanne firstname.lastname@example.org Pascal Fua École Polytechnique Fédérale de Lausanne email@example.com Kwang Moo Yi Visual Computing Group, University of Victoria firstname.lastname@example.org
noticebox[b]Preprint. Work in progress.\end@float
Establishing correspondences across images is at the heart of many Computer Vision algorithms, such as those for wide-baseline stereo, object detection, and image retrieval. With the emergence of SIFT , sparse methods that find interest points and then match them across images became the de facto standard. In recent years, many of these approaches have been revisited using deep nets [46, 10, 30, 47], which has also sparked a revival for dense matching [8, 49, 50, 42, 40].
However, dense methods tend to fail in complex scenes with occlusions , while sparse methods still suffer from severe limitations. Some can only train individual parts of the feature extraction pipeline  while others can be trained end-to-end but still require the output of hand-crafted detectors to initialize the training process [46, 10, 47]. For the former, reported gains in performance may fade away when they are integrated into the full pipeline. For the latter, parts of the image which hand-crafted detectors miss are simply discarded for training.
In this paper, we propose a sparse-matching method with a novel deep architecture, which we name LF-Net, for Local Feature Network, that is trainable end-to-end and does not require using a hand-crafted detector to generate training data. Instead, we use image pairs for which we know the relative pose and corresponding depth maps, which can be obtained either with laser scanners or shape-from-structure algorithms , without any further annotation.
Being thus given dense correspondence data, we could train a feature extraction pipeline by selecting a number of keypoints over two images, computing descriptors for each keypoint, using the ground truth to determine which ones match correctly across images, and use those to learn good descriptors. This is, however, not feasible in practice. First, extracting multiple maxima from a score map is inherently not differentiable. Second, performing this operation over each image produces two disjoint sets of keypoints which will typically produce very few ground truth matches, which we need to train the descriptor network, and in turn guide the detector towards keypoints which are distinctive and good for matching.
We therefore propose to create a virtual target response for the network, using the ground-truth geometry in a non-differentiable way. Specifically, we run our detector on the first image, find the maxima, and then optimize the weights so that when run on the second image it produces a clean response map with sharp maxima at the right locations. Moreover, we warp the keypoints selected in this manner to the other image using the ground truth, guaranteeing a large pool of ground truth matches. Note that while we break differentiability in one branch, the other one can be trained end to end, which lets us learn discriminative features by learning the entire pipeline at once. We show that our method greatly outperforms the state-of-the-art.
2 Related work
Since the appearance of SIFT , local features have played a crucial role in computer vision, becoming the de facto standard for wide-baseline image matching . They are versatile [21, 44, 26] and remain useful in many scenarios. This remains true even in competition with deep network alternatives, which typically involve dense matching [8, 49, 50, 42, 40] and tend to work best on narrow baselines, as they can suffer from occlusions, which local features are robust against.
Typically, feature extraction and matching comprises three stages: finding interest points, estimating their orientation, and creating a descriptor for each. SIFT , along with more recent methods [5, 29, 1, 46] implements the entire pipeline. However, many other approaches target some of their individual components, be it feature point extraction [28, 41], orientation estimation , or descriptor generation [38, 33, 37]. One problem with this approach is that increasing the performance of one component does not necessarily translate into overall improvements [46, 32].
Next, we briefly introduce some representative algorithms below, separating those that rely on hand-crafted features from those that use Machine Learning techniques extensively.
SIFT  was the first widely successful attempt at designing an integrated solution for local feature extraction. Many subsequent efforts focused on reducing its computational requirements. For instance, SURF  used Haar filters and integral images for fast keypoint detection and descriptor extraction. DAISY  computed dense descriptors efficiently from convolutions of oriented gradient maps. The literature on this topic is very extensive—we refer the reader to .
While methods such as FAST  used machine learning techniques to extract keypoints, most early efforts in this area targeted descriptors, e.g.using metric learning  or convex optimization . However, with the advent of deep learning, there has been a renewed push towards replacing all the components of the standard pipeline by convolutional neural networks.
– Keypoints. In , piecewise-linear convolutional filters were used to make keypoint detection robust to severe lighting changes. In , neural networks are trained to rank keypoints. The latter is relevant to our work because no annotations are required to train the keypoint detector, but both methods are optimized for repeatability and not for the quality of the associated descriptors.
– Orientations. The method of  is the only one we know of that focuses on improving orientation estimates. It uses a siamese network to predict the orientations that minimize the distance between the orientation-dependent descriptors of matching keypoints, assuming that the keypoints have been extracted using some other technique.
– Descriptors. The bulk of methods focus on descriptors. In [12, 48], the comparison metric is learned by training Siamese networks. Later works, starting with , rely on hard sample mining for training and the norm for comparisons. A triplet-based loss function was introduced in , and in , negative samples are mined over the entire training batch. More recent efforts further increased performance using spectral pooling  and novel loss formulations . However, none of these take into account what kind of keypoint they are working and typically use only SIFT.
Crucially, performance improvements in popular benchmarks for a single one of either of these three components do not always survive when evaluating the whole pipeline [46, 32]. For example, keypoints are often evaluated on repeatability, which can be misleading because they may be repeatable but useless for matching purposes. Descriptors can prove very robust against photometric and geometric transformations, but this may be unnecessary or even counterproductive when patches are well-aligned, and results on the most common benchmark  are heavily saturated.
This was demonstrated in , which integrated previous efforts [41, 45, 33] into a fully-differentiable architecture, reformulating the entire keypoint extraction pipeline with deep networks. It showed that not only is joint training necessary for optimal performance, but also that standard SIFT still outperforms many modern baselines. However, their approach still relies on SIFT keypoints for training, and as a result it can not learn where SIFT itself fails. Along the same lines, a deep network was introduced in  to match images with a keypoint-based formulation, assuming a homography model. However, it was largely trained on synthetic images or real images with affine transformations, and its effectiveness on practical wide-baseline stereo problems remains unproven.
Fig. 1 depicts our full training pipeline. We first present its individual components and the manner in which they are connected into a feature pipeline, that is, a single branch. We then introduce our loss function that is computed from the output of two such branches. Finally, we present the training scheme we developed to learn their weights.
3.1 LF-Net: a Local Feature Network
Our architecture has two main components. The first one is a dense, multi-scale, fully convolutional network that returns keypoint locations, scales, and orientations. It is designed to achieve fast inference time, and be agnostic to image size. The second is a network that outputs local descriptors for patches cropped around keypoints produced by the first network.
In the remainder of this section, we assume that the images have been undistorted using the camera calibration data. We convert them to grayscale for simplicity and simply normalize them individually using their mean and standard deviation . As will be discussed in Section 4.1, depth maps and camera parameters can all be obtained using off-the-shelf SfM algorithms . As depth measurements are often missing around 3D object boundaries—especially when computed SfM algorithms—image regions for which we do not have depth measurements are masked and discarded during training.
Feature map generation.
We first use a fully convolutional network to generate a rich feature map from an image , the corresponding depth map , and the intrinsic and extrinsic camera parameters, and , respectively. The feature map, in turn, can later be used to extract keypoints and their attributes, that is, location, score, scale, and orientation. We do this for two reasons. First, it has been shown that using such a mid-level representation to estimate multiple quantities helps increase the predictive power of deep nets . Second, it allows for larger batch sizes, that is, using more images simultaneously, which is key to training a robust detector.
In practice, we use a simple ResNet  layout with three blocks. Each block contains convolutional filters followed by batch normalization , leaky-ReLU activations, and another set of convolutions. All convolutions are zero-padded to have the same output size as the input, and have 16 output channels. In our experiments, this has proved more successful that more recent architectures relying on strided convolutions and pixel shuffling .
Scale-invariant keypoint detection.
To detect scale-invariant keypoints we propose a novel approach to scale-space detection that relies on the feature map . To generate a scale-space response, we resize it times, at uniform intervals between and , where and in out experiments. These are convolved with independent filters size, which results score maps in for , one for each scale. To increase the saliency of keypoints, we perform a differentiable form of non-maximum suppression by applying a softmax operator over 1515 windows in a convolutional manner, which results in sharper score-map, . Since the non-maximum suppression results are scale-dependent, we resize each back to the original image size, which yields . Finally, we merge all the into a final scale-space score map, , with a softmax-like operation. We define it as
where is the Hadamard product.
From this scale-invariant map we choose the top pixels as keypoints, and further apply a local softargmax  for sub-pixel accuracy. While selecting the the top keypoints is not differentiable, this does not stop gradients from back-propagating through the selected points. Furthermore, the sub-pixel refinement through softargmax also makes it possible for gradients to flow through with respect to keypoint coordinates.To predict the scale at each keypoint, we simply apply a softargmax operation over the scale dimension of . A simpler alternative would have been to directly regress the scale once a keypoint has been detected. However, this turned out to be less effective in practice.
To learn orientations we follow the approach of [45, 46], but on the shared feature representation instead of the image. We apply a single convolution on which outputs two values for each pixel. They are taken to be the sine and cosine of the orientation and and used to compute a dense orientation map using the function.
As discussed above, we extract from the score map the highest scoring feature points and their image locations. With the scale map and orientation map , this gives us quadruplets of the form , for which we want to compute descriptors.
To this end, we consider image patches around the selected keypoint locations. We crop them from the normalized images and resize them to . To preserve differentiability, we use a bilinear sampling scheme of  for cropping. Our descriptor network comprises three convolutional filters with a stride of 2 and 64, 128, and 256 channels respectively. Each one is followed by batch normalization and ReLU activation. After the convolutional layers, we have a fully-connected 512-channel layer, followed by batch normalization, ReLU, and a final fully-connected layer to reduce the dimensionality to =256. The descriptors are normalized and we denote them as .
3.2 Loss functions
We formulate our training objective as a combination of two types of loss functions, image-level and patch-level. Keypoint detection requires image-level operations and also affects where patches are extracted, thus we use both image-level and patch-level losses. For the other components, we use patch-level losses as they operate independently for each patch once keypoints are selected.
Given the ground truth pose and depth, we propose to select keypoints from the warped score map for with standard, non-differentiable non-maximum suppression, and generate a clean score map by placing Gaussian kernels with standard deviation =0.5 at those locations. We denote this operation . Note that while it is non-differentiable, it only takes place on branch , and thus has no effect in the optimization. For warping we apply rigid-body transforms  with the projective camera model. We call this the SE(3) module , which in addition to the score map takes as input the camera intrinsics, extrinsics, and depth—note that we omit the latter three for brevity. Mathematically, we write
Here, as mentioned before, occluded image regions are not used for optimization.
With existing methods [33, 3, 23], the pool of pair-wise relationships is predefined before training, assuming a detector is given. More importantly, forming these pairs from two disconnected sets of keypoints will produce too many outliers for the training to ever converge. Finally, we want the gradients to flow back to the keypoint detector network, so that the we are able to learn keypoints that are good for matching.
We propose to solve this problem by leveraging the ground truth camera motion and depth to form sparse patch correspondences on the fly, by warping the detected keypoints. Note that we are only able to do this as we warp over branch and back-propagate through branch .
More specifically, once keypoints are selected from , we warp their spatial coordinates to , similarly as we do for the score maps to compute the image-level loss, but in the opposite direction. Note that we form the keypoint with scale and orientation from branch , as they are not as sensitive as the location, and we empirically found that it helps the optimization. We then extract descriptors at these corresponding regions and . If a keypoint falls on occluded regions after warping, we drop it from the optimisation process. With these corresponding regions and their associated decriptors and we form which is used to train the keypoint, orientation, and scale components. Mathematically we write
Triplet loss for descriptors.
To learn the descriptor, we also need to consider non-corresponding pairs of patches. Similar to , we form a triplet loss to learn the ideal embedding space for the patches. However, for the positive pair we use the ground-truth geometry to find a match, as described above. For the negative—non-matching—pairs, we employ a progressive mining strategy to obtain the most informative patches possible. Specifically, we sort the negatives for each sample by loss in decreasing order and sample randomly over the top , where , where is the current iteration, i.e., we start with a pool of the 64 hardest samples and reduce it as the networks converge, up to a minimum of 5. Sampling informative patches is critical to learn discriminative descriptors, and random sampling will provide too many easy negative samples.
With the matching and non-matching pairs, we form the triplet loss as:
where , i.e., it can be any non-corresponding sample, and =1 is the margin.
Loss function for each component.
In summary, the loss function that is used to learn each component, i.e., the detector, orientation, and descriptor is as the following.
Note that the orientation loss reduces to , as the gradient of with respect to the orientation component is zero.
3.3 Training and inference
As shown in Fig. 1, we formulate the learning problem in terms of a two-branch architecture which takes as input two images of the same scene, and , , along with their respective depth maps, camera poses, and calibration matrices, which can be obtained from conventional SfM methods. One distinctive characteristic of our setup is that branch holds the components which break differentiability, and is thus never back-propagated. To do this in a mathematically sound way, we take inspiration from Q-learning  and use the parameters of the network at the previous iteration for this branch. This allows us to have a differentiable setup without changing the loss function. To make the optimization more stable, we also flip the images on each branch and merge the gradients before updating.
We emphasize here that with our loss, the gradients for the patch-wise loss can safely back-propagate through branch , including the top selection, to the image-level networks. Likewise, the softargmax operator used for keypoint extraction allows the optimization to differentiate the patch-wise loss with respect to the location of the keypoints.
At test time, we simply run the differentiable part of this framework, i.e., branch . Although differentiability is no longer a concern, we still rely, for simplicity, on the spatial SoftMax for non-maximum supression, and the softargmax and spatial transformers for patch sampling. Even so, our implementation can extract 512 keypoints from QVGA frames (320240) at 62 fps and from VGA frames (640480) at 25 fps (42 and 20 respectively for 1024 keypoints), on a Titan X PASCAL.
3.4 Implementation details
We extract 512 keypoints for training, as larger numbers become problematic due to memory constraints. This also allows us to maintain a batch with multiple image pairs (6), which helps convergence. Note that at test time we can choose as many keypoints as desired. As datasets with natural images are composed of mostly upright images and are thus rather biased in terms of orientation, we perform data augmentation by randomly rotating the input patches by up to 45, and transform the camera’s roll angle accordingly. We also perform scale augmentation by resizing the input patches by to for the indoors data (see Section 4.1), and transforming the focal length accordingly. For optimization, we use ADAM  with a learning rate of . To balance and we take . Our implementation is written in TensorFlow and will be made public.
We consider both indoors and outdoors images as their characteristics drastically differ, as shown in Fig. 2. For indoors data we rely on ScanNet , an RGB-D dataset with over 2.5M images, including accurate camera poses from SfM reconstructions. These sequences show office settings with specularities and very significant blurring artefacts, and the depth maps are incomplete due to sensing failures, specially around 3D object boundaries. The dataset provides training, validation, and test splits that we use accordingly. As this dataset is very large, we only use a fraction of the sequences for training and validation, and randomly choose 40 test sequences for evaluation, mostly due to time constraints in computing the baselines. To prevent selecting pairs of images that do not share any field of view, we sample images 30 frames away, guaranteeing enough scene overlap. At test time, we consider multiple values for the frame difference to evaluate increasing baselines.
For outdoors data we use 25 photo-tourism image collections of popular landmarks collected by [36, 15]. We run COLMAP  to obtain dense 3D reconstructions, including dense but noisy and inaccurate depth maps for every image. We post-process the depth maps by projecting each image pixel to 3D space at the estimated depth, and mark it as invalid if the closest 3D point from the reconstruction is further than a threshold. The resulting depth maps are still noisy, but many occluded pixels are filtered out as shown in Fig. 2. To guarantee a reasonable degree of overlap for each image pair we perform a visibility check using the SfM points visible over both images. We consider bounding boxes twice the size of those containing these points to extract image regions roughly corresponding, while ignoring very small ones. We use 14 sequences for training and validation, spliting the images into training and validation subsets by with a 70:30 ratio, and sample up to 50 pairs from each different scene. For testing we use the remaining 11 sequences, which were not used for training or validation, and sample up to 1 pairs from each set. We use square patches size for training, for either data type.
4.2 Baselines and metrics
We consider full local feature pipelines, SIFT , SURF  ORB , A-KAZE , and LIFT , using the authors’ release for LIFT and OpenCV for the rest. For ScanNet, we test on 320240 images, which is commensurate with the patches cropped while training. We do the same for the baselines, as their performance seems to be better than at higher resolutions, probably due to the low-texture nature of the images. For the outdoors dataset, we resize the images so that the largest dimensions is 640 pixels, as they are richer in texture, and all methods work better at this resolution. Similarly, we extract 1024 keypoints for outdoors images, but limit them to 512 for Scannet, as the latter contains very little texture.
To evaluate the entire local feature pipeline performance, we use the Matching Score , which is defined as the ratio of estimated correspondences that are correct according to the ground-truth geometry, after obtaining them through nearest neighbour matching with the descriptors. As our data exhibits complex geometry, and to emphasize accurate localization of keypoing, similar to  we use a 5-pixel threshold instead of the overlap measure used in .
As an ablation study, we also consider the case where , i.e., we do not train the detector with the patch-wise loss, effectively splitting the training of the detection/scale and orientation/descriptor. We denote this as ‘split’.
4.3 Results on outdoors data
|Sequence||SIFT||SURF||A-KAZE||ORB||LIFT||Ours (split)||Ours (joint)|
For this experiment we provide results independently for each sequence, in addition to the average. Due to the nature of the data, results vary from sequence to sequence. We provide quantitative results in Table 1 and qualitative examples in Fig. 3. Our approach outperforms the closest competitor, LIFT, by 58% relative. Note that training the networks separately (i.e., setting ) still produces state of the art results, but converges very early, and training them jointly increases performance by 43% relative.
4.4 Results on indoors data
|(a) SIFT||(b) SURF||(c) A-KAZE||(d) LF-Net (ours)|
To evaluate performance over different baselines we sample image pairs at different frame difference values: 10, 20, 30, and 60. At 10 the images are very similar, whereas at 60 there is a significant degree of camera motion—note that our method is trained exclusively at a 30-frame difference. Results are shown in Table 2. Our approach outperforms the closest competitor by 45% relative. As before, training all components jointly increases performance by a large margin, 31% relative. Additionally, we test the models trained on SfM data (from Section 4.3) on ScanNet to showcase their generalization power: 33% relative w.r.t. the closest competitor. Note that we do not perform the reverse (training indoors, testing outdoors) as ScanNet does not contain the types of photometric transformations present in photo-tourism data. We provide qualitative examples in Fig. 3.
We have proposed LF-Net, a novel deep architecture to learn local features. It embeds the entire feature extraction pipeline, and can be trained end-to-end with just a collection of images. To allow training from scratch without hand-crafted priors, we devise a two-branch setup and create virtual target responses iteratively. We run this non-differentiable process in one branch while optimizing over the other, which we keep differentiable, and show they converge to an optimal solution. Our method outperforms the state of the art by a large margin, on both indoor and outdoor datasets, at 60 fps for QVGA images. We will release code and learned models, for reproducibility.
This work was partially supported by systems supplied by Compute Canada.
- Alcantarilla et al.  Alcantarilla, P., Fernández, P., Bartoli, A., and Davidson, A. J. (2012). KAZE Features. In ECCV.
- Alcantarilla et al.  Alcantarilla, P. F., Nuevo, J., and Bartoli, A. (2013). Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces. In BMVC.
- Balntas et al.  Balntas, V., Johns, E., Tang, L., and Mikolajczyk, K. (2016). PN-Net: Conjoined Triple Deep Network for Learning Local Image Descriptors. In arXiv Preprint.
- Bay et al.  Bay, H., Tuytelaars, T., and Van Gool, L. (2006). SURF: Speeded Up Robust Features. In ECCV.
- Bay et al.  Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008). SURF: Speeded Up Robust Features. CVIU, 10(3), 346–359.
- Brown et al.  Brown, M., Hua, G., and Winder, S. (2011). Discriminative Learning of Local Image Descriptors. PAMI.
- Chapelle and Wu  Chapelle, O. and Wu, M. (2009). Gradient Descent Optimization of Smoothed Information Retrieval Metrics. Information Retrieval, 13(3), 216–235.
- Choy et al.  Choy, C., Gwak, J., Savarese, S., and Chandraker, M. (2016). Universe Correspondence Network. In NIPS.
- Dai et al.  Dai, A., Chang, A., Savva, M., Halber, M., Funkhouser, T., and Nießner, M. (2017). ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In CVPR.
- DeTone et al.  DeTone, D., Malisiewicz, T., and Rabinovich, A. (2017). SuperPoint: Self-Supervised Interest Point Detection and Description. arXiv preprint arXiv:1712.07629.
- Engel et al.  Engel, J., Schöps, T., and Cremers, D. (2014). LSD-SLAM: Large-Scale Direct Monocular SLAM. In ECCV.
- Han et al.  Han, X., Leung, T., Jia, Y., Sukthankar, R., and Berg, A. C. (2015). MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching. In CVPR.
- Hartley and Zisserman  Hartley, R. and Zisserman, A. (2000). Multiple View Geometry in Computer Vision. Cambridge University Press.
- He et al.  He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. In CVPR, pages 770–778.
- Heinly et al.  Heinly, J., Schoenberger, J., Dunn, E., and Frahm, J.-M. (2015). Reconstructing the World in Six Days. In CVPR.
- Ioffe and Szegedy  Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML.
- Jaderberg et al.  Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K. (2015). Spatial Transformer Networks. In NIPS, pages 2017–2025.
- Keller et al.  Keller, M., Chen, Z., Maffra, F., Schmuck, P., and Chli, M. (2018). Learning deep descriptors with scale-aware triplet networks. In CVPR.
- Kingma and Ba  Kingma, D. and Ba, J. (2015). Adam: A Method for Stochastic Optimisation. In ICLR.
- Kokkinos  Kokkinos, I. (2017). UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory. In CVPR.
- Lowe  Lowe, D. (2004). Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 20(2).
- Mikolajczyk and Schmid  Mikolajczyk, K. and Schmid, C. (2004). A Performance Evaluation of Local Descriptors. PAMI, 27(10), 1615–1630.
- Mishchuk et al.  Mishchuk, A., Mishkin, D., Radenovic, F., and Matas, J. (2017). Working hard to know your neighbor’s margins: Local descriptor learning loss. In NIPS.
- Mnih et al.  Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533.
- Mukherjee et al.  Mukherjee, D., Wu, Q. M. J., and Wang, G. (2015). A Comparative Experimental Study of Image Feature Detectors and Descriptors. MVA, 26(4), 443–466.
- Mur-artal et al.  Mur-artal, R., Montiel, J., and Tardós, J. (2015). Orb-Slam: A Versatile and Accurate Monocular Slam System. IEEE Transactions on Robotics, 31(5), 1147–1163.
- Rosten and Drummond  Rosten, E. and Drummond, T. (2006). Machine Learning for High-Speed Corner Detection. In ECCV.
- Rosten et al.  Rosten, E., Porter, R., and Drummond, T. (2010). Faster and Better: A Machine Learning Approach to Corner Detection. PAMI, 32, 105–119.
- Rublee et al.  Rublee, E., Rabaud, V., Konolidge, K., and Bradski, G. (2011). ORB: An Efficient Alternative to SIFT or SURF. In ICCV.
- Savinov et al.  Savinov, N., Seki, A., Ladicky, L., Sattler, T., and Pollefeys, M. (2017). Quad-networks: unsupervised learning to rank for interest point detection. CVPR.
- Schönberger and Frahm  Schönberger, J. and Frahm, J. (2016). Structure-from-motion revisited. In CVPR.
- Schönberger et al.  Schönberger, J., Hardmeier, H., Sattler, T., and Pollefeys, M. (2017). Comparative Evaluation of Hand-Crafted and Learned Local Features. In CVPR.
- Simo-serra et al.  Simo-serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., and moreno-noguer, F. (2015). Discriminative Learning of Deep Convolutional Feature Point Descriptors. In ICCV.
- Simonyan et al.  Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Learning Local Feature Descriptors Using Convex Optimisation. PAMI.
- Strecha et al.  Strecha, C., Bronstein, A., Bronstein, M., and Fua, P. (2012). LDAHash: Improved Matching with Smaller Descriptors. PAMI, 34(1).
- Thomee et al.  Thomee, B., Shamma, D., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., and Li, L. (2016). YFCC100M: the New Data in Multimedia Research. In CACM.
- Tian et al.  Tian, Y., Fan, B., and Wu, F. (2017). L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space. In CVPR.
- Tola et al.  Tola, E., Lepetit, V., and Fua, P. (2010). Daisy: An Efficient Dense Descriptor Applied to Wide Baseline Stereo. PAMI, 32(5), 815–830.
- Ulyanov et al.  Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). Instance Normalization: the Missing Ingredient for Fast Stylization. arXiv Preprint.
- Ummenhofer et al.  Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., and Brox, T. (2017). Demon: Depth and Motion Network for Learning Monocular Stereo. In CVPR.
- Verdie et al.  Verdie, Y., Yi, K. M., Fua, P., and Lepetit, V. (2015). TILDE: A Temporally Invariant Learned DEtector. In CVPR.
- Vijayanarasimhan et al.  Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., and Fragkiadaki, K. (2017). Sfm-Net: Learning of Structure and Motion from Video. arXiv Preprint.
- Wei et al.  Wei, X., Zhang, Y., Gong, Y., and Zheng, N. (2018). Kernelized subspace pooling for deep local descriptors. In CVPR.
- Wu  Wu, C. (2013). Towards Linear-Time Incremental Structure from Motion. In 3DV.
- Yi et al. [2016a] Yi, K. M., Verdie, Y., Fua, P., and Lepetit, V. (2016a). Learning to Assign Orientations to Feature Points. In CVPR.
- Yi et al. [2016b] Yi, K. M., Trulls, E., Lepetit, V., and Fua, P. (2016b). LIFT: Learned Invariant Feature Transform. In ECCV.
- Yi et al.  Yi, K. M., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., and Fua, P. (2018). Learning to Find Good Correspondences. In CVPR.
- Zagoruyko and Komodakis  Zagoruyko, S. and Komodakis, N. (2015). Learning to Compare Image Patches via Convolutional Neural Networks. In CVPR.
- Zamir et al.  Zamir, A. R., Wekel, T., Agrawal, P., Malik, J., and Savarese, S. (2016). Generic 3D Representation via Pose Estimation and Matching. In ECCV.
- Zhou et al.  Zhou, T., Brown, M., Snavely, N., and Lowe, D. (2017). Unsupervised Learning of Depth and Ego-Motion from Video. In CVPR.