Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences
Interest point descriptors have fueled progress on almost every problem in computer vision. Recent advances in deep neural networks have enabled task-specific learned descriptors that outperform hand-crafted descriptors on many problems. We demonstrate that commonly used metric learning approaches do not optimally leverage the feature hierarchies learned in a Convolutional Neural Network (CNN), especially when applied to the task of geometric feature matching. While a metric loss applied to the deepest layer of a CNN, is often expected to yield ideal features irrespective of the task, in fact the growing receptive field as well as striding effects cause shallower features to be better at high precision matching tasks. We leverage this insight together with explicit supervision at multiple levels of the feature hierarchy for better regularization, to learn more effective descriptors in the context of geometric matching tasks. Further, we propose to use activation maps at different layers of a CNN, as an effective and principled replacement for the multi-resolution image pyramids often used for matching tasks. We propose concrete CNN architectures employing these ideas, and evaluate them on multiple datasets for 2D and 3D geometric matching as well as optical flow, demonstrating state-of-the-art results and generalization across datasets.
The advent of repeatable high curvature point detectors [1, 2, 3] heralded a revolution in computer vision that shifted the emphasis of the field from holistic object models and direct matching of image patches , to highly discriminative hand-crafted descriptors. These descriptors made a mark on a wide array of problems in computer vision, with pipelines created to solve tasks such as optical flow , object detection , 3D reconstruction  and action recognition .
The current decade is witnessing as wide-ranging a revolution, brought about by the widespread use of deep neural networks. Yet there exist computer vision pipelines that, thanks to extensive engineering efforts of the past decades, have proven impervious to end-to-end learned solutions. Despite some recent efforts [9, 10, 11], deep learning solutions do not yet outperform or achieve similar levels of generality as state-of-the-art methods on problems such as structure from motion (SfM)  and object pose estimation . Indeed, we see a consensus emerging that some of the systems employing interest point detectors and descriptors are here to stay, but it might instead be advantageous to leverage deep learning for their individual components.
Recently, a few convolutional neural network (CNN) architectures [14, 15, 16, 17] have been proposed with the aim of learning strong geometric feature descriptors for matching images, and have yielded mixed results [18, 19]. We posit that the ability of CNNs to learn representation hierarchies, which has made them so valuable for many visual recognition tasks, becomes a hurdle when it comes to low-level geometric feature learning, unless specific design choices are made in training and inference to exploit that hierarchy. This paper presents such strategies for the problem of dense geometric correspondence.
Most recent works employ various metric learning losses and extract feature descriptors from the deepest layers [14, 15, 16, 17], with the expectation that the loss would yield good features right before the location of the loss layer. On the contrary, several studies [20, 21] suggest that deeper layers respond to high-level abstract concepts and are by design invariant to local transformations in the input image. However, shallower layers are found to be more sensitive to local structure, which is not exploited by most deep-learning based approaches for geometric correspondence that use only deeper layers. To address this, we propose a novel hierarchical metric learning approach that combines the best characteristics of various levels of feature hierarchies, to simultaneously achieve robustness and localization sensitivity. Our framework is widely applicable, which we demonstrate through improved matching for interest points in both 2D and 3D data modalities, on KITTI Flow  and 3DMatch  datasets, respectively.
Further, we leverage recent studies that highlight the importance of carefully marshaling the training process: (i) by deeply supervising [23, 24] intermediate feature layers to learn task-relevant features, and (ii) on-the-fly hard negative mining  that forces each iteration of training to achieve more. Finally, we exploit the intermediate activation maps generated within the CNN itself as a proxy for image pyramids traditionally used to enable coarse-to-fine matching . Thus, at test time, we employ a hierarchical matching framework, using deeper features to perform coarse matching that benefits from greater context and higher-level visual concepts, followed by a fine grained matching step that involves searching for shallower features. Figure 1 illustrates our proposed approach.
In summary, our contributions include:
We demonstrate that while in theory metric learning should produce good features irrespective of the layer the loss is applied to, in fact shallower features are superior for high-precision geometric matching tasks, whereas deeper features help obtain greater recall.
We propose a CNN-driven scheme for coarse-to-fine hierarchical matching, avoiding heuristically or manually tuned pyramid approaches.
We experimentally validate our ideas by comparing against state-of-the-art geometric matching approaches and feature fusion baselines, as well as perform an ablative analysis of our proposed solution. We evaluate for the tasks of 2D and 3D interest point matching and refinement, as well as optical flow, demonstrating state-of-the-art results and generalization ability.
2 Related Work
With the use of deep neural networks, many new ideas have emerged both pertaining to learned feature descriptors and directly learning networks for low-level vision tasks in an end-to-end fashion, which we review next.
SIFT , SURF , BRISK  were designed to complement high curvature point detectors, with  even proposing its own algorithm for such a detector. In fact, despite the interest in learned methods, they are still the state-of-the-art for precision [18, 19], even if they are less effective in achieving high recall rates.
While early work [28, 29, 30] leveraged intermediate activation maps of a CNN trained with an arbitrary loss for the task of keypoint matching, most recent methods rely on an explicit metric loss [31, 32, 14, 15, 16, 33, 34] to learn descriptors. The hidden assumption behind the use of contrastive or triplet loss at the final layer of a CNN is that this explicit loss will cause the relevant features to emerge at the top of the feature hierarchy. But it has also been observed that early layers of the CNN are the ones that learn local geometric features . Consequently, many of these works demonstrate superior performance to handcrafted descriptors on semantic matching tasks but often lag behind on geometric matching.
Matching in 2D
LIFT  is a moderately deep architecture for end-to-end interest point detection and matching, which uses features at a single level of hierarchy and does not perform dense matching. Universal Correspondence Network (UCN)  combines a fully convolutional network in a Siamese setup, with a spatial transformer module  and contrastive loss  for dense correspondence, to achieve state-of-the-art on semantic matching tasks but not on geometric matching. Like them, we use GPU to speed up -nearest neighbour for on-the-fly hard negative mining, albeit across multiple feature learning layers. Recently, AutoScaler  explicitly applies a learned feature extractor on multiple scales of the input image. While this takes care of the issue that a deep layer may have an unnecessarily large receptive field when learning on the basis of contrastive loss, we argue that it is more elegant for the CNN to “look at the image” at multiple scales, rather than separately process multiple scales.
Matching in 3D
Descriptors for matching in 3D voxel grid representations are learned by 3DMatch , employing a Siamese 3D CNN setup on a 30x30x30 cm voxel grid with a contrastive loss. It performs self-supervised learning by utilizing RGB-D scene reconstructions to obtain ground truth correspondence labels for training, outperforming a state-of-the-art hand-crafted descriptor . Thus, 3DMatch provides an additional testbed to validate our ideas, where we report positive results from incorporating our hierarchical metric learning and matching into the approach.
Learned Optical Flow
Recent works achieve state-of-the-art results on optical flow by training CNNs in an end-to-end fashion [38, 39], followed by Conditional Random Field (CRF) inference  to capture detailed boundaries. We also demonstrate the efficacy of our matching on optical flow benchmarks. However, we do not use heavily engineered or end-to-end learning for minimizing flow metrics, rather we show that our matches along with an off-the-shelf interpolant  already yield strong results.
Recent works [23, 24, 41, 42] suggest that providing explicit supervision to intermediate layers of a CNN can yield higher performance on unseen data, by regularizing the training process. However, to the best of our knowledge, the idea has neither been tested on the task of keypoint matching nor had the learned intermediate features been evaluated. We do both in our work.
Image Pyramids and Hierarchical Fusion
Downsampling pyramids have been a steady fixture of computer vision for exploiting information across multiple scales . Recently, many techniques have also been developed for fusing features from different layers within a CNN and producing output at high resolution, e.g. semantic segmentation [44, 45, 46, 47], depth estimation , and optical flow [38, 39]. Inspired by  for image alignment, we argue that the growing receptive field in deep CNN layers  provides a natural way to parse an image at multiple scales. Thus, in our hierarchical matching scheme, we employ features extracted from a deeper layer with greater receptive field and higher-level semantic notions  for coarsely locating the corresponding point, followed by shallower features for precise localization. We demonstrate performance improvements in correspondence estimation by using our approach over prior techniques for feature fusion such as [44, 46].
We introduce new ideas for recent metric learning frameworks that learn interest point descriptors. In the following, we first identify the general principles behind our framework, then propose concrete neural network architectures that realize them. In this section, we limit our discussion to models for 2D images. We detail and validate our ideas on the 3DMatch  architecture in Section 4.3.
3.1 Hierarchical Metric Learning
We follow the standard CNN-based metric learning setup proposed as the Siamese architecture . This involves two Fully Convolutional Networks (FCN)  with tied weights, parsing two images of the same scene. We extract features out of the intermediate convolutional layer activation maps at the locations corresponding to the interest points, and after normalization obtain their Euclidean distance. At training time, separate contrastive losses are applied to multiple levels in the feature hierarchy to encourage the network to learn embedding functions that minimizes the distance between the descriptors of matching points, while maximizing the distance between unmatched points.
Correspondence Contrastive Loss (CCL)
We borrow the correspondence contrastive loss formulation introduced in , and adapted from . Here, represents the feature extracted from the -th feature level of the reference image at a pixel location ; similarly, represents the feature extracted from the -th feature level of the target image at a pixel location . Let represent a dataset of triplets , where is a location in the reference image , is a location in the target image , and is if and only if are a match. Let be a margin parameter and be a window size. We define:
Then, our training loss, , sums CCL losses over multiple levels :
Our rationale in applying CCL losses at multiple levels of the feature hierarchy is twofold. Recent studies [23, 24] indicate that deep supervision of CNN architectures contribute to improved regularization, by encouraging the network early on to learn task-relevant features.Secondly, both deep and shallow layers can be supervised for matching simultaneously within one network.
Hard Negative Mining
Since our training data includes only positive correspondences, we actively search for hard negative matches “on-the-fly” to speed up training and to leverage the latest instance of network weights. We adopt the approach of UCN , but in contrast to it, our hard negative mining happens independently for each of the feature levels being supervised.
We visualize one specific instantiation of the above ideas in Figure 2, adapting the VGG-M  architecture for the task. We retain the first 5 convolutional layers, initializing them with weights pre-trained for ImageNet classification . We use ideas from semantic segmentation literature [52, 47] to increase the resolution of the intermediate activation maps by (a) eliminating down-sampling in the second convolutional and pooling layers (setting their stride value to 1, down from 2) (b) increasing the pooling window size for the second layer from x to x and (c) dilating  the subsequent convolutional layers (conv3, conv4 and conv5) to retain their pretrained receptive fields.
At training, the network is provided with a pair of images and a set of point correspondences. The network is replicated in a Siamese scheme  during training (with shared weights) where each sub-network processes one image from the pair; and thus after each feed-forward pass, we have 4 feature maps: 2 shallow ones and 2 deep ones, respectively from the second and fifth convolutional layers (conv2, conv5). We apply supervision after these same layers (conv2, conv5).
We also experiment with a GoogLeNet  baseline as employed in UCN . Specifically, we augment the network with a x convolutional layer and L2 normalization following the fourth convolutional block (inception_4a/output) for learning deep features, as in UCN. In addition, for learning shallow features, we augment the network with a x convolutional layer right after the second convolutional layer (conv2/x), followed by L2 normalization, but before the corresponding non-linear ReLU squashing function. We extract the shallow and deep feature maps based on the normalized outputs after the second convolutional layer conv2/x and the inception_4a/output layers respectively. We provide the detailed architecture of our GoogLeNet variant as supplementary material.
We implement our system in Caffe  and use the ADAM stochastic minimization algorithm  to train our network for iterations using a base learning rate of on a single P6000 GPU. Pre-trained layers are fine-tuned with a learning rate multiplier of 0.1 whereas the weights of the newly-added feature-extraction layers are randomly initialized using Xavier’s method. We use a weight decay parameter of and L2 weight regularization. During training, each batch consists of three randomly chosen image pairs and we randomly choose 1K positive correspondences from each pair. It takes the VGG-M variant of our system around 43 hours to train whereas it takes 30 hours to train our GoogLeNet-based variant.
3.2 Hierarchical Matching
We adapt and train our networks as described in the previous section, optimizing network weights for matching using features extracted from different layers. Yet, we find that features from different depths offer complementary capabilities as predicted by earlier works [20, 21] and confirmed by our empirical evaluation in Section 4. Specifically, features extracted from shallower layers obtain superior matching accuracies for smaller distance thresholds (precision), whereas those from deeper layers provide better accuracies for larger distance thresholds (recall). Such coarse-to-fine matching has been well-known in computer vision , however recent work highlights how employing CNN feature hierarchies for the task (at least in the context of image alignment ) is more robust.
To establish correspondences, we compare the deep and shallow features of the input images and as follows. Assuming the shallow feature coordinates and the deep feature coordinates in the reference image are related by with a scaling factor , we first use the deep feature descriptor in the reference image to find the point in the target image with closest to with nearest neighbor search.111If is fractional, we use bilinear interpolation to compute . Next, we refine the location of by searching within a circle of a radius of 32 pixels around (assuming that the input images have the same size, thus, ) to find the point whose shallow feature descriptor is closest to , forming a correspondence between and .
Our proposed hierarchical matching is implemented on CUDA and run on a single P6000 GPU, requiring an average of seconds to densely extract features and compute correspondences for a pair of input images of size .
In this section, we first benchmark our proposed method for 2D correspondence estimation against standard metric learning and matching approaches, feature fusion, as well as state-of-the-art learned and hand-crafted methods for extracting correspondences. Next, we show how our method for correspondence estimation can be applied for optical flow and compare it against recent optical flow methods. Finally, we incorporate our ideas in a state-of-the-art 3D fully convolutional network  and show improved performance. In the following, we denote our method as HiLM, which is short for Hierarchical metric Learning and Matching.
4.1 2D Correspondence Experiments
We empirically evaluate our ideas against different approaches for dense correspondence estimation. We first consider metric learning and matching approaches based on feature sets extracted from a single convolutional layer 222LIFT  is not designed for dense matching and hence not included in our experiments. Note that LIFT also uses features from only a single convolutional layer., where we separately train five networks, based on the VGG-M baseline in Figure 2. Each one of the five networks has a different depth and we refer to the -th network by convi-net to indicate that the network is truncated at the -th convolutional layer (convi), for . We train a convi-net network by adding a convolutional layer, L2 normalization, and CCL loss after the output of the last layer (convi). Figure 3 (a) shows one branch of the conv3-net baseline as an example.
In addition, we also compare our method against two alternatives for fusing features from different layers inspired by ideas from semantic segmentation [44, 46]. One is hypercolumn-fusion – Figure 3 (b), where feature sets from all layers (first through fifth) are concatenated for every interest point and a set of \axa1 convolution kernels are trained to fuse features before L2 normalization and CCL loss. Instead of upsampling deeper feature maps as in \pcitehariharan2015cvpr, we extract deep features at higher resolution by setting the stride of multiple convolutional/pooling layers to while dilating the subsequent convolutions appropriately as shown in Figure 3. Another approach we consider is topdown-fusion, where refinement modules similar to  are used to refine the top-level conv5 features gradually down the network by combining with lower-level features till conv2 (please see supplementary material for details).
We evaluate on KITTI Flow 2015  where all networks are trained on of the image pairs and the remaining are used for evaluation. For a fair comparison, we use the same train-test split for all methods and train each with 1K correspondences per image pair and for 50K iterations. During testing, we use the correspondences in each image pair (obtained using all non-occluded ground truth flows) for evaluation. Specifically, each method predicts a point in the target image that matches the point from the reference image .
Following prior works [29, 15, 17], we use Percentage of Correct Keypoints (PCK) as our evaluation metric. Given a pixel threshold , the PCK measures the percentage of predicted points that are within pixels from the ground truth corresponding point (and so are considered as correct matches up to pixels).
Single-Layer and Feature Fusion Descriptors
We plot PCK curves obtained for all methods under consideration in Figure 4 where we split the graph into sub-graphs based on the pixel threshold range. These plots reveal that, for smaller thresholds, shallower features (e.g. conv2-net with @ pixels) provide higher PCK than deeper ones (e.g. conv5-net with @ pixels), with the exception of conv1-net which performs worst. On the other hand, deeper features have better performance for higher thresholds (e.g. conv5-net with versus conv2-net with @ 15 pixels).\mycomment The graph suggests that deeper features are more suited for rough correspondence estimation whereas shallower features are more suited for obtaining more refined locations. This suggests that, for best performance, one would need to utilize the shallower as well as deeper features produced by the network rather than using just the output of the last layer for correspondence estimation.
The plot also indicates that while baseline approaches for fusing features improve the PCK for smaller thresholds (e.g. hypercolumn-fusion with versus conv5-net with @ pixels), they do not perform on par with the simple conv2-based features (e.g. conv2-net with @ pixels).
Different variants of our full approach achieve the highest PCK for smaller thresholds (e.g. HiLM (conv2+conv4) with @ 5 pixels), without losing accuracy for higher thresholds. \mycommentBy utilizing the power of the shallow and deep features. In fact, our method is able to outperform the conv2 features (e.g. conv2-net with @ pixels) although it uses them for refining the rough correspondences estimated by the deeper layers. This is explained by the relative invariance of deeper features to local structure, which helps to avoid matching patches that have similar local appearance but rather belong to different objects.
We also perform experiments on cross-domain generalization ability. Specifically, we train HiLM (conv2+conv5) on MPI Sintel  and evaluate it on KITTI Flow 2015 as the previous experiment, plotting the result in Figure 4 (black curve). As expected the Sintel model is subpar compared to the same model trained on KITTI ( vs. @ pixels), however it outperforms both hypercolumn-fusion () and topdown-fusion () trained on KITTI, across all PCK thresholds. Similar generalization results are obtained when cross-training with HPatches  (see supplementary material).
We also compare the performance of (a) our HiLM (conv2+conv5, VGG-M), (b) a variant of our method based on GoogLeNet/UCN (described in Section 3), (c) the original UCN , and (d) the following hand-crafted descriptors: SIFT , KAZE , DAISY . We use the same KITTI Flow 2015 evaluation set utilized in the previous experiment. To evaluate hand-crafted approaches, we use them to compute the descriptors at test pixels in the reference image (for which ground truth correspondences are available) and match the resulting descriptors against the descriptors computed on the target image over a grid of 4 pixel spacing in both directions.
Figure 5 compares the resulting PCKs and shows that our HiLM (VGG-M) outperforms UCN  for smaller thresholds (e.g. HiLM (VGG-M) with versus UCN with @ 2 pixels). That difference in performance is not the result of baseline shift since our GoogLeNet variant (same baseline network as UCN) has similar or slightly better performance compared to our VGG-M variant. The graph also indicates the relatively higher invariance of CNN-based descriptors to local structure that allows them to obtain a higher percentage of roughly-localized correspondences (e.g. UCN with , HiLM (VGG-M) with , and HiLM (GoogLeNet) with , all at 10 pixel threshold).
4.2 Optical Flow Experiments
In this section, we demonstrate the application of our geometric correrspondences for obtaining dense optical flows using KITTI Flow 2015 . We emphasize that the objective here is not to outperform approaches that have been extensively engineered [59, 60, 39] for optical flows, including minimizing flow metric (end-point error) directly, e.g. FlowNet2 . Yet, we consider it useful to garner insights from flow benchmarks since the tasks (i.e. geometric correspondence and optical flow) are conceptually similar.
For dense optical flow estimation, we leverage GoogLeNet  as our backbone architecture. However, at test time, we modify the trained network to obtain dense per-pixel correspondences. To this end: (i) we set the stride to 1 in the first convolutional and pooling layers (conv1 and pool1), (ii) we set the kernel size of the first pooling layer (pool1) to 5 instead of 3, (iii) we set the dilation offset of the second convolutional layer (conv2) to 4, and (iv) we set the stride of the second pooling layer (pool2) to 4. These changes allow us to obtain our shallow feature maps at the same resolution as the input images ( x ) and the deep feature maps at x , and to obtain dense per-pixel correspondences faster and with significantly fewer requirements on the GPU memory as compared to an approach that would process the feature maps at full resolution through all layers of the network.
We first extract and match feature descriptors for every pixel in the input images using our proposed method. These initial matches are usually contaminated by outliers or incorrect matches. Therefore, we follow the protocol of AutoScaler for outlier removal. In particular, we enforce local motion constraints using a window of and perform forward-backward consistency checks with a threshold of pixel. These filtered matches are then fed to EpicFlow  interpolation for producing the final optical flow output. Figure 6 illustrates an example of this procedure.
We tabulate our quantitative evaluation results on KITTI Flow 2015 in Table 1. As mentioned earlier, our objective is not necessarily to obtain the best optical flow performance, rather we wish to emphasize that we are able to provide high-quality interest point matches. In fact, many recent works [59, 60] focus on embedding rich domain priors at the level of explicit object classes into their models, which allows them to make good guesses when data is missing (e.g. due to occlusions, truncations, homogenous surfaces). Yet, we are able to outperform several methods in our comparisons except  for foreground pixels (i.e. by Fl-fg, HiLM with versus other methods with –, excluding  with ). As expected, we do not get as good matches in regions of the image where relatively less structure is present (e.g. background), and for such regions methods [59, 60] employing strong prior models have significant advantages. However, even on background regions, we are able to either beat or perform on par with most of our competitors (i.e. by Fl-bg, versus –), including machinery proposed for optical flows such as [28, 40, 63]. Overall, we outperform 6 state-of-the-art methods evaluated in Table 1 (i.e. by Fl-all), including the multi-scale correspondence approach of .
We visualize qualitative results over several test images in Figure 7, to contrast DeepFlow2 , EpicFlow , and SPM-BP  against our method. As expected from the earlier discussion, we observe superior results for our method on the image regions belonging to the vehicles, because of strong local structures, whereas for instance in first column (fourth row) SPM-BP  entirely fails on the blue car. We observe errors in the estimates of our method largely in regions which are occluded (surroundings of other cars) or truncated (lower portion of the images), where the competing methods visualized here also have high errors.
4.3 3D Correspondence Experiments
To demonstrate the generality of our contributions to different data modalities, we now consider an extension of our proposed method in Section 3 to 3D correspondence estimation. In the following, we first present the details of our network architecture and then discuss the results of our quantitative evaluation.
We use 3DMatch  as our baseline architecture and make the following changes to incorporate hierarchical metric learning. We insert two x convolutional layers (stride of each) and one x pooling layer (stride of ) after the second convolutional layer of 3DMatch to obtain a -dimensional vector, which serves as the shallow feature descriptor. Our deep feature descriptor is computed after the eighth convolutional layer in the same manner as 3DMatch. Our hierarchical metric learning scheme again employs two CCL losses (Section 3.1) for learning shallow and deep feature descriptors simultaneously. We disable hard negative mining in this experiment to enable a fair comparison with 3DMatch. Our network is implemented using the Marvin framework  and trained with stochastic gradient descent using a base learning rate of for K iterations on a single TITAN XP GPU. We use pre-trained weights provided by 3DMatch to initialize the common layers in our network, which have a learning rate multiplier of , whereas the weights of the newly added layers are initialized using Xavier’s method and have a learning rate multiplier of . We generate correspondence data for training our network using the same procedure provided by 3DMatch.
3DMatch evalutes the classification accuracy of putative correspondences, using fixed keypoint locations and binary labels. Since our method enables local refinement with shallow features and hence shifts hypothesized correspondence location in space, we define a protocol suitable to measure refinement performance. We employ PCK as our evaluation metric, similar to 2D correspondence evaluation [29, 15, 17]. We generate test data consisting of K ground truth correspondences using the same procedure as 3DMatch. We use a region of xx cm centered on the reference keypoint (in the reference “image”) following  to compute the reference descriptor. This is matched against putative keypoints in a xx cm region (in the target “image”), to refine this coarse prior estimate333In fact, the ground truth keypoint correspondence lies at the center of this region, but this knowledge is not available to the method in any way.. Specifically, we divide this region into subvolumes of xx cm and employ our hierarchical matching approach to exhaustively search 444We use a sampling gap of 3 cm along all three dimensions in searching for subvolumes to reduce computational costs. for the subvolume whose descriptor is most similar to the reference descriptor. In particular, once the coarse matching using deeper feature descriptors yields an approximate location in the xx cm region, we constrain the refinement by shallow feature descriptors to a search radius of cm around the approximate location returned from the coarse matching.
We compare our complete framework, namely, HiLM (conv2+conv8) against variants which are trained with hierarchical metric loss but rely either on deep or shallow features for matching (HiL (conv8) and HiL (conv2), respectively), and 3DMatch which use only deep features. Figure 9 shows the PCK curves of all competing methods computed over 10K test correspondences generated by the procedure of 3DMatch. From the results, our shallow features trained with hierarchical metric learning are able to outperform their deep counterparts for most PCK thresholds (e.g. HiL (conv2) with versus HiL (conv8) with @ 9 cm). By utilizing both deep and shallow features, our complete framework achieves higher PCK numbers than its variants and outperforms 3DMatch across all PCK thresholds (e.g. HiLM (conv2+conv8) with versus 3DMatch with @ 9 cm).
We draw inspiration from recent studies [20, 21] as well as conventional intuitions about CNN architectures to enhance learned representations for dense 2D and 3D geometric matching. Convolutional network architectures naturally learn hierarchies of features, thus, a contrastive loss applied at a deep layer will return features that are less sensitive to local image structure. We propose to remedy this by employing features at multiple levels of the feature hierarchy for interest point description. Further, we leverage recent ideas in deep supervision to explicitly obtain task-relevant features at intermediate layers. Finally, we exploit the receptive field growth for increasing layer depths as a proxy to replace conventional coarse-to-fine image pyramid approaches for matching. We thoroughly evaluate these ideas realized as concrete network architectures, on challenging benchmark datasets. Our evaluation on the task of explicit keypoint matching outperforms hand-crafted descriptors, a state-of-the-art descriptor learning approach , as well as various ablative baselines including hypercolumn-fusion and topdown-fusion. Further, an evaluation for optical flow computation outperforms several competing methods even without extensive engineering or leveraging higher-level semantic scene understanding. Finally, augmenting a recent 3D descriptor learning framework  with our ideas yields performance improvements, hinting at wider applicability. Our future work will explore applications of our geometric correspondences, such as flexible ground surface modeling [66, 67] and geometric registration and refinement [68, 16].
Part of this work was done during Mohammed E. Fathy’s internship at NEC Laboratories America, Inc., in Cupertino, CA. The authors also thank Christopher B. Choy for his help with the source code of Universal Correspondence Network.
Appendix A Supplementary Material
a.1 Generalization Results
As mentioned in Section 4.1 of the main paper, we perform experiments to evaluate the generalization ability of our approach by training and testing on different scenes. In the following, we first provide the details of network training on MPI Sintel  and HPatches , and then present the results of our networks when testing on KITTI Flow 2015 . Note that the evaluation is conducted on the same test set from KITTI Flow 2015 that was used in Section 4.1 of the main paper, and there is no fine tuning on KITTI Flow 2015.
Network Training on MPI Sintel
The MPI Sintel  dataset contains synthetic image pairs with ground truth optical flows. We train different variants of our proposed approach on MPI Sintel following the procedure that we use for training on KITTI Flow 2015 described in Section 4.1 of the main paper.
Network Training on HPatches
We also train our networks on the full image set of the HPatches  dataset, which consists of 116 sequences with 57 sequences having illumination variations and 59 sequences having geometric (projective) variations. Each sequence has 6 images and 5 ground truth homography transformations between the first image and -th image for . To train our networks, we use all 116 sequences, each with 5 image pairs for . An input image pair is preprocessed for training by randomly cropping a subimage from the reference image and using the ground truth homography to help compute the coordinates of the corresponding cropped region in the target image. We use these cropped regions as input for our networks and randomly select 1000 correspondending points between the cropped images for training. The training is run for 50K iterations and we use a batch of 4 randomly chosen image pairs per iteration.
Generalization Results of HiLM
We show in Figure 10 the generalization results of different variants of our proposed approach when training on MPI Sintel  and testing on KITTI Flow 2015. Our Sintel-trained models outperform both feature fusion baselines, namely hypercolumn-fusion  and topdown-fusion , that were trained on KITTI, across all PCK thresholds, e.g. HiLM (conv2+conv3) Sintel, HiLM (conv2+conv4) Sintel, HiLM (conv2+conv5) Sintel with , , PCK respectively, versus, hypercolumn-fusion, topdown-fusion with , respectively, @ 5 pixels. However, as expected, our networks trained on Sintel do not perform as well as the same networks trained on KITTI, e.g. HiLM (conv2+conv3), HiLM (conv2+conv4), HiLM (conv2+conv5) with , , respectively, @ 5 pixels.
Generalization Results of conv2-net
It is worth noting that, for a wide range of pixel thresholds, the performance of HiLM (trained on MPI Sintel ) is better than the performance of most single layer and feature fusion baselines even trained on KITTI Flow 2015 . The only exception is the conv2-net baseline trained on KITTI Flow 2015. We further evaluate the generalization performance of the conv2-net baseline by training versions of it on MPI Sintel  and HPatches  and testing them on KITTI Flow 2015 , plotting the results in Figures 12 and 13. The results show that the generalization performance of conv2-net is not on par with HiLM, which indicates the benefit of our hierarchical learning that combines information from multiple levels of the feature hierarchy.
a.2 3D Correspondence Results
In addition to 3D correspondence estimation results with xx cm search volumes presented in Section 4.3 of the main paper, we also conduct similar experiments yet with larger search volumes. In particular, given the descriptor in the reference “image”, we search over all candidate keypoints in a xx cm region in the target “image” to find the keypoint whose descriptor is most similar to the reference descriptor. Figure 14 presents the results of xx cm search regions. Similar observations as with the xx cm search volume experiments are obtained. Specifically, our shallow features trained with hierarchical metric loss are usually more effective than their deep counterparts, e.g. HiL (conv2) with versus HiL (conv8) with , @ 12 cm. In addition, our complete framework outperforms both of its variants, and achieves higher PCK numbers than 3DMatch , across all PCK thresholds, e.g. HiLM (conv2+conv8) with versus 3DMatch with , @ 12 cm.
a.3 Network Architectures
We present the network architectures of our GoogLeNet variant and the topdown-fusion baseline in Figures 15 and 16 respectively. Our GoogLeNet variant is described in Section 3.1 of the main paper, whereas the topdown-fusion baseline is inspired by ideas from  for fusing features from different layers in a top-down scheme and used in our comparisons in Section 4.1 of the main paper.
-  Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of the Alvey Vision Conference (AVC). (1988)
-  Lindeberg, T.: Feature detection with automatic scale selection. IJCV 30(2) (1998) 79–116
-  Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2) (2004) 91–110
-  Zhang, Z., Deriche, R., Faugeras, O., Luong, Q.T.: A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence 78 (1995) 87–119
-  Brox, T., Malik, J.: Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation. PAMI 33(3) (2011) 500–513
-  Dalal, N., Triggs, B.: Histogram of Oriented Gradients for Human Detection. In: CVPR. (2005)
-  Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in 3D. TOG 25(3) (2006) 835–846
-  Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action Recognition by Dense Trajectories. In: CVPR. (2011)
-  Kendall, A., Grimes, M., Cipolla, R.: PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In: ICCV. (2015)
-  Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-Net: Learning of Structure and Motion from Video. In: ArXiv. (2017)
-  Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Frank Michel, S.G., Rother, C.: DSAC - Differentiable RANSAC for Camera Localization. In: CVPR. (2017)
-  Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks. In: ICRA. (2017)
-  Rad, M., Lepetit, V.: BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth. In: ICCV. (2017)
-  Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: Learned Invariant Feature Transform. In: ECCV. (2016)
-  Choy, C.B., Gwak, J., Savarese, S., Chandraker, M.: Universal Correspondence Network. In: NIPS. (2016)
-  Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., Funkhouser, T.: 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. In: CVPR. (2017)
-  Wang, S., Luo, L., Zhang, N., Li, J.: AutoScaler: Scale-Attention Networks for Visual Correspondence. In: BMVC. (2017)
-  Schönberger, J.L., Hardmeier, H., Sattler, T., Pollefeys, M.: Comparative Evaluation of Hand-Crafted and Learned Local Features. In: CVPR. (2017)
-  Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR. (2017)
-  Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV. (2014)
-  Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object Detectors Emerge in Deep Scene CNNs. In: ICLR. (2015)
-  Menze, M., Geiger, A.: Object Scene Flow for Autonomous Vehicles. In: CVPR. (2015)
-  Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-Supervised Nets. AISTATS (2015)
-  Li, C., Zia, M.Z., Tran, Q.H., Yu, X., Hager, G.D., Chandraker, M.: Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing. In: CVPR. (2017)
-  Czarnowski, J., Leutenegger, S., Davison, A.J.: Semantic Texture for Robust Dense Tracking. In: ICCVW. (2017)
-  Bay, H., Tuytelaars, T., Gool, L.V.: SURF: Speeded Up Robust Features. In: ECCV. (2006)
-  Leutenegger, S., Chli, M., Siegwart, R.Y.: BRISK: Binary Robust Invariant Scalable Keypoints. In: ICCV. (2011)
-  Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: Large Displacement Optical Flow with Deep Matching. In: ICCV. (2013)
-  Long, J.L., Zhang, N., Darrel, T.: Do Convnets Learn Correspondence? In: NIPS. (2014)
-  Lin, K., Lu, J., Chen, C.S., Zhou, J.: Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks. In: CVPR. (2016)
-  Zbontar, J., LeCun, Y.: Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches. The Journal of Machine Learning Research (JMLR) 17 (2016) 1–32
-  Gadot, D., Wolf, L.: PatchBatch: A Batch Augmented Loss for Optical Flow. In: CVPR. (2016)
-  Yang, T.Y., Hsu, J.H., Lin, Y.Y., Chuang, Y.Y.: DeepCD: Learning Deep Complementary Descriptors for Patch Representations. In: ICCV. (2017)
-  Zhang, X., Yu, F.X., Kumar, S., Chang, S.F.: Learning Spread-out Local Feature Descriptors. In: ICCV. (2017)
-  Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial Transformer Networks. In: NIPS. (2015)
-  Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR. (2005)
-  Rusu, R.B., Blodow, N., Beetz, M.: Fast Point Feature Histograms (FPFH) for 3D registration. In: ICRA. (2009)
-  Dosovitskiy, A., Fischer, P., Ilg, E., HÃ¤usser, P., Hazirbas, C., Golkov, V., v.d. Smagt, P., Cremers, D., Brox, T.: FlowNet: Learning Optical Flow with Convolutional Networks. In: ICCV. (2015)
-  Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In: CVPR. (2017)
-  Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow. In: CVPR. (2015)
-  Zamir, A.R., Wu, T.L., Sun, L., Shen, W., Shi, B.E., Malik, J., Savarese, S.: Feedback Networks. In: CVPR. (2017)
-  Li, C., Zia, M.Z., Tran, Q.H., Yu, X., Hager, G.D., Chandraker, M.: Deep Supervision with Intermediate Concepts. In: ArXiv. (2018)
-  Lucas, B.D., Kanade, T.: Optical Navigation by the Method of Differences. In: IJCAI. (1985)
-  Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR. (2015)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: MICCAI. (2015)
-  Pinheiro, P.O., Lin, T.Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: ECCV. (2016)
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. PAMI (2017)
-  Eigen, D., Puhrsch, C., Fergus, R.: Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In: NIPS. (2014)
-  Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation. In: CVPR. (2015)
-  Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. In: BMVC. (2014)
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. IJCV 115(3) (2015) 211–252
-  Yu, F., Koltun, V.: Multi-Scale Context Aggregation by Dilated Convolutions. In: ICLR. (2016)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: ACM Multimedia. (2014)
-  Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ICLR (2014)
-  Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: ECCV. (2012)
-  Alcantarilla, P.F., Bartoli, A., Davison, A.J.: KAZE features. In: ECCV. (2012)
-  Tola, E., Lepetit, V., Fua, P.: DAISY: An efficient dense descriptor applied to wide-baseline stereo. PAMI 32(5) (2010) 815–830
-  Bai, M., Luo, W., Kundu, K., Urtasun, R.: Exploiting Semantic Information and Deep Matching for Optical Flow. In: ECCV. (2016)
-  Sevilla-Lara, L., Sun, D., Jampani, V., Black, M.J.: Optical Flow with Semantic Segmentation and Localized Layers. In: CVPR. (2016)
-  Bailer, C., Varanasi, K., Stricker, D.: CNN-based Patch Matching for Optical Flow with Thresholded Hinge Embedding Loss. In: CVPR. (2017)
-  Li, Y., Min, D., Brown, M.S., Do, M.N., Lu, J.: SPM-BP: Sped-up PatchMatch Belief Propagation for Continuous MRFs. In: ICCV. (2015)
-  Chen, Q., Koltun, V.: Full Flow: Optical Flow Estimation by Global Optimization over Regular Grids. In: CVPR. (2016)
-  Wang, S., Fanello, S., Rhemann, C., Izadi, S., Kohli, P.: The Global Patch Collider. In: CVPR. (2016)
-  : Marvin: A minimalist GPU-only N-dimensional ConvNet framework. http://marvin.is Accessed: 2015-11-10.
-  Lee, B., Daniilidis, K., Lee, D.D.: Online Self-Supervised Monocular Visual Odometry for Ground Vehicles. In: ICRA. (2015)
-  Dhiman, V., Tran, Q.H., Corso, J.J., Chandraker, M.: A Continuous Occlusion Model for Road Scene Understanding. In: CVPR. (2016)
-  Choi, S., Zhou, Q.Y., Koltun, V.: Robust Reconstruction of Indoor Scenes. In: CVPR. (2015)