Universal Correspondence Network

Universal Correspondence Network


We present a deep learning framework for accurate visual correspondences and demonstrate its effectiveness for both geometric and semantic matching, spanning across rigid motions to intra-class shape or appearance variations. In contrast to previous CNN-based approaches that optimize a surrogate patch similarity objective, we use deep metric learning to directly learn a feature space that preserves either geometric or semantic similarity. Our fully convolutional architecture, along with a novel correspondence contrastive loss allows faster training by effective reuse of computations, accurate gradient computation through the use of thousands of examples per image pair and faster testing with feed forward passes for keypoints, instead of for typical patch similarity methods. We propose a convolutional spatial transformer to mimic patch normalization in traditional features like SIFT, which is shown to dramatically boost accuracy for semantic correspondences across intra-class shape variations. Extensive experiments on KITTI, PASCAL, and CUB-2011 datasets demonstrate the significant advantages of our features over prior works that use either hand-constructed or learned features.



1 Introduction

Correspondence estimation is the workhorse that drives several fundamental problems in computer vision, such as 3D reconstruction, image retrieval or object recognition. Applications such as structure from motion or panorama stitching that demand sub-pixel accuracy rely on sparse keypoint matches using descriptors like SIFT sift (). In other cases, dense correspondences in the form of stereo disparities, optical flow or dense trajectories are used for applications such as surface reconstruction, tracking, video analysis or stabilization. In yet other scenarios, correspondences are sought not between projections of the same 3D point in different images, but between semantic analogs across different instances within a category, such as beaks of different birds or headlights of cars. Thus, in its most general form, the notion of visual correspondence estimation spans the range from low-level feature matching to high-level object or scene understanding.

Traditionally, correspondence estimation relies on hand-designed features or domain-specific priors. In recent years, there has been an increasing interest in leveraging the power of convolutional neural networks (CNNs) to estimate visual correspondences. For example, a Siamese network may take a pair of image patches and generate their similiarity as the output seeingByMoving (); comparingPatches (); stereoMatchingWithCNN (). Intermediate convolution layer activations from the above CNNs are also usable as generic features.

However, such intermediate activations are not optimized for the visual correspondence task. Such features are trained for a surrogate objective function (patch similarity) and do not necessarily form a metric space for visual correspondence and thus, any metric operation such as distance does not have explicit interpretation. In addition, patch similarity is inherently inefficient, since features have to be extracted even for overlapping regions within patches. Further, it requires feed-forward passes to compare each of patches with other patches in a different image.

In contrast, we present the Universal Correspondence Network (UCN), a CNN-based generic discriminative framework that learns both geometric and semantic visual correspondences. Unlike many previous CNNs for patch similarity, we use deep metric learning to directly learn the mapping, or feature, that preserves similarity (either geometric or semantic) for generic correspondences. The mapping is, thus, invariant to projective transformations, intra-class shape or appearance variations, or any other variations that are irrelevant to the considered similarity. We propose a novel correspondence contrastive loss that allows faster training by efficiently sharing computations and effectively encoding neighborhood relations in feature space. At test time, correspondence reduces to a nearest neighbor search in feature space, which is more efficient than evaluating pairwise patch similarities.

The UCN is fully convolutional, allowing efficient generation of dense features. We propose an on-the-fly active hard-negative mining strategy for faster training. In addition, we propose a novel adaptation of the spatial transformer spatialTransformer (), called the convolutional spatial transformer, desgined to make our features invariant to particular families of transformations. By the learning optimal feature space that compensates for affine transformations, the convolutional spatial transformer imparts the ability to mimic patch normalization of descriptors such as SIFT. Figure 1 illustrates our framework.

The capabilities of UCN are compared to a few important prior approaches in Table 1. Empirically, the correspondences obtained from the UCN are denser and more accurate than most prior approaches specialized for a particular task. We demonstrate this experimentally by showing state-of-the-art performances on sparse SFM on KITTI, as well as dense geometric or semantic correspondences on both rigid and non-rigid bodies in KITTI, PASCAL and CUB datasets.

Figure 1: Various types of correspondence problems have traditionally required different specialized methods: for example, SIFT or SURF for sparse structure from motion, DAISY or DSP for dense matching, SIFT Flow or FlowWeb for semantic matching. The Universal Correspondence Network accurately and efficiently learns a metric space for geometric correspondences, dense trajectories or semantic correspondences.

To summarize, we propose a novel end-to-end system that optimizes a general correspondence objective, independent of domain, with the following main contributions:

  • Deep metric learning with an efficient correspondence constrastive loss for learning a feature representation that is optimized for the given correspondence task.

  • Fully convolutional network for dense and efficient feature extraction, along with fast active hard negative mining.

  • Fully convolutional spatial transformer for patch normalization.

  • State-of-the-art correspondences across sparse SFM, dense matching and semantic matching, encompassing rigid bodies, non-rigid bodies and intra-class shape or appearance variations.

Figure 2: System overview: The network is fully convolutional, consisting of a series of convolutions, pooling, nonlinearities and a convolutional spatial transformer, followed by channel-wise L2 normalization and correspondence contrastive loss. As inputs, the network takes a pair of images and coordinates of corresponding points in these images (blue: positive, red: negative). Features that correspond to the positive points (from both images) are trained to be closer to each other, while features that correspond to negative points are trained to be a certain margin apart. Before the last L2 normalization and after the FCNN, we placed a convolutional spatial transformer to normalize patches or take larger context into account.
Features Dense Geometric Corr. Semantic Corr. Trainable Efficient Metric Space
SIFT sift ()
DAISY daisy ()
Conv4 convnetsLearnCorrespondence ()
DeepMatching deepMatching ()
Patch-CNN comparingPatches ()
LIFT lift ()
Table 1: Comparison of prior state-of-the-art methods with UCN (ours). The UCN generates dense and accurate correspondences for either geometric or semantic correspondence tasks. The UCN directly learns the feature space to achieve high accuracy and has distinct efficiency advantages, as discussed in Section 3.

2 Related Works

Correspondences Visual features form basic building blocks for many computer vision applications. Carefully designed features and kernel methods have influenced many fields such as structure from motion, object recognition and image classification. Several hand-designed features, such as SIFT, HOG, SURF and DAISY have found widespread applications sift (); surf (); daisy (); hog ().

Recently, many CNN-based similarity measures have been proposed. A Siamese network is used in comparingPatches () to measure patch similarity. A driving dataset is used to train a CNN for patch similarity in seeingByMoving (), while stereoMatchingWithCNN () also uses a Siamese network for measuring patch similarity for stereo matching. A CNN pretrained on ImageNet is analyzed for visual and semantic correspondence in convnetsLearnCorrespondence (). Correspondences are learned in warpNet () across both appearance and a global shape deformation by exploiting relationships in fine-grained datasets. In contrast, we learn a metric space in which metric operations have direct interpretations, rather than optimizing the network for patch similarity and using the intermediate features. For this, we implement a fully convolutional architecture with a correspondence contrastive loss that allows faster training and testing and propose a convolutional spatial transformer for local patch normalization.

Metric learning using neural networks Neural networks are used in signatureVerification () for learning a mapping where the Euclidean distance in the space preserves semantic distance. The loss function for learning similarity metric using Siamese networks is subsequently formalized by faceVerification (); DrLIM (). Recently, a triplet loss is used by deepRanking () for fine-grained image ranking, while the triplet loss is also used for face recognition and clustering in facenet (). Mini-batches are used for efficiently training the network in songCVPR16 ().

CNN invariances and spatial transformations A CNN is invariant to some types of transformations such as translation and scale due to convolution and pooling layers. However, explicitly handling such invariances in forms of data augmentation or explicit network structure yields higher accuracy in many tasks sicnn (); spatialPyramidPooling (); spatialTransformer (). Recently, a spatial transformer network is proposed in spatialTransformer () to learn how to zoom in, rotate, or apply arbitrary transformations to an object of interest.

Fully convolutional neural network Fully connected layers are converted in convolutional filters in fcnn () to propose a fully convolutional framework for segmentation. Changing a regular CNN to a fully convolutional network for detection leads to speed and accuracy gains in fast-rcnn (). Similar to these works, we gain the efficiency of a fully convolutional architecture through reusing activations for overlapping regions. Further, since number of training instances is much larger than number of images in a batch, variance in the gradient is reduced, leading to faster training and convergence.

3 Universal Correspondence Network

We now present the details of our framework. Recall that the Universal Correspondence Network is trained to directly learn a mapping that preserves similarity instead of relying on surrogate features. We discuss the fully convolutional nature of the architecture, a novel correspondence contrastive loss for faster training and testing, active hard negative mining, as well as the convolutional spatial transformer that enables patch normalization.

Fully Convolutional Feature Learning To speed up training and use resources efficiently, we implement fully convolutional feature learning, which has several benefits. First, the network can reuse some of the activations computed for overlapping regions. Second, we can train several thousand correspondences for each image pair, which provides the network an accurate gradient for faster learning. Third, hard negative mining is efficient and straightforward, as discussed subsequently. Fourth, unlike patch-based methods, it can be used to extract dense features efficiently from images of arbitrary sizes.

During testing, the fully convolutional network is faster as well. Patch similarity based networks such as seeingByMoving (); comparingPatches (); stereoMatchingWithCNN () require feed forward passes, where is the number of keypoints in each image, as compared to only for our network. We note that extracting intermediate layer activations as a surrogate mapping is a comparatively suboptimal choice since those activations are not directly trained on the visual correspondence task.

Correspondence Contrastive Loss Learning a metric space for visual correspondence requires encoding corresponding points (in different views) to be mapped to neighboring points in the feature space. To encode the constraints, we propose a generalization of the contrastive loss faceVerification (); DrLIM (), called correspondence contrastive loss. Let denote the feature in image at location . The loss function takes features from images and , at coordinates and , respectively (see Figure 3). If the coordinates and correspond to the same 3D point, we use the pair as a positive pair that are encouraged to be close in the feature space, otherwise as a negative pair that are encouraged to be at least margin apart. We denote for a positive pair and for a negative pair. The full correspondence contrastive loss is given by


For each image pair, we sample correspondences from the training set. For instance, for KITTI dataset, if we use each laser scan point, we can train up to 100k points in a single image pair. However in practice, we used 3k correspondences to limit memory consumption. This allows more accurate gradient computations than traditional contrastive loss, which yields one example per image pair. We again note that the number of feed forward passes at test time is compared to for Siamese network variants seeingByMoving (); stereoMatchingWithCNN (); comparingPatches (). Table 2 summarizes the advantages of a fully convolutional architecture with correspondence contrastive loss.

Figure 3: Correspondence contrastive loss takes three inputs: two dense features extracted from images and a correspondence table for positive and negative pairs. Methods # examples per # feed forwards image pair per test Siamese Network 1 Triplet Loss 2 Contrastive Loss 1 Corres. Contrast. Loss Table 2: Comparisons between metric learning methods for visual correspondence. Feature learning allows faster test times. Correspondence contrastive loss allows us to use many more correspondences in one pair of images than other methods.

Hard Negative Mining The correspondence contrastive loss in Eq. (1) consists of two terms. The first term minimizes the distance between positive pairs and the second term pushes negative pairs to be at least margin away from each other. Thus, the second term is only active when the distance between the features and are smaller than the margin . Such boundary defines the metric space, so it is crucial to find the negatives that violate the constraint and train the network to push the negatives away. However, random negative pairs do not contribute to training since they are are generally far from each other in the embedding space.

Instead, we actively mine negative pairs that violate the constraints the most to dramatically speed up training. We extract features from the first image and find the nearest neighbor in the second image. If the location is far from the ground truth correspondence location, we use the pair as a negative. We compute the nearest neighbor for all ground truth points on the first image. Such mining process is time consuming since it requires comparisons for and feature points in the two images, respectively. Our experiments use a few thousand points for , with being all the features on the second image, which is as large as . We use a GPU implementation to speed up the K-NN search knn-cuda () and embed it as a Caffe layer to actively mine hard negatives on-the-fly.

Convolutional Spatial Transformer CNNs are known to handle some degree of scale and rotation invariances. However, handling spatial transformations explicitly using data-augmentation or a special network structure have been shown to be more successful in many tasks spatialTransformer (); spatialPyramidPooling (); warpNet (); sicnn (). For visual correspondence, finding the right scale and rotation is crucial, which is traditionally achieved through patch normalization mser (); sift (). A series of simple convolutions and poolings cannot mimic such complex spatial transformations.

To mimic patch normalization, we borrow the idea of the spatial transformer layer spatialTransformer (). However, instead of a global image transformation, each keypoint in the image can undergo an independent transformation. Thus, we propose a convolutional version to generate the transformed activations, called the convolutional spatial transformer. As demonstrated in our experiments, this is especially important for correspondences across large intra-class shape variations.

(a) SIFT

(b) Spatial transformer

(c) Convolutional spatial transformer
Figure 4: (a) SIFT normalizes for rotation and scaling. (b) The spatial transformer takes the whole image as an input to estimate a transformation. (c) Our convolutional spatial transformer applies an independent transformation to features.

The proposed transformer takes its input from a lower layer and for each output feature, applies an independent spatial transformation. The transformation parameters are also extracted convolutionally. Since they go through an independent transformation, the transformed activations are placed inside a larger activation without overlap and then go through a successive convolution with the stride to combine the transformed activations independently. The stride size has to be equal to the size of the spatial transformer kernel size. Figure 4 illustrates the convolutional spatial transformer module.

4 Experiments

We use Caffe caffe () package for implementation. Since it does not support the new layers we propose, we implement the correspondence contrastive loss layer and the convolutional spatial transformer layer, the K-NN layer based on knn-cuda () and the channel-wise L2 normalization layer. We did not use flattening layer nor the fully connected layer to make the network fully convolutional, generating features at every fourth pixel. For accurate localization, we then extract features densely using bilinear interpolation to mitigate quantization error for sparse correspondences. Please refer to the supplementary materials for the network implementation details and visualization.

For each experiment setup, we train and test three variations of networks. First, the network has hard negative mining and spatial transformer (Ours-HN-ST). Second, the same network without spatial transformer (Ours-HN). Third, the same network without spatial transformer and hard negative mining, providing random negative samples that are at least certain pixels apart from the ground truth correspondence location instead (Ours-RN). With this configuration of networks, we verify the effectiveness of each component of Universal Correspondence Network.

Datasets and Metrics We evaluate our UCN on three different tasks: geometric correspondence, semantic correspondence and accuracy of correspondences for camera localization. For geometric correspondence (matching images of same 3D point in different views), we use two optical flow datasets from KITTI 2015 Flow benchmark and MPI Sintel dataset. For semantic correspondences (finding the same functional part from different instances), we use the PASCAL-Berkeley dataset with keypoint annotations pascal-voc-2011 (); pascal_keypoints () and a subset used by FlowWeb flowWeb (). We also compare against prior state-of-the-art on the Caltech-UCSD Bird datasetwah2011caltech (). To test the accuracy of correspondences for camera motion estimation, we use the raw KITTI driving sequences which include Velodyne scans, GPS and IMU measurements. Velodyne points are projected in successive frames to establish correspondences and any points on moving objects are removed.

To measure performance, we use the percentage of correct keypoints (PCK) metric convnetsLearnCorrespondence (); flowWeb (); warpNet () (or equivalently “accuracy@T” deepMatching ()). We extract features densely or on a set of sparse keypoints (for semantic correspondence) from a query image and find the nearest neighboring feature in the second image as the predicted correspondence. The correspondence is classified as correct if the predicted keypoint is closer than pixels to ground-truth (in short, PCK@). Unlike many prior works, we do not apply any post-processing, such as global optimization with an MRF. This is to capture the performance of raw correspondences from UCN, which already surpasses previous methods.

method SIFT-NN sift () HOG-NN hog () SIFT-flow siftflow () DaisyFF daisyff () DSP DSP () DM best () deepMatching () Ours-HN Ours-HN-ST
MPI-Sintel 68.4 71.2 89.0 87.3 85.3 89.2 91.5 90.7
KITTI 48.9 53.7 67.3 79.6 58.0 85.6 86.5 83.4
Table 3: Matching performance PCK@10px on KITTI Flow 2015 kitti_flow () and MPI-Sintel sintel (). Note that DaisyFF, DSP, DM use global optimization whereas we only use the raw correspondences from nearest neighbor matches.
(a) PCK performance for dense features NN
(b) PCK performance on keypoints NN
Figure 5: Comparison of PCK performance on KITTI raw dataset (a) PCK performance of the densely extracted feature nearest neighbor (b) PCK performance for keypoint features nearest neighbor and the dense CNN feature nearest neighbor
(a) Original image pair and keypoints
(b) SIFT sift () NN matches
(c) DAISY daisy () NN matches
(d) Ours-HN NN matches
Figure 6: Visualization of nearest neighbor (NN) matches on KITTI images (a) from top to bottom, first and second images and FAST keypoints and dense keypoints on the first image (b) NN of SIFT matches on second image. (c) NN of dense DAISY matches on second image. (d) NN of our dense UCN matches on second image.

Geometric Correspondence We pick random correspondences in each KITTI or MPI Sintel image during training. We consider a correspondence as a hard negative if the nearest neighbor in the feature space is more than pixels away from the ground truth correspondence. We used the same architecture and training scheme for both datasets. Following convention deepMatching (), we measure PCK at 10 pixel threshold and compare with the state-of-the-art methods on Table 3. SIFT-flow siftflow (), DaisyFF daisyff (), DSP DSP (), and DM best deepMatching () use additional global optimization to generate more accurate correspondences. On the other hand, just our raw correspondences outperform all the state-of-the-art methods. We note that the spatial transformer does not improve performance in this case, likely due to overfitting to a smaller training set. As we show in the next experiments, its benefits are more apparent with a larger-scale dataset and greater shape variations.

We also used KITTI raw sequences to generate a large number of correspondences, and we split different sequences into train and test sets. The details of the split is on the supplementary material. We plot PCK for different thresholds for various methods with densely extracted features on the larger KITTI raw dataset in Figure 4(a). The accuracy of our features outperforms all traditional features including SIFT sift (), DAISY daisy () and KAZE kaze (). Due to dense extraction at the original image scale without rotation, SIFT does not perform well. So, we also extract all features except ours sparsely on SIFT keypoints and plot PCK curves in Figure 4(b). All the prior methods improve (SIFT dramatically so), but our UCN features still perform significantly better even with dense extraction. Also note the improved performance of the convolutional spatial transformer. PCK curves for geometric correspondences on individual semantic classes such as road or car are in supplementary material.

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean
conv4 flow 28.2 34.1 20.4 17.1 50.6 36.7 20.9 19.6 15.7 25.4 12.7 18.7 25.9 23.1 21.4 40.2 21.1 14.5 18.3 33.3 24.9
SIFT flow 27.6 30.8 19.9 17.5 49.4 36.4 20.7 16.0 16.1 25.0 16.1 16.3 27.7 28.3 20.2 36.4 20.5 17.2 19.9 32.9 24.7
NN transfer 18.3 24.8 14.5 15.4 48.1 27.6 16.0 11.1 12.0 16.8 15.7 12.7 20.2 18.5 18.7 33.4 14.0 15.5 14.6 30.0 19.9
Ours RN 31.5 19.6 30.1 23.0 53.5 36.7 34.0 33.7 22.2 28.1 12.8 33.9 29.9 23.4 38.4 39.8 38.6 17.6 28.4 60.2 36.0
Ours HN 36.0 26.5 31.9 31.3 56.4 38.2 36.2 34.0 25.5 31.7 18.1 35.7 32.1 24.8 41.4 46.0 45.3 15.4 28.2 65.3 38.6
Ours HN-ST 37.7 30.1 42.0 31.7 62.6 35.4 38.0 41.7 27.5 34.0 17.3 41.9 38.0 24.4 47.1 52.5 47.5 18.5 40.2 70.5 44.0
Table 4: Per-class PCK on PASCAL-Berkeley correspondence dataset pascal_keypoints () (, ).
Query Ground Truth Ours HN-ST VGG conv4_3 NN Query Ground Truth Ours HN-ST VGG conv4_3 NN


Figure 7: Qualitative semantic correspondence results on PASCAL pascal-voc-2011 () correspondences with Berkeley keypoint annotation pascal_keypoints () and Caltech-UCSD Bird dataset wah2011caltech ().
mean conv4 flowconvnetsLearnCorrespondence () 24.9 11.8 4.08 SIFT flow 24.7 10.9 3.55 fc7 NN 19.9 7.8 2.35 ours-RN 36.0 21.0 11.5 ours-HN 38.6 23.2 13.1 ours-HN-ST 44.0 25.9 14.4 Table 5: Mean PCK on PASCAL-Berkeley correspondence dataset pascal_keypoints () (). Even without any global optimization, our nearest neighbor search outperforms all methods by a large margin. Figure 8: PCK on CUB dataset wah2011caltech (), compared with various other approaches including WarpNet warpNet () (.)

Semantic Correspondence The UCN can also learn semantic correspondences invariant to intra-class appearance or shape variations. We independently train on the PASCAL dataset pascal-voc-2011 () with various annotations pascal_keypoints (); flowWeb () and on the CUB dataset wah2011caltech (), with the same network architecture.

We again use PCK as the metric pck (). To account for variable image size, we consider a predicted keypoint to be correctly matched if it lies within Euclidean distance of the ground truth keypoint, where is the size of the image and is a variable we control. For comparison, our definition of varies depending on the baseline. Since intraclass correspondence alignment is a difficult task, preceding works use either geometric DSP () or learned warpNet () spatial priors. However, even our raw correspondences, without spatial priors, achieve stronger results than previous works.

As shown in Table 4 and 5, our approach outperforms that of Long et al.convnetsLearnCorrespondence () by a large margin on the PASCAL dataset with Berkeley keypoint annotation, for most classes and also overall. Note that our result is purely from nearest neighbor matching, while convnetsLearnCorrespondence () uses global optimization too. We also train and test UCN on the CUB dataset wah2011caltech (), using the same cleaned test subset as WarpNet warpNet (). As shown in Figure 8, we outperform WarpNet by a large margin. However, please note that WarpNet is an unsupervised method. Please see Figure 7 for qualitative matches. Results on FlowWeb datasets are in supplementary material, with similar trends.

Finally, we observe that there is a significant performance improvement obtained through use of the convolutional spatial transformer, in both PASCAL and CUB datasets. This shows the utility of estimating an optimal patch normalization in the presence of large shape deformations.

Camera Motion Estimation We use KITTI raw sequences to get more training examples for this task. To augment the data, we randomly crop and mirror the images and to make effective use of our fully convolutional structure, we use large images to train thousands of correspondences at once.

We establish correspondences with nearest neighbor matching, use RANSAC to estimate the essential matrix and decompose it to obtain the camera motion. Among the four candidate rotations, we choose the one with the most inliers as the estimate , whose angular deviation with respect to the ground truth is reported as . Since translation may only be estimated up to scale, we report the angular deviation between unit vectors along the estimated and ground truth translation from GPS-IMU.

In Table 6, we list decomposition errors for various features. Note that sparse features such as SIFT are designed to perform well in this setting, but our dense UCN features are still quite competitive. Note that intermediate features such as seeingByMoving () learn to optimize patch similarity, thus, our UCN significantly outperforms them since it is trained directly on the correspondence task.

Features SIFT sift () DAISY daisy () SURF surf () KAZE kaze () Agrawal et al. seeingByMoving () Ours-HN Ours-HN-ST
Ang. Dev. (deg) 0.307 0.309 0.344 0.312 0.394 0.317 0.325
Trans. Dev.(deg) 4.749 4.516 5.790 4.584 9.293 4.147 4.728
Table 6: Essential matrix decomposition performance using various features. The performance is measured as angular deviation from the ground truth rotation and the angle between predicted translation and the ground truth translation. All features generate very accurate estimation.

5 Conclusion

We have proposed a novel deep metric learning approach to visual correspondence estimation, that is shown to be advantageous over approaches that optimize a surrogate patch similarity objective. We propose several innovations, such as a correspondence contrastive loss in a fully convolutional architecture, on-the-fly active hard negative mining and a convolutional spatial transformer. These lend capabilities such as more efficient training, accurate gradient computations, faster testing and local patch normalization, which lead to improved speed or accuracy. We demonstrate in experiments that our features perform better than prior state-of-the-art on both geometric and semantic correspondence tasks, even without using any spatial priors or global optimization. In future work, we will explore applications of our correspondences for rigid and non-rigid motion or shape estimation as well as applying global optimization.


This work was part of C. Choy’s internship at NEC Labs. We acknowledge the support of Korea Foundation of Advanced Studies, Toyota Award #122282, ONR N00014-13-1-0761, and MURI WF911NF-15-1-0479.

a.1 Network Architecture

We use the ImageNet pretrained GoogLeNet googlenet (), from the bottom conv1 to the inception_4a layer, but we used stride 2 for the bottom 2 layers and 1 for the rest of the network. We followed the convention of facenet (); songCVPR16 () to normalize the features, which we found to stabilize the gradients during training. Since we are densely extracting features convolutionally, we implement the channel-wise normalization layer which makes all features have a unit L2 norm.

After the inception_4a layer, we place the correspondence contrastive loss layer which takes features from both images as well as the respective correspondence coordinates in each image. The correspondences are densely sampled from either flow or matched keypoints. Since the semantic keypoint correspondences are sparse, we augment them with random negative coordinates. When we use the active hard-negative sampling, we place the K-NN layer which returns the nearest neighbor of query image keypoints in the reference image.

We visualize the universal correspondence network on Fig. A1. The model includes the hard negative mining, the convolutional spatial transfomer, and the correspondence contrastive loss. The caffe prototxt file and the interactive web visualization using netscope () is available at http://cvgl.stanford.edu/projects/ucn/.

Figure A1: Visualization of the universal correspondence network with the hard negative mining layer and the convolutional spatial transformer. The Siamese network shares the same weights for all layers. To implement the Siamese network in Caffe, we appended to all layer names on the second network. Each image goes through the universal correspondence network and the output features named and are fed into the K-NN layer to find the hard negatives on-the-fly. After the hard negative mining, the pairs are used to compute the correspondence contrastive loss.

a.2 Convolutional Spatial Transformer

The convolutional spatial transformer consists of a number of affine spatial transformers. The number of affine spatial transformers depends on the size of the image. For each spatial transformer, the origin of the coordinate is at the center of each kernel. We denote as the coordinates of the sampled points from the previous input and for coordinates of the points on the output layer . Typically, are the coordinates of nodes on a grid. are affine transformation parameters. The coordinates of the sampled points and the target points satisfy the following equation.

To get the output at , we use bilinear interpolation to sample values around . Let be the values at lower left, lower right, upper left, and upper right respectively.

The gradients with respect to the input features are

Finally, the gradients with respect to the transformation parameters are

a.3 Additional tests for semantic correspondence

PASCAL VOC comparison with FlowWeb We compared the performance of UCN with FlowWeb flowWeb (). As shown in Tab. A1, our approach outperforms FlowWeb. Please note that FlowWeb is an optimization in unsupervised setting thus we split their data per class to train and test our network.

aero bike boat bottle bus car chair table mbike sofa train tv mean
DSP 17 30 5 19 33 34 9 3 17 12 12 18 17
FlowWeb flowWeb () 29 41 5 34 54 50 14 4 21 16 15 33 26
Ours-RN 33.3 27.6 10.5 34.8 53.9 41.1 18.9 0 16.0 22.2 17.5 39.5 31.5
Ours-HN 35.3 44.6 11.2 39.7 61.0 45.0 16.5 4.2 18.2 32.4 24.0 48.3 36.7
Ours-HN-ST 38.6 50.0 12.6 40.0 67.7 57.2 26.7 4.2 28.1 27.8 27.8 45.1 43.0
Table A1: PCK on 12 rigid PASCAL VOC, as split in FlowWeb flowWeb () (, ).

Qualitative semantic match results Please refer to Fig A2 and A3 for additional qualitative semantic match results.

Query Ground Truth Ours HN-ST VGG conv4_3 NN Query Ground Truth Ours HN-ST VGG conv4_3 NN
Figure A2: Additional qualitative semantic correspondence results on PASCAL pascal-voc-2011 () correspondences with Berkeley keypoint annotation pascal_keypoints ().
Query Ground Truth Ours HN-ST VGG conv4_3 NN Query Ground Truth Ours HN-ST VGG conv4_3 NN
Figure A3: Additional qualitative semantic correspondence results on Caltech-UCSD Bird dataset wah2011caltech ().

a.4 Additional KITTI Raw Results

We used a subset of KITTI raw video sequences for all our experiments. The dataset has 9268 frames which amounts to 15 minutes of driving. Each frame consists of Velodyne scan, stereo RGB images, GPS-IMU sensor input. In addition, we used proprietary segmentation data from NEC to evaluate the performance on different semantic classes.

Scene type City Road Residential
Training 1, 2, 5, 9, 11, 13, 14, 27, 28, 29, 48, 51, 56, 57, 59, 84, 15, 32, 19, 20, 22, 23, 35, 36, 39, 46, 61, 64, 79,
Testing 84, 91 52, 70, 79, 86, 87,
Table A2: KITTI Correspondence Dataset: we used a subset of all KITTI raw sequences to construct a dataset.

We excluded the sequence number 17, 18, 60 since the scenes in the videos are mostly static. Also, we exclude 93 since the GPS-IMU inputs are too noisy.

In Figure A5, we plot the variation in PCK at 30 pixels for various camera baselines in our test set. We label semantic classes on the KITTI raw sequences and evaluate the PCK performance on different semantic classes in Figure A4. The curves have same color codes as Figure 5 in the main paper.

Figure A4: PCK evaluations for semantic classes on KITTI raw dataset
Figure A5: PCK performance for various camera baselines on KITTI raw dataset.

a.5 KITTI Dense Correspondences

In this section, we present more qualitative results of nearest neighbor matches using our universal correspondence network on KITTI images on Fig. A6.

Query keypoints of frame Predicted keypoint matches of frame
Figure A6: Visualization of dense feature nearest neighbor matches on the Sintel dataset sintel (). For each row, we visualize the query points (left) and the nearest neighbor matches (right) on images with 1 frame difference.

a.6 Sintel Dense Correspondences

Query keypoints of frame Predicted keypoint matches of frame
Figure A7: Visualization of dense feature nearest neighbor matches on the Sintel dataset sintel (). For each row, we visualize the query points (left) and the nearest neighbor matches (right) on images with 1 frame difference.

In this section, we present more qualitative results of nearest neighbor matches using our universal correspondence network on Sintel images on Fig. A7.


  1. P. Agrawal, J. Carreira, and J. Malik. Learning to See by Moving. In ICCV, 2015.
  2. P. F. Alcantarilla, A. Bartoli, and A. J. Davison. Kaze features. In ECCV, 2012.
  3. H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). CVIU, 2008.
  4. L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d pose annotations. In ICCV, 2009.
  5. J. Bromley, I. Guyon, Y. Lecun, E. Säckinger, and R. Shah. Signature verification using a Siamese time delay neural network. In NIPS, 1994.
  6. D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In ECCV, 2012.
  7. S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, volume 1, June 2005.
  8. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
  9. S. Dasgupta. Netscope: network architecture visualizer or something, 2015.
  10. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.
  11. V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud. K-nearest neighbor search: Fast gpu-based implementations and application to high-dimensional feature matching. In ICIP, 2010.
  12. R. Girshick. Fast R-CNN. ArXiv e-prints, Apr. 2015.
  13. R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
  14. M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. NIPS, 2015.
  15. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
  16. H. Kaiming, Z. Xiangyu, R. Shaoqing, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
  17. A. Kanazawa, D. W. Jacobs, and M. Chandraker. WarpNet: Weakly Supervised Matching for Single-view Reconstruction. ArXiv e-prints, Apr. 2016.
  18. A. Kanazawa, A. Sharma, and D. Jacobs. Locally Scale-invariant Convolutional Neural Network. In Deep Learning and Representation Learning Workshop: NIPS, 2014.
  19. J. Kim, C. Liu, F. Sha, and K. Grauman. Deformable spatial pyramid matching for fast dense correspondences. In CVPR. IEEE, 2013.
  20. C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. PAMI, 33(5), May 2011.
  21. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CVPR, 2015.
  22. J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? In NIPS, 2014.
  23. D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  24. J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In BMVC, 2002.
  25. M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In CVPR, 2015.
  26. J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. DeepMatching: Hierarchical Deformable Dense Matching. Oct. 2015.
  27. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  28. H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In Computer Vision and Pattern Recognition (CVPR), 2016.
  29. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR 2015, 2015.
  30. E. Tola, V. Lepetit, and P. Fua. DAISY: An Efficient Dense Descriptor Applied to Wide Baseline Stereo. PAMI, 2010.
  31. J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. In CVPR, 2014.
  32. P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  33. H. Yang, W. Y. Lin, and J. Lu. DAISY filter flow: A generalized approach to discrete dense correspondences. In CVPR, 2014.
  34. Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. PAMI, 2013.
  35. K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned Invariant Feature Transform. In ECCV, 2016.
  36. S. Zagoruyko and N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks. CVPR, 2015.
  37. J. Zbontar and Y. LeCun. Computing the stereo matching cost with a CNN. In CVPR, 2015.
  38. T. Zhou, Y. Jae Lee, S. X. Yu, and A. A. Efros. Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences. In CVPR, June 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description