AnchorNet: A Weakly Supervised Network to Learn Geometry-sensitive Features For Semantic Matching

AnchorNet: A Weakly Supervised Network
to Learn Geometry-sensitive Features For Semantic Matching

David Novotny     Diane Larlus     Andrea Vedaldi
Visual Geometry Group Dept. of Engineering Science, University of Oxford {david,vedaldi}@robots.ox.ac.uk Computer Vision Group Xerox Research Centre Europe diane.larlus@xrce.xerox.com
Abstract

Despite significant progress of deep learning in recent years, state-of-the-art semantic matching methods still rely on legacy features such as SIFT or HoG. We argue that the strong invariance properties that are key to the success of recent deep architectures on the classification task make them unfit for dense correspondence tasks, unless a large amount of supervision is used. In this work, we propose a deep network, termed AnchorNet, that produces image representations that are well-suited for semantic matching. It relies on a set of filters whose response is geometrically consistent across different object instances, even in the presence of strong intra-class, scale, or viewpoint variations. Trained only with weak image-level labels, the final representation successfully captures information about the object structure and improves results of state-of-the-art semantic matching methods such as the deformable spatial pyramid or the proposal flow methods. We show positive results on the cross-instance matching task where different instances of the same object category are matched as well as on a new cross-category semantic matching task aligning pairs of instances each from a different object class.

1 Introduction

Matching, i.e. the problem of establishing correspondences between images, is one of the tent-poles of image understanding. It is well known that, given matches between images of the same object or scene, it is possible to estimate 3D geometry (stereo and structure from motion) and motion (visual odometry, optical flow, and tracking). But matching can be applied to much more abstract levels of understanding as well. For example, aligning different object instances of the same type [32, 21] allows to discover analogies between objects, inducting abstractions such as object categories.

Figure 1: We propose AnchorNet, a novel deep architecture that produces an image representation which significantly improves state-of-the-art semantic matching methods. Key to its success is a set of filters with a sparse response that is geometrically consistent across different instances of a category or of two similar categories. Although these filters are learned in a weakly supervised manner (i.e. only image-level labels are used) they tend to anchor reliably on meaningful object parts.

While reliable techniques exist for low-level matching, high-level matching of different object instances remains a heavily-researched topic. Most of the work in this area has focused on finding powerful geometric regularizers, such as hierarchical correspondences [35] or deformable spatial pyramids [32], to compensate for the still brittle visual descriptors. Surprisingly, even powerful convolutional neural network (CNN) descriptors have been found lacking for cross-instance matching [37, 21, 64], and in fact comparable or even inferior to old hand-crafted features such as SIFT [38] and HoG [11] for this task.

\pbox[b]
\pbox[b]
\pbox[b] \adjustboxmax width=0.95 \pbox[b]
\adjustboxmax width=0.95 \pbox[b]
\adjustboxmax width=0.95 \pbox[b]
\adjustboxmax width=0.95 \pbox[b]
(a) (b) (c)
Figure 2: Example responses of anchor filters discovered by the AnchorNet. (a), (b) show the class specific filters for bird and dog classes respectively while (c) depicts the class agnostic filters across different categories (one filter per row).

It is unclear why CNN representations, which perform well for many challenging vision tasks, including object detection [16] and segmentation [36], image captioning [58], and visual question answering [1], have not been found to work as well for cross-instance matching. Our hypothesis is that this is due to the fact that CNNs are trained on large datasets such as Imagenet ILSVRC [12] purely for the image classification task. By learning with the sole purpose of predicting a global image label, CNNs become insensitive to local details and geometry and hence work poorly for matching. This effect can be reversed by fine-tuning the model on substantial amounts of data strongly supervised with bounding box [16] or keypoint [9] annotations. While this allows to use CNNs as excellent object and keypoint detectors, it defeats the purpose of using CNN features as generic descriptors for discovering correspondences in an unsupervised manner, as matching requires.

In this paper, we address this issue by introducing a new deep architecture that can learn representations that work well for cross-instance matching (Figure 1), while using exactly the same supervision as traditional pre-training – namely image-level labels used to train categorizers on ILSVRC12 [12]. Using only image-level labels for matching amounts to weak supervision since the labels do not provide any information on the geometry of objects or scenes.

Our key insight is that a set of diverse and sparse filter responses provides a powerful representation for establishing matches. Convolutional features that respond sparsely on an image tend to automatically anchor to distinctive image structures such as semantic object parts. Further enforcing diversity of the filter bank responses results in a good coverage. This yields a unique description for all object fragments which is an essential property that enables reliable estimation of dense semantic correspondences.

We incorporate this idea by extracting from information-rich residual hypercolumns (section 3.1) a bank of distinctive and diverse filters with orthogonal responses (section 3.2Figure 2). In this framework, which we call AnchorNet, geometric consistency is not imposed explicitly, but emerges spontaneously. We also show how to compress banks of class-specific filters into a class-agnostic bank (section 3.3) which works well for all classes.

Extensive experiments show that the proposed representation can be seamlessly leveraged by state-of-the-art semantic matching methods such as the Deformable Spatial Pyramid [32] or Proposal Flow [21] in order to improve their performance (section 4.1). For the first time, we also show that high-level correspondences can be established between objects of different categories, including new ones, unseen during the training of our network (section 4.2).

2 Related Work

Finding dense correspondences. The classical matching methods estimate very accurate pixel correspondences between two images of the same scene, in presence of moderate viewpoint variations [25, 39, 44]. Early methods use different hand-crafted features such as SIFT [38], HoG [11], SURF [4] or DAISY [53]. This task has many applications including stereo matching [44], optical flow [25, 60], or wide baseline matching [39, 62].

Recent works have generalized the notion of flow to image pairs that are only semantically related [34, 46, 32, 51, 21]. This requires handling a higher degree of variability in appearance. The semantic alignment task also finds many applications such as image completion [3], enhancement [20], or segmentation [34], and video depth estimation [30]. The SIFT Flow algorithm [35, 34] pioneered the idea of dense correspondences across different scenes and proposes a multi-resolution image pyramid and a hierarchical optimization algorithm for efficiency. This approach got extended by the Deformable Spatial Pyramid (DSP) algorithm [32] that introduced a multi-scale regularization with a hierarchically connected pyramid of graphs. The generalized deformable spatial pyramid [28] improves over DSP by enforcing additional spatial constraints at a significant computational cost. The Patch Match method [2] and its extension [3] target general purpose matching, including cross-instance matching. The method of [5] builds an exemplar-LDA classifier for every pixel to obtain dense correspondences that improve the performance of scene flows. Proposal Flow [21] leverages the recent development in object proposals and uses local and geometric consistency constraints to establish dense semantic correspondences. Finally, WarpNet [29] learns correspondences by exploiting the relationships within a fine-grained dataset.

A few methods [26, 27, 45, 31, 41, 63] have posed the problem of finding correspondences as the joint alignment of multiple pairs of images, defining the task of collective alignment. These methods assume sets of images that share a category label and consistent viewpoints. The latest method in this field is FlowWeb [63], that builds a fully connected graph with images as nodes, and pairwise flow fields as edges. Yet, this method scales poorly with the size of the image collection, and it is not straightforward to establish pairwise alignments between new samples.

Deep features for correspondences. Long et al. [37] studied the application of CNN features pre-trained on large classification datasets for finding correspondences between object instances. They found that CNN features perform on par with hand-crafted alternatives such as SIFT for the weakly-supervised keypoint transfer problems, and can outperform them when keypoint supervision is available. This work paved the way to new deep architectures trained for finding dense correspondences between same object or scene instances [13, 59, 52]. Recently, Choy et al. [9] proposed a deep architecture that performs well at cross-instance alignment, but requires strong supervision in form of many keypoint matches.

The question of training deep features without keypoint annotations still remains unanswered, as state-of-the art semantic matching methods [32, 21] still rely on hand-engineered SIFT and HoG respectively.

3 Method

Figure 3: The proposed AnchorNet architecture. First, images are described using hypercolumn descriptors. Sparse filters are discovered for each category using a set of discriminability and diversity losses. Finally a denoising auto-encoder learns how to share these filters between categories, leading to a final category-agnostic representation generalizing to new classes.

The output of a deep convolutional layer in a CNN is a tensor of height , width , and with feature channels. Thus, at each spatial location , one obtains a -dimensional feature vector . As noted by [10], such CNN feature vectors are analogous to hand-crafted dense descriptors like HoG and Dense-SIFT and can often be used as a plug-and-play replacement for the latter in applications. However, as noted in e.g. [37] and shown in the experiments, this substitution does not work well for cross-instance matching algorithms such as DSP [32] and Proposal Flow [21].

Since CNNs can be turned in excellent keypoint detectors by fine-tuning on data strongly annotated with keypoint labels [9, 54], the reason for this failure must be in the way most CNNs are pre-trained on image classification tasks. Note that collecting keypoint annotations for every category does not scale and defeats the purpose of cross-instance matching, which is to discover such correspondences automatically. As a solution, we propose a new architecture that, while using the same image-level supervision as the standard pre-training on the classification task, learns features with better geometric awareness.

Our method is motivated by a simple observation. Suppose that learning encourages a feature to respond very locally (ideally a point). A convolutional filter can do this only by responding to a visual structure that occurs uniquely in each image – hence the distinctive part or keypoint of an object. We call the latter the anchoring principle. A geometry-aware representation suitable for semantic matching should discover such a complete set of features that ultimately covers the whole object. We can do so by learning a bank of filters that respond to complementary image locations. We call this the diversity principle. Note that diversity indirectly encourages anchoring, as, if features respond to different parts of an image, they must also respond locally. Armed with these insights, we propose next an architecture termed AnchorNet that follows the two principles. We then show that these are sufficient to significantly boost the geometric awareness of the resulting features. A diagram of our network is presented in Figure 3.

3.1 Residual hypercolumns

We base our AnchorNet architecture on the powerful residual architectures of [24]. We select the ResNet50 model as a good compromise between speed and accuracy.

In order to improve the geometric sensitivity of the representation, we follow [22] and extract hypercolumns (HC). A HC at location in the image is created by concatenating the convolutional feature responses at that location for different layers of the network. Recall that, in most CNN architectures, deeper features have reduced resolution; HC compensates for this by upsampling the responses to a common size before concatenation. We denote the resulting network , where is the input image.

In more detail, we bilinearly upsample and concatenate the rectified outputs of the res2c, res4c and res5c layers [24] into a hypercolumn tensor. Before concatenation, descriptors extracted at each layer are compressed by PCA to 256 dimensions (PCA is implemented as a filter bank) and normalized to balance their energies. This results in dimensional HC vectors.

3.2 Learning anchoring features for an object type

The residual HC are high-capacity descriptors reflecting both high-level semantics as well as low-level image details. While this suggests that they should contain enough information for establishing matches, their direct utilization leads to suboptimal results. Thus, we train a set of convolutional filters that compress the HC responses into a compact set of anchor filters that are suitable for matching. To this end, we learn filters that satisfy two properties: discriminability and diversity.

Discriminability constraints. We start by learning filters predictive of an object category. As a result, the filters tend to focus on relevant foreground objects, and rarely on the background. Without loss of generality, we first consider a binary setting where images are either containing object instances of a single object category () or irrelevant background (). We later extend to multiple categories in section 3.3.

Learning uses a large dataset of images with cheap-to-obtain image-level class labels. We follow common deep networks [50, 24, 33] and use ILSVRC 12 [47] for training.

Discriminability is encouraged by minimizing the following loss function:

(1)

where denotes the HC tensor extracted from image . The function is the smooth version of ReLU [42] and gmax is the global max-pooling operator.

Minimizing identifies the strongest response of each filter in the image and then enhances or suppresses it depending on whether the image contains the object. A disadvantage is that, due to the global max-pooling, the backpropagated signal is extremely sparse, which makes learning slow. To speed-up the convergence rate, we introduce a secondary loss function that, for negative images only, generates much denser gradients by using global average pooling (gavg) instead of max pooling:

(2)

Using global average pooling is meaningful for the negative images, where all responses should be suppressed, but not for the positive ones, where only selected responses should be enhanced.

Diversity constraints. Discriminability alone encourages filters to respond to the object; however different filters may learn to respond to redundant highly-distinctive object parts. In order to obtain good coverage (and ultimately good anchoring), we require the filters of one class to be active on diverse regions.

The diversity constraint is implemented by two diversity losses and , encouraging orthogonality of the filters and of their responses, respectively. makes filters orthogonal by penalizing their correlations, as follows:

(3)

where is the column of filter at spatial location 111i.e. for our filters , . Note that orthogonal filters are likely to respond to different image structures, but this is not necessarily the case. Thus, we introduce a second term that directly decorrelates the filters’ response maps :

(4)

This term is further regularized by smoothing the response maps prior to computing the loss , where is a Gaussian kernel; this encourages filter responses to spread farther apart by dilating their activations. Note that inducing diversity among classifier prediction has been explored before [15, 19, 18, 48, 6], however none of these works consider diversity as a loss to train a deep representation as we propose.

Discussion. By making a large number of filters both discriminative and diverse, our method indirectly encourages them to become highly-specialized and hence to respond to unique parts of objects (the anchoring principle). This happens automatically, without enforcing such geometric properties explicitly. This intuition is strongly supported by our experiments. Examples of the filters learned for the bird and dog classes are presented in Figure 2 (a) and (b). It is apparent that filters fire on consistent object parts despite large intraclass variations, demonstrating the power of our formulation and its applicability to matching.

3.3 Class-agnostic representation

In the previous section we have defined category specific anchoring filters. In this section, we extend them to be generic to any category. This allows to use the same representation for every image, irrespective of its label, to match instances across different categories (e.g. dog vs cat), and to even handle new categories.

First, a filter bank is learned for each object category using the method above. Each object is learned by considering only images of that object class and a common background class . Since filters are not learned to discriminate between objects, and since the diversity losses are applied only within each bank, different filter banks can develop correlations. Figure 2 illustrates this by showing that filters learned for the “dog” and “bird” classes capture similar concepts such as eyes or nose.

We take advantage of the overlap between different banks by introducing a new bank of filters that projects the class-specific responses of the filters to general-purpose response maps applicable to objects of any class.

In order to learn the projections end-to-end, we add a denoising autoencoder (DAE) [57] to our architecture. DAE minimizes the reconstruction loss

(5)

where is the distance between the normalized tensors and and is the convolution transpose operator [56]. Here denotes the stack of class-specific heatmaps centered by removing their mean , estimated online during training. We have observed that centering followed by normalization greatly improves the convergence properties of . Function injects noise by randomly setting to zero 25% of the feature channels of the tensor .

The decorrelation loss eq. (3) is applied to the compression filters as well in order to encourage their diversity.

Note that the reconstruction loss , when optimized end-to-end with the rest of the model, encourages the maps to shrink (because, if everywhere, then the autoencoder has a trivial optimum). This is however prevented by the decorrelation losses , . thus works as a regularizer enforcing part sharing. Examples of the learned class agnostic filters are in fig. 2 (c).

Denoising autoencoders have been used for domain adaptation before [7, 17]. In a similar spirit, the last part of our network transforms a set of class (domain) specific filters into a domain invariant representation that can accommodate for any class, even the one not seen during training.

Network training. AnchorNet is optimized with stochastic gradient descent (SGD) by minimizing the sum of the proposed losses , , , and , with mini batches of size 16, a learning rate of , and a momentum of 0.0005. Parameters of the network are initialized with the ResNet50 model pre-trained on ILSVRC12. We use two-stage optimization to speed up the training process. First, the class-specific filters are trained on training images independently for each object class keeping the rest of the network parameters fixed. Then, we attach the autoencoder and the reconstruction loss to fine-tune all the network parameters end-to-end on images. Further details are provided in the supplementary material.

4 Experiments

mean aero bike bird boat bottle bus car cat chair cow dog horse mbike person plant sheep sofa table train tv
Pairwise alignment methods
DSP + ANet-class 0.45 0.31 0.49 0.32 0.53 0.75 0.51 0.47 0.23 0.53 0.37 0.20 0.33 0.41 0.22 0.46 0.45 0.77 0.45 0.48 0.74
DSP + ANet 0.45 0.29 0.47 0.29 0.52 0.73 0.50 0.46 0.25 0.53 0.37 0.21 0.34 0.39 0.20 0.44 0.45 0.77 0.45 0.51 0.74
DSP + HC 0.41 0.29 0.45 0.24 0.51 0.73 0.48 0.44 0.20 0.52 0.32 0.16 0.28 0.35 0.19 0.39 0.37 0.74 0.44 0.48 0.67
DSP + SIFT [32] 0.39 0.25 0.46 0.21 0.48 0.63 0.50 0.45 0.19 0.48 0.30 0.14 0.26 0.35 0.13 0.40 0.37 0.66 0.37 0.48 0.62
Proposal Flow + ANet-class 0.43 0.26 0.43 0.28 0.54 0.71 0.50 0.45 0.24 0.54 0.32 0.21 0.28 0.35 0.21 0.45 0.40 0.74 0.46 0.50 0.70
Proposal Flow + ANet 0.42 0.26 0.41 0.26 0.53 0.70 0.49 0.45 0.25 0.54 0.31 0.19 0.28 0.31 0.17 0.43 0.39 0.74 0.44 0.52 0.69
Proposal Flow + HC 0.42 0.26 0.42 0.26 0.54 0.70 0.50 0.45 0.23 0.53 0.32 0.18 0.27 0.32 0.18 0.43 0.38 0.74 0.45 0.51 0.64
Proposal Flow + HoG [21] 0.41 0.25 0.45 0.23 0.54 0.70 0.49 0.44 0.19 0.53 0.30 0.16 0.25 0.35 0.16 0.41 0.35 0.74 0.44 0.50 0.63
Baseline: NoFlow 0.39 0.27 0.40 0.22 0.50 0.73 0.46 0.42 0.20 0.51 0.30 0.15 0.25 0.32 0.18 0.38 0.34 0.74 0.44 0.47 0.64
Collective alignment methods
FlowWeb [63] 0.43 0.33 0.53 0.24 0.51 0.72 0.54 0.51 0.20 0.52 0.32 0.15 0.29 0.45 0.19 0.41 0.39 0.73 0.41 0.51 0.68
Table 1: Weighted IoU for pairwise semantic part matching on PASCAL Parts. The proposed methods are in bold.

We thoroughly compare our method with existing techniques for semantic matching (section 4.1). Then, we assess how well our features allow to establish matches across images of different categories (section 4.2) which, to the best of our knowledge, was never demonstrated before.

Note that for all reported results, training only uses ILSVRC12 [12] images and labels, where the categories are merged according to the PASCAL-ILSVRC class mapping from [12] (e.g. sofa is a merge of “studio couch” and “day bed”). In this manner, 231 ILSVRC classes are used as positive examples spread over the 20 PASCAL VOC classes; the remaining 769 classes are used to form the set of negative (background) images. Even when we report results on one of the PASCAL VOC [14] classes, none of the PASCAL VOC training data is used.

4.1 Dense pairwise semantic matching

We follow the standard practice [63, 21] of using a dataset with manually annotated semantic keypoints or regions and assess how well a semantic matching method in combination with different types of features transfers the annotations from an image to another. We experiment on three datasets following their evaluation protocol.

Source image Target image Source mask Target mask ours Proposal Flow [21] DSP [32]

















Source image Target image Source mask Target mask ours Proposal Flow [21] DSP [32]










Figure 4: Segmentation mask transfer on PASCAL Parts for DSP+ANet (ours), Proposal Flow + HoG, and DSP + SIFT.
mean aero bike boat bottle bus car chair mbike sofa table train tv
Pairwise alignment methods
DSP + ANet-class 0.24 0.23 0.28 0.06 0.38 0.44 0.39 0.14 0.19 0.16 0.11 0.13 0.41
DSP + ANet 0.23 0.22 0.25 0.06 0.35 0.42 0.34 0.14 0.17 0.17 0.13 0.14 0.40
DSP + HC 0.20 0.20 0.23 0.05 0.39 0.36 0.25 0.10 0.15 0.12 0.10 0.12 0.28
DSP + SIFT [32] 0.18 0.17 0.30 0.05 0.19 0.33 0.34 0.09 0.17 0.12 0.09 0.12 0.18
Proposal Flow + ANet-class 0.17 0.17 0.21 0.05 0.25 0.26 0.27 0.10 0.14 0.12 0.07 0.10 0.24
Proposal Flow + ANet 0.16 0.16 0.19 0.05 0.22 0.26 0.25 0.10 0.12 0.11 0.05 0.12 0.23
Proposal Flow + HC 0.16 0.17 0.21 0.05 0.23 0.27 0.24 0.09 0.13 0.12 0.05 0.11 0.20
Proposal Flow + HoG [21] 0.17 0.20 0.26 0.05 0.20 0.31 0.29 0.10 0.17 0.13 0.05 0.13 0.21
Baseline: NoFlow 0.17 0.18 0.17 0.05 0.39 0.31 0.17 0.09 0.12 0.11 0.07 0.11 0.24
Collective alignment methods
FlowWeb [63] 0.26 0.29 0.41 0.05 0.34 0.54 0.50 0.14 0.21 0.16 0.04 0.15 0.33
Table 2: PCK () for semantic keypoint transfer on the 12 rigid classes of the PASCAL Parts dataset.

Compared methods. The most successful cross-instance matching methods include DSP [32] and Proposal Flow [21] (PF). In their original formulation, these methods performed best with the Dense SIFT [38] feature for DSP, and the whitened version of HoG [23] for PF. In the following experiments, we replace these descriptors with our representation, as follows.

For DSP, the learned filter banks produce a dense field of feature vectors which are bilinearly upsampled to the original image size, normalized and passed to DSP as a plug-and-play replacement of Dense SIFT. For PF, we mimic their use of HoG: every object proposal serves as a pooling region for the set of filter activations that are extracted once for every image. The pooling is performed by reading-off the filter activations inside the region and resizing them to using bilinear interpolation. This tensor is then vectorized and normalized to form the final descriptor of the proposal region. We use the variant of PF that extracts 1000 selective search boxes [55] per image. The rest of the matching procedure is identical to the original PF algorithm.

We compare both the class-agnostic (ANet) and class-specific (ANet-class) variants of our anchor filters. The class-agnostic variant uses the 256 dimensional features produced by the autoencoder filters , whereas ANet-class uses the output of the class-specific filters corresponding to a given PASCAL VOC object category . Thus, ANet-class assumes knowledge of the object class label while ANet is universally applicable without requiring additional image-specific information. As baseline descriptors we consider SIFT, HoG and HC descriptors formed by concatenating the PCA projected layers of ResNet50 (res2c, res4c and res5c - section 3.1). We also report the NoFlow baseline that predicts zero-displacement for every pixel.

While we focus on pairwise matching, an alternative is to align many images together, known as co-alignment. Among various co-alignment methods, including [26, 45, 31], FlowWeb [63] is currently the state of the art. Due to its superior performance, we only report results for FlowWeb; however, while FlowWeb works very well, it is important to note that it is also substantially more expensive than pairwise matching, does not scale well and cannot accommodate for new image pairs.

Evaluation of segmentation masks transfer. We compare the various methods on the task of transferring semantic part segmentation masks, strictly following the protocol of [63]. Dense semantic matches, as determined by DSP or PF given a descriptor, are used to warp the part segmentation mask from a source to a target image. The matching quality is assessed as the average weighted intersection-over-union (IoU) between the predicted masks and the ground-truth ones for different semantic parts. The results are reported in Table F, qualitative results are provided in Figure 4.

We make the following observations. First, the ResNet50 features, perform at most marginally better, than SIFT or HoG, while both ANet and ANet-class features improve performance for both DSP (+6% IoU) and PF (+1% IoU). Second, the class-specific features ANet-class perform on par with the class-agnostic features ANet, demonstrating the ability of our domain generalization approach to compress the class-specific filters into the class-agnostic ones. Third, our features, in combination with DSP, exhibit the best average performance among all the compared methods. Remarkably, both ANet and ANet-class outperform all co-alignment methods, including FlowWeb [63], achieving state-of-the-art results on this dataset. This is an interesting finding as the co-alignment methods exploit the small viewpoint and appearance variations in order to improve pairwise alignments.

Evaluation of keypoint matching. We also evaluate performance on matching semantic keypoints. Corresponding annotations are provided by [61] for the 12 rigid PASCAL VOC categories. Similar to the previous section, we use the dataset from [63], and, strictly following their evaluation protocol, we assess the matching accuracy using PCK, setting the misalignment tolerance parameter to .

Table G contains the results of this experiment. Our features improve the original DSP results by a large margin (+6% PCK), obtaining state-of-the-art results on this dataset among the pairwise alignment methods. Pairwise matching becomes in fact competitive with the results obtained by FlowWeb in co-alignment, although the latter use more information. Proposal Flow is generally weaker on this task and is not helped by the better features.

Evaluation of region matching. As a third benchmark dataset, we use the PF dataset and corresponding protocol as described in detail in [21]. The dataset contains 10 image sets of 4 object types and the task is to establish matches between annotated semantic regions within the image sets. We report region matching precision using the definitions specified in [21]. Table 3 contains the results obtained by using the code and data made available by [21].

We evaluate our deep features in combination with the two matching methods presented in [21]: the best performing local offset matching (LOM), and the naive appearance matching (NAM). ANet  is compared with the best performing feature from [21], i.e. HoG [23]. We observe that using ANet-class features in combination with both matching methods (LOM, NAM) brings a significant performance improvement. Note in particular that ANet-class is sufficiently powerful to make the NAM baseline, which does not use any sophisticated geometric reasoning, competitive with the LOM+HoG, which uses geometric reasoning but handcrafted features (LOM+ANet-class is even better).

AuCs for PCR
\diagboxMatchingFeature ANet-class ANet HOG [21]
NAM: baseline 0.41 0.36 0.29
LOM: Proposal Flow 0.46 0.43 0.43
Table 3: Region matching on the PF dataset.
Source class bicycle mbike bus car bus dog cat sheep dog horse cow sheep cow
mean
Target class mbike bicycle car bus car cat dog dog sheep cow horse cow sheep
DSP + ANet 0.37 0.35 0.45 0.52 0.35 0.36 0.25 0.25 0.34 0.27 0.31 0.47 0.37 0.58
DSP + HC 0.32 0.27 0.44 0.48 0.32 0.34 0.20 0.21 0.22 0.23 0.27 0.40 0.28 0.54
DSP + SIFT [32] 0.29 0.28 0.40 0.40 0.27 0.30 0.16 0.16 0.20 0.19 0.26 0.31 0.28 0.50
Proposal Flow + ANet 0.35 0.32 0.38 0.50 0.32 0.37 0.23 0.27 0.30 0.25 0.29 0.41 0.32 0.53
Proposal Flow + HC 0.33 0.31 0.34 0.49 0.29 0.35 0.22 0.24 0.28 0.23 0.28 0.41 0.32 0.53
Proposal Flow + HOG [21] 0.31 0.30 0.43 0.48 0.30 0.35 0.19 0.21 0.22 0.19 0.25 0.37 0.29 0.50
Baseline: NoFlow 0.27 0.26 0.44 0.35 0.26 0.25 0.17 0.18 0.22 0.17 0.22 0.29 0.26 0.49
Table 4: Weighted IoU for cross instance semantic part matching on PASCAL Parts.
Table 5: Semantic matching on the AnimalParts dataset. For each method, we report the average PCK over all possible 12x12 domain pairs. An overview of individual cross-category results can be found in Figure 5
Figure 5: Per-domain semantic matching on the AnimalParts dataset. Cells are colored proportionally to the matching performance on a given animal class pair. Columns denote the source domains, rows the targets.
\adjustbox

max height=4.8cm \pbox[b]
\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


\adjustboxmax height=4.8cm \pbox[b]


Figure 6: Cross-class alignments on the AnimalParts dataset. Given a target (top row) and source images (bottom row) we establish semantic correspondences between parts of animal classes. The alignment warps the source image into the target image. We compare Proposal Flow + ANet (ours - 2nd row) and Proposal Flow + HoG [21] (3rd row).

4.2 Generalization across categories

The previous section experimented on the task of aligning different object instances of the same category. Here, we depart from this scenario and consider instead cross-category matching, where correspondences are established between objects of different categories. To the best of our knowledge, this is the first time this task is considered.

For evaluation, we first use the PASCAL Parts [8] data from [63]. Parts with different location qualifiers are merged into one (e.g. “left-leg” and “right-leg” are merged into “leg”) to ensure shareability across categories. Overall, there are 9 object categories and 13 shared part types.

Second, we consider the AnimalParts [43] dataset, introduced as a test-bed to study the transferability of semantic part detectors. Here, we reuse the dataset in order to assess transferability of ANet filters trained without explicit supervision. AnimalParts includes only a few part types (“eye” and “foot”), but a large number of different categories – 100 animals from the ILSVRC12 dataset. In order to present results compactly, animals are grouped in 12 families, based on the WordNet [40] hierarchy. For each pair of super-classes, 40 image pairs are randomly sampled for evaluation, resulting in 7K image pairs in total. PCK is computed for each pair of super-classes, and the results are averaged over such pairs. The class-specific ANet-class does not apply since the goal is to match across categories and most of these categories were not seen during training.

Tables 5 and 4 and Figure 5 show that ANet works substantially better than other matching methods. For the AnimalParts, the best results are obtained with Proposal Flow in combination with our features, with a 7% PCK improvement over the PF + HoG baseline (). The fact that AnimalParts contains categories unseen at train time (e.g. reptiles) demonstrates the scalability and generalization of the proposed approach. For PASCAL Parts, similar to the intra-class matching experiment (section 4.1), DSP performs best. Here ANet attains a 16% relative improvement over the best previously published method (Proposal Flow + HoG). Figure 6 provides qualitative results.

5 Conclusion

In this paper we have examined the problem of dense semantic matching. Employing the concept of filter anchoring, we have designed a novel deep architecture, termed AnchorNet. Supervised with only image-level labels, AnchorNet automatically learns a set of filters which respond in a sparse and geometrically consistent manner across object instances. Thanks to these filters, our architecture produces powerful representations for image matching. We experimentally validate these features in conjunction with state-of-the art semantic matching methods attaining state-of-the-art performance on the segmentation transfer and keypoint matching tasks. Versatility of our representation has been demonstrated on the new task of cross-category matching where we report positive results on two test-beds.

Acknowledgments. We would like to thank Xerox Research Center Europe and ERC 677195-IDIU for supporting this research.

Appendix A Learning details

In this section we provide additional details about the learning protocol of AnchorNet. Training converges after visiting training samples (for each class) in stage 1 and samples in stage 2 (two days on a single GPU NVIDIA Tesla M40). The learning rate was fixed to a value of with the minibatch size of 16 and the momentum set to the standard value of 0.0005. The training data were augmented as in [24].

The losses were balanced as follows. The weights of and were set to and respectively. The weight of was set to a higher value of which is necessary due to the inhibition of the gradient by the normalization which takes place just before computing . The weights of and were set to be as high as possible () such that are treated approximately as hard constraints. Importantly, is optimized only during visiting positive samples as reconstructing the activations of negative samples would waste the capacity of the autoencoder. During the first training stage, we sample positive and negative samples with equal probability. Furthermore, during stage 2, we ensure that the distribution of positive samples is uniform over the set of 20 Pascal categories. This causes the positive samples from any given object category to be less frequent than the negative samples. Hence, in order to rebalance losses in stage 2, we decrease the weights of negative samples by a factor of . Due to the fact that the gradients from exhibit high magnitudes, we decrease the learning rate on the layers bellow the first autoencoder layer by a factor of during the second stage.

Appendix B Additional experimental results

Tables G and F provide an extension of Tables 1 and 2 from the paper. On top of the features already provided in Tables 1 and 2, we include more baseline features: res4c and res5c, which are extracted from the ResNet50 architecture and the features from Simon et al. [49]. [49] selects part-like convolutional feature channels using a mixture of constellation models; however, if two different aspects are detected in two images, the set of common features is too sparse for matching. Thus, we converted their output to dense descriptors for use in DSP and PF by 1) modifying the HC from the ResNet50 architecture by retaining their part-like channels across all aspects (denoted as Constellation-HC) and 2) by backpropagating the part-like channel activations to the input image as they do, and using the image-level activations as dense descriptors (Constellation-BP).

Additionally, to quantify the impact of the diversity losses , we also report the performance of the features produced by the ANet-class method optimized without the diversity losses with DSP used as the matching algorithm (DSP + ANet-class w/o ).

We observe that the res4c, res5c features as well as all the variants of the constellation features perform on par with the hypercolumn features (HC). The apparent drop in performance of DSP + ANet-class w/o compared to DSP + ANet-class highlights the contribution of the diversity losses.

mean aero bike bird boat bottle bus car cat chair cow dog horse mbike person plant sheep sofa table train tv
Pairwise alignment methods
DSP + ANet-class 0.45 0.31 0.49 0.32 0.53 0.75 0.51 0.47 0.23 0.53 0.37 0.20 0.33 0.41 0.22 0.46 0.45 0.77 0.45 0.48 0.74
DSP + ANet-class w/o 0.41 0.27 0.42 0.25 0.51 0.72 0.46 0.42 0.21 0.52 0.32 0.19 0.30 0.33 0.18 0.44 0.34 0.75 0.42 0.48 0.64
DSP + ANet 0.45 0.29 0.47 0.29 0.52 0.73 0.50 0.46 0.25 0.53 0.37 0.21 0.34 0.39 0.20 0.44 0.45 0.77 0.45 0.51 0.74
DSP + res4c 0.41 0.28 0.43 0.23 0.50 0.73 0.47 0.43 0.20 0.52 0.31 0.15 0.27 0.34 0.19 0.39 0.36 0.74 0.44 0.48 0.65
DSP + res5c 0.40 0.27 0.42 0.23 0.50 0.73 0.47 0.42 0.20 0.51 0.31 0.15 0.26 0.33 0.19 0.39 0.35 0.74 0.44 0.48 0.65
DSP + HC 0.41 0.29 0.45 0.24 0.51 0.73 0.48 0.44 0.20 0.52 0.32 0.16 0.28 0.35 0.19 0.39 0.37 0.74 0.44 0.48 0.67
DSP + SIFT [32] 0.39 0.25 0.46 0.21 0.48 0.63 0.50 0.45 0.19 0.48 0.30 0.14 0.26 0.35 0.13 0.40 0.37 0.66 0.37 0.48 0.62
DSP + Constellation-HC 0.40 0.28 0.42 0.23 0.50 0.73 0.47 0.42 0.20 0.52 0.31 0.15 0.27 0.34 0.19 0.38 0.36 0.74 0.44 0.48 0.65
DSP + Constellation-BP 0.40 0.27 0.41 0.23 0.50 0.73 0.46 0.42 0.20 0.51 0.31 0.15 0.26 0.33 0.18 0.38 0.35 0.73 0.44 0.47 0.64
Proposal Flow + ANet-class 0.43 0.26 0.43 0.28 0.54 0.71 0.50 0.45 0.24 0.54 0.32 0.21 0.28 0.35 0.21 0.45 0.40 0.74 0.46 0.50 0.70
Proposal Flow + ANet 0.42 0.26 0.41 0.26 0.53 0.70 0.49 0.45 0.25 0.54 0.31 0.19 0.28 0.31 0.17 0.43 0.39 0.74 0.44 0.52 0.69
Proposal Flow + res4c 0.42 0.27 0.44 0.26 0.54 0.70 0.50 0.45 0.23 0.53 0.32 0.18 0.28 0.33 0.17 0.44 0.39 0.74 0.45 0.52 0.66
Proposal Flow + res5c 0.39 0.23 0.34 0.25 0.53 0.70 0.47 0.43 0.22 0.52 0.30 0.18 0.26 0.27 0.17 0.41 0.38 0.73 0.45 0.49 0.60
Proposal Flow + HC 0.42 0.26 0.42 0.26 0.54 0.70 0.50 0.45 0.23 0.53 0.32 0.18 0.27 0.32 0.18 0.43 0.38 0.74 0.45 0.51 0.64
Proposal Flow + HoG [21] 0.41 0.25 0.45 0.23 0.54 0.70 0.49 0.44 0.19 0.53 0.30 0.16 0.25 0.35 0.16 0.41 0.35 0.74 0.44 0.50 0.63
Proposal Flow + Constellation-HC 0.40 0.26 0.39 0.25 0.53 0.68 0.48 0.43 0.21 0.52 0.30 0.17 0.26 0.31 0.15 0.42 0.37 0.72 0.44 0.50 0.62
Proposal Flow + Constellation-BP 0.39 0.25 0.38 0.23 0.53 0.68 0.47 0.41 0.20 0.51 0.29 0.16 0.25 0.30 0.15 0.41 0.35 0.71 0.43 0.49 0.60
Baseline: NoFlow 0.39 0.27 0.40 0.22 0.50 0.73 0.46 0.42 0.20 0.51 0.30 0.15 0.25 0.32 0.18 0.38 0.34 0.74 0.44 0.47 0.64
Collective alignment methods
FlowWeb [63] 0.43 0.33 0.53 0.24 0.51 0.72 0.54 0.51 0.20 0.52 0.32 0.15 0.29 0.45 0.19 0.41 0.39 0.73 0.41 0.51 0.68
Table F: Weighted IoU for pairwise semantic part matching (not to be confused with object or part detection or segmentation) on PASCAL Parts. The methods that use our proposed features are in bold.
mean aero bike boat bottle bus car chair mbike sofa table train tv
Pairwise alignment methods
DSP + ANet-class 0.24 0.23 0.28 0.06 0.38 0.44 0.39 0.14 0.19 0.16 0.11 0.13 0.41
DSP + ANet-class w/o 0.17 0.19 0.18 0.06 0.31 0.31 0.18 0.10 0.13 0.12 0.08 0.12 0.24
DSP + ANet 0.23 0.22 0.25 0.06 0.35 0.42 0.34 0.14 0.17 0.17 0.13 0.14 0.40
DSP + HC 0.20 0.20 0.23 0.05 0.39 0.36 0.25 0.10 0.15 0.12 0.10 0.12 0.28
DSP + res4c 0.19 0.20 0.22 0.05 0.39 0.35 0.24 0.10 0.14 0.11 0.09 0.12 0.27
DSP + res5c 0.17 0.19 0.19 0.05 0.38 0.32 0.19 0.09 0.13 0.11 0.08 0.11 0.25
DSP + SIFT [32] 0.18 0.17 0.30 0.05 0.19 0.33 0.34 0.09 0.17 0.12 0.09 0.12 0.18
DSP + Constellation-HC [49] 0.18 0.20 0.21 0.05 0.39 0.33 0.20 0.10 0.13 0.12 0.09 0.12 0.26
DSP + Constellation-BP [49] 0.17 0.19 0.19 0.05 0.39 0.32 0.19 0.10 0.12 0.12 0.08 0.12 0.25
Proposal Flow + ANet-class 0.17 0.17 0.21 0.05 0.25 0.26 0.27 0.10 0.14 0.12 0.07 0.10 0.24
Proposal Flow + ANet 0.16 0.16 0.19 0.05 0.22 0.26 0.25 0.10 0.12 0.11 0.05 0.12 0.23
Proposal Flow + HC 0.16 0.17 0.21 0.05 0.23 0.27 0.24 0.09 0.13 0.12 0.05 0.11 0.20
Proposal Flow + res4c 0.17 0.19 0.24 0.05 0.23 0.28 0.27 0.09 0.15 0.12 0.05 0.13 0.21
Proposal Flow + res5c 0.11 0.13 0.11 0.04 0.21 0.21 0.19 0.07 0.08 0.08 0.05 0.09 0.14
Proposal Flow + HoG [21] 0.17 0.20 0.26 0.05 0.20 0.31 0.29 0.10 0.17 0.13 0.05 0.13 0.21
Proposal Flow + Constellation-HC [49] 0.14 0.18 0.17 0.04 0.19 0.25 0.20 0.08 0.12 0.10 0.05 0.10 0.17
Proposal Flow + Constellation-BP [49] 0.13 0.16 0.15 0.04 0.19 0.25 0.18 0.07 0.10 0.10 0.06 0.10 0.17
Baseline: NoFlow 0.17 0.18 0.17 0.05 0.39 0.31 0.17 0.09 0.12 0.11 0.07 0.11 0.24
Collective alignment methods
FlowWeb [63] 0.26 0.29 0.41 0.05 0.34 0.54 0.50 0.14 0.21 0.16 0.04 0.15 0.33
Table G: PCK () for semantic keypoint transfer on the 12 rigid classes of the PASCAL Parts dataset.

References

  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. PatchMatch: A randomized correspondence algorithm for structural image editing. 2009.
  • [3] C. Barnes, E. Shechtman, D. B. Goldman, and A. Finkelstein. The generalized PatchMatch correspondence algorithm. In Proc. ECCV, 2010.
  • [4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). CVIU, 110(3):346–359, 2008.
  • [5] H. Bristow, J. Valmadre, and S. Lucey. Dense semantic correspondence where every pixel is a classifier. In Proc. ICCV, 2015.
  • [6] T. T. Cai and L. Wang. Orthogonal matching pursuit for sparse signal recovery with noise. IEEE IT, 57:4680–4688, 2011.
  • [7] M. Chen, Z. Xu, K. Weinberger, and F. Sha. Marginalized denoising autoencoders for domain adaptation. In Proc. ICML, 2012.
  • [8] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proc. CVPR, 2014.
  • [9] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. Universal correspondence network. In Proc. NIPS. 2016.
  • [10] M. Cimpoi, S. Maji, and A. Vedaldi. Deep convolutional filter banks for texture recognition and segmentation. In Proc. CVPR, 2015.
  • [11] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. CVPR, 2005.
  • [12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. CVPR, 2009.
  • [13] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proc. ICCV, 2015.
  • [14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. 2010.
  • [15] A. Gane, T. Hazan, and T. S. Jaakkola. Learning with maximum a-posteriori perturbation models. In Proc. AISTATS, 2014.
  • [16] R. Girshick. Fast r-cnn. In Proc. ICCV, 2015.
  • [17] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proc. ICML, 2011.
  • [18] A. Guzman-Rivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce multiple structured outputs. In Proc. NIPS, 2012.
  • [19] A. Guzman-Rivera, P. Kohli, D. Batra, and R. A. Rutenbar. Efficiently enforcing diversity in multi-output structured prediction. In Proc. AISTATS, 2014.
  • [20] Y. HaCohen, E. Shechtman, D. B. Goldman, and D. Lischinski. Non-rigid dense correspondence with applications for image enhancement. 2011.
  • [21] B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal flow. In Proc. CVPR, 2016.
  • [22] B. Hariharan, P. A. Arbeláez, R. B. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proc. CVPR, 2015.
  • [23] B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In Proc. ECCV, 2012.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proc. CVPR, 2016.
  • [25] B. K. P. Horn and B. G. Schunck. Determining optical flow: A retrospective. Artif. Intell., 59(1-2):81–87, 1993.
  • [26] G. B. Huang, V. Jain, and E. G. Learned-Miller. Unsupervised joint alignment of complex images. In Proc. ICCV, 2007.
  • [27] G. B. Huang, M. A. Mattar, H. Lee, and E. G. Learned-Miller. Learning to align from scratch. In Proc. NIPS, 2012.
  • [28] J. Hur, H. Lim, C. Park, and S. C. Ahn. Generalized deformable spatial pyramid: Geometry-preserving dense correspondence estimation. In Proc. CVPR, 2015.
  • [29] A. Kanazawa, D. W. Jacobs, and M. Chandraker. WarpNet: Weakly supervised matching for single-view reconstruction. In Proc. CVPR, 2016.
  • [30] K. Karsch, C. Liu, and S. B. Kang. Depth extraction from video using non-parametric sampling. In Proc. ECCV, 2012.
  • [31] I. Kemelmacher-Shlizerman and S. M. Seitz. Collection flow. In Proc. CVPR, 2012.
  • [32] J. Kim, C. Liu, F. Sha, and K. Grauman. Deformable spatial pyramid matching for fast dense correspondences. In Proc. CVPR, 2013.
  • [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, 2012.
  • [34] C. Liu, J. Yuen, and A. Torralba. SIFT flow: Dense correspondence across scenes and its applications. PAMI, 33(5):978–994, 2011.
  • [35] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. Sift flow: Dense correspondence across different scenes. In Proc. ECCV, 2008.
  • [36] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, 2015.
  • [37] J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? In Proc. NIPS, 2014.
  • [38] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  • [39] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proc. BMVC, 2002.
  • [40] G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38:39–41, 1995.
  • [41] H. Mobahi, C. Liu, and W. T. Freeman. A compositional model for low-dimensional image set representation. In Proc. CVPR, 2014.
  • [42] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. ICML, 2010.
  • [43] D. Novotny, D. Larlus, and A. Vedaldi. I have seen enough: Transferring parts across categories. In Proc. BMVC, 2016.
  • [44] M. Okutomi and T. Kanade. A multiple-baseline stereo. PAMI, 15(4):353–363, 1993.
  • [45] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images. In Proc. CVPR, 2010.
  • [46] W. Qiu, X. Wang, X. Bai, A. Yuille, and Z. Tu. Scale-space sift flow. In Proc. WACV, 2014.
  • [47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
  • [48] M. Schiegg, F. Diego, and F. A. Hamprecht. Learning diverse models: The coulomb structured support vector machine. In Proc. ECCV, 2016.
  • [49] M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In Proc. ICCV, 2015.
  • [50] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. 2015.
  • [51] M. Tau and T. Hassner. Dense correspondences across scenes and scales. PAMI, 38(5):875–888, 2016.
  • [52] J. Thewlis, S. Zheng, P. Torr, and A. Vedaldi. Fully-trainable deep matching. In Proc. BMVC, 2016.
  • [53] E. Tola, V. Lepetit, and P. Fua. DAISY: An Efficient Dense Descriptor Applied to Wide Baseline Stereo. PAMI, 32(5):815–830, 2010.
  • [54] S. Tulsiani and J. Malik. Viewpoints and keypoints. In Proc. CVPR, 2015.
  • [55] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 104:154–171, 2013.
  • [56] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proc. ACM Int. Conf. on Multimedia, 2015.
  • [57] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proc. ICML, 2008.
  • [58] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proc. CVPR, 2015.
  • [59] J. Žbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. 17(1):2287–2318, 2016.
  • [60] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical flow with deep matching. In Proc. ICCV, 2013.
  • [61] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision, 2014.
  • [62] H. Yang, W.-Y. Lin, and J. Lu. Daisy filter flow: A generalized discrete approach to dense correspondences. In Proc. CVPR, 2014.
  • [63] T. Zhou, Y. Jae Lee, S. X. Yu, and A. A. Efros. Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences. In Proc. CVPR, 2015.
  • [64] T. Zhou, P. Krähenbühl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In Proc. CVPR, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
6754
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description