Dynamic Context Correspondence Network for Semantic Alignment
Establishing semantic correspondence is a core problem in computer vision and remains challenging due to large intra-class variations and lack of annotated data. In this paper, we aim to incorporate global semantic context in a flexible manner to overcome the limitations of prior work that relies on local semantic representations. To this end, we first propose a context-aware semantic representation that incorporates spatial layout for robust matching against local ambiguities. We then develop a novel dynamic fusion strategy based on attention mechanism to weave the advantages of both local and context features by integrating semantic cues from multiple scales. We instantiate our strategy by designing an end-to-end learnable deep network, named as Dynamic Context Correspondence Network (DCCNet). To train the network, we adopt a multi-auxiliary task loss to improve the efficiency of our weakly-supervised learning procedure. Our approach achieves superior or competitive performance over previous methods on several challenging datasets, including PF-Pascal, PF-Willow, and TSS, demonstrating its effectiveness and generality.
Estimating dense correspondence across related images is a fundamental task in computer vision [scharstein2002taxonomy, hirschmuller2007stereo, horn1981determining]. While early works have focused on correspondence between images depicting the same object or scene, semantic alignment aims to find dense correspondence between different objects belonging to the same category [liu2011sift]. Such semantic correspondence has attracted much attention recently [han2017scnet, Rocco2018, kim2018recurrent] due to its potential use in a broad range of real-world applications such as image editing [dale2009image], co-segmentation [taniai2016joint], 3D reconstruction and scene recognition [agarwal2011building, nikandrova2015category]. However, this task remains extremely challenging because of large intra-class variations, viewpoint changes, background clutters and lack of data with dense annotation [Rocco2017, Rocco2018].
There has been tremendous progress in semantic correspondence recently, thanks to learned feature representations based on convolutional neural networks (CNNs) and the adoption of weak supervision strategy in network training [Rocco18b, Rocco2018, Rocco2017, kim2018recurrent, kim2017dctm, kim2017fcss]. Most existing approaches learn a convolutional feature embedding so that similar image patches are mapped close to each other in the feature space, and use nearest neighbor search or geometric models for correspondence estimation [Rocco2017, Rocco2018, kim2018recurrent, kim2017dctm]. In order to achieve localization precision and robustness against deformations, such feature representations typically capture local image patterns which are insufficient to encode global semantic cues. Consequently, they are particularly sensitive to large intra-class variations and the presence of repetitive patterns. While recent efforts [kim2017fcss, Rocco18b] introduce local neighborhood cues to improve the matching quality, their effectiveness is limited by the local operations and short-range context.
In this work, we aim to address the aforementioned limitations by incorporating global context information and a fusion mechanism that weaves the advantages of both local and spatial features for accurate semantic matching, as shown in Fig. 1. To this end, we first introduce a context-aware semantic representation that integrates appearance features with a self-similarity pattern descriptor, which enables us to capture global semantic context with spatial layout cues. In addition, we propose a pixel-wise attention mechanism that dynamically combines correlation maps derived from local features and context-aware semantic features. The key idea of our approach is to reduce matching ambiguities and to improve localization accuracy simultaneously by the dynamic blending of information from multiple spatial scales.
Concretely, we develop a novel Dynamic Context Correspondence Network (DCCNet), which consists of three main modules: a spatial context network, a correlation network and an attention fusion network. Given an input image pair, we first compute their convolutional (conv) features using a backbone CNN (e.g., ResNet [he2016deep]). The conv features are fed into our first module, the spatial context network, which computes the context-aware semantic features that are robust against repetitive patterns and ambiguous matching. Our second module, the correlation network, has two shared branches that generates two correlation score maps for the context-aware semantic and the original conv features respectively. The third module, attention fusion net, predicts a pixel-wise weight mask to fuse two correlation score maps for final correspondence prediction. Our network is fully differentiable and is trained with a weakly-supervised strategy in an end-to-end manner. To improve the training efficiency, we propose a new hybrid loss with multiple auxiliary tasks.
We evaluate our method by extensive experiments on three public benchmarks, including PF-Willow [ham2016proposal], PF-PASCAL [ham2018proposal] and TSS datasets [taniai2016joint]. The experimental results demonstrate the strong performance of our model, which outperforms the prior state-of-the-art approaches in most cases. We also conduct a detailed ablation study to illustrate the benefits of our proposed modules.
The main contributions of this work can be summarized as follows:
We propose a context-aware semantic representation to generate robust matching against repetitive patterns and local ambiguities in the semantic correspondence problem.
We develop a novel dynamic fusion strategy based on an attention mechanism to integrate multiple levels of feature representation. To the best of our knowledge, we are the first to adaptively combine context spatial information with local appearance in the semantic correspondence task.
We design a multi-auxiliary task loss to regularize the training process for weakly-supervised semantic correspondence task and achieve superior or competitive performance on public benchmarks.
2 Related Work
Traditional methods of semantic matching mostly utilize hand-crafted features to find similar image patches with additional spatial smoothness constraints in their alignment models [liu2011sift, tola2010daisy, taniai2016joint]. SIFT Flow [liu2011sift] extends classical optical flow to establish correspondences across similar scenes using dense SIFT descriptors. Taniai \etal [taniai2016joint] adopt HOG descriptors to jointly perform co-segmentation and dense alignment. Due to lack of semantics in feature representations, those approaches often suffer from inaccurate matching when facing large appearance changes from intra-class variations.
Recently, CNNs have been successfully applied to semantic matching thanks to their learned feature representations, which are more robust to appearance or shape variations. Early attempts [novotny2017anchornet, kim2017fcss] employ learnable feature descriptors with hand-drafted alignment models, while other approaches [han2017scnet, kim2017fcss] requires external modules to generate object proposals for feature extraction, all of which are hence not end-to-end trainable. More recent work tends to use fully trainable network to learn the feature and alignment jointly. Rocco \etal [Rocco2017] proposes a network architecture for geometric matching using a self-supervised strategy from synthetic images, and further improves it with weakly-supervised learning in [Rocco2018]. The follow-up work extends this strategy in several directions by improving the global transformation model [hongsuck2018attentive], developing cycle-consistency loss [chendeep], estimating locally-varying geometric fields [kim2018recurrent, jeon2018parn], or exploiting neighborhood consensus to produce consistent flow [Rocco18b]. However, most CNN-based approaches rely on dense matching of conv features, which are incapable of encoding global context [luo2016understanding, chen2017rethinking].
Spatial Context in Correspondence
Spatial context has been explored for semantic matching in the literature before deep learning era. Particularly, Irani \etalpropose the Local Self Similarity (LSS) descriptor [shechtman2007matching] to capture self-similarity structure, which has been extended to deep learning based correspondence estimation [kim2015dasc, kim2016deep]. More recent work, FCSS [kim2017fcss] and its extension [kim2017dctm], reformulate LSS as a CNN module, computing local self-similarity with learned sparse sampling pattern in object proposals. In contrast, our method exploits a larger spatial context and computes a dense self-similarity descriptor, which is more robust against repetitive patterns and encodes richer context. We also combine this descriptor with local conv features, further improving the discriminative capability of our feature and stabilizing training.
Attention mechanism has been widely used in computer vision tasks to focus on relevant information. For instance, attention-based dynamic fusion is adopted for confidence measure in stereo matching [kim2019laf]. In semantic segmentation, Chen \etal [chen2016attention] propose an attention mechanism that learns to fuse multi-scale features at each pixel location. In semantic correspondence, recent methods design attention modules for suppressing background regions in images [chendeep, hongsuck2018attentive]. By contrast, our work addresses the challenge of integrating local and context cues in semantic matching, for which, to the best of our knowledge, dynamic fusion has not been explored before.
We now describe our method for estimating a robust and accurate semantic correspondence between two images. Our goal is to seek a flexible feature representation that enables us to capture global semantic contexts as well as informative local features. To this end, we introduce a learnable context-aware semantic representation that augments each local convolutional feature with a global context descriptor. Such a context-aware feature is integrated into the correlation computation by a dynamic fusion mechanism, which combines correlation scores from the context-aware feature and the local conv feature in a selective manner to generate high-quality matching predictions.
Below we start a brief introduction to the semantic correspondence task and an overview of our framework in Sec. 3.1. We then present our proposed context-aware semantic feature and its encoder network in Sec. 3.2, followed by a dynamic fusion module in Sec. 3.4. Finally, we describe our multi-auxiliary task loss in Sec. 3.5.
3.1 Problem Setting and Overview
Given an input image pair , the goal of semantic alignment is to estimate a dense correspondence between pixels in two images. A common strategy is to infer the correspondence from a correlation map , which describes the matching similarities between any two locations from different images. Formally, let , where and are the height and width of those two images, respectively. The correlation map is denoted as and where is a similarity function. To achieve point-to-point spatial correspondence between two images, we can perform a hard assignment in either of two possible directions, from to , or vice versa (cf. [Rocco18b]). Specifically, we have the following mapping from to :
By doing so, we convert the semantic correspondence problem to a correlation map prediction task, in which our goal is to find a functional mapping from the image pair to an optimal correlation map that generates the accurate pixel-wise correspondences.
A typical deep learning based approach aims to build a high-quality correlation map based on learned feature representation. Formally, we first compute the conv features of the images by an embedding network, which is pre-trained on a large dataset (e.g., ImageNet). Denoting the embedding network as , we generate the image feature maps as follows,
where and are the normalized conv feature representations of the input image pair , is the number of feature channel.
Given the conv features, we then build a correlation network that learns a mapping from the feature pair to their correlation map . Formally,
where is the mapping function implemented by the deep network and is its parameters. Given a feature-wise correspondence, we can derive the pixel-wise correspondences in Eq. (1) by interpolation on the image plane.
While such deep correspondence networks (e.g. [Rocco18b]) provide a powerful framework to learn a flexible representation for matching, in practice they are sensitive to large intra-class variations and repetitive patterns in images due to lack of global context. In this work, we propose a novel correspondence network to tackle those challenges in semantic correspondence. Our network is capable of capturing global context of each feature location and dynamically integrating context-aware semantic cues with local semantic information to reduce the matching ambiguities. Hence we refer to our network as Dynamic Context Correspondence Network (DCCNet). Our DCCNet network is composed of three main modules: 1) a spatial context encoder, 2) a correlation network and 3) a dynamic fusion network. Below we will introduce the details of each module and an overview of our network is illustrated in Fig. 2.
3.2 Spatial Context Encoder
Taking as input the conv features of the image pairs, the first component of DCCNet is a spatial context encoder that incorporates global semantic context into the conv feature. To achieve this, we propose a self-similarity based operator to describe the spatial context, as shown in Fig. 3. Specifically, the spatial context encoder consists of two modules: a) spatial context generation, b) context-aware semantic feature generation, which will be detailed below.
Spatial Context Generation
Inspired by LSS [shechtman2007matching], we design a novel self-similarity based descriptor on top of deep conv features to encode spatial context at each location in an image. Concretely, given the conv feature map of an image (omit superscript here for clarity), we first apply a zero padding of size ( is odd) on the feature map to get the padded feature map . For location () in , its spatial context descriptor is defined as a self-similarity vector computed between its own local feature and the features in its neighboring region of size centered at (). Specifically, we compute the self-similarity features as follows:
where is the spatial context descriptor of location and denotes spatial context of the image . We refer the neighborhood size as the kernel size of the context descriptor. With varying kernel sizes, the descriptor is able to encode the spatial context at different scales.
It is worth noting that our spatial context descriptor differs from non-local graph networks [wang2018non] in encoding context information, as our descriptor maintains spatial structure, which is important for matching, while graph propagation typically uses aggregation operators to integrate out spatial cues. Our representation also differs from FCSS [kim2017fcss] in several aspects. First, we use a large context to compute self-similarity instead of a local window in order to achieve robustness toward repetitive patterns. Second, FCSS [kim2017fcss] relies on object proposals to remove background while we learn to select informative semantic cues. Moreover, we empirically find that the spatial context descriptor alone is insufficient for high-quality matching, and therefore combine it with local conv features, which will be described below.
Context-aware Semantic Feature
The second module of our spatial context encoder computes a context-aware semantic feature for each location on the conv feature map. While the spatial context descriptor encodes second-order statistics in a neighborhood of feature location, it lacks local semantic cues represented by the original conv feature. In order to capture different aspects of semantic objects, we employ a simple fusion step to generate a context-aware semantic representation which provides us better matching quality. Concretely, we apply a non-linear transformation over the concatenation of and as below:
where is a nonlinear function (ReLU) and the weight matrix transforms the features into dimensional space. We use to denote the context-aware semantic features of image , and add superscript to represent context-aware semantic feature and from the image and , respectively.
3.3 Correlation Network
The second module of DCCNet is a correlation network that takes in feature representations of an image pair and produces a correlation map. While any correlation computation module can be used here, we adopt the neighborhood consensus module [Rocco18b] in this work for its superior performance. Specifically, for each type of feature representations of an image pair, say the context-aware semantic feature or the local semantic feature , we feed them into the correlation network to generate their corresponding correlation map:
where is the neighborhood consensus operator, is the correlation operation. We use to refine the correlation maps based on local neighborhood information. In addition, mutual nearest neighbor consistency constraint [Rocco18b] is applied before and after , which is merged into for simplicity as it does not contain learnable parameters. We refer the reader to [Rocco18b] for more details. We now have two correlation maps, and , that describes the pixelwise correspondence using context-aware semantic cues and local semantic features, respectively.
3.4 Dynamic Fusion Network
While the context-aware semantic feature allows us to encode more global visual patterns, the spatial context encoder in Sec. 3.2 adopts a spatial-invariant fusion mechanism (i.e., a global embedding) to combine local cues and spatial context, which turns out to be sub-optimal for feature locations with distracting neighboring region. An effective solution is to introduce a spatially varying fusion mechanism to balance the context and local conv features specifically for each location. To that end, we propose a dynamic fusion strategy to achieve adaptive fusion for different locations in each image pair. Our fusion utilizes scores from two correlation maps computed in Sec. 3.3 for each location and determines which one is more trustworthy using a location-specific weight.
Specifically, given two correlation maps, and , we introduce the third module of DCCNet, a dynamic fusion network, to integrate two correlation scores. Motivated by [chen2016attention], we exploit an attention mechanism to generate a location-aware weight mask for correlation map fusion. The attention-based dynamic fusion consists of the following two modules: 1) correlation map embedding, 2) attention-based fusion, which will be described below.
Our dynamic fusion strategy is associated with the matching direction. Here we describe the dynamic fusion in the direction from image to image for clarity, as the other direction is similar, as shown in Fig. 4.
Correlation Map Embedding
In order to predict the attention mask, we first compute a feature representation from the correlation maps. Concretely, we apply an embedding function to produce a correlation map embedding:
where is implemented by 4D convolutional neural network, and is the learnable parameter of . are at the same dimension with , in . By this module, we extract those 4D correlation features , , before reshaping them in the next attention module that produces the weight mask and fusion result.
To compute the attention weight mask, we first reshape into a tensor form and , where . We then compute a fusion weight map for each image pair, which indicates whether the local conv feature is more informative than the context-aware semantic feature for each location. For the direction of image to , we stack the reshaped correlation maps and along the first axis followed by an attention network to predict the fusion weights:
where is concatenation operator along the first dimension, and is the attention weight mask for . The attention network is implemented by a fully convolution layer followed by a softmax operator to normalize the attention weights. Given the attention mask, we fuse the correlation maps in an adaptive way as follows,
where is the element-wise multiplication with broadcasting for producing the weighted correlation maps. The output correlation is generated by reshaping into the 4D form . Similarly, the adaptively fused correlation from the other direction can also be computed by this module. Finally, those two refined correlation map and are used to find semantic correspondence (cf. [Rocco18b]).
3.5 Learning with Multi-auxiliary Task Loss
We learn the model parameters of our DDCNet in a weakly-supervised manner from a set of matched images. Given two images and , the outputs of our model are and . We first adopt the weakly-supervised training loss proposed in NC-Net [Rocco18b], which has a functional form :
where denotes the groundtruth label of the image pair with for positive matching, and for negative. and are the mean matching scores over all hard assigned matches of a given image pair in both matching directions. To minimize this loss, the model should maximizes the scores of positive and minimizes the scores of negative matching pairs, respectively. We denote this loss term as .
To learn an effective dynamic fusion strategy, we further introduce additional supervision from two auxiliary tasks. Specifically, we also use the correlation map of local semantic feature and the correlation map of context-aware semantic feature to generate the matching results, and denote their correspondence losses as and , respectively. Here we compute the auxiliary task losses and following the same procedure as in . The overall training loss is then defined as,
where and are the hyper-parameter to balance the main and auxiliary task losses.
We evaluate our DCCNet on the weakly-supervised semantic correspondence task by conducting a series of experiments on three public datasets, including PF-PASCAL [ham2018proposal], PF-WILLOW [ham2016proposal] and TSS [taniai2016joint]. In this section, we introduce our experiment settings and report evaluation results in detail. We first describe the implementation details in Sec.4.1, followed by the quantitative results of the three datasets in Sec.4.2, Sec.4.3 and Sec.4.4, respectively. Finally, ablation study and comprehensive analysis are provided in Sec.4.5.
4.1 Implementation details
We implement our DCCNet with the PyTorch framework [paszke2017automatic]. For the feature extractor, we use the ResNet-101 [he2016deep] pre-trained on ImageNet with the parameters fixed and truncated at the conv4_23 layer. The spatial context encoder adopts a kernel size and the output dimension of the context-aware semantic features is set to 1024, which are determined by validation. For the correlation network, we follow [Rocco18b] and stack three 4D convolutional layers with the kernel size at 5555 and set the channel number of the intermediate layer to be 16. For the dynamic fusion net, we choose the same 4D conv layers as in the correlation network for the correlation embedding module, and the attention mask prediction layer is implemented with a conv layer.
To train the model, we set and in the multi-auxiliary task loss to 1 by validation. The model parameters are randomly initialized except for feature extractor. The model is trained for 5 epochs on 4 GPUs with early stopping to avoid overfitting. We use Adam optimizer [kingma2014adam] with a learning rate of 5.
Images of all three datasets are first resized into the size of 400400. Our model is trained on the PF-PASCAL benchmark [ham2018proposal]. To further validate generalization capacity of our model, we test the trained model with the PF-WILLOW dataset [ham2016proposal] and the TSS dataset [ham2016proposal] without any further finetuning. Finally, we conduct the ablation study on the PF-PASCAL dataset [ham2018proposal].
4.2 PF-Pascal Benchmark
Dataset and Evaluation Metric
The PF-PASCAL [ham2018proposal] benchmark is built from the PASCAL 2011 keypoint annotation dataset [bourdev2009poselets], which consists of 20 object categories. Following the dataset split in [khan2017], we partition the total 1351 image pairs into a training set of 735 pairs, validation set of 308 pairs and test set of 308 pairs, respectively. The model learning is performed in a weakly-supervised manner where keypoint annotations are not used for training but for evaluation only. We report the percentage of the correct keypoints (PCK) metric [yang2013articulated] which measures the percentage of keypoints whose transfer errors below a given threshold. In line with previous work, we report PCK () w.r.t. image size.
As shown in Table 1, we compare our proposed method with previous methods including NC-Net [Rocco18b], WeakAlign [Rocco2018], RTN [kim2018recurrent], CNNGeo [Rocco2017], Proposal Flow [ham2018proposal], UCN [choy2016universal] and different versions of SCNet [khan2017]. Our approach achieves an overall PCK of , outperforming the prior state of the art [Rocco18b] by .
Fig. 5 shows qualitative comparisons with Nc-Net [Rocco18b]. We can see that our model is robust against repetitive patterns thanks to our proposed context-aware semantic representation and dynamic fusion. More qualitative results can be found in the suppl. material.
4.3 PF-WILLOW Benchmark
Dataset and evaluation metric
The PF-WILLOW data-set consists of 900 image pairs selected from a total of 100 images [ham2016proposal]. We report the PCK scores with multiple thresholds w.r.t. bounding box size in order to compare with prior methods.
Table 2 compares the PCK accuracies of our DCCNet to those of the state-of-the-art semantic correspondence techniques. Our proposed method improves the PCK accuracies over the previously published best performance by when and . Our model also achieves a competitive PCK () of which is merely lower than the state-of-the-art result, partially due to the large scale variation in this dataset unseen in the training. Fig. 6 shows qualitative results on the PF-WILLOW dataset, which further demonstrate the strength of our method.
4.4 TSS Benchmark
Dataset and evaluation metric
The TSS dataset contains 400 image pairs in total, divided into three groups, including FG3DCAR, JODS, and PASCAL. Ground truth flows and foreground masks for image pair are provided, where we only use it for evaluation in the weak supervision setting. Following Taniai \etal [taniai2016joint], we report the PCK over foreground object by setting to w.r.t. image size.
Table 3 presents quantitative results on the TSS benchmark. We observe that our method outperforms previous methods on one of the three groups of the TSS dataset and our average performance over three groups on the TSS dataset achieves new state of the art. This shows our method can generalize to novel datasets despite the moderate change of data distribution. Qualitative results are presented in Fig. 7.
4.5 Ablation Study
To understand the effectiveness of our model components, we conduct a series of ablation studies focusing on: 1) effects of individual modules, 2) kernel sizes in spatial context, 3) different fusion methods and 4) multi-auxiliary task losses. We select NC-Net [Rocco18b] as our baseline and report PCK () on the PF-PASCAL [ham2018proposal] test split.
Effects of Individual Modules
We consider five different ablation settings and the overall results are shown in Table 4. First, we note that applying our proposed spatial context encoder (Baseline+S) generates large performance improvement () over NC-Net [Rocco18b]. Second, adding dynamic fusion with auxiliary loss (Baseline+SDA) provides a further boost of . Below we introduce detailed analysis for each module via the other three ablation settings.
|FCSS+SIFT Flow [kim2017fcss]||83.0||65.6||49.4||66.0|
Spatial Context Encoder
Table 5 shows the effects of incorporating context with different kernel sizes. For using our spatial context encoder alone (Baseline+S), the performance increases first and then drops with increasing kernel sizes, which is due to degradation of context-aware features as more background clutters are included. Our dynamic fusion and auxiliary loss (Baseline+SDA) can effectively alleviate the degradation problem.
We study the effects of our dynamic fusion by simple average fusion of two correlation maps, referring to the resulting model as Baseline+SAA. From Table 4 we can see that our dynamic fusion model (Baseline+SDA) yields significant better results () than average fusion (), showing the necessity of our attention module. Moreover, Baseline+SAA underperforms the model setting with context-aware semantic feature alone (Baseline+S) due to its global averaging. In contrast, the pixel-wise weight mask from attention net enables each location to adaptively merge different scales of semantic cues. We also evaluate the model setting without correlation map embedding during dynamic fusion (Baseline+SCA), which generates worse results, indicating the efficacy of 4D correlation map features in the dynamic fusion network.
|Baseline+SCA||✓||Dynamic w/o Corr Embedding||✓||79.9|
|Baseline+SAA||✓||Average w/ Corr Embedding||✓||80.2|
|Baseline+SD||✓||Dynamic w/ Corr Embedding||✗||81.0|
|Baseline+SDA||✓||Dynamic w/ Corr Embedding||✓||82.3|
Multi-auxiliary task loss
To validate the effect of our proposed auxiliary task loss, we train a model without two additional loss terms, which is referred to as Baseline+SD. Table 4 shows that our model with auxiliary loss terms (Baseline+SDA) attains higher PCK scores than the Baseline+SD model, reaching the state-of-the-art result of . This improvement indicates the effectiveness of our multi-auxiliary task loss in regularizing the training process for weakly-supervised semantic correspondence task. With the multi-auxiliary task loss, our local feature and context-aware semantic feature branches have stronger supervision signals, which in turn benefits the fusion branch and produces better overall matching results.
In this work, we have proposed an effective deep correspondence network, DCCNet, for the semantic alignment problem. Compared to the prior work, our approach has several innovations in semantic matching. First, we develop a learnable context-aware semantic representation that is robust against repetitive patterns and local ambiguities. In addition, we design a novel dynamic fusion module to adaptively combine semantic cues from multiple spatial scales. Finally, we adopt a multi-auxiliary task loss to better regularize the learning of our dynamic fusion strategy. We demonstrate the efficacy of our approach by extensive experimental evaluations on the PF-PASCAL, PF-WILLOW and TSS datasets. The results evidently show that our DCCNet achieves the superior or comparable performances over the prior state-of-the-art approaches on all three datasets.
This work was supported in part by the NSFC Grant No.61703195 and the Shanghai NSF Grant No.18ZR1425100.