GLU-Net: Global-Local Universal Network for Dense Flow and Correspondences
Establishing dense correspondences between a pair of images is an important and general problem, covering geometric matching, optical flow and semantic correspondences. While these applications share fundamental challenges, such as large displacements, pixel-accuracy, and appearance changes, they are currently addressed with specialized network architectures, designed for only one particular task. This severely limits the generalization capabilities of such networks to new scenarios, where \egrobustness to larger displacements or higher accuracy is required.
In this work, we propose a universal network architecture that is directly applicable to all the aforementioned dense correspondence problems. We achieve both high accuracy and robustness to large displacements by investigating the combined use of global and local correlation layers. We further propose an adaptive resolution strategy, allowing our network to operate on virtually any input image resolution. The proposed GLU-Net achieves state-of-the-art performance for geometric and semantic matching as well as optical flow, when using the same network and weights.
Finding pixel-to-pixel correspondences between images continues to be a fundamental problem in Computer Vision [Hartley, Forsyth]. This is due to its many important applications, including visual localization [johannes2017, Taira2018], 3D-reconstruction [AgarwalFSSCSS11], structure-from-motion [SchonbergerF16], image manipulation [HaCohenSGL11, LiuYT11], action recognition [SimonyanZ14] and autonomous driving [JanaiGBG17]. Due to the astonishing developments in deep learning in recent years and its impressive performance, end-to-end trainable Convolutional Neural Networks (CNNs) are now applied for this task in all the aforementioned domains.
The general problem of estimating correspondences between pairs of images can be divided into several different tasks, depending on the origin of the images. In the geometric matching task [Hartley], the images constitute different views of the same scene, taken by a single or multiple cameras. The images may be taken from radically different viewpoints, leading to large displacements and appearance transformations between the frames. On the other hand, optical flow [Horn1981, Baker2011] aims to estimate accurate pixel-wise displacements between two consecutive frames of a sequence or video. In the semantic matching problem [LiuYT11, Ham2016] (also referred as semantic flow), the task is instead to find semantically meaningful correspondences between different instances of the same scene category or object, such as ‘car’ or ‘horse’. Current methods generally address one of these tasks, using specialized architectures that generalize poorly to related correspondence problems. In this work, we therefore set out to design a universal architecture that jointly addresses all aforementioned tasks.
One key architectural aspect shared by a variety of correspondence networks is the reliance on correlation layers, computing local similarities between deep features extracted from the two images. This provide strong cues when establishing correspondences. Optical flow methods typically employ local correlation layers [Dosovitskiy2015, Ilg2017a, Sun2018, Sun2019, Hui2018, Hui2019], evaluating similarities in a local neighborhood around an image coordinate. While suitable for small displacements, they are unable to capture large viewpoints changes. On the contrary, geometric and semantic matching architectures utilize global correlations [Melekhov2019, Rocco2017a, Rocco2018a, Rocco2018b, Kim2019], where similarities are evaluated between all pairs of locations in the dense feature maps. While capable of handling long-range matches, global correlation layers are computationally unfeasible at high resolutions. Moreover, they constrain the input image size to a pre-determined resolution, which severely hampers accuracy for high-resolution images.
Contributions: In this paper, we propose GLU-Net, a Global-Local Universal Network for estimating dense correspondences. Our architecture is robust to large viewpoint changes and appearance transformations, while capable of estimating small displacements with high accuracy. The main contributions of this work are: (i) We introduce a single unified architecture, applicable to geometric matching, semantic matching and optical flow. (ii) Our network carefully integrates global and local correlation layers to handle both large and small displacements. (iii) To circumvent the fixed input resolution imposed by the global cost volume, we propose an adaptive resolution strategy that enables our network to take any image resolution as input, crucial for high-accuracy displacements. (iv) We train our network in a self-supervised manner, relying on synthetic warps of real images, thus requiring no annotated ground-truth flow.
We perform comprehensive experiments on the three aforementioned tasks, providing detailed analysis of our approach and thorough comparisons with recent state-of-the-art. Our approach outperforms previous methods for dense geometric correspondences on the HPatches [Lenc] and ETH3D [ETH3d] datasets, while setting a new state-of-the-art for semantic correspondences on the TSS [Taniai2016] dataset. Moreover, our network, without any retraining or fine-tuning, generalizes to optical flow by providing highly competitive results on the KITTI [Geiger2013] dataset. Both training code and models will be available at [glu-net].
2 Related work
Finding correspondences between a pair of images is a classical computer vision problem, uniting optical flow, geometric correspondences and semantic matching. This problem dates back several decades [Horn1981], with most classical techniques relying on hand crafted [Alcantarilla2012, harris, Leutenegger2011, Alahi2012, Bay2006, Lowe2004, Rublee2011] or trained [OnoTFY18, suerpoint, GLAMpoint] feature detectors/descriptors, or variational formulations [Horn1981, Baker2011, LiuYT11]. In recent years, CNNs have revolutionised most areas within vision, including different aspects of the image correspondence problem. Here, we focus on Convolutional Neural Network (CNN)-based methods for generating dense correspondences or flow fields, as these are most related to our work.
Optical Flow: Dosovitskiy \etal [Dosovitskiy2015] constructed the first trainable CNN for optical flow estimation, FlowNet, based on a U-Net denoising autoencoder architecture [Vincent2010] and trained it on a large synthetic FlyingChair dataset. Ilg \etal [Ilg2017a] stacked several basic FlowNet models into a large one, called FlowNet2, which performed on par with classical state-of-the-art on the Sintel benchmark [Butler2012]. Subsequently, Ranjan and Black [Ranjan2017] introduced SpyNet, a compact spatial image pyramid network.
Recent notable contributions to end-to-end trainable optical flow include PWC-Net [Sun2018, Sun2019] and LiteFlowNet [Hui2018], followed by LiteFlowNet2 [Hui2019]. They employ multiple constrained correlation layers operating on a feature pyramid, where the features at each level are warped by the current flow estimate, yielding more compact and effective networks. Nevertheless, while these networks excel at small to medium displacements with small appearance changes, they perform poorly on strong geometric transformations or when the visual appearance is significantly different.
Geometric Correspondence: Unlike optical flow, geometric correspondence estimation focuses on large geometric displacements, which can cause significant appearance distortions between the frames. Motivated by recent advancements in optical flow architectures, Melekhov \etal [Melekhov2019] introduced DGC-Net, a coarse-to-fine CNN-based framework that generates dense 2D correspondences between image pairs. It relies on a global cost volume constructed at the coarsest resolution. However, the input size is constrained to a fixed resolution (), severely limiting its performance on higher resolution images. Rocco \etal [Rocco2018b] aim at increasing the performance of the global correlation layer by proposing an end-to-end trainable neighborhood consensus network, NC-Net, to filter out ambiguous matches and keep only the locally and cyclically consistent ones. Furthermore, Laskar \etal [Laskar2019] utilize a modified version of DGC-Net, focusing on image retrieval.
Semantic Correspondence: Unlike optical flow or geometric matching, semantic correspondence poses additional challenges due to intra-class appearance and shape variations among different instances from the same object or scene category. Rocco \etal [Rocco2017a, Rocco2018a] proposed the CNNGeo matching architecture, predicting globally parametrized affine and TPS transformations between image pairs. Other approaches aim to predict richer geometric deformations [Choy2016, KimMHLS19, Kim2018, Rocco2018b] using \egSpatial Transformer Networks [Jaderberg2015]. Recently, Jeon \etal [Jeon] introduced PARN, a pyramidal model where dense affine transformation fields are progressively estimated in a coarse-to-fine manner. SAM-Net [Kim2019] obtains better results by jointly learning semantic correspondence and attribute transfer. Huang \etal [DCCNet] proposed DCCNet, which fuses correlation maps derived from local features and from a newly designed context-aware semantic feature representation.
We address the problem of finding pixel-wise correspondences between a pair of images and . In this work, we put no particular assumptions on the origin of the image pair itself. It may correspond to two different views of the same scene, two consecutive frames in a video, or two images with similar semantic content. Our goal is to estimate a dense displacement field, often referred to as flow, that warps image towards such that,
The flow represents the pixel-wise 2D motion vectors in the target image coordinate system. It is directly related to the pixel correspondence map , which directly maps an image coordinate in the target image to its corresponding position in the source image.
In this work, we design an architecture capable of robustly finding both long-range correspondences and accurate estimation of pixel-wise displacements. We thereby achieve a universal network for predicting dense flow fields, applicable to geometric matching, semantic correspondences and optical flow. The overall architecture follows a CNN feature-based coarse-to-fine strategy, which has proved widely successful for specific tasks [Hui2018, Sun2018, Melekhov2019, Jeon, Kim2019]. However, contrary to previous works, our architecture combines global and local correlation layers, as discussed in Section 3.1 and 3.2, to benefit from their complementary properties. We further circumvent the input resolution restriction imposed by the global correlation layer by introducing an adaptive resolution strategy in Section 3.3. It is based on a two-stream feature pyramid, which allows dense correspondence prediction for any input resolution image. Our final architecture is detailed in Section 3.4 and the training procedure explained in 3.5.
|(a) Local-Net (b) Global-Net (c) GLOCAL-Net (d) GLU-Net (Ours)|
3.1 Local and Global Correlations
Current state-of-the-art architectures [Sun2018, Hui2018, DCCNet, Jeon, Melekhov2019] for estimating image correspondences or optical flow rely on measuring local similarities between the source and target images. This is performed in a deep feature space, which provides a discriminative embedding with desirable invariances. The result, generally referred to as a correlation or cost volume, provides an extremely powerful cue when deriving the final correspondence or flow estimate. The correlation can be performed in a local or global manner.
Local correlation: In a local correlation layer, the feature similarity is only evaluated in the neighborhood of the target image coordinate, specified by a search radius . Formally, the correlation between the target and source feature maps is defined as,
where is a coordinate in the target feature map and is the displacement from this location. The displacement is constrained to , \iethe maximum motion in any direction is . We let denote the level in the feature pyramid. While most naturally thought of as a 4-dimensional tensor, the two displacement dimensions are usually vectorized into one to simplify further processing in the CNN. The resulting 3D correlation volume thus has a dimensionality of .
Global correlation: A global correlation layer evaluates the pairwise similarities between all locations in the target and source feature maps. The correlation volume contains at each target image location the scalar products between corresponding feature vector and the vectors extracted from all source feature map coordinates ,
As for the local cost volume, we vectorize the source dimensions, leading to a 3D tensor of size .
Comparison: Local and global correlation layers have a few key contrary properties and behaviors. Local correlations are popularly employed in architectures designed for optical flow [Dosovitskiy2015, Sun2018, Hui2018], where the displacements are generally small. Thanks to their restricted search region, local correlation layers can be applied for high-resolution feature maps, which allows accurate estimation of small displacements. On the other hand, a local correlation based architecture is limited to a certain maximum range of displacements. Conversely, a global correlation based architecture does not suffer from this limitation, encapsulating arbitrary long-range displacements.
The major disadvantage of the global cost volume is that its dimensionality scales with the size of the feature map . Therefore, due to the quadratic scaling in computation and memory, global cost volumes are only suitable at coarse resolutions. Moreover, post-processing layers implemented with 2D convolutions expect a fix channel dimensionality. Since the channel dimension of the cost volume depends on its spatial dimensions , this effectively constrains the network input resolution to a fixed pre-determined value, referred to as . The network can thus not leverage the more detailed structure in high-resolution images and lacks precision, since the images require down-scaling to before being processed by the network. Architectures with only local correlations (Local-Net) or with a unique global correlation (Global-Net) are represented in Figure 2a, b.
3.2 Global-Local Architecture
We introduce a unified network that leverages the advantages of both global and local correlation layers and which also circumvents the limitations of both. Our goal is to handle any kind of geometric transformations - including large displacements - while achieving high precision for detailed and small displacements. This is performed by carefully integrating global and local correlation layers in a feature pyramid based network architecture.
Inspired by DGC-Net [Melekhov2019], we employ a global correlation layer at the coarsest level. The purpose of this layer is to handle the long-range correspondences. Since these are best captured in the coarsest scale, only a single global correlation is needed. In subsequent layers, the dense flow field is refined by computing image feature similarity using local correlations. This allows precise estimation of the displacements. Combining global and local correlation layers allows us to achieve robust and accurate prediction of both long and small-range motions. Such an architecture is visualized with GLOCAL-Net in Figure 2c. However, this network is still restricted to a certain input resolution. Next, we introduce a design strategy that circumvents this issue.
3.3 Adaptive resolution
As previously discussed, the global correlation layer imposes a pre-determined input resolution for the network to ensure a constant channel dimensionality of the global cost volume. This severely limits the applicability and accuracy of the correspondence network, since higher resolution images requires down-scaling before being processed by the network, followed by up-scaling of the resulting flow. In this section, we address this key issue by introducing an architecture capable of taking images of any resolution, while still benefiting from a global correlation.
Our adaptive-resolution architecture consists of two sub-networks, which operate on two different image resolutions. The first, termed L-Net, takes source and target images downscaled to a fixed resolution , which allows a global correlation layer to be integrated. The H-Net on the other hand, operates directly on the original image resolution , which is not constrained to any specific value. It refines the flow estimate generated by the L-Net with local correlations applied to a shallow feature pyramid constructed directly from the original images. It is schematically represented in Figure 2d.
Both sub-networks are based on a coarse-to-fine architecture, employing the same feature extractor backbone. In details, the L-Net relies on a global correlation at the coarsest level in order to effectively handle any kind of geometric transformations, including very large displacements. Subsequent levels of L-Net employ local correlations to refine the flow field. It is then up-sampled to the coarsest resolution of H-Net, where it serves as the initial flow estimate used for warping the source features . Subsequently, the flow prediction is refined numerous times within H-Net, that operates on the full scale images, thus providing a very detailed, sub-pixel accurate final estimation of the dense flow field relating and .
For high-resolution images, the upscaling factor between the finest pyramid level, , of L-Net and the coarsest, , of H-Net (see Figure 2d) can be significant. Our adaptive resolution strategy allows additional refinement steps of the flow estimate between those two levels during inference, thus improving the accuracy of the estimated flow, without training any additional weights. This is performed by recursively applying the layer weights at intermediate resolutions obtained by down-sampling the source and target features from . In summary, our adaptive resolution network is capable of seamlessly predicting an accurate flow field in the original input resolution, while also benefiting from robustness to long-range correspondences provided by the global layer. The entire network is trained end-to-end.
3.4 Architecture details
In this section, we provide a detailed description of our architecture. While any feature extractor backbone can be employed, we use the VGG-16 [Chatfield14] network trained on ImageNet [Hinton2012] to provide a fair comparison to previous works in geometric [Melekhov2019] and semantic correspondences [Jeon].
For our L-Net, we set the input resolution to . It is composed of two pyramid levels, using
Conv5-3 ( resolution) and
Conv4-3 ( resolution) respectively. The former employs global correlation, while the latter is based on a local correlation. The H-Net is composed of two feature pyramid levels extracted from the original image resolution . For this purpose, we employ
Conv3-3 having resolutions and respectively. The H-Net is purely based on local correlation layers. Our final architecture, composed of four pyramid levels in total, is detailed in Figure 3. Next, we describe the various architectural components.
Coarsest resolution and mapping estimation: We compute a global correlation from the -normalized source and target features. The cost volume is further post-processed by applying channel-wise -normalisation followed by ReLU [relu] to strongly down-weight ambiguous matches [Rocco2017a]. Similar to DGC-Net [Melekhov2019], the resulting global correlation C is then fed into a correspondence map decoder to estimate a 2D dense correspondence map at the coarsest level of the feature pyramid:
The correspondence map is then converted to a displacement field, as .
Subsequent flow estimations: The flow is refined by local correlation modules. At level , the flow decoder infers the residual flow as,
is a local correlation (2) with search radius and is the warped source feature map according to the upsampled flow . The complete flow field is computed as .
Flow refinement: Contextual information have been shown advantageous for pixel-wise prediction tasks [Chen2017, DCCNet]. We thus use a sub-network R, called the refinement network, to post-process the estimated flow at the highest levels of L-Net and H-Net (L2 and L4 in Figure 3) by effectively enlarging the receptive field size. It takes the features of the second last layer from the flow decoder as input and outputs the refined flow . For the other pyramid level (L3), the final flow field is .
Cyclic consistency: Since the quality of the correlation is of primary importance for the flow estimation process, we introduce an additional filtering step on the global cost volume to enforce the reciprocity constraint on matches. We employ the soft mutual nearest neighbor filtering introduced by [Rocco2018b] and apply it to post-process the global correlation.
Loss: We train our network in a single phase. We fix the pre-trained feature backbone during training and following FlowNet [Dosovitskiy2015], we apply supervision at every pyramid level using the endpoint error (EPE) loss with respect to the ground truth displacements.
Dataset: Our network is solely trained on pairs generated by applying random warps to the original images. Since our network is designed to also estimate correspondences between high-resolution images, training data of sufficient resolution is preferred in order to utilize the full potential of our architecture. We use a combination of the DPED [Ignatov2017], CityScapes [Cordts2016] and ADE-20K [Zhou2019] datasets, which have images larger than with sufficiently diverse content. On the total dataset of images, we apply the same synthetic transformations as in [Melekhov2019]. The resulting image pairs are cropped to for training. We call this dataset DPED-CityScape-ADE. We provide additional training and architectural details in the appendix.
4 Experimental Validation
In this section, we comprehensively evaluate our approach for three diverse problems: geometric matching, semantic correspondences and optical flow. Importantly, we use the same network and model weights, trained on DPED-CityScape-ADE, for all three applications. More detailed results are available in the supplementary material.
4.1 Geometric matching
We first apply our universal correspondence network for the task of geometric matching. In this problem, the images consist of different views of the same scene and include large geometric transformations.
HP: We use the HPatches dataset [Lenc], consisting of 59 sequences of real images with varying photometric and geometric changes. Each image sequence contains a source image and 5 target images taken under different viewpoints, with sizes ranging from to . In addition to evaluating on the original image resolution (referred to as HP), we also evaluate on downscaled () images and ground-truths (HP-240) following [Melekhov2019].
ETH3D: To validate our approach for real 3D scenes, where image transformations are not constrained to simple homographies, we also employ the Multi-view dataset ETH3D [ETH3d]. It contains 10 image sequences at or resolution, depicting indoor and outdoor scenes and resulting from the movement of a camera completely unconstrained, used for benchmarking 3D reconstruction. The authors additionally provide a set of sparse geometrically consistent image correspondences (outputted by [SchonbergerF16]) that have been optimized over the entire image sequence using the reprojection error. We sample image pairs from each sequence at different intervals to analyze varying magnitude of geometric transformations, and use the provided points as sparse ground truth correspondences. This results in about 500 image pairs in total for each selected interval.
Metrics: In line with [Melekhov2019], we employ the Average End-Point Error (AEPE) and Percentage of Correct Keypoints (PCK) as the evaluation metrics. AEPE is defined as the Euclidean distance between estimated and ground truth flow fields, averaged over all valid pixels of the target image. PCK is computed as the percentage of correspondences with an Euclidean distance error , w.r.t. to the ground truth , that is smaller than a threshold .
Compared methods: We compare with DGC-Net [Melekhov2019], which is the current state-of-the-art for dense geometric matching, as well as with two state-of-the-art optical flow methods, PWC-Net [Sun2018] and LiteFlowNet [Hui2018], both trained on Flying-Chairs [Dosovitskiy2015] followed by 3D-things [Ilg2017a]. We use the pre-trained weights provided by the authors.
|LiteFlowNet [Hui2018]||20.48||28.13 %||57.28 %||118.30||13.94 %||32.00 %|
|PWC-Net [Sun2018, Sun2019]||21.45||20.93 %||54.52 %||94.54||13.20 %||37.53 %|
|DGC-Net [Melekhov2019]||9.07||50.01 %||77.40 %||33.26||12.00 %||58.06 %|
|GLU-Net (Ours)||7.40||59.92 %||83.47 %||25.05||39.55 %||78.54 %|
Results: We first present results on the HP and HP-240 in Table 1. Our model strongly outperforms all others by a large margin both in terms of accuracy (PCK) and robustness (AEPE). It is interesting to note that while our model is already better than DGC-Net on the small resolution HP-240, the gap in performance further broadens when increasing the image resolution. Particularly, GLU-Net obtains a PCK-1px value almost four times higher than that of DGC-Net on HP. This demonstrates the benefit of our adaptive resolution strategy, which enables to process high-resolution images with great accuracy. Figure 4 shows qualitative examples of different networks applied to HP images and ETH3D image pairs taken by two different cameras. Our GLU-Net is robust to large view-points variations as well as drastic changes in illumination.
In Figure 5, we plot AEPE and PCK-5px obtained on the ETH3D scenes for different intervals between image pairs. For small intervals, finding correspondences strongly resembles optical flow task while increasing it leads to larger displacements. Therefore, specialised optical flow methods PWC-Net [Sun2018] and LiteFlowNet [Hui2018] obtain slightly better AEPE and PCK for low intervals, but rapidly degrade for larger ones. In all cases, our approach consistently outperforms DGC-Net [Melekhov2019] in both metrics by a large margin.
4.2 Semantic matching
Here, we perform experiments for the task of semantic matching, where images depict different instances of the same object category, such as cars or horses. We use the same model and weights as in the previous section.
Dataset and metric: We use the TSS dataset [Taniai2016], which provides dense flow fields annotations for the foreground object in each pair. It contains 400 image pairs, divided into three groups: FG3DCAR, JODS, and PASCAL, according to the origins of the images. Following Taniai \etal [Taniai2016], we report the PCK with a distance threshold equal to , where and are the dimensions of the source image and .
Compared methods: We compare to several recent state-of-the-art methods specialised in semantic matching [Rocco2018a, Rocco2018b, Jeon, DCCNet, Kim2018, Kim2019]. In addition to our universal network, we evaluate a version that adopts two architectural details that are used in the semantic correspondence literature. Specifically, we add a consensus network [Rocco2018b] for the global correlation layer and concatenate features from different levels in the L-Net, similarly to [Jeon] (see Section 4.4 for an analysis). We call this version Semantic-GLU-Net. To accommodate reflections, which do not occur in geometric correspondence scenarios, we infer the flow field on original and flipped versions of the target image and output the flow field with least horizontal average magnitude.
Results: We report results on TSS in Table 2. Our universal network obtains state-of-the-art performance on average over the three TSS groups. Moreover, individual results on FG3Dcar and PASCAL are very close to best metrics. This shows the generalization properties of our network, which is not trained on the same magnitude of semantic data. In contrast, most specialized approaches fine-tuned on PASCAL data [Ham2016]. Finally, including architectural details specifically for semantic matching, termed Semantic-GLU-Net, further improves our performance, setting a new state-of-the-art on TSS, by improving a substantial PCK over the previous best. Interestingly, we outperform methods that use a deeper, more powerful feature backbone. Qualitative examples of our approach are shown in Figure 6.
4.3 Optical flow
Finally, we apply our network, with the same weights as previously, for the task of optical flow estimation. Here, the image pairs stem from consecutive frames of a video.
Dataset and metric: For optical flow evaluation, we use the KITTI dataset [Geiger2013], which is composed of real road sequences captured by a car-mounted stereo camera rig. The 2012 set only consists of static scenes while the 2015 set is extended to dynamic scenes. For this task, we follow the standard evaluation metric, namely the Average End-Point Error (AEPE). We also use the KITTI-specific F1 metric, which represents the percentage of outliers.
Compared methods: We employ state-of-the-art PWC-Net [Sun2018, Sun2019] and LiteFlowNet [Hui2018] trained on Flying-Chairs [Dosovitskiy2015] and 3D-things [Ilg2017a]. We also compare to DGC-Net [Melekhov2019] for completeness.
Results: Since we do not finetune our model, we only evaluate on the KITTI training sets. For fair comparison, we compare to models not finetuned on the KITTI training data. The results are shown in Table 3 and a qualitative example is illustrated in Figure 7.
|AEPE-all||F1-all [%]||AEPE-all||F1-all [%]|
|PWC-Net [Sun2018, Sun2019]||4.14||20.01||10.35||33.67|
Our network obtains highest AEPE on both KITTI-2012 and KITTI-2015. Nevertheless, we observe that our approach achieves a larger F1 measure on KITTI-2015 compared to approaches specifically trained and designed for optical flow. This is largely due to our self-supervised training data, which currently does not model independently moving objects or occlusions, but could be included to pursue a more purposed optical flow solution. Yet, our approach demonstrates competitive results for this challenging task, without training on any optical flow data. This clearly shows that our network can not only robustly estimate long-range correspondences, but also accurate small displacements.
4.4 Ablation study
Here, we perform a detailed analysis of our approach.
Local-global architecture: We first analyze the impact of global and local correlation layers in our dense correspondence framework. We compare using only local layers (Local-Net), a global layer (Global-Net) and our combination (GLOCAL-Net), presented in Figure 2. As shown in Table 4, Local-Net fails on the HP dataset, due to its inability to capture large displacements. While the Global-Net can handle large viewpoint changes, it achieves inferior accuracy compared to GLOCAL-Net, which additionaly integrates local correlations layers.
Adaptive resolution: By further adding the adaptive resolution strategy (Section 3.3), our approach (GLU-Net in Table 4) achieves a large performance gain in all metrics compared to GLOCAL-Net. This improvement is most prominent for high resolution images, \iethe original HP data.
Iterative refinement: From Table 4, applying iterative refinement (it-R) clearly benefits accuracy for high-resolution images (HP). This further allows us to seamlessly add extra flow refinements, without incurring any additional network weights, in order to process images of high resolution.
Global correlation: Lastly, we explore design choices for the global correlation block in our architecture. As shown in Table 5, adding cyclic consistency (CC) [Rocco2018b] as a post-processing brings improvements for all datasets. Subsequently adding NC-Net and concatenating features of L-Net (Concat-F) lead to major overall gain on the HP [Lenc] and TSS [Taniai2016] datasets. However, we observe a slight degradation in accuracy, as seen on KITTI [Geiger2013]. We therefore only include these components for the Semantic-GLU-Net version (Section 4.2) and not in our universal GLU-Net.
|Net||Net||Net||(no CC, no it-R)||(no CC, it-R)|
|No CC||+ CC (Ours)||+ NC-Net||+ Concat-F|
We propose a universal coarse-to-fine architecture for estimating dense flow fields from a pair of images. By carefully combining global and local correlation layers, our network effectively estimates long-range displacements while also achieving high accuracy. Crucially, we introduce an adaptive resolution strategy to counter the fixed input resolution otherwise imposed by the global correlation. Our universal GLU-Net is thoroughly evaluated for the three diverse tasks of geometric correspondences, semantic matching and optical flow. When using the same model weights, our network achieves state-of-the-art performance on all above tasks, demonstrating its universal applicability.
In this supplementary material, we first provide details about the architecture of the different modules of our network GLU-Net in Section A. We then explain the training procedure in more depth in Section B. Finally, we present additional qualitative results and more detailed quantitative experiments in Section C.
Appendix A Architecture details
In this section, we provide additional details about cyclic consistency as a post processing step of the global correlation. We also give a detailed architectural description of the mapping and flow decoders, along with the refinement network. Lastly, we explain in depth the iterative refinement allowed by our adaptive resolution strategy. In the following, a convolution layer or block refers to the composition of a 2D-convolution followed by batch norm [IoffeS15] and ReLU [relu] (Conv-BN-ReLU).
a.1 Cyclic consistency post-processing step for improved global correlation
Since the quality of the correlation layer output is of primary importance for the flow estimation process, we introduce an additional filtering step on the global cost volume to enforce the reciprocity constraint on matches. To encourage matched features to be mutual nearest neighbours, we employ the soft mutual nearest neighbor filtering introduced by [Rocco2018b] and apply it to post-process the global correlation.
The soft mutual nearest neighbor module filters a global correlation into such that:
with and the ratios of the score of the particular match with the best scores along each pair of dimensions corresponding to images and respectively. We present the formula for below, the same applies for .
This cyclic consistency post-processing step does not add any training weights.
a.2 Mapping decoder
In this sub-section, we give additional details of the mapping decoder for the global correlation layer (Eq. 4 and Figure 3 in the paper). We compute a global correlation from the -normalized source and target features. The cost volume is further post-processed by applying channel-wise -normalisation followed by ReLU [relu] to strongly down-weight ambiguous matches [Rocco2017a]. Similar to DGC-Net [Melekhov2019], the resulting global correlation layer C is then fed into a correspondence map decoder to estimate a 2D dense correspondence map at the coarsest level of the feature pyramid,
The outputted mapping estimate is parameterized such that each predicted pixel location in the map belongs to the interval representing width and height normalized image coordinates. The correspondence map is then re-scaled to image coordinates and converted to a displacement field.
The decoder consists of 5 feed-forward convolutional blocks with a spatial kernel. The number of feature channels of each convolutional layers are respectively 128, 128, 96, 64, and 32. The final output of the mapping decoder is the result of a linear 2D convolution, without any activation.
a.3 Flow decoder
Here, is a local correlation volume with search radius and is the warped source feature map according to the upsampled flow . The complete flow field is then computed as .
The flow decoder at level 4 (see Figure 3) additionally takes an input , obtained by applying a transposed convolution layer to the features of the second last layer from the flow decoder . This additional inputs was first introduced and utilized in PWC-Net [Sun2018] at every pyramid level. It enables the decoder of the current level to obtain some information about the correlation at the previous level. In GLU-Net, this additional input to the flow decoder only appears in H-Net since in L-Net, a global correlation and mapping decoder precede the flow decoder.
As for the flow decoder , we employ a similar architecture to the one in PWCNet [Sun2018]. It consists of 5 convolutional layers with DenseNet connections [Huang2017]. The numbers of feature channels at each convolutional layers are respectively 128, 128, 96, 64, and 32, and the spatial kernel of each convolution is . DenseNet connections are used since they have been shown to lead to significant improvement in image classification [Huang2017] and optical flow estimation [Sun2018]. The final output of the flow decoder is the result of a linear 2D convolution, without any activation.
a.4 Refinement network
Here, we explain in more details the refinement network R (Figure 3 in the paper). The refinement network aims to refine the pixel-level flow field , thus preventing erroneous flows from being amplified by up-sampling and passing to the next pyramid level. Its architecture is the same than the context network employed in PWC-Net [Sun2018]. It is a feed-forward CNN with 7 dilated convolutional layers [YuK15], with varying dilation rates. Dilated convolutions enlarge the receptive field without increasing the number of weights. From bottom to top, the dilation constants are 1, 2, 4, 8, 16, 1, and 1. The spatial kernel is set to for all convolutional layers.
a.5 Details about Local-net, Global-Net and GLOCAL-Net
In Figure 2 of the main paper, we introduced Local-Net, Global-Net and GLOCAL-Net to investigate the differences between architectures based on local correlation layers, a global correlation layer or a combination of the two, respectively. All three networks are composed of three pyramid levels and use the same feature extractor backbone VGG-16 [Chatfield14]. The mapping and flow decoders have the same architecture as those used for GLU-Net and described above. For Global-Net, the pyramid levels following the global correlation level employ concatenation of the target and warped source feature maps, as suggested in DGC-Net [Melekhov2019]. They are fed to the flow estimation decoder along with the up-sampled flow from the previous resolution. Finally, Global-Net and GLOCAL-Net are both restricted to a pre-determined input resolution due to their global correlation at the coarsest pyramid level. On the other hand, Local-Net, which only relies on global correlations, can take input images of any resolutions.
a.6 Iterative refinement
Here we provide more details about the iterative refinement procedure described in Section 3.3 in the paper. For high-resolution images, the upscaling factor between the finest pyramid level, , of L-Net and the coarsest, , of H-Net (see Figure 8) can be significant. Our adaptive resolution strategy allows additional refinement steps of the flow estimate between those two levels during inference, thus improving the accuracy of the estimated flow, without training any additional weights. This is performed by recursively applying the layer weights at intermediate resolutions obtained by down-sampling the source and target features from .
Particularly, we apply iterative refinement if the ratio between the resolutions of the and levels is larger than three. We then iteratively perform refinements at intermediate resolutions, obtained by a reduction of factor 2 from in each step, until the ratio between the resolution of the coarsest intermediate level and the resolution of is smaller than 2.
In mode details, we construct a local correlation layer from the source and target feature maps of level down-sampled to the desired intermediate resolution. We then apply the weights of the level decoder to the local correlation, therefore obtaining an intermediate refinement of the flow field. This process is illustrated in Figure 8, where the gap between and here allows for two additional flow field refinements.
Appendix B Training details
Here, we provide additional details about the training procedure and the training dataset.
We freeze the weights of the feature extractor during training. Let denote the learnable parameters of the network. Let denote the flow field estimated by the network at the pyramid level. refers to the corresponding dense flow ground-truth, computed from the random warp. We employ the multi-scale training loss, first introduced in FlowNet [Dosovitskiy2015],
where are the weights applied to each pyramid level and the second term of the loss regularizes the weights of the network. We do not apply any mask during training, which means that occluded regions (that do not have visible matches) are included in the training loss. Since the image pairs are related by synthetic transformations, these regions do have a correct ground-truth flow value.
For our adaptive resolution strategy, we down-sample and scale the ground truth from original resolution to in order to obtain the ground truth flow fields for L-Net. Similarly to FlowNet [Dosovitskiy2015] and PWC-Net [Sun2018], we down-sample the ground truth from the base resolution to the different pyramid resolutions without further scaling, so as to obtain the supervision signals at the different levels.
To use the full potential of our GLU-Net, training should be performed on high-resolution images. We create the training dataset following the procedure in DGC-Net [Melekhov2019], but enforcing the condition of high resolution. We use the same synthetic transformations (affine, thin-plate and homographies), but apply them to our higher resolution images collected from the DPED [Ignatov2017], CityScapes [Cordts2016] and ADE-20K [Zhou2019] datasets. Indeed, DPED images are very large, however the DPED training dataset is composed of only approximately 5000 sets of images taken by four different cameras. We use the images from two cameras, resulting in around images. CityScapes additionally adds about images. We complement with a random sample of ADE-20K images with a minimum resolution of .
b.3 Implementation details
As a preprocessing step, the training images are mean-centered and normalized using mean and standard deviation of ImageNet dataset [Hinton2012]. For the training of Global-Net, Local-Net and GLOCAL-Net, we use a batch size of 32 and an initial learning rate of which is gradually decreased during training. The weights in the training loss are set to be .
Our final network GLU-Net is trained with a batch size of 16 and the learning rate initially equal to . The weights in the training loss are set to be . Our system is implemented using Pytorch [pytorch] and our networks are trained using Adam optimizer [adam] with learning rate decay of .
|GLU-Net (Ours)||PCK-1px [%]||87.89||67.49||62.31||47.76||34.14||59.92||61.72||42.43||40.57||29.47||23.55||39.55|
Appendix C Detailed results
c.1 Run time
We compare the run time of our method with state-of-the-art approaches over the HP-240 images in Table 6. The timings have been obtained on the same desktop with an NVIDIA GTX 1080 Ti GPU. The HP-240 images are of size , which corresponds to the pre-determined input resolution of DGC-Net. For PWC-Net, LiteFlowNet and GLU-Net, the images are resized to before being passed through the networks. We do not consider this resizing in the estimated time. They all output a flow at a quarter resolution the input image. We up-scale to the image resolution with bilinear interpolation. This up-scaling operation is included in the estimated time.
Our network GLU-Net obtains similar run time than PWC-Net and is three times faster than DGC-Net. This is due to the fact that PWC-Net, LiteFlowNet and GLU-Net outputs a flow at a quarter image resolution whereas DGC-Net refines the estimated flow field with two additional pyramid levels until the fixed resolution of .
c.2 Geometric matching
We provide the detailed results on HP and ETH3D datasets, as well as extensive additional qualitative examples.
Results on HPatches dataset
Comparison to different training datasets: DGC-Net [Melekhov2019] is trained on pairs of images created from applying synthetic transformations to the Tokyo Time Machine dataset [Arandjelovic2016] and cropping them (denoted as tokyo). Since we cannot train GLU-Net on the same tokyo dataset due to its small resolution, for completeness, we additionally trained our GLOCAL-Net, that also has a fixed input resolution, on tokyo and compare the results to GLOCAL-Net trained on our CityScape-DPED-ADE dataset. It is important to note that GLOCAL-Net has 3 pyramid levels, the finest one at resolution for a pre-determined input resolution of . On the other hand, DGC-Net [Melekhov2019] has 5 pyramid levels, the last one applied on input resolution . We evaluate those networks on HP-240 and HP and present the results in Table 8.
GLOCAL-Net obtains similar results on both HP and HP-240 datasets, independently of its training data tokyo or CityScape-DPED-ADE. Since both datasets were created by applying the same synthetic transformations, this support the fact that transformation and displacement statistics are more important for generalization properties than image content [Mayer2018, Schuster2019b, Sun2019].
|DGC-Net (tokyo)||9.07||50.01 %||77.40 %||33.26||12.00 %||58.06 %|
|GLOCAL-Net (CityScape-DPED-ADE)||8.77||48.53 %||78.12 %||31.64||10.23 %||56.72 %|
|GLOCAL-Net (tokyo)||8.48||41.00 %||77.86 %||31.16||7.31 %||49.08 %|
|GLU-Net (CityScape-DPED-ADE)||7.40||59.92 %||83.47%||25.05||39.55%||78.54 %|
Detailed results on HP and HP-240: Detailed results obtained by different models on the various view-points of the HP and HP-240 datasets are presented in Table 7. It corresponds to Table 1 of the main paper. We outperform all other methods for each viewpoint ID on both low resolution (HP-240) and high-resolution images (HP). Particularly, our network permits to gain a lot of accuracy (in the order of 3 to 4 times higher for PCK-1 on HP) as compared to DGC-Net. Additional qualitative examples are shown in Figure 12.
We additionally present the PCK curves computed over the different viewpoints of HP, as a function of the relative distance threshold. We do not set a pixel-level thresholds for the PCK curves since HP image pairs have different resolutions in general. GLU-Net achieves better accuracy (better PCK) for all thresholds compared to PWC-Net [Sun2018], LiteFLowNet [Hui2018] and DGC-Net [Melekhov2019]. Importantly, GLU-Net obtains significantly better PCK for low thresholds.
Results on ETH3D
In the main paper, Figure 5, we quantitatively evaluated our approach over pairs of ETH3D images sampled from consecutive frames at different intervals. In Table 9, we give the corresponding detailed evaluation metrics (AEPE and PCK) obtained by PWC-Net, LiteFlowNet, DGC-Net and GLU-Net.
|interval = 3||PCK-1px [%]||58.87||54.14||31.50||47.47|
|interval = 5||PCK-1px [%]||53.64||47.02||25.23||40.22|
|interval = 7||PCK-1px [%]||46.97||39.69||20.90||34.41|
|interval = 9||PCK-1px [%]||39.52||32.61||17.61||30.25|
|interval = 11||PCK-1px [%]||31.10||26.15||14.88||26.54|
|interval = 13||PCK-1px [%]||24.89||21.30||12.83||23.45|
|interval = 15||PCK-1px [%]||19.92||17.03||10.69||20.48|
Here, we additionally provide qualitative examples of the different networks and GLU-Net applied to pairs of images at different intervals in Figure 11. It is visible that while optical flow methods achieve good results for low intervals, the warped source images according to their outputted flows get worst when increasing the intervals between image pairs. On the other hand, our model produces flow fields of constant quality.
Qualitative results: We additionally use ETH3D images to demonstrate the superiority of our approach to deal with extreme viewpoint changes on the one hand, and radical illumination and appearance variations on the other hand.
In addition to the medium resolution images evaluated previously, ETH3D [ETH3d] also provides several additional scenes taken with high-resolution cameras, acquiring images at 24 Megapixel (). Since the images of a sequence are taken by a unique camera, consecutive pairs of images show only little lightning variations, however they are related by very wide view-point changes. As there are no ground-truth correspondences provided along with the images, we only evaluate qualitatively on consecutive pairs of images. The original images of are down-samled by a factor of for practical purposes. We show quantitative results over a few of those images in Figure 10. GLU-Net is capable of handling very large motions, where the other methods partly (DGC-Net) or completely fail (PWC-Net and LiteFlowNet).
On the other hand, our network can also handle large appearances changes due to variation in illumination or due to the use of different optics. For this purpose, we utilize additional examples of pairs of images from ETH3D taken by two different cameras simultaneously. The camera of the first images has a field-of-view of 54 degrees while the other camera has a field of view of 83 degrees. They capture images at a resolution of or depending on the scenes and on the camera. The exposure settings of the cameras are set to automatic for all datasets, allowing the device to adapt to illumination changes. Qualitative examples of state-of-the-art methods and GLU-Net applied to such pairs of images are presented in Figure 13. GLU-Net is robust to changes in lightning conditions as well as to artifacts. While DGC-Net [Melekhov2019] obtains satisfactory results, the warped image according to its outputted flow is often blurry whereas we always obtain sharp, almost perfect warped source images.
c.3 Semantic correspondences
In Figure 14, we present additional qualitative results on the TSS [Taniai2016] dataset of our universal network (GLU-Net) and its modified version (Semantic-GLU-Net), which includes NC-Net [Rocco2018b] and feature concatenation [Jeon].
c.4 Detailed ablative analysis
In this section, we provide additional ablation experiments. All networks are trained on CityScape-DPED-ADE dataset.
Coarse-to-fine-approach: We first defend the use of a coarse-to-fine approach with a feature pyramid. We report AEPE and PCK metrics for the flow estimates obtained at different levels of the feature pyramid of GLU-Net model in Table 10. On the flow field estimated at each level, we apply bilinear interpolation to the original image resolution and multiply the estimated flow with the corresponding scale factor for the levels of L-Net. The end-point error decreases from the coarsest level to the highest level of the pyramid while the accuracy (PCK) increases. This supports the use of a pyramidal model.
Scale pyramid level of the adaptive resolution: In Table 11, we present the influence of the pyramid level at which the adaptive resolution module is integrated in the four-level pyramid network. Having a single level in L-Net (corresponding to the global correlation layer) and three pyramid levels in H-Net (referred to as 3L) lead to poor results on all datasets, even compared to GLOCAL-Net. On the other hand, both other alternatives (1 or 2 levels in H-Net) bring about major improvements of robustness (AEPE) and accuracy (PCK) on HPatches dataset, particularly on the high-resolution images HP. However, having only one level in H-Net (1L) degrades the performances obtained on the semantic dataset TSS. H-Net and L-Net both comprised of 2 pyramid levels (2L) appears as the best option to achieve competitive results on geometric matching, optical flow as well as semantic matching.
|AEPE||PCK-1px [%]||PCK-5px [%]|
|Level 1 ||45.49||0.70||13.53|
|Level 2 ||30.00||6.27||50.29|
|Level 3 ||26.43||30.47||74.44|
|Level 4 ||25.05||39.55||78.54|
|GLOCAL-Net||1L = 1 H-Net level||2L = 2 H-Net levels||3L = 3 H-Net levels|