Domain-invariant Stereo Matching Networks

Domain-invariant Stereo Matching Networks

Feihu Zhang Xiaojuan Qi  Ruigang Yang Victor Prisacariu  Benjamin Wah Philip Torr
University of Oxford  Baidu Research  CUHK
Abstract

State-of-the-art stereo matching networks have difficulties in generalizing to new unseen environments due to significant domain differences, such as color, illumination, contrast, and texture. In this paper, we aim at designing a domain-invariant stereo matching network (DSMNet) that generalizes well to unseen scenes. To achieve this goal, we propose i) a novel “domain normalization” approach that regularizes the distribution of learned representations to allow them to be invariant to domain differences, and ii) a trainable non-local graph-based filter for extracting robust structural and geometric representations that can further enhance domain-invariant generalizations. When trained on synthetic data and generalized to real test sets, our model performs significantly better than all state-of-the-art models. It even outperforms some deep learning models (e.g. MC-CNN [54]) fine-tuned with test-domain data. The code and dataset will be avialable at https://github.com/feihuzhang/DSMNet.

1 Introduction

Stereo reconstruction is a fundamental problem in computer vision, robotics and autonomous driving. It aims to estimate 3D geometry by computing disparities between matching pixels in a stereo image pair. Recently, many end-to-end deep neural network models (e.g. [56, 4, 17]) have been developed for stereo matching that achieve impressive accuracy on several datasets or benchmarks.

However, state-of-the-art stereo matching networks (supervised [56, 4, 17] and unsupervised [59, 45]) cannot generalize well to unseen data without fine-tuning or adaptation. Their difficulties lie in the large domain differences (such as color, illumination, contrast and texture) between stereo images in various datasets. As illustrated in Fig. 1, the pre-trained models on one specific dataset produce poor results on other real and unseen scenes.

{overpic}

[width=0.975]images/illustrate.pdf

(b) Test Scenes(a) Training Scenes(c) Feature Map of GANet [56](d) Feature Map of our DSMNet(e) Results of GANet [56](f) Results of Our DSMNet

Figure 1: Visualization of the feature maps and disparity results. The state-of-the-art GANet [56] is used for comparisons. Models are trained on synthetic data (Sceneflow [30]) and tested on novel real scenes (KITTI [31]). The feature maps from GANet has many artifacts (e.g. noises). Our DSMNet mainly captures the structure and shape information as robust features, and there is no distortions or artifacts in the feature map. It can produce accurate disparity estimations in the novel test scenes. The same observations are shown by more models (e.g. PSMNet [4], HD [53]) and datasets in the supplementary material.

Domain adaptation and transfer learning methods (e.g. [45, 11, 3]) attempt to transfer or adapt from one source domain to another new domain. Typically, a large number of stereo images from the new domain are required for the adaptation. However, these cannot be easily obtained in many real scenarios. And, in this case, we still need a good method for disparity estimation even without data from the new domain for adaptation.

Thus, it is desirable to design a model that can generalize well to unseen data without re-training or adaptation. The difficulties for developing such a domain invariant stereo matching network (DSMNet) come from the significant domain differences between stereo images in various scenes/datasets (e.g. Fig. 1(a) and 1(b)). Such differences make the learned features unstable, distorted and noisy, leading to many wrong matching results.

Fig. 1 visualizes the features learned by some state-of-the-art stereo matching models [56, 4, 53]. Due to the limited effective receptive field of convolutional neural networks [28], they capture the domain-sensitive local patterns (e.g. local contrast, edge and texture) when constructing matching features, which, however, break down and produce a lot of artifacts (e.g. noises) in the feature maps when applied to the novel test data (Fig. 1(c)). The artifacts and distortions in the features inhibit robust matching, leading to wrong matching results (Fig. 1(e)).

In this paper, we propose two novel neural network layers for constructing the robust deep stereo matching network for cross-domain generalization without further fine-tuning or adaptation. Firstly, to reduce the domain shifts/differences between different datasets/scenes, we propose a novel domain normalization layer that fully regulates the feature’s distribution in both the spatial (height and width) and the channel dimensions. Secondly, to eliminate the artifacts and distortions in the features, we propose a learnable non-local graph-based filtering layer that can capture more robust structural and geometric representations (e.g. shape and structure, as illustrated in Fig. 1(d)) for domain-invariant stereo matching.

We formulate our method as an end-to-end deep neural network model and train it only with synthetic data. In our experiments, without any fine-tuning or adaptation on the real test datasets, our DSMNet far outperforms: 1) almost all state-of-the-art stereo matching models (e.g. GANet[56]) trained on the same synthetic dataset, 2) most of the traditional methods (e.g. Cosfter filter, SGM [13] et al.), 3) most of the unsupervised/self-supervised models trained on the target test domains. Our model even surpasses some of the fine-tuned (on the target domains) supervised deep neural network models (e.g. MC-CNN[54], content-CNN[29], DispNetC [30] et al.).

2 Related Work

2.1 Deep Neural Networks for Stereo Matching

In recent years, deep neural networks have seen great success in the task of stereo matching [17, 4, 56]. These models can be categorized into three types: 1) learning better features for traditional stereo matching algorithms, 2) correlation-based end-to-end deep neural networks, 3) cost-volume based stereo matching networks.

In the first category, deep neural networks have been used to compute patch-wise similarity scores as the matching costs [57, 54]. The costs are then fed into the traditional cost aggregation and disparity computation/refinement methods [13] to get the final disparity maps. The models are, however, limited by the traditional matching cost aggregation step and often produce wrong predictions in occluded regions, large textureless/reflective regions and around object edges.

DispNetC [30], a typical method in the second category, computes the correlations by warping between stereo views and attempts to predict the per-pixel disparity by minimizing a regression training loss. Many other sate-of-the-art methods, including iResNet [25], CRL[36], SegStereo [51], EdgeStereo [42], HD [53], and MADNet [45], are all based on color or feature correlations between the left and right views for disparity estimation.

The recently developed cost-volume based models explicitly learn feature extraction, cost volume, and regularization function all end to end. Examples include GC-Net[17], PSM-Net[4] , StereoNet [18], AnyNet [49], GANet [56] and EMCUA [34]. They all utilize a similarity cost as the third dimension to build the 4D cost volume in which the real geometric context is maintained.

There are also others that combine the correlation and cost volume strategies (e.g. [12]).

The common feature of these models is that they all require a large number of training samples with ground truth depth/disparity. More importantly, a model trained on one specific domain cannot generalize well to new scenes without fine-tuning or retraining.

2.2 Adaptation and Self-supervised Learning

Self-supervised Learning:

A recent trend of training stereo matching networks in an unsupervised manner relies on image reconstruction losses that are achieved by warping left and right views [59, 58]. However, they cannot solve the occlusions and reflective regions where there is no correspondence between the left and the right views. Also, they cannot generalize well to other new domains.

Domain Adaptation:

Some methods pre-train the models on synthetic data and then explore the cross-domain knowledge to adapt [11, 37] for a new domain. Others focus on the online or offline adaptations [44, 45, 43, 39]. For example, MADNet [45] is proposed to adapt the pre-trained model online and in real time. But, it has poor accuracy even after the adaptation. Moreover, the domain adaptation approaches require a large number of stereo images from the target domain for adaptations. However, these cannot be easily obtained in many real scenarios. And, in this case, we still need a good method for disparity estimation even without data from the new domain for adaptation.

2.3 Cross-Domain Generalization

Different to domain adaptation, domain generalization is a much harder problem that assumes no access to target information for adaptation or fine-tuning. There are many approaches that explore the idea of domain-invariant feature learning. Previous approaches focus on developing data-driven strategies to learn invariant features from different source domains [32, 10, 20]. Some recent methods utilize meta-learning that takes variations in multiple source domains to generalize to novel test distributions [1, 21]. Other approaches [23, 22] employ an invariant adversarial network to learn domain-invariant representation/features for image recognition. Choy et al. [6] develop a universal feature learning framework for visual correspondences using deep metric learning.

In contrast to the above approaches, there are methods that try to improve the batch or instance normalization in order to improve the generalization and robustness for style transfer or image recognition [33, 24, 35].

In summary, for stereo matching, work is seldom done to improve the generalization ability of the end-to-end deep neural network models, especially when developing the domain-invariant stereo matching networks.

3 Proposed DSMNet

To overcome the challenges in cross-domain generalization, we develop in the following sections our domain-invariant stereo matching networks. These include domain normalization to remove the influence of the domain shifts (e.g. color, style, illuminance), as well as non-local graph-based filtering and aggregation to capture the non-local structural and geometric context as robust features for domain-invariant stereo reconstruction.

Figure 2: Normalization methods. Each subplot shows a feature map tensor, with as the batch axis, as the channel axis, and as the spatial axes. The blue elements in set are normalized by the same mean and variance. The proposed domain normalization consists of image-level normalization (blue, Eq. (1)) and pixel-level normalization of each -channel feature vector (green, Eq. (3)).

3.1 Domain Normalization

Batch normalization (BN) has become the default feature normalization operation for constructing end-to-end deep stereo matching networks [17, 4, 56, 42, 45, 30]. Although it can reduce the internal covariate shift effects in training deep networks, it is domain-dependent and has negative influence on the model’s cross-domain generalization ability.

BN normalizes the features as follows:

(1)

Here and are the input and output features, respectively, and indexes elements in a tensor (i.e. feature maps, as illustrated in Fig. 2) of size (: batch size, : channels, : spatial height, : spatial width). and are the corresponding channel-wise mean and standard deviation (std) and are computed by:

(2)

where is the set of elements in the same channel as element (Fig. 2), and is a small constant to avoid dividing by zeros.

Mean and standard deviation are computed per batch in the training phase, and the accumulated values of the training set are utilized for inference. However, different domains may have different and caused by color shifts, contrast, and illumination (Fig. 1(a) and 1(b)). Thus and computed for one dataset are not transferable to others.

{overpic}

[width=0.95]images/datanorm2.jpg 1234512345(a) Instance Norm(b) Domain Norm

Figure 3: Norm distributions of the features of different datasets (from left to right: synthetic SceneFlow, KITTI, Middlebury, CityScapes and ETH 3D). We choose the output feature of the feature extraction network for our study. The norm of the -channel feature vector of each pixel is counted for the distribution. Instance normalization can only reduce the image-level differences, but does not normalize the -channel feature vectors at pixel level.

Instance normalization (IN) [33, 38] overcomes the dependency on data-set statistics by normalizing each sample separately, where elements in are confined to be from the same sample as illustrated in Fig. 2. In theory, IN is domain-invariant, and normalization across the spatial dimensions (, ) reduces image-level appearance/style variations.

However, matching of stereo views is realized at the pixel level by finding an accurate correspondence for each pixel using its -channel feature vector. Any inconsistence of the feature norm and scaling will significantly influence the matching cost and similarity measurements.

Fig. 3 illustrates that IN cannot regulate the norm distribution of pixel-wise feature vectors that vary in datasets/domains.

{overpic}

[width=0.6]images/graph.pdf (a) 8-connected graph(b) directed graph (c) directed graph

Figure 4: Illustration of the graph construction. The 8-way connected graph is separated into two directed graphs and .

We propose in Fig. 2 our domain-invariant normalization (DN). Our method normalizes features along the spatial axis (, ) to induce style-invariant representations similar to IN as well as along the channel dimension () to enhance the local invariance.

Our DN is realized as follows:

(3)

where (green region in Fig. 2) includes elements from the same example ( axis) and the same spatial location (, axis). is computed as Eq. (1) and (2) with elements in from the same channel and sample (blue region in Fig. 2). In DN, besides normalization across spatial dimension, we also employ normalization to normalize features along the channel axis. They collaborate with each other to address the address the sensitivity to domain shift as well as stress noises and extreme values in feature vectors. As illustrated in Fig. 3, it helps regulate the norm distribution of the features in different datasets and improves the robustness to local domain shifts (e.g. texture pattern, noise, contrast).

Finally, the trainable per-channel scale and shift are added to enhance the discriminative representation ability as BN and IN. The final formulation is as follows:

(4)

3.2 Non-local Aggregation

We propose a graph-based filter that robustly exploits non-local contextual information and reduces the dependence on local patterns (see Fig. 1(c)) for domain-invariant stereo matching.

3.2.1 Formulation

Our inspiration comes from traditional graph-based filters that are remarkably effective in employing non-local structural information for structure-preserving texture and detail removing/smoothing [55], denoising [55, 5], as well as depth-aware estimation and enhancement [26, 52].

For a 2D image/feature map , we construct an 8-connected graph by connecting pixel to its eight neighbors (see Fig. 4). To avoid loops and achieve fast non-local information aggregation over the graph, we split it into two reverse directed graphs , (see Fig. 4(b) and 4(c)).

We assign weight to each edge , and a feature (or color) vector to each node . We also allow to propagate information to itself with weight . For graph (), our non-local filter is defined as follows:

(5)

Here, is a feasible path from to . Note that is included in the path and counts for the start node . Unlike traditional geodesic filters, we consider all valid paths from source node to target node . The propagation weight along path is the product of all edge weights along the path. Here weight is defined as the sum of the weights of all feasible paths from to , which determines how much information is diffused to from .

For the edge weight , we define it in a self-regularized manner as follows:

(6)

where and represent the feature vectors of and , respectively. This definition does not introduce new parameters and thus is more robust to cross-domain generalization.

Compared to other local filters, such as Gaussian filter, median filter, and mean filter that can only propagate information in a local region determined by the filter kernel size, our proposed non-local filter allows the propagation of long-range information with weights as a spatial accumulation along all feasible paths in a graph.

For stable training and to avoid extreme values, we further add a normalization constraint to the weights associated with in the graph as:

(7)

Here, is the set of the connected neighbors of (including itself), and is the directed edge connecting and . For example, in Fig. 4(b), for node , ; and for node , .

If Eq. (7) holds, we can further derive 111The proof is available in the supplementary material.. Eq. (5) can then be simplified as follows:

(8)

Such a transformation not only increases the robustness in training but also reduces the computational costs.

3.2.2 Linear Implementation

Eq. (8) can be realized as an iterative linear aggregation, where the node representation is sequentially updated following the direction of the graph (e.g. from top to bottom, then left to right in ). In each step, is updated as:

(9)

Finally, we repeat the aggregation process for both and where the updated representation with is used as the input for aggregation with (similar to patchmatch stereo [2]). The aggregation of Eq. (9) is a linear process with time complexity of (with nodes in the graph). During training, backpropagation can be realized by reversing the propagation equation which is also a linear process (available in the supplementary material).

3.2.3 Relations to Existing Approaches

We show that the recently proposed semi-global aggregation (SGA) layer[56] and affinity-based propagation approach [27] are special cases of our graph-based non-local filter (Eq. (8)). In addition, we compare it with non-local neural networks [48, 50] and the attention mechanism [15].

Semi-global Aggregation (SGA) [56] is proposed as a differentiable approximation of SGM [13] and can be presented as follows:

(10)

The aggregations are done in four directions, namely . Taking the right to left propagation () as an example, we can construct a propagation graph in Fig. 5(a). The -coordinate represents disparity , and the -coordinate represents the indexes of the pixels/nodes. Compared to our non-local graph in Fig. 4(b), edges connecting top and bottom nodes are removed, and the maximum of each column is densely connected to every node of the next column (red edges). The SGA layer can then be realized by our proposed non-local filter in Eq. (8). Here, are the neighborhood nodes of , and are the corresponding edge weights.

{overpic}

[width=1.01]images/graph2.pdf (a) SGA[56](b) one-way [27](c) three-way [27]

Figure 5: Special cases of our non-local filter. (a) Semi-global aggregation (SGA) layer [56]. The dark blue node represents the maximum of each column. (b) and (c) are the affinity-based spatial propagations [27]. They aggregate from column to .

The Affinity-based Spatial Propagation in [27] can be achieved as:

(11)

where are the learned affinities. is equal to our weight for . The graphs for filtering can be constructed as in Fig. 5(b) and 5(c) for the one-way and three-way propagations [27], respectively.

Figure 6: Overview of the network architecture. Synthetic data are used for training, while using data from other new domains (e.g. real KITTI dataset) for testing. The backbone of the state-of-the-art GANet[56] is used as the baseline. The proposed domain normalization is used after each convolutional layer in the feature extraction and guidance network. Several non-local filter layers are implemented for both feature extraction and cost aggregation.

The Non-local Neural Networks and Attentions [48, 50, 15] are implemented without spatial and structural awareness. The similarity definition between two pixels only considers the feature differences without considering their spatial distances. Therefore, they will easily smooth out depth edges and thin structures (as illustrated in the supplementary material). Our non-local filter spatially aggregates the message along the paths in the graph which can avoid over smoothness and better preserve the structure of the disparity maps.

3.3 Network Architecture

As illustrated in Fig. 6, we utilize the backbone of GANet as the baseline architecture. The local guided aggregation layer in [56] is removed since it’s domain-dependent and captures a lot of local patterns that are very sensitive to local domain shifts.

We replace the original batch normalization layer by our proposed domain normalization layer for feature extraction. For the feature extraction network, we utilize a total of seven proposed filtering layers. For 3D cost aggregation of the cost volume, two non-local filters are further added for cost volume filtering in each channel/depth. All the details of the network architecture are presented in Table I in the supplementary material.

4 Experimental Results

In our experiments, we train our method only with synthetic data and test it on four real datasets to evaluate its domain generalization ability. During training, we use disparity regression [17] for disparity prediction, and the smooth loss to compute the errors for back-propagation (the same as in [56, 4]). All the models are optimized with Adam (, ). We train with a batch size of 8 on four GPUs using random crops from the input images. The maximum of the disparity is set as 192. We train the model on the synthetic dataset for 10 epochs with a constant learning rate of 0.001. All other training settings are kept the same as those in [56].

4.1 Datasets

KITTI stereo 2012 [9] and 2015 [31] datasets provide about 400 image pairs of outdoor driving scenes for training, where the disparity labels are transformed from Velodyne LiDAR points. The Cityscapes dataset [7] provides a large amount of high-resolution () stereo images collected from out-door city driving scenes. The disparity labels are pre-computed by SGM [13] which is not accurate enough for training deep neural network models. The Middlebury stereo dataset [40] is designed for indoor scenes with higher resolution (up to ). But it provides no more than 50 image pairs that are not enough to train robust deep neural networks. In addition, ETH 3D dataset [41] provides 27 pairs of gray images for training.

These existing real datasets are all limited by their small quantity or poor ground-truth labels, making them insufficient for training deep learning models. Hence, we just use them as test sets for evaluating our models’ cross-domain generalization ability.

We mainly use synthetic data to train our domain-invariant models. The existing Scene Flow synthetic dataset [30] contains 35k training image pairs with a resolution of . This dataset has a limited number of the outdoor driving scenes that provide stereo pairs with a few settings of the camera baselines and image resolutions. We use CARLA [8] to generate a new supplementary synthetic dataset (with 20k stereo pairs) with more diverse settings, including two kinds of image resolutions ( and ), three different focal lengths, and five different camera baselines (in a range of 0.2-1.5m). This supplementary dataset can significantly improve the diversity of the training set (which will be published with the paper).

The two advantages in using synthetic data are that it can avoid all the difficulties of labeling a large amount of real data, and that it can eliminate the negative influence of wrong depth values in real datasets.

4.2 Ablation Study

We evaluate the performance of our DSMNet with numerous settings, including different architectures, normalization strategies and numbers (0-9) of the proposed non-local filter (NLF) layers. As listed in Table 1, the full-setting DSMNet far outperforms the baseline in accuracy by 3% on the KITTI and 8% on the Middlebury datasets. Our proposed domain normalization improves the accuracy by about 1.5%, and the NLF layers contribute another 1.4% on the KITTI dataset.

Normlize Non-local Filter Backbone Midd KITTI
feature cost volume 3-pixel 2-pixel
BN ours 30.3 9.4
DN ours 27.1 7.9
DN +3 ours 24.2 7.1
DN +7 ours 22.9 6.8
DN +9 ours 22.4 6.8
DN +7 +2 ours 21.8 6.5
BN PSMNet 39.5 16.3
BN GANet 32.2 11.7
DN +7 +2 PSMNet 26.1 8.5
DN +7 +2 GANet 23.7 7.3
Table 1: Ablation study. Models are trained on synthetic data (SceneFlow). Threshold error rates (%) are used for evaluations.

Moreover, our proposed layers are generic and could be seamlessly integrated into other deep stereo matching models. Here, we replace our backbone model with GANet [56] and PSMNet [4]. The accuracies are improved by 48% on KIITTI dataset and 813% on Middlebury dataset for coss-domain evaluations compared with the original PSMNet and GANet.

(a) Input view
(b) HD[53]
(c) PSMNet[4]
(d) Our DSMNet
Figure 7: Comparisons with state-of-the-art models. Models are trained on synthetic data and evaluated on high-resolution real datasets (Middlebury and CityScapes). Our DSMNet can produce much more accurate disparity estimation. (See supplementary for more results.

4.3 Component Analysis and Comparisons

To further validate the superiorities of the proposed layers , we compare each of them with other related normalization and non-local strategies.

Models Middlebury (full) KITTI
Batch Norm 29.1 7.3
Instance Norm 27.1 6.4
Adaptive Norm[33] 28.2 6.8
Attention[15] 25.2 5.9
Feature Denoising[50] 25.9 6.1
Affinity [27] 23.1 5.2
DSMNet (full setting) 20.1 4.1
Table 2: Comparisons with Existing Normalization and Filtering/Attention Strategies
Normalization Strategies.

Table 2 compares our domain normalization with batch normalization [16], instance normalization [47], and the recently proposed adaptive batch-instance normalization [33]. We keep all other settings the same as our DSMNet and only replace the normalization method for training and evaluation. Our domain normalization is superior to others for domain-invariant stereo matching because it can fully regulate the feature vectors’ distribution and remove both image-level and local contrast differences for cross-domain generalization.

Non-local Approaches.

Finally, we compare our graph-based non-local filter with other related strategies, including affinity-based propagation [27], non-local neural network denoising [50], and non-local attention [15] (in Table 2). Our graph-based filtering strategy is better for capturing the structural and geometric context for robust domain-invariant stereo matching. The non-local neural network denoising [50] and non-local attention [15] do not have spatial constraints that usually lead to over smoothness of the depth edges (as shown in the supplementary material). Affinity-based propagations [27] are special cases of our proposed filtering strategy and are not as effective in feature and cost volume aggregations for stereo matching.

Models
KITTI
2012  2015
Middlebury
 full  half   quarter
ETH3D
Carla
CostFilter[14] 21.7 18.9 57.2 40.5 17.6 31.1 41.1
PatchMatch[2] 20.1 17.2 50.2 38.6 16.1 24.1 30.1
SGM[13] 7.1 7.6 38.1 25.2 10.7 12.9 20.2
Training set SceneFlow
HD[53] 23.6 26.5 50.3 37.9 20.3 54.2 35.7
gwcnet[12] 20.2 22.7 47.1 34.2 18.1 30.1 33.2
PSMNet[4] 15.1 16.3 39.5 25.1 14.2 23.8 25.9
GANet[56] 10.1 11.7 32.2 20.3 11.2 14.1 18.8
Our DSMNet 6.2 6.5 21.8 13.8 8.1 6.2 9.8
Training set SceneFlow + Carla
HD[53] 19.1 19.5 47.3 35.2 19.5 45.2
gwcnet[12] 17.2 18.1 45.2 31.8 17.2 29.4
PSMNet[4] 10.3 11.0 35.5 23.7 13.8 20.3
GANet[56] 7.2 7.6 31.9 19.7 11.4 13.5
Our DSMNet 3.9 4.1 20.1 13.6 8.2 6.0
Table 3: Evaluations on the KITTI, Middlebury, and ETH 3D validation datasets. Threshold error rates (%) are used.

4.4 Cross-Domain Evaluations

In this section, we compare our proposed DSMNet with state-of-the-art stereo matching models by training with synthetic data and evaluating on real test sets.

Comparisons with State-of-the-Art Models.

In Table 3 and Fig. 7, we compare our DSMNet with other state-of-the-art deep neural network models on the four real datasets. All the models are trained on synthetic data (either SceneFlow or a mixture of SceneFlow and Carla). We find that DSMNet far outperforms the state-of-the-art models by 330% in error rates on all these datasets. It is also far better than traditional stereo matching algorithms, like SGM [13], costfilter [14] and patchmatch [2].

Models Training Set Error Rates (%)
Our DSMNet Synthetic 3.71
MC-CNN-acrt[54] Kitti-gt 3.89
DispNetC[30] Kitti-gt 4.34
Content-CNN[29] Kitti-gt 4.54
MADNet-finetune[45] Kitti-gt 4.66
Weak Supervise[46] Kitti-gt 4.97
MADNet[45] Kitti (no gt) 8.23
OASM-Net[19] Kitti (no gt) 8.98
Unsupervised[59] Kitti (no gt) 9.91
Table 4: Evaluation on KITTI 2015 Benchmark
(a) Input view
(b) MC-CNN [54]
(c) PSMNet [4]
(d) HD [53]
(e) GANet-deep [56]
(f) Our DSMNet-synthetic
Figure 8: Comparisons with the fine-tuned state-of-the-art models. Our model is trained only with synthetic data. All others are fine-tuned on the KITTI target scenes. As pointed by arrows, our DSMNet can produce more accurate object boundaries.
Evaluation on the KITTI Benchmark.

Table 4 presents the performance of our DSMNet on the KITTI benchmark [31]. Our model far outperforms most of the unsupervised/self-supervised models trained on the KITTI domain. It is even better than supervised stereo matching networks (including, MC-CNN[54], content-CNN[29], and DispNetC [30]) trained or fine-tuned on the KITTI dataset. When compared with other fine-tuned state-of-the-art models (e.g. PSMNet[4], HD[53], GANet-deep[56]), our DSMNet (without fine-tuning) produces more accurate object boundaries (Fig. 8).

4.5 Fine-tuning

In this section, we show DSMNet’s best performance when fine-tuned on the target domain. We fine-tune the model pre-trained on synthetic data for a further 700 epochs using the KITTI 2015 training set. The learning rate for fine-tuning begins at 0.001 for the first 300 epochs and decreases to 0.0001 for the rest. The results are submitted to the KITTI benchmarks for evaluations.

Table 5 compares the results of the fine-tuned DSMNet and those of other state-of-the-art DNN models. We find that DSMNet outperforms most of the recent models (including PSMNet [4], HD [53], GwcNet [12] and GANet-15[56]) by a noteworthy margin. This implies that DSMNet can achieve the same accuracy by fine-tuning on one specific dataset, without sacrificing accuracy to improve its cross-domain generalization ability.

We also separately test the effectiveness of our non-local filtering strategy. Using the current best “GANet-deep”[56] (including the Local Guided Aggregation layer) as the baseline, we add five filtering layers for feature extraction. All other settings are kept the same as the original GANet. After training on synthetic data and fine-tuning on the KITTI training dataset, the model gets a new state-of-the-art accuracy (1.77%) on KITTI 2015 benchmark. This shows that our graph-based filter can improve not only cross-domain generalization but also the accuracy on the test domains.

Models Non-Occluded All Area
GANet + Our NLF 1.58 1.77
GANet-deep[56] 1.63 1.81
DSMNet-finetune 1.71 1.90
GANet-15[56] 1.73 1.93
HD[53] 1.87 2.02
gwcnet-g[12] 1.92 2.11
PSMNet[4] 2.14 2.32
GCNet[17] 2.61 2.87
Table 5: Evaluation on the KITTI 2015 Benchmark (Fine-tuning)

4.6 Efficiency and Parameters

Our proposed non-local filtering is a linear process that can be realized efficiently. The inference time is increased slightly by no more than 5% compared with the baseline. Moreover, no any new parameter is introduced for the proposed domain normalization and non-local filtering layers. Detailed comparisons are available in the supplementary material.

5 Conclusion

In this paper, we have proposed two end-to-end trainable neural network layers for our domain-invariant stereo matching network. Our novel domain normalization can fully regulate the distribution of learned features to address significant domain shifts, and our non-local graph-based filter can capture more robust non-local structural and geometric features for accurate disparity estimation in cross-domain situations. We have verified our model on four real datasets and have shown its superior accuracy when compared to other state-of-the-art stereo matching networks in cross-domain generalization.

References

  • [1] Y. Balaji, S. Sankaranarayanan, and R. Chellappa (2018) Metareg: towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems (NIPS), pp. 998–1008. Cited by: §2.3.
  • [2] M. Bleyer, C. Rhemann, and C. Rother (2011) PatchMatch stereo-stereo matching with slanted support windows.. In British Machine Vision Conference (BMVC), pp. 1–11. Cited by: §3.2.2, §4.4, Table 3.
  • [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3722–3731. Cited by: §1.
  • [4] J. Chang and Y. Chen (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5418. Cited by: Table 6, Figure 10, 11(c), Figure 1, §1, §1, §1, §2.1, §2.1, §3.1, 7(c), 8(c), §4.2, §4.4, §4.5, Table 3, Table 5, §4.
  • [5] X. Chen, S. Bing Kang, J. Yang, and J. Yu (2013) Fast patch-based denoising using approximated patch geodesic paths. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1211–1218. Cited by: §3.2.1.
  • [6] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker (2016) Universal correspondence network. In Advances in Neural Information Processing Systems (NIPS), pp. 2414–2422. Cited by: §2.3.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3213–3223. Cited by: Figure 10, §4.1.
  • [8] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: Appendix E, §4.1.
  • [9] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: §4.1.
  • [10] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi (2015) Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2551–2559. Cited by: §2.3.
  • [11] X. Guo, H. Li, S. Yi, J. Ren, and X. Wang (2018) Learning monocular depth by distilling cross-domain stereo networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 484–500. Cited by: §1, §2.2.
  • [12] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li (2019) Group-wise correlation stereo network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3273–3282. Cited by: §2.1, §4.5, Table 3, Table 5.
  • [13] H. Hirschmuller (2008) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2), pp. 328–341. Cited by: §1, §2.1, §3.2.3, §4.1, §4.4, Table 3.
  • [14] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz (2013) Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2), pp. 504–511. Cited by: §4.4, Table 3.
  • [15] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 603–612. Cited by: Figure 12, §F.3, §3.2.3, §3.2.3, §4.3, Table 2.
  • [16] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.3.
  • [17] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. CoRR, vol. abs/1703.04309. Cited by: §1, §1, §2.1, §2.1, §3.1, Table 5, §4.
  • [18] S. Khamis, S. R. Fanello, C. Rhemann, A. Kowdle, J. P. C. Valentin, and S. Izadi (2018) StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction. CoRR abs/1807.08865. Cited by: §2.1.
  • [19] A. Li and Z. Yuan (2018) Occlusion aware stereo matching via cooperative unsupervised learning. In Asian Conference on Computer Vision, pp. 197–213. Cited by: Table 4.
  • [20] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5542–5550. Cited by: §2.3.
  • [21] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018) Learning to generalize: meta-learning for domain generalization. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.3.
  • [22] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot (2018) Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5400–5409. Cited by: §2.3.
  • [23] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao (2018) Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 624–639. Cited by: §2.3.
  • [24] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, pp. 109–117. Cited by: §2.3.
  • [25] Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou, and J. Zhang (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2811–2820. Cited by: §2.1.
  • [26] M. Liu, O. Tuzel, and Y. Taguchi (2013) Joint geodesic upsampling of depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 169–176. Cited by: §3.2.1.
  • [27] S. Liu, S. De Mello, J. Gu, G. Zhong, M. Yang, and J. Kautz (2017) Learning affinity via spatial propagation networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1520–1530. Cited by: Figure 5, Figure 5, §3.2.3, §3.2.3, §4.3, Table 2.
  • [28] W. Luo, Y. Li, R. Urtasun, and R. Zemel (2016) Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pp. 4898–4906. Cited by: §1.
  • [29] W. Luo, A. G. Schwing, and R. Urtasun (2016) Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5695–5703. Cited by: §1, §4.4, Table 4.
  • [30] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4040–4048. Cited by: Appendix E, Figure 10, Figure 1, §1, §2.1, §3.1, §4.1, §4.4, Table 4.
  • [31] M. Menze and A. Geiger (2015) Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070. Cited by: Figure 10, Figure 1, §4.1, §4.4.
  • [32] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto (2017) Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5715–5725. Cited by: §2.3.
  • [33] H. Nam and H. Kim (2018) Batch-instance normalization for adaptively style-invariant neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 2558–2567. Cited by: §2.3, §3.1, §4.3, Table 2.
  • [34] G. Nie, M. Cheng, Y. Liu, Z. Liang, D. Fan, Y. Liu, and Y. Wang (2019) Multi-level context ultra-aggregation for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3283–3291. Cited by: §2.1.
  • [35] X. Pan, P. Luo, J. Shi, and X. Tang (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 464–479. Cited by: §2.3.
  • [36] J. Pang, W. Sun, J. SJ. Ren, C. Yang, and Q. Yan (2017) Cascade residual learning: a two-stage convolutional neural network for stereo matching. IEEE International Conference on Computer Vision Workshops (ICCVW). Cited by: §2.1.
  • [37] J. Pang, W. Sun, C. Yang, J. Ren, R. Xiao, J. Zeng, and L. Lin (2018) Zoom and learn: generalizing deep stereo matching to novel domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2070–2079. Cited by: §2.2.
  • [38] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2337–2346. Cited by: §3.1.
  • [39] M. Poggi, D. Pallotti, F. Tosi, and S. Mattoccia (2019) Guided stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 979–988. Cited by: §2.2.
  • [40] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling (2014) High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition, pp. 31–42. Cited by: Figure 10, §4.1.
  • [41] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3260–3269. Cited by: §4.1.
  • [42] X. Song, X. Zhao, L. Fang, and H. Hu (2019) EdgeStereo: an effective multi-task learning network for stereo matching and edge detection. arXiv preprint arXiv:1903.01700. Cited by: §2.1, §3.1.
  • [43] A. Tonioni, M. Poggi, S. Mattoccia, and L. Di Stefano (2017-10) Unsupervised adaptation for deep stereo. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
  • [44] A. Tonioni, O. Rahnama, T. Joy, L. D. Stefano, T. Ajanthan, and P. H. Torr (2019) Learning to adapt for stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9661–9670. Cited by: §2.2.
  • [45] A. Tonioni, F. Tosi, M. Poggi, S. Mattoccia, and L. D. Stefano (2019) Real-time self-adaptive deep stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 195–204. Cited by: §1, §1, §2.1, §2.2, §3.1, Table 4.
  • [46] S. Tulyakov, A. Ivanov, and F. Fleuret (2017) Weakly supervised learning of deep metrics for stereo reconstruction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1339–1348. Cited by: Table 4.
  • [47] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.3.
  • [48] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §3.2.3, §3.2.3.
  • [49] Y. Wang, Z. Lai, G. Huang, B. H. Wang, L. Van Der Maaten, M. Campbell, and K. Q. Weinberger (2018) Anytime stereo image depth estimation on mobile devices. arXiv preprint arXiv:1810.11408. Cited by: §2.1.
  • [50] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 501–509. Cited by: Figure 12, §F.3, §3.2.3, §3.2.3, §4.3, Table 2.
  • [51] G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia (2018) SegStereo: exploiting semantic information for disparity estimation. arXiv preprint arXiv:1807.11699. Cited by: §2.1.
  • [52] Q. Yang (2012) A non-local cost aggregation method for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1402–1409. Cited by: §3.2.1.
  • [53] Z. Yin, T. Darrell, and F. Yu (2019) Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6044–6053. Cited by: Figure 10, 11(b), Figure 1, §1, §2.1, 7(b), 8(d), §4.4, §4.5, Table 3, Table 5.
  • [54] J. Zbontar and Y. LeCun (2015) Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1592–1599. Cited by: Domain-invariant Stereo Matching Networks, §1, §2.1, 8(b), §4.4, Table 4.
  • [55] F. Zhang, L. Dai, S. Xiang, and X. Zhang (2015) Segment graph based image filtering: fast structure-preserving smoothing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 361–369. Cited by: §3.2.1.
  • [56] F. Zhang, V. Prisacariu, R. Yang, and P. H. Torr (2019) GA-net: guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 185–194. Cited by: Table 6, Figure 10, Figure 1, Figure 1, §1, §1, §1, §1, §2.1, §2.1, Figure 5, Figure 5, Figure 6, §3.1, §3.2.3, §3.2.3, §3.3, 8(e), §4.2, §4.4, §4.5, §4.5, Table 3, Table 5, §4.
  • [57] F. Zhang and B. W. Wah (2018) Fundamental principles on learning new features for effective dense matching. IEEE Transactions on Image Processing 27 (2), pp. 822–836. Cited by: §2.1.
  • [58] Y. Zhong, Y. Dai, and H. Li (2017) Self-supervised learning for stereo matching with self-improving ability. arXiv preprint arXiv:1709.00930. Cited by: §2.2.
  • [59] C. Zhou, H. Zhang, X. Shen, and J. Jia (2017) Unsupervised learning of stereo matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1567–1575. Cited by: §1, §2.2, Table 4.

Supplementary Material

Appendix A Proof of Footnote 1

Following all the variable definitions in the paper, here, we prove that

(12)

Since any path which reaches node must pass through its neighborhoods , we can expand to get that

Following the order of (Fig. 4), we can prove Eq. (12) by mathematical induction:

When , for ,

Assume when , .

We can get that for :

Here, , since .

This yields the equivalence of Eq. (12).

Appendix B Backpropagation

The backpropagation for and in Eq. (9) can be computed inversely. Assume the gradient from next layer is . The backpropagation can be implemented as:

(13)

where, is a temporary gradient variable which can be calculated iteratively (similar to Eq. (9)):

(14)

The propagation of Eq. (14) is an inverse process and in an order of

Appendix C Details of the Architecture

Table 8 presents the details of the parameters of the DSMNet. It has seven non-local filtering layers which are used in feature extraction and cost aggregation. The proposed Domain Normalization layer is used to replace Batch Normalization after each 2D convolutional layer in the feature extraction and guidance networks.

Appendix D Efficiency and Parameters

As shown in Table 6, our proposed non-local filtering is a linear process that can be realized efficiently. The inference time is increased by about 5% compared with the baseline. Moreover, no any new parameters are introduced for the proposed domain normalization and non-local filtering layers.

Methods Elapsed Time Parameter Number
GANet-deep [56] 1.8s 60M
Baseline 1.4s 48M
Our DSMNet 1.5s 48M
PSMNet [4] 0.4s 52M
DSMNet (PSMNet) 0.42s 52M
Table 6: Efficiency (Elapsed Time) and Number of Parameter

Appendix E Carla Dataset

Since the synthetic Sceneflow dataset [30] only has limited number about 7,000 of stereo pairs for diving scenes, we use the Carla [8] platform to produce the stereo pairs for outdoor driving scenes. As shown in Table 7, the new carla supplementary dataset has more diverse settings, including two kinds of image resolutions ( and ), three different focal lengths, and six different camera baselines (in a range of 0.2-1.5m). This supplementary dataset can significantly improve the diversity of the training set. As shown in Fig. 9, the Carla scenes still have significant domain differences (e.g. color, textures) compared with the real scenes (e.g. KITTI, CityScapes), but, our DSMNet can extract shape and structure information for robust stereo matching. These can be better transferred to the real scenes and produce more accurate disparity estimation.

dataset number of pairs focal length baseline settings resolutions
SceneFlow 34,000 450, 1050 0.54
Carla Stereo 20,000 640, 670, 720 0.2, 0.3, 0.5, 1.0, 1.2, 1.5 ,
Table 7: Statistics of the Carla Stereo Dataset
No. Layer Description Output Tensor
Feature Extraction
input normalized image pair as input HW3
1 33 conv, DN, ReLU HW32
2 33 conv, stride 3, DN, ReLU HW32
3 33 conv, DN, ReLU HW32
4 NLF, DN, ReLU HW32
5 33 conv, stride 2, DN, ReLU HW48
6 NLF, DN, ReLU HW48
7 33 conv, DN, ReLU HW48
8-9 repeat 5,7 HW64
10-11 repeat 8-9 HW96
12-13 repeat 8-9 HW128
14 33 deconv, stride 2, DN, ReLU HW96
15 33 conv, DN, ReLU HW96
16-17 repeat 14-15 HW64
18-19 repeat 14-15 HW48
20 NLF, DN, ReLU HW48
21-22 repeat 14-15 HW32
23-41 repeat 4-22 HW32
42 NLF, DN, ReLU HW32
concatenation (11,14) (9,16) (7,18) (4,21) (20,24) (17,27) (15,29) (13,31) (18,25) (30,33) (28,35) (26,37) (23, 40)
cost volume
by feature concatenation HW6432
Guidance Branch
input concate 1 and up-sampled 35 as input HW64
(1) 33 conv, DN, ReLU HW16
(2) 33 conv, stride 3, DN, ReLU HW32
(3) 33 conv, DN, ReLU HW32
(4) 33 conv (no bn & relu) HW20
(5) split, reshape, normalize HW5
(6)-(8) from (3), repeat (3)-(5) HW5
(9)-(11) from (6), repeat (6)-(8) HW5
(12) from (2), 33 conv, stride 2, DN, ReLU HW32
(13) 33 conv, DN, ReLU HW32
(14) 33 conv (no bn & relu) HW20
(15) split, reshape, normalize HW5
(16)-(18) from (13), repeat (13)-(15) HW5
(19)-(21) from (16), repeat (13)-(15) HW5
(22)-(24) from (19), repeat (13)-(15) HW5
Cost Aggregation
input 4D cost volume HW6464
333, 3D conv HW6432
SGA: weight matrices from (5) HW6432
NLF HW6432
333, 3D conv HW6432
output 333, 3D to 2D conv, upsamping HW193
softmax, regression, loss weight: 0.2 HW1
333, 3D conv, stride 2 HW3248
333, 3D conv HW3248
SGA: weight matrices from (15) HW3248
333, 3D conv, stride 2 HW1664
333, 3D deconv, stride 2 HW3248
333, 3D conv HW3248
SGA: weight matrices from (18) HW3248
333, 3D deconv, stride 2 HW6432
333, 3D conv HW6432
SGA: weight matrices from (8) HW6432
NLF HW6432
output 333, 3D to 2D conv, upsamping HW193
softmax, regression, loss weight: 0.6 HW1
repeat HW6432
final output 333, 3D to 2D conv, upsamping HW193
regression, loss weight: 1.0 HW1
connection concate: (4,12), (7,9), (8,19), (11,16), (15,23), (18,20); add: (1,4)
Table 8: Parameters of the network architecture of “DSMNet”
(a) left view
(b) right view
(c) disparity map
Figure 9: Example of the Carla stereo data.

Appendix F More Results

f.1 Feature Visualization

As compared in Fig. 10, the features of the state-of-the-art models are mainly local patterns which can have a lot of artifacts (e.g. noises) when suffering from domain shifts. Our DSMNet mainly captures the non-local structure and shape information, which are robust for cross-domain generalization. There is no artifacts in the feature maps of our DSMNet.

f.2 Disparity Results on Different Datasets

More results and comparisons are provided in Fig. 11. All the models are trained on the synthetic dataset and tested on the real KITTI, Middlebury, ETH3D and Cityscapes datasets.

(a) Input view
(b) GANet-synthetic
(c) GANet-finetune
(d) HD-synthetic
(e) PSMNet-synthetic
(f) DSMNet-synthetic
Figure 10: Comparison and visualization of the feature maps for cross-domain test . (b) GANet [56], (d) HD [53], (e) PSMNet [4] are trained on the synthetic dataset (Sceneflow [30]) and test on other real scenes/datasets (from top to bottom: Kitti [31], Middlebury [40] and CityScapes [7]). The features are mainly local patterns and produce a lot of artifacts (e.g. noises) when suffering from domain shifts. (c) GANet is finetuned on the test dataset for comparisons. The artifacts have been stressed after fine tuning. (f) Our DSMNet trained on the synthetic data. No distortions and artifacts are introduced on the feature maps. It mainly captures the non-local structure and shape information, which are more robust for cross-domain generalization.
(a) Input view
(b) HD[53]
(c) PSMNet[4]
(d) Our DSMNet
Figure 11: Comparisons with the state-of-the-art models on four real dataset (from top to bottom: KITTI, Middlebury, ETH3D and Cityscapes). All the models are trained on the synthetic dataset. Our DSMNet can produce accurate disparity estimation on other new datasets without fine-tuning.

f.3 Comparisons with Other Non-local Strategies

Our graph-based filtering strategy is better for capturing the structural and geometric context for robust domain-invariant stereo matching. The non-local neural network denoising [50] and non-local attention [15] do not have spatial constraints that usually lead to over smoothness of the depth edges and thin structures (as shown in Fig. 12).

Figure 12: Comparisons with non-local attention mechanism [15] (second row) and non-local denoising [50] strategy (third row). When using these strategies, the thin structures (e.g. poles) are easily eroded by the background. These non-local strategies easily smooth out the disparity maps. As a comparison, our DSMNet (last row) can keep the thin structures of the disparity maps.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
400548
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description