Domain-invariant Stereo Matching Networks
State-of-the-art stereo matching networks have difficulties in generalizing to new unseen environments due to significant domain differences, such as color, illumination, contrast, and texture. In this paper, we aim at designing a domain-invariant stereo matching network (DSMNet) that generalizes well to unseen scenes. To achieve this goal, we propose i) a novel “domain normalization” approach that regularizes the distribution of learned representations to allow them to be invariant to domain differences, and ii) a trainable non-local graph-based filter for extracting robust structural and geometric representations that can further enhance domain-invariant generalizations. When trained on synthetic data and generalized to real test sets, our model performs significantly better than all state-of-the-art models. It even outperforms some deep learning models (e.g. MC-CNN ) fine-tuned with test-domain data. The code and dataset will be avialable at https://github.com/feihuzhang/DSMNet.
Stereo reconstruction is a fundamental problem in computer vision, robotics and autonomous driving. It aims to estimate 3D geometry by computing disparities between matching pixels in a stereo image pair. Recently, many end-to-end deep neural network models (e.g. [56, 4, 17]) have been developed for stereo matching that achieve impressive accuracy on several datasets or benchmarks.
However, state-of-the-art stereo matching networks (supervised [56, 4, 17] and unsupervised [59, 45]) cannot generalize well to unseen data without fine-tuning or adaptation. Their difficulties lie in the large domain differences (such as color, illumination, contrast and texture) between stereo images in various datasets. As illustrated in Fig. 1, the pre-trained models on one specific dataset produce poor results on other real and unseen scenes.
Domain adaptation and transfer learning methods (e.g. [45, 11, 3]) attempt to transfer or adapt from one source domain to another new domain. Typically, a large number of stereo images from the new domain are required for the adaptation. However, these cannot be easily obtained in many real scenarios. And, in this case, we still need a good method for disparity estimation even without data from the new domain for adaptation.
Thus, it is desirable to design a model that can generalize well to unseen data without re-training or adaptation. The difficulties for developing such a domain invariant stereo matching network (DSMNet) come from the significant domain differences between stereo images in various scenes/datasets (e.g. Fig. 1(a) and 1(b)). Such differences make the learned features unstable, distorted and noisy, leading to many wrong matching results.
Fig. 1 visualizes the features learned by some state-of-the-art stereo matching models [56, 4, 53]. Due to the limited effective receptive field of convolutional neural networks , they capture the domain-sensitive local patterns (e.g. local contrast, edge and texture) when constructing matching features, which, however, break down and produce a lot of artifacts (e.g. noises) in the feature maps when applied to the novel test data (Fig. 1(c)). The artifacts and distortions in the features inhibit robust matching, leading to wrong matching results (Fig. 1(e)).
In this paper, we propose two novel neural network layers for constructing the robust deep stereo matching network for cross-domain generalization without further fine-tuning or adaptation. Firstly, to reduce the domain shifts/differences between different datasets/scenes, we propose a novel domain normalization layer that fully regulates the feature’s distribution in both the spatial (height and width) and the channel dimensions. Secondly, to eliminate the artifacts and distortions in the features, we propose a learnable non-local graph-based filtering layer that can capture more robust structural and geometric representations (e.g. shape and structure, as illustrated in Fig. 1(d)) for domain-invariant stereo matching.
We formulate our method as an end-to-end deep neural network model and train it only with synthetic data. In our experiments, without any fine-tuning or adaptation on the real test datasets, our DSMNet far outperforms: 1) almost all state-of-the-art stereo matching models (e.g. GANet) trained on the same synthetic dataset, 2) most of the traditional methods (e.g. Cosfter filter, SGM  et al.), 3) most of the unsupervised/self-supervised models trained on the target test domains. Our model even surpasses some of the fine-tuned (on the target domains) supervised deep neural network models (e.g. MC-CNN, content-CNN, DispNetC  et al.).
2 Related Work
2.1 Deep Neural Networks for Stereo Matching
In recent years, deep neural networks have seen great success in the task of stereo matching [17, 4, 56]. These models can be categorized into three types: 1) learning better features for traditional stereo matching algorithms, 2) correlation-based end-to-end deep neural networks, 3) cost-volume based stereo matching networks.
In the first category, deep neural networks have been used to compute patch-wise similarity scores as the matching costs [57, 54]. The costs are then fed into the traditional cost aggregation and disparity computation/refinement methods  to get the final disparity maps. The models are, however, limited by the traditional matching cost aggregation step and often produce wrong predictions in occluded regions, large textureless/reflective regions and around object edges.
DispNetC , a typical method in the second category, computes the correlations by warping between stereo views and attempts to predict the per-pixel disparity by minimizing a regression training loss. Many other sate-of-the-art methods, including iResNet , CRL, SegStereo , EdgeStereo , HD , and MADNet , are all based on color or feature correlations between the left and right views for disparity estimation.
The recently developed cost-volume based models explicitly learn feature extraction, cost volume, and regularization function all end to end. Examples include GC-Net, PSM-Net , StereoNet , AnyNet , GANet  and EMCUA . They all utilize a similarity cost as the third dimension to build the 4D cost volume in which the real geometric context is maintained.
There are also others that combine the correlation and cost volume strategies (e.g. ).
The common feature of these models is that they all require a large number of training samples with ground truth depth/disparity. More importantly, a model trained on one specific domain cannot generalize well to new scenes without fine-tuning or retraining.
2.2 Adaptation and Self-supervised Learning
A recent trend of training stereo matching networks in an unsupervised manner relies on image reconstruction losses that are achieved by warping left and right views [59, 58]. However, they cannot solve the occlusions and reflective regions where there is no correspondence between the left and the right views. Also, they cannot generalize well to other new domains.
Some methods pre-train the models on synthetic data and then explore the cross-domain knowledge to adapt [11, 37] for a new domain. Others focus on the online or offline adaptations [44, 45, 43, 39]. For example, MADNet  is proposed to adapt the pre-trained model online and in real time. But, it has poor accuracy even after the adaptation. Moreover, the domain adaptation approaches require a large number of stereo images from the target domain for adaptations. However, these cannot be easily obtained in many real scenarios. And, in this case, we still need a good method for disparity estimation even without data from the new domain for adaptation.
2.3 Cross-Domain Generalization
Different to domain adaptation, domain generalization is a much harder problem that assumes no access to target information for adaptation or fine-tuning. There are many approaches that explore the idea of domain-invariant feature learning. Previous approaches focus on developing data-driven strategies to learn invariant features from different source domains [32, 10, 20]. Some recent methods utilize meta-learning that takes variations in multiple source domains to generalize to novel test distributions [1, 21]. Other approaches [23, 22] employ an invariant adversarial network to learn domain-invariant representation/features for image recognition. Choy et al.  develop a universal feature learning framework for visual correspondences using deep metric learning.
In contrast to the above approaches, there are methods that try to improve the batch or instance normalization in order to improve the generalization and robustness for style transfer or image recognition [33, 24, 35].
In summary, for stereo matching, work is seldom done to improve the generalization ability of the end-to-end deep neural network models, especially when developing the domain-invariant stereo matching networks.
3 Proposed DSMNet
To overcome the challenges in cross-domain generalization, we develop in the following sections our domain-invariant stereo matching networks. These include domain normalization to remove the influence of the domain shifts (e.g. color, style, illuminance), as well as non-local graph-based filtering and aggregation to capture the non-local structural and geometric context as robust features for domain-invariant stereo reconstruction.
3.1 Domain Normalization
Batch normalization (BN) has become the default feature normalization operation for constructing end-to-end deep stereo matching networks [17, 4, 56, 42, 45, 30]. Although it can reduce the internal covariate shift effects in training deep networks, it is domain-dependent and has negative influence on the model’s cross-domain generalization ability.
BN normalizes the features as follows:
Here and are the input and output features, respectively, and indexes elements in a tensor (i.e. feature maps, as illustrated in Fig. 2) of size (: batch size, : channels, : spatial height, : spatial width). and are the corresponding channel-wise mean and standard deviation (std) and are computed by:
where is the set of elements in the same channel as element (Fig. 2), and is a small constant to avoid dividing by zeros.
Mean and standard deviation are computed per batch in the training phase, and the accumulated values of the training set are utilized for inference. However, different domains may have different and caused by color shifts, contrast, and illumination (Fig. 1(a) and 1(b)). Thus and computed for one dataset are not transferable to others.
Instance normalization (IN) [33, 38] overcomes the dependency on data-set statistics by normalizing each sample separately, where elements in are confined to be from the same sample as illustrated in Fig. 2. In theory, IN is domain-invariant, and normalization across the spatial dimensions (, ) reduces image-level appearance/style variations.
However, matching of stereo views is realized at the pixel level by finding an accurate correspondence for each pixel using its -channel feature vector. Any inconsistence of the feature norm and scaling will significantly influence the matching cost and similarity measurements.
Fig. 3 illustrates that IN cannot regulate the norm distribution of pixel-wise feature vectors that vary in datasets/domains.
We propose in Fig. 2 our domain-invariant normalization (DN). Our method normalizes features along the spatial axis (, ) to induce style-invariant representations similar to IN as well as along the channel dimension () to enhance the local invariance.
Our DN is realized as follows:
where (green region in Fig. 2) includes elements from the same example ( axis) and the same spatial location (, axis). is computed as Eq. (1) and (2) with elements in from the same channel and sample (blue region in Fig. 2). In DN, besides normalization across spatial dimension, we also employ normalization to normalize features along the channel axis. They collaborate with each other to address the address the sensitivity to domain shift as well as stress noises and extreme values in feature vectors. As illustrated in Fig. 3, it helps regulate the norm distribution of the features in different datasets and improves the robustness to local domain shifts (e.g. texture pattern, noise, contrast).
Finally, the trainable per-channel scale and shift are added to enhance the discriminative representation ability as BN and IN. The final formulation is as follows:
3.2 Non-local Aggregation
We propose a graph-based filter that robustly exploits non-local contextual information and reduces the dependence on local patterns (see Fig. 1(c)) for domain-invariant stereo matching.
Our inspiration comes from traditional graph-based filters that are remarkably effective in employing non-local structural information for structure-preserving texture and detail removing/smoothing , denoising [55, 5], as well as depth-aware estimation and enhancement [26, 52].
For a 2D image/feature map , we construct an 8-connected graph by connecting pixel to its eight neighbors (see Fig. 4). To avoid loops and achieve fast non-local information aggregation over the graph, we split it into two reverse directed graphs , (see Fig. 4(b) and 4(c)).
We assign weight to each edge , and a feature (or color) vector to each node . We also allow to propagate information to itself with weight . For graph (), our non-local filter is defined as follows:
Here, is a feasible path from to . Note that is included in the path and counts for the start node . Unlike traditional geodesic filters, we consider all valid paths from source node to target node . The propagation weight along path is the product of all edge weights along the path. Here weight is defined as the sum of the weights of all feasible paths from to , which determines how much information is diffused to from .
For the edge weight , we define it in a self-regularized manner as follows:
where and represent the feature vectors of and , respectively. This definition does not introduce new parameters and thus is more robust to cross-domain generalization.
Compared to other local filters, such as Gaussian filter, median filter, and mean filter that can only propagate information in a local region determined by the filter kernel size, our proposed non-local filter allows the propagation of long-range information with weights as a spatial accumulation along all feasible paths in a graph.
For stable training and to avoid extreme values, we further add a normalization constraint to the weights associated with in the graph as:
Here, is the set of the connected neighbors of (including itself), and is the directed edge connecting and . For example, in Fig. 4(b), for node , ; and for node , .
Such a transformation not only increases the robustness in training but also reduces the computational costs.
3.2.2 Linear Implementation
Eq. (8) can be realized as an iterative linear aggregation, where the node representation is sequentially updated following the direction of the graph (e.g. from top to bottom, then left to right in ). In each step, is updated as:
Finally, we repeat the aggregation process for both and where the updated representation with is used as the input for aggregation with (similar to patchmatch stereo ). The aggregation of Eq. (9) is a linear process with time complexity of (with nodes in the graph). During training, backpropagation can be realized by reversing the propagation equation which is also a linear process (available in the supplementary material).
3.2.3 Relations to Existing Approaches
We show that the recently proposed semi-global aggregation (SGA) layer and affinity-based propagation approach  are special cases of our graph-based non-local filter (Eq. (8)). In addition, we compare it with non-local neural networks [48, 50] and the attention mechanism .
The aggregations are done in four directions, namely . Taking the right to left propagation () as an example, we can construct a propagation graph in Fig. 5(a). The -coordinate represents disparity , and the -coordinate represents the indexes of the pixels/nodes. Compared to our non-local graph in Fig. 4(b), edges connecting top and bottom nodes are removed, and the maximum of each column is densely connected to every node of the next column (red edges). The SGA layer can then be realized by our proposed non-local filter in Eq. (8). Here, are the neighborhood nodes of , and are the corresponding edge weights.
The Affinity-based Spatial Propagation in  can be achieved as:
The Non-local Neural Networks and Attentions [48, 50, 15] are implemented without spatial and structural awareness. The similarity definition between two pixels only considers the feature differences without considering their spatial distances. Therefore, they will easily smooth out depth edges and thin structures (as illustrated in the supplementary material). Our non-local filter spatially aggregates the message along the paths in the graph which can avoid over smoothness and better preserve the structure of the disparity maps.
3.3 Network Architecture
As illustrated in Fig. 6, we utilize the backbone of GANet as the baseline architecture. The local guided aggregation layer in  is removed since it’s domain-dependent and captures a lot of local patterns that are very sensitive to local domain shifts.
We replace the original batch normalization layer by our proposed domain normalization layer for feature extraction. For the feature extraction network, we utilize a total of seven proposed filtering layers. For 3D cost aggregation of the cost volume, two non-local filters are further added for cost volume filtering in each channel/depth. All the details of the network architecture are presented in Table I in the supplementary material.
4 Experimental Results
In our experiments, we train our method only with synthetic data and test it on four real datasets to evaluate its domain generalization ability. During training, we use disparity regression  for disparity prediction, and the smooth loss to compute the errors for back-propagation (the same as in [56, 4]). All the models are optimized with Adam (, ). We train with a batch size of 8 on four GPUs using random crops from the input images. The maximum of the disparity is set as 192. We train the model on the synthetic dataset for 10 epochs with a constant learning rate of 0.001. All other training settings are kept the same as those in .
KITTI stereo 2012  and 2015  datasets provide about 400 image pairs of outdoor driving scenes for training, where the disparity labels are transformed from Velodyne LiDAR points. The Cityscapes dataset  provides a large amount of high-resolution () stereo images collected from out-door city driving scenes. The disparity labels are pre-computed by SGM  which is not accurate enough for training deep neural network models. The Middlebury stereo dataset  is designed for indoor scenes with higher resolution (up to ). But it provides no more than 50 image pairs that are not enough to train robust deep neural networks. In addition, ETH 3D dataset  provides 27 pairs of gray images for training.
These existing real datasets are all limited by their small quantity or poor ground-truth labels, making them insufficient for training deep learning models. Hence, we just use them as test sets for evaluating our models’ cross-domain generalization ability.
We mainly use synthetic data to train our domain-invariant models. The existing Scene Flow synthetic dataset  contains 35k training image pairs with a resolution of . This dataset has a limited number of the outdoor driving scenes that provide stereo pairs with a few settings of the camera baselines and image resolutions. We use CARLA  to generate a new supplementary synthetic dataset (with 20k stereo pairs) with more diverse settings, including two kinds of image resolutions ( and ), three different focal lengths, and five different camera baselines (in a range of 0.2-1.5m). This supplementary dataset can significantly improve the diversity of the training set (which will be published with the paper).
The two advantages in using synthetic data are that it can avoid all the difficulties of labeling a large amount of real data, and that it can eliminate the negative influence of wrong depth values in real datasets.
4.2 Ablation Study
We evaluate the performance of our DSMNet with numerous settings, including different architectures, normalization strategies and numbers (0-9) of the proposed non-local filter (NLF) layers. As listed in Table 1, the full-setting DSMNet far outperforms the baseline in accuracy by 3% on the KITTI and 8% on the Middlebury datasets. Our proposed domain normalization improves the accuracy by about 1.5%, and the NLF layers contribute another 1.4% on the KITTI dataset.
Moreover, our proposed layers are generic and could be seamlessly integrated into other deep stereo matching models. Here, we replace our backbone model with GANet  and PSMNet . The accuracies are improved by 48% on KIITTI dataset and 813% on Middlebury dataset for coss-domain evaluations compared with the original PSMNet and GANet.
4.3 Component Analysis and Comparisons
To further validate the superiorities of the proposed layers , we compare each of them with other related normalization and non-local strategies.
|DSMNet (full setting)||20.1||4.1|
Table 2 compares our domain normalization with batch normalization , instance normalization , and the recently proposed adaptive batch-instance normalization . We keep all other settings the same as our DSMNet and only replace the normalization method for training and evaluation. Our domain normalization is superior to others for domain-invariant stereo matching because it can fully regulate the feature vectors’ distribution and remove both image-level and local contrast differences for cross-domain generalization.
Finally, we compare our graph-based non-local filter with other related strategies, including affinity-based propagation , non-local neural network denoising , and non-local attention  (in Table 2). Our graph-based filtering strategy is better for capturing the structural and geometric context for robust domain-invariant stereo matching. The non-local neural network denoising  and non-local attention  do not have spatial constraints that usually lead to over smoothness of the depth edges (as shown in the supplementary material). Affinity-based propagations  are special cases of our proposed filtering strategy and are not as effective in feature and cost volume aggregations for stereo matching.
|Training set||SceneFlow + Carla|
4.4 Cross-Domain Evaluations
In this section, we compare our proposed DSMNet with state-of-the-art stereo matching models by training with synthetic data and evaluating on real test sets.
Comparisons with State-of-the-Art Models.
In Table 3 and Fig. 7, we compare our DSMNet with other state-of-the-art deep neural network models on the four real datasets. All the models are trained on synthetic data (either SceneFlow or a mixture of SceneFlow and Carla). We find that DSMNet far outperforms the state-of-the-art models by 330% in error rates on all these datasets. It is also far better than traditional stereo matching algorithms, like SGM , costfilter  and patchmatch .
|Models||Training Set||Error Rates (%)|
|MADNet||Kitti (no gt)||8.23|
|OASM-Net||Kitti (no gt)||8.98|
|Unsupervised||Kitti (no gt)||9.91|
Evaluation on the KITTI Benchmark.
Table 4 presents the performance of our DSMNet on the KITTI benchmark . Our model far outperforms most of the unsupervised/self-supervised models trained on the KITTI domain. It is even better than supervised stereo matching networks (including, MC-CNN, content-CNN, and DispNetC ) trained or fine-tuned on the KITTI dataset. When compared with other fine-tuned state-of-the-art models (e.g. PSMNet, HD, GANet-deep), our DSMNet (without fine-tuning) produces more accurate object boundaries (Fig. 8).
In this section, we show DSMNet’s best performance when fine-tuned on the target domain. We fine-tune the model pre-trained on synthetic data for a further 700 epochs using the KITTI 2015 training set. The learning rate for fine-tuning begins at 0.001 for the first 300 epochs and decreases to 0.0001 for the rest. The results are submitted to the KITTI benchmarks for evaluations.
Table 5 compares the results of the fine-tuned DSMNet and those of other state-of-the-art DNN models. We find that DSMNet outperforms most of the recent models (including PSMNet , HD , GwcNet  and GANet-15) by a noteworthy margin. This implies that DSMNet can achieve the same accuracy by fine-tuning on one specific dataset, without sacrificing accuracy to improve its cross-domain generalization ability.
We also separately test the effectiveness of our non-local filtering strategy. Using the current best “GANet-deep” (including the Local Guided Aggregation layer) as the baseline, we add five filtering layers for feature extraction. All other settings are kept the same as the original GANet. After training on synthetic data and fine-tuning on the KITTI training dataset, the model gets a new state-of-the-art accuracy (1.77%) on KITTI 2015 benchmark. This shows that our graph-based filter can improve not only cross-domain generalization but also the accuracy on the test domains.
4.6 Efficiency and Parameters
Our proposed non-local filtering is a linear process that can be realized efficiently. The inference time is increased slightly by no more than 5% compared with the baseline. Moreover, no any new parameter is introduced for the proposed domain normalization and non-local filtering layers. Detailed comparisons are available in the supplementary material.
In this paper, we have proposed two end-to-end trainable neural network layers for our domain-invariant stereo matching network. Our novel domain normalization can fully regulate the distribution of learned features to address significant domain shifts, and our non-local graph-based filter can capture more robust non-local structural and geometric features for accurate disparity estimation in cross-domain situations. We have verified our model on four real datasets and have shown its superior accuracy when compared to other state-of-the-art stereo matching networks in cross-domain generalization.
-  (2018) Metareg: towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems (NIPS), pp. 998–1008. Cited by: §2.3.
-  (2011) PatchMatch stereo-stereo matching with slanted support windows.. In British Machine Vision Conference (BMVC), pp. 1–11. Cited by: §3.2.2, §4.4, Table 3.
-  (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3722–3731. Cited by: §1.
-  (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5418. Cited by: Table 6, Figure 10, 11(c), Figure 1, §1, §1, §1, §2.1, §2.1, §3.1, 7(c), 8(c), §4.2, §4.4, §4.5, Table 3, Table 5, §4.
-  (2013) Fast patch-based denoising using approximated patch geodesic paths. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1211–1218. Cited by: §3.2.1.
-  (2016) Universal correspondence network. In Advances in Neural Information Processing Systems (NIPS), pp. 2414–2422. Cited by: §2.3.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3213–3223. Cited by: Figure 10, §4.1.
-  (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: Appendix E, §4.1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: §4.1.
-  (2015) Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2551–2559. Cited by: §2.3.
-  (2018) Learning monocular depth by distilling cross-domain stereo networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 484–500. Cited by: §1, §2.2.
-  (2019) Group-wise correlation stereo network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3273–3282. Cited by: §2.1, §4.5, Table 3, Table 5.
-  (2008) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2), pp. 328–341. Cited by: §1, §2.1, §3.2.3, §4.1, §4.4, Table 3.
-  (2013) Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2), pp. 504–511. Cited by: §4.4, Table 3.
-  (2019) Ccnet: criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 603–612. Cited by: Figure 12, §F.3, §3.2.3, §3.2.3, §4.3, Table 2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.3.
-  (2017) End-to-end learning of geometry and context for deep stereo regression. CoRR, vol. abs/1703.04309. Cited by: §1, §1, §2.1, §2.1, §3.1, Table 5, §4.
-  (2018) StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction. CoRR abs/1807.08865. Cited by: §2.1.
-  (2018) Occlusion aware stereo matching via cooperative unsupervised learning. In Asian Conference on Computer Vision, pp. 197–213. Cited by: Table 4.
-  (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5542–5550. Cited by: §2.3.
-  (2018) Learning to generalize: meta-learning for domain generalization. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.3.
-  (2018) Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5400–5409. Cited by: §2.3.
-  (2018) Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 624–639. Cited by: §2.3.
-  (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, pp. 109–117. Cited by: §2.3.
-  (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2811–2820. Cited by: §2.1.
-  (2013) Joint geodesic upsampling of depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 169–176. Cited by: §3.2.1.
-  (2017) Learning affinity via spatial propagation networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1520–1530. Cited by: Figure 5, Figure 5, §3.2.3, §3.2.3, §4.3, Table 2.
-  (2016) Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pp. 4898–4906. Cited by: §1.
-  (2016) Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5695–5703. Cited by: §1, §4.4, Table 4.
-  (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4040–4048. Cited by: Appendix E, Figure 10, Figure 1, §1, §2.1, §3.1, §4.1, §4.4, Table 4.
-  (2015) Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070. Cited by: Figure 10, Figure 1, §4.1, §4.4.
-  (2017) Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5715–5725. Cited by: §2.3.
-  (2018) Batch-instance normalization for adaptively style-invariant neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 2558–2567. Cited by: §2.3, §3.1, §4.3, Table 2.
-  (2019) Multi-level context ultra-aggregation for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3283–3291. Cited by: §2.1.
-  (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 464–479. Cited by: §2.3.
-  (2017) Cascade residual learning: a two-stage convolutional neural network for stereo matching. IEEE International Conference on Computer Vision Workshops (ICCVW). Cited by: §2.1.
-  (2018) Zoom and learn: generalizing deep stereo matching to novel domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2070–2079. Cited by: §2.2.
-  (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2337–2346. Cited by: §3.1.
-  (2019) Guided stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 979–988. Cited by: §2.2.
-  (2014) High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition, pp. 31–42. Cited by: Figure 10, §4.1.
-  (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3260–3269. Cited by: §4.1.
-  (2019) EdgeStereo: an effective multi-task learning network for stereo matching and edge detection. arXiv preprint arXiv:1903.01700. Cited by: §2.1, §3.1.
-  (2017-10) Unsupervised adaptation for deep stereo. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
-  (2019) Learning to adapt for stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9661–9670. Cited by: §2.2.
-  (2019) Real-time self-adaptive deep stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 195–204. Cited by: §1, §1, §2.1, §2.2, §3.1, Table 4.
-  (2017) Weakly supervised learning of deep metrics for stereo reconstruction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1339–1348. Cited by: Table 4.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.3.
-  (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §3.2.3, §3.2.3.
-  (2018) Anytime stereo image depth estimation on mobile devices. arXiv preprint arXiv:1810.11408. Cited by: §2.1.
-  (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 501–509. Cited by: Figure 12, §F.3, §3.2.3, §3.2.3, §4.3, Table 2.
-  (2018) SegStereo: exploiting semantic information for disparity estimation. arXiv preprint arXiv:1807.11699. Cited by: §2.1.
-  (2012) A non-local cost aggregation method for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1402–1409. Cited by: §3.2.1.
-  (2019) Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6044–6053. Cited by: Figure 10, 11(b), Figure 1, §1, §2.1, 7(b), 8(d), §4.4, §4.5, Table 3, Table 5.
-  (2015) Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1592–1599. Cited by: Domain-invariant Stereo Matching Networks, §1, §2.1, 8(b), §4.4, Table 4.
-  (2015) Segment graph based image filtering: fast structure-preserving smoothing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 361–369. Cited by: §3.2.1.
-  (2019) GA-net: guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 185–194. Cited by: Table 6, Figure 10, Figure 1, Figure 1, §1, §1, §1, §1, §2.1, §2.1, Figure 5, Figure 5, Figure 6, §3.1, §3.2.3, §3.2.3, §3.3, 8(e), §4.2, §4.4, §4.5, §4.5, Table 3, Table 5, §4.
-  (2018) Fundamental principles on learning new features for effective dense matching. IEEE Transactions on Image Processing 27 (2), pp. 822–836. Cited by: §2.1.
-  (2017) Self-supervised learning for stereo matching with self-improving ability. arXiv preprint arXiv:1709.00930. Cited by: §2.2.
-  (2017) Unsupervised learning of stereo matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1567–1575. Cited by: §1, §2.2, Table 4.
Appendix A Proof of Footnote 1
Following all the variable definitions in the paper, here, we prove that
Since any path which reaches node must pass through its neighborhoods , we can expand to get that
When , for ,
Assume when , .
We can get that for :
Here, , since .
This yields the equivalence of Eq. (12).
Appendix B Backpropagation
The backpropagation for and in Eq. (9) can be computed inversely. Assume the gradient from next layer is . The backpropagation can be implemented as:
where, is a temporary gradient variable which can be calculated iteratively (similar to Eq. (9)):
The propagation of Eq. (14) is an inverse process and in an order of
Appendix C Details of the Architecture
Table 8 presents the details of the parameters of the DSMNet. It has seven non-local filtering layers which are used in feature extraction and cost aggregation. The proposed Domain Normalization layer is used to replace Batch Normalization after each 2D convolutional layer in the feature extraction and guidance networks.
Appendix D Efficiency and Parameters
As shown in Table 6, our proposed non-local filtering is a linear process that can be realized efficiently. The inference time is increased by about 5% compared with the baseline. Moreover, no any new parameters are introduced for the proposed domain normalization and non-local filtering layers.
Appendix E Carla Dataset
Since the synthetic Sceneflow dataset  only has limited number about 7,000 of stereo pairs for diving scenes, we use the Carla  platform to produce the stereo pairs for outdoor driving scenes. As shown in Table 7, the new carla supplementary dataset has more diverse settings, including two kinds of image resolutions ( and ), three different focal lengths, and six different camera baselines (in a range of 0.2-1.5m). This supplementary dataset can significantly improve the diversity of the training set. As shown in Fig. 9, the Carla scenes still have significant domain differences (e.g. color, textures) compared with the real scenes (e.g. KITTI, CityScapes), but, our DSMNet can extract shape and structure information for robust stereo matching. These can be better transferred to the real scenes and produce more accurate disparity estimation.
|dataset||number of pairs||focal length||baseline settings||resolutions|
|Carla Stereo||20,000||640, 670, 720||0.2, 0.3, 0.5, 1.0, 1.2, 1.5||,|
|No.||Layer Description||Output Tensor|
|input||normalized image pair as input||HW3|
|1||33 conv, DN, ReLU||HW32|
|2||33 conv, stride 3, DN, ReLU||HW32|
|3||33 conv, DN, ReLU||HW32|
|4||NLF, DN, ReLU||HW32|
|5||33 conv, stride 2, DN, ReLU||HW48|
|6||NLF, DN, ReLU||HW48|
|7||33 conv, DN, ReLU||HW48|
|14||33 deconv, stride 2, DN, ReLU||HW96|
|15||33 conv, DN, ReLU||HW96|
|20||NLF, DN, ReLU||HW48|
|42||NLF, DN, ReLU||HW32|
|concatenation||(11,14) (9,16) (7,18) (4,21) (20,24) (17,27) (15,29) (13,31) (18,25) (30,33) (28,35) (26,37) (23, 40)|
|by feature concatenation||HW6432|
|input||concate 1 and up-sampled 35 as input||HW64|
|(1)||33 conv, DN, ReLU||HW16|
|(2)||33 conv, stride 3, DN, ReLU||HW32|
|(3)||33 conv, DN, ReLU||HW32|
|(4)||33 conv (no bn & relu)||HW20|
|(5)||split, reshape, normalize||HW5|
|(6)-(8)||from (3), repeat (3)-(5)||HW5|
|(9)-(11)||from (6), repeat (6)-(8)||HW5|
|(12)||from (2), 33 conv, stride 2, DN, ReLU||HW32|
|(13)||33 conv, DN, ReLU||HW32|
|(14)||33 conv (no bn & relu)||HW20|
|(15)||split, reshape, normalize||HW5|
|(16)-(18)||from (13), repeat (13)-(15)||HW5|
|(19)-(21)||from (16), repeat (13)-(15)||HW5|
|(22)-(24)||from (19), repeat (13)-(15)||HW5|
|input||4D cost volume||HW6464|
|333, 3D conv||HW6432|
|SGA: weight matrices from (5)||HW6432|
|333, 3D conv||HW6432|
|output||333, 3D to 2D conv, upsamping||HW193|
|softmax, regression, loss weight: 0.2||HW1|
|333, 3D conv, stride 2||HW3248|
|333, 3D conv||HW3248|
|SGA: weight matrices from (15)||HW3248|
|333, 3D conv, stride 2||HW1664|
|333, 3D deconv, stride 2||HW3248|
|333, 3D conv||HW3248|
|SGA: weight matrices from (18)||HW3248|
|333, 3D deconv, stride 2||HW6432|
|333, 3D conv||HW6432|
|SGA: weight matrices from (8)||HW6432|
|output||333, 3D to 2D conv, upsamping||HW193|
|softmax, regression, loss weight: 0.6||HW1|
|final output||333, 3D to 2D conv, upsamping||HW193|
|regression, loss weight: 1.0||HW1|
|connection||concate: (4,12), (7,9), (8,19), (11,16), (15,23), (18,20); add: (1,4)|
Appendix F More Results
f.1 Feature Visualization
As compared in Fig. 10, the features of the state-of-the-art models are mainly local patterns which can have a lot of artifacts (e.g. noises) when suffering from domain shifts. Our DSMNet mainly captures the non-local structure and shape information, which are robust for cross-domain generalization. There is no artifacts in the feature maps of our DSMNet.
f.2 Disparity Results on Different Datasets
More results and comparisons are provided in Fig. 11. All the models are trained on the synthetic dataset and tested on the real KITTI, Middlebury, ETH3D and Cityscapes datasets.
f.3 Comparisons with Other Non-local Strategies
Our graph-based filtering strategy is better for capturing the structural and geometric context for robust domain-invariant stereo matching. The non-local neural network denoising  and non-local attention  do not have spatial constraints that usually lead to over smoothness of the depth edges and thin structures (as shown in Fig. 12).