Do End-to-end Stereo Algorithms Under-utilize Information?

Do End-to-end Stereo Algorithms Under-utilize Information?


Deep networks for stereo matching typically leverage 2D or 3D convolutional encoder-decoder architectures to aggregate cost and regularize the cost volume for accurate disparity estimation. Due to content-insensitive convolutions and down-sampling and up-sampling operations, these cost aggregation mechanisms do not take full advantage of the information available in the images. Disparity maps suffer from over-smoothing near occlusion boundaries, and erroneous predictions in thin structures. In this paper, we show how deep adaptive filtering and differentiable semi-global aggregation can be integrated in existing 2D and 3D convolutional networks for end-to-end stereo matching, leading to improved accuracy. The improvements are due to utilizing RGB information from the images as a signal to dynamically guide the matching process, in addition to being the signal we attempt to match across the images. We show extensive experimental results on the KITTI 2015 and Virtual KITTI 2 datasets comparing four stereo networks (DispNetC, GCNet, PSMNet and GANet) after integrating four adaptive filters (segmentation-aware bilateral filtering, dynamic filtering networks, pixel adaptive convolution and semi-global aggregation) into their architectures. Our code is available at


Stevens Institute of Technology

1 Introduction

Progress in learning based stereo matching has been rapid from early work that focused on aspects of the stereo matching pipeline [32], such as similarity computation [43], to recent end-to-end systems [7, 23, 29, 45]. What is remarkable is that the evolution of learning based stereo has largely mirrored that of conventional algorithms: initial end-to-end systems resembled winner-take-all stereo with the disparity of each pixel determined almost independently based on a small amount of local context, while GA-Net [45], arguably the most effective current algorithm, includes a differentiable form of the Semi-Global Matching (SGM) algorithm [20], which has been by far the most popular conventional optimization technique for over a decade.

Despite their success, deep stereo networks seem to under-utilize information present in their inputs. Specifically, in this paper we demonstrate how many existing networks leverage RGB information from the images to extract features that facilitate matching, but leave additional information unexploited. We show that the accuracy of representative network architectures can be improved by integrating into them modules that are sensitive to pixel similarity, image edges or semantics and act like adaptive filters.

In our experiments, we have integrated four components that can be thought of as adaptive or guided filters into four existing networks for stereo matching. Specifically, we have experimented with segmentation-aware bilateral filtering (SABF) [17], dynamic filtering networks (DFN) [22], pixel adaptive convolution (PAC) [34] and semi-global aggregation (SGA) from GANet [45] integrated into DispNetC [29], GCNet [23], PSMNet [7], and GANet [45]. The set of filters is diverse: SABF, for example, is pre-trained to embed its input via a semantic segmentation loss, while SGA aggregates matching cost along rows and columns of cost volume slices. The backbone networks are also diverse, spanning 2D [29] and 3D [7, 23] convolutional networks and GANet that foregoes 3D convolutions in favor of the SGM-like aggregation mechanism. The number of parameters of the backbones ranges from 2.8 to 42.2 million.

The contributions of this paper are:

  • Several novel deep architectures for stereo matching.

  • Evidence that further progress in stereo matching is possible by leveraging image context as guidance for refinement, filtering and aggregation of the matching volume.

  • A comparison of four filtering methods leading to the conclusion that SGA typically achieves the highest accuracy among them.

2 Related Work

We review deep learning based end-to-end stereo matching and content-guided or adaptive filtering in CNNs.

End-to-end Stereo Matching.   End-to-end stereo matching methods can be generally grouped into two categories: 2D CNNs for correlation-based disparity estimation and 3D CNNs for cost volume based disparity regression.

A representative work in the first category is DispNetC [29]. It computes the correlations among features between stereo views at different disparity values, and regresses the disparity via an encoder-decoder 2D CNN architecture. Many other methods [26, 31, 33, 36, 41, 42, 44] extend this paradigm. CRL [31] employs a two-stage network for cascade disparity residual learning. Feature constancy is added in iResNet [26] to further refine the disparity. SegStereo [41], DSNet [44] and EdgeStereo [33] are multi-task learning approaches jointly optimizing stereo matching with semantic segmentation or edge detection.

State-of-the-art stereo networks [7, 16, 23, 45] largely fall in the second category. Different than the correlation based cost volume in 2D-CNN stereo networks, 3D-CNNs generate a 4D cost volume by concatenating the deep features from the Siamese branches along the channel dimension at each disparity level and each pixel position. GCNet [23] and PSMNet [7] apply 3D convolutional layers for cost aggregation, followed by a differentiable soft-argmin layer for disparity regression. GwcNet [16] leverages group-wise correlation of channel-split features to generate a hybrid cost volume that can be processed by a smaller 3D convolutional aggregation sub-network. GANet [45] includes local guided aggregation (LGA) and semi-global aggregation (SGA) layers for efficient cost aggregation which are complementary to 3D convolutional layers. LGA aggregates the cost volume locally to refine thin structures, while SGA is a differentiable counterpart of SGM [20].

Content Guided and Adaptive Filtering in CNNs. Existing content-adaptive CNNs fall into two general classes. In the first class, conventional image-adaptive filters (e.g. bilateral filters [1, 35], guided image filters [18] and non-local means [2, 4], among others) have been adapted to be differentiable and used as content-adaptive neural network layers [6, 8, 9, 14, 21, 24, 27, 28, 37, 38, 46]. Wu et al. [38] propose novel layers to perform guided filtering [18] inside CNNs. Wang et al. [37] present non-local neural networks to mimic non-local means [4] for capturing long-range dependencies. The bilateral inception module by Gadde et al. [14] can be inserted into existing CNN segmentation architectures for improved results. It performs bilateral filtering to propagate information between superpixels based on the spatial and color similarity. Harley et al. [17] integrate segmentation information in the CNN by firstly learning segmentation-aware embeddings, then generating local foreground attention masks, and combining the masking filters with convolutional filters to perform segmentation-aware convolution (see details in Sec. 3.2). Deformable convolutions [12, 47] produce spatially varying modifications to the convolutional filters, where the modifications are represented as offsets in favor of learning geometric-invariant features. Pixel adaptive convolution (PAC) [34] mitigates the content-agnostic drawback of standard convolutions by multiplying the convolutional filter weights by a spatial kernel function. PAC has been applied in joint image upsampling networks [25] and in a learnable dense conditional random field (CRF) framework [9, 21, 24, 46]. See Sec. 3.2 for more details.

Another class of content-adaptive CNNs focuses on learning spatial position-aware filter weights using separate sub-networks. These approaches are called “Dynamic Filter Networks” (DFN) [22, 39, 40] or kernel prediction networks [3], which have been used in several computer vision tasks. Jia et al. [22] propose the first DFN where the convolutional filters are generated dynamically depending on input pixels. The filter weights, provided by a filter-generating network conditioned on an input, are applied to another input through the dynamic filtering layer (see details in Sec. 3.2). It is extended by Wu et al. [39] with an additional attention mechanism and a dynamic sampling strategy to allow the position-specific kernels to also learn from multiple neighboring regions.

3 Approach

In this section, we describe our approach for adapting state-of-the-art stereo matching networks [7, 23, 29, 45] by integrating deep filtering techniques to improve their accuracy. We show how four filtering techniques, segmentation-aware bilateral filtering (SABF) [17], dynamic filtering networks (DFN) [22], pixel adaptive convolution (PAC) [34] and semi-global aggregation (SGA) [45], can be used to filter the cost volume for accurate disparity estimation.

3.1 Cost Volume in Stereo Matching

Figure 1: Cost volume in 2D and 3D CNNs for stereo matching. (a) Deep features are extracted for each view. (b) A left feature vector and several counterparts , at disparity . (c) The left feature vector is either correlated (top branch for 2D CNNs) or concatenated (bottom branch for 3D CNNs) with the corresponding right feature vectors. (d) The cost volume, including the highlighted cost volume slice (green), cost volume fiber (blue) and cost feature (light blue) that is a scalar for 2D CNNs or a vector for 3D CNNs.

Given a rectified stereo pair and with dimensions (: height, : width), stereo matching finds the correspondence between a reference pixel in the left image and a target pixel in the right image . The cost volume (or matching volume) is defined as a 3D or 4D tensor with dimensions or (: disparity range, : feature dimensionality for each of the views), to represent the likelihood of a reference pixel corresponding to a target pixel , with disparity . Its construction is illustrated in Fig 1. Specifically, the extracted left feature vector of the reference pixel is either correlated (in 2D CNNs e.g. DispNetC [29] and iResNet [26]) or concatenated (in 3D CNNs e.g. GCNet [23], PSMNet [7] and GANet [45]) with the corresponding feature of the target pixel , for each disparity . It is formulated in Eq. 1:


where denotes correlation, indicates concatenation, resulting in a 3D or 4D tensor for 2D and 3D CNNs, respectively. We denote a cost volume slice at disparity by (i.e., ).

3.2 Content-Adaptive Filtering Modules

In this subsection, we present four content-adaptive filtering approaches which can be effectively incorporated as content-adaptive CNN layers in state-of-the-art networks for end-to-end stereo matching.

3.2.1 Segmentation-aware Bilateral Filtering Module

Segmentation-aware bilateral filtering (SABF) [17] was proposed to enforce smoothness while preserving region boundaries or motion discontinuities in dense prediction tasks, such as semantic segmentation and optical flow estimation. As shown in Fig. 2, here we adapt the SABF to stereo matching to filter the cost volume (Sec. 3.1), by (i) learning to embed in a feature space where semantic dissimilarity between pixels can be measured by a distance function [10]; (ii) creating local foreground (relative to a given pixel) attention filters ; (iii) filtering the cost volume so as to capture the relevant foreground and be robust to appearance variations in the background or occlusions.

Figure 2: Integrating the segmentation-aware bilateral filtering (SABF) module. The upper branch shows the embedding is learned by an embedding network from an input image . Pairwise embedding distances are converted (Eq. 4) to SABF filter weights , as shown at the bottom. The overall figure shows how SABF filters a cost volume slice to obtain a segmentation-aware filtered result .

Learning the Embedding.   Given an RGB image , consisting of pixels , and its semantic segmentation label map , the embedding function is implemented as an Embedding Network (see the detailed architecture in supplement) that maps pixels into an embedding space as , or , where is the embedding for RGB pixel , with as the dimension of the embedding space. See Fig. 2. The embeddings are learned via a loss function over pixel pairs sampled in a neighborhood around each pixel. Specifically, for any two pixels and and corresponding object class labels and , the pairwise loss is:


where indicates the -norm, and the thresholds are and . Therefore, the total loss for all the pixels in the image is defined in Eq. 3:


where spans the spatial neighbors of index . We follow the implementation of Harley et al. [17] where three overlapping neighborhoods with dilation rates of , , and are used for a good trade-off between long-range pairwise connectivity and computational efficiency.

Applying the SABF Layer.   Once the embedding is learned, the SABF filter weights are obtained by converting the pairwise distance between and into (unnormalized) probabilities using two Gaussian distributions as in Eq. 4:


where and are two predefined standard deviations. Then, given an input feature , we can efficiently compute a segmentation-aware filtered result via the SABF layer:


where defines an (e.g., ) filtering window. The input and output here are from a raw cost volume slice and the filtered slice , respectively (Fig. 2).

3.2.2 Dynamic Filtering Networks (DFN) Module

Dynamic Filter Networks (DFN) [22] are a content-adaptive filtering technique. As shown in Fig. 4(b), the filters in the DFN are dynamically generated by a separate filter-generating network conditioned on an input . Then, they are applied to another input via the dynamic filtering layer. In our implementation, is the deep feature of the reference image , and is a cost volume slice .

Filter-Generating Network.   Given an input (: height, : width, : channel size of ), the filter-generating network generates dynamic filters , parameterized by (: filter window size, : channel size of , : the number of filters).

Dynamic Local Filtering Layer.   The generated filters are applied to input images or feature maps via the dynamic filtering layer to output the filtered result . Specifically, the dynamic filtering layer in the DFN [22] has two types of instances: dynamic convolutional layer (with ) and dynamic local filtering layer (with ). The latter is adopted in this paper because it guarantees content-adaptive filtering via applying a specific local filter to the neighborhood centered around each pixel coordinate of the input :


Therefore, the operations in Eq. 6 are not only input content specific but also spatial position specific.

3.2.3 Pixel Adaptive Convolutional (PAC) Module

Pixel adaptive convolution (PAC) proposed by Su et al. [34] is a new content-adaptive convolution, which can alleviate the drawback of standard convolution that ignores local image content, while retaining its favorable spatial sharing property compared with existing content-adaptive filters, e.g. the DFN [22]. As illustrated in Fig. 4(a), PAC modifies a conventional spatially invariant convolution filter at each position by multiplying it with a position-specific filter , the adapting kernel. has a pre-defined form, e.g. Gaussian: , where and are the adapting features, corresponding to pixels and . The adapting features can be either hand-crafted (e.g. position and color features) or deep features. We use deep features extracted from the left image as the adapting features .

Given an input at pixel , the output filtered by PAC is defined as


where and denote the feature dimension of and , respectively. defines an convolution window. and indicate the convolution filter weights and biases. We adopt the notation to spatially index filter weights with 2D spatial offsets. In our approach and are from the raw and filtered cost volume slices and , respectively.

Figure 3: Integrating the SGA module. The bottom branch shows the guidance network which generates aggregation weights in four directions. The top branch illustrates SGA which iteratively aggregates the cost volume via traversing from pixel to in a direction , over the entire image and each disparity . The maximum response among the four directions is selected as the output .

3.2.4 Semi-Global Aggregation Module

In contrast to the above filters that leverage local context, Semi-Global Aggregation (SGA) [45] is able to iteratively aggregate the cost volume considering both pixelwise costs and pairwise smoothness constraints in four directions. The constraints are originally defined and approximately solved by Semi-Global Matching (SGM) [20] as an energy function of the disparity map. SGA learns to aggregate the cost volume to approximately minimize the energy, and also support backpropagation for end-to-end stereo matching as shown in Fig. 3. The aggregated cost volume is recursively defined as:


where is a unit direction vector along which the cost of pixel at disparity is aggregated. The weights are achieved and normalized (s.t. ) by a guidance sub-network to avoid very large accumulated values when traversing along the path. The final aggregated cost is obtained by picking the maximum among four directions, namely, left, right, up and down, i.e., , defined as:

Figure 4: (a) PAC filters the cost volume slice by multiplying it with a standard spatially invariant convolutional filter and a spatially varying filter that depends on input pixel features . The output is the filtered slice . (b) DFN filters are dynamically generated by the filter-generating network conditioned on left pixel features . is filtered by to output . (c) Overview of network architecture: a stereo pair is fed into a weight-sharing Siamese sub-network for feature learning; the extracted features of each view are correlated or concatenated to generate the cost volume ; content-adaptive filtering modules are applied to and output the filtered volume , followed by state-of-the-art encoder-decoder cost aggregation and disparity regression, as shown in two branches for 2D and 3D convolutional architectures.

3.3 Network Architecture

As illustrated in Fig. 4, our architecture takes as backbones four state-of-the-art 2D and 3D CNNs for stereo matching, i.e., DispNetC [29], GCNet [23], PSMNet [7] and GANet [45], and adapts SABF (Fig. 2), PAC (Fig. 4(a)), DFN (Fig. 4(b)) and SGA (Fig. 3), to aggregate the cost volume.

3.3.1 Network Backbones

Given a rectified stereo pair, the backbone architectures described below include unary feature extraction from the images by a weight-sharing Siamese structure, cost volume computation and regularization, and disparity regression. Unary features are denoted by . 2D or 3D convolutional and transpose convolutional layers have or kernels, unless otherwise specified.

DispNetC Backbone.   DispNetC [29] is an encoder-decoder architecture with an explicit 1D correlation layer. The encoder downsamples the input images via convolution by up to , while the decoder gradually upsamples the feature maps via transposed convolution at 6 different scales ranging from to . Features (, ) are correlated by a 1D correlation layer [13] to form a 3D cost volume () with a maximum disparity of . is further contracted to resolution and expanded by alternate transpose convolutions and loss layers. There are six intermediate disparity maps which are all interpolated to the input resolution. The loss is computed on all six disparity maps, but only the last one is used as output.

GCNet Backbone.   GCNet [23] is a 3D CNN architecture that models geometry in stereo matching. We add an initial bilinear interpolation layer to downsample the input stereo pair to half resolution. GCNet extracts (, ) via a convolution layer and eight residual blocks [19] and then builds a 4D cost volume () by concatenating with its counterpart from the other view across each disparity level. is downsampled by up to via four encoding blocks (each with three convolutions with strides equal to , , and ) and upsampled by via five decoding blocks (each has a skip-connection from the early layer and one transposed convolution with stride ), resulting in an aggregated () volume. It is interpolated to input resolution and regressed to predict the disparity map via the differentiable soft argmin:


where cost is first converted to a probability of each disparity value via the softmax operation .The disparity map comprises the expected values of for each pixel.

PSMNet Backbone.   PSMNet [7] learns (, ) via three convolution layers followed by four basic residual blocks [19]. is further processed separately by spatial pyramid pooling (SPP), a convolution, bilinear interpolation and feature concatenation. This generates the SPP features (, ) used to construct a 4D () cost volume . is regularized by a stacked hourglass with three encoding-decoding blocks. Each hourglass generates a regularized volume, from which an intermediate disparity is obtained via the soft argmin (Eq. 10). The training loss is computed on all three intermediate disparities.

GANet Backbone.   GANet [45] introduces local guided aggregation (LGA) and semi-global aggregation (SGA) layers which are complementary to 3D convolutional layers. The unary features (, ) are extracted through a stacked hourglass CNN with skip connections. They are then used to construct a 4D cost volume (), which is fed into a cost aggregation block (built of alternate 3D convolution and transpose convolution layers, SGA and LGA) for regularization, refinement and disparity regression via the soft argmin (Eq. 10). The guidance sub-network generates the weight matrices for SGA (Sec. 3.2) and LGA. The LGA layer is used before disparity regression and locally refines the 4D cost volume for several times. We adopt the GANet-deep version which achieves the best accuracy among all variants.

Network Capacity.  Network capacities are listed in Table 1. The parameters of four backbone networks (i.e, row W/O in gray, meaning no filters applied) increase in the order of GCNet PSMNet GANet DispNetC. The order of the filtering modules (rows in skyblue) according to increasing number of parameters is PAC DFN SABF SGA.

3.3.2 Loss Function

The smooth loss function (see the supplement) is used for end-to-end training. The GCNet backbone has one disparity output, while the other backbones produce multiple intermediate disparity maps. The network loss is their weighted average. The corresponding weights are defined as 1) GANet: , , and ; 2) PSMNet: , , and ; 3) DispNetC: , , , , , and as training starts on Scene Flow, while in finetuning, only the final disparity is activated. When integrating the SABF filter to the backbones, the embedding loss in Eq. 3 is added with a weight of .


Filters Network Backbones
DispNetC PSMNet GANet GCNet
No. (%) No. (%) No. (%) No. (%)


\cellcolormygrayW/O 42.2 - 5.2 - 6.6 - 2.8 -


\cellcolormyskyblueSABF 44.0 4.2 7.0 34 8.4 27 4.6 62.4
\cellcolormyskyblueDFN 42.6 0.8 5.6 6.4 6.9 5.1 3.2 11.8
\cellcolormyskybluePAC 42.3 0.1 5.3 2.0 6.7 1.6 2.9 3.6
\cellcolormyskyblueSGA 45.2 7.0 8.3 58.8 - - 5.9 108


Table 1: Network capacity. For each combination of the filtering techniques and network backbones, columns No. show the number of parameters in millions, and columns (%) are the relative increase in the number of parameter w.r.t. the backbone baselines. Largest values are in bold. Inapplicable entries are marked by “-”.

4 Experiments

4.1 Datasets

Our networks are trained from scratch on Scene Flow [29], then finetuned on Virtual KITTI 2 (VKT2) [5, 15] or KITTI 2015 (KT15) [30]. Scene Flow is a large scale synthetic dataset containing training and testing images with dense ground truth disparity maps. We exclude the pixels with disparities in training. Virtual KITTI 2 (VKT2) is a synthetic clone of the real KITTI dataset. It contains 5 sequence clones of Scene 01, 02, 06, 18 and 20, and nine variants with diverse weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by , ). Since there is no designated validation set, we select (i) Scene06 (i.e., VKT2-Val-S6 with frames) and (ii) multiple blocks of consecutive frames from Scene 01, 02, 18, 20 (i.e., VKT2-Val-WoS6 with frames), and use the remaining images for training. The former scene remains unseen during training, while networks observe similar frames to the latter. We evaluate these settings separately. KITTI 2015 (KT15): is a real dataset of street views. It contains training stereo image pairs with sparsely labeled disparity from LiDAR data. We divided the training data into a training set ( images) and a validation set ( images) for our experiments.

Metrics.   For KT15, we adopt the bad-3 error (i.e., percentage of pixels with disparity error px or of the true disparity) counted over non-occluded (noc) or all pixels with ground truth. For VKT2, in addition to bad-3, we also use the endpoint error (EPE) and bad-1 error (px or ) over all pixels.


Filters DispNetC PSMNet GANet GCNet
EPE(px) px px EPE(px) px px EPE(px) px px EPE(px) px px


Seen locations (i.e., Scene01, 02, 18 and 20) from VKT2 validation set VKT2-Val-WoS6
W/O 0.68 11.54 3.72 0.45 7.08 2.21 0.33 5.34 1.70 0.62 9.71 3.18
\cellcolormyskyblueSABF \cellcolormygray0.65 \cellcolormygray 10.86 \cellcolormygray 3.48 \cellcolormygray0.36 \cellcolormygray5.92 \cellcolormygray1.98 0.34 5.50 1.76 \cellcolormygray0.60 9.89 3.19
\cellcolormyskyblue DFN \cellcolormygray 0.57 \cellcolormygray9.85 \cellcolormygray3.26 \cellcolormygray0.42 \cellcolormygray6.45 \cellcolormygray2.17 0.37 6.23 1.99 \cellcolormygray0.60 \cellcolormygray9.20 \cellcolormygray3.11
\cellcolormyskybluePAC \cellcolormygray0.58 \cellcolormygray9.92 \cellcolormygray3.39 0.52 7.81 2.61 0.40 7.01 2.20 0.75 12.98 4.02
\cellcolormyskyblueSGA \cellcolormygray0.57 \cellcolormygray9.37 \cellcolormygray3.21 \cellcolormygray0.40 \cellcolormygray6.08 \cellcolormygray2.14 - - - \cellcolormygray0.55 \cellcolormygray9.24 \cellcolormygray2.98


Totally unseen location (i.e., Scene06) from VKT2 validation set VKT2-Val-S6
W/O 0.70 10.28 3.12 0.48 5.16 1.96 0.30 3.09 1.0563 0.59 7.48 2.25
\cellcolormyskyblueSABF \cellcolormygray0.69 \cellcolormygray 9.75 \cellcolormygray 3.00 \cellcolormygray0.44 \cellcolormygray4.47 \cellcolormygray1.73 \cellcolormygray0.28 3.16 \cellcolormygray0.97 \cellcolormygray0.56 7.51 \cellcolormygray2.23
\cellcolormyskyblue DFN \cellcolormygray 0.599 \cellcolormygray8.54 \cellcolormygray2.791 \cellcolormygray0.39 \cellcolormygray4.83 \cellcolormygray1.69 \cellcolormygray0.29 3.54 \cellcolormygray 1.0561 \cellcolormygray0.55 \cellcolormygray6.81 \cellcolormygray2.14
\cellcolormyskybluePAC \cellcolormygray0.603 \cellcolormygray8.73 \cellcolormygray2.96 0.52 5.78 1.98 0.35 4.36 1.47 0.73 11.87 2.99
\cellcolormyskyblueSGA \cellcolormygray0.607 \cellcolormygray8.02 \cellcolormygray2.794 \cellcolormygray0.42 \cellcolormygray4.34 \cellcolormygray1.71 - - - \cellcolormygray0.53 \cellcolormygray7.45 2.29


Table 2: Evaluation on the validation sets of VKT2-Val-WoS6 and VKT2-Val-S6. Results are shown in the entries of filters (rows in skyblue) and backbones (columns) w.r.t. the baselines (rows W/O). Our improved results are highlighted in gray, and the best ones are in bold. GANet already contains SGA, resulting in blank entries “-”.
(a) (b) (c) (d)
Figure 5: Results of (a) DispNetC, (b) PSMNet, (c) GCNet and (d) GANet on each category of validation set VKT2-Val-S6.

4.2 Implementation Details

Architecture Details.   Our models are implemented with PyTorch. Each convolution layer is followed by batch normalization (BN) and ReLU unless otherwise specified. We use the official code of PSMNet and GANet, and implement DispNetC (no PyTorch version) and GCNet (no official code). Our implementations achieve similar or better results on the KT15 benchmark, i.e., bad-3 errors (all) and (noc) versus the authors’ (all) and (noc) for DispNetC, and (all) and (noc) versus the authors’ (all) and (noc) for GCNet. We use and in Eq. 4 for SABF. We investigate how the filter window and dilation rate affect the filtering output, and find that with achieve a good balance in accuracy, space requirements and runtime. Please see the supplementary material for ablation studies.

Training Details.   The SABF embedding network is pre-trained on Cityscapes [11]. For data augmentation we randomly crop image patches and do channel-wise standardization by subtracting the mean and dividing by the standard deviation. For fair comparison, we follow the baselines and set the maximum disparity to . We use the official pre-trained models on Scene Flow for GANet and PSMNet, and train DispNetC and GCNet on Scene Flow from scratch for epochs with a learning rate (lr) of . Training is optimized with Adam (, ), except for GCNet which uses RMSprop (). For KT15, our models and baselines are finetuned for 600 epochs (with lr of 0.001 for the first epochs and 0.0001 for the next epochs). For VKT2, we finetune all algorithms for epochs, with lr of 0.001 at first, then divided by 10 at epoch 5 and epoch 18.

4.3 Quantitative Results

All the results are on the validation sets since we could not submit all combinations to the benchmarks. Due to page limits, qualitative results are available in the supplement.

Virtual KITTI 2 Evaluation.  We finetune our models (the pre-trained backbones on Scene Flow after integrating the filters) on the VKT2 dataset. (EPE, bad-1 and bad-3) on the VKT2-Val-WoS6 and VKT2-Val-S6 validation sets are listed in Table 2. In most cases, our models (rows in skyblue) achieve higher accuracy than the baselines (row W/O). The exception is standard GANet which performs well since it always includes SGA and LGA for global and local matching volume aggregation. Fig. 5 plots EPE errors on individual categories of VKT2-Val-S6. SABF, DFN and SGA boost 2D and 3D CNNs, while PAC improves 2D CNNs, but not 3D CNNs. When moving from familiar (VKT2-Val-WoS6) to unseen (VKT2-Val-S6) validation scenes, DFN achieves better adaptation than SGA, and the SABF and DFN variants outperform standard GANet.

KITTI 2015 Evaluation.  The results on KT15 are shown in Table 3. Our networks obtain improved accuracy for all the backbones except for GANet. All the filters boost the backbones due to leveraging image context as guidance, and SGA achieves the highest accuracy among them.


Filters DispNetC PSMNet GANet GCNet
noc all noc all noc all noc all


W/O 2.59 3.02 1.46 1.60 0.97 1.10 2.06 2.64


\cellcolormyskyblue SABF \cellcolormygray2.26 \cellcolormygray2.63 \cellcolormygray1.28 \cellcolormygray1.40 1.07 1.17 \cellcolormygray1.76 \cellcolormygray 2.10
\cellcolormyskyblue DFN \cellcolormygray 2.37 \cellcolormygray 2.78 \cellcolormygray 1.23 \cellcolormygray 1.34 0.99 1.11 \cellcolormygray 1.70 \cellcolormygray 2.08
\cellcolormyskyblue PAC \cellcolormygray 2.38 \cellcolormygray 2.72 \cellcolormygray 1.29 \cellcolormygray 1.48 1.13 1.23 \cellcolormygray 1.71 \cellcolormygray 2.03
\cellcolormyskyblue SGA \cellcolormygray1.90 \cellcolormygray2.18 \cellcolormygray1.17 \cellcolormygray1.32 - - \cellcolormygray1.69 \cellcolormygray1.91


Table 3: KITTI 2015 bad-3 validation results. Improved results are highlighted in gray, and best ones are in bold. GANet contains SGA, resulting in blank entries “-”.

Runtime   In Table 4, we compare the GPU memory consumption and runtime in inference mode on pairs of frames with dimension . All experiments are run on the same machine, with the same configuration of disparity range , filter size and dilation rate .

Comparison and Summary.   Our results show that most architectures benefit from adaptive filtering, with the exception of GANet, which already includes such filtering of SGA. It is worth pointing that GANet has the largest number of parameters and the longest runtime among the backbones. Lighter architectures, e.g. PSMNet+SABF or PSMNet+DFN can achieve competitive results. DispNetC trails in terms of accuracy, but has about 20% of the footprint of GANet and its combination with DFN strikes a good balance between accuracy and processing requirements.

5 Conclusions

We demonstrate how deep adaptive or guided filtering can be integrated into representative 2D and 3D CNNs for stereo matching with improved accuracy. Our extensive experimental results on Virtual KITTI 2 and KITTI 2015 highlight how our filtering modules effectively leverage the RGB information to dynamically guide the matching, resulting in further progress in stereo matching. SGA, a component of GANet, is the most effective filtering mechanism and improves all backbones. More broadly, our work shows that current state-of-the-art methods do not take full advantage of available information, with the exception of GANet which shows superior performance due to SGA and LGA, but has more parameters than GCNet and PSMNet. Integrating even the smaller filtering modules leads to decreases in error in the order of 10%.


Filters DispNetC PSMNet GANet GCNet
Mem. Time Mem. Time Mem. Time Mem. Time


W/O 1394 18.35 5151 315.57 7178 1894.70 4280 146.83
\cellcolormyskyblue SABF 1888 24.32 5386 563.42 7920 2488.72 4424 379.37
\cellcolormyskyblue DFN 1422 28.33 5246 432.32 7466 2041.53 4298 255.20
\cellcolormyskybluePAC 1535 25.34 5168 514.91 8274 2383.44 4400 334.73
\cellcolormyskyblue SGA 7066 489.60 11070 823.00 - - 9916 655.18


Table 4: Runtime (ms) and GPU memory consumption (MiB). Results are shown in the entries of filters (rows in skyblue) and backbones (columns) w.r.t. the baselines (rows W/O). The smallest values are in bold. GANet already contains SGA, resulting in blank entries “-”.

Acknowledgements.   This research has been partially supported by National Science Foundation under Awards IIS-1527294 and IIS-1637761.


In this supplementary material, we show the loss function, more details of filtering modules in the network, ablation study, and additional qualitative results, some of which are mentioned but not fully discussed in the main paper due to the page limit.

Loss Function.   The loss is evaluated over the valid pixels which have ground truth disparity. We adopt the smooth loss function for end-to-end training:


where counts the valid pixels, measures the absolute error of disparity prediction and ground truth .

Deep Adaptive Filtering Architectures.   The integrated deep adaptive or guided filters include segmentation-aware bilateral filtering (SABF) [17], dynamic filtering networks (DFN) [22], pixel adaptive convolution (PAC) [34] and semi-global aggregation (SGA). Please see our source code ( for the detailed architectures. Fig. 6 shows the embedding network of SABF.

Figure 6: The embedding network in the segmentation-aware bilateral filtering (SABF) module. Each convolutional layer is followed by a ReLU layer which is not drawn. Convolutional layers, e.g. conv1_0, are defined by as filter, as in/out feature planes, and as the output dimension). Max pooling layers are implemented as filter with stride of for downsampling (i.e., ).

Ablation Study.    We perform ablation studies to investigate how the filter window and the dilation rate can affect the filtering output and disparity estimation. Out of a large number of possible combinations, we show two representatives DispNetC+SABF and PSMNet+SABF in Table 5. We find that with achieve a good balance in accuracy, space and runtime. The 500-run averaged memory consumption and runtime in GPUs are measured when we test a stereo pair, and the bad-3 (noc,all) errors are evaluated on the KITTI 2015 validation set. Please note that a filter (with dilation rate ) covers regions in the cost feature space, which is equivalent to regions in the RGB image space, due to the cost volume being a quarter of the size of the input images. In the following experiments, we keep using with for our different architectures. Please note that the results in Table 5 are obtained on a 16-core Intel Core i7-9800X CPU at 3.80GHz, and an NVIDIA TITAN Xp GPU with 12GB of RAM.

Effectiveness Comparison    In Table 7, we further investigate the effectiveness among those variants in terms of parameter increase (column %) and error decrease (column %) evaluated on the KITTI 2015 validation set, as shown in Table 6. Please note that Table 6 is originally included in the main paper, and we repeat it here to explain the effectiveness in Table 7.


filter size DispNetC+SABF PSMNet+SABF
Mem. Time Mem. Time
EPE(px) 3(%) EPE(px) 3(%) (MiB) (ms) EPE(px) 3(%) EPE(px) 3(%) (MiB) (ms)


0.875 3.14 0.845 2.99 2132 45.26 0.639 1.63 0.643 1.61 4681 449.40
0.867 3.13 \cellcolormygray 0.841 \cellcolormygray 2.90 \cellcolormygray 2228 \cellcolormygray 48.99 0.657 1.54 \cellcolormygray 0.630 \cellcolormygray 1.46 \cellcolormygray 4989 \cellcolormygray 633.42
0.832 2.83 0.795 2.46 2588 54.21 0.650 1.50 0.642 1.54 4709 939.40
0.825 2.84 0.854 3.00 3008 60.56 0.868 1.77 0.689 1.91 4953 1226.72


Table 5: Illustration of the effects of different filter window sizes and dilation rates . We computer the bad-3 (noc,all) errors on the KITTI 2015 validation set and the averaged GPU memory consumption and runtime to test a pair of frames with dimension .


Filters DispNetC PSMNet GANet GCNet
noc all noc all noc all noc all


W/O 2.59 3.02 1.46 1.60 0.97 1.10 2.06 2.64


\cellcolormyskyblue SABF \cellcolormygray2.26 \cellcolormygray2.63 \cellcolormygray1.28 \cellcolormygray1.40 1.07 1.17 \cellcolormygray1.76 \cellcolormygray 2.10
\cellcolormyskyblue DFN \cellcolormygray 2.37 \cellcolormygray 2.78 \cellcolormygray 1.23 \cellcolormygray 1.34 0.99 1.11 \cellcolormygray 1.70 \cellcolormygray 2.08
\cellcolormyskyblue PAC \cellcolormygray 2.38 \cellcolormygray 2.72 \cellcolormygray 1.29 \cellcolormygray 1.48 1.13 1.23 \cellcolormygray 1.71 \cellcolormygray 2.03
\cellcolormyskyblue SGA \cellcolormygray1.90 \cellcolormygray2.18 \cellcolormygray1.17 \cellcolormygray1.32 - - \cellcolormygray1.69 \cellcolormygray1.91


Table 6: KITTI 2015 bad-3 validation results. Improved results are highlighted in gray, and best ones are in bold. GANet contains SGA, resulting in blank entries “-”.


Filters DispNetC PSMNet GANet GCNet
E(%) P(%) E(%) P(%) E(%) P(%) E(%) P(%)


SABF 12.9 4.2 12.4 34 -5.9 27 20.6 62.4
DFN 7.9 0.8 16.2 6.4 -0.1 5.1 21.5 11.8
PAC 9.9 0.1 7.8 2.0 -12 1.6 23.3 3.6
SGA 27.8 7.0 17.7 58.8 - - 27.7 108


Table 7: Effectiveness comparison on the KITTI 2015 val-30 dataset. For each combination of network backbone and filtering, columns E(%) and P(%) indicate the relative decrease of error and increase of the number of parameters, respectively, w.r.t. the backbone baselines.

Qualitative Results    In Figs. 710, we show reference images and disparity maps generated by each backbone without modifications and the same backbone after integrating one of the filtering techniques.

















Figure 7: Results using DispNetC as backbone. (a) DispNetC vs DispNetC+PAC on Virtual KITTI 2 Scene06 validation set. (b) DispNetC vs DispNetC+SABF on KITTI 2015 validation set. In all rows, the left image is the reference image of the stereo pair, the middle column in the disparity map from the unmodified backbone, and the right image is the disparity map of the backbone with the integrated filter.
















Figure 8: Results using GCNet as backbone. (a) GCNet vs GCNet+SGA on Virtual KITTI 2 Scene06 validation set. (b) GCNet vs GCNet+SGA on KITTI 2015 validation set. In all rows, the left image is the reference image of the stereo pair, the middle column in the disparity map from the unmodified backbone, and the right image is the disparity map of the backbone with the integrated filter.
















Figure 9: Results using PSMNet as backbone. (a) PSMNet vs PSMNet+DFN on Virtual KITTI 2 Scene06 validation set.(b) PSMNet vs PSMNet+PAC on KITTI 2015 validation set. In all rows, the left image is the reference image of the stereo pair, the middle column in the disparity map from the unmodified backbone, and the right image is the disparity map of the backbone with the integrated filter.
















Figure 10: Results using GANet as backbone. (a) GANet vs GANet+SABF on Virtual KITTI 2 Scene06 validation set. (b) GANet vs GANet+PAC on KITTI 2015 validation set. In all rows, the left image is the reference image of the stereo pair, the middle column in the disparity map from the unmodified backbone, and the right image is the disparity map of the backbone with the integrated filter.


  1. V. Aurich and J. Weule (1995) Non-linear Gaussian filters performing edge preserving diffusion. In Mustererkennung 1995, pp. 538–545. Cited by: §2.
  2. S. P. Awate and R. T. Whitaker (2005) Higher-order image statistics for unsupervised, information-theoretic, adaptive, image filtering. In CVPR, Vol. 2, pp. 44–51. Cited by: §2.
  3. S. Bako, T. Vogels, B. McWilliams, M. Meyer, J. Novák, A. Harvill, P. Sen, T. Derose and F. Rousselle (2017) Kernel-predicting convolutional networks for denoising monte carlo renderings.. ACM Trans. Graph. 36 (4), pp. 97–1. Cited by: §2.
  4. A. Buades, B. Coll and J. Morel (2005) A non-local algorithm for image denoising. In CVPR, Vol. 2, pp. 60–65. Cited by: §2.
  5. Y. Cabon, N. Murray and M. Humenberger (2020) Virtual KITTI 2. arXiv preprint arXiv:2001.10773. Cited by: §4.1.
  6. S. Chandra and I. Kokkinos (2016) Fast, exact and multi-scale inference for semantic image segmentation with deep Gaussian CRFs. In ECCV, pp. 402–418. Cited by: §2.
  7. J. Chang and Y. Chen (2018) Pyramid stereo matching network. In CVPR, pp. 5410–5418. Cited by: §1, §1, §2, §3.1, §3.3, §3.3, §3.
  8. L. Chen, J. T. Barron, G. Papandreou, K. Murphy and A. L. Yuille (2016) Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In CVPR, pp. 4545–4554. Cited by: §2.
  9. L. Chen, A. Schwing, A. Yuille and R. Urtasun (2015) Learning deep structured models. In ICML, pp. 1785–1794. Cited by: §2.
  10. S. Chopra, R. Hadsell and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, pp. 539–546. Cited by: §3.2.
  11. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223. Cited by: §4.2.
  12. J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu and Y. Wei (2017-10) Deformable convolutional networks. In ICCV, Cited by: §2.
  13. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In ICCV, pp. 2758–2766. Cited by: §3.3.
  14. R. Gadde, V. Jampani, M. Kiefel, D. Kappler and P. Gehler (2016) Superpixel convolutional networks using bilateral inceptions. In ECCV, pp. 597–613. Cited by: §2.
  15. A. Gaidon, Q. Wang, Y. Cabon and E. Vig (2016) Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §4.1.
  16. X. Guo, K. Yang, W. Yang, X. Wang and H. Li (2019) Group-wise correlation stereo network. In CVPR, Cited by: §2.
  17. A. W. Harley, K. G. Derpanis and I. Kokkinos (2017) Segmentation-aware convolutional networks using local attention masks. In ICCV, pp. 5038–5047. Cited by: §1, §2, §3.2, §3.2, §3, Do End-to-end Stereo Algorithms Under-utilize Information?.
  18. K. He, J. Sun and X. Tang (2012) Guided image filtering. PAMI 35 (6), pp. 1397–1409. Cited by: §2.
  19. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.3, §3.3.
  20. H. Hirschmüller (2008) Stereo processing by semiglobal matching and mutual information. PAMI 30 (2), pp. 328–341. Cited by: §1, §2, §3.2.
  21. V. Jampani, M. Kiefel and P. V. Gehler (2016) Learning sparse high dimensional filters: image filtering, dense crfs and bilateral neural networks. In CVPR, pp. 4452–4461. Cited by: §2.
  22. X. Jia, B. De Brabandere, T. Tuytelaars and L. V. Gool (2016) Dynamic filter networks. In Advances in Neural Information Processing Systems, pp. 667–675. Cited by: §1, §2, §3.2, §3.2, §3.2, §3, Do End-to-end Stereo Algorithms Under-utilize Information?.
  23. A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In ICCV, pp. 66–75. Cited by: §1, §1, §2, §3.1, §3.3, §3.3, §3.
  24. P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with Gaussian edge potentials. In NIPS, pp. 109–117. Cited by: §2.
  25. Y. Li, J. Huang, A. Narendra and M. Yang (2016) Deep joint image filtering. In ECCV, pp. 154–169. Cited by: §2.
  26. Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou and J. Zhang (2018) Learning for disparity estimation through feature constancy. In CVPR, pp. 2811–2820. Cited by: §2, §3.1.
  27. G. Lin, C. Shen, A. Van Den Hengel and I. Reid (2016) Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, pp. 3194–3203. Cited by: §2.
  28. S. Liu, S. De Mello, J. Gu, G. Zhong, M. Yang and J. Kautz (2017) Learning affinity via spatial propagation networks. In NIPS, pp. 1520–1530. Cited by: §2.
  29. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pp. 4040–4048. Cited by: §1, §1, §2, §3.1, §3.3, §3.3, §3, §4.1.
  30. M. Menze and A. Geiger (2015) Object scene flow for autonomous vehicles. In CVPR, pp. 3061–3070. Cited by: §4.1.
  31. J. Pang, W. Sun, J. S. Ren, C. Yang and Q. Yan (2017-10) Cascade residual learning: a two-stage convolutional neural network for stereo matching. In ICCV Workshop on Geometry Meets Deep Learning, Cited by: §2.
  32. D. Scharstein and R. Szeliski (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47 (1-3), pp. 7–42. Cited by: §1.
  33. X. Song, X. Zhao, L. Fang and H. Hu (2020) EdgeStereo: an effective multi-task learning network for stereo matching and edge detection. IJCV. Cited by: §2.
  34. H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller and J. Kautz (2019) Pixel-adaptive convolutional neural networks. In CVPR, Cited by: §1, §2, §3.2, §3, Do End-to-end Stereo Algorithms Under-utilize Information?.
  35. C. Tomasi and R. Manduchi (1998) Bilateral filtering for gray and color images. In Sixth international conference on computer vision (IEEE Cat. No. 98CH36271), pp. 839–846. Cited by: §2.
  36. A. Tonioni, F. Tosi, M. Poggi, S. Mattoccia and L. Di Stefano (2019-06) Real-time self-adaptive deep stereo. In CVPR, Cited by: §2.
  37. X. Wang, R. Girshick, A. Gupta and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §2.
  38. H. Wu, S. Zheng, J. Zhang and K. Huang (2018) Fast end-to-end trainable guided filter. In CVPR, pp. 1838–1847. Cited by: §2.
  39. J. Wu, D. Li, Y. Yang, C. Bajaj and X. Ji (2018-09) Dynamic filtering with large sampling field for convnets. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  40. T. Xue, J. Wu, K. Bouman and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In NIPS, pp. 91–99. Cited by: §2.
  41. G. Yang, H. Zhao, J. Shi, Z. Deng and J. Jia (2018) SegStereo: exploiting semantic information for disparity estimation. In ECCV, pp. 636–651. Cited by: §2.
  42. Z. Yin, T. Darrell and F. Yu (2019-06) Hierarchical discrete distribution decomposition for match density estimation. In CVPR, Cited by: §2.
  43. J. Žbontar and Y. LeCun (2016) Stereo matching by training a convolutional neural network to compare image patches.. Journal of Machine Learning Research 17 (1-32), pp. 2. Cited by: §1.
  44. W. Zhan, X. Ou, Y. Yang and L. Chen (2019) DSNet: joint learning for scene segmentation and disparity estimation. In IEEE International Conference on Robotics and Automation, pp. 2946–2952. Cited by: §2.
  45. F. Zhang, V. Prisacariu, R. Yang and P. H. Torr (2019) GA-net: guided aggregation net for end-to-end stereo matching. In CVPR, Cited by: §1, §1, §2, §3.1, §3.2, §3.3, §3.3, §3.
  46. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang and P. H. S. Torr (2015) Conditional random fields as recurrent neural networks. In ICCV, Cited by: §2.
  47. X. Zhu, H. Hu, S. Lin and J. Dai (2019-06) Deformable convnets v2: more deformable, better results. In CVPR, Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description