# Widening siamese architectures for stereo matching

###### Abstract

Computational stereo is one of the classical problems in computer vision. Numerous algorithms and solutions have been reported in recent years focusing on developing methods for computing similarity, aggregating it to obtain spatial support and finally optimizing an energy function to find the final disparity. In this paper, we focus on the feature extraction component of stereo matching architecture and we show standard CNNs operation can be used to improve the quality of the features used to find point correspondences. Furthermore, we propose a simple space aggregation that hugely simplifies the correlation learning problem. Our results on benchmark data are compelling and show promising potential even without refining the solution.

^{†}

^{†}affiliationtext: Centre for Medical Image Computing and Department of Computer Science, University College London, UK

## I Introduction

Computational stereo is one of the classical problems in computer vision systems whereby two cameras placed at different viewpoints can be used to extract 3D information by analyzing the relative position of the objects in the two perspectives of the scene. Finding relative displacements between image pairs from stereo cameras is usually called stereo matching [1, 2]. By using the fundamental constrains in the two-view geometry of two perspective cameras, it is possible to reduce the stereo matching problem to a 1D search space in horizontally rectified images. Despite the reduced search space, accurately finding stereo correspondences in real world images is still very challenging because occlusions, reflective surfaces, repetitive patterns, textureless or low detail regions that can affect the similarity metric and underpins the search.

Recently, since the first winning entry in the ImageNet Large Scale Visual Recognition Challenge, deep learning has been at the forefront of most computer vision breakthroughs [3]. Convolution neural networks (CNNs) are able to learn very complex non-linear representations from raw visual data, creating effective and versatile models for complex problems. CNNs are now widely used across different vision problems and also in a vast range of applications, such as robotics and medical endoscopic imaging. Deep learning models have also recently been applied to stereo matching and are now among the most accurate methods reported on the public common evaluation datasets [4, 5, 6].

One of first successful uses of deep learning for stereo matching treats the problem as a binary classification [7], where different CNNs are trained to recognize two input patches centered around corresponding pixels. However, because each pixel is processed individually, and no spatial constrains are imposed in the decision, the resulting disparity map can be quite noisy. To mitigate noise, extensive post processing steps are used to smooth the result using hand-crafted regularization functions. Several improvements have been reported since then, usually by stacking extra convolution layers after the feature extraction, allowing the CNNs to learn their own spatial regularization. The current top stereo method ranked on the KITTI benchmark dataset also focuses on context and consistency by using a very deep end-to-end learned architecture with 3-D convolutions that is able to infer disparity maps capable of beating any hand-crafted regularized method [8].

While spatial consistency is essential for good stereo matching, there has been limited focus on the quality of the high-level representation learned to match corresponding points. Several methods proposed different architectures, correlation operations or regularization approaches but the majority of CNN stereo methods start with a relatively shallow siamese architecture that acts as a feature extractor for the stereo image pair. In this work, we take a step back from deep complex CNN architectures and focus on the type of features that are used to find correspondences. We propose the use of pooling and deconvolution operations in the siamease architecture that allows the extraction of features with a wider receptive field around the target pixels. The intuition is that, a wider context view allows the feature extraction of more visual cues, allowing better point correspondence. Furthermore, we propose a simple feature space transformation that significantly simplifies the learning problem, allowing the CNNs to learn end-to-end correlation with a very shallow architecture. Our main objective is to show that improvements can be achieved simply by enhancing the way stereo features are extracted and aggregated. Because siamese architectures are part of most matching CNNs available, this work can easily be combined with more complex approaches (hand-crafted or deep learned) present in the literature.

## Ii Related work

A huge range of approaches has been proposed to solve the stereo matching problem in the last decades. For the sake of brevity, we will focus on the work that exploits deep learning as a viable way to find point correspondences in image pairs [1, 2].

The introduction of large scale, high resolution datasets, such as KITTI [4, 5] and Middlebury [6], has opened the opportunity for the use of learning approaches in stereo matching. As stated before, [7] used a siamese CNN to binary classify matching or non-matching pairs of points. The method required an extensive post processing step, where edge and texture information were used as smoothness constrains.

More recently, [9] expanded on Zbontar’s work and proposed a way to obtain disparity values for all possible displacements without manually pairing patch candidates. In other words, a wider image is passed though one of the branches of the siamese architecture and the computed features are correlated with the ones extracted from the target patch. This allows the computation of matching costs for all disparities with one-pass of the CNN. This work also shows that the inner product is a fast and effective way to compute feature correlation. Again, because inference for each pixel is made independently, hand-crafted feature regularization is used to smooth the results.

Currently, the top performing stereo methods in the KITTI datasets [4, 5] focus on end-to-end network learning with spatial regularization and do not use any type of hand-crafted post processing. [10] employ a second network that is trained to smooth the matching cost obtained by a deep residual architecture. Kendall et al. use 37 layered network with multi-scale 3D convolutions to learn how to match a block of concatenated features from both images. [11] tackled the matching problem in two stages: first, a tweaked version of DispNet is used to estimate disparities with more detail and then a second network is used to rectify the results of the first stage. [12] also achieved excellent performance by combining CNNs and conditional random fields into a hybrid model for stereo estimation. Despite the huge difference in architectures and training methodology, all these methods start roughly the same way, with a siamease architecture that acts as a feature descriptor for the stereo image pair. Most recent work chooses to focus on the spatial regularization rather than the feature extraction step. We argue that significant improvements can be achieved by simply increasing the amount of context that is extracted by the siamease architecture.

The work presented here is most similar to the one developed by Luo et al. [9] but with two major contributions. First, we show that the loss of the detail from pooling operations can be compensated with deconvolution operations if these are applied in the feature space, before computing correlation. This allows to hugely increase the global receptive field of the feature extractors, resulting in a more robust matching even before spatial regularization. Second, we show that a simple feature aggregation can be used to simplify the learning problem, resulting in effective, more easily learned, data driven correlation metric. To reiterate, because we are just proposing to improve the feature extraction step, our aim is not to beat the current state-of-the-art for full stereo matching pipelines. Our contribution provides a very effective and fast stereo matching network that can easily be further improved by plugging it to most current CNN stereo matching models.

## Iii Methodology

Let us consider a stereo image pair with size . A typical stereo algorithm computes a cost volume such that:

(1) |

where is the maximum disparity under consideration and the function returns a similarity score between the two pixels that are indexed in the horizontal and vertical directions by and . Typically, is a similarity function between handcrafted representations of small patches around the pixels [2]. Alternatively, CNNs can be use to learn complex, high dimensional feature extractors that allow a more robust patch comparison [7].

Some of the most accurate stereo algorithms proposed in recent years employ CNNs to score the patch similarity measure [13, 7, 9, 10, 8]. Even though these methods proceed with different approaches, every model starts with a siamese architecture that processes the left and the right images. While subsequent layers may allow more complex correlation inference or spatial regularization of the cost volume, the matching is still in essence based on the features extracted by the siamese branches. As a consequence, the architecture of the siamese CNN plays a crucial role in the quality of the stereo matching, much like the role of a traditional low level vision similarity metric. We therefore focus on enhancing the underlying siamese network in order to improve performance.

### Iii-a Siamese network architecture

We construct our network by layering sequential blocks of 2D convolutions, batch normalization and a rectifier linear unit (ReLU). Just like most architectures, we use layers with 64 neurons of convolutions and the parameters between branches are shared. The last layers are added without batch normalization and ReLU operations.

Generally speaking, wider patches allow the extraction of more visual cues and help more accurate matching, especially in textureless regions or areas of aperture problems. The area around the target pixel that is considered in the matching process depends on the global receptive field of the CNN architecture. If we denote the input of the layer indexed by the coordinates as , then a network with layers will output . Mathematically, we can define the global receptive field as the range of pixels in that affects each . Intuitively, the global receptive field is the size of the region that a CNN uses towards making a single prediction.

More convolution layers and bigger sized filters allow small increases in the global receptive field but cause an exponential increase in computation time and memory requirements. A common practice in classification CNNs is the use of strided pooling to downsample feature maps withing the network, allowing for much wider global receptive fields [13]. Pooling operations have also been reported to provide translation invariance to CNN models [13]. However, the properties that make pooling useful in classification tasks are not desirable for stereo matching, so most stereo algorithms avoid this operation. The loss of detail from feature downsampling makes it harder to recognize very small differences, something crucial for pixel-level matching. We address this problem by using transpose convolution (deconvolution) operations.

Deconvolution operations allow CNNs to learn filters capable of upsampling feature maps. The operation is especially useful in pixel-level applications, such as semantic segmentation or generative networks. For example, for optical flow, where the matching search space is bidimensional, the FlowNet [14] sequentially downsamples the features maps with pooling operations and uses a series of deconvolutions to obtain a dense prediction map. Unlike FlowNet, we argue that it is easier to match upsampled features than upsampling matching scores. Because of this we choose to implement deconvolution layers before computing any correlation metric. Just like represented in Figure 1, we implement the same amount of 2 strided deconvolutions as the number of max poolings within the CNN. This creates a dense feature space that can be used for computation of a correlation score for every possible disparity level.

### Iii-B Correlation layer

Several stereo matching CNNs use the inner product as a correlation metric between features vectors extracted from the siamease branches [9, 7, 13]. The operation is computationally efficient, fast and differentiable, which allows backpropagation during training. In these cases, the CNN learns feature extractors that minimize the inner product between two corresponding points. While this provides a fast and effective way to compute correlation, it would be preferable to allow the network to learn a correlation that best fits the stereo data. Note that the inner product only measures one direction/component of similarity between vectors. Whereas the network could learn more complex relationships.

Recent methods choose to concatenate the output from the siamese network along the feature dimension and follow it with more convolution layers [7, 10, 13]. To a certain extent, this allows the CNN to learn how to correlate matching points, but the maximum disparity that the network is able to find is intrinsically related to the global receptive field of the layers stacked after the siamese portion of the CNN.

Lets consider the case where we want to find the disparity map for a left stereo image with dimensions. Considering , the maximum disparity possible between the stereo pair, correlation needs to be computed with all pixels within a range in the right stereo image , just as described in Equation 1. By using a siamease network with a dimensional output its possible to extract two feature vectors with dimensions. To learn how to match pixels for possible disparities from the concatenated volume, the network needs to process values in its third dimension and to account for a range of pixels in the input second dimension. In other words, the correlation layers would need to start with neurons, and their global receptive field would need to be equal or superior to in the image width dimension. Using the common approach where we stack layers of convolution blocks the global receptive field of a network is equal to . In the KITTI dataset [4], for example, where , it would take at least 128 layers of convolutions for a network to have a global receptive field wide enough to match 256 pixels apart without downsampling the feature space. This is not only challenging from a computational point of view but it greatly complicates the learning process. Beyond learning how to correlate features of matching points, the model would also need to correspond feature positions with the intended disparity. We propose a new correlation layer that greatly simplifies the learning problem, needing as little as two convolution layers to compute a disparity map for any size .

Defining the -dimensional feature vectors computed from and as the and , respectively, we construct a new feature space as:

(2) |

where represents a concatenation operation. Note that we are still concatenating vectors along the feature dimension, but we replicate the left features and pair them with right features of every possible disparity. The new feature space has the dimensions where, for all pixels, there is a paired -dimensional feature vector for all possible disparities. This simple transformation radically changes what kind of information convolution filters receive. Lets consider applying a single convolution layer that outputs a single value from a dimensional input to the new feature space . Note that a single value would be computed for disparities for all pixels, using only the corresponding right and left feature pairing as input. This way, the correlation layer only needs to learn how to correlate two concatenated -dimensional vectors, independently of their original position, considerably simplifying the learning problem. This layer would output a map that can be easily transformed to the intended disparity volume with a shape. Beyond this, in this feature space, filters of size allow the network to learn a correlation metric that accounts for neighbor disparity pairs, creating the opportunity for a more robust disparity correlation. Finally, because the filters learned during training always correlate -dimensional feature pairs, can be rebuilt for a variable number of max disparities without needing to retrain the model.

In our experimental results, we compare the performance of siamease architectures trained with inner product and with our correlation layer. We use the simplest architecture that allows non-linear logical operations [15]. We use a single activated hidden layer with neurons and filters, and a single output neuron also with a filter.

>2 pixel | >3 pixel | >5 pixel | Runtime (s) | |||||
---|---|---|---|---|---|---|---|---|

Siamese CNN | Correlation | Non-Occ | All | Non-Occ | All | Non-Occ | All | |

inner prod | 12.42 | 14.18 | 11.38 | 13.16 | 9.98 | 11.76 | 1.15 | |

learned | 11.27 | 13.05 | 10.39 | 12.13 | 9.08 | 10.82 | 5.25 | |

inner prod | 7.57 | 9.45 | 6.72 | 8.61 | 5.64 | 7.53 | 1.15 | |

learned | 6.65 | 8.23 | 5.84 | 7.58 | 4.80 | 6.48 | 5.27 | |

inner prod | 7.47 | 9.34 | 6.50 | 8.36 | 5.31 | 7.17 | 1.16 | |

learned | 7.57 | 10.29 | 6.59 | 9.05 | 5.34 | 7.80 | 5.28 |

### Iii-C Training

We train our models with stereo image pairs from the KITTI datatsets [4, 5], where the true displacement of a sparse number of pixels is known. We randomly extract small patches from the left stereo image and the same coordinate patch from the right image extended by the maximum disparity under consideration. This allow to diversely sample training batches while being memory efficient. We treat each disparity value as a mutually exclusive classification problem. The values outputted from the correlation step are used in a softmax loss. All parameters are trained with stochastic gradient descent and gradients are backpropagated using the standard Adam optimization [16].

### Iii-D Testing

During testing, memory constrains us to one-pass computations of disparity maps for high resolution images with big max displacements. Instead of processing subsections of the image individually, we follow the same procedure suggested by [9]. First, we extract the feature representation for all pixels of the stereo image pair with the siamese architecture. Then in the correlation step, the same feature values can be reused for computation of disparity maps of multiple pixels. This results in significant increases in the inference speed.

## Iv Experimental Evaluation

We train and evaluate our models using both the KITTI 2012 [4] and KITTI 2015 [5] datasets. Both are composed of rectified natural images captured by a stereo camera. KITTI 2012 consists only of static environments while moving objects are present in KITTI 2015. Just like most methods [7, 9, 8, 10], we use the sparse available labels from non-occluded pixels for training.

We evaluate our methodology by training three different siamese architectures: , and , with 4, 7 and 9 convolution layers and with 1, 2 and 3 max pooling layers, respectively. We also compare all models trained with inner product and with the proposed correlation architecture.

All parameters are randomly initialized with a normalized Gaussian distribution and input images are normalized to have zero mean and unit standard deviation. Every CNN is trained for 75K iterations with a starting learning rate. Training is done with randomly extracted patches from left image with sizes for , for and for . We use the biggest batch size that our system allowed for each model. For CNNs trained with inner product, this translates to batches of 128, 32 and 20 for , and , respectively, and batches of 128, 20 and 8 for the same models trained with out correlation architecture. All models were implemented in Tensorflow [17] and ran on a NVIDIA Titax-X GPU.

>2 pixel | >3 pixel | >5 pixel | Runtime (s) | |||||
---|---|---|---|---|---|---|---|---|

Siamese CNN | Correlation | Non-Occ | All | Non-Occ | All | Non-Occ | All | |

inner prod | 11.19 | 12.68 | 10.01 | 11.50 | 8.57 | 10.05 | 1.15 | |

learned | 8.26 | 10.72 | 7.10 | 9.71 | 6.82 | 8.40 | 5.25 | |

inner prod | 7.80 | 9.36 | 6.81 | 8.37 | 5.75 | 7.30 | 1.15 | |

learned | 6.79 | 8.21 | 5.92 | 7.30 | 4.92 | 6.24 | 5.27 | |

inner prod | 6.89 | 8.47 | 6.02 | 7.61 | 5.18 | 6.74 | 1.16 | |

learned | 7.47 | 8.96 | 6.42 | 7.88 | 5.41 | 6.82 | 5.28 |

### Iv-a Kitti 2012

The KITTI 2012 datasets consists of 194 image pairs for training and 195 for testing. Because no ground truth is given for the testing images, and multiple online submissions are not allowed, we evaluate our models by spliting the training data in a training and validation sets. As in the work developed by [9], we randomly use 160 image pairs for training and 34 for testing. Again, our main objective is to study and improve the siamease architecture that initializes most recent CNN stereo matching systems, so we do not implement an end-to-end system capable of competing with current state-of-the-art systems. The performance of our models in the validation set is shown in Table I.

When we use the inner product for feature correlation, a direct comparison with the same depth architectures from [9] allow us to verify the effect of pooling and deconvolution layers. All our models outperform the corresponding networks proposed by [9], which shows the benefit of our pooling/deconvolution approach. Despite the overall increase in performance, Table I shows that there is a limit to the benefit of increasing the receptive field trough downsampling pooling layers. While the 2-pixel is reduced substantially from to , the extra pooling layers in did not greatly decreased the matching error.

Table I also shows that slightly better matching was achieved by learning correlations from the proposed feature space. While the overall increase in performance is small, the correlation layer substantially improves matching in edge regions when compared to the inner product counterpart. Matching improvements are present in and when the correlation layer is used, but a slightly worst performance is achieved in . This indicates that the loss of detail from successive pooling might hinder the ability of the network to learn a good correlation function. The best results were achieved with , where the receptive field is big enough for robust matching, but the lost of detail is not enough to stop the network from computing an effective correlation. Figure 3 shows that, even without spatial regularization, our architecture is able to smoothly match low detail regions while maintaining sharp edges in cars and trees.

### Iv-B Kitti 2015

KITTI 2015 has 200 image pairs for training and for testing. Again, just like [9], we randomly split the training set in 160 images for training and 40 for validation. This allows a better direct comparison with their method.

A similar analysis to the one made for KITTI 2012 is valid for the KITTI 2015 results. Bigger receptive fields allow lower matching errors for features learned with the inner-product implementation. When learning a correlation, a compromise between a wider global receptive field with less loss of detail is found in the architecture. In Figure 4, we continue to predict big smooth disparities in low texture regions, even without any post-processing. This shows that wider global receptive fields allow a much more effective correlation computation. Furthermore, even with the downsampling operation within the networks, features capable of representing small structures like traffic signs, fences and trees can be successfully extracted. Stacking further layers should easily allow spatial regularization to be learned without significant increase in computation cost, since the concatenation and reshaping operations of the feature space transformation are the bottleneck of the method.

### Iv-C Comparisons with other methods

Method | KITTI 2012 | KITTI 2015 | ||
---|---|---|---|---|

Non-Occ | All | Non-Occ | All | |

MC-CNN-acrt | 15.02 | 16.92 | 15.20 | 16.83 |

MC-CNN-fast | 17.72 | 19.56 | 18.47 | 20.04 |

Luo et al. | 10.87 | 12.86 | 9.96 | 11.67 |

+ inner product | 7.57 | 10.29 | 6.89 | 8.47 |

+ correlation | 6.65 | 8.23 | 6.79 | 8.21 |

As stated before, we do no propose a full stereo pipeline for stereo matching. Our main objective is to study and improve a crucial part of most of the current CNN stereo matching models: the siamease architecture. Because of this, we compare our work with other non-spatial regularized architectures. This results are presented in Table III.

Table III shows that when compared with other non regularized Siamese architectures, our wider models have a significantly lower 2-pixel error in both 2012 and 2015 KITTI datasets. Furthermore, the proposed space transformation allows to learn a shallow correlation layer which allows it to outperform all other siamese architectures.

The results reported do not guarantee that replacing the siamease architectures of more complex models, such as the one proposed by [8], will improve matching performance, but they show promising potential even without spatial regularization. If nothing else, our models, just like the ones proposed by [9], provide a simple, fast and easy to train approach, but much more accurate results.

## V Conclusion

Similar to so many areas in computing, deep learning has allowed us to move at an incredible speed towards a robust solution for stereo matching. As computation power increases, there is a natural tendency to move to bigger and more complex CNN models. In this work we demonstrated that big improvements are still possible by small, problem-specific adaptations that simplify the learning problem. For future work, we plan to incorporate the recent approaches that take use context for regularization, allowing us to take full advantage of the proposed feature extractor.

## References

- [1] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1-3):7–42, 2002.
- [2] Myron Z Brown, Darius Burschka, and Gregory D Hager. Advances in computational stereo. IEEE transactions on pattern analysis and machine intelligence, 25(8):993–1008, 2003.
- [3] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- [4] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- [5] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- [6] Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Nešić, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition, pages 31–42. Springer, 2014.
- [7] Jure Zbontar and Yann LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1-32):2, 2016.
- [8] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression. arXiv preprint arXiv:1703.04309, 2017.
- [9] Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016.
- [10] Amit Shaked and Lior Wolf. Improved stereo matching with constant highway networks and reflective confidence learning. arXiv preprint arXiv:1701.00165, 2016.
- [11] Jiahao Pang, Wenxiu Sun, JS Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: A two-stage convolutional neural network for stereo matching. In ICCVW, volume 3, 2017.
- [12] Patrick Knöbelreiter, Christian Reinbacher, Alexander Shekhovtsov, and Thomas Pock. End-to-end training of hybrid CNN-CRF models for stereo. CoRR, abs/1611.10229, 2016.
- [13] Haesol Park and Kyoung Mu Lee. Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters, 2016.
- [14] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852, 2015.
- [15] Raúl Rojas. Neural networks: a systematic introduction. Springer Science & Business Media, 2013.
- [16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- [17] Martín Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.