Semisupervised learning of deep metrics for stereo reconstruction
Abstract
Deeplearning metrics have recently demonstrated extremely good performance to match image patches for stereo reconstruction. However, training such metrics requires large amount of labeled stereo images, which can be difficult or costly to collect for certain applications.
The main contribution of our work is a new semisupervised method for learning deep metrics from unlabeled stereo images, given coarse information about the scenes and the optical system. Our method alternatively optimizes the metric with a standard stochastic gradient descent, and applies stereo constraints to regularize its prediction.
Experiments on reference datasets show that, for a given network architecture, training with this new method without groundtruth produces a metric with performance as good as stateoftheart baselines trained with the said groundtruth.
This work has three practical implications. Firstly, it helps to overcome limitations of training sets, in particular noisy ground truth. Secondly it allows to use much more training data during learning. Thirdly, it allows to tune deep metric for a particular stereo system, even if ground truth is not available.
1 Introduction
The stereo reconstruction problem consists in estimating a depth map from two images taken from different viewpoints. The problem has many practical applications in robotics [34], remote sensing [43], and 3D graphics [47].
It has been heavily investigated for several decades [40], and recent developments focused on designing highorder, regionbased and objectspecific priors [60, 10, 55, 17, 24, 29, 52, 51], and improving efficiency of large scale stereo [36, 25, 16, 7]. Perhaps the most significant recent breakthrough was to use deep metrics [12, 58]. It led to considerable gains in processing speed and reconstruction accuracy (see Tables 4, 5, and 6). Our work improves upon this line of research.
2 Related work
Stereo reconstruction algorithms rely on epipolar geometry [18], according to which to every nooccluded point in one stereo view corresponds a point in the other view lying on a line that does not depend on the scene, but only on the optical system. This line is called an epipolar line, and for a calibrated stereo system, it is known for every image point. Furthermore, for a pinhole camera, all the points lying on a given epipolar line in the second view correspond to points lying on a common epipolar line in the first view. Such two epipolar lines are called conjugate.
It is a standard procedure to warp stereo views in order to make conjugate epipolar lines in these views horizontal and vertically aligned. This is called stereo rectification, and in a rectified stereo pair, every point from the first view corresponds to a point shifted horizontally in the second view. The extension of this shift – also known as a disparity – allows to compute the distance to the corresponding 3d point, which is the ultimate goal of the stereo reconstruction.
So at the core of the stereo reconstruction process lies the matching of similar patches in two images along epipolar lines and the estimation of the disparity. It is not a trivial task, since the local appearance of a physical point in the two views might differ due to radiometric and geometric distortions. The patch matching is usually performed using invariant similarity measures and descriptors, also known as features. Historically, the former were more popular for the stereo reconstruction, while the latter were used for matching sparse points of interest.
2.1 Similarity measures
The invariant similarity measures [21, 19] are popular for stereo reconstruction, probably due to their low computational complexity. The simplest similarity measures are the sum of absolute differences (SAD), and the sum of squared differences (SSD). Zeromean variants of these methods (ZSAD, ZSSD), as well as sum of absolute gradient differences (GSAD), are invariant to local brightness changes, which can also be achieved by combining SAD and SSD with background subtraction by mean, Laplacian of Gaussian (LoG) [20] or Bilateral filters [4].Nonparametric similarity measures, such as Rank and Census [56] are invariant to arbitrary orderpreserving local intensity transformations, and measures such as the Mutual Information (MI) [23] explicitly model the joint intensity distribution in the two images, and are invariant to arbitrary intensity transformations. All these methods are invariant to radiometric distortions only.
2.2 Descriptors
Invariant descriptors are popular for sparse point matching, and are designed to be invariant to both radiometric and geometric distortions. They all are either local histograms of oriented image gradients such as SIFT [30], or binary strings of local pairwise pixel comparisons such as BRIEF [9]. Although descriptors are rarely used for stereo, there are some exceptions, such as DAISY [48], which can be efficiently computed densely.
Recently, the community has moved from these fully handcrafted descriptors to datadriven descriptors, incorporating machinelearning approaches. Most of such descriptors perform discriminative dimensionality reduction either by feature selection, as VGG [45], linear feature extraction, as LDAHash [46], or boosting, as BinBoost [50].
2.3 Deep metrics
As for other application domains of machine learning, the current trend is to move beyond “shallow” models, where the learned quantities interact linearly with handdesigned nonlinearities, but are not involved in further recombinations.
The resulting “deep metrics” demonstrate extremely good performance compared to other similarity measures and descriptors both for sparse point matching [22, 14, 44, 57, 54] and stereo reconstruction [58, 12].
Standard deep metric networks have a Siamese architecture, introduced in [8]. They consist of two “embedding” subnetworks with complete weight sharing that join into a common “head”. Each embedding subnetwork is convolutional, it takes an image patch as input, and outputs the patch’s descriptor. The “head” is usually fully connected, it takes the two descriptors as input, and outputs a similarity measure. The Siamese architecture was firstly used for image patch matching in its classic form in [22]. Later it was shown, that the “head” network may be replaced by a fixed similarity such as [44] or cosine [58], that the embedding subnetworks may not share weights [57], and, finally, that the explicit notion of a descriptor might not be necessary [57].
2.4 Supervised learning of deep metrics
Existing methods for training a Siamese network for patch matching are supervised, using a training set composed of positive and negative examples. Each positive example (respectively negative) is a pair composed of a reference patch and its matching patch (respectively a nonmatching one) from another image.
Training either takes one example at the time, positive or negative, and adapts the similarity [44, 12, 22, 57, 54], or takes at each step both a positive and a negative example, and maximizes the difference between the similarities, hence aiming at making the two patches from the positive pair “more similar” than the two patches from the negative pair [58, 26, 6]. This latter scheme is known as “Triplet Contrastive learning.”
Although the supervised learning of deep metrics works very well, the complexity of the models requires very large labeled training sets which are hard to collect for real applications. Beside, even when such large sets are available, the ground truth is produced automatically from sensors and, thus, usually noisy and/or may suffer from gross errors. This can be mitigated by augmenting the training set with random perturbations [58] or synthetic training data [14, 33]. However synthesis procedures are handcrafted and do not account for the regularities specific to the stereo system and target scene at hand.
2.5 Semisupervised learning
Our work is inspired by MultiInstance Learning (MIL) [5] and SelfTraining [49]. The main idea behind MIL, is to use “coarsely” labeled data, where one label indicates if a group of samples contains at least one positive sample. This allows to deal with low geometrical accuracy, or even the absence of geometrical information and a labeling at the scene level. It has been applied with success to deep learning [53].
Another strategy to relax the requirement for detailed labeling is SelfTraining, where the training set is enriched with unlabeled data. As for transductive learning, selftraining works by leveraging the information carried by the unlabeled data about the structure of the data population [11, 37].
Our most efficient method uses dynamic programming (DP) to regularize the noisy prediction of the metric as it is currently trained. Similar idea appeared in [27], in a different context, to train a deep network to recognize handwritten characters, using wordwise labels to infer characterwise labels. It has also been used to segment automatically sequences of action demonstrations into macroactions to deal with nonMarkovian decision processes [28], and the shortest paths algorithm, which is a generalization of dynamic programming to multiple paths, was used to train a person detector from videos with timesparse groundtruth [3].
3 Method
We start by formulating in § 3.1 the task of semisupervised deep metric learning for stereo, then in § 3.2 we review the stereo matching problem constraints we consider, and in § 3.3 we describe how we use them to drive the training.
3.1 Problem formulation
We are provided with a semisupervised training set . Each training example is a triplet of series of grayscale patches:

reference patches extracted from a horizontal line of a left rectified stereo image,

positive patches extracted from the corresponding horizontal in the right rectified stereo image, and

negative patches extracted from another horizontal line of a right rectified stereo image,
where is the number of patches per line, and is the number of training examples. In addition to the training set, we are provided with the maximum possible disparity , which depends on the optical system and a prior knowledge about the scene.
Our goal is to learn a deep metric such that, for any set of reference and positive image patches , the rowwise maxima of the similarity matrix correspond to the true matches.
Note, that in contrast to [22, 57, 44, 14, 54, 12, 58] in our case each training example is not a pair of patches, but a triplet of series of patches each taken on an horizontal line of a rectified stereo image, so that we can utilize constraints and loss functions defined on such families of patches jointly. Additionally, processing lines as a whole significantly speeds up the training process by allowing to reuse shared computations.
3.2 Matching constraints
The stereo matching problem satisfies the following constraints:

Epipolar constraint. Every nonoccluded reference patch has a matching positive patch [18][239241p].

Disparity range constraint. The offset of the reference patch index with respect to the matching positive patch index is bounded by a maximum disparity . This comes from the stereo system parameters (focal length, pixel size, baseline) and the distance range of the scenes.

Uniqueness constraint. The matching positive patch is unique [32].

Continuity constraint. The offsets of the reference patches indices with respect to the matching positive patch indices are similar for nearby reference patches everywhere except on depth discontinuities [32].

Ordering constraint. The reference patches are ordered on their lines as the matching positive patches on theirs.
These constraints result in a particular shape of the positive similarity matrix, as pictured in Figure 1.
3.3 Proposed semisupervised methods
We developed several semisupervised methods that use different subsets of the stereo constraints during training. All methods alternate between two steps: (1) improving the metric, given the current estimate of the matches for the positive examples, and (2) recomputing these matches under the constraints, given the current estimate of the metric. They can be used in combination with any deep metric architecture and any gradient based optimization method.
To each of our methods corresponds a loss function optimized in each of the two steps mentioned above. It takes as an input either , or the three matrices , and defined respectively as follows:
(1)  
(2)  
(3) 
In the next sections we describe each method in details.
3.3.1 MIL method
This method is inspired by MultiInstance Learning (MIL) paradigm [5] and uses only the epipolar and the disparity range constraints (E) and (D) from § 3.2.
From these two constraints, we know that every nonoccluded reference patch has a matching positive patch in a known index interval, but does not have a matching negative patch. Therefore, for every reference patch, the similarity of the best referencepositive match should be greater than similarity of the best referencenegative match. Our training objective is to push apart these two similarities.
The training loss for the MIL method is
(4) 
where is a set of rows of the similarity matrix that are guaranteed to have correct matches (see Fig 1), is a set of valid columns of the similarity matrix that are guaranteed to have correct matches, is the number of patches in a horizontal line of rectified image, and is a loss margin. Note that the disparity range constraint is taken into account automatically, if we use the similarity matrices as defined in § 3.3.
Experiments shows that the method learns metrics insensitive to small shifts from the optimal match. This problem results in blocky shape of a similarity matrix, where blocks correspond to the areas where the metric is not able to find unique match. This issue motivates the CONTRASTIVE method described in the following section.
3.3.2 CONTRASTIVE method
This method uses the epipolar, the disparity range, and the uniqueness constraints (E), (D), and (U) from § 3.2.
From the epipolar and the disparity range constraints we know that every nonoccluded reference patch has a matching positive patch in a known index interval. Furthermore, according to the uniqueness constraint the matching positive patch is unique. Therefore, for every patch, the similarity of the best match should be greater than the similarity of the second best match. Our training objective is to push apart these two quantities.
The training loss for this CONTRASTIVE method is
(5) 
where is a similarity matrix with masked out rowwise maxima, is a similarity matrix with masked out columnwise maxima. To mask out elements of similarity of matrix, we simply substitute them with .
Experiments show that this method suffers from a problem opposite to the one exhibited by the MIL method: it produces oversharpened metric, sensitive even to small shifts from the exact match. This is also detrimental to the performance, since our goal is to find metric invariant to small geometric transformations, such as shift. We solved the problem by masking out all spatial neighbors withing radius from the maximas in and in . See the supplementary materials for details.
3.3.3 MILCONTRASTIVE method
As we showed in previous sections, the CONTRASTIVE and the MIL methods have complementary properties and use the stereo constraints in orthogonal way. Therefore we can combine them into a new method that we call MILCONTRASTIVE.
3.3.4 CONTRASTIVEDP method
This method uses all constraints listed in § 3.2. The only difference with CONTRASTIVE is that it finds the best match under (C) and (O) using dynamic programming (DP), instead of independent maxima.
Formally, it solves
(6) 
where is the set of paths which are continuous in the following sense:
Which means that only down, right and diagonal steps are allowed. This enforces the continuity and the ordering constraints (C) and (O) in the solution. Notice also that we search for a path that has maximum average energy rather than maximum total energy to prevent a bias toward longer paths and consequently smaller disparities.
Given the best matchpath found by the dynamic programming we define our loss function as
(7) 
where is a similarity matrix where all neighbors of elements belonging to withing radius are masked out by setting their values to .
The best matchpath computed by the dynamic programming might contain vertical and horizontal segments. These segments correspond to patches that are occluded by foreground objects on one of the views, and thus do not have correct matches. Therefore, in our experiments we ignore vertical and horizontal segments longer than during the learning. For more details, please refer to the supplementary materials.
4 Experiments
Our experiments were done in the Torch framework [13]. Optimization was performed with the ADAM method with standard settings, using minibatches of size equal to the training images height, and no data augmentation of any sort. The initialization of weights and biases of our deep metric network was done in standard way by random sampling from zeromean uniform distribution.
We guarantee reproducibility of all experiments in this section by using only available datasets, and making our code available online under opensource license after publications.
4.1 DataSets
In our experiments we use three popular benchmark datasets: KITTI’12 [15], KITTI’15 [34] and Middlebury (MB) [40, 41, 39, 21, 38]. These datasets have online scoreboards [1, 2], showing comparative performance of all participating stereo methods.
KITTI’12 and KITTI’15 datasets each consist of 200 training and 200 test rectified stereo pairs of resolution 1226370 acquired from cars moving around a city. About of the pixels in the training set are supplied with a ground truth disparity acquired by a laser altimeter with error less than 3 pixels. The disparity range is about 230 pixels. Each dataset is supplied with an extension (respectively KITTI’12EXT and KITTI’15EXT) that contains 19 additional stereo pairs for each scene, without ground truth disparity. This allows us to use 40 more training data for the semisupervised learning than for the supervised (actually even more, considering that only about 30% of pixels in the training set have labels).
Middlebury dataset (MB) consists of 60 training and 30 test rectified stereo pairs. The images are acquired by different stereo systems and contain different artificial scenes. Their resolution varies from 380430 to 30002000, and their disparity ranges vary from 30 to 800 pixels. The training images are provided with a dense ground truth disparity acquired by structured light system with error less that 0.2 pixels.
4.2 Performance measure
To estimate the performance of deep metrics we compute a prediction error rate defined as the proportion of nonoccluded patches for which the predicted disparity is off by more than 3 pixels.
The motivation behind this work is to improve the metric as a mean to match patches in a standalone manner, as we have not taken into account the interplay with the additional postprocessing that may be applied in a complete stereo pipeline. Performance regarding this main objective is measured by picking the patch with the largest similarity among the patches that belong to a valid disparity range on the epipolar line. We call this the winnertake all (WTA) error rate.
A second measure is the error rate of a complete stereo pipeline with pluggedin deep metric. This is a performance measure of direct practical interest, although not the objective we optimize during our training.
4.3 Deep metric architecture
The main contribution of this work is a new semisupervised training method, not deep metric architecture, therefore we simply adopt the overall architecture of well performing MCCNN fst network from [58], shown in Table 1, and substitute their learning method with ours.
Parameter  KITTI’12,15  MB 

Number of CNN layers  4  5 
Number of features per layer  64  64 
Receptive field  3x3x64  3x3x64 
Activation function  ReLU  ReLU 
Equivalent patch size  9x9  11x11 
Similarity metric  Cosine  Cosine 
4.4 Comparison of semisupervised methods
In this experiment we compare the performance of the proposed semisupervised methods. We performed comparison on KITTI’12 dataset using the winnertakeall (WTA) error (see § 4.2). The results of the experiments are shown in Table 2.
Method  WTA error, []  Time, [hr] 

MIL  18.45  45 
CONTRASTIVE  17.63  30 
MILCONTRASTIVE  16.12  65 
CONTRASTIVEDP  14.61  68 
The main conclusion is that semisupervised methods that use more stereo constraints during learning perform better. For example, the MIL, that uses only the epipolar and the disparity range constraints, has larges WTA error, whereas the CONTRASTIVEDP, that uses the epipolar, the disparity range, the continuity, the uniqueness and the ordering constraints has smallest WTA error.
In all following sections, we use the best performing CONTRASTIVEDP method only, and refer to it as MCCNNSS, where SS stands for semisupervised.
4.5 Comparison with supervised method
In this section, we compare the proposed semisupervised method with our reference fully supervised deepmetric baseline [58] on the three different sets, using the winnertakeall (WTA) error (see § 4.2).
The results are shown in Table 3. As we see, our method outperforms the supervised method in terms of WTA error across two sets, and does virtually as well on the third. This is remarkable considering the fact that our method does not use ground truth disparity during learning.
The success of our method in case of KITTI’12 and KITTI’15 sets can be attributed to the fact that these sets have large amount of unlabeled stereo data, that can be used by our method. In fact, these sets have more than more unlabeled data than labeled training data.
In case of MB dataset our method does not have this advantage over the supervised method. The set has only 30% more unlabeled training data than the labeled training data. This is probably the reason why our method shows slightly worse performance on this dataset than compared to the supervised method.
Method  WTA error, [%]  

KITTI’12  KITTI’15  MB  
MCCNN fst [58]  15.44  15.38  29.94 
MCCNNSS fst (ours)  13.90  14.08  30.06 
CENSUS 9x9 [56]  53.52  50.35  64.53 
AD 9x9  32.36  30.67  59.39 
4.6 Stereo benchmarking
In this section we investigate how well our semisupervised deep metric performs when it is combined with the complete stereo pipeline. For that we plug it in the stereo pipeline from [58], and tuned the parameters of the pipeline using simple coordinate descent method, starting from the default values of [58]. Note that we used specific metric and pipeline parameters for each dataset.
Then we computed disparity maps for the test sets with withheld ground truth, and uploaded the results to the evaluation web sites for the respective datasets[1, 2]. The obtained evaluation results are shown in Tables 5, 6 and 4. As we can see, results with our metric trained without ground truth during training are very close to the results of the fully supervised method across all benchmarks.
Those are very encouraging results, given in particular that we did not optimize the deep metric and the pipeline parameters together, and considering the performance in the winnertakeall setup of § 4.5.
Regarding the processing time, note that the network structure used for our method is identical to that of MCCNNfst [58], except for the pipeline parameters. The difference in processing times in Tables 5, 6 and 4 is only due to the hardware differences.
#  Date  Algorithm  Pipeline Err, [%]  Time, [s] 

1  01/19/15  NTDE [24]  7.62  300 
2  08/28/15  MCCNN acrt [58]  8.29  254 
3  11/03/15  MCCNN+RBS [7]  8.62  345 
4  01/26/16  MCCNN fst [58]  9.69  2.94 
5  14/11/16  MCCNNSS (ours)  12.3  5.59 
6  10/13/15  MDP [29]  12.6  130 
7  04/19/15  MeshStereo [60]  13.4  146 
#  Date  Algorithm  Pipeline Err, [%]  Time, [s] 

1  27/04/16  PBCP [42]  2.36  68 
2  26/10/15  Displets v2 [17]  2.37  265 
3  21/08/15  MCCNN acrt [58]  2.43  67 
4  30/03/16  cfusion [35]  2.46  70 
5  16/04/15  PRSM [52]  2.78  300 
6  21/08/15  MCCNN fst [58]  2.82  0.8 
7  03/08/15  SPSst [55]  2.83  2 
8  14/11/16  MCCNNSS (ours)  3.02  1.35 
9  03/03/14  VCSF [51]  3.05  300 
#  Date  Algorithm  Pipeline Err, [%]  Time, [s] 

1  26/10/15  Displets v2 [17]  3.43  265 
2  27/04/16  PBCP [42]  3.61  68 
3  21/08/15  MCCNN acrt [58]  3.89  2.94 
4  16/04/15  PRSM [52]  4.27  300 
5  06/11/15  DispNetC [33]  4.34  0.06 
6  11/04/16  ContentCNN [31]  4.54  1 
7  21/08/15  MCCNN fst [58]  4.62  0.8 
8  14/11/16  MCCNNSS (ours)  4.97  1.35 
9  03/08/15  SPSst [55]  5.31  2 
4.7 What does deep metric learn?
In Figure 2 we show positive similarity matrices before and after the training with MCCNNSS on KITTI’12 dataset. While one can not visually distinguish the best match in the similarity matrices before the training, it becomes clearly visible after. This suggests that the training improves discriminative ability of the deep metric.
In Figure 3 we show failure cases of learned deep metric. Most of the failures happen when the ground truth match is visually indistinguishable from the incorrect match picked by the deep metric. This happens if the reference patch is from a flat image area, an area with a repetitive texture, or an area with a horizontal edge.
Notably, some failures are triggered by probable errors in the ground truth. These errors might worsen outcomes of the supervised learning but does not affect outcomes of our semisupervised learning, since it does not use the groundtruth.
4.8 Generalization across datasets
In this experiment, we study how deep metric trained using our semisupervised method on one dataset performs on another datasets in terms of WTA error.
From Table 7 it appears that a metric always performs better when the train and test population come from the same dataset. This confirms that our semisupervised metric has great practical value: it allows to tune descriptor for a particular stereo system at hands, even if dataset with groundtruth is not available.
Training set  WTA error, [%]  

KITTI’12  KITTI’15  MB  
KITTI’12  13.90  15.52  34.85 
KITTI’15  16.61  14.08  36.66 
MB  14.22  15.00  30.06 
5 Conclusion
We proposed novel semisupervised techniques for training patch similarity measures for stereo reconstruction. These techniques allow to train with datasets for which ground truth is not available, by relying on simple constraints coming from properties of the optical sensor, and from a rough knowledge about the scenes to process.
We applied this framework to the training of a “deep metric”, that is a deep siamese neuralnetwork that takes two patches as an input and predicts a similarity measure. Benchmarking on standard datasets shows that the resulting performance is as good or better than published results with the same network trained on the same but fully labeled datasets (see Table 3).
This very good performance can be explained by the strong redundancy of a fully labeled dataset, due to the continuity of surfaces, coupled with inevitable labeling errors. The latter can degrade the performance resulting from a fully supervised training process, and could only be mitigated by using a prior knowledge about the regularity of the labeling, similar to the constraints we use.
The techniques we propose open the way first to using stereo reconstruction based on deep metrics for datasets for which no groundtruth exists, such as planetary measurements. Second, it will allow the training of larger neural networks, with very large unlabeled datasets. Our experiments show that the network that we are using in our experiments does benefit from an one order of magnitude more training samples, than it is available to supervised method as shown in Table 3. We expect that this effect will be even more significant if we use our training method with larger networks that would overfit existing labeled training sets.
References
 [1] KITTI 2012, 2015 stereo scoreboards. http://www.cvlibs.net/datasets/kitti/. Accessed: 20161114.
 [2] Middlebury scoreboard. http://vision.middlebury.edu/stereo/. Accessed: 20161114.
 [3] K. All, D. Hasler, and F. Fleuret. FlowBoost  Appearance learning from sparsely annotated video. In CVPR, pages 1433–1440, 2011.
 [4] A. Ansar, A. Castano, and L. Matthies. Enhanced Realtime Stereo Using Bilateral Filtering â. 3DPVT, 2004.
 [5] B. Babenko. Multiple instance learning: algorithms and applications. NCBI Google Scholar, 2008.
 [6] V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. PNNet: Conjoined Triple Deep Network for Learning Local Image Descriptors. CoRR, 2016.
 [7] J. T. Barron and B. Poole. The fast bilateral solver. ECCV, 2016.
 [8] J. Bromley, I. Guyon, Y. Lecun, E. SÃ¤ckinger, and R. Shah. Signature verification using a ”siamese” time delay neural network. In NIPS, 1994.
 [9] M. Calonder, V. Lepetit, M. Özuysal, T. Trzcinski, C. Strecha, and P. Fua. BRIEF: Computing a local binary descriptor very fast. PAMI, 2012.
 [10] A. Chakrabarti, Y. Xiong, S. J. Gortler, and T. Zickler. LowLevel Vision by Consensus in a Spatial Hierarchy of Regions. CVPR, 2015.
 [11] X. Chen, A. Shrivastava, and A. Gupta. NEIL: Extracting Visual Knowledge from Web Data. In ICCV, 2013.
 [12] Z. Chen, X. Sun, and L. Wang. A Deep Visual Correspondence Embedding Model for Stereo Matching Costs. ICCV, 2015.
 [13] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, 2011.
 [14] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT. ARXIV, 2014.
 [15] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
 [16] A. Geiger, M. Roser, and R. Urtasun. Efficient largescale stereo matching. ACCV, 2010.
 [17] F. Güney and A. Geiger. Displets: Resolving Stereo Ambiguities using Object Knowledge. CVPR, 2015.
 [18] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
 [19] H. Hirschm. Evaluation of Stereo Matching Costs on Images with Radiometric Differences. PAMI, 2008.
 [20] H. Hirschmüller, P. R. Innocent, and J. Garibaldi. RealTime CorrelationBased Stereo Vision with Reduced Border Errors. IJCV, 47, 2002.
 [21] H. Hirschmuller and D. Scharstein. Evaluation of Cost Functions for Stereo Matching. CVPR, pages 1–8, 2007.
 [22] M. Jahrer, M. Grabner, and H. Bischof. Learned local descriptors for recognition and matching. Computer Vision Winter Workshop, 2008.
 [23] J. Kim, V. Kolmogorov, and R. Zabih. Visual correspondence using energy minimization and mutual information. In ICCV, 2003.
 [24] K. R. Kim and C. S. Kim. Adaptive smoothness constraints for efficient stereo matching using texture and edge information. In ICIP, 2016.
 [25] J. Kowalczuk, E. T. Psota, and L. C. Pérez. Realtime Stereo Matching on CUDA using an Iterative Refinement Method for Adaptive SupportWeight Correspondences Realtime Stereo Matching on CUDA using an Iterative Refinement Method for Adaptive SupportWeight Correspondences. Transactions on Circuits and Systems for Video Technology, 2012.
 [26] B. G. V. Kumar, G. Carneiro, and I. Reid. Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimising Global Loss Functions. CVPR, 2016.
 [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323, 1998.
 [28] L. Lefakis and F. Fleuret. Dynamic Programming Boosting for Discriminative MacroAction Discovery. ICML, 32:1548–1556, 2014.
 [29] A. Li, D. Chen, Y. Liu, and Z. Yuan. Coordinating multiple disparity proposals for stereo computation. In CVPR, 2016.
 [30] D. G. Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 2004.
 [31] W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In CVPR, 2016.
 [32] D. Marr and T. Poggio. A Computational Theory of Human Stereo Vision. Biological Sciences, 204(1156):301–328, 1979.
 [33] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. CVPR, 2016.
 [34] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In CVPR, 2015.
 [35] V. Ntouskos and F. Pirri. Confidence driven tgv fusion. arXiv preprint arXiv:1603.09302, 2016.
 [36] E. T. Psota, J. Kowalczuk, M. Mittek, and L. C. Perez. MAP Disparity Estimation Using Hidden Markov Trees. ICCV, 2015.
 [37] S. E. Reed and H. Lee. Raining deep neural networks on noisy labels with bootstrapping. ICLR, 2015.
 [38] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling. Highresolution stereo datasets with subpixelaccurate ground truth. Lecture Notes in Computer Science, 2014.
 [39] D. Scharstein and C. Pal. Learning conditional random fields for stereo. In CVPR, 2007.
 [40] D. Scharstein and R. Szeliski. A Taxonomy and Evaluation of Dense TwoFrame Stereo Correspondence Algorithms. IJCV, 2001.
 [41] D. Scharstein and R. Szeliski. Highaccuracy stereo depth maps using structured light. CVPR, 2003.
 [42] A. Seki and M. Pollefeys. Patch based confidence prediction for dense disparity map. In BMVC, 2016.
 [43] D. E. Shean, O. Alexandrov, Z. M. Moratto, B. E. Smith, I. R. Joughin, C. Porter, and P. Morin. An automated, opensource pipeline for mass production of digital elevation models (DEMs) from veryhighresolution commercial stereo satellite imagery. {ISPRS} Journal of Photogrammetry and Remote Sensing, 2016.
 [44] E. SimoSerra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. MorenoNoguer. Discriminative Learning of Deep Convolutional Feature Point Descriptors. ICCV, 2015.
 [45] K. Simonyan, A. Vedaldi, and A. Zisserman. Descriptor Learning Using Convex Optimization. PAMI, 2013.
 [46] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. LDAHash: Improved matching with smaller descriptors. PAMI, 2012.
 [47] C. Strecha, T. Pylvänäinen, and P. Fua. Dynamic and scalable large scale image reconstruction. In CVPR, 2010.
 [48] E. Tola. DAISY: A Fast Descriptor for Dense Wide Baseline Stereo and Multiview Reconstruction. PAMI, 2010.
 [49] I. Triguero, S. García, and F. Herrera. Selflabeled techniques for semisupervised learning: taxonomy, software and empirical study. Knowledge and Information Systems, 2013.
 [50] T. Trzcinski, M. Christoudias, V. Lepetit, and P. Fua. Learning Image Descriptors with the BoostingTrick. NIPS, 2012.
 [51] C. Vogel, S. Roth, and K. Schindler. Viewconsistent 3d scene flow estimation over multiple frames. In ECCV, 2014.
 [52] C. Vogel, K. Schindler, and S. Roth. 3d scene flow estimation with a piecewise rigid scene model. IJCV, 2015.
 [53] J. Wu, Y. Yu, C. Huang, and K. Yu. Deep Multiple Instance Learning for Image Classification and AutoAnnotation. CVPR, 2015.
 [54] Xufeng Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. MatchNet: Unifying feature and metric learning for patchbased matching. CVPR, 2015.
 [55] K. Yamaguchi, D. Mcallester, and R. Urtasun. Efficient Joint Segmentation , Occlusion Labeling , Stereo and Flow Estimation. ECCV, 2014.
 [56] R. Zabih and J. Woodfill. Nonparametric Local Transforms for Computing Visual Correspondence. ECCV, 1994.
 [57] S. Zagoruko and N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks Sergey. CVPR, 2015.
 [58] J. Žbontar and Y. LeCun. Computing the Stereo Matching Cost With a Convolutional Neural Network. CVPR, 2015.
 [59] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. JMLR, 2016.
 [60] C. Zhang and Z. Li. MeshStereo : A Global Stereo Model with Mesh Alignment Regularization for View Interpolation. ICCV, 2015.