Self-supervised learning for stereo reconstruction on aerial images

Self-supervised learning for stereo reconstruction on aerial images


Recent developments established deep learning as an inevitable tool to boost the performance of dense matching and stereo estimation. On the downside, learning these networks requires a substantial amount of training data to be successful. Consequently, the application of these models outside of the laboratory is far from straight forward. In this work we propose a self-supervised training procedure that allows us to adapt our network to the specific (imaging) characteristics of the dataset at hand, without the requirement of external ground truth data. We instead generate interim training data by running our intermediate network on the whole dataset, followed by conservative outlier filtering. Bootstrapped from a pre-trained version of our hybrid CNN-CRF model, we alternate the generation of training data and network training. With this simple concept we are able to lift the completeness and accuracy of the pre-trained version significantly. We also show that our final model compares favorably to other popular stereo estimation algorithms on an aerial dataset.

Self-supervised learning for stereo reconstruction on aerial images

Patrick Knöbelreiter   Christoph Vogel   Thomas Pock
{knoebelreiter, vogel, pock}
Institute for Computer Graphics and Vision
Graz University of Technology

Index Terms—  large scale 3D, dense matching, CNN

1 Introduction

Acquired with modern high resolution cameras, aerial images can provide accurate 3D measurements of the observed scene via dense image matching. Consequently, through the years, stereo estimation has emerged as an attractive alternative to LiDAR (Light Detection and Ranging) in various tasks, like high resolution Digital Surface Model (DSM) generation or orthoimage production [1], leading to a simplified processing pipeline and reduced (flight) costs. Furthermore, the generation of stereo data is a common first step in many different applications, e.g. in 3D change detection [2] or semantic 3D reconstruction of urban scenes [3].

Recently, machine learning, and in particular, deep learning has affected many low-level vision tasks including stereo estimation, leading to considerable improved performance. Here, convolutional neural networks (CNNs) can be used to replace different parts in conventional stereo pipelines, e.g. the feature generation for computing the data term [4]. A different path is to directly formulate stereo estimation as a regression task [5]. In this work, we follow the former approach, which naturally requires less parameters, leading to easier to train networks and in our experience also a better generalization performance. Especially the latter feature is attractive for our aerial reconstruction task. Dense ground truth data is notoriously hard to acquire, while artificial datasets [5] usually lack the photogrammetric properties of ’real world scenes’ and especially of the specific dataset in consideration. The problem of missing ground truth data is further magnified by the fact that CNNs demand a lot of labeled training data to expose their performance. While LiDAR measurements could provide at least (very) sparse ground truth, such an approach would mitigate the advantages of utilizing image based matching at all, with the additional problem that these measurements appear too sparse to be of use for CNN training. Nevertheless, the aim of this work is to utilize CNNs for stereo estimation of aerial scenes. To that end, we propose a self-supervised learning framework. Instead of formulating the problem as an unsupervised learning task, which ultimately leads to a fully generative approach, we rather directly utilize the dataset that has to be reconstructed as training data. In that sense, we are able to learn the specific imaging characteristics at hand. Starting from a pre-trained version of our network, we generate the training data simply by applying our reconstruction method on the whole dataset. To secure the integrity of our training data we employ strict and conservative outlier filtering and apply our training procedure on the unmasked, but still dense data. Our experiments indicate that this concept can lead to highly accurate reconstructions, improving the completeness (and accuracy) from 5 (4) percent up to 22 (24) percent, if compared to our pre-trained model and other competing stereo methods.

Fig. 1: Visualization of the textured 3D point cloud of Vaihingen generated by our algorithm.

2 Related Work

Commonly, dense stereo estimation from aerial images is formulated as a label-based Markov-Random Field (MRF) energy optimization problem, where methods operate on rectified image pairs. A popular representative is Semi-Global Matching (SGM) [6, 7] that approximately solves the MRF energy via dynamic programming (DP), with four scanlines per pixel. Later, the work of Zbontar et al. [4] paved the way for deep learning for stereo. They propose to replace the usual, handcrafted features that are used to define the data term in the energy, with a learned representation. Later, Luo et al. [8] exchange the patch-wise training of [4] with a method that learns the features on whole images instead, introducing a differentiable cost volume formulation in the CNN. Both methods rely on SGM to find a solution of their energy formulation and employ various post-processing steps to refine the solution. Mayer et al. [5] instead directly formulate the problem as an end-to-end regression task. Their CNN possesses several millions of parameters and, hence, requires a large amount of synthetic data for training.

To overcome the requirement for a sufficient amount of training data, the recent trend is to use only weak supervision. Tonioni et al. [9] generate their training data from a traditional formulation [10], but estimate a confidence score for the established matches with another CNN [11]. For training their regression network, the loss function combines the confidence weight to penalize deviations to their generated training data with an additional smoothness constraint on the solution. In contrast, we generate our training data using a state-of-the-art learned model [12] and employ a geometrically motivated consistency check with a hard, conservatively chosen threshold. [13] explicitly utilize a pre-defined list of matching constraints to guide the learning. To that end, they are restricted to train the network per scanline to encode the constraints in the learning procedure. Another regression based approach is proposed by Zhou et al. [14]. They start from a randomly initialized network and construct their training data using their own reliable predictions. Matches are considered as reliable if they survive a left-right (LR) consistency check. The network is then trained using only the reliable matches. The method is similar in spirit to our approach. In contrast, we advocate to start from a much better initialization using a pre-trained model [12]. In our experience this procedure is both, beneficial in training time and final accuracy. Apart from that, our model is much closer to the traditional MRF problem.

3 Self-Supervised Dense Matching

In our setting we assume to have access to a larger set of already rectified image pairs on which we want to perform stereo matching. What we do not assume is to have access to ground truth data for any of these image pairs, which could be used for training. Our objective is to still apply a state-of-the-art stereo CNN and boost its performance on this specific dataset. In a nutshell, we exploit a pre-trained and – during training – continuously improving versions of the CNN to generate our own training data.

CNN-CRF Model. In this work we utilize the hybrid CNN-CRF model proposed in [12] that incorporates deep learning into classical energy minimization. Our CNN-CRF model minimizes the following typical CRF-type energy defined on the pixel graph of an image with the usual 4-connected neighborhood structure :


The solution of (1) is a member of the set of mappings representing a disparity map of of range . Here, both, the data-term and the regularizer are each represented as a CNN. The optimization of the CRF energy is performed via a massively parallel and highly efficient variant of dual decomposition. The whole system can be learned end-to-end [12]. In this work, however, we focus on the data term and keep edge-weights and penalty function in (1) fix.

Generating the training data. To bootstrap our procedure, we directly use the publicly available model ( with a 7 layered data term CNN, which was trained on the Middlebury Stereo 2014 dataset [15]. It has been shown in [12] that the model generalizes well to unknown scenes, which arguably makes it a good candidate for generating our initial training data. However, because the original training images are completely different from our aerial dataset, the reconstruction still contains outliers and erroneous regions. Therefore, directly using the resulting disparity images for training a new data term will rather harm the performance than improve our method. To mitigate this problem, our training procedure has to distinguish between regions, where it can trust the generated ground truth and where not.

Filtering the generated data. We use the common left-right consistency check to filter unreliable matches. Therefore, we first compute two disparity maps, and , for each image pair, where either the left image () or the right one () serves as the reference frame. For our filter we then require that matching points in the left and right image are in mutual correspondence for both disparity maps. More precisely, a pixel survives the left-right consistency check if


where is a threshold that is set to in our experiments. This simple check gets rid of most of the wrong pixels and is, in our experience, already sufficient to retrain our model.

Training. As stated in Section 3, the model consists of two networks, one for the data-term and one for the regularizer . From our experience, training edge costs for our regularizer requires the edges also to be represented in the training data. However, pixel near occlusions rarely survive our consistency check and are, thus, underrepresented in our self-supervised training data. Consequently, we keep the edge costs fixed and only retrain the network represented by for the aerial images. In particular, we generate a one-hot encoding of our ground truth disparity maps and perform maximum likelihood training, i.e. we minimize the following loss function w.r.t. the parameters of the network:


where is the correlation volume predicted by the model, is the one-hot encoding of the ground truth disparity map. The second equality comes from the fact that the one-hot encoding puts all the probability mass to the ground truth disparity .

4 Experiments

In this section we evaluate the effectiveness of our self learning algorithm in the context of aerial images. We compare the depth maps generated by the well-known Semi-Global Matching algorithm [6, 16] with the pre-trained CNN-CRF model and our model, refined via self-supervised training.

Dataset. We evaluate our method on the Vaihingen dataset of the ISPRS Urban classification and 3D reconstruction benchmark [17]. The Vaihingen dataset consists of 20 aerial images of size pixels. In each image of the dataset the blue channel has been replaced by the response of an infrared camera, which leads to further deviation between the pre-trained and refined model. Nevertheless, we could observe similar behavior for the Toronto dataset of the same benchmark [17] where the color channels are RGB. All images are registered in a global coordinate system. Additionally a laser point cloud is provided, which we use for our evaluation. We perform all our experiments at half resolution.

Both algorithms, SGM and CNN-CRF, require rectified input images. In order to limit the memory consumption during training, we additionally divide the images into parts.

Fig. 2: Visual comparison of disparity maps. Left: Generated training data. Right: Improved disparity map after retraining. Most of the (dark blue) artefacts are gone after the self-training. Color-coding from cold (small height) to warm (large height).

Performance evaluation. We use the provided laser scanned depth values as our reference data to compare the different models. The pipeline for the evaluation consists of (i) computing the disparity map in pixel space for an image pair, (ii) using the disparity to compute the metric depth value for all pixels in the reference image, (iii) projecting the laser point cloud into the reference image and (iv) computing the metric difference for all valid pixels in pixel space. Additionally, we compute the recall of the reconstructed points, given by


where is the set of pixel with a valid (surviving the consistency check) disparity and the set of pixels with a Laser measurement. A recall of 100% would mean that every pixel captured by the laser scanner is also captured by our model. We perform the evaluation using all available images and, therefore, report the numbers achieved on the whole dataset. Recall that we use the laser measurements only for evaluation.

Table 1 compares the recall and the accuracy achieved by the baseline SGM model, our pre-trained model used to bootstrap the training and our model after the first and the second training iteration. The accuracy is given as the percentage of pixel within a defined 3D distance to the laser measurements. In our setting one disparity value corresponds to a 3D displacement of to meters. Each iteration of the training increases both the recall and the overall accuracy. Our final model is able to increase the recall by 16.4 percent points and the accuracy between 2.6 and 12.7 percent points compared to the pre-trained version. This shows that self learning is a suitable option to use deep learning on stereo data without ground truth. Fig. 2 visually compares the depth map obtained from the pre-trained network with the one computed with the retrained model. Our model is able to close the gaps in the reconstruction during retraining. A closer inspection reveals that masked regions mainly occur near building-ground edges and correspond to occlusions and, hence, cannot survive the consistency check. This underlines our findings from Table 1, self-training can improve the accuracy and performance and lead to significantly denser reconstructions.

Model Recall [ % ] Accuracy [ % ]
0.3m 0.5m 1m
SGM 76.0 52.5 69.8 86.7
Pt-Net 87.7 62.9 76.4 87.1
Training 1 92.1 65.2 78.6 88.9
Training 2 92.4 64.5 78.7 89.3
Table 1: Evaluation of the models and comparison with the laser ground truth. The self-learned models increase the performance on the target domain significantly compared to SGM and the pre-trained network (Pt-Net) used for retraining.
Fig. 3: Close-up comparison between our computed depth values and the laser depth values.
Fig. 4: Detailed view of a church. Note how accurately the church tower and the facade is reconstructed.

Fig. 3 compares the density of points between our computed depth map and the laser depth values projected into the image space. We are able to densely reconstruct the scene, whereas the laser provides only a sparse depth map. Fig. 4 shows a detailed reconstruction of our algorithm. In this visualization the color-coding is chosen to highlight high-frequency variations in depth. Here, especially the tower and the arches in the facade of the church prove that our model can deliver highly precise reconstruction from aerial images.

5 Conclusion & Future Work

We have shown that, without the requirement of any labeled training data, state-of-the-art machine learning approaches for stereo matching can be used to compute high quality depth maps from aerial images. Starting from a pre-trained version, our proposed self-supervised learning framework constructs the training data with a previous version of the learning algorithm itself and additionally relies on conservative consistency checking to reject most of the potential outliers. Our experiments indicate that this concept works for large scale aerial images, whose imaging characteristics are quite far from the initial dataset used for pre-training. Nevertheless, the perceptual quality as well as the raw performance numbers are increased significantly compared to baseline models.


  • [1] S. Gehrke A, K. Morin B, M. Downey A, N. Boehrer C, and T. Fuchs C, “Semi-global matching: An alternative to lidar for dsm generation?,” 2012.
  • [2] R. Qin, J. Tian, and P. Reinartz, “3d change detection – approaches and applications,” in ISPRS Journal of Photogrammetry and Remote Sensing, 2016.
  • [3] M. Blaha, C. Vogel, A. Richard, J. D. Wegner, T. Pock, and K. Schindler, “Large-scale semantic 3d reconstruction: An adaptive multi-resolution model for multi-class volumetric labeling.,” in Computer Vision and Pattern Recognition, 2016.
  • [4] J. Žbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, 2016.
  • [5] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Conf. on Computer Vision and Pattern Recognition, 2016.
  • [6] H. Hirschmüller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in Conference on Computer Vision and Pattern Recognition, 2005, vol. 2.
  • [7] H. Hirschmüller, M. Buder, and I. Ernst, “Memory efficient semi-global matching,” in The XXII Congress of the International Society for Photogrammetry and Remote Sensing, 2012.
  • [8] W. Luo, A. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in Conference on Computer Vision and Pattern Recognition, 2016.
  • [9] A. Tonioni, M. Poggi, S. Mattoccia, and L. Di Stefano, “Unsupervised adaptation for deep stereo,” in International Conf. on Computer Vision, 2017.
  • [10] R. Zabih and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” in European Conference on Computer Vision, 1994.
  • [11] M. Poggi, F. Tosi, and S. Mattoccia, “Learning from scratch a confidence measure,” in British Conf. on Machine Vision,2016.
  • [12] P. Knöbelreiter, C. Reinbacher, A. Shekhovtsov, and T. Pock, “End-to-End Training of Hybrid CNN-CRF Models for Stereo,” in Computer Vision and Pattern Recognition, 2017.
  • [13] S. Tulyakov, A. Ivanov, and F. Fleuret, “Weakly supervised learning of deep metrics for stereo reconstruction,” in International Conf. on Computer Vision, 2017.
  • [14] C. Zhou, H. Zhang, X. Shen, and J. Jia, “Unsupervised learning of stereo matching,” in International Conf. on Computer Vision, 2017.
  • [15] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” in German Conference on Pattern Recognition, 2014.
  • [16] S. Audet, Y. Kitta, Y. Noto, R. Sakamoto, and T. Akihiro, “libsgm,”
  • [17] F. Rottensteiner, G. Sohn, M. Gerke, and J. D. Wegner, “Isprs test project on urban classification and 3d building reconstruction,” Commission III-Photogrammetric Computer Vision and Image Analysis, 2013.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description