Deep-Learning Assisted High-Resolution Binocular Stereo Depth Reconstruction
This work presents dense stereo reconstruction using high-resolution images for infrastructure inspections. The state-of-the-art stereo reconstruction methods, both learning and non-learning ones, consume too much computational resource on high-resolution data. Recent learning-based methods achieve top ranks on most benchmarks. However, they suffer from the generalization issue due to lack of task-specific training data. We propose to use a less resource demanding non-learning method, guided by a learning-based model, to handle high-resolution images and achieve accurate stereo reconstruction. The deep-learning model produces an initial disparity prediction with uncertainty for each pixel of the down-sampled stereo image pair. The uncertainty serves as a self-measurement of its generalization ability and the per-pixel searching range around the initially predicted disparity. The downstream process performs a modified version of the Semi-Global Block Matching method with the up-sampled per-pixel searching range. The proposed deep-learning assisted method is evaluated on the Middlebury dataset and high-resolution stereo images collected by our customized binocular stereo camera. The combination of learning and non-learning methods achieves better performance on 12 out of 15 cases of the Middlebury dataset. In our infrastructure inspection experiments, the average 3D reconstruction error is less than 0.004m.
There is a widespread integration of UAV (Unmanned Aerial Vehicle) technology in the infrastructure inspection, which requires dense 3D reconstruction of facilities such as bridges and power grids. Binocular stereo camera is widely used for dense depth reconstruction due to its simplicity and low-cost hardware. However, stereo matching for high-resolution images is still a challenging task because of the huge amount of computation brought by the large disparity searching range and pixel number. In addition, image-based reconstructions suffer from lack of texture, slanted surfaces  and inadequate lighting, leading to reconstruction failure and sparse disparity predictions in these difficult image regions. Our goal is dense and accurate depth reconstruction for high-resolution images as shown in Fig. 1.
High-resolution stereo images have a large number of pixels and large searching ranges for potential disparities. Recently, deep-learning methods tend to out-perform non-learning ones. However, the image size is limited by both the available training data and computing hardware. Image sizes of common binocular stereo datasets are under 1 megapixel, e.g, KITTI 2015 , Scene Flow  and NYU . Techniques such as encoder-decoder with skip connections  and spacial polling [12, 30] could reduce memory requirements. However, most of the methods also adopt the cost volume concept  which consumes substantial amount of GPU memory. Recent works reduce memory even further, e.g. , but still not enough to fit a pair of our 4K images.
Non-learning methods such as the Semi-Global Matching (SGM) consumes over 50GB of CPU memory on 4K images (12 megapixels) with a disparity range of 1000 (measured from the SGM part of SPS-Stereo ). The SGBM (Semi-Global Block Matching, which runs a simplified version of SGM by default) method implemented in the OpenCV  package could handle our 4K image pairs directly with restricted computing resources. However, the model parameters are case dependent and SGBM might fail to predict disparities in many image regions as shown later in Fig. 4. In these failure regions, searching for a stereo match within a large disparity range is difficult because there may be multiple disparities with similar matching cost. Intuitively, the search may be easier if the range is narrowed.
We also observed that a deep-learning model could estimate its uncertainty of the disparity prediction. The uncertainty is a good hint of a possible disparity range which narrows the disparity searching range. Our work follows the above observations. The key contributions are:
We propose a hybrid approach that uses a learning-based model to guide a non-learning method in order to achieve high efficiency and accuracy in high-resolution stereo reconstruction tasks.
We train a deep-learning model to produce both disparity and uncertainty. The uncertainty is further utilized as the per-pixel searching range by a non-learning method.
We show in the experiments that the combination of learning and non-learning methods could accurately process high-resolution stereo images.
Ii Related work
The majority of the recent work related to depth estimation use deep-learning models. Some are targeting high-resolution images, e.g. Pillai et al.  apply subpixel-convolutional layers to an encoder-decoder architecture to deal with large images. Wang et al.  propose to initially predict depth from down-sampled images. Then the depth is incrementally up-scaled and corrected by a deep-learning model. Wofk et al.  perform network pruning to reduce the resource requirement. These models could not handle the 4K image size of our stereo camera. Besides, high-resolution training data are not available. Considering the above limitations, we train and test our deep-learning model on low-resolution stereo images.
Our approach needs a deep-learning model to estimate uncertainty of its disparity prediction. Our inspiration is derived from the work of Gal  and Kendal  on uncertainties of deep-learning methods. According to them and , a model could learn to estimate the aleatoric uncertainty which partially depends on the individual input data. During inference, the trained model predicts the possible error it might make on the current input. Our work embraces this method and makes a deep-learning model predict per-pixel disparity uncertainty. Then the uncertainty is directly utilized to compute the possible disparity range.
In our proposed approach, we guide an SGM algorithm and achieve better performance by directly providing depth estimation. Similar methods are utilized to do depth completion or sensor fusion for stereo vision. Most of the fused depth information comes from direct sparse measurement, such as LIDAR[18, 3, 25, 13] and ToF (Time-of-Flight) sensors . In our work, the depth information comes from a deep-learning model on a per-pixel basis. Like Fischer et al., we are going to modify the cost aggregation process  of SGM with the implementation inspired by the work of Shivakumar et al..
Iii Technical Approach
Fig. 2 shows the processing pipeline of the proposed approach. A deep-learning model (referred to as PSMNU) predicts both disparity and uncertainty from down-sampled images. The initial disparity is up-sampled to the original resolution. Then an occlusion proposal is derived from the disparity. The disparity and the occlusion proposal are processed by a guided filter. The uncertainty estimation is used to determine the per-pixel searching range (PPSR) which is also up-sampled. Finally, a modified SGBM algorithm (referred to as SGBMP) takes in the filtered disparity, occlusion proposal, and PPSR to weight the aggregated matching cost. With this pipeline, we recover accurate dense disparity predictions for 4K stereo images.
Iii-a Deep-learning model with uncertainty
We build PSMNU based on the PSMNet  which has promising performance. Following , we modify the PSMNet to predict aleatoric uncertainties. Let represent our deep-learning model as a function which maps stereo images to disparity of the left image. This mapping could be considered to be a random process and is assumed to follow a per-pixel Gaussian distribution expressed in (1).
where is our disparity prediction for pixel and is the true disparity. denotes a mapping which produces a disparity at pixel . The probability density of our model predicting a equal to upon seeing is represented as . is the standard deviation of the Gaussian distribution at pixel . also represents the uncertainty. We change the last regression layer of PSMNet to make output two channels. One channel is for and the other for . We refer to our model as PSMNU (PSMNet with Uncertainty) in this work. To stabilize the computation and avoid division by zero, is produced in practice . By using the loss function defined in (2), PSMNU needs no ground truth for .
with being the number of pixels and defined as
In general, a lower means more confident. When is not confident on then tends to be large. To lower the loss, the model has to predict a large to attenuate but regularized by the last term of (2) which punishes the model from predicting large values. In contrast, if a small is predicted, the model is allowed to give a small to lower the regularization term and also the loss function. (and ) behaves consistently with its role in the Gaussian distribution defined in (1). A large leads to uniform and lower probability of being equal to , a small indicates that is close to .
We train PSMNU on the Scene Flow dataset (FlyingThings3D, Monkaa)  with full resolution images and a disparity range of 256. Later, the 4K images will be down-sampled to 1/4 width, and then fed to PSMNU. Therefore, the up-sampled disparity prediction covers 1024 pixels, which is enough for our tests. PSMNU is trained with a mini-batch of 4 on 4 NVIDIA TITAN X GPUs for 5 epochs. A prediction result from PSMNU on Middlebury Stereo Evaluation V3  is shown in Fig. 3 (first down-sampled to 7681024, result is up-sampled back to full resolution). The prediction has an average error of 1.18 pixels (also shown in Tab. I) compared with the true disparity. The map shows that high uncertainty exists at most of the object edges where disparities become discontinuous and occlusions happen.
Iii-B SGBM with per-pixel searching range (PPSR)
We use the and from PSMNU to determine a disparity range for each pixel. This PPSR is represented as and we set for all our experiments. We implement our method by extending the SGBM method  of OpenCV and name it SGBMP. Note that and are scaled before the up-sampling to ensure the consistency between image scales. Once obtained the PPSR, we focus on modifying the cost aggregation part of the SGBM method. For each pixel , the SGM aggregation cost of the -th candidate disparity inside the PPSR is expressed by (4).
where is the aggregated matching cost defined in (14) of , is the weighted , and are constant parameters. Equation (4) imposes a prior on to favor predicted by PSMNU. However, SGBMP trusts by a discount according to the value. It is further controlled by and the disparity distance between and . is a factor which adjusts the global weighting of . Typically, . We empirically decide in our experiments. In tests with our 4K images, gets saturated easily in some regions due to the large size of the image. SGBM already takes care of this issue following (13) of . We further scale down the stereo matching cost by a factor of 3 before the cost aggregation.
SGBM applies uniqueness ratio check (or peak ratio check similar to , referred to as UR check later) and occlusion check to its disparity predictions. UR check is controlled by the uniquenessRatio parameter of SGBM. SGBMP disables the UR check. This keeps disparity predictions from being eliminated by the case dependent uniquenessRatio in difficult regions such as the areas with low and repetitive texture. SGBMP also does the occlusion check similar to SGBM. However, when from PSMNU has a significant error (such as the region marked by the red circle in Fig. 3 (c)), the error may survive the occlusion check, leading to occlusions of some other pixels. A special guided filter is developed to process the occlusion proposal derived from PSMNU’s initial disparity prediction. The occlusion proposals are represented as a logical mask, , with the same size as the left image. Let be a window which has pixel coordinate as the center. Then we run the guided filter described in Algorithm 1 in its horizontal version. We run the vertical version filter on the result of the horizontal filtering. The window consists of 3 pixels in a row for the horizontal version and a 3-pixel column for the vertical one. The initial from PSMNU gets updated after the guided filtering. This revised is then used to generate the PPSR for SGBMP.
Iv-a Comparison with true disparity
Among various openly available binocular stereo datasets, the Middlebury Stereo Evaluation V3  has large image size. We compared SGBM and SGBMP on all the 15 training cases of this dataset. The metrics of bad1.0, invalid, and avgErr defined by Middlebury dataset are utilized. We also evaluate the standard deviation of avgErr, denoted as stdErr. stdErr measures the noise level of a disparity prediction. Lower stdErr means less noise. As discussed previously, we disable the UR check for SGBMP to make it possible to find a good stereo match inside difficult regions. In a second SGBM run, we also turn off the UR check for fair comparison and we name this run SGBMUR. The parameters are the same across the cases except the minDisparity and numDisparity . The parameters are listed in Fig. 4.
Fig. 4 shows results of all the tests while Tab. I lists the detailed values of the metrics associated to the (a)-(c) rows in Fig. 4.
SGBM invalidates many disparity predictions leading to high invalid and low avgErr, e.g. Jadeplant (Fig. 4 (c)). For our infrastructure inspection tasks, we prefer low invalid and low avgErr. PSMNU assists SGBMP to achieve this desired performance on 11 out of the 15 cases of the Middlebury dataset. Detailed results of all the 15 cases can be found on the project web-page
Fig. 4 (a) Adirondack: When PSMNU performs well for most of the pixels, the prediction of SGBMP is better than or close to PSMNU.
Fig. 4 (b) PlaytableP: If there are some regions where PSMNU makes large errors, SGBMP could compensate and give better disparity predictions. This could be illustrated in the floor region of the PlaytableP case.
Fig. 4 (c) Jadeplant: When PSMNU has a poor performance, SGBMP still manages to achieve similar avgErr with SGBM but lower (and better) invalid value.
Results of all the other 12 training cases provided by Middlebury dataset are shown in Row (d)-(f) of Fig. 4. Since PSMNU does not explicitly invalidate disparities, the invalid values of PSMNU in Tab. I are the results of manually masking the left most disparity prediction and these values are for reference. The masked regions correspond to the areas in the left images where SGBM and SGBMP could not make any disparity predictions. Due to the nature of the difficult regions, such as textureless surfaces, disparity predictions may contain high level of noise despite having a low avgErr. The stdErr values in Tab. I evaluate the noise level. On 12 out of the 15 cases of the Middlebury dataset, SGBMP achieves the lowest stdErr and invalid at the same time (associated column names of Row (d)-(f) in Fig. 4 are marked by *). And 10 out of these 12 cases SGBMP have the lowest avgErr (the column names are marked by +).
The down-sampled image size for PSMNU is for all the 15 cases. With this image size, PSMNU consumes around 8GB of GPU memory for a single pair of stereo images. Tab. I also shows the execution time of the Adirondack case as an example. PSMNU’s average execution time is about 1.5s. Based on our experimental results, SGBMP needs roughly 50% more time than SGBM on average. We also submit all the results, including all the cases without ground truth, to the Middlebury Evaluation V3 web-site.
Iv-B Performance on real-world stereo images
Tests on Middlebury dataset show promising performance gain of SGBMP. In this section, we test SGBMP on the high-resolution stereo images obtained by our experimental hardware in real-world infrastructure inspection scenarios. We use 4 identical 4K cameras to build 2 stereo cameras with identical baselines. These stereo cameras are installed to a handheld platform  and a UAV as shown in Fig. 5. However, for this work, images are captured without flying. The cameras are externally triggered and hardware-synchronized with other sensors such as the LIDAR and IMU. We have to deal with many large and slanted surfaces and low texture regions. The lenses have a significant vignetting effect under low lighting conditions, making the left and right images have different brightness. We collected over 600 pairs of stereo images for 4 concrete structures and 2 building surfaces. In Fig. 6, results from one camera position are shown for 5 test cases. The parameters adopted for SGBM and SGBMP are the same with Fig. 4 except 0 uniquenessRatio for SGBMP and various minDisparity and numDisparity. minDisparity and numDisparity are selected individually for each test case to make sure that the true disparities are inside the selected ranges. The current computing resource forces PSMNU to work with the 1/16 of the original image size.
As illustrated in Fig. 6, SGBMP improves the accuracy compared with SGBM, while achieving high-resolution results. This could be attributed to the robust performance of PSMNU on real-world data. Regarding the execution time, SGBMP also needs about 50% more than SGBM.
During the experiments, the camera may have a low performance with inadequate lighting. We show 3 such types of cases in Fig. 7 and they are extremely difficult for SGBM to do dense reconstruction. The Row (a) and (b) of Fig. 7 have roughly the same camera positions with Row (a) and (b) in Fig. 6. However, the lighting conditions are worse and the images appear darker. In Fig. 7 (c), the objective is simply a flat concrete wall with minor decorations. The brightness level is so low that we could observe the vignetting effects on the borders of each image. These stereo image pairs also have inconsistent color due to the lighting. Most of the valid disparity predictions of SGBM are around object boundaries. In contrast, PSMNU and SGBMP keep their performance and recover most of the pixels of the foreground objects with accurate disparities.
Point clouds from a FARO survey scanner are utilized as the true depth  to evaluate the absolute accuracy of SGBMP. We have scanned the stone pillar and the bridge support shown as the (b) and (c) rows in Fig. 6. The camera poses are also obtained from . For every predicted 3D point in the SGBMP cloud, a plane is fitted by referring to the neighboring points from the survey scanner found within a radius of 0.05m. Then the point-to-plane distance is utilized as the reconstruction error. As shown in Fig. 1 and Fig. 8, the average errors are lower than 0.004m with the majority of the predicted points having errors lower than 0.01m. We observe that the error of SGBMP becomes larger as the 3D points locate further away from the camera. Large errors occur near the object edges where large depth discontinuities and occlusions are present.
We present a high-resolution binocular stereo depth reconstruction pipeline by combining deep-learning model and a non-learning method. Our deep-learning model, PSMNU, estimates its uncertainty on disparity prediction, and we use the uncertainty as a per-pixel searching range for the true disparity. With restricted computing resources, PSMNU produces accurate disparity prediction with associated uncertainties on down-sampled stereo images. The initial disparity prediction and the per-pixel disparity searching range are sent to the downstream non-learning method, SGBMP. SGBMP then predicts a dense disparity map with improved accuracy and high valid pixel rate on high-resolution stereo images. We evaluate our approach on the Middlebury Stereo Evaluation V3 dataset. SGBMP delivers superior accuracy over both SGBM and PSMNU for most of the cases. The absolute accuracy is also evaluated on our 4K infrastructure inspection images. We compare SGBMP with the point clouds collected by a survey scanner. The experiments show the average reconstruction error is below 0.004m.
Although we show significant improvements over various scenarios, the proposed method could still give bad predictions if PSMNU fails. To make PSMNU more robust, incorporating multi-task learning and multi-view depth reconstruction may be the way to explore in our future studies.
Special thanks to Huai Yu (Wuhan University, China) and Daisuke Hayashi (Shimizu Corporation, Japan) for helping with the data collection.
- (2011) PatchMatch stereo-stereo matching with slanted support windows.. In Bmvc, Vol. 11, pp. 1–11. Cited by: §I.
- (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418. Cited by: §III-A.
- (2017) Fusion of stereo and lidar data for dense depth map computation. In 2017 Workshop on Research, Education and Development of Unmanned Aerial Systems (RED-UAS), pp. 186–191. Cited by: §II.
- (2009) Aleatory or epistemic? does it matter?. Structural Safety 31 (2), pp. 105–112. Cited by: §II.
- (2011) Combination of time-of-flight depth and stereo using semiglobal optimization. In 2011 IEEE International Conference on Robotics and Automation, pp. 3548–3553. Cited by: §II.
- (2016) Uncertainty in deep learning. Ph.D. Thesis, PhD thesis, University of Cambridge. Cited by: §II.
- (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I.
- (2005) Accurate and efficient stereo processing by semi-global matching and mutual information. In null, pp. 807–814. Cited by: §III-B.
- (2012) A quantitative evaluation of confidence measures for stereo vision. IEEE transactions on pattern analysis and machine intelligence 34 (11), pp. 2121–2133. Cited by: §III-B.
- (2019) Geometry and uncertainty in deep learning for computer vision. Ph.D. Thesis, University of Cambridge. Cited by: §I, §II, §III-A, §III-A.
- (2017) End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75. Cited by: §I.
- (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §I.
- (2018) Sparse-to-dense: depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
- (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Note: arXiv:1512.02134 External Links: Cited by: §I, §III-A.
- (2015) Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
- (2013) A survey on time-of-flight stereo fusion. In Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications, pp. 105–127. Cited by: §II.
- (2019) Superdepth: self-supervised, super-resolved monocular depth estimation. In Proc. ICRA, Cited by: §II.
- (2016) High-resolution lidar-based depth mapping using bilateral filter. In 2016 IEEE 19th international conference on intelligent transportation systems (ITSC), pp. 2469–2474. Cited by: §II.
- (2014) High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition, pp. 31–42. Cited by: §III-A, §IV-A.
- (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47 (1-3), pp. 7–42. Cited by: §II.
- (2019) Real time dense depth estimation by fusing stereo with sparse depth measurements. In The International Conference on Robotics and Automation., Cited by: §II.
- (2012) Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pp. 746–760. Cited by: §I.
- StereoSGBM class reference. Note: Accessed: 2019-08-21 External Links: Cited by: §I, §IV-A.
- Stereosgbm.cpp. Note: Accessed: 2019-09-08 External Links: Cited by: §III-B.
- (2017) Sparsity invariant cnns. In 2017 International Conference on 3D Vision (3DV), pp. 11–20. Cited by: §II.
- (2019) Anytime stereo image depth estimation on mobile devices. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5893–5900. Cited by: §II.
- (2019) FastDepth: fast monocular depth estimation on embedded systems. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6101–6108. Cited by: §II.
- (2014) Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In European Conference on Computer Vision, pp. 756–771. Cited by: §I.
- (2019-06) Recurrent mvsnet for high-resolution multi-view stereo depth inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
- (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §I.
- (2019) A joint optimization approach of lidar-camera fusion for accurate dense 3-d reconstructions. IEEE Robotics and Automation Letters 4 (4), pp. 3585–3592. Cited by: §IV-B, §IV-B.