# Real-Time Subpixel Fast Bilateral Stereo

###### Abstract

Stereo vision technique has been widely used in robotic systems to acquire 3-D information. In recent years, many researchers have applied bilateral filtering in stereo vision to adaptively aggregate the matching costs. This has greatly improved the accuracy of the estimated disparity maps. However, the process of filtering the whole cost volume is very time consuming and therefore the researchers have to resort to some powerful hardware for the real-time purpose. This paper presents the implementation of fast bilateral stereo on a state-of-the-art GPU. By highly exploiting the parallel computing architecture of the GPU, the fast bilateral stereo performs in real time when processing the Middlebury stereo datasets.

stereo vision, disparity maps, real-time, fast bilateral stereo, GPU.

## I Introduction

We live in a 3-D world but our eyes can only perceive objects in 2-D. The miracle of our depth perception is due to our brain’s ability to analyse the difference between the two 2-D images which are projected on the retinas of our eyes [1]. As for the digital images captured by cameras, they are 2-D in nature. By comparing the difference between the images captured using a pair of synchronised cameras, the 3-D information of the surrounding environment can be obtained [2]. This process is commonly referred to as stereo vision, and it is very similar to the human binocular vision [3]. For a well-calibrated stereo rig, the horizontal distance between each pair of corresponding points in the left image and in the right image is defined as disparity [4].

The two important factors in stereo vision are speed and accuracy [5]. A lot of research has been carried out over the past decade to improve both the precision of the disparity maps and the execution speed of the algorithms [6, 7]. However, the stereo vision algorithms designed to achieve better disparity accuracy usually have higher computational complexity and lower processing efficiency [8]. Hence, speed and accuracy are two desirable but conflicting properties, and it is very challenging to achieve both of them simultaneously [7]. Therefore, the main motivation of developing a stereo vision algorithm is to improve the trade-off between accuracy and speed [7]. In most circumstances, a desirable trade-off entirely depends on the target application. For example, a real-time performance is required for the stereo vision systems applied in mobile robotics because the other systems, e.g., obstacle detection and robot navigation, can be easily implemented in real time if the 3-D information is available.

The state-of-the-art algorithms for disparity estimation can mainly be classified as local and global [4]. The local algorithms [9, 10, 11, 5] simply match a series of blocks from the left and right images and select the correspondence with the lowest cost or the highest correlation. This optimisation is also known as Winner-Take-All (WTA) [5]. Unlike the local algorithms, the global algorithms process the stereo matching using some sophisticated techniques, e.g., Belief Propagation (BP) [12] and Graph Cut (GC) [13]. These techniques are usually developed based on the Markov Random Fields (MRF) [14], where the process of finding the best disparities translates to a probability maximisation problem [4]. However, finding the optimum values for the smoothness parameters is a difficult task due to the occlusion problem. Over-penalising the smoothness term can reduce the ambiguities around the discontinuities but on the other hand can cause errors for continuous areas [4]. In [15], Mozerov and Weijer proved that the initial energy optimisation problem in a fully connected MRF model can be formulated as an adaptive cost aggregation using bilateral filtering. Since then, a lot of endeavours have been made in local algorithms to improve the accuracy of the disparity maps by performing bilateral filtering on the cost volume before estimating the disparities. These algorithms are also known as fast bilateral stereo (FBS). However, filtering the whole cost volume is always computationally intensive and therefore the FBS has to be implemented on some powerful hardware for the purpose of real-time [4].

This paper presents the implementation of the FBS on an NVIDIA GTX 1080 GPU. Firstly, we provide the readers with some mathematical preliminaries about the FBS. The cost computation is optimised based on our previously published algorithm in [5]. Furthermore, each matching cost is saved in two 3-D cost volumes, hence reducing the computations by when estimating both the left and right disparity maps. Then, we provide some details on the practical implementation, such as the use of different types of device memory. The source code of our implementation is publicly available at: https://bitbucket.org/rangerfan/bilateral_stereo_gpu_middlebury.git.

The remainder of this paper is organised as follows: Section II describes the proposed disparity estimation algorithm. Implementation details are provided in Section III. In Section IV, we present the experimental results and evaluate the performance of the proposed stereo vision system. Section V summaries the paper and provides some recommendations for future work.

## Ii Algorithm Description

In general, a disparity estimation algorithm usually consists of four steps: cost computation, cost aggregation, disparity optimisation and disparity refinement [15]. However, the sequential use of these steps entirely depends on the chosen algorithm [16]. In the following subsections, we discuss each step in the proposed disparity estimation algorithm.

### Ii-a Cost Computation

In stereo matching, the most commonly used approaches for cost computation are the Absolute Difference (AD) and the Squared Difference (SD) [17]. However, these two approaches are very sensitive to the intensity difference between the left and right images, which may further lead to some incorrect matches in the process of disparity estimation. In this paper, we use the Normalised Cross-Correlation (NCC) as the cost function to measure the similarity between each pair of blocks selected from the left and right images. Although the NCC is more computationally intensive than the AD and the SD, it can provide more accurate results when the intensity difference is involved [4]. The cost function of the NCC is as follows [5]:

(1) |

where

(2) |

(3) |

is the correlation cost. denotes the intensity of a pixel at in the left image and represents the intensity of a pixel at in the right image. The centre of the reference square block is . Its width is and the number of pixels within each block is . is set to in this paper. and represent the means of the pixel intensities within the left and right blocks, respectively. and denote the standard deviations of the left and right blocks, respectively.

In practical implementation, the values of , , and are pre-calculated and stored in a static program storage for direct indexing [4]. Therefore, only the dot product in Eq. (1) needs to be calculated when computing the correlation cost between each pair of left and right blocks. This greatly reduces unnecessary computations. More details on the NCC implementation are provided in our previously published work [5].

The calculated correlation costs are simultaneously stored in the left and right 3-D cost volumes, as shown in Fig. 1. It is to be noted that the value of at the position of in the left cost volume is the same as that at the position of in the right cost volume. In the next step, each correlation cost in the two 3-D cost volumes is updated by aggregating the costs from its neighbourhood system.

### Ii-B Cost Aggregation

In global algorithms, finding the best disparities is equivalent to maximising the joint probability in Eq. (4) [4].

(4) |

where represents a vertex at the site of in the disparity map and denotes the intensity differences which correspond to different disparities . represents the neighbourhood system of . In this paper, the value of is set to and is an 8-connected neighbourhood. expresses the compatibility between each possible disparity and the corresponding block similarity. expresses the compatibility between and its neighbourhood system . However, Eq. (4) is commonly formulated as an energy function, as follows [4]:

(5) |

where corresponds to the correlation cost in this paper and determines the aggregation strategy. In the MRF model, the method to formulate an adaptive is important because the intensity in a discontinuous area usually greatly differs from those of its neighbours [18]. Therefore, some authors formulated as a piece-wise model to discriminate the discontinuous areas [19]. However, the process of minimising the energy function in Eq. (5) results in a high computational complexity, making real-time performance challenging. Since Tomasi et al. introduced the bilateral filter in [20], many authors have investigated its application to aggregate the matching costs [8, 21, 22]. Mozerov and Weijer also proved that the bilateral filtering is a feasible solution for the energy minimisation problem in the MRF model [15]. These methods are also known as fast bilateral stereo, where both intensity difference and spatial distance provide a Gaussian weighting function to adaptively constrain the cost aggregation from the neighbours. A general representation of the cost aggregation in the FBS is as follows:

(6) |

where

(7) |

(8) |

and are based on the spatial distance and the colour similarity, respectively. and are two parameters used to control the values of and , respectively. The costs within a square block are aggregated adaptively and an updated cost can be obtained.

Although the FBS has shown a good performance in terms of matching accuracy, it usually takes a long time to process the whole cost volume. Therefore, the FBS has to been implemented on some powerful hardware in order to perform in real time.

### Ii-C Disparity Optimisation

As discussed in Section II-B, the process of energy minimisation in global algorithms can be realised by performing bilateral filtering on the initial cost volumes. The best disparities can therefore be found by simply performing WTA on the left and right aggregated cost volumes. The left and right disparity maps, i.e., and , are shown in Fig. 1.

### Ii-D Disparity Refinement

This step usually involves several disparity map post-processing algorithms, such as weighted median filtering [8], left-right consistency (LRC) check [23] and subpixel enhancement [4]. In this subsection, we only discuss the latter two algorithms.

Firstly, according to the uniqueness constraint stated in [5], for an arbitrary point in the left image , there exists at most one corresponding point in the right image , namely:

(9) |

Therefore, the LRC check is always performed to remove the incorrect matches in the occlusion areas and find an outline in the disparity map [5]. The corresponding LRC check result is illustrated in Fig. 1.

Furthermore, since the distance between a 3-D object and the camera focal point is inversely proportional to the disparity value, a disparity error larger than one pixel may result in a non-negligible difference in the 3-D reconstruction result [4]. Therefore, a subpixel enhancement is always performed to increase the resolution of the disparity values. This can be achieved by fitting a parabola to three correlation costs , and around the initial disparity and then selecting the centre line of as the subpixel disparity [4]:

(10) |

The corresponding subpixel disparity map is shown in Fig. 1. In the next section, we provide more details on the implementation.

## Iii Implementation

The proposed algorithm is implemented on an NVIDIA GTX 1080 GPU to achieve real-time performance. An overview of the practical implementation is shown in Fig. 2. The GPU has 8 GB GDDR5X and 20 streaming multi-processors (SMs). Each SM consists of 128 streaming processors (SPs), and therefore the GPU has 2560 SPs in total. The Single Instruction Multiple Data (SIMD) architecture allows the SPs on the same SM to execute the same instruction but process different data at each clock cycle [24]. The host memory and the device memory are communicated with each other via Graphical/Memory Controller Hub (GMCH) and I/O Controller Hub (ICH), which are also known as the Intel northbridge and southbridge, respectively.

In our implementation, the left and right images are first sent to the global memory of the GPU from the host memory. Compared with the texture memory and constant memory which are read-only and cached on-chip, the global memory is off-chip and therefore less efficient in terms of memory requesting [5]. Furthermore, a thread is more likely to fetch the data from the closest addresses that its nearby threads accessed, and thus the use of cache in global memory is impossible [24]. Hence, we use the texture memory to optimise the caching for the left and right images. We first create two 2-D texture objects on the texture memory. Then, the texture objects are bound directly to the addresses of the left and right images in the global memory. The value of a pixel in the left or right image can thus be fetched from the texture objects instead of the global memory. This greatly minimises the memory requests from the global memory and further improves the speed of data fetching. Then, the instruction unit in each SM sends the same instructions to the SPs to compute and aggregate the correlation costs, and then optimise and refine the disparities.

In the first step, , , and are pre-calculated and their values are stored in global memory. When calculating the correlation cost using Eq. (1), the values of , , and are fetched from the global memory and the values of and are fetched from the texture memory. This significantly reduces the expensive computations of and and the memory requests from the global memory. The value of each correlation cost is saved in two 3-D cost volumes in the global memory.

As discussed in Section II-B, bilateral filtering is performed on the left and right cost volumes to aggregate the correlation costs adaptively. Due to the fact that constant memory is read-only and beneficial for the data that will not change over the course of a kernel execution, the values of and are pre-calculated and stored in the constant memory to reduce the unnecessary computations in Eq. (7) and (8). The left and right aggregated cost volumes are then stored in the global memory.

In the third step, each SP searches for the highest correlation cost between and . The position in the axis which corresponds to the highest cost is then selected as the desirable disparity for the position of . The estimated left and right disparity maps are then stored in the global memory for the following processes.

Finally, the instruction of disparity refinement is executed on each SM. In this paper, the left image is set as the reference, and therefore the LRC check is performed to remove the incorrect matches in the left disparity map. Then, each SP executes Eq. (10) and stores the corresponding subpixel disparity values in the global memory. The subpixel disparity map is then sent to the host memory via GMCH and ICH for display.

The performance of the proposed implementation is discussed in Section IV.

## Iv Experimental Results

As discussed in Section I, accuracy and speed are two main aspects of stereo vision and achieving desirable results entirely depends on a good trade-off between these two factors. Hence in the following subsections, we evaluate both the precision of the proposed algorithm and the execution speed of the practical implementation.

### Iv-a Accuracy Evaluation

In this subsection, we use Middlebury 2001 datasets [16] and 2003 datasets [25] to evaluate the accuracy of the proposed stereo matching algorithm. Some experimental results are illustrated in Fig. 3.

To quantify the accuracy of the estimated disparity maps, Barron et al. proposed to compute the Percentage of Error Pixels as follows [26]:

(11) |

where

(12) |

The threshold is used to determine whether an estimated disparity is correct or not and it is set to in this paper. stands for the total number of disparities used for evaluation. represents the estimated disparity map and denotes the corresponding ground truth data. To provide an in-depth performance evaluation of the proposed algorithm, we compute the values of for different regions of interest: the disparity map excluding occlusions (nonocc), the disparity map excluding occlusions and texture-less areas (nonocctextl), and the disparity map excluding occlusions and discontinuities (nonoccdiscont). The corresponding values of with respect to different input images and regions of interest are shown in Table I. Next, we use the “Cones” dataset to evaluate the performance of the algorithm with respect to different values of parameters in bilateral filtering, i.e., , and . The corresponding evaluation results are shown in Fig. 4, 5 and 6.

Region of interest | Cones | Teddy | Venus | Sawtooth |
---|---|---|---|---|

nonoccl | 8.2264 | 10.9244 | 2.8573 | 7.3800 |

nonocctextl | 8.2928 | 10.8922 | n/a | n/a |

nonoccdiscont | 2.7580 | 4.8556 | n/a | n/a |

Fig. 4 shows the percentage of error pixels with respect to different values of . The parameter in Eq. (7) is a Gaussian weighting function which is dependent on . Therefore, the curves in Fig. 4 for each region of interest have a local minimum. Similarly, it can be observed that each curve in Fig. 5 has a local minimum. This also conforms to the Gaussian weighting function in Eq. (8). Finally, it can be seen that for each curve in Fig. 6, the value of decreases with increasing . However, this decrease in is observed up to the point when is equal to . Beyond the latter value, increases slightly for each region of interest. This indicates that the performance of the FBS depends on the values of , and and we have to find their optimum values in order to achieve the best performance in terms of accuracy.

### Iv-B Speed Evaluation

In addition to accuracy, the execution speed of the proposed algorithm is also quantified in order to give an evaluation of the overall system’s performance. However, due to the fact that image size and disparity range are not constant among different datasets, a general way to depict the performance in terms of processing speed is given in millions of disparity evaluations per second as follows [7]:

(13) |

where and represent the width and height of the disparity map, denotes the maximum search range and represents the algorithm runtime in seconds. Our implementation achieves a performance of when is set to .

Furthermore, we use the “Cones” dataset to evaluate the speed performance with respect to different aggregation sizes. The runtime corresponding to different values of is shown in Fig. 7. It can be seen that increases with increasing . When is equal to , the execution speed of the proposed implementation is around fps. Although a better accuracy can be achieved when is set to (see Fig. 6), the processing speed is doubled (see Fig. 7). Therefore, provides a better trade-off between accuracy and speed in the proposed algorithm.

## V Conclusion and Future Work

In this paper, we presented the implementation of the FBS on a GTX 1080 GPU. On the algorithm side, the NCC is first simplified as a dot product. The values of and are pre-calculated and stored in a static program storage for direct indexing. Furthermore, each correlation cost is simultaneously stored in two 3-D cost volumes, thus decreasing the computations by approximately when estimating both the left and right disparity maps. Then, the correlation costs are aggregated adaptively by performing bilateral filtering on the left and right cost volumes. The left and right disparity maps are estimated by simply performing WTA on the filtered cost volumes. Finally, the disparities in half-occluded areas are removed and the subpixel resolution is achieved by conducting parabola interpolations around the initial disparities. By highly exploiting the parallel computing architecture, the proposed FBS performs in real time on the GTX 1080 GPU.

However, performing the bilateral filtering on the whole cost volume is still a time-consuming process. Therefore, we plan to use a group of reliable feature points to suggest the search range for their neighbours. Then, only the correlation costs around the suggested search range are calculated and the bilateral filtering is performed only on the space around the calculated disparities.

## Vi Acknowledgement

This paper is supported by Shenzhen Science, Technology and Innovation Commission (SZSTI) JCYJ20160428154842603 also supported by the Research Grant Council of Hong Kong SAR Government, China, under Project No. 11210017 and No. 16212815 and No. 21202816, NSFC U1713211 awarded to Prof. Ming Liu; Shenzhen Science, Technology and Innovation Commission (SZSTI) JCYJ20170818153518789 and National Natural Science Foundation of China No. 61603376, awarded to Dr. Lujia Wang.

## References

- [1] Ning Qian, “Binocular disparity and the perception of depth,” Neuron, vol. 18, no. 3, pp. 359–368, 1997.
- [2] Richard Hartley and Andrew Zisserman, Multiple view geometry in computer vision, Cambridge university press, 2003.
- [3] Trucco Emanuele and Verri Alessandro, “Introductory techniques for 3-d computer vision,” 1998.
- [4] Rui Fan, Xiao Ai, and Naim Dahnoun, “Road surface 3D reconstruction based on dense subpixel disparity map estimation,” IEEE Transactions on Image Processing, vol. PP, no. 99, pp. 1, 2018.
- [5] Rui Fan and Naim Dahnoun, “Real-time implementation of stereo vision based on optimised normalised cross-correlation and propagated search range on a gpu,” in Imaging Systems and Techniques (IST), 2017 IEEE International Conference on. IEEE, 2017, pp. 241–246.
- [6] Beau Jeffrey Tippetts, Real-Time Stereo Vision for Resource Limited Systems, Brigham Young University, 2012.
- [7] Beau Tippetts, Dah Jye Lee, Kirt Lillywhite, and James Archibald, “Review of stereo vision algorithms and their suitability for resource-limited systems,” Journal of Real-Time Image Processing, vol. 11, no. 1, pp. 5–25, 2016.
- [8] Qingxiong Yang, Liang Wang, Ruigang Yang, Henrik Stewénius, and David Nistér, “Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 3, pp. 492–504, 2009.
- [9] Radim Sara, “Finding the largest unambiguous component of stereo matching,” Computer Vision-ECCV 2002, pp. 900–914, 2002.
- [10] Radim Sara, R., “Robust correspondence recognition for computer vision,” in Compstat 2006-Proceedings in Computational Statistics, pp. 119–131. Springer, 2006.
- [11] Jan Cech and Radim Sara, “Efficient sampling of disparity space for fast and accurate matching,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.
- [12] Alexander T Ihler, W Fisher John III, and Alan S Willsky, “Loopy belief propagation: Convergence and effects of message errors,” Journal of Machine Learning Research, vol. 6, no. May, pp. 905–936, 2005.
- [13] Yuri Boykov, Olga Veksler, and Ramin Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
- [14] Marshall F Tappen and William T Freeman, “Comparison of graph cuts with belief propagation for stereo, using identical mrf parameters,” in Proceedings Ninth IEEE International Conference on Computer Vision. IEEE, 2003, p. 900.
- [15] Mikhail G Mozerov and Joost van de Weijer, “Accurate stereo matching by two-step energy minimization,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 1153–1163, 2015.
- [16] Daniel Scharstein and Richard Szeliski, “Middlebury stereo vision page,” 2002.
- [17] Daniel Scharstein and Richard Szeliski, “Middlebury stereo evaluation-version 2,” The Middlebury Computer Vision Pages (online), available from (accessed 2015-03-02), 2011.
- [18] Stan Z Li, Markov random field modeling in computer vision, Springer Science & Business Media, 2012.
- [19] Vladimir Kolmogorov and Ramin Zabih, “Computing visual correspondence with occlusions using graph cuts,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. IEEE, 2001, vol. 2, pp. 508–515.
- [20] Carlo Tomasi and Roberto Manduchi, “Bilateral filtering for gray and color images,” in Computer Vision, 1998. Sixth International Conference on. IEEE, 1998, pp. 839–846.
- [21] Asmaa Hosni, Christoph Rhemann, Michael Bleyer, Carsten Rother, and Margrit Gelautz, “Fast cost-volume filtering for visual correspondence and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 2, pp. 504–511, 2013.
- [22] Zhen Zhang, Advanced stereo vision disparity calculation and obstacle analysis for intelligent vehicles, Ph.D. thesis, University of Bristol, 2013.
- [23] Xing Mei, Xun Sun, Mingcai Zhou, Shaohui Jiao, Haitao Wang, and Xiaopeng Zhang, “On building an accurate stereo matching system on graphics hardware,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011, pp. 467–474.
- [24] Rui Fan and Naim Dahnoun, “Real-time stereo vision-based lane detection system,” Measurement Science and Technology, vol. 29, no. 7, 2018.
- [25] Daniel Scharstein and Richard Szeliski, “High-accuracy stereo depth maps using structured light,” in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on. IEEE, 2003, vol. 1, pp. I–I.
- [26] John L Barron, David J Fleet, and Steven S Beauchemin, “Performance of optical flow techniques,” International journal of computer vision, vol. 12, no. 1, pp. 43–77, 1994.