A Learning-based Framework for Hybrid Depth-from-Defocus and Stereo Matching
Depth from defocus (DfD) and stereo matching are two most studied passive depth sensing schemes. The techniques are essentially complementary: DfD can robustly handle repetitive textures that are problematic for stereo matching whereas stereo matching is insensitive to defocus blurs and can handle large depth range. In this paper, we present a unified learning-based technique to conduct hybrid DfD and stereo matching. Our input is image triplets: a stereo pair and a defocused image of one of the stereo views. We first apply depth-guided light field rendering to construct a comprehensive training dataset for such hybrid sensing setups. Next, we adopt the hourglass network architecture to separately conduct depth inference from DfD and stereo. Finally, we exploit different connection methods between the two separate networks for integrating them into a unified solution to produce high fidelity 3D disparity maps. Comprehensive experiments on real and synthetic data show that our new learning-based hybrid 3D sensing technique can significantly improve accuracy and robustness in 3D reconstruction.
A Learning-based Framework for Hybrid Depth-from-Defocus and Stereo Matching
Acquiring 3D geometry of the scene is a key task in computer vision. Applications are numerous, from classical object reconstruction and scene understanding to the more recent visual SLAM and autonomous driving. Existing approaches can be generally categorized into active or passive 3D sensing. Active sensing techniques such as LIDAR and structured light offer depth map in real time but require complex and expensive imaging hardware. Alternative passive scanning systems are typically more cost-effective and can conduct non-intrusive depth measurements but maintaining its robustness and reliability remains challenging.
Stereo matching and depth from defocus (DfD) are the two best-known passive depth sensing techniques. Stereo recovers depth by utilizing parallaxes of feature points between views. At its core is correspondences matching between feature points and patching the gaps by imposing specific priors, e.g., induced by the Markov Random Field. DfD, in contrast, infers depth by analyzing blur variations at same pixel captured with different focus settings (focal depth, apertures, etc). Neither technique, however, is perfect on its own: stereo suffers from ambiguities caused by repetitive texture patterns and fails on geometry lying along epipolar lines whereas DfD is inherently limited by the aperture size of the optical system.
It is important to note that DfD and stereo are complementary to each other: stereo provides accurate depth estimation even for distant objects whereas DfD can reliably handle repetitive texture patterns. In computational imaging, a number of hybrid sensors have been designed to combine the benefits of the two. In this paper, we seek to leverage deep learning techniques to infer depths in such hybrid DfD and stereo setups. Recent advances in neural network have revolutionized both high-level and low-level vision by learning a non-linear mapping between the input and output. Yet most existing solutions have exploited only stereo cues [21, 39, 40] and very little work addresses using deep learning for hybrid stereo and DfD or even DfD alone, mainly due to the lack of a fully annotated DfD dataset.
In our setup, we adopt a three images setting: an all-focus stereo pair and a defocused image of one of the stereo views, the left in our case. We have physically constructed such a hybrid sensor by using Lytro Illum camera. We first generate a comprehensive training dataset for such an imaging setup. Our dataset is based on FlyingThings3D from , which contains stereo color pairs and ground truth disparity maps. We then apply occlusion-aware light field rendering to synthesize the defocused image. Next, we adopt the hourglass network  architecture to extract depth from stereo and defocus respectively. Hourglass network features a multi-scale architecture that consolidates both local and global contextures to output per-pixel depth. We use stacked hourglass network to repeat the bottom-up, top-down depth inferences, allowing for refinement of the initial estimates. Finally, we exploit different connection methods between the two separate networks for integrating them into a unified solution to produce high fidelity 3D depth maps. Comprehensive experiments on real and synthetic data show that our new learning-based hybrid 3D sensing technique can significantly improve accuracy and robustness in 3D reconstruction.
1.1 Related Work
Learning based Stereo Stereo matching is probably one of the most studied problem in computer vision. We refer the readers to the comprehensive survey [29, 3]. Here we only discuss the most relevant ones. Our work is motivated by recent advances in deep neural network. One stream focuses on learning the matching function. The seminal work by Žbontar and LeCun  leveraged convolutional neural network (CNN) to predict the matching cost of image patches, then enforced smoothness constraints to refine depth estimation.  investigated multiple network architectures to learn a general similarity function for wide baseline stereo. Han et al.  described a unified approach that includes both feature representation and feature comparison functions. Luo et al.  used a product layer to facilitate the matching process, and formulate the depth estimation as a multi-class classification problem. Other network architectures [5, 20, 25] have also been proposed to serve a similar purpose.
Another stream of studies exploits end-to-end learning approach. Mayer et al.  proposed a multi-scale network with contractive part and expanding part for real-time disparity prediction. They also generated three synthetic dataset for disparity, optical flow and scene flow estimation. Knöbelreiter et al.  presented a hybrid CNN+CRF model. They first utilized CNNs for computing unary and pairwise cost, then feed the costs into CRF for optimization. The hybrid model is trained in an end-to-end fashion. In this paper we employ end-to-end learning approach for depth inference for its efficiency and compactness.
Depth from Defocus The amount of blur at each pixel carries information about object’s distance. To recover scene geometry, earlier DfD techniques [31, 27, 37] rely on images captured with different focus setting (moving the objects, the lense or the sensor, changing the aperture size, etc). More recently, Favaro and Soatto  formulated the DfD problem as a forward diffusion process where the amount of diffusion depends on the depth of the scene. [18, 41] recovered scene depth and all-focused image from images captured by camera with binary coded aperture. Based on a per-pixel linear constraint from image derivatives, Alexander et al.  introduced a monocular computational sensor to simultaneously recover depth and motion of the scene.
Varying the size of the aperture [26, 8, 33, 2] has also been extensively investigated. This approach will not change the distance between the lens and sensor, thus avoiding the magnification effects. Our DfD setting uses a defocused and all-focused image pair as input, which can be viewed as a special case of the varying aperture. To tackle the task of DfD, we utilize a multi-scale CNN architecture. Different from conventional hand-crafted features and engineered cost functions, our data driven approach is capable of learning more discriminative features from the defocus image and inferring the depth at a fraction of the computational cost.
Hybrid Stereo and DfD Sensing In the computational imaging community, there has been a handful of works that aim to combine stereo and DfD. Early approaches [16, 32] use a coarse estimation from DfD to reduce the search space of correspondence matching in stereo. Rajagopalan et al.  used a defocused stereo pair to recover depth and restore all-focus image. Recently, Tao et al.  analyzed the variances of the epipolar image (EPI) to infer depth: the horizontal variance after vertical integration of the EPI encodes the defocus cue, while vertical variance represents the disparity cue. Both cues are then jointly optimized in a MRF framework. Takeda et al.  analyzed the relationship between point spread function and binocular disparity in the frequency domain, and jointly resolved the depth and deblurred the image. Wang et al.  presented a hybrid camera system that is composed of two calibrated auxiliary cameras and an uncalibrated main camera. The calibrated cameras were used to infer depth and the main camera provides DfD cues for boundary refinement. Our approach instead leverages the neural network to combine DfD and stereo estimations. To our knowledge, this is the first approach that employs deep learning for stereo and DfD fusion.
2 Training Data
The key to any successful learning based depth inference scheme is a plausible training dataset. Numerous datasets have been proposed for stereo matching but very few are readily available for defocus based depth inference schemes. To address the issue, we set out to create a comprehensive DfD dataset. Our DfD dataset is based on FlyingThing3D , a synthetic dataset consisting of everyday objects randomly placed in the space. When generating the dataset,  separates the 3D models and textures into disjointed training and testing parts. In total there are 25,000 stereo images with ground truth disparities. In our dataset we only select stereo frames whose largest disparity is less than 100 pixels.
The synthesized color images in FlyingThings3D are all-focus images. To simulate defocused images, we adopt the Virtual DSLR approach from . Virtual DSLR uses color and disparity image pair as input, and outputs defocused image with quality comparable to those captured by expensive DSLR. The algorithm resembles refocusing technique in light field rendering without requiring the actual creation of the light field, thus reducing both memory and computational cost. Further, the Virtual DSLR takes special care of occlusion boundaries, to avoid color bleeding and discontinuity commonly observed on brute-force blur-based defocus synthesis.
For the scope of this paper, we assume circular apertures, although more complex ones can easily be synthesized, e.g., for coded-aperture setups. To emulate different focus settings of the camera, we randomly set the focal plane, and select the size of the blur kernel in the range of pixels. Finally, we add Poisson noise to both defocused image and the stereo pair to simulate the noise contained in real images. We’d emphasize that the added noise is critical in real scene experiments, as will be discussed in 5.2. Our final training dataset contains 750 training samples and 160 testing samples, with each sample containing one stereo pair and the defocused image of the left view. The resolution of the generated images are , the same as the ones in FlyingThings3D. Figure 1 shows two samples of our training set.
3 DfD-Stereo Network Architecture
Depth inference requires integration of both fine- and large-scale structures. For DfD and stereo, the depth cues could be distributed at various scales in an image. For instance, textureless background requires understanding of a large region, while objects with complex shapes need attentive evaluation of fine details. To capture the contextual information across different scales, a number of recent approaches adopt multi-scale networks and the corresponding solutions have shown plausible results [7, 13]. In addition, recent studies  have shown that a deep network with small kernels are very effective in image recognition tasks. In comparison to large kernels, multiple layers of small kernels maintain a large receptive field while reducing the number of parameters to avoid overfitting. Therefore, a general principle in designing our network is a deep multi-scale architecture with small convolutional kernels.
3.1 Hourglass Network for DfD and Stereo
Based on the observations above, we construct multi-scale networks that follow the hourglass (HG) architecture  for both DfD and stereo. Figure 2 illustrates the structure of our proposed network.
HG network features a contractive part and an expanding part with skip layers between them. The contractive part is composed of convolution layers for feature extraction, and max pooling layers for aggregating high-level information over large areas. Specifically, we perform several rounds of max pooling to dramatically reduce the resolution, allowing smaller convolutional filters to be applied to extract features that span across the entire space of image. The expanding part is a mirrored architecture of the contracting part, with max pooling replaced by nearest neighbor upsampling layer for upsampling. A skip layer that contains a residual module connects each pair of max pooling and upsampling layer so that the spatial information at each resolution will be preserved. Elementwise addition between the skip layer and the upsampled feature map follows to integrate the information across two adjacent resolutions. Both contractive and expanding part utilize large amount of residual modules . Figure 2 (a) shows one HG structure.
One pair of contractive and expanding network can be viewed as one iteration of prediction. By stacking multiple HG networks together, we can further reevaluate and refine the initial prediction. In our experiment we find a two-stack network is sufficient to provide satisfactory performance. Adding additional networks only marginally improves the results but at the expense of longer training time. Further, since our stacked HG network is very deep, we also insert auxiliary supervision after each HG network to facilitate the training process. Specifically, we first apply convolution after each HG to generate an intermediate depth prediction. By comparing the prediction against the ground truth depth, we compute a loss. Finally, the intermediate prediction is remapped to the feature space by applying another convolution, then added back to the features output from previous HG network. Our two-stack HG network has two intermediate loss, whose weight is equal to the weight of the final loss.
Before the two-stack HG network, we add a siamese network, whose two network branches share the same architecture and weights. By using convolution layers that have a stride of 2, the siamese network serves to shrink the size of the feature map, thus reducing the memory usage and computational cost of the HG network. After the HG network, we apply deconvolution layers to progressively recover the image to its original size. At each scale the upsampled low resolution features are fused with high-resolution features from siamese network. This upsampling process with multi-scale guidance allows structures to be resolved at both fine- and large-scale. Note that based on our experiment, the downsample/upsample process largely facilitates the training and produces results that are very close to those obtained from full resolution patches. Finally, the network produces pixel-wise disparity prediction at the end. For DfD and stereo, we utilize the same HG architecture, which we call HG-DfD-Net and HG-Stereo-Net. Figure 2 (b) shows the overall structure of both networks.
3.2 Network Fusion
The most brute-force approach to integrate DfD and stereo is to directly concatenate the output disparity maps from the two branches and apply more convolutions. However, such an approach does not make use of the features readily presented in the branches and hence neglects cues for deriving appropriate combination of the predicted maps. Consequently, such naïve approaches tend to average the results of two branches rather than making further improvement, as shown in Table 1.
Instead, we propose HG-Fusion-Net to fuse DfD and stereo, as illustrated in figure 3. HG-Fusion-Net consists of two HG networks, with extra connections between them. Each connection applies an convolution on the features of one network and adds to the other one. In doing so, the two sub-networks can exchange information at various stages, which is critical for different cues from the two networks to interact with each other. The convolution kernel serves as a transformation of feature space, consolidating new cues into the other branch.
In our network, we set up pairs of interconnections at two spots, one at the beginning of each hourglass. At the cost of only four convolutions, the interconnections largely proliferate the paths of the network. The HG-Fusion-Net can be regarded as an ensemble of original HG networks with different lengths that enables much stronger representation power. In addition, the fused network avoids solving the whole problem all at once, but first collaboratively solves the stereo and DfD sub-problems, then merges into one coherent solution.
In addition to the above proposal, we also explore multiple variants of the HG-Fusion-Net. With no interconnection, the HG-Fusion-Net simply degrades to the brute-force approach. A compromise between our HG-Fusion-Net and the brute-force approach would be using only one pair of interconnections. We choose to keep the first pair, the one before the first hourglass, since it would enable the network to exchange information early. Apart from the number of interconnections, we also investigate the identity interconnections, which directly adds features to the other branch without going through convolution. We present the quantitative results of all the models on Table 1.
Optimization The input of HG-DfD-Net, HG-Stereo-Net, HG-Fusion-Net are defocused/focus image pair, stereo pair and stereo pair plus the defocused image of the left view, respectively. All networks are trained in an end-to-end fashion. For the loss we use the mean absolute error (MAE) with -norm regularization. We adopt MXNET  deep learning framwork to implement and train our models. Our implementation applies batch normalization  after each convolution layer, and use PRelu layer  to add nonlinearity to the network while avoiding “dead” filters. We also use the technique from  to initialize the weights. For the network solver we choose the Adam optimizer  and set the initial learning rate to 0.001, weight decay = 0.002, = 0.9, = 0.999. We train and test all the models on a NVIDIA Tesla K80 graphic card.
Data Preparation and Augmentation To prepare the data, we first stack the stere/defocus pair along the channel’s direction, then extract patches from the stacked image with a stride of 64 to increase the number of training samples. Recall that the HG network contains multiple max pooling layers for downsampling, the patch needs to be cropped to the nearest number that is multiple of 64 for both height and width. In the training phase, we use patches of size as input. The large patch contains enough contextual information to recover depth from both defocus and stereo. To increase the generalization of the network, we also augment the data by flipping the patches horizontally and vertically. We perform the data augmentation on the fly at almost no additional cost.
5.1 Synthetic Data
We train the HG-DfD-Net, HG-Stereo-Net and HG-Fusion-Net separately, and then conduct experiments on test samples from the synthetic data. Figure 4(a) compares the results of three networks. We observe that results from HG-DfD-Net show clearer depth edge, but also exhibit noise on flat regions. On the contrary, HG-Stereo-Net provides smooth depth. However, there are depth bleeding across boundaries, especially when there are holes, such as the tire of the motorcycle on the first row. We suspect that the depth bleeding is due to the occlusion, by which DfD is less affected. Finally, HG-Fusion-Net finds the optimal combination of the two, producing smooth depth while keeping sharp depth boundaries. Table 1 also quantitatively describes the performance of different models on our synthetic dataset. Results from Table 1 confirm that HG-Fusion-Net achieves the best result for almost all metrics, with notable margin ahead of using stereo or defocus cues alone. The brute-force fusion approach without interconnection only averages results from HG-DfD-Net and HG-Stereo-Net, making no further improvement. The network with fewer or identity interconnection performs slightly worse than the HG-Fusion-Net, but still a lot better than the network without interconnection. This demonstrates that interconnections can efficiently broadcast information across branches and largely facilitate mutual optimization.
We also conduct another experiment on a scene with a staircase textured by horizontal stripes, as illustrated in figure 4(b). The scene is rendered from the front view, making it extremely challenging for stereo since all the edges are parallel to the epipolar line. On the contrary, DfD will be able to extract the depth due to its 2D aperture. Figure 4(b) shows the resultant depths enclosed in the red box of the front view, proving the effectiveness of our learning-based DfD on such difficult scene. Note that the inferred depth is not perfect. This is mainly due to the fact that our training data lacks objects with stripe texture. We can improve the result by adding similar textures to the training set.
||1 px||3 px||5 px||MAE (px)||Time (s)|
5.2 Real Scene
To conduct experiments on real scene, we use light field (LF) camera to capture the LF and generate the defocused image. LF camera captures a rich set of rays to describe the visual appearance of the scene. In free space, LF is commonly represented by two-plane parameterizations , where is the camera plane and is the image plane . To conduct digital refocusing, we can move the synthetic image plane that leads to the following photography equation :
By varying , we can refocus the image at different depth. Note that by fixing , we obtain the sub-aperture image that is amount to the image captured using a sub-region of the main lens aperture. Therefore, Eqn. 1 corresponds to shift-and-add the sub-aperture images .
In our experiment we use Lytro Illum camera as our capturing device. We first mount the camera on a translation stage and move the LF camera horizontally to capture two LFs. Then we extract the sub-aperture images from each LF using Light Field Toolbox . The two central sub-aperture images are used to form a stereo pair. We also use the central sub-aperture image in the left view as the all-focused image due to its small aperture size. Finally, we apply the shift-and-add algorithm to generate the defocused image. Both the defocused and sub-aperture image has the size of .
The result of real scene is shown in Fig.5. We have conducted tests on both indoor and outdoor scenes. In general, both HG-DfD-Net and HG-Stereo-Net preserve depth edges well, but results from HG-DfD-Net are more noisy. HG-Fusion-Net produces the best results with smooth depth and sharp depth boundaries. The plant in the first row of Fig.5 presents challenges for both stereo and DfD methods due to the heavy occlusion of branches and leaves. But HG-Fusion-Net manages to identify the fine structure of leaves and generate correct depth value. We have also trained HG-Fusion-Net on a clean dataset without Poisson noise, and show the results on the last column of Fig.5. The inferred depths exhibit severe noise pattern on real data, confirming the necessity to add noise to dataset for simulating real images.
We have presented a learning based solution for a hybrid DfD and stereo depth sensing scheme. We have adopted the hourglass network architecture to separately extract depth from defocus and stereo. We have then studied and explored multiple neural network architectures for linking both networks to improve depth inference. Comprehensive experiments show that our proposed approach preserves the strength of DfD and stereo while effectively suppressing their weaknesses. In addition, we have created a large synthetic dataset for our setup that includes image triplets of a stereo pair and a defocused image along with the corresponding ground truth disparity.
Our immediate future work is to explore different DfD inputs and their interaction with stereo. For instance, instead of using a single defocused image, we can vary the aperture size to produce a stack of images where objects at the same depth exhibit different blur profiles. Learning based approaches can be directly applied to the profile for depth inference or can be combined with our current framework for conducting hybrid depth inference. We have presented one DfD-Stereo setup. Another minimal design was shown in , where a stereo pair with different focus distance is used as input. In the future, we will study the cons and pros of different hybrid DfD-stereo setups and tailor suitable learning-based solutions for fully exploiting the advantages of such setups.
-  E. Alexander, Q. Guo, S. J. Koppal, S. J. Gortler, and T. E. Zickler. Focal flow: Measuring distance and velocity with defocus and differential motion. In ECCV, pages 667–682, 2016.
-  V. M. Bove. Entropy-based depth from focus. Journal of the Optical Society of America, 10(10):561–566, 1993.
-  M. Z. Brown, D. Burschka, and G. D. Hager. Advances in computational stereo. TPAMI, 25(8):993–1008, 2003.
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
-  Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang. A deep visual correspondence embedding model for stereo matching costs. In ICCV, pages 972–980, 2015.
-  D. Dansereau, O. Pizarro, and S. Williams. Decoding, calibration and rectification for lenselet-based plenoptic cameras. In CVPR, pages 1027–1034, 2013.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, pages 2366–2374, 2014.
-  J. Ens and P. Lawrence. A matrix based method for determining depth from focus. In CVPR, pages 600–606, 1991.
-  P. Favaro, S. Soatto, M. Burger, and S. J. Osher. Shape from defocus via diffusion. TPAMI, 30(3):518–531, 2007.
-  X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In CVPR, pages 3279–3286, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026–1034, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  T.-W. Hui, C. C. Loy, and X. Tang. Depth map super-resolution by deep multi-scale guidance. In ECCV, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
-  W. N. Klarquist, W. S. Geisler, and A. C. Bovik. Maximum-likelihood depth-from-defocus for active vision. In International Conference on Intelligent Robots and Systems, pages 374–379 vol.3, 1995.
-  P. Knöbelreiter, C. Reinbacher, A. Shekhovtsov, and T. Pock. End-to-end training of hybrid cnn-crf models for stereo. arXiv preprint arXiv:1611.10229, 2016.
-  A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth from a conventional camera with a coded aperture. ACM Trans. Graph., 26(3), 2007.
-  M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages 31–42, 1996.
-  Z. Liu, Z. Li, J. Zhang, and L. Liu. Euclidean and hamming embedding for image patch description with convolutional networks. In CVPR Workshops, pages 72–78, 2016.
-  W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In TPAMI, pages 5695–5703, 2016.
-  N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pages 4040–4048, 2016.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. ECCV, pages 483–499, 2016.
-  R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan. Light field photography with a hand-held plenoptic camera. Stanford University Computer Science Tech Report, 2:1–11, 2005.
-  H. Park and K. M. Lee. Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters, 2016.
-  A. P. Pentland. A new sense for depth of field. TPAMI, pages 523–531, 1987.
-  A. N. Rajagopalan and S. Chaudhuri. Optimal selection of camera parameters for recovery of depth from defocused images. In CVPR, pages 219–224, 1997.
-  A. N. Rajagopalan, S. Chaudhuri, and U. Mudenagudi. Depth estimation and image restoration using defocused stereo pairs. TPAMI, 26(11):1521–1525, 2004.
-  D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vision, 47(1-3), 2002.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  M. Subbarao and G. Surya. Depth from defocus: A spatial domain approach. Int. J. Comput. Vision, 13(3):271–294, 1994.
-  M. Subbarao, T. Yuan, and J. Tyan. Integration of defocus and focus analysis with stereo for 3d shape recovery. In Proc. SPIE, volume 3204, pages 11–23, 1997.
-  G. Surya and M. Subbarao. Depth from defocus by changing camera aperture: a spatial domain approach. CVPR, pages 61–67, 1993.
-  Y. Takeda, S. Hiura, and K. Sato. Fusing depth from defocus and stereo with coded apertures. In CVPR, pages 209–216, 2013.
-  M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi. Depth from combining defocus and correspondence using light-field cameras. In ICCV, pages 673–680, 2013.
-  T. C. Wang, M. Srikanth, and R. Ramamoorthi. Depth from semi-calibrated stereo and defocus. In CVPR, pages 3717–3726, 2016.
-  M. Watanabe and S. K. Nayar. Rational filters for passive depth from defocus. Int. J. Comput. Vision, 27(3):203–225, May 1998.
-  Y. Yang, H. Lin, Z. Yu, S. Paris, and J. Yu. Virtual DSLR: high quality dynamic depth-of-field synthesis on mobile platforms. In Digital Photography and Mobile Imaging XII, pages 1–9, 2016.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In CVPR, pages 4353–4361, 2015.
-  J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In CVPR, pages 1592–1599, 2015.
-  C. Zhou, S. Lin, and S. Nayar. Coded aperture pairs for depth from defocus. In ICCV, pages 325–332, 2010.