Generating and Exploiting Probabilistic Monocular Depth Estimates
Despite the remarkable success of modern monocular depth estimation methods, the accuracy achievable from a single image is limited, making it is practically useful to incorporate other sources of depth information. Currently, depth estimation from different combinations of sources are treated as different applications, and solved via separate networks trained to use the set of available sources as input for each application. In this paper, we propose a common versatile model that outputs a probability distribution over scene depth given an input color image, as a sample approximation using outputs from a conditional GAN. This distributional output is useful even in the monocular setting, and can be used to estimate depth, pairwise ordering, etc. More importantly, these outputs can be combined with a variety of other depth cues—such as user guidance and partial measurements—for use in different application settings, without retraining. We demonstrate the efficacy of our approach through experiments on the NYUv2 dataset for a number of tasks, and find that our results from a common model, trained only once, are comparable to those from state-of-the-art methods with separate task-specific models.
\NewDocumentCommand\finalRmmm#1 & #2 & #3
\NewDocumentCommand\osize¿\SplitArgument3,m\finalOsize#1 \NewDocumentCommand\finalOsizemmmm#1#2#3#4 \iccvfinalcopy
Recent neural network-based methods [7, 48, 3, 9, 23] have become surprising successful at predicting scene depth from only a single color image. This success confirms that even a single view of a scene contains considerable information about scene geometry. However, purely monocular depth map estimates are far from being precisely accurate, and this is likely to always be true given the ill-posed nature of the task. Fortunately, many practical systems are able to rely on other, yet also imperfect, sources of depth information—limited measurements from depth sensors, interactive user guidance, consistency across video frames or multiple views, etc. It is therefore desirable to be able to combine monocular cues for depth with information from these other sources, to yield estimates that are more accurate than possible from one source alone.
However, depth maps predicted by monocular estimators can not be directly combined with other depth cues. Instead, researchers have considered depth estimation from different combinations of cues as different applications in their own right (\eg, depth up-sampling , estimation from sparse  and line  measurements, etc.), and solved each by learning separate estimators that take their corresponding set of cues, in addition to the color image, as input. This requires determining the types of inputs that will be available for each application setting, constructing a corresponding training set, choosing an appropriate network architecture, and then training the network—a process that is often onerous to duplicate for multiple settings.
In this paper, we propose a single network model for extracting and summarizing the depth information present in a single color image, in a manner that can be directly utilized in different applications and combined with different external depth cues, without retraining. Given a color image, our model outputs a probability distribution over scene depth conditioned on the image input. We use a conditional GAN [11, 37] to output multiple plausible depth estimates for individual patches in the image plane. We then use the set of estimates for each patch to form a sample approximation of the joint distribution over depth values in that patch, and combine distributions for overlapping patches to obtain a distribution for the entire depth map. Thus, rather than a “best guess” for depth at each pixel, our model outputs a rich characterization of the information and ambiguity about depth values and their spatial dependencies.
As illustrated in Fig. 1, and demonstrated through experiments on the NYUv2 dataset , our distributional output is versatile enough to enable a diverse variety of applications. It is useful even in the purely monocular setting—when only a single image is available—and can be used to produce accurate depth predictions, a measure of confidence in these predictions, as well as estimates of relative ordering of pairs of scene points. More importantly, it is also able to incorporate additional information to produce improved depth estimates in diverse application settings: producing multiple depth maps for user selection, incorporating user annotation of erroneous regions, incorporating a small number of depth measurements—along a single line, within a smaller field of view, at random as well as regular sparse locations—and selecting the optimal locations for these measurements. Crucially, all of these applications are enabled by the same network model that is trained only once, while achieving accuracy comparable to state-of-the-art methods that rely on separate task-specific models.
2 Related Work
Monocular Depth Estimation.
First attempted by Saxena \etal , early work in estimating scene depth from a single color image relied on hand-crafted features [42, 22, 43, 39], use of graphical models [42, 34, 54], and databases of exemplars [20, 17]. More recently, Eigen \etal  showed that, given a large enough database of image-depth pairs , convolution neural networks could be trained to achieve significantly more reliable depth estimates. Since then, there have been steady gains in accuracy through the development of improved neural network-based methods [7, 53, 50, 40, 31, 3, 26, 13, 24, 9], as well as strategies for unsupervised an semi-supervised learning [10, 21, 4]. Beyond estimating absolute depth, some works have also looked at pairwise ordinal depth relations between pair of points in the scene from a input color image [55, 4].
Depth from Partial Measurement.
Since making dense depth measurements is slow and expensive, it is useful to be able to recover a high quality dense depth map from a small number of direct measurements, by exploiting monocular cues from a color image. A popular way of combining color information with partial measurements is by requiring color and depth edges to co-occur. This approach is often successful for “depth inpainting”, i.e., filling in gaps of missing measurements in a depth map (common in measurements from structured light sensors). A notable and commonly-used example is the colorization method of Levin \etal . Other methods along this line include [14, 33, 32, 36, 6], while Zhang and Funkhouser  used a neural network to predict normals and occlusion boundaries to aid inpainting.
However, when working with a very small number of measurements, the task is significantly more challenging (see discussion in ) and requires relying more heavily on the monocular cue. In this regime, the solution has been to train a network that takes the color image and the provided sparse samples as input. Researchers have demonstrated the efficacy of this approach with measurements along a single horizontal line from a line sensor , random sparse measurements [47, 35, 16, 49, 44], and sub-sampled measurements on a regular grid [28, 12, 5]. Moreover, each of these methods also train separate networks for different settings, such as different networks for different sparsity levels in , and different resolution grids in .
Monocular depth estimators commonly output a single estimate of the depth value at each pixel, preventing their use in different estimation settings. Some existing methods do produce distributional outputs, but as per-pixel variance maps [18, 13] or per-pixel probability distributions . Note that depth values at different locations are not statistically independent, i.e., different values at different locations may be plausible independently, but not in combination. Thus, per-pixel distributions provide only a limited characterization that, while useful in some applications, can not be used more generally, \eg, to infer relative depth, or spatially propagate information from sparse measurements.
Like us, Chakrabarti \etal  also consider joint distributions over local depth values, albeit to eventually produce a depth map. They use a factorization into independent distributions for different depth derivatives, and train a network to output these distributions. But, their outputs do not provide a way to solve other inference tasks (this was not their goal). Also, their factorization into pre-selected derivatives with a fixed parametric form is still a restrictive assumption that does not fully capture local depth dependencies.
In this work, we use a more general form for the conditional joint distribution of depth values in local regions. We train a conditional GAN [11, 37] to produce multiple estimates of depth in local patches from an image input. Conditional GANs have been used to produce outputs that are more “natural” than those from networks trained with regression loss alone . In our case, we run our GAN model multiple times to generate multiple plausible estimates, treat these as samples from a distribution, and use these samples to approximate the distribution itself.
3 Proposed Method
Given the RGB image of a scene, our goal is to reason about its corresponding depth map , represented as a vector containing depth values for all pixels in the image. Rather than predict a single estimate for , we seek to output a distribution , to more generally characterize depth information and ambiguity present in the image. We form this distribution as a product of functions defined on individual overlapping patches as
where is a potential function for the patch, and a sparse matrix that crops out that patch from (for patches of size , each is a matrix).
We now describe our approach that trains a conditional GAN to generate multiple depth estimates for each patch, uses these to construct the functions , and then leverages the resulting distribution for inference.
3.1 Diverse Patch Depth Estimates from a GAN
Conditional GANs  train a “generator” network to produce estimates so as to match conditional distributions of data in a training set. The generator takes the conditioning variables and a noise source as input, and is trained adversarially against a discriminator that also uses the same conditioning inputs. We employ a conditional GAN to generate multiple plausible estimates for the depth of each patch , given the input image . For large networks and high-dimensional inputs, GAN training typically suffers from issues of instability, as well as reduced output diversity from mode-collapse . Note that the latter is especially a concern in our setting: most applications that use conditional GANs (\eg, ) are concerned with generating only a single estimate at test time, and use the conditional GAN framework to ensure these estimates are plausible. In contrast, we invoke our generator multiple times on the same input image at test time, and need the multiple outputs for each patch to be diverse so as to faithfully characterize local depth ambiguity.
Accordingly, we use a pre-trained feature extractor to reduce the complexity of our generator and discriminator networks. Specifically, we take a pre-trained network from a state-of-the-art monocular depth estimation method (DORN ), remove the last two convolution layers, and treat the remaining network as our feature extractor. Then, our generator and discriminator networks both operate on the corresponding feature map output, rather than on the image itself. To generate estimates for a given patch, a small spatial feature map window, with receptive field centered with the patch, is provided as input to the generator and discriminator. Moreover, as is common in recent methods for conditional generation , uses dropout  rather than an explicit random vector input.
Figure 2 includes a schematic of our conditional GAN setup, with further architecture details provided in the supplementary. We carry out standard adversarial training on the generator-discriminator pair, keeping the pre-trained feature extraction network fixed. Since both our generator and discriminator have significantly lower complexity than would be required if operating directly on the input image, we find training to be stable and our learned generator successful in producing plausible yet diverse estimates. At test time, we run the feature extractor once, and then run the generator multiple times with different instantiations of dropout to generate a diverse set of estimates for each patch. This is efficient because the bulk of the computation happens in the feature extraction layers, and is not repeated.
3.2 Sample Approximation for Patch Potentials
We use the generated outputs from our generator to form a sample approximation to the per-patch potential functions , and thus the joint distribution over the depth map in (1). Given a set of of different estimates of depth of patch , we define its potential function as
This can be interpreted as forming a kernel density estimate from the depth samples in using a Gaussian kernel, were the Gaussian bandwidth is a scalar hyper-parameter111While can be estimated based on the variance between and true patch depths, its actual value is not used in any of the tasks we consider..
Unlike the independent per-pixel [18, 13, 30] or per-derivative  distributions, the samples from our generator lead to more general joint patch potentials , that can express complex spatial dependencies between depth values in local regions. Moreover, our joint distribution , defined in terms of overlapping patches, models dependencies across the entire depth map. This enables information propagation across the entire scene, and reasoning about the global plausibility of scene depth estimates.
3.3 Inference with Distributional Outputs
Inference by Expectation.
A natural way to compute estimates of certain properties or functions of the depth map, is simply as its expectation under our output distribution. When these the properties depend on depths of individual points or nearby sets of points, this can be done by considering all patches that contain these points, all generator samples for each patch, and averaging across this entire set. In Sec. 4, we will show examples of using this strategy to compute point and pair-wise properties of depth values in the monocular setting.
Inference by Mode Computation.
Several applications require computing a global depth map estimate, potentially based on additional information or constraints availalbe during inference. Note that our patch potentials are multi-modal functions, defined as a mixture of Gaussian components centered on each sample from the conditional GAN. Based on this observation, we propose recovering global depth map estimates as modes based on our distributional output , by selecting one mode or sample in for every patch, instead of averaging across them.
This is done through a joint optimization over global and per-patch depths and as:
where the per-patch depths are constrained to be among the corresponding discrete sets of generated samples. The first term in (3.3) simply corresponds to a scaled negative log-likelihood of our output distribution. The other two terms represent different ways of introducing additional information—either as costs on individual patches, or on the global depth map. For different inference applications in Sec. 4, we will use appropriately defined costs in one of these two forms to incorporate external depth cues.
We use a simple iterative algorithm to carry out the optimization in (3.3). We begin with an initial estimate of as the mean per-pixel depth (i.e., across all patches that contain each pixel, and all samples from each patch), and apply alternating updates to and till convergence as
The updates to patch estimates can be done independently, and in parallel, for different patches. The cost in (4) is the sum of the squared distance from corresponding crop of the current global estimate, and the external cost when available. We can compute these costs for all samples in , and select the one with the lowest cost. Note that since does not depend on , it need only be computed once at the start of optimization.
The update to the global map in (5) depend on the form of the external global cost . If no such cost is present, is given by simply the overlap-average of the currently selected samples for each patch. For the applications in Sec. 4 that do involve , we find it sufficient to solve (5) by first initializing to the overlap-average, and then carrying out a small number of gradient descent steps as
where the scalar step-size is a hyper-parameter.
4 Applications and Results
In this section, we describe results for using our probabilistic outputs and inference strategies for various applications—for different inference tasks in the monocular setting, and by combination with different costs and constraints based on additional information when available. We report performance for all applications on the NYUv2 dataset . Crucially, all results from our method reported in all tables and figures in this section are from the same network model, that is trained only once.
We use raw frames from scenes in the official train split in NYUv2  to construct our training and validation sets, and report performance using standard error metrics (see ) on the “valid” crop, including filled-in values, of the full-resolution official test images. As mentioned, we use feature extraction layers from a pre-trained DORN model . The DORN architecture works on rescaled input images and output depth maps at a lower resolution (of ), and so we operate our conditional GAN at the same resolution. However, our outputs are rescaled back to the orginal full resolution to compute error metrics, and in applications with input depth measurements, these are also provided at the original resolution and then rescaled (see supplmentary for details). For our distribution, we use overlapping patch-sizes of side with stride four, and generate 100 samples per-patch to construct . Generating samples takes 4.8s on a 1080Ti GPU for each image, while inference from these samples is faster (see supplementary for per-application run times). Our code and trained models are available at https://projects.ayanc.org/prdepth/.
4.1 Monocular Inference
Our distributional output is useful even when a single color image is the only input, and we now discuss applications for reasoning about scene geometry in this setting.
Predicting Depth and Confidence.
Our outputs can be used for the standard monocular estimation task, i.e., predicting a depth map of the scene given a color image. We can recover this estimate from our model as the mean of the distribution . This corresponds to simply averaging all the estimates for each pixel’s depth—from all the patches that include it, and from all generated estimates for each patch. This can be computed efficiently by getting a mean estimate of per-patch depth by averaging all generated samples for each patch, and then getting per-pixel means as the overlap-average of patches. Another possibility is to predict as the mode of , by solving the optimization in (3.3) without any additional costs or .
Along with an estimate of each pixel’s depth value, we can also output a measure of confidence in these predictions. We do so by computing the variance of each pixel’s depth value across patches and samples from our distributional output —which relates to the per-pixel variance under (differing by a constant ). This variance map gives us a measure of our model’s relative confidence in its estimates at different pixels.
|Method||lower is better||higher is better|
In Table. 1, we compare the accuracy of our mean and mode depth estimates to those of other monocular depth estimation methods222For [23, 9], we recompute these numbers on the official NYUv2 crop from their provided test set estimates.  also used a different definition of RMSE (as mean of per-image RMSE) in their paper. We report results using the standard definition here.. We find that in the monocular setting, the mean and mode estimates are nearly identical. Moreover, these estimates also have nearly the same accuracy as those from DORN , whose feature extractor our model is based on. This shows that our rich distributional outputs come “for free”, without adversely affecting our ability to recover depth compared to standard monocular estimation.
Table 1 also includes the results of using our distributional output in combination with an oracle that selects the most accurate patch estimate from our generator’s samples, and computes the depth map from these samples by overlap-average. These estimates are significantly more accurate, demonstrating that our generated samples contain estimates close to true depth. The oracle performance also represents an upper bound for tasks that incorporate additional information using per-patch costs in (3.3).
Figure 3 evaluates our confidence measure as a predictor of accuracy. We show depth predictions and error and confidence maps from our model for two example images from the NYUv2 test set, and find that regions with relatively higher error also tend to be those where our model has high variance, and thus low confidence—often corresponding to reflective surfaces and isolated far away parts of the scene. We also show a more systematic evaluation of accuracy vs confidence (Fig. 3, right), with errors averaged across the entire test set, over different subsets of only the most confident pixels. The error drops rapidly as we discard a small fraction of pixels with the highest variance.
Predicting Pairwise Depth Ordering.
Another monocular task, introduced in , is to predict the ordinal relative depth of pairs of nearby points in the scene: whether the points are at similar depths (within some threshold), and if not, which point is nearer. Instead of predicting this ordering from an estimated depth map (as done in [4, 55]), we use our distributional output and look at the relative depth in all samples in all patches that contain a pair of queried points, outputting the ordinal relation that is most frequent.
Table 2 compares the performance of our method with that of  and , who use correctness of pairwise ordering as an objective during training. Results are reported in terms of the WKDR error metrics, on a standard set of point pairs on the NYUv2 test set (see ). We also show results predicting ordering from our mean depth map prediction (see supplementary for more details). We find that using our distributional output leads to better predictions than using simply the mean estimate, and that these are comparable to those from the task-optimized model of .
4.2 Incorporating User Guidance
Depth estimates are often useful in interactive image editing and graphics applications. We now describe ways of using our distributional output to include feedback from a user in the loop for improved depth accuracy.
Diverse Estimates for User Selection.
We use Batra \etal’s approach  to derive multiple diverse “global” estimates of the depth map from our distribution , and propose presenting these as alternatives to the user. We set the first estimate to our mean estimate, generate every subsequent estimate by finding a mode using (3.3) with per-patch costs defined as
This introduces a preference for samples that are different from corresponding patches in previous estimates, weighted by a scalar hyper-paramter (set on a validation set).
Figure 4 illustrates the performance of this approach, on an example image and quantitatively over the entire test set. As a proxy for user guidance, we automatically select among the estimates for each scene based on minimum error with the ground-truth. We find that accuracy improves quickly even when selecting among a small number of modes , suggesting that this method can deliver performance gains with fairly minimal user input.
Using Annotations of Erroneous Regions.
As a simple extension, we consider also getting annotations of regions with high error from the user, in each estimate . Note that we only get the locations of these regions, not their correct depth values. Given this annotation, we define a mask that is one within the region and zero elsewhere, and now recover each , with a modified cost :
where denotes element-wise multiplication, and the masks focuses the cost on regions marked as erroneous.
Figure 4 also includes results for this form of user guidance, where user annotation of regions is simulated by choosing windows with the highest error against the ground truth, such that they have no more than 50% overlap with previously marked regions for the same image. We find that now the accuracy of the selected estimate drops dramatically faster with increasing number of estimates .
4.3 Depth Completion
We now consider applications where a small number of depth values are available, \eg, from a sensor that makes limited measurements for efficiency. As illustrated in Fig. 5, our model can use these measurements along with monocular cues to produce accurate estimates of a full depth map.
Dense Depth from Sparse Measurements.
Assuming an input sparse set of depth measurements at isolated points in the scene, we estimate the depth map by using these measurements to define a global cost in (3.3) as
where represents sampling at the measured locations. Based on this, we define the gradients to be applied in (6) for computing the global depth updates as
where represents the transpose of the sampling operation. Since both the weight and the step-size in (6) are hyper-parameters, we simply set , and set the step-size (as well as number of gradient steps) based on a validation set.
|#||Method||lower is better||higher is better|
We apply this technique for two kinds of sparse inputs. We first consider measurements at arbitrary randomly selected points like in [47, 35, 16, 49, 44]. In this case, the transpose sampling operation is computed as a nearest neighbor fill—by copying values for every point in the full image plane from their nearest sampled location. Table 3 reports the accuracy of the full depth completed depth maps using our method for different numbers of randomly placed measurements, and compares it to those obtained using , as well as using the learning-based methods of Ma and Karaman , and Wang \etal . Our estimates are significantly more accurate than those from , and comparable to [35, 49]333Both  and  evaluate their methods on a centered crop at half-resolution, while we report our performance at the official full-resolution valid crop for NYUv2 to be consistent with the benchmark. Our performance at half-resolution is similar, and reported in the supplementary., even though the latter not only use networks trained for this specific completion task, but train different networks for different numbers of measurements.
Instead of placing points randomly, we also consider choosing an optimal set of locations to measure based on the color image, given an budget on the total number of measurements. We select these points as local maxima of the variance map described in Sec. 4.1. We also include results for depth maps reconstructed from these optimally placed measurements in Table 3, and find these are more accurate.
We next consider the setting of depth up-sampling, where the sparse input measurements are on a regular lower-resolution grid. Because of the regular spacing between measured samples, we are able to use bi-linear interpolation for the transpose operation in (10). We evaluate our method for two sub-sampling levels in Table 4, and compare it to  and the method of Chen \etal . Again, we perform better than , and competitively with the task-specific networks of —that are separately trained for different sampling levels—especially for 96x sub-sampling.
|Method||lower is better||higher is better|
We also consider the case when the available measurements are dense in a contiguous, but small, portion of the image plane—such as from a sensor with a smaller field-of-view (FOV), or alone a single line . In this case, we define and as sparse vectors of length that are zero in locations without measurements. At measured locations, contains the measured values, while the mask is set to one. We use these to define a per-patch cost for use with (3.3) as
where the weight is determined on a validation set.
We report results for this approach in Table 5, with measurements given either as small centered windows in the image (corresponding to a small FOV camera), or as along a vertically centered horizontal line. We compare our approach with , and for the case of line measurements, with the learning-based method of Liao \etal 444Note that  use measurements along a line simulated to be horizontal in world co-ordinates, leading to different vertical positions at each co-ordinate. However, due to lack of exact details for replicating this setting, we simply use a line that is horizontal in the image plane.. Our approach again outperforms , and in comparison to , has slightly higher RMSE but is better on all other metrics.
|Size||Method||lower is better||higher is better|
Using distributional estimates of depth from a single image, our approach enables a variety of applications without the need for repeated training. While we focused on applications where the final output was depth or some direct function of scene geometry in this paper, we are interested in exploring how our distributional outputs can be used to manage ambiguity in downstream processing—such as for re-rendering or path planning—in future work.
Acknowledgments. This work was supported by the NSF under award no. IIS-1820693.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse m-best solutions in markov random fields. In Proc. ECCV, 2012.
-  A. Chakrabarti, J. Shao, and G. Shakhnarovich. Depth from a single image by harmonizing overcomplete local network predictions. In NeurIPS, 2016.
-  W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In NeurIPS, 2016.
-  Z. Chen, V. Badrinarayanan, G. Drozdov, and A. Rabinovich. Estimating depth from rgb and sparse sensing. In Proc. ECCV, 2018.
-  D. Doria and R. J. Radke. Filling large holes in lidar data by inpainting depth gradients. In Proc. CVPR Workshops, 2012.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. ICCV, 2015.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
-  H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In Proc. CVPR, 2018.
-  R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proc. ECCV, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
-  S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang. Learning dynamic guidance for depth image enhancement. In Proc. CVPR, 2017.
-  M. Heo, J. Lee, K.-R. Kim, H.-U. Kim, and C.-S. Kim. Monocular depth estimation using whole strip masking and reliability-based refinement. In Proc. ECCV, 2018.
-  D. Herrera, J. Kannala, J. Heikkilä, et al. Depth map inpainting under a second-order smoothness prior. In Scandinavian Conference on Image Analysis, 2013.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proc. CVPR, 2017.
-  M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi. Sparse and dense data with cnns: Depth completion and semantic segmentation. In Proc. Intl. Conference on 3D Vision (3DV), 2018.
-  K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depth extraction from video using non-parametric sampling. PAMI, 2014.
-  A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NeurIPS, pages 5574–5584, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  J. Konrad, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee. Learning-based, automatic 2d-to-3d image and video conversion. IEEE Trans. on Image Processing, 2013.
-  Y. Kuznietsov, J. Stuckler, and B. Leibe. Semi-supervised deep learning for monocular depth map prediction. In Proc. CVPR, 2017.
-  L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In Proc. CVPR, 2014.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Intl. Conference on 3D Vision (3DV), 2016.
-  J.-H. Lee, M. Heo, K.-R. Kim, and C.-S. Kim. Single-image depth estimation based on fourier domain analysis. In Proc. CVPR, 2018.
-  A. Levin, D. Lischinski, and Y. Weiss. Colorization using optimization. In ACM Transactions on Graphics (TOG), 2004.
-  J. Li, R. Klein, and A. Yao. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proc. ICCV, 2017.
-  J. Li, R. Klein, and A. Yao. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proc. ICCV, 2017.
-  Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep joint image filtering. In Proc. ECCV, 2016.
-  Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu. Parse geometry from a line: Monocular depth estimation with partial laser observation. In Proc. ICRA, 2017.
-  C. Liu, J. Gu, K. Kim, S. Narasimhan, and J. Kautz. Neural rgb-d sensing: Depth and uncertainty from a video camera. arXiv preprint arXiv:1901.02571, 2019.
-  F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. PAMI, 2016.
-  J. Liu and X. Gong. Guided depth enhancement via anisotropic diffusion. In Pacific-Rim Conference on Multimedia, 2013.
-  J. Liu, X. Gong, and J. Liu. Guided inpainting and filtering for kinect depth maps. In Proc ICPR, 2012.
-  M. Liu, M. Salzmann, and X. He. Discrete-continuous depth estimation from a single image. In Proc. CVPR, 2014.
-  F. Ma and S. Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In Proc. ICRA, 2018.
-  K. Matsuo and Y. Aoki. Depth image enhancement using local tangent plane approximations. In Proc. CVPR, 2015.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proc. CVPR, 2018.
-  R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monocular depth estimation in complex dynamic scenes. In Proc. CVPR, 2016.
-  A. Roy and S. Todorovic. Monocular depth estimation using neural regression forest. In Proc. CVPR, 2016.
-  A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In NeurIPS, 2006.
-  A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. PAMI, 2009.
-  J. Shi, X. Tao, L. Xu, and J. Jia. Break ames room illusion: depth from general single images. ACM Transactions on Graphics (TOG), 2015.
-  S. S. Shivakumar, T. Nguyen, S. W. Chen, and C. J. Taylor. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion. arXiv preprint arXiv:1902.00761, 2019.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. ECCV, 2012.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
-  W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool. Sparse and noisy lidar completion with rgb guidance and uncertainty. arXiv preprint arXiv:1902.05356, 2019.
-  P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Towards unified depth and semantic prediction from a single image. In Proc. CVPR, 2015.
-  T.-H. Wang, F.-E. Wang, J.-T. Lin, Y.-H. Tsai, W.-C. Chiu, and M. Sun. Plug-and-play: Improve depth prediction via sparse data propagation. In Proc. ICRA, 2019.
-  X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proc. CVPR, 2015.
-  D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proc. CVPR, 2017.
-  Y. Zhang and T. Funkhouser. Deep depth completion of a single rgb-d image. In Proc. CVPR, 2018.
-  Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monocular object instance segmentation and depth ordering with cnns. In Proc. ICCV, 2015.
-  W. Zhuo, M. Salzmann, X. He, and M. Liu. Indoor scene structure analysis for single image depth estimation. In Proc. CVPR, 2015.
-  D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learning ordinal relationships for mid-level vision. In Proc. CVPR, 2015.
Appendix A Architecture and Training
Our conditional GAN consists of a pre-trained feature extractor, a generator and a discriminator. As mentioned in the paper, we take the pre-trained DORN model , remove its last two convolutional layers and use it as our feature extractor. This feature extractor takes an RGB image, resized to size 257353 from the original 640480 in NYUv2, and outputs a 2560-dimensional feature map at a lower-resolution 3345. Our conditional GAN takes this feature map as input, and reasons about an output depth map at the same 257353 resolution. We consider overlapping patches at stride 4, giving us a total of 5781 patches, each of size . In other words, for each forward pass of our generator, we want to produce an output of size , and then run this multiple (100) times to get multiple samples for each patch.
We describe our architectures for the discriminator and generator in Table 6 and 7 respectively. Notice that we are able to generate outputs efficiently in a fully convolutional way—using reshape operations and transpose convolution layers to generate the depth samples for each patch. In the generator, the output of each patch depends on a small receptive field in the input feature map, and we use dropout as the noise source. While the overlapping patches do have overlapping receptive fields in the feature map, we make sure that they have independent instantiations of dropout noise values. The discriminator has a two-stream architecture: one for processing the feature map, and the other for processing the depth patch (either from the generator or ground-truth). Outputs from both streams are concatenated and sent through two more layers to predict a true/fake conditional label for each input depth patch.
For training, we use Adam  with a learning rate of and set and set to 0.5 and 0.9, respectively. As typically used to stabilize GAN training, we update the discriminator at every iteration while only updating the generator once every five iterations. We use a batch size of 4 and train for 240k iterations.
Appendix B Output Depth Resolution
As mentioned above, our distributional output corresponds to the lower DORN  resolution of for the depth map. However, all error metrics in the paper are computed (inside the valid crop) at full resolution. To do so, we resize our method’s outputs to by bilinear interpolation. Moreover, in all applications with additional inputs, these are also provided at the original higher resolution. For user annotations, erroneous regions are marked as windows at the full resolution, and we map the locations of these windows to the lower resolution to construct our masks . Similarly, for depth from sparse measurements, corresponds to sparse measurements of depth at the full-resolution, and our global cost is defined in terms of a full-resolution depth map (we scale our depth map to the full resolution, and scale the gradients back). For depth un-cropping, we again provide depth measurements at the full resolution, and scale these to the DORN resolution to construct our measurement and mask vectors and . Thus, all inputs and all evaluation metrics are based on the standard benchmark resolution.
Appendix C Inference Hyperparameters
For user-guidance and depth un-cropping, the value of is chosen based on a small validation set, with for user-guidance, and for un-cropping. Moreover, for user guidance, we find that slowly increasing the value of from to its final value of during optimization leads to convergence to better solutions. For depth completion from sparse (both random and regularly spaced measurements), we set the value of step-size (in range ) and number of steps (in range ) based on a validation set as well.
|\row0, features from feature extractor, \osize1,33,45,2560 \row1, resize, \osize1,65,89,2560 \row2, conv 11, \osize1,65,89,1024 \row3, conv 11, \osize1,65,89,512 \row4, conv 33 dilation=2, \osize1,61,85,512 \row5, conv 33 dilation=2, \osize1,57,81,256 \row6, reshape, \osize(57*81),1,1,256||dropout as noise|
|\row7, conv 11, \osize(57*81),1,1,256||dropout as noise|
|\row8, conv 11, \osize(57*81),1,1,256||dropout as noise|
|\row9, conv 11, \osize(57*81),1,1,256||dropout as noise|
\row10, conv_transpose 33, \osize(57*81), 3,3,256 \row11, conv_transpose 33, \osize(57*81), 5,5,128 \row12, conv_transpose 33, \osize(57*81), 7,7,64 \row13, resize, \osize(57*81), 13,13,64 \row14, conv_transpose 33, \osize(57*81), 15,15,32 \row15, conv_transpose 33, \osize(57*81), 17,17,16 \row16, resize, \osize(57*81), 33,33,16 \row17, conv 11 + tanh, \osize(57*81), 33,33,1 \row18, reshape,
Appendix D Running Time
Our method works by first generating multiple (100) samples for each overlapping patch, and then running inference either by computing expectation over these samples or running an optimization for mode-selection. While sample generation has a consistent running time for all applications, the time taken for optimization differs—even among mode-selection applications based on the number of iterations taken to converge. We report these running time in Table 8, when using an NVIDIA 1080Ti GPU.
|\row0.a, features from feature extractor, \osize1,33,45,2560 \row1.a, resize, \osize1,65,89,2560 \row2.a, conv 33 dilation=2, \osize1,61,85,1024 \row3.a, conv 33 dilation=2, \osize1,57,81,256 \row4.a, reshape, \osize(57*81),1,1,256 \row0.b, true/fake depth patches, \row1.b, reshape,\osize(57*81),33,33,1 \row2.b, conv 33 stride=2, \osize(57*81),16,16,8 \row3.b, conv 22 stride=2, \osize(57*81),8,8,16 \row4.b, conv 22 stride=2, \osize(57*81),4,4,32 \row5.b, conv 22 stride=2, \osize(57*81),2,2,64 \row6.b, reshape, \osize(57*81),1,1,256 \row0 , concat: 4.a and 6.b, \osize(57*81),1,1,512 \row1, conv 11, \osize(57*81),1,1,1024 \row2, conv 11, \osize(57*81),1,1,512 \row3, conv 11, \osize(57*81),1,1,256 \row4, conv 11 + sigmoid, \osize(57*81),1,1,1|
|Mean Depth Estimate||0.01s|
|Depth from Random Sparse Measurements||0.80s|
|Selection with Annotation||4.08s|
Appendix E Additional Application Details and Results
e.1 Predicting Pairwise Depth Ordering
For predicting ordinal depth ordering for a pair of points and in the image plane, we adopt the definition of ground-truth label from  as
where the threshold is equal to 0.02 as in . To compute predictions from our distribution mean, which is a per-pixel best guess, we simply look at the relationship between the predicted depths of the corresponding points. Note that like , we select a different threshold (based on a validation set) for use in (12) for prediction, so as to balance WKDR and WKDR.
To make use of our probabilistic outputs for better ordinal prediction, we look at the depth ordering of a given query pair in all samples in all patches that contain this pair (and here, we use the true threshold ), outputting the label that is most frequent. In rare cases where no patch includes both points in a query pair, we simply use the prediction from the mean depth estimate as our output.
e.2 Incorporating User Guidance
We include more results for the user guidance tasks in Figure 6, demonstrating improvements over the mean depth prediction as users select among our generated modes. Note that when provided a limited annotation of a small erroneous region, our method generates estimates that not only correct that region, but also propagate improvements outside the input bounding box.
e.3 Depth Completion
Optimal Locations for Sparse Measurements.
For arbitrarily placed sparse depth measurements, our method is able to do better than random sampling by selecting an optimal set of locations to measure from the color image, given a budget on the total number of measurements. Specifically, we select local maxima of the monocular variance map from our output distribution, which represents points where our model is most uncertain about depth. Figure 7 demonstrates the selected points using this approach, and the resulting improvement in predicted depth over random sampling.
Note that [35, 49] evaluate their methods by reporting errors on a centered crop of half-resolution depth-maps, and also derive their input sparse measurements at this half-resolution. In contrast, our results in Table 3 in the paper represent the official benchmark metrics (in the valid crop at full resolution) for consistency to other evaluations—in our paper and elsewhere. For a more direct comparison to [35, 49], we also evaluated our method by replicating their setting. Specifically, to provide input sparse measurements, we first down-sample the ground-truth depth map and randomly sample depth values from this down-sampled map. We then provide these as inputs to our method (which resizes these back to the full resolution to compute the global cost ). Then, we take the full-resolution depth map estimates produced by our method, down-sample them to half-resolution, and compute error metrics on the same centered crop as [35, 49]. We report these results in Table 9, and find they are similar to standard evaluation in Table 3 in the paper.
|#||Method||lower is better||higher is better|
We also include results for the various depth completion tasks for more example scenes in Fig. 8.