Generating and Exploiting Probabilistic Monocular Depth Estimates
Abstract
Despite the remarkable success of modern monocular depth estimation methods, the accuracy achievable from a single image is limited, making it is practically useful to incorporate other sources of depth information. Currently, depth estimation from different combinations of sources are treated as different applications, and solved via separate networks trained to use the set of available sources as input for each application. In this paper, we propose a common versatile model that outputs a probability distribution over scene depth given an input color image, as a sample approximation using outputs from a conditional GAN. This distributional output is useful even in the monocular setting, and can be used to estimate depth, pairwise ordering, etc. More importantly, these outputs can be combined with a variety of other depth cues—such as user guidance and partial measurements—for use in different application settings, without retraining. We demonstrate the efficacy of our approach through experiments on the NYUv2 dataset for a number of tasks, and find that our results from a common model, trained only once, are comparable to those from stateoftheart methods with separate taskspecific models.
¿\SplitArgument2,m\finalR#1
\NewDocumentCommand\finalRmmm#1 & #2 & #3
\NewDocumentCommand\osize¿\SplitArgument3,m\finalOsize#1
\NewDocumentCommand\finalOsizemmmm#1#2#3#4
\iccvfinalcopy
1 Introduction
Recent neural networkbased methods [7, 48, 3, 9, 23] have become surprising successful at predicting scene depth from only a single color image. This success confirms that even a single view of a scene contains considerable information about scene geometry. However, purely monocular depth map estimates are far from being precisely accurate, and this is likely to always be true given the illposed nature of the task. Fortunately, many practical systems are able to rely on other, yet also imperfect, sources of depth information—limited measurements from depth sensors, interactive user guidance, consistency across video frames or multiple views, etc. It is therefore desirable to be able to combine monocular cues for depth with information from these other sources, to yield estimates that are more accurate than possible from one source alone.
However, depth maps predicted by monocular estimators can not be directly combined with other depth cues. Instead, researchers have considered depth estimation from different combinations of cues as different applications in their own right (\eg, depth upsampling [5], estimation from sparse [35] and line [29] measurements, etc.), and solved each by learning separate estimators that take their corresponding set of cues, in addition to the color image, as input. This requires determining the types of inputs that will be available for each application setting, constructing a corresponding training set, choosing an appropriate network architecture, and then training the network—a process that is often onerous to duplicate for multiple settings.
In this paper, we propose a single network model for extracting and summarizing the depth information present in a single color image, in a manner that can be directly utilized in different applications and combined with different external depth cues, without retraining. Given a color image, our model outputs a probability distribution over scene depth conditioned on the image input. We use a conditional GAN [11, 37] to output multiple plausible depth estimates for individual patches in the image plane. We then use the set of estimates for each patch to form a sample approximation of the joint distribution over depth values in that patch, and combine distributions for overlapping patches to obtain a distribution for the entire depth map. Thus, rather than a “best guess” for depth at each pixel, our model outputs a rich characterization of the information and ambiguity about depth values and their spatial dependencies.
As illustrated in Fig. 1, and demonstrated through experiments on the NYUv2 dataset [45], our distributional output is versatile enough to enable a diverse variety of applications. It is useful even in the purely monocular setting—when only a single image is available—and can be used to produce accurate depth predictions, a measure of confidence in these predictions, as well as estimates of relative ordering of pairs of scene points. More importantly, it is also able to incorporate additional information to produce improved depth estimates in diverse application settings: producing multiple depth maps for user selection, incorporating user annotation of erroneous regions, incorporating a small number of depth measurements—along a single line, within a smaller field of view, at random as well as regular sparse locations—and selecting the optimal locations for these measurements. Crucially, all of these applications are enabled by the same network model that is trained only once, while achieving accuracy comparable to stateoftheart methods that rely on separate taskspecific models.
2 Related Work
Monocular Depth Estimation.
First attempted by Saxena \etal [41], early work in estimating scene depth from a single color image relied on handcrafted features [42, 22, 43, 39], use of graphical models [42, 34, 54], and databases of exemplars [20, 17]. More recently, Eigen \etal [8] showed that, given a large enough database of imagedepth pairs [45], convolution neural networks could be trained to achieve significantly more reliable depth estimates. Since then, there have been steady gains in accuracy through the development of improved neural networkbased methods [7, 53, 50, 40, 31, 3, 26, 13, 24, 9], as well as strategies for unsupervised an semisupervised learning [10, 21, 4]. Beyond estimating absolute depth, some works have also looked at pairwise ordinal depth relations between pair of points in the scene from a input color image [55, 4].
Depth from Partial Measurement.
Since making dense depth measurements is slow and expensive, it is useful to be able to recover a high quality dense depth map from a small number of direct measurements, by exploiting monocular cues from a color image. A popular way of combining color information with partial measurements is by requiring color and depth edges to cooccur. This approach is often successful for “depth inpainting”, i.e., filling in gaps of missing measurements in a depth map (common in measurements from structured light sensors). A notable and commonlyused example is the colorization method of Levin \etal [25]. Other methods along this line include [14, 33, 32, 36, 6], while Zhang and Funkhouser [52] used a neural network to predict normals and occlusion boundaries to aid inpainting.
However, when working with a very small number of measurements, the task is significantly more challenging (see discussion in [5]) and requires relying more heavily on the monocular cue. In this regime, the solution has been to train a network that takes the color image and the provided sparse samples as input. Researchers have demonstrated the efficacy of this approach with measurements along a single horizontal line from a line sensor [29], random sparse measurements [47, 35, 16, 49, 44], and subsampled measurements on a regular grid [28, 12, 5]. Moreover, each of these methods also train separate networks for different settings, such as different networks for different sparsity levels in [35], and different resolution grids in [5].
Probabilistic Outputs.
Monocular depth estimators commonly output a single estimate of the depth value at each pixel, preventing their use in different estimation settings. Some existing methods do produce distributional outputs, but as perpixel variance maps [18, 13] or perpixel probability distributions [30]. Note that depth values at different locations are not statistically independent, i.e., different values at different locations may be plausible independently, but not in combination. Thus, perpixel distributions provide only a limited characterization that, while useful in some applications, can not be used more generally, \eg, to infer relative depth, or spatially propagate information from sparse measurements.
Like us, Chakrabarti \etal [3] also consider joint distributions over local depth values, albeit to eventually produce a depth map. They use a factorization into independent distributions for different depth derivatives, and train a network to output these distributions. But, their outputs do not provide a way to solve other inference tasks (this was not their goal). Also, their factorization into preselected derivatives with a fixed parametric form is still a restrictive assumption that does not fully capture local depth dependencies.
In this work, we use a more general form for the conditional joint distribution of depth values in local regions. We train a conditional GAN [11, 37] to produce multiple estimates of depth in local patches from an image input. Conditional GANs have been used to produce outputs that are more “natural” than those from networks trained with regression loss alone [15]. In our case, we run our GAN model multiple times to generate multiple plausible estimates, treat these as samples from a distribution, and use these samples to approximate the distribution itself.
3 Proposed Method
Given the RGB image of a scene, our goal is to reason about its corresponding depth map , represented as a vector containing depth values for all pixels in the image. Rather than predict a single estimate for , we seek to output a distribution , to more generally characterize depth information and ambiguity present in the image. We form this distribution as a product of functions defined on individual overlapping patches as
(1) 
where is a potential function for the patch, and a sparse matrix that crops out that patch from (for patches of size , each is a matrix).
We now describe our approach that trains a conditional GAN to generate multiple depth estimates for each patch, uses these to construct the functions , and then leverages the resulting distribution for inference.
3.1 Diverse Patch Depth Estimates from a GAN
Conditional GANs [37] train a “generator” network to produce estimates so as to match conditional distributions of data in a training set. The generator takes the conditioning variables and a noise source as input, and is trained adversarially against a discriminator that also uses the same conditioning inputs. We employ a conditional GAN to generate multiple plausible estimates for the depth of each patch , given the input image . For large networks and highdimensional inputs, GAN training typically suffers from issues of instability, as well as reduced output diversity from modecollapse [1]. Note that the latter is especially a concern in our setting: most applications that use conditional GANs (\eg, [15]) are concerned with generating only a single estimate at test time, and use the conditional GAN framework to ensure these estimates are plausible. In contrast, we invoke our generator multiple times on the same input image at test time, and need the multiple outputs for each patch to be diverse so as to faithfully characterize local depth ambiguity.
Accordingly, we use a pretrained feature extractor to reduce the complexity of our generator and discriminator networks. Specifically, we take a pretrained network from a stateoftheart monocular depth estimation method (DORN [9]), remove the last two convolution layers, and treat the remaining network as our feature extractor. Then, our generator and discriminator networks both operate on the corresponding feature map output, rather than on the image itself. To generate estimates for a given patch, a small spatial feature map window, with receptive field centered with the patch, is provided as input to the generator and discriminator. Moreover, as is common in recent methods for conditional generation [15], uses dropout [46] rather than an explicit random vector input.
Figure 2 includes a schematic of our conditional GAN setup, with further architecture details provided in the supplementary. We carry out standard adversarial training on the generatordiscriminator pair, keeping the pretrained feature extraction network fixed. Since both our generator and discriminator have significantly lower complexity than would be required if operating directly on the input image, we find training to be stable and our learned generator successful in producing plausible yet diverse estimates. At test time, we run the feature extractor once, and then run the generator multiple times with different instantiations of dropout to generate a diverse set of estimates for each patch. This is efficient because the bulk of the computation happens in the feature extraction layers, and is not repeated.
3.2 Sample Approximation for Patch Potentials
We use the generated outputs from our generator to form a sample approximation to the perpatch potential functions , and thus the joint distribution over the depth map in (1). Given a set of of different estimates of depth of patch , we define its potential function as
(2) 
This can be interpreted as forming a kernel density estimate from the depth samples in using a Gaussian kernel, were the Gaussian bandwidth is a scalar hyperparameter^{1}^{1}1While can be estimated based on the variance between and true patch depths, its actual value is not used in any of the tasks we consider..
Unlike the independent perpixel [18, 13, 30] or perderivative [3] distributions, the samples from our generator lead to more general joint patch potentials , that can express complex spatial dependencies between depth values in local regions. Moreover, our joint distribution , defined in terms of overlapping patches, models dependencies across the entire depth map. This enables information propagation across the entire scene, and reasoning about the global plausibility of scene depth estimates.
3.3 Inference with Distributional Outputs
Inference by Expectation.
A natural way to compute estimates of certain properties or functions of the depth map, is simply as its expectation under our output distribution. When these the properties depend on depths of individual points or nearby sets of points, this can be done by considering all patches that contain these points, all generator samples for each patch, and averaging across this entire set. In Sec. 4, we will show examples of using this strategy to compute point and pairwise properties of depth values in the monocular setting.
Inference by Mode Computation.
Several applications require computing a global depth map estimate, potentially based on additional information or constraints availalbe during inference. Note that our patch potentials are multimodal functions, defined as a mixture of Gaussian components centered on each sample from the conditional GAN. Based on this observation, we propose recovering global depth map estimates as modes based on our distributional output , by selecting one mode or sample in for every patch, instead of averaging across them.
This is done through a joint optimization over global and perpatch depths and as:
(3) 
where the perpatch depths are constrained to be among the corresponding discrete sets of generated samples. The first term in (3.3) simply corresponds to a scaled negative loglikelihood of our output distribution. The other two terms represent different ways of introducing additional information—either as costs on individual patches, or on the global depth map. For different inference applications in Sec. 4, we will use appropriately defined costs in one of these two forms to incorporate external depth cues.
We use a simple iterative algorithm to carry out the optimization in (3.3). We begin with an initial estimate of as the mean perpixel depth (i.e., across all patches that contain each pixel, and all samples from each patch), and apply alternating updates to and till convergence as
(4)  
(5) 
The updates to patch estimates can be done independently, and in parallel, for different patches. The cost in (4) is the sum of the squared distance from corresponding crop of the current global estimate, and the external cost when available. We can compute these costs for all samples in , and select the one with the lowest cost. Note that since does not depend on , it need only be computed once at the start of optimization.
The update to the global map in (5) depend on the form of the external global cost . If no such cost is present, is given by simply the overlapaverage of the currently selected samples for each patch. For the applications in Sec. 4 that do involve , we find it sufficient to solve (5) by first initializing to the overlapaverage, and then carrying out a small number of gradient descent steps as
(6) 
where the scalar stepsize is a hyperparameter.
4 Applications and Results
In this section, we describe results for using our probabilistic outputs and inference strategies for various applications—for different inference tasks in the monocular setting, and by combination with different costs and constraints based on additional information when available. We report performance for all applications on the NYUv2 dataset [45]. Crucially, all results from our method reported in all tables and figures in this section are from the same network model, that is trained only once.
Preliminaries.
We use raw frames from scenes in the official train split in NYUv2 [45] to construct our training and validation sets, and report performance using standard error metrics (see [7]) on the “valid” crop, including filledin values, of the fullresolution official test images. As mentioned, we use feature extraction layers from a pretrained DORN model [9]. The DORN architecture works on rescaled input images and output depth maps at a lower resolution (of ), and so we operate our conditional GAN at the same resolution. However, our outputs are rescaled back to the orginal full resolution to compute error metrics, and in applications with input depth measurements, these are also provided at the original resolution and then rescaled (see supplmentary for details). For our distribution, we use overlapping patchsizes of side with stride four, and generate 100 samples perpatch to construct . Generating samples takes 4.8s on a 1080Ti GPU for each image, while inference from these samples is faster (see supplementary for perapplication run times). Our code and trained models are available at https://projects.ayanc.org/prdepth/.
4.1 Monocular Inference
Our distributional output is useful even when a single color image is the only input, and we now discuss applications for reasoning about scene geometry in this setting.
Predicting Depth and Confidence.
Our outputs can be used for the standard monocular estimation task, i.e., predicting a depth map of the scene given a color image. We can recover this estimate from our model as the mean of the distribution . This corresponds to simply averaging all the estimates for each pixel’s depth—from all the patches that include it, and from all generated estimates for each patch. This can be computed efficiently by getting a mean estimate of perpatch depth by averaging all generated samples for each patch, and then getting perpixel means as the overlapaverage of patches. Another possibility is to predict as the mode of , by solving the optimization in (3.3) without any additional costs or .
Along with an estimate of each pixel’s depth value, we can also output a measure of confidence in these predictions. We do so by computing the variance of each pixel’s depth value across patches and samples from our distributional output —which relates to the perpixel variance under (differing by a constant ). This variance map gives us a measure of our model’s relative confidence in its estimates at different pixels.
Method  lower is better  higher is better  

rms  rel  
Eigen [7]  0.641    0.158  76.9  95.0  98.8 
Chakrabarti [3]  0.620    0.149  80.6  95.8  98.7 
Li [27]  0.635  0.063  0.143  78.8  95.8  99.1 
Xu [51]  0.586  0.052  0.121  81.1  95.4  98.7 
Laina [23]  0.584  0.059  0.136  82.2  95.6  98.9 
Qi [38]  0.569  0.057  0.128  83.4  96.0  99.0 
DORN [9]  0.545  0.050  0.114  85.8  96.2  98.7 
Ours (mean)  0.536  0.053  0.125  85.2  96.2  98.8 
Ours (mode)  0.536  0.053  0.125  85.1  96.6  99.0 
Ours (oracle)  0.253  0.017  0.041  96.7  99.2  99.8 
In Table. 1, we compare the accuracy of our mean and mode depth estimates to those of other monocular depth estimation methods^{2}^{2}2For [23, 9], we recompute these numbers on the official NYUv2 crop from their provided test set estimates. [9] also used a different definition of RMSE (as mean of perimage RMSE) in their paper. We report results using the standard definition here.. We find that in the monocular setting, the mean and mode estimates are nearly identical. Moreover, these estimates also have nearly the same accuracy as those from DORN [9], whose feature extractor our model is based on. This shows that our rich distributional outputs come “for free”, without adversely affecting our ability to recover depth compared to standard monocular estimation.
Table 1 also includes the results of using our distributional output in combination with an oracle that selects the most accurate patch estimate from our generator’s samples, and computes the depth map from these samples by overlapaverage. These estimates are significantly more accurate, demonstrating that our generated samples contain estimates close to true depth. The oracle performance also represents an upper bound for tasks that incorporate additional information using perpatch costs in (3.3).
Figure 3 evaluates our confidence measure as a predictor of accuracy. We show depth predictions and error and confidence maps from our model for two example images from the NYUv2 test set, and find that regions with relatively higher error also tend to be those where our model has high variance, and thus low confidence—often corresponding to reflective surfaces and isolated far away parts of the scene. We also show a more systematic evaluation of accuracy vs confidence (Fig. 3, right), with errors averaged across the entire test set, over different subsets of only the most confident pixels. The error drops rapidly as we discard a small fraction of pixels with the highest variance.
Predicting Pairwise Depth Ordering.
Another monocular task, introduced in [55], is to predict the ordinal relative depth of pairs of nearby points in the scene: whether the points are at similar depths (within some threshold), and if not, which point is nearer. Instead of predicting this ordering from an estimated depth map (as done in [4, 55]), we use our distributional output and look at the relative depth in all samples in all patches that contain a pair of queried points, outputting the ordinal relation that is most frequent.
Table 2 compares the performance of our method with that of [4] and [55], who use correctness of pairwise ordering as an objective during training. Results are reported in terms of the WKDR error metrics, on a standard set of point pairs on the NYUv2 test set (see [55]). We also show results predicting ordering from our mean depth map prediction (see supplementary for more details). We find that using our distributional output leads to better predictions than using simply the mean estimate, and that these are comparable to those from the taskoptimized model of [4].
Method  WKDR  WKDR  WKDR 

Zoran [55]  43.5%  44.2%  41.4% 
Chen [4]  28.3%  30.6%  28.6% 
Ours (mean)  33.2%  29.3%  35.7% 
Ours (distribution)  28.9%  26.1%  30.7% 
4.2 Incorporating User Guidance
Depth estimates are often useful in interactive image editing and graphics applications. We now describe ways of using our distributional output to include feedback from a user in the loop for improved depth accuracy.
Diverse Estimates for User Selection.
We use Batra \etal’s approach [2] to derive multiple diverse “global” estimates of the depth map from our distribution , and propose presenting these as alternatives to the user. We set the first estimate to our mean estimate, generate every subsequent estimate by finding a mode using (3.3) with perpatch costs defined as
(7) 
This introduces a preference for samples that are different from corresponding patches in previous estimates, weighted by a scalar hyperparamter (set on a validation set).
Figure 4 illustrates the performance of this approach, on an example image and quantitatively over the entire test set. As a proxy for user guidance, we automatically select among the estimates for each scene based on minimum error with the groundtruth. We find that accuracy improves quickly even when selecting among a small number of modes , suggesting that this method can deliver performance gains with fairly minimal user input.
Using Annotations of Erroneous Regions.
As a simple extension, we consider also getting annotations of regions with high error from the user, in each estimate . Note that we only get the locations of these regions, not their correct depth values. Given this annotation, we define a mask that is one within the region and zero elsewhere, and now recover each , with a modified cost :
(8) 
where denotes elementwise multiplication, and the masks focuses the cost on regions marked as erroneous.
Figure 4 also includes results for this form of user guidance, where user annotation of regions is simulated by choosing windows with the highest error against the ground truth, such that they have no more than 50% overlap with previously marked regions for the same image. We find that now the accuracy of the selected estimate drops dramatically faster with increasing number of estimates .
4.3 Depth Completion
We now consider applications where a small number of depth values are available, \eg, from a sensor that makes limited measurements for efficiency. As illustrated in Fig. 5, our model can use these measurements along with monocular cues to produce accurate estimates of a full depth map.
Dense Depth from Sparse Measurements.
Assuming an input sparse set of depth measurements at isolated points in the scene, we estimate the depth map by using these measurements to define a global cost in (3.3) as
(9) 
where represents sampling at the measured locations. Based on this, we define the gradients to be applied in (6) for computing the global depth updates as
(10) 
where represents the transpose of the sampling operation. Since both the weight and the stepsize in (6) are hyperparameters, we simply set , and set the stepsize (as well as number of gradient steps) based on a validation set.
#  Method  lower is better  higher is better  

meas.  rms  mrms  rel  
20 
Levin [25]  0.703  0.602  0.175  75.5  93.0  97.9 
Ma [35]    0.351  0.078  92.8  98.4  99.6  
Ours  0.391  0.329  0.078  92.5  98.5  99.7  
Opt.  Ours  0.363  0.307  0.078  92.4  98.5  99.7 
50 
Levin [25]  0.507  0.436  0.117  86.4  97.1  99.3 
Ma [35]    0.281  0.059  95.5  99.0  99.7  
Ours  0.344  0.288  0.064  94.2  98.8  99.7  
Opt.  Ours  0.313  0.264  0.062  94.6  99.0  99.8 
100 
Levin [25]  0.396  0.340  0.085  92.2  98.5  99.6 
Wang [49]  0.372    0.089  91.5  98.3  99.6  
Ours  0.302  0.254  0.053  95.5  99.2  99.8  
Opt.  Ours  0.271  0.229  0.052  95.8  99.3  99.8 
200 
Levin [25]  0.305  0.264  0.061  95.7  99.2  99.8 
Ma [35]    0.230  0.044  97.1  99.4  99.8  
Ours  0.262  0.220  0.043  96.7  99.4  99.9  
Opt.  Ours  0.239  0.203  0.048  96.3  99.4  99.9 
We apply this technique for two kinds of sparse inputs. We first consider measurements at arbitrary randomly selected points like in [47, 35, 16, 49, 44]. In this case, the transpose sampling operation is computed as a nearest neighbor fill—by copying values for every point in the full image plane from their nearest sampled location. Table 3 reports the accuracy of the full depth completed depth maps using our method for different numbers of randomly placed measurements, and compares it to those obtained using [25], as well as using the learningbased methods of Ma and Karaman [35], and Wang \etal [49]. Our estimates are significantly more accurate than those from [25], and comparable to [35, 49]^{3}^{3}3Both [35] and [49] evaluate their methods on a centered crop at halfresolution, while we report our performance at the official fullresolution valid crop for NYUv2 to be consistent with the benchmark. Our performance at halfresolution is similar, and reported in the supplementary., even though the latter not only use networks trained for this specific completion task, but train different networks for different numbers of measurements.
Instead of placing points randomly, we also consider choosing an optimal set of locations to measure based on the color image, given an budget on the total number of measurements. We select these points as local maxima of the variance map described in Sec. 4.1. We also include results for depth maps reconstructed from these optimally placed measurements in Table 3, and find these are more accurate.
We next consider the setting of depth upsampling, where the sparse input measurements are on a regular lowerresolution grid. Because of the regular spacing between measured samples, we are able to use bilinear interpolation for the transpose operation in (10). We evaluate our method for two subsampling levels in Table 4, and compare it to [25] and the method of Chen \etal [5]. Again, we perform better than [25], and competitively with the taskspecific networks of [5]—that are separately trained for different sampling levels—especially for 96x subsampling.
Method  lower is better  higher is better  
rms  rel  
48x  Levin [25]  0.319  0.027  0.065  95.4  99.1  99.8 
Chen [5]  0.193    0.032  98.3  99.7  99.9  
Ours  0.251  0.017  0.040  97.1  99.5  99.9  
96x 
Levin [25]  0.512  0.050  0.120  85.9  97.1  99.4 
Chen [5]  0.318    0.072  94.2  98.9  99.8  
Ours  0.335  0.026  0.061  94.7  99.1  99.8  

Depth Uncropping.
We also consider the case when the available measurements are dense in a contiguous, but small, portion of the image plane—such as from a sensor with a smaller fieldofview (FOV), or alone a single line [29]. In this case, we define and as sparse vectors of length that are zero in locations without measurements. At measured locations, contains the measured values, while the mask is set to one. We use these to define a perpatch cost for use with (3.3) as
(11) 
where the weight is determined on a validation set.
We report results for this approach in Table 5, with measurements given either as small centered windows in the image (corresponding to a small FOV camera), or as along a vertically centered horizontal line. We compare our approach with [25], and for the case of line measurements, with the learningbased method of Liao \etal [29]^{4}^{4}4Note that [29] use measurements along a line simulated to be horizontal in world coordinates, leading to different vertical positions at each coordinate. However, due to lack of exact details for replicating this setting, we simply use a line that is horizontal in the image plane.. Our approach again outperforms [25], and in comparison to [29], has slightly higher RMSE but is better on all other metrics.
Size  Method  lower is better  higher is better  

rms  rel  
6080  Levin [25]  1.357  0.141  0.424  50.5  73.6  85.7 
Ours  0.500  0.049  0.115  86.9  96.9  99.1  
120160 
Levin [25]  1.104  0.118  0.348  57.5  79.2  90.0 
Ours  0.469  0.045  0.107  88.2  97.1  99.1  
240320 
Levin [25]  0.664  0.072  0.196  74.2  91.8  96.7 
Ours  0.391  0.036  0.086  91.0  97.7  99.3  
330440 
Levin [25]  0.378  0.040  0.102  90.2  97.4  99.2 
Ours  0.314  0.027  0.066  93.5  98.3  99.6  
Single Line 
Levin [25]  1.003  0.101  0.281  63.8  83.2  92.3 
Liao [29]  0.442  0.043  0.104  87.8  96.4  98.9  
Ours  0.457  0.041  0.098  89.7  97.5  99.3 
5 Conclusion
Using distributional estimates of depth from a single image, our approach enables a variety of applications without the need for repeated training. While we focused on applications where the final output was depth or some direct function of scene geometry in this paper, we are interested in exploring how our distributional outputs can be used to manage ambiguity in downstream processing—such as for rerendering or path planning—in future work.
Acknowledgments. This work was supported by the NSF under award no. IIS1820693.
References
 [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [2] D. Batra, P. Yadollahpour, A. GuzmanRivera, and G. Shakhnarovich. Diverse mbest solutions in markov random fields. In Proc. ECCV, 2012.
 [3] A. Chakrabarti, J. Shao, and G. Shakhnarovich. Depth from a single image by harmonizing overcomplete local network predictions. In NeurIPS, 2016.
 [4] W. Chen, Z. Fu, D. Yang, and J. Deng. Singleimage depth perception in the wild. In NeurIPS, 2016.
 [5] Z. Chen, V. Badrinarayanan, G. Drozdov, and A. Rabinovich. Estimating depth from rgb and sparse sensing. In Proc. ECCV, 2018.
 [6] D. Doria and R. J. Radke. Filling large holes in lidar data by inpainting depth gradients. In Proc. CVPR Workshops, 2012.
 [7] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In Proc. ICCV, 2015.
 [8] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multiscale deep network. In NeurIPS, 2014.
 [9] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In Proc. CVPR, 2018.
 [10] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proc. ECCV, 2016.
 [11] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
 [12] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang. Learning dynamic guidance for depth image enhancement. In Proc. CVPR, 2017.
 [13] M. Heo, J. Lee, K.R. Kim, H.U. Kim, and C.S. Kim. Monocular depth estimation using whole strip masking and reliabilitybased refinement. In Proc. ECCV, 2018.
 [14] D. Herrera, J. Kannala, J. Heikkilä, et al. Depth map inpainting under a secondorder smoothness prior. In Scandinavian Conference on Image Analysis, 2013.
 [15] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. In Proc. CVPR, 2017.
 [16] M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi. Sparse and dense data with cnns: Depth completion and semantic segmentation. In Proc. Intl. Conference on 3D Vision (3DV), 2018.
 [17] K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depth extraction from video using nonparametric sampling. PAMI, 2014.
 [18] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NeurIPS, pages 5574–5584, 2017.
 [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [20] J. Konrad, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee. Learningbased, automatic 2dto3d image and video conversion. IEEE Trans. on Image Processing, 2013.
 [21] Y. Kuznietsov, J. Stuckler, and B. Leibe. Semisupervised deep learning for monocular depth map prediction. In Proc. CVPR, 2017.
 [22] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In Proc. CVPR, 2014.
 [23] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Intl. Conference on 3D Vision (3DV), 2016.
 [24] J.H. Lee, M. Heo, K.R. Kim, and C.S. Kim. Singleimage depth estimation based on fourier domain analysis. In Proc. CVPR, 2018.
 [25] A. Levin, D. Lischinski, and Y. Weiss. Colorization using optimization. In ACM Transactions on Graphics (TOG), 2004.
 [26] J. Li, R. Klein, and A. Yao. A twostreamed network for estimating finescaled depth maps from single rgb images. In Proc. ICCV, 2017.
 [27] J. Li, R. Klein, and A. Yao. A twostreamed network for estimating finescaled depth maps from single rgb images. In Proc. ICCV, 2017.
 [28] Y. Li, J.B. Huang, N. Ahuja, and M.H. Yang. Deep joint image filtering. In Proc. ECCV, 2016.
 [29] Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu. Parse geometry from a line: Monocular depth estimation with partial laser observation. In Proc. ICRA, 2017.
 [30] C. Liu, J. Gu, K. Kim, S. Narasimhan, and J. Kautz. Neural rgbd sensing: Depth and uncertainty from a video camera. arXiv preprint arXiv:1901.02571, 2019.
 [31] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. PAMI, 2016.
 [32] J. Liu and X. Gong. Guided depth enhancement via anisotropic diffusion. In PacificRim Conference on Multimedia, 2013.
 [33] J. Liu, X. Gong, and J. Liu. Guided inpainting and filtering for kinect depth maps. In Proc ICPR, 2012.
 [34] M. Liu, M. Salzmann, and X. He. Discretecontinuous depth estimation from a single image. In Proc. CVPR, 2014.
 [35] F. Ma and S. Karaman. Sparsetodense: Depth prediction from sparse depth samples and a single image. In Proc. ICRA, 2018.
 [36] K. Matsuo and Y. Aoki. Depth image enhancement using local tangent plane approximations. In Proc. CVPR, 2015.
 [37] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [38] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proc. CVPR, 2018.
 [39] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monocular depth estimation in complex dynamic scenes. In Proc. CVPR, 2016.
 [40] A. Roy and S. Todorovic. Monocular depth estimation using neural regression forest. In Proc. CVPR, 2016.
 [41] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In NeurIPS, 2006.
 [42] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. PAMI, 2009.
 [43] J. Shi, X. Tao, L. Xu, and J. Jia. Break ames room illusion: depth from general single images. ACM Transactions on Graphics (TOG), 2015.
 [44] S. S. Shivakumar, T. Nguyen, S. W. Chen, and C. J. Taylor. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion. arXiv preprint arXiv:1902.00761, 2019.
 [45] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. ECCV, 2012.
 [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
 [47] W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool. Sparse and noisy lidar completion with rgb guidance and uncertainty. arXiv preprint arXiv:1902.05356, 2019.
 [48] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Towards unified depth and semantic prediction from a single image. In Proc. CVPR, 2015.
 [49] T.H. Wang, F.E. Wang, J.T. Lin, Y.H. Tsai, W.C. Chiu, and M. Sun. Plugandplay: Improve depth prediction via sparse data propagation. In Proc. ICRA, 2019.
 [50] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proc. CVPR, 2015.
 [51] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe. Multiscale continuous crfs as sequential deep networks for monocular depth estimation. In Proc. CVPR, 2017.
 [52] Y. Zhang and T. Funkhouser. Deep depth completion of a single rgbd image. In Proc. CVPR, 2018.
 [53] Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monocular object instance segmentation and depth ordering with cnns. In Proc. ICCV, 2015.
 [54] W. Zhuo, M. Salzmann, X. He, and M. Liu. Indoor scene structure analysis for single image depth estimation. In Proc. CVPR, 2015.
 [55] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learning ordinal relationships for midlevel vision. In Proc. CVPR, 2015.
Appendices
Appendix A Architecture and Training
Our conditional GAN consists of a pretrained feature extractor, a generator and a discriminator. As mentioned in the paper, we take the pretrained DORN model [9], remove its last two convolutional layers and use it as our feature extractor. This feature extractor takes an RGB image, resized to size 257353 from the original 640480 in NYUv2, and outputs a 2560dimensional feature map at a lowerresolution 3345. Our conditional GAN takes this feature map as input, and reasons about an output depth map at the same 257353 resolution. We consider overlapping patches at stride 4, giving us a total of 5781 patches, each of size . In other words, for each forward pass of our generator, we want to produce an output of size , and then run this multiple (100) times to get multiple samples for each patch.
We describe our architectures for the discriminator and generator in Table 6 and 7 respectively. Notice that we are able to generate outputs efficiently in a fully convolutional way—using reshape operations and transpose convolution layers to generate the depth samples for each patch. In the generator, the output of each patch depends on a small receptive field in the input feature map, and we use dropout as the noise source. While the overlapping patches do have overlapping receptive fields in the feature map, we make sure that they have independent instantiations of dropout noise values. The discriminator has a twostream architecture: one for processing the feature map, and the other for processing the depth patch (either from the generator or groundtruth). Outputs from both streams are concatenated and sent through two more layers to predict a true/fake conditional label for each input depth patch.
For training, we use Adam [19] with a learning rate of and set and set to 0.5 and 0.9, respectively. As typically used to stabilize GAN training, we update the discriminator at every iteration while only updating the generator once every five iterations. We use a batch size of 4 and train for 240k iterations.
Appendix B Output Depth Resolution
As mentioned above, our distributional output corresponds to the lower DORN [9] resolution of for the depth map. However, all error metrics in the paper are computed (inside the valid crop) at full resolution. To do so, we resize our method’s outputs to by bilinear interpolation. Moreover, in all applications with additional inputs, these are also provided at the original higher resolution. For user annotations, erroneous regions are marked as windows at the full resolution, and we map the locations of these windows to the lower resolution to construct our masks . Similarly, for depth from sparse measurements, corresponds to sparse measurements of depth at the fullresolution, and our global cost is defined in terms of a fullresolution depth map (we scale our depth map to the full resolution, and scale the gradients back). For depth uncropping, we again provide depth measurements at the full resolution, and scale these to the DORN resolution to construct our measurement and mask vectors and . Thus, all inputs and all evaluation metrics are based on the standard benchmark resolution.
Appendix C Inference Hyperparameters
For userguidance and depth uncropping, the value of is chosen based on a small validation set, with for userguidance, and for uncropping. Moreover, for user guidance, we find that slowly increasing the value of from to its final value of during optimization leads to convergence to better solutions. For depth completion from sparse (both random and regularly spaced measurements), we set the value of stepsize (in range ) and number of steps (in range ) based on a validation set as well.
No.  Layer  Output Shape 

\row0, features from feature extractor, \osize1,33,45,2560 \row1, resize, \osize1,65,89,2560 \row2, conv 11, \osize1,65,89,1024 \row3, conv 11, \osize1,65,89,512 \row4, conv 33 dilation=2, \osize1,61,85,512 \row5, conv 33 dilation=2, \osize1,57,81,256 \row6, reshape, \osize(57*81),1,1,256  dropout as noise  
\row7, conv 11, \osize(57*81),1,1,256  dropout as noise  
\row8, conv 11, \osize(57*81),1,1,256  dropout as noise  
\row9, conv 11, \osize(57*81),1,1,256  dropout as noise  
\row10, conv_transpose 33, \osize(57*81), 3,3,256 \row11, conv_transpose 33, \osize(57*81), 5,5,128 \row12, conv_transpose 33, \osize(57*81), 7,7,64 \row13, resize, \osize(57*81), 13,13,64 \row14, conv_transpose 33, \osize(57*81), 15,15,32 \row15, conv_transpose 33, \osize(57*81), 17,17,16 \row16, resize, \osize(57*81), 33,33,16 \row17, conv 11 + tanh, \osize(57*81), 33,33,1 \row18, reshape, 
Appendix D Running Time
Our method works by first generating multiple (100) samples for each overlapping patch, and then running inference either by computing expectation over these samples or running an optimization for modeselection. While sample generation has a consistent running time for all applications, the time taken for optimization differs—even among modeselection applications based on the number of iterations taken to converge. We report these running time in Table 8, when using an NVIDIA 1080Ti GPU.
No.  Layer  Output Shape 

\row0.a, features from feature extractor, \osize1,33,45,2560 \row1.a, resize, \osize1,65,89,2560 \row2.a, conv 33 dilation=2, \osize1,61,85,1024 \row3.a, conv 33 dilation=2, \osize1,57,81,256 \row4.a, reshape, \osize(57*81),1,1,256 \row0.b, true/fake depth patches, \row1.b, reshape,\osize(57*81),33,33,1 \row2.b, conv 33 stride=2, \osize(57*81),16,16,8 \row3.b, conv 22 stride=2, \osize(57*81),8,8,16 \row4.b, conv 22 stride=2, \osize(57*81),4,4,32 \row5.b, conv 22 stride=2, \osize(57*81),2,2,64 \row6.b, reshape, \osize(57*81),1,1,256 \row0 , concat: 4.a and 6.b, \osize(57*81),1,1,512 \row1, conv 11, \osize(57*81),1,1,1024 \row2, conv 11, \osize(57*81),1,1,512 \row3, conv 11, \osize(57*81),1,1,256 \row4, conv 11 + sigmoid, \osize(57*81),1,1,1 
Task  Time 

Sample Generation  4.8s 
Inference  
Mean Depth Estimate  0.01s 
Depth from Random Sparse Measurements  0.80s 
Depth Upsampling  0.50s 
Depth Uncropping  1.18s 
User Selection  3.32s 
Selection with Annotation  4.08s 
Appendix E Additional Application Details and Results
e.1 Predicting Pairwise Depth Ordering
For predicting ordinal depth ordering for a pair of points and in the image plane, we adopt the definition of groundtruth label from [55] as
(12) 
where the threshold is equal to 0.02 as in [55]. To compute predictions from our distribution mean, which is a perpixel best guess, we simply look at the relationship between the predicted depths of the corresponding points. Note that like [4], we select a different threshold (based on a validation set) for use in (12) for prediction, so as to balance WKDR and WKDR.
To make use of our probabilistic outputs for better ordinal prediction, we look at the depth ordering of a given query pair in all samples in all patches that contain this pair (and here, we use the true threshold ), outputting the label that is most frequent. In rare cases where no patch includes both points in a query pair, we simply use the prediction from the mean depth estimate as our output.
e.2 Incorporating User Guidance
We include more results for the user guidance tasks in Figure 6, demonstrating improvements over the mean depth prediction as users select among our generated modes. Note that when provided a limited annotation of a small erroneous region, our method generates estimates that not only correct that region, but also propagate improvements outside the input bounding box.
e.3 Depth Completion
Optimal Locations for Sparse Measurements.
For arbitrarily placed sparse depth measurements, our method is able to do better than random sampling by selecting an optimal set of locations to measure from the color image, given a budget on the total number of measurements. Specifically, we select local maxima of the monocular variance map from our output distribution, which represents points where our model is most uncertain about depth. Figure 7 demonstrates the selected points using this approach, and the resulting improvement in predicted depth over random sampling.
Halfresolution Comparison to [35, 49].
Note that [35, 49] evaluate their methods by reporting errors on a centered crop of halfresolution depthmaps, and also derive their input sparse measurements at this halfresolution. In contrast, our results in Table 3 in the paper represent the official benchmark metrics (in the valid crop at full resolution) for consistency to other evaluations—in our paper and elsewhere. For a more direct comparison to [35, 49], we also evaluated our method by replicating their setting. Specifically, to provide input sparse measurements, we first downsample the groundtruth depth map and randomly sample depth values from this downsampled map. We then provide these as inputs to our method (which resizes these back to the full resolution to compute the global cost ). Then, we take the fullresolution depth map estimates produced by our method, downsample them to halfresolution, and compute error metrics on the same centered crop as [35, 49]. We report these results in Table 9, and find they are similar to standard evaluation in Table 3 in the paper.
#  Method  lower is better  higher is better  

meas.  rms  mrms  rel  
20 
Ma [35]    0.351  0.078  92.8  98.4  99.6 
Ours  0.399  0.337  0.081  92.1  98.4  99.6  
50 
Ma [35]    0.281  0.059  95.5  99.0  99.7 
Ours  0.338  0.285  0.062  94.4  98.9  99.8  
100 
Wang [49]  0.372    0.089  91.5  98.3  99.6 
Ours  0.294  0.248  0.051  95.8  99.2  99.8  
200 
Ma [35]    0.230  0.044  97.1  99.4  99.8 
Ours  0.252  0.213  0.041  97.0  99.5  99.9  

More Results.
We also include results for the various depth completion tasks for more example scenes in Fig. 8.