Generating and Exploiting Probabilistic Monocular Depth Estimates

Generating and Exploiting Probabilistic Monocular Depth Estimates

Zhihao Xia, Patrick Sullivan, Ayan Chakrabarti
Washington University in St. Louis

Despite the remarkable success of modern monocular depth estimation methods, the accuracy achievable from a single image is limited, making it is practically useful to incorporate other sources of depth information. Currently, depth estimation from different combinations of sources are treated as different applications, and solved via separate networks trained to use the set of available sources as input for each application. In this paper, we propose a common versatile model that outputs a probability distribution over scene depth given an input color image, as a sample approximation using outputs from a conditional GAN. This distributional output is useful even in the monocular setting, and can be used to estimate depth, pairwise ordering, etc. More importantly, these outputs can be combined with a variety of other depth cues—such as user guidance and partial measurements—for use in different application settings, without retraining. We demonstrate the efficacy of our approach through experiments on the NYUv2 dataset for a number of tasks, and find that our results from a common model, trained only once, are comparable to those from state-of-the-art methods with separate task-specific models.


¿\SplitArgument2,m\finalR#1 \NewDocumentCommand\finalRmmm#1 & #2 & #3
\NewDocumentCommand\osize¿\SplitArgument3,m\finalOsize#1 \NewDocumentCommand\finalOsizemmmm#1#2#3#4 \iccvfinalcopy

1 Introduction

Recent neural network-based methods [7, 48, 3, 9, 23] have become surprising successful at predicting scene depth from only a single color image. This success confirms that even a single view of a scene contains considerable information about scene geometry. However, purely monocular depth map estimates are far from being precisely accurate, and this is likely to always be true given the ill-posed nature of the task. Fortunately, many practical systems are able to rely on other, yet also imperfect, sources of depth information—limited measurements from depth sensors, interactive user guidance, consistency across video frames or multiple views, etc. It is therefore desirable to be able to combine monocular cues for depth with information from these other sources, to yield estimates that are more accurate than possible from one source alone.

Figure 1: Overview of our Approach. Given an input color image, we use a conditional GAN to generate depths estimates in overlapping patches, with multiple plausible estimates for each patch. These estimates together are used form a joint probability distribution over the depth map. This distributional output enables several inference tasks in the monocular setting, as well in applications where additional depth information is available—all with a model that is trained only once.

However, depth maps predicted by monocular estimators can not be directly combined with other depth cues. Instead, researchers have considered depth estimation from different combinations of cues as different applications in their own right (\eg, depth up-sampling [5], estimation from sparse [35] and line [29] measurements, etc.), and solved each by learning separate estimators that take their corresponding set of cues, in addition to the color image, as input. This requires determining the types of inputs that will be available for each application setting, constructing a corresponding training set, choosing an appropriate network architecture, and then training the network—a process that is often onerous to duplicate for multiple settings.

In this paper, we propose a single network model for extracting and summarizing the depth information present in a single color image, in a manner that can be directly utilized in different applications and combined with different external depth cues, without retraining. Given a color image, our model outputs a probability distribution over scene depth conditioned on the image input. We use a conditional GAN [11, 37] to output multiple plausible depth estimates for individual patches in the image plane. We then use the set of estimates for each patch to form a sample approximation of the joint distribution over depth values in that patch, and combine distributions for overlapping patches to obtain a distribution for the entire depth map. Thus, rather than a “best guess” for depth at each pixel, our model outputs a rich characterization of the information and ambiguity about depth values and their spatial dependencies.

As illustrated in Fig. 1, and demonstrated through experiments on the NYUv2 dataset [45], our distributional output is versatile enough to enable a diverse variety of applications. It is useful even in the purely monocular setting—when only a single image is available—and can be used to produce accurate depth predictions, a measure of confidence in these predictions, as well as estimates of relative ordering of pairs of scene points. More importantly, it is also able to incorporate additional information to produce improved depth estimates in diverse application settings: producing multiple depth maps for user selection, incorporating user annotation of erroneous regions, incorporating a small number of depth measurements—along a single line, within a smaller field of view, at random as well as regular sparse locations—and selecting the optimal locations for these measurements. Crucially, all of these applications are enabled by the same network model that is trained only once, while achieving accuracy comparable to state-of-the-art methods that rely on separate task-specific models.

2 Related Work

Monocular Depth Estimation.

First attempted by Saxena \etal [41], early work in estimating scene depth from a single color image relied on hand-crafted features [42, 22, 43, 39], use of graphical models [42, 34, 54], and databases of exemplars [20, 17]. More recently, Eigen \etal [8] showed that, given a large enough database of image-depth pairs [45], convolution neural networks could be trained to achieve significantly more reliable depth estimates. Since then, there have been steady gains in accuracy through the development of improved neural network-based methods [7, 53, 50, 40, 31, 3, 26, 13, 24, 9], as well as strategies for unsupervised an semi-supervised learning [10, 21, 4]. Beyond estimating absolute depth, some works have also looked at pairwise ordinal depth relations between pair of points in the scene from a input color image [55, 4].

Depth from Partial Measurement.

Since making dense depth measurements is slow and expensive, it is useful to be able to recover a high quality dense depth map from a small number of direct measurements, by exploiting monocular cues from a color image. A popular way of combining color information with partial measurements is by requiring color and depth edges to co-occur. This approach is often successful for “depth inpainting”, i.e., filling in gaps of missing measurements in a depth map (common in measurements from structured light sensors). A notable and commonly-used example is the colorization method of Levin \etal [25]. Other methods along this line include [14, 33, 32, 36, 6], while Zhang and Funkhouser [52] used a neural network to predict normals and occlusion boundaries to aid inpainting.

However, when working with a very small number of measurements, the task is significantly more challenging (see discussion in [5]) and requires relying more heavily on the monocular cue. In this regime, the solution has been to train a network that takes the color image and the provided sparse samples as input. Researchers have demonstrated the efficacy of this approach with measurements along a single horizontal line from a line sensor [29], random sparse measurements [47, 35, 16, 49, 44], and sub-sampled measurements on a regular grid [28, 12, 5]. Moreover, each of these methods also train separate networks for different settings, such as different networks for different sparsity levels in [35], and different resolution grids in [5].

Probabilistic Outputs.

Monocular depth estimators commonly output a single estimate of the depth value at each pixel, preventing their use in different estimation settings. Some existing methods do produce distributional outputs, but as per-pixel variance maps [18, 13] or per-pixel probability distributions [30]. Note that depth values at different locations are not statistically independent, i.e., different values at different locations may be plausible independently, but not in combination. Thus, per-pixel distributions provide only a limited characterization that, while useful in some applications, can not be used more generally, \eg, to infer relative depth, or spatially propagate information from sparse measurements.

Like us, Chakrabarti \etal [3] also consider joint distributions over local depth values, albeit to eventually produce a depth map. They use a factorization into independent distributions for different depth derivatives, and train a network to output these distributions. But, their outputs do not provide a way to solve other inference tasks (this was not their goal). Also, their factorization into pre-selected derivatives with a fixed parametric form is still a restrictive assumption that does not fully capture local depth dependencies.

In this work, we use a more general form for the conditional joint distribution of depth values in local regions. We train a conditional GAN [11, 37] to produce multiple estimates of depth in local patches from an image input. Conditional GANs have been used to produce outputs that are more “natural” than those from networks trained with regression loss alone [15]. In our case, we run our GAN model multiple times to generate multiple plausible estimates, treat these as samples from a distribution, and use these samples to approximate the distribution itself.

Figure 2: Conditional GAN Schematic. To reduce complexity of our generator and discriminator networks and ensure stable training, we use pre-trained feature extraction layers from a state-of-the-art monocular model [9], that was trained to make deterministic depth map predictions. These layers are kept fixed during training. For a given patch, a corresponding small centered window in the feature map is provided as the conditioning input to both the generator and discriminator networks.

3 Proposed Method

Given the RGB image of a scene, our goal is to reason about its corresponding depth map , represented as a vector containing depth values for all pixels in the image. Rather than predict a single estimate for , we seek to output a distribution , to more generally characterize depth information and ambiguity present in the image. We form this distribution as a product of functions defined on individual overlapping patches as


where is a potential function for the patch, and a sparse matrix that crops out that patch from (for patches of size , each is a matrix).

We now describe our approach that trains a conditional GAN to generate multiple depth estimates for each patch, uses these to construct the functions , and then leverages the resulting distribution for inference.

3.1 Diverse Patch Depth Estimates from a GAN

Conditional GANs [37] train a “generator” network to produce estimates so as to match conditional distributions of data in a training set. The generator takes the conditioning variables and a noise source as input, and is trained adversarially against a discriminator that also uses the same conditioning inputs. We employ a conditional GAN to generate multiple plausible estimates for the depth of each patch , given the input image . For large networks and high-dimensional inputs, GAN training typically suffers from issues of instability, as well as reduced output diversity from mode-collapse [1]. Note that the latter is especially a concern in our setting: most applications that use conditional GANs (\eg, [15]) are concerned with generating only a single estimate at test time, and use the conditional GAN framework to ensure these estimates are plausible. In contrast, we invoke our generator multiple times on the same input image at test time, and need the multiple outputs for each patch to be diverse so as to faithfully characterize local depth ambiguity.

Accordingly, we use a pre-trained feature extractor to reduce the complexity of our generator and discriminator networks. Specifically, we take a pre-trained network from a state-of-the-art monocular depth estimation method (DORN [9]), remove the last two convolution layers, and treat the remaining network as our feature extractor. Then, our generator and discriminator networks both operate on the corresponding feature map output, rather than on the image itself. To generate estimates for a given patch, a small spatial feature map window, with receptive field centered with the patch, is provided as input to the generator and discriminator. Moreover, as is common in recent methods for conditional generation [15], uses dropout [46] rather than an explicit random vector input.

Figure 2 includes a schematic of our conditional GAN setup, with further architecture details provided in the supplementary. We carry out standard adversarial training on the generator-discriminator pair, keeping the pre-trained feature extraction network fixed. Since both our generator and discriminator have significantly lower complexity than would be required if operating directly on the input image, we find training to be stable and our learned generator successful in producing plausible yet diverse estimates. At test time, we run the feature extractor once, and then run the generator multiple times with different instantiations of dropout to generate a diverse set of estimates for each patch. This is efficient because the bulk of the computation happens in the feature extraction layers, and is not repeated.

3.2 Sample Approximation for Patch Potentials

We use the generated outputs from our generator to form a sample approximation to the per-patch potential functions , and thus the joint distribution over the depth map in (1). Given a set of of different estimates of depth of patch , we define its potential function as


This can be interpreted as forming a kernel density estimate from the depth samples in using a Gaussian kernel, were the Gaussian bandwidth is a scalar hyper-parameter111While can be estimated based on the variance between and true patch depths, its actual value is not used in any of the tasks we consider..

Unlike the independent per-pixel [18, 13, 30] or per-derivative [3] distributions, the samples from our generator lead to more general joint patch potentials , that can express complex spatial dependencies between depth values in local regions. Moreover, our joint distribution , defined in terms of overlapping patches, models dependencies across the entire depth map. This enables information propagation across the entire scene, and reasoning about the global plausibility of scene depth estimates.

3.3 Inference with Distributional Outputs

Inference by Expectation.

A natural way to compute estimates of certain properties or functions of the depth map, is simply as its expectation under our output distribution. When these the properties depend on depths of individual points or nearby sets of points, this can be done by considering all patches that contain these points, all generator samples for each patch, and averaging across this entire set. In Sec. 4, we will show examples of using this strategy to compute point and pair-wise properties of depth values in the monocular setting.

Inference by Mode Computation.

Several applications require computing a global depth map estimate, potentially based on additional information or constraints availalbe during inference. Note that our patch potentials are multi-modal functions, defined as a mixture of Gaussian components centered on each sample from the conditional GAN. Based on this observation, we propose recovering global depth map estimates as modes based on our distributional output , by selecting one mode or sample in for every patch, instead of averaging across them.

This is done through a joint optimization over global and per-patch depths and as:


where the per-patch depths are constrained to be among the corresponding discrete sets of generated samples. The first term in (3.3) simply corresponds to a scaled negative log-likelihood of our output distribution. The other two terms represent different ways of introducing additional information—either as costs on individual patches, or on the global depth map. For different inference applications in Sec. 4, we will use appropriately defined costs in one of these two forms to incorporate external depth cues.

We use a simple iterative algorithm to carry out the optimization in (3.3). We begin with an initial estimate of as the mean per-pixel depth (i.e., across all patches that contain each pixel, and all samples from each patch), and apply alternating updates to and till convergence as


The updates to patch estimates can be done independently, and in parallel, for different patches. The cost in (4) is the sum of the squared distance from corresponding crop of the current global estimate, and the external cost when available. We can compute these costs for all samples in , and select the one with the lowest cost. Note that since does not depend on , it need only be computed once at the start of optimization.

The update to the global map in (5) depend on the form of the external global cost . If no such cost is present, is given by simply the overlap-average of the currently selected samples for each patch. For the applications in Sec. 4 that do involve , we find it sufficient to solve (5) by first initializing to the overlap-average, and then carrying out a small number of gradient descent steps as


where the scalar step-size is a hyper-parameter.

4 Applications and Results

In this section, we describe results for using our probabilistic outputs and inference strategies for various applications—for different inference tasks in the monocular setting, and by combination with different costs and constraints based on additional information when available. We report performance for all applications on the NYUv2 dataset [45]. Crucially, all results from our method reported in all tables and figures in this section are from the same network model, that is trained only once.


We use raw frames from scenes in the official train split in NYUv2 [45] to construct our training and validation sets, and report performance using standard error metrics (see [7]) on the “valid” crop, including filled-in values, of the full-resolution official test images. As mentioned, we use feature extraction layers from a pre-trained DORN model [9]. The DORN architecture works on rescaled input images and output depth maps at a lower resolution (of ), and so we operate our conditional GAN at the same resolution. However, our outputs are rescaled back to the orginal full resolution to compute error metrics, and in applications with input depth measurements, these are also provided at the original resolution and then rescaled (see supplmentary for details). For our distribution, we use overlapping patch-sizes of side with stride four, and generate 100 samples per-patch to construct . Generating samples takes 4.8s on a 1080Ti GPU for each image, while inference from these samples is faster (see supplementary for per-application run times). Our code and trained models are available at

4.1 Monocular Inference

Our distributional output is useful even when a single color image is the only input, and we now discuss applications for reasoning about scene geometry in this setting.

Predicting Depth and Confidence.

Our outputs can be used for the standard monocular estimation task, i.e., predicting a depth map of the scene given a color image. We can recover this estimate from our model as the mean of the distribution . This corresponds to simply averaging all the estimates for each pixel’s depth—from all the patches that include it, and from all generated estimates for each patch. This can be computed efficiently by getting a mean estimate of per-patch depth by averaging all generated samples for each patch, and then getting per-pixel means as the overlap-average of patches. Another possibility is to predict as the mode of , by solving the optimization in (3.3) without any additional costs or .

Along with an estimate of each pixel’s depth value, we can also output a measure of confidence in these predictions. We do so by computing the variance of each pixel’s depth value across patches and samples from our distributional output —which relates to the per-pixel variance under (differing by a constant ). This variance map gives us a measure of our model’s relative confidence in its estimates at different pixels.

Figure 3: Accuracy and Confidence. (Left) For two example images, we show predicted and ground truth depth maps, along with confidence and error maps. (Color maps are normalized for each scene separately). Our model produces accurate depth estimates overall, and its confidence predicts points where these estimates may be incorrect, such as reflective and distant surfaces. (Right) We show the improvement in error, computed over the entire test set, after discarding different fractions of pixels where our model is least confident.
Method lower is better higher is better
rms rel
Eigen [7] 0.641 - 0.158 76.9 95.0 98.8
Chakrabarti [3] 0.620 - 0.149 80.6 95.8 98.7
Li [27] 0.635 0.063 0.143 78.8 95.8 99.1
Xu [51] 0.586 0.052 0.121 81.1 95.4 98.7
Laina [23] 0.584 0.059 0.136 82.2 95.6 98.9
Qi [38] 0.569 0.057 0.128 83.4 96.0 99.0
DORN [9] 0.545 0.050 0.114 85.8 96.2 98.7
Ours (mean) 0.536 0.053 0.125 85.2 96.2 98.8
Ours (mode) 0.536 0.053 0.125 85.1 96.6 99.0
Ours (oracle) 0.253 0.017 0.041 96.7 99.2 99.8
Table 1: Accuracy of depth maps estimated using the proposed approach in the monocular setting, compared to other monocular depth estimation methods on NYUv2 [45].

In Table. 1, we compare the accuracy of our mean and mode depth estimates to those of other monocular depth estimation methods222For [23, 9], we recompute these numbers on the official NYUv2 crop from their provided test set estimates. [9] also used a different definition of RMSE (as mean of per-image RMSE) in their paper. We report results using the standard definition here.. We find that in the monocular setting, the mean and mode estimates are nearly identical. Moreover, these estimates also have nearly the same accuracy as those from DORN [9], whose feature extractor our model is based on. This shows that our rich distributional outputs come “for free”, without adversely affecting our ability to recover depth compared to standard monocular estimation.

Table 1 also includes the results of using our distributional output in combination with an oracle that selects the most accurate patch estimate from our generator’s samples, and computes the depth map from these samples by overlap-average. These estimates are significantly more accurate, demonstrating that our generated samples contain estimates close to true depth. The oracle performance also represents an upper bound for tasks that incorporate additional information using per-patch costs in (3.3).

Figure 3 evaluates our confidence measure as a predictor of accuracy. We show depth predictions and error and confidence maps from our model for two example images from the NYUv2 test set, and find that regions with relatively higher error also tend to be those where our model has high variance, and thus low confidence—often corresponding to reflective surfaces and isolated far away parts of the scene. We also show a more systematic evaluation of accuracy vs confidence (Fig. 3, right), with errors averaged across the entire test set, over different subsets of only the most confident pixels. The error drops rapidly as we discard a small fraction of pixels with the highest variance.

Predicting Pairwise Depth Ordering.

Another monocular task, introduced in [55], is to predict the ordinal relative depth of pairs of nearby points in the scene: whether the points are at similar depths (within some threshold), and if not, which point is nearer. Instead of predicting this ordering from an estimated depth map (as done in [4, 55]), we use our distributional output and look at the relative depth in all samples in all patches that contain a pair of queried points, outputting the ordinal relation that is most frequent.

Table 2 compares the performance of our method with that of [4] and [55], who use correctness of pairwise ordering as an objective during training. Results are reported in terms of the WKDR error metrics, on a standard set of point pairs on the NYUv2 test set (see [55]). We also show results predicting ordering from our mean depth map prediction (see supplementary for more details). We find that using our distributional output leads to better predictions than using simply the mean estimate, and that these are comparable to those from the task-optimized model of [4].

Zoran [55] 43.5% 44.2% 41.4%
Chen [4] 28.3% 30.6% 28.6%
Ours (mean) 33.2% 29.3% 35.7%
Ours (distribution) 28.9% 26.1% 30.7%
Table 2: Error rates for predicted pairwise ordinal depth ordering using our common model, compared to other methods that used accurate ordering as an objective during training.

4.2 Incorporating User Guidance

Depth estimates are often useful in interactive image editing and graphics applications. We now describe ways of using our distributional output to include feedback from a user in the loop for improved depth accuracy.

Figure 4: Results with User Guidance. (Left) We show examples of multiple generated global depth map estimates from our output distributions that are presented to the user for selection, with the user optionally also marking erroneous regions in each estimate (bottom). (Right) We show average errors over the NYUv2 test set for the best estimate selected among depth map predictions for each image, and find this error decreases quickly (especially with region annotations) as we go from a single estimate to even small values of .

Diverse Estimates for User Selection.

We use Batra \etal’s approach [2] to derive multiple diverse “global” estimates of the depth map from our distribution , and propose presenting these as alternatives to the user. We set the first estimate to our mean estimate, generate every subsequent estimate by finding a mode using (3.3) with per-patch costs defined as


This introduces a preference for samples that are different from corresponding patches in previous estimates, weighted by a scalar hyper-paramter (set on a validation set).

Figure 4 illustrates the performance of this approach, on an example image and quantitatively over the entire test set. As a proxy for user guidance, we automatically select among the estimates for each scene based on minimum error with the ground-truth. We find that accuracy improves quickly even when selecting among a small number of modes , suggesting that this method can deliver performance gains with fairly minimal user input.

Using Annotations of Erroneous Regions.

As a simple extension, we consider also getting annotations of regions with high error from the user, in each estimate . Note that we only get the locations of these regions, not their correct depth values. Given this annotation, we define a mask that is one within the region and zero elsewhere, and now recover each , with a modified cost :


where denotes element-wise multiplication, and the masks focuses the cost on regions marked as erroneous.

Figure 4 also includes results for this form of user guidance, where user annotation of regions is simulated by choosing windows with the highest error against the ground truth, such that they have no more than 50% overlap with previously marked regions for the same image. We find that now the accuracy of the selected estimate drops dramatically faster with increasing number of estimates .

4.3 Depth Completion

We now consider applications where a small number of depth values are available, \eg, from a sensor that makes limited measurements for efficiency. As illustrated in Fig. 5, our model can use these measurements along with monocular cues to produce accurate estimates of a full depth map.

Dense Depth from Sparse Measurements.

Assuming an input sparse set of depth measurements at isolated points in the scene, we estimate the depth map by using these measurements to define a global cost in (3.3) as


where represents sampling at the measured locations. Based on this, we define the gradients to be applied in (6) for computing the global depth updates as


where represents the transpose of the sampling operation. Since both the weight and the step-size in (6) are hyper-parameters, we simply set , and set the step-size (as well as number of gradient steps) based on a validation set.

# Method lower is better higher is better
meas. rms m-rms rel

Levin [25] 0.703 0.602 0.175 75.5 93.0 97.9
Ma [35] - 0.351 0.078 92.8 98.4 99.6
Ours 0.391 0.329 0.078 92.5 98.5 99.7
Opt. Ours 0.363 0.307 0.078 92.4 98.5 99.7

Levin [25] 0.507 0.436 0.117 86.4 97.1 99.3
Ma [35] - 0.281 0.059 95.5 99.0 99.7
Ours 0.344 0.288 0.064 94.2 98.8 99.7
Opt. Ours 0.313 0.264 0.062 94.6 99.0 99.8

Levin [25] 0.396 0.340 0.085 92.2 98.5 99.6
Wang [49] 0.372 - 0.089 91.5 98.3 99.6
Ours 0.302 0.254 0.053 95.5 99.2 99.8
Opt. Ours 0.271 0.229 0.052 95.8 99.3 99.8

Levin [25] 0.305 0.264 0.061 95.7 99.2 99.8
Ma [35] - 0.230 0.044 97.1 99.4 99.8
Ours 0.262 0.220 0.043 96.7 99.4 99.9
Opt. Ours 0.239 0.203 0.048 96.3 99.4 99.9
Table 3: Performance on dense depth estimation from arbitrary sparse measurements. Results for [35, 49] are with task-specific networks, separately trained for different numbers of measurements. We also show performance when choosing optimal measurement locations (opt) using our model. (Note that “m-rms” corresponds to mean over per-image RMSE values.)
Figure 5: Depth Completion from Limited Measurements. We show results for estimating a full depth map from different forms of partial measurements, on one example image. Our model is able to exploit even a small number of measurements to significantly improve accuracy over the purely monocular case—\eg, for the chair and far top-left corner of this scene.

We apply this technique for two kinds of sparse inputs. We first consider measurements at arbitrary randomly selected points like in [47, 35, 16, 49, 44]. In this case, the transpose sampling operation is computed as a nearest neighbor fill—by copying values for every point in the full image plane from their nearest sampled location. Table 3 reports the accuracy of the full depth completed depth maps using our method for different numbers of randomly placed measurements, and compares it to those obtained using [25], as well as using the learning-based methods of Ma and Karaman [35], and Wang \etal [49]. Our estimates are significantly more accurate than those from [25], and comparable to [35, 49]333Both [35] and [49] evaluate their methods on a centered crop at half-resolution, while we report our performance at the official full-resolution valid crop for NYUv2 to be consistent with the benchmark. Our performance at half-resolution is similar, and reported in the supplementary., even though the latter not only use networks trained for this specific completion task, but train different networks for different numbers of measurements.

Instead of placing points randomly, we also consider choosing an optimal set of locations to measure based on the color image, given an budget on the total number of measurements. We select these points as local maxima of the variance map described in Sec. 4.1. We also include results for depth maps reconstructed from these optimally placed measurements in Table 3, and find these are more accurate.

We next consider the setting of depth up-sampling, where the sparse input measurements are on a regular lower-resolution grid. Because of the regular spacing between measured samples, we are able to use bi-linear interpolation for the transpose operation in (10). We evaluate our method for two sub-sampling levels in Table 4, and compare it to [25] and the method of Chen \etal [5]. Again, we perform better than [25], and competitively with the task-specific networks of [5]—that are separately trained for different sampling levels—especially for 96x sub-sampling.

Method lower is better higher is better
rms rel
48x Levin [25] 0.319 0.027 0.065 95.4 99.1 99.8
Chen [5] 0.193 - 0.032 98.3 99.7 99.9
Ours 0.251 0.017 0.040 97.1 99.5 99.9

Levin [25] 0.512 0.050 0.120 85.9 97.1 99.4
Chen [5] 0.318 - 0.072 94.2 98.9 99.8
Ours 0.335 0.026 0.061 94.7 99.1 99.8

Table 4: Performance for depth up-sampling. Results for [5] are with separate networks trained at each sub-sampling level.

Depth Un-cropping.

We also consider the case when the available measurements are dense in a contiguous, but small, portion of the image plane—such as from a sensor with a smaller field-of-view (FOV), or alone a single line [29]. In this case, we define and as sparse vectors of length that are zero in locations without measurements. At measured locations, contains the measured values, while the mask is set to one. We use these to define a per-patch cost for use with (3.3) as


where the weight is determined on a validation set.

We report results for this approach in Table 5, with measurements given either as small centered windows in the image (corresponding to a small FOV camera), or as along a vertically centered horizontal line. We compare our approach with [25], and for the case of line measurements, with the learning-based method of Liao \etal [29]444Note that [29] use measurements along a line simulated to be horizontal in world co-ordinates, leading to different vertical positions at each co-ordinate. However, due to lack of exact details for replicating this setting, we simply use a line that is horizontal in the image plane.. Our approach again outperforms [25], and in comparison to [29], has slightly higher RMSE but is better on all other metrics.

Size Method lower is better higher is better
rms rel
6080 Levin [25] 1.357 0.141 0.424 50.5 73.6 85.7
Ours 0.500 0.049 0.115 86.9 96.9 99.1

Levin [25] 1.104 0.118 0.348 57.5 79.2 90.0
Ours 0.469 0.045 0.107 88.2 97.1 99.1

Levin [25] 0.664 0.072 0.196 74.2 91.8 96.7
Ours 0.391 0.036 0.086 91.0 97.7 99.3

Levin [25] 0.378 0.040 0.102 90.2 97.4 99.2
Ours 0.314 0.027 0.066 93.5 98.3 99.6

Single   Line
Levin [25] 1.003 0.101 0.281 63.8 83.2 92.3
Liao [29] 0.442 0.043 0.104 87.8 96.4 98.9
Ours 0.457 0.041 0.098 89.7 97.5 99.3
Table 5: Performance of proposed approach for depth un-cropping, from measurements in small centered windows, and along a single horizontal line. For the centered windows, we compute error metrics only over un-observed pixels.

5 Conclusion

Using distributional estimates of depth from a single image, our approach enables a variety of applications without the need for repeated training. While we focused on applications where the final output was depth or some direct function of scene geometry in this paper, we are interested in exploring how our distributional outputs can be used to manage ambiguity in downstream processing—such as for re-rendering or path planning—in future work.

Acknowledgments. This work was supported by the NSF under award no. IIS-1820693.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • [2] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse m-best solutions in markov random fields. In Proc. ECCV, 2012.
  • [3] A. Chakrabarti, J. Shao, and G. Shakhnarovich. Depth from a single image by harmonizing overcomplete local network predictions. In NeurIPS, 2016.
  • [4] W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. In NeurIPS, 2016.
  • [5] Z. Chen, V. Badrinarayanan, G. Drozdov, and A. Rabinovich. Estimating depth from rgb and sparse sensing. In Proc. ECCV, 2018.
  • [6] D. Doria and R. J. Radke. Filling large holes in lidar data by inpainting depth gradients. In Proc. CVPR Workshops, 2012.
  • [7] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. ICCV, 2015.
  • [8] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
  • [9] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In Proc. CVPR, 2018.
  • [10] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proc. ECCV, 2016.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
  • [12] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang. Learning dynamic guidance for depth image enhancement. In Proc. CVPR, 2017.
  • [13] M. Heo, J. Lee, K.-R. Kim, H.-U. Kim, and C.-S. Kim. Monocular depth estimation using whole strip masking and reliability-based refinement. In Proc. ECCV, 2018.
  • [14] D. Herrera, J. Kannala, J. Heikkilä, et al. Depth map inpainting under a second-order smoothness prior. In Scandinavian Conference on Image Analysis, 2013.
  • [15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proc. CVPR, 2017.
  • [16] M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi. Sparse and dense data with cnns: Depth completion and semantic segmentation. In Proc. Intl. Conference on 3D Vision (3DV), 2018.
  • [17] K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depth extraction from video using non-parametric sampling. PAMI, 2014.
  • [18] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NeurIPS, pages 5574–5584, 2017.
  • [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [20] J. Konrad, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee. Learning-based, automatic 2d-to-3d image and video conversion. IEEE Trans. on Image Processing, 2013.
  • [21] Y. Kuznietsov, J. Stuckler, and B. Leibe. Semi-supervised deep learning for monocular depth map prediction. In Proc. CVPR, 2017.
  • [22] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In Proc. CVPR, 2014.
  • [23] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Intl. Conference on 3D Vision (3DV), 2016.
  • [24] J.-H. Lee, M. Heo, K.-R. Kim, and C.-S. Kim. Single-image depth estimation based on fourier domain analysis. In Proc. CVPR, 2018.
  • [25] A. Levin, D. Lischinski, and Y. Weiss. Colorization using optimization. In ACM Transactions on Graphics (TOG), 2004.
  • [26] J. Li, R. Klein, and A. Yao. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proc. ICCV, 2017.
  • [27] J. Li, R. Klein, and A. Yao. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proc. ICCV, 2017.
  • [28] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep joint image filtering. In Proc. ECCV, 2016.
  • [29] Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu. Parse geometry from a line: Monocular depth estimation with partial laser observation. In Proc. ICRA, 2017.
  • [30] C. Liu, J. Gu, K. Kim, S. Narasimhan, and J. Kautz. Neural rgb-d sensing: Depth and uncertainty from a video camera. arXiv preprint arXiv:1901.02571, 2019.
  • [31] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. PAMI, 2016.
  • [32] J. Liu and X. Gong. Guided depth enhancement via anisotropic diffusion. In Pacific-Rim Conference on Multimedia, 2013.
  • [33] J. Liu, X. Gong, and J. Liu. Guided inpainting and filtering for kinect depth maps. In Proc ICPR, 2012.
  • [34] M. Liu, M. Salzmann, and X. He. Discrete-continuous depth estimation from a single image. In Proc. CVPR, 2014.
  • [35] F. Ma and S. Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In Proc. ICRA, 2018.
  • [36] K. Matsuo and Y. Aoki. Depth image enhancement using local tangent plane approximations. In Proc. CVPR, 2015.
  • [37] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [38] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proc. CVPR, 2018.
  • [39] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monocular depth estimation in complex dynamic scenes. In Proc. CVPR, 2016.
  • [40] A. Roy and S. Todorovic. Monocular depth estimation using neural regression forest. In Proc. CVPR, 2016.
  • [41] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In NeurIPS, 2006.
  • [42] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. PAMI, 2009.
  • [43] J. Shi, X. Tao, L. Xu, and J. Jia. Break ames room illusion: depth from general single images. ACM Transactions on Graphics (TOG), 2015.
  • [44] S. S. Shivakumar, T. Nguyen, S. W. Chen, and C. J. Taylor. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion. arXiv preprint arXiv:1902.00761, 2019.
  • [45] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. ECCV, 2012.
  • [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
  • [47] W. Van Gansbeke, D. Neven, B. De Brabandere, and L. Van Gool. Sparse and noisy lidar completion with rgb guidance and uncertainty. arXiv preprint arXiv:1902.05356, 2019.
  • [48] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Towards unified depth and semantic prediction from a single image. In Proc. CVPR, 2015.
  • [49] T.-H. Wang, F.-E. Wang, J.-T. Lin, Y.-H. Tsai, W.-C. Chiu, and M. Sun. Plug-and-play: Improve depth prediction via sparse data propagation. In Proc. ICRA, 2019.
  • [50] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proc. CVPR, 2015.
  • [51] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proc. CVPR, 2017.
  • [52] Y. Zhang and T. Funkhouser. Deep depth completion of a single rgb-d image. In Proc. CVPR, 2018.
  • [53] Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monocular object instance segmentation and depth ordering with cnns. In Proc. ICCV, 2015.
  • [54] W. Zhuo, M. Salzmann, X. He, and M. Liu. Indoor scene structure analysis for single image depth estimation. In Proc. CVPR, 2015.
  • [55] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learning ordinal relationships for mid-level vision. In Proc. CVPR, 2015.


Appendix A Architecture and Training

Our conditional GAN consists of a pre-trained feature extractor, a generator and a discriminator. As mentioned in the paper, we take the pre-trained DORN model [9], remove its last two convolutional layers and use it as our feature extractor. This feature extractor takes an RGB image, resized to size 257353 from the original 640480 in NYUv2, and outputs a 2560-dimensional feature map at a lower-resolution 3345. Our conditional GAN takes this feature map as input, and reasons about an output depth map at the same 257353 resolution. We consider overlapping patches at stride 4, giving us a total of 5781 patches, each of size . In other words, for each forward pass of our generator, we want to produce an output of size , and then run this multiple (100) times to get multiple samples for each patch.

We describe our architectures for the discriminator and generator in Table 6 and 7 respectively. Notice that we are able to generate outputs efficiently in a fully convolutional way—using reshape operations and transpose convolution layers to generate the depth samples for each patch. In the generator, the output of each patch depends on a small receptive field in the input feature map, and we use dropout as the noise source. While the overlapping patches do have overlapping receptive fields in the feature map, we make sure that they have independent instantiations of dropout noise values. The discriminator has a two-stream architecture: one for processing the feature map, and the other for processing the depth patch (either from the generator or ground-truth). Outputs from both streams are concatenated and sent through two more layers to predict a true/fake conditional label for each input depth patch.

For training, we use Adam [19] with a learning rate of and set and set to 0.5 and 0.9, respectively. As typically used to stabilize GAN training, we update the discriminator at every iteration while only updating the generator once every five iterations. We use a batch size of 4 and train for 240k iterations.

Appendix B Output Depth Resolution

As mentioned above, our distributional output corresponds to the lower DORN [9] resolution of for the depth map. However, all error metrics in the paper are computed (inside the valid crop) at full resolution. To do so, we resize our method’s outputs to by bilinear interpolation. Moreover, in all applications with additional inputs, these are also provided at the original higher resolution. For user annotations, erroneous regions are marked as windows at the full resolution, and we map the locations of these windows to the lower resolution to construct our masks . Similarly, for depth from sparse measurements, corresponds to sparse measurements of depth at the full-resolution, and our global cost is defined in terms of a full-resolution depth map (we scale our depth map to the full resolution, and scale the gradients back). For depth un-cropping, we again provide depth measurements at the full resolution, and scale these to the DORN resolution to construct our measurement and mask vectors and . Thus, all inputs and all evaluation metrics are based on the standard benchmark resolution.

Appendix C Inference Hyperparameters

For user-guidance and depth un-cropping, the value of is chosen based on a small validation set, with for user-guidance, and for un-cropping. Moreover, for user guidance, we find that slowly increasing the value of from to its final value of during optimization leads to convergence to better solutions. For depth completion from sparse (both random and regularly spaced measurements), we set the value of step-size (in range ) and number of steps (in range ) based on a validation set as well.

No. Layer Output Shape
\row0, features from feature extractor, \osize1,33,45,2560 \row1, resize, \osize1,65,89,2560 \row2, conv 11, \osize1,65,89,1024 \row3, conv 11, \osize1,65,89,512 \row4, conv 33 dilation=2, \osize1,61,85,512 \row5, conv 33 dilation=2, \osize1,57,81,256 \row6, reshape, \osize(57*81),1,1,256 dropout as noise
\row7, conv 11, \osize(57*81),1,1,256 dropout as noise
\row8, conv 11, \osize(57*81),1,1,256 dropout as noise
\row9, conv 11, \osize(57*81),1,1,256 dropout as noise

\row10, conv_transpose 33, \osize(57*81), 3,3,256 \row11, conv_transpose 33, \osize(57*81), 5,5,128 \row12, conv_transpose 33, \osize(57*81), 7,7,64 \row13, resize, \osize(57*81), 13,13,64 \row14, conv_transpose 33, \osize(57*81), 15,15,32 \row15, conv_transpose 33, \osize(57*81), 17,17,16 \row16, resize, \osize(57*81), 33,33,16 \row17, conv 11 + tanh, \osize(57*81), 33,33,1 \row18, reshape,
Table 6: Generator architecture. For all dropout layers, we use probability . Every convolutional or transpose convolutional layer is followed by a ReLU, except the last layer which uses tanh as activation. Valid padding is used everywhere in the Generator.

Appendix D Running Time

Our method works by first generating multiple (100) samples for each overlapping patch, and then running inference either by computing expectation over these samples or running an optimization for mode-selection. While sample generation has a consistent running time for all applications, the time taken for optimization differs—even among mode-selection applications based on the number of iterations taken to converge. We report these running time in Table 8, when using an NVIDIA 1080Ti GPU.

No. Layer Output Shape
\row0.a, features from feature extractor, \osize1,33,45,2560 \row1.a, resize, \osize1,65,89,2560 \row2.a, conv 33 dilation=2, \osize1,61,85,1024 \row3.a, conv 33 dilation=2, \osize1,57,81,256 \row4.a, reshape, \osize(57*81),1,1,256 \row0.b, true/fake depth patches, \row1.b, reshape,\osize(57*81),33,33,1 \row2.b, conv 33 stride=2, \osize(57*81),16,16,8 \row3.b, conv 22 stride=2, \osize(57*81),8,8,16 \row4.b, conv 22 stride=2, \osize(57*81),4,4,32 \row5.b, conv 22 stride=2, \osize(57*81),2,2,64 \row6.b, reshape, \osize(57*81),1,1,256 \row0 , concat: 4.a and 6.b, \osize(57*81),1,1,512 \row1, conv 11, \osize(57*81),1,1,1024 \row2, conv 11, \osize(57*81),1,1,512 \row3, conv 11, \osize(57*81),1,1,256 \row4, conv 11 + sigmoid, \osize(57*81),1,1,1
Table 7: Two-stream Discriminator architecture. Every convolutional layer is followed by a Leaky ReLU with , except the last layer which is followed by a sigmoid function. Valid padding is used everywhere in the Discriminator.
Task Time
Sample Generation 4.8s
Mean Depth Estimate 0.01s
Depth from Random Sparse Measurements 0.80s
Depth Up-sampling 0.50s
Depth Un-cropping 1.18s
User Selection 3.32s
Selection with Annotation 4.08s
Table 8: Running time for sample generation, and inference in different applications. Note that for user-guidance, the reported time is for each generated mode .

Appendix E Additional Application Details and Results

e.1 Predicting Pairwise Depth Ordering

For predicting ordinal depth ordering for a pair of points and in the image plane, we adopt the definition of ground-truth label from  [55] as


where the threshold is equal to 0.02 as in [55]. To compute predictions from our distribution mean, which is a per-pixel best guess, we simply look at the relationship between the predicted depths of the corresponding points. Note that like [4], we select a different threshold (based on a validation set) for use in (12) for prediction, so as to balance WKDR and WKDR.

To make use of our probabilistic outputs for better ordinal prediction, we look at the depth ordering of a given query pair in all samples in all patches that contain this pair (and here, we use the true threshold ), outputting the label that is most frequent. In rare cases where no patch includes both points in a query pair, we simply use the prediction from the mean depth estimate as our output.

e.2 Incorporating User Guidance

We include more results for the user guidance tasks in Figure 6, demonstrating improvements over the mean depth prediction as users select among our generated modes. Note that when provided a limited annotation of a small erroneous region, our method generates estimates that not only correct that region, but also propagate improvements outside the input bounding box.

Figure 6: Addtional Results for User Guidance. For each input example, the top rows show results for multiple diverse global estimates produced by our method for user selection. The bottom rows show the results where the user also provides a bounding box as annotation of an erroneous region (simulated in our experiments by considering errors with respect to ground-truth depth).

e.3 Depth Completion

Optimal Locations for Sparse Measurements.

For arbitrarily placed sparse depth measurements, our method is able to do better than random sampling by selecting an optimal set of locations to measure from the color image, given a budget on the total number of measurements. Specifically, we select local maxima of the monocular variance map from our output distribution, which represents points where our model is most uncertain about depth. Figure 7 demonstrates the selected points using this approach, and the resulting improvement in predicted depth over random sampling.

Figure 7: Results for Optimal Sampling. In this example, our selected samples avoid regions where our monocular distributions have high confidence, and focuses its measurements in regions where depth is more ambiguous. This leads to more accurate estimates than random sampling.

Half-resolution Comparison to [35, 49].

Note that [35, 49] evaluate their methods by reporting errors on a centered crop of half-resolution depth-maps, and also derive their input sparse measurements at this half-resolution. In contrast, our results in Table 3 in the paper represent the official benchmark metrics (in the valid crop at full resolution) for consistency to other evaluations—in our paper and elsewhere. For a more direct comparison to [35, 49], we also evaluated our method by replicating their setting. Specifically, to provide input sparse measurements, we first down-sample the ground-truth depth map and randomly sample depth values from this down-sampled map. We then provide these as inputs to our method (which resizes these back to the full resolution to compute the global cost ). Then, we take the full-resolution depth map estimates produced by our method, down-sample them to half-resolution, and compute error metrics on the same centered crop as [35, 49]. We report these results in Table 9, and find they are similar to standard evaluation in Table 3 in the paper.

# Method lower is better higher is better
meas. rms m-rms rel

Ma [35] - 0.351 0.078 92.8 98.4 99.6
Ours 0.399 0.337 0.081 92.1 98.4 99.6

Ma [35] - 0.281 0.059 95.5 99.0 99.7
Ours 0.338 0.285 0.062 94.4 98.9 99.8

Wang [49] 0.372 - 0.089 91.5 98.3 99.6
Ours 0.294 0.248 0.051 95.8 99.2 99.8

Ma [35] - 0.230 0.044 97.1 99.4 99.8
Ours 0.252 0.213 0.041 97.0 99.5 99.9

Table 9: Performance on depth estimation from arbitrary sparse measurements, using the same evaluation setting as [35, 49] (half-resolution, evaluated on a center-crop).

More Results.

We also include results for the various depth completion tasks for more example scenes in Fig. 8.

Figure 8: Additional Results for Depth Completion from Limited Measurements. Shown here are our depth predictions and corresponding error-maps when incorporating different kinds of depth measurements.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description