SparsetoContinuous: Enhancing Monocular Depth
Estimation using Occupancy Maps
Abstract
This paper addresses the problem of single image depth estimation (SIDE), focusing on improving the accuracy of deep neural network predictions. In a supervised learning scenario, the quality of predictions is intrinsically related to the training labels, which guide the optimization process. For indoor scenes, structuredlightbased depth sensors (e.g. Kinect) are able to provide dense, albeit shortrange, depth maps. On the other hand, for outdoor scenes, LiDARs are still considered the standard sensor, which comparatively provide much sparser measurements, especially in areas further away. Rather than modifying the neural network structure to deal with sparse depth maps, this paper introduces a novel technique for the densification of depth maps based on the Hilbert Maps framework. A continuous occupancy map is produced based on 3D points from LiDAR scans, and the resulting reconstructed surface is projected into a 2D depth map with arbitrary resolution. Experiments conducted with various subsets of the KITTI dataset show the improvement produced by the proposed SparsetoContinuous technique, without the introduction of extra information into the training methodology.
I Introduction
Robotic platforms have been increasingly present in our society, performing progressively more complex activities in the most diverse environments. One of the driving factors behind this breakthrough is the development of sophisticated perceptual systems, which allow these platforms to understand the environment around them as well as, or better than, humans.
Nowadays, distinct sensors allow the capture of threedimensional information, and amongst the most advanced are the rangefinders using LiDAR technology [1]. Nonetheless, these sensors can be extremely expensive depending on the range and level of detail required by the application. Since it is also possible to reconstruct 3D structures from 2D observations of the scene [2], visual systems have been employed as an alternative due to their reduced cost and size, also being able to perceive colors. However, estimating depths from 2D images is a challenging task and it is described as an illposed problem, since the observed images may be resultant of several possible projections from the actual realworld scene [3]. This problem has been extensively studied in Stereo Vision [4, 5, 6, 7] and Single Image Depth Estimation (SIDE) [3, 8, 9, 10], and in this work we focus on the second approach.
Deep Convolutional Networks (CNNs) have had a deep impact on how recent works are addressing the SIDE task, with significant improvements on the accuracy of estimates and the level of details present in depth maps. Many of these methods model the monocular depth estimation task as a regression problem and are supervised, often using sparse depth maps as groundtruth, since these are readily available from other sensors (i.e. LiDAR rangefinders).
However, the degree of sparsity present in these maps is very high. For instance, in a ( pixels) image from the KITTI Depth dataset, 84.78% of its pixels do not contain valid information. One of the palliatives found is to use datasets that provide a large number of examples, such as KITTI Raw Data [1] and KITTI Depth [11], for the effective training of deep neural networks. Some recent works also propose the use of secondary information (i.e. lowresolution depth maps, normal surfaces, semantic maps), associating them to the RGB images as extra inputs [12, 13, 14], or focus on the development of network architectures that are more suitable for the processing of sparse information [11].
Similar existing works have also proposed the use of rendered depth images from synthetic datasets, which are also continuous [15, 16]. However, the generalization power of networks trained using this type of dataset is still questioned [11], mainly due to the existing gap in terms of the degree of realism between virtual environments and realworld scenes.
In this work, we address the SIDE task and propose the use of occupancy models to interpolate raw sparse LiDAR measurements and generate continuous depth images, which then serve to train a ResNetbased architecture in a supervised manner. These continuous images have four times more information than sparse ones, which makes network convergence faster and easier, use fewer images to train, and – more importantly – improve the quality of network predictions. Finally, to demonstrate the benefit of training deep convolutional networks using our proposed method, we compare the obtained estimates when training in three different datasets with varying levels of sparsity, as illustrated in Figure 1. To produce the occupancy models necessary for continuous projections, we employ the Hilbert Maps framework [17], due to its efficient training and query properties and scalability to large datasets.
Ii Related Work
Depth Estimation from a single image is an illposed problem, since the observed image may be generated from several possible projections from the actual realworld scene [3]. In addition, it is inherently ambiguous, as the proposed methods attempt to retrieve depth information directly from color intensities [18]. Besides all the presented adversities, other tasks such as obstacle detection, semantic segmentation and scene structure highly benefit from the presence of depth estimates [19], which makes this task particularly useful.
Previous approaches relied on handcrafted features, where the most suitable features for the application were manually selected, and used to have strong geometrical assumptions [20, 21]. More recently, automatic optimization methods were proposed to automate the generation process of visual cues [3, 8, 22, 23, 19, 12, 10]. Commonly, the monocular depth estimation problem is modeled as a regression problem, whose parameters are optimized based on the minimization of a cost function and often using sparse depth maps as groundtruth for supervised learning.
Early works employed techniques such as Markov Random Fields (MRFs) [24, 25] and Conditional Random Fields (CRFs) [26] to perform this task. More recently, deep learning concepts have also been used to address the SIDE problem, where deep convolutional neural networks (CNNs) are responsible for extracting the visual features [3, 27, 8, 18]. The success of these techniques highly impacted how subsequent works began to address the SIDE task, which in turn significantly improved the accuracy of estimates and the level of details present in depth maps [12, 10].
In parallel to the supervised learning approach, some works focus on minimizing photometric reconstruction errors between the stereo images [28, 9] or video sequences [29], which allow them to be trained in an unsupervised way (i.e. without depth estimates as groundtruth).
Depth Map Completion has been widely studied in computer vision and image processing, and deals with decreasing the sparsity level of depth maps. Monocular Depth Estimation differs from it as it seeks to directly approximate RGB images to depth maps. Briefly, existing Depth Map Completion methods seek to predict distances for pixels where the depth sensor doesn’t have information. Currently, there are two types of approaches associated with this problem.
The first one, nonGuided Depth Upsampling, aims to generate denser maps using only sparse maps obtained directly from 3D data or SLAM features. These methods resemble those proposed in Depth SuperResolution task [11], where the goal is to retrieve accurate highresolution depth maps. More recently, deep convolutional neural networks have also been employed in superresolution for both image [30, 31] and depth [32, 33] applications. Other works focus on inpainting the missing depth information, e. g., Uhrig et al. [11] employed sparse convolutional layers to process the irregular distributed 3D laser data. However, methods predicting depth when trained only on raw information usually do not perform too well [14].
The second approach, Image Guided Depth Completion, suggests incorporating any kind of guidance for achieving superior performance, e. g. to use sparse maps and RGB images of the scene (RGBD data) as inputs. Besides lowresolution sparse samples obtained from lowcost LiDAR or SLAM features [12, 34], other auxiliary information can also be employed, such as semantic labels [35], 2D laser points [13], normal surface and occlusion boundary maps [14].
Synthetic Datasets have also been employed to retrieve depth information [15, 16]. These datasets provide highquality dense depth maps that are extracted straightly from virtual environments. Some of the most used available datasets are: Apolloscape [36], SUNCG [37], SYNTHIA [38] and Virtual KITTI [39]. However, it remains open for discussion if the complexity and realism levels of the information in such synthetic datasets is sufficient to train the algorithms so they can be successfully deployed in realworld situations [11].
Iii Methodology
Iiia Occupancy Maps
A common way to store rangebased sensor data is through the use of pointclouds, which can be projected back into a 2D plane to produce depth images, containing distance estimates for all pixels that have a corresponding world point. Assuming a rectified camera projection matrix , a rectifying rotation matrix and a rigid body transformation matrix from camera to rangebased sensor , a 3D point P can be projected into pixel u as such:
(1) 
An example of this projection can be seen in Figure 1a, where we can see the sparsity generated by directly projecting pointcloud information, most notably in areas further away from the sensor. Spatial dependency modeling is a crucial aspect in computer vision, and the introduction of such irregular gaps can severely impact performance. Because of that, here we propose projecting not the pointcloud itself, but rather its occupancy model, as generated by the Hilbert Maps (HM) framework [17]. This methodology has recently been successfully applied to the modeling of largescale 3D environments [40], producing a continuous occupancy function that can be queried at arbitrary resolutions. Assuming a dataset , where is a point in the threedimensional space and is a classification variable that indicates the occupancy property of , the probability of nonoccupancy for a query point is given by:
(2) 
where is the feature vector and w are the weight parameters, that describe the discriminative model . We employ the same feature vector from [40], defined by a series of squared exponential kernel evaluations against an inducing point set , obtained by clustering the pointcloud and calculating mean and variance estimates for each subset of points:
(3)  
(4) 
Clustering is performed using the QuickMeans algorithm proposed in [41], due to its computational efficiency and ability to produce consistent cluster densities. However, this algorithm is modified to account for variable cluster densities within a function, in this case the distance from origin. This is achieved by setting , where and are the inner and outer radii used to define cluster size and is a scaling constant. The intuition is that areas further from the center will have fewer points, and therefore larger clusters are necessary to properly interpolate over such sparse structures. The tradeoff for this increase interpolative power is loss in structure details, since a larger volume will be modeled by the same cluster. The optimal weight parameters w are calculated by minimizing the following negativelikelihood loss function:
(5) 
where is a regularization function such as the elastic net [42]. Once the occupancy model has been trained, it can be used to produce a reconstruction of the environment, and each pixel is then checked for collision in the 3D space, producing depth estimates. An example of reconstructed depth image is depicted in Figure 1c, where we can see that virtually all previously empty areas were filled by the occupancy model, while maintaining spatial dependencies intact (up to the reconstructive capabilities of the HM framework).
IiiB Continuous Depth Images
When datasets do not provide ground truths directly, it is still possible to obtain them using the 3D LiDAR scans and extrinsic/intrinsic parameters from the RGB cameras. In this case, a sparse depth image can be generated by directly projecting the cloud of points of the scene to the image plane of the visual sensor [3, 9]. Continuous depth images, in turn, can be obtained by interpolating the measured points into continuous surfaces prior to the projection. In this work, the Hilbert Maps technique was used on the LiDAR scans to generate these surfaces. After restricting the continuous map to the region under the left camera’s field of view, we projected the remaining depth values in the image plane.
IiiC Data Augmentation
Two types of random online transformations were performed, thus artificially increasing the number of training data samples.
Flips: The input image and the corresponding depth map are flipped horizontally with 50% probability.
Color Distortion: Adjusts the intensity of color components on an RGB image randomly. The order of the following transformations is also chosen randomly:

Brightness by a random value

Saturation by a random value

Hue by a random value

Contrast by a random value
As pointed out by [3], the worldspace geometry of the scene is not preserved by image scaling and translation transformations. Therefore, we opted for not using these transformations and rotations, although this last one is geometrypreserving. We believe that aggressive color distortions prevent the network from becoming biased in relating pixel intensity to depth values, thus focusing on learning the scene’s geometric relationships.
IiiD Loss Functions
We employed three different loss functions for adjusting the internal parameters of the presented deep neural network. The motivation behind this is simply to determine which one is more suitable for approximating the outputs () to the reference values () for the th pixel. The mathematical expressions for each one are presented as follows:
IiiD1 Squared Euclidian Norm (mse)
Also known as norm, it is the most commonly used cost function for neural network optimization, which consists of computing the Euclidean distances (Equation 6) between predictions () and groundtruths () [18].
(6) 
IiiD2 Scale Invariant Mean Squared Error (eigen)
In the context of depth prediction, discovering the global scale of a scene is a naturally ambiguous task. Eigen & Fergus [27] verified that subtracting the mean scale from the evaluated scenes (second term in Equation IIID2) results in a significant improvement in network predictions.
L_eigen_grads(y, y^*) = 1n ∑_i d_i^2  λn2 (∑_i d_i )^2 + 1n ∑_i [(∇_x d_i)^2+(∇_y d_i)^2],
IiiD3 Adaptive BerHu Penalty (berhu)
Also known as Reverse Huber’s function, it was proposed in order to obtain more robust estimates by penalizing differently predictions that are close to the source (i.e. vehicle) than those that are distant [18, 43, 44].
(7) 
As shown in the Equation 7, the penalization adapts according to how far the predictions are from the reference depths, where small values are subject to the norm, whereas high values, to the norm.
IiiE Network Architecture
In this work, we used the network topology Fully Convolutional Residual Network (FCRN) proposed by Laina et al. [18]. This network was selected because it presents a smaller number of trainable parameters, besides requiring a smaller number of images to be trained, without losing performance. In addition, the residual blocks present in the architecture allow the construction of a deeper model capable of predicting more accurate and higherresolution output maps. More specifically, the FCRN (Figure 2) is based on the ResNet50 topology, but the fullyconnected layers have been replaced by a set of residual upsampling blocks, also referred to as upprojections, which are layers responsible for deconvolving and retrieving spatial resolution of feature maps. This network was trained endtoend in a supervised way, but unlike the authors who proposed it, we modified its output to predict distances in meters rather than distances in logspace. The network uses RGB images of pixels as inputs for training the 63M trainable parameters and provides an output map with size of pixels.
Iv Experimental Results
Iva Implementation details
We implemented the network using Tensorflow [45]. Our models were trained on the KITTI Depth and KITTI Raw Data datasets using a NVIDIA Titan X GPU with 12 GB memory. We used a batch size of 4 and 300000 training steps. The initial learning rate value was 0.0001, reducing 5% every 1000 steps. Besides learning decay, we also employed a dropout of 50% and normalization as regularization [46, 47].
IvB Datasets
Three different datasets are considered in this work: KITTI Discrete (sparse), KITTI Depth (semidense), and KITTI Continuous (dense), including frames from the “city”, “residential”, “road”, “campus” and “person” sequences. Typically, the resolution of the used RGB and depth maps images is pixels.
KITTI Depth: Due to their complexity, training and evaluating deep convolutional neural networks require a large number of annotated images pairs. The KITTI Depth is a largescale dataset created to allow supervised endtoend training, since other datasets such as Middlebury [48], Make3D [25], and KITTI [49, 50] do not have enough data to adjust all internal parameters of a deep neural network [11]. The dataset is paired with scenes presented in the KITTI Raw Data dataset [1] and consists of 92750 semidense depth maps (ground truth). The depth images were obtained by accumulating 11 lasers scans whose outliers were removed by enforcing consistency between the LiDAR points and the reconstructed depth maps, generated by semiglobal matching (SGM) [11].
KITTI Discrete/Continuous: Unlike the procedure performed in KITTI Depth to make the depth images less sparse, which consisted of accumulating different scans of the laser sensor, we used an occupancy model to make the depth maps denser. In other words, the goal was to increase the number of valid pixels available in depth images for training. In this sense, this alternative requires a smaller number of training images than other techniques that use datasets with sparse/semidense ground truth information. The KITTI Continuous is also based on 3D Velodyne pointclouds, but first we interpolate its measurements as surfaces to generate the continuous depth images (more details in section IIIB). The KITTI Discrete dataset was built alongside KITTI Continuous and consists of depth images which are the direct projections of pointclouds on the 2D image plane. However, in this dataset version the generated depth maps are very sparse since they use only one LiDAR scan.
For the Discrete and Continuous datasets, we randomly shuffled the images and corresponding depth maps before splitting them into training and test sets by an 80% ratio. The resulting number of pairs for each dataset is presented in Table I. On average, the number of valid pixels available on the KITTI Discrete and KITTI Depth datasets represents only 6.63% and 24.33%, i. e., respectively 15 and 4 times smaller than the number of points available on the KITTI Continuous dataset.
Dataset  Train  Test  Total 



KITTI Depth  85898  6852^{1}^{1}1Validation subset used instead of test subset. The test subset doesn’t have depth maps for the corresponding input images.  92750  70910  
KITTI Discrete  25742  6436  32178  19323  
KITTI Continuous  25742  6436  32178  291453 
IvC Evaluation Metrics
Since the final results are generally a set of predictions of the testing set images, qualitative (visual) analysis may be biased and not sufficient to say if one approach is better than another. This way, several works use the following metrics to evaluate their methods and thus compare them with other techniques in the literature [3, 51, 19]:
Threshold (): % of s.t.
Abs Relative Difference:
Squared Relative Difference: :
RMSE (linear):
RMSE (log):
where is the number of valid pixels in all evaluated images. In addition, in order to compare our results with other works, we also use the evaluation protocol of restricting groundtruth depth values and predictions to a range, in this case the interval. In other words, we discard depths below and cap distances above . Some works [28, 9] require different intervals to be fairly compared to.
IvD Benchmark Evaluation
In this section, we perform the benchmark evaluation of our method trained on the presented datasets and compare them with existing works. Since there are no test images for the KITTI Raw Data, we evaluated the network predictions on two different test splits, which were resized to the original size using bilinear upsampling, and compared them to the corresponding groundtruth depth maps.
IvD1 Eigen Split
As already mentioned, the KITTI Raw dataset does not have an official training/test split, so Eigen et al. subdivided the available images into 33,131 for training and 697 for evaluation [3]. As other works present in the literature, we also use the test subset to evaluate our methods, which allow us to directly compare them with stateofart algorithms. Since this dataset doesn’t provide the ground truth depth images, they need to be manually generated using the methodology presented in section IIIB. In Table II we detail how our approaches perform on this test split alongside other results of leaderboard algorithms. The last two rows show how our method improved over the baseline, causing the network to reach current stateoftheart results. A qualitative comparison between our results and the current stateoftheart is presented in Figure 3.
Like DORN [10], our method also detects well obstacles present in the scenes, with the noticeable difference that ours provide a certain margin of safety around the obstacles, due to the reconstructive properties of the Hilbert Maps framework, as shown in section IIIA, besides achieving similar performance using a simpler architecture.
Abs Rel  Sqr Rel  RMSE  RMSE (log)  
Approach  Supervised  Range  lower is better  higher is better  
Make 3D [25]  Yes  0.280  3.012  8.734  0.361  0.601  0.820  0.926  
Mancini et al. [52]  Yes      7.508  0.524  0.318  0.617  0.813  
Eigen et al. [3], coarse 28144  Yes  0.194  1.531  7.216  0.273  0.679  0.897  0.967  
Eigen et al. [3], fine 27142  Yes  0.190  1.515  7.156  0.270  0.692  0.899  0.967  
Liu et al. [8], DCNFFCSP FT  Yes  0.217  1.841  6.986  0.289  0.647  0.882  0.961  
Ma & Karaman [12], only RGB  Yes  0.208    6.266    0.591  0.900  0.962  
Fu et al. [10], DORN (ResNet)  Yes  0.072  0.307  2.727  0.120  0.932  0.984  0.994  
Kuznietsov et al. [53]  No  0.262  4.537  6.182  0.338  0.768  0.912  0.955  
Godard et al. [9] (CS+K)  No  0.136  1.512  5.763  0.236  0.836  0.935  0.968  
Zhou et al. [29] (w/o explainability)  No  0.208  1.551  5.452  0.273  0.695  0.900  0.964  
Garg et al. [28], L12 Aug 8x  No  09  1.080  5.104  0.273  0.740  0.904  0.962  
Zhou et al. [29] (CS+K)  No  0.190  1.436  4.975  0.258  0.735  0.915  0.968  
Godard et al. [9] (CS+K)  No  0.118  0.932  4.941  0.215  0.858  0.947  0.974  
Kuznietsov et al. [53]  Yes  0.117  0.597  3.531  0.183  0.861  0.964  0.989  
Ma & Karaman [12], RGBd 500 Samples  Yes  0.073    3.378    0.935  0.976  0.989  
Fu et al. [10], DORN (ResNet)  Yes  0.071  0.268  2.271  0.116  0.936  0.985  0.995  
KITTI Depth, only valid, (Ours)  Yes  0.195  1.417  4.040  0.236  0.718  0.841  0.883  
KITTI Continuous, only valid, (Ours)  Yes  0.071  0.267  2.536  0.133  0.820  0.894  0.908 
IvD2 Eigen Split (Continuous)
The abovementioned Eigen test split has been used in the literature for evaluating depth estimation methods for years. However, the depth maps (ground truths) proposed by the split set are sparse. For a fair comparison between the methods trained on the sparse and continuous datasets, we increased the number of evaluation points, since our technique improves the prediction quality not only for sparse points of the original depth map but also for the scene as a whole. This modification makes it possible to further highlight the benefits of our technique. More specifically, we generated an evaluation split using 638 test images from the original testing set, but using their corresponding continuous version.
IvE Ablation Studies
Besides evaluating the presented architecture on the different versions of the KITTI datasets, we conducted various ablation studies to identify the best training combination. More specifically, we trained using different datasets and loss functions. We also studied the influence of using all pixels, including sky and reflecting surfaces, or only valid pixels, which have corresponding depth information. The obtained results are presented in Table III. The models trained in the continuous dataset showed a decrease of 63.6% and 37.2%, respectively, in the AbsRel and RMSE error metrics compared to methods trained in the same circumstances, but using semisparse maps.
Abs Rel  Sqr Rel  RMSE  RMSE (log)  
Dataset  Pixels  Loss  lower is better  higher is better  
discrete  valid  0.200  1.525  4.315  0.244  0.715  0.833  0.878  
depth  valid  0.195  1.417  4.040  0.236  0.718  0.841  0.883  
continuous’city  all  0.180  1.413  6.088  0.486  0.659  0.806  0.858  
continuous’city  all  0.144  1.099  4.548  0.717  0.727  0.837  0.875  
continuous  valid  0.125  0.661  4.232  0.195  0.728  0.860  0.898  
continuous  valid  0.124  0.653  4.176  0.197  0.733  0.861  0.898  
continuous  all  0.118  0.697  4.272  0.636  0.756  0.865  0.891  
continuous  all  0.110  0.488  3.300  0.297  0.773  0.872  0.898  
continuous  valid  0.103  0.408  2.976  0.159  0.782  0.882  0.906  
continuous  all  0.093  0.500  3.561  0.481  0.790  0.878  0.899  
continuous  valid  0.071  0.267  2.536  0.133  0.820  0.894  0.908  
The Figure 4 illustrates the qualitative comparison between the predictions when training on the proposed datasets. As can be noted, the continuous depth images boosted up the quality of distance estimations. In other words, they make the predicted images much less blurred, i. e., they have a better definition of the edges of the objects, also having more accurate measurements according to the ground truth maps. The main cause of the predictions of sparse datasets to be blurred is the use of 2D convolutional filters in widely sparse regions and the occasionality of depth information, since the distance value in a given pixel is intermittent and this depends on where the laser points will be reprojected.
V Conclusion
In this paper, we present a novel data preprocessing step by employing occupancy models (i.e. Hilbert Maps) to the Single Image Depth Estimation problem, which generates continuous depth maps for training our deep residual network, differing from typical supervised approaches that use sparse ones. This training process does not require any other type of sensors or extra information, only RGB images as input and continuous maps as supervision, which significantly improved the quality of network predictions over typical sparse maps. Moreover, the proposed methodology presented superior performance even when using 60% fewer examples than those trained on the KittiDepth dataset, as a consequence of increasing the valid information present in the ground truth maps from 15.2% to 62.6%. The main limitation of the proposed preprocessing method is the computational cost required to compute each continuous depth map used for training. Future work will focus on optimizing the method itself, mainly tackling the aforementioned problem, and honing the network topology by incorporating new layers that are more suited to the SIDE task. Uhrig et al. [11] employed sparse convolutions to deal with the sparsity present on depth maps, similarly, we suggest the development or the use of more suitable layers, e. g. subpixel convolutional layers [54], for processing the available information, which is now continuous but still has empty areas.
Acknowledgment
This research was supported by funding from the Brazilian National Council for Scientific and Technological Development (CNPq), under grant 130463/20175 and 465755/20143, the São Paulo Research Foundation (FAPESP) grant 2014/508510, and the Faculty of Engineering & Information Technologies, The University of Sydney, under the Faculty Research Cluster Program. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPUs used on this research.
References
 [1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research (IJRR), vol. 32, no. 11, pp. 1231–1237, 2013.
 [2] D. J. Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess, “Unsupervised learning of 3d structure from images,” in Proc. of the 30th International Conference on Neural Information Processing Systems (NIPS). USA: Curran Associates Inc., 2016, pp. 5003–5011.
 [3] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multiscale deep network,” in Proc. of the 27th International Conference on Neural Information Processing Systems (NIPS), 2014, pp. 2366–2374.
 [4] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang, “A Deep Visual Correspondence Embedding Model for Stereo Matching Costs,” in Proc. of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 972–980.
 [5] J. Žbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional neural network,” in Proc. of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1592–1599.
 [6] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4040–4048.
 [7] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient Deep Learning for Stereo Matching,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5695–5703.
 [8] F. Liu, Chunhua Shen, and Guosheng Lin, “Deep convolutional neural fields for depth estimation from a single image,” in Proc. of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5162–5170.
 [9] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised Monocular Depth Estimation with LeftRight Consistency,” in Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6602–6611.
 [10] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep Ordinal Regression Network for Monocular Depth Estimation,” in Proc. of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [11] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in 2017 International Conference on 3D Vision (3DV), Oct 2017, pp. 11–20.
 [12] F. Ma and S. Karaman, “SparsetoDense: Depth Prediction from Sparse Depth Samples and a Single Image,” in Proc. of the 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018.
 [13] Y. Liao, L. Huang, Y. Wang, S. Kodagoda, Y. Yu, and Y. Liu, “Parse geometry from a line: Monocular depth estimation with partial laser observation,” in Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 5059–5066.
 [14] Y. Zhang and T. Funkhouser, “Deep Depth Completion of a Single RGBD Image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 175–185.
 [15] M. Mancini, G. Costante, P. Valigi, T. A. Ciarfuglia, J. Delmerico, and D. Scaramuzza, “Toward Domain Independence for LearningBased Monocular Depth Estimation,” IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1778–1785, 2017.
 [16] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia, “Jmod 2: Joint monocular obstacle detection and depth estimation,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1490–1497, 2018.
 [17] F. Ramos and L. Ott, “Hilbert maps: Scalable continuous occupancy mapping with stochastic gradient descent,” International Journal of Robotics Research (IJRR), vol. 35, no. 14, pp. 1717–1730, 2016.
 [18] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Proc. of the 4th International Conference on 3D Vision (3DV), 2016, pp. 239–248.
 [19] Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” Proc. of the IEEE Transactions on Circuits and Systems for Video Technology, 2017.
 [20] D. Hoiem, A. A. Efros, and M. Hebert, “Automatic photo popup,” ACM Transactions on Graphics, vol. 24, no. 3, p. 577, 2005.
 [21] V. Hedau, D. Hoiem, and D. Forsyth, “Thinking inside the box: Using appearance models and context based on room geometry,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6316 LNCS, no. PART 6, pp. 224–237, 2010.
 [22] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, “Towards unified depth and semantic prediction from a single image,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2800–2809.
 [23] A. Roy, “Monocular Depth Estimation Using Neural Regression Forest,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5506–5514.
 [24] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning Depth from Single Monocular Images,” Advances in Neural Information Processing Systems, vol. 18, pp. 1161–1168, 2006.
 [25] A. Saxena, M. Sun, and A. Y. Ng, “Make3D: Learning 3D scene structure from a single still image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 824–840, 2009.
 [26] M. Liu, M. Salzmann, and X. He, “DiscreteContinuous Depth Estimation from a Single Image,” in Proc. of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 716–723.
 [27] D. Eigen and R. Fergus, “Predicting Depth, Surface Normals and Semantic Labels with a Common Multiscale Convolutional Architecture,” in Proc. of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2650–2658.
 [28] R. Garg, B. G. Vijay Kumar, G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in Proc. of the European Conference on Computer Vision (ECCV), 2016, pp. 740–756.
 [29] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and egomotion from video,” in Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, no. 6, 2017, p. 7.
 [30] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image superresolution via sparse representation,” IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861–2873, 2010.
 [31] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image superresolution,” in Proc. of the European conference on computer vision (ECCV). Springer, 2014, pp. 184–199.
 [32] G. Riegler, M. Rüther, and H. Bischof, “Atgvnet: Accurate depth superresolution,” in Proc. of the European Conference on Computer Vision (ECCV). Springer, 2016, pp. 268–284.
 [33] X. Song, Y. Dai, and X. Qin, “Deep depth superresolution: Learning depth superresolution using deep convolutional neural network,” in Asian Conference on Computer Vision. Springer, 2016, pp. 360–376.
 [34] C. S. Weerasekera, T. Dharmasiri, R. Garg, T. Drummond, and I. Reid, “JustinTime Reconstruction: Inpainting Sparse Maps using Single View Depth Predictors as Priors,” in Proc. of the 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018.
 [35] N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, and C. Stiller, “Semantically guided depth upsampling,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9796 LNCS, pp. 37–48, 2016.
 [36] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang, “The ApolloScape Dataset for Autonomous Driving,” 2018. [Online]. Available: http://arxiv.org/abs/1803.06184
 [37] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic Scene Completion from a Single Depth Image,” in Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 190–198.
 [38] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3234–3243.
 [39] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual Worlds as Proxy for Multiobject Tracking Analysis,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4340–4349.
 [40] V. Guizilini and F. Ramos, “Largescale 3d scene reconstruction with Hilbert maps,” in Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), 2016.
 [41] ——, “Learning to reconstruct 3d structures for occupancy mapping,” in Proceedings of Robotics: Science and Systems (RSS), 2017.
 [42] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society, Series B, vol. 67, pp. 301–320, 2005.
 [43] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, no. 7, pp. 59–72, 2007.
 [44] L. Zwald and S. LambertLacroix, “The BerHu penalty and the grouped effect,” jul 2012. [Online]. Available: http://arxiv.org/abs/1207.6868
 [45] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, and Others, “Tensorflow: a system for largescale machine learning.” in Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation, vol. 16, 2016, pp. 265–283.
 [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
 [47] I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” Nature Methods, vol. 13, no. 1, pp. 35–35, 2015.
 [48] D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense TwoFrame Stereo Correspondence Algorithms,” International Journal of Computer Vision, vol. 47, no. 13, pp. 7–42, 2002.
 [49] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 [50] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3061–3070.
 [51] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 0712June, 2015, pp. 1119–1127.
 [52] M. Mancini, G. Costante, P. Valigi, and T. A. Ciarfuglia, “Fast Robust Monocular Depth Estimation for Obstacle Detection with Fully Convolutional Networks,” in International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 4296–4303.
 [53] Y. Kuznietsov, J. Stückler, and B. Leibe, “Semisupervised deep learning for monocular depth map prediction,” in Proc. of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6647–6655.
 [54] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Realtime single image and video superresolution using an efficient subpixel convolutional neural network,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.