Monocular Depth Estimation by Learning from Heterogeneous Datasets
Depth estimation provides essential information to perform autonomous driving and driver assistance. Especially, Monocular Depth Estimation is interesting from a practical point of view, since using a single camera is cheaper than many other options and avoids the need for continuous calibration strategies as required by stereo-vision approaches. State-of-the-art methods for Monocular Depth Estimation are based on Convolutional Neural Networks (CNNs). A promising line of work consists of introducing additional semantic information about the traffic scene when training CNNs for depth estimation. In practice, this means that the depth data used for CNN training is complemented with images having pixel-wise semantic labels, which usually are difficult to annotate (e.g. crowded urban images). Moreover, so far it is common practice to assume that the same raw training data is associated with both types of ground truth, i.e., depth and semantic labels. The main contribution of this paper is to show that this hard constraint can be circumvented, i.e., that we can train CNNs for depth estimation by leveraging the depth and semantic information coming from heterogeneous datasets. In order to illustrate the benefits of our approach, we combine KITTI depth and Cityscapes semantic segmentation datasets, outperforming state-of-the-art results on Monocular Depth Estimation.
Depth estimation provides essential information at all levels of driving assistance and automation. Active sensors such a RADAR and LIDAR provide sparse depth information. Post-processing techniques can be used to obtain dense depth information from such sparse data . In practice, active sensors are calibrated with cameras to perform scene understanding based on depth and semantic information. Image-based object detection , classification , and segmentation , as well as pixel-wise semantic segmentation  are key technologies providing such semantic information.
Since a camera sensor is often involved in driving automation, obtaining depth directly from it is an appealing approach and so has been a traditional topic from the very beginning of ADAS111ADAS: Advanced driver-assistance systems development. Vision-based depth estimation approaches can be broadly divided in stereoscopic and monocular camera based settings. The former includes attempts to mimic binocular human vision. Nowadays, there are robust methods for dense depth estimation based on stereo vision , able to run in real-time . However, due to operational characteristics, the mounting and installation properties, the stereo camera setup can loose calibration. This can compromise depth accuracy and may require to apply on-the-fly calibration procedures [8, 9].
On the other hand, monocular depth estimation would solve the calibration problem. Compared to the stereo setting, one disadvantage is the lack of the scale information, since stereo cameras allow for direct estimation of the scale by triangulation. Though, from a theoretical point of view, there are other depth cues such as occlusion and semantic object size information which are successfully determined by the human visual system . These cues can be exploited in monocular vision for estimating scale and distances to traffic participants. Hence, monocular depth estimation can indeed support detection and tracking algorithms [11, 12, 13]. Dense monocular depth estimation is also of great interest since higher level 3D scene representations, such as the well-known Stixels [14, 15] or semantic Stixels , can be computed on top. Attempts to address dense monocular depth estimation can be found based on either super-pixels  or pixel-wise semantic segmentation ; but in both cases relying on hand-crafted features and applied to photos mainly dominated by static traffic scenes.
State-of-the-art approaches to monocular dense depth estimation rely on CNNs [19, 20, 21, 22]. Recent work has shown [23, 24] that combining depth and pixel-wise semantic segmentation in the training dataset can improve the accuracy. These methods require that each training image has per-pixel association with depth and semantic class ground truth labels, e.g., obtained from a RGB-D camera. Unfortunately, creating such datasets imposes a lot of effort, especially for outdoor scenarios. Currently, no such dataset is publicly available for autonomous driving scenarios. Instead, there are several popular datasets, such as KITTI containing depth  and Cityscapes  containing semantic segmentation labels. However, none of them contains both depth and semantic ground truth for the same set of RGB images.
Depth ground truth usually relies on a LIDAR calibrated with a camera system, and the manual annotation of pixel-wise semantic classes is quite time consuming (e.g. 60-90 minutes per image). Furthermore, in future systems, the LIDAR may be replaced by four-planes LIDARs having a higher degree of sparsity of depth cues, which makes accurate monocular depth estimation even more relevant.
Accordingly, in this paper we propose a new method to train CNNs for monocular depth estimation by leveraging depth and semantic information from multiple heterogeneous datasets. In other words, the training process can benefit from a dataset containing only depth ground truth for a set of images, together with a different dataset that only contains pixel-wise semantic ground truth (for a different set of images). In Sect. II we review the state-of-the-art on monocular dense depth estimation, whereas in Sect. III we describe our proposed method in more detail. Sect. IV shows quantitative results for the KITTI dataset, and qualitative results for KITTI and Cityscapes datasets. In particular, by combining KITTI depth and Cityscapes semantic segmentation datasets, we show that the proposed approach can outperform the state-of-the-art in KITTI (see Fig. 1). Finally, in Sect. V we summarize the presented work and draw possible future directions.
Ii Related Work
First attempts to perform monocular dense depth estimation relied on hand-crafted features [18, 27]. However, as in many other Computer Vision tasks, CNN-based approaches are currently dominating the state-of-the-art, and so our approach falls into this category too.
Eigen et al.  proposed a CNN for coarse-to-fine depth estimation. Liu et al.  presented a network architecture with a CRF-based loss layer which allows end-to-end training. Laina et al.  developed an encoder-decoder CNN with a reverse Huber loss layer. Cao et al.  discretized the ground-truth depth into several bins (classes) for training a FCN-residual network that predicts these classes pixel-wise; which is followed by a CRF post-processing enforcing local depth coherence. Fu et al.  proposed a hybrid model between classification and regression to predict high-resolution discrete depth maps and low-resolution continuous depth maps simultaneously. Overall, we share with these methods the use of CNNs as well as tackling the problem as a combination of classification and regression when using depth ground truth, but our method also leverages pixel-wise semantic segmentation ground truth during training (not needed in testing) with the aim of producing a more accurate model, which will be confirmed in Sect. IV.
There are previous methods using depth and semantics during training. The motivation behind is the importance of object borders and, to some extent, object-wise consistency in both tasks (depth estimation and semantic segmentation). Arsalan et al.  presented a CNN consisting of two separated branches, each one responsible for minimizing corresponding semantic segmentation and depth estimation losses during training. Jafari et al.  introduced a CNN that fuses state-of-the-art results for depth estimation and semantic labeling by balancing the cross-modality influences between the two cues. Both  and  assume that for each training RGB image it is available pixel-wise depth and semantic class ground truths. Training and testing is performed in indoor scenarios, where a RGB-D integrated sensor is used (neither valid for outdoor scenarios nor for distances beyond 5 meters). In fact, the lack of publicly available datasets with such joint ground truths has limited the application of these methods outdoors. In contrast, a key of our proposal is the ability of leveraging disjoint depth and semantic ground truths from different datasets, which has allowed us to address driving scenarios.
The works introduced so far rely on deep supervised training, thus eventually requiring abundant high quality depth ground truth. Therefore, alternative unsupervised and semi-supervised approaches have been also proposed, which rely on stereo image pairs for training a disparity estimator instead of a depth one. However, at testing time the estimation is done from monocular images. Garg et al.  trained a CNN where the loss function describes the photometric reconstruction error between a rectified stereo pair of images. Godard et al.  used a more complex loss function with additional terms for smoothing and enforcing left-right consistency to improve convergence during CNN training. Kuznietsov et al.  proposed a semi-supervised approach to estimate inverse depth maps from the CNN by combining an appearance matching loss similar to the one suggested in  and a supervised objective function using sparse ground truth depth coming from LIDAR. This additional supervision helps to improve accuracy estimation over . All these approaches have been challenged with driving data and are the current state-of-the-art.
Note that autonomous driving is pushing forward 3D mapping, where LIDAR sensing plays a key role. Thus, calibrated depth and RGB data are regularly generated. Therefore, although, unsupervised and semi-supervised approaches are appealing, at the moment we have decided to assume that depth ground truth is available; focusing on incorporating RGB images with pixel-wise class ground truth during training. Overall, our method outperforms the state-of-the-art (Sect. IV ).
Iii Proposed Approach for Monocular Depth Estimation
Iii-a Overall training strategy
As we have mentioned, in contrast to previous works using depth and semantic information, we propose to leverage heterogeneous datasets to train a single CNN for depth estimation; i.e. training can rely on one dataset having only depth ground truth, and a different dataset having only pixel-wise semantic labels. To achieve this, we divide the training process in two phases. In the first phase, we use a multi-task learning approach for pixel-wise depth and semantic CNN-based classification (Fig. 2). This means that at this stage depth is discretized, a task that has been shown to be useful for supporting instance segmentation . In the second phase, we focus on depth estimation. In particular, we add CNN layers that perform regression taking the depth classification layers as input (Fig. 3).
Multi-task learning has been shown to improve the performance of different visual models (e.g. combining semantic segmentation and surface normal prediction tasks in indoor scenarios; combining object detection and attribute prediction in PASCAL VOC images) . We use a network architecture consisting of one common sub-net followed by two additional sub-net branches. We denote the layers in the common sub-net as DSC (depth-semantic classification) layers, the depth specific sub-net as DC layers and the semantic segmentation specific sub-net as SC layers. At training time we apply a conditional calculation of gradients during back-propagation, which we call conditional flow. More specifically, the common sub-net is always active, but the origin of each data sample determines which specific sub-net branch is also active during back-propagation (Fig. 2). We alternate batches of depth and semantic ground truth samples.
Phase one mainly aims at obtaining a depth model (DSC+DC). Incorporating semantic information provides cues to preserve depth ordering and per object depth coherence (DSC+DS). Then, phase two uses the pre-trained depth model (DSC+DC), which we further extend by regression layers to obtain a depth estimator, denoted by DSC-DRN (Fig. 3). We use standard losses for classification and regression tasks, i.e. cross-entropy and L1 losses respectively.
Iii-B Network Architecture
Our CNN architecture is inspired by the FCNDROPOUT of Ros , which follows a convolution-deconvolution scheme. Fig. 4 details our overall CNN architecture. First, we define a basic set of four consecutive layers consisting of Convolution, Batch Normalization, Dropout and ReLu. We build convolutional blocks (ConvBlk) based on this basic set. There are blocks containing a varying number of sets, starting from two to four sets. The different sets of a block are chained and put in a pipeline. Each block is followed by an average-pooling layer. Deconvolutional blocks (DeconvBlk) are based on one deconvolution layer together with skip connection features to provide more scope to the learning process. Note that to achieve better localization accuracy these features originate from the common layers (DSC) and are bypassed to both the depth classification (DC) branch and the semantic segmentation branch (SC). In the same way we introduce skip connections between the ConvBlk and DeconvBlk of the added regression layers.
At phase 1, the network comprises 9 ConvBlk and 11 DeconvBlk elements. At phase 2, only the depth-related layers are active. By adding 2 ConvBlk with 2 DeconvBlk elements to the (DSC+DC) branch we obtain the (DSC-DRN) network. Here, the weights of the (DSC+DC)-network part are initialized from phase 1. Note that at testing time only the depth estimation network (DSC-DRN) is required, consisting of 9 ConvBlk and 7 DeconvBlk elements.
Iv Experimental Results
We evaluate our approach on KITTI dataset , following the commonly used Eigen et al.  split for depth estimation. It consists of 22,600 training images and 697 testing images, i.e. RGB images with associated LIDAR data. To generate dense depth ground truth for each RGB image we follow Premebida et al. . We use half down-sampled images, i.e. pixels, for training and testing. Moreover, we use 2,975 images from Cityscapes dataset  with per-pixel semantic labels.
Iv-B Implementation Details
We implement and train our CNN using MatConvNet , which we modified to include the conditional flow back-propagation. We use a batch size of 10 and 5 images for depth and semantic branches, respectively. We use ADAM, with a momentum of and weight decay of . The ADAM parameters are pre-selected as , and . Smoothing is applied via gradient minimization  as pre-processing for RGB images, with and . We include data augmentation consisting of small image rotations, horizontal flip, blur, contrast changes, as well as salt & pepper, Gaussian, Poisson and speckle noises. For performing depth classification we have followed a linear binning of 24 levels on the range [1,80)m.
We compare our approach to supervised methods such as Liu et al.  and Cao et al. , unsupervised methods such as Garg et al.  and Godard et al. , and semi-supervised method Kuznietsov et al. . Liu et al., Cao et al. and Kuznietsov et al. did not release their trained model, but they reported their results on the Eigen et al. split as us. Garg et al. and Godard et al. provide a Caffe model and a Tensorflow model respectively, trained on our same split (Eigen et al.’s KITTI split comes from stereo pairs). We have followed the author’s instructions to run the models for estimating disparity and computing final depth by using the camera parameters of the KITTI stereo rig (focal length and baseline). In addition to the KITTI data, Godard et al. also added 22,973 stereo images coming from Cityscapes; while we use 2,975 from Cityscapes semantic segmentation challenge (19 classes). Quantitative results are shown in Table I for two different distance ranges, namely [1,50]m (cap 50m) and [1,80]m (cap 80m). As for the previous works, we follow the metrics proposed by Eigen et al.. Note how our method outperforms the state-of-the-art models in all metrics but one (being second best), in the two considered distance ranges.
In Table I we also assess different aspects of our model. In particular, we compare our depth estimation results with (DSC-DRN) and without the support of the semantic segmentation task. In the latter case, we distinguish two scenarios. For the first one, which we denote as DC-DRN, we discard the SC subnet from the phase so that we first train the depth classifier and later add the regression layers for retraining the network. In the second scenario, noted as DRN, we train the depth branch directly for regression, i.e. without pre-training a depth classifier. We see that for both cap 50m and 80m, DC-DRN and DRN are on par. However, we obtain the best performance when we introduce the semantic segmentation task during training. Without the semantic information, our DC-DRN and DRN do not yield comparable performance. This suggests that our approach can exploit the additional information provided by semantic information to learn a better depth estimator.
|Lower the better||Higher the better|
|Liu fine-tune ||80||0.217||1.841||6.986||0.289||-||0.647||0.882||0.961|
|Godard – K ||80||0.155||1.667||5.581||0.265||0.066||0.798||0.920||0.964|
|Godard – K + CS ||80||0.124||1.24||5.393||0.230||0.052||0.855||0.946||0.975|
|Godard – K ||50||0.149||1.235||4.823||0.259||0.065||0.800||0.923||0.966|
|Godard – K + CS ||50||0.117||0.866||4.063||0.221||0.052||0.855||0.946||0.975|
Fig. 5 shows qualitative results on KITTI. Note how well relative depth is estimated, also how clear are seen vehicles, pedestrians, trees, poles and fences. Fig. 6 shows similar results for Cityscapes; illustrating generalization since the model was trained on KITTI. In this case, images are resized at testing time to KITTI image size () and the result is resized back to Cityscapes image size () using bilinear interpolation.
We have presented a method to leverage depth and semantic ground truth from different datasets for training a CNN-based depth-from-mono estimation model. Thus, up to the best of our knowledge, allowing for the first time to address outdoor driving scenarios with such a training paradigm (i.e. depth and semantics). In order to validate our approach, we have trained a CNN using depth ground truth coming from KITTI dataset as well as pixel-wise ground truth of semantic classes coming from Cityscapes dataset. Quantitative results on standard metrics show that the proposed approach improves performance, even yielding new state-of-the-art results. As future work we plan to incorporate temporal coherence in line with works such as .
Antonio M. López wants to acknowledge the Spanish project TIN2017-88709-R (Ministerio de Economia, Industria y Competitividad) and the Spanish DGT project SPIP2017-02237, the Generalitat de Catalunya CERCA Program and its ACCIO agency.
-  C. Premebida, J. Carreira, J. Batista, and U. Nunes, “Pedestrian detection combining RGB and dense LIDAR data,” in IROS, 2014.
-  F. Yang and W. Choi, “Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers,” in CVPR, 2016.
-  Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu, “Traffic-sign detection and classification in the wild,” in CVPR, 2016.
-  K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in ICCV, 2017.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017.
-  H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE T-PAMI, vol. 30, no. 2, pp. 328–341, 2008.
-  D. Hernández, A. Chacón, A. Espinosa, D. Vázquez, J. Moure, and A. López, “Embedded real-time stereo estimation via semi-global matching on the GPU,” Procedia Comp. Sc., vol. 80, pp. 143–153, 2016.
-  T. Dang, C. Hoffmann, and C. Stiller, “Continuous stereo self-calibration by camera parameter tracking,” IEEE T-IP, vol. 18, no. 7, pp. 1536–1550, 2009.
-  E. Rehder, C. Kinzig, P. Bender, and M. Lauer, “Online stereo camera calibration from scratch,” in IV, 2017.
-  J. Cutting and P. Vishton, “Perceiving layout and knowing distances: The integration, relative potency, and contextual use of different information about depth,” in Handbook of perception and cognition - Perception of space and motion, W. Epstein and S. Rogers, Eds. Academic Press, 1995.
-  D. Ponsa, A. López, F. Lumbreras, J. Serrat, and T. Graf, “3D vehicle sensor based on monocular vision,” in ITSC, 2005.
-  D. Hoiem, A. Efros, and M. Hebert, “Putting objects in perspective,” IJCV, vol. 80, no. 1, pp. 3–15, 2008.
-  D. Cheda, D. Ponsa, and A. López, “Pedestrian candidates generation using monocular cues,” in IV, 2012.
-  H. Badino, U. Franke, and D. Pfeiffer, “The stixel world - a compact medium level representation of the 3D-world,” in DAGM, 2009.
-  D. Hernández, L. Schneider, A. Espinosa, D. Vázquez, A. López, U. Franke, M. Pollefeys, and J. Moure, “Slanted stixels: Representing SF’s steepest streets,” in BMVC, 2017.
-  L. Schneider, M. Cordts, T. Rehfeld, D. Pfeiffer, M. Enzweiler, U. Franke, M. Pollefeys, and S. Roth, “Semantic stixels: Depth is not enough,” in IV, 2016.
-  A. Saxena, M. Sun, and A. Ng., “Make3D: Learning 3D scene structure from a single still image,” IEEE T-PAMI, vol. 31, no. 5, pp. 824–840, 2009.
-  B. Liu, S. Gould, and D. Koller, “Single image depth estimation from predicted semantic labels,” in CVPR, 2010.
-  Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE T-CSVT, 2017.
-  H. Fu, M. Gong, C. Wang, and D. Tao, “A compromise principle in deep monocular depth estimation,” arXiv:1708.08267, 2017.
-  C. Godard, O. Aodha, and G. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in CVPR, 2017.
-  Y. Kuznietsov, J. Stückler, and B. Leibe, “Semi-supervised deep learning for monocular depth map prediction,” in CVPR, 2017.
-  A. Mousavian, H. Pirsiavash, and J. Košecká, “Joint semantic segmentation and depth estimation with deep convolutional networks,” in 3DV, 2016.
-  O. Jafari, O. Groth, A. Kirillov, M. Yang, and C. Rother, “Analyzing modular cnn architectures for joint depth prediction and semantic segmentation,” in ICRA, 2017.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” IJRR, vol. 32, no. 11, pp. 1231–1237, 2013.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016.
-  L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of perspective,” in CVPR, 2014.
-  D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in NIPS, 2014.
-  F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE T-PAMI, vol. 38, no. 10, pp. 2024–2039, 2016.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 3DV, 2016.
-  R. Garg, V. Kumar, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in ECCV, 2016.
-  J. Uhrig, M. Cordts, U. Franke, and T. Brox, “Pixel-level encoding and depth layering for instance-level semantic labeling,” in GCPR, 2016.
-  I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch networks for multi-task learning,” in CVPR, 2016.
-  G. Ros, “Visual scene understanding for autonomous vehicles: understanding where and what,” Ph.D. dissertation, Comp. Sc. Dpt. at Univ. Autònoma de Barcelona, 2016.
-  A. Vedaldi and K. K. Lenc, “MatConvNet – convolutional neural networks for MATLAB,” in ACM-MM, 2015.
-  L. Xu, C. Lu, Y. Xu, and J. Jia, “Image smoothing via l0 gradient minimization,” ACM Trans. on Graphics, vol. 30, no. 6, pp. 174:1–174:12, 2011.
-  T. Zhou, M. Brown, N. Snavely, and D. Lowe, “Unsupervised learning of depth and ego-motion from video,” in CVPR, 2017.