Enforcing geometric constraints of virtual normal for depth prediction
Abstract
Monocular depth prediction plays a crucial role in understanding D scene geometry. Although recent methods have achieved impressive progress in evaluation metrics such as the pixelwise relative error, most methods neglect the geometric constraints in the 3D space. In this work, we show the importance of the highorder 3D geometric constraints for depth prediction. By designing a loss term that enforces one simple type of geometric constraints, namely, virtual normal directions determined by randomly sampled three points in the reconstructed 3D space, we can considerably improve the depth prediction accuracy. Significantly, the byproduct of this predicted depth being sufficiently accurate is that we are now able to recover good 3D structures of the scene such as the point cloud and surface normal directly from the depth, eliminating the necessity of training new submodels as was previously done. Experiments on two benchmarks: NYU DepthV2 and KITTI demonstrate the effectiveness of our method and stateoftheart performance. Code is available at:
1 Introduction
Monocular depth prediction aims to predict distances between scene objects and the camera from a single monocular image. It is a critical task for understanding the 3D scene, such as recognizing a 3D object and parsing a 3D scene.
Although the monocular depth prediction is an illposed problem because many 3D scenes can be projected to the same 2D image, many deep convolutional neural networks (DCNN) based methods [7, 8, 12, 14, 24, 27, 35] have achieved impressive results by using a large amount of labelled data, thus taking advantage of prior knowledge in labelled data to solve the ambiguity.
These methods typically formulate the optimization problem as either pointwise regression or classification. That is, with the i.i.d. assumption, the overall loss is summing over all pixels. To improve the performance, some endeavours have been made to employ other constraints besides the pixelwise term. For example, a continuous conditional random field (CRF) [28] is used for depth prediction, which takes pairwise information into account. Other highorder geometric relations [9, 31] are also exploited, such as designing a gravity constraint for local regions [9] or incorporating the depthtosurfacenormal mutual transformation inside the optimization pipeline [31]. Note that, for the above methods, almost all the geometric constraints are ‘local’ in the sense that they are extracted from a small neighborhood in either 2D or 3D. Surface normal is ‘local’ by nature as it is defined by the local tangent plane. As the ground truth depth maps of most datasets are captured by consumerlevel sensors, such as the Kinect, depth values can fluctuate considerably. Such noisy measurement would adversely affect the precision and subsequently the effectiveness of those local constraints inevitably. Moreover, local constraints calculated over a small neighborhood have not fully exploited the structure information of the scene geometry that may be possibly used to boost the performance.
To address these limitations, here we propose a more stable geometric constraint from a global perspective to take longrange relations into account for predicting depth, termed virtual normal. A few previous methods already made use of 3D geometric information in depth estimation, almost all of which focus on using surface normal. We instead reconstruct the 3D point cloud from the estimated depth map explicitly. In other words, we generate the 3D scene by lifting each RGB pixel in the 2D image to its corresponding 3D coordinate with the estimated depth map. This 3D point cloud serves as an intermediate representation. With the reconstructed point cloud, we can exploit many kinds of 3D geometry information, not limited to the surface normal. Here we consider the longrange dependency in the 3D space by randomly sampling three noncolinear points with the large distance to form a virtual plane, of which the normal vector is the proposed virtual normal (VN). The direction divergence between groundtruth and predicted VN can serve as a highorder 3D geometry loss. Owing to the longrange sampling of points, the adverse impact caused by noises in depth measurement is much alleviated compared to the computation of the surface normal, making VN significantly more accurate. Moreover, with randomly sampling we can obtain a large number of such constraints, encoding the global 3D geometric. Second, by converting estimated depth maps from images to 3D point cloud representations it opens many possibilities of incorporating algorithms for 3D point cloud processing to 2D images and 2.5D depth processing. Here we show one instance of such possibilies.
By combining the highorder geometric supervision and the pixelwise depth supervision, our network can predict not only an accurate depth map but also the highquality 3D point cloud, subsequently other geometry information such as the surface normal. It is worth noting that we do not use a new model or introduce network branches for estimating the surface normal. Instead it is computed directly from the reconstructed point cloud. The second row of Fig. 1 demonstrates an example of our results. By contrast, although the previously stateoftheart method [18] predicts the depth with low errors, the reconstructed point cloud is far away from the original shape (see, e.g., left part of ‘sofa’). The surface normal also contains many errors. We are probably the first to achieve highquality monocular depth and surface normal prediction with a single network.
Experimental results on NYUDv2 [36] and KITTI [13] datasets demonstrate stateoftheart performance of our method. Besides, when training with the lightweight backbone, MobileNetV2 [34], our framework provides a better tradeoff between network parameters and accuracy. Our method outperforms other stateoftheart realtime systems by up to % with a comparable number of network parameters. Furthermore, from the reconstructed point cloud, we directly calculate the surface normal, with a precision being on par with that of specific DCNN based surface normal estimation methods.
In summary, our main contributions of this work are as follow.

We demonstrate the effectiveness of enforcing a highorder geometric constraint in the 3D space for the depth prediction task. Such global geometry information is instantiated with a simple yet effective concept termed virtual normal (VN). By enforcing a loss defined on VNs, we demonstrate the importance of 3D geometry information in depth estimation, and design a simple loss to exploit it.

Our method can reconstruct highquality 3D scene point clouds, from which other 3D geometry features can be calculated, such as the surface normal. In essence, we show that for depth estimation, one should not consider the information represented by depth only. Instead, converting depth into 3D point clouds and exploiting 3D geometry is likely to improve many tasks including depth estimation.

Experimental results on NYUDV2 and KITTI illustrate that our method achieves stateoftheart performance.
1.1 Related Work
Monocular Depth Prediction. Depth prediction from images is a longstanding problem. Previous work can be divided into active methods and passive methods. The former ones use the assistant optical information for prediction, such as coded patterns [41], while the latter ones completely focus on image contents. Monocular depth prediction [2, 7, 8, 28, 43] has been extensively studied recently. As limited geometric information can be directly extracted from the monocular image, it is essentially an illposed problem. Recently, owing to the structural features from very deep convolution neural network, such as ResNet [16], various DCNNbased methods learn to predict depth with deep CNN features. Fu et al. [12] proposed an encoderdecoder network, which extracts multiscale features from the encoder and is trained in an endtoend manner without iterative refinement. They achieved stateoftheart performance on several datasets. Jiao et al. [19] proposed an attentiondriven loss, which merges the semantic priors to improve the prediction precision on unbalanced distribution datasets.
Most previous methods only adopted the pixelwise depth supervision to train a network. By contrast, Liu et al. [28] combined DCNN with the continuous conditional random field (CRF) to exploit consistency information of neighbouring pixels. CRF establishes a pairwise constraint for local regions. Furthermore, several highorder constraints are investigated. Chen et al. [5] applied the generative adversarial training to lead the network to learn a contextaware and patchlevel loss automatically. Note that most of these methods directly work with the depth, instead of in the 3D space.
Surface Normal. Surface normal is an important geometry information for 3D scene understanding. Several datadriven methods [7, 8, 10, 11, 39, 42] have achieved promising results. Eigen et al. [7] proposed a CNN with different output channels to directly predict depth map, surface normal and semantic labels. Bansal et al. [1] proposed a twostream network to predict the surface normal first, which is further joined with the input image to learn the pose. Note that most of these methods formulate surface normal prediction and depth prediction as multiple different tasks.
2 Our Method
Our approach resolves the monocular depth prediction and reconstructs the highquality scene 3D point cloud from the predicted depth at the same time. The pipeline is illustrated in Fig. 2.
We take an RGB image as the input of an encoderdecoder network and predict the depth map . From the , the 3D scene point cloud can be reconstructed. The ground truth point cloud is reconstructed from .
We enforce two types of supervision for training the network.We firstly follow standard monocular depth prediction methods to enforce pixelwise depth supervision over with . With the reconstructed point clouds, we then align the spatial relationship between the and the using the proposed virtual normal.
When the network is well trained, we not only obtain accurate depth map but also highquality point clouds. From the reconstructed point clouds, other 3D features can be directly calculated, such as the surface normal.
2.1 Highorder Geometric Constraints
Surface Normal. The surface normal is an important ‘local’ feature for many pointcloud based applications such as registration [33] and object detection [17, 15]. It appears to be a promising 3D cue for improving depth prediction. One can apply the angular difference between groundtruth and calculated surface normal to be a geometric constraint. One major issue of this approach is, when computing surface normal from either a depth map or 3D point cloud, it is sensitive to noise. Moreover, surface normal only considers shortrange local information.
We follow [21] to calculate the surface normal. It assumes that local 3D points locate in the same plane, of which the normal vector is the surface normal. In practice groundtruth depth maps are usually captured by a consumerlevel sensor with limited precision, so depth maps are contaminated by noise. The reconstructed point clouds in the local region can vary considerably due to noises as well as the size of local patch for sampling (Fig. 3(a)). We experiment on the NYUDV2 dataset to test the robustness of the surface normal computation. Five different sampling sizes around the target pixel are employed to sample points, which are used to calculate its surface normal. The sample area is . The Mean Difference Error (Mean) [7] between calculated surface normals is evaluated. From Fig. 3(b), we can learn that the surface normal varies significantly with different sampling sizes. For example, the Mean between 33 and 1111 is 22°. Such unstable surface normal negatively affects its effectiveness for learning. Likewise, other 3D geometric constraints demonstrating the ‘local’ relative relations also encounter this problem.
Virtual Normal. In order to enforce robust highorder geometric supervision in the 3D space, we propose the virtual normal (VN) to establish 3D geometric connections between regions in a much larger range. The point cloud can be reconstructed from the depth based on the pinhole camera model. For each pixel , the 3D location in the world coordinate can be obtained by the prospective projection. We set the camera coordinate as the world coordinate. Then the 3D coordinate is denoted as follows:
(1) 
where is the depth. and are the focal length along the and coordinate axis respectively. and are the 2D coordinate of the optical center.
We randomly sample groups points from the depth map, with three points in each group. The corresponding 3D points are . Three points in a group are restricted to be noncolinear based on the restriction . is the angle between two vectors.
(2) 
where are hyperparameters. In all experiments, we set ,
In order to sample more longrange points, which have ambiguous relative locations in 3D space, we perform longrange restriction for each group in .
(3) 
where in our experiments.
Therefore, three 3D points in each group can establish a plane. We compute the normal vector of the plane to encode geometric relations, which can be written as
(4) 
where is the normal vector of the virtual plane .
Robustness to Depth Noise. Compared with local surface normal, our virtual normal is more robust to noise. In Fig. 4, we sample three 3D points with large distance. and are assumed to locate on the plane, is on the axis. When varies to , the direction of the virtual normal changes from to . is the intersection point between plane and axis. Because of restrictions and , the difference between and is usually very small, which is simple to show:
(5) 
Furthermore, we conduct a simple experiment to verify the robustness of our proposed virtual normal against data noise. We create an unit sphere and then add gaussian noise to simulate the ideal noisefree data and the real noisy data (see Fig. (a)a). We then sample 100K groups points from the noisy surface and the ideal one to compute the virtual normal respectively, while 100K points are sampled to compute the surface normal as well. For the gaussian noise, we use different deviations to simulate different noise levels by varying deviation , and the mean being . The experimental results are illustrated in Fig. (c)c. We can learn that our proposed virtual normal is much more robust to the data noise than the surface normal. Other local constraints are also sensitive to data noise.
Most ‘local’ geometric constraints, such as the surface normal, actually enforcing the firstorder smoothness of the surface but are less useful for helping the depth map prediction. In contrast, the proposed VN establishes longrange relations in the 3D space. Compared with pairwise CRFs, VN encodes triplet based relations, thus being of high order.
Virtual Normal Loss. We can sample a large number of triplets and compute corresponding VNs. With the sampled VNs, we compute the divergence as the Virtual Normal Loss (VNL):
(6) 
where the is the number of valid sampling groups satisfying . In experiments we have employed online hard example mining.
Pixelwise Depth Supervision. We also use a standard pixelwise depth map loss. We quantize the realvalued depth and formulate the depth prediction as a classification problem instead of regression, and employ the crossentropy loss. In particular we follow [2] to use the weighted crossentropy loss (WCEL), with the weight being the information gain. See [2] for details.
To obtain the accurate depth map and recover highquality 3D information, we combine WCEL and VNL together to supervise the network output. The overall loss is:
(7) 
where is a tradeoff parameter, which is set to in all experiments to make the two terms roughly of the same scale.
3 Experiments
In this section, we conduct several experiments to compare ours against stateoftheart methods. We evaluate our methods on two datasets, NYUDV2 and KITTI.
3.1 Datasets
NYUDV2. The NYUDV2 dataset consists of 464 different indoor scenes, which are further divided into 249 scenes for training and 215 for testing. We randomly sample 29K images from the training set to form NYUDLarge. Note that DORN uses the whole training set, which is significantly larger than that what we use. Apart from the whole dataset, there are officially annotated 1449 images (NYUDSmall), in which 795 images are split for training and others are for testing. In the ablation study, we use the NYUDSmall data.
KITTI. The KITTI dataset contains over 93K outdoor images and depth maps with the resolution around . All images are captured on driving cars by stereo cameras and a Lidar. We test on 697 images from 29 scenes split by Eigen et al. [8], validate on 888 images, and train on about 23488 images from the remaining 32 scenes.
3.2 Implementation Details
The pretrained ResNeXt101 [40] model on ImageNet [6] is used as our backbone model. A polynomial decaying method with the base learning rate 0.0001 and the power of is applied for SGD. The weight decay and the momentum are set to 0.0005 and 0.9 respectively. Batch size is in our experiments. The model is trained for epochs on NYUDLarge and KITTI, and is trained for epochs on NYUDSmall in the ablation study. We perform the data augmentation on the training samples by the following methods. For NYUDV2, the RGB image and the depth map are randomly resized with ratio , randomly flipped in the horizon, and finally randomly cropped with the size for NYUDV2. The similar process is applied for KITTI but resizing with the ratio and cropping with . Note that the depth map should be scaled with the corresponding resizing ratio.
3.3 Evaluation Metrics
We follow previous methods [24] to evaluate the performance of monocular depth prediction quantitatively based on following metrics: mean absolute relative error (rel), mean error (), root mean squared error (rms) , root mean squared log error (rms (log)) and the accuracy under threshold ().
3.4 Comparison with Stateoftheart
In this section, we detail the comparison of our methods with stateoftheart methods.
Method  rel  log10  rms  

Lower is better  Higher is better  
Saxena et al. [35]    
Karsch et al. [20]        
Liu et al. [29]        
Ladicky et al. [23]        
Li et al. [25]  
Roy et al. [32]        
Liu et al. [28]  
Wang et al. [38]  
Eigen et al. [7]    
Chakrabarti [3]    
Li et al. [26]  
Laina et al. [24]  
DORN [12]  
Ours 
NYUDV2. In this experiment, we compare with other stateoftheart methods on the NYUDV2 dataset. Table 1 demonstrates that our proposed method outperforms other stateoftheart methods across all evaluation metrics significantly. Compare to DORN, we have improved the accuracy from to over all evaluation metrics that they report.
In addition to the quantitative comparison, we demonstrate some visual results between our method and the stateoftheart DORN in Fig. 6. Clearly, the predicted depth by the proposed method is much more accurate. The plane of ours is much smoother and has fewer errors (see the wall regions colored with red in the 1st, 2nd, and 3rd row). Furthermore, the last row in Fig. 6 manifests that our predicted depth is more accurate in the complicated scene. We have fewer errors in shelf and desk regions.
KITTI. In order to demonstrate that our proposed method can still reach the stateoftheart performance on outdoor scenes, we test our method on the KITTI dataset. Results in Table 2 show that our method has outperformed all other methods on all evaluation metrics except root mean square (rms) error. The rms error is only slightly behind that of DORN. Note that for outdoor scenes, the rms (log) error, instead of rms, is usually the metric of interest, in which ours is better.
3.5 Ablation Studies
In this section, we conduct several ablation studies to analyze the details of our approach.
Effectiveness of VNL. In this study, in order to prove the effectiveness of the proposed VNL we compare it with two types of pixelwise depth map supervision, a pairwise geometric supervision, and a highorder geometric supervision: 1) the ordinary crossentropy loss (CEL); 2) the loss (); 3) the surface normal loss (SNL); 4) the pairwise geometric loss (PL). We reconstruct the point cloud from the depth map and further recover the surface normal from the point cloud. The angular discrepancy between the ground truth and recovered surface normal is defined as the surface normal loss, which is a highorder geometric supervision in 3D space. The pairwise loss is the direction difference of two vectors in 3D, which are established by randomly sampling paired points in groundtruth and predicted point cloud. The loss function of PL is as follow,
(8) 
where and are paired points sampled from the ground truth and the predicted point cloud respectively. is the total number of pairs.
We also employ the longrange restriction for the paired points. Therefore, similar to VNL, PL can also be seen as a global geometric supervision in 3D space. The experimental results are reported in Table. 3. WCEL is the baseline for all following experiments.
Metrics  rel  log10  rms  

Pixelwise Depth Supervision  
CEL  
WCEL  
WCEL+L1  
Pixelwise Depth Supervision + Geometric Supervision  
WCEL+PL^{‡}  
WCEL+PL+VNL  
WCEL+SNL^{†}  
WCEL+VNL^{‡} (Ours) 

‘Local’ geometric supervision in 3D.

‘Global’ geometric supervision in 3D.
Firstly, we analyze the effect of pixelwise depth supervision for prediction performance. As WCE employs an weight in the CE loss, its performance is slightly better than that of CEL. However, when we enforce two pixelwise supervision (WCEL+L1) on the depth map, the performance cannot improve any more. Thus using two pixelwise loss terms does not help.
Secondly, we analyze the effectiveness of the supplementary 3D geometric constraint (PL, SNL, VNL). Compared with the baseline (WCEL), three supplementary 3D geometric constraints can promote the network performance with varying degrees. Our proposed VNL combining with WCEL has the best performance, which has improved the baseline performance by up to %.
Thirdly, we analyze the difference of three geometric constraints. As SNL can only exploit geometric relations of homogeneous local regions, its performance is the lowest among the three constraints over all evaluation metrics. Compared with SNL, since PL constrains the global geometric relations, its performance is clearly better. However, the performance of WCEL+PL is not as good as our proposed WCEL+VNL. When we further add our VNL on top of WCEL+PL, the precision can further be slightly improved and is comparable to WCEL+VNL. Therefore, although PL is a global geometric constraint in 3D, the pairwise constraint cannot encode as strong geometry information as our proposed VNL.
At last, in order to further demonstrate the effectiveness of VNL, we analyze the results of network trained with and without VNL supervision on the KITTI dataset. The visual comparison is shown in Fig. 7. One can see that VNL can improve the performance of the network in ambiguous regions. For example, the sign (1st row), the distant pedestrian (2nd row), and traffic light in the last row of the figure can demonstrate the effectiveness of the proposed VNL.
In conclusion, the geometric constraints in the 3D space can significantly boost the network performance. Moreover, the global and highorder constraints can enforce stronger supervision than the ‘local’ and pairwise ones in 3D space.
Impact of the Amount of Samples. Previously, we have proved the effectiveness of VNL. Here the impact of the size of samples for VNL is discussed. We sample six different sizes of point groups, 0K, 20K, 40K, 60K, and 80K and 100K, to establish VNL. ‘0K’ means that the model is trained without VNL supervision. The rel error is reported for evaluation. Fig. 8 demonstrates that ‘rel’ slumps by % with 20K point groups to establish VNL. However, it only drops slightly when the samples for VNL increase from 20K to 100K. Therefore, the performance saturates with more samples, when samples reach a certain number in that the diversity of samples is enough to construct the global geometric constraint.
Lightweight Backbone Network. We train the network with the MobileNetV2 backbone to evaluate the effectiveness of the proposed geometric constraint on the light network. We train it on the NYUDLarge for epochs. Results in Table 4 show that the proposed VNL can improve the performance by  . Comparing with previous stateoftheart methods, we have improved the accuracy by around over all evaluation metrics and achieved a better tradeoff between parameters and the accuracy.
3.6 Recovering 3D Features from Estimated Depth
We have argued that, with geometric constraints in the 3D space, the network can achieve more accurate depth and also obtain higherquality 3D information. Here we show the recovered 3D point cloud and the surface normal to support this.
3D Point Cloud. Firstly, we compare the reconstructed 3D point cloud from our predicted depth and that of DORN. Fig. 9 demonstrate that the overall quality of ours outperforms theirs significantly. Although our predicted depth is only slightly better than theirs on evaluation metrics, the reconstructed wall (see the 2nd row in 9) of ours is much flatter and has fewer errors. The shape of the bed is more similar to the ground truth. From the bird view, it is hard to recognize the bed shape of their results. The point cloud in Fig. 1 also leads to a similar conclusion.
Surface Normal. Lastly, we compare the calculated surface normal with previous stateoftheart methods and demonstrate the quantitative results in Table 5. The ground truth is obtained as described in [7]. We first compare our geometrically calculated results with DCNNbased optimization methods. Although we do not optimize a submodel to achieve the surface normal, our results can outperform most of the previous methods and even are the best on metric.
Method  Mean  Median  

Lower is better  Higher is better  
Predicted Surface Normal from the Network  
3DP [10]  
Ladicky et al. [42]  
Fouhey et al. [11]  
Wang et al. [39]  
Eigen et al. [7]  
Calculated Surface Normal from the Point cloud  
GTGeoNet^{†} [31]  
DORN^{‡} [12]  
Ours 

Cited from the original paper.

Using authors’ released models.
Furthermore, we compare the surface normals directly computed from the reconstructed point cloud with that of DORN [12] and GeoNet [31]. Note that we run the released code and model of DORN to obtain depth maps and then calculate surface normals from the depth, while the evaluation of GeoNet is cited from the original paper. In Table 5, we can see that, with highorder geometric supervision, our method outperforms DORN and GeoNet by a large margin, and even is close to Eigen method which trains to output normals. It suggests that our method can lead the model to learn the shape from images.
Apart from the quantitative comparison, the visual effect is shown in Fig. 10, demonstrating that our directly calculated surface normals are not only accurate in planes (the 1st row), but also are of higher quality in regions with sophisticated curved surface (the 2nd and last row).
4 Conclusion
In this paper, we have proposed to construct a longrange geometric constraint (VNL) in the 3D space for monocular depth prediction. In contrast to previous methods with only pixelwise depth supervision in 2D space, our method can not only obtain the accurate depth maps but also recover highquality 3D features, such as the point cloud and the surface normal, eliminating necessities to optimize a new submodel. Compared with other 3D constrains, our proposed VNL is more robust to noise and can encode strong global constraints. Experimental results on NYUDV2 and KITTI have proved the effectiveness of our method and the stateoftheart performance.
In particular, to demonstrate that our method is able to produce sensible local shapes, the normals directly derived from the estimated depth of our method outperform many other recent depth estimation methods and are close to that of those trained to output normals. We hope that our method provides a useful tool and stimulates insight into predicting not only depth but also shape from monocular images.
5 Appendix
5.1 Model
An overview architecture of our model is illustrated in Fig.11. The network is mainly composed of two parts, an encoder to establish features in different levels from , and a decoder to reconstruct the depth map. Inspired by [27], the decoder is composed of several adaptive merging blocks (AMB) to fuse features from different levels and dilated residual blocks (DRB) to transform features. In order to improve the receptive field of the decoder, we set the dilation rates of all convolutions in DRB to 2 and insert an Astrous Spatial Pyramid Pooling (ASPP) module (dilation rate: 2, 4, 8) [4] between the encoder and the decoder. Furthermore, we establish 4 flip connections from different levels of encoder blocks to the decoder to merge more lowlevel features. The AMB will learn a merging parameter for adaptive merging. Apart from features from the highest level with 512 channels, other flips’ features dimension are 256. At last, a prediction module, a convolution and a softmax, is applied to transfer the features dimensions from 256 channels to 150 depth bins.
5.2 Predicted Depth and Surface Normal
5.3 3D point cloud
In order to further show the quality of reconstructed point cloud from the predicted depth, we randomly select 3 scenes from the testing part of NYUDV2 and KITTI. 3 views are randomly selected to display the reconstructed point cloud. The results are shown in Fig. 15 and Fig. 16.
Acknowledgments
We would like to thank Huawei Technologies for the donation of GPU cloud computing resources. We are particularly grateful to one of the reviewers who sees the value of our work and has provided constructive comments.
References
 [1] A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d3d alignment via surface normal prediction. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5965–5974, 2016.
 [2] Y. Cao, Z. Wu, and C. Shen. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol., 2017.
 [3] A. Chakrabarti, J. Shao, and G. Shakhnarovich. Depth from a single image by harmonizing overcomplete local network predictions. In Proc. Advances in Neural Inf. Process. Syst., pages 2658–2666, 2016.
 [4] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In arXiv: Comp. Res. Repository, volume abs/1802.02611, 2018.
 [5] R. Chen, F. Mahmood, A. Yuille, and N. J. Durr. Rethinking monocular depth estimation with adversarial training. In arXiv: Comp. Res. Repository, volume abs/1808.07528, 2018.
 [6] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 248–255. Ieee, 2009.
 [7] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2650–2658, 2015.
 [8] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multiscale deep network. In Proc. Advances in Neural Inf. Process. Syst., pages 2366–2374, 2014.
 [9] X. Fei, A. Wang, and S. Soatto. Geosupervised visual depth prediction. In arXiv: Comp. Res. Repository, volume abs/1807.11130, 2018.
 [10] D. F. Fouhey, A. Gupta, and M. Hebert. Datadriven 3d primitives for single image understanding. In Proc. IEEE Int. Conf. Comp. Vis., pages 3392–3399, 2013.
 [11] D. F. Fouhey, A. Gupta, and M. Hebert. Unfolding an indoor origami world. In Proc. Eur. Conf. Comp. Vis., pages 687–702. Springer, 2014.
 [12] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2002–2011, 2018.
 [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. SAGE Int. J. Robotics Research, 32(11):1231–1237, 2013.
 [14] X. Guo, H. Li, S. Yi, J. Ren, and X. Wang. Learning monocular depth by distilling crossdomain stereo networks. In Proc. Eur. Conf. Comp. Vis., pages 484–500, 2018.
 [15] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning rich features from rgbd images for object detection and segmentation. In Proc. Eur. Conf. Comp. Vis., pages 345–360. Springer, 2014.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 770–778, 2016.
 [17] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit. Multimodal templates for realtime detection of textureless objects in heavily cluttered scenes. In Proc. IEEE Int. Conf. Comp. Vis., pages 858–865. IEEE, 2011.
 [18] J. Hu, M. Ozay, Y. Zhang, and T. Okatani. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In IEEE Winter Conf. on Applications of Comp. Vis., 2019.
 [19] J. Jiao, Y. Cao, Y. Song, and R. Lau. Look deeper into depth: Monocular depth estimation with semantic booster and attentiondriven loss. In Proc. Eur. Conf. Comp. Vis., pages 53–69, 2018.
 [20] K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depth extraction from video using nonparametric sampling. IEEE Trans. Pattern Anal. Mach. Intell., 36(11):2144–2158, 2014.
 [21] K. Klasing, D. Althoff, D. Wollherr, and M. Buss. Comparison of surface normal estimation methods for range sensing applications. In IEEE Int. Conf. Robotics & Automation, pages 3206–3211. IEEE, 2009.
 [22] Y. Kuznietsov, J. Stückler, and B. Leibe. Semisupervised deep learning for monocular depth map prediction. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2215–2223. IEEE, 2017.
 [23] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 89–96, 2014.
 [24] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Int. Conf. on 3D Vision, pages 239–248. IEEE, 2016.
 [25] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1119–1127, 2015.
 [26] J. Li, R. Klein, and A. Yao. A twostreamed network for estimating finescaled depth maps from single rgb images. In Proc. IEEE Int. Conf. Comp. Vis., pages 22–29, 2017.
 [27] R. Li, K. Xian, C. Shen, Z. Cao, H. Lu, and L. Hang. Deep attentionbased classification network for robust depth prediction. In arXiv: Comp. Res. Repository, volume abs/1807.03959, 2018.
 [28] F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):2024–2039, 2016.
 [29] M. Liu, M. Salzmann, and X. He. Discretecontinuous depth estimation from a single image. In IEEE Trans. Pattern Anal. Mach. Intell., pages 716–723, 2014.
 [30] V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen, and I. Reid. Realtime joint semantic segmentation and depth estimation using asymmetric annotations. In arXiv: Comp. Res. Repository, volume abs/1809.04766, 2018.
 [31] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 283–291, 2018.
 [32] A. Roy and S. Todorovic. Monocular depth estimation using neural regression forest. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5506–5514, 2016.
 [33] R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz. Aligning point cloud views using persistent feature histograms. In Proc. IEEE/RSJ Int. Conf. Intelligent Robots & Systems, pages 3384–3391. IEEE, 2008.
 [34] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 4510–4520, 2018.
 [35] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell., 31(5):824–840, 2009.
 [36] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. Eur. Conf. Comp. Vis., pages 746–760. Springer, 2012.
 [37] A. Spek, T. Dharmasiri, and T. Drummond. Cream: Condensed realtime models for depth prediction using convolutional neural networks. In Proc. IEEE/RSJ Int. Conf. Intelligent Robots & Systems, pages 540–547. IEEE, 2018.
 [38] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Towards unified depth and semantic prediction from a single image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2800–2809, 2015.
 [39] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 539–547, 2015.
 [40] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5987–5995. IEEE, 2017.
 [41] W. Yin, X. Cheng, J. Xie, H. Cui, and Y. Chen. Highspeed 3d profilometry employing hsi color model for color surface with discontinuities. Elsevier Optics & Laser Technology., 96:81–87, 2017.
 [42] B. Zeisl, M. Pollefeys, et al. Discriminatively trained dense surface normal estimation. In Proc. Eur. Conf. Comp. Vis., pages 468–484. Springer, 2014.
 [43] K. Zheng, Z.J. Zha, Y. Cao, X. Chen, and F. Wu. Lanet: Layoutaware dense network for monocular depth estimation. pages 1381–1388. ACM, 2018.