Structure-Attentioned Memory Network for Monocular Depth Estimation
Monocular depth estimation is a challenging task that aims to predict a corresponding depth map from a given single RGB image. Recent deep learning models have been proposed to predict the depth from the image by learning the alignment of deep features between the RGB image and the depth domains. In this paper, we present a novel approach, named Structure-Attentioned Memory Network, to more effectively transfer domain features for monocular depth estimation by taking into account the common structure regularities (e.g., repetitive structure patterns, planar surfaces, symmetries) in domain adaptation. To this end, we introduce a new Structure-Oriented Memory (SOM) module to learn and memorize the structure-specific information between RGB image domain and the depth domain. More specifically, in the SOM module, we develop a Memorable Bank of Filters (MBF) unit to learn a set of filters that memorize the structure-aware image-depth residual pattern, and also an Attention Guided Controller (AGC) unit to control the filter selection in the MBF given image features queries. Given the query image feature, the trained SOM module is able to adaptively select the best customized filters for cross-domain feature transferring with an optimal structural disparity between image and depth. In summary, we focus on addressing this structure-specific domain adaption challenge by proposing a novel end-to-end multi-scale memorable network for monocular depth estimation. The experiments show that our proposed model demonstrates the superior performance compared to the existing supervised monocular depth estimation approaches on the challenging KITTI and NYU Depth V2 benchmarks.
Depth estimation is an important component in many 3D computer vision tasks like visual Simultaneous Localization and Mapping (visual SLAM). Traditional approaches have made significant progress in binocular or multi-view depth estimation by taking advantage of geometry constraints of either spatial (i.e. stereo camera) or temporal (i.e. video sequence) pairs. With the prevalence of deep convolutional neural networks, researchers have been trying to relax the constraints by tackling monocular depth estimation. Recent works (Wang2015Towards; Roy2016Monocular; Kuznietsov2017Semi; Kim2016Unified; Fu2018Deep; Eigen2015Predicting) have demonstrated promising results using regression-based deep learning models. Their models are trained by minimizing image-level losses with supervised signal on predicted results. Nevertheless, the cross-modality variance between the RGB image and the depth map still makes monocular depth prediction an ill-posed problem. Based on this observation, some researchers have considered solving the problem with additional feature-level structural constraints by minimizing the cross-modality residual complexity between image features and depth features. Most existing methods either consider the pixel-wise or structure-wise alignment in this regard. For instance, several architectures utilize the micro discrepancy loss as similarity measures such like sum of squared differences, correlation coefficients (Myronenko2010Intensity) and maximum mean discrepancy (Ghifary2015Domain; Long2015Learning) to align the RGB images features with depth features from pixel to pixel independently without considering the spatial dependencies. Another line of work has tried to apply the adversarial adaptation methods (Kundu2018AdaDepth; Tzeng2017Adversarial; Judy2015Simultaneous) in conjunction with task-specific losses that concentrate on macro spatial distribution similarity between the image features and depth ones. In this paper, we seek a way to address this domain adaption challenge on both pixel-wise discrepancies and the structure dependencies by extracting the structure-specific information between the two domains (as shown in Figure 1).
In order to explore the pixel-wise discrepancies as well as the structure dependencies between the image features and depth features, we propose a memorable domain adaptation network, with an image-encoder-depth-decoder regression network backbone, and a specifically designed Structure-Oriented Memory (SOM) module coupled with a cross-modality residual complexity loss to minimize the gap between latent distribution of the image and depth map from both the pixel-level and structure-level. Given the observation that similar type of scenes (e.g. roadside scenes) often share common structural regularities (e.g. repetitive structure patterns, planar surfaces, symmetries), a set of filters could be trained to learn a specific structural image-depth residual patterns. Therefore, in our SOM module, we build a Memorable Bank of Filters (MBF) to store and learn the structure-ware filters, then we construct an Attention Guided Controller (AGC) to learn to automatically select the appropriate filters (from the MBF) to capture the significant information from the given image features (generated by the image encoder) for the further depth estimation. Finally, the customized image features are fed into the depth decoder network to output the corresponding depth maps. Importantly, comparing to the direct alignment between the two domains features (e.g. direct applying loss between and ), our introduced SOM module not only improves the fitting ability, but also reduces the training burden of the image encoder simultaneously. The experiments conducted on two well-known large scale benchmarks KITTI and NYU Depth V2, demonstrate that our proposed model obtains the state-of-the-art performance on monocular depth estimation tasks. Moreover, the performance margin between model trained with SOM and the one trained with direct alignment, validate the effectiveness of our proposed SOM module.
In summary, our contributions in this paper are as follows:
We introduce memory strategies to address monocular depth estimation by designing a novel Structure-Oriented Memory (SOM) module with a Memorable Bank of Filters (MBF) and an Attention Guided Controller (AGC) for feature-level cross-modality domain adaptation.
We propose a novel end-to-end deep learning Structure-Attentioned Memory Network, which seamlessly integrates a front-end regression network with the SOM module that operates at feature-level to substantially improve the depth prediction performance.
We achieve state-of-the-art performance on two large scale benchmarks: KITTI and NYU Depth V2, which validates the effectiveness of the proposed method.
The remainder of our paper is organized as follows. We present a brief review of the related literature in Section Related Works, after which we introduce the proposed method in details in Section Proposed Method. In Section Experiments, we provide the qualitative and quantitative experimental results, as well as ablation studies that demonstrate the effectiveness of the proposed method. Finally, we conclude the paper in Section Conclusion.
Monocular depth estimation is a fundamental problem in computer vision which has widespread application in graphics, robotics and AR/VR. While previous works mainly tackle this using hand-crafted image features or probabilistic models such as Markov Random Fields (MRFs) (Saxena2009Make3D), recent success of deep learning based methods (Wang2015Towards; Roy2016Monocular; Kuznietsov2017Semi; Kim2016Unified; Fu2018Deep; Eigen2015Predicting) have inspired researchers to use deep learning techniques to address the challenging depth estimation problem. The learning based monocular depth estimation approaches can be mainly summarized into two categories, the supervised and the unsupervised/semi-supervised methods.
Supervised Methods A majority of works focus on supervised learning to use the learned features from CNNs to do accurate depth prediction. Eigen2014Depth first brought CNNs to depth regression task by integrating coarse and refined features with a two-stage network. The multi-task learning strategies were also applied in depth estimation to boost the performance. Liu2010Single utilized the semantic segmentation as objectness cues for depth estimation. Furthermore, Shi2014Pulling and Xu2018PAD performed joint prediction of the pixel-level semantic labels as well as the depth. Surface normal information was also adopted in many recent works (Eigen2015Predicting; Zhou2017Unsupervised; Wang2015Towards; Qi2018GeoNetG). Besides, some research works also demonstrated the robustness of multi-scale feature fusion in pixel-level prediction tasks (e.g. semantic segmentation, depth estimation). Fu2018Deep adopted the dilated convolution to enlarge the perceptive field without decreasing spatial resolution of the feature maps. In Buyssens2012Multiscale’s work, inputs at different resolutions are utilized to build a multi-stream architecture. Instead of regression, there are also methods that discretize the depth range and transfer the regression problem to a classification problem. In the work of Fu2018Deep, the space-increasing discretization is proposed to reduce the over-strengthened loss for the large depth values.
Unsupervised/Semi-supervised Methods Another line of methods on monocular image depth prediction goes along the unsupervised/semi-supervised direction which mostly takes advantage of geometry constraints (e.g. epipolar geometry) on either spatial (between left-right pairs) or temporal (forward-backward) relationship. Garg2016Unsupervised proposed to estimate the depth map from a pair of stereo images by imposing the left-right consistency loss. Zhan2018Unsupervised jointly learned a single view depth estimator and monocular odometry estimator using stereo video sequences, which enables the use of both spatial and temporal photometric warp constraints. Moreover, following the trend of adversarial learning, the generative adversarial networks (GANs) have been utilized in the depth estimation problem. Kundu2018AdaDepth proposed an unsupervised domain adaptation strategy for adapting depth predictions from synthetic RGB-D pairs to natural scenes in the depth estimation task.
Cross-Modality Domain Adaptation In addition to the recent depth estimation methods, research works focused on the cross-modality domain adaption are also highly relevant to ours. The existence of cross modality, or domain shift, is commonly seen in real-world application, which is the consequence of data captured by different sensors (e.g. optical camera, LiDAR or stereo camera), or varying conditions (i.e. background). In different domains, semantic labels are shared whereas the data distributions are usually different to a large extent. For example, Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) in bio-medical image analysis (Dou2018Unsupervised); RGB images, depth maps and point clouds in 2D, 2.5D and 3D computer vision tasks. Numerous approaches have been proposed to address the domain adaptation needs in different visual tasks. Here we briefly review some domain adaption methods using deep learning techniques.
Most deep domain adaptation methods utilize a siamese architecture with two streams for source and target models respectively, and the network is trained with a discrepancy loss to minimize the pixel-wise shift between domains. Long2015Learning used maximum mean discrepancy together with a task-specific loss to adapt the source and target, while Sun2016Deep proposed the deep correlation alignment algorithm to match the mean and covariance. Bloesch_2018_CVPR proposed to learn a dense representation using an auto-encoder. Mandikal20183D trained the network with constrain in latent space to transfer feature from 2D to 3D in order to directly predict 3D point cloud from a single image. In our work, we aim to design a domain adaptive (SOM) module using memory mechanism, so that the image features can be automatically customized to obtain a better depth prediction.
The monocular depth estimation problem can be defined as a nonlinear mapping from the RGB image to the geometric depth map , which can be learned in a supervised fashion given a training set . To learn the mapping function, we propose Structure-Attentioned Memory Network as shown in Figure 2, which is composed of a (pre-trained) depth auto-encoder, an image encoder and a depth predictor equipped with SOM module. All the components are trained into two stages. In the first stage, a series of ‘target’ depth features are learned by training a depth map auto-encoder . In the second stage, we train an image encoder , SOM modules and a depth predictor to map the 2D image to the depth map in an end-to-end manner. Particularly, encodes the RGB image to the ‘source’ image features , which act as queries to obtain image-depth residual patterns from SOM module. The residual is then concatenated to the source feature to form a newly transferred feature set (which is expected to be aligned with the target feature with supervision) is fed to the predictor to estimate the output depth map. We will elaborate the network structures from two stages separately.
Stage 1: Depth Auto-Encoder
In order to learn a strong and robust prior over the depth map as a reference in the latent matching process, we train a depth auto-encoder which takes a ground truth depth map as input, and outputs a reconstructed depth map . As shown in Figure 3 (Stage 1), we use the DenseNet based encoder-decoder structure. Specifically, DenseNet-121 is utilized for constructing the depth encoder (Figure 3 (a)), in which four feature maps with cascading resolutions are extracted from different blocks (shallow to deep) for depth decoding. In order to make sure that the object contours as well as details are well preserved, we use a Feature Pyramid Network (FPN) to build the depth decoder, fusing multi-scale features in a pyramid structure. Specifically, as shown in Figure 3 (b), four features with sizes , , and of the input are derived. Starting from the deepest feature, each feature map is first upsampled by a factor of , and element-wisely added to its following feature map. After the fusion process of the multi-scale feature maps, each of the newly generated feature maps is upsampled to size of the original input (or the size of the shallowest feature map), and concatenated together to form a feature volume. Finally, the output depth map is predicted via extra CNN layers on the concatenated feature volume. The FPN decoder is able to preserve details in the depth map decoding process. We will show more experimental comparison between different decoder structure to demonstrate its effectiveness in the Experiments section.
Stage 2: Depth Prediction with SOM Module for Latent Space Adaptation
In the second stage, we aim to train the network in an end-to-end manner to effectively transfer the features derived from image encoder from image domain to depth domain, as a strong prior over the ground truth depth, so as to better deduce the depth from the transferred prior. To this end, this stage contains three major components as shown in Figure 3 (c), (d) and (e): the image encoder, the SOM module for latent space adaptation, and the depth predictor . Each component of the network will be explained below.
Image Encoder and Depth Predictor as Regression Backbone In order to make sure that the network derive both depth features and image features at the same scale, we design the encoder-decoder based backbone ((c) and (d) in Figure 3) for stage 2 exactly the same as those of stage 1 but without weight sharing. Specifically, the structure of image encoder ((c) in Figure 3) is identical to that of depth encoder ((e) in Figure 3), and similarly for ((b)) and ((e)).
SOM Module for Latent Space Adaptation In the latent space, we propose an additional structure oriented memory module consisting of two collaborative units: a Memorable Bank of Filters (MBF) that stores a bank of learned filters to detect the cross-modality residual complexity between the depth feature and the image feature, and an Attention Guided Controller (AGC) which controls the interaction between the image feature with the MBF. The image feature as a specific query feature selects filters from MBF with an attention guided read controller, and the MBF is updated through a write controller that is naturally integrated into the back propagation to make the network can be trained end-to-end. The proposed SOM reading and writing process are as follows.
SOM Reading Different from reading by ‘addressing’ in general memory concept, the proposed SOM module is reading by ‘attention’, which means each memory slot is assigned with a weight, and the whole memory is merged per weights as reading output. As demonstrated in Figure 4, given the query feature , in order to obtain weights for each memory slot, we build a LSTM-based read controller to learn the weights. Specifically, each filter from the memory slot is firstly convolved on the feature, and the intermediate outputs are denoted as , where is the memory size, and is formulated as: , is the kernel, is the bias, and is the convolution operation. The intermediate outputs could be thought of as the ‘unweighted/unbiased’ output that takes each filter/memory slot equally. Then in order to further add weighted attention on the result pool, a Bi-Directional Convolutional Long Short Term Memory is applied as the read controller on to explore the correlation within the pool, so as to aggregate the memory slots with strong attention. Particularly, read controller processes from two directions and computes the forward hidden sequence by iterating the input from to , and the backward hidden sequence by iterating the input from to . The forward/backward flow of the LSTM cell is formulated as below: where is the hidden sequence, is the logistic sigmoid function, is the convolution operator and denotes the Hadamard product. represent input gate, forget gate, output gate, and cell activation vector respectively, and is the hidden-input gate matrix, while is the input-output gate matrix. The final attention sequence is computed with regard to both and as follows: , where to , and each after softmax operation in the output sequence is associated with the weight for each memory slot (refer to value in Figure 4, the redder the color, the higher the attention), therefore . The memory output is a combination of the output sequence that focuses more on the slot with higher attention, while less on lower attention value: . Finally, is concatenated with the query feature itself to reproduce a transferred feature that is supposed to match the distribution of the depth feature .
SOM Writing The proposed memory writer can be seamlessly integrated to network back propagation. The attention learned from the read controller will also operate in the memory writing process, and specifically, the slot with higher attention will be updated to a larger extent and vice versa. As shown in Figure 2, there are two backward flows that affect the writing of the memory (red arrows in Figure 2): one comes from the output branch, and the other comes from the latent matching branch. The update rule could be formulated (in a simplified form) as , where is the attention for each slot, is the learning rate, and is the total gradient from both branches.
We design multiple objectives to constrain the joint training of the network with details as follows.
Depth Estimation Objective The depth estimation objective poses constraints on the front-end pipeline of the single image depth estimation. A common way for supervising regression tasks is to adopt or loss between the prediction and the ground truth, which means that larger values have much heavier influence on the loss. However, in depth estimation task, the larger the depth value is, the farther the object is to the camera, which means that the information is less rich for the estimator, leading to unnecessarily large loss (Fu2018Deep). Therefore, in order to reduce the over-emphasized error on large depth values, we use the logarithm mean squared error ()) loss to make the predictor focus more on closer objects which makes up the main portion in a depth map. The objective is formulated as , where is the ground truth depth map, while is the predicted depth map.
Auto-Encoder Objective The objective for the depth auto-encoder is utilized in the first training stage. To make sure that the depth features and the image features are in the same scale with same constraints, we also applied the on the auto-encoder as , where is the ground truth depth map, while is the reconstructed depth map.
Cross-Modality Residual Complexity Objective The latent adaptation objective is applied to constrain the SOM module to minimize feature distribution discrepancies. We use loss between the ‘target’ depth features (pretrained from stage 1) and the SOM transferred image features. The objective is a sum of feature alignment losses at different levels as , where is the number of features involved in latent matching.
Gradient and Surface Normal Constraints To further strengthen the network by pulling out the model from local minima, we added extra constraints on the predicted depth map including the gradient loss and the surface normal loss to finetune the training following commonly used techniques. The gradient loss is defined as , and specifically, we adopt Sobel filter to calculate the gradient both vertically and horizontally; is the image gradient of the ground truth depth map, while is the image gradient of the predicted depth map. The surface normal loss is defined as the similarity between the surface normal of the ground truth depth map with the predicted depth map as , formulated with the corresponding gradient.
In total, the training objectives are summarized as follows: (1) In training stage 1, the total loss is: ; (2) In training stage 2, the total loss is a weighted sum of , , and , which is formulated as: , where is the weight for each objective.
|Method||Error (lower is better)||Accuracy (higher is better)|
|Abs Rel||Sq Rel||RMSE|
In this section, we present our experiments on two large-scale datasets by introducing the implementation details, benchmark performance, and ablation studies validating the effectiveness of the proposed approach.
Implementation Details The proposed method is implemented using the TensorFlow 1.10 framework and runs on a single NVIDIA TITAN X GPU with 12 GB memory. The encoder-decoder structure from both stage 1 and stage 2 are identical but without weight sharing. The depth auto-encoder is trained from scratch, while the image encoder is initialized with ImageNet (Olga2015ImageNet) pre-trained parameters. For multi-scale feature fusion, we consider four levels of feature maps which are derived from different blocks of the DenseNet-121 backbone with the feature map sizes 1/4, 1/8, 1/16 and 1/32 of the input images. For instance, in NYU Depth V2 dataset, with the input resolution , four feature maps with cascading sizes , , , are extracted. The network is trained with initial learning rate , and decreased every 10 epochs. The weight decay and momentum set to and respectively. We used the Adam optimizer and batch normalization during training, with normalization decay 0.97. We set the weights for each objective as , , , and . The gradient loss is added after 4k steps of training, and the surface normal loss is added after 8k steps of training.
Data Augmentation We employ several data augmentation techniques on NYU Depth V2 dataset to prevent overfitting from limited amount of data, including: (i) Random Cropping by of the image height/ width; (ii) Scaling the original image by the factor interval of ; (iii) Random Flipping of the images horizontally; (iii) Rotating the images randomly with the degree of ; (iv) Color jitter of brightness (by -10 to 10 of original value), contrast (by a factor of 0.5 to 2.0), saturation and hue (by -20 to 20 of original value).
Evaluation Metrics Below is a list of evaluation metrics the quantitative evaluation is performed: (1) the absolute mean relative error (Abs Rel): , (2) the squared relative error (Sq Rel): , (3) the root mean squared error (RMSE): , (4) log mean squared error (): , (5) average log 10 error (Avg ): , and (6) accuracy with threshold (t=, , ): .
Results on KITTI Dataset (Eigen split) The KITTI dataset is a large scale dataset for autonomous driving, which contains depth images captured with LiDAR sensor mounted on a driving vehicle. In our experiment, to compare the results at the same level, we follow the experimental protocol proposed by Eigen2015Predicting, in which around images (resolution ) from scenes are utilized as training data, and around images from scenes are used for validation. Following the previous works, the depth value of the RGB image is scaled to 0-80m. During training, the depth maps are down-scaled to resolution , and up-sampled to the original size in evaluation process. Table 1 shows the comparison with the state-of-the-art methods on KITTI dataset. We compared with state-of-the-art methods (Saxena2009Make3D; Liu2015Deep; Zhou2017Unsupervised; Eigen2014Depth; Garg2016Unsupervised; Kundu2018AdaDepth; Zhan2018Unsupervised; Godard2017Unsupervised; Kuznietsov2017Semi). Particularly, the methods proposed by Saxena2009Make3D; Liu2015Deep; Zhou2017Unsupervised; Eigen2014Depth; Kundu2018AdaDepth only employ monocular images in both training and testing, while approaches in Zhan2018Unsupervised; Garg2016Unsupervised; Kuznietsov2017Semi; Godard2017Unsupervised are unsupervised methods that use stereo images in training and apply single image during testing. The proposed method outperforms all these methods by a large margin, and Figure 5 displays a few visualized prediction results on examples randomly chosen from the validation dataset.
Results on NYU Depth V2 Dataset The NYU Depth V2 dataset contains 120K pairs of RGB-D (resolution ) captured by Kinect. The dataset is manually selected and annotated into 1449 RGB-D pairs, in which 795 images are used for training, and the rest for validation. The depth value ranges from 0 to 10m. In the training process, the depth maps are down-scaled to resolution , and in testing/ evaluation, the predicted depth map is upsampled to the original resolution. Table 2 shows the comparison of the proposed method with state-of-the-art methods (official test split). We compare with both hand-crafted feature based approaches (Saxena2009Make3D; Kevin2012Depth; Shi2014Pulling) and deep learning based ones (Liu2014Discrete; Zhuo2015Indoor; Li2015Depth; Wang2015Towards; Xu2018PAD; Liu2016Learning; Roy2016Monocular). Figure 6 shows examples of predicted depth maps on the NYU Depth V2 dataset.
Ablation Studies To further demonstrate the effectiveness of the proposed method, we conduct ablation studies from two aspects on NYU Depth V2 dataset. Firstly, we compare the performance of the depth estimation pipeline with different decoder structures: (1) The decoder that simply uses symmetric structure with the encoder that cascadingly upsample the feature map until the output size. (2) The decoder that takes four different feature maps from the encoder and fuses them in a pyramid fashion (as described in Section Stage 1: Depth Auto-Encoder). The qualitative comparison are shown in Table 2 ( and ). As can be seen from the evaluation results, the decoder structure with pyramid multi-sacle feature fusion out-performs the one that only takes the latent feature as input by a large margin, especially in the metric. Therefore, it is obvious that the mixture of features from different levels are beneficial for the details compensation (i.e. contour, edges).
To validate the effectiveness of the proposed SOM module, we compare the performance of the proposed method with SOM settings against direct alignment and analyze the results. Firstly, we add the feature alignment loss for latent feature maps based on the structure to test the performance of direct feature alignment (). The quantitative results of direct alignment rarely improved compared with the one that is trained without feature alignment loss, reflecting the limited capability of the encoder for feature adaptation. Then, we add the SOM module at feature level () and compare the results with the baseline structure that goes without memory. The large margin quantitative improvement in Table 2 implies that structure-specific feature alignment with memory mechanism (SOM) is superior to other approaches such as direct alignment.
In this paper, we developed a novel memory guided network named Structure-Attentioned Memory Network for monocular depth estimation, consisting of the encoder-decoder based structure, as well as the external SOM module which is trained to learn and memorize the structure attentioned image-depth-residual pattern in cross-modality latent alignment. The proposed method achieves state-of-the-art performance on challenging large-scale benchmarks, and each component is validated to be effective in the ablation study.