CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion
Depth Completion deals with the problem of converting a sparse depth map to a dense one, given the corresponding color image. Convolutional spatial propagation network (CSPN) is one of the state-of-the-art (SoTA) methods of depth completion, which recovers structural details of the scene. In this paper, we propose CSPN++, which further improves its effectiveness and efficiency by learning adaptive convolutional kernel sizes and the number of iterations for the propagation, thus the context and computational resource needed at each pixel could be dynamically assigned upon requests. Specifically, we formulate the learning of the two hyper-parameters as an architecture selection problem where various configurations of kernel sizes and numbers of iterations are first defined, and then a set of soft weighting parameters are trained to either properly assemble or select from the pre-defined configurations at each pixel. In our experiments, we find weighted assembling can lead to significant accuracy improvements, which we referred to as "context-aware CSPN", while weighted selection, "resource-aware CSPN" can reduce the computational resource significantly with similar or better accuracy. Besides, the resource needed for CSPN++ can be adjusted w.r.t. the computational budget automatically. Finally, to avoid the side effects of noise or inaccurate sparse depths, we embed a gated network inside CSPN++, which further improves the performance. We demonstrate the effectiveness of CSPN++ on the KITTI depth completion benchmark, where it significantly improves over CSPN and other SoTA methods 111http://www.cvlibs.net/datasets/kitti/eval_depth.php?benchmark=depth_completion.
Image guided depth completion, or depth completion for short in this paper, is the task of converting a sparse depth map from devices such as LiDAR [velodyne] or algorithms such as structure-from-motion (SfM) [wu2011visualsfm] and simultaneously localization and mapping (SLAM) [engel2014lsd] to a per-pixel dense depth map with the help of reference images. The technique has a wide range of applications for the perception of indoor/outdoor moving robots such as self-driving vehicles [chen2015deepdriving], home/indoor robots [desouza2002vision], or applications such as augmented reality [ventura2008depth].
One of the state-of-the-art (SoTA) methods for this task is CSPN, which is an efficient local linear propagation model with learned affinity from a convolutional neural network (CNN). In CSPN, [cheng2018learning] claim three important properties should be considered for the depth completion task, 1) depth preservation, where the depth value at sparse points should be maintained, 2) structure alignment, where the detailed structures, such as edges and object boundaries in estimated depth map, should be aligned with the given image, and 3) transition smoothness, where the depth transition between sparse points and their neighborhoods should be smooth.
In real applications, depths from devices like LiDAR, or algorithms such as SfM or SLAM could be noisy [wvangansbeke_depth_2019] due to system or environmental errors. Datasets like KITTI adopt stereo and multiple frames to compensate the errors for evaluation. Here in this paper, we do not assume that the sparse depth map is the ground truth, rather, we consider that it may include errors as well. So the depth value at sparse points should be conditionally maintained with respect to its accuracy. Secondly, all pixels are considered equally in CSPN, while intuitively the pixels at geometrical edges and object boundaries should be more focused for structure alignment and transition smoothness. Therefore, in CSPN++, we propose to find a proper propagation context, to further improve the performance of depth completion.
To be specific, as illustrated in Fig. 1, in CSPN++, numerous configurations of convolutional kernel size and number of iteration are first defined for each pixel , then we utilize to weight different proposals of kernel size, and use to weight outputs after different iterations. Based on these hyper-parameters, we induce context-aware and resource-aware variants for CSPN++. In context-aware CSPN (CA-CSPN), we propose to assemble the outputs, and CSPN++ is structurally similar to networks such as InceptionNet [szegedy2016rethinking] or DenseNet [huang2017densely], where gradient from the final output can be directly back-propagated to earlier propagation stages. We find the model learns stronger representation yielding significant performance boost comparing to CSPN.
In resource-aware CSPN (RA-CSPN), CSPN++ sequentially selects one convolutional kernel and one number of iteration for each pixel by minimizing the computational resource usage, where the learned computational resource allocation speeds up CSPN significantly (25 in our experiments) with improvements of accuracy. In addition, RA-CSPN can also be automatically adapted to a provided computational budget with the awareness of accuracy through a budget rounding operation during the training and inference.
In summary, our contribution lies in two aspects:
Base on the observation of error sparse depths, we propose a gate network to guide the depth preservation, and make the output more robust to noisy sparse depths.
We propose an effective method to adapt the kernel sizes and iteration number for each pixel with respect to image content for CSPN, which induces two variants, named as context-aware and resource-aware CSPN. The former significantly improves its performance, and the later speeds up the algorithm and makes the CSPN++ adapt to computational budgets.
Depth estimation, completion, enhancement/refinement and models for dense prediction with dynamic context and compression have been center problems in computer vision and robotics for a long time. Here we summarize those works in several aspects without enumerating them all due to space limits, and we majorly clarify their core relationships with CSPN++ proposed in this paper.
The task of depth completion [uhrig2017sparsity] recently attracts lots of interests in robotics due to its wide application for enhancing 3D perception for robotics [LiaoHWKYL16]. The provided depths are usually from LiDAR, SfM or SLAM, yielding a map with valid depth partially available in some of the pixels. Within this field, some works directly convert sparse 3D points to dense ones without image guidance [Zimmermann2017Learning, Ladicky_2017_ICCV, uhrig2017sparsity], which produce impressive results with deep learning. However, conventionally, jointly considering the structures from reference images for guiding depth completion/enhancement [liu2013guided, ferstl2013image] yields better results. With the rising the deep learning for depth estimation from a single image [eigen2014depth, wang2016surge], researchers adopt similar strategies to image guided depth completion. For example, [Ma2017SparseToDense] propose to treat sparse depth map as an additional input to a ResNet based depth predictor [laina2016deeper], producing superior results than the depth output from CNN with solely image inputs. Later works are further proposed by focusing on improving the efficiency [ku2018defense], separately modeling the features from image and sparse depths [tang2019learning], recovering the structural details of depth maps [cheng2018depth], combing with multi-level CRF [xu2018structured] or adopting auxiliary training losses using normal [zhang2018deep] or 3D representation [qiu2019deeplidar, Chen2019depthcompletion] from self-supervised learning strategy [ma2019self].
Among all of these works, we treat CSPN [cheng2018depth] as our baseline strategy due to its clear motivation and good theoretical guarantee in the stability of training and inference, and our resulted CSPN++ provides a significant boost both on its effectiveness and efficiency.
Context Aware Architectures.
Assembling multiple contexts inside a network for dense predictions has been an effective component for recognition tasks in computer vision. In our perspective, the assembling strategies could be horizontal or vertical. Horizontal strategies assemble outputs from multiple branches in a single layer of a network, which include modules of Inception/Xception [szegedy2016rethinking], pyramid spatial pooling (PSP) [zhao2016pyramid], atrous spatial pyramid pooling (ASPP) [ChenPSA17], and vertical strategies assemble outputs from different layers include modules of HighwayNet [srivastava2015highway], DenseNet [huang2017densely], etc. Some recent works combine these two strategies together such as networks of HRNet [sun2019deep] or models of DenseASPP [yang2018denseaspp]. Most recently, to make the context to be better conditioned on each pixel or provided image, attention mechanism with the cost of additional computation is further induced inside the network for context selection such as skipnet [wang2018skipnet], non-local networks [wang2018non] or context deformation such as spatial transformer networks [jaderberg2015spatial] or deformable networks [zhu2019deformable].
In the field of depth completion, [cheng2018learning] propose the atrous convolutional spatial pyramid fusion (ACSF) module which extends ASPP by additionally adding affinity for each pixel, yielding stronger performance, which can be treated as a case of combining horizontal assembling with attention from affinity values. In our case, CA-CSPN of CSPN++ extends context assembling idea into CSPN with both horizontal and vertical strategies via attention. Horizontally, it assembles multiple kernel sizes, and vertically it assembles the outputs from different iteration stages as illustrated in Fig. 1. Here we would like to note that although mathematically in forward process, performing one step CA-CSPN with kernels of 77, 55, 33 together is equivalent to performing CSPN with a single 77 kernel since the full process are linear, the backward learning process is different due to the auxiliary parameters (, ), and our results are significantly better.
Resource Aware Inference.
In addition, the dynamic context intuition can be also applied for efficient prediction by stopping the computation after obtained a proper context, which is also known as adaptive inference [graves2016adaptive]. Specifically, the relevant strategies have been adopted in image classification such as a multi-scale dense network (MSDNet) [huang2018multi], object detection such as trade-off balancing [huang2017speed] or semantic segmentation such as regional convolution network (RCN) treating each pixel differently [li2017not].
In RA-CSPN of CSPN++, we first embed such an idea in depth completion, and adopt functionality of RCN in CSPN for efficient inference. To minimize the computation, each pixel chooses one kernel size and then one number of iterations sequentially from the proposed configurations. Besides, we can easily add a provided computation budget, such as latency or memory constraints, into our optimization target, which could be back-propagated for operation selection similar to resource constraint architecture search algorithms [zhou2019epnas, cai2018proxylessnas].
To make the paper self-contained, we first briefly review CSPN [cheng2018learning], and then demonstrate how we extend it with context and resource awareness. Given one depth map that is output from a network taken input as an image , CSPN updates the depth map to a new depth map . Without loss of generality, we follow their formulation by embedding depth to a hidden representation , and the updating equation for one step propagation can be written as,
where represents one step CSPN given a predefined size of convolutional kernel . is the neighborhood pixels in a kernel, and the affinities output from a network are properly normalized which guarantees the stability of the module. The whole process will iterate times to obtain the final results. Here, needs to be tuned in the experiments, which impacts the final performance significantly in their paper.
For depth completion, CSPN preserves the depth value at those valid pixels in a sparse depth map by adding a replacement operation at the end of each step. Formally, let to be the corresponding embedding for , the replacement step after performing Eq. (1) is,
where is an indicator for the validity of sparse depth at .
Context and Resource Aware CSPN
In this section, we elaborate how CSPN++ enhances CSPN by learning a proper configuration for each pixel by introducing additional parameters to predict. Specifically, predicting for weighting various convolutional kernel size and for weighting different number of iterations given a kernel size . As shown in Fig. 2, both variables are image content dependent, and are predicted from a shared backbone with CSPN affinity and estimated depths.
Given the provided and , context-aware CSPN (CA-CSPN) first assembles the results from different steps. Formally, the propagation from to could be written as,
where, is the sigmoid function, and is the outputs from the network. In the process, progressively aggregates the output from each step of CSPN based on . Finally, we assemble different outputs from various kernels after iterations,
Here, both and are properly normalized with their norm, so that our output maintains the stabilization property of CSPN for training and inference.
When there are sparse points available, CSPN++ adopts a confidence variable predicted at each valid depth in the sparse depth map, which is output from the shared backbone in our framework (Fig. 2). Therefore, the replacement step for CSPN++ can be modified accordingly,
where , where is predicted from a network after a convolutional layer.
Complexity and computational resource analysis.
From CSPN, we know that theoretically with sufficient amount of GPU cores and large memory storage, the overall complexity for CSPN with a kernel size of and iteration is . In CA-CSPN, with induced convolutional kernels, the computation complexity is , where is the maximum kernel size since all branch can be performed simultaneously.
However, in the real application, the expected computational resource is limited and latency of memory request with large convolutional kernel could be time consuming. Therefore, we need to utilize a better metric for estimating the cost. Here, we adopt the popularly used memory cost and Mult-Adds/FLOPs as an indicator of latency or computational resource usage on a device. Specifically, based on the CUDA implementation of convolution with im2col [jia2014caffe], performing CSPN with a kernel would require memory cost of , and FLOPs of , given a single feature block with a size of . In summary, given kernels, the latency from big estimation for CA-CSPN would be . Finally, we would like to note that the memory and computational configuration varies with given devices, so does the latency estimation. A better strategy would be directly testing over the target device as proposed in [cai2018proxylessnas]. Here, we just provide a reasonable estimation with the commonly used GPU.
Network architectures. As illustrated in Fig. 2, for the backbone network, we adopt the same ResNet-34 structure proposed in [ma2019self]. The only modification is at the end of the network, it outputs the per-pixel estimation of assembling parameters , , noisy guidance for replacement and affinity matrix using a convolutional layer with a kernel. For handling the affinity values for various propagation kernels, we use a shared affinity matrix since the affinity between different pixels should be irrelevant to the context of propagation, which saves the memory cost inside the network.
Training context-aware CSPN. Given the proposed architecture, based on our computational resource analysis w.r.t. latency, we add additional regularization term inside the general optimization target, which minimizes the expected computational cost by treating as probabilities of configuration selection. It is shown to be effective in improving the final performance in our experiments. Formally, the overall target for training CA-CSPN can be written as,
where is the network parameters, and is weight decay regularization. is the expected computational cost given the assembling variables based on our analysis. are height and width of the feature respectively. and is the output depth map from CA-CSPN and ground truth depth map correspondingly. Here, our system can be trained end-to-end.
Resource Aware Configuration
As introduced in our complexity analysis, CSPN with large kernel size and long time propagation is time consuming. Therefore, to accelerate it, we further propose resource-aware CSPN (RA-CSPN), which selects the best kernel size and number of iteration for each pixel based on the estimated . Formally, its propagation step can be written as,
Here each pixel is treated differently by selecting a best learned configuration, and we follow the same process of replacement as Eq. (2) for handling depth completion.
Computational resource analysis.
Given the selected configuration of convolutional kernel and number of iteration at each pixel, the latency estimation for each image that we proposed in Sec. Complexity and computational resource analysis. is changed to , where and are the average iteration step and kernel size in the image respectively. Both of the numbers are guaranteed to be smaller than the maximum number of iteration and kernel size .
In our case, training RA-CSPN does not need to modify the multi-branch architecture shown in Fig. 1, but switches from the weighted average assembling as described in Eq. (Context-Aware CSPN) and Eq. (Context-Aware CSPN) to max selection that only one path is adopted for each pixel. In addition, we need modify our loss function in Eq. (Complexity and computational resource analysis.) by changing the expected computational cost as,
In practice, to implement configuration selection, we can reuse the same training pipeline as CA-CSPN via converting the obtained soft weighting values in and to one-hot representation through an argmax operation.
Practically, there are two issues we need to handle when making the algorithm efficient at testing: 1) how to perform different convolution simultaneously at different pixels, and 2) how to continue the propagation for pixels whose neighborhood pixels stop their diffusion/propagation process. To handle these issues, we follow the idea of regional convolution [li2017not].
Specifically, as shown in Fig. 3, to tackle the first one, we group pixels to multiple regions based on our predicted kernel size, and prepare corresponding matrix before convolution for each group using region-wise im2col. Then, the generated matrix can be processed simultaneously at each pixel using region-wise convolution. To tackle the second issue, when the propagation of one pixel stops at time step , we directly copy the feature of to the next step for computing convolution at later stages. In summary, RA-CSPN can be performed in a single forward pass with less resource usage.
|Method||SPP||CSPN configuration||GR||LR||Results (Lower the better)|
|Normal||assemble kernel||assemble iter.||RMSE(mm)||MAE(mm)|
Learning with provided computational budget.
Finally, in real applications, rather than providing an optimal computational resource, usually there is a hard constraint for a deployed model, either the memory or latency of inference. Thanks to the adaptive resource usage of CSPN++, we are able to directly put the required budget into our optimization target during training. Formally, given a target memory cost and a latency cost for resource-aware CSPN, our optimization target in Eq. (Complexity and computational resource analysis.) could be modified as,
where is the expected memory cost, and is the expected latency cost defined in Eq. (Training RA-CSPN.). The two constraints can be added to our target easily with Lagrange multiplier. Formally, our optimization target with resource budges is,
where the hinge loss is adopted as our surrogate function for satisfying the constraints.
Last but not the least, since our primal problem, i.e. optimization with deep neural network, is highly non-convex, thus during training, there is no guarantee that all samples will satisfy the constraints. In addition, during testing, the predicted configuration might also violate the given constraints, e.g. . Therefore, for these cases, we propose a resource rounding strategy to hard constraint its overall computation within the budgets. Specifically, we calculate the average cost at each pixel, and for the pixels violating the cost, as illustrated in Fig. 1, we are are able to find the Pareto optimal frontier [mock2011pareto] that satisfying the constraint, and we pick the one with largest iteration since it obtains the largest reception field.
|method||kernel||iter.||m. c.||l. c.||Lower the Better|
For experiments, we majorly perform CSPN++ over the KITTI depth completion benchmark [uhrig2017sparsity]. In this section, we will first introduce the dataset, metrics and our implementation details. Then, extensive ablation study of CSPN++ is conducted on the validation set to verify our insight of each proposed components. Finally, we provide qualitative comparison of CSPN++ versus other SoTA method on testing set.
DataSet. The KITTI Depth Completion benchmark is a large self-driving real-world dataset with street views from a driving vehicle. It consists 86k training, 7k validation and 1k testing depth maps with corresponding raw LiDAR scans and reference images. The sparse depth maps are obtained by projecting the raw LiDAR points through the view of camera, and the ground truth dense depth maps are generated by first projecting the accumulated LiDAR scans of multiple timestamps, and then removing outliers depths from occlusion and moving objects through comparing with stereo depths from image pairs.
Metrics. We adopt error metrics same as KITTI depth completion benchmark, including root mean square error (RMSE), mean abosolute error (MAE), inverse RMSE (iRMSE) and inverse MAE (iMAE), where inverse indicates inverse depth representation, i.e.converting to .
Implementation details. We train our network with four NVIDIA Tesla P40, and use batchsize of 8. In all our experiments, we adopt kernel sizes of , and , and sample outputs after times of propagation. All our models are trained with Adam optimizer with . The learning rate start from and reduce by half for every 5 epochs. Here, for training context-aware CSPN in Eq. (Complexity and computational resource analysis.), the parameter for weight decay, i.e. , is set to 0.0005, and the parameter for resource regularization, i.e. is set to 0.1. For training resource-aware CSPN in Eq. (Training RA-CSPN.), we set and . All our parameters are induced for balancing value scale of different losses without exhaustively tuning.
Ablation study of context-aware CSPN (CA-CSPN). Here, we conduct experiments to verify each module adopted in our framework, including our baselines, i.e. CSPN with spatial pyramid pooling(SPP), and our newly proposed modules in context-aware CSPN. Specifically, to make the validation efficient, we only train each network 10 epochs to obtain its results. For SPP, we adopt pooling sizes of and for CSPN, we use the kernel size of and set the number of iteration as . As shown in Tab 1, by adding SPP and CSPN module to the baseline from [ma2019self], we can significantly reduce the depth error due to the induced pyramid context in SPP and refined structure with CSPN. With additional confidence guided replacement(GR) (Eq. (5)), our module better handles the noisy sparse depths, and the RMSE is significantly reduced from to . Then, at rows with ‘assemble kernel‘, we add the component of learning to horizontally assemble predictions from different kernel size via the learned . It further reduce the error from to . At rows with ‘assemble iter.‘, we include the component of learning to vertically assemble outputs after different iterations via the learned . Finally, at rows with ‘LR‘, we add our proposed latency regularization term (Eq. (Complexity and computational resource analysis.)) into the training losses, yielding the best results of our context-aware CSPN.
In Fig. 4, we visualize the learned configurations of and at each pixel. Typically, we find majority pixels on ground and walls only need small kernel and few iterations for recovery, while pixels further away and around object and surface boundary need large kernels and more iterations to obtain larger context for reconstruction. This agrees with our intuition since in real cases, sparse points are denser close by and the structure is simpler in planar regions, thus it is easier for depth estimation.
Ablation study of resource-aware CSPN (RA-CSPN). To verify the efficiency of our proposed RA-CSPN, we study the computational improvement w.r.t. vanilla CSPN and CA-CSPN. As list in Tab 2, at row ‘CSPN‘, we list its memory cost and latency on device. At row ‘CA-CSPN‘, although the memory cost and latency are in practice larger, but the expected kernel size and iteration steps are much smaller using our latency regularization terms. This indicates that most pixels only need small kernel and few iteration for obtaining better results. At row of ‘RA-CSPN‘, we train with resource-aware objective as in Eq. (Training RA-CSPN.), and show that RA-CSPN not only outperforms CSPN for efficiency (almost 3 faster), but also improves RMSE from to . More importantly, we can train RA-CSPN with computational budget to fit different devices as proposed in Eq. (10). At the last row, with a hard constrain that the m.c. and l.c. is less than 35% of the vanilla CSPN, we found that, our method will adjust kernel sizes and iteration actively. In this case, the reduce from 0.439 to 0.303 but increase from 0.268 to 0.333, which means that the network chooses larger kernel sizes with less iteration automatically to satisfied our hard constraints, while still produces better results and demonstrate the effectiveness of our method.
Comparisons against other methods
Finally, to compare against other SoTA methods for depth estimation accuracy, we use our best obtained model from CA-CSPN, and finetune it with another 30 epochs before submitting the results to KITTI test server. As summarized in Tab. 3, CA-CSPN outperforms all other methods significantly and currently rank 2nd on the bench mark. However, our results are better in three out of the four metrics. Here, we would like to note that our results are also better than methods adopted additional dataset, e.g. DeepLiDAR [qiu2019deeplidar] uses CARLA [dosovitskiy2017carla] to better learn dense depth and surface normal tasks jointly, and FusionNet [wvangansbeke_depth_2019] used semantic pre-trained segmentation models on CityScape [cordts2016cityscapes]. Our plain model only trained on KITTI dataset and outperforms all other methods.
In Fig. 5, we qualitatively compare the dense depth maps estimated from our proposed mehtod with UberATG-FuseNet [Chen2019depthcompletion] together with the corresponding error maps. We found our results are better at detailed scene structure recovery.
In this paper, we propose CSPN++ for depth completion, which outperforms previous SoTA strategy CSPN [cheng2018learning] by a large margin. Specifically, we elaborate two variants using the same framework of model selection, i.e. context-aware CSPN and resource-aware CSPN. The former significantly reduces estimation error, while the later achieves much better efficiency with comparable accuracy with the former. We hope CSPN++ could motivate researchers to better adopt data-driven strategies for effective learning hyper-parameters in various tasks. In the future, we would like merge the two variants, and consider replacing more modules in network with CSPN for multiple tasks such as segmentation and detection.