CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion
Abstract
Depth Completion deals with the problem of converting a sparse depth map to a dense one, given the corresponding color image. Convolutional spatial propagation network (CSPN) is one of the stateoftheart (SoTA) methods of depth completion, which recovers structural details of the scene. In this paper, we propose CSPN++, which further improves its effectiveness and efficiency by learning adaptive convolutional kernel sizes and the number of iterations for the propagation, thus the context and computational resource needed at each pixel could be dynamically assigned upon requests. Specifically, we formulate the learning of the two hyperparameters as an architecture selection problem where various configurations of kernel sizes and numbers of iterations are first defined, and then a set of soft weighting parameters are trained to either properly assemble or select from the predefined configurations at each pixel. In our experiments, we find weighted assembling can lead to significant accuracy improvements, which we referred to as "contextaware CSPN", while weighted selection, "resourceaware CSPN" can reduce the computational resource significantly with similar or better accuracy. Besides, the resource needed for CSPN++ can be adjusted w.r.t. the computational budget automatically. Finally, to avoid the side effects of noise or inaccurate sparse depths, we embed a gated network inside CSPN++, which further improves the performance. We demonstrate the effectiveness of CSPN++ on the KITTI depth completion benchmark, where it significantly improves over CSPN and other SoTA methods ^{1}^{1}1http://www.cvlibs.net/datasets/kitti/eval_depth.php?benchmark=depth_completion.
Introduction
Image guided depth completion, or depth completion for short in this paper, is the task of converting a sparse depth map from devices such as LiDAR [velodyne] or algorithms such as structurefrommotion (SfM) [wu2011visualsfm] and simultaneously localization and mapping (SLAM) [engel2014lsd] to a perpixel dense depth map with the help of reference images. The technique has a wide range of applications for the perception of indoor/outdoor moving robots such as selfdriving vehicles [chen2015deepdriving], home/indoor robots [desouza2002vision], or applications such as augmented reality [ventura2008depth].
One of the stateoftheart (SoTA) methods for this task is CSPN, which is an efficient local linear propagation model with learned affinity from a convolutional neural network (CNN). In CSPN, [cheng2018learning] claim three important properties should be considered for the depth completion task, 1) depth preservation, where the depth value at sparse points should be maintained, 2) structure alignment, where the detailed structures, such as edges and object boundaries in estimated depth map, should be aligned with the given image, and 3) transition smoothness, where the depth transition between sparse points and their neighborhoods should be smooth.
In real applications, depths from devices like LiDAR, or algorithms such as SfM or SLAM could be noisy [wvangansbeke_depth_2019] due to system or environmental errors. Datasets like KITTI adopt stereo and multiple frames to compensate the errors for evaluation. Here in this paper, we do not assume that the sparse depth map is the ground truth, rather, we consider that it may include errors as well. So the depth value at sparse points should be conditionally maintained with respect to its accuracy. Secondly, all pixels are considered equally in CSPN, while intuitively the pixels at geometrical edges and object boundaries should be more focused for structure alignment and transition smoothness. Therefore, in CSPN++, we propose to find a proper propagation context, to further improve the performance of depth completion.
To be specific, as illustrated in Fig. 1, in CSPN++, numerous configurations of convolutional kernel size and number of iteration are first defined for each pixel , then we utilize to weight different proposals of kernel size, and use to weight outputs after different iterations. Based on these hyperparameters, we induce contextaware and resourceaware variants for CSPN++. In contextaware CSPN (CACSPN), we propose to assemble the outputs, and CSPN++ is structurally similar to networks such as InceptionNet [szegedy2016rethinking] or DenseNet [huang2017densely], where gradient from the final output can be directly backpropagated to earlier propagation stages. We find the model learns stronger representation yielding significant performance boost comparing to CSPN.
In resourceaware CSPN (RACSPN), CSPN++ sequentially selects one convolutional kernel and one number of iteration for each pixel by minimizing the computational resource usage, where the learned computational resource allocation speeds up CSPN significantly (25 in our experiments) with improvements of accuracy. In addition, RACSPN can also be automatically adapted to a provided computational budget with the awareness of accuracy through a budget rounding operation during the training and inference.
In summary, our contribution lies in two aspects:

Base on the observation of error sparse depths, we propose a gate network to guide the depth preservation, and make the output more robust to noisy sparse depths.

We propose an effective method to adapt the kernel sizes and iteration number for each pixel with respect to image content for CSPN, which induces two variants, named as contextaware and resourceaware CSPN. The former significantly improves its performance, and the later speeds up the algorithm and makes the CSPN++ adapt to computational budgets.
Related Work
Depth estimation, completion, enhancement/refinement and models for dense prediction with dynamic context and compression have been center problems in computer vision and robotics for a long time. Here we summarize those works in several aspects without enumerating them all due to space limits, and we majorly clarify their core relationships with CSPN++ proposed in this paper.
Depth Completion.
The task of depth completion [uhrig2017sparsity] recently attracts lots of interests in robotics due to its wide application for enhancing 3D perception for robotics [LiaoHWKYL16]. The provided depths are usually from LiDAR, SfM or SLAM, yielding a map with valid depth partially available in some of the pixels. Within this field, some works directly convert sparse 3D points to dense ones without image guidance [Zimmermann2017Learning, Ladicky_2017_ICCV, uhrig2017sparsity], which produce impressive results with deep learning. However, conventionally, jointly considering the structures from reference images for guiding depth completion/enhancement [liu2013guided, ferstl2013image] yields better results. With the rising the deep learning for depth estimation from a single image [eigen2014depth, wang2016surge], researchers adopt similar strategies to image guided depth completion. For example, [Ma2017SparseToDense] propose to treat sparse depth map as an additional input to a ResNet based depth predictor [laina2016deeper], producing superior results than the depth output from CNN with solely image inputs. Later works are further proposed by focusing on improving the efficiency [ku2018defense], separately modeling the features from image and sparse depths [tang2019learning], recovering the structural details of depth maps [cheng2018depth], combing with multilevel CRF [xu2018structured] or adopting auxiliary training losses using normal [zhang2018deep] or 3D representation [qiu2019deeplidar, Chen2019depthcompletion] from selfsupervised learning strategy [ma2019self].
Among all of these works, we treat CSPN [cheng2018depth] as our baseline strategy due to its clear motivation and good theoretical guarantee in the stability of training and inference, and our resulted CSPN++ provides a significant boost both on its effectiveness and efficiency.
Context Aware Architectures.
Assembling multiple contexts inside a network for dense predictions has been an effective component for recognition tasks in computer vision. In our perspective, the assembling strategies could be horizontal or vertical. Horizontal strategies assemble outputs from multiple branches in a single layer of a network, which include modules of Inception/Xception [szegedy2016rethinking], pyramid spatial pooling (PSP) [zhao2016pyramid], atrous spatial pyramid pooling (ASPP) [ChenPSA17], and vertical strategies assemble outputs from different layers include modules of HighwayNet [srivastava2015highway], DenseNet [huang2017densely], etc. Some recent works combine these two strategies together such as networks of HRNet [sun2019deep] or models of DenseASPP [yang2018denseaspp]. Most recently, to make the context to be better conditioned on each pixel or provided image, attention mechanism with the cost of additional computation is further induced inside the network for context selection such as skipnet [wang2018skipnet], nonlocal networks [wang2018non] or context deformation such as spatial transformer networks [jaderberg2015spatial] or deformable networks [zhu2019deformable].
In the field of depth completion, [cheng2018learning] propose the atrous convolutional spatial pyramid fusion (ACSF) module which extends ASPP by additionally adding affinity for each pixel, yielding stronger performance, which can be treated as a case of combining horizontal assembling with attention from affinity values. In our case, CACSPN of CSPN++ extends context assembling idea into CSPN with both horizontal and vertical strategies via attention. Horizontally, it assembles multiple kernel sizes, and vertically it assembles the outputs from different iteration stages as illustrated in Fig. 1. Here we would like to note that although mathematically in forward process, performing one step CACSPN with kernels of 77, 55, 33 together is equivalent to performing CSPN with a single 77 kernel since the full process are linear, the backward learning process is different due to the auxiliary parameters (, ), and our results are significantly better.
Resource Aware Inference.
In addition, the dynamic context intuition can be also applied for efficient prediction by stopping the computation after obtained a proper context, which is also known as adaptive inference [graves2016adaptive]. Specifically, the relevant strategies have been adopted in image classification such as a multiscale dense network (MSDNet) [huang2018multi], object detection such as tradeoff balancing [huang2017speed] or semantic segmentation such as regional convolution network (RCN) treating each pixel differently [li2017not].
In RACSPN of CSPN++, we first embed such an idea in depth completion, and adopt functionality of RCN in CSPN for efficient inference. To minimize the computation, each pixel chooses one kernel size and then one number of iterations sequentially from the proposed configurations. Besides, we can easily add a provided computation budget, such as latency or memory constraints, into our optimization target, which could be backpropagated for operation selection similar to resource constraint architecture search algorithms [zhou2019epnas, cai2018proxylessnas].
Preliminaries
To make the paper selfcontained, we first briefly review CSPN [cheng2018learning], and then demonstrate how we extend it with context and resource awareness. Given one depth map that is output from a network taken input as an image , CSPN updates the depth map to a new depth map . Without loss of generality, we follow their formulation by embedding depth to a hidden representation , and the updating equation for one step propagation can be written as,
(1) 
where represents one step CSPN given a predefined size of convolutional kernel . is the neighborhood pixels in a kernel, and the affinities output from a network are properly normalized which guarantees the stability of the module. The whole process will iterate times to obtain the final results. Here, needs to be tuned in the experiments, which impacts the final performance significantly in their paper.
For depth completion, CSPN preserves the depth value at those valid pixels in a sparse depth map by adding a replacement operation at the end of each step. Formally, let to be the corresponding embedding for , the replacement step after performing Eq. (1) is,
(2) 
where is an indicator for the validity of sparse depth at .
Context and Resource Aware CSPN
In this section, we elaborate how CSPN++ enhances CSPN by learning a proper configuration for each pixel by introducing additional parameters to predict. Specifically, predicting for weighting various convolutional kernel size and for weighting different number of iterations given a kernel size . As shown in Fig. 2, both variables are image content dependent, and are predicted from a shared backbone with CSPN affinity and estimated depths.
ContextAware CSPN
Given the provided and , contextaware CSPN (CACSPN) first assembles the results from different steps. Formally, the propagation from to could be written as,
(3) 
where, is the sigmoid function, and is the outputs from the network. In the process, progressively aggregates the output from each step of CSPN based on . Finally, we assemble different outputs from various kernels after iterations,
(4) 
Here, both and are properly normalized with their norm, so that our output maintains the stabilization property of CSPN for training and inference.
When there are sparse points available, CSPN++ adopts a confidence variable predicted at each valid depth in the sparse depth map, which is output from the shared backbone in our framework (Fig. 2). Therefore, the replacement step for CSPN++ can be modified accordingly,
(5) 
where , where is predicted from a network after a convolutional layer.
Complexity and computational resource analysis.
From CSPN, we know that theoretically with sufficient amount of GPU cores and large memory storage, the overall complexity for CSPN with a kernel size of and iteration is . In CACSPN, with induced convolutional kernels, the computation complexity is , where is the maximum kernel size since all branch can be performed simultaneously.
However, in the real application, the expected computational resource is limited and latency of memory request with large convolutional kernel could be time consuming. Therefore, we need to utilize a better metric for estimating the cost. Here, we adopt the popularly used memory cost and MultAdds/FLOPs as an indicator of latency or computational resource usage on a device. Specifically, based on the CUDA implementation of convolution with im2col [jia2014caffe], performing CSPN with a kernel would require memory cost of , and FLOPs of , given a single feature block with a size of . In summary, given kernels, the latency from big estimation for CACSPN would be . Finally, we would like to note that the memory and computational configuration varies with given devices, so does the latency estimation. A better strategy would be directly testing over the target device as proposed in [cai2018proxylessnas]. Here, we just provide a reasonable estimation with the commonly used GPU.
Network architectures. As illustrated in Fig. 2, for the backbone network, we adopt the same ResNet34 structure proposed in [ma2019self]. The only modification is at the end of the network, it outputs the perpixel estimation of assembling parameters , , noisy guidance for replacement and affinity matrix using a convolutional layer with a kernel. For handling the affinity values for various propagation kernels, we use a shared affinity matrix since the affinity between different pixels should be irrelevant to the context of propagation, which saves the memory cost inside the network.
Training contextaware CSPN. Given the proposed architecture, based on our computational resource analysis w.r.t. latency, we add additional regularization term inside the general optimization target, which minimizes the expected computational cost by treating as probabilities of configuration selection. It is shown to be effective in improving the final performance in our experiments. Formally, the overall target for training CACSPN can be written as,
(6)  
where is the network parameters, and is weight decay regularization. is the expected computational cost given the assembling variables based on our analysis. are height and width of the feature respectively. and is the output depth map from CACSPN and ground truth depth map correspondingly. Here, our system can be trained endtoend.
Resource Aware Configuration
As introduced in our complexity analysis, CSPN with large kernel size and long time propagation is time consuming. Therefore, to accelerate it, we further propose resourceaware CSPN (RACSPN), which selects the best kernel size and number of iteration for each pixel based on the estimated . Formally, its propagation step can be written as,
where  (7) 
Here each pixel is treated differently by selecting a best learned configuration, and we follow the same process of replacement as Eq. (2) for handling depth completion.
Computational resource analysis.
Given the selected configuration of convolutional kernel and number of iteration at each pixel, the latency estimation for each image that we proposed in Sec. Complexity and computational resource analysis. is changed to , where and are the average iteration step and kernel size in the image respectively. Both of the numbers are guaranteed to be smaller than the maximum number of iteration and kernel size .
Training RACSPN.
In our case, training RACSPN does not need to modify the multibranch architecture shown in Fig. 1, but switches from the weighted average assembling as described in Eq. (ContextAware CSPN) and Eq. (ContextAware CSPN) to max selection that only one path is adopted for each pixel. In addition, we need modify our loss function in Eq. (Complexity and computational resource analysis.) by changing the expected computational cost as,
where  (8) 
In practice, to implement configuration selection, we can reuse the same training pipeline as CACSPN via converting the obtained soft weighting values in and to onehot representation through an argmax operation.
Efficient testing.
Practically, there are two issues we need to handle when making the algorithm efficient at testing: 1) how to perform different convolution simultaneously at different pixels, and 2) how to continue the propagation for pixels whose neighborhood pixels stop their diffusion/propagation process. To handle these issues, we follow the idea of regional convolution [li2017not].
Specifically, as shown in Fig. 3, to tackle the first one, we group pixels to multiple regions based on our predicted kernel size, and prepare corresponding matrix before convolution for each group using regionwise im2col. Then, the generated matrix can be processed simultaneously at each pixel using regionwise convolution. To tackle the second issue, when the propagation of one pixel stops at time step , we directly copy the feature of to the next step for computing convolution at later stages. In summary, RACSPN can be performed in a single forward pass with less resource usage.
Method  SPP  CSPN configuration  GR  LR  Results (Lower the better)  

Normal  assemble kernel  assemble iter.  RMSE(mm)  MAE(mm)  
[ma2019self]  799.08  265.98  
[ma2019self]  ✓  788.23  247.55  
CSPN  ✓  ✓  765.78  213.,54  
CSPN  ✓  ✓  ✓  756.27  215.21  
CACSPN  ✓  ✓  ✓  732.46  210.61  
CACSPN  ✓  ✓  ✓  ✓  732.34  209.20  
CACSPN  ✓  ✓  ✓  ✓  ✓  725.43  207.88 
Learning with provided computational budget.
Finally, in real applications, rather than providing an optimal computational resource, usually there is a hard constraint for a deployed model, either the memory or latency of inference. Thanks to the adaptive resource usage of CSPN++, we are able to directly put the required budget into our optimization target during training. Formally, given a target memory cost and a latency cost for resourceaware CSPN, our optimization target in Eq. (Complexity and computational resource analysis.) could be modified as,
s.t.  (9) 
where is the expected memory cost, and is the expected latency cost defined in Eq. (Training RACSPN.). The two constraints can be added to our target easily with Lagrange multiplier. Formally, our optimization target with resource budges is,
(10)  
where the hinge loss is adopted as our surrogate function for satisfying the constraints.
Last but not the least, since our primal problem, i.e. optimization with deep neural network, is highly nonconvex, thus during training, there is no guarantee that all samples will satisfy the constraints. In addition, during testing, the predicted configuration might also violate the given constraints, e.g. . Therefore, for these cases, we propose a resource rounding strategy to hard constraint its overall computation within the budgets. Specifically, we calculate the average cost at each pixel, and for the pixels violating the cost, as illustrated in Fig. 1, we are are able to find the Pareto optimal frontier [mock2011pareto] that satisfying the constraint, and we pick the one with largest iteration since it obtains the largest reception field.
Experiments
method  kernel  iter.  m. c.  l. c.  Lower the Better  

(MB)  (ms)  RMSE(mm)  
CSPN  7x7  12  1.0  1.0  829  28.88  756.27  
CACSPN  assemble  12  0.680  1.0  2125  67.23  732.46  
CACSPN  assemble  assemble  0.316  0.446  2125  67.23  725.43  
RACSPN  select  select  0.268  0.439  626.29  10.03  732.32  
RACSPN  select  select  0.35  0.35  0.333  0.303  625.30  9.84  742.17 
For experiments, we majorly perform CSPN++ over the KITTI depth completion benchmark [uhrig2017sparsity]. In this section, we will first introduce the dataset, metrics and our implementation details. Then, extensive ablation study of CSPN++ is conducted on the validation set to verify our insight of each proposed components. Finally, we provide qualitative comparison of CSPN++ versus other SoTA method on testing set.
Experimental setup
DataSet. The KITTI Depth Completion benchmark is a large selfdriving realworld dataset with street views from a driving vehicle. It consists 86k training, 7k validation and 1k testing depth maps with corresponding raw LiDAR scans and reference images. The sparse depth maps are obtained by projecting the raw LiDAR points through the view of camera, and the ground truth dense depth maps are generated by first projecting the accumulated LiDAR scans of multiple timestamps, and then removing outliers depths from occlusion and moving objects through comparing with stereo depths from image pairs.
Metrics. We adopt error metrics same as KITTI depth completion benchmark, including root mean square error (RMSE), mean abosolute error (MAE), inverse RMSE (iRMSE) and inverse MAE (iMAE), where inverse indicates inverse depth representation, i.e.converting to .
Implementation details. We train our network with four NVIDIA Tesla P40, and use batchsize of 8. In all our experiments, we adopt kernel sizes of , and , and sample outputs after times of propagation. All our models are trained with Adam optimizer with . The learning rate start from and reduce by half for every 5 epochs. Here, for training contextaware CSPN in Eq. (Complexity and computational resource analysis.), the parameter for weight decay, i.e. , is set to 0.0005, and the parameter for resource regularization, i.e. is set to 0.1. For training resourceaware CSPN in Eq. (Training RACSPN.), we set and . All our parameters are induced for balancing value scale of different losses without exhaustively tuning.
Ablation studies
Ablation study of contextaware CSPN (CACSPN). Here, we conduct experiments to verify each module adopted in our framework, including our baselines, i.e. CSPN with spatial pyramid pooling(SPP), and our newly proposed modules in contextaware CSPN. Specifically, to make the validation efficient, we only train each network 10 epochs to obtain its results. For SPP, we adopt pooling sizes of and for CSPN, we use the kernel size of and set the number of iteration as . As shown in Tab 1, by adding SPP and CSPN module to the baseline from [ma2019self], we can significantly reduce the depth error due to the induced pyramid context in SPP and refined structure with CSPN. With additional confidence guided replacement(GR) (Eq. (5)), our module better handles the noisy sparse depths, and the RMSE is significantly reduced from to . Then, at rows with ‘assemble kernel‘, we add the component of learning to horizontally assemble predictions from different kernel size via the learned . It further reduce the error from to . At rows with ‘assemble iter.‘, we include the component of learning to vertically assemble outputs after different iterations via the learned . Finally, at rows with ‘LR‘, we add our proposed latency regularization term (Eq. (Complexity and computational resource analysis.)) into the training losses, yielding the best results of our contextaware CSPN.
In Fig. 4, we visualize the learned configurations of and at each pixel. Typically, we find majority pixels on ground and walls only need small kernel and few iterations for recovery, while pixels further away and around object and surface boundary need large kernels and more iterations to obtain larger context for reconstruction. This agrees with our intuition since in real cases, sparse points are denser close by and the structure is simpler in planar regions, thus it is easier for depth estimation.
Ablation study of resourceaware CSPN (RACSPN). To verify the efficiency of our proposed RACSPN, we study the computational improvement w.r.t. vanilla CSPN and CACSPN. As list in Tab 2, at row ‘CSPN‘, we list its memory cost and latency on device. At row ‘CACSPN‘, although the memory cost and latency are in practice larger, but the expected kernel size and iteration steps are much smaller using our latency regularization terms. This indicates that most pixels only need small kernel and few iteration for obtaining better results. At row of ‘RACSPN‘, we train with resourceaware objective as in Eq. (Training RACSPN.), and show that RACSPN not only outperforms CSPN for efficiency (almost 3 faster), but also improves RMSE from to . More importantly, we can train RACSPN with computational budget to fit different devices as proposed in Eq. (10). At the last row, with a hard constrain that the m.c. and l.c. is less than 35% of the vanilla CSPN, we found that, our method will adjust kernel sizes and iteration actively. In this case, the reduce from 0.439 to 0.303 but increase from 0.268 to 0.333, which means that the network chooses larger kernel sizes with less iteration automatically to satisfied our hard constraints, while still produces better results and demonstrate the effectiveness of our method.
Comparisons against other methods
Finally, to compare against other SoTA methods for depth estimation accuracy, we use our best obtained model from CACSPN, and finetune it with another 30 epochs before submitting the results to KITTI test server. As summarized in Tab. 3, CACSPN outperforms all other methods significantly and currently rank 2nd on the bench mark. However, our results are better in three out of the four metrics. Here, we would like to note that our results are also better than methods adopted additional dataset, e.g. DeepLiDAR [qiu2019deeplidar] uses CARLA [dosovitskiy2017carla] to better learn dense depth and surface normal tasks jointly, and FusionNet [wvangansbeke_depth_2019] used semantic pretrained segmentation models on CityScape [cordts2016cityscapes]. Our plain model only trained on KITTI dataset and outperforms all other methods.
In Fig. 5, we qualitatively compare the dense depth maps estimated from our proposed mehtod with UberATGFuseNet [Chen2019depthcompletion] together with the corresponding error maps. We found our results are better at detailed scene structure recovery.
Method 






SC [uhrig2017sparsity]  4.94  1.78  1601.33  481.27  
CSPN [cheng2018depth]  2.93  1.15  1019.64  279.46  
NConv [eldesokey2019confidence]  2.60  1.03  829.98  233.26  
StD [ma2019self]  2.80  1.21  814.73  249.95  
FN [wvangansbeke_depth_2019]  2.19  0.93  772.87  215.02  
DL [qiu2019deeplidar]  2.56  1.15  758.38  226.25  
Uber [Chen2019depthcompletion]  2.34  1.14  752.88  221.19  
CACSPN  2.07  0.90  743.69  209.28 
Conclusion
In this paper, we propose CSPN++ for depth completion, which outperforms previous SoTA strategy CSPN [cheng2018learning] by a large margin. Specifically, we elaborate two variants using the same framework of model selection, i.e. contextaware CSPN and resourceaware CSPN. The former significantly reduces estimation error, while the later achieves much better efficiency with comparable accuracy with the former. We hope CSPN++ could motivate researchers to better adopt datadriven strategies for effective learning hyperparameters in various tasks. In the future, we would like merge the two variants, and consider replacing more modules in network with CSPN for multiple tasks such as segmentation and detection.