LMSCNet: Lightweight Multiscale 3D Semantic Completion
Abstract
We introduce a new approach for multiscale 3Dsemantic scene completion from voxelized sparse 3D LiDAR scans. As opposed to the literature, we use a 2D UNet backbone with comprehensive multiscale skip connections to enhance feature flow, along with 3D segmentation heads. On the SemanticKITTI benchmark, our method performs on par on semantic completion and better on occupancy completion than all other published methods – while being significantly lighter and faster. As such it provides a great performance/speed trade-off for mobile-robotics applications. The ablation studies demonstrate our method is robust to lower density inputs, and that it enables very high speed semantic completion at the coarsest level. Our code is available at https://github.com/cv-rits/LMSCNet.
1 Introduction
Understanding 3D surroundings is a natural ability for humans. While past experience allows us to reason about scene geometry and semantics of an entire scene, this proves difficult for computers given the inherently sparse nature of 3D sensors [2] (due to sparse sensing, limited field-of-view, and occlusions). Still, a comprehensive 3D sensing of the scene is crucial for applications like mobile robotics, and more especially for autonomous driving. Recently, semantic scene completion was proposed [36] as a new generative task, where both completion and semantic labels are inferred for the whole scene.
Unlike images, conveniently encoded as 2D tensors, 3D data causes representation challenges. It is thus common to encode the latter as voxel grids processed by 3D Convolutional Neural Networks (CNNs) [11, 36, 12, 26]. This shows good results but also requires heavy computation, as the memory requirement grows cubically with the input voxel resolution [28]. Consequently, most of the literature limits the predicted resolution and network depth, being incapable to perform the task at the same spatial resolution as the input [36, 15, 41]. This drawback has limited the deployment of such methods for real time applications – augmented and virtual reality [38], robotics perception and navigation [23], scene understanding [25], among others – that would greatly benefit of semantic scene completion from sparse LiDAR scan.
We tackle this problem, and propose a Lightweight Multiscale Semantic Completion, coined LMSCNet, where a 3D voxel grid is processed with considerably lighter 2D convolutions without need of additional modalities [1, 15]. This is achieved by convolving along one spatial axis (close in spirit to bird-eye view process [5]), while mapping to third dimension with 3D segmentation heads. In our proposal, multiscale completion is also possible given informative features map flow, preserving computation efficiency and enabling very fast inference at coarse levels. Fig. LABEL:fig:intro shows the multiscale output of our LMSCNet on the SemanticKITTI dataset [1], using a single sparse LiDAR scan input encoded as voxel. While some works use progressive multiscale losses [25, 10, 12], the literature ignores the benefit of multiscale completion which we prove useful for reducing inference times. To summarize, the main contributions of our work are:
-
a novel 3D semantic scene completion pipeline using an occupancy grid,
-
a lightweight architecture with mix of 2D/3D convolutions leading significantly less parameters,
-
a modular multiscale pipeline which allows coarser inference at very high speed,
-
state of the art performance on SemanticKITTI [1] and better performance on completion.
2 Related Works
To process 3D data such as point-clouds, some use bird-eye-view [6] or 2D spherical [29] projection. Still, the common strategy relies either on point [32] or voxel [33] networks.
The inherent limitation of voxel representation is the staggering memory requirement due to empty voxels, which led to optimized structures [33] or use of sparse convolutions [18] to prevent dilation of the data manifold.
When it comes from real sensing, 3D data is inherently sparse (e.g. LiDAR scan, stereo, etc.) and its densification was initially framed as a completion or reconstruction task. Recent works though, also assign semantic labels to their output subsequently referring to this as semantic completion.
3D completion & reconstruction. First works approximate missing data as a set of primitive local surfaces either from the data structure [37, 35, 11, 37] or using truncated Signed Distance Functions (TSDFs) [8, 30, 34], while others use continuous energy minimization [21].
More recently, learning methods boosted the completion of occluded and unseen regions [14, 17, 43]. In [22], voxel labels are predicted using a Conditional Random Field (CRF), while [11] uses 3D-Encoder-Predictor to correlate observed scene with a priori known 3D shapes. A few works also benefit from Signed Distance Functions representation as it provides richer gradient flow [31, 33, 10]. For memory reason, [10] uses TSDFs input with sparse encoder and partially dense decoder to propagate features in unknown regions. While this effectively reduces memory, because TSDFs are denser than occupancy grids, we argue it would require a bigger and denser decoder, thus annihilating the benefit of any sparse encoding. Other end-to-end completion networks were also proposed in [9, 4]. Despite discretization, we preferred a voxel-based implementation due to size limitations of point-based networks, even if there are promising object completion results [40].
3D Semantic Scene Completion. SSCNet [36] was the first work to combine semantic segmentation and scene completion end-to-end with 3D CNNs. Further works also use additional RGB data by projecting or fusing semantic features from an image network [15, 26, 24, 27]. An alternative to 3D data only is to encode LiDAR scans as spherical projection [29], which enriches neighboring information [1, 19]. While this boosts performance, it also increases the network complexity and subsequently the inference time.
Generative Adversarial Networks (GANs) have also been proposed to enforce realistic outputs [39, 7] but are harder to train.
To lower memory consumption with the preferred voxelized representations, Spatial Group Convolutions (SGC) [41] divide input into groups for efficient processing at the cost of small performance drops.
Different from the literature, we rely solely on 3D voxelized occupancy data avoiding any preprocessing (as for TSDFs, SDFs, etc.), and propose a lightweight architecture with additional multiscale capability.

3 Method
We tackle the problem of dense 3D semantic completion where the task is to assign a semantic label to each individual voxel. Given a sparse 3D voxel grid, the goal is to predict the 3D semantic scene representation, where each voxel is being assigned a semantic label , where is the number of semantic classes and stands for free voxels.
Our architecture, coined LMSCNet and shown in Fig. 1, uses a lightweight UNet style architecture to predict 3D semantic completion at multiple scales, allowing fast coarse inference, beneficial for mobile robotics applications. Instead of greedy 3D convolutions, we mostly employ 2D convolutions along the height axis; similar to a bird-eye view. In the following we detail our custom lightweight 2D/3D architecture (Sec. 3.1), the multiscale reconstruction (Sec. 3.1), and the overall training pipeline (Sec. 3.2).
3.1 Lightweight multiscale 2D/3D architecture
To infer a dense output from the sparse input voxel grid, we use a standard encoder-decoder UNet architecture with 4 levels, thus learning features at decreasing resolutions. At each level, a series of convolution operations is applied followed by a pooling; downscaling the resolution size by 2. The reduction of spatial dimensions in UNets is beneficial for semantic tasks as it subsequently increases the kernels field-of-view at no cost. Note that dilated convolutions (a.k.a ‘atrous’) with increasing dilation rates cannot be used in the encoder due to the sparse input nature. Though dense convolutions in the encoder imply a dilation of the input manifold [18], we argue this is beneficial for 3D semantic completion, given the sparsedense nature of the task.
2D backbone. To preserve a lightweight architecture, we use 2D convolutions along the X,Y dimensions, thus turning the height dimension (Z) into a feature dimension. Notice that we directly process 3D data in contrast to other 2D/3D works that rely on 2.5D data (e.g. depth [19, 26], bird-eye view [6]). While using 2D convolutions implies loosing 3D spatial connexity, it also enables significantly lighter operations. To further reduce the memory requirements, we keep a minimum number of features in each convolution layer. Along with the standard skip connections, we also enhance information flow in the decoder by concatenating the output of each level to all lower levels. Technically, we upsample coarse feature maps learning ad-hoc deconvolution before concatenation to lower levels, which is shown with purple deconv blocks along blue and gray arrows in Fig. 1. Intuitively, this enables our network to use high level features from coarser resolutions, and thus enhancing the spatial contextual information.
3D segmentation head. Different from other works handling point cloud as bird-eye-view, the task of 3D semantic completion actually requires to retrieve the 3rd dimension “lost” with 2D convolutions. In other words, while 2D CNNs output 3D features maps, our decoder must output 4D tensor; the last dimension being the semantic class-wise probability distribution.
To address this, we introduce 3D segmentation heads depicted as gray blocks in Fig. 1. The heads use a series of dense and dilated convolutions. The latter, in the form of Atrous Spatial Pyramid Pooling (aka ASPP [5, 26]), is beneficial to fuse information from different receptive fields thanks to the convolutions with increasing dilation rates (here [1, 2 and 3]). Note that dilated convolutions, though light and powerful, are not appropriate for sparse inputs and, as such, cannot be used in the encoder.
In our segmentation head, the benefit of preceding ASPP with dense 3D convolutions is dual: a) to further densify the feature maps, b) to ward off features from the segmentation heads and the backbone features. This last property is required to enable multiscale capacity, which we now describe.
Multiscale completion. In the same vein as [41, 10], we aim to output multiscale completion to enable both coarse scene representation and faster scene completion at lower resolution – beneficial for mobile robotics applications. We subsequently attach a 3D segmentation head after each level of the 2D UNet architecture, thus providing outputs at input relative scale of . A sample output is shown in Fig. 2. As already mentioned, we noticed experimentally the importance of separating the segmentation features from the main features of the 2D backbone, which again justifies the additional 3D convolutions in the segmentation head. The main interests of our multiscale architecture is that it infers semantic scene completion at a desired scale as needed, reducing the computation and memory requirements. This is further analyzed in Sec. 4.1.1.

3.2 Training strategy
We train our LMSCNet from scratch in a standard end-to-end fashion from pairs of sparse input voxel () and semi-dense semantically labeled voxel grid (). It is important to note that in a real setup, a dense ground truth is impractical for scene completion, due to occlusions and sensor field-of-view limitations. As such, the ground truth is also sparse and encoded with N+2 classes ( semantic classes, 1 free class, 1 unknown). Similar to others [36, 26, 15] we use a sparse loss strategy, back propagating the gradient only where ground truth is known.
For each scale , we train with a cross-entropy loss defined as
(1) |
where is the network output, a voxel index, and a one-hot vector (i.e. if voxel is labeled class , otherwise ).
Note that semantic tasks are by nature highly class-imbalanced problems. This is especially true in outdoor settings, which causes the prevalence of classes like road or vegetation. We account for the class-imbalance nature in Eq. 1 by weighting each class loss according to the inverse of the class-frequency as in [29], thus using (with ).
Finally, the complete network loss is a weighted sum of all level losses
(2) |
where is the per-level loss weight, written for generality, though we use which works well and preserves multiscale capacity. Note that some of our choices were guided by faster training or inference speed. For example, unlike [41, 42, 13, 10, 36] we avoid using Truncated Signed Distance Function variants (TSDF) that require a greedy computation time and was found to be of little benefit [15, 1]. We also tried to encode input as N+2 classes, that is with unknown class, but we noticed little improvement – if any – at the cost of a large pre-processing time for ray casting.
scene completion | semantic scene completion | ||||||||||||||||||||||
Approach |
precision |
recall |
IoU |
road (15.30%) |
sidewalk (11.13%) |
parking (1.12%) |
other-ground (0.56%) |
building (14.1%) |
car (3.92%) |
truck (0.16%) |
bicycle (0.03%) |
motorcycle (0.03%) |
other-vehicle (0.20%) |
vegetation (39.3%) |
trunk (0.51%) |
terrain (9.17%) |
person (0.07%) |
bicyclist (0.07%) |
motorcyclist (0.05%) |
fence (3.90%) |
pole (0.29%) |
traffic-sign (0.08%) |
mIoU |
SSCNet [36] | 31.71 | 83.40 | 29.83 | 27.55 | 16.99 | 15.60 | 6.04 | 20.88 | 10.35 | 1.79 | 0 | 0 | 0.11 | 25.77 | 11.88 | 18.16 | 0 | 0 | 0 | 14.40 | 7.90 | 3.67 | 9.53 |
*SSCNet-full [36] | 59.64 | 75.52 | 49.98 | 51.15 | 30.76 | 27.12 | 6.44 | 34.53 | 24.26 | 1.18 | 0.54 | 0.78 | 4.34 | 35.25 | 18.17 | 29.01 | 0.25 | 0.25 | 0.03 | 19.87 | 13.10 | 6.73 | 16.14 |
TS3D [15] | 31.58 | 84.18 | 29.81 | 28.00 | 16.98 | 15.65 | 4.86 | 23.19 | 10.72 | 2.39 | 0 | 0 | 0.19 | 24.73 | 12.46 | 18.32 | 0.03 | 0.05 | 0 | 13.23 | 6.98 | 3.52 | 9.54 |
TS3D+DNet [1] | 25.85 | 88.25 | 24.99 | 27.53 | 18.51 | 18.89 | 6.58 | 22.05 | 8.04 | 2.19 | 0.08 | 0.02 | 3.96 | 19.48 | 12.85 | 20.22 | 2.33 | 0.61 | 0.01 | 15.79 | 7.57 | 6.99 | 10.19 |
TS3D+DNet+SATNet [1] | 80.52 | 57.65 | 50.60 | 62.20 | 31.57 | 23.29 | 6.46 | 34.12 | 30.70 | 4.85 | 0 | 0 | 0.07 | 40.12 | 21.88 | 33.09 | 0 | 0 | 0 | 24.05 | 16.89 | 6.94 | 17.70 |
LMSCNet (ours) | 77.11 | 66.19 | 55.32 | 64.04 | 33.12 | 24.91 | 3.22 | 38.67 | 29.48 | 2.53 | 0 | 0 | 0.11 | 40.53 | 18.97 | 30.77 | 0 | 0 | 0 | 20.52 | 15.72 | 0.54 | 17.01 |
LMSCNet-singlescale (ours) | 81.55 | 65.07 | 56.72 | 64.80 | 34.68 | 29.02 | 4.62 | 38.08 | 30.89 | 1.47 | 0 | 0 | 0.81 | 41.31 | 19.89 | 32.05 | 0 | 0 | 0 | 21.32 | 15.01 | 0.84 | 17.62 |
* Own implementation.
Input | SSCNet-full [36] | LMSCNet (ours) | Ground Truth |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
4 Experiments
We evaluate our LMSCNet method by training on the recent semantic scene completion benchmark SemanticKITTI [1] providing 3D voxel grids from semantically labeled scans of HDL-64E rotating LiDAR in outdoor urban scenes [16]. In [1], inputs are voxelized single scans, while the ground truth was obtained from the voxelized aggregation of successive registered scans. Grids are 256x256x32 with 0.2m voxel size, and it is important to note that input and ground truth are sparse, with average density of 6.7% and 65.8%, respectively. We use standard mIoU as a semantic completion metric, measuring the intersection over union averaged over all classes (20 semantic classes + free). Additionally, we consider completion metrics IoU, Precision, and Recall to provide a sense of the scene completion quality, regardless of the assigned semantic labels (i.e. considering the binary free / occupied setting). Note that completion is crucial for obstacle avoidance in mobile robotics.
Implementation details. We train using the original train/val splits with 3834/815 grids [1], adding x-y flipping augmentation for generalization. Adam optimizer is used (, ) with learning rate of 0.001 scaled by . Training fits in a single 11GB GPU with batch size 4, taking around 48 hours to converge ( 80 epochs).
4.1 Performance
In the following we report performance against four state-of-the-art methods: SSCNet [36], TS3D [15], TS3D+DNet [1], TS3D+DNet+SATNet [1]. Because SSCNet output is 4x downsampled, we also report performance using deconvolution to reach full input resolution, hereafter denoted SSCNet-full. We refer to the supplementary for details on the required architectures adjustments.
Hereafter, we denote our multiscale architecture as LMSCNet. We detail semantic completion performance and then demonstrate the speed and lightness of our architecture.
Semantic Scene Completion
Performance on the SemanticKITTI benchmark [1] is reported in Tab. 1 against all published methods and SSCNet-full. The evaluation was conducted on the official server (i.e. hidden test set) hence, with the full size ground truth.
Overall, we perform on par with the best methods, though 2nd on the semantic completion metric (mIoU). On the latter, TS3D+DNet+SATNet [1] is slightly better despite their significantly heavier and slower network. Note also that TS3D uses additional RGB input, and all TS3D+DNet use also LiDAR refraction intensity. Conversely, LMSCNet is more versatile as it uses only occupancy grid input. Notice that the highly imbalanced class frequencies (shown in parenthesis in Tab. 1) also illustrate the task complexity. Specifically, we outperform others on the largest four classes but performs on par or lower on the others, which advocates for some improvement in our balancing strategy. On the completion metrics (IoU) our method outperforms all others by a comfortable margin. Again, completion is of high importance for practical mobile robotics applications.
In addition to the multiscale proposal (LMSCNet), we also report LMSCNet-singlescale – a variation of LMSCNet where we train with –, which logically performs a little better at full size though at the cost of loosing crucial multiscale capacity.
Qualitative performance. We compare qualitatively full size outputs of our LMSCNet and SSCNet-full in Fig. 3, with views pairs from 4 scenes of the SemanticKITTI validation set
Multiscale performance.
LMSCNet scale | IoU | mIoU |
---|---|---|
1:1 (full size) | 54.22 | 16.78 |
1:2 | 56.27 | 16.78 |
1:4 | 59.36 | 17.19 |
1:8 | 65.45 | 17.37 |
Tab. 2 shows multiscale performance of our method on the SemanticKITTI validation set, where the scale is relative to the full size resolution (level 0). From Sec. 3.1, scale at level is . Ground truths at lower resolution were obtained from majority vote pooling of the full size ground truth. From the above table, our architecture maintains a good performance in all resolutions, with best performance logically reached at the lowest resolution (highest level). Qualitative multiscale completion is visible in Fig. 2. We argue that our architecture reaches multiscale capacity thanks to the disentanglement of segmentation features with our custom head. Additionally, at coarser resolution our network reaches very fast inference, which will be described in details in the following section.
Architectures comparison.
Method | Params (M) | FLOPs (G) | FPS |
*SSCNet [36] | 0.93 | 82.5 | 56.90 |
*SSCNet-full [36] | 1.09 | 769.6 | 45.94 |
*TS3D [15] | 43.77 | 2016.7 | 9.79 |
*TS3D+DNet [1] | 51.31 | 847.1 | 8.72 |
*TS3D+DNet+SATNet [1] | 50.57 | 905.2 | 1.27 |
LMSCNet | 0.35 | 72.6 | 21.28 |
LMSCNet (1:2) | 0.32 | 13.7 | 126.38 |
LMSCNet (1:4) | 0.28 | 5.7 | 323.46 |
LMSCNet (1:8) | 0.24 | 4.4 | 372.24 |
* Own implementation to compute network statistics.
![]() |
![]() |
Tab. 3 reports networks statistics for our architecture and all above mentioned baselines. From the latter, even at full size LMSCNet has significantly less parameters (0.35M) and lower computational cost for inference (72.6G FLOPs). Compared to any TS3D baselines it is at least an order of magnitude faster. However, SSCNet (original or full) is twice faster than LMSCNet, though with more parameters and worse performance (cf. Tab. 1). Since lighter models does not always run faster due to the sequentiality of some operations on GPU, we conjecture the higher speed of SSCNet is caused by the lower number of convolutional operations compared to LMSCNet full scale (16 vs. 25).
In last rows of Tab. 3, we report statistics for coarser completion, removing unnecessary parts of our network at inference. Lower resolution inference allows significant speedups in the processing, reaching 372 FPS at the highest scale – 6x faster than SSCNet and 300x faster than TS3D+DNet+SATNet –. Fig. 5 illustrates the architectures performance versus speed. Notice that even at full scale we provide a better speed-performance balance. Because semantic completion is an application of high interest for mobile robotics, like autonomous driving, our lighter architecture is beneficial for embedded GPUs and enables coarse scene analysis at high speed.
![]() |
![]() |
![]() |
8 layers | 32 layers |
4.2 Ablation studies
To study the benefit of our design choices, we conduct a series of ablation studies on SemanticKITTI validation set. This is done by modifying important blocks of our architecture and evaluating its performance.
Influence of input resolution. We evaluate our robustness, by retrieving the original 64-layers KITTI scans used in SemanticKITTI and simulating 8/16/32 layers LiDARs with layers subsampling
Fig. 6 shows quantitative and qualitative performance using simulated and original LiDAR. As expected, lower layers input deteriorate the performance, especially in areas far from the sensor location, but our network still performs reasonably well on semantics (mIoU) and completion (IoU). This is visible in the middle image, as 8 layers input (2.10% density) is sufficient to retrieve the general outline of the scene.
Method | IoU | mIoU |
---|---|---|
LMSCNet (ours) | 54.22 | 16.78 |
w/o Deconv | 52.79 | 15.64 |
w/o ASPP | 53.81 | 16.21 |
w/o Multiscale UNet | 53.54 | 16.22 |
Deconv versus Upsampling. As we aimed to preserve a lightweight architecture, we tried to remove the parameters-greedy deconv layers from our network (cf. Sec. 1), replacing them with up-sampling layers. From Tab. 4, performance without deconv introduces a 1.43% and 1.14% performance drop for completion and semantic completion respectively, with only 3% less parameters.
Dilated convolutions. We evaluate the benefit of dilated convolutions in the decoder by ablating ASPP blocks from the segmentation head (see Fig. 1). Tab. 4 indicates that mIoU drops by 0.41% without ASPP. We conjecture that the boost of ASSP results come from the increasing receptive fields of the inner dilated convolutions, providing richer features.
Multiscale UNet decoder.
![]() |
![]() |
As illustrated in Fig. (a)a, unlike vanilla UNet decoder we concatenate the features at the end of each decoder level to all other levels. This is intended to aggregate multiscale features and should intuitively help considering coarser semantic features for fine resolutions.
We assess the benefit of our multiscale UNet by evaluating Vanilla UNet in the last row of Tab. 4, which shows that our proposal boosts completion by 0.68% and semantic completion by 0.56%.
5 Conclusion
We proposed a novel method, coined LMSCNet, for 3D Semantic Scene Completion, which benefits from mixing 2D/3D convolutions to preserve lightweight architecture, while enabling inference at multiple scales. On the challenging SemanticKITTI benchmark, we perform on par with other methods on semantic completion with a much lighter architecture and at faster inference speed. For completion, we outperform the state of the art. Results show that loss of 3D spatial connexity caused by the 2D convolutions does not impair performance. We attribute this to the uniform dimensional variance in used application (i.e. constant sensor viewpoint, urban outdoor scenes). We conjecture that data with higher variance in all directions would cause a higher impact. Also of interest for mobile robotics, our proposal is robust to much lower input density and our multiscale capacity enables scene completion for lower resolution at very high speed.
Supplementary
Performance in SemanticKITTI [1] (64 layers) | |||
---|---|---|---|
Input | SSCNet-full [36] | LMSCNet (ours) | Ground Truth |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Performance in nuScenes [3] (32 layers) | ||
---|---|---|
Input | SSCNet-full [36] | LMSCNet (ours) |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Appendix A Technical details
a.1 Baselines implementations
Hereafter, we provide additional details on the implementation of the baselines listed in main article Tab. 1.
TS3D baselines. We compare our method with 3 variants of the Two Stream 3D (TS3D) network as reported in [1], which originate from [15]. As in their original work, TS3D uses an additional RGB modality, TS3D+DNet and TS3D+DNet+RangeNet use instead LiDAR intensity. The network is modified in two ways: first, by directly using projected semantic labels to the input grid obtained by a LiDAR-based semantic segmentation network [29] (TS3D+DNet); and secondly, by exchanging the 3D-CNN backbone by SATNet [26] (TS3D+DNet+SATNet). The semantic labels obtained from the 2D branch on the 3 variants are one-hot encoded and lifted to the 3D grid resulting on a (N+1)HWD input tensor.
SSCNet baselines. Following the practice in [1], we use SSCNet [36] without the flipped TSDF as input encoding. However, while [1] only compares with SSCNet, which predictions are of the original input resolution, we also propose SSCNet-full, which outputs fullsize predictions. This is done by applying a 444 transpose convolution to the last layer of the network to retrieve original dimensions. For data balancing, we use their strategy by randomly subsampling occluded free space, conserving a 2:1 free-occupied ratio.
a.2 Architecture comparison
Fig. 9 provides additional insight about the benefit of each architecture, where the top rightmost corresponds to the best speed-performance trade-off. Even though SSCNet-full achieves faster inference than our method at the original scale, the performance is slightly lower and more noisy as observed in Fig. 8. TS3D+DNet+RangeNet achieves slightly higher performance but the inference time and the number of parameters are considerably higher as seen in main article Tab. 3. Considering this, our network keeps the best speed-performance trade-off. The interest of the coarser scale inferences of our method can be highlighted by the considerably lower inference times and high performances.

Appendix B Qualitative results
In Fig. 8 further qualitative results of our method in both SemanticKITTI [1] and nuScenes [3] datasets are provided. Notice our network performs smoother and less noisy reconstructions. Even though the ground-truth in SemanticKITTI accumulates scans of dynamic objects as seen in rows 3-4, our network reconstructs the vehicles correctly. This can be explained by the abundance of parked vehicles in the dataset. Performance in nuScenes can be seen in rows 5 to 8. It can be observed again the smoothness of the reconstruction when compared to SSCNet-full, with less noisy objects in the middle of the road. Notice that the network has been trained on SemanticKITTI, which explains the high vegetation predictions in nuScenes. We refer the reader to the supplementary video for more qualitative insights.
Footnotes
- In Eq. 2, losses from heterogeneous resolutions can be summed due to the ad-hoc normalization in Eq. 1
- Note that SemanticKITTI benchmark (i.e. test set) does not provide any visual results. Hence, we omit TS3D baselines due to retraining complexities and their use of additional modalities (RGB or LiDAR intensity).
- Every 2nd, 4th and 8th layer are subsampled to simulate 32, 16 and 8 layer LiDARs, respectively. Unlike [20], note that data SemanticKITTI uses KITTI odometry set in which data is already untwisted.
References
- (2019) SemanticKITTI: a dataset for semantic scene understanding of LiDAR sequences. International Conference on Computer Vision (ICCV), pp. 9296–9306. Cited by: Figure 8, §A.1, §A.1, Appendix B, 4th item, §1, §2, Figure 3, §3.2, Table 1, §4.1.1, §4.1.1, §4.1, §4, Table 3, Table 4, §4.
- (2017) A survey of surface reconstruction from point clouds. Comput. Graph. Forum 36, pp. 301–329. Cited by: §1.
- (2020) nuScenes: a multimodal dataset for autonomous driving. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Figure 8, Appendix B, Figure 4, §4.1.1.
- (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), pp. 667–676. Cited by: §2.
- (2018) DeepLab: semantic image segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. Transactions on Pattern Analysis and Machine Intelligence (PAMI) 40, pp. 834–848. Cited by: §1, §3.1.
- (2017) Multi-view 3D object detection network for autonomous driving. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1907–1915. Cited by: §2, §3.1.
- (2019) 3D semantic scene completion from a single depth image using adversarial training. In International Conference on Image Processing (ICIP), pp. 1835–1839. Cited by: §2.
- (1996) A volumetric method for building complex models from range images. In SIGGRAPH, Cited by: §2.
- (2017) ScanNet: richly-annotated 3D reconstructions of indoor scenes. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443. Cited by: §2.
- (2020) SG-NN: sparse generative neural networks for self-supervised scene completion of RGB-D scans. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2, §3.1, §3.2.
- (2017) Shape completion using 3D-encoder-predictor CNNs and shape synthesis. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6545–6554. Cited by: §1, §2, §2.
- (2018) ScanComplete: large-scale scene completion and semantic segmentation for 3D scans. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4578–4587. Cited by: §1, §1.
- (2019) EdgeNet: semantic scene completion from RGB-D images. ArXiv abs/1908.02893. Cited by: §3.2.
- (2016) Structured prediction of unobserved voxels from a single depth image. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5431–5440. Cited by: §2.
- (2019) Two stream 3D semantic scene completion. Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 416–425. Cited by: §A.1, §1, §1, §2, §3.2, §3.2, Table 1, §4.1, Table 3.
- (2012) Are we ready for autonomous driving? The KIITI vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: §4.
- (2015) Joint 3D object and layout inference from a single RGB-D image. In German Conference on Pattern Recognition (GCPR), Cited by: §2.
- (2018) 3D semantic segmentation with submanifold sparse convolutional networks. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9224–9232. Cited by: §2, §3.1.
- (2018) View-volume network for semantic scene completion from a single depth image. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §2, §3.1.
- (2018) Sparse and dense data with CNNs: depth completion and semantic segmentation. In International Conference on 3D Vision (3DV), pp. 52–60. Cited by: §4.2, footnote 3.
- (2006) Poisson surface reconstruction. In Symposium on Geometry Processing (SGP), Cited by: §2.
- (2013) 3D scene understanding by voxel-CRF. International Conference on Computer Vision (ICCV), pp. 1425–1432. Cited by: §2.
- (2013) Reinforcement learning in robotics: a survey. International Journal of Robotics Research (IJRR) 32, pp. 1238 – 1274. Cited by: §1.
- (2019) RGBD based dimensional decomposition residual network for 3D semantic scene completion. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7685–7694. Cited by: §2.
- (2009) Towards total scene understanding: classification, annotation and segmentation in an automatic framework. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2036–2043. Cited by: §1, §1.
- (2018) See and think: disentangling semantic scene completion. In NeurIPS, Cited by: §A.1, §1, Figure 1, §2, §3.1, §3.1, §3.2.
- (2020) 3D gated recurrent fusion for semantic scene completion. ArXiv abs/2002.07269. Cited by: §2.
- (2019) Point-voxel CNN for efficient 3D deep learning. In NeurIPS, Cited by: §1.
- (2019) RangeNet ++: fast and accurate LiDAR semantic segmentation. International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. Cited by: §A.1, §2, §2, §3.2.
- (2011) KinectFusion: real-time dense surface mapping and tracking. International Symposium on Mixed and Augmented Reality, pp. 127–136. Cited by: §2.
- (2019) DeepSDF: learning continuous signed distance functions for shape representation. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 165–174. Cited by: §2.
- (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §2.
- (2017) OctNetFusion: learning depth fusion from data. In International Conference on 3D Vision (3DV), pp. 57–66. Cited by: §2, §2.
- (2019) 3D surface reconstruction from voxel-based LiDAR data. Intelligent Transportation Systems Conference (ITSC), pp. 2681–2686. Cited by: §2.
- (2013) 3DNN: viewpoint invariant 3D geometry matching for scene understanding. International Conference on Computer Vision (ICCV), pp. 1873–1880. Cited by: §2.
- (2017) Semantic scene completion from a single depth image. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 190–198. Cited by: Figure 8, §A.1, §1, §1, §2, Figure 3, Figure 4, §3.2, §3.2, Table 1, §4.1, Table 3.
- (2005) Shape from symmetry. International Conference on Computer Vision (ICCV) 2, pp. 1824–1831 Vol. 2. Cited by: §2.
- (2010) A survey of augmented reality technologies, applications and limitations. International Journal of Virtual Reality (IJVR) 9, pp. 1–20. Cited by: §1.
- (2018) Adversarial semantic scene completion from a single depth image. In International Conference on 3D Vision (3DV), pp. 426–434. Cited by: §2.
- (2018) PCN: point completion network. 2018 International Conference on 3D Vision (3DV), pp. 728–737. Cited by: §2.
- (2018) Efficient semantic scene completion network with spatial group convolution. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §3.1, §3.2.
- (2019) Cascaded context pyramid for full-resolution 3D semantic scene completion. International Conference on Computer Vision (ICCV), pp. 7800–7809. Cited by: §3.2.
- (2017) Learning for active 3D mapping. International Conference on Computer Vision (ICCV), pp. 1548–1556. Cited by: §2.
