# Bayesian Spatial Kernel Smoothing for Scalable

Dense Semantic Mapping

###### Abstract

This paper develops a Bayesian continuous 3D semantic occupancy map from noisy point cloud measurements. In particular, we generalize the Bayesian kernel inference model for occupancy (binary) map building to semantic (multi-class) maps. The method nicely reverts to the original occupancy mapping framework when only one occupied class exists in obtained measurements. First, using Categorical likelihood and its conjugate prior distribution, we extend the counting sensor model for binary classification to a multi-class classification problem which results in a unified probabilistic model for both occupancy and semantic probabilities. Secondly, by applying a Bayesian spatial kernel inference to the semantic counting sensor model, we relax the independent grid assumption and bring smoothness and continuity to the map inference. These latter properties enable the method to exploit local correlations present in the environment to predict semantic probabilities in regions unobserved by the sensor while increasing the performance. Lastly, computational efficiency and scalability are achieved by leveraging sparse kernels and a test-data octrees data structure. The evaluations using multiple sequences of stereo camera and LiDAR datasets show that the proposed method consistently outperforms the compared baselines. We also present a qualitative evaluation using data collected by a biped robot platform on the University of Michigan - North Campus.

## I Introduction

Robotic mapping is the problem of inferring a representation of the robot’s surroundings using noisy measurements as it navigates through an environment. This problem is traditionally solved using occupancy grid mapping techniques [29, 8, 21]. As robotic systems move toward more challenging behaviors in more complex scenarios, such systems require richer maps so that the robot understands the significance of the scene and objects within. Hence, the integration of semantic knowledge into the map has been the focus of robotic research in recent years [31, 45, 24, 43, 38, 46].

A semantic occupancy map as shown in Fig. 1, besides possessing properties similar to an occupancy grid map, maintains for each cell a set of probabilities of semantic classes. These probabilities are often updated using a Bayes filter [40, 46], and then Conditional Random Fields (CRF) or Markov Random Fields (MRF) are subsequently applied to mitigate discontinuities and inconsistencies in the semantic map [22, 43, 38, 46, 48]. In principle, CRF models encourage label consistency among neighboring grids in super-voxels [38] or 2D superpixels [48, 46]. However, CRF optimization is only applied as a post-processing step, and therefore, it is unable to predict semantics of partially observed regions in the map.

Occupancy grid maps assume the grids are statistically independent. However, a series of investigations on continuous occupancy mapping shows that taking local spatial correlations into account increases mapping performance [32, 23, 15, 44, 13, 7, 6]. Building on a similar idea, continuous semantic maps [12, 9] can deal with sparse sensor measurements by inferring semantics of partially observed regions from neighboring measurements. Recent work on Bayesian generalized kernel inference for occupancy map prediction (BGKOctoMap) proposed in [6] uses a kernel inference approach to generalize the counting sensor model[16] to continuous maps while maintaining the 87 of the method.

In this paper, we extend the continuous counting sensor model developed in [6] to the continuous semantic counting sensor model. The resulting inference model reduces to the original BGKOctoMap when only one occupied class exists in the obtained measurements. More precisely, the main contributions of this paper are as follows. First, we develop a probabilistic model for semantic occupancy mapping which models occupancy and semantic probabilities in a unified framework. Secondly, we improve the mapping performance of the semantic counting sensor model by using Bayesian kernel inference. Lastly, we present extensive experiments using both stereo camera and LiDAR datasets. The evaluations show that the proposed method consistently outperforms the state-of-the-art systems.

The remainder of the paper is organized as follows. Related work is given in Section II. Section III presents preliminaries and the semantic counting sensor model. Section IV describes how to apply the Bayesian kernel inference for continuous mapping. Experimental results are presented in Section V. Discussions on the limitations of this work and ideas for future work are provided in Section VI. Finally, Section VII concludes the paper.

## Ii Related Work

We give an overview of discrete 3D semantic mapping followed by a review of related work on Bayesian kernel inference. Early semantic mapping work uses traditional pixel-wise image segmentation methods and directly transfers image labels from 2D to 3D. Labels from multiple images are fused in 3D through a statistical method [20, 37] or a Bayesian update [40], without any further 3D optimization. He et al. [20] build a semantic octomap by using an MRF for image segmentation and selecting the most frequent label of the 3D points inside each grid as the semantic label for that grid. Sengupta et al. [38] build a semantic volumetric map by using a CRF for 2D semantic segmentation and assigning labels by a voting scheme. Stückler et al. [40] use random decision forests to segment object classes in images and fuse soft labels in a voxel-based 3D map using a Bayesian update. While these methods are similar to our semantic counting sensor model, the latter is a closed-form Bayesian inference which outputs the mean and variance of the posterior.

To deal with noisy 2D predictions, 3D CRF optimization has been introduced as a refinement technique and it is widely used in 3D semantic mapping [22, 41, 43]. In [25, 38, 48], a higher-order dense CRF model is used to further optimize the semantic predictions for 3D elements. Basic CRF models encourage label consistency for adjacent 3D elements, while higher-order dense CRFs can model long-range relationships within a region, such as grids in super-voxels [38] or grids corresponding to 2D superpixels [48], and further improve the mapping performance. With the advent of deep learning methods, recent work uses deep Convolutional Neural Networks (CNNs) for 2D image segmentation, and follows the same framework for building 3D semantic maps [26, 46]. However, CRF optimization post-processes the inferred occupied grids, which does not change the principle of discrete semantic map inference.

Bayesian Kernel Inference (BKI) was introduced in [42] as an approximation to Gaussian processes that requires only computations instead of , where is the number of training points. It generalizes local kernel estimation to the context of Bayesian inference for the exponential family of distributions. Instead of approximating inference on the model, the approximation is made at the stage of model selection. Assuming latent training parameters are conditionally independent given the target parameters, exact inference on this model is possible for any likelihood function from the exponential family. In [33], BKI is successfully applied to a visual odometry problem for modeling sensor uncertainty. In [35], BKI has been used on a Bernoulli-distributed random event with beta-distributed prior to model collision in safe high-speed navigation problems and could achieve safe behavior in a novel environment with no relevant training data. BKI was first used in the context of mapping problems in [7, 6], to generalize the discrete counting sensor model [16] to continuous occupancy mapping. Following the same idea, we apply BKI in our semantic counting sensor model and generalize it to continuous semantic mapping. In particular, we use BKI on a Categorical likelihood with a Dirichlet distribution as its conjugate prior.

## Iii Preliminaries and Semantic Counting Sensor Model

The counting sensor model describes occupancy probability via a Bernoulli likelihood function. It counts for each grid how often a beam has ended in that grid and how often a beam has passed through it. This model has comparable performance to Bayesian updates in occupancy grid mapping [17]. The semantic counting sensor model is its natural generalization from occupancy (binary) mapping to semantic (multi-class) mapping.

Let be the set of semantic class labels, i.e., categories, and be the map spatial support. For any map point , we have a measurement tuple , where and . In practice, is the output of a softmax function computed using the output of a deep network for multi-class classification. The training set (data) can be defined as .

Assuming map cells are indexed by , the th map cell can take on one of possible categories with the probability of each category separately specified as , where . The th map cell with semantic probability is described by a Categorical distribution as:

(1) |

In semantic mapping, we seek the posterior over ; .

For incremental Bayesian inference, we adopt a Dirichlet prior distribution over , given by , as the conjugate prior of the Categorical likelihood, where , are concentration parameters (hyperparameters). Applying Bayes’ rule, the posterior is given by , , where is

(2) |

Because counts the number of measurements which falls into the th cell and indicate the th category, we call this model the Semantic Counting Sensor Model (S-CSM). Given concentration parameters , the mode of has the following closed form, which is also the maximum-a-posteriori estimate of :

(3) |

We also have the closed-form expected value and variance of as follows:

(4) |

We use (2) to calculate the parameters of the posterior Dirichlet distribution for cell and given the posterior parameter , the statistics of cell can be computed by (3) and (4).

For free-class measurements, we use free-space points linearly interpolated along each sensor beam. We note that in the particular case when represents the free-space class and represents the occupied class, the semantic counting sensor model nicely reverts to the original counting sensor model.

However, the semantic counting sensor model inherits the traditional occupancy grid mapping limitations because the posterior parameters for each cell are only correlated with measurements that directly fall into or pass through the cell. To mitigate this shortcoming, we use BKI to convert the discrete semantic counting sensor model to a continuous model by taking into account local correlations in the map.

## Iv Continuous Semantic Mapping via Bayesian Kernel Inference

Bayesian kernel inference, as introduced by Vega-Brown et al. [42], relates the extended likelihood and the likelihood by a smoothness constraint, where is the value of the latent variable for the query point . In this framework, the maximum entropy distribution , satisfying , has the form , where is the Kullback-Leibler Divergence (KLD), and is a kernel function. Let be the extended likelihood and the likelihood. We define a smooth distribution over semantics as having bounded KLD between the two distributions. Given a kernel function operating on 3D spatial inputs , we have

(5) |

We adopt the Categorical likelihood and place a prior distribution over . Subsequently, (6) becomes:

(8) |

which is proportional to the posterior where is defined as

(9) |

The mode, mean, and variance for the continuous model can be computed exactly as given in (3) and (4).

Compared with (2), (9) not only considers measurements which fall into a cell but also adjacent measurements with a weighting coefficient defined by the kernel function, i.e., the distance to the query point. We note that the kernel neither needs to be positive-definite nor symmetric. To reduce the computational complexity, we choose the sparse kernel [27] as

(10) |

where , is the length-scale, and is kernel scale parameter (signal variance).

The derived continuous semantic model can deal with sparse and noisy sensor measurements better and allows for queries at arbitrary resolution. In the context of semantic occupancy mapping, the query points are chosen to be the grid centroids. Thus, (9) can be used to recursively update the posterior parameters for each grid. We use a block to contain a number of grids according to the block depth, where each block is an octree of grids. For every block of test data, the corresponding training data is comprised of all portions of the new measurements that pass through the block’s extended block [22], which is defined as the set of neighboring blocks with faces adjacent to the block containing the test data of interest.

###### Example 1 (Three-dimensional Toy Example).

Figure 2 illustrates a three-dimensional toy example of the continuous semantic mapping via Bayesian kernel inference using a simulated dataset made in Gazebo, with annotated semantic labels. The simulated dataset has dimensions . We manually annotate the raw data into three semantic classes: ground, wall, and cylindrical obstacles. Semantic occupancy maps with resolution for both S-CSM and Semantic Bayesian Kernel Inference (S-BKI) models are built using the annotated point cloud as sensor measurements. The figure shows that S-CSM can reconstruct the 3D environment with correct semantic information but has a limited predictive capability where sensor coverage is sparse. The S-BKI map can interpolate the gaps in the walls due to the continuity and smoothness of Bayesian kernel inference. We also found that Bayesian kernel inference decreases the variance of the wall by considering neighboring measurements. There are some artifacts, however, on the periphery of the wall where the variance is relatively high.

## V Experimental Results

We now present three experiments using the KITTI dataset [10], SemanticKITTI dataset [2], and a 3D bipedal robot. In the first experiment, we compare our methods with the semantic occupancy mapping system in [46], which reports the best results on the KITTI stereo dataset. In the SemanticKITTI experiment, we show the segmentation accuracy improvement of S-BKI over S-CSM, and over a point-cloud-segmentation deep neural network [28] that we used for prior prediction. Finally, we qualitatively compare our two models on data collected by a Cassie robot. The methods are implemented in C++ ^{1}^{1}1https://github.com/ganlumomo/BKISemanticMapping, and make use of the Learning-Aided 3D Mapping Library [6], the Robot Operating System (ROS) [34], and the Point Cloud Library (PCL) [36]. We also make use of the test-data octrees data structure in [44] for fast data retrieval and memory requirement reduction. The parameters, shown in Table I, were manually tuned but remained fixed through all experiments.
For baselines, we used the authors’ open source implementations without any modification.

Hyperparameter | Description | Value |
---|---|---|

Kernel length-scale | 0.3 | |

Kernel scale | 0.1 | |

Dirichlet prior | 0.001 |

### V-a KITTI Dataset

KITTI dataset with semantically labeled images contains 40 test images from sequence 05 [25], and 25 test images from sequence 15 [37] in KITTI odometry dataset. We qualitatively and quantitatively compare the mapping performance of our methods with the state-of-the-art semantic mapping system in [46]. The input data of this dataset consists of stereo camera images; we first transform the stereo images into 3D point clouds using the provided camera calibration, then we use an image segmentation deep neural network [47] on the left image to obtain semantic measurements.

For a fair comparison, we adopt the same data pre-processing methods as used by Yang et al. [46]. We use ELAS [11] to generate depth maps from stereo image pairs, ORB-SLAM [30] to estimate 6DoF camera poses, and the deep network dilated CNN [47] for prior semantic label predictions. The superpixels needed by Yang’s CRF module is generated by the SLIC algorithm [1]. The common parameters for occupancy mapping in the three methods are set according to Yang’s work: resolution of 0.1 , free and occupied thresholds as 0.47 and 0.6, respectively.

#### V-A1 Qualitative Results

The 3D view of the semantic map built by the S-BKI model is given in Fig. 1. Our approach is able to recognize and reconstruct general objects such as road, sidewalk, building, fence and vegetation. We also show the same view of the corresponding variance map of S-BKI in Fig. 1. Most of the grids on the surface have relatively low variance (cyan), the middle grids have the lowest variance (blue) where the sensor measurements are dense, while the grids on the margins of the scans have relatively high variance (red) where the sensor measurements are sparse. It can also be noticed that the uneven parts of the road in the semantic map have high variance, which might be caused by the discontinuity of the estimated camera poses.

We also found that a small portion of grids of the fence on the left side are misclassified as vegetation, where the corresponding variance is high. This nice property enabled us to reject misclassified grids by setting a threshold variance. If the variance is too high, we can regard the state of the grid as unknown and thus build safer semantic maps for robot navigation.

To compare the mapping performance with Yang’s semantic mapping system with CRF optimization, we project semantic maps onto 2D left camera views and compare with 2D ground truth images as shown in Fig. 3. The projected image from Yang’s semantic map contains more gaps than S-CSM and S-BKI, compared with the ground truth image where the road, buildings, and vegetation are continuous and dense, while the projected image of S-BKI has the least holes in those regions, which resembles the ground truth better.

Metric | Method |
Building |
Road |
Vege. |
Sidewalk |
Car |
Signate |
Fence |
Pole |
Average |
---|---|---|---|---|---|---|---|---|---|---|

IoU Exclusive | Yang et al. [46] | 86.2 | 91.5 | 85.3 | 74.1 | 77.1 | 16.8 | 78.5 | 28.0 | 67.2 |

S-CSM | 86.3 | 93.2 | 84.3 | 80.0 | 76.8 | 25.5 | 77.5 | 30.1 | 69.2 | |

S-BKI | 87.4 | 93.3 | 84.7 | 79.9 | 76.9 | 18.6 | 78.7 | 29.2 | 68.6 | |

IoU | Yang et al. [46] | 32.5 | 70.1 | 45.2 | 55.7 | 39.5 | 13.0 | 46.6 | 18.9 | 40.2 |

S-CSM | 40.2 | 74.1 | 49.5 | 62.1 | 42.1 | 20.3 | 47.7 | 22.8 | 44.9 | |

S-BKI | 45.6 | 75.5 | 52.8 | 62.9 | 43.3 | 14.9 | 49.3 | 22.9 | 46.0 |

Metric | Method |
Building |
Road |
Vege. |
Sidewalk |
Car |
Signate |
Fence |
Pole |
Average |
---|---|---|---|---|---|---|---|---|---|---|

IoU Exclusive | Yang et al. [46] | 95.6 | 90.4 | 92.8 | 70.0 | 94.4 | 0.1 | 84.5 | 49.5 | 72.2 |

S-CSM | 94.4 | 95.4 | 90.7 | 84.5 | 95.0 | 22.2 | 79.3 | 51.6 | 76.6 | |

S-BKI | 94.6 | 95.4 | 90.4 | 84.2 | 95.1 | 27.1 | 79.3 | 51.3 | 77.2 | |

IoU | Yang et al. [46] | 32.9 | 85.8 | 59.0 | 79.3 | 61.0 | 0.9 | 46.8 | 33.9 | 50.0 |

S-CSM | 42.6 | 87.3 | 62.9 | 77.9 | 62.6 | 17.1 | 47.7 | 34.8 | 54.1 | |

S-BKI | 49.3 | 88.8 | 69.1 | 78.2 | 63.6 | 22.0 | 49.3 | 36.7 | 57.1 |

#### V-A2 Quantitative Results

We follow the evaluation method given in [46] by projecting 3D semantic map onto the 2D left image plane, ignoring voxels that are too far from the camera (40 meters for all the methods), and calculating the standard metric of Intersection over Union (IoU) based on labeled ground truth in left images. IoU is defined as TP/(TP+FN+FP), where T/F P/N stands for true/false positive/negative.

Yang et al. [46] *exclude* the data that has not been projected onto images (gray color in the projected images), *even when there exists corresponding ground truth data of it* (as shown in the ground truth images). For a fair comparison, we follow this approach for all three methods and call it *IoU Exclusive*. However, this evaluation ignores the classification error of gaps in the map, and cannot show the advantage of continuous mapping. Therefore, we compute a more rigorous *IoU* by taking all projected data except the sky class into account.

The quantitative results are given in Table II and III, where the two metrics are computed. S-BKI has the highest IoU among almost all semantic classes compared with S-CSM and Yang et al. [46], and S-CSM is the second-best method. We reiterate that the IoU Exclusive is not a reasonable metric for mapping performance evaluations; nevertheless, S-CSM and S-BKI still outperform the compared baseline using this metric. In the latter case, as expected, S-CSM and S-BKI perform similarly.

There are two main reasons why S-CSM outperforms Yang’s work. First, Yang’s semantic mapping uses a separate Bayesian filter to update occupancy which gives larger weights to recent data when taking an average, while S-CSM gives equal weights to all data. In other words, if recent data is noisy, S-CSM would outperform the Bayesian filtering. Secondly, even if the 3D CRF model further optimizes the grid labels, it is only post-processing pre-calculated occupied grids and, therefore, it cannot recover the correct semantic labels for misclassified occupancy or unknown grids. In contrast, the counting sensor model uses a statistical model to infer the grid statistics. By adding the Bayesian kernel inference, S-BKI outperforms S-CSM as it can fill the gaps in the map using nearby measurements. Even for fully observed regions, by considering local correlations the map becomes more robust to noisy measurements.

### V-B SemanticKITTI Dataset

We also evaluate our mapping algorithms using LiDAR data from SemanticKITTI dataset [2]. SemanticKITTI dataset is a large-scale dataset based on KITTI odometry dataset. It provides dense annotations for each scan of sequences 00-10 including camera poses estimated from a surfel-based SLAM approach (SuMa) [3]. The input data of this dataset is collected by a Velodyne HDL-64E laser scanner. The semantic measurements are generated by a 3D point cloud semantic segmentation deep neural network.

We evaluate our mapping methods on all sequences with ground truth semantic labels. RangeNet++ [28] provides several pre-trained models and their predictions on SemanticKITTI dataset. We choose SqueezesegV2 with K-Nearest Neighbor processing (SqueezesegV2-KNN) [28] to compute semantic measurements given the LiDAR points. All maps are built with the resolution of 0.1 and without any pre-processing of the input data.

Seq. | Method |
Car |
Bicycle |
Motorcycle |
Truck |
Other Vehicle |
Person |
Bicyclist |
Motorcyclist |
Road |
Parking |
Sidewalk |
Other Ground |
Building |
Fence |
Vegetation |
Trunk |
Terrain |
Pole |
Traffic Sign |
Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

00 | Sq.-KNN | 92.1 | 18.3 | 55.0 | 76.5 | 62.9 | 34.2 | 52.0 | 61.4 | 94.7 | 71.0 | 87.9 | 1.2 | 89.8 | 54.6 | 82.2 | 53.1 | 79.3 | 38.6 | 51.5 | 60.9 |

S-CSM | 95.6 | 23.5 | 69.8 | 88.3 | 74.4 | 47.9 | 71.6 | 56.9 | 96.3 | 78.1 | 91.2 | 3.1 | 93.6 | 64.2 | 87.4 | 70.1 | 83.5 | 61.1 | 70.7 | 69.9 | |

S-BKI | 96.9 | 26.5 | 75.8 | 93.5 | 80.1 | 61.5 | 77.5 | 71.0 | 96.2 | 79.2 | 91.5 | 6.6 | 94.6 | 66.5 | 88.9 | 73.4 | 84.5 | 65.8 | 76.2 | 75.0 | |

01 | Sq.-KNN | 83.8 | n/a | n/a | n/a | 82.9 | n/a | n/a | 67.9 | 92.6 | n/a | n/a | 70.5 | 58.0 | 71.4 | 72.1 | 18.0 | 71.5 | 21.8 | 68.9 | 64.0 |

S-CSM | 89.8 | n/a | n/a | n/a | 91.0 | n/a | n/a | 70.3 | 93.4 | n/a | n/a | 74.2 | 64.4 | 73.8 | 75.1 | 26.3 | 74.7 | 31.9 | 78.7 | 70.3 | |

S-BKI | 91.0 | n/a | n/a | n/a | 96.0 | n/a | n/a | 70.7 | 94.3 | n/a | n/a | 75.2 | 67.1 | 75.1 | 76.4 | 30.6 | 76.1 | 36.2 | 81.4 | 72.5 | |

02 | Sq.-KNN | 90.9 | 14.5 | 50.8 | n/a | 56.4 | 38.6 | n/a | 59.9 | 93.9 | 68.1 | 84.9 | 50.9 | 79.1 | 66.1 | 82.5 | 48.9 | 68.3 | 25.7 | 35.9 | 59.7 |

S-CSM | 95.4 | 28.5 | 73.4 | n/a | 80.3 | 60.3 | n/a | 75.1 | 94.8 | 74.4 | 87.4 | 61.7 | 85.0 | 71.8 | 86.7 | 66.9 | 72.9 | 43.5 | 55.7 | 71.4 | |

S-BKI | 95.8 | 31.1 | 76.4 | n/a | 83.3 | 62.5 | n/a | 79.5 | 94.8 | 75.0 | 87.4 | 63.6 | 85.6 | 72.1 | 87.1 | 68.8 | 73.4 | 45.9 | 60.4 | 73.1 | |

03 | Sq.-KNN | 88.4 | 21.9 | n/a | 12.4 | 60.1 | 16.3 | n/a | n/a | 92.8 | 57.9 | 83.2 | n/a | 77.4 | 70.1 | 79.3 | 41.6 | 62.3 | 35.9 | 47.3 | 56.5 |

S-CSM | 92.4 | 29.7 | n/a | 23.1 | 65.4 | 17.6 | n/a | n/a | 94.3 | 69.4 | 86.9 | n/a | 80.4 | 73.8 | 83.2 | 52.3 | 66.9 | 53.5 | 62.0 | 63.0 | |

S-BKI | 94.5 | 42.4 | n/a | 48.8 | 73.6 | 23.8 | n/a | n/a | 94.3 | 73.2 | 87.2 | n/a | 82.1 | 74.7 | 84.1 | 55.7 | 67.4 | 57.3 | 66.7 | 68.0 | |

04 | Sq.-KNN | 84.9 | n/a | n/a | n/a | 68.1 | 20.8 | n/a | n/a | 95.8 | 26.1 | 68.4 | 61.5 | 49.3 | 76.4 | 82.6 | 14.0 | 67.6 | 36.0 | 44.6 | 56.9 |

S-CSM | 88.3 | n/a | n/a | n/a | 71.2 | 23.2 | n/a | n/a | 96.5 | 40.5 | 72.5 | 64.0 | 52.1 | 78.5 | 85.5 | 19.4 | 72.5 | 50.8 | 57.6 | 62.3 | |

S-BKI | 87.7 | n/a | n/a | n/a | 82.5 | 37.3 | n/a | n/a | 96.2 | 55.7 | 72.3 | 68.3 | 56.9 | 80.3 | 87.1 | 24.4 | 72.7 | 55.5 | 67.0 | 67.5 | |

05 | Sq.-KNN | 89.1 | 8.6 | 15.4 | 82.5 | 70.9 | 31.0 | 55.0 | n/a | 94.7 | 84.8 | 85.0 | 61.5 | 87.0 | 72.4 | 75.5 | 30.3 | 64.6 | 27.6 | 39.5 | 59.8 |

S-CSM | 93.4 | 15.4 | 28.9 | 86.4 | 78.4 | 39.8 | 69.4 | n/a | 96.4 | 90.1 | 88.5 | 70.0 | 90.9 | 77.7 | 81.2 | 46.8 | 69.5 | 47.6 | 57.4 | 68.2 | |

S-BKI | 93.2 | 27.4 | 46.0 | 89.0 | 84.1 | 47.5 | 83.3 | n/a | 94.2 | 88.0 | 83.6 | 75.2 | 92.4 | 75.3 | 82.1 | 53.5 | 69.5 | 50.2 | 63.3 | 72.1 | |

06 | Sq.-KNN | 85.4 | 17.1 | 50.2 | 86.7 | 66.1 | 27.6 | 64.3 | n/a | 87.6 | 56.0 | 74.9 | 66.5 | 83.9 | 38.4 | 61.9 | 32.0 | 89.5 | 40.1 | 52.7 | 60.1 |

S-CSM | 91.8 | 22.7 | 62.5 | 89.8 | 75.4 | 43.3 | 92.1 | n/a | 91.1 | 68.2 | 80.4 | 70.5 | 89.4 | 49.3 | 69.7 | 50.1 | 92.2 | 60.0 | 77.9 | 70.9 | |

S-BKI | 92.6 | 28.7 | 67.9 | 93.5 | 81.4 | 62.7 | 95.4 | n/a | 90.3 | 70.7 | 79.9 | 71.8 | 91.7 | 53.6 | 73.7 | 54.7 | 91.9 | 66.4 | 84.8 | 75.1 | |

07 | Sq.-KNN | 92.4 | 21.3 | 64.0 | 83.6 | 69.8 | 53.2 | 63.6 | n/a | 93.9 | 75.9 | 89.3 | n/a | 90.9 | 59.7 | 76.5 | 45.9 | 82.8 | 40.2 | 54.0 | 68.1 |

S-CSM | 94.9 | 25.9 | 76.8 | 82.6 | 81.5 | 64.2 | 88.0 | n/a | 95.8 | 80.9 | 92.0 | n/a | 93.8 | 66.6 | 80.8 | 59.8 | 84.7 | 55.4 | 73.2 | 76.2 | |

S-BKI | 93.8 | 29.2 | 80.2 | 82.7 | 87.8 | 70.1 | 92.7 | n/a | 93.9 | 77.0 | 87.7 | n/a | 94.1 | 63.4 | 81.4 | 84.1 | 84.5 | 53.2 | 77.6 | 77.2 | |

08 | Sq.-KNN | 86.7 | 14.4 | 24.6 | 21.0 | 23.3 | 23.5 | 40.9 | n/a | 90.1 | 32.4 | 74.8 | 1.2 | 79.6 | 42.7 | 79.2 | 36.5 | 71.1 | 28.3 | 24.8 | 44.1 |

S-CSM | 90.5 | 23.0 | 34.9 | 26.8 | 29.1 | 32.4 | 49.4 | n/a | 92.6 | 38.7 | 79.0 | 1.1 | 84.6 | 51.6 | 83.3 | 48.3 | 72.9 | 44.1 | 31.6 | 50.8 | |

S-BKI | 92.3 | 30.0 | 39.7 | 29.3 | 32.1 | 38.8 | 54.7 | n/a | 92.9 | 40.9 | 79.9 | 1.1 | 86.6 | 54.6 | 84.9 | 52.3 | 74.2 | 47.9 | 34.7 | 53.7 | |

09 | Sq.-KNN | 89.2 | 5.3 | 48.0 | 79.8 | 61.3 | 37.3 | n/a | n/a | 91.0 | 59.0 | 79.9 | 38.9 | 80.9 | 62.9 | 77.0 | 32.3 | 61.7 | 31.8 | 52.6 | 58.2 |

S-CSM | 93.9 | 12.2 | 71.9 | 85.6 | 71.6 | 47.5 | n/a | n/a | 91.8 | 67.0 | 83.1 | 23.4 | 88.9 | 65.7 | 82.6 | 42.9 | 64.9 | 52.4 | 53.0 | 64.6 | |

S-BKI | 96.0 | 22.8 | 80.2 | 90.5 | 79.7 | 60.7 | n/a | n/a | 91.7 | 70.0 | 83.8 | 30.7 | 90.8 | 69.1 | 84.0 | 46.3 | 66.0 | 59.1 | 58.2 | 69.4 | |

10 | Sq.-KNN | 84.0 | 8.1 | 36.2 | 49.3 | 10.2 | 40.9 | n/a | n/a | 89.4 | 59.6 | 78.5 | 42.7 | 76.7 | 64.2 | 77.6 | 29.0 | 67.8 | 30.7 | 47.9 | 52.0 |

S-CSM | 91.0 | 14.6 | 51.8 | 67.7 | 16.6 | 52.8 | n/a | n/a | 92.1 | 69.7 | 83.7 | 51.3 | 81.7 | 70.0 | 82.2 | 43.3 | 72.4 | 51.7 | 64.1 | 62.1 | |

S-BKI | 93.8 | 24.6 | 60.3 | 76.2 | 21.2 | 65.0 | n/a | n/a | 92.3 | 73.4 | 84.8 | 54.5 | 83.0 | 71.2 | 83.4 | 47.3 | 73.4 | 56.2 | 67.9 | 66.4 | |

Average |
Sq.-KNN | 87.9 | 14.4 | 43.0 | 61.5 | 57.5 | 32.3 | 55.2 | 63.1 | 92.4 | 59.1 | 80.7 | 43.9 | 77.5 | 61.7 | 76.9 | 34.7 | 71.5 | 32.4 | 47.2 | 57.6 |

S-CSM | 92.5 | 21.7 | 58.7 | 68.8 | 66.8 | 42.9 | 74.1 | 67.4 | 94.1 | 67.7 | 84.5 | 46.6 | 82.3 | 67.5 | 81.6 | 47.8 | 75.2 | 50.2 | 62.0 | 65.9 | |

S-BKI | 93.4 | 29.2 | 65.8 | 75.4 | 72.9 | 93.0 | 80.7 | 73.7 | 93.7 | 72.3 | 83.8 | 49.0 | 84.1 | 68.7 | 83.0 | 53.7 | 75.8 | 54.0 | 67.1 | 72.1 |

#### V-B1 Qualitative Results

Examples of qualitative results of the S-BKI semantic map using sequence 04 and 05 are shown in Fig. 4. The figures highlight that the proposed methods work not only with dense stereo camera data but also with LiDAR data which is sparser. Sequence 05 is a large-scale dataset with 2761 LiDAR scans and S-BKI can successfully reconstruct road, vegetation, terrain, and cars.

#### V-B2 Quantitative Results

We compute the IoU metric for 3D predictions of SqueezesegV2-KNN, S-CSM and S-BKI. Once the maps are inferred, we query the semantic labels for all points of each scan and compare map labels with ground truth labels. We acknowledge that SqueezesegV2-KNN is not a semantic mapping system, but to the best of our knowledge, we are the first semantic mapping work which reports quantitative results using SemanticKITTI dataset.

Quantitative results on sequences 00-10 are given in Table IV. For all sequences, our semantic mapping methods can improve the prior segmentation IoU by fusing multiple scans. We note that S-BKI consistently outperforms S-CSM almost in all semantic classes, which shows the advantage of Bayesian kernel inference and continuous semantic maps. When S-CSM outperforms S-BKI, the IoUs are close to each other.

### V-C Experimental Results on a Cassie Bipedal Robot

Finally, we test our mapping methods on data collected using the bipedal robot Cassie Blue shown in Fig 5. Cassie Blue has a custom designed torso on which is mounted an Intel RealSense depth camera capable of providing both RGB images and corresponding organized point clouds in outdoor environments. We collected data on the Wave Field of the University of Michigan - North Campus, as shown in the top left image of Fig. 6.

To obtain semantic measurements, we manually annotated 1194 training images and 457 validation images from the NCLT dataset [4]. The NCLT dataset was selected because it shares a similar environmental domain as the Wave Field data, which includes *background, water, road, sidewalk, terrain, building, vegetation, car, person, bike, pole, stair, traffic sign and sky* for a total of 14 classes. We used these images to fine-tune a modified 2D segmentation network MobileNet [39] with a pre-trained model on the ImageNet dataset [5] for efficiency. The fine-tuned network is used to segment the RGB images, and the organized point clouds can then directly be used together with the corresponding semantic labels for each point.

The qualitative results are given in Fig. 6. To further test the mapping performance of our methods on sparse data, we down sample the point clouds per scan to a resolution of 0.2 , and build a semantic occupancy map with a resolution of 0.1 . The mapping drift after one full round of the Wave Field is because of the odometry system used in the experiment [19, 18]. The details of both maps are given in Fig. 7. While the robot is navigating along the sidewalk, S-CSM produces discontinuous semantic maps from sparse sensor measurements, which may cause the robot’s planner to regard the gaps in the map as unwalkable areas, a practical problem when we conduct autonomous walking experiments with Cassie Blue ^{2}^{2}2https://www.youtube.com/watch?v=LhFC45jweFM&t=32s. In contrast, the S-BKI model produces a continuous and smooth map, where gaps are assigned with labels inferred from local correlations in the map.

## Vi Discussions and Limitations

There are still several limitations to this work. First, the length-scale of the kernel function trades off predictive ability and classification accuracy. When the length scale is large, the model can extrapolate large-scale trends in data, and thus be more predictive; however, the classification accuracy may drop for small objects in the environment. In the current approach, we manually tune the length-scale and use the same scale everywhere, independent of the class. Optimizing the hyperparameters in a Bayesian framework can be helpful. In addition, varying the length-scale based on geometric features and semantic properties may further improve semantic mapping performance. Secondly, the memory and space storage for large-scale mapping is another limitation. We currently store the entire semantic map in computer memory without any pruning. However, with the current test-data octrees data structure, even when storing the map after pruning, the save in memory consumption is not substantial. How to compress the continuous semantic maps is an interesting future research direction. We also note that the current software accompanying the paper is not real-time for large input data. Developing a real-time semantic mapping system based on this work is another interesting future work.

## Vii Conclusion

In this paper, we extended the counting sensor model for occupancy grid mapping to a semantic counting sensor model for semantic occupancy mapping. To relax the independent-grid assumption in occupancy grid mapping, we used a Bayesian spatial kernel inference to generalize the semantic counting sensor model to continuous semantic mapping. Extensive experimental results show the proposed methods work with both dense stereo camera and LiDAR data. We improved the mapping performance over the state-of-the-art semantic mapping system using the KITTI dataset, and increased the segmentation accuracy over a 3D deep neural network with KNN processing using the SemanticKITTI dataset. We labeled the NCLT dataset and collected data using Cassie Blue biped robot to further evaluate the mapping performance in real world experiments. The S-BKI model consistently outperforms S-CSM, which shows the advantage of using Bayesian kernel inference in continuous mapping.

## Acknowledgment

This article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. The authors would like to thank Yukai Gong for the development of the feedback controller utilized in the Cassie experiments as well as Bruce Huang, Zhenyu Gan, Omar Harib, Eva Mungai, and Grant Gibson for their help in collecting experimental data.

## References

- [1] (2012) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34 (11), pp. 2274–2282. Cited by: §V-A.
- [2] (2019) SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proc. IEEE Int. Conf. Comput. Vis., Cited by: Fig. 4, §V-B, TABLE IV, §V.
- [3] (2018) Efficient surfel-based SLAM using 3D laser range data in urban environments. In Proc. Robot.: Sci. Syst. Conf., Cited by: §V-B.
- [4] (2016) University of Michigan North Campus long-term vision and lidar dataset. Int. J. Robot. Res. 35 (9), pp. 1023–1035. Cited by: §V-C.
- [5] (2009) Imagenet: a large-scale hierarchical image database. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 248–255. Cited by: §V-C.
- [6] (2019) Learning-Aided 3-D Occupancy Mapping With Bayesian Generalized Kernel Inference. IEEE Trans. Robot.. Cited by: §I, §I, §II, §V.
- [7] (2017) Bayesian generalized kernel inference for occupancy map prediction. In Proc. IEEE Int. Conf. Robot. and Automation, pp. 3118–3124. Cited by: §I, §II.
- [8] (1987) Sonar-based real-world mapping and navigation. IEEE J. Robot. Autom. 3 (3), pp. 249–265. Cited by: §I.
- [9] (2017) Sparse Bayesian inference for dense semantic mapping. arXiv preprint arXiv:1709.07973. Cited by: §I.
- [10] (2012) Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 3354–3361. Cited by: §V.
- [11] (2010) Efficient large-scale stereo matching. In Proc. Asian Conf. Comput. Vis., pp. 25–38. Cited by: §V-A.
- [12] (2017) Gaussian processes semantic map representation. arXiv preprint arXiv:1707.01532. Cited by: §I.
- [13] (2018) Gaussian processes autonomous mapping and exploration for range-sensing mobile robots. Auton. Robot. 42 (2), pp. 273–290. Cited by: §I.
- [14] (2019) Sampling-based incremental information gathering with applications to robotic exploration and environmental monitoring. Int. J. Robot. Res. 38 (6), pp. 658–685. External Links: Document Cited by: Fig. 2.
- [15] (2014) Exploration on continuous gaussian process frontier maps. In Proc. IEEE Int. Conf. Robot. and Automation, pp. 6077–6082. Cited by: §I.
- [16] (2003) Map building with mobile robots in dynamic environments. In Proc. IEEE Int. Conf. Robot. and Automation, Vol. 2, pp. 1557–1563. Cited by: §I, §II.
- [17] (2005) Mapping with mobile robots. Ph.D. Thesis, University of Freiburg, Freiburg im Breisgau, Germany. Cited by: §III.
- [18] (2019) Contact-aided invariant extended Kalman filtering for robot state estimation. arXiv preprint arXiv:1904.09251. Cited by: §V-C.
- [19] (2018-06) Contact-aided invariant extended Kalman filtering for legged robot state estimation. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. External Links: Document Cited by: §V-C.
- [20] (2013) Nonparametric semantic segmentation for 3D street scenes. In Proc. IEEE/RSJ Int. Conf. Intell. Robots and Syst., pp. 3697–3703. Cited by: §II.
- [21] (2013) OctoMap: an efficient probabilistic 3D mapping framework based on octrees. Auton. Robot. 34 (3), pp. 189–206. Cited by: §I.
- [22] (2013) 3D scene understanding by voxel-CRF. In Proc. IEEE Int. Conf. Comput. Vis., pp. 1425–1432. Cited by: §I, §II, §IV.
- [23] (2013) GPmap: a unified framework for robotic mapping based on sparse Gaussian processes. In Proc. Int. Conf. Field Service Robot., Cited by: §I.
- [24] (2015) Semantic mapping for mobile robotics tasks: a survey. Robot. and Auton. Syst. 66, pp. 86–103. Cited by: §I.
- [25] (2014) Joint semantic segmentation and 3D reconstruction from monocular video. In Proc. European Conf. Comput. Vis., pp. 703–718. Cited by: Fig. 1, §II, Fig. 3, §V-A, TABLE II.
- [26] (2017) SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In Proc. IEEE Int. Conf. Robot. and Automation, pp. 4628–4635. Cited by: §II.
- [27] (2009) A sparse covariance function for exact Gaussian process inference in large datasets. In Proc. Int. Joint Conf. Artif. Intell., pp. 1936–1942. Cited by: §IV.
- [28] (2019) RangeNet++: Fast and Accurate LiDAR Semantic Segmentation. In Proc. IEEE/RSJ Int. Conf. Intell. Robots and Syst., Cited by: §V-B, §V.
- [29] (1985) High resolution maps from wide angle sonar. In Proc. IEEE Int. Conf. Robot. and Automation, Vol. 2, pp. 116–121. Cited by: §I.
- [30] (2015) ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31 (5), pp. 1147–1163. Cited by: §V-A.
- [31] (2008) Towards semantic maps for mobile robots. Robot. and Auton. Syst. 56 (11), pp. 915–926. Cited by: §I.
- [32] (2012) Gaussian process occupancy maps. Int. J. Robot. Res. 31 (1), pp. 42–62. Cited by: §I.
- [33] (2016) PROBE-GK: predictive robust estimation using generalized kernels. In Proc. IEEE Int. Conf. Robot. and Automation, pp. 817–824. Cited by: §II.
- [34] (2009) ROS: an open-source Robot Operating System. In ICRA workshop on open source software, Vol. 3, pp. 5. Cited by: §V.
- [35] (2018) Bayesian learning for safe high-speed navigation in unknown environments. In Robot. Res., pp. 325–341. Cited by: §II.
- [36] (2011) 3D is here: Point cloud library (PCL). In Proc. IEEE Int. Conf. Robot. and Automation, pp. 1–4. Cited by: §V.
- [37] (2013) Urban 3D semantic modelling using stereo vision. In Proc. IEEE Int. Conf. Robot. and Automation, pp. 580–585. Cited by: §II, §V-A, TABLE III.
- [38] (2015) Semantic octree: Unifying recognition, reconstruction and representation via an octree constrained higher order MRF. In Proc. IEEE Int. Conf. Robot. and Automation, pp. 1874–1879. Cited by: §I, §I, §II, §II.
- [39] (2018) RtSeg: real-time semantic segmentation comparative study. In Proc. Int. Conf. Image Process., pp. 1603–1607. Cited by: §V-C.
- [40] (2012) Semantic mapping using object-class segmentation of RGB-D images. In Proc. IEEE/RSJ Int. Conf. Intell. Robots and Syst., pp. 3005–3010. Cited by: §I, §II.
- [41] (2013) Mesh based semantic modelling for indoor and outdoor scenes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 2067–2074. Cited by: §II.
- [42] (2014) Nonparametric Bayesian inference on multivariate exponential families. In Proc. Advances Neural Inform. Process. Syst. Conf., pp. 2546–2554. Cited by: §II, §IV.
- [43] (2015) Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In Proc. IEEE Int. Conf. Robot. and Automation, pp. 75–82. Cited by: §I, §I, §II.
- [44] (2016) Fast, accurate Gaussian process occupancy maps via test-data octrees and nested Bayesian fusion. In Proc. IEEE Int. Conf. Robot. and Automation, pp. 1003–1010. Cited by: §I, §V.
- [45] (2008) Semantic mapping using mobile robots. IEEE Trans. Robot. 24 (2), pp. 245–258. Cited by: §I.
- [46] (2017) Semantic 3D occupancy mapping through efficient high order CRFs. In Proc. IEEE/RSJ Int. Conf. Intell. Robots and Syst., pp. 590–597. Cited by: §I, §I, §II, Fig. 3, §V-A2, §V-A2, §V-A2, §V-A, §V-A, TABLE II, TABLE III, §V.
- [47] (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §V-A, §V-A.
- [48] (2016) Building 3D semantic maps for mobile robots using RGB-D camera. Intell. Service Robot. 9 (4), pp. 297–309. Cited by: §I, §II.