Object Detection and Classification in Occupancy Grid Maps using Deep Convolutional Networks ††thanks: J. B. Frias thanks University of Vigo for funding his research period at Karlsruhe Institute of Technology (KIT), Germany.
A detailed environment perception is a crucial component of automated vehicles. However, to deal with the amount of perceived information, we also require segmentation strategies. Based on a grid map environment representation, well-suited for sensor fusion, free-space estimation and machine learning, we detect and classify objects using deep convolutional neural networks. As input for our networks we use a multi-layer grid map efficiently encoding 3D range sensor information. The inference output consists of a list of rotated bounding boxes with associated semantic classes. We conduct extensive ablation studies, highlight important design considerations when using grid maps and evaluate our models on the KITTI Bird’s Eye View benchmark. Qualitative and quantitative benchmark results show that we achieve robust detection and state of the art accuracy solely using top-view grid maps from range sensor data.
We require a detailed environment representation for the safe use of mobile robotic systems, e.g. in automated driving. To enable higher level scene understanding and decrease computational cost for existing methods, information needs to be further filtered, segmented and categorized. This task can be accomplished by object detection, shape estimation and classification methods, in the following simply referred to as object detection. Given an input environment representation, the object detector should output a list of oriented shapes and their corresponding most likely semantic classes.
In this work we represent the environment by top-view grid maps, in the following referred to as grid maps. Occupancy grid maps, first introduced in  encode surface point positions and free-space from a point of view in a two-dimensional grid. As all traffic participants move on a common ground surface one might not require full 3D information but instead represent the scene in 2D with obstacles occupying areas along the drivable path. Multi-layer grids are well-suited for sensor fusion  and their organized 2D representation enables the use of efficient convolutional operations for deep learning in contrast to sparse point sets. Whereas objects in camera images might vary in scale due to the projective mapping grid maps representing an orthographic top view are composed of metric fixed-size cells making objects scale invariant. In addition, object projections in camera images might overlap which is not the case for multiple objects in occupancy grid maps.
Here, we first present an overview on object detection and semantic classification in multi-layer grid maps. We then train different meta-architectures, show the influence of various parameters and discuss their effects on performance in detail. By making specific design considerations for the grid map domain we are able to train object detectors in an end-to-end fashion achieving state-of-the-art accuracy at reasonable processing time compared to recent 3D object detection approaches. Finally, we compare the most promising object detection models to recent state-of-the-art approaches on the KITTI bird’s eye view benchmark.
First, we review and compare related work on object detection in grid maps and other domains in Section II. We then present our preprocessing to obtain training examples in Section III. After recalling our general training strategy and metrics we provide information on the grid map domain adaptation in Section IV. We perform a quantitative and qualitative evaluation of different configurations in Section V. Finally, we conclude our work and propose future plans for object detection in Section VI.
Ii Related Work
Ii-a Object Detection Meta-Architectures
Recently, a notable amount of state-of-the-art object detectors is based on the Faster R-CNN meta-architecture . In Faster R-CNN detection happens in two stages, a region proposal network (RPN) and a classification and box refinement network. In the RPN features are extracted from the input and used to predict class-agnostic box candidates in a grid of anchors tiled in space, scale and aspect ratio. The feature slice corresponding to each box proposal is then sequentially fed into the box classifier. In the original Faster R-CNN implementation each feature slice is fed into two dense layers before performing classification and box refinement whereas in R-FCN  the dense layers are omitted, reducing the amount of computation per region. In contrast to Faster R-CNN and R-FCN single shot detectors (SSDs)  predict bounding boxes and semantic classes with a single feed-forward CNN, significantly reducing inference time but also lowering the overall accuracy.
Ii-B Feature Extractors
The detection stage input consists of high-level features. These features may be computed by a deep feature extractor such as Resnet , Inception  or MobileNet . Resnets implement layers as residual functions, gain accuracy from increased depth and were successfully applied in the ILSVRC and COCO 2015 challenges. Among other aspects, Inception and MobileNet use factorized convolutions to optimize accuracy and computation time. With Inception units the depth and width of networks can be increased without increasing computational cost. MobileNets further reduce the number of parameters by using depth-wise separable convolutions.
Ii-C Object Detection in Aerial Images
Here, we compare the object segmentation task in grid maps to (scale-corrected) satellite or aerial images which has a long research history [9, 10, 11]. For example,  use 1420 labeled samples in high resolution panchromatic images to train a vehicle detector, reducing false positives by selecting only hypotheses on surfaces semantically classified as streets. Whereas atmospheric conditions might limit aerial image quality due to camera views far from the scene top view grid maps suffer from occlusions due to a view within the scene. These problems can either be tackled by fusing multiple measurements from different views or learned environment reconstruction . However,  consider the shadows / occlusions from cars one of the most relevant features (together with the rectangular shape and the windshield layout).
Ii-D KITTI Bird’s Eye View Benchmark
Training deep networks requires a comparably large amount of labeled data. The KITTI Bird’s Eye View Evaluation 2017  consists of 7481 training and 7518 camera images as well as corresponding range sensor data represented as point sets. Training and test data contain 80,256 labeled objects in total which are represented as oriented 3D bounding boxes (7 parameters). As summarized in Table I, there are eight semantic classes labeled in the training set although not all classes are used to determine the benchmark result.
|Class||Occurrence||Max. length||Max. width|
Currently, successful benchmark submissions share a two-stage structure comprised of RPN and box refinement and classification network [15, 16]. They first extract features from sensor data, create axis aligned object proposals and perform classification and box regression on the best candidates. Whereas the region proposal in  is based only on a grid map,  also incorporate camera images to generate proposals. To further increase accuracy  train two separate networks for cars and pedestrians/cyclists, respectively.
Ii-E Choice of Input Features
The choice of grid cell features varies heavily along different publications. [17, 18, 15] use the (normalized) number of detections and characteristics derived from detection reflectances. As the reduction of 3D range sensor information to 2D implies a loss of information features that encode height information might be relevant.  use the average height and an estimate of its standard deviation as features whereas  use four height values, equally sampled in the interval between the lowest and the highest point coordinate of each cell.
There are also higher level features possible.  use evidence measures for occupied and free cells, average velocity and its auto-covariance matrix estimated by a particle filter.  estimate the standard deviations in the two principle horizontal directions whereas  estimate local planarity. However, as we aim to train object detectors in an end-to-end fashion we do not consider handcrafted features in this work. On the one hand, it seems sometimes arbitrary to us how certain features are picked and there is no evidence of gaining accuracy when using higher-level features in combination with the training of deep networks. On the other hand, higher-level features such as velocity estimates might not be available at all times.
Ii-F Box Encoding
Similar to the feature encoding of grid cells there is a variety of different box encodings used in related work.  use eight 3D points (24 parameters) for box regression and recover the box orientation in direction of the longer box side. In contrast to this,  use four ground points and the height of the upper and lower box face, respectively (14 parameters). They explicitly regress the sine and cosine of orientation to handle angle wrapping and increase regression robustness. One encoding that needs the minimum amount of 2D box parameters (5) is presented in . They represent boxes by two points and one height parameter (5 parameters).
Iii Grid Map Processing
We perform minimal preprocessing in order to obtain occupancy grid maps. As there are labeled objects only in the camera image we remove all points that are not in the camera’s field of view (see Figure 2). We then apply optional ground surface segmentation described in Section III-A and estimate different grid cell features summarized in Section III-B. The resulting multi-layer grid maps are of size 80 m 80 m and a cell size of either 10 cm or 15 cm.
Iii-a Ground Surface Segmentation
Recent approaches create top view images including all available range sensor points. However, it remains unclear if ground surface points significantly influence the object detection accuracy. Therefore, we optionally split ground from non-ground points. As we observed the ground to be flat in most of the scenarios we fit a ground plane to the representing point set. However, any other method for ground surface estimation can be used as well. For each scan, we perform nonlinear Least-Squares optimization  to find the optimal plane parameters
which minimize the accumulated point-to-plane error for all points of the point set where denotes the distance vector between and its plane projection point. The loss function is chosen to be the Cauchy loss with a small scale (5 cm) to strictly robustify against outliers. We then remove all points from the point set with signed distance below m to the plane.
Iii-B Grid Cell Features
We use the full point set or a non-ground subset to construct a multi-layer grid map containing different features. Inspired by other contributions (e.g. [15, 16]) we investigate if there is evidence for better convergence or accuracy by normalizing the number of detections per cell.
Exemplary, we follow the approach presented in  to estimate the decay rate
for each cell as the ratio of the number of detections and the sum of distances traveled through for all rays . We determine and by casting rays from the sensor origin to end points using the slab method proposed in . In another configuration, we use the number of detections and observations per cell directly. To encode height information we use the minimum and maximum z coordinate of all points within a cell instead of splitting the z range into several intervals (e.g. as in [15, 16]). In all configurations we determine the average reflected energy, in the following termed as intensity. Figure 2 depicts the grid cell features presented. Table II summarizes the feature configurations used for evaluation.
|F1||Intensity, min. / max. z coordinate, detections, observations|
|F2||Intensity, min. / max. z coordinate, decay rate|
|F3||Intensity, detections, observations|
|F1*||Same as F1 but with ground surface removed|
Out of the total amount of training examples we use 2331 (31%) samples for internal evaluation, referred to as the validation set. As summarized in Table IV we train networks with several configurations, varying one parameter at a time. We pretrain each feature extractor with a learning rate of (Resnet) and (Inception) for 250k iterations with an grid cell size of 15cm. A few networks are compared against other methods by uploading inferred labels to the KITTI benchmark. Due to our limited computational resources we train all networks using SGD, batch normalization  and use the Momentum optimizer with a momentum of 0.9. Starting from the trained baseline networks we then train each configuration for another 200k iterations with the learning rate lowered by a factor of 2.
Iv-a Box Encoding
As mentioned in Section II-F there are several box encodings used. We want to use as few parameters as possible because we assume this to be beneficial for box regression accuracy. However, while the orientation estimation might be more problematic we adapt the approach in  and estimate the orientation by two parameters and , providing an explicit and smooth mapping within (B1). To compare against other encodings we also represent boxes by position, extent and orientation (B2) as well as two points and width  (B3). The encodings are summarized in Table III.
|Box Encoding Id||Parameters|
Iv-B Data Augmentation
Because convolutional filters are not rotationally invariant we increase the amount of training samples by augmenting different viewing angles. Similar to  and  we randomly flip the grid map around its x-axis (pointing to the front). Subsequently, we randomly rotate each grid map within around the sensor origin. Label boxes are augmented accordingly.
Iv-C Proposal Generation
In contrast to  we aim to train one network for many classes (see Table I). However, as vans and cars as well as sitting persons and pedestrians are similar or only very few training samples are available we merge these classes into one class.
Working on fixed scale grid maps, we can further adapt the object proposal generation to our domain by adapting its size, aspect ratio and stride. Table I summarizes the maximum length and width for each semantic class. Therefore, we determine a small set
of anchor sizes that enclose most objects closely. Note that we determine the combined extent for cars / vans and pedestrians / sitting persons as we treat them to be of the same class. Trams might not fit completely into the largest feature maps. However, we think that they can be distinguished properly due to their large size. We chose the feature slice aspect ratios to be 1:1, 2:1 and 1:2 and the stride to be 16 times the grid cell size.
To train the RPN we use the same multi-task loss as presented in . However, for the box classification and regression stage we extend this metric by another sibling output layer and define the multi-task loss similar to  as
For each proposal a discrete probability distribution over classes is computed by the softmax function. Here, denotes the multi-class cross entropy loss for the true class . is the predicted bounding-box regression offset given in  in which specifies a scale-invariant translation and log-space height / width shift relative to an object proposal. denotes the predicted inclined bounding-box regression offset. For the localization losses and we use the robust smooth L1 loss. Here, denotes the true bounding-box regression target and the true inclined bounding-box regression target depending on the used box encoding (see Table III). The hyperparameters and balance the different loss terms and are set to in all experiments. The difference between the two bounding box representations is also depicted in Figure 1(f).
|Architecture||Grid Map||Box||KITTI Evaluation|
|Net||Meta Arch.||Feat. Extr.||Feat.||Res.||Enc.||Cars||Cyclists||Pedestrians||Time|
Table IV summarizes the evaluation results on the validation set for different network configurations.
We evaluate the overall accuracy based on the average precision for the KITTI Bird’s Eye View Evaluation using an Intersection over Union (IoU) threshold of 0.7 for cars and an IoU of 0.5 for cyclists and pedestrians. The evaluation is divided into the three difficulties Easy (E), Moderate (M) and Hard (H) based on occlusion level, maximal truncation and minimum bounding box size.
Table IV summarizes the quantitative evaluation results using the KITTI benchmark metric.
The largest gain in accuracy is made by decreasing the grid cell size as for Net 5. However, also the box encoding has a large impact on the accuracy. While in Net 6 angles can not be recovered robustly the angle encoding in B 1 yields better results. Unfortunately, the network training for box encoding B3 did not converge at all. This might be due to an issue during data augmentation when boxes (and grid maps) are rotated. Also, the input features have an impact on the detection accuracy. It seems that normalization via the decay rate model yields better results that using the number of detections and observations directly. This is advantageous as the amount of grid map layers can be decreased this way. Ground surface removal has a minor impact on the detection of cars and other large objects but leads to a reduced accuracy in the detection of cyclists and pedestrians. We believe that this is due to detections close to the ground surface that are removed.
Our test results (submitted as TopNet variants) are similar to the validation results, yielding state-of-the-art benchmark results. This shows that no overfitting on the validation data occurred, likely due to our data augmentation strategies.
Figure 3 depicts two scenarios for qualitative comparison of three network configurations.
V-C Inference Time
We evaluated the processing times on a 2.5 GHz six core Intel Xeon E5-2640 CPU with 15 MB cache and an NVIDIA GeForce GTX 1080 Ti GPU with 11 GB graphics memory. In comparison to the other networks Net 5 has the highest inference time. This is due to the large grid size as we keep the grid map extent of 80 m 80 m fixed across different evaluations. Net 2 has a slightly short inference time due to the R-FCN meta architecture. Using the InceptionV2 feature extractor in Net 3 also decreases the inference time compared to using a Resnet101. A different number of grid map layers or different box encodings have no significant impact on the inference time.
We presented our approach to object detection and classification based on multi-layer grid maps using deep convolutional networks.
By specifically adapting preprocessing, input features, data augmentation, object encodings and proposal generation to the grid map domain we show that our networks achieve state of the art benchmark results by only using multi-layer grid maps from range sensor data. We identify the input feature selection together with the resolution as an important factor for network accuracy and training / inference time.
As a next step we aim to develop a framework for semi-supervised learning of object detectors, hopefully increasing generalization and thus overall robustness. Finally, we want to develop a tracking framework based on grid maps by coupling detections with predictions in an end-to-end learnable framework.
-  A. Elfes, “Using Occupancy Grids for Mobile Robot Perception and Navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989.
-  D. Nuss, T. Yuan, G. Krehl, M. Stuebler, S. Reuter, and K. Dietmayer, “Fusion of Laser and Radar Sensor Data with a Sequential Monte Carlo Bayesian Occupancy Filter,” in 2015 IEEE Intelligent Vehicles Symposium (IV), 2015, pp. 1074–1081.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
-  J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object Detection via Region-based Fully Convolutional Networks,” may 2016.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector,” dec 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” feb 2016.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” apr 2017.
-  S. Hinz, “Detection and Counting of Cars in Aerial Images,” in Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429), vol. 3, 2003, pp. III–997–1000 vol.2.
-  T. T. Nguyen, H. Grabner, H. Bischof, and B. Gruber, “On-line Boosting for Car Detection from Aerial Images,” in 2007 IEEE International Conference on Research, Innovation and Vision for the Future, 2007, pp. 87–95.
-  S. Kluckner, G. Pacher, H. Grabner, H. Bischof, and J. Bauer, “A 3D Teacher for Car Detection in Aerial Images,” in 2007 IEEE 11th International Conference on Computer Vision, 2007, pp. 1–8.
-  S. Wirges, F. Hartenbach, and C. Stiller, “Evidential Occupancy Grid Map Augmentation using Deep Learning,” ArXiv e-prints, 2018.
-  T. Zhao and R. Nevatia, “Car Detection in Low Resolution Aerial Images,” Image and Vision Computing, vol. 21, no. 8, pp. 693–703, 2003.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we Ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” Computer Vision and Pattern Recognition, 2012 IEEE Conference on, pp. 3354–3361, 2012.
-  X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-View 3D Object Detection Network for Autonomous Driving,” CVPR, pp. 1907–1915, 2017.
-  J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander, “Joint 3D Proposal Generation and Object Detection from View Aggregation,” ArXiv e-prints, 2017.
-  A. Golovinskiy, V. G. Kim, and T. Funkhouser, “Shape-based Recognition of 3d Point Clouds in Urban Environments,” IEEE 12th International Conference on Computer Vision, 2009, no. Iccv, pp. 2154–2161, 2009.
-  P. Babahajiani, L. Fan, and M. Gabbouj, “Object Recognition in 3D Point Cloud of Urban Street Scene,” C. V. Jawahar and S. Shan, Eds. Cham: Springer International Publishing, 2015, pp. 177–190.
-  S. Hoermann, P. Henzler, M. Bach, and K. Dietmayer, “Object Detection on Dynamic Occupancy Grid Maps Using Deep Learning and Automatic Label Generation,” ArXiv e-prints, 2018.
-  Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, and Z. Luo, “R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection,” 2017.
-  S. Agarwal, K. Mierle, and Others, “Ceres Solver,” http://ceres-solver.org.
-  A. Schaefer, L. Luft, and W. Burgard, “An Analytical Lidar Sensor Model Based on Ray Path Information,” IEEE International Conference on Robotics and Automation, vol. 2, no. 3, pp. 1405–1412, 2017.
-  T. L. Kay and J. T. Kajiya, “Ray Tracing Complex Scenes,” in ACM SIGGRAPH Computer Graphics, vol. 20, no. 4, 1986, pp. 269–278.
-  S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” feb 2015.