Patch Refinement - Localized 3D Object Detection
We introduce Patch Refinement a two-stage model for accurate 3D object detection and localization from point cloud data. Patch Refinement is composed of two independently trained Voxelnet-based networks, a Region Proposal Network (RPN) and a Local Refinement Network (LRN). We decompose the detection task into a preliminary Bird’s Eye View (BEV) detection step and a local 3D detection step. Based on the proposed BEV locations by the RPN, we extract small point cloud subsets ("patches"), which are then processed by the LRN, which is less limited by memory constraints due to the small area of each patch. Therefore, we can apply encoding with a higher voxel resolution locally. The independence of the LRN enables the use of additional augmentation techniques and allows for an efficient, regression focused training as it uses only a small fraction of each scene. Evaluated on the KITTI 3D object detection benchmark, our submission from January 28, 2019, outperformed all previous entries on all three difficulties of the class car, using only 50 % of the available training data and only LiDAR information.
Object detection and localization is one of the key challenges in autonomous driving Geiger et al. (2012) and robotics Wolf et al. (2016). Compared to the performance of 2D object detection in images the results in 3D object detection lag behind considerably, mainly caused by increased difficulty of the localization task. Instead of fitting axis-aligned rectangles around the part of objects that is visible in the image plane, the main challenge in 3D detection and localization is the amodal oriented 3D bounding box prediction in 3D space, which includes the occluded or truncated parts of objects.
Despite the recent advancement of image-only 3D detectors Wang et al. (2019); Li et al. (2019) point cloud information from LiDAR sensors is a requirement to achieve highly accurate 3D localization. Unfortunatly, LiDAR data is represented as an unordered, sparse set of points in a continuous 3D space and can not be directly processed by standard convolutional neural networks (CNNs). Deep CNNs operating on multiple discretized 2D Bird’s Eye View (BEV) maps achieved early success in both 3D Chen et al. (2017); Ku et al. (2018) and BEV Yang et al. (2018) detection. Simply discretizing point clouds is inevitably linked to a loss of information. Current state-of-the-art 3D object detection is largely based on the seminal work PointNet Qi et al. (2017). Pointnets are used in combination with 2D image detectors to perform 3D localization on point cloud subsets Qi et al. (2018); Shin et al. (2019). Two-stage approaches Shi et al. (2019) use PointNets in an initial segmentation stage and a subsequent localization stage. Voxelnet Zhou and Tuzel (2018) introduces Voxel Feature Encoding (VFE) layers which utilize PointNet to learn an embedding of local geometry within voxels, which can then be processed by 3D and 2D convolutional stages. Others Yan et al. (2018); Lang et al. (2019) add further improvements to Voxelnet.
Voxelnet is a single-stage model, as such it applies VFE-encoding with a uniform resolution on the whole scene, although such a resolution is only necessary at locations that contain objects. As a consequence, it is severely limited by memory constraints, especially during training, where only a batch size of 2 can be processed by a GPU with 11 GB of memory. Given the typical sparseness of objects within a LiDAR scene, only a few subsets contain viable information to train regression Engelcke et al. (2017). But in the case of Voxelnet Batch Normalization layers Ioffe and Szegedy (2015) break this local independence.
To be able to train a small detector focused on 3D bounding box regression, we decompose the task into a preliminary BEV detection step and a local 3D detection step, similar to the two-stage approach of R-CNN Girshick et al. (2014). Object sizes are bound and unaffected by the distance to the sensor. We construct a Local Refinement Network (LRN) that operates on small subsets of points within a fixed-sized cuboid, which we term ”patches”. The RPN does not have to perform warping and independence of the LRN can be achieved by training with some noise to account for proposed locations that are slightly offset. We favor an independent approach because it enables additional augmentation options, the higher resolution features have to be calculated in any case and it allows us to evaluate the regression ability of the LRN without the influence of an RPN.
Our work comprises the following key contributions:
We demonstrate that it is a viable approach to decompose the 3D object detection task in autonomous driving scenarios into a preliminary BEV detection followed by a local 3D detection via two independently trained networks.
We show that it takes only a few simple modifications to utilize the Voxelnet Zhou and Tuzel (2018) architecture either as an efficient Region Proposal Network or as a Local Refinement Network that is capable to perform highly accurate 3D bounding box regression and is not limited to fixed input space.
We report the beneficial effect of adding the corner bounding box parametrization of AVOD Ku et al. (2018) as auxiliary regression targets, even without applying the proposed decoding algorithm.
2 Related Work
MV3D Chen et al. (2017) and AVOD Ku et al. (2018) are two-stage models that fuse image and point cloud information and perform regression on the resulting 2D BEV feature maps, projections of point clouds and camera information. While this enables the use of 2D CNNs, these approaches cannot capture the full geometric complexity of the 3D input scene, due to information loss caused by discretization.
Frustum PointNets Qi et al. (2018) projects the proposals of an image-based 2D detector onto the point cloud. The resulting frustum is then further processed by a sequence of PointNets. It demonstrates that accurate amodal bounding box prediction can be performed without context information, which is actively removed by a segmentation PointNet. Noticeably, detection scores are calculated without taking the LiDAR representation into account.
Voxelnet Zhou and Tuzel (2018) applies a 3D grid to divide the input space into voxels. Followed by a sparse voxel-wise input encoding via a sequence of PointNet-based VFE layers. This enables the network to learn structures within voxels and to embed the point cloud into a structured representation while retaining the most important geometric information in the data. Which are subsequently processed by 3D and 2D CNNs. Both our RPN and LRN are based on Voxelnet, modified to better accomplish their respective tasks.
SECOND Yan et al. (2018) modifies Voxelnet by replacing the costly dense 3D convolutions with efficient sparse convolutions, reducing both run-time and memory consumption considerably. Furthermore, they propose ground truth sampling, an augmentation technique that populates training scenes with additional objects from other scenes. Besides speeding-up training by increasing the average number of objects per frame, this augmentation provides strong protection against overfitting on context information Shi et al. (2019); Lang et al. (2019).
PointPillars Lang et al. (2019) proposes a VoxelNet-based model without 3D convolutional middle layers. Instead, they apply a voxel grid with a vertical resolution of one and the encoding of the vertical information is solely performed within the VFE-layers. Further optimized towards speed, the model achieves the highest frame rates within the pool of current 3D object detection models. Our RPN follows a similar design, but we use a vertical resolution of two and concatenation.
PointRCNN Shi et al. (2019) is a two-stage approach utilizing PointNets, that introduces a novel LiDAR-only bottom-up 3D proposal generation first stage, followed by a second stage to refine predictions. Similar to our model it follows the R-CNN approach and pools the relevant subset of the input point cloud for each proposal. Unlike our LRN, the second stage of Point R-CNN reuses higher-level features and relies on the RPN to transform the proposals into a canonical representation.
3.1 Local Refinement Network
We follow the Voxelnet approach and apply a 3D voxel grid to the input, grouping points to voxels. This is followed by a VFE stage and then by a reduction step from 3D to 2D BEV feature maps. These are then processed by our 2D convolutional backbone network.
Grouping Points into Voxels
We use the efficient sparse encoding algorithm of Voxelnet that processes only non-empty voxels. Although we train only on small regions of the input scene, we preserve absolute (global) point coordinates. This way our model can learn that objects farther from the sensor are typically represented with fewer measurements than nearby objects.
Voxel Feature Encoding
While the grouping algorithm of Voxelnet processes only non-empty voxels, the input to the VFE layers consists of dense tensors with a large proportion of padded zeros (roughly 90 % with the default setting of at most 35 points per voxel). As it has a regularizing effect on the running mean and variance this zero-padding alleviates the use of Batch Normalization (BN) Ioffe and Szegedy (2015). We relinquish the use of zero padding and remove the BN in our VFE-stage. Instead, we apply per-sample normalization. Overall, our modified VFE layer requires less memory while also increasing the predictive performance of our model.
The activation tensor resulting from the VFE stage is still three dimensional. Via multiple 3D-convolutional layers, the vertical resolution of the activation tensor is reduced to one resulting in 2D feature maps. The result is a 2D representation in BEV with vertical information encoded locally.
Figure 3 depicts the architecture of our backbone network. It uses multiple blocks of 2D convolutional layers to generate intermediate feature maps of different resolutions. We then apply transposed convolutional layers to get feature maps with equal dimensions before the output layers, where feature maps are concatenated and combined with convolutions at each anchor position. The resulting feature maps are labeled and in Figure 3. For the regression head, the receptive field of has been chosen to cover the majority of objects while the receptive field of is slightly larger to cover outliers. Restricting the receptive field of the regression head reduces distractions from the environment when predicting exact bounding boxes. In contrast, we perform the detection based on higher-level feature maps that have a larger receptive field.
We use a variant of residual box parametrization with a direction classifier Yan et al. (2018) and sine and cosine encoding for orientation. From the vector representing the three box-center coordinates , height , width , length and yaw around the -axis of an oriented 3D bounding box, we calculate residual regression targets . In , , the subscripts and denote bottom and top respectively, between a ground truth vector and an anchor vector from the same vector space as described in Equation (1):
where denotes the component of the vector (the notation for the other components follows respectively).
Additionally, we use the box parametrization of AVOD Ku et al. (2018) as auxiliary regression targets. We transform ground truth boxes and anchor boxes into a 2D corner representation in BEV for the - and -coordinates of each corner resulting in eight variables , where the subscripts enumerate corners. Then we calculate the targets . While we do not use these additional regression parameters during inference, the added training task improves performance considerably. We use axis-aligned 2D Intersection over Union (IoU) in BEV as a similarity measure between the ground truth boxes rotated to the nearest axis and the corresponding anchor type. For detection, positive anchors have to surpass an IoU threshold of , while anchors with an IoU below are treated as negatives. For regression, the positive threshold is set to . Our default choice for training the detection head is the balanced sampling of anchors with a ratio of 3 to 1 of negative and positive anchors. For detection and direction we use the binary cross-entropy loss function, denoted and for the regression targets we use the smooth L1 loss (also known as Huber loss), denoted . Balancing is achieved via the hyperparameters , with the default values of 1, 1 and 2 respectively.
Let and denote the sampled positive and negative detection anchors and let denote the sigmoid activation of the classification output. Further, let denote the positive regression anchors. is the total number of positive regression anchors. The overall loss function is given as in Equation (2).
3.2 Patch Construction
The independence of the LRN enables us to construct patches from ground truth annotations.
Inspired by the beneficial effect of ground truth sampling in SECOND, we try to achieve protection against overfitting on context information in a similar way. First, we create lists of objects and related surrounding areas (surfaces) in the training set. We then remove points within the slightly enlarged bounding boxes of the objects present in the respective surface. We then rotate each surface to align its center with the depth axis in front of the sensor car. Finally, we sort both object and surface lists based on the absolute distance to the sensor. During training, we combine objects with surfaces that appear in a similar distance to the sensor. To ensure that the object lines up with the surface, we look up the vertical coordinate of the original object and place the augmented object at the corresponding height. We apply surface sampling based on the difficulty levels easy, moderate and hard with the respective probabilities of , and .
Global & Per-Object Augmentation
We then proceed with standard augmentation techniques. Scaling and mirroring of the patch and the object individually. Contrary to Zhou and Tuzel (2018); Yan et al. (2018) we do not apply per-object rotation and vertical translation. As per-object rotation introduces self-occlusion artifacts and vertical translation creates unreasonable patches where objects appear below ground or fly above it. Especially in the context of self-driving, the training data for car objects is heavily biased, with a strong peak for parallel orientation to the ego-vehicle. A major benefit of working on a per-object basis without a fixed input space is the ability to use an augmentation for a full range of rotations in the forward view around the global -axis by . Therefore, every object is learned to be recognized in every possible orientation inside the patch, respecting the global position in which the so rotated object would happen to appear. Consequently, both the perpendicular and the parallel anchor types are trained equally well.
Random Cropping - LRN Detection Objective
At this stage, we have an augmented object placed somewhere upon a surface. To achieve robustness against imperfect proposals by an actual RPN and to create a training task for the detection head, we sample noise from a circular area ( and meters). Finally, we crop the patch at the BEV location that is determined by the object center and the offset from the sampled noise. Therefore, the objective of the detection head of the LRN will be to revert this offset and determine the correct object center within the small anchor grid. A task that is closely related to regression and designed to achieve a correlation between detection score and the ability to accurately regress an object.
3.3 Region Proposal Network
Our RPN is a slim version of the LRN network, the main source of simplification is the circumvention of the vertical reduction stage. Instead of multiple 3D convolutional layers, we use a vertical voxel resolution of two and concatenate the two resulting feature planes. While this change reduces the 3D detection results considerably, the BEV detection results remain nearly unchanged.
During inference, we take the proposals of an RPN to extract patches. To increase the similarity with our training task, we remove points within the slightly enlarged bounding box of additional proposals falling within the patch. Similar to training we rotate the patch upon the depth axis (improvement of +0.15 AP).
We evaluate our method on the KITTI 3D object detection benchmark Geiger et al. (2012), which provides samples of LiDAR point clouds as well as RGB camera images. The data set consists of 7,481 training samples and 7,518 test samples. Detections are evaluated on three difficulty levels (easy, moderate, hard). Average precision (AP) is used as the metric, successful detections for the class car require an IoU of at least 70,%.
We subdivide the original training samples into a 3,712 samples train set and a 3,769 samples validation set as described in Chen et al. (2015), which we use for all our experiments and the submission to the test server. Furthermore, we use a non-maximum suppression threshold of 0.01. We train our model on a single 1080Ti GPU.
4.1 Implementation Details
Region Proposal Network
For our RPN we use the input space and BEV resolution of Voxelnet, a vertical resolution of 2, the described loss and a backbone configuration of ABC/AX. To compensate for the smaller receptive fields due to the circumvented 3D convolutions, we add 2 layers to the convolutional block 1. We first pre-train the RPN on patches with the local objective and only afterward train it for its final task as an RPN, simply by changing its input to whole point clouds and decreasing the learning rate by a factor of 0.1 and the batch size to 4. The RPN adapts to the new input within a few epochs only. Due to the increased imbalance between foreground and background objects, present when processing whole point clouds, we choose Focal Loss Lin et al. (2017) with default parameters and to train the detection head.
Local Refinement Network
For the width and depth dimensions, we chose voxel sizes of meters and patch sizes of meters. For the vertical dimension, we chose a voxel height of meters. In order to reduce the vertical dimension from 19 to 1, we use a sequence of 4 3D convolutional layers with a kernel-size of 3, vertical strides of (2,1,2,1), and no padding. To compensate for the smaller receptive fields due to the increased resolution we add 5 layers to the convolutional block 1 and 2 layers to convolutional block 2. The convolutional block related to the extended feature maps X consists of 6 3x3 convolutions without initial down-sampling. The voxel resolution is in the encoding stage. The regression and detection heads operate on a anchor grid. We train with a batch size of 32 samples for 5 million samples with Adam Optimizer, an initial learning rate of and after one million samples we multiply the learning rate by every samples.
4.2 Evaluation on the KITTI Test Set
Table 1 presents the results of our method on the KITTI test set using the Average Precision (AP) metric. Consuming only LiDAR data and 50 % of the training data, our submission outperformed all previous methods for 3D object detection on cars. Noticeably, on the easy difficulty, the effect of the local training objective of the detection head is most prominent. A comparison of the precision-recall curves provided by the KITTI benchmark showed that our model has a distinct advantage to better avoid high ranked false positive detections that do not pass the 70 % IoU threshold.
|Method||Modality||3D Object Detection||BEV Detection|
|MV3D Chen et al. (2017)||RGB + LiDAR||71.09||62.35||55.12||86.02||76.90||68.49|
|Voxelnet Zhou and Tuzel (2018)||LiDAR||77.47||65.11||57.73||89.35||79.26||77.39|
|F-PointNet* Qi et al. (2018)||RGB + LiDAR||81.20||70.39||62.19||88.70||84.00||75.33|
|AVOD-FPN* Ku et al. (2018)||RGB + LiDAR||81.94||71.88||66.38||88.53||83.79||77.90|
|SECOND Yan et al. (2018)||LiDAR||83.13||73.66||66.20||88.07||79.37||77.95|
|PointPillars* Lang et al. (2019)||LiDAR||79.05||74.99||68.30||88.35||86.10||79.83|
|PointRCNN Shi et al. (2019)||LiDAR||85.94||75.76||68.32||89.47||85.68||79.10|
|MMF* Liang et al. (2019)||RGB + LiDAR||86.81||76.75||68.41||89.49||87.47||79.10|
4.3 Ablation Studies on the KITTI Validation Set
|without auxiliary targets||-0.42||-0.57||-0.67|
|no surface sampling||-1.25||-1.26||-1.93|
|reduced global rotation||-1.05||-0.96||-0.94|
|backbone BC/AX red.||-0.06||-0.03||-0.15|
We trained our network with VFE-layers as proposed in Voxelnet, with zero-padding and BN. In this case, we observe a large performance drop overall difficulty levels when used in our model. We hypothesize that the reason is that typically the batch statistics of those features are highly variable due to a low number of points and a varying input space. Additionally, we applied BN without zero-padding. This performs poorly both on whole scenes, as well as on patches.
We analyze the effectiveness of our augmentation strategies: surface sampling and additional global rotation. Table 2 shows two strong performance drops when either of those is not employed. For surface sampling, this drop is expected, since we fully rely on the construction of artificial patches to avoid the model to overfit on context information. The drop caused by reduced global rotation could be related to an increased orientation bias. The data set comprises largely objects parallel to the sensor car and few objects in the perpendicular orientation. As we rotate all objects upon the depth axis and sample rotation only from , the number of perpendicular objects is further reduced. Overall, both augmentation strategies are vital components of the training procedure.
Auxiliary Regression Targets
In our experiments without auxiliary regression targets, we observe slower, more unstable learning. Additionally, with auxiliary regression targets, the model becomes more decisive in rejecting false positives and preserves the level of recall with fewer proposals. Table 2 shows a decrease in performance if auxiliary regression targets are omitted.
We validate our architecture design against variants (see Figure 3), in which we modified the connections of the regression head and detection head to the feature maps A, B, C, and X. In the standard variant, the detection head is connected to B and C, while the regression head is connected to A and X. We denote this backbone as BC/AX. As a first variant, the backbone ABC/AX uses an additional feature map for detection, which leads to earlier overfitting of the detection head. A comparison of the backbone variant C/AX and backbone variant B/AX using only one feature map for detection, suggests that the higher level map C is of greater importance. The backbone variant BC/A relinquishes the use of the additional regression map, which has a negative effect on the performance on easy and moderate difficulty levels. The backbone variant “BC/AX red.” uses fewer layers in the convolutional blocks 1 and 2, which achieves almost identical performance. Table 2 shows the decrease in performance for the different backbone variants.
We study the influence of additional objects present in a patch. We calculate thresholds based on the percentile of the detection scores over the validation set. We then remove additional objects only if their respective detection score exceeds a given threshold. We observe that removing those objects where the RPN is most confident is of greater importance and that the detection of objects of difficulty hard is affected the most from additional objects within a patch. We conclude that the effect is caused by distraction effects of additional objects with distinct features.
4.4 Additional Experiments
Refinement of Other Models
We study how our LRN performs when the region proposals are provided by other detection models. To this end, we construct validation patches based on the predictions generated by two other models. Our experiments show only marginal differences between our RPN and two state-of-the-art detection models, namely SECOND v1.5 and PointRCNN (see Table 3). The results underline that the regression capability of the RPN is of low importance. Additionally, we compare our RPN to proposals created from ground truth labels. The gap increases with difficulty and suggests further improvement can be achieved via an RPN that is more capable to distinguish objects described by a low amount of LiDAR points, e.g. a fusion-based RPN.
|RPN||RPN - AP Score 3D||Refined - AP Score 3D|
We further investigate the option of a two-phase training procedure for our lightweight RPN. As the patches occur at their original distance to the sensor, we train a detector that can operate on whole scenes. First, we pre-train by concentrating on the domain of patches only, then we switch to the domain of whole scenes. The pre-trained RPN surpasses the moderate 3D-scores of Voxelnet () after only one additional epoch.
We have proposed Patch Refinement, a two-stage model for 3D object detection from sparse LiDAR point clouds. We demonstrate that a modified Voxelnet is capable of highly accurate local bounding box regression and a simplified Voxelnet is an adequate choice for an RPN to complement such an LRN. As the LRN operates on local point cloud subsets only, it can refine the proposals of an arbitrary RPN on demand. Further improvements regarding accuracy may be attainable by using a ground plane estimation algorithm and the integration of image information in the RPN stage.
-  (2015) 3D object proposals for accurate object class detection. In Advances in Neural Information Processing Systems 28, pp. 424–432. Cited by: §4.
-  (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915. External Links: Cited by: §1, §2, Table 1.
-  (2017) Vote3deep: fast object detection in 3d point clouds using efficient convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 1355–1361. Cited by: §1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. External Links: Cited by: §1, §4.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587. External Links: Cited by: §1, §3.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. External Links: Cited by: §1, §3.1.
-  (2018) Joint 3d proposal generation and object detection from view aggregation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. External Links: Cited by: 3rd item, §1, §2, §3.1, Table 1.
-  (2019) PointPillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: §1, §2, §2, Table 1.
-  (2019) Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7644–7652. Cited by: §1.
-  (2019) Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. Cited by: Table 1.
-  (2017-10) Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2999–3007. External Links: Cited by: §4.1.
-  (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. Cited by: §1, §2, Table 1.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1.
-  (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–779. Cited by: §1, §2, §2, Table 1.
-  (2019) Roarnet: a robust 3d object detection based on region approximation refinement. In 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 2510–2515. Cited by: §1.
-  (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8445–8453. Cited by: §1.
-  (2016-01) Enhancing semantic segmentation for robotics: the power of 3-d entangled forests. IEEE Robotics and Automation Letters 1 (1), pp. 49–56. External Links: Cited by: §1.
-  (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §1, §2, §3.1, §3.2, Table 1.
-  (2018) Pixor: real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660. Cited by: §1.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. External Links: Cited by: 2nd item, §1, §2, §3.2, Table 1.