PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module††thanks: Deng Cai is the corresponding author.
LIDAR point clouds and RGB-images are both extremely essential for 3D object detection. So many state-of-the-art 3D detection algorithms dedicate in fusing these two types of data effectively. However, their fusion methods based on Bird’s Eye View (BEV) or voxel format are not accurate. In this paper, we propose a novel fusion approach named Point-based Attentive Cont-conv Fusion(PACF) module, which fuses multi-sensor features directly on 3D points. Except for continuous convolution, we additionally add a Point-Pooling and an Attentive Aggregation to make the fused features more expressive. Moreover, based on the PACF module, we propose a 3D multi-sensor multi-task network called Pointcloud-Image RCNN(PI-RCNN as brief), which handles the image segmentation and 3D object detection tasks. PI-RCNN employs a segmentation sub-network to extract full-resolution semantic feature maps from images and then fuses the multi-sensor features via powerful PACF module. Beneficial from the effectiveness of the PACF module and the expressive semantic features from the segmentation module, PI-RCNN can improve much in 3D object detection. We demonstrate the effectiveness of the PACF module and PI-RCNN on the KITTI 3D Detection benchmark, and our method can achieve state-of-the-art on the metric of 3D AP.
With the rapid development of autonomous driving, 3D detection attracts more and more attention. LIDAR is the most common 3D sensor in autonomous driving. There are existing works detecting 3D objects from LIDAR points[zhou2018voxelnet, yan2018second, yang2018pixor, lang2019pointpillars, shi2019pointrcnn, Chen2019fastpointrcnn, DBLP:journals/corr/abs-1901-08373]. However, although LIDAR points can capture the 3D structures of objects, they do not have enough semantic information and suffer from the sparsity of points. The loss of semantics causes tough and confusing scenes which the model is hard to tackle. The sparsity of LIDAR points, especially the points far away, brings difficulties for the network to recognize. These challenges are exampled in Figure 1.
Meanwhile, some works[mousavian20173d, li2019gs3d, ku2019monocular] try to estimate 3D location and dimension of objects via monocular images. Comparing with point clouds, RGB-images have more regular and dense data format and have much richer semantic information to distinguish vehicles and background. However, the nature of 2D image determines that 3D detection algorithms based on monocular images suffer from low precision.
To address these challenges, many state-of-the-art methods [chen2017multi, ku2018joint, liang2018deep, qi2018frustum, liang2019multi] combine the data of multiple sensors to remedy the semantic loss of point clouds. [chen2017multi, ku2018joint] directly merge the features from images and BEV(birds-eye-view) maps. [qi2018frustum] employ a cascade structure to predict 3D objects via a frustum from the 2D detection bounding box. [liang2018deep] apply continuous convolution[Wang_2018_CVPR] to fuse multi-sensor features.
However, the direct fusion like [chen2017multi, ku2018joint] ignore the extremely different perspectives of RGB-images and Birds-view maps. The 3D detection based on frustum[qi2018frustum] suffers from the weakness of 2D detection and involves many points of background or other instances because of occlusion. Although [liang2018deep] apply continuous convolution to overcome the challenge of different perspectives, their fusion based on BEV map is not accurate. BEV-format quantifies the 3D world into a pseudo-image, so the neighbors search and fusion on BEV map suffers from the loss of precision.
To overcome these shortcomings, we propose a novel fusion module named Point-based Attentive Continuous-Convolution Fusion module(PACF module as brief). Different from [liang2018deep, liang2019multi], we directly apply continuous convolution on raw points. Meanwhile, inspired by some multi-task works[gao2019nddr, liang2019multi], we combine the image segmentation task and 3D detection to take full advantage of the semantic information from images. Specially, we fuse the semantic features outputted by a segmentation model with the features of LIDAR points via our proposed PACF module. Moreover, based on the PACF module, we propose a robust multi-sensor 3D object detection network named Point-Image RCNN(PI-RCNN as brief).
Our proposed PI-RCNN is inspired by two observations: (1) The most significant information we can obtain from 2D-image is the segmentation mask, and once we obtain the segmentation mask, we naturally get the 2D locations and bounding boxes of objects on images; (2) There is no intersection for objects in 3D space, so we can naturally get the LIDAR points segmentation through only 3D objects label.
PI-RCNN is composed of two sub-networks: an image segmentation sub-network and a point-based 3D detection sub-network. The segmentation sub-network of PI-RCNN is a lightweight fully convolution network, which outputs a prediction mask whose size is the same as the original input image. The detection sub-network is a 3D detector which takes raw LIDAR points as input. The PACF module bridges the two sub-networks and combines the features from RGB-image and LIDAR points to benefit the 3D object detection. With the features fused by our proposed PACF module, our proposed PI-RCNN can effectively improve the performance of 3D object detection. Experiments on KITTI[geiger2013vision] dataset demonstrate the effectiveness of our approach. Our proposed framework PI-RCNN achieves state-of-the-art on the metric of 3D AP.
We summarize our contributions into three aspects:
We propose a novel fusion method, named PACF module, to fuse the multi-sensor features. PACF module conducts point-wise continuous convolution directly on 3D points and applies a Point-Pooling and an Attentive Aggregation operation to obtain better fusion performance.
Based on the powerful PACF module, we design an efficient multi-sensor 3D object detection algorithm, named Point-Image RCNN(PI-RCNN as brief). What is more, PI-RCNN combines multiple tasks(image segmentation and 3D object detection) to improve the performance of 3D detection.
We conduct extensive experiments on KITTI dataset and demonstrate the effectiveness of our approach.
3D Object Detection from Single Sensor
3D Object Detection from RGB-images. [mousavian20173d, li2019gs3d] employ geometry constrains of 2D bounding box predictions to estimate the pose of 3D objects and obtain the location through camera calibration. [Chen_2016_CVPR] exploit instance and semantic segmentation along with geometric priors to infer 3D object based on monocular images. [wangcvpr2019] generate a set of pseudo points via depth estimation on RGB-image and reason about 3D objects on the generated 3D points. However, due to the lack of depth information, the depth estimation through monocular image is inaccurate, so 3D detection based on RGB-images suffers from low precision.
3D Object Detection from Point Clouds. Due to traditional CNN can not be applied directly on LIDAR points, many algorithms try various ways to address this issue. In the most common paradigm, point clouds are primarily converted to a fixed size pseudo-image which can be processed by a standard CNN, for example, BEV[ku2018joint, liang2018deep, lang2019pointpillars] or voxels[zhou2018voxelnet, yan2018second, yang2018pixor, wang2019voxel].
There are also algorithms leveraging raw 3D points to detect 3D objects. [qi2017pointnet, qi2017pointnet++] exploit raw points to classify point clouds or predict point segmentation. [shi2019pointrcnn] employ PointNet++[qi2017pointnet++] to generate 3D proposals from raw point clouds and a point-based RCNN to conduct refinement in a local range.
3D Object Detection from Multi Sensors
[chen2017multi] take RGB-image, front-view, and birds-eye-view as input, and exploits a 3D RPN to generate 3D proposals. [ku2018joint] develop the idea of [chen2017multi], propose a feature pyramid backbone to extract features from BEV map and merge features from BEV map and RGB-image by a crop and resize operation. [qi2018frustum] use a 3D frustum projected from the 2D bounding box to estimate 3D objects. [liang2018deep] apply continuous convolution[Wang_2018_CVPR] to fuse BEV features with the neighbor points’ features retrieved from the image.
However, the direct fusion methods like [ku2018joint, chen2017multi] are too coarse, the rectangular RoIs(Region of Interest) on images involve lots of background noise and ignore the differences between the perspective of bird’s view map and image. [liang2018deep] employ continuous convolution to avoid the perspective issue, but their BEV-based fusion method suffers the loss of precision, and there is much improvable space to utilize the semantic information of images. Although [liang2019multi] declaim that they apply “point-wise” continuous convolution, it still conducts fusion on BEV map and does not achieve real “point-wise” fusion directly on LIDAR points.
In this section, we present our proposed novel fusion module, Point-based Attentive Continuous-Convolution Fusion module(PACF module as brief). Different from [liang2018deep, liang2019multi], PACF module conducts real “point-wise” continuous convolution directly on 3D LIDAR points and additionally add a Point-Pooling operation and an Attentive Aggregation to make fusion more robust. Moreover, based on the PACF module, we propose Point-Image RCNN (PI-RCNN as brief), a multi-sensor 3D detection network which combines multiple tasks. PI-RCNN combines the image segmentation and 3D object detection and exploits the semantic features from image segmentation to supplement the LIDAR points. The overall architecture of PI-RCNN is illustrated in Figure 2. PI-RCNN is composed of two sub-networks. One is the segmentation sub-network which takes RGB-images as inputs and outputs semantic features. The other is a point-based 3D detection network, which generates and refines 3D proposals from raw LIDAR points. PACF module is the bridge between the two sub-networks. PACF module conducts fusion operation directly on 3D points instead of BEV or voxel format pseudo-image and merges the semantic features from RGB-image with features from LIDAR points. Moreover, PACF module adds Point-Pooling and Attentive Aggregation to make fused features more expressive. Beneficial from the effectiveness of PACF module, PI-RCNN can detect 3D objects more preciously.
Point-based Attentive ContFuse Module
Fusion for multi-sensor data. The different data format and perspective are the main challenges of fusing features from 2D images and 3D points. RGB-images only represent the 2D projection of the real 3D world on the camera image plane, while LIDAR points capture the 3D structures of the scenes. [chen2017multi, ku2018joint] convert the LIDAR points to BEV(birds-eye-view) pseudo-images and directly fuse the features from BEV maps and RGB-images. However, the proposals on BEV map and RGB-images have different perspectives, so the direct fusion is too coarse to fuse accurate and beneficial features. ContFuse[liang2018deep] project the image features into BEV map and fuse features of the neighbor points with the continuous convolution[Wang_2018_CVPR]. However, BEV-format is only the quantification of the 3D pointclouds and suffers from precious loss, so the neighbor search and fusion on BEV is not accurate, especially in the Z-axis of LIDAR coordinate. Although MMF[liang2019multi] build a dense correspondence between image and BEV, they still do not apply real “point-wise” continuous convolution directly on 3D points.
PACF module. To address these issues, we propose a novel fusion module, PACF module, which achieves more accurate and robust fusion. The details of the PACF module are illustrated in Figure 3. Given a feature map extracted from RGB-image and raw LIDAR points, PACF module outputs a set of discrete 3D points whose features contains the semantic information from RGB-image. In detail, the PACF module consists of five steps. (1) We search the nearest neighbor points in a distance range ( as default) for each 3D point. (2) We project the neighbor points onto the feature maps extracted from the 2D image plane via camera calibration. (3) We retrieve the corresponding semantic features from images and combine image features with the geometric offset of 3D points. (4) We exploit attentive continuous convolution to fuse the semantic+geometric features of k-nearest neighbor points. (5) We conduct a Point-Pooling operation for the outputs of step (3) and concatenate them with outputs of step (4) as the final features of target points.
The attentive continuous convolution is improved based on ContFuse[liang2018deep]. We denote as the coordinate of point , as the semantic features retrieved from the output of segmentation sub-network. Note, we concatenate the final segmentation mask and the feature maps of the second-last layer as the semantic features, so is a -d vector, where is the channel number of the feature maps. The continuous convolution is defined as:
where and N is the number of LIDAR points, and K is the number of neighbor points (including ego point), is the coordinate of target point , is the coordinate of neighbor points , so represents the geometric offset from the target point to the neighbor point , is a -d row vector, and is the output of continuous convolution. in Equation 1 approximates continuous convolution, which converts the input into the output, where are channel numbers of the input and output features respectively.
Inspired by the Pooling operation in CNN and attentive mechanism, we add a Point-Pooling operation and an Attentive Aggregation to strengthen the continuous convolution. In detail, we conduct a Pooling operation on the features of neighbor points. The Point-Pooling can be represented as:
where is the features of all neighbors, represents the pooled features for each target point . The POOL is conducted along the point-axis. In practice, we exploit Max-Pooling to obtain the most expressive features from neighbor points. Besides, we conduct an Attentive Aggregation to merge the features of neighbor points. In practice, we employ another MLP to aggregate neighbors, that is to say, for each target point :
where represents the features of neighbor points outputted by the , the aggregates the neighbor features into -d features of target point through a set of learnable parameters. The final output of the PACF module is the concatenation of above three parts:
Improvements comparing with previous methods. Our proposed PACF module has five differences from [liang2018deep, liang2019multi]. Primarily, they both fuse features on the pixels of BEV. However, BEV format quantifies the real 3D space to a 2D pseudo-image, so the neighbor search and feature fusion applied on the pixels on BEV is not accurate. In contrast, we conduct the neighbor search, continuous convolution, and final fusion directly on raw 3D points instead of BEV, which precludes the quantification loss. Secondly, except for the MLP for continuous convolution, we add another learnable MLP to fuse the features from neighbor points, which can be considered as an attention mechanism for the features of neighbors. Thirdly, to avoid the interpolation loss, we retrieve the image features on features map with a larger resolution, whose size is consistent with the original size of the image. The fourth difference is that we combine the image segmentation task with 3D object detection. Instead of using the image features learned from 3D detection task, we first pre-train the image sub-network on a segmentation dataset. In the Experiments Section, we conduct experiments to compare the features pre-trained on segmentation task with the features learned from 3D detection. We argue that the features learned under the supervision of semantic segmentation are more expressive, and the combination of multiple tasks (image segmentation and 3D detection) is robust. Finally, inspired by the pooling operation in CNN and the attentive mechanism, we conduct point-wise pooling among the features of neighbor points and add a learnable Attentive Aggregation operation to merge the features of neighbors more effectively.
We argue that these improvements make a significant difference. In the Experiments Section, we will conduct ablation experiments to analyze the effects of these differences.
Main Architecture of PI-RCNN
PI-RCNN is a multi-task 3D detection network and is composed of two sub-networks: image segmentation sub-network and 3D Detection sub-network.
Semantic Segmentation Sub-Network. To obtain robust semantic features from RGB-images, we first analyze which features from images are most beneficial for 3D objects detection. For the 2D object detection task, the feature extractor is usually pre-trained on classification dataset, such as ImageNet[deng2009imagenet], which is sufficient enough for detecting 2D bounding box. Because the target of 2D object detection is only predicting the rectangular bounding box, which does not demand meticulous features in 2D proposals. As long as the features of RoI capture the part region of objects, the detector’s head can classify and regress the proposals correctly. However, it is insufficient for the dense correspondence between image pixels and LIDAR points.
We argue that image features learned from 3D detection label are too coarse for the correspondence between image pixels and 3D points. We observed that once we get the segmentation mask from RGB-image, we can project the 3D points onto the 2D image plane to retrieve the corresponding segmentation of 3D points. Because segmentation mask is a pixel-level prediction, which does not involve the background pixels like the 2D bounding box, it can give each point more accurate semantic information to help the detection sub-network to predict 3D objects more preciously. The comparison of features outputted by pre-trained segmentation sub-network and no pre-training sub-network is illustrated in Figure 4. Therefore, we combine the image segmentation task with 3D detection and use the outputs from a segmentation network as the semantic features of RGB-images. Besides, segmentation feature maps have a larger resolution than the outputs of classification backbone, which makes the projection and fusion between LIDAR points and image pixels more accurate. In the Experiments Section, we will conduct a relative ablation study to verify the effectiveness of pre-training on segmentation dataset. Note that we do not need pre-train an instance-level segmentation sub-networks, because the target of segmentation supervision is only helping us obtain semantic features for fusion and we detect objects based on LIDAR points. We exploit UNet[ronneberger2015u], a lightweight fully-convolution network, as the segmentation sub-network of PI-RCNN. Note, in practice, we can alternate it with other lightweight segmentation networks.
3D Detection Sub-Network We argue that point-wise fusion is more robust than fusion based on BEV map. To conduct the point-wise fusion operation, we need to employ a 3D detection network based on raw 3D points. Therefore, we employ PointRCNN[shi2019pointrcnn], a two-stage 3D detection network whose inputs are raw LIDAR points, as the detection Sub-Network of PI-RCNN. PointRCNN employ PointNet++[qi2017pointnet++] as its first stage to generate 3D proposals from raw LIDAR points. Its stage-2 transforms the points in each proposal to canonical coordinates to refine the 3D bounding box.
We provide two fusion strategies. The main difference between the two strategies is the location of the fusion module. The comparison is illustrated in Figure 5. We denote these two versions of fusion strategy as PI-RCNN V1 and PI-RCNN V2 respectively. In the Experiments Section, we will analyze the performance of two fusion strategies.
PI-RCNN V1: We fuse the features from multiple sensors in the “middle-way”, as illustrated in Figure 5. In this strategy, the semantic features from image act as a supplementation of the 3D points features outputted by the first stage of detection sub-network.
PI-RCNN V2: We can also conduct the fusion operation at the beginning of the detection network. After obtaining the output of segmentation sub-network, we concatenate the image features with raw LIDAR points as the input of detection sub-network. For this fusion strategy, we can alternate the detection sub-network with other 3D detectors which takes inputs of arbitrary format. For example, when leveraging a 3D detection algorithm based on the format of the BEV map or voxels, the semantic features can act as the extra features of LIDAR points.
For the 3D detection sub-network, we follow the loss function introduced by the [shi2019pointrcnn]. The loss of detection sub-network is defined as:
where are defined the same as original paper.
For the training of image segmentation sub-network, we need a semantic segmentation label as supervision. As mentioned in [shi2019pointrcnn], the 3D objects are not overlapped with each other, and we can get the segmentation of points from the 3D detection label. Hence, we can obtain a sparse segmentation mask by projection the points segmentation onto the 2D image plane, and we only compute loss on the pixels with supervision. To address the imbalance between the foreground and background, we employ Focal Loss[lin2017focal] as:
where for forground point otherwise , is the scores outputted by network. And we keep the default settings as the original paper.
Therefore, the total loss is:
where is the weight of segmentation loss. For the sake of simplicity, we use as the default setting.
Although our proposed PI-RCNN can be trained end-to-end without pretraining on segmentation dataset, we observe that initialization is essential for the performance of 3D detection. So in practice, we pre-train the segmentation sub-network on a semantic segmentation dataset. When we pre-train the segmentation sub-network, we only consider the category we interest as foreground, and other categories are all viewed as background.
Implementation and Training Details
Network Architecture. For the segmentation sub-network, considering the need for real-time detection, we follow the network structure of UNet[ronneberger2015u], a lightweight and fully-convolution network. The segmentation sub-network can be alternated with other segmentation networks. Because our primary goal is using semantic features to improve the performance of 3D object detection, so we do not pay much attention to the architecture of segmentation sub-network and employ the same settings for the segmentation sub-network for all experiments.
For the 3D detection sub-network, we exploit a point-based 3D detection algorithm, PointRCNN[shi2019pointrcnn]. PointRCNN is a two-stage 3D detector and predicts 3D objects directly by raw LIDAR points. To compare fairly, in all experiments, we use consistent settings with the original paper. Note, if we use the “PI-RCNN V2” fusion strategy, theoretically, we can alternate the detection sub-network with almost any other 3D detection algorithm based on LIDAR points, whatever format of input it takes. For the sake of simplicity, in all the following experiments, we employ the “PI-RCNN V1” fusion strategy as default.
Input Representation. For the detection sub-network, we take raw 3D points as the input, instead of BEV or voxel format. We follow the settings in [shi2019pointrcnn] for the 3D points input. We set the region of concern of LIDAR points as in LIDAR coordinate and subsample 16,384 points in the viewable region of camera as inputs. For the RGB-image, we resize the RGB-image to due to the demand of upsampling operation in the segmentation sub-network. When testing, we find that sampling the input points like training is better than inputting all the points. So we test all our models with the same subsampling strategy. Although this will bring some randomness to the evaluation results, we find that the results are stable( for 3D AP(M)) for one model.
Data Augmentation. To guarantee the correct correspondence between LIDAR points and image pixels, we do not use data augmentation when training. This is different from most 3D detection algorithms based only on LIDAR.
When pretraining the segmentation sub-network, we apply data augmentation to obtain better performance. In detail, we randomly flip the image horizontally, and randomly center-crop the image with a ratio 0.8. Besides the spatial augmentation, we enhance the brightness, contrast, saturation of images with a random factor in . We apply all above augmentations with a probability 0.5.
Results on KITTI Dataset
|PACF||PointPool||Att Aggr||3D AP(Car)|
|Image Features||3D AP(Car)|
We evaluate PI-RCNN on KITTI [geiger2013vision] dataset. KITTI 3D detection dataset contains 7481 training samples and 7518 testing samples. The training samples are provided with labels, while the results in testing set must be submitted to the official test server to evaluate. We follow the common train/val split mentioned in [chen2017multi] to divide 7481 training samples into train split with 3712 samples and val split with 3769 samples. We evaluate our approach on Car class and compare PI-RCNN with state-of-the-art 3D detectors on both val split and testing split of KITTI dataset. For all the following experiments, the models are trained on train split and evaluated on val or test split.
We compare PI-RCNN with other state-of-the-art methods on both testing and val split. The evaluation results on testing and val set are shown in Table 1 and 2 respectively. We follow the implementation released by PointRCNN[shi2019pointrcnn]. Note, we do not use any data augmentation when training on 3D detection task due to the need of multi-sensor fusion. So when comparing with PointRCNN, we only compare with the results of our re-implementation without data augmentation on the val split. On the testing split, PI-RCNN surpasses the previous state-of-the-art methods on the metric of 3D AP. On the val split, PI-RCNN outperforms the state-of-the-art multi-sensor 3D detectors. Meanwhile, our PI-RCNN outperforms the baseline PointRCNN in Moderate, in Hard, and in Easy in the absence of data augmentation. The results demonstrate the effectiveness of our proposed PI-RCNN.
We conduct ablation studies to analyze the effects of the PACF module and PI-RCNN. All models are trained on the train split and evaluated on the val split of KITTI dataset. All evaluations on the val split are performed via 40 recall positions instead of the 11 recall positions.
PACF module. We conduct some ablation experiments about PACF module. We first analyze the effect of hyper-parameter , and the results are shown in Table 4. As mentioned in [liang2018deep], the continuous convolution might learn to ignore the noise of distant points, so for the sake of simplicity, we use for all experiments. The best result comes from the setting. Meanwhile, we observe that and are even worse than . The reason might be that large involves distant points and brings noises for the features of the target point. Then we study the effects of the Point-Pooling and Attentive Aggregation. The results are shown in Table 3. Table 3 shows that the additional Point-Pooling and Attentive Aggregation operations are beneficial for the feature fusion.
PI-RCNN V1 vs. V2. As mentioned above, there are two fusion strategies we can choose. We analyze these two versions of PI-RCNN. The comparison results are shown in Table 5. We can see that V1 slightly outperforms V2 and the results suggest that fusion in the “middle” of detection sub-network is better than fusion at the beginning. One possible reason might be that the stage-1 of detection sub-network learns to generate 3D proposal mainly through the 3D information of LIDAR points and the supplementary features appended at the beginning do not contribute as much as fusion in the ”middle”.
Semantic Features. Fusing image features of which layer in the segmentation sub-network is important for PI-RCNN performance. Table 6 shows the effects of different image features, where seg-mask represents the final segmentation mask output and last represents the feature maps outputted by the previous layer before the final output. Table 6 shows that seg-mask+last setting gets the best results. Meanwhile, we can see that even the final segmentation mask only contains 1 channels, but the results of seg-mask setting are comparable with the seg-mask+last setting. The slight difference demonstrates the necessary of multi-task combination.
Segmentation Pretraining. As mentioned above, our multi-task model PI-RCNN can be trained end-to-end only under the supervision of 3D objects annotation. However, the results in Table 7 suggest that pre-training the segmentation sub-network improves performance. The reason might be that the segmentation supervision from 3D objects annotation is too sparse. Meanwhile, Figure 4 shows that in the case of no pre-training, the outputted segmentation mask is much coarse than the output of the pre-trained model.
In this paper, we propose a Point-based Attentive Cont-Conv Fusion(PACF) module and a multi-sensor multi-task 3D object detection network named PI-RCNN. PI-RCNN combines the image segmentation and 3D detection. Our proposed framework is simple but effective. Our proposed PI-RCNN achieves the state-of-the-art results on KITTI 3D Detection benchmark.
This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 61936006).