FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving
Moving Object Detection (MOD) is an important task for achieving robust autonomous driving. An autonomous vehicle has to estimate collision risk with other interacting objects in the environment and calculate an optional trajectory. Collision risk is typically higher for moving objects than static ones due to the need to estimate the future states and poses of the objects for decision making. This is particularly important for near-range objects around the vehicle which are typically detected by a fisheye surround-view system that captures a 360 view of the scene. In this work, we propose a CNN architecture for moving object detection using fisheye images that were captured in autonomous driving environment. As motion geometry is highly non-linear and unique for fisheye cameras, we will make an improved version of the current dataset public to encourage further research. To target embedded deployment, we design a lightweight encoder sharing weights across sequential images. The proposed network runs at 15 fps on a 1 teraflops automotive embedded system at accuracy of 40% IoU and 69.5% mIoU.
Vision based driver assistance systems [horgan2015vision] have become very common in commercial vehicles and they are gradually moving towards higher levels of autonomous driving. Deep learning, semantic segmentation in particular [siam2017deep], has played a significant role in enabling this progress.
Deep learning algorithms are computationally intensive and it is necessary to design efficient algorithms [siam2018rtseg, briot2018analysis] for deployment. Deep learning algorithms also have the advantage of being used for various tasks by sharing the encoded features [sistu2019neurall, chennupati2019multinet++].
The autonomous driving scenes are highly dynamic, where there are a lot of moving objects interacting with each other forming a very complex environment to deal with. Knowing the motion information helps generic foreground detection [jain2017fusionseg] and improves semantic segmentation [rashed2019motion]. Fewer classes are movable and this can be leveraged to improve the classification accuracy. For example, object classes like buildings or poles are static and will not have dominant motion vectors after egomotion compensation.
There are two types of motion in an autonomous driving scene. The first one is motion of the surrounding obstacles and the second is motion of the ego-vehicle. The ego-motion might cause difficulties to successfully detect the moving objects because even static objects will be perceived as moving. Motion segmentation implies two tasks that are performed jointly. The first one is object detection in which we highlight the interesting objects only of specific classes, which are pedestrians and vehicles, and discard any motion perceived from the background due to ego-motion. The second is motion classification in which a binary classifier predicts whether the object is moving or static.
In this work, we collect 5k samples of fisheye images captured using real cameras embedded on a moving vehicle in a 360-surround-view setup (illustrated in Figure 1) . We generate the annotation using a semi-automatic approach in the form of binary masks highlighting moving obstacles. We make use of the generated annotation to train an adapted end-to-end network which is based on [gamal2018shuffleseg] for moving object detection. The algorithm leverages a two-stream mid-fusion approach, however we make use of two sequential images which encode motion across time where the network implicitly learns to distinguish between ego-motion and obstacles motion for the final motion segmentation task. The contributions of our work can be listed as follows:
Generation of the first public automotive dataset for fisheye images with MOD annotations.
Implementation of an efficient two-stream network architecture suitable for embedded systems.
Empirical study of different training and data augmentation schemes.
2 Moving Object Detection
The detection and localization of moving obstacles is critically important for Advanced Driver Assistance Systems (ADAS) and autonomous vehicles as they are essential for emergency braking, to support decision making for its next step navigation and to avoid possible collisions [heimberger2017].
In automotive scenarios, rear-view and surround-view fisheye cameras are commonly deployed in existing vehicles for viewing applications. From a static observation point, the detection of moving obstacles is almost trivial as any non-zero optical flow will be due to motion in the scene or noise in the image. For a moving observer, the problem is challenging as the entire scene relative to the camera is moving and is additionally complicated when we consider fisheye cameras, which exhibit complex patterns of motion due to the non-linear projection and strong lens distortion.
Related work: The classical approach to the detection of moving objects is based on the geometrical understanding of the scene, where the ego-vehicle motion and the displacement vectors of the pixels between two frames are known. Arguably the most famous constraint used in motion detection is the epipolar constraint [soumya2012, clarke1996], which can be combined with additional geometrical constraints in order to detect multiple types of motion [klappstein2006]. However, even if the geometry of moving objects is well known, their detection still presents challenges caused by the intrinsic geometrical limitations. In the search to overcome the limitations of the classical approach there has been promising work in using CNN to solve the moving object detection problem, such as MODNet [siam2017modnet] and MPNet [wang2018]. Given the use of fisheye cameras in surround-view systems, it is of utmost importance for research to explore this direction and provide a CNN architecture for moving objects detection on fisheye images. One of the main challenges of detecting moving objects with a CNN is to make it scene agnostic, so that the detection is based only on motion cues & not on appearance cues.
3 Dataset Creation
Fisheye Cameras: Fisheye cameras are commonly used for near-field sensing for use cases like parking and traffic jam assist. They provide a wide field of view and requires just four cameras for the full 360 coverage. This advantage comes with a cost that is significantly more complex projection geometry exhibited by fisheye cameras. Thus models learnt on rectilinear cameras do not generalize well to fisheye cameras. This motivated us to create a new dataset focused on parking scenes with other vehicles and pedestrians being the main moving objects.
|Model||Number of samples||mIoU||MOD IoU|
|Trained on rectilinear KITTI data||1300||53.5||10|
|Trained on fisheye data||3638||69.5||39.8|
|+ weight sharing in two stream||3638||69.5||39.6|
|+ static objects scene augmentation||5849||70||42|
Semi-automated annotation procedure There is no public dataset for fisheye images that focuses on autonomous driving scenes, thus we introduce our own dataset which has 1 Megapixel images captured at 30 fps. In order to train our network end-to-end for moving object detection, we developed a pipeline that generates MOD annotation to be used as ground-truth as illustrated in the procedure below:
Previously generated object annotation bounding boxes are parsed to identify the objects positions within the scene.
The object positions from the annotations are used to extract from the LiDAR data those points that are within the annotated object.
The extracted point cloud is then processed to classify the object as moving or static.
After processing, the points are projected onto the image using the camera calibration information.
The resulting set of 2D points is converted to a convex hull polygon.
Dataset Statistics: The fisheye dataset used was generated from only parking scenes and contains a total of 5139 frames using the sampling strategy discussed in [DBLP:conf/visapp/UricarHKY19]. We split the data into 70% for training and 30% for testing. A total of 3638 frames were used to train the network including 73 different scenes and 1501 frames were used to test the network, including 70 different scenes. The total number of moving objects annotations in the training dataset is 6296. The average number of moving objects per frame is 1.4, mainly pedestrians and cars. The average percentage of moving pixels in a frame is 0.54%, and the average percentage of static pixels is 99.46%, including background and static objects. The dataset will be released as part of WoodScape project [yogamani2019woodscape].
4 Proposed Model and Experiments
The architecture used is based on [gamal2018shuffleseg], where the network is adapted to accept two-inputs in a two-stream fashion as proposed by [siam2017modnet, jain2017fusionseg, 8594088]. However, those methods used optical flow images to capture motion information and RGB images to understand scene semantics. Optical flow requires preprocessing, especially for fisheye images which will be distorted depending on the fisheye camera parameters. In our approach, we train the network end-to-end using temporally sequential images which encode both semantics and motion together. The network encoder is responsible for the feature extraction phase before the feature maps are upsampled to the input image size. The encoder is based on  which utilizes point-wise group convolutions and channel shuffling, which dramatically reduce computation cost at a high accuracy level. The decoder part is composed of three deconvolution layers which provide the final output image size. The main advantage of this approach is its low complexity where a lightweight architecture is used to fit on autonomous driving embedded platforms and provide good accuracy as well. The network is trained to classify the output pixels among two classes, moving and non-moving classes. The number of static pixels exceeds the number of moving pixels. This is because of the background pixels which are considered as static ones, in addition to the static foreground pixels such as static vehicles and pedestrians. Weighted cross-entropy is utilized to overcome the class imbalance problem.
Figure 2 illustrates the network architecture we use where two temporally sequential images are processed separately in two encoders. This setup allows the network to understand the motion within the surrounding scene. The network is trained to generate a binary mask for MOD where each pixel can be moving or static. Throughout all experiments, weighted cross-entropy has been utilized to overcome the class imbalance problem. Adam optimizer is set at rate . L2 regularization with weighted decay of has been utilized to avoid over-fitting the data. The network encoder is initialized with pre-trained weights on ImageNet.
Results: Table 1 illustrates our results on MOD task using fisheye images. The first row represents the usage of pre-trained weights, where the network was trained on 1100 images using the dataset provided by [siam2017modnet] and inference was done on fisheye images. Results show inability of the network to generalize rectilinear model to fisheye images. The second row shows results where the network is trained on 3k fisheye images where significant improvement was observed providing 40% IoU compared to 10% when trained on KittiMoSeg[siam2017modnet] dataset. The third row shows further improvement after augmentation of the dataset with static objects so that the number of moving and static objects become balanced. Overall, the detection results are reasonable and the main issue is with false positives with static pedestrians being detected as moving objects. This was due to small movement of pedestrians while standing still in the dataset. To improve efficiency, we used shared weights in the two encoders so the previous encoder can be re-used from the previous iteration. This resulted in very little decrease in accuracy as shown in fourth row. Finally, we augmented scenes which contained only static objects, which did not need any annotation. This resulted in a slight increase in accuracy as shown in the fifth row. The proposed network is very efficient and runs realtime at 15 fps on a 1 teraflops automotive embedded system.
In this paper, we introduced a new moving object detection dataset for fisheye cameras. Firstly, we showed that the model trained on rectilinear KITTI dataset does not generalize well for fisheye images. We designed an efficient architecture for moving object segmentation and provided baseline experiments. We also tested different training and augmentation techniques to improve accuracy. We will make an improved version of the dataset public in order to encourage further research. In future work, we plan to incorporate geometric priors into the loss function to improve accuracy.