Real-time Detection, Tracking, and Classification of Moving and Stationary Objects using Multiple Fisheye Images
The ability to detect pedestrians and other moving objects is crucial for an autonomous vehicle. This must be done in real-time with minimum system overhead. This paper discusses the implementation of a surround view system to identify moving as well as static objects that are close to the ego vehicle. The algorithm works on 4 views captured by fisheye cameras which are merged into a single frame. The moving object detection and tracking solution uses minimal system overhead to isolate regions of interest (ROIs) containing moving objects. These ROIs are then analyzed using a deep neural network (DNN) to categorize the moving object. With deployment and testing on a real car in urban environments, we have demonstrated the practical feasibility of the solution.
Autonomous cars are currently primed for mass adoption by the consumer market. The recent slew of announcements by several car makers and automotive suppliers indicate that we continue on the path to having autonomous vehicles cohabiting our streets with other objects. Some examples of these objects include pedestrians, vehicles, strollers and shopping carts to name a few. As opposed to most other consumer technologies, autonomous vehicles behaving erroneously can result in significant harm to humans and property. Therefore, the ability to detect and recognize various objects around the car is paramount to ensuring safe operation of the vehicle. In this paper, we present a solution to detect and classify several moving objects around the vehicle in real time. This solution was developed for use in Carnegie Mellon University’s autonomous driving research vehicle, Cadillac SRX .
1.1 Related work
Several approaches to detecting obstacles around the car have been explored. The information about objects around the car can be obtained through various sensors. These include LIDAR and RADAR-based approaches used by . Others include stereo camera-based approaches to make use of the additional depth information available . Bertozzi et al.  use images from 4 fisheye cameras to detect and track objects. Another approach is to combine the data from various sensors to arrive at a more accurate estimate. This sensor fusion-based approach has also been explored for Carnegie Mellon University’s autonomous driving research vehicle . Once the information about the environment has been obtained, the detection of objects can be done by processing one single frame at a time or by comparing the images from adjacent frames. Detection using processing from a single frame can be done using feature-based methods or machine learning. These methods analyze a single frame to find various object classes like pedestrians and cyclists. The disadvantage of this approach is that they impose high computational demand and they can also miss objects that can be of potential danger but were not known a priori while training the model. Detection using processing of adjacent frames takes advantage of object motion to reduce the search space and computation time. Optical flow-based methods are the most popular for this latter approach. When combined with tracking, these methods can detect objects even in the absence of relative motion, if the object had been moving anytime in the past. Yet others combine many of these methods to develop a hybrid approach .
1.2 Our Contributions
We developed a solution to fit the practical need for having the solution perform accurately in real-time on the vehicle. We made several design choices to meet our objectives. The moving object detector is run on the CPU, while the DNN-based classifier is run on a GPU. One of the major challenges is to maintain a low computation overhead. This is critical to achieve real-time processing of simultaneous streams from 4 fisheye cameras. To achieve such a low computation overhead, we eschewed the De-Warping stage that similar solutions use . Our algorithm was developed to work directly on the raw images from the fisheye cameras. Instead of processing the 4 video streams individually, we merge the four video streams into a single video sequence and process them as a single stream. We also deal with a fixed region of interest in the 4 video streams as shown in Figure 1. This enabled us to narrow down the search space and transform the task of finding where the moving object is into a simpler one of figuring out whether a particular region of interest contains a moving object. For ensuring that a previously detected moving object continues to be detected even after it stops moving, we have developed a simpler version of tracking using the sparse Lucas Kanade algorithm.
Another priority of ours was that our solution must be portable enough to be platform-independent. The detector and tracking modules were implemented in OpenCV to achieve this objective. The modules were also developed to be independent from each other. This allows us to enable only a subset of the modules as necessary. This also allows the modules to be integrated without much effort with newer solutions. We have done extensive testing of the entire solution on a real car as well as using offline computation. Our entire solution has been ported to x86, TI TDA2x, and NVIDIA TX2. Performance and accuracy measurements were taken for all these system implementations.
2 Architecture and Algorithms
The overall architecture of our solution includes detection, tracking and classification modules. The detection module detects if there is a moving object in any of the regions of interest. The tracking module is responsible for ensuring that a previously-detected object continues to be detected even if it stops moving. The classification module uses deep learning to categorize the moving object. The function and design of these modules are explained in the following sections.
The detection module detects if there is a moving object in any of the regions of interest. The different regions of interest were chosen to align with practical needs to indicate vehicle and other objects approaching from the left, right or center of each field of view.
Figure 2 illustrates the pipeline of our detection and tracking modules. As shown, the first stage in detection is the background subtraction of adjacent frames. This narrows our search space to only the pixels where we believe motion to be present. In this background-subtracted image, we find the appropriate feature locations to track. Similar solutions use various feature descriptors such as SIFT , SURF , BRIEF  and ORB . We decided to use the ORB feature descriptor chiefly because of its efficiency of computation. Once the feature points are extracted, we use the sparse Lucas Kanade algorithm  to find the corresponding feature points in the subsequent frame. The vectors connecting the feature points from the first frame to the corresponding ones in the second frame were summed for every region of interest separately. The length of the resultant vector was thresholded to determine if the region of interest contains any moving object.
With simple detection, there is the drawback that a moving object will stop being detected if the object enters a region of interest and then stops moving. This behavior is not what we intend. We would like to know of every potential object around the car that can pose danger, even though it is not moving currently. Hence, we introduced a tracking module that would track a previously-detected object even after it stops moving. Tracking of moving objects in a video stream is a well-studied problem with many state-of-the-art trackers available . Most state-of-the-art trackers perform pretty well with high sensitivity and specificity. However, the computation overhead of these trackers is high. Since we are concerned with detecting the presence of moving objects in the regions of interest, our requirements are less demanding than that offered by these turn-key solutions. We need not track multiple objects, since we only generate warnings for the presence or absence of moving objects in each ROI. We also did not need to track the exact bounding boxes for the objects. We leverage these looser requirements to obtain lower computation overhead. In our solution, once we detect the presence of moving objects, we store those feature points. When the detector transitions from positive to negative, we run another iteration of the sparse Lucas Kanade algorithm on the stored feature points from the previous frame. If the feature points are detected in the current frame with high confidence and the resultant motion vector for the feature points is small, we conclude that the moving object is still present in the ROI but has stopped moving. This ROI is considered to contain the moving object until the detector module kicks in and freshly detects a moving object, at which point, the stored features are discarded.
The detector and tracking modules are used to find the regions of interest with moving objects. However, to make meaningful use of this information, the system needs to know the nature of the moving object. Specifically, we would like to categorize the moving object as pedestrian, bicycle, shopping cart, etc. This information can then be used by the autonomous vehicle software to take action tailored to the category of the detected object. For this purpose, we introduce a deep neural network-based classification module.
Since AlexNet  in 2012, Convolutional Neural Networks have proved to be the state-of-the-art methodology in Computer Vision. It has outperformed traditional approaches in various tasks, such as recognition, segmentation, and detection. Over time, CNNs in use today have grown to be very deep and complex. This has led to an increase in accuracy. However, these networks are also large in size and their response times are slow. In many real-world applications like our system, the model needs to execute on a small, low-power embedded system without a powerful GPU to provide sufficient processing power to run a computationally heavy deepnet in real-time. In 2017, Andrew et al.  proposed MobileNet, which substitutes the standard convolution layers with depth-wise convolution layers and point-wise convolution layers to achieve small, low-latency models that meet the requirement for our application.
MobileNet factorizes a standard convolution into a depth-wise convolution and a point-wise convolution, which is referred to as depth-wise separable convolution. The depth-wise convolutions apply a single filter for each input channel. Then, the output map is fed into a convolution to combine the output of the depth-wise convolution layer. The factorization actually splits a standard one-step convolution operation into 2 separate steps. Because the separation restricts space requirements and simplifies the computation, depth-wise separable convolution can achieve a reduction in computation of
For a standard convolution, the computational cost is
where is the number of the input channels, is the number of the output channels, is the kernel size, and is the feature map size.
Because of the factorization, the computation contains two parts. For depth-wise convolution part, the computational cost is
and the convolution has a computational cost of
To sum up, the total cost of depth-wise separable convolution is
leading to a reduction in computation of:
According to , MobileNet uses depth-wise separable convolutions resulting between to times less computation than the use of standard convolution with a reduction in accuracy of only about .
MobileNet provided us features extracted from images as descriptors. We utilized these descriptors to build a classifier. Here, we chose to use SoftMax together with Cross Entropy as the loss function.
SoftMax, defined as
is normally used after the final layer to map real values to the interval , in order to represent a probability distribution over possible categories.
Then, we use cross entropy to compute loss and provide a gradient for the back-propagation phase, which is defined as:
Before training, we first obtain the frames from the video, and then split them into regions of interest. Because MobileNet takes input images of dimension we re-sized all different-resolution images into images and labeled them as shown in Figure 3.
This section describes our experiments measuring the performance of our moving-object detection system using different hardware platforms and images acquired under various areas.
3.1 Hardware and Implementation
In this evaluation, we examined the practical feasibility and efficacy of our algorithm using multiple platforms from different vendors. Figure 4 shows our surround view system prototype built using a TI TDA2x evaluation kit from Spectrum Digital  to capture data and evaluate our detection/tracking algorithm in real-life scenarios. We also integrated our algorithm on an NVIDIA TX2 embedded platform  and an x86 desktop to compare the relative performance of our detection, tracking and classification modules across different platforms. The TI TDA2x system is equipped with a dual-core 1.0-GHz ARM A15 processor and two C66x DSPs. The NVIDIA TX2 system is equipped with a quad-core 2.0-GHz ARMv8 A57 processor, a dual-core 2.0-GHz ARMv8 Denver processor, and an integrated Pascal GPU. The x86 desktop is equipped with a 3.3-Ghz Intel I9 7900x processor with a NVIDIA GPU Titan Xp. We ran 64-bit Ubuntu 16.04 with OpenCV 2.4.17 for the detection/tracking module. CUDA 8.0+CuDNN 6.0, Tensorflow, Python 3, and OpenCV 3.0 were used for the classification module implementation. To have the best classification result, we applied MobileNet 1.0 model to our system. We discuss speed performance with different MobileNet model versions in Section 3.2.
|Core||Job description||Used APIs||Run Time||Total|
|CPU||Diff of these two frames||cv::cvtColor, cv::absdiff, cv::threshold|
|ORB Feature Points Extraction||cv::OrbFeatureDetector|
|Sparse Optical flow calculation||cv::calcOpticalFlowPyrLK|
|GPU||Diff of these two frames||gpu::cvtColor, gpu::absdiff, gpu::threshold|
|ORB Feature Points Extraction||gpu::ORB_GPU|
|Sparse Optical flow calculation||gpu::opticalFlowPyrLK.sparse|
Figure 5 captures our implemented system architecture. The different components of the system architecture are as follows:
Capture/Merge: This component captures the images from the 4 fisheye cameras and merges them into a single quad view image as shown in Figure 1.
Detection/Tracking: These modules take the quad view image as input and process them to detect the moving objects. The outputs from the modules are the indexes of the ROIs containing moving objects.
Detection Output Transfer: The output from detection/tracking modules are transferred to the classification module using socket communication.
Object Classification: This module takes the quad view image and selected ROI indexes as input and processes them to classify the moving objects into different categories.
3.2 Experiment Results
Experiment 1: Detection only
Experiment 2: Detection and tracking
Experiment 3: Classification only
Experiment 4: Detection, tracking and classification
The metrics for Experiment 1 and 2 were compared to evaluate the improvement in accuracy on adding the tracking module. Likewise, the metrics from Experiment 3 and 4 were compared to evaluate the speed and accuracy trade offs when combining all the modules.
We collected video sequences at 1280x720 resolution at 30 fps using our surround view system equipped with 4 190\degreeFOV fisheye cameras as shown in Figure 4. We used 7 video sequences to create our training data. 1009 frames were selected from the video data and 12 pre-fixed ROIs were sliced into individual images. The images were labeled with presence information and the category of objects for training our MobileNet-based classification module. To create test data, we labeled 5 other video sequences with manually-annotated ground truths of object presence and target class in 12 ROIs. These training and testing data were obtained from various indoor and outdoor parking lots from local shopping centers and buildings in Pittsburgh, USA.
Table 1 shows a comparison of the execution times of the detection module between GPU and CPU on NVIDIA TX2. We observe that running the detection module on the GPU instead of the CPU does not result in a significant performance improvement. This is because the algorithms deal with sparse features instead of processing all the pixels in the image. Therefore, the detection and tracking module can be selected to run either on a GPU or a CPU as per system availability. We chose to pin the detection and tracking operation to the CPU. This allowed us to dedicate the GPU resource to the DNN-based classification algorithm to maximize its performance.
Table 2 shows the measured average frame per second (fps) of each algorithm module running on different platforms. Unfortunately, we could not evaluate the classification module on the TI TDA2x platform since it does not support CUDA . We notice that the overall performance of the integrated âdetection, tracking, and classificationâ approach is significantly better than the ’classification only’ scheme. This is because fewer ROIs need to be processed by the classification module using the filtered ROI index information received from the detection and tracking module. We also observe that the âdetection, tracking, and classificationâ experiment was practically feasible on an embedded platform like the NVIDIA TX2 by dedicating the entire GPU resource to the classification module. The NVIDIA TX2 supports a feature inherent in integrated GPUs called zero-copy memory. Zero-copy memory eliminates the need for copying data to and from DRAM associated with the GPU. This allows both CPU and GPU components of GPU intensive programs to share memory space. This feature enables us to reduce the latency while transferring the captured input image from the CPU to the classification module on GPU.
To evaluate the accuracy of our algorithms, we defined the following metrics for detection/tracking and classification modules:
Our measured values for the precision and recall are shown in Table 3. As we can see from the table, the addition of the tracking module improves the recall metric at the cost of less precision. Consider the case when a moving object comes to a stop at one of the ROIs. This object will not be detected in the absence of the tracking module. This ROI will be counted as a false negative in all the frames until the object starts moving again. The result is a low recall value. The addition of the tracking module does reduce the number of false negatives leading to a higher recall. But, we also see a reduction in precision. This happens in the scenario when the tracking module erroneously latches onto a non-moving object. This ROI then counts as a false positive for the subsequent frames resulting in a lower precision value. We believe that the overall compromise is however preferable for an autonomous vehicle system where we would rather have more false positives than false negatives.
Figure 6 shows the speed of different MobileNet models on x86 and NVIDIA TX2 with respect to the different input image resolutions. The appropriate MobileNet model can be chosen to fit the latency and size budget for the platform in use. We decided to use MobileNet_v1_1.0_224  to achieve the best accuracy in real-time.
4 Conclusion and Future Work
We have presented a solution that can detect, track and recognize moving objects in pre-determined regions of interest in real-time with good accuracy. Our solution was evaluated on x86, TI TDA2X and NVIDIA TX2 platforms and their relative performance was measured for various combinations of active modules. The final solution was also deployed on a real car and extensively tested in real-world situations. The system performed with good accuracy and precision in these real-world tests. As the system was built from the ground up using a modular approach, the advantage of modularity was evident in the ease with which the breakdown of performance and accuracy measurements could be taken. In fact, these modules can be effectively added to any existing autonomous vehicle solution with minimum porting time and system overhead.
In the future, we plan to train the classification module to detect several additional categories. We would also like to add an ego-motion compensator to be able to use the detection and tracking module even when the vehicle is traveling at high speeds. Another next step is to integrate the entire solution with other existing algorithms on the NVIDIA Drive PX2 and deploy on our autonomous driving research vehicle.
- J. Wei, J. M. Snider, J. Kim, J. M. Dolan, R. Rajkumar, and B. Litkouhi, “Towards a viable autonomous driving research platform,” in Intelligent Vehicles Symposium (IV), 2013 IEEE. IEEE, 2013, pp. 763–770.
- H. Wang, B. Wang, B. Liu, X. Meng, and G. Yang, “Pedestrian recognition and tracking using 3d lidar for autonomous vehicle,” Robotics and Autonomous Systems, vol. 88, pp. 71–78, 2017.
- N. Bernini, M. Bertozzi, L. Castangia, M. Patander, and M. Sabbatelli, “Real-time obstacle detection using stereo vision for autonomous ground vehicles: A survey,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 873–878.
- M. Bertozzi, L. Castangia, S. Cattani, A. Prioletti, and P. Versari, “360 detection and tracking algorithm of both pedestrian and vehicle using fisheye images,” in Intelligent Vehicles Symposium (IV), 2015 IEEE. IEEE, 2015, pp. 132–137.
- H. Cho, Y.-W. Seo, B. V. Kumar, and R. R. Rajkumar, “A multi-sensor fusion system for moving object detection and tracking in urban driving environments,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on. IEEE, 2014, pp. 1836–1843.
- J. Choi, “Realtime on-road vehicle detection with optical flows and haar-like feature detectors,” Tech. Rep., 2012.
- D. G. Lowe, “Object recognition from local scale-invariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
- H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” Computer vision–ECCV 2006, pp. 404–417, 2006.
- M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” Computer Vision–ECCV 2010, pp. 778–792, 2010.
- E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Computer Vision (ICCV), 2011 IEEE international conference on. IEEE, 2011, pp. 2564–2571.
- S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” International journal of computer vision, vol. 56, no. 3, pp. 221–255, 2004.
- M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual object tracking vot2015 challenge results,” in Proceedings of the IEEE international conference on computer vision workshops, 2015, pp. 1–23.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- “TDA2x SoC family,” http://processors.wiki.ti.com/index.php/TDA2x.
- “TDA2x Vision Evaluation Module Kit,” http://www.spectrumdigital.com/tda2x-vision-evaluation-module-kit.
- “NVIDIA Jetson TX1/TX2 Embedded Platforms,” http://www.nvidia.com/object/embedded-systems-dev-kits-modules.html.
- J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” ACM Queue, vol. 6, no. 2, pp. 40–53, 2008.
- “MobileNets: Open-Source Models for Efficient On-Device Vision,” https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html.