Traffic Danger Recognition With Surveillance Cameras Without Training Data

Traffic Danger Recognition With Surveillance Cameras Without Training Data

Lijun Yu
Peking University
Beijing, China
yulijun@pku.edu.cn
This work was done when Lijun Yu was a visiting scholar at Carnegie Mellon University and later a research intern at MIX Labs.
   Dawei Zhang
MIX Labs
Beijing, China
dawei@mixlabs.xyz
   Xiangqun Chen
Peking University
Beijing, China
cherry@sei.pku.edu.cn
   Alexander Hauptmann
Carnegie Mellon Univ.
Pittsburgh, PA, US
alex@cs.cmu.edu
Abstract

We propose a traffic danger recognition model that works with arbitrary traffic surveillance cameras to identify and predict car crashes. There are too many cameras to monitor manually. Therefore, we developed a model to predict and identify car crashes from surveillance cameras based on a 3D reconstruction of the road plane and prediction of trajectories. For normal traffic, it supports real-time proactive safety checks of speeds and distances between vehicles to provide insights about possible high-risk areas. We achieve good prediction and recognition of car crashes without using any labeled training data of crashes. Experiments on the BrnoCompSpeed dataset show that our model can accurately monitor the road, with mean errors of 1.80% for distance measurement, 2.77 km/h for speed measurement, 0.24 m for car position prediction, and 2.53 km/h for speed prediction.

1 Introduction

Surveillance cameras are widely installed, recording and storing massive data every day. But anomalous events are very rare and it is impossible for humans to monitor all these cameras. Car crashes are a crucial safety issue nowadays. Leveraging the recent development of computer vision algorithms, we are developing an automatic system for traffic surveillance on highways and streets.

We have built a model that can predict and recognize crashes from surveillance cameras. One benefit is that ambulances could immediately be sent to the crash scene saving lives. As accidents are relatively few, our model also supports proactive safety check based on normal traffic flow. Real-time speed and distance measurements will lead to insights about high-risk areas, such as where cars frequently get too close. This will help to improve traffic safety on the long term.

As accidents are rare in regular surveillance videos, it is arduous to collect and build a labeled dataset of car crashes covering all possible situations. Taking this reality into account, we propose a model that requires no labeled crash data for training. Physically, a collision between cars occurs when they gradually get closer and finally come into contact. We can predict their trajectories and check overlap positions indicating a collision. In severe crashes, vehicles are deformed and undetectable afterwards, but the crash is recognized ahead of time based on the predictions.

Our model consists of five steps to achieve the goal. Camera calibration method is applied to transform a point on the image to the road plane. Object detection and tracking algorithms identify a vehicle and trace its history. A 3D bounding box is built to get the projection of a car on the road. Position and speed are estimated and predicted for the future. Finally, the model can recognize danger based on distances between vehicles and overlaps in the trajectories.

We run this model on BrnoCompDataset [19], which contains highway surveillance videos with ground truth speed and distance measurements. We evaluate its performance for different steps and show convincing results. It performs an effective 3D reconstruction of the road plane with a mean distance measurement error of 1.80% along the road. Upon efficient detection and tracking of vehicles, it takes a precise measurement of speeds at a mean error of 2.77 km/h. It predicts vehicle trajectories reliably with errors of 0.24 m for car positions and 2.53 km/h for speeds averagely for 0.12 seconds ahead. This allows refined recognition of traffic danger from all the measurement and predictions. Importantly, all these results are achieved without any labeled training data.

The key contributions of this paper are:

  • A traffic danger recognition model for surveillance cameras based on a 3D reconstruction of the road and prediction of trajectories. It does not need any labeled training data of car crashes.

  • Results show that the model monitors the road accurately, with mean errors of 1.80% for distance measurement, 2.77 km/h for speed measurement, 0.24 m for car position prediction, and 2.53 km/h for speed prediction.

2 Related Work

Camera Calibration. Calibration methods are employed to derive the intrinsic (focal length, principal point) and extrinsic (rotation, translation) parameters of a camera. The accuracy of calibration is critical for the 3D reconstruction and further processing. Different methods may require various forms of user inputs, such as drawing parallel lines [12], camera position [22, 14], and average vehicle size [3] or speed [16]. Fully automatic calibration is also achievable according to [4, 18].

Object Detection. Object detection models are utilized to identify vehicles in video frames. These models such as Fast R-CNN [6] and Faster R-CNN [15] rely on region proposal algorithms and deep convolution neural networks to get bounding boxes of objects. Mask R-CNN [7] further extends by predicting object masks simultaneously.

Multiple Object Tracking. Vehicle objects detected in adjacent frames need to be traced correctly. SORT algorithm [2] supports fast online tracking with Kalman Filter [9] and Hungarian algorithm [10]. Deep-SORT [23] additionally integrates appearance information to improve the performance.

Anomaly Detection. Traffic danger recognition is one specific aspect of anomaly detection. Multiple instance learning [20] requires sufficient annotated training data. Motion pattern based learning for traffic anomaly [24] also uses labeled data. But our approach is built upon no labeled videos of car crashes.

3 Methodology

Our traffic danger recognition model consists of five steps. Camera calibration provides geometry parameters and a transformation from image coordinates to road plane coordinates. Object detection and tracking algorithms provide the types, positions, and masks of vehicles and trace their histories. 3D bounding boxes are built to localize vehicles in the world space and then project to the road plane. Positions and speeds are calculated with adjacent frames plus smoothing and predicted for the future. Finally, we can recognize danger from vehicle distances and potential overlaps in the predictions.

3.1 Camera Model and Calibration

We adopt a traffic camera model similar to the paper of Sochor et al[18] as shown in Figure 1. We follow the practice of Dubská et al[5] in setting up directions of three vanishing points . With a known plane, points in an image can be reprojected to points on the plane in the world space. The reprojection enables a 3D reconstruction of vehicles on the road.

Figure 1: Traffic camera model: define world coordinate system, where - plane is parallel to the image and passes through its top left. Camera is on the - plane and points to the principal point at the center of the image. is the road plane. are directions of vanishing points, in the direction of traffic, parallel to the road and perpendicular to , and perpendicular to .

Although some automatic calibration methods have been developed, they do not achieve perfect performance in our model. So we remain using a manual calibration which requires labeling two groups of parallel lines of each camera view. Then we derive two vanishing points in the image space using a least square error method as in [12]. With Algorithm 1 extracted from the supplementary material of the dataset [19], we can derive the road plane in the world space and project image points to world points on the plane.

We rotate the world coordinate system to make the - plane parallel to the road plane, so we can get plane coordinates of a point by omitting axis. Rotation parameters are acquired by solving Equation 1.

(1)
0:  Image point , plane offset
0:  Calibration
0:  World point on the plane
  
  
  
  
  
  
  
  
  
Algorithm 1 Project an image point to a world point on the road plane. denotes the focal length. For points, lower case represents image coordinates and upper case stands for world coordinates. is an arbitrary offset of the plane, usually set to 10 as in [19].

3.2 Object Detection and Tracking

We select Mask R-CNN by He et al[7] as our object detection model, which outputs detection scores, object types, bounding boxes, and object masks. We use Abdulla’s implementation [1] with trained weights on Microsoft COCO dataset [13] and select three types of objects as targets: car, bus, and truck. Then we apply a filter to the detected objects as shown in Figure 2. The filter follows three rules:

  1. Vehicles should not be too small in size.

  2. Vehicles should be in the road area.

  3. Vehicles should be completely visible.

Figure 2: Object detection: raw detections (left), and filtered objects (right). The white car at the top left is filtered by rule 1, the red at the bottom right by 3, and the cars at the top right by 2.

We use Deep SORT by Wojke et al[23] to track vehicles across frames. Each vehicle is supposed to get a unique ID from the tracking model, and it is robust through brief loss of detection.

3.3 3D Bounding Box

We estimate the contour of a vehicle with its mask from Mask R-CNN, using the algorithm by Suzuki et al[21]. For each of the three vanishing points, we calculate the tilt angles of the lines passing that vanishing point and each point in the contour. In this way we find the tangent lines of the contour passing three vanishing lines. We alter the algorithm from Sochor [17] to build 3D bounding boxes of cars as described in Algorithm 2 and Figure 3.

0:  Tangent lines from vanishing points
0:  3D bounding box
  
  
  
  
  
  
  
  
  if  then
     
  else
     
  end if
  
Algorithm 2 Build 3D bounding box with tangent lines from vanishing points. Lines with subscript denote the lines with the minimum tilt angle, and the maximum. Position of the points are shown in Figure 3.
Figure 3: 3D bounding box: tangent lines of contour and their intersections (top left), derived lines and intersections (top right), the final result (bottom left), and vehicles in other angles of view (bottom right). Lines in colors of blue, green, and red pass through respectively.

3.4 Trajectory Prediction

To get the current location of a vehicle, we can find the bottom of the 3D bounding boxes and project them to the road plane according to Section 3.1. The set of the bottom points relies on the direction of the vehicle as:

(2)

The center position of a vehicle is calculated by

(3)

and a recent speed is calculated from adjacent frames as

(4)

where denotes frame number and is the frame rate of the video. Exponential smoothing is applied to get a smoothed speed as

(5)

With an optional scale factor , we are able to know the real world value of the speed.

To predict the trajectories, we assume:

  1. The future is divided into time slots with equal lengths.

  2. The vehicle centers follow normal distributions.

  3. The vehicle shapes do not change.

We predict speed, acceleration, center coordinates and variance for the beginning of each slot as a snapshot. Within a slot, we assume there are fixed acceleration and variance. Then the speed and center coordinates can be calculated according to kinematics rules. In this way, predictions are available for an arbitrary time in the future.

For now, we are using a simple linear prediction method with the real situation as the only one snapshot and assuming the acceleration is always zero. Models of conditional random fields [11] and long short-term memory [8] are planned to be tested in the future.

3.5 Danger Recognition

We use two ways to recognize dangerous situations. The first one is the distance measurement between vehicles. It not only tells where cars are going to crash but provides a proactive safety check for areas where cars often get too close, as well. The second one is called danger map, which detects overlap of vehicles in the predictions that indicates crashes.

The distance between two vehicles is defined as the minimum distance between two points from two quadrangles respectively.

Lemma 1.

Let be the pair with the minimum distance among all pairs of points from two quadrangles respectively, then at least one of must be a vertex.

Proof.

Suppose both are not vertices, so each of them is on an edge, namely . If , there must be another pair of points consisting of at least one vertex that has an equal distance. If is not parallel to , then the nearest distance between and cannot be at the middle of both edges, which contradicts the suppose. Therefore, at least one of is a vertex. ∎

With Lemma 1, we can calculate the minimum distance between two quadrangles as:

(6)
(7)

where is the distance between a point and an edge. In this way, the minimum distance is calculated from only 32 candidates. Distances for all vehicle pairs are calculated and alerted when less than a threshold.

We accumulate the probability of a car box based on the distribution of its center to get the heat map of a vehicle. It represents the probability of its position at a specific time in the future. Then we aggregate the heat maps of all the vehicles in a scene into a danger map. A danger map represents the probability of coexistence of two or more vehicles in the same location. Figure 4 shows a sample result of danger recognition.

Figure 4: Danger recognition: four vehicles in a sample prediction of the road plane, with vehicle IDs at the bottom right and speeds at the top left. Distances are shown for nearing vehicles, and danger map is shown in black at the overlap of vehicle 7 and 10.

4 Experiments

4.1 Dataset and Setup

We use BrnoCompDataset [19] to evaluate the performance of our model. It consists of surveillance videos of 6 sessions from 3 directions on the highway in the Czech Republic. The dataset provides the ground truth of distance measurement lines and speed of vehicles from Lidar sensor. It also has calibration results from various systems [19, 18].

We run our model on each of the 18 videos for 10 minutes. The videos are processed at the original resolution of 1080p and downsampled from 50 fps to 25 fps. We do not use lower frame rate because the Deep-SORT model has worse performance when it is less than 25fps. We exploit the calibration results from [18] which provides vanishing points acquired by manual calibration from parallel lines, along with scale factors inferred from speeds. We let the smoothing parameter according to some preliminary experiments. The trajectory prediction is set for 0.12 and 0.24 seconds ahead, accordingly 3 and 6 frames.

4.2 Calibration Error

We measure the calibration error to test the provided calibration results and the correctness of our coordinates transformation algorithm which maps a point from the image space to the road plane space.

We calculate the distance of the given measurement lines in our plane coordinate system. The lines are divided into two groups according to their directions: toward or . The average length of the given lines is different in each group. We collect absolute and relative errors of measured distances and report the mean and median values in Table 1.

Mean Median
Distance
Toward
Absolute Error (m) 0.2618 0.1684
Relative Error 1.80% 1.42%
Distance
Toward
Absolute Error (m) 0.1633 0.1646
Relative Error 2.06% 2.07%
Table 1: Calibration error: distance measurement error on all videos

The results show that our model can accurately measure distances in the real world based merely on surveillance camera views and calibration parameters. The error in each direction is much smaller than the shape of conventional vehicles. As of the high speed in the highway, these errors are even smaller than the movement of a vehicle between two adjacent frames. This model provides an effective 3D reconstruction of the road plane with little error.

4.3 Vehicle Detection and Tracking Error

We measure vehicle detection and tracking error to test the Mask R-CNN and Deep-SORT models. For each vehicle detected and tracked, we record the time and position of its every appearance. Based on the appearance history, we calculate an estimated period of the vehicle in the measurement area of Lidar sensors. The measurement area is considered to be the largest one if there are more than two Lidar sensors set up. Then we calculate the intersection of union (IoU) between the estimated period and the real period of existence in the ground truth to get a similarity matrix. Hungarian algorithm [10] is employed to solve this matching problem. Additionally, matching results with IoU less than are dropped. We report the recall on each video for this evaluation in Table 4.

We find that Mask R-CNN sometimes does not work at certain viewing angles or for certain types of vehicles. For lost detections, as long as the gap is short enough, Deep-SORT is still able to track. In other cases, however, tracking also fails and that causes the loss. Despite these, the combination of Mask R-CNN and Deep-SORT have achieved an overall recall rate, which shows that it is efficacious for the vehicle detection and tracking in this task.

4.4 Speed Estimation Error

We use the matched vehicles from the previous section to evaluate the performance of our speed estimation. As the ground truth only has the average speed for each vehicle, we use the smoothed speed of a vehicle at its last appearance for comparison. We collect absolute and relative errors of the estimated speed of each vehicle and report in Table 2.

Mean Median
Absolute Error (km/h) 2.7708 1.8625
Relative Error 3.68% 2.55%
Table 2: Speed estimation error: estimated speeds compared with ground truth from Lidar sensors on all videos

According to the dataset, the average speed for each session is mostly between 60 km/h and 90 km/h. For highway traffic, a mean error of less than 2.77 km/h proves that our model can precisely measure the speeds of vehicles. This accurate measurement is the foundation for further predictions and danger recognition.

4.5 Prediction Error

We evaluate the two levels of predictions separately. For each level, we collect the absolute error of location prediction, plus the absolute and relative error of speed prediction of each vehicle. As the smoothed speed is not stable at the beginning, predictions from vehicles with a history of fewer than frames (0.2 seconds) are excluded. The mean and median values of each metric are shown in Table 3.

Level +0.12s Mean Median
Location
Prediction
Absolute Error (m) 0.2433 0.1736
Speed
Prediction
Absolute Error (km/h) 2.5313 1.8373
Relative Error 4.55% 2.52%
Level +0.24s Mean Median
Location
Prediction
Absolute Error (m) 0.3563 0.3256
Speed
Prediction
Absolute Error (km/h) 3.0134 2.4995
Relative Error 5.71% 3.92%
Table 3: Prediction error: location and speed prediction error of two levels on all videos

Although the prediction mechanism currently deployed is rather simple, it provides results much beyond our expectations. As a vehicle at 75 km/h would move 2.5 meters in 0.12 seconds, a mean error of 0.24 m for location prediction is well acceptable. The difference between the mean and median values indicates some outliers are harming the performance, but we can still see that most of the predictions are within an error of 2km/h. For traffic on highways, crashes usually happen within 0.12 seconds, so it is enough for the danger map to work. Moreover, another prediction of +0.24s is there for more information beforehand, and it is reasonable to have a slightly larger error than +0.12s.

Video ID 1C 1L 1R 2C 2L 2R 3C 3L 3R 4C
Vehicle Matching Recall 95.0% 92.2% 97.0% 82.2% 92.6% 92.5% 81.8% 100% 100% 92.4%
Video ID 4L 4R 5C 5L 5R 6C 6L 6R Mean
Vehicle Matching Recall 93.2% 98.2% 83.0% 98.9% 97.5% 98.8% 99.5% 96.4% 94.0%
Table 4: Vehicle detection and tracking error: vehicle matching recall on each video. The number in Video ID is Session ID, and the letter denotes the direction according to C-center, L-left, R-right.

5 Conclusions

We propose a traffic danger recognition model that works with arbitrary surveillance cameras. It does not require any labeled training data of crashes. The model consists of five steps: camera calibration, object detection and tracking, 3D bounding box, trajectory prediction, and danger recognition. We measure the performance with experiments step by step, presenting that it is accurate at the estimation of speed and position of vehicles by projecting to a 3D reconstructed road plane. It is suitable for crash detection and proactive safety checks.

A demo of our model working on a real crash scene can be found on Youtube111https://www.youtube.com/playlist?list=PLssAerj8zfUR5wBc7N6gmCFTm0azCHSIf. In the future, a complete test set of video containing real crashes will be processed to report detection accuracy. Trajectory prediction model could be improved with conditional random fields or recurrent neural network. We will also test automatic camera calibration methods to obtain similar performance as manual calibration, then the system could function on arbitrary surveillance cameras with zero input.

References

  • [1] W. Abdulla. Mask r-cnn for object detection and instance segmentation on keras and tensorflow. https://github.com/matterport/Mask_RCNN, 2017.
  • [2] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 3464–3468. IEEE, 2016.
  • [3] D. J. Dailey, F. W. Cathey, and S. Pumrin. An algorithm to estimate mean traffic speed using uncalibrated cameras. IEEE Transactions on Intelligent Transportation Systems, 1(2):98–107, 2000.
  • [4] M. Dubská, A. Herout, R. Juránek, and J. Sochor. Fully automatic roadside camera calibration for traffic surveillance. IEEE Transactions on Intelligent Transportation Systems, 16(3):1162–1171, 2015.
  • [5] M. Dubská, A. Herout, and J. Sochor. Automatic camera calibration for traffic understanding. In BMVC, volume 4, page 8, 2014.
  • [6] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [7] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
  • [8] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [9] R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
  • [10] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • [11] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
  • [12] S. C. Lee and R. Nevatia. Robust camera calibration tool for video surveillance camera in urban environment. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, pages 62–67. IEEE, 2011.
  • [13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [14] T.-W. Pai, W.-J. Juang, and L.-J. Wang. An adaptive windowing prediction algorithm for vehicle speed estimation. In Intelligent Transportation Systems, 2001. Proceedings. 2001 IEEE, pages 901–906. IEEE, 2001.
  • [15] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [16] T. N. Schoepflin and D. J. Dailey. Dynamic camera calibration of roadside traffic management cameras for vehicle speed estimation. IEEE Transactions on Intelligent Transportation Systems, 4(2):90–98, 2003.
  • [17] J. Sochor. Traffic analysis from video. Diplomová práce, Brno University of Technology, Faculty of Information Technology, 2014.
  • [18] J. Sochor, R. Juránek, and A. Herout. Traffic surveillance camera calibration by 3d model bounding box alignment for accurate vehicle speed measurement. Computer Vision and Image Understanding, 161:87–98, 2017.
  • [19] J. Sochor, R. Juránek, J. Špaňhel, L. Maršík, A. Širokỳ, A. Herout, and P. Zemčík. Comprehensive data set for automatic single camera visual speed measurement. IEEE Transactions on Intelligent Transportation Systems, 2018.
  • [20] W. Sultani, C. Chen, and M. Shah. Real-world anomaly detection in surveillance videos. Center for Research in Computer Vision (CRCV), University of Central Florida (UCF), 2018.
  • [21] S. Suzuki et al. Topological structural analysis of digitized binary images by border following. Computer vision, graphics, and image processing, 30(1):32–46, 1985.
  • [22] K. Wang, H. Huang, Y. Li, and F.-Y. Wang. Research on lane-marking line based camera calibration. In Vehicular Electronics and Safety, 2007. ICVES. IEEE International Conference on, pages 1–6. IEEE, 2007.
  • [23] N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking with a deep association metric. In Image Processing (ICIP), 2017 IEEE International Conference on, pages 3645–3649. IEEE, 2017.
  • [24] Y. Xu, X. Ouyang, Y. Cheng, S. Yu, L. Xiong, C.-C. Ng, S. Pranata, S. Shen, and J. Xing. Dual-mode vehicle motion pattern learning for high performance road traffic anomaly detection. In CVPR Workshop (CVPRW) on the AI City Challenge, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
321534
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description