RWF-2000: An Open Large Scale Video Database for Violence Detection
In recent years, surveillance cameras are widely deployed in public places, and the general crime rate has been reduced significantly due to these ubiquitous devices. Usually, these cameras provide cues and evidence after crimes conducted, while they are rarely used to prevent or stop criminal activities in time. It is both time and labor consuming to manually monitor a large amount of video data from surveillance cameras. Therefore, automatically recognizing violent behaviors from video signals becomes essential. In this paper, we summarize several existing video datasets for violence detection and propose a new video dataset111https://github.com/mchengny/RWF2000-Video-Database-for-Violence-Detection with more than 2,000 videos captured by surveillance cameras in real-world scenes. Also, we present a new method that utilizes both the merits of 3D-CNNs and optical flow, namely Flow Gated Network. The proposed approach obtains an accuracy of 86.75% on the test set of our proposed RWF-2000 database.
Ming Cheng, Kunjing Cai, Ming Li
\addressData Science Research Center, Duke Kunshan University, Kunshan, China
School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
Violence detection, Video classification, Optical flow
Recently, video-based violent behavior detection has attracted more and more attention. There is an increasing number of surveillance cameras in public places, which could collect evidence and deter potential criminals. However, it is too expensive to monitor a large amount of video data in real-time manually. Thus, automatically recognizing criminal scenes from videos becomes essential and challenging.
Generally, the concept of video-based violence detection is defined as detecting violent behaviors in video data. It can be viewed as a subset of human action recognition which aims at recognizing general human actions. Compared to still images, video data has additional temporal sequences. A set of consecutive frames can represent a continuous motion, and neighboring frames contain much redundant information due to high inter-frame correlation. Thus, many researchers devote to fuse both spatial and temporal information properly.
Some early methods rely on detecting the presences of highly relevant objects (e.g., gun shooting, blaze, blood, blast), rather than directly recognizing the violent events [3, 1, 2]. In 2011, Nievas et al. [11, 12] firstly released two video datasets for violence detection, namely the Hockey Fight dataset and the Movies Fight dataset. Moreover, in 2012, Hassner et al.  also proposed the Crowd Violence dataset. Since then, most works focus on developing methods that directly recognize violence in videos.
To directly recognize violence from videos, there are mainly two categories of methods in these algorithms, namely the traditional feature extraction with shallow classifiers and the end-to-end deep leaning framework. For the conventional feature extraction methods, researchers try to build a powerful set of video descriptors and feed them to a linear classifier (e.g., SVM). Based on this principle, many classical methods were presented: ViF , STIPs , SCOF , iDT , etc. While in recent years, many end-to-end deep learning based methods have been proposed, e.g. two-stream method , ConvLSTM , C3D , TSN , ECO . Currently, most state-of-the-art results are obtained through deep learning based methods.
Although there are some existing video datasets for violence detection, they still have the drawbacks of small scale, reduced diversity, and low imaging resolution. In order to solve the problem of insufficient high-quality data, we collect a new video dataset (RWF-2000) and freely release it to the research community. This dataset has a larger scale than previous others. From the perspective of modeling, it is very difficult to design a powerful hand-crafted feature extractor due to the complexity of video data. Hence, we present a novel model with a self-learning fusion mechanism, which could adopt both appearance features and temporal features well.
This paper is organized as follows. Section 2 summarizes three widely used datasets for violence detection and presents a new one with large scale and rich diversity. Section 3 presents both a novel cropping strategy to reduce the size of the input video and our method that utilizes both RGB videos and the optical flow. Section 4 presents experimental results on the proposed dataset and also compares our method with others on some standard datasets. Section 5 gives a brief conclusion and discusses some planned future works.
2 Proposed Dataset
The Hockey Fight dataset, the Movies Fight dataset, and the Crowd Violence dataset are widely used for video-based violence detection [11, 12, 9], however, they are still with small scale and limited diversity.
In the Hockey Fight dataset, there are 1,000 video clips extracted from hockey games of the National Hockey League (NHL). Each video clip has a length of approximate 2 seconds with a resolution of 360 288 and is labeled as fight or non-fight action. Since all the videos are taken in a single scene, the diversity of this dataset is limited.
The Movies Fight dataset contains 200 video clips extracted from various action movies. Half of the videos are labeled as fight actions, and the others belong to non-fight actions. This dataset comes from different movies with diverse scenes, while the number of videos is too small. Currently, it is no longer suitable to be used in the deep learning domain.
As the Crowd Violence dataset, it is mainly designed to evaluate the quality of violence detection in crowded scenes. It contains 246 video clips captured in crowded places (e.g., parade, stadium). This dataset is close to real-world scenes. However, the number of videos is still insufficient. Also, its imaging quality is inferior.
To address these problems, we collect a new Real-World Fighting (RWF) 2000 dataset from YouTube, which consists of 2,000 video clips captured by surveillance cameras in real-world scenes. Figure 1 offers a glance at the RFW-2000 dataset. Each video file is a 5-second video clip with 30 fps. Half of the videos contain violent behaviors, while others belong to non-violent actions. All videos in this dataset are captured by surveillance cameras in the real world, none of them are modified by multimedia technologies. Therefore, they are close to real violent events captured by surveillance cameras, which can be used for real applications. Furthermore, the number of videos is much more significant than previous datasets for violence detection (shown in Figure 2).
3 Proposed Method
3.1 Cropping and Sampling
Since consecutive frames in a video are highly correlated, and the region of interest for recognizing human activity is usually a small area, we implement both cropping and sampling strategies to reduce the amount of input video data.
Firstly, we employ Farneback’s method  to compute the dense optical flow between neighboring frames. The computed dense optical flow is a field of the 2-D displacement vector. Thus we calculate the norm of each vector to obtain a heat map for indicating the motion intensity. We use the sum of all the heat maps to be a final motion intensity map, then the region of interest is extracted from the location with the most significant motion intensity (shown in Figure 4). Secondly, we implement a sparse sampling strategy. For each input video, we sparsely sample frames from the video by a uniform interval, then generate a fix-length video clip.
By adopting both cropping and sampling, the amount of input data is significantly reduced. In this project, the target length of the video clip is set to 64, and the size of the cropped region is 224 224.
3.2 Flow Gated Network
In previous methods, most of them were trying to extract appearance features of the individual frame, then fuse them for modeling temporal information. Ng et al.  summarized various architectures for temporal feature pooling, while most of them were human-designed and tested one by one. As the utilization of motion information may be limited due to the coarse pooling mechanism, we try to design a temporal pooling mechanism that is achieved by network self-learning.
Figure 3 shows the structure of our proposed model with four parts: the RGB channel, the Optical Flow channel, the Merging Block, and the Fully Connected Layer. RGB channel and Optical Flow channel are made of cascading 3-D CNNs, and they have consistent structures so that their output could be fused. Merging Block is also composed of basic 3-D CNNs, which is used to process information after self-learned temporal pooling. And the FC layers generate the output.
The highlight of this model is to utilize a branch of the optical flow channel to help build a pooling mechanism. Relu activation is adopted at the end of the RGB channel, while the sigmoid function is placed at the end of the Optical Flow channel. Then, outputs from RGB and Optical Flow channels are multiplied together and processed by a temporal max-pooling. Since the output of the sigmoid function is between 0 and 1, it is a scaling factor to adjust the output of the RGB channel. Meanwhile, since max-pooling just reserve local maximum, the output of RGB channel multiplied by one will have a larger probability to be retained, and the value multiplied by zero is more natural to be dropped. This mechanism is a kind of self-learned pooling mechanism, which utilizes a branch of optical flow as a gate to determine what information the model should preserve or drop. The detailed parameters of the model structure are described in Table 1.
|Block Name||Type||Filter Shape||t|
|FC layers||FC layer||128||1|
|HOF + BoW ||88.6|
|MoSIFT + BoW ||90.9|
|HOG + BoW ||91.7|
|MoWLD + BoW ||91.9|
|MoWLD + Sparse Coding ||93.7|
|MoSIFT + KDE + Sparse Coding ||94.3|
|MoWLD + KDE + Sparse Coding ||94.9|
|MoIWLD + KDE + SRC ||96.8|
|Flow Gated Network||98|
|HOG + BoW ||49|
|HOF + BoW ||59|
|MoSIFT + BoW ||84.2|
|Extreme Acceleration + SVM ||85.4|
|Extreme Acceleration + AdaBoost ||99.97|
|Flow Gated Network||100|
|MoWLD + BoW ||82.56|
|Flow Gated Network||84.44|
|MoWLD + Sparse Coding ||86.39|
|MoSIFT + KDE + Sparse Coding ||89.05|
|MoWLD + KDE + Sparse Coding ||89.78|
|MoIWLD + KDE + SRC ||93.19|
4 Experimental Results
We split the RWF-2000 dataset into three parts: the training set (70%), the validation set (10%), and the test set (20%). In the training process, we implement the SGD optimizer with momentum (0.9) and the learning rate decay (1e-6). Also, we introduce the brightness transformation and random rotation as a set of tricks for data augmentation, which may be helpful to simulate various lighting conditions and camera perspectives in real-world scenes.
After 6,000 iterations of training, our model obtained an accuracy of 86.75% on the test set (shown in Table 2). Additionally, we test the proposed model on three public datasets for violence detection and give comparisons with others, including both traditional methods and deep learning-based methods.
Table 3 presents the comparison of our model with others on the Hockey Fight dataset. It shows that our model achieves relatively higher accuracy.
Table 4 shows the experimental result on the Movies Fight dataset. Many methods have achieved nearly 100% accuracy, which indicates this dataset is no longer suitable for model evaluation because of the potential over-fitting problem.
Table 5 indicates the accuracy of our model on the Crowd Violence dataset. Compared to others, our model does not obtain the best performance. This is because the input of our model relies on computing the optical flow, while the poor imaging quality of this dataset may introduce a larger error.
This paper presents both a novel dataset and a method for violence detection in surveillance videos. The proposed RWF-2000 dataset is the currently largest surveillance video dataset used for violence detection. Moreover, a special pooling mechanism is employed by optical flow, which could implement temporal feature pooling instead of human-designed strategies. In the future, we will explore methods without the optical flow channel. Also, to train a more robust and reliable model, the size of the dataset needs to be further expanded in the future.
This research was funded in part by the National Natural Science Foundation of China (61773413), Natural Science Foundation of Guangzhou City (201707010363), Six Talent Peaks project in Jiangsu Province (JY-074), Science and Technology Program of Guangzhou City (201903010040).
-  (2011) Violence detection in movies. In CGIV, pp. 119–124. Cited by: §1.
-  (2003) Semantic context detection based on hierarchical audio models. In MIR, pp. 109–115. Cited by: §1.
-  (2005) DOVE: detection of movie violence using motion intensity analysis on skin and blood. Philippine Computing Science Congress 6, pp. 150–156. Cited by: §1.
-  (2010) Violence detection in video using spatio-temporal features. In SIBGRAPI, pp. 224–230. Cited by: §1.
-  (2014) Fast violence detection in video. In VISAPP, Vol. 2, pp. 478–485. Cited by: Table 4.
-  (2014) Violence detection in video by using 3d convolutional neural networks. In ISVC, pp. 551–558. Cited by: Table 3.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pp. 2625–2634. Cited by: §1.
-  (2003) Two-frame motion estimation based on polynomial expansion. In SCIA, pp. 363–370. Cited by: §3.1.
-  (2012) Violent flows: real-time detection of violent crowd behavior. In CVPRW, pp. 1–6. Cited by: §1, §1, §2, Table 5.
-  (2014) Detection of violent crowd behavior based on statistical characteristics of the optical flow. In FSKD, pp. 565–569. Cited by: §1.
-  (2011) Hockey fight detection dataset. In CAIP, pp. 332–339. Cited by: §1, §2.
-  (2011) Movies fight detection dataset. In CAIP, pp. 332–339. Cited by: §1, §2.
-  (2011) Violence detection in video using computer vision techniques. In CAIP, pp. 332–339. Cited by: Table 3, Table 4.
-  (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, pp. 568–576. Cited by: §1.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, pp. 4489–4497. Cited by: §1.
-  (2013) Action recognition with improved trajectories. In ICCV, pp. 3551–3558. Cited by: §1.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, pp. 20–36. Cited by: §1.
-  (2014) Violent video detection based on mosift feature and sparse coding. In ICASSP, pp. 3538–3542. Cited by: Table 3, Table 5.
-  (2015) Beyond short snippets: deep networks for video classification. In CVPR, pp. 4694–4702. Cited by: §3.2.
-  (2016) Discriminative dictionary learning with motion weber local descriptor for violence detection. IEEE Transactions on Circuits and Systems for Video Technology 27 (3), pp. 696–709. Cited by: Table 3, Table 5.
-  (2017) MoWLD: a robust motion image descriptor for violence detection. Multimedia Tools and Applications 76 (1), pp. 1419–1438. Cited by: Table 3, Table 5.
-  (2016) A new method for violence detection in surveillance scenes. Multimedia Tools and Applications 75 (12), pp. 7327–7349. Cited by: Table 5.
-  (2017) Violent interaction detection in video based on deep learning. In Journal of Physics: Conference Series, Vol. 844, pp. 012044. Cited by: Table 3, Table 4.
-  (2018) Eco: efficient convolutional network for online video understanding. In ECCV, pp. 695–712. Cited by: §1.