Multi-Stream Single Shot Spatial-Temporal Action Detection

Multi-Stream Single Shot Spatial-Temporal Action Detection

Abstract

We present a 3D Convolutional Neural Networks (CNNs) based single shot detector for spatial-temporal action detection tasks. Our model includes: (i) two short-term appearance and motion streams, with single RGB and optical flow image input separately, in order to capture the spatial and temporal information for the current frame; (ii) two long-term 3D ConvNet based stream, working on sequences of continuous RGB and optical flow images to capture the context from past frames. Our model achieves strong performance for action detection in video and can be easily integrated into any current two-stream action detection methods. We report a frame-mAP of on the challenging UCF101-24 [journals/corr/abs-1212-0402] actions dataset, achieving the state-of-the-art result of the one-stage methods. To the best of our knowledge, our work is the first system that combined 3D CNN and SSD in action detection tasks.

\copyrightnotice

© IEEE 2019 \toappearTo appear in The 26th IEEE International Conference on Image Processing (ICIP) \namePengfei Zhang, Yu Cao, Benyuan Liu\addressDepartment of Computer Science, University of Massachusetts Lowell, USA {keywords} Action Detection, spatial-temporal action localization, 3D convolutional neural networks, SSD

1 Introduction

The objective of action detection is to recognize and localize all the human action instances in a given video across both space and time. It is a fundamental task for video understanding and important for practical applications such as video surveillance and human-robot interaction. Action detection is a challenging problem due to two main difficulties: (i) it is hard to capture visual representations in the large spatial-temporal search space; (ii) it is difficult to understand the video fast and accurately, while the detection speed is essential for many application scenarios such as fall and violence detection.

To investigate the spatial-temporal video representation, researchers leverage the hand-crafted features such as dense trajectory [NIPS2014_5353, 6751553] and optical flow [NIPS2014_5353, 7410719] to build a two-stream network [6909619] to combine the spatial-temporal information together. However, most of these approaches overlook a fundamental issue in action detection, namely, the specific representation of spatial-temporal information for various actions. Many of them only use optical flow, an estimation of motion for each pixel between two images, as the source of temporal information. Conventional approaches estimate optical flows between adjacent frames, which could only represent temporal information of short time periods but lack of long-term information that is important for human action recognition as well. With the successes that 2D CNN has achieved in the field of visual representation on spatial domain, it is natural to extend it to 3D to capture both spatial and temporal information. In action recognition tasks, even though the 3D CNN (I3D [8099985]) achieved the best result so far, the improvement brought by 3D CNN, compared to the hand-crafted features based 2D CNN approaches [8237852], has not reached its full potential. In this work, we revisit the role of optical flow and 3D convolution in temporal reasoning for action detection. To explore the contribution of the 4 streams: 2D RGB, 2D optical flow (OF), 3D RGB and 3D OF in action detection, we propose a multi-stream architecture and examine the performance of different stream combinations for various types of actions. We demonstrate that, for the single-stream framework, 3D CNN based model outperforms 2D CNN based model for RGB and optical flow respectively. However, in a two-stream framework, there are different winners of appearance and motion stream for various actions due to the large intra-class variability. As a result, the best frame level mean average precision (mAP) is achieved by the fusion of all four streams, which adapts to a variety of actions.

As for the second challenge on detection speed, although many conventional action detection methods [7410719, peng:hal-01349107, 7298676, 8578731, ren15fasterrcnn] achieved good results, their two-stage architecture performs region proposal and classification in two steps. While the accuracy is improved, it significantly slows down the detection speed, making it unacceptable for realistic scenarios. To accelerate the detection speed, inspired by [8237655, 8237734], our model adopts the one-stage method, Single Shot MultiBox Detector (SSD) [DBLP:conf/eccv/LiuAESRFB16], as the detection framework. It merges the two stages into a single network, carries out the localization and classification simultaneously, and thus accelerates the entire process.

The key contributions of this paper include: (i) we leverage the single stage object detection architecture SSD to build a time efficient action detector; (ii) we explore different combinations of 2D and 3D streams for the detection task for a variety of action videos; (iii) experiment results show that our model outperforms previous one-stage action detection methods on the challenging untrimmed sports video dataset UCF101-24.

2 Related work

Our research builds on previous works in two fields:
Spatial-temporal action localization. Gkioxari and Malik [7298676] applied a two-stream R-CNN based framework to produce frame level detections, and then linked the result to tubes with a dynamic programming method. Weinzaepfel et al. [7410719] extracted EdgeBoxes as the action proposals and then used a tracking-by-detection method instead of the linking method. Both Saha et al. [ren15fasterrcnn] and Peng et al. [peng:hal-01349107] leveraged two-stream Faster R-CNN to do action detection. Singh et al. [8237655] applied a single stage detection method SSD to perform online detection. Kalogeiton et al. [8237734] extend SSD’s anchor boxes to anchor cuboids to perform the temporal-spatial proposal.
3D CNN. Ji et al. [6165309] and Tran et al. [7410867] extended the 2D convolutional kernel to 3 dimensions, and much subsequent studies such as I3D and P3D [8237852] has gained lots of successes in video related tasks. The most recent state-of-the-art result is achieved by Gu et al. [8578731] based on I3D and faster-RCNN. Hou et al. [hou2017end] designed a C3D version of one-stage action detection method, however, it is an offline algorithm could not do frame level incremental detection.

3 Model Description

Multi-stream model. The architecture of our model is illustrated in Fig.1. Our model consists of 4 streams: 2D and 3D RGB streams, 2D and 3D optical flow streams. The conventional 2D RGB and optical flow streams are employed to capture the short-term spatial-temporal features, meanwhile, 3D streams are added to learn long-term features. The two 2D streams share the same architecture, but are trained individually and have their own parameters. The same applies to the 3D streams.

For the target action instances at time , the 2D RGB stream’s input is current frame , while the input of the 2D optical flow stream is extracted from the pair of using Brox et al.’s [Brox2004] method. The input dimension of both 2D streams is , where , and denote the number of channels, height and width of the input frame, respectively. To perform spatial-temporal reasoning with 3D CNN, 3D RGB stream’s input is a sequence of continuous frames . Similarly, 3D optical flow stream’s input is frames extracted from RGB frame pairs from to . The input dimension is . We set frames in the experiment.

Figure 1: Illustration of the proposed two-stream architecture. 3D SSD takes consecutive video frames as input and extracts both spatial and temporal information.

2D SSD network. Each of the 2D networks consists of 3 main parts: backbone network, extra convolutional layers and detection heads. The backbone network is truncated VGG-16 and its last two fully connected layers fc6 and fc7 are converted to convolutional layers. Eight extra layers are added to the end of the backbone network to predict default bounding boxes’ offsets and their confidences for actions. Each of the selected layers has a different spatial output dimension, that represents the action instance in different scales. The final predictions are produced by two detection heads: localization head and classification head, synchronously. We use a VGG-16 model pretrained on ImageNet to initialize our model and fine-tune it on the action dataset.
3D SSD network. As for the 3D streams, keeping the extra layers and detection heads unchanged, we inflate all the convolutional and pooling layers in backbone network from 2D to 3D, then apply temporal pooling to bridge the gap between 3D and 2D networks. To initialize the network, we repeat the weights of pretrained model’s 2D kernels T times, where T represents the size of the inflated kernel in temporal dimension. In our model, we convert all 33 kernels to 333 kernels, set all layers’ temporal padding as and temporal stride for pooling layers as .
Temporal pooling. We connect 3D and 2D layers by the temporal pooling layer. This layer performs mean-pooling along the temporal dimension, transforming the input feature map with dimension to the output with dimension of .

Figure 2: Details of the inflated 3D backbone.

Figure 3: UCF101-24 frame average precision for each action class compared to 2D RGB with baseline, the value of each class is compute with .

Fusion Method. We adopt late-fusion [6909619, NIPS2014_5353, 7780582] to merge the spatial and temporal information from each stream together. In this step, we first choose one stream as the appearance stream, such as the 2D or 3D RGB stream, keep its bounding boxes regression result, and then set each box’s confidence score as the average score of the corresponding boxes from all fused streams. In the rest of paper, we will denote as the late fusion of appearance stream and motion streams from to , denotes the number of motion streams.

4 Experiments

To evaluate the performance of 3D SSD stream, we examine different stream combinations and their detection accuracy on the UCF101-24 dataset. Singh’s 2D SSD real-time framework [8237655] is used as a baseline. We keep their fusion and linking methods unchanged and focus on the performance improvement resulted from 3D SSD.

4.1 Settings

Datasets. We choose the first split of UCF101-24 dataset to evaluate our model. It contains 24 sport classes in 3,207 untrimmed videos. Each video is annotated with bounding boxes for each action instance at frame level and each frame may contain multiple actors.
Evaluation metrics. We evaluate the detection accuracy by mean average precision (mAP) for both frame and video levels. At frame level, if the Intersection-over-Union (IoU) between a predicted bounding box and ground truth is greater than a threshold and this box’s action category is classified correctly, we will mark it as an correct detection. As for video level metric, after we connect the frame level detection into tubes, we can evaluate it with the spatial-temporal overlap between the predicted and annotated tubes. As in [7298676, 7410719, ren15fasterrcnn], we present the performance of our model in Table 1, 2, and 3, for frame-mAP with IoU threshold 0.5 and video-mAP with multiple IoU thresholds, .

4.2 Performance

We will first analyse the performance of single streams, then further discuss the contribution of each stream in an ablation study of different two-stream combinations, and show how to get the best result for various of data at last.

Single-Stream. We report the comparison of 2D and 3D streams for RGB and optical flow in Table 1. The 2D streams adopt the same architecture and experiment setup as in Singh et.al [8237655]. For both frame and video level, each of our 3D streams outperforms the corresponding 2D streams, especially at video level, our 3D RGB network improves the mAP by and for IoU threshold and , respectively. Similarly, our 3D OF network improves and . The result indicates that the temp†oral information brought in by 3D convolution significantly improves the single-stream model’s performance.

Method video-mAP f.-mAP
IoU 0.2 0.5 0.5:0.95 0.5
2D RGB [8237655] 69.8 40.9 18.7 64.96
2D OF [8237655] 63.7 30.8 11.0 47.26
ours-3D RGB 75.33 45.69 19.15 65.10
ours-3D OF 67.46 35.26 12.51 50.85
Table 1: Comparison of video and frame mAP between 2D and 3D RGB and Optical Flow (OF) streams.

Two-Stream. In this section, we answer the following two questions: (i) which stream is the best appearance stream? (ii) which stream is the best motion stream?
As for appearance stream, because 3D RGB stream contains both spatial and temporal information, our model can choose either 2D or 3D RGB stream as appearance stream. As shown in Table 2, the result of 2D RGB 2D OF outperform that of 3D RGB 2D OF by with the same fusion method. Meanwhile, when the motion stream is 3D OF, the combination with 2D RGB appearance stream outperforms that of 3D RGB by . This can be explained as the 2D RGB stream contains more accurate spatial information for current frame, while the 3D convolution brings in certain noises from the previous frames.

Method video-mAP f.-mAP
IoU 0.2 0.5 0.5:0.95 0.5
2D RGB+2D OF (b)[8237655] 73.0 44.0 19.2 68.31
2D RGB+2D OF (u)[8237655] 73.5 46.3 20.4 64.97
2D RGB+2D OF (l)[8237655] 76.43 45.18 20.08 67.81
ours-3D RGB+2D OF(l) 76.02 47.38 19.35 67.06
ours-2D RGB+3D RGB(l) 76.18 46.52 20.94 68.72
ours-3D RGB+3D OF(l) 76.84 46.38 19.2 68.82
ours-2D RGB+3D OF(l) 77.19 47.75 21.11 69.47
Table 2: Comparison between different combinations of two-stream fusion. (b) boost fusion, (l) late fusion, (u) union fusion.

The candidates for motion stream are: 3D RGB, 2D and 3D optical flows. Comparing the 3D RGB stream with 2D optical flow stream, we find that 2D RGB 3D RGB performs better than 2D RGB 2D OF in frame-mAP and video-mAP for IoU threshold and . The more detailed frame level average precision analysis for each of the 24 action classes is demonstrated in Fig.3. Based on the way how actors and background change with respect to camera, the videos of UCF101-24 can be divided into 3 categories: (i) active background videos: videos where the camera moves along with the actors, meanwhile, the background environment changes sharply, for example, rope climbing, skiing and skateboarding. For these 3 classes of videos, the 2D RGB 3D RGB combination outperform the 2D RGB 2D OF combination by , and , respectively. The poor performance of 2D optical flow stream is caused by the noises produced by the fast changing background. (ii) fixed background videos: videos where the camera is fixed, the background does not change much and the actors move quickly in short time frame, such as Salsa Spin, Cliff Diving and Basketball Dunk. Because optical flow contains more accurate short-term temporal information than 3D RGB, the performance of 2D RGB 2D OF is better than that of 2D RGB 3D RGB. (iii) For other videos that contain more complex circumstance, 3D RGB stream’s contribution is similar to or slightly better than optical flow.

Figure 4: Fixed background and Active background videos.

While 2D OF and 3D RGB outperforms each other in different scenarios, the best performance of all two-stream combinations is achieved by 2D RGB 3D OF. It improve the frame-mAP of 2D RGB 2D OF by , and the result of 2D RGB 3D RGB by , which means 3D optical flow is the best choice of motion stream in two-stream framework. However, as shown in Fig.3, we can observe that the 3D optical flow still inherit the drawback of 2D optical flow, resulting in poor performance for active background videos.
Multi-Stream. We present the fusion results of three-stream and four-stream models in Table 3. Compared to Singh et al. [8237655]’s two-stream model, our three-stream model (2D RGB 3D RGB 2D OF) obtains improvement for the frame-mAP with the 3D RGB stream integrated, and improvement with the fusion of all four streams. To the best of our knowledge, our model outperforms all the one-stage methods with better action localization and classification accuracy. In practice, we also need to consider the time consumption to prepare a stack of optical flows, which is important for developing an online real-time system. For different kinds of action videos and applications, our model is flexible to be reorganized or integrated into other models to meet the requirements.

Method frame-mAP@0.5
(SSD) Kalogeiton et al. [8237734] 67.10
Hou et al. [8237882] 67.3
(SSD) Singh et al. [8237655] 67.81
ours-2D RGB+3D RGB+2D OF 70.04
ours-2D RGB+3D RGB+3D OF 71.10
ours-3D RGB+3D OF+2D RGB+2D OF 71.28
ours-2D RGB+2D OF+3D RGB+3D OF 71.30
Table 3: Comparison of frame-mAP to the state-of-the-art on UCF101-24 dataset in split1.

5 Conclusions and future plans

This paper introduced a multi-stream action detector which achieves state-of-the-art results of the one-stage methods on UCF101-24 dataset. We present an empirical study of the properties of the combinations of 2D RGB, 2D OF, 3D RGB and 3D OF streams. Based on the results of those experiments, the following conclusions could be obtained: (i) 2D RGB stream is a better choice for appearance stream comparing to other streams; (ii) for active background videos, 3D RGB motion stream can tolerate more environmental noises; (iii) optical flow, especially the 3D stream, performs well for videos that have fixed background and significant short-term action instances. Future work will be devoted to two directions: (i) optimize the framework with other one-stage methods, such as YOLO [DBLP:conf/cvpr/RedmonF17] series. (ii) Improve the temporal convolutional module with more lightweight 3D kernels to accelerate the whole forward process.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
387178
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description