Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition
In this report, our approach to tackling the task of ActivityNet 2018 Kinetics-600 challenge is described in detail. Though spatial-temporal modelling methods, which adopt either such end-to-end framework as I3D  or two-stage frameworks (i.e., CNN+RNN), have been proposed in existing state-of-the-arts for this task, video modelling is far from being well solved. In this challenge, we propose spatial-temporal network (StNet) for better joint spatial-temporal modelling and comprehensively video understanding. Besides, given that multi-modal information is contained in video source, we manage to integrate both early-fusion and later-fusion strategy of multi-modal information via our proposed improved temporal Xception network (iTXN) for video understanding. Our StNet RGB single model achieves 78.99% top-1 precision in the Kinetics-600 validation set and that of our improved temporal Xception network which integrates RGB, flow and audio modalities is up to 82.35%. After model ensemble, we achieve top-1 precision as high as 85.0% on the validation set and rank No.1 among all submissions.
The main challenge lies in extracting discriminative spatial-temporal descriptors from video sources for human action recognition task. CNN+RNN architecture for video sequence modelling [2, 3] and purely ConvNet-based video recognition [4, 5, 6, 7, 8, 1, 9] are two major research directions. Despite considerable progress has been made since several years ago, action recognition from video is far from being well solved.
For the CNN+RNN solutions, the feed-forward CNN part is used for spatial modelling, meanwhile the temporal modelling part, e.g., LSTM  or GRU , makes end-to-end optimization more difficult due to its recurrent architecture. Taking feature sequence extracted from a video as input, there are many other sequence modelling frameworks or feature encoding methods aiming at better temporal coding for video classification. In , fast-forward LSTM (FF-LSTM) and temporal Xception network are proposed for effective sequence modelling and considerable performance gain is observed against traditional RNN models in terms of video recognition accuracy. NetVLAD , ActionVLAD  and Attention Clusters  are recently proposed to integrate local features for action recognition and good results are achieved by these encoding methods. Nevertheless, separately training CNN and RNN parts is harmful for integrated spatial-temporal representation learning.
ConvNets-based solutions for action recognition can be generally categorized into 2D ConvNet and 3D ConvNet. Among these solutions, 2D or 3D two-stream architectures achieve state-of-the-art recognition performance. 2D two-stream architectures [4, 7] extract classification scores from evenly sampled RGB frames and optical flow fields. Final prediction is obtained by simply averaging the classification scores. In this way, temporal dynamics are barely explored due to poor temporal modelling. As a remedy for the aforementioned drawback, multiple 3D ConvNet models are invented for end-to-end spatial-temporal modelling such as T-ResNet , P3D , ECO , ARTNet  and S3D . Among these 3D ConvNet frameworks, state-of-the-art solution is non-local neural network  which is based on I3D  for video modelling and leverages the spatial-temporal nonlocal relationships therein. However, 3D CNN is computational costly and training 3D CNN models inflated from deeper network suffers from performance drop due to batch size reduction.
In this challenge, we propose a novel framework called Spatial-temporal Network (StNet) to jointly model spatial-temporal correlations for video understanding. StNet first models local spatial-temporal correlation by applying 2D convolution over a -channel super image which is formed by sampling successive RGB frames from a video and concatenating them in the the channel dimension. As for long range temporal dynamics, StNet treats 2D feature maps of uniformly sampled super images as 3D feature maps whose temporal dimension is and relies on 3D convolution with temporal kernel size of 3 and spatial kernel size of 1 to capture long range temporal dependency. With our proposed StNet, both local spatial-temporal relationship and long range temporal dynamics can be modelled in an end-to-end fashion. In addition, large number of convolution kernel parameters is avoided because we can model local spatial-temporal with 2D convolution and spatial kernel size of 3D convolution in StNet is set to 1.
Video source contains such multi-modal information as appearance information in the RGB frames, motion information among successive video frames and acoustic information in its audio signal. Existing works have proved that fusing multi-modal information is helpful [7, 15, 12]. In this challenge, we also utilize multiple modalities to boost the recognition performance. We improve our formerly proposed temporal Xception network  and enable it to integrate both early-fusion and later-fusion features of multi-modal information. This model is referred to as improved temporal Xception network (iTXN) in the following .
2 Spatial-Temporal Modelling
The proposed StNet can be constructed from existing state-of-the-art 2D CNN frameworks, such as ResNet , Inception-Resnet  and so on. Taking ResNet as example, Fig.1 illustrates how we can build StNet. Similar to TSN , we choose to model long range temporal dynamics by temporal snippets sampling rather than inputing the whole video sequence. One of the differences from TSN is that we sample temporal segments which consists of contiguous RGB frames rather than one single frame. These frames are stacked to form a super image whose channel size is , so the input to the network is a tensor of size . We choose to insert two temporal modelling blocks right after the Res3 and Res4 block. The temporal modelling blocks are designed to capture the long-range temporal dynamics inside a video sequence and they can be implemented easily by leveraging Conv3d-BN3d-ReLU. Note that existing 2D CNN framework is powerful enough for spatial modelling, so we set both height kernel size and width kernel size of 3D convolution as 1 to save model parameters while the temporal kernel size is empirically set to be 3. As an augmentation, we append a temporal Xception block  to the global average pooling layer for further temporal modelling. Details about temporal Xception block can be found in the most right block of Fig.2.
To build StNet from other 2D CNN frameworks such as InceptionResnet V2 , ResNeXt  and SENet  is quite similar to what we have done with ResNet, therefore, we do not elaborate all such details here. In our current setting, is set to 5, is 7 in the training phase and in the testing phase. As can be seen, StNet is an end-to-end framework for joint spatial-temporal modelling. A large majority of its parameters can be initialized from its 2D CNN counterpart. The initialization of the rest parameters following the below rules: 1) weights of Conv1 can be initialized following what the authors have done in I3D ; 2) parameters of 1D or 3D BatchNorm layers are initialized to be identity mapping; 3) biases of 1D or 3D Conv are initially set to be zeros and weights are all set to , where is input channel size.
3 Multi-Modal Fusion
Videos consist of multiple modalities. For instances, appearance information is contained in RGB frames, motion information is implicitly shown by the gradually change of frames along time and audio can provide acoustic information. For a video recognition system, utilizing such multi-modal information effectively is beneficial for performance improvement. Existing works [15, 12] have evidenced this point.
In this piece of work, we also follow the common practice to boost our recognition performance by integrating multi-modal information, i.e., appearance, motion and audio. Appearance can be explored from RGB frames with existing 2D/3D solution as well as our proposed StNet. To better utilize motion information, we extract optical flows from video sequences not only with the TV_L1 algorithm  but also with the Farneback algorithm . As for audio information, we simply follow what have been done in [26, 12].
Fusing multi-modal information have been extensively researched in the literature. Early-fusion and later-fusion are the most common methods. In this paper, we propose to combine early-fusion and later-fusion in one single framework. As is shown in Fig.2, pre-extracted features of RGB, TV_L1 flow, Farneback flow and audio are concatenated along with the channel dimension and its output is fed into a temporal Xception block for early fusion. These four feature modalities are also encoded with temporal Xception block individually. Afterwards, the early-fusion feature vector are concatenated with the individually encoded features of the four modality for classification.
In this section, we report some experimental results to verify the effectiveness of our proposed frameworks. All the base RGB, flow and audio models evaluated in the following subsection are pre-trained on the Kinetics-400 training set and finetuned on the Kinetics-600 training set. All the results reported below are evaluated on the Kinetics-600 validation set.
4.1 Spatial-Temporal Modelling
To show the effectiveness of the proposed StNet, we have trained StNet with InceptionResnet V2  and SE-ResNeXt 101 [23, 22] and a series of baseline RGB models, denoted as StNet-IRv2 and StNet-se101 respectively. As we know, the state-of-the-art 2D CNN models for action recognition is TSN , and we implemented TSN with InceptionResnet V2 and SE-ResNeXt 152 backbone networks. In the following context, we denote these two models as TSN-IRv2 and TSN-se152 respectively. We also introduced VLAD encoding + SVM on the TSN-IRv2 Conv2d_7b feature. Nonlocal neural network is state-of-the-art 3D CNN model for video classification, so we also finetuned nonlocal-Res50-I3D net as a baseline model with the codes released by the authors. Due to the time limitation, we cannot afford training such big model as nonlocal-Res101-I3D.
|TSN-IRv2 (T=50, cropsize=331)||76.16%|
|TSN-se152 (T=50, cropsize=256)||76.22%|
|TSN-IRv2 + VLAD + SVM||75.6%|
|Nonlocal Res50-I3D (1crop of 32 frames)||71.1%|
|Nonlocal Res50-I3D (30crops)||78.6%|
|StNet-se101 (T=25, cropsize=256)||76.08%|
|StNet-IRv2 (T=25, cropsize=331)||78.99%|
Evaluation results are presented in Tabel.1. We can see from this table that StNet-IRv2 outperforms TSN-IRv2 by up to 2.83% in top-1 precision and it also achieves better performance than nonlocal-Res50-I3D net. Please note that our StNet-se101 performs comparable with TSN-se152, which also evidences the superiority of the StNet framework.
4.2 Multi-Modal Fusion
In this work, we exploit not only RGB information, but also TV_L1 flow , Farneback flow  and audio information  extracted from video sources. The recognition performances with each individual modality are listed in Table.2. For multi-modality fusion, StNet-IRv2 RGB feature, TSN-IRv2 TV_L1 flow feature, TSN-se152 Farneback flow feature and TSN-VGG audio feature are used for better complementarity.
|TSN-IRv2 Farneback flow||69.3%|
|TSN-se152 Farneback flow||71.3%|
To evaluate iTXN which is designed for multi-modal fusion, we compared it with several baselines: AttentionClusters , Fast-Forward LSTM and temporal Xception network which are proposed in . The results are shown in the Table.3. From this table, we can see that iTXN is a good framework for integrating multiple modalities.
Our final results are obtained by ensembling multiple single modality models and several multi-modal models by gradient boosting decision tree (GBDT) . After model ensemble, we finally achieve top-1 and top-5 precision of 85.0% and 96.9% on the validation set.
|temporal Xception network||81.8%||95.6%|
In this challenge, we proposed a novel StNet end-to-end framework to jointly model spatial-temporal patterns in videos for human action recognition. In order to better integrate multi-modal information which is naturally contained in video sources, we improved temporal Xception network to combines both early-fusion and later-fusion of multiple modalities. Experiment results have evidenced the effectiveness of the proposed StNet and iTXN.
-  Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
-  Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 2625–2634
-  Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 4694–4702
-  Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems. (2014) 568–576
-  Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems. (2016) 3468–3476
-  Feichtenhofer, C., Pinz, A., Wildes, R.P.: Temporal residual networks for dynamic scene recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 4728–4737
-  Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, Springer (2016) 20–36
-  Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. (2015) 4489–4497
-  Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
-  Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8) (1997) 1735–1780
-  Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
-  Bian, Y., Gan, C., Liu, X., Li, F., Long, X., Li, Y., Qi, H., Zhou, J., Wen, S., Lin, Y.: Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification. arXiv preprint arXiv:1708.03805 (2017)
-  Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5297–5307
-  Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: Actionvlad: Learning spatio-temporal aggregation for action classification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
-  Long, X., Gan, C., de Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: Purely attention based local feature integration for video classification. arXiv preprint arXiv:1711.09550 (2017)
-  Zolfaghari, M., Singh, K., Brox, T.: Eco: Efficient convolutional network for online video understanding. arXiv preprint arXiv:1804.09066 (2018)
-  Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. arXiv preprint arXiv:1711.09125 (2017)
-  Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 (2017)
-  Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. arXiv preprint arXiv:1711.07971 (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
-  Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI. (2017) 4278–4284
-  Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 5987–5995
-  Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507 (2017)
-  Pérez, J.S., Meinhardt-Llopis, E., Facciolo, G.: Tv-l1 optical flow estimation. Image Processing On Line 2013 (2013) 137–150
-  Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. Image analysis (2003) 363–370
-  Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B., et al.: Cnn architectures for large-scale audio classification. In: Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, IEEE (2017) 131–135
-  Friedman, J.H.: Stochastic gradient boosting. Computational Statistics & Data Analysis 38(4) (2002) 367–378