Pose-Based Two-Stream Relational Networks for Action Recognition in Videos

Pose-Based Two-Stream Relational Networks for Action Recognition in Videos

Wei Wang    Jinjin Zhang    Chenyang Si    Liang Wang
Center for Research on Intelligent Perception and Computing (CRIPAC)
  
National Laboratory of Pattern Recognition (NLPR)
Center for Excellence in Brain Science and Intelligence Technology (CEBSIT)
  
Institute of Automation
   Chinese Academy of Sciences (CASIA)
University of Chinese Academy of Sciences (UCAS)
{wangwei, wangliang}@nlpr.ia.ac.cn, {jinjin.zhang, chenyang.si}@cripac.ia.ac.cn
Abstract

Recently, pose-based action recognition has gained more and more attention due to the better performance compared with traditional appearance-based methods. However, there still exist two problems to be further solved. First, existing pose-based methods generally recognize human actions with captured 3D human poses which are very difficult to obtain in real scenarios. Second, few pose-based methods model the action-related objects in recognizing human-object interaction actions in which objects play an important role. To solve the problems above, we propose a pose-based two-stream relational network (PSRN) for action recognition. In PSRN, one stream models the temporal dynamics of the targeted 2D human pose sequences which are directly extracted from raw videos, and the other stream models the action-related objects from a randomly sampled video frame. Most importantly, instead of fusing two-streams in the class score layer as before, we propose a pose-object relational network to model the relationship between human poses and action-related objects. We evaluate the proposed PSRN on two challenging benchmarks, i.e., Sub-JHMDB and PennAction. Experimental results show that our PSRN obtains the state-the-of-art performance on Sub-JHMDB (80.2%) and PennAction (98.1%). Our work opens a new door to action recognition by combining 2D human pose extracted from raw video and image appearance.

Keywords:
Action Recognition; 2D Human Pose; Relational Modelling; Two-Stream; Selective Attention

1 Introduction

Recognition of human actions is a very important task in computer vision, which has a wide range of applications, e.g., intelligent video surveillance, robot vision, human-computer interaction, and so on. Action recognition is also a very challenging task, which has to analyze human motion information and understand its temporal characteristics. Moreover in most cases, the objects interacted with humans provide an important clue for action recognition.

For a long time, learning and extracting effective spatio-temporal features for videos is the mainstream direction of action recognition, i.e., the well-known spatio-temporal interest points (STIPs) [1] and improved dense trajectories (iDT) [2]. Recently with the advent of deep learning, deep convolutional networks for action recognition have received significant attention due to the large improvement on recognition accuracy. Tran et al. [3] extend traditional 2D convolutional network to its 3D version to learn spatiotemporal features for video action recognition. Simonyan et al. [4] propose the first two-stream ConvNet architecture for both temporal optical flow and spatial appearance, which is followed by several improved architectures [5][6]. Although these two-stream approaches achieve better performance on the benchmarks, i.e., UCF101 [7], the optical flow in temporal stream generally demands a high computation cost, and the two streams lack of a principled fusion strategy.

Actually, 2D/3D human poses as the trajectories of skeleton joints are more effective representations for characterizing the dynamics of human actions if comparing with optical flow. Analyzing pose/skeleton sequence is another approach for action recognition. Du et al. [8] propose a hierarchical recurrent neural network for skeleton based action recognition by dividing human skeleton into five parts according to human physical structure. To extract discriminative skeleton features, Song et al. [9] propose a spatio-temporal attention model. Pose-based methods have achieved great success on the datasets, i.e., NTU RGB+D [10]. However, there exist two urgent problems to be solved. First, instead of starting from raw video, most of the pose-based methods recognize actions directly from captured 3D human poses which are very difficult to obtain in real scenarios. Second, existing pose-based methods generally ignore to exploit spatial appearance which contains the objects associated with particular actions. Although Du et al. [11] combine pose and appearance for action recognition, they just introduce pose-based attention to learn human-part appearance features at each time, and ignore to model the action-related objects as well.

From the analysis above, fusing both human poses and action-related objects becomes a natural and reasonable choice to recognize human actions. Recent progress in realtime 2D pose estimation from image and video, i.e., openpose [12], has made action recognition from estimated poses plausible. The work on relational reasoning [13] inspires us to provide a principled way to model the relationship between human poses and action-related objectss.

In this paper, we propose a pose-based two-stream relational network (PSRN) for action recognition. Fig. 1 shows the architecture of PSRN which contains a temporal pose stream, a spatial object stream and a pose-object relational network. In the pose stream, we first extract the multi-person 2D poses of video frames with an improved openpose. Due to the varying number of persons in videos, we assume there is only one person performing the actions, and further select the targeted pose with an attention mechanism from the multi-person poses. To better represent human poses, we not only utilize the skeleton position but also the skeleton velocity, both of which are modelled with LSTM-RNNs to characterize the pose dynamics. The last hidden representations in these two LSTM-RNNs, i.e., and , will be used for later pose-object relational modelling. The object stream extracts the features maps of a randomly sampled video frame by a VGG16 network to represent the action-related objects. In the pose-object relational network, instead of fusing two-streams in the class score layer as before, we take each column of the object feature maps as an object, and combine all the columns with pose hidden representations and to obtain a relation feature . Finally, we supervise PSRN with three classification losses on the pose hidden representations , and relation feature , respectively. This kind of multi-loss objective makes our model learn better.

We perform experiments on two challenging action recognition datasets, namely PennAction [14] and Sub-JHMDB [15], to verify the effectiveness of our model. Experimental results show that the proposed PSRN outperforms the state-of-the-art on these datasets.

The main contributions of this paper are summarized as follows:

  1. We propose a new two-stream architecture for both 2D pose sequence and image appearance, which provides effective representations for the temporal dynamics of human actions and action-related objects.

  2. We propose a principled strategy to fusion the two-stream networks by modelling the relationship between human pose and action-related objects with a relational network.

  3. The proposed pose-based two-stream relational network (PSRN) achieves the best results on two challenging benchmarks, which verifies its effectiveness.

  4. Our work opens a new door to action recognition by utilizing 2D human poses directly extracted from raw videos, which no longer needs to input captured 3D human poses or expensive optical flows.

The remainder of this paper is organized as follows. In Section 2, we introduce the related work on action recognition and relational modelling. In Section 3, we illustrate the details of the proposed PSRN. Experimental results are presented in Section 4. Finally, we conclude and discuss the paper in Section 5.

Figure 1: The architecture of the proposed pose-based two-stream relational network (PSRN). It contains a temporal pose stream, a spatial object stream and a pose-object relational network. The pose stream models the temporal dynamics of the targeted 2D human pose sequences, and the object stream extracts visual feature maps from a randomly sampled video frame to represent the action-related objects. Pose-object relational network provides a principled two-stream fusion strategy by modelling the relationship between human poses and action-related objects.

2 Related Works

In this section, we briefly review the existing literature that closely relates to the proposed pose-based two-stream relational network, including 2D human pose estimation, pose-based action recognition, two-stream action recognition and relational modelling.

2D human pose estimation     There exist two categories of 2D human pose estimation methods: single-person and multi-person. For single-person pose estimation, Toshev et al. [16] propose an architecture called DeepPose which is the first pose estimation method using deep networks. Beyond using ConvNet for joint estimation, some work builds grapical model to learn spatial relationships between joints [17]. Recently, several well-designed deep networks have achieved impressive results [18]. For multi-person pose estimation, Cao et al. [12] exploit part affinity fields (PAF) to associate body parts with individuals in a bottom-up way. Some top-down approaches consist of two stages of person detector and single-person pose estimation for each detection [19]. He et al. propose an end-to-end solution called Mask R-CNN by extending Faster R-CNN [20]. In this paper, we finetune the PAF proposed in [12] on a larger skeleton detection dataset and obtain much more accurate poses on the experimental action datasets.

Pose-based action recognition     Recognition of human actions based on pose sequence has received much more attention, due to its effective representation on action dynamics. Previously, various handcrafted features, i.e., the relative positions of joints and the covariance matrix of joint locations over time [21], are proposed as discriminative descriptors for pose sequence. Recently, large amounts of approaches using recurrent neural network (RNN) are proposed to model the temporal dynamics of pose sequence. Du et al. [8] propose an end-to-end hierarchical RNN for skeleton based action recognition by dividing the human skeleton into five parts according to human physical structure. Song et al. [9] propose an end-to-end spatial and temporal attention model for action recognition. Zhang et al. [22] propose a view adaptive model which can adapt to the most suitable viewpoints for action recognition. Although pose-based methods have achieved great success on several benchmarks, most of them recognize actions directly from captured 3D human pose which is very difficult to obtain in real scenarios. In this paper, we model the action dynamics with 2D human poses which are extracted from raw videos.

Two-stream action recognition     Inspired by the two-stream hypothesis for perception and action [23], Simonyan et al. [4] propose a two-stream ConvNet architecture to process temporal motion information and spatial appearance in parallel, and then fuse the classification scores of these two streams. Several extensions to that work have been proposed to explore the two-stream architectures. Feichtenhofter et al. [5] investigate convolutional spatial fusion and temporal fusion, i.e., bilinear fusion and 3D pooling, especially the multiplicative interactions of spacetime features. For long-range temporal structure modeling, Wang et al. [6] propose a temporal segment network for video-based action recognition, which combines a sparse temporal sampling strategy and video-level supervision. It is well known that the optical flow used in temporal stream generally demands a high computation cost to obtain reasonable accuracy. In this paper, we use 2D pose sequence to represent motion information, which is much more natural and reasonable than optical flow. In addition, we propose to fuse two streams with a novel pose-object relational network.

Relational modelling     Graph-based methods inherently support relation-centric computation, i.e., graph neural networks [24] and interaction networks [25]. Santoro et al. [13] recently propose a simple plug-and-play general solution to relational reasoning. To handle complex tasks which requires an order of magnitude more steps of relational reasoning, i.e., Sudoku puzzle, Palm et al. [26] propose a recurrent relational network. In this paper, we follow the idea of [13] and propose a pose-object relational network to model the relationship between human pose and action-related objects.

3 Pose-Based Two-Stream Relational Network

In this section, we explain the proposed pose-based two-stream relational network in detail. First, we introduce the procedure of estimating 2D human poses from raw videos. Then, we describe the temporal pose stream and spatial object stream, respectively. Next, we illustrate the pose-object relational network. Finally, we give the details of learning PSRN.

3.1 2D Human Pose Estimation

This work does not aim to propose a new pose estimation method, so we choose the approach presented in [12] to estimate the multi-person poses of action videos. This approach also called PAF proposes a part affinity field to associate body parts in a bottom-up way, which maintains high accuracy and achieves realtime performance. In our experiments, we retrain PAF on a larger AI challenge dataset111https://challenger.ai/competition/keypoint/subject, which has 210,000 annotated images in the training set. Compared to the 18 keypoints in COCO 2016 keypoints challenge dataset used in [12], this dataset annotates only 14 keypoints for each person, which ignores 5 keypoints on two eyes, two ears and one nose, and adds a new keypoint on top of the head. To obtain more accurate detection results on action datasets, we replace original VGG16 with a powerful inception-resnet-V2 in the front-end of PAF. The retrained model performs much better than its original implementation with the mAP of 0.52 to 0.48 on the AI challenge dataset. Fig. 2(a) and (b) shows several testing results from AI challenge dataset by using [12] and our implementation, respectively. Although our retrained model improves the accuracy of keypoint detection, there still exist several kinds of failures on the action video dataset as shown in Fig. 2(c), i.e., keypoint missing on body part or whole body, keypoints detected on action-related objects. These failures pose great challenges for further action recognition. However, from the experimental results in Section 4, we can see that the proposed PSRN still achieves better results.

Figure 2: Results of 2D human pose estimation. (a) and (b) The testing results from AI challenge dataset by [12] and our implementation. Our retrained model performs much better than the implementation in [12]. (c) Several kinds of failure cases of our retrained model on action videos, i.e., keypoint missing on body part or whole body, keypoints detected on action-related objects.

3.2 Temporal Pose Stream

Considering that there exist varying number of persons in the action videos, and some human poses extracted in Section 3.1 may be incomplete, we need to perform pose filling to obtain consistent and complete pose data. We count the number of persons in each frame and choose the max number as the common person number of all the videos. If the keypoints of some body part are missing or the number of detected persons is smaller than , we need to fill keypoint into the missing parts/persons so that each frame has a complete representation for persons. Fig. 3(a) illustrates the procedure of pose filling, which adds virtual human poses with all keypoints and complements part pose.

For each of the human poses, Fig. 3(b) shows its 14 keypoints which are divided into five parts annotated by ellipse boxes. Each part is initially represented by the concatenation of keypoint positions and then transformed into a -dim vector by a multilayer peceptron (MLP). The representation of a human pose is the concatenation of these five -dim vectors.

Actually, most of the human actions occur on a specific person, so that it is very difficult to obtain a compact representation for action dynamics if directly concatenating all the human poses at each frame. We propose to select the targeted pose from all the detected poses by virtue of attention mechanism.

At each time-step , we have obtained -dim vectors for all the human poses, which are denoted as

(1)

We adopt the soft attention mechanism introduced in Xu et al. [27], which is based on LSTM implementation:

(2)
(3)
(4)

where is the input gate, is the forget gate, is the output gate, is the intermediate memory state, is the updated memory state and is the hidden state. represents the input to the LSTM at time-step , which will be explained later. is the trainable parameters of an affine transformation, here is the dimension of all the gates and states. and are the sigmoid activation and element-wise multiplication.

Figure 3: (a) Pose filling to obtain consistent and complete pose data. We add virtual human poses and complement incomplete poses with keypoints. (b) Similar to [8], the 14 keypoints of a human pose are divided into five parts annotated by ellipse boxes according to human physical structure. The representation of a human pose is the concatenation of the five part representations.

The soft attention computes a positive weight to measure the importance of each pose representation based on previous hidden state . The weight is generally computed by a multilayer perceptron which is followed by a softmax operation:

(5)

Fig. 4 shows the learned attention weights for several action videos. The size of keypoint area denotes the weight value. We can see that the targeted human pose has larger weights than the non-targeted poses.

Once the attention weights have been computed, the selected pose representation is a summation of the weighted pose representations

(6)

As shown in Eqn. 2, is the selected compact pose representation which is input to the LSTM.

Up to the last time-step , the hidden representation of the LSTM can be used to classify actions and build the pose-object relational network later. For clarity, we replace with to denote the representation of pose position.

To further explore the temporal dynamics of pose sequence, we compute the pose velocity by differencing the selected pose representations and

(7)

The pose velocity sequence can be modelled by another LSTM. The output of this LSTM can also be used to classify actions and build the pose-object relational network later.

Figure 4: The learned attention weights for multiple human poses. The size of the keypoint area denotes the weight value. We can see that the targeted human pose has larger weights than the non-targeted human poses.

3.3 Spatial Object Stream

Spatial appearance provides a useful clue for action recognition since some actions are tightly associated with particular objects. In the literature, there are large amounts of work on action classification from still images [28].

In this work, we randomly sample a video frame and extract its feature maps () with a VGG16 network. Each 512-dim column vector is considered as an object representation. We send the 49 object representations into the next module for pose-object relational modelling.

3.4 Pose-Object Relational Network

After obtaining the pose representations and from temporal pose stream, and the object representations from spatial object stream, we model the relationship between human poses and action-related objects, which provides a principled two-stream fusion strategy. Inspired by [13] performing relational reasoning on question answering tasks, we define the pose-object relation as a composite function below:

(8)

where is the relation representation, and are simple multilayer perceptrons (MLP) with parameters and , respectively.

3.5 Learning PSRN

The proposed PSRN is a principled framework for video-based action recognition. If aiming to achieve better performance, the design of network architecture and the procedure of network training should be taken more care of. In this section, we introduce the details of the network architecture and network training.

Network Architecture    To further improve the performance, we enhance the temporal pose stream by replacing original two unidirectional LSTMs (uni-LSTMs) with two bidirectional LSTMs (bi-LSTMs) on both pose position and pose velocity. For modelling long-term dependency on pose sequences with LSTMs, we adopt the lookback structure which looks at the outputs from the last steps when generating the output for the current step. Compared to other two-stream architectures, i.e., [6] and [11] which adopt very deep convolutional networks for optical flow and image appearance, our two-stream networks which only contain a VGG16 network on images and two bidirectional LSTMs on pose sequences are not very complicated.

Network Training    Considering that both of the two-stream networks can be used to perform recognition task, we supervise our model with the following total loss:

(9)

where , and are three cross-entropy losses corresponding to pose position representation , pose velocity representation and relation representation , respectively. is the weight decay regularization for all the model parameters, and is the weight decaying coefficient which is set to in the experiments. It should be noted that the same coefficients are set to the three losses, because there is no obvious improvement using different coefficients in the experiments.

Due to the similar form of these three cross-entropy losses, we take as an example

(10)

there are action videos and classes of human actions. is the groundtruth label while is its prediction when using pose positions.

During training, we use the Adam optimizer [29] with = 0.9, = 0.999. Practically, it is difficult for the proposed PSRN to converge to a good solution when training the whole network from the scratch. We adopt a three-stage training strategy to provide better initializations for the final end-to-end training. In the first stage, we train the temporal pose stream with the losses and . The initial learning rate is set to 0.0001 and reduced to its half after 78,000 iterations. In the second stage, we fix the learned parameters of the temporal pose stream, and only train the spatial object stream and pose-object relational network with the loss . The VGG16 network in spatial object stream is initialized with a widely-used model pretrained on ImageNet dataset. For a better convergence, we vary the learning rate in a similar way to the warmup strategy introduced in [30], which increases the learning rate exponentially for the first warmpup steps (2000) from 1e-6 to 1e-4, and then decreases it to its half after 28,000 iterations. In the third stage, we start to train the whole network which is initialized with previous learned parameters with the total loss .

4 Experiments

In this section, we first introduce the experimental datasets and implementation details. Then, we compare the proposed PSRN with several state-of-the-art methods. Finally, we perform model analysis by evaluating the key model components and analyzing the confusion matrix of the classification results on PennAction.

4.1 Datasets

We evaluate our pose-based two-stream relational network (PSRN) on two benchmarks in pose-related action recognition, i.e., Sub-JHMDB [15] and PennAction [14]. These two datasets are very challenging due to the richer variation in terms of appearance and dynamics. It should be noted that although the full body human joints are annotated for the videos in these datasets, we do not use them during model training and testing. In addition, we use the evaluation protocol introduced in [11] to report classification accuracy for both datasets.

Sub-JHMDB:  It is a subset of JHMDB, which contains 316 clips distributed over 12 action categories, i.e., swing_baseball, climp_stairs, kick_ball, walk, jump, pullup, push, pick, catch, run, shoot_ball and golf. Each clip contains between 15 and 40 frames of size 320 240. There are 3 train/test splits. Similar to [11], we compare the average accuracy on these three splits to the state-of-the-art.

PennAction:  It consists of 2326 challenging consumer videos distributed over 15 action categories, i.e., baseball_pitch, baseball_ swing, bench_press, bowling, clean_and_jerk, golf_swing, jump_rope, jumping_jacks, pullup, pushup, situp, squat, strum_guitar, tennis_forehand and tennis_serve. This dataset has rich annotations which consist of action class labels, 2D keypoint positions and their corresponding visibilities, and camera viewpoints.

4.2 Implementation Details

If no otherwise specified, we perform our PSRN on the two datasets with the same implementation details. The sampled video frame is randomly cropped and resized to 224 224 for the VGG16 network to extract spatial feature maps (7 7 512), and the videos are transformed to multiple scales for the retrained PAF to estimate human poses (14 keypoints). For convenience, we rescale the pose positions between 0 and 1. As introduced in Section 3.2, human poses are divided into five parts which have the following position dimensions: head (8-dim), left/right arm (6-dim) and left/right leg (6-dim). The body part positions are initially transformed into 100-dim representations with MLPs. Then the representation of human pose is the concatenated 500-dim vector which is input to the LSTM-based pose stream. Both of the position and velocity LSTMs have 512-dim hidden representation. The four-layer MLP for and the two-layer MLP for in pose-object relational network all consist of 512 units per layer with ReLU non-linearities. During both training and testing stages, we randomly sample 10 frames from each video as the input to the temporal pose stream, and resample one frame from these 10 frames as the input to the spatial object stream. The lookback steps in our LSTM implementations are setting to 5. We implement our PSRN by Tensorflow222https://www.tensorflow.org.

4.3 Experimental Results

To evaluate the performance of the proposed PSRN, we compare it with the recent state-of-the-art approaches in pose-based action recognition on Sub-JHMDB and PennAction. Some of them belongs to the hand-crafted approaches (Action Bank [31], MST [32], AOG [33] and Hierarchical [34]), while the others are exploiting deep networks (P-CNN [35], JDD [36], Pose+idt-fv [37], RPAN [11]). The experimental results are shown in Table 1. We can see that our PSRN with only pose stream achieves the comparable results with some approaches. This is mainly credited to the attention mechanism and bidirectional LSTMs on both pose positions and pose velocities. Our PSRN with two-stream relational network achieves the best results and outperforms all the other methods on both datasets. This further verifies the role of action-related objects in action recognition and the effectiveness of our pose-object relational modelling.

State-of-the-art Year Sub-JHMDB PennAction
Action Bank [31] 2013 - 83.9
MST [32] 2014 45.3 74.0
AOG [33] 2015 61.2 85.5
P-CNN [35] 2015 66.8 -
Hierarchical [34] 2016 77.5 -
JDD [36] 2016 77.7 87.4
Pose+ idt-fv [37] 2017 74.6 92.9
RPAN [11] 2017 78.6 97.4
Our PSRN (only pose stream) 71.7 95.9
Our PSRN (two-stream) 80.2 98.1
Table 1: Comparison with the state-of-the-art approaches on Sub-JHMDB (average over three splits) and PennAction. Our PSRN with two-stream relational network achieves the best performance.

4.4 Model Analysis

To understand the properties of the proposed PSRN, we first evaluate the effectiveness of several key model components on Sub-JHMDB (the second split) and PennAction, i.e., attention mechanism on pose selection, bi-LSTM on pose sequence modelling and two-stream relational network. Then we analyze the confusion matrix of the classification results on PennAction.

Architecture Analysis:  Table 2 and 3 show the classification accuracies of several PSRN variants, which either replace bidirectional LSTM with unidirectional LSTM (uni-LSTM+Attention) or ignore the attention mechanism to select targeted poses (bi-LSTM). Moreover, we give the accuracies from pose position, pose velocity, pose stream fusion (position and velocity), and the proposed two-stream relation fusion (relational network). We can see that under the same stream settings, our PSRN with bi-LSTMs and attention achieves better performance than the other two variants, which verifies the effectiveness of bi-LSTMs and attention mechanism. With the same network architecture, our two-stream relation fusion outperforms other streams, which verifies the effectiveness of the proposed pose-object relational network.

Stream uni-LSTM+Attention bi-LSTM bi-LSTM+Attention
Pose Position 65.3 53.8 70.0
Pose Velocity 66.3 57.5 71.2
Pose Stream Fusion 68.8 60.0 73.8
Two-stream Fusion 77.5 65.0 83.7
Table 2: Evaluating the effectiveness of bi-LSTMs, attention mechanism and two-stream relational network on the second split of Sub-JHMDB.
Stream uni-LSTM+Attention bi-LSTM bi-LSTM+Attention
Pose Position 93.9 84.8 94.0
Pose Velocity 94.7 90.9 95.4
Pose Stream Fusion 95.4 92.3 95.9
Two-stream Fusion 96.9 94.9 98.1
Table 3: Evaluating the effectiveness of bi-LSTMs, attention mechanism and two-stream relational network on PennAction.

Confusion Matrix:  Furthermore, we analyze the classification results with confusion matrix and compare it with RPAN [11] on PennAction. Fig. 5 (a) and (b) show the confusion matrix of RPAN and ours, respectively. Our PSRN with 20 misclassified videos improves RPAN with 31 misclassified videos. The improvements mainly come from 4 human actions, i.e., bench_press (3), squat (4), tennis_forehand (5) and tennis_serve (2). We can see that all of these four classes have action-related objects, and our model can model these objects and their relations to the corresponding actions by virtue of the proposed pose-object relational network. Moreover, RPAN often misclassifies tennis_forehand as tennis_serve and squat as bench_press, while our PSRN reduces these confusions although these action pairs have very similar spatial appearance. This verifies the effectiveness of our pose-based temporal streams which can model and distinguish action dynamics better.

Figure 5: Confusion matrix comparison on PennAction. (a) RPAN [11] and (b) Our PSRN. The values in the matrix are the number of test videos. Our PSRN misclassifies much less videos than RPAN with 20 vs. 31. The improvements are mainly from 4 human actions, i.e., bench_press (3), squat (4), tennis_forehand (5) and tennis_serve (2). Moreover, RPAN often misclassifies tennis_forehand as tennis_serve and squat as bench_press, while our PSRN reduce these confusions by 6-2=4 and 3, respectively. The comparison results show the advantages of our pose-based two-stream relational network.

5 Conclusion and Future Work

In this paper, we propose a pose-based two-stream relational network (PSRN) for action recognition. One stream models the temporal dynamics of the targeted 2D human pose sequences, while the other stream represents the spatial objects. Moreover, we propose a principled way to fuse these two streams. The proposed PSRN achieves the best performance on two challenging benchmarks.

More efforts should be made to further improve our work. Firstly, our current relational model can not explicitly tell us which objects are related to human poses/actions. We can propose a selective relational network which assigns weights for all the objects and further selects the most relevant object for pose-object relational modelling. Secondly, this work focuses on single-person action recognition based on the hypothesis that most of the human actions occur on a specific person, which should be extended to handle multi-person action recognition. Thirdly, 2D pose estimation is independent from the other parts of PSRN in this version. To perform an end-to-end learning and achieve optimal performance, we need to seamlessly integrate the pose estimation into our model.

References

  • [1] Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)
  • [2] Wang, H., Klaser, A., Schmid, C., Liu, C.: Action recognition by dense trajectories. In: CVPR (2011)
  • [3] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)
  • [4] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
  • [5] Feichtenhofer, C., Pinz, A., Zisserman, A.: Spatiotemporal multiplier networks for video action recognition. In: CVPR (2017)
  • [6] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV (2016)
  • [7] Soomro, K., Zamir, A., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012)
  • [8] Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR (2015)
  • [9] Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI (2017)
  • [10] Shahroudy, A., Liu, J., Ng, T., Wang, G.: Ntu rgb+d: A large scale dataset for 3d human activity analysis. In: CVPR (2016)
  • [11] Du, W., Wang, Y., Qiao, Y.: Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In: ICCV (2017)
  • [12] Cao, Z., Simon, T., Wei, S., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR (2017)
  • [13] Santoro, A., Raposo, D., Barrett, D., Malinowski, M., Pascanu, R., Battaglia, P., Lillicrap, T.: A simple neural network module for relational reasoning. In: NIPS (2017)
  • [14] Zhang, W., Zhu, M., Derpanis, K.: From actemes to action: A strongly-supervised representation for detailed action understanding. In: ICCV (2013)
  • [15] Jhuang, H., Gall, J., Zuffi, S., Schmid, C.: Towards understanding action recognition. In: ICCV (2013)
  • [16] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: CVPR (2014)
  • [17] Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)
  • [18] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV (2016)
  • [19] Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)
  • [20] He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
  • [21] Gowayyed, M., Torki, M., Hussein, M., El-Saban, M.: Histogram of oriented displacements (hod): describing trajectories of human joints for action recognition. In: IJCAI (2013)
  • [22] Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: ICCV (2017)
  • [23] Goodale, M., Milner, A.: Separate visual pathways for perception and action. In: Trends in Neurosciences (1992)
  • [24] Scarselli, F., Gori, M., Tsoi, A., Hagenbuchner, M., Monfardini, G.: The graph neural network model. In: TNN (2009)
  • [25] Battaglia, P., Pascanu, R., Lai, M., Rezende, D., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics. In: NIPS (2016)
  • [26] Palm, R., Paquet, U., Winther, O.: Recurrent relational networks for complex relational reasoning. CoRR abs/1711.08028 (2017)
  • [27] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)
  • [28] Guo, G., Lai, A.: A survey on still image based human action recognition. In: PR (2014)
  • [29] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
  • [30] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017)
  • [31] Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: A strongly-supervised representation for detailed action understanding. In: ICCV (2013)
  • [32] Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning and recognition. In: CVPR (2014)
  • [33] Nie, B., Xiong, C., Zhu, S.: Joint action recognition and pose estimation from video. In: CVPR (2015)
  • [34] Lillo, I., Niebles, J., Soto, A.: A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In: CVPR (2016)
  • [35] cheron, G., Laptev, I., Schmid, C.: P-cnn: Pose-based cnn features for action recognition. In: ICCV (2015)
  • [36] C. Cao, Y. Zhang, C.Z., Lu, H.: Action recognition with joints-pooled 3d deep convolutional descriptors. In: ICCV (2016)
  • [37] Iqbal, U., Garbade, M., Gall, J.: Pose for action action for pose. In: FG (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
198651
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description