Action Machine: Rethinking Action Recognition in Trimmed Videos

Action Machine: Rethinking Action Recognition in Trimmed Videos

Jiagang Zhu, Wei Zou, Liang Xu, Yiming Hu, Zheng Zhu, Manyu Chang,
Junjie Huang, Guan Huang, Dalong Du
Institute of Automation, Chinese Academy of Sciences (CASIA)
University of Chinese Academy of Sciences (UCAS)
Horizon Robotics, Inc. Xiamen University
{zhujiagang2015, wei.zou, huyiming2016, zhuzheng2014, huangjunjie2016}@ia.ac.cn
{liang.xu, guan.huang, dalong.du}@horizon.ai
{changmanyu}@stu.xmu.edu.cn
Abstract

Existing methods in video action recognition mostly do not distinguish human body from the environment and easily overfit the scenes and objects. In this work, we present a conceptually simple, general and high-performance framework for action recognition in trimmed videos, aiming at person-centric modeling. The method, called Action Machine, takes as inputs the videos cropped by person bounding boxes. It extends the Inflated 3D ConvNet (I3D) by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition, being fast to train and test. Action Machine can benefit from the multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses. On NTU RGB-D, Action Machine achieves the state-of-the-art performance with top-1 accuracies of 97.2% and 94.3% on cross-view and cross-subject respectively. Action Machine also achieves competitive performance on another three smaller action recognition datasets: Northwestern UCLA Multiview Action3D, MSR Daily Activity3D and UTD-MHAD. Code will be made available.

1 Introduction

With the release of Kinetics-400 [5] and Kinetics-600 [4] in the last two years, action recognition in videos has shown similar trend as the object recognition due to the ImageNet [11]. A variety of tasks including trimmed video classification [51], temporal action recognition in untrimmed videos [28], spatial-temporal action detection [9], have been quite popular in recent competitions [16, 15].

This paper studies action recognition in trimmed videos. To some extent, advances in this field are hampered by the biases in datasets collection, lack of annotations and object recognition in images [11]. For example, the videos in UCF-101 [43] and HMDB-51 [25] are rich in scenes and objects, while missing person bounding box annotations.111Except their subsets, UCF-24 [43] and JHMDB-21 [21], which are for spatial-temporal action detection. Previous methods [42, 49, 59, 60, 5], which do not directly distinguish human body from videos, tend to predict an action according to the scenes and objects, since convolutional neural networks (CNNs) make it easier to classify the objects and things than human motions. They can be easily distracted by some irrelevant cues of videos when recognizing an action. As shown in Fig. 1(b), the video frame with ground-truth class carry is predicted as a wrong action drop trash by the baseline Inflated 3D ConvNet (I3D) [5]. Because the model has learned that the trash can and the action drop trash always appear in a video together (Fig. 1(a)).

Figure 1: Visualizing the class-specific activation maps of Inflated 3D ConvNet (I3D) [5] with the Class Activation Mapping (CAM) [58]. The video frames of two action classes from Northwestern UCLA Multiview Action3D [48] are displayed, i.e., drop trash, carry, which are acted by a man and a woman respectively. The results of our person-centric modeling method (subfigure (c) and (d)) are more related to body movements, while the baseline I3D (subfigure (a) and (b)) overfits the trash can.

This motivates us to design a model that can explicitly capture human body movements from videos, simultaneously follows the stream of RGB and CNN-based methods in action recognition. Pose (skeleton) data is lightweight, easy to understand and highly relevant to human action. It can be readily estimated by deep models, due to the recent advances of human pose estimation in a single image [29, 35, 8, 53, 34, 36] and in videos [53]. The pose estimation methods are usually based on person bounding boxes, which can greatly filter out non-human clutters in RGB images. In view of this, person-centric action recognition has the potential to benefit from the joint training with pose estimation. Thanks to the large-scale annotated datasets [29] and powerful deep networks [18], the poses estimated from images in the wild are more robust than the skeleton captured by depth sensor like Kinect which is limited to indoor pose-based action recognition [30, 56, 61, 12, 57, 54]. Thus, action recognition in videos can be naturally formulated as a multi-task learning problem including RGB-based action recognition, pose estimation and pose-based action recognition.

In this work, a person-centric modeling method for human action recognition is proposed, called Action Machine, which shares similar spirit with Convolutional Pose Machines (CPM) [52] in sequential fashion model design. The proposed method (Fig. 6) extends the Inflated 3D ConvNet (I3D) [5] by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition. In details, the video frames are cropped by the bounding boxes of target persons and are taken as the inputs of I3D. For frame-wise pose estimation, a 2D deconvolution head is added to the last convolutional layer of I3D, in parallel with the existing head for RGB-based action recognition. After the pose estimation task, a 2D CNN is applied to the pose sequences for pose-based action recognition. At test time, the predictions of two classification heads are fused by summation. Some class-specific activation maps of Action Machine are shown in Fig. 1(c) and (d), indicating only the regions that really correspond to the action are activated. The main contributions of this work are summarized as follows:

  1. We present a conceptually simple and general framework for action recognition in trimmed videos, called Action Machine, aiming at person-centric modeling.

  2. The proposed techniques of explicitly modeling human body movements including person cropping, multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses can help to improve the model performance.

  3. We showcase the generality of our framework via extensive experiments on four human action datasets. Action Machine achieves the state-of-the-art performance on NTU RGB-D [38], Northwestern UCLA Multiview Action3D [48]. It also achieves competitive performance on MSR Daily Activity3D [47] and UTD-MHAD [6]. Action Machine is a high-performance framework while being fast to train and test.

In the remainder of this paper, related works are given in Section 2. Section 3 describes our proposed approach. In Section 4, our method is evaluated on the datasets. Finally, discussions and conclusions are given in Section 5.

Figure 2: Action Machine. It consists of the following steps: First, the videos after person cropping are used as the inputs of I3D for RGB-based action recognition. Then a 2D deconvolution head is added to the last convolutional layer of I3D for frame-wise pose estimation. Third, the estimated pose sequences are fed into a 2D CNN for pose-based action recognition. The proposed method is trained in a multi-task manner. Finally, the predictions of two heads for action recognition are fused by summation at test time.

2 Related works

2.1 Deep learning for action recognition

RGB-based methods. Two-stream ConvNet [42] employs RGB images and optical flow stacks as the inputs of two networks and fuses their predictions by late fusion. Temporal Segment Network (TSN) [49] improves the performance of two-stream ConvNet by sparsely sampling video frames and learning video-level predictions. Deep networks with Temporal Pyramid Pooling (DTPP) [59] samples enough frames from videos and learns video-level representation end-to-end. Using one network, C3D [45] learns spatial-temporal patterns from video clips by 3D convolutions. In 2017, DeepMind released a large-scale video action datasets Kinetics [5] and proposed Inflated 3D ConvNet (I3D). Non-local Net [51] equips I3D with attention mechanism, extracting long-range interactions in spatial-temporal domain. The above models take as inputs the random spatial crops of video frames during training and can easily overfit the scenes and objects in videos because of failing to focus on human body explicitly. Different from them, we use the detected bounding boxes to crop the target persons from videos as the inputs of model, eliminating the effect of background context.

Pose (Skeleton)-based methods. Compared with RGB images, skeleton data has the merits of being lightweight and free from scene cues. Previous studies on pose-based action recognition can be categorized into RNN-based [30, 56, 61], CNN-based [12, 57] and GCN-based(Graph Convolution Network) [54] methods. RNN-based [30, 56, 61] methods treat the skeleton data as vectors and capture the sequence information of skeleton. CNN-based methods [12, 22] represent a skeleton sequence as a pseudo-image and recognize the underlying action in the same way as image classification. GCN-based methods [54] capture joint interactions on the skeleton graphs, explicitly considering the adjacent relationship among joints in a non-Euclidean space. In this work, we follow the CNN-based methods [12, 57] and use the 2D CNN for the pose-based action recognition.

2.2 Human pose estimation

Human pose estimation can fall into top-down methods [8, 53, 35, 17] in which a pose estimator is applied to the output of a person detector, and bottom-up methods [3, 52], in which keypoint proposals are grouped together into person instances. In this work, we adopt a top-down method and resort to a off-the-shelf detector [10] for bounding boxes. For pose estimation, a 2D deconvolution head is added to the last convolutional layer of I3D. This is inspired from Mask R-CNN [17], which extends Faster R-CNN [37] to support keypoint estimation. Action Machine does not involve detection task during training and the person cropping operations are imposed on the images instead of features.

2.3 Multi-task learning for action recognition

Chained multi-stream network [63] unifies three sources: RGB images, optical flow and body part mask for action recognition and detection. It introduces a Markov chain model to fuse these cues successively. In [34], Soft-argmax is extended to regress 2D and 3D pose directly, leading to the end-to-end training of pose estimation and action recognition. Different from the above two works, Action Machine is based on I3D, which has less parameters than C3D [45] because of 2D+1D convolution [51]. It is also easy to train because of transferring pre-trained weights from 2D CNN and does not need the costly optical flow maps compared to two-stream ConvNet [42]. The pose estimation method we use is detection-based, detecting keypoint by regressing heatmap and can get more accurate pose than the regression-based pose estimation in [34]. Meanwhile, the pose estimation head can benefit from the temporal context of the I3D output.

3 Action Machine

Figure 3: We extend I3D by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimension while deconv increases it). The output conv of heatmap and offsetmap is 11, deconvs are 44 with stride 2. ‘res5’ denotes the fifth stage of I3D with ResNet-50. ‘8’ denotes the shared operations of 2D pose estimation on the temporal dimension.

As shown in Fig. 6, the pipeline of Action Machine consists of the following steps: First, the videos after person cropping are taken as the inputs of I3D for RGB-based action recognition. Then a 2D deconvolution head is added to the last convolutional layer of I3D for frame-wise pose estimation. The heatmaps produced by the pose estimation head are converted into joint coordinates by an argmax operation. Third, the transformed joint coordinates with 2D shape are taken as inputs by a 2D CNN for pose-based action recognition. The proposed method is trained in a multi-task manner. Finally, two sources of predictions, i.e., RGB images and poses, are fused by summation at test time.

Network input. All the video frames are fed to a published detector, i.e., Deformable CNN [10] for person bounding boxes and the category confidence threshold is set to 0.99 to avoid most false positive detections. The minimal bounding box enclosing all the detected boxes in a video is used for person cropping. In this way, the problem of detection-missing in videos is mostly solved. And cropping video by a shared box for all frames can align the feature on the temporal dimension.

Backbone. We use the I3D with ResNet-50 [18] backbone shown in Table 10 for feature extraction. In order to estimate the pose of each frame, we remove the temporal max pooling after the first stage of I3D. The output feature map of the backbone has a size of 2048877, used both by RGB-based action recognition and pose estimation.

layer output size
577, 64, stride 1, 2, 2 8112112
133 max, stride 1, 2, 2 85656
85656
82828
81414
877

Table 1: Our used ResNet-50 I3D model for video. The dimensions of 3D output maps and filter kernels are in THW (2D kernels in HW), with the number of channels following. The input size is 8224224. Residual blocks are shown in brackets.

RGB-based action recognition. As shown in Fig. 7, global average pooling is performed after the last convolutional layer of I3D to get a 2048-d feature .

Consider a dataset of videos with categories , where is the label. Formally, the prediction can be obtained directly

(1)

where is the softmax operation,  . and are the parameters of the fully connected layer. In the training stage, combining with cross-entropy loss, the final loss function is

(2)

where is the value of the -th dimension of .

Pose estimation. Given the output features of I3D, the pose estimation is performed on each temporal dimension. Inspired from Mask R-CNN [17], a 2D deconvolution head is added to the last convolutional layer of I3D, as shown in Fig. 7. By default, two deconvolutional layers with batch normalization [20] and ReLU activation [24] are used. Each layer has 256 filters with 44 kernel and the stride is 2. Following [36], a 11 convolutional layer is added at last to generate predicted heatmaps for all keypoints (one channel per keypoint) and offsets (two channels per keypoint for the and -directions) for a total of 3 output channels, where is the number of keypoints.

Given the image crop, let if the -th keypoint is located at position and 0 otherwise. Here indexes the keypoint type and indexes the pixel locations on the image crop grid. For each position and each keypoint , we compute the probability if , which means the point is within a disk of radius from the location of the -th keypoint. such heatmaps are trained by solving a binary classification problem for each position and keypoint independently. For each position and each keypoint , we also predict the 2D offset vector from the pixel to the corresponding keypoint. such vector fields are trained by solving a 2D regression problem for each position and keypoint independently.

The output of the heatmap branch yields the heatmap probabilities for each position and each keypoint . The training target for the heatmap branch is a map of zeros and ones, with if and 0 otherwise. The corresponding loss function is the sum of smooth loss for each position and keypoint independently

(3)

where is the smooth loss.

For training the offset regression branch, the differences between the predicted and ground truth offsets are penalized by smooth loss. The offset loss is only computed for positions within a disk of radius from each keypoint.

(4)

The final loss function for pose estimation has the form

(5)

where and are two scalar factors to balance the loss function.

At test time, for the -th keypoint, the argmax operation is performed on the -th heatmap to yield the coarse location

(6)

The accurate coordinate of the -th keypoint is obtained by adding the corresponding offset to .

Pose-based action recognition. The coordinates of 2D pose can be transformed into a tensor of a size , where denotes the number of input frames. An extra confidence channel is added for each predicted joints, which is obtained by max pooling over the heatmap and passed to the ReLU activation. Then the tensor is fed into the ResNet-18 [18] for pose-based action recognition, as shown in Fig. 7. Due to the low spatial dimension of the input pose sequences, in the used ResNet-18, all the pooling operations are removed and all the stride 2 operations in the convolutional layers are replaced with 1. Global average pooling is performed after the last convolutional layer of ResNet-18 to get a 512-d feature. The prediction of pose stream is optimized with cross-entropy loss

(7)

Multi-task training. Action Machine has three tasks: RGB-based action recognition, pose estimation and pose-based action recognition. They are jointly optimized by the following loss function:

(8)

where , and are the loss weights of RGB-based action recognition, pose estimation and pose-based action recognition respectively.222Note that the gradients of pose-based action recognition don’t backpropagate into the pose estimation head because of the argmax operation in equation 6. We do not use the Soft-argmax in [34] on heatmap for end-to-end training because we find the keypoints quality of this approach on COCO [29] is lower than ours. They are all set to 1.0 by default.

Fusion of RGB and pose-based action recognition. In order to combine the strengths of predictions from RGB images and poses, the predicted probabilities of two heads are fused by summation at test time.

4 Experiments

4.1 Datasets

The proposed method has been evaluated on five datasets: on COCO [29] for pose estimation, and on four trimmed video action datasets: NTU RGB-D [38], Northwestern-UCLA Multiview Action 3D [48], MSR Daily Activity3D [47] and UTD-MHAD [6] for action recognition.333Because the videos in these datasets usually have one person (except the 11 classes of NTU RGB-D where the videos have two persons), which are appropriate for validating the effect of person-centric modeling only by simple human detection. Datasets like UCF-101 [43] and HMDB-51 [25] have some group activities (sports) with more than one person and a variety of scenes. It may need tracking and person re-identification for our use and additional annotation cost, which is beyond the scope of this paper.

COCO [29]. The COCO train, validation, and test sets contain more than 200k images and 250k person instances annotated with keypoints. 150k instances of them are publicly available for training and validation. Our models are trained on COCO train2017 dataset (includes 57K images and 150K person instances) and tested on the val2017 set.

NTU RGB-D [38]. This dataset is acquired with a Kinect v2 sensor. It contains more than 56K videos and 4 million frames with 60 different activities including individual activities, interactions between two people, and health-related events. The actions are performed by 40 subjects and recorded from 80 viewpoints. We follow the cross-subject and cross-view protocol from [38].

Northwestern-UCLA Multiview Action 3D (N-UCLA) [48]. This dataset contains 1494 sequences, covering 10 action categories, such as drop trash or sit down. Each sequence is captured simultaneously by 3 Kinect v1 cameras. Each action is performed one to six times by ten different subjects. We follow the cross-view protocol defined by [48]. It has three cross-view combinations: xview1, xview2 and xview3. The combination xview1 means that the samples from view 2 and 3 are for training, and the samples from view 1 are for testing.

MSR Daily Activity3D (MSR) [47]. This dataset contains 320 videos shot with a Kinect v1 sensor. 16 daily activities are performed twice by 10 subjects from a single viewpoint. Following [47], we use the videos from subject 1, 3, 5, 7 and 9 for training, and the remaining ones for testing.

UTD-MHAD [6]. This dataset is collected using a Microsoft Kinect sensor and a wearable inertial sensor in an indoor environment. It contains 27 actions performed by 8 subjects and has 861 sequences. Cross-subject protocol [6] is used for testing.

4.2 Experimental settings for pose estimation

Training. The ground truth human box is made to a fixed aspect ratio (height : width = 4 : 3) by extending the box in height or width. It is then cropped from the image and resized to a fixed resolution. The default resolution is 384288. Data augmentation includes scale(30%), rotation(30 degrees) and flip. Our models are pre-trained on ImageNet [11]. ResNet-50 is used by default. We train our models on a 4-GPU machine and each GPU has 2 images in a mini-batch (so in total with a mini-batch size of 8 images). We train our models for 22 epochs in total, starting with a learning rate of 0.01 and reducing it by a factor of 10 according to a schedule of [17, 21]. SGD is used, with a momentum of 0.9 and a weight decay of 0.0001. The L1 norm of all gradients is clipped by 2 independently. MXNet [7] is used for implementation.

Testing. The detected person bounding boxes on COCO val2017 are used. Following the common practice in [35, 8, 53], the joint location is predicted on the averaged heatmaps of the original images and their horizontal flips. Following [1], we use the mean average precision (AP) over 10 OKS (object keypoint similarity) thresholds for evaluation.

4.3 Experimental settings for action recognition

Preprocessing. As shown in Table 12, for all datasets, we resize their videos to make them smaller and keep their aspect ratios. The sampling stride is selected according to the frame rate of videos and model performance.

Training. Our models are pre-trained on ImageNet [11]. Then the pre-trained weights are inflated from 2D to 3D, as shown in Table 10. For small datasets, including N-UCLA [48], MSR [47] and UTD-MHAD [6], we also try pre-training our models on NTU RGB-D [38]. The models are fine-tuned using 8-frame clips with sampling stride shown in Table 12. The start frame is randomly sampled during training. Data augmentation includes the bounding box center, width, height jittering and random mirror. Then the bounding box is made to a fixed aspect ratio (height : width = 1 : 1444We do not keep the same aspect ratio as 2D pose estimation in COCO, because we use a shared box for all frames in a video and a moving person in a video is likely to cover a range different from a still person in a single image.) by extending the box in height or width. It is then cropped from the image and resized to a fixed resolution. The default resolution is 224224. We train our models on a 4-GPU machine and each GPU has 4 clips (32 images) in a mini-batch (in total with a mini-batch size of 16 clips). We train our models for 85 epochs in total, starting with a learning rate of 0.01 and reducing it by a factor of 10 according to a schedule of [42, 68]. SGD is used, with a momentum of 0.9 and a weight decay of 0.0001. The L1 norm of all gradients is clipped by 2 independently. Dropout with ratio of 0.5 is used before the fully connected layer of RGB-based and pose-based action recognition. The pose annotations of video frames are obtained by using the detection boxes and models trained on COCO [29]. MXNet [7] is used for implementation.

Testing. Following [51], the fully convolutional inference is performed spatially on videos, including three crops, i.e., the up left, center, down right of bounding box center. 10 clips are evenly sampled from a full-length video and the softmax scores are computed on them individually. The final prediction is the averaged softmax scores of all clips. We report the top-1 accuracy using the model of the last epoch.

Dataset resolution resize to frame rate stride
NTU RGB-D [38] 10801920 256454 30 8
N-UCLA [48] 480640 256340 12 1
MSR [47] 480640 256340 30 8
UTD-MHAD [6] 480640 256340 15 4
Table 2: Video preprocessing and configurations.

4.4 Pose estimation on COCO

Figure 4: Results of our pose estimation method on COCO [29].
Method Backbone Input Size AP
8-stage Hourglass [35] - 256192 66.9
CPN [8] ResNet-50 384288 70.6
Simple Baseline [53] ResNet-50 384288 72.2
Ours ResNet-50 384288 72.7
Table 3: Comparison with Hourglass [35], CPN [8] and Simple Baseline [53] on COCO val2017 dataset.

As shown in Table 3, our method is compared with state-of-the-art methods: Hourglass [35], CPN [8] and Simple Baseline [53] on COCO val2017. Our method achieves competitive performance with the above methods. Fig. 4 shows some results of our pose estimation method on COCO val2017 dataset. The proposed method equipped with I3D is used for the next action recognition experiments.

4.5 Action recognition

In this section, Action Machine is compared with other approaches on four human action datasets. Results are shown in Table 13, 5, 6, 7, where denotes that the corresponding modality is used as the input of model in testing. Note that Action Machine does not take as input human poses in testing because it has learned to estimate poses from RGB images after training.

Performance on NTU RGB-D. In Table 13, our model is compared with pose-based methods [46, 13, 38, 38, 30, 23, 22, 56, 54, 41] and RGB-based methods [40, 63, 34, 2, 32]. Action Machine with single modality (RGB) as input at test time achieves the state-of-the-art performance. Specifically, Action Machine outperforms PoseMap [32] by 2 and 2.6 points in top-1 accuracy on cross-view and cross-subject respectively. Compared to PoseMap [32], Action Machine is conceptually simple and easy to implement.

Pose RGB xview xsub
Lie Group [46] - 52.8 50.1
H-RNN [13] - 64.0 59.1
Deep LSTM [38] - 67.3 60.7
PA-LSTM [38] - 70.3 62.9
ST-LSTM+TS [30] - 77.7 69.2
Temporal Conv [23] - 83.1 74.3
C-CNN+MTLN [22] - 84.8 79.6
VA-LSTM [56] - 87.6 79.4
ST-GCN [54] - 88.3 81.5
SR-TSL [41] - 92.4 84.8
Chained [63] - - 80.8
DSSCA-SSLM [40] - 74.9
2D-3D-Softargmax [34] - - 85.5
Glimpse Clouds [2] - 93.2 86.6
PoseMap [32] 95.2 91.7
Action Machine (Ours) - 97.2 94.3
Table 4: Performance on NTU RGB-D, accuracy(%).

Performance on N-UCLA. In Table 5, our model is compared with pose-based methods: [46, 13, 31, 26] and RGB-based methods [2]. Action Machine with single modality (RGB) as input outperforms previous state-of-the-art approaches. Without using LSTM and extra handcrafted rules as Glimpse Clouds [2], Action Machine has a accuracy gain of 4.7 points in average top-1 accuracy on cross-view.

Pose RGB xview1 xview2 xview3 Avg
Lie Group [46] - - - - 74.2
H-RNN [13] - - - - 78.5
Enhanced viz. [31] - - - - 86.1
Ensemble TS-LSTM [26] - - - - 89.2
Glimpse Clouds [2] - 83.4 89.5 90.1 87.6
Action Machine (Ours) - 88.3 92.2 96.5 92.3
Table 5: Performance on N-UCLA, accuracy(%).

Performance on MSR. In Table 6, our model is compared with pose-based methods [47, 14, 55, 44], depth-based methods [62, 39, 33, 40]. Without using depth modality, Action Machine achieves competitive performance compared to DSSCA-SSLM [40], which is based on handcrafted feature including RGB and depth. However, on the cross-subject of NTU RGB-D (Table 13), a larger dataset than MSR, DSSCA-SSLM [40] is lower than ours for 19.4 points, showing the robustness of our method against the amount of data.

Pose RGB Depth xsub
Action Ensemble [47] - - 68.0
Efficient Pose-Based [14] - - 73.1
Moving Pose [55] - - 73.8
Moving Poselets [44] - - 74.5
Depth Fusion [62] - - 88.8
MMMP [39] - 91.3
DL-GSGC [33] - 95.0
DSSCA-SSLM [40] - 97.5
Action Machine (Ours) - - 93.0
Table 6: Performance on MSR Daily Activity3D, accuracy(%).

Performance on UTD-MHAD. In Table 7, our model is compared with [50, 19, 27, 32]. Without using the 3D pose extracted by depth sensor as these methods, Action Machine with RGB modality as input achieves competitive performance.

Pose RGB xsub
JTM [50] - 85.8
Optical Spectra [19] - 86.9
JDM [27] - 88.1
PoseMap [32] 94.5
Action Machine (Ours) - 92.5
Table 7: Performance on UTD-MHAD, accuracy(%).

4.6 Ablation study

Ablation studies are performed on NTU RGB-D and N-UCLA to verify the effectiveness of our techniques for person-centric modeling in Action Machine. There are four basic configurations, as illustrated below:

(a) Activation maps of RGBAction person crop
(b) Activation maps of KPS RGBAction
Figure 5: Visualizing the class-specific activation maps of our model with the Class Activation Mapping (CAM) [58]. Activation maps of video snippets of three action classes, i.e., sit down, stand up, carry are shown from top to the bottom. It is clear that the multi-task training of RGB-based action recognition and pose estimation can make the model focus on the spatial-temporal regions related to the action class.

RGBAction random crop. The baseline I3D model takes as inputs the random crops of videos and performs action recognition using RGB feature.

RGBAction person crop. The I3D model takes as inputs the videos after person cropping and performs action recognition using RGB feature.

KPS RGBAction. The I3D model takes as inputs the videos after person cropping, performs action recognition using RGB feature, and adds a head for pose estimation.

KPS PoseAction RGBAction. The I3D model takes as inputs the videos after person cropping, adds a head for pose estimation, and performs action recognition using RGB and pose feature. The model trained from KPS RGBAction is used as the pre-trained model. We fix it and only train the ResNet-18 or ResNet-50 for pose-based action recognition. We report the results of pose-based action recognition and the sum fusion of predictions from RGB images and poses.

As shown in Table 8, on the cross-subject of NTU RGB-D, person cropping can improve the model accuracy by 0.9 points over random crop. Our full model outperforms the baseline RGBAction random crop by 2 points. Due to the high accuracy of baseline, the improvement on the cross-view is not obvious. Similar gain potential can also be observed on the small subsets of NTU RGB-D (xview-s, xsub-s), which are originally used for the fast training and testing in our implementation.

xview xsub xview-s xsub-s
RGBAction random crop 97.2 92.3 94.3 61.2
RGBAction person crop 97.7 93.2 94.5 67.9
KPS RGBAction 97.3 93.8 95.0 71.2
KPS PoseAction RGBAction
(ResNet-18)
90.1/97.1 84.9/94.1 87.8/95.9 62.9/72.7
KPS PoseAction RGBAction
(ResNet-50)
91.3/97.2 85.5/94.3 89.9/96.1 66.0/73.5
Table 8: Ablation studies on NTU RGB-D, accuracy(%). In the rows which have slash , the number on the left of slash is the accuracy of pose-based action recognition, the right is the accuracy of fusion of RGB and pose results. xview-s and xsub-s denote the small subsets of NTU RGB-D cross-view and cross-subject respectively.

As shown in Table 9, on N-UCLA, Action Machine outperforms the baseline by a large margin. Specifically, RGBAction person crop with the person cropping technique can improve the accuracy by 1.6 and 4.3 points on xview1 and xview3 over the baseline RGBAction random crop respectively. Person cropping does not bring accuracy gain on xview2, because the test crops of front view images (Fig. 8, the first and second row) on this dataset are close to that cropped by person boxes. Jointly training pose estimation and RGB-based action recognition, i.e., KPS RGBAction, can improve about 3 to 7 points. Overall, using ResNet-18, our final model exceeds the baseline by 7.2 points. By using a stronger backbone, i.e., ResNet-50 for pose-based action recognition and NTU RGB-D pre-training, the accuracies of our models, either solely by poses or the fusion of RGB images and poses, are further improved.

To better understand how our approach learns discriminative feature for action recognition, the class-specific activation maps of our models are visualized with the Class Activation Mapping (CAM) [58] approach in Fig. 8. The videos are sampled from N-UCLA, including three classes (sit down, stand up, carry). These maps show that, jointly training RGB-based action recognition with pose estimation can make the model focus on the motions of human body. For example, KPS RGBAction (Fig. 8(b)) pays more attention on the standing and sitting process, while RGBAction person crop (Fig. 8(a)) seems to focus on the object (chair). Particularly, in the carry example, KPS RGBAction is significantly activated only by the hand (center of the body). Nevertheless, RGBAction person crop is activated by the trash can, leading to a wrong prediction (drop trash).

xview1 xview2 xview3 Avg
RGBAction random crop 81.6 82.4 86.3 83.4
RGBAction person crop 83.2 82.4 90.6 85.4
KPS RGBAction 86.3 90 94.9 90.4
KPS PoseAction RGBAction
(ResNet-18)
79.7/87.5 81/90.4 87.5/94.1 82.7/90.6
KPS PoseAction RGBAction
(ResNet-50)
84.2/89.6 81.8/90 88.4/94.3 84.8/91.3
KPS PoseAction RGBAction
(ResNet-18, NTU pre-training)
85.5/88.6 88.0/91.6 93.2/96.5 88.9/92.2
KPS PoseAction RGBAction
(ResNet-50, NTU pre-training)
83.8/88.3 87.6/92.2 93.2/96.5 88.2/92.3
Table 9: Ablation studies on N-UCLA, accuracy(%). In the rows which have slash , the number on the left of slash is the accuracy of pose-based action recognition, the right is the accuracy of fusion of RGB and pose results.

4.7 Timing

Inference: We train a ResNet-50-I3D model that shares features between RGB-based action recognition and pose estimation with two deconvolutional layers. And it is followed by a ResNet-50 for pose-based action recognition. This model runs at 55ms per clip (8 frames) on an Nvidia TitanX GPU. As the dimension of pose sequences is small, substituting ResNet-50 with ResNet-18 for pose-based action recognition don’t cause much difference: it finally takes 50ms. I3D takes 30ms. Action Machine is fast to test and adds only a small overhead to I3D.

Training: Action Machine is also fast to train. Training with ResNet-50-I3D on the cross-view of NTU RGB-D takes 32 hours (0.66s per 16 clips (128 frames) mini-batch) in our synchronized 4-GPU implementation.

5 Discussions and Conclusions

We propose a person-centric modeling method: Action Machine, for human action recognition in trimmed videos. It has three complementary tasks: RGB-based action recognition, pose estimation and pose-based action recognition. By using person bounding boxes and human poses, Action Machine achieves competitive performance compared with other approaches on four video action datasets [38, 48, 47, 6]. However, in our implementation, it is hard to discard non-human clutters strictly (e.g., the trash can in Fig. 8) because of the bounding box quality and other postprocessing steps. Besides, in our multi-task training, the ground-truth pose annotations are estimated by the model trained on COCO [29] and may not be abundant enough for the training of pose estimation task due to the paucity of videos. The joint training of pose estimation on COCO and action recognition on videos may relieve the problem, as we can exploit the data richness of COCO.

References

  • [1] COCO: COCO Leaderboard. http://cocodataset.org.
  • [2] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor. Glimpse clouds: Human activity recognition from unstructured feature points. In CVPR, 2018.
  • [3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
  • [4] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600. CoRR, abs/1808.01340, 2018.
  • [5] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • [6] C. Chen, R. Jafari, and N. Kehtarnavaz. Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In ICIP, 2015.
  • [7] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
  • [8] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. In CVPR, 2018.
  • [9] S. V. C. P. D. A. R. G. T. Y. L. S. R. R. S. C. S. J. M. Chunhui Gu, Chen Sun. AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
  • [10] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
  • [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  • [12] Y. Du, Y. Fu, and L. Wang. Skeleton based action recognition with convolutional neural network. In ACPR, 2015.
  • [13] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 2015.
  • [14] A. Eweiwi, M. S. Cheema, C. Bauckhage, and J. Gall. Efficient pose-based action recognition. In ACCV, 2014.
  • [15] B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Alwassel, V. Escorcia, R. Krishna, S. Buch, and C. D. Dao. The activitynet large-scale activity recognition challenge 2018 summary. CoRR, abs/1808.03766, 2018.
  • [16] B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Alwassel, R. Krishna, V. Escorcia, K. Hata, and S. Buch. Activitynet challenge 2017 summary. CoRR, abs/1710.08011, 2017.
  • [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [19] Y. Hou, Z. Li, P. Wang, and W. Li. Skeleton optical spectra-based action recognition using convolutional neural networks. TCSVT, 2018.
  • [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [21] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
  • [22] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. A new representation of skeleton sequences for 3d action recognition. In CVPR, 2017.
  • [23] T. S. Kim and A. Reiter. Interpretable 3d human action analysis with temporal convolutional networks. In CVPRW, 2017.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. 2012.
  • [25] H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre. Hmdb51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering, 2013.
  • [26] I. Lee, D. Kim, S. Kang, and S. Lee. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In ICCV, 2017.
  • [27] C. Li, Y. Hou, P. Wang, and W. Li. Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Processing Letters, 2017.
  • [28] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In ECCV, 2018.
  • [29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [30] J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In ECCV, 2016.
  • [31] M. Liu, H. Liu, and C. Chen. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 2017.
  • [32] M. Liu and J. Yuan. Recognizing human actions as the evolution of pose estimation maps. In CVPR, 2018.
  • [33] J. Luo, W. Wang, and H. Qi. Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In ICCV, 2013.
  • [34] D. C. Luvizon, D. Picard, and H. Tabia. 2d/3d pose estimation and action recognition using multitask deep learning. In CVPR, 2018.
  • [35] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, ECCV, 2016.
  • [36] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
  • [37] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 2015.
  • [38] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. In CVPR, 2016.
  • [39] A. Shahroudy, T. Ng, Y. Gong, and G. Wang. Multimodal multipart learning for action recognition in depth videos. PAMI, 2016.
  • [40] A. Shahroudy, T. Ng, Y. Gong, and G. Wang. Deep multimodal feature analysis for action recognition in rgb+d videos. PAMI, 2018.
  • [41] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV, 2018.
  • [42] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS. 2014.
  • [43] K. Soomro, A. R. Zamir, M. Shah, K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. CoRR, page 2012.
  • [44] L. Tao and R. Vidal. Moving poselets: A discriminative and interpretable skeletal motion representation for action recognition. In ICCVW, 2015.
  • [45] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [46] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3d skeletons as points in a lie group. In CVPR, 2014.
  • [47] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In CVPR, 2012.
  • [48] J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu. Cross-view action modeling, learning, and recognition. In CVPR, 2014.
  • [49] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Val Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [50] P. Wang, W. Li, C. Li, and Y. Hou. Action recognition based on joint trajectory maps with convolutional neural networks. Knowledge-Based Systems, 2018.
  • [51] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018.
  • [52] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
  • [53] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
  • [54] S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.
  • [55] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In ICCV, 2013.
  • [56] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV, 2017.
  • [57] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View adaptive neural networks for high performance skeleton-based human action recognition. CoRR, abs/1804.07453, 2018.
  • [58] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
  • [59] J. Zhu, W. Zou, and Z. Zhu. End-to-end video-level representation learning for action recognition. In ICPR, 2018.
  • [60] J. Zhu, W. Zou, and Z. Zhu. Two stream gated fusion convnets for action recognition. In ICPR, 2018.
  • [61] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In AAAI, 2016.
  • [62] Y. Zhu, W. Chen, and G. Guo. Fusing multiple features for depth-based action recognition. ACM Trans. Intell. Syst. Technol., 2015.
  • [63] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV, 2017.

Appendix

Appendix A NTU RGB-D small subset setting

In Section 4.6, for ablation studies of different configurations of our models, we use the small subsets of NTU RGB-D (xview-s, xsub-s), which are designed by us for the fast training and testing.

For xview-s, the sample videos of the original cross-view split with subject ID larger than 5 are discarded. For this evaluation, the training and testing sets have 3, 839 and 1, 917 samples (about 1/10th of the full cross-view split), respectively.

For xsub-s, based on the original cross-subject split, we pick all the samples of camera 1 and discard samples of cameras 2 and 3. The sample videos with subject ID larger than 10 are discarded. For this evaluation, the training and testing sets have 4, 317 and 1, 439 samples (about 1/10th of the full cross-subject split), respectively.

Appendix B Cross-dataset recognition task

In order to show the advantage of person-centric modeling over baseline, we further test our trained models on another different datasets. Specifically, we train our models on NTU RGB-D cross-subject and test them on the test sets of N-UCLA, MSR Daily and UTD-MHAD respectively. The shared category mappings between the smaller dataset and NTU RGB-D are shown in Table 101112 and the test videos are limited to have these ground-truth classes. Because of the different sources of videos, the scene contexts and objects in training dataset are largely different from the testing dataset. A model without capturing human body motion will behave worse than the model which learns to focus on. Results are shown in Table 13. It is clearly observed that our proposed person-centric modeling techniques including: person cropping, multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses can help to improve the performance of baseline model RGBAction random crop on different datasets. Existing methods based on RGB images easily overfit the scenes and objects of specific datasets without focusing on human body movements, though they may have high performance. In contrast, Action Machine is more generalizable and extendable.

N-UCLA NTU RGB-D
pick up with one hand pickup
pick up with two hands pickup
stand up standing up (from sitting position)
sit down sitting down
throw throw
Table 10: Shared category mapping between N-UCLA and NTU RGB-D.
MSR Daily NTU RGB-D
drink drink water
eat eat meal/snack
read book reading
call cellphone make a phone call/answer phone
write on a paper writing
cheer up cheer up
stand up standing up (from sitting position)
sit down sitting down
Table 11: Shared category mapping between MSR Daily and NTU RGB-D.
UTD-MHAD NTU RGB-D
sit to stand standing up (from sitting position)
stand to sit sitting down
Table 12: Shared category mapping between UTD-MHAD and NTU RGB-D.
N-UCLA MSR Daily UTD-MHAD
RGBAction random crop 70.0 70.8 100
RGBAction person crop 70.0 78.4 100
KPS RGBAction 76.4 79.7 100
KPS PoseAction RGBAction
(ResNet-18)
68.8/76.4 58.2/79.7 100/100
KPS PoseAction RGBAction
(ResNet-50)
69.2/77.3 63.2/81.0 100/100
Table 13: Cross-dataset testing on N-UCLA (xview3), MSR Daily and UTD-MHAD, accuracy(%). The models on the first column are trained on NTU RGB-D cross-subject. In the rows which have slash , the number on the left of slash is the accuracy of pose-based action recognition, the right is the accuracy of fusion of RGB and pose results.

Appendix C Pose estimation results on action recognition datasets

We visualize the video frames with detected boxes and estimated poses on action recognition datasets in Fig. 6, 7, 8, 9. We use the testing model KPS RGBAction, which performs RGB-based action recognition and pose estimation simultaneously. In general, the estimated poses are accurate.

Figure 6: Visualizing the video frames of the test set of NTU RGB-D cross-view. The video frames with detected boxes (yellow) and estimated poses of eight action classes, i.e., wear a shoe, cheer up, pointing to something with finger, falling, writing, put the palms together, reading, brushing teeth, tear up paper are shown from top row to the bottom.
Figure 7: Visualizing the video frames of the test set of N-UCLA xview3. The video frames with detected boxes (yellow) and estimated poses of four action classes, i.e., sit down, throw, donning, pick up with two hands are shown from top row to the bottom.
Figure 8: Visualizing the video frames of the test set of MSR Daily. The video frames with detected boxes (yellow) and estimated poses of four action classes, i.e., sit down, call cellphone, play guitar, walk are shown from top row to the bottom.
Figure 9: Visualizing the video frames of the test set of UTD-MHAD. The video frames with detected boxes (yellow) and estimated poses of four action classes, i.e., two hand push, front boxing, cross arms in the chest, forward lunge (left foot forward) are shown from top row to the bottom.
Comments 2
Request Comment
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
325820
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
2

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description