Action Machine: Rethinking Action Recognition in Trimmed Videos
Existing methods in video action recognition mostly do not distinguish human body from the environment and easily overfit the scenes and objects. In this work, we present a conceptually simple, general and high-performance framework for action recognition in trimmed videos, aiming at person-centric modeling. The method, called Action Machine, takes as inputs the videos cropped by person bounding boxes. It extends the Inflated 3D ConvNet (I3D) by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition, being fast to train and test. Action Machine can benefit from the multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses. On NTU RGB-D, Action Machine achieves the state-of-the-art performance with top-1 accuracies of 97.2% and 94.3% on cross-view and cross-subject respectively. Action Machine also achieves competitive performance on another three smaller action recognition datasets: Northwestern UCLA Multiview Action3D, MSR Daily Activity3D and UTD-MHAD. Code will be made available.
With the release of Kinetics-400  and Kinetics-600  in the last two years, action recognition in videos has shown similar trend as the object recognition due to the ImageNet . A variety of tasks including trimmed video classification , temporal action recognition in untrimmed videos , spatial-temporal action detection , have been quite popular in recent competitions [16, 15].
This paper studies action recognition in trimmed videos. To some extent, advances in this field are hampered by the biases in datasets collection, lack of annotations and object recognition in images . For example, the videos in UCF-101  and HMDB-51  are rich in scenes and objects, while missing person bounding box annotations.111Except their subsets, UCF-24  and JHMDB-21 , which are for spatial-temporal action detection. Previous methods [42, 49, 59, 60, 5], which do not directly distinguish human body from videos, tend to predict an action according to the scenes and objects, since convolutional neural networks (CNNs) make it easier to classify the objects and things than human motions. They can be easily distracted by some irrelevant cues of videos when recognizing an action. As shown in Fig. 1(b), the video frame with ground-truth class carry is predicted as a wrong action drop trash by the baseline Inflated 3D ConvNet (I3D) . Because the model has learned that the trash can and the action drop trash always appear in a video together (Fig. 1(a)).
This motivates us to design a model that can explicitly capture human body movements from videos, simultaneously follows the stream of RGB and CNN-based methods in action recognition. Pose (skeleton) data is lightweight, easy to understand and highly relevant to human action. It can be readily estimated by deep models, due to the recent advances of human pose estimation in a single image [29, 35, 8, 53, 34, 36] and in videos . The pose estimation methods are usually based on person bounding boxes, which can greatly filter out non-human clutters in RGB images. In view of this, person-centric action recognition has the potential to benefit from the joint training with pose estimation. Thanks to the large-scale annotated datasets  and powerful deep networks , the poses estimated from images in the wild are more robust than the skeleton captured by depth sensor like Kinect which is limited to indoor pose-based action recognition [30, 56, 61, 12, 57, 54]. Thus, action recognition in videos can be naturally formulated as a multi-task learning problem including RGB-based action recognition, pose estimation and pose-based action recognition.
In this work, a person-centric modeling method for human action recognition is proposed, called Action Machine, which shares similar spirit with Convolutional Pose Machines (CPM)  in sequential fashion model design. The proposed method (Fig. 6) extends the Inflated 3D ConvNet (I3D)  by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition. In details, the video frames are cropped by the bounding boxes of target persons and are taken as the inputs of I3D. For frame-wise pose estimation, a 2D deconvolution head is added to the last convolutional layer of I3D, in parallel with the existing head for RGB-based action recognition. After the pose estimation task, a 2D CNN is applied to the pose sequences for pose-based action recognition. At test time, the predictions of two classification heads are fused by summation. Some class-specific activation maps of Action Machine are shown in Fig. 1(c) and (d), indicating only the regions that really correspond to the action are activated. The main contributions of this work are summarized as follows:
We present a conceptually simple and general framework for action recognition in trimmed videos, called Action Machine, aiming at person-centric modeling.
The proposed techniques of explicitly modeling human body movements including person cropping, multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses can help to improve the model performance.
We showcase the generality of our framework via extensive experiments on four human action datasets. Action Machine achieves the state-of-the-art performance on NTU RGB-D , Northwestern UCLA Multiview Action3D . It also achieves competitive performance on MSR Daily Activity3D  and UTD-MHAD . Action Machine is a high-performance framework while being fast to train and test.
2 Related works
2.1 Deep learning for action recognition
RGB-based methods. Two-stream ConvNet  employs RGB images and optical flow stacks as the inputs of two networks and fuses their predictions by late fusion. Temporal Segment Network (TSN)  improves the performance of two-stream ConvNet by sparsely sampling video frames and learning video-level predictions. Deep networks with Temporal Pyramid Pooling (DTPP)  samples enough frames from videos and learns video-level representation end-to-end. Using one network, C3D  learns spatial-temporal patterns from video clips by 3D convolutions. In 2017, DeepMind released a large-scale video action datasets Kinetics  and proposed Inflated 3D ConvNet (I3D). Non-local Net  equips I3D with attention mechanism, extracting long-range interactions in spatial-temporal domain. The above models take as inputs the random spatial crops of video frames during training and can easily overfit the scenes and objects in videos because of failing to focus on human body explicitly. Different from them, we use the detected bounding boxes to crop the target persons from videos as the inputs of model, eliminating the effect of background context.
Pose (Skeleton)-based methods. Compared with RGB images, skeleton data has the merits of being lightweight and free from scene cues. Previous studies on pose-based action recognition can be categorized into RNN-based [30, 56, 61], CNN-based [12, 57] and GCN-based(Graph Convolution Network)  methods. RNN-based [30, 56, 61] methods treat the skeleton data as vectors and capture the sequence information of skeleton. CNN-based methods [12, 22] represent a skeleton sequence as a pseudo-image and recognize the underlying action in the same way as image classification. GCN-based methods  capture joint interactions on the skeleton graphs, explicitly considering the adjacent relationship among joints in a non-Euclidean space. In this work, we follow the CNN-based methods [12, 57] and use the 2D CNN for the pose-based action recognition.
2.2 Human pose estimation
Human pose estimation can fall into top-down methods [8, 53, 35, 17] in which a pose estimator is applied to the output of a person detector, and bottom-up methods [3, 52], in which keypoint proposals are grouped together into person instances. In this work, we adopt a top-down method and resort to a off-the-shelf detector  for bounding boxes. For pose estimation, a 2D deconvolution head is added to the last convolutional layer of I3D. This is inspired from Mask R-CNN , which extends Faster R-CNN  to support keypoint estimation. Action Machine does not involve detection task during training and the person cropping operations are imposed on the images instead of features.
2.3 Multi-task learning for action recognition
Chained multi-stream network  unifies three sources: RGB images, optical flow and body part mask for action recognition and detection. It introduces a Markov chain model to fuse these cues successively. In , Soft-argmax is extended to regress 2D and 3D pose directly, leading to the end-to-end training of pose estimation and action recognition. Different from the above two works, Action Machine is based on I3D, which has less parameters than C3D  because of 2D+1D convolution . It is also easy to train because of transferring pre-trained weights from 2D CNN and does not need the costly optical flow maps compared to two-stream ConvNet . The pose estimation method we use is detection-based, detecting keypoint by regressing heatmap and can get more accurate pose than the regression-based pose estimation in . Meanwhile, the pose estimation head can benefit from the temporal context of the I3D output.
3 Action Machine
As shown in Fig. 6, the pipeline of Action Machine consists of the following steps: First, the videos after person cropping are taken as the inputs of I3D for RGB-based action recognition. Then a 2D deconvolution head is added to the last convolutional layer of I3D for frame-wise pose estimation. The heatmaps produced by the pose estimation head are converted into joint coordinates by an argmax operation. Third, the transformed joint coordinates with 2D shape are taken as inputs by a 2D CNN for pose-based action recognition. The proposed method is trained in a multi-task manner. Finally, two sources of predictions, i.e., RGB images and poses, are fused by summation at test time.
Network input. All the video frames are fed to a published detector, i.e., Deformable CNN  for person bounding boxes and the category confidence threshold is set to 0.99 to avoid most false positive detections. The minimal bounding box enclosing all the detected boxes in a video is used for person cropping. In this way, the problem of detection-missing in videos is mostly solved. And cropping video by a shared box for all frames can align the feature on the temporal dimension.
Backbone. We use the I3D with ResNet-50  backbone shown in Table 10 for feature extraction. In order to estimate the pose of each frame, we remove the temporal max pooling after the first stage of I3D. The output feature map of the backbone has a size of 2048877, used both by RGB-based action recognition and pose estimation.
|577, 64, stride 1, 2, 2||8112112|
|133 max, stride 1, 2, 2||85656|
RGB-based action recognition. As shown in Fig. 7, global average pooling is performed after the last convolutional layer of I3D to get a 2048-d feature .
Consider a dataset of videos with categories , where is the label. Formally, the prediction can be obtained directly
where is the softmax operation, . and are the parameters of the fully connected layer. In the training stage, combining with cross-entropy loss, the final loss function is
where is the value of the -th dimension of .
Pose estimation. Given the output features of I3D, the pose estimation is performed on each temporal dimension. Inspired from Mask R-CNN , a 2D deconvolution head is added to the last convolutional layer of I3D, as shown in Fig. 7. By default, two deconvolutional layers with batch normalization  and ReLU activation  are used. Each layer has 256 filters with 44 kernel and the stride is 2. Following , a 11 convolutional layer is added at last to generate predicted heatmaps for all keypoints (one channel per keypoint) and offsets (two channels per keypoint for the and -directions) for a total of 3 output channels, where is the number of keypoints.
Given the image crop, let if the -th keypoint is located at position and 0 otherwise. Here indexes the keypoint type and indexes the pixel locations on the image crop grid. For each position and each keypoint , we compute the probability if , which means the point is within a disk of radius from the location of the -th keypoint. such heatmaps are trained by solving a binary classification problem for each position and keypoint independently. For each position and each keypoint , we also predict the 2D offset vector from the pixel to the corresponding keypoint. such vector fields are trained by solving a 2D regression problem for each position and keypoint independently.
The output of the heatmap branch yields the heatmap probabilities for each position and each keypoint . The training target for the heatmap branch is a map of zeros and ones, with if and 0 otherwise. The corresponding loss function is the sum of smooth loss for each position and keypoint independently
where is the smooth loss.
For training the offset regression branch, the differences between the predicted and ground truth offsets are penalized by smooth loss. The offset loss is only computed for positions within a disk of radius from each keypoint.
The final loss function for pose estimation has the form
where and are two scalar factors to balance the loss function.
At test time, for the -th keypoint, the argmax operation is performed on the -th heatmap to yield the coarse location
The accurate coordinate of the -th keypoint is obtained by adding the corresponding offset to .
Pose-based action recognition. The coordinates of 2D pose can be transformed into a tensor of a size , where denotes the number of input frames. An extra confidence channel is added for each predicted joints, which is obtained by max pooling over the heatmap and passed to the ReLU activation. Then the tensor is fed into the ResNet-18  for pose-based action recognition, as shown in Fig. 7. Due to the low spatial dimension of the input pose sequences, in the used ResNet-18, all the pooling operations are removed and all the stride 2 operations in the convolutional layers are replaced with 1. Global average pooling is performed after the last convolutional layer of ResNet-18 to get a 512-d feature. The prediction of pose stream is optimized with cross-entropy loss
Multi-task training. Action Machine has three tasks: RGB-based action recognition, pose estimation and pose-based action recognition. They are jointly optimized by the following loss function:
where , and are the loss weights of RGB-based action recognition, pose estimation and pose-based action recognition respectively.222Note that the gradients of pose-based action recognition don’t backpropagate into the pose estimation head because of the argmax operation in equation 6. We do not use the Soft-argmax in  on heatmap for end-to-end training because we find the keypoints quality of this approach on COCO  is lower than ours. They are all set to 1.0 by default.
Fusion of RGB and pose-based action recognition. In order to combine the strengths of predictions from RGB images and poses, the predicted probabilities of two heads are fused by summation at test time.
The proposed method has been evaluated on five datasets: on COCO  for pose estimation, and on four trimmed video action datasets: NTU RGB-D , Northwestern-UCLA Multiview Action 3D , MSR Daily Activity3D  and UTD-MHAD  for action recognition.333Because the videos in these datasets usually have one person (except the 11 classes of NTU RGB-D where the videos have two persons), which are appropriate for validating the effect of person-centric modeling only by simple human detection. Datasets like UCF-101  and HMDB-51  have some group activities (sports) with more than one person and a variety of scenes. It may need tracking and person re-identification for our use and additional annotation cost, which is beyond the scope of this paper.
COCO . The COCO train, validation, and test sets contain more than 200k images and 250k person instances annotated with keypoints. 150k instances of them are publicly available for training and validation. Our models are trained on COCO train2017 dataset (includes 57K images and 150K person instances) and tested on the val2017 set.
NTU RGB-D . This dataset is acquired with a Kinect v2 sensor. It contains more than 56K videos and 4 million frames with 60 different activities including individual activities, interactions between two people, and health-related events. The actions are performed by 40 subjects and recorded from 80 viewpoints. We follow the cross-subject and cross-view protocol from .
Northwestern-UCLA Multiview Action 3D (N-UCLA) . This dataset contains 1494 sequences, covering 10 action categories, such as drop trash or sit down. Each sequence is captured simultaneously by 3 Kinect v1 cameras. Each action is performed one to six times by ten different subjects. We follow the cross-view protocol defined by . It has three cross-view combinations: xview1, xview2 and xview3. The combination xview1 means that the samples from view 2 and 3 are for training, and the samples from view 1 are for testing.
MSR Daily Activity3D (MSR) . This dataset contains 320 videos shot with a Kinect v1 sensor. 16 daily activities are performed twice by 10 subjects from a single viewpoint. Following , we use the videos from subject 1, 3, 5, 7 and 9 for training, and the remaining ones for testing.
4.2 Experimental settings for pose estimation
Training. The ground truth human box is made to a fixed aspect ratio (height : width = 4 : 3) by extending the box in height or width. It is then cropped from the image and resized to a fixed resolution. The default resolution is 384288. Data augmentation includes scale(30%), rotation(30 degrees) and flip. Our models are pre-trained on ImageNet . ResNet-50 is used by default. We train our models on a 4-GPU machine and each GPU has 2 images in a mini-batch (so in total with a mini-batch size of 8 images). We train our models for 22 epochs in total, starting with a learning rate of 0.01 and reducing it by a factor of 10 according to a schedule of [17, 21]. SGD is used, with a momentum of 0.9 and a weight decay of 0.0001. The L1 norm of all gradients is clipped by 2 independently. MXNet  is used for implementation.
Testing. The detected person bounding boxes on COCO val2017 are used. Following the common practice in [35, 8, 53], the joint location is predicted on the averaged heatmaps of the original images and their horizontal flips. Following , we use the mean average precision (AP) over 10 OKS (object keypoint similarity) thresholds for evaluation.
4.3 Experimental settings for action recognition
Preprocessing. As shown in Table 12, for all datasets, we resize their videos to make them smaller and keep their aspect ratios. The sampling stride is selected according to the frame rate of videos and model performance.
Training. Our models are pre-trained on ImageNet . Then the pre-trained weights are inflated from 2D to 3D, as shown in Table 10. For small datasets, including N-UCLA , MSR  and UTD-MHAD , we also try pre-training our models on NTU RGB-D . The models are fine-tuned using 8-frame clips with sampling stride shown in Table 12. The start frame is randomly sampled during training. Data augmentation includes the bounding box center, width, height jittering and random mirror. Then the bounding box is made to a fixed aspect ratio (height : width = 1 : 1444We do not keep the same aspect ratio as 2D pose estimation in COCO, because we use a shared box for all frames in a video and a moving person in a video is likely to cover a range different from a still person in a single image.) by extending the box in height or width. It is then cropped from the image and resized to a fixed resolution. The default resolution is 224224. We train our models on a 4-GPU machine and each GPU has 4 clips (32 images) in a mini-batch (in total with a mini-batch size of 16 clips). We train our models for 85 epochs in total, starting with a learning rate of 0.01 and reducing it by a factor of 10 according to a schedule of [42, 68]. SGD is used, with a momentum of 0.9 and a weight decay of 0.0001. The L1 norm of all gradients is clipped by 2 independently. Dropout with ratio of 0.5 is used before the fully connected layer of RGB-based and pose-based action recognition. The pose annotations of video frames are obtained by using the detection boxes and models trained on COCO . MXNet  is used for implementation.
Testing. Following , the fully convolutional inference is performed spatially on videos, including three crops, i.e., the up left, center, down right of bounding box center. 10 clips are evenly sampled from a full-length video and the softmax scores are computed on them individually. The final prediction is the averaged softmax scores of all clips. We report the top-1 accuracy using the model of the last epoch.
4.4 Pose estimation on COCO
|8-stage Hourglass ||-||256192||66.9|
|Simple Baseline ||ResNet-50||384288||72.2|
As shown in Table 3, our method is compared with state-of-the-art methods: Hourglass , CPN  and Simple Baseline  on COCO val2017. Our method achieves competitive performance with the above methods. Fig. 4 shows some results of our pose estimation method on COCO val2017 dataset. The proposed method equipped with I3D is used for the next action recognition experiments.
4.5 Action recognition
In this section, Action Machine is compared with other approaches on four human action datasets. Results are shown in Table 13, 5, 6, 7, where denotes that the corresponding modality is used as the input of model in testing. Note that Action Machine does not take as input human poses in testing because it has learned to estimate poses from RGB images after training.
Performance on NTU RGB-D. In Table 13, our model is compared with pose-based methods [46, 13, 38, 38, 30, 23, 22, 56, 54, 41] and RGB-based methods [40, 63, 34, 2, 32]. Action Machine with single modality (RGB) as input at test time achieves the state-of-the-art performance. Specifically, Action Machine outperforms PoseMap  by 2 and 2.6 points in top-1 accuracy on cross-view and cross-subject respectively. Compared to PoseMap , Action Machine is conceptually simple and easy to implement.
|Lie Group ||-||52.8||50.1|
|Deep LSTM ||-||67.3||60.7|
|Temporal Conv ||-||83.1||74.3|
|Glimpse Clouds ||-||93.2||86.6|
|Action Machine (Ours)||-||97.2||94.3|
Performance on N-UCLA. In Table 5, our model is compared with pose-based methods: [46, 13, 31, 26] and RGB-based methods . Action Machine with single modality (RGB) as input outperforms previous state-of-the-art approaches. Without using LSTM and extra handcrafted rules as Glimpse Clouds , Action Machine has a accuracy gain of 4.7 points in average top-1 accuracy on cross-view.
|Lie Group ||-||-||-||-||74.2|
|Enhanced viz. ||-||-||-||-||86.1|
|Ensemble TS-LSTM ||-||-||-||-||89.2|
|Glimpse Clouds ||-||83.4||89.5||90.1||87.6|
|Action Machine (Ours)||-||88.3||92.2||96.5||92.3|
Performance on MSR. In Table 6, our model is compared with pose-based methods [47, 14, 55, 44], depth-based methods [62, 39, 33, 40]. Without using depth modality, Action Machine achieves competitive performance compared to DSSCA-SSLM , which is based on handcrafted feature including RGB and depth. However, on the cross-subject of NTU RGB-D (Table 13), a larger dataset than MSR, DSSCA-SSLM  is lower than ours for 19.4 points, showing the robustness of our method against the amount of data.
|Action Ensemble ||-||-||68.0|
|Efficient Pose-Based ||-||-||73.1|
|Moving Pose ||-||-||73.8|
|Moving Poselets ||-||-||74.5|
|Depth Fusion ||-||-||88.8|
|Action Machine (Ours)||-||-||93.0|
Performance on UTD-MHAD. In Table 7, our model is compared with [50, 19, 27, 32]. Without using the 3D pose extracted by depth sensor as these methods, Action Machine with RGB modality as input achieves competitive performance.
4.6 Ablation study
Ablation studies are performed on NTU RGB-D and N-UCLA to verify the effectiveness of our techniques for person-centric modeling in Action Machine. There are four basic configurations, as illustrated below:
RGBAction random crop. The baseline I3D model takes as inputs the random crops of videos and performs action recognition using RGB feature.
RGBAction person crop. The I3D model takes as inputs the videos after person cropping and performs action recognition using RGB feature.
KPS RGBAction. The I3D model takes as inputs the videos after person cropping, performs action recognition using RGB feature, and adds a head for pose estimation.
KPS PoseAction RGBAction. The I3D model takes as inputs the videos after person cropping, adds a head for pose estimation, and performs action recognition using RGB and pose feature. The model trained from KPS RGBAction is used as the pre-trained model. We fix it and only train the ResNet-18 or ResNet-50 for pose-based action recognition. We report the results of pose-based action recognition and the sum fusion of predictions from RGB images and poses.
As shown in Table 8, on the cross-subject of NTU RGB-D, person cropping can improve the model accuracy by 0.9 points over random crop. Our full model outperforms the baseline RGBAction random crop by 2 points. Due to the high accuracy of baseline, the improvement on the cross-view is not obvious. Similar gain potential can also be observed on the small subsets of NTU RGB-D (xview-s, xsub-s), which are originally used for the fast training and testing in our implementation.
|RGBAction random crop||97.2||92.3||94.3||61.2|
|RGBAction person crop||97.7||93.2||94.5||67.9|
As shown in Table 9, on N-UCLA, Action Machine outperforms the baseline by a large margin. Specifically, RGBAction person crop with the person cropping technique can improve the accuracy by 1.6 and 4.3 points on xview1 and xview3 over the baseline RGBAction random crop respectively. Person cropping does not bring accuracy gain on xview2, because the test crops of front view images (Fig. 8, the first and second row) on this dataset are close to that cropped by person boxes. Jointly training pose estimation and RGB-based action recognition, i.e., KPS RGBAction, can improve about 3 to 7 points. Overall, using ResNet-18, our final model exceeds the baseline by 7.2 points. By using a stronger backbone, i.e., ResNet-50 for pose-based action recognition and NTU RGB-D pre-training, the accuracies of our models, either solely by poses or the fusion of RGB images and poses, are further improved.
To better understand how our approach learns discriminative feature for action recognition, the class-specific activation maps of our models are visualized with the Class Activation Mapping (CAM)  approach in Fig. 8. The videos are sampled from N-UCLA, including three classes (sit down, stand up, carry). These maps show that, jointly training RGB-based action recognition with pose estimation can make the model focus on the motions of human body. For example, KPS RGBAction (Fig. 8(b)) pays more attention on the standing and sitting process, while RGBAction person crop (Fig. 8(a)) seems to focus on the object (chair). Particularly, in the carry example, KPS RGBAction is significantly activated only by the hand (center of the body). Nevertheless, RGBAction person crop is activated by the trash can, leading to a wrong prediction (drop trash).
|RGBAction random crop||81.6||82.4||86.3||83.4|
|RGBAction person crop||83.2||82.4||90.6||85.4|
Inference: We train a ResNet-50-I3D model that shares features between RGB-based action recognition and pose estimation with two deconvolutional layers. And it is followed by a ResNet-50 for pose-based action recognition. This model runs at 55ms per clip (8 frames) on an Nvidia TitanX GPU. As the dimension of pose sequences is small, substituting ResNet-50 with ResNet-18 for pose-based action recognition don’t cause much difference: it finally takes 50ms. I3D takes 30ms. Action Machine is fast to test and adds only a small overhead to I3D.
Training: Action Machine is also fast to train. Training with ResNet-50-I3D on the cross-view of NTU RGB-D takes 32 hours (0.66s per 16 clips (128 frames) mini-batch) in our synchronized 4-GPU implementation.
5 Discussions and Conclusions
We propose a person-centric modeling method: Action Machine, for human action recognition in trimmed videos. It has three complementary tasks: RGB-based action recognition, pose estimation and pose-based action recognition. By using person bounding boxes and human poses, Action Machine achieves competitive performance compared with other approaches on four video action datasets [38, 48, 47, 6]. However, in our implementation, it is hard to discard non-human clutters strictly (e.g., the trash can in Fig. 8) because of the bounding box quality and other postprocessing steps. Besides, in our multi-task training, the ground-truth pose annotations are estimated by the model trained on COCO  and may not be abundant enough for the training of pose estimation task due to the paucity of videos. The joint training of pose estimation on COCO and action recognition on videos may relieve the problem, as we can exploit the data richness of COCO.
-  COCO: COCO Leaderboard. http://cocodataset.org.
-  F. Baradel, C. Wolf, J. Mille, and G. W. Taylor. Glimpse clouds: Human activity recognition from unstructured feature points. In CVPR, 2018.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
-  J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600. CoRR, abs/1808.01340, 2018.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
-  C. Chen, R. Jafari, and N. Kehtarnavaz. Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In ICIP, 2015.
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
-  Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. In CVPR, 2018.
-  S. V. C. P. D. A. R. G. T. Y. L. S. R. R. S. C. S. J. M. Chunhui Gu, Chen Sun. AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
-  Y. Du, Y. Fu, and L. Wang. Skeleton based action recognition with convolutional neural network. In ACPR, 2015.
-  Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 2015.
-  A. Eweiwi, M. S. Cheema, C. Bauckhage, and J. Gall. Efficient pose-based action recognition. In ACCV, 2014.
-  B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Alwassel, V. Escorcia, R. Krishna, S. Buch, and C. D. Dao. The activitynet large-scale activity recognition challenge 2018 summary. CoRR, abs/1808.03766, 2018.
-  B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Alwassel, R. Krishna, V. Escorcia, K. Hata, and S. Buch. Activitynet challenge 2017 summary. CoRR, abs/1710.08011, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Y. Hou, Z. Li, P. Wang, and W. Li. Skeleton optical spectra-based action recognition using convolutional neural networks. TCSVT, 2018.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
-  Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. A new representation of skeleton sequences for 3d action recognition. In CVPR, 2017.
-  T. S. Kim and A. Reiter. Interpretable 3d human action analysis with temporal convolutional networks. In CVPRW, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. 2012.
-  H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre. Hmdb51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering, 2013.
-  I. Lee, D. Kim, S. Kang, and S. Lee. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In ICCV, 2017.
-  C. Li, Y. Hou, P. Wang, and W. Li. Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Processing Letters, 2017.
-  T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In ECCV, 2018.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In ECCV, 2016.
-  M. Liu, H. Liu, and C. Chen. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 2017.
-  M. Liu and J. Yuan. Recognizing human actions as the evolution of pose estimation maps. In CVPR, 2018.
-  J. Luo, W. Wang, and H. Qi. Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In ICCV, 2013.
-  D. C. Luvizon, D. Picard, and H. Tabia. 2d/3d pose estimation and action recognition using multitask deep learning. In CVPR, 2018.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, ECCV, 2016.
-  G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 2015.
-  A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. In CVPR, 2016.
-  A. Shahroudy, T. Ng, Y. Gong, and G. Wang. Multimodal multipart learning for action recognition in depth videos. PAMI, 2016.
-  A. Shahroudy, T. Ng, Y. Gong, and G. Wang. Deep multimodal feature analysis for action recognition in rgb+d videos. PAMI, 2018.
-  C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV, 2018.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS. 2014.
-  K. Soomro, A. R. Zamir, M. Shah, K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. CoRR, page 2012.
-  L. Tao and R. Vidal. Moving poselets: A discriminative and interpretable skeletal motion representation for action recognition. In ICCVW, 2015.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
-  R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3d skeletons as points in a lie group. In CVPR, 2014.
-  J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In CVPR, 2012.
-  J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu. Cross-view action modeling, learning, and recognition. In CVPR, 2014.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Val Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
-  P. Wang, W. Li, C. Li, and Y. Hou. Action recognition based on joint trajectory maps with convolutional neural networks. Knowledge-Based Systems, 2018.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
-  B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
-  S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.
-  M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In ICCV, 2013.
-  P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV, 2017.
-  P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View adaptive neural networks for high performance skeleton-based human action recognition. CoRR, abs/1804.07453, 2018.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
-  J. Zhu, W. Zou, and Z. Zhu. End-to-end video-level representation learning for action recognition. In ICPR, 2018.
-  J. Zhu, W. Zou, and Z. Zhu. Two stream gated fusion convnets for action recognition. In ICPR, 2018.
-  W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In AAAI, 2016.
-  Y. Zhu, W. Chen, and G. Guo. Fusing multiple features for depth-based action recognition. ACM Trans. Intell. Syst. Technol., 2015.
-  M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV, 2017.
Appendix A NTU RGB-D small subset setting
In Section 4.6, for ablation studies of different configurations of our models, we use the small subsets of NTU RGB-D (xview-s, xsub-s), which are designed by us for the fast training and testing.
For xview-s, the sample videos of the original cross-view split with subject ID larger than 5 are discarded. For this evaluation, the training and testing sets have 3, 839 and 1, 917 samples (about 1/10th of the full cross-view split), respectively.
For xsub-s, based on the original cross-subject split, we pick all the samples of camera 1 and discard samples of cameras 2 and 3. The sample videos with subject ID larger than 10 are discarded. For this evaluation, the training and testing sets have 4, 317 and 1, 439 samples (about 1/10th of the full cross-subject split), respectively.
Appendix B Cross-dataset recognition task
In order to show the advantage of person-centric modeling over baseline, we further test our trained models on another different datasets. Specifically, we train our models on NTU RGB-D cross-subject and test them on the test sets of N-UCLA, MSR Daily and UTD-MHAD respectively. The shared category mappings between the smaller dataset and NTU RGB-D are shown in Table 10, 11, 12 and the test videos are limited to have these ground-truth classes. Because of the different sources of videos, the scene contexts and objects in training dataset are largely different from the testing dataset. A model without capturing human body motion will behave worse than the model which learns to focus on. Results are shown in Table 13. It is clearly observed that our proposed person-centric modeling techniques including: person cropping, multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses can help to improve the performance of baseline model RGBAction random crop on different datasets. Existing methods based on RGB images easily overfit the scenes and objects of specific datasets without focusing on human body movements, though they may have high performance. In contrast, Action Machine is more generalizable and extendable.
|pick up with one hand||pickup|
|pick up with two hands||pickup|
|stand up||standing up (from sitting position)|
|sit down||sitting down|
|MSR Daily||NTU RGB-D|
|call cellphone||make a phone call/answer phone|
|write on a paper||writing|
|cheer up||cheer up|
|stand up||standing up (from sitting position)|
|sit down||sitting down|
|sit to stand||standing up (from sitting position)|
|stand to sit||sitting down|
|RGBAction random crop||70.0||70.8||100|
|RGBAction person crop||70.0||78.4||100|