A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition
This paper presents a new framework for human action recognition from 3D skeleton sequences. Previous studies do not fully utilize the temporal relationships between video segments in a human action. Some studies successfully used very deep Convolutional Neural Network (CNN) models but often suffer from the data insufficiency problem. In this study, we first segment a skeleton sequence into distinct temporal segments in order to exploit the correlations between them. The temporal and spatial features of skeleton sequences are then extracted simultaneously by utilizing a fine-to-coarse (F2C) CNN architecture optimized for human skeleton sequences. We evaluate our proposed method on NTU RGB+D and SBU Kinect Interaction dataset. It achieves 79.6% and 84.6% of accuracies on NTU RGB+D with cross-object and cross-view protocol, respectively, which are almost identical with the state-of-the-art performance. In addition, our method significantly improves the accuracy of the actions in two-person interactions.
Thao Le Minh, Nakamasa Inoue, Koichi Shinoda Department of Computer Science Tokyo Institute of Technology, Tokyo, Japan email@example.com, firstname.lastname@example.org, email@example.com
In the past few years, human action recognition has become an intensive area of research, as a result of the dramatic growth of societal applications for a number of areas including security surveillance systems, human-computer-interaction-based games, and health-care industry. The conventional approach based on RGB data was not robust against intra-class variations and illumination variations. With the advancement of 3D sensing technologies, in particular, affordable RGB-D cameras such as Microsoft Kinect, these problems have been remedied to some extent. Human action recognition studies utilizing 3D skeleton data have drawn a great deal of attention [Han et al., 2017, Presti and La Cascia, 2016].
Human action recognition based on 3D skeleton data is basically a time series problem, and accordingly, a great body of previous studies have focused on extracting motion patterns from skeleton time sequences. Earlier methods utilized hand-crafted features for representing the intra-frame relationships through the skeleton sequences [Yang and Tian, 2014, Wang et al., 2012]. In the deep learning approach, the end-to-end learning based on Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) has been utilized to learn the temporal dynamics [Du et al., 2015, Song et al., 2017, Zhu et al., 2016, Liu et al., 2016, Shahroudy et al., 2016, Liu et al., 2017a]. Recent studies have shown the superiority of Convolutional Neural Networks (CNNs) over RNN with LSTM for this task [Ke et al., 2017a, Liu et al., 2017c, Ke et al., 2017b, Liu et al., 2017b]. Most of the CNN-based studies encoded the trajectories of human joints in an image space that represents the spatio-temporal information of skeleton data. The encoded feature is then fed into a deep CNN pre-trained on large scale image datasets, for example, ImageNet [Russakovsky et al., 2015], under the notion of transfer learning [Pan and Yang, 2010]. This CNN-based method is, however, weak in handling long temporal sequences. And thus, it usually fails to distinguish actions with similar distance variations but with different durations, such as “handshaking” and “giving something to other persons”.
Motivated by the success of the generative model for CAPTCHA images [George et al., 2017], we believe 3D human action recognition systems can also benefit from a specific network structure for this application domain. The first step is to segment a given skeleton sequence into different temporal segments. Here, we assume that temporal features of different time-steps have different correlations. We further utilize a new F2C CNN-based network architecture to model high-level features. By utilizing both the temporal relationships between temporal segments and spatial connectivities among human body parts, it is expected to have a superior performance to the naive deep CNN networks. To the best of our knowledge, this is the first attempt to use F2C network for 3D human action recognition.
2 Related Studies
Deep learning techniques drew a great attention in the field of 3D human action recognition. The end-to-end network architectures can discriminate actions from raw skeleton data without any handcrafted features. [Zhu et al., 2016] adopted three LSTM layers to exploit the co-occurrence features of skeleton joints at different layers. A hierarchical LSTM-based network which models different body parts was better than the naive LSTM architectures [Du et al., 2015].
The use of deep learning techniques for this area of research was exploded when NTU RGB+D dataset [Shahroudy et al., 2016] was released. [Shahroudy et al., 2016] introduced a part-aware LSTM to learn the long-term dynamics of a long skeleton sequence from multimodal inputs extracted from human body parts. [Liu et al., 2016] further employed a spatio-temporal LSTM (ST-LSTM) to handle both the spatial dependencies and the temporal dependencies. ST-LSTM is also enhanced with a tree-structure based traversal method for transmitting input data of each frame into the network. In addition, this method used a trust gate mechanism to exclude noisy data from the input.
CNNs are powerful for the task of object detection from images. Transfer learning techniques enable them to perform well even with a limited number of data samples [Wagner et al., 2013, Long et al., 2015]. Motivated by this, [Ke et al., 2017a] was the first to apply transfer learning for 3D human action recognition. They used a VGG model [Chatfield et al., 2014] pre-trained with ImageNet to extract high-level features from cosine distance features between joint vectors and their normalized magnitude. [Ke et al., 2017b] further transformed the cylindrical coordinates of an original skeleton sequence into three clips of gray-scale images. The clips are then processed by pre-trained VGG19 model [Simonyan and Zisserman, 2014] to extract image features. Multi-task learning was also proposed by [Ke et al., 2017b] for the final classification, which achieved the state-of-the-art performance on NTU RGB+D dataset.
Our study addresses two problems of the previous studies: (1) the loss of temporal information of a skeleton sequence during training and, (2) the need for a specific CNN structure for skeleton data. We believe that a very deep CNN model such as VGG [Simonyan and Zisserman, 2014], AlexNet [Krizhevsky et al., 2012] or ResNet [He et al., 2016] are overqualified for such sparse data as human skeleton. Moreover, the available skeleton datasets are relatively small compared to image datasets. Thus, we believe a network architecture which is able to leverage the geometric dependencies of human joints is promising for solving this issue.
3 Fine-to-Coarse CNN for 3D Human Action Recognition
This section presents our proposed method for 3D skeleton-based action recognition which exploits the geometric dependency of human body parts and the temporal relationship in a time sequence of skeletons (Figure 1). It consists of two phases: feature representation and high-level feature learning with a F2C network architecture.
3.1 Feature Representation
We encode the geometry of human body originally given in an image space into local coordinate systems to extract the relative geometric relationships among human joints in a video frame. We select six joints in a human skeleton as reference joints in order to generate whole-body-based (WB) features and body-part-based (BP) features. The hip joint is chosen as the origin of the coordinate system presenting the WB features, while the other reference joints, namely the head, the left shoulder, the right shoulder, the left hip, and the right hip, are selected exactly the same as [Ke et al., 2017a] to represent the BP features. The WB features represent the motions of human joints around the base of the spine, while the BP features represent the variation of appearance and deformation of the human pose when viewed from different body parts. We believe that the combined use of WB and BP is robust against coordinate transformations.
Different from the other studies using BP features [Shahroudy et al., 2016, Liu et al., 2016, Ke et al., 2017a], we extract a velocity together with a joint position from each joint of the raw skeleton. The velocity represents the variations over the time and has been widely employed in many previous studies, mostly in the handcrafted-feature-based approaches [Zanfir et al., 2013, Kerola et al., 2016, Zhang et al., 2017]. It is robust against the speed changes; and accordingly, is effective to discriminate actions with similar distance variations but with different speeds, such as punching and pushing.
3.1.1 Whole-body-based Feature
In the -th frame of sequence of skeletons with joints, the 3D position of the -th joint is depicted as:
The relative inter-joint positions are highly discriminative for human actions [Luo et al., 2013]. The relative position of joint at time is described as:
where depicts the position of the centre-hip joint. The velocity features are the first derivatives of the position feature [Zanfir et al., 2013]:
In addition, we follow the limb normalization procedure [Zanfir et al., 2013] to reduce the problem caused by the variations in human body size among human subjects. We first compute the average limb lengths of each two connected joints over the training dataset, and then use them to modify the locations of the corresponding joints while keeping the joint angles unchanged.
In order to extract the spatial features of a human skeleton at time over the set of joints, we first define a spatial configuration of a joint chain. We believe that the order of joints greatly affects the learning ability of 2D CNN since the joints in adjacent body parts share more spatial relations than a random pair of joints. For example, in most actions, the joints of the right arm are more correlated to those of the left arm than those of the left leg are. With this intention, we concatenate joints in the following order: left arm, right arm, torso, left leg, right leg. Note that the torso in the context of this paper includes the head joint of the human skeleton. Let is the number of frames in a given skeleton sequence. In the next step, we compute each feature of skeleton data over frames and stack them as a feature row. Consequently, we obtain the WB features of two 2D arrays; each corresponds to the joint location and velocity. Finally, we project these 2D array features into RGB image space using a linear transformation. In particular, each of three components of each skeleton joint is represented as one of the three corresponding components of a pixel in a color image; by normalizing the values to the range 0 to 255. The two sets of color images are further up-scaled by using a cubic spline interpolation. Cubic spline interpolation is a commonly used technique in image processing to minimize the interpolation error [Hou and Andrews, 1978]. We call these two RGB images as skeleton images. Figure 2(a) illustrates our procedure for WB feature generation.
3.1.2 Body-part-based Feature
In order to represent the BP features, we choose five joints corresponding to five human body parts as the reference joints: the head, the left shoulder, the right shoulder, the left hip, and the right hip, as in [Ke et al., 2017a]. They are relatively stable in most actions. Mathematically, the position feature of skeleton joint at time is presented as:
where depicts the position of the reference joint. Similarly, the velocity feature is given by:
Similar to the WB feature representation, with each skeleton at time , we obtain five feature vectors of a joint location and five vectors of a velocity corresponding to five distinct reference joints. We then place all BP features side by side to produce one unique row feature and place them along the temporal axis to obtain a 2D array feature. Finally, we apply a linear transformation to represent these array features as RGB images and further up-scale them by using a cubic spline interpolation. As a result, we obtain two BP-base skeleton images; one corresponding to the joint location and the other to the velocity from each skeleton sequence. The whole process is illustrated in Figure 2(b).
3.2 Fine-to-Coarse Network Architecture
In this section, we explain the detail of our proposed F2C network architecture for high-level feature learning. Figure 3 illustrates our network structure in three dimensions.
Our F2C network takes three color channels of a single skeleton image generated from the feature representation phase as inputs. Accordingly, the input of our F2C network consists of two dimensions: the spatial dimension which describes the geometric dependencies of human joints along the joint chain, and the temporal dimension of the time-feature representation over frames of a skeleton sequence. Let be the number of segments along the temporal axis , is the number of body parts (), each image skeleton is considered as a set of slices (Figure 3). Assume (=m) is the number of frames in one temporal segment, is the dimension of one body part along the spatial dimension, each input slide has size of . In the next step, we simultaneously concatenate the slices over both the spatial axis and temporal axis. In other words, regarding the spatial dimension, we first concatenate each body part belongs to human limbs (arms and legs) with the torso, while concatenating two consecutive temporal segments together. Each concatenated 2D array feature is further passed through a convolutional layer and a max pooling layer. The same fusion procedure is applied before passing the next convolutional layer. In short, our F2C network composes of three layer-concatenation steps, and three convolutional blocks accordingly. In the last step, the extracted image features are flattened to obtain an output of 1D array feature.
Our network can be viewed as a procedure to eliminate unwanted connections between layers from the conventional CNN. We believe traditional CNN models include some redundant connections for capturing human-body-geometric features. Many actions only require the movement of the upper body (e.g. hand waving, clapping) or the lower body (e.g. sitting, kicking), while the other requires the movements of the whole body (e.g. moving towards another person, pick up something). For this reason, the bottom layers in our proposed method can discriminate “fine”actions which require the movements of some certain body parts, while the top layers are discriminative for “coarse” actions using the movements of the whole body.
4 Experiments and Discussion
4.1 Datasets and Experimental Conditions
We conduct experiments on two skeleton benchmark datasets publicly available: NTU RGB+D [Shahroudy et al., 2016], SBU Kinect Interaction Dataset [Yun et al., 2012]. As the method proposed by [Ke et al., 2017a] is relatively related to this paper, we make their method as our baseline. We also compare our proposed method with other state-of-the-art methods reported on the same datasets.
NTU RGB+D dataset is the largest skeleton-based human action dataset for the time being with 56,880 sequences. The skeleton data were collected by utilizing Microsoft Kinect v2 sensors. Each skeleton contains 25 human joints. In this dataset, there are 60 distinct action classes of three human-action groups: daily actions, health-related actions, and two-person interactive actions. All the actions are performed by 40 distinct subjects. The actions are recorded simultaneously by three camera sensors located at different angles: , , . To increase the number of camera views, the height and distances of the cameras are modified in each setup. This dataset is challenging due to the large variations of viewpoints and sequence lengths. In our experiments, we use the two standard evaluation protocols proposed by the original study [Shahroudy et al., 2016], namely, cross-subject (CS) and cross-view (CV).
conv3-64: 33 convolution, 64 filters
|Input of 224224 RGB image|
|35 input slices of 3244|
|24 input slices of 6488|
|10 fused feature slices of 3244|
|4 fused feature slices of 1622|
|Lie Group [Vemulapalli et al., 2014]||50.1||52.8|
|Part-aware LSTM [Shahroudy et al., 2016]||62.9||70.3|
|ST-LSTM + Trust Gate [Liu et al., 2016]||69.2||77.7|
|Temporal Perceptive Network [Hu et al., 2017]||75.3||84.0|
|Context-aware attention LSTM [Liu et al., 2018]||76.1||84.0|
|Enhanced skeleton visualization [Liu et al., 2017c]||76.0||82.6|
|Temporal CNNs[Kim and Reiter, 2017]||74.3||83.1|
|Clips+CNN+Concatenation [Ke et al., 2017b]||77.1||81.1|
|Clips+CNN+MTLN [Ke et al., 2017b]||79.6||84.8|
|SkeletonNet [Ke et al., 2017a]||75.9||81.2|
|Pos + Vel + VGG||68.1||72.4|
|Pos + F2C network||76.6||81.7|
SBU Kinect Interaction Dataset is another skeleton-based dataset collected using the Microsoft Kinect sensor. There are 282 skeleton sequences divided into 21 subsets, which are collected from eight different types of two-person interactions including approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands. Each skeleton contains 15 joints. There are seven subjects who performed the actions in the same laboratory environment. We also augment data as in [Ke et al., 2017a] before doing five-fold cross-validation. Each skeleton image is first resized to 250 250 and then is randomly cropped into 20 sub-images with the size of 224 224. Eventually, we obtain a dataset of 11,280 samples.
Implementation details The proposed model was implemented using Keras 111https://github.com/keras-team/keras with TensorFlow backend. For a fair comparison with the previous studies, transfer learning is applied in order to improve the classification performance. To be more specific, our proposed F2C network architecture is first trained with ImageNet with the input image dimension is set to 224224. The pre-trained weights are then applied to all experiments.
For NTU RGB+D dataset, we first remove 302 missing skeletons reported by [Shahroudy et al., 2016]. 20% of training samples are used as a validation set. The first fully connected layer has 256 hidden units, while the output layer has the same size as the number of actions in the datasets. The network is trained using Adam for stochastic optimization [Kingma and Ba, 2015]. The learning rate is set to 0.001 and exponentially decayed over 25 epochs. We use a batch size of 32. The same experimental settings are applied to all the experiments.
We set the number of temporal segments to seven, because it shows the best performance on NTU RGB+D dataset. Considering body part features have different contributions to an action, we do not share weights between input slices during training. This might increase the number of parameters but gain better generalization ability of the network. Table 2 shows the detail of our network configuration.
4.2 Experimental Results
|Pat on back||54.7||46.2||82.8||80.7|
|Touch other’s pocket||66.9||50.6||90.9||95.3|
|Deep LSTM+Co-occurence [Zhu et al., 2016]||90.4|
|ST-LSTM+Trust Gate [Liu et al., 2016]||93.3|
|SkeletonNet [Ke et al., 2017a]||93.5|
|Clips+CNN+Concatenation [Ke et al., 2017b]||92.9|
|Clips+CNN+MTLN [Ke et al., 2017b]||93.6|
|Context-aware attention LSTM [Liu et al., 2018]||94.9|
NTU RGB+D dataset We compare the performance of our method with the previous studies in Table 2. The classified accuracy is chosen as the evaluation metric.
Pos + Vel + VGG In this experiment, we use VGG16 with pre-trained on ImageNet dataset instead of our F2C network. This experiment examines the significance of the proposed F2C network for high-level feature learning against the conventional deep CNN models.
Pos + F2C network In this experiment, we only use joint position with the proposed F2C network architecture. This experiment examines the importance of incorporating velocity feature to improve the classification performance.
Pos + Vel + F2C network (F2CSkeleton) This is our proposed method.
As shown in Table 2, our proposed method outperforms results reported by [Vemulapalli et al., 2014, Shahroudy et al., 2016, Liu et al., 2016, Hu et al., 2017, Liu et al., 2018, Liu et al., 2017c, Ke et al., 2017a] with the same testing condition. In particular, we gain over 3.0% improvement from our baseline [Ke et al., 2017a] on both CS and CV testing protocols. Similarly, our method is around 2.5% better than the method with feature concatenation [Ke et al., 2017b]. However, [Ke et al., 2017b] using Multi-Task Learning Network (MTLN) obtained a slightly better performance than our method with CV protocol. The learning paradigm MTLN works as a hierarchical method to effectively learn the intrinsic correlations between multiple related tasks [Zhang and Yeung, 2014], thus, outperforms a mere concatenation. We believe our method also can benefit from MTLN. We will include this as a part of our future work to improve our network.
Table 2 also shows that our F2C network performs significantly better than VGG16. In particular, our F2C network improves the accuracy from 68.1% to 79.6% with CS protocol and from 72.4% to 84.6% with CV protocol. The incorporation of velocity improves the performance about 3.0 points in both testing protocols.
Our method outperforms SkeletonNet on all the two-person interactions. Table 3 shows our classification performance with CV protocol. Two-person interactions usually require the movement of the whole body. Top layers of our tailored network architecture can learn the whole body motion better than the naive CNN models originally designed for detecting generic objects in a still image.
On the other hand, it appears that our method performs poorly on two classes, namely “brushing teeth” (58.3%) and “brushing hair” (47.6%). Confusion matrix reveals that “brushing teeth” is often misclassified as either “cheer up” and “hand waving”, while the “brushing hair” is misclassified as “hand waving”. This may be because the “head joint”, which is selected as the reference joint for the torso, is not stationary enough compared to the other reference joints in these action types.
SBU Kinect Interaction dataset Table 4 shows the comparisons of our proposed method with the previous studies on SBU dataset. As can be seen, our proposed method achieved the best performance on this dataset over all the other previous methods. In particular, our method gains more than 5.0 points improvement compared to the two state-of-the-art CNN-based methods [Ke et al., 2017a, Ke et al., 2017b], and about 4.0 points better than [Liu et al., 2018]. These results again confirm that our method has superior performance on two-person interaction actions.
This paper addresses two problems of the previous studies: the loss of temporal information in a skeleton sequence when modeling using CNNs and the need of a network model specific to a human skeleton sequence. We first propose to segment a skeleton sequence to retrieve the dependencies between temporal segments in an action. We also propose an F2C CNN architecture for exploiting the spatio-temporal feature of skeleton data. As a result, our method with only three network blocks shows the superior generalization ability across very deep CNN models. We achieve performance of 79.6% and 84.6% of accuracies on the large skeleton dataset, NTU RGB+D, with cross-object and cross-view protocol, respectively, which reaches the state-of-the-art.
In the future, as has been noted, we will adopt the notion of multi-task learning. In addition, since we do not share weights between input slices during training, our network has more trainable parameters compared general CNN models with the same input size and the number of filters. We believe our method will work better if we reduce the number of feature maps in convolutional layers. The current skeleton data is very challenging due to noisy joints. For example, by manually checking skeleton data from the first data collection setup of NTU RGB+D, we find that there were about 8.8% of noisy detections. Because our method did not apply any algorithms to remove these noises from the input, it is promising to take this into consideration for better performance.
This work was supported by JSPS KAKENHI 15K12061 and by JST CREST Grant Number JPMJCR1687, Japan.
- Chatfield et al., 2014 Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC).
- Du et al., 2015 Du, Y., Wang, W., and Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In Proc. of Computer Vision and Pattern Recognition (CVPR), pages 1110–1118.
- George et al., 2017 George, D., Lehrach, W., Kansky, K., Lázaro-Gredilla, M., Laan, C., Marthi, B., Lou, X., Meng, Z., Liu, Y., Wang, H., et al. (2017). A generative vision model that trains with high data efficiency and breaks text-based captchas. Science, 358(6368):eaag2612.
- Han et al., 2017 Han, F., Reily, B., Hoff, W., and Zhang, H. (2017). Space-time representation of people based on 3d skeletal data: A review. Proc. of Computer Vision and Image Understanding, 158:85–105.
- He et al., 2016 He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proc. of Computer Vision and Pattern Recognition (CVPR), pages 770–778.
- Hou and Andrews, 1978 Hou, H. and Andrews, H. (1978). Cubic splines for image interpolation and digital filtering. IEEE Transactions on acoustics, speech, and signal processing, 26(6):508–517.
- Hu et al., 2017 Hu, Y., Liu, C., Li, Y., Song, S., and Liu, J. (2017). Temporal perceptive network for skeleton-based action recognition. In Proc. of British Machine Vision Conference (BMVC), pages 1–2.
- Ke et al., 2017a Ke, Q., An, S., Bennamoun, M., Sohel, F., and Boussaid, F. (2017a). Skeletonnet: Mining deep part features for 3-d action recognition. IEEE signal processing letters, 24(6):731–735.
- Ke et al., 2017b Ke, Q., Bennamoun, M., An, S., Sohel, F., and Boussaid, F. (2017b). A new representation of skeleton sequences for 3d action recognition. In Proc. of Computer Vision and Pattern Recognition (CVPR), pages 4570–4579. IEEE.
- Kerola et al., 2016 Kerola, T., Inoue, N., and Shinoda, K. (2016). Graph regularized implicit pose for 3d human action recognition. In Proc. of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1–4. IEEE.
- Kim and Reiter, 2017 Kim, T. S. and Reiter, A. (2017). Interpretable 3d human action analysis with temporal convolutional networks. In Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1623–1631. IEEE.
- Kingma and Ba, 2015 Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. Proc. of International Conference on Learning Representations (ICLR).
- Krizhevsky et al., 2012 Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proc. of Advances in Neural Information Processing Systems (NIPS), pages 1097–1105.
- Liu et al., 2017a Liu, J., Shahroudy, A., Xu, D., Chichung, A. K., and Wang, G. (2017a). Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Liu et al., 2016 Liu, J., Shahroudy, A., Xu, D., and Wang, G. (2016). Spatio-temporal lstm with trust gates for 3d human action recognition. In Proc. of European Conference on Computer Vision, pages 816–833. Springer.
- Liu et al., 2018 Liu, J., Wang, G., Duan, L.-Y., Abdiyeva, K., and Kot, A. C. (2018). Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Transactions on Image Processing, 27(4):1586–1599.
- Liu et al., 2017b Liu, M., Chen, C., and Liu, H. (2017b). 3d action recognition using data visualization and convolutional neural networks. In IEEE International Conference on Multimedia and Expo (ICME), pages 925–930. IEEE.
- Liu et al., 2017c Liu, M., Liu, H., and Chen, C. (2017c). Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68:346–362.
- Long et al., 2015 Long, M., Cao, Y., Wang, J., and Jordan, M. I. (2015). Learning transferable features with deep adaptation networks. In Proc. of the 32nd International Conference on Machine Learning, ICML, pages 97–105.
- Luo et al., 2013 Luo, J., Wang, W., and Qi, H. (2013). Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In Proc. of Computer vision (ICCV), pages 1809–1816. IEEE.
- Pan and Yang, 2010 Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.
- Presti and La Cascia, 2016 Presti, L. L. and La Cascia, M. (2016). 3d skeleton-based human action classification: A survey. Proc. of Pattern Recognition, 53:130–147.
- Russakovsky et al., 2015 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252.
- Shahroudy et al., 2016 Shahroudy, A., Liu, J., Ng, T.-T., and Wang, G. (2016). Ntu rgb+d: A large scale dataset for 3d human activity analysis. In Proc. of Computer Vision and Pattern Recognition (CVPR).
- Simonyan and Zisserman, 2014 Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. Proc. of CoRR.
- Song et al., 2017 Song, S., Lan, C., Xing, J., Zeng, W., and Liu, J. (2017). An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proc. of Association for the Advancement of Artificial Intelligence (AAAI), volume 1, page 7.
- Vemulapalli et al., 2014 Vemulapalli, R., Arrate, F., and Chellappa, R. (2014). Human action recognition by representing 3d skeletons as points in a lie group. In Proc. of Computer Vision and Pattern Recognition, pages 588–595.
- Wagner et al., 2013 Wagner, R., Thom, M., Schweiger, R., Palm, G., and Rothermel, A. (2013). Learning convolutional neural networks from few samples. In Neural Networks (IJCNN), The 2013 International Joint Conference on, pages 1–7. IEEE.
- Wang et al., 2012 Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In Proc. of Computer Vision and Pattern Recognition (CVPR), pages 1290–1297. IEEE.
- Yang and Tian, 2014 Yang, X. and Tian, Y. (2014). Effective 3d action recognition using eigenjoints. Journal of Visual Communication and Image Representation, 25(1):2–11.
- Yun et al., 2012 Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., and Samaras, D. (2012). Two-person interaction detection using body-pose features and multiple instance learning. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 28–35. IEEE.
- Zanfir et al., 2013 Zanfir, M., Leordeanu, M., and Sminchisescu, C. (2013). The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In Proc. of the IEEE International Conference on Computer Vision, pages 2752–2759.
- Zhang et al., 2017 Zhang, S., Liu, X., and Xiao, J. (2017). On geometric features for skeleton-based action recognition using multilayer lstm networks. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 148–157. IEEE.
- Zhang and Yeung, 2014 Zhang, Y. and Yeung, D.-Y. (2014). A regularization approach to learning task relationships in multitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):12.
- Zhu et al., 2016 Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., Xie, X., et al. (2016). Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In Proc. of Association for the Advancement of Artificial Intelligence (AAAI), volume 2, page 8.