TwoStream RNN/CNN for Action Recognition in 3D Videos
Abstract
The recognition of actions from video sequences has many applications in health monitoring, assisted living, surveillance, and smart homes. Despite advances in sensing, in particular related to 3D video, the methodologies to process the data are still subject to research. We demonstrate superior results by a system which combines recurrent neural networks with convolutional neural networks in a voting approach. The gatedrecurrentunitbased neural networks are particularly wellsuited to distinguish actions based on longterm information from optical tracking data; the 3DCNNs focus more on detailed, recent information from video data. The resulting features are merged in an SVM which then classifies the movement. In this architecture, our method improves recognition rates of stateoftheart methods by 14% on standard data sets.
I Introduction
Recognition of human activity in 3D videos has received increasing attention since 2010[1, 2, 3, 4, 5, 6, 7]. Compared to 2D videos, 3D videos provide more spatial information and could be more informative. Action recognition with 3D videos is applied in different fields, such as health monitoring for patients, assisted living for disabled people, and robot perception and cognition.
Following this line of research, this paper proposes and applies novel deeplearning methods on what is currently the largest 3D action recognition dataset. Our results are compared with existing best approaches and are shown to be superior. Our proposed deeplearning methods consist mainly of three parts: a novel skeletonbased recurrent neural network structure, using a 3Dconvolutional [8] neural network for RGB videos, and sketching a new twostream fusion method to combine RNN and CNN. All methods are evaluated on the NTU RGB+D Dataset[2]. The dataset was published in 2016 and contains more than 56k action samples in four different modalities: RGB videos, depth map sequences, 3D skeletal data, and infrared videos. The dataset consists of 60 different action classes including daily, healthrelated, and mutual actions. In this paper, we use both the 3D skeletal data and RGB videos.
Traditional studies on 3D action recognition use different kinds of methods [1, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52] to compute handcrafted features, while deeplearning approaches [6, 53, 7, 54, 3, 2, 5, 4] are endtoend trainable and can be applied directly on raw data. Focussing on the latter, for skeletonbased activity analysis, [3, 2, 5, 4] used different kinds of recurrent neural networks to acquire stateoftheart performances on various of 3D action datasets. Du et al. [3] propose an hierarchical RNN, which is fed with manually divided five groups of the human skeleton, such as two hands, two legs, and one torso. Inspired from this, Shahroudy et al. [2] present a novel long shortterm memory (LSTM) [55] cell, called partaware LSTM, which is also fed with separated five parts of skeleton. Evolved from these two ideas, Zhu et al. [5] provide a novel deep RNN structure, which can automatically learn the cooccurrence, similar to grouping data into five human body parts, from skeleton data. Most recently, Liu et al. [4] propose a skeleton tree traversal algorithm and a new gating mechanism to improve robustness against noise and occlusion.
However, our proposed RNN structure makes a different contribution. Our method is inspired by recent normalization technologies [56] and a novel recurrent neuron mechanism [57]. With these advanced deep learning technologies embedded into our RNN structure, it can be trained with 13 times fewer iterations and for each iteration consumes 20% less computational time, compared to a normal RNN model with LSTM cells. Our contribution focuses more on making the network much easier to train, less inclined to overfitting, and deep enough to represent the data. More importantly, our proposed RNN structure outperforms all other skeletonbased methods on the largest 3D action recognition dataset.
To process RGB videos, our method is inspired by different kinds of convolutional neural networks[8, 58, 59, 60]. We use a 3DCNN[8] model on the RGB videos of the NTU RGB+D dataset. We compare the results with proposed RNN models, and fuse their output.
To combine the RNN and CNN models, we propose two fusion structures: decision and feature fusion. The first is very simple to use, whereas the second provides a better performance. For decision fusion, we illustrate a voting method based on the confidence of the classifiers. For feature fusion, we propose a novel twostream RNN/CNN structure, shown in Fig. 4, which combines temporal and spatial features from the RNN and CNN models and boost the performance by a significant margin. Our twostream RNN/CNN structure outperforms the current stateoftheart method [4] more than 14% on both cross subject and cross view settings.
To summarize, our contributions in this paper are:

A novel RNN structure is proposed, which converges 13 times faster during training and costs 20% less computational power at each forward pass, compared to a normal LSTM;

Two fusion methods, i.e. decision fusion and feature fusion, are proposed to combine the proposed RNN structure and a 3DCNN structure [8]. The decision fusion is easier to use, while the feature fusion has superior performance.
Ii Methodology
In this section, we first introduce the concept of recurrent neural networks and batch normalization, and then describe the proposed RNN structure: a deep bidirectional gated recurrent neural network with batch normalization and dropout. Afterwards, the applied 3DCNN model and twostream RNN/CNN fusion architectures, i.e., decision fusion and feature fusion, are introduced.
Iia Recurrent Neural Network
IiA1 Vanilla Recurrent Neural Network
Recurrent neural networks can handle sequence information with varied lengths of time steps. This transforms the input to a internal hidden state at each time step. The network passes the state along with the next input to the neuron, time step after time step. The neuron learns when to remember and forget information with nonlinear activation functions:
(1) 
(2) 
where represents time steps, and represents a nonlinear activation function such as a standard logistic sigmoid function or a hyperbolic tangent function .
Multiple layers of RNN can be stacked to increase the complexity:
(3) 
(4) 
(5) 
where denotes the layer number.
In practice, a vanilla RNN does not remember information over a longer time; a problem which is related to the vanishing gradient problem.
IiA2 Long ShortTerm Memory
This problem can be solved by LSTM [55] which stores information in gated cells at the neurons. This allows errors to be backpropagated through hundreds or thousands of time steps:
(6) 
(7) 
(8) 
where and denote input gate, forget gate, and output gate, respectively. represents new candidate values, which could be added to the cell state . We use for elementwise multiplication.
IiA3 Gated Recurrent Unit
An improvement to LSTM called gated recurrent unit (GRU) was proposed in [57]. GRU has a simpler structure and can be computed faster. The three gates from LSTM are combined into two gates, respectively updating gate and resetting gate in GRU. GRU also combines cell state and hidden state into one state . The mathematical description is as follows:
(9) 
(10) 
(11) 
where denotes new candidate state values.
IiA4 Bidirectional Recurrent Neural Network
A bidirectional RNN [61] performs a forward pass and a backward pass, which runs input data from to and from to , respectively.
For classification, the output of an RNN can be passed to a fullyconnected layer with softmax activation functions; this allows us to interpret the output as a probability.
IiA5 Batch Normalization
To train a deep neural network, the internal covariate shift [56] slows down the training process. The internal covariate shift is the distribution of each layer’s input changes during training, because the parameters in the previous layer are changing. To reduce the internal covariate shift, we could whiten the layer activations, but this takes too much computation power. Batch normalization, a part of the neural network structure, approximates this process by standardizing the activations using a statistical estimate of the mean and standard deviation for each training minibatch. It can be shown that
(12) 
where and are scale and shift parameters for the activation . With these, identity transformation for each activation could be presented. is a constant added as a regularization parameter for numerical stability. The division in Eq. (12) is performed elementwise. and are learned during training and fixed during inference.
IiA6 Proposed RNN Structure
For skeletonbased action recognition tasks, the data set consists of the 3D coordinates of a number of body joints. We feed this information, together with action labels, to an RNN. This RNN network has two bidirectional layers, each of which consists of 300 GRU cells. After the recurrent layers follows the batch normalization layer, which standardizes the activations from the RNN layer. Then the normalized activations flow to the next fullyconnected layer with 600 rectified linear unit (ReLU) [62] activation functions. During training, in each iteration the network randomly drops out 25% of the neurons between the batch normalization layer and the next fullyconnected layer to reduce overfitting. Lastly, a softmax layer maps the compressed motion information (features) to 60 action classes. Fig. 1 shows the structure of this RNN network.
To highlight the improvements of this final proposed model, we compare our approach to simpler models. These models are a standard RNN; an LSTMRNN; LSTM plus batch normalization (“LSTMBN”), LSTMBN with dropout (“LSTMBNDP”), GRUBNDP, and a bidirectional GRUBNDP which we call “BIGRUBNDP”. All these models have one recurrent layer. The next complexity is adding an extra layer of hidden units to the last model (“2 layer BIGRUBNDP”). Finally, we add another fullyconnected layer on top, before the softmax layer, and call this model “2 layer BIGRUBNDPH”. Sec. III discusses the results of all models.
IiB Convolutional Neural Network
To process RGB videos, we choose to use the 3DCNN model from[8], as it shows promising performances on 2D video action recognition tasks. We believe that 3D convolution nets are more suitable for learning features from videos than 2D convolution nets.
2D convolution generates a series of 2D feature maps from images. Inspired by this, a 3D convolution processes frame clips, where the third dimension is time step, which results in a series of 3D feature volumes, as shown in Fig. 2. This compressed representation contains spatiotemporal information from the video clips. To learn a rich amount of features, multiple layers of convolution and maxpooling operations are stacked into one model.
To be specific, the 3DCNN model [8], which we choose, has five convolutional groups, each group has one or two convolutional layers and one maxpooling layer, two fullyconnected layers, and one softmax output layer. The details of this model are presented in Fig. 3.
We finetune this model with pretrained parameters [8] on Sports1M Dataset, which has approximately one million YouTube videos. This reduces overfitting and demands less training time on the current dataset.
IiC Twostream RNN/CNN
As having the proposed RNN structure for the skeleton data and the 3DCNN model for the RGB videos, we want to combine the strengths of RNN and CNN nets. To improve the performance, we propose two fusion models, decision fusion and feature fusion.
IiC1 Decision Fusion
In the case of decision fusion, we use a simple but efficient voting method, inspired by majority voting. As a result of having only two classifiers, we cannot apply majority voting. Instead, the fusion method predicts based on voting confidence.
We first split the dataset into training, validation and testing. The same training set is used to train the RNN and CNN nets. The validation set is then used to find the best parameters, trust weights and for the voting method. We initialize the trust weights with equal values, which means and for both RNN and CNN classifiers. Afterwards, we compare the confidences, which are the highest probabilities of softmax output from both classifiers for each prediction. The more confident one wins:
(13) 
where is the fused prediction for sample ; and denote RNN and CNN prediction, respectively.
Based on this concept, we develop a way to fuse the predictions from RNN and CNN. We evaluate the performance with the validation dataset and search for the best trust weights for decision fusion. Having only two parameters and , only little tuning is needed.
IiC2 Feature Fusion
Another way to combine these two neural networks is feature fusion.
We first train the RNN and CNN models on the training dataset. As in training, neural nets can learn discriminant information from raw data. Thus, we use the trained RNN model to extract temporal features from 3D skeleton data and use the trained CNN model to learn spatiotemporal features from RGB videos. Both features come from the first fullyconnected layer in each model. The features are concatenated, L2 normalized, and eventually, fed to a linear SVM classifier.
The SVM parameter is found using the validation dataset. Then the model is tested on the test set. The feature fusion structure for two streams of RNN and CNN features is presented in Fig. 4.
Iii Experiments
The models introduced in the previous section are evaluated in the experiments. The dataset is first introduced in this section, then the setups and parameter settings for the experiments are illustrated. We compare the results of the proposed models with the current best methods. In the end, we analyze and discuss the problems related to deep learning methods for 3D action recognition.
Iiia NTU RGB+D Dataset [2]
The proposed approaches are evaluated on the NTU RGB+D dataset[2], which we know as the current largest publicly available 3D action recognition dataset. The dataset consists of more than 56k action videos and 4 million frames, which were collected by 3 Kinect V2 cameras from 40 distinct subjects, and divided into 60 different action classes including 40 daily (drinking, eating, reading, etc.), 9 healthrelated (sneezing, staggering, falling down, etc.), and 11 mutual (punching, kicking, hugging, etc.) actions. It has four major data modalities provided by the Kinect sensor: 3D coordinates of 25 joints for each person (skeleton), RGB frames, depth maps, and IR sequences. In this paper, we use the first two modalities, since they are the two most informative modalities.
The large intraclass and view point variations make this dataset challenging. However, the large amount of action samples makes it highly suitable for datadriven methods.
This dataset has two standard evaluation criteria [2]. The first one is a crosssubject test, in which half of the subjects are used for training and the other half are used for testing. The second one is a crossview test, in which two viewpoints are used for training and one is excluded for evaluation.
IiiB Implementation details
In our experiments, the implementation consists of RNN, CNN, and Fusion. For all these models we use the same training, validation and testing splits. The validation set is composed of 10% of the subjects in the training set in [2]. The remaining subjects in the training set [2] make up the training set.
IiiB1 RNN Implementation
In the RNN experiments, we have two human skeletons as input, each skeleton has 25 3D coordinates. Since the longest time step is 300, we pad all the action sequences to a length of 300. The dimension of each action sample is 300 (time steps) 150 (coordinates).
We use TensorFlow[63] with TFlearn[64] and run the experiments on either one NVIDIA GTX 1080 GPU or one NVIDIA GTX TITAN X GPU. We train the network using RMSprop [65] optimizer and set learning rate as 0.001, decay as 0.9, and momentum as 0. We train the network from scratch using minibatches of 1000 sequences for onelayer models and use minibatches of 650 sequences for twolayer models. For all RNN nets, we use 300 neurons for each singledirectional layer, double the amount of neurons for bidirectional layers, and we use a 75% keep probability for dropout. For batch normalization, we initialize as 1.0, as 0.0, and set as . The estimated means and variances are fixed during inference.
As a comparison, the mentioned parameters are the same for all proposed RNN models, only the structure changes.
IiiB2 CNN Implementation
We use the 3DCNN model[8] in Caffe[66] and train it on RGB frames from the NTU RGB+D dataset, with pretrained parameters[8] from the Sport1M dataset. From RGB videos, we extract the frames, crop and resize them from pixels to pixels [8]. Videos are split into nonoverlapped 16frame clips.
We refer to the input of CNN model as a size of , where is the number of channels, is the number of time steps, and are the height and width of the frame, respectively. The network takes video clips as input and predicts the 60 action labels which belong to the 60 different actions. It further resizes the input frames to 128171 pixel resolution. The input dimensions are 316128171 pixel. During training we use jittering on the input clips by random cropping them into 316112112 pixel. We finetune the network with stochastic gradient descent optimizer using minibatches of 44 clips, with initial learning rate of 0.0001. The learning rate is then reduced by half, when no training progress was observed [65]. The training stopped after around 20 epochs.
For videobased prediction, the model averages the predictions over all 16frame clips split from the same video and provides the final prediction for the input video. A similar idea is applied for extracting features from fc6 layer, which averages the 4096dimensional feature vectors over all clips in the same video, resulting in one 4096dimensional vector for each video.
IiiB3 Fusion Implementation
We fuse the best RNN structure, 2 layer BIGRUBNDPH, with the 3DCNN model, first using decision fusion, then using feature fusion.
For decision fusion, we first extract the softmax output, then search for the fusion parameters, trust weight and for RNN and CNN from the validation split. The parameters are and for the cross subject setup, and and for the cross view setup.
For feature fusion, we extract the RNN features (600 dimensions) from the fullyconnected layer, and extract CNN features (4096 dimensions) from the fc6 layer [8]. We then concatenate them into one feature array (4,696 dimensions) and apply L2 normalization. In the end, we have normalized RNN/CNN features from training, validation, and testing splits. We use training and validation splits to find the optimal value of for linear SVM[67] model. For both crosssubject and crossview setups, we find that gives the best validation accuracy.
Among all the models in this paper, feature fusion model shows the best testing results. We refer to this model as a twostream RNN/CNN structure as shown in Fig. 4.
IiiC Experimental Results and Analysis
The evaluation results are shown in Tab. I. The first 16 rows are skeletonbased methods. The 3DCNN model (17th row) uses RGB videos as input. The decision fusion (18th row) and feature fusion (19th row) models use the best RNN structure, which is the 2 Layer BIGRUBNDPH (16th row), and the 3DCNN (17th row) model.
Tab. I shows that our RNN structure, the 1 Layer LSTMBN, already outperforms the baseline method partaware LSTM reported in [2] because batch normalization improves the LSTM model. Adding a dropout procedure reduces overfitting and further improves the results (rows 11, 12). From rows 12 and 13 we can see that the performances of LSTM and GRU cells are similar[68]. GRU is better in the crosssubject test and LSTM is better in the crossview test. On the other hand, GRU is faster than LSTM both in computational speed and converge rate. As presented in Fig. 5 left, for 1k training steps, the same model performs 5.42% more accurately and takes 20% less computational time when using GRU cells than when using LSTM cells.


Nr.  Method  cross subject  cross view 
01  Skeleton Quads[9, 2]  38.62%  41.36% 
02  Lie Group[10, 2]  50.08%  52.76% 
03  FTP Dynamic Skeletons[11, 2]  60.23%  65.22% 
04  HBRNNL[3, 2]  59.07%  63.97% 
05  Deep RNN[2]  56.29%  64.09% 
06  Deep LSTM[2]  60.69%  67.29% 
07  Partaware LSTM[2]  62.93%  70.27% 
08  STLSTM (Tree) + Trust Gate[4]  69.2%  77.7% 
09  1 Layer RNN  18.74%  20.27% 
10  1 Layer LSTM  60.99%  64.68% 
11  1 Layer LSTMBN  64.07%  71.86% 
12  1 Layer LSTMBNDP  64.69%  73.48% 
13  1 Layer GRUBNDP  65.21%  70.36% 
14  1 Layer BIGRUBNDP  64.78%  73.12% 
15  2 Layer BIGRUBNDP  66.21%  72.46% 
16  2 Layer BIGRUBNDPH  70.70%  80.23% 
17  3DCNN[8]  79.75%  83.95% 
18  Decision Fusion  82.05%  86.68% 
19  Feature Fusion  83.74%  93.65% 

The addition of the extra fullyconnected layer brings another significant improvement. This increases the complexity of the neural network, which helps the model capture more inherent features from the 3D skeleton data [69]. The recurrent layers before the fullyconnected layer can be seen as a temporal feature extractor, which compact input information (dimension ) into 600 dimensions. The latter part of the RNN structure can be considered as a classifier learning to map these 600dimensional features to 60 different action categories. Altogether, our novel RNN model, 2 Layer BIGRUBNDPH, outperforms all the other skeletonbased models including STLSTM (Tree traversal) + Trust Gate [4].
Then, we use the RGB video data to train the 3DCNN model. We use the voting method based on confidence to fuse the 2 Layer BIGRUBNDPH and 3DCNN model. In the next step, we utilize a linear SVM [8] to fuse the fc6 features from the CNN and the fc features from the RNN. This further improves results by over 13% in comparison to our best RNN, and by more than 14% compared to literature models [2, 4]. This boosting is due to the features from RNN and CNN model being highly complementary. The RNN model uses 50 3D coordinates for two human bodies over 300 time steps, and learns to find the longterm motion pattern. Whereas the CNN model has 2D RGB frames, which additionally has spatiotemporal information about objects, such as cups, pens, and books. However, the CNN model can only memorize information for 16 time steps long—longer memorization is prohibited by GPU memory limitations. These facts make the features from RNN and CNN model highly complementary, as the testing results show in row 16, 17, and 19 in Tab. I.
IiiD Discussion
To better analyze and improve the performance of the model, we take a closer look at actions that are highly confusing to the twostream RNN/CNN structure. As presented in Fig. 6 and 7, such action pairs include reading vs. writing, putting on a shoe vs. taking off a shoe, and rubbing two hands vs. clapping. These actions are shown in a video at https://www.youtube.com/watch?v=G0PXKCEgIoA. Fig. 8 shows some classified action samples.
There could be several reasons for this observation. First, these actions are sometimes inherent confusing. Secondly, there are flaws in the data. Kinect depth information, from which the NTU skeleton data is created, is quite noisy [70, 71]. Correspondingly, the 3D skeleton data used in our RNNs are also quite noisy [4]. RGB videos data are more accurate and stable, but single frames carry no 3D information. Thirdly, the 3DCNN model[8] is trained with small video clips, which are 16 time steps long. The CNN model is adapted to find only shortterm temporal features in these clips. As GPU memory and computing power increase, the model could also be adapted to find longterm temporal features in each whole video. Lastly, although, the RNN model can memorize the whole action sequences and give final predictions, it has no information about the appearances and movements of surrounding objects, which could be discriminative information for the classification task.
Iv Conclusion and Future Work
In this paper, we propose a novel RNN structure for 3D skeletons that achieves stateoftheart performance on the largest 3D action recognition dataset. The proposed RNN model can also be trained 13 times faster and saves 20% computational power on each training step. Additionally, the RGB videos from the same dataset are used to finetune a 3DCNN model. In the end, an efficient fusion structure, twostream RNN/CNN, is introduced to fuse the capabilities of both RNN and CNN models. The results of this method are 13% higher than using the proposed RNN alone, and 14% higher than the best published result in the literature. In the future, we want to consider using the other sensor modalities such as depth maps and IR sequences and see what is the best architecture to fuse all these modalities.
References
 [1] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3D points,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern RecognitionWorkshops. IEEE, 2010, pp. 9–14.
 [2] A. Shahroudy, J. Liu, T.T. Ng, and G. Wang, “NTU RGB+D: A large scale dataset for 3D human activity analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [3] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118.
 [4] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatiotemporal LSTM with trust gates for 3d human action recognition,” arXiv preprint arXiv:1607.07043, 2016.
 [5] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Cooccurrence feature learning for skeleton based action recognition using regularized deep lstm networks,” arXiv preprint arXiv:1603.07772, 2016.
 [6] K. Wang, X. Wang, L. Lin, M. Wang, and W. Zuo, “3d human activity recognition with reconfigurable convolutional neural networks,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 97–106.
 [7] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. O. Ogunbona, “Action recognition from depth maps using deep convolutional neural networks,” 2015.
 [8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 4489–4497.
 [9] G. Evangelidis, G. Singh, and R. Horaud, “Skeletal quads: Human action recognition using joint quadruples,” in International Conference on Pattern Recognition, 2014, pp. 4513–4518.
 [10] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3D skeletons as points in a lie group,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595.
 [11] J.F. Hu, W.S. Zheng, J. Lai, and J. Zhang, “Jointly learning heterogeneous features for RGBD activity recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5344–5352.
 [12] H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, “Histogram of oriented principal components for crossview action recognition,” 2016.
 [13] S. Gaglio, G. L. Re, and M. Morana, “Human activity recognition process using 3d posture data,” IEEE Transactions on HumanMachine Systems, vol. 45, no. 5, pp. 586–597, 2015.
 [14] C. Chen, K. Liu, and N. Kehtarnavaz, “Realtime human action recognition based on depth motion maps,” Journal of realtime image processing, pp. 1–9, 2013.
 [15] B. Ni, G. Wang, and P. Moulin, “Rgbdhudaact: A colordepth video database for human daily activity recognition,” in Consumer Depth Cameras for Computer Vision. Springer, 2013, pp. 193–208.
 [16] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Unstructured human activity detection from rgbd images,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 842–849.
 [17] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1290–1297.
 [18] L. Xia, C.C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2012, pp. 20–27.
 [19] V. Bloom, V. Argyriou, and D. Makris, “Dynamic feature selection for online action recognition,” in International Workshop on Human Behavior Understanding. Springer, 2013, pp. 64–76.
 [20] Y.C. Lin, M.C. Hu, W.H. Cheng, Y.H. Hsieh, and H.M. Chen, “Human action recognition and retrieval using sole depth information,” in Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012, pp. 1053–1056.
 [21] C. Zhang, Y. Tian, and E. Capezuti, “Privacy preserving automatic fall detection for elderly using rgbd cameras,” in International Conference on Computers for Handicapped Persons. Springer, 2012, pp. 625–633.
 [22] O. Oreifej and Z. Liu, “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 716–723.
 [23] H. S. Koppula, R. Gupta, and A. Saxena, “Learning human activities and object affordances from rgbd videos,” The International Journal of Robotics Research, vol. 32, no. 8, pp. 951–970, 2013.
 [24] F. Negin, F. Özdemir, C. B. Akgül, K. A. Yüksel, and A. Erçil, “A decision forest based feature selection framework for action recognition from rgbdepth cameras,” in International Conference Image Analysis and Recognition. Springer, 2013, pp. 648–657.
 [25] P. Wei, N. Zheng, Y. Zhao, and S.C. Zhu, “Concurrent action detection with structural prediction,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3136–3143.
 [26] M. Munaro, S. Michieletto, and E. Menegatti, “An evaluation of 3d motion flow and 3d pose estimation for human action recognition,” in RSS Workshops: RGBD: Advanced Reasoning with Depth Cameras, 2013.
 [27] C. Ellis, S. Z. Masood, M. F. Tappen, J. J. Laviola Jr, and R. Sukthankar, “Exploring the tradeoff between accuracy and observational latency in action recognition,” International Journal of Computer Vision, vol. 101, no. 3, pp. 420–436, 2013.
 [28] A. Mansur, Y. Makihara, and Y. Yagi, “Inverse dynamics for action recognition,” IEEE transactions on cybernetics, vol. 43, no. 4, pp. 1226–1236, 2013.
 [29] Z. Yang, L. Zicheng, and C. Hong, “Rgbdepth feature for 3d human activity recognition,” China Communications, vol. 10, no. 7, pp. 93–103, 2013.
 [30] V. Carletti, P. Foggia, G. Percannella, A. Saggese, and M. Vento, “Recognition of human actions from rgbd videos using a reject option,” in International Conference on Image Analysis and Processing. Springer, 2013, pp. 436–445.
 [31] D. Kastaniotis, I. Theodorakopoulos, G. Economou, and S. Fotopoulos, “Gaitbased gender recognition using pose information for real time applications,” in Digital Signal Processing (DSP), 2013 18th International Conference on. IEEE, 2013, pp. 1–6.
 [32] A.A. Liu, W.Z. Nie, Y.T. Su, L. Ma, T. Hao, and Z.X. Yang, “Coupled hidden conditional random fields for rgbd human action recognition,” Signal Processing, vol. 112, pp. 74–82, 2015.
 [33] D. Huang, S. Yao, Y. Wang, and F. De La Torre, “Sequential maxmargin event detectors,” in European conference on computer vision. Springer, 2014, pp. 410–424.
 [34] I. Lillo, A. Soto, and J. Carlos Niebles, “Discriminative hierarchical modeling of spatiotemporally composable human activities,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 812–819.
 [35] G. Yu, Z. Liu, and J. Yuan, “Discriminative orderlet mining for realtime recognition of humanobject interaction,” in Asian Conference on Computer Vision. Springer, 2014, pp. 50–65.
 [36] C. Wu, J. Zhang, S. Savarese, and A. Saxena, “Watchnpatch: Unsupervised understanding of actions and relations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4362–4370.
 [37] J.F. Hu, W.S. Zheng, J. Lai, S. Gong, and T. Xiang, “Exemplarbased recognition of human–object interactions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 4, pp. 647–660, 2016.
 [38] C. Chen, R. Jafari, and N. Kehtarnavaz, “Utdmhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2015, pp. 168–172.
 [39] Z. Cheng, L. Qin, Y. Ye, Q. Huang, and Q. Tian, “Human daily action analysis with multiview and colordepth data,” in European Conference on Computer Vision. Springer, 2012, pp. 52–61.
 [40] Z. Zhang, W. Liu, V. Metsis, and V. Athitsos, “A viewpointindependent statistical method for fall detection,” in Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 3626–3630.
 [41] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Berkeley mhad: A comprehensive multimodal human action database,” in Applications of Computer Vision (WACV), 2013 IEEE Workshop on. IEEE, 2013, pp. 53–60.
 [42] S. M. Amiri, M. T. Pourazad, P. Nasiopoulos, and V. C. Leung, “Nonintrusive human activity monitoring in a smart home environment,” in eHealth Networking, Applications & Services (Healthcom), 2013 IEEE 15th International Conference on. IEEE, 2013, pp. 606–610.
 [43] P. Wei, Y. Zhao, N. Zheng, and S.C. Zhu, “Modeling 4d humanobject interactions for event and object recognition,” in 2013 IEEE International Conference on Computer Vision. IEEE, 2013, pp. 3272–3279.
 [44] J. Wang, X. Nie, Y. Xia, Y. Wu, and S.C. Zhu, “Crossview action modeling, learning and recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2649–2656.
 [45] H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian, “Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition,” in European Conference on Computer Vision. Springer, 2014, pp. 742–757.
 [46] A.A. Liu, Y.T. Su, P.P. Jia, Z. Gao, T. Hao, and Z.X. Yang, “Multipe/singleview human action recognition via partinduced multitask structural learning,” IEEE transactions on cybernetics, vol. 45, no. 6, pp. 1194–1208, 2015.
 [47] Y. Song, J. Tang, F. Liu, and S. Yan, “Body surface context: A new robust feature for action recognition from depth videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 6, pp. 952–964, 2014.
 [48] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras, “Twoperson interaction detection using bodypose features and multiple instance learning,” in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2012, pp. 28–35.
 [49] T. Hu, X. Zhu, W. Guo, and K. Su, “Efficient interaction recognition through positive action representation,” Mathematical Problems in Engineering, vol. 2013, 2013.
 [50] C. Wolf, E. Lombardi, J. Mille, O. Celiktutan, M. Jiu, E. Dogan, G. Eren, M. Baccouche, E. Dellandréa, C.E. Bichot et al., “Evaluation of video activity localizations integrating quality and quantity measurements,” Computer Vision and Image Understanding, vol. 127, pp. 14–30, 2014.
 [51] V. Bloom, V. Argyriou, and D. Makris, “G3di: A gaming interaction dataset with a real time detection and evaluation framework,” in Workshop at the European Conference on Computer Vision. Springer, 2014, pp. 698–712.
 [52] C. Van Gemeren, R. T. Tan, R. Poppe, and R. C. Veltkamp, “Dyadic interaction detection from pose and flow,” in International Workshop on Human Behavior Understanding. Springer, 2014, pp. 101–115.
 [53] L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, and L. Zhang, “A deep structured model with radius–margin bound for 3d human activity recognition,” International Journal of Computer Vision, pp. 1–18, 2015.
 [54] P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, and P. Ogunbona, “Convnetsbased action recognition from depth maps through virtual cameras and pseudocoloring,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015, pp. 1119–1122.
 [55] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [56] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [57] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoderdecoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
 [58] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Longterm recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.
 [59] K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems, 2014, pp. 568–576.
 [60] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
 [61] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
 [62] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML10), 2010, pp. 807–814.
 [63] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
 [64] A. Damien et al., “Tflearn,” https://github.com/tflearn/tflearn, 2016.
 [65] T. Tieleman and G. Hinton, “Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, vol. 4, no. 2, 2012.
 [66] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.
 [67] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [68] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
 [69] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
 [70] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, “Realtime human pose recognition in parts from single depth images,” Communications of the ACM, vol. 56, no. 1, pp. 116–124, 2013.
 [71] T. Mallick, P. P. Das, and A. K. Majumdar, “Characterizations of noise in kinect depth images: a review,” IEEE Sensors Journal, vol. 14, no. 6, pp. 1731–1740, 2014.