Sliding Dictionary Based Sparse Representation For Action Recognition
Abstract
The task of action recognition has been in the forefront of research, given its applications in gaming, surveillance and health care. In this work, we propose a simple, yet very effective approach which works seamlessly for both offline and online action recognition using the skeletal joints. We construct a sliding dictionary which has the training data along with their time stamps. This is used to compute the sparse coefficients of the input action sequence which is divided into overlapping windows and each window gives a probability score for each action class. In addition, we compute another simple feature, which calibrates each of the action sequences to the training sequences, and models the deviation of the action from the each of the training data. Finally, a score level fusion of the two heterogeneous but complementary features for each window is obtained and the scores for the available windows are successively combined to give the confidence scores of each action class. This way of combining the scores makes the approach suitable for scenarios where only part of the sequence is available. Extensive experimental evaluation on three publicly available datasets shows the effectiveness of the proposed approach for both offline and online action recognition tasks.
1 Introduction
There has been considerable interest in the computer vision community to recognize human actions from 3D data ever since the inception of Kinect sensor. Kinect provides multichannel data which has paved unprecedented avenues to problems of action recognition in particular, and computer vision in general. This interest has grown manifold after the work of Shotton et al. [14], which estimates the 3D joint locations of humans in real time from a single depth image. Although the joint positions obtained from Kinect can be noisy, especially in presence of partial occlusions, the relative simplicity and compact representation offered by 3D skeletal joints have prompted researchers to exploit the advantages arising from the compactness. Since then, there has been an increase in research on action recognition by modeling 3D skeletal joints and recent advances [7][15] have indicated that 3D skeletal joints (Figure 1) are a better, simple yet efficient way of representing human actions. Online action recognition, where we may be required to recognize the action from partial data also finds wide applications in user interface and in gaming [8].
Here, we propose a simple, yet very effective approach which works seamlessly for both offline and online action recognition using the 3D skeletal joints. The proposed approach is based on learning sparse representations based on a sliding dictionary which is constructed from the training data utilizing their time stamps. Given an input action sequence, it is divided into several overlapping windows and for each window, we compute its probability of belonging to the different action classes. The reconstruction error for each class computed with the sparse coefficients is used to compute this probability. In addition, we compute another simple feature, which calibrates each of the action sequences to the training sequences, and models the deviation of the action from the each of the training data. The combined score for each window is computed using a score level fusion of these two features and the scores for all the available windows are successively combined to give the probability of each action class. Extensive experimental evaluation on three publicly available datasets, namely UTD, UT Kinect and MSRC 12 datasets and comparisons with the stateoftheart shows the effectiveness of the proposed approach for both offline and online action recognition tasks. The main contributions of the proposed work are as follows:

Proposed a sliding dictionary based sparse representation framework for action recognition.

Approach can work seamlessly for both offline and online action recognition.

Simple, yet very effective approach as justified by results on three datasets.
2 Related Work
A diverse set of approaches to action recognition using 3D skeletal joints exists in literature. Yang and Tian [19] uses a NaiveBayes Nearest Neighbour classifier on an offset generated by taking the joint positions difference between specific frames. The feature space is scaled down to a lower dimension using PCA. Xia et al. [18] model postures by projecting the 3D joints to 3D bins of a histogram (HOJ3D) and Hidden markov model is used for classification. [17] employ a multiple kernel learning method to extract the most informative of 3D joint pairs and each of these joint pairs is modeled according to the relative positions with respect to each other. Vemulapalli et al. [15] use dynamic time warping to account for rate variations. They model human actions as curves in a lie group and Fourier Temporal Pyramid is employed to handle the temporal misalignment, after warping the curves obtained for each class to its nominal curve. Oneversusall linear SVM is used for classification. This method gives good results, but the numerical and computational complexities arising from this way of modeling makes it less feasible. [7] built covaraince of 3D joints (Cov3DJ) and then compute Cov3DJ on different temporal hierarchy to account for order of motion in time.
Zhu et al. [22] perform a feature level fusion of spatiotemporal features and 3D joints features using random forests. Bloom et al. [2] extract multiple types of features from 3D joints which can be computed in real time: pairwise joint difference, joint velocities with respect to different frames and their magnitudes, and joint angles between three joints. These features obtained from all the joints are concatenated to get a single feature vector. We refer the reader to [1] for a detailed discussion on action recognition using 3D skeletal joints. Kviatkovsky et al. [8] uses the covariance descriptor for action recognition by extending it to spatiotemporal domain. It is extended to online action recognition by creating a buffer of features extracted using the covariance descriptor. The buffer is updated when a new frame is added and on demand nearest neighbour is used for classsification. A heirarchy of bioinspired multiple skeletal configurations is used in [3] such that each of the configurations represents the motion of set of joints at a particular temporal scale. These skeletal configurations are modeled as Linear Dynamic Systems. Li et al. [9] build an action graph of sampled 3D points from depth maps, such that each node in an action graph represents a posture common to set of actions to be classified. Gaussian Mixture Model is used to model the distribution of the sampled 3D points. In [11], the most informative 3D joints in an action sequence is extracted, where the information is a measure of variance of joint angles in time series. Action is represented as a sequence of these sampled informative 3D points and SVM is used for classification. Ye et al. [20] overviews different action recognition approaches related to skeletal representation and depth maps, and compares the performance of different algorithms on various standard datasets in literature.
3 Proposed Approach
Here, we describe the proposed approach for the task of action recognition. First, we describe the sliding window based sparse representation framework, followed by the difference of 3D Joints feature and score fusion of the two features.
3.1 Sliding Dictionary  Sparse Representation
To make the proposed approach applicable for both offline and online action recognition, we propose a sliding dictionarybased sparse representation of the input action sequence. Let be the number of different action classes and , be the number of training sequences of class . Let denotes the total number of overlapping windows for each action sequence. The number of frames for each action sequence may be different and depends on the total number of frames in that sequence. This is for offline action recognition when we know the total number of frames apriori. Let denote the collection of all the features vectors for class and window given as . Here denotes the feature vector of the training example of class for window . Thus the combined data for all the classes for window will be denoted by
(1) 
The complete dictionary is constructed from the training feature vectors of all classes and all the windows as follows:
(2) 
So the dictionary consists of all the features computed from all the training action sequences which are time stamped using their window index.
Given a test action sequence, it is similarly divided into overlapping windows and the feature vector corresponding to each of the windows is computed. For computing the corresponding sparse coefficients, instead of using the whole dictionary, we use a sliding dictionary based on the window index of the input sequence. For example, for computing the sparse coefficient for , the dictionary used is as follows
(3) 
i.e. the dictionary elements corresponding to window and windows before and after are used. This sliding window ensures that the temporal evolution of the sequence is maintained, i.e. the initial part of the test sequence is not matched with the last part of a training sequence. Multiple windows of the training sequence is considered in the dictionary to handle the temporal misalignments and rate variations in the action sequences. The corresponding sparse coefficient is obtained by solving the following
(4) 
Here, we use the standard sparse coding solver SPAMS [10] to solve for the sparse coefficient .
For the window , we compute the probability of the test window belonging to each of the action classes using the reconstruction error. Let be the matrix obtained from corresponding to the dictionary atoms belonging to action and be the corresponding sparse coefficient obtained from corresponding to the same action . The reconstruction of using dictionary atoms corresponding to only class is given by: The reconstruction error for window for the action class is computed by taking the euclidean distance of the reconstructed feature and the original feature as
(5) 
Lower reconstruction error for class implies that it is more likely to belong to that class as compared to classes for which the reconstruction error is high. The probability that window of the test sequence belongs to the action class using the sliding dictionary is thus given by
(6) 
The reason behind computing the probability of all classes instead of assigning it to a particular class is that a small part of a sequence may appear similar to many action classes. For example, if we look at just the initial part of a walking and jogging sequence, they may appear very similar, which means that only a small segment is not sufficient to infer the class label.
But as we see more and more windows, the probability of the action belonging to the correct class will increase and that of the incorrect classes will decrease.
Feature Used: Since the framework presented above is general, any appropriate feature can be used as the input.
In this work, we have used covariance descriptor of the 3D skeletal joints as the input feature.
Let be the number of frames in window and be the number of joints.
Let be a dimensional matrix such that each row is a vector of coordinates of all the joints of frame respectively.
The covariance of the window is calculated as:
(7) 
where is the sample mean of and is the transpose operator. is a symmetric matrix by definition. Hence, only the upper triangular matrix is taken and concatenated to get a single feature vector of dimension which is the covariance descriptor [7]. is finally normalized to have a unit norm.
3.2 Difference of 3D Joints Feature
In this work, we augment the sparse representation based feature with another very simple feature, which is the difference of 3D joints. We show that the combination of these two simple features is very effective for the task of action recognition and compares well with the stateoftheart for several available datasets. For this feature also, given an input video sequence, we divide it into overlapping segments as done for the previous feature. This is done for both the training sequences as well as the testing ones. For this feature, the probability of a test window belonging to a given action class is computed from the difference of the joint locations of the test window with all the training sequences of that class. For a test sequence, to compute its distance from a particular training sequence, we first compute the baseline difference which essentially aligns the two sequences in terms of their joint locations in the first frame. Let and refers to first frame of testing and training respectively, i.e the neutral pose position, then this baseline difference is given by
(8) 
Now, for the window of the test sequence, we consider the windows of the training sequences. We compute the differences between the joint locations of the frames in the test sequence window with all the frames in the selected window of the training sequence. The distance between test frame of window from a frame in the chosen window of the training sequence is computed as (where is the joint locations of a frame of the training sequence in the chosen window)
(9) 
For each action class, the least distances from the training sequences of that class are accumulated and the mean of these is computed which is denoted by its score . This is used to further compute the probability of the window of the test sequence of belonging to class , i.e.
(10) 
3.3 Score Level Fusion
We perform score level fusion by combining the probabilities of both the sparse representation feature and difference of 3D joints feature. Then the final score for window is obtained as follows
(11) 
such that
and are parameters which are used to obtain tradeoffs between the two scores. This is the confidence score that window of the test sequence belongs to action class . Suppose, windows of the test sequence is available (for online recognition with partial data, may be less than the total number of windows for the entire test sequence). Then the final confidence score for all the action classes is computed and the predicted action class is given as follows:
(12) 
4 Experimental Evaluation
Now, we perform extensive evaluation of the proposed approach on publicly available datasets :UTDMHAD, UT Kinect Action dataset and MSRC12 Gesture Dataset.
4.1 UT Kinect Action Dataset
The UT Kinect dataset [18] consists of 10 subjects performing 10 actions and each subject performs every action two times. This is a challenging dataset due to high variations in the actions of same class. The subjects 1, 3, 5, 7, 9 are used for training and 2, 4, 6, 8, 10 were used for testing, similar to the cross subject setup of [22]. The confusion matrix for this dataset is shown in Figure 3. We see that except for three actions, the approach performs perfectly for all the other actions. Table 1 reports the results of the proposed approach and the other stateoftheart approaches for action recognition. The other results are directly taken from [15]. We observe that the recognition accuracy achieved by the proposed approach is comparable with the stateoftheart [15] and significantly higher than all the other recently proposed approaches.
4.2 UT Dallas Multimodal Dataset
The UTDMHAD dataset [5] is a very new dataset and is gathered using both Microsoft Kinect sensor and a wearable inertial sensor. The dataset consists of actions performed by subjects ( females and males). Each action is performed by each subject times. Out of the total of sequences, we remove the corrupt sequences as in [5] and use only the the skeletal positions of the remaining sequences. It is a comparatively difficult dataset as large number of actions are pooled together.
We employ the experimental protocol of [5], and half of the subjects (odd numbered) are used for training and rest half of the subjects (even numbered) for testing. Table 2 shows the results obtained using the proposed approach and also comparisons with the stateoftheart. Note that the result obtained using collaborative representation classifier method is 79.1% using both kinect and inertial data. However by using only Kinect it was found be 66.1%. Also, Local Binary pattern is performed on depth maps, which are less noisy as compared to skeletal data. We see that the proposed approach outperforms both the other approaches and gives the best result for this dataset.
4.3 MSRC  12 Gesture Dataset
The MSRC12 Kinect gesture dataset [6] contains sequences of human movements, represented as bodypart locations, and the associated gesture is to be recognized by the system. The dataset consists of 594 sequences and a total of 6,244 gesture instances. The motion files contain tracks of 20 joints estimated using the Kinect Pose Estimation pipeline. The body poses are captured at a sample rate of 30Hz with an accuracy of about two centimeters in joint positions. This is a very large dataset and is a good test of scalability of the proposed approach.
We have used the test setup of [7], where half of the subjects were used for training and half of them for testing. The experiment was repeated for 20 times, each time taking half of the subjects at random and we report the average over all the 20 iterations. The results of the proposed approach and comparison with the approach in [7] is shown in Table 3. We see that for this dataset also, the proposed approach performs better than the stateoftheart result.
Covariance descriptors [7]  91.7 
Proposed Approach  92.89 
4.4 Online Action Recognition
The above experiments were performed using the whole action sequence and compared with the stateoftheart in the offline scenario. Since the proposed approach relies on incremental update of the probability of each action class with the availability of each window, it adapts seamlessly to online action recognition as well. In this work, for both online and offline action recognition, we assume the action has already been detected from an unsegmented video using a suitable detection approach as in [13], [21], [6]. This detection part would provide the action points on the unsegmented sequence and the beginning and ending point of an action. All further computations required for the recognition would be carried out by our algorithm, even with the availability of partial data.
The main difference between online and offline version of action recognition is that for the online scenario, only a partial number of frames are available and the recognition has to be done using the incomplete information. The total number of frames in the sequence is also not known apriori, unlike the offline scenario. In this work, we evaluate the usefulness of the proposed algorithm for online action recognition on the offline datasets itself, UT Kinect and UT Dallas, under the constraint that only partial data is available. Since the total number of frames is not known beforehand, we perform a framelevel computation of the probabilities. The training sequences are divided into overlapping windows as before, but for the testing sequence, we consider window sizes of variable length. For each frame, we take the maximum probability among all the windows which has the particular frame as the middle one to be probability of that frame belonging to a particular action class. The score level fusion is performed similar to the offline scenario. As more and more frames are available, the probability confidence score keeps on accumulating, and the score increases for the correct class and decreases for incorrect classes, until the detection algorithm gives the ’end of action’ signal. We achieve a recognition accuracy of 88.89% on UTK Dataset and 79.07% on UTD Dataset using the covariance descriptor as the feature to the dictionary. The online scenario using these datasets is challenging due to the fact that only partial data is available. We observe that the recognition accuracy decreases marginally from the offline to online scenario, thus proving the effectiveness of the seamless adaptability of the proposed approach.
Since the proposed approach is based on incremental update of confidence scores, first we perform an experiment to see how the scores gets updated for some example actions of the UT Kinect Action dataset. Figure 4 shows how the scores are updated as a function of the number of sliding windows for two different actions, for test actions walk (left) and sit down (right) respectively.
Two noteworthy observations can be made from the above mentioned plots:

As more frames become available, the confidence (score) of the correct action class increases as compared to the incorrect action classes.

Depending on the relative pace at which score of the correct class outgrows those of the incorrect classes, partial number of frames may be sufficient to get the correct class.
We note from the experimental results that even with very few frames, the approach predicts with reasonable accuracy the correct action class, signifying its applicability for online action recognition task. We report results on the UT Kinect Action dataset, but we have observed similar behaviour for the other datasets as well.
We perform another experiment to observe the number of frames that are required to get a reasonable accuracy for action classification. We consider increasingly more number of frames and plot the recognition accuracy. Figure 5 shows the recognition accuracies of the UTK dataset as a function of the total number of available frames in the whole video sequence. We observe that at around of the total number of frames, the recognition accuracy reaches close to the highest accuracy obtained by the proposed algorithm. Also, we get a accuracy with only of frames availability. We observed similar performance for the other datasets also. This justifies the usefulness of the proposed approach for online action recognition.
5 Conclusion and Future Work
In this paper, we have presented an algorithm which employs covariance of 3D Joints descriptor to construct a sliding dictionary. This dictionary is designed such that temporal variations are accounted for. We have introduced a frame level difference of skeletal joints, which calibrates the testing action. Score level fusion of the two scores gives the final confidence score for each action class. Extensive experiments on different datasets justify that despite being simple, the proposed approach is very effective for both online and offline action recognition.
The proposed framework is general and more suitable features can be seamlessly added to improve the recognition accuracy of the proposed approach and this will be one direction of our future work. Also, in future, we would like to use dynamic time warping to better handle the temporal misalignments, which can result in further boost in the accuracy. Another direction of future work is to extend this approach for actions with lateral shifts, for e.g. same action done with the right hand once and left hand the other time. We would like to generate synthetic training data to take care of this problem as part of our future work.
References
 [1] J. Aggarwal and L. Xia. Human activity recognition from 3d data: A review. Pattern Recognition Letters, 48:70–80, 2014.
 [2] V. Bloom, D. Makris, and V. Argyriou. G3d: A gaming action dataset and real time action recognition evaluation framework. In Computer Vision and Pattern Recognition Workshops, pages 7–12, 2012.
 [3] R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, and R. Vidal. Bioinspired dynamic 3d discriminative skeletal features for human action recognition. In Computer Vision and Pattern Recognition Workshops, pages 471–478, 2013.
 [4] C. Chen, R. Jafari, and N. Kehtarnavaz. Action recognition from depth sequences using depth motion mapsbased local binary patterns. In Winter Conference on Applications of Computer Vision, pages 1092–1099, 2015.
 [5] C. Chen, R. Jafari, and N. Kehtarnavaz. Utdmhad: A multimodal human dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In International Conference on Image Processing, 2015.
 [6] S. Fothergill, H. Mentis, P. Kohli, and S. Nowozin. Instructing people for training gestural interactive systems. In Conference on Human Factors in Computing Systems, pages 1737–1746, 2012.
 [7] M. E. Hussein, M. Torki, M. A. Gowayyed, and M. ElSaban. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In International Joint Conference on Artificial Intelligence, pages 2466–2472, 2013.
 [8] I. Kviatkovsky, E. Rivlin, and I. Shimshoni. Online action recognition using covariance of shape and motion. Computer Vision and Image Understanding, 129:15–26, 2014.
 [9] W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3d points. In Computer Vision and Pattern Recognition Workshops, pages 9–14, 2010.
 [10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60, 2010.
 [11] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy. Sequence of the most informative joints (smij): A new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation, 25(1):24–38, 2014.
 [12] E. OhnBar and M. M. Trivedi. Joint angles similarities and hog2 for action recognition. In Computer Vision and Pattern Recognition Workshops, pages 465–470, 2013.
 [13] A. Sharaf, M. Torki, M. E. Hussein, and M. ElSaban. Realtime multiscale action detection from 3d skeleton data. In Winter Conference on Applications of Computer Vision, pages 998–1005. IEEE, 2015.
 [14] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore. Realtime human pose recognition in parts from single depth images. ACM Communications, 56(1):116–124, 2013.
 [15] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3d skeletons as points in a lie group. In Computer Vision and Pattern Recognition, pages 588–595, 2014.
 [16] C. Wang, Y. Wang, and A. L. Yuille. An approach to posebased action recognition. In Computer Vision and Pattern Recognition, 2013.
 [17] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Computer Vision and Pattern Recognition, pages 1290–1297, 2012.
 [18] L. Xia, C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3d joints. In Computer Vision and Pattern Recognition Workshops, pages 20–27, 2012.
 [19] X. Yang and Y. Tian. Eigenjointsbased action recognition using naivebayesnearestneighbor. In Computer Vision and Pattern Recognition Workshops, pages 14–19, 2012.
 [20] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall. A survey on human motion analysis from depth data. In TimeofFlight and Depth Imaging. Sensors, Algorithms, and Applications, pages 149–187. Springer, 2013.
 [21] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: An efficient 3d kinematics descriptor for lowlatency action recognition and detection. In International Conference on Computer Vision, pages 2752–2759. IEEE, 2013.
 [22] Y. Zhu, W. Chen, and G. Guo. Fusing spatiotemporal features and joints for 3d action recognition. In Computer Vision and Pattern Recognition Workshops, pages 486–491, 2013.