SkeletonBased Action Recognition with Synchronous Local and Nonlocal Spatiotemporal Learning and Frequency Attention
Abstract
Benefiting from its succinctness and robustness, skeletonbased human action recognition has recently attracted much attention. Most existing methods utilize local networks, such as recurrent networks, convolutional neural networks, and graph convolutional networks, to extract spatiotemporal dynamics hierarchically. As a consequence, the local and nonlocal dependencies, which respectively contain more details and semantics, are asynchronously captured in different level of layers. Moreover, limited to the spatiotemporal domain, these methods ignored patterns in the frequency domain. To better extract information from multidomains, we propose a residual frequency attention (rFA) to focus on discriminative patterns in the frequency domain, and a synchronous local and nonlocal (SLnL) block to simultaneously capture the details and semantics in the spatiotemporal domain. To optimize the whole process, we also propose a softmargin focal loss (SMFL), which can automatically conducts adaptive data selection and encourages intrinsic margins in classifiers. Extensive experiments are performed on several largescale action recognition datasets and our approach significantly outperforms other stateoftheart methods.
SkeletonBased Action Recognition with Synchronous Local and Nonlocal Spatiotemporal Learning and Frequency Attention
Guyue Hu^{1, 3}, Bo Cui^{1, 3}, Shan Yu^{1, 2, 3} ^{1}Brainnetome Center, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences ^{2}CAS Center for Excellence in Brain Science and Intelligence Technology ^{3}University of Chinese Academy of Sciences {guyue.hu, bo.cui, shan.yu}@nlpr.ia.ac.cn
1 Introduction
The skeletonbased human action recognition has recently attracted much attention due to its succinctness of representation and robustness to variations of viewpoints, appearances and surrounding distractions (?). Skeletonbase human actions are naturally sequences, thus abundant of works apply Recurrent Neural Networks (RNNs) with Long ShortTerm Memory (LSTM) to model the temporal dependencies (?; ?; ?). Many researchers also treat the skeletal data as 2D pseudoimages, and feed them into various CNNs to model spatiotemporal dynamics, which achieve more effective results (?; ?; ?). To exploit the structure information of human body, Yan et al. (?) constructs a graph based on the physical connections of human joints, which obtains promising performances.
All the aforementioned methods can be summarized as a paradigm, i.e. firstly prepare the skeletal data as sequences, pseudoimages or graphs, then feed them into stacked networks to mine spatiotemporal features hierarchically with the softmax loss (?). Despite the significant improvements in performance, there exist three serious problems to be solved: 1) The recurrent and convolutional operations are neighborhoodbased local operations (?), so the localrange detailed information and nonlocal semantic information mainly be captured asynchronously in the lower and higher layers respectively, which hinders the fusion of details and semantics in action dynamics. 2) Human actions such as shaking hands, brushing teeth, and clapping have characteristic frequency patterns, but previous works are always limited to the spatiotemporal dynamics and ignore periodic patterns in the frequency domain. 3) In the softmax loss scenario, networks cannot learn classifiers equipped with margin between the positive samples and negative samples because of its softmax operation, and cannot conduct any data selection without extra structure.
To move beyond such limitations and better extract information from multidomains, we propose a novel model and a novel loss referred as SLnLrFA and SMFL, respectively. The SLnLrFA model is equipped with synchronous local and nonlocal (SLnL) blocks for spatiotemporal learning, and a residual frequency attention (rFA) block for frequency patterns mining. To optimize the multidomain feature learning process, the proposed softmargin focal loss (SMFL) automatically conducts data selection and encourages intrinsic margin in classifiers. Fig.1 shows the pipeline of our method. An adaptive transform network firstly augments and transforms the skeletal action. Then, the residual frequency attention (rFA) block is applied to select discriminative frequency patterns, followed by synchronous local and nonlocal (SLnL) blocks and local blocks in the spatiotemporal domain, where SLnL is designed to extract local details and nonlocal semantics synchronously. Finally, the three classifiers with inputs from position, velocity and concatenated features are optimized as the multitask learning scenario according to our softmargin focal loss.
Our main contributions lie in the following four aspects:

Moving beyond the spatiotemporal domain, we propose a residual frequency attention that shedding a new light to exploit frequency information for skeletonbased action recognition.

We propose synchronous local and nonlocal block to simultaneously mine local detailed dynamics and nonlocal semantic information in the earlystage layers.

We propose a general softmargin focal loss, which can automatically conduct data selection in training processes, and encourage classifiers with intrinsic softmargins, which has a clear probabilistic interpretation.

Our approach outperforms the stateoftheart methods by significant margins on the largest indoor dataset NTU RGB+D and the largest unconstrained dataset Kinetics for skeletonbased action recognition.
2 Related Works
Frequency domain analysis.
Generalized frequency domain analysis contains several large classes of methods such as discret Fourier transform (DFT), shorttime Fourier transform (SFT) and wavelet tranform, which are classical tools in the fields of signal analysis and image processing (?). Due to the booming of deep learning techniques (?; ?), methods based on the spatiotemporal domain dominate the field of computer vision, with only a few works paying attention to the frequency domain. For example, frequency domain analysis of critical points trajectories (?) and frequency divergence image are applied for RGBbased action recognition (?). Scattering convolution network with wavelet filters are used for object classification (?). The current work will revisit the frequency domain, and exploit discriminative frequency patterns to improve the skeletonbased action recognition.
Nonlocal operation.
Nonlocal means is a classical filtering algorithm that allows distant pixels to contribute to the target pixel (?). Blockmatching (?) explores groups of nonlocal similarity between patches, which is a solid baseline for image denoising. Blockmatching is widely used in computer vision tasks like superresolution (?), image denoising (?). The popular selfattention (?) in machine translation can also be viewed as a nonlocal operation. Recently, different nonlocal blocks are inserted into CNNs for video classification (?) and RNNs for image restoration (?). Their local and nonlocal operations apply to different objects in in different level of layers, while our SLnL simultaneously operate on the same objects, thus only SLnL can fuse local and nonlocal information synchronously.
Reformed softmax loss.
The softmax loss (?), consisted of the last fully connected layer, the softmax function, and the crossentropy loss, is widely applied in supervised learning due to its simplicity and clear probabilistic interpretation. However, recent works (?; ?) have exposed its limitations on feature discriminability and have stimulated two types of methods for improvements. One type directly refines or combines the crossentropy loss with other losses like contrastive loss (?), triplet loss (?), etc. The other type reformulates the softmax function with geometrical or algebraic margin (?; ?) to encourage intraclass compactness and interclass separability of feature learning, which completely destroys the probabilistic meaning of the original softmax function. With a simple modification to the crossentropy loss, our SMFL encourages intrinsic softmargins in classifiers and maintains a clear probabilistic interpretation, which will be proofed in next section.
Adaptive data selection.
The contributions of easy data and hard data are different among the training processes, thus adaptive data selection strategy significantly impact the model performance and training efficiency (?). Some previous studies adopt heuristic rules to adjust the sampling probabilities of the train data, such as curriculum learning (?), selfpaced learning (?), online batch selection (?), etc. Fan et al. (?) uses deep reinforcement learning framework to automatically learn what data to learn. However, the aforementioned methods require extra data selection networks or complex modifications to the mainstream shufflebased training pipeline, while the focal loss (?) introduces only simple modification to the loss function can encourage effective data selection. Thus, our softmargin focal loss also adopt this paradigm.
3 Methods
The overall pipeline of our method is shown in Fig.1 (see Introduction for details). In this section, we will introduce each component separately.
3.1 Transform Network
A skeletal human action is represented by dimensional locations of body joints in a frame video. We firstly enrich the representation in a single rectangular coordinate system to multiple representations in adaptive oblique coordinate systems through the coordinate transformer, then adaptively augment the number and rearrange the order of joints through the skeleton transformer.
In order to learn better expressions of human actions from multiple aspects, we transform each coordinate of joints in the original rectangular coordinate system to new coordinates corresponding to oblique coordinate systems. , where is the transition matrix from the original coordinate system to a new coordinate system . For convenience, the coordinates are concatenated as , and similar for the transition matrices . Therefore, the expressions of human actions can be enriched by endtoend learning the concatenated transform matrix . Note that the view adaption network (?) is a special case of our multicoordinate transformer because only one rectangular coordinate transform has been applied.
Directly taking an action as a channels spatiotemporal image will lose structural information among human skeletons and will be limited to the original joints. Following Li et al. (?), we introduce an adaptive skeleton transformer to augment the number and rearrange the order of joints. Each skeleton with the original structureless permutation consisted of joints is adaptively augmented and rearranged as an optimal permutation through transform function , where the transform matrix is learned by endtoend training, and donates the number of new joints. As a result, the network selects important body joints and structure relationships automatically.
3.2 Residual Frequency Attention
Previous works always concentrate on the spatiotemporal domain, but many actions contain inherent frequencysensitive patterns, such as shaking hands, and brushing teeth, which motivates us to revisit the frequency domain. The classical operations in the frequency domain, such as highpass, lowpass, and bandpass filters, only have a few parameters that are far from enough, thus we propose a more general frequency attention block (Fig. 2) equipped with abundant learnable parameters to adaptively select frequency components.
Given a transformed action after the transform network (=, = ), the 2D discret Fourier transform (DFT) transforms the pseudo spatiotemporal image in each channel to in the frequency domain via
(1)  
where and are frequencies and channel of spatiotemporal image respectively, and / donates the cosine/sinusoidal component. The frequency spectrum and the phase spectrum . In practice, the DFT and its inverse (IDFT) are computed through the fast Fourier transform (FFT) algorithm and its inverse (IFFT), which reduce the computational complexity from to in our case.
For each action, the attention weights and are complex functions of its cosine and sinusoidal components in the frequency domain, i.e.
(2) 
where . Specifically, after a channel averaging operation, each component is fed into two fully connected layers (FC) to learn adaptive weights for each frequency, followed by a sigmoid transfom function. The first FC layers serve as a bottleneck layer (?) for dimensionality reduction with a ratio factor . Then, the learned attention weights are duplicated to every channel to pay attention to the input frequency image via
(3) 
(4) 
where donates the elementwise multiplication. Finally, a spatiotemporal residual component is applied to obtain the output after attention, i.e.
(5) 
where donates the efficient dimensional IFFT.
3.3 Synchronous Local and Nonlocal Spatiotemporal Learning
Nonlocal Module.
A general nonlocal operation takes a multichannel signal as its input and generates a multichannel output . Here and are channels, and is the number of , where is the set that enumerates all positions of the signal (image, video, feature map, etc.). Let and donate the th row vector of and respectively, the nonlocal operation can be formulated as follows:
(6) 
where the multichannel unary transform computes the embedding of , the multichannel binary transform computes the affinity between the positions and , and is a normalization factor. With different choices of and , such as Guassian, embeddded Gaussian and dot product, various of nonlocal operations could be constructed. For simplicity, we only consider and in the form of linear embedding and embeddded Gaussian respectively, and set , i.e.
(7) 
where are learnable transform parameters.
(8)  
(9)  
(10) 
where , and donates the embedding channel. To weigh how important the nonlocal information is when compared to local information, a weighting function is appended, i.e.
(11) 
where . Note that the nonlocal modules can be dropin pretrained model without breaking its initial behavior by initializing as 0. A nonlocal module with a dimensional input can be completed with some transpose operations, some convolutional layers with the kernels of 1, and a softmax layer, Fig.3(a) shows a 2D example.
Baseline local block.
The local operation is defined as
(12) 
where is the local neighbor set of target position , . The convolution is a typical local operation with identity affinity , liner transform , identity normalization factor , and is the neighbors around target center with a same shape of kernel. Our baseline local block is constructed from convolution operation. As shown in Fig.3(b), two convolutional layers with kernel and are applied to learn temporal local (tLocal) features and spatial local (sLocal) features respectively, and a convolutional layer for spatialtemporal local (stLocal) features. The block also contains a residual path, a rectified linear unit (ReLU) and a batch normalization (BN) layer.
Synchronous local and nonlocal (SLnL) block.
In order to synchronously exploit local details and nonlocal semantics in human action, three nonlocal modules are are parallel merged into the above baseline local block. As shown in Fig.3(c), two 1D nonlocal modules to explore temporal nonlocal (tNonLocal) and spatial nonlocal (sNonLocal) information respectively, followed by a 2D nonlocal module for spatiotemporal nonlocal (stNonLocal) patterns. We define affinity field as the representation of the range of pixel indices that could contribute to the target position in the next layer of the local or nonlocal modules, which is a more general concept than the receptive field of CNNs. The affinity field in Fig.3(d) clearly shows our SLnL can mine and fuse local details and nonlocal semantics synchronously. Note that our SLnL is significantly different from the methods (?; ?) which only inserted a few nonlocal modules after stacked local networks, thus the local and nonlocal operations are still separately conducted in different layers having different resolutions. Contrastively, our SLnL simultaneously captures local and nonlocal patterns in every layer (Fig.3(d)).
3.4 Softmargin focal loss
A common challenge for classification tasks is that the discrimination difficulties are different among samples and classes, but most previous works for skeletonbased action recognition use the softmax loss that haven’t taken the challenge into consideration. There are two possible measures to alleviate it, i.e. data selection and margin encouraging.
Intuitively, the larger predicted probability a sample has, the farther away from the decision boundary it might be, and vice versa. Motivated by this intuition, we construct a softmargin (SM) loss term as follows:
(13) 
where is the estimated posterior probability of the ground truth class, and is a margin parameter. for the fact that . As Fig.4 shows when the posterior probability is small, the sample is more likely be close to the boundary, thus we penalize it with a large margin loss. Otherwise, a small margin loss is imposed. To further illustrate the idea, we introduce the term into the cross entropy loss leading to a softmargin cross entropy (SMCE) loss,
(14)  
Assuming that is the learned features before the last FC layer, the FC layer transforms it into score of classes by multiplying , where is the parameter of the linear classifier corresponding to the class , i.e. . Followed with a softmax layer, we have and , then the SMCE can be rewritten as:
(15)  
(16)  
(17) 
Comparing the standard softmax loss with Eq.17, only the score of the ground truth class is replaced by . Optimizing model with SMCE, we will obtain classifiers that meet the constraint . As a result, an intrinsic margin between the positive (belonging to a specific class) samples and the negative (not belonging to the specific class) samples of each class will be formed in classifiers by adding the SM loss term into the loss function.
In addition, the focal loss (?) defined as
(18) 
where is a focusing parameter, can encourage adaptive data selection without any damage to the original model structure and training processes. As Fig.4 shows the relative loss for wellclassified easy samples is reduced by FL when compared to CE. Although FL pays more attention to hard samples, it has no margin around the decision boundary. Similar to SMCE, we introduce the term into FL to obtain the softmargin focal loss (SMFL) as follows:
(19)  
Finally, our SMFL can encourage intrinsic margins in classifiers and maintain FL’s advantage of data selection as well.
Our two stream model (Fig.1) predicts three probability vectors , , from three modes including position, velocity, and their concatenation. We optimize it as a pseudo multitask learning problem with the proposed SMFL, i.e. each classifier produces a loss component via
(20) 
where is mode type, and is the onehot class label. Thus the final loss is as follows:
(21) 
During inference, only is used to predict the final class. Note that the proposed SMFL as well as the byproduct SMCE is universal for all classification tasks.
4 Experiments
4.1 Datasets and Experimental details
Ntu Rgb+d.
NTU RGB+D (NTU) dataset (?) is currently the largest indoor action recognition dataset. It contains 56,000 action clips in 60 actions performed by 40 subjects. Each clip consists of 25 joint locations with one or two persons. There are two evaluation protocols for this dataset, i.e. crosssubject (CS) and crossview (CV). For the crosssubject evaluation, 40320 samples from 20 subjects were used for training and 16540 samples from the rest subjects were used for testing. For the crossview evaluation, samples are split by camera views, with two camera views for training and the rest one for testing.
Kinetics.
Kinetics (?) is by far the largest unconstrained action recognition dataset, which contains 300,000 video clips in 400 classes retrieved from YouTube. The skeleton is estimated by Yan et al. (?) from the RGB videos by OpenPose toolbox. Each joint consists of 2D coordinates in the pixel coordinate system and a confidence score , thus finally represented by a tuple of . Each skeleton frame is recorded as an array of 18 tuples. We use the released skeletal dataset to train our model, and evaluate the performance by the top1 and top5 accuracies as recommended by Key et al. (?).
Implementation Details.
During the data preparation, we randomly crop sequences with a ratio uniformly drawn from [0.5,1] for training, and centrally crop sequences with a fixed ratio of 0.95 for inference. Due to the variety in action length, we resized the sequence to a fixed length 64/128 (NTU/Kinetics) with bilinear interpolation along the frame dimension. Finally, the obtained data are fed into a batch normalization layer to normalize the scale.
Each stream of model for NTU is composed of totally 6 blocks in Fig.3 with local kernels of 3 and channels of 64, 64, 128, 128, 256, 256 respectively, also maxpooling is applied every two blocks. For Kinetics, two additional blocks with channels of 512 are appended, also the local kernels of the first two blocks are changed into 5. The numbers of new coordinate systems and new joints in the transform network are set as 10 and 64 respectively for both datasets.
During training, we apply Adam optimizer with weight decay of 0.0005. Learning rate is initialized as 0.001, followed by an exponential decay with a rate of 0.98/0.95 (NTU/Kinetics) per epoch. A dropout with ratio of 0.2 is applied to each block to alleviate the overfitting problem. The model is trained for 300/100 (NTU/Kinetics) epoches with a batch size of 32/128 (NTU/Kinetics).
4.2 Experimental Results
To validate the effectiveness of the proposed SLnLrFA in constrained and unconstraint environments, we perform experiments on the NTU RGB+D dataset and the Kinetics dataset, and compare the performances against other stateoftheart methods. Because there is no previous methods with the capability of mining patterns in the frequency domain for skeletonbased action recognition, we only compare our method to the ones in the spatiotemporal domain.
On NTU RGB+D, we compare with one RNNbased method (?), three LSTMbased methods (?; ?; ?), two CNNbased methods (?; ?), one graph convolutional method (?), and one graph and LSTM hybridized method (?). As the local components of our SLnL block are CNNbased while the nonlocal components can are designed to learn the affinity degree between each target position (node) to every position (node) in the figure (graph), our SLnLrFA can be treated as a variant of CNN and graph hybridized method. As shown in Table 1, the CNNbased methods are generally better than RNN or LSTMbased methods, and graphbased methods or hybrid graphbased methods also perform well. Our method consistently outperforms the stateoftheart approaches by a large margin for both crosssubject(CS) and crossview(CV) evaluation. Specifically, our SLnLrFA outperforms the best CNNbased method (HCN) by 2.6% (CS) and 3.8% (CV), also outperforms the recently reported LSTM and graph hybridized method (SRTSL) by a margin of 4.3% (CS) and 2.5% (CV), respectively.
Methods  CS  CV 

HRNN (?)  59.1  64.0 
PALSTM (?)  70.3  62.9 
STLSTM+TG (?)  69.2  77.7 
VALSTM (?)  79.4  87.6 
STGCN (?)  81.5  88.3 
TSCNN (?)  83.2  89.3 
HCN (?)  86.5  91.1 
SRTSL (?)  84.8  92.4 
SLnLrFA (Ours)  89.1  94.9 
Methods  top1  top5 

Feature Enc. (?)  14.9  25.8 
Deep LSTM (?)  16.4  35.3 
Temporal Conv. (?)  20.3  40.0 
STGCN (?)  30.7  52.8 
SLnLrFA (Ours)  36.6  59.1 
On Kinetics, we compare with four characteristic methods, including handcrafted features (?), deep LSTM network (?), temporal convolutional network (?), and graph convolutional network (?). As shown in Table 2, the deep models outperform the handcrafted features method, and the CNNbased methods work better than the LSTMbased methods. Our method outperforms the stateoftheart approach (STGCN) by a large margin of 5.9% (top1) and 6.3% (top5) for recognition accuracies.
4.3 Ablation Study
To analyze the effectiveness of every proposed component, extensive ablation studies are conducted on NTU RGB+D.
Raw data VS transformed data.
The baseline model (Baseline) of this section contains only local blocks in Fig.3(b) and is optimized with CE loss. The proposed coordinate and skeleton transformer (CST), the coordinate transformer (CT), the skeleton transformer (ST) and a CNN variant with the same depth (CNNV) are applied to transform the data. Also three models that have no transformer, i.e. using raw position (PosN) data, raw velocity data (VelN), and both of them (PosVelN) are compared. As shown in Table 4, the performances achieved with transformed data consistently outperform that with the raw data. The improvements of CT and ST indicate that representing action in adaptive multiple coordinate systems is better than the original coordinate system, also the augmented and rearranged data encode more structure information than the raw data. Even with the same depth, the improvement of CNNV is insignificant, indicating that our improvement is not just induced by adding depth. Finally, the CST preforms the best, indicating that the coordinate transform and the skeleton transform are complementary to each other.
Comparisons on loss function.
We firstly reform the Baseline into Baseline with above CST for this section. The model is optimized with the cross entropy loss (CE), focal loss (FL), softmargin cross entropy loss (SMCE), and softmargin focal loss (SMFL), respectively. To save space, at most two best parameters for each loss are listed in Table 4. Because FL can conduct adaptive data selection, it performs better than CE. Benefiting from the encouraged margins between the positive and negative samples, both SMCE and SMFL perform better than their corresponding original versions, i.e. CE and FL. Finally, the SMFL achieves the best because of its advantages from adaptive data selection and intrinsic margins encouraging.
How to select discriminative frequency patterns?
Attention methods  CS (%)  CV (%) 

No Frequency Attention^{✝}  86.9  92.6 
Amplitude FA (aFA)  84.7  89.8 
Shared FA (sFA)  87.3  92.9 
Dependent FA (dFA)  87.5  93.2 
Residual FA (rFA)^{*}  87.7  93.6 

Baseline;

Adopted.
We further reform the Baseline into Baseline for this section by adding the SMFL. To validate the effectiveness of proposed rFA, we compare it with several variants. The Amplitude frequency attention (aFA) is built on frequency spectrum instead of sinusoidal and cosine components. Shared FA (sFA) learns shared attention parameters for sinusoidal and cosine components, while dependent FA (dFA) learns two set of parameters independently. The rfA is formed by applying the residual learning trick to dFA in the spatiotemporal domain (Fig.2). In Table 5, we observe that aFA does harm to the model because the phase angle information is completely missing when only using the frequency spectrum. The dFA outperforms the sFA because that it has more parameters to model the frequency patterns. The rFA finally achieves the best that outperforms Baseline with a large margin, indicating that the frequency information is effective for action recognition.
Comparisons of methods with different affinity fields.
Affinity Field  CS (%)  CV (%) 
Local (=6, = 0)^{✝}  87.7  93.6 
tSLnL (=1, = 5)  88.1  93.9 
sSLnL (=1, = 5)  88.0  94.1 
SLnL (=1, = 5)  88.3  94.3 
SLnL (=2, = 4)  88.6  94.6 
SLnL (=3, = 3)^{*}  88.8  94.9 
SLnL (=4, = 2)  88.9  94.8 
SLnL (=5, = 1)^{*}  89.1  94.7 
SLnL (=6, = 0)  88.8  94.7 

Baseline;

Adopted.
We further reform the Baseline into Baseline with a rFA block for this section. Although nonlocal dependencies can be captured in higher layers of hierarchical local networks, we argue that synchronously explore nonlocal information in early stages is preferable. We merge one temporal nonlocal block (tSLnL), spatial nonlocal block (sSLnL), or spatialtemporal block (SLnL) into Baseline to examine their effectiveness. As shown in Table 6, both the nonlocal information from the temporal and spatial dimensions during early stages are helpful. In addition, benefiting from the synchronous fusion of local details and nonlocal semantics, our SLnL boosts up the recognition performance. To further investigate the properties of deeper SLnL, we replace local blocks in Baseline with SLnL. Table 6 shows more SLnL blocks in lower layers generally lead to better results, but the improvements of higher layers is relatively small because the affinity field of local operations is increasing with layers. The results clearly show that synchronously extracting local details and nonlocal semantics is vital for modeling the spatiotemporal dynamics of human actions.
5 Conclusion
In this work, we propose a novel method (SLnLrFA) to model skeletonbased actions from multidomains, which significantly outperforms the stateoftheart methods. The SLnL synchronously extracts local details and nonlocal semantics in the spatiotemporal domain. The rFA adaptively selects discriminative frequency patterns, which sheds a new light to exploit information in the frequency domain for skeletonbased action recognition. In addition, we propose a novel softmargin focal loss for general recognition tasks, which can encourage intrinsic margins in classifiers with a clear probability interpretation, it also conducts adaptive data selection without any damage to the original model structure and training processes. In the future, we will apply our method to other related computer vision tasks like general objects recognition and events detection.
References
 [Beaudry, Péteri, and Mascarilla 2014] Beaudry, C.; Péteri, R.; and Mascarilla, L. 2014. Action recognition in videos using frequency analysis of critical point trajectories. In ICIP, 1445–1449.
 [Bengio et al. 2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In ICML, 41–48.
 [Buades, Coll, and Morel 2005] Buades, A.; Coll, B.; and Morel, J. 2005. A nonlocal algorithm for image denoising. In CVPR, 60–65.
 [Cruz and Street 2017] Cruz, A. C., and Street, B. 2017. Frequency divergence image: A novel method for action recognition. In 14th IEEE International Symposium on Biomedical Imaging, 1160–1164.
 [Dabov et al. 2007] Dabov, K.; Foi, A.; Katkovnik, V.; and Egiazarian, K. O. 2007. Image denoising by sparse 3d transformdomain collaborative filtering. IEEE Trans. Image Processing 16(8):2080–2095.
 [Dai et al. 2016] Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. RFCN: object detection via regionbased fully convolutional networks. In NIPS, 379–387.
 [Du, Wang, and Wang 2015] Du, Y.; Wang, W.; and Wang, L. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 1110–1118.
 [Fan et al. 2017] Fan, Y.; Tian, F.; Qin, T.; Bian, J.; and Liu, T. 2017. Learning what data to learn. In ICML Workshops.
 [Fernando et al. 2015] Fernando, B.; Gavves, E.; M., J. O.; Ghodrati, A.; and Tuytelaars, T. 2015. Modeling video evolution for action recognition. In CVPR, 5378–5387.
 [Glasner, Bagon, and Irani 2009] Glasner, D.; Bagon, S.; and Irani, M. 2009. Superresolution from a single image. In ICCV, 349–356.
 [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
 [Kay et al. 2017] Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; Suleyman, M.; and Zisserman, A. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
 [Ke et al. 2017] Ke, Q.; Bennamoun, M.; An, S.; Sohel, F. A.; and Boussaïd, F. 2017. A new representation of skeleton sequences for 3d action recognition. In CVPR, 4570–4579.
 [Kim and Reiter 2017] Kim, T. S., and Reiter, A. 2017. Interpretable 3d human action analysis with temporal convolutional networks. In CVPR Workshops, 1623–1631.
 [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1106–1114.
 [Kumar, Packer, and Koller 2010] Kumar, M. P.; Packer, B.; and Koller, D. 2010. Selfpaced learning for latent variable models. In NIPS, 1189–1197.
 [Lefkimmiatis 2017] Lefkimmiatis, S. 2017. Nonlocal color image denoising with convolutional neural networks. In CVPR, 5882–5891.
 [Li et al. 2017] Li, C.; Zhong, Q.; Xie, D.; and Pu, S. 2017. Skeletonbased action recognition with convolutional neural networks. In ICME Workshops, 597–600.
 [Li et al. 2018] Li, C.; Zhong, Q.; Xie, D.; and Pu, S. 2018. Cooccurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In IJCAI, 786–792.
 [Lin et al. 2017] Lin, T.; Goyal, P.; Girshick, R. B.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In ICCV, 2999–3007.
 [Liu et al. 2016a] Liu, J.; Shahroudy, A.; Xu, D.; and Wang, G. 2016a. Spatiotemporal LSTM with trust gates for 3d human action recognition. In ECCV, 816–833.
 [Liu et al. 2016b] Liu, W.; Wen, Y.; Yu, Z.; and Yang, M. 2016b. Largemargin softmax loss for convolutional neural networks. In ICML, 507–516.
 [Liu et al. 2018] Liu, D.; Wen, B.; Fan, Y.; Loy, C. C.; and Huang, T. S. 2018. Nonlocal recurrent network for image restoration. arXiv preprint arXiv:1806.02919.
 [Loshchilov and Hutter 2015] Loshchilov, I., and Hutter, F. 2015. Online batch selection for faster training of neural networks. In ICML Workshops.
 [Oyallon, Belilovsky, and Zagoruyko 2017] Oyallon, E.; Belilovsky, E.; and Zagoruyko, S. 2017. Scaling the scattering transform: Deep hybrid networks. In ICCV, 5619–5628.
 [Schroff, Kalenichenko, and Philbin 2015] Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR, 815–823.
 [Shahroudy et al. 2016] Shahroudy, A.; Liu, J.; Ng, T.; and Wang, G. 2016. NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR, 1010–1019.
 [Si et al. 2018] Si, C.; Jing, Y.; Wang, W.; Wang, L.; and Tan, T. 2018. Skeletonbased action recognition with spatial reasoning and temporal stack learning. In ECCV.
 [Song et al. 2017] Song, S.; Lan, C.; Xing, J.; Zeng, W.; and Liu, J. 2017. An endtoend spatiotemporal attention model for human action recognition from skeleton data. In AAAI, 4263–4270.
 [Sonka, Hlavac, and Boyle 2014] Sonka, M.; Hlavac, V.; and Boyle, R. 2014. Image processing, analysis, and machine vision. Cengage Learning.
 [Sun et al. 2014] Sun, Y.; Chen, Y.; Wang, X.; and Tang, X. 2014. Deep learning face representation by joint identificationverification. In NIPS, 1988–1996.
 [Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In NIPS, 6000–6010.
 [Wang et al. 2016] Wang, P.; Li, Z.; Hou, Y.; and Li, W. 2016. Action recognition based on joint trajectory maps using convolutional neural networks. In ACM MM, 102–106.
 [Wang et al. 2017] Wang, X.; Girshick, R. B.; Gupta, A.; and He, K. 2017. Nonlocal neural networks. In CVPR, 7794–7803.
 [Wang et al. 2018] Wang, X.; Zhang, S.; Lei, Z.; Liu, S.; Guo, X.; and Li, S. Z. 2018. Ensemble softmargin softmax loss for image classification. In IJCAI, 992–998.
 [Yan, Xiong, and Lin 2018] Yan, S.; Xiong, Y.; and Lin, D. 2018. Spatial temporal graph convolutional networks for skeletonbased action recognition. In AAAI.
 [Zhang et al. 2017] Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; and Zheng, N. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV, 2136–2145.