Skeleton-Based Action Recognition with Synchronous Local and Non-local Spatio-temporal Learning and Frequency Attention

Skeleton-Based Action Recognition with Synchronous Local and Non-local Spatio-temporal Learning and Frequency Attention

Guyue Hu1, 3, Bo Cui1, 3, Shan Yu1, 2, 3
1Brainnetome Center, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
2CAS Center for Excellence in Brain Science and Intelligence Technology
3University of Chinese Academy of Sciences
{, bo.cui, shan.yu}

Benefiting from its succinctness and robustness, skeleton-based human action recognition has recently attracted much attention. Most existing methods utilize local networks, such as recurrent networks, convolutional neural networks, and graph convolutional networks, to extract spatio-temporal dynamics hierarchically. As a consequence, the local and non-local dependencies, which respectively contain more details and semantics, are asynchronously captured in different level of layers. Moreover, limited to the spatio-temporal domain, these methods ignored patterns in the frequency domain. To better extract information from multi-domains, we propose a residual frequency attention (rFA) to focus on discriminative patterns in the frequency domain, and a synchronous local and non-local (SLnL) block to simultaneously capture the details and semantics in the spatio-temporal domain. To optimize the whole process, we also propose a soft-margin focal loss (SMFL), which can automatically conducts adaptive data selection and encourages intrinsic margins in classifiers. Extensive experiments are performed on several large-scale action recognition datasets and our approach significantly outperforms other state-of-the-art methods.

Skeleton-Based Action Recognition with Synchronous Local and Non-local Spatio-temporal Learning and Frequency Attention

Guyue Hu1, 3, Bo Cui1, 3, Shan Yu1, 2, 3 1Brainnetome Center, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences 2CAS Center for Excellence in Brain Science and Intelligence Technology 3University of Chinese Academy of Sciences {, bo.cui, shan.yu}

1 Introduction

The skeleton-based human action recognition has recently attracted much attention due to its succinctness of representation and robustness to variations of viewpoints, appearances and surrounding distractions (?). Skeleton-base human actions are naturally sequences, thus abundant of works apply Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) to model the temporal dependencies (???). Many researchers also treat the skeletal data as 2D pseudo-images, and feed them into various CNNs to model spatio-temporal dynamics, which achieve more effective results (???). To exploit the structure information of human body, Yan et al. (?) constructs a graph based on the physical connections of human joints, which obtains promising performances.

All the aforementioned methods can be summarized as a paradigm, i.e. firstly prepare the skeletal data as sequences, pseudo-images or graphs, then feed them into stacked networks to mine spatio-temporal features hierarchically with the softmax loss (?). Despite the significant improvements in performance, there exist three serious problems to be solved: 1) The recurrent and convolutional operations are neighborhood-based local operations (?), so the local-range detailed information and non-local semantic information mainly be captured asynchronously in the lower and higher layers respectively, which hinders the fusion of details and semantics in action dynamics. 2) Human actions such as shaking hands, brushing teeth, and clapping have characteristic frequency patterns, but previous works are always limited to the spatio-temporal dynamics and ignore periodic patterns in the frequency domain. 3) In the softmax loss scenario, networks cannot learn classifiers equipped with margin between the positive samples and negative samples because of its softmax operation, and cannot conduct any data selection without extra structure.

Figure 1: The overall pipeline of the proposed method. The position and velocity information of human joints are fed into a tranform network, a residual attention network, synchronous local and non-local blocks , and local blocks sequentially. Treated as a pseudo multi-task learning task, the proposed model is optimized according to our soft-margin focal loss.

To move beyond such limitations and better extract information from multi-domains, we propose a novel model and a novel loss referred as SLnL-rFA and SMFL, respectively. The SLnL-rFA model is equipped with synchronous local and non-local (SLnL) blocks for spatio-temporal learning, and a residual frequency attention (rFA) block for frequency patterns mining. To optimize the multi-domain feature learning process, the proposed soft-margin focal loss (SMFL) automatically conducts data selection and encourages intrinsic margin in classifiers. Fig.1 shows the pipeline of our method. An adaptive transform network firstly augments and transforms the skeletal action. Then, the residual frequency attention (rFA) block is applied to select discriminative frequency patterns, followed by synchronous local and non-local (SLnL) blocks and local blocks in the spatio-temporal domain, where SLnL is designed to extract local details and non-local semantics synchronously. Finally, the three classifiers with inputs from position, velocity and concatenated features are optimized as the multi-task learning scenario according to our soft-margin focal loss.

Our main contributions lie in the following four aspects:

  • Moving beyond the spatio-temporal domain, we propose a residual frequency attention that shedding a new light to exploit frequency information for skeleton-based action recognition.

  • We propose synchronous local and non-local block to simultaneously mine local detailed dynamics and non-local semantic information in the early-stage layers.

  • We propose a general soft-margin focal loss, which can automatically conduct data selection in training processes, and encourage classifiers with intrinsic soft-margins, which has a clear probabilistic interpretation.

  • Our approach outperforms the state-of-the-art methods by significant margins on the largest in-door dataset NTU RGB+D and the largest unconstrained dataset Kinetics for skeleton-based action recognition.

2 Related Works

Frequency domain analysis.

Generalized frequency domain analysis contains several large classes of methods such as discret Fourier transform (DFT), short-time Fourier transform (SFT) and wavelet tranform, which are classical tools in the fields of signal analysis and image processing (?). Due to the booming of deep learning techniques (??), methods based on the spatio-temporal domain dominate the field of computer vision, with only a few works paying attention to the frequency domain. For example, frequency domain analysis of critical points trajectories (?) and frequency divergence image are applied for RGB-based action recognition (?). Scattering convolution network with wavelet filters are used for object classification (?). The current work will revisit the frequency domain, and exploit discriminative frequency patterns to improve the skeleton-based action recognition.

Non-local operation.

Non-local means is a classical filtering algorithm that allows distant pixels to contribute to the target pixel (?). Block-matching (?) explores groups of non-local similarity between patches, which is a solid baseline for image denoising. Block-matching is widely used in computer vision tasks like super-resolution (?), image denoising (?). The popular self-attention (?) in machine translation can also be viewed as a non-local operation. Recently, different non-local blocks are inserted into CNNs for video classification (?) and RNNs for image restoration (?). Their local and non-local operations apply to different objects in in different level of layers, while our SLnL simultaneously operate on the same objects, thus only SLnL can fuse local and non-local information synchronously.

Reformed softmax loss.

The softmax loss (?), consisted of the last fully connected layer, the softmax function, and the cross-entropy loss, is widely applied in supervised learning due to its simplicity and clear probabilistic interpretation. However, recent works (??) have exposed its limitations on feature discriminability and have stimulated two types of methods for improvements. One type directly refines or combines the cross-entropy loss with other losses like contrastive loss (?), triplet loss (?), etc. The other type reformulates the softmax function with geometrical or algebraic margin (??) to encourage intra-class compactness and inter-class separability of feature learning, which completely destroys the probabilistic meaning of the original softmax function. With a simple modification to the cross-entropy loss, our SMFL encourages intrinsic soft-margins in classifiers and maintains a clear probabilistic interpretation, which will be proofed in next section.

Adaptive data selection.

The contributions of easy data and hard data are different among the training processes, thus adaptive data selection strategy significantly impact the model performance and training efficiency (?). Some previous studies adopt heuristic rules to adjust the sampling probabilities of the train data, such as curriculum learning (?), self-paced learning (?), online batch selection (?), etc. Fan et al. (?) uses deep reinforcement learning framework to automatically learn what data to learn. However, the aforementioned methods require extra data selection networks or complex modifications to the mainstream shuffle-based training pipeline, while the focal loss (?) introduces only simple modification to the loss function can encourage effective data selection. Thus, our soft-margin focal loss also adopt this paradigm.

3 Methods

The overall pipeline of our method is shown in Fig.1 (see Introduction for details). In this section, we will introduce each component separately.

3.1 Transform Network

A skeletal human action is represented by -dimensional locations of body joints in a frame video. We firstly enrich the representation in a single rectangular coordinate system to multiple representations in adaptive oblique coordinate systems through the coordinate transformer, then adaptively augment the number and rearrange the order of joints through the skeleton transformer.

In order to learn better expressions of human actions from multiple aspects, we transform each coordinate of joints in the original rectangular coordinate system to new coordinates corresponding to oblique coordinate systems. , where is the transition matrix from the original coordinate system to a new coordinate system . For convenience, the coordinates are concatenated as , and similar for the transition matrices . Therefore, the expressions of human actions can be enriched by end-to-end learning the concatenated transform matrix . Note that the view adaption network (?) is a special case of our multi-coordinate transformer because only one rectangular coordinate transform has been applied.

Directly taking an action as a -channels spatio-temporal image will lose structural information among human skeletons and will be limited to the original joints. Following Li et al. (?), we introduce an adaptive skeleton transformer to augment the number and rearrange the order of joints. Each skeleton with the original structureless permutation consisted of joints is adaptively augmented and rearranged as an optimal permutation through transform function , where the transform matrix is learned by end-to-end training, and donates the number of new joints. As a result, the network selects important body joints and structure relationships automatically.

Figure 2: The residual frequency attention block. The spatio-temparal domain and frequency domain can be switched conveniently through 2D-FFT and 2D-IFFT. The attention for the sinusoidal component and the cosine component are conducted in frequency domain, and the residual component is applied in the spatio-temporal domain.
(a) 2D Non-local module
(b) Baseline local block
(c) SLnL block
(d) The affinity field of SLnL
Figure 3: (a) A 2D example of non-local module. (b) The structure of the baseline local block. (c) The structure of the proposed synchronous local and non-local (SLnL) block. (D) The affinity field of SLnL. Note that the affinity field is a more general concept than the receptive field of CNNs. The red and blue represent local and non-local modules repectively in (d).

3.2 Residual Frequency Attention

Previous works always concentrate on the spatio-temporal domain, but many actions contain inherent frequency-sensitive patterns, such as shaking hands, and brushing teeth, which motivates us to revisit the frequency domain. The classical operations in the frequency domain, such as high-pass, low-pass, and band-pass filters, only have a few parameters that are far from enough, thus we propose a more general frequency attention block (Fig. 2) equipped with abundant learnable parameters to adaptively select frequency components.

Given a transformed action after the transform network (=, = ), the 2D discret Fourier transform (DFT) transforms the pseudo spatio-temporal image in each channel to in the frequency domain via


where and are frequencies and channel of spatio-temporal image respectively, and / donates the cosine/sinusoidal component. The frequency spectrum and the phase spectrum . In practice, the DFT and its inverse (IDFT) are computed through the fast Fourier transform (FFT) algorithm and its inverse (IFFT), which reduce the computational complexity from to in our case.

For each action, the attention weights and are complex functions of its cosine and sinusoidal components in the frequency domain, i.e.


where . Specifically, after a channel averaging operation, each component is fed into two fully connected layers (FC) to learn adaptive weights for each frequency, followed by a sigmoid transfom function. The first FC layers serve as a bottleneck layer (?) for dimensionality reduction with a ratio factor . Then, the learned attention weights are duplicated to every channel to pay attention to the input frequency image via


where donates the element-wise multiplication. Finally, a spatio-temporal residual component is applied to obtain the output after attention, i.e.


where donates the efficient -dimensional IFFT.

3.3 Synchronous Local and Non-local Spatio-temporal Learning

Non-local Module.

A general non-local operation takes a multi-channel signal as its input and generates a multi-channel output . Here and are channels, and is the number of , where is the set that enumerates all positions of the signal (image, video, feature map, etc.). Let and donate the -th row vector of and respectively, the non-local operation can be formulated as follows:


where the multi-channel unary transform computes the embedding of , the multi-channel binary transform computes the affinity between the positions and , and is a normalization factor. With different choices of and , such as Guassian, embeddded Gaussian and dot product, various of non-local operations could be constructed. For simplicity, we only consider and in the form of linear embedding and embeddded Gaussian respectively, and set , i.e.


where are learnable transform parameters.


where , and donates the embedding channel. To weigh how important the non-local information is when compared to local information, a weighting function is appended, i.e.


where . Note that the non-local modules can be drop-in pre-trained model without breaking its initial behavior by initializing as 0. A non-local module with a -dimensional input can be completed with some transpose operations, some convolutional layers with the kernels of 1, and a softmax layer, Fig.3(a) shows a 2D example.

Baseline local block.

The local operation is defined as


where is the local neighbor set of target position , . The convolution is a typical local operation with identity affinity , liner transform , identity normalization factor , and is the neighbors around target center with a same shape of kernel. Our baseline local block is constructed from convolution operation. As shown in Fig.3(b), two convolutional layers with kernel and are applied to learn temporal local (tLocal) features and spatial local (sLocal) features respectively, and a convolutional layer for spatial-temporal local (stLocal) features. The block also contains a residual path, a rectified linear unit (ReLU) and a batch normalization (BN) layer.

Synchronous local and non-local (SLnL) block.

In order to synchronously exploit local details and non-local semantics in human action, three non-local modules are are parallel merged into the above baseline local block. As shown in Fig.3(c), two 1D non-local modules to explore temporal non-local (tNon-Local) and spatial non-local (sNon-Local) information respectively, followed by a 2D non-local module for spatio-temporal non-local (stNon-Local) patterns. We define affinity field as the representation of the range of pixel indices that could contribute to the target position in the next layer of the local or non-local modules, which is a more general concept than the receptive field of CNNs. The affinity field in Fig.3(d) clearly shows our SLnL can mine and fuse local details and non-local semantics synchronously. Note that our SLnL is significantly different from the methods (??) which only inserted a few non-local modules after stacked local networks, thus the local and non-local operations are still separately conducted in different layers having different resolutions. Contrastively, our SLnL simultaneously captures local and non-local patterns in every layer (Fig.3(d)).

3.4 Soft-margin focal loss

A common challenge for classification tasks is that the discrimination difficulties are different among samples and classes, but most previous works for skeleton-based action recognition use the softmax loss that haven’t taken the challenge into consideration. There are two possible measures to alleviate it, i.e. data selection and margin encouraging.

Intuitively, the larger predicted probability a sample has, the farther away from the decision boundary it might be, and vice versa. Motivated by this intuition, we construct a soft-margin (SM) loss term as follows:


where is the estimated posterior probability of the ground truth class, and is a margin parameter. for the fact that . As Fig.4 shows when the posterior probability is small, the sample is more likely be close to the boundary, thus we penalize it with a large margin loss. Otherwise, a small margin loss is imposed. To further illustrate the idea, we introduce the term into the cross entropy loss leading to a soft-margin cross entropy (SMCE) loss,


Assuming that is the learned features before the last FC layer, the FC layer transforms it into score of classes by multiplying , where is the parameter of the linear classifier corresponding to the class , i.e. Followed with a softmax layer, we have and , then the SMCE can be rewritten as:


Comparing the standard softmax loss with Eq.17, only the score of the ground truth class is replaced by . Optimizing model with SMCE, we will obtain classifiers that meet the constraint . As a result, an intrinsic margin between the positive (belonging to a specific class) samples and the negative (not belonging to the specific class) samples of each class will be formed in classifiers by adding the SM loss term into the loss function.

Figure 4: Comparison among the proposed soft-margin focal loss (SMFL), the soft-margin cross entropy (SMCE) the loss cross-entropy (CE) loss, the focal loss (FL), and the soft-margin loss (SM). The focusing parameter and the margin parameter of losses are expressed as .

In addition, the focal loss (?) defined as


where is a focusing parameter, can encourage adaptive data selection without any damage to the original model structure and training processes. As Fig.4 shows the relative loss for well-classified easy samples is reduced by FL when compared to CE. Although FL pays more attention to hard samples, it has no margin around the decision boundary. Similar to SMCE, we introduce the term into FL to obtain the soft-margin focal loss (SMFL) as follows:


Finally, our SMFL can encourage intrinsic margins in classifiers and maintain FL’s advantage of data selection as well.

Our two stream model (Fig.1) predicts three probability vectors , ,  from three modes including position, velocity, and their concatenation. We optimize it as a pseudo multi-task learning problem with the proposed SMFL, i.e. each classifier produces a loss component via


where is mode type, and is the one-hot class label. Thus the final loss is as follows:


During inference, only is used to predict the final class. Note that the proposed SMFL as well as the by-product SMCE is universal for all classification tasks.

4 Experiments

4.1 Datasets and Experimental details

Ntu Rgb+d.

NTU RGB+D (NTU) dataset (?) is currently the largest in-door action recognition dataset. It contains 56,000 action clips in 60 actions performed by 40 subjects. Each clip consists of 25 joint locations with one or two persons. There are two evaluation protocols for this dataset, i.e. cross-subject (CS) and cross-view (CV). For the cross-subject evaluation, 40320 samples from 20 subjects were used for training and 16540 samples from the rest subjects were used for testing. For the cross-view evaluation, samples are split by camera views, with two camera views for training and the rest one for testing.


Kinetics (?) is by far the largest unconstrained action recognition dataset, which contains 300,000 video clips in 400 classes retrieved from YouTube. The skeleton is estimated by Yan et al. (?) from the RGB videos by OpenPose toolbox. Each joint consists of 2D coordinates in the pixel coordinate system and a confidence score , thus finally represented by a tuple of . Each skeleton frame is recorded as an array of 18 tuples. We use the released skeletal dataset to train our model, and evaluate the performance by the top-1 and top-5 accuracies as recommended by Key et al. (?).

Implementation Details.

During the data preparation, we randomly crop sequences with a ratio uniformly drawn from [0.5,1] for training, and centrally crop sequences with a fixed ratio of 0.95 for inference. Due to the variety in action length, we resized the sequence to a fixed length 64/128 (NTU/Kinetics) with bilinear interpolation along the frame dimension. Finally, the obtained data are fed into a batch normalization layer to normalize the scale.

Each stream of model for NTU is composed of totally 6 blocks in Fig.3 with local kernels of 3 and channels of 64, 64, 128, 128, 256, 256 respectively, also max-pooling is applied every two blocks. For Kinetics, two additional blocks with channels of 512 are appended, also the local kernels of the first two blocks are changed into 5. The numbers of new coordinate systems and new joints in the transform network are set as 10 and 64 respectively for both datasets.

During training, we apply Adam optimizer with weight decay of 0.0005. Learning rate is initialized as 0.001, followed by an exponential decay with a rate of 0.98/0.95 (NTU/Kinetics) per epoch. A dropout with ratio of 0.2 is applied to each block to alleviate the overfitting problem. The model is trained for 300/100 (NTU/Kinetics) epoches with a batch size of 32/128 (NTU/Kinetics).

4.2 Experimental Results

To validate the effectiveness of the proposed SLnL-rFA in constrained and unconstraint environments, we perform experiments on the NTU RGB+D dataset and the Kinetics dataset, and compare the performances against other state-of-the-art methods. Because there is no previous methods with the capability of mining patterns in the frequency domain for skeleton-based action recognition, we only compare our method to the ones in the spatio-temporal domain.

On NTU RGB+D, we compare with one RNN-based method (?), three LSTM-based methods (???), two CNN-based methods (??), one graph convolutional method (?), and one graph and LSTM hybridized method (?). As the local components of our SLnL block are CNN-based while the non-local components can are designed to learn the affinity degree between each target position (node) to every position (node) in the figure (graph), our SLnL-rFA can be treated as a variant of CNN and graph hybridized method. As shown in Table 1, the CNN-based methods are generally better than RNN- or LSTM-based methods, and graph-based methods or hybrid graph-based methods also perform well. Our method consistently outperforms the state-of-the-art approaches by a large margin for both cross-subject(CS) and cross-view(CV) evaluation. Specifically, our SLnL-rFA outperforms the best CNN-based method (HCN) by 2.6% (CS) and 3.8% (CV), also outperforms the recently reported LSTM and graph hybridized method (SR-TSL) by a margin of 4.3% (CS) and 2.5% (CV), respectively.

Methods CS CV
H-RNN (?) 59.1 64.0
PA-LSTM (?) 70.3 62.9
ST-LSTM+TG (?) 69.2 77.7
VA-LSTM (?) 79.4 87.6
ST-GCN (?) 81.5 88.3
TS-CNN (?) 83.2 89.3
HCN (?) 86.5 91.1
SR-TSL (?) 84.8 92.4
SLnL-rFA (Ours) 89.1 94.9
Table 1: Comparisons of action recognition accuracy (%) on NTU RGB+D.
Methods top-1 top-5
Feature Enc. (?) 14.9 25.8
Deep LSTM (?) 16.4 35.3
Temporal Conv. (?) 20.3 40.0
ST-GCN (?) 30.7 52.8
SLnL-rFA (Ours) 36.6 59.1
Table 2: Comparisons of recognition accuracy on Kinetics.

On Kinetics, we compare with four characteristic methods, including hand-crafted features (?), deep LSTM network (?), temporal convolutional network (?), and graph convolutional network (?). As shown in Table 2, the deep models outperform the hand-crafted features method, and the CNN-based methods work better than the LSTM-based methods. Our method outperforms the state-of-the-art approach (ST-GCN) by a large margin of 5.9% (top-1) and 6.3% (top-5) for recognition accuracies.

4.3 Ablation Study

To analyze the effectiveness of every proposed component, extensive ablation studies are conducted on NTU RGB+D.

Raw data VS transformed data.

Transform CS CV Pos-N 83.1 89.2 Vel-N 82.7 88.4 PosVel-N 84.5 90.6 CNN-V 84.7 90.7 C-T 85.0 91.1 S-T 85.1 90.9 CS-T* 85.5 91.3 Baseline; Adopted.
Table 3: Comparisonsof transform methods.
Loss types CS CV CE 85.5 91.3 FL(2,) 85.8 91.9 FL(3,) 85.6 91.8 SMCE(,0.4) 86.4 92.0 SMCE(,0.6) 86.2 92.3 SMFL(2,0.4)* 86.9 92.5 SMFL(2,0.6)* 86.5 92.6 Baseline; Adopted.
Table 4: Comparisons of loss functions.

The baseline model (Baseline) of this section contains only local blocks in Fig.3(b) and is optimized with CE loss. The proposed coordinate and skeleton transformer (CS-T), the coordinate transformer (C-T), the skeleton transformer (S-T) and a CNN variant with the same depth (CNN-V) are applied to transform the data. Also three models that have no transformer, i.e. using raw position (Pos-N) data, raw velocity data (Vel-N), and both of them (PosVel-N) are compared. As shown in Table 4, the performances achieved with transformed data consistently outperform that with the raw data. The improvements of C-T and S-T indicate that representing action in adaptive multiple coordinate systems is better than the original coordinate system, also the augmented and rearranged data encode more structure information than the raw data. Even with the same depth, the improvement of CNN-V is insignificant, indicating that our improvement is not just induced by adding depth. Finally, the CS-T preforms the best, indicating that the coordinate transform and the skeleton transform are complementary to each other.

Comparisons on loss function.

We firstly reform the Baseline into Baseline with above CS-T for this section. The model is optimized with the cross entropy loss (CE), focal loss (FL), soft-margin cross entropy loss (SMCE), and soft-margin focal loss (SMFL), respectively. To save space, at most two best parameters for each loss are listed in Table 4. Because FL can conduct adaptive data selection, it performs better than CE. Benefiting from the encouraged margins between the positive and negative samples, both SMCE and SMFL perform better than their corresponding original versions, i.e. CE and FL. Finally, the SMFL achieves the best because of its advantages from adaptive data selection and intrinsic margins encouraging.

How to select discriminative frequency patterns?

Attention methods CS (%) CV (%)
No Frequency Attention 86.9 92.6
Amplitude FA (aFA) 84.7 89.8
Shared FA (sFA) 87.3 92.9
Dependent FA (dFA) 87.5 93.2
Residual FA (rFA)* 87.7 93.6
  • Baseline;

  • Adopted.

Table 5: Comparisons of different attention methods.                

We further reform the Baseline into Baseline for this section by adding the SMFL. To validate the effectiveness of proposed rFA, we compare it with several variants. The Amplitude frequency attention (aFA) is built on frequency spectrum instead of sinusoidal and cosine components. Shared FA (sFA) learns shared attention parameters for sinusoidal and cosine components, while dependent FA (dFA) learns two set of parameters independently. The rfA is formed by applying the residual learning trick to dFA in the spatio-temporal domain (Fig.2). In Table 5, we observe that aFA does harm to the model because the phase angle information is completely missing when only using the frequency spectrum. The dFA outperforms the sFA because that it has more parameters to model the frequency patterns. The rFA finally achieves the best that outperforms Baseline with a large margin, indicating that the frequency information is effective for action recognition.

Comparisons of methods with different affinity fields.

Affinity Field CS (%) CV (%)
Local (=6, = 0) 87.7 93.6
tSLnL (=1, = 5) 88.1 93.9
sSLnL (=1, = 5) 88.0 94.1
SLnL (=1, = 5) 88.3 94.3
SLnL (=2, = 4) 88.6 94.6
SLnL (=3, = 3)* 88.8 94.9
SLnL (=4, = 2) 88.9 94.8
SLnL (=5, = 1)* 89.1 94.7
SLnL (=6, = 0) 88.8 94.7
  • Baseline;

  • Adopted.

Table 6: Comparisons of methods with various affinity fields. donates the number of non-local blocks.

We further reform the Baseline into Baseline with a rFA block for this section. Although non-local dependencies can be captured in higher layers of hierarchical local networks, we argue that synchronously explore non-local information in early stages is preferable. We merge one temporal non-local block (tSLnL), spatial non-local block (sSLnL), or spatial-temporal block (SLnL) into Baseline to examine their effectiveness. As shown in Table 6, both the non-local information from the temporal and spatial dimensions during early stages are helpful. In addition, benefiting from the synchronous fusion of local details and non-local semantics, our SLnL boosts up the recognition performance. To further investigate the properties of deeper SLnL, we replace local blocks in Baseline with SLnL. Table 6 shows more SLnL blocks in lower layers generally lead to better results, but the improvements of higher layers is relatively small because the affinity field of local operations is increasing with layers. The results clearly show that synchronously extracting local details and non-local semantics is vital for modeling the spatio-temporal dynamics of human actions.

5 Conclusion

In this work, we propose a novel method (SLnL-rFA) to model skeleton-based actions from multi-domains, which significantly outperforms the state-of-the-art methods. The SLnL synchronously extracts local details and non-local semantics in the spatio-temporal domain. The rFA adaptively selects discriminative frequency patterns, which sheds a new light to exploit information in the frequency domain for skeleton-based action recognition. In addition, we propose a novel soft-margin focal loss for general recognition tasks, which can encourage intrinsic margins in classifiers with a clear probability interpretation, it also conducts adaptive data selection without any damage to the original model structure and training processes. In the future, we will apply our method to other related computer vision tasks like general objects recognition and events detection.


  • [Beaudry, Péteri, and Mascarilla 2014] Beaudry, C.; Péteri, R.; and Mascarilla, L. 2014. Action recognition in videos using frequency analysis of critical point trajectories. In ICIP, 1445–1449.
  • [Bengio et al. 2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In ICML, 41–48.
  • [Buades, Coll, and Morel 2005] Buades, A.; Coll, B.; and Morel, J. 2005. A non-local algorithm for image denoising. In CVPR, 60–65.
  • [Cruz and Street 2017] Cruz, A. C., and Street, B. 2017. Frequency divergence image: A novel method for action recognition. In 14th IEEE International Symposium on Biomedical Imaging, 1160–1164.
  • [Dabov et al. 2007] Dabov, K.; Foi, A.; Katkovnik, V.; and Egiazarian, K. O. 2007. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Trans. Image Processing 16(8):2080–2095.
  • [Dai et al. 2016] Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. R-FCN: object detection via region-based fully convolutional networks. In NIPS, 379–387.
  • [Du, Wang, and Wang 2015] Du, Y.; Wang, W.; and Wang, L. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 1110–1118.
  • [Fan et al. 2017] Fan, Y.; Tian, F.; Qin, T.; Bian, J.; and Liu, T. 2017. Learning what data to learn. In ICML Workshops.
  • [Fernando et al. 2015] Fernando, B.; Gavves, E.; M., J. O.; Ghodrati, A.; and Tuytelaars, T. 2015. Modeling video evolution for action recognition. In CVPR, 5378–5387.
  • [Glasner, Bagon, and Irani 2009] Glasner, D.; Bagon, S.; and Irani, M. 2009. Super-resolution from a single image. In ICCV, 349–356.
  • [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
  • [Kay et al. 2017] Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; Suleyman, M.; and Zisserman, A. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  • [Ke et al. 2017] Ke, Q.; Bennamoun, M.; An, S.; Sohel, F. A.; and Boussaïd, F. 2017. A new representation of skeleton sequences for 3d action recognition. In CVPR, 4570–4579.
  • [Kim and Reiter 2017] Kim, T. S., and Reiter, A. 2017. Interpretable 3d human action analysis with temporal convolutional networks. In CVPR Workshops, 1623–1631.
  • [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1106–1114.
  • [Kumar, Packer, and Koller 2010] Kumar, M. P.; Packer, B.; and Koller, D. 2010. Self-paced learning for latent variable models. In NIPS, 1189–1197.
  • [Lefkimmiatis 2017] Lefkimmiatis, S. 2017. Non-local color image denoising with convolutional neural networks. In CVPR, 5882–5891.
  • [Li et al. 2017] Li, C.; Zhong, Q.; Xie, D.; and Pu, S. 2017. Skeleton-based action recognition with convolutional neural networks. In ICME Workshops, 597–600.
  • [Li et al. 2018] Li, C.; Zhong, Q.; Xie, D.; and Pu, S. 2018. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In IJCAI, 786–792.
  • [Lin et al. 2017] Lin, T.; Goyal, P.; Girshick, R. B.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In ICCV, 2999–3007.
  • [Liu et al. 2016a] Liu, J.; Shahroudy, A.; Xu, D.; and Wang, G. 2016a. Spatio-temporal LSTM with trust gates for 3d human action recognition. In ECCV, 816–833.
  • [Liu et al. 2016b] Liu, W.; Wen, Y.; Yu, Z.; and Yang, M. 2016b. Large-margin softmax loss for convolutional neural networks. In ICML, 507–516.
  • [Liu et al. 2018] Liu, D.; Wen, B.; Fan, Y.; Loy, C. C.; and Huang, T. S. 2018. Non-local recurrent network for image restoration. arXiv preprint arXiv:1806.02919.
  • [Loshchilov and Hutter 2015] Loshchilov, I., and Hutter, F. 2015. Online batch selection for faster training of neural networks. In ICML Workshops.
  • [Oyallon, Belilovsky, and Zagoruyko 2017] Oyallon, E.; Belilovsky, E.; and Zagoruyko, S. 2017. Scaling the scattering transform: Deep hybrid networks. In ICCV, 5619–5628.
  • [Schroff, Kalenichenko, and Philbin 2015] Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR, 815–823.
  • [Shahroudy et al. 2016] Shahroudy, A.; Liu, J.; Ng, T.; and Wang, G. 2016. NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR, 1010–1019.
  • [Si et al. 2018] Si, C.; Jing, Y.; Wang, W.; Wang, L.; and Tan, T. 2018. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV.
  • [Song et al. 2017] Song, S.; Lan, C.; Xing, J.; Zeng, W.; and Liu, J. 2017. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI, 4263–4270.
  • [Sonka, Hlavac, and Boyle 2014] Sonka, M.; Hlavac, V.; and Boyle, R. 2014. Image processing, analysis, and machine vision. Cengage Learning.
  • [Sun et al. 2014] Sun, Y.; Chen, Y.; Wang, X.; and Tang, X. 2014. Deep learning face representation by joint identification-verification. In NIPS, 1988–1996.
  • [Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In NIPS, 6000–6010.
  • [Wang et al. 2016] Wang, P.; Li, Z.; Hou, Y.; and Li, W. 2016. Action recognition based on joint trajectory maps using convolutional neural networks. In ACM MM, 102–106.
  • [Wang et al. 2017] Wang, X.; Girshick, R. B.; Gupta, A.; and He, K. 2017. Non-local neural networks. In CVPR, 7794–7803.
  • [Wang et al. 2018] Wang, X.; Zhang, S.; Lei, Z.; Liu, S.; Guo, X.; and Li, S. Z. 2018. Ensemble soft-margin softmax loss for image classification. In IJCAI, 992–998.
  • [Yan, Xiong, and Lin 2018] Yan, S.; Xiong, Y.; and Lin, D. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.
  • [Zhang et al. 2017] Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; and Zheng, N. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV, 2136–2145.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description