Infrared and 3D skeleton feature fusion for RGB-D action recognition
Abstract
A challenge of skeleton-based action recognition is the difficulty to classify actions with similar motions and object-related actions. Visual clues from other streams help in that regard. RGB data are sensible to illumination conditions, thus unusable in the dark. To alleviate this issue and still benefit from a visual stream, we propose a modular network (FUSION) combining skeleton and infrared data. A 2D convolutional neural network (CNN) is used as a pose module to extract features from skeleton data. A 3D CNN is used as an infrared module to extract visual cues from videos. Both feature vectors are then concatenated and exploited conjointly using a multilayer perceptron (MLP). Skeleton data also condition the infrared videos, providing a crop around the performing subjects and thus virtually focusing the attention of the infrared module. Ablation studies show that using pre-trained networks on other large scale datasets as our modules and data augmentation yield considerable improvements on the action classification accuracy. The strong contribution of our cropping strategy is also demonstrated. We evaluate our method on the NTU RGB+D dataset, the largest dataset for human action recognition from depth cameras, and report state-of-the-art performances.
I Introduction
\PARstartHuman action recognition is an important computer vision field with applications ranging from video surveillance, robotics, to automated driving systems among others. It has been studied for decades but is still relevant due to its potential applications and recent rapid development [50].

Consumer-grade depth cameras such as Intel RealSense [20] and Microsoft Kinect [58] coupled with advanced human pose estimation algorithms [38] have allowed 3D skeleton data to be obtained in real time. Key joints of the human body are extracted to a 3D space, providing a high-level representation of an action. Skeleton data are robust to surrounding environment, illumination variations and may be generalized to various viewpoints [1], [10], [31], [50]. Earlier works have indicated that key joints are powerful descriptors of human motion [16]. The low dimensionality and high representation power make skeleton data a prime input for action recognition tasks.
Opening the door for new action recognition algorithms, those are broadly categorized into RGB and 3D skeleton approaches. However, it has been demonstrated that visual and skeleton inputs can work in symbiosis [32]. Actions with similar body motion, such as writing versus typing on a keyboard, prove difficult to classify with skeleton data only. In this respect, skeleton data might benefit from the visual clues of RGB streams.
Depth cameras offer four different data streams: RGB, depth, infrared (IR) videos and 3D skeleton. To our knowledge, infrared videos from depth cameras have never been used as an input source for action recognition. We argue that the lack of large scale datasets proposing IR videos in addition to the other streams is in part responsible. Moreover, RGB and IR images are quite similar, the former offering a richer representation of a scene therefore making it a better candidate. However, IR is usable in the dark, which is viable for security applications when skeleton data are insufficient. The recent introduction of large scale datasets like NTU RGB+D [33] and PKU-MMD [27] containing IR videos motivates the evaluation of methods using this stream. Video understanding is a well-studied computer vision task. But modeling spatiotemporal features and long-term dependencies remains an issue.
Another challenge in video action classification is the volume of information. To reduce the complexity of the videos, downscaling the frames is often employed but also comes with a decrease in the quality of the information. Moreover, discriminating clues may only occur in a small portion of the frames, becoming undetectable in the process. An alternative proposal is to focus on regions of interest. Visual attention models are capable of focusing on important cues and disregard other areas [4], [30], [35].
In this work, we intend to address the difficulty of differentiating actions with similar motions with an additional visual stream insensible to illumination conditions. Furthermore, we evaluate the potential of IR videos as a standalone source. We propose a model fusing video and pose data (FUSION). Pose has a double purpose. It is used as an input stream in its own right and also conditions the IR sequences, providing a crop around the subjects, facilitating the classification. The general outline of the network is illustrated Fig. 1.
The pose network is an 18-layer ResNet [11] taking as input the entire skeleton sequence. The sequence is mapped to an RGB image which is then rescaled to fit the input size of the CNN. The IR network is a ResNet (2+1)D (R(2+1)D) [45] where a fixed number of random frames taken from evenly spaced subsequences are used as inputs. The features of each module are then fused using a concatenation scheme before proposing a final classification with a multilayer perceptron (MLP).
Our main contributions are as follows :
-
We demonstrate the importance of IR streams from depth cameras for human action recognition.
-
We propose a fusion network taking skeleton and IR sequences as inputs, which has never been attempted before.
-
We perform extensive ablation studies. We isolate different modules of our model and study their individual representation power. We also evaluate the importance of data augmentation, transfer learning, 2D-skeleton conditioned IR sequences and IR sequence length on accuracy score.
-
We achieve state-of-the-art results compared to methods using different streams.
Codes, documentation and supplementary materials can be found on the project page.
Ii Related Work
Ii-a Skeleton-based approaches
Human action recognition has received a lot of attention due to its high-level representation and powerful discriminating nature. Traditional approaches focus on handcrafted features [14], [46], [49]. These could be the dynamics of joint motion, covariance matrix of joint trajectories [14] or the representation of joints in a Lie group [46]. Design choices prove challenging and result in suboptimal results. Recent deep-learning methods report improved accuracy. There exists three main frameworks: sequence-based models, image-based models and graph-based models.
Sequence models exploit skeleton data as time series of key joints which are then fed to recurrent neural networks (RNN) [8], [24], [28], [33], [42], [48], [56]. Part-aware long short-term memory (LSTM) RNN [33] uses different memory cells for different regions of the body, then fuses them for the final classification. Similarly in [8], a bidirectional RNN studies separate body parts individually in earlier levels and conjointly deeper on. In an effort to model simultaneously time and spatial dependencies, Liu et al. propose a 2D recurrent model [28]. Recurrent models are now part of the early deep learning efforts for skeleton-based action recognition. Vastly improving upon the results of the traditional methods, they remain insufficient. The sequence length has to be fixed during training which is not ideal and requires a sampling strategy. Moreover, sequence models tend to be much slower than their image-based counterpart.
Image models represent skeleton data as 2D images which are then used as inputs for convolutional neural networks (CNN) [7], [19], [21], [25], [29], [52]. An intuitive method is to assign the , and coordinates of a skeleton sequence to the channels of an RGB image [7], [25]. Each joint corresponds to a row and each frame to a column, or inversely. Pixel intensity is then normalized between 0 and 255 based on maximal coordinates value of the dataset [7] or sequence [25]. Other works utilize the relative coordinates between joints to generate multiple images [19]. Wang et al. project the 3D coordinates on orthogonal 2D planes and encode the trajectories into a hue, saturation, value (HSV) space [52]. A pre-trained model over ImageNet [5] is leveraged. A similar approach is used in [12]. More recent works focus on view-invariant transformations [18], [29] or networks [57] with improved results. In [21], a temporal convolutional network is deployed with interpretability of the results a major objective. CNNs are able to learn from entire sequences rather than sampled frames. The image generated from the skeleton sequence is resized to accommodate the fixed input shape of the CNN. This means an entire sequence can be used at once, which is an advantage compared to recurrent methods.
Graph neural networks have received a lot of attention as of late due to their effective representation of skeleton data [53]. There exists two main graph model architectures: graph neural networks (GNN), which combine graph and recurrent networks, and graph convolutional networks (GCN), which aim to generalize traditional convolutional networks. Of this architecture derives two types of GCNs: spectral and spatial. Spatial GCNs leverage the convolution operator for each node using its nearest neighbors [40]. Yan et al. [54] make best of the graph representation to learn both spatial and temporal features. Li et al. generalize the graph representation to actional and structural links [26]. In [39], a temporal attention mechanism is adopted to enhance the classification while exploring the co-occurrence relationship between spatial and temporal domains. In [37], length and direction of bones are used in addition to joint coordinates while adapting the topology of the graph. Shi et al. represent skeleton data as a directed acyclic graph based on kinematic dependencies of joints and bones [36]. GCNs report the current state-of-the-art results on benchmark datasets. However, carefully designed CNNs show comparable results [57]. Also, CNNs can be pre-trained on other large scale datasets which actually improves the performances of image-based skeleton action recognition models [57]. To our knowledge, an ImageNet [5] style transfer learning is impractical for GCNs.
Ii-B RGB-based video classification

Traditional approaches focus on handcrafted features in the form of spatiotemporal interest points. Among those, improved Dense Trajectories (iDT) [47], which uses estimated camera movements for feature correction, is considered the state of the art. After the widespread use of deep learning on single images, many attempts have been made to propose benchmarks for video classification.
Soon after [47], two breakthrough papers [17], [41] would form the backbone of future efforts. In [17], Karpathy et al. explore different ways of fusing temporal information using pre-trained 2D CNNs. In [41], handcrafted features, in the form of optical flow, are used symbiotically with the raw video. Two parallel networks compute spatial and temporal features. A few drawbacks include the inability to effectively capture long-range temporal information and the heavy calculations required to compute optical flow.
Later research propositions fall into five frameworks :
Heavy networks and computations of handcrafted features as well as the absence of a benchmark for long-term temporal features remain an issue. In [45], Tran et al. explore different forms of spatiotemporal convolutions and their impact on video understanding. A (2+1)D convolution block separating spatial and temporal filters allows for a greater non-linearity compared to a standard 3D block with an equivalent number of parameters, as illustrated Fig. 2. Separating convolutions yields state-of-the-art results on benchmark datasets such as Sports-1M [17], Kinetics [3], UCF101 [43] and HMDB51 [23].
Ii-C Mixed inputs action recognition
Depth cameras provide different streams, or in other words, different representations of a same action. Some works have attempted to improve classification by combining streams. It can be argued that skeleton-based approaches prove most effective at discriminating actions with broad movements. However for actions involving similar joint positions and trajectories, such as reading vs. playing on a phone, skeleton-based models do not perform as well. Visual streams can provide important cues such as the type of object held. RGB and depth streams have been studied extensively. However, to our knowledge, we are the first to use IR data from depth cameras for action recognition.
In [13], [34], [51] the complementary role of RGB and depth is demonstrated. In [59], pose, motion and raw RGB images are inputted in 3 parallel 3D CNNs. Although visual information greatly improves upon the pose baseline, results are comparable with the then state-of-the-art methods using only skeleton data. In [32], human-object interactions are modeled using both skeleton and depth data. An end-to-end network is proposed to learn view-invariant representations of skeleton data and held objects. Once again, visual information increases the accuracy but the results do not justify the complexity of a fusion approach compared to the other skeleton-only approaches of the time. The same year, Baradel et al. use RGB and skeleton data conjointly in a pertinent way [2]. Pose information is used as an input but also conditions the RGB stream. The 3D skeleton data are projected onto the RGB sequences to effectively extract crops around the hands of the subject, serving as another input. The RGB stream thus provides important clues about an object held and inter-subject interactions, significantly improving the results. This work shows that not all body parts need to be focused on, unlike the approach in [32]. But this requires as many streams as there are hands, which is memory inefficient. Furthermore, when the hands are close together, the information provided may be redundant.
Iii Proposed Model
We design a deep neural network using skeleton and IR data, called “Full Use of Infrared and Skeleton in Optimized Network” (FUSION). The network consists of two parallel modules and an MLP. One module interprets skeleton data, the other IR videos. The features extracted from each individual stream are then fused using a concatenation scheme. The MLP is used as the final module and outputs a probability density. The network is trained in end-to-end fashion by optimizing the classification score.
We note a skeleton sequence where denotes a joint index, a frame index and a coordinate axis (, and ). We note a sampled IR sequence, as detailed section III-B3, where is taken between , with the number of sampled frames.
In the following sections, we present the individual modules of our FUSION model: a 2D CNN as the pose module, a 3D CNN as the IR module and an MLP as the stream fusion module.
Iii-a Pose module

A skeleton sequence requires careful treatment for optimal results. First, a skeleton sequence is normalized to be position invariant, meaning the distance between the subject and the camera is accounted for. The sequence is then transcribed to an RGB image, with multiple subjects interactions in mind. The handcrafted RGB image is then fed to a 2D CNN.
Prior normalization step
Each skeleton sequence is normalized by translating the global coordinate system of the camera to a local coordinate system corresponding to a key joint of the main subject. We choose the middle of the spine as the new origin. This is illustrated Fig. 3.
We adopt a sequence-wise normalization. In other words, the translation vector is computed for the first frame and applied to each subsequent frame, meaning the subject may move away from the new local coordinate system, as follows:
(1) |
Where is the normalized skeleton sequence, corresponds to the middle of the spine for the Kinect 2 skeleton [58]. The ”″ notation signifies that all values are considered across this dimension.
Skeleton data to skeleton 2D maps
A skeleton sequence is mapped to an image similar to [7], a skeleton map. Each coordinate axis, , and , is attributed to each channel of an RGB image. Each key joint corresponds to a row while the columns represent the different frames.
We apply a dataset-wise normalization [7]. We note and the minimal and maximal values of the coordinates after the normalization step for the entire dataset. The pixels of the skeleton map are recalculated using a min-max strategy in the range, as follows:
(2) |
Where is the normalized skeleton map with both the coordinate axis and the image channel.
To accommodate for the fixed input size of the 2D CNN, the skeleton map is resized to a standard size.
Multi subject strategy

Our network is scalable to multiple subjects. We concatenate the different skeleton maps across the joint dimension. With being the total number of joints, the first rows correspond to the first subject, the subsequent rows to subject 2, etc. We limit the number of subjects to two, corresponding to the maximum of the NTU RGB+D dataset [33]. Nonetheless, this method may be generalized to a greater number of subjects. Should the skeleton sequence comprise only of one subject, the rows of the second subject are set to zero.
In case of multiple subjects, the coordinates of the latter are translated to the local coordinate system of the main subject (Fig. 4).
The advantages of our method are manifold. Firstly, this alleviates the need for individual networks for different subjects. Secondly, this representation allows for a second subject to still intervene if its skeleton is detected after the first frame. Thirdly, the distance information is kept as each subject coordinates are translated to the local coordinate system of the first subject. Lastly, the skeleton map is resized to a standard size to accommodate for the fixed input size of the pose module. This implies that the network is able to learn from raw sequences of different sizes.
CNN used
The transformed skeleton map is used as input. We use an existing CNN with pre-trained weights on ImageNet as we find this ameliorates the classification score even when the images are handcrafted. We choose an 18-layer ResNet [11] for its compromise between accuracy and speed.
We extract a pose feature vector from the skeleton map with the pose module with parameters (3). Here, and for the rest of the paper, subscripts of modules and parameters refer to a module, not an index.
(3) |
Iii-B IR module
The action performed by a subject is only a small region inside the frames of an IR sequence. The 2D skeleton data are used to capture the region of interest and virtually focus the attention of the network, with multiple potential subjects in mind. Because the IR module requires a video input with a fixed number of frames, a subsampling strategy is deployed. A 3D CNN is used to exploit the IR data.
Cropping strategy

Traditionally, 3D CNNs require a lot of parameters to account for the complex task of video understanding. Thus, the frames are heavily downscaled to reduce memory needs. In the process, discriminating information may be lost. In an action video of daily activities, the background provides little to no context. We would like our model to only focus on the subject as this is where the action happens. We argue that a crop around the subject provides ample cues about the action performed. Depth information, coupled with pose estimation algorithms, provides a turnkey solution for human detection. We propose a cropping strategy, shown Fig. 5 by a green parallelepiped, to virtually force the model to focus on the subject.
Given a 3D skeleton sequence projected on the 2D frames of the IR stream, we extract the maximal and minimal pixel positions across all joints and frames. This creates a fixed bounding box capturing the subject on the spatial and temporal domains. We empirically choose a 20 pixels offset to account for potential skeleton inaccuracy. The IR stream is padded with zeros should the box coordinates with the offset exceed the IR frame range.
The advantage of our method is as follows. Providing a crop around the region of interest reduces the size of the frames without decreasing the quality. The downscaling factor is thus less important and preserves a better aspect of the image. Furthermore, it alleviates the need for an attention mechanism as the cropping strategy may be seen as a hard attention scheme in itself. Also, the network does not have to learn information from the background, which is noise in our case, as it is reduced to a minimum.
Multi subject strategy
The cropping strategy can be generalized to multiple subjects. The bounding box is enlarged to account for the other subjects. We take the maximal and minimal values across all joints, frames and subjects.
For a given sequence, the bounding box is immobile regardless of the number of subjects. This allows keeping camera dynamics. We do not want to add confusion to the sequence by adding a virtual movement of the camera with a mobile bounding box.
Sampling strategy
Contrary to the pose network, a given IR sequence is not treated in its entirety. A 3D CNN requires a sequence with a fixed number of frames . Choices must be made regarding the value of and the sampling strategy. A potential approach would be to take adjacent frames in a sequence. But the subsequence might not be enough to correctly capture the essence of the action. Instead, we propose a scheme where the raw sequence is divided into windows of equal duration similar to [28], as illustrated Fig. 6. A random frame is taken from each window. A new sequence is created of length . This is a form a data augmentation as a raw sequence may yield different results.


3D CNN used
The new sampled sequences are used as inputs for the 3D CNN. We use an 18-layer deep R(2+1)D network [45] pre-trained on Kinetics-400 [3]. R(2+1)D is an elegant network which revisits 3D convolutions. Tran et al. showed factoring spatial and temporal convolutions yields state-of-the-art results on benchmark RGB action recognition datasets. Separating spatial and temporal convolutions with a nonlinear activation function in between allows for a more complex function representation with the same number of parameters.
We extract a stream feature vector from the sampled IR sequence with the IR module with parameters , as follows:
(4) |
Iii-C Stream fusion
Both pose and IR modules output their own feature vectors. An MLP serves as the final module and returns a probability distribution for each action class in a dataset.
Features of both streams are fused using a concatenation scheme. The MLP consists of three layers with batch normalization [15] before computation. The ReLU activation function is used for all neurons. Lastly, a softmax activation function is deployed to normalize the last layer’s output into a probability distribution.
The class probability distribution is outputted by the MLP with parameters (5). Inputs and correspond to the feature vectors computed by the pose and IR modules.
(5) |
We tried a scheme where the pose and IR modules of our network would emit their own prediction. We would then average the predictions on a logits level with learned weights during the backpropagation step. However, this would lead to the network’s final classification to be attributed solely to one module or the other. Instead, we believe that an MLP allows for the features of the different streams to be interpreted conjointly.
Iv Network Architecture
Iv-a Architecture
Pose module
The pose network is an 18-layer deep ResNet [11]. The network takes as input a tensor of dimensions 3x224x224, where 3 corresponds to the RGB channels and 224 to the height and width of the image. The output, , is a 1D vector of 512 features.
IR module
The IR network is an 18-layer deep R(2+1)D [45]. It takes as input a video of dimensions 3xTx112x112, where 3 corresponds to the RGB channels, to the length of the sequence and 112 to the height and width of the image. The output, , is a 1D vector of 512 features.
To be able to leverage the pre-trained R(2+1)D CNN, which is originally trained on RGB images, the IR frames, which are single-channel grayscale images, are duplicated.
Classification module
The classification module is an MLP network with three layers. The first layer expects a vector of 1024 features and comprises 256 units. The second layer consists of 128 units. The last layer has as many units as there are different action classes in a dataset. Finally, the softmax function is used to normalize the predictions to a probability distribution. Batch normalization is applied before the layers. A dropout scheme has been tested in place of batch normalization but was not found to be superior. The ReLU activation function is used for all layers except the last.
The entire network is detailed Fig. 7.
Iv-B Data augmentation
To prevent overfitting and reinforce the generalization capabilities of our model, we perform data augmentation during training.
The skeleton sequences have limited viewpoints but their representation makes them excellent candidates for augmentation through geometric transformations. The skeleton sequences are enhanced by performing a random rotation around the , and axis. For each sequence during training, we apply a random rotation between and on each axis.
We approach IR data augmentation with the following scheme. For each sequence during training, we perform a horizontal mirroring transformation on the frames with a 50% chance probability. The two streams are augmented independently.
Iv-C Training
V Experiments
We evaluate the performances of our proposed model on the NTU RGB+D dataset, the largest benchmark to date [33]. We also perform extensive ablation studies to understand the individual contributions of our modules.
V-a NTU RGB+D dataset
The NTU RGB+D dataset is the largest human action recognition dataset to date captured with a Microsoft Kinect V2 [58]. To our knowledge, it is also the only one including the IR sequences. It contains 60 different classes ranging from daily to health-related actions spread across 56,880 clips and 40 subjects. It includes 80 different views. An action may require up to two subjects. The various setups, views, orientations, result in a great diversity which makes NTU RGB+D a challenging dataset.
There are two benchmark evaluations for this dataset: Cross-Subject (CS) and Cross-View (CV). The former splits the 40 subjects into training and testing groups. The latter uses the samples acquired from cameras 2 and 3 for training while the samples from camera 1 are used for testing.
V-B Experimental settings
Method | Pose | IR | CS | CV |
---|---|---|---|---|
Pose network | X | - | 82.3 | 89.5 |
Method | Pose | IR | CS | CV |
---|---|---|---|---|
IR network | - | X | 89.8 | 94.1 |
For consistency, we do not modify the following hyperparameters across all experiments. We set the batch size to 16 which allows the model and a batch to fit on most high-end GPUs. Gradient clipping is used to avoid an exploding gradient issue. We set it to 10. Adam optimizer [22] is used to train the networks. A learning rate of 0.0001 is set and kept consistent during training.
The pose and IR modules each require a fixed input size. Skeleton maps are resized to 224x224 images. IR frames are resized to 112x112.
To assure consistency and reproducibility, we use a pseudorandom number generator fed with a fixed seed. Following [33], we sample 5% of the training set as our validation set.
Method | Pose | IR | CS | CV |
---|---|---|---|---|
Pose module | X | - | 78.7 | 85.1 |
Pose module - P | X | - | 80.7 | 87.0 |
IR module | - | X | 76.8 | 76.3 |
IR module - P | - | X | 84.0 | 84.6 |
IR module - C | - | X | 84.6 | 88.6 |
IR module - CP | - | X | 90.1 | 91.2 |
Method | Pose | IR | CS | CV |
---|---|---|---|---|
Pose module - P | X | - | 80.7 | 87.0 |
Pose module - PA | X | - | 82.3 | 89.5 |
IR module - P | - | X | 84.0 | 84.6 |
IR module - PA | - | X | 85.0 | 87.5 |
IR module CP | - | X | 90.1 | 91.2 |
IR module CPA | - | X | 89.8 | 94.1 |
FUSION - CP | X | X | 90.8 | 94.0 |
FUSION - CPA | X | X | 91.6 | 94.5 |
Method | Pose | IR | CS | CV |
---|---|---|---|---|
IR module | - | X | 76.8 | 76.3 |
IR module - C | - | X | 84.6 | 88.6 |
IR module- PA | - | X | 85.0 | 87.5 |
IR module - CPA | - | X | 89.8 | 94.1 |
V-C Ablation studies
In this section, we isolate the pose and IR modules and study their individual contribution with regard to different parameters. Action classification accuracy on the NTU RGB+D dataset is used as the comparison metric. We evaluate the impact of transfer learning, data augmentation, pose conditioning of IR sequences and the number of frames . Finally, we compare our results with the current state of the art.
CS | CV | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Method | Pose | IR | T=8 | T=12 | T=16 | T=20 | T=8 | T=12 | T=16 | T=20 |
IR module - CPA | - | X | 86.8 | 89.5 | 90.0 | 89.8 | 88.8 | 91.3 | 93.0 | 94.1 |
FUSION - CPA | X | X | 88.7 | 90.4 | 90.3 | 91.6 | 92.4 | 94.4 | 94.3 | 94.5 |
Pose module
We evaluate the performances of our pose module as a standalone. The IR module does not intervene. We also adjust the input size of the classification MLP. Optimal results are achieved by combining pre-training with data augmentation. Table I shows the best results of the pose module on NTU RGB+D: 82.3% on CS and 89.5% on CV.
The CV benchmark is a much easier task, hence the better results compared to CS. The test actions are already seen during training but from a different point of view with a different camera. Although the different setups yield different joint position estimations for a given sequence [57], the geometric nature of skeleton data allows for a better generalization. This is not the case for the CS task as the test sequences are completely novel. Consequently, the following discussions will only address the CS benchmark.
The confusion matrix reveals the pose module’s strong ability to correctly classify actions with intense kinetic movements. Actions such as sitting down, standing up, falling, jumping, staggering, walking toward or away from another subject are classified with over 95% accuracy. Unsurprisingly, actions with similar skeleton motions prove the most challenging. Writing is the trickiest, with 40% accuracy only and often mislabeled as writing or typing on a keyboard. The incorrectly classified actions fall under two categories: similar motion actions and object-related actions. We believe this will always be a limitation of pose-only networks.
Infrared module
The other part of the FUSION network, and arguably the most important contributor, is the infrared module. In similar fashion as above, the input size of the MLP is adjusted while keeping the number of neurons equal. Optimal results are achieved with a pre-trained network, with data augmentation, on pose-conditioned inputs for a sequence length of . Table II shows the performance of the IR module as a standalone: 89.8% on CS and 94.1% on CV.
The confusion matrix reveals a more balanced accuracy score over the different actions of the NTU RGB+D dataset. Some actions, such as touching another person’s pocket or staggering, prove more difficult to recognize for the IR module compared to the pose module. This reinforces our intuition that pose and visual streams are complementary. However, some object-oriented actions are still difficult to correctly discern. For instance, writing is more often than not mislabeled as playing with a phone. We propose two possible explanations. Firstly, the object information might be lost during the rescaling process, even with our cropping strategy in place. Secondly, the IR nature, grayscale and noisy, might not be clear enough to discern the object correctly. But other object-related actions such as dropping an object or brushing hair see an impressive improvement of over 10%.
Influence of pre-training
Pre-training a network is an elegant way to transfer a learned task to a new one. It has been shown to provide impressive results even on handcrafted images [57]. Furthermore, it helps with the overfitting issue smaller datasets may demonstrate.
We evaluate the impact of this strategy on our network. Table III shows the effect of pre-training on the different modules.
The pose network enjoys a noticeable increase in accuracy of about 2% for both benchmarks (78.7% to 80.7% on CS). It is pre-trained on ImageNet, which consists of real-life images. The skeleton maps used as inputs are handcrafted. Even then, a pre-training scheme shows encouraging results.
The impact of pre-training on the IR module’s accuracy is significant. For uncropped sequences, the accuracy increases by about 7% for both benchmarks (76.8% to 84.0% on CS). For cropped sequences, the gain is over 5% for the cross-subject benchmark (84.6% to 90.1%) and almost 3% for cross-view (88.6% to 91.2%).
The greater contribution of transfer learning for the IR module compared to the pose module might be explained by the greater resemblance of IR vs. RGB videos compared to handcrafted vs. real-life images. Nonetheless, such findings further emphasize the power of transfer learning.
Influence of data augmentation
Data augmentation consists of virtually enlarging the dataset, thus hopefully preventing overfitting and reducing variance between training and test sets. We perform augmentation for the different data streams. Table IV shows the performances of data augmentation on the different modules with pre-trained networks. Overall, data augmentation yields favorable results.
The pose module alone enjoys an increase of about 2% accuracy for both benchmarks (80.7% to 82.3% on CS). The IR module alone seems to benefit more from data augmentation on the CV benchmark compared to the CS. For the CV benchmark, the increase is about 3% whether the input sequence is cropped (91.2% to 94.1%) or not (84.6% to 87.5%). For the CS benchmark, the improvements are not significant. When the modules are fused, our FUSION network, data augmentation is favorable but not significant with an increase in the 0.5% range (90.8% to 91.6% on CS). However, this could be expected. As the baseline results increase, the gains are expected to diminish.
Transfer learning vs. data augmentation
Transfer learning and data augmentation are two strategies to better generalize the performances of a network. Transfer learning leverages the learned parameters from another dataset while data augmentation virtually enlarges the current dataset. A small dataset might lead to overfitting which results in an increase in variance between the training and validation sets as the training error continues to lower.
Our model is able to reach a negligible training error, even with individual modules, showcasing an overfitting issue. Having studied the impacts on performances of both methods, transfer learning shows much better results. This might be explained by the already large size of the NTU RGB+D dataset mitigating the potential of data augmentation. Nonetheless, it is formidable how a model can yield vastly different performances based on the initialization of its parameters. The black box nature of deep learning makes the interpretation of how a model learns difficult. Perhaps future works will focus on understanding the internal representation of a network to guide its learning rather than implementing evermore complex models.
Influence of pose-conditioned cropped IR sequences
In this section, we evaluate the impact of our cropping strategy, detailed section III-B1, on the performances of the IR module as a standalone. Table V shows a significant increase in performances.
Our baseline for this comparison, the IR module without transfer learning and data augmentation on uncropped sequences, reports unsatisfactory results (76.8% on CS). With transfer learning and data augmentation, we are able to increase the accuracy by almost 10% average for both benchmarks (76.8% to 85.0% on CS). However, we find that our cropping strategy alone reaps similar benefits (76.8% to 84.6% on CS). When combining all three strategies, we further ameliorate the classification score by about 5% (89.8% on CS). The average gain for both benchmarks is thus above 15%, which is considerable.
We demonstrate the power of a pragmatic approach. An identical model is able to perform significantly better thanks to careful design choices.
Influence of sequence length
Sequences of the NTU RGB+D dataset are at most a couple of seconds long. We study the impact of the length of the new sampled IR sequence on classification performances of two networks: the IR module only and on the complete FUSION model. Both models are pre-trained and fed with augmented data. The IR sequences are pose-conditioned. Table VI reports the impact of different values of on the accuracy score.
As a general tendency, the greater the value of , the better the results. Best results are achieved for for three out of four scenarios (on CS: 89.8% for IR module only and 91.6% for FUSION). The exception happens for the IR module as a standalone on the CS benchmark where the optimal value is (90.0%). However, the difference in accuracy is negligible. For the FUSION network, excellent results are achieved for a number of frames as little as (90.4% on CS and 94.4% on CV). FUSION networks with a smaller value of are much faster, showcasing a trade-off between speed and accuracy.
Comparison with the state of the art
Method | Pose | RGB | Depth | IR | CS | CV |
Lie Group [46] | X | - | - | - | 50.1 | 82.8 |
HBRNN [8] | X | - | - | - | 59.1 | 64 |
Deep LSTM [33] | X | - | - | - | 60.7 | 67.3 |
PA-LSTM [33] | X | - | - | - | 62.9 | 70.3 |
ST-LSTM [28] | X | - | - | - | 69.2 | 77.7 |
STA-LSTM [42] | X | - | - | - | 73.4 | 81.2 |
VA-LSTM [56] | X | - | - | - | 79.2 | 87.7 |
TCN [21] | X | - | - | - | 74.3 | 83.1 |
C+CNN+MTLN [19] | X | - | - | - | 79.6 | 84.8 |
Synthesized CNN [29] | X | - | - | - | 80 | 87.2 |
3scale ResNet [25] | X | - | - | - | 85 | 92.3 |
DSSCA-SSLM [34] | - | X | X | - | 74.9 | - |
[32] | X | - | X | - | 75.2 | 83.1 |
CMSN [59] | X | X | - | - | 80.8 | - |
STA-HANDS [2] | X | X | - | - | 84.8 | 90.6 |
Coop CNN [51] | - | X | X | - | 86.4 | 89 |
ST-GCN [54] | X | - | - | - | 81.5 | 88.3 |
DGNN [36] | X | - | - | - | 89.9 | 96.1 |
Pose module - PA | X | - | - | - | 82.3 | 89.5 |
IR module - CPA | - | - | - | X | 89.8 | 94.1 |
FUSION - CPA | X | - | - | X | 91.6 | 94.5 |
We compare our FUSION model with the state of the art (Table VII). We divide current methods into 5 different frameworks including handcrafted features, RNN-based methods, CNN-based methods, fusion methods and GCN-based methods. Current best results are obtained using skeleton data only with GCNs. We achieve better results than the current state of the art on the CS benchmark (91.6%) with 1.7% accuracy increase. On the CV benchmark, results are comparable (94.5% for FUSION against 96.1% for DGNN [36]). We conclude to the efficacy of IR data to correctly interpret human actions.
We significantly improve upon current fusion methods, once again validating the complementary role of pose and visual data.
Vi Conclusion
We propose an end-to-end trainable network using skeleton and infrared data for human action recognition. A pose module extracts features from skeleton data and an infrared module learns from videos. The 3D skeleton is used as an input source and also conditions the infrared stream, providing a crop around the subjects. The two stream features are then concatenated, and a final prediction is outputted. The pose and infrared modules report strong individual performances, which is greatly due to the power of transfer learning as they are both pre-trained on other large scale datasets. When working in symbiosis, the results are further ameliorated. We are the first to conjointly use pose and infrared streams. Our method improves the state of the art on the largest RGB-D action recognition dataset to date. Compared to other stream fusion approaches, our method requires less prepossessing and is more memory efficient.
Our work demonstrates the strong representational power of infrared data, which opens the door for applications where illumination conditions render RGB videos unusable. The complementary role of pose and visual streams is further illustrated, which is in line with previous work. Given the modular nature of our proposed network, future works could focus on more modern pose modules such graph neural networks.
Acknowledgment
This work was supported by research grants from the Natural Sciences and Engineering Research Council of Canada and an industrial funding from Aerosystems International Inc. The authors would also like to thank their collaborators from Aerosystems International Inc.
Footnotes
References
- (2014) Human activity recognition from 3d data: a review. Pattern Recognition Letters 48, pp. 70–80. Cited by: §I.
- (2017) Pose-conditioned spatio-temporal attention for human action recognition. arXiv preprint arXiv:1703.10106. Cited by: §II-C, §II-C, TABLE VII.
- (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: 5th item, §II-B, §III-B4, §IV-C.
- (2015) Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia 17 (11), pp. 1875–1886. Cited by: §I.
- (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §II-A, §II-A, §IV-C.
- (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. Cited by: 1st item.
- (2015) Skeleton based action recognition with convolutional neural network. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583. Cited by: §II-A, §III-A2, §III-A2.
- (2015) Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1110–1118. Cited by: §II-A, TABLE VII.
- (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: 3rd item, 4th item.
- (2017) Space-time representation of people based on 3d skeletal data: a review. Computer Vision and Image Understanding 158, pp. 85–105. Cited by: §I.
- (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §III-A4, §IV-A1.
- (2016) Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology 28 (3), pp. 807–811. Cited by: §II-A.
- (2018) Deep bilinear learning for rgb-d action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 335–351. Cited by: §II-C.
- (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Twenty-Third International Joint Conference on Artificial Intelligence, Cited by: §II-A.
- (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §III-C.
- (1973) Visual perception of biological motion and a model for its analysis. Perception & psychophysics 14 (2), pp. 201–211. Cited by: §I.
- (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §II-B, §II-B.
- (2017) Skeletonnet: mining deep part features for 3-d action recognition. IEEE signal processing letters 24 (6), pp. 731–735. Cited by: §II-A.
- (2017) A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297. Cited by: §II-A, TABLE VII.
- (2017) Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–10. Cited by: §I.
- (2017) Interpretable 3d human action analysis with temporal convolutional networks. In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp. 1623–1631. Cited by: §II-A, TABLE VII.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B.
- (2011) HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563. Cited by: §II-B.
- (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1012–1020. Cited by: §II-A.
- (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 601–604. Cited by: §II-A, TABLE VII.
- (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3595–3603. Cited by: §II-A.
- (2017) PKU-mmd: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475. Cited by: §I.
- (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pp. 816–833. Cited by: §II-A, §III-B3, TABLE VII.
- (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition 68, pp. 346–362. Cited by: §II-A, TABLE VII.
- (2014) Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §I.
- (2016) 3D skeleton-based human action classification: a survey. Pattern Recognition 53, pp. 130–147. Cited by: §I.
- (2017) Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5832–5841. Cited by: §I, §II-C, §II-C, TABLE VII.
- (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019. Cited by: §I, §II-A, Fig. 5, §III-A3, §V-B, TABLE VII, §V.
- (2017) Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE transactions on pattern analysis and machine intelligence 40 (5), pp. 1045–1058. Cited by: §II-C, TABLE VII.
- (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119. Cited by: §I.
- (2019) Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7912–7921. Cited by: §II-A, §V-C8, TABLE VII.
- (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12026–12035. Cited by: §II-A.
- (2011) Real-time human pose recognition in parts from single depth images. In CVPR 2011, pp. 1297–1304. Cited by: §I.
- (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236. Cited by: §II-A.
- (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §II-A.
- (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §II-B.
- (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Thirty-first AAAI conference on artificial intelligence, Cited by: §II-A, TABLE VII.
- (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §II-B.
- (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: 2nd item.
- (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §I, Fig. 2, §II-B, §III-B4, §IV-A2.
- (2014) Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 588–595. Cited by: §II-A, TABLE VII.
- (2013) Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pp. 3551–3558. Cited by: §II-B, §II-B.
- (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 499–508. Cited by: §II-A.
- (2012) Mining actionlet ensemble for action recognition with depth cameras. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297. Cited by: §II-A.
- (2018) RGB-d-based human motion recognition with deep learning: a survey. Computer Vision and Image Understanding 171, pp. 118–139. Cited by: §I, §I.
- (2018) Cooperative training of deep aggregation networks for rgb-d action recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §II-C, TABLE VII.
- (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 24th ACM international conference on Multimedia, pp. 102–106. Cited by: §II-A.
- (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §II-A.
- (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §II-A, TABLE VII.
- (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pp. 4507–4515. Cited by: 2nd item.
- (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126. Cited by: §II-A, TABLE VII.
- (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II-A, §II-A, §V-C1, §V-C3.
- (2012) Microsoft kinect sensor and its effect. IEEE multimedia 19 (2), pp. 4–10. Cited by: §I, §III-A1, §V-A.
- (2017) Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2904–2913. Cited by: §II-C, TABLE VII.
