Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition
Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this work, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying human actions from video sequences. We show that a single architecture can be used to solve both problems in an efficient way and still achieves state-of-the-art or comparable results at each task while running with a throughput of more than 100 frames per second. The proposed method benefits from high parameters sharing between the two tasks by unifying still images and video clips processing in a single pipeline, allowing the model to be trained with data from different categories simultaneously and in a seamlessly way. Additionally, we provide important insights for end-to-end training the proposed multi-task model by decoupling key prediction parts, which consistently leads to better accuracy on both tasks. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU RGB+D) demonstrate the effectiveness of our method on the targeted tasks. Our source code and trained weights are publicly available at https://github.com/dluvizon/deephar.
Human action recognition has been intensively studied in the last years, specially because it is a very challenging problem, but also due to the several applications that can benefit from it. Similarly, human pose estimation has also rapidly progressed with the advent of powerful methods based on convolutional neural networks (CNN) and deep learning. Despite the fact that action recognition benefits from precise body poses, the two problems are usually handled as distinct tasks in the literature , or action recognition is used as a prior for pose estimation [66, 25]. To the best of our knowledge, there is no recent method in the literature that tackles both problems in a joint way to the benefit of action recognition. In this paper, we propose a unique end-to-end trainable multi-task framework to handle human pose estimation and action recognition jointly, as illustrated in Fig. 1.
One of the major advantages of deep learning methods is their capability to perform end-to-end optimization. This is all the more true for multi-task problems, where related tasks can benefit from one another, as suggested by Kokkinos . Action recognition and pose estimation are usually hard to be stitched together to perform a beneficial joint optimization, usually requiring 3D convolutions  or heatmaps transformations . Detection based approaches require the non-differentiable argmax function to recover the joint coordinates as a post processing stage, which breaks the backpropagation chain needed for end-to-end learning. We propose to solve this problem by extending the differentiable soft-argmax [36, 67] for joint 2D and 3D pose estimation. This allows us to stack action recognition on top of pose estimation, resulting in a multi-task framework trainable from end-to-end.
In comparison with our previous work , we propose a new network architecture carefully designed for pose and action prediction simultaneously at different feature map resolutions. Each prediction is supervised and re-injected into the network for further refinement. Differently from , where we first predict poses then actions, here poses and actions are predicted in parallel and successively refined, strengthening the multi-task aspect of our method. Another improvement is the proposed depth estimation approach for 3D poses, which allows us to depart from learning the costly volumetric heat maps while improving the overall accuracy of the method.
The main contributions of our work are presented as follows: First, we propose a new multi-task method for jointly estimating 2D/3D human poses and recognizing associated actions. Our method is simultaneously trained from end-to-end for both tasks with multimodal data, including still images and video clips. Second, we propose a new regression approach for 3D pose estimation from single frames, benefiting at the same time from images “in-the-wild” with 2D annotated poses and 3D data. This has been proven a very efficient way to learn good visual features, which is also very important for action recognition. Third, our action recognition approach is based only on RGB images, from which we extract 3D poses and visual information. Despite that, our multi-task method achieves state-of-the-art on both 2D and 3D scenarios, even when compared with methods using ground-truth poses. Fourth, the proposed network architecture is scalable without any additional training procedure, which allows us to choose the right trade-off between speed and accuracy a posteriori. Finally, we show that the hard problem of multi-tasking pose estimation and action recognition can be tackled efficiently by a single and carefully designed architecture, handling both problems together and in a better way than separately. As a result, our method provides acceptable pose and action predictions at more than 180 frames per second (FPS), while achieving its best scores at 90 FPS on a customer GPU.
The remaining of this paper is organized as follows. In Section 2 we present a review of the most relevant works related to our method. The proposed multi-task framework is presented in Section 3. Extensive experiments on both pose estimation and action recognition are presented in Section 4, followed by our conclusions in Section 5.
2 Related Work
In this section, we present some of the most relevant methods related to our work, which are divided into human pose estimation and action recognition. Since an extensive literature review is out of the scope of the paper, we encourage the readers to refer to the surveys in [49, 22] for respectively pose estimation and action recognition.
2.1 Human Pose Estimation
2D Pose Estimation
The problem of human pose estimation has been intensively studied in the last years, from Pictorial Structures [1, 18, 44] to more recent CNN based approaches [41, 30, 45, 23, 48, 62, 6, 58, 59, 43]. We can identify from the literature two distinct families of methods for pose estimation: detection and regression based methods. Recent detection methods handle pose estimation as a heat map prediction problem, where each pixel in a heat map represents the detection score of a given body joint being localized at this pixel [7, 20]. Exploring the concepts of stacked architectures, residual connections, and multiscale processing, Newell \etal  proposed the Stacked Hourglass networks (SHG), which improved scores on 2D pose estimation challenges significantly. Since then, methods in the state of the art are frequently proposing complex variations of the SHG architecture. For example, Chu \etal  proposed an attention model based on conditional random field (CRF) and Yang \etal  replaced the residual unit from SHG by the Pyramid Residual Module (PRM). Very recently,  proposed a high-resolution network that keeps a high-resolution flow, resulting in more precise predictions. With the emergence of Generative Adversarial Networks (GANs) , Chou \etal  proposed to use a discriminative network to distinguish between estimated and target heat maps. This process could increase the quality of predictions, since the generator is stimulated to produce more plausible predictions. Another application of GANs in that sense is to enforce the structural representation of the human body .
However, all the previous mentioned detection based approaches do not provide body joint coordinates directly. To recover the body joints in coordinates, predicted heat maps have to be converted to joint positions, generally using the argument of the maximum a posteriori probability (MAP), called . On the other hand, regression based approaches use a nonlinear function to project the input image directly to the desired output, which can be the joint coordinates. Following this paradigm, Toshev and Szegedy  proposed a holistic solution based on cascade regression for body part regression and Carreira \etal  proposed the Iterative Error Feedback. The limitation of current regression methods is that the regression function is frequently sub-optimal. In order to tackle this weakness, the soft-argmax function  has been proposed to compute body joint coordinates from heat maps in a differentiable way.
3D Pose Estimation
Recently, deep architectures have been used to learn 3D representations from RGB images [69, 57, 37, 56, 38, 46] thanks to the availability of high precise 3D data , and are now able to surpass depth-sensors . Chen and Ramanan  divided the problem of 3D pose estimation into two parts. First, they target 2D pose estimation considering the camera coordinates and second, the 2D estimated poses are matched to 3D representations by means of a nonparametric shape model. However, this is an ill-defined problem, since two different 3D poses could have the same 2D projection. Other methods propose to regress the 3D relative position of joints, which usually presents a lower variance than the absolute position. For example, Sun \etal  proposed a bone representation of the human body. However, since the errors are accumulative, such a structural transformation might effect tasks that depend on the extremities of the human body, like action recognition.
Pavlakos \etal  proposed the volumetric stacked hourglass architecture, but the method suffers from significant increase in the number of parameters and from the required memory to store all the gradients. A similar technique is used in , but instead of using argmax for coordinate estimation, the authors use a numerical integral regression, which is similar to the soft-argmax operation . More recently, Yang \etal  proposed to use adversarial networks to distinguish between generated and ground truth poses, improving predictions on uncontrolled environments. Differently form our previous work in , we show that a volumetric representation is not required for 3D prediction. Similarly to methods on hand pose estimation  and on 3D human pose estimation , we predict 2D depth maps which encode the relative depth of each body joint.
2.2 Action Recognition
2D Action Recognition
In this section we revisited some methods that exploit pose information for action recognition. For example, classical methods for feature extraction have been used in [63, 27], where the key idea is to use body joint locations to select visual features in space and time. 3D convolutions have been stated as the best option to handle the temporal dimension of images sequences [8, 10, 60], but they involve a high number of parameters and cannot efficiently benefit from the abundant still images during training. Another option to integrate the temporal aspect is by analysing motion from image sequences [13, 19], but these methods require the difficult estimation of optical flow. Unconstrained temporal and spatial analysis are also promising approaches to tackle action recognition, since it is very likely that, in a sequence of frames, some very specific regions in a few frames are more relevant than the remaining parts. Inspired on this observation, Baradel \etal  proposed an attention model called Glimpse Clouds, which learns to focus on specific image patches in space and time, aggregating the patterns and soft-assigning each feature to workers that contribute to the final action decision. The influence of occlusions could be alleviated by multi-view videos  and inaccurate pose sequences could be replaced by heat maps for better accuracy . However, this improvement is not observed when pose predictions are sufficiently precise.
2D action recognition methods usually use the body joint information only to extract localized visual features [63, 13], as an attention mechanism. Methods that directly explore the body joints usually do not generate it  or present lower precision with estimated poses . Our approach removes these limitations by performing pose estimation together with action recognition. As such, our model only needs the input RGB frames while still performing discriminative visual recognition guided by the estimated body joints.
3D Action Recognition
Differently from video based action recognition, 3D action recognition is mostly based on skeleton data as the primary information [35, 47]. With depth sensors such as the Microsoft Kinect, it is possible to capture 3D skeletal data without a complex installation procedure frequently required for motion capture systems (MoCap). However, due to the required infrared projector, depth sensors are limited to indoor environments, have a low range of operation, and are not robust to occlusions, frequently resulting in noisy skeletons. To cope with the noisy skeletons, Spatio-Temporal LSTM networks  have been widely used to learn the reliability of skeleton sequences or as an attention mechanism [32, 52]. In addition to the skeleton data, multimodal approaches can also benefit from visual cues . In that direction, pose-conditioned attention mechanisms have been proposed  to focus on image patches centered around the hands.
Since our architecture predicts precise 3D poses from RGB frames, we do not have to cope with the noisy skeletons from Kinect. Moreover, we show in the experiments that, despite being based on temporal convolution instead of the more common LSTM, our system is able to reach state of the art performance on 3D action recognition, indicating that action recognition does not necessarily require long term memory.
3 Proposed Multi-task Approach
The goal of the proposed method is to jointly handle human pose estimation and action recognition, prioritizing the use of predicted poses on action recognition and benefiting from shared computations between the two tasks. For convenience, we define the input of our method as either a still RGB image or a video clip (sequence of images) , where is the number of frames in a video clip and is the frame size. This distinction is important because we handle pose estimation as a single frame problem. The outputs of our method for each frame are: predicted human pose and per body joint confidence score , where is the number of body joints. When taking a video clip as input, the method also outputs a vector of action probabilities , where is the number of action classes. To simplify notation, in this section we omit batch normalization layers and ReLU activations, which are used in between convolutional layers as a common practice in deep neural networks.
3.1 Network Architecture
Differently from our previous work  where poses and actions are predicted sequentially, here we want to strengthen the multi-task aspect of our method by predicting and refining poses and actions in parallel. This is implemented by the proposed architecture, illustrated in Fig. 2. Input images are fed through the entry-flow, which extracts low level visual features. The extracted features are then processed by a sequence of downscaling and upscaling pyramids indexed by , which are respectively composed of downscaling and upscaling units (DU and UU), and prediction blocks (PB), indexed by . Each PB is supervised on pose and action predictions, which are then re-injected into the network, producing a new feature map that is refined by further downscaling and upscaling pyramids. Downscaling or upscaling units are respectively composed by maxpooling or upsampling layers followed by a residual unit that is a standard or a depthwise separable convolution  with skip connection. These units are detailed in Fig. 3.
In order to be able to handle human poses and actions in a unified framework, the network can operate into two distinct modes: (i) single frame processing or (ii) video clip processing. In the first operational mode (single frame), only layers related to pose estimation are active, from which connections correspond to the blue arrows in Fig. 2. In the second operational mode (video clip), both pose estimation and action recognition layers are active. In this case, layers in the single frame processing part handle each video frame as a single sample in the batch. Independently on the operational mode, pose estimation is always performed from single frames, which prevents the method from depending on the temporal information for this task. For video clip processing, the information flow from single frame processing (pose estimation) and from video clip processing (action recognition) are independently propagated from one prediction block to another, as demonstrated in Fig. 2 respectively by blue and red arrows.
Multi-task Prediction Block
The main challenges related to the design of the network architecture is how to handle multimodal data (single frames and video clips) in a unified way and how to allow predictions refinement for both poses and actions. To this end, we propose a multi-task prediction block (PB), detailed in Fig. 4. In the PB, pose and action are simultaneously predicted and re-injected into the network for further refinement. In the global architecture, each PB is indexed by pyramid and level , and produces the following three feature maps:
Namely, is a tensor of single frame features, which is propagated from one PB to another, is a tensor of multi-task (single frame) features used for both pose and action, and is a tensor of video clip features, exclusively used for action predictions and also propagated from one PB to another. is the index of single frames in a video clip, and and are respectively the size of single frame features and video clip features.
For pose estimation, prediction blocks take as input the single frame features from the previous pyramid and the features from lower or higher levels, respectively for downscaling and upscaling pyramids. A similar propagation of previous features and happens for action. Note that both and feature maps are three-dimensional tensors (2D maps plus channels) that can be easily handled by 2D convolutions.
The tensor of multi-task features is defined by:
where DU is the downscaling unit (replaced by UU for upscaling pyramids), RU is the residual unit, is a convolution, and is a weight matrix. Then, is used to produce body joint probability maps:
and body joint depth maps:
where is the spatial softmax , and and are weight matrices. Probability maps and body joint depth maps encode, respectively, the probability of a body joint being at a given location and the depth with respect to the root joint, normalized in the interval . Both and have shape .
3.2 Pose Regression
Once a set of body joint probability maps and depth maps are computed from multi-task features, we aim to estimate the corresponding 3D points by a differentiable and non-parametrized function. For that, we decouple the problem in 2D pose estimation and depth estimation, and the final 3D pose is the concatenation of the intermediate parts.
The Soft-argmax Layer for 2D Estimation
Given a 2D input signal, the main idea is to consider that the argument of the maximum (argmax) can be approximated by the expectation of the input signal after being normalized to have the properties of a distribution. Indeed, for a sufficiently pointy (Leptokurtic) distribution, the expectation should be close to the maximum a posteriori (MAP) estimation. For a 2D heat map as input, the normalized exponential function (softmax) can be used, since it alleviates the undesirable influences of values below the maximum and increases the “pointiness” of the resulting distribution, producing a probability map, as defined in Equation 6.
Let’s define a single probability map for the th joint as , in such a way that . Then, the expected coordinates are given by the function :
where is the size of the input probability map, and and are line and column indexes of . According to Equation 8, the coordinates are constrained between the interval , which corresponds to the normalized limits of the input image.
Differently from our previous work , where volumetric heat maps were required to estimate the third dimension of body joints, here we use a similar apprach to , where specialized depth maps are used to encode the depth information. Similarly to the probability maps decomposition from section 3.2.1, here we define as a depth map for the th body joint. Thus, the regressed depth coordinate is defined by:
Since is a normalized unitary and positive probability map, Equation 9 represents a spatially weighted pooling of depth map based on the 2D body joint location.
Body Joint Confidence Scores
The probability of a certain body joint being present (even if occluded) in the image is computed by the maximum value in the corresponding probability map. Considering a pose layout with body joints, the estimated joint confidence vector is represented by . If the probability map is very pointy, this score is close to 1. On the other hand, if the probability map is uniform or has more than one region with high response, the confidence score drops.
As systematically noted in recent works [7, 20, 40, 42], predictions re-injection is a very efficient way to improve precision on estimated poses. Differently from all previous methods based on direct heat map regression, our approach can benefit from prediction re-injection at different resolutions, since our pose regression method is invariant to the feature map resolution. Specifically, in each PB at different pyramid and different level, we compute a new set of features based on features from previous blocks and on the current prediction, as follows:
where and are weight matrices related to the re-injection of 2D pose and depth information, respectively. With this approach, further PB at different pyramids and levels are able to refine predictions, considering different sets of features at different resolutions.
3.3 Human Action Recognition
Another important advantage in our method is its ability to integrate high level pose information with low level visual features in a multi-task framework. This characteristic allows sharing the single frame processing pipeline for both pose estimation and visual features extraction. Additionally, visual features are trained using both action sequences and still images captured “in-the-wild”, which have been proven as a very efficient way to learn robust visual representations. As shown in Fig. 4, the action prediction part takes as input two different sources of information: pose features and appearance features. Additionally, similarly to the pose prediction part, action features from previous pyramids () and levels () are also aggregated in each prediction.
In order to explore the rich information encoded with body joint positions, we convert a sequence of poses with joints each into an image-like representation. Similar representations were previously used in [5, 28]. We choose to encode the temporal dimension as the vertical axis, the joints as the horizontal axis, and the coordinates of each point ( for 2D, for 3D) as the channels. With this approach, we can use classical 2D convolutions to extract patterns directly from the temporal sequence of body joints. The predicted coordinates of each body joints are pondered by their confidence scores, thus points that are not present in the image (and consequently cannot be correctly predicted) have less influence on action recognition. A graphical representation of pose features is presented in Fig. 4(a).
In addition to the pose information, visual cues are very important to action recognition, since they bring contextual information. In our method, localized visual information is encoded as appearance features, which are extracted in a similar process to the one of pose features, with the difference that the first relies on local visual information instead of joint coordinates. In order to extract localized appearance features, we multiply each channel from the tensor of multi-task features by each channel from the probability maps (outer product of and ), which is learned as a byproduct of the pose estimation process. Then, the spatial dimensions are collapsed by a sum, resulting in the appearance features for time of size . For a sequence of frames, we concatenate each appearance feature map for resulting in the video clip appearance features . To clarify this process, a graphical representation is shown in Fig. 4(b).
We argue that our multi-task framework has two benefits for the appearance based part: First, it is computationally very efficient since most part of the computations are shared. Second, the extracted visual features are more robust since they are trained simultaneously for different but related tasks and on different datasets.
Action Features Aggregation and Re-injection
Some actions are hard to be distinguished from others only by the high level pose representation. For example, the actions drink water and make a phone call are very similar if we take into account only the body joints, but are easily separated if we have the visual information corresponding to the objects cup and phone. On the other hand, other actions are not directly related to visual information but with body movements, like salute and touch chest, and in this case the pose information can provide complementary information. In our method, we combine visual cues and body movements by aggregating pose and appearance features. This aggregation is a straightforward process, since both feature types have the same spacial dimensions.
Similarly to the single frame features re-injection mechanism discussed in section 3.2.4, our approach also allows action features re-injection, as detailed in the action prediction part in Fig. 4. We demonstrate in the experiments that this technique also improves action recognition results with no additional parameters.
Decoupled Action Poses
Since the multi-task architecture is trained simultaneously on pose estimation and on action recognition, we may have an effect of competing gradients from poses and actions, specially in the predicted poses, which are used as the output for the first task and as the input for the second task. To mitigate that influence, late in the training process, we propose to decouple estimated poses (used to compute pose scores) from action poses (used by the action recognition part) as illustrated in Fig. 6.
Specifically, we first train the network on pose estimation for about one half of the full training iterations, then we replicate only the last layers that project the multi-task feature map to heat maps and depth maps (parameters and ), resulting in a “copy” of probability maps and depth maps . Note that this replica corresponds to a simple convolution from the feature space to the number of joints, which is almost insignificant in terms of parameters and computations. The “copy” of this layer is a new convolutional layer with its weights initialized with W. Finally, for the remaining training, the action recognition part propagates its loss through the replica poses. This process allows the original pose predictions to stay specialized on the first task, while the replicated poses absorb partially the action gradients and are optimized accordingly to the action recognition task. Despite the replicated poses not being directly supervised in the final training stage (which corresponds to a few more epochs), we show in our experiments that they still remain coherent with supervised estimated poses.
In this section, we present quantitative and qualitative results by evaluating the proposed method on two different tasks and on two different modalities: human pose estimation and human action recognition on 2D and 3D scenarios. Since our method relies on body coordinates, we consider four publicly available datasets mostly composed of full poses, which are detailed as follows.
MPII Human Pose Dataset  is a well known 2D human pose dataset composed of about 25K images collected from YouTube videos. 2D poses were manually annotated with up to 16 body joints. Human3.6M  is a 3D human pose dataset composed by videos with 11 subjects performing 17 different activities, all recorded simultaneously by 4 cameras. High precision 3D poses were captured by a MoCap system, from which 17 body joints are used for evaluation. Penn Action  is a 2D dataset for action recognition composed by 2,326 videos with sports people performing 15 different actions. Human poses were manually annotated with up to 13 body joints. NTU RGB+D  is a large scale 3D action recognition dataset composed by 56K videos in Full HD with 60 actions performed by 40 different actors and recorded by 3 cameras in 17 different configurations. Each color video has an associated depth map video and 3D Kinect poses.
On 2D pose estimation, we evaluate our method on the MPII validation set composed of 3K images, using the probability of correct keypoints measure with respect to the head size (PCKh) . On 3D pose estimation, we evaluate our method on Human3.6M by measuring the mean per joint position error (MPJPE) after alignment of the root joint. We follow the most common evaluation protocol [65, 54, 37, 38, 42] by taking five subjects for training (S1, S5, S6, S7, S8) and evaluating on two subjects (S9, S11) on one every 64 frames. We use ground truth person bounding boxes for a fair comparison with previous methods on single person pose estimation. We report results using a single cropped bounding box per sample.
On action recognition, we report results using the percentage of correct action classification score. We use the proposed evaluation protocol for Penn Action , splitting the data as 50/50 for training/testing, and the more realistic cross-subject scenario for NTU, on which 20 subjects are used for training, and the remaining are used for testing. Our method is evaluated on single-clip and/or multi-clip. In the first case, we crop a single clip with frames in the middle of the video. In the second case, we crop multiple video clips temporally spaced of frames one from another, and the final predicted action is the average decision among all clips from one video.
In our experiments, we consider two scenarios: A) 2D pose estimation and action recognition, on which we use respectively MPII and Penn Action datasets, and B) 3D pose estimation and action recognition, using MPII, Human3.6M, and NTU datasets.
4.2 Implementation and Training Details
For the pose estimation task, we train the network using the elastic net loss  function on predicted poses:
where and are respectively the estimated and the ground truth positions of the j body joint. The same loss is used for both 2D and 3D cases, but only available values ( for 2D and for 3D) are taken into account for backpropagation, depending on the dataset. We use poses in the camera coordinate system, with laying on the image plane and corresponding to the depth distance, normalized in the interval , where the top-left image corner corresponds to , and the bottom-right image corner corresponds to . For depth normalization, the root joint is assumed to have , and a range of 2 meters is used to represent the remaining joints. If a given body joint falls outside the cropped bounding box on training, we set the ground truth confidence flag to zero, otherwise we set it to one. The ground truth confidence information is used to supervise predicted joint confidence scores with the binary cross entropy loss. Despite giving an additional information, the supervision on confidence scores has negligible influence on the precision of estimated poses. For the action recognition part, we use categorical cross entropy loss on predicted actions.
Since the pose estimation part is the most computationally expensive, we chose to use separable convolutions with kernel size equals to for single frame layers and standard convolutions with kernel size equals to for video clip processing layers (action recognition layers). We performed experiments with the network architecture using 4 levels and up to 8 pyramids ( and ). No further significant improvement was noticed on pose estimation by using more than 8 pyramids. On action recognition, this limit was observed at 4 pyramids. For that reason, when using the full model with 8 pyramids, the action recognition part starts only at the 5 pyramid, reducing the computational load. In our experiments, we used normalized RGB images of size as input, which are reduced to a feature map of size by the entry flow network, corresponding to level . At each level, the spatial resolution is reduced by a factor of 2 and the size of features is arithmetically increased by . For action recognition, we used and features for Penn Action and NTU, respectively.
For all the experiments, we first initialize the network by training pose estimation only, for about 32k iterations with mini batches of 32 images (equivalent to 40 epochs on MPII). Then, all the weights related to pose estimation are fixed and only the action recognition part is trained for 2 and 50 epochs, respectively for Penn Action and NTU datasets. Finally, the full network is trained in a multi-task scenario, simultaneously for pose estimation and action recognition, until the validation scores plateau. Training the network on pose estimation for a few epochs provides a good general initialization and a better convergence of the action recognition part. The intermediate training stage of action recognition has two objectives: first, it is useful to allow a good initialization of the action part, since it is built on top of the pre-initialized pose estimator; and second, it is about 3 times faster than performing multi-task training directly while resulting in similar scores. This process is specially useful for NTU, due to the large amount of training data. The training procedure takes about one day for the pose estimation initialization, then two/three days for the remaining process for Penn Action/NTU, using a desktop GeForce GTX 1080Ti GPU.
For initialization on pose estimation, the network was optimized with RMSprop and initial learning rate of 0.001. For action and multi-task training, we use RMSprop for Penn Action with learning rate reduced by a factor of 0.1 after 15 and 25 epochs, and, for NTU, a vanilla SGD with Nesterov momentum of 0.9 and initial learning rate of 0.01, reduced by a factor of 0.1 after 50 and 55 epochs. We weight the loss on body joint confidence scores and action estimations by a factor of 0.01, since the gradients from the cross entropy loss are much stronger than the gradients from the elastic net loss on pose estimation. This parameter was empirically chosen and we did not observe a significant variation in the results with slightly different values (e.g., with 0.02). Each iteration is performed on 4 batches of 8 frames, composed of random images for pose estimation and video clips for action. We train the model by alternating one batch containing pose estimation samples only and another batch containing action samples only. This strategy resulted in slightly better results compared to batches composed of mixed pose and action samples. We augment training data by performing random rotations from to , scaling from to , video temporal subsampling by a factor from 3 to 10, random horizontal flipping, and random color shifting. On evaluation, we also subsampled Penn Action/NTU videos by a factor of 6/8, respectively.
4.3 Evaluation on 3D Pose Estimation
Our results compared to previous approaches are shown in Table 1. Our multi-task method achieves the state-of-the-art average prediction error of 48.6 millimeters on Human3.6M for 3D pose estimation, improving our previous work  by 4.6 mm. Considering only the pose estimation task, our average error is 49.5 mm, 0.9 mm higher than the multi-tasking result, which shows the benefit of multi-task training for 3D pose estimation. For the activity “Sit down”, which is the most challenging case, we improve previous methods (\egYang \etal ) by 21 mm. The generalization of our method is demonstrated by qualitative results of 3D pose estimation for all datasets in Fig. 10. Note that a single model and a single training procedure was used to produce all the images and scores, including 3D pose estimation and 3D action recognition, as discussed in the following.
|3D heat maps (ours , only H36M)||61.7||63.5||56.1||60.1||60.0||57.6||64.6||75.1|
|3D heat maps (ours )||49.2||51.6||47.6||50.5||51.8||48.5||51.7||61.5|
|Methods||Sit Down||Smoke||Photo||Wait||Walk||Walk Dog||Walk Pair||Average|
|3D heat maps (ours , only H36M)||95.4||63.4||73.3||57.0||48.2||66.8||55.1||63.8|
|3D heat maps (ours )||70.9||53.7||60.3||48.9||44.4||57.9||48.9||53.2|
Method not using ground-truth bounding boxes.
Methods using extra 2D data for training.
4.4 Evaluation on Action Recognition
For action recognition, we evaluate our method considering both 2D and 3D scenarios. For the first, a single model was trained using MPII for single frames (pose estimation) and Penn Action for video clips. In the second scenario, we use Human3.6M for 3D pose supervision, MPII for data augmentation, and NTU video clips for action. Similarly, a single model was trained for all the reported 3D pose and action results.
For 2D, the pose estimation was trained using mixed data from MPII (80%) and Penn Action (20%), using 16 body joints. Results are shown in Table 2. We reached the state-of-the-art action classification score of 98.7% on Penn Action, improving our previous work  by 1.3%. Our method outperformed all previous methods, including the ones using ground truth (manually annotated) poses.
|Our previous work ||✓||-||✓||-||98.6|
Including UCF101 data; using add. deep features.
For 3D, we trained our multi-task network using mixed data from Human3.6M (50%), MPII (37.5%) and NTU (12.5%) for pose estimation and NTU video clips for action recognition. Our results compared to previous methods are presented in Table 3. Our approach reached 89.9% of correctly classified actions on NTU, which is a strong result considering the hard task of classifying among 60 different actions in the cross-subject split. Our method improves previous results by at least 3.3% and our previous work by 4.4%, which shows the effectiveness of the proposed approach.
|Our previous work ||✓||-||✓||85.5|
Ground truth poses used on test to select visual features.
4.5 Ablation Study
We performed several experiments on the proposed network architecture in order to identify its best arrangement for solving both tasks with the best performance vs computational cost trade-off. In Table 4, we show the results on 2D pose estimation and on action recognition considering different network layouts. For example, in the first line, a single PB is used at pyramid 1 and level 2. In the second line, a pair of full downscaling and upscaling pyramids are used, but with supervision only at the last PB. This results in 97.5% of accuracy on action recognition and 84.2% on PCKh for pose estimation. An equivalent network is used in the third line, but then with supervision on all PB blocks, which brings an improvement of 0.9% on pose and 0.6% on action, with the same number of parameters. Note that the networks from the second and third lines are exactly the same, but in the first case, only the last PB is supervised, while in the latter all PB receive supervision. Finally, the last line shows results with the full network, reaching 88.3% on MPII and 98.2% on Penn Action (single-clip), with a single multi-task model.
Pose and Appearance Features
The proposed method benefits from both pose and appearance features, which are complementary to the action recognition task. Additionally, the confidence score is also complementary to pose itself and leads to marginal action recognition gains if used to weight pose predictions. Similar results are achieved if confidence scores are concatenated to poses. In Table 5, we present results on pose estimation and on action recognition for different features extraction strategies. Considering pose features or appearance features alone, the results on Penn Action are respectively 97.4% and 97.9%, respectively 0.7% and 0.2% lower than combined features. We also show in the last row the influence of decoupled action poses, resulting in a small gain of 0.1% on action scores and 0.3% on pose estimation, which shows that decoupling action poses brings additional improvements, specially for pose estimation. When not considering decoupled poses, note that the best score on pose estimation happens when poses are not directly used for action, which also supports the evidence of competing losses.
|Action features||MPII val. PCKh||PennAction Acc.|
|Pose features only||84.9||97.7|
|Appearance features only||85.2||97.9|
|Combined + decoupled poses||85.4||98.2|
Additionally, we can observe that decoupled action poses remain coherent with supervised poses, as shown in Fig. 7, which suggests that the initial pose supervision is a good initialization overall. Nonetheless, in some cases, decoupled probability maps can drift to regions in the image more relevant for action recognition, as illustrated in Fig. 8. For example, feet heat maps can drift to objects in the hands, since the last is more informative with respect to the performed action.
Single-task vs. multi-task
In this part we compare the results on human action recognition considering single-task and multi-task training protocols. In Table 6, in the first row, are shown results on PennAction and NTU datasets considering training with action supervision only, \ie, with the full network architecture (including pose estimation layers) but without pose supervision. In the second row we show the results when using the manually annotated poses from PennAction for pose supervision. We did not use NTU (Kinect) poses for supervision since they are very noisy. From this, we can notice an improvement of almost 10% on PennAction, only by adding pose supervision. When mixing with MPII data, it further increases 0.8%. On NTU, multi-tasking improves a significant 1.9%. We believe that the improvement of multi-tasking on PennAction is much more evident because this is a small dataset, therefore it is difficult to learn good representations for complex actions without explicit pose information. On the contrary, NTU is a large scale dataset, more suitable for learning approaches. As a consequence, the gap between single and multi-task on NTU is smaller, but still relevant.
|Training protocol||PennAction Acc.||NTU Acc.|
|Single-task (action only)||87.5||88.0|
|Multi-task (same dataset)||97.4||–|
|Multi-task (+MPII +H36M for 3D)||98.2||89.9|
Once the network is trained, it can be easily cut to perform faster inferences. For instance, the full model with 8 pyramids can be cut at the 4th or 2nd pyramids, which generally degrades the performance, but allows faster predictions. To show the trade-off between precision and speed, we cut the trained multi-task model at different prediction blocks and estimate the throughput in frames per second (FPS), evaluating pose estimation precision and action recognition classification accuracy. We consider mini batches with 16 images for pose estimation and single video clips of 8 frames for action. The results are shown in Fig. 9. For both 2D and 3D scenarios, the best predictions are at more than 90 FPS. For the 3D scenario, pose estimation on Human3.6M can be performed at more than 180 FPS and still reach a competitive result of 57.3 millimeters error, while for action recognition on NTU, at the same speed, we still obtain state of the art results with 87.7% of correctly classified actions, or even comparable results with recent approaches at more than 240 FPS. Finally, we show our results for both 2D and 3D scenarios compared to previous methods in Table 7, considering different inference speed. Note that our method is the only to perform both pose and action estimation in a single prediction, while achieving state-of-the-art results at a very high speed.
|Yang \etal ||88.6||58.6||-||-|
|Ours  @ 85 fps||-||53.2||97.4||85.5|
|Ours 2D @ 240 fps||85.5||-||97.5||-|
|Ours 2D @ 120 fps||88.3||-||98.7||-|
|Ours 3D @ 240 fps||80.7||63.9||-||86.6|
|Ours 3D @ 180 fps||83.8||57.3||-||87.7|
|Ours 3D @ 90 fps||87.0||48.6||-||89.9|
In this work, we presented a new approach for human pose estimation and action recognition using multi-task deep learning. The proposed method for 3D pose provides highly precise estimations with low resolution feature maps and departs from requiring the expensive volumetric heat maps by predicting specialized depth maps per body joints. The proposed CNN architecture, along with the pose regression method, allows multi-scale pose and action supervision and re-injection, resulting in a highly efficient densely supervised approach. Our method can be trained with mixed 2D and 3D data, benefiting from precise indoor 3D data, as well as “in-the-wild” images manually annotated with 2D poses. This has demonstrated significant improvements for 3D pose estimation. The proposed method can also be trained with single frames and video clips simultaneously and in a seamless way.
More importantly, we show that the hard problem of multi-tasking human poses and action recognition can be handled by a carefully designed architecture, resulting in a better solution for each task than learning them separately. In addition, we show that joint learning human poses results in consistent improvement of action recognition. Finally, with a single training procedure, our multi-task model can be cut at different levels for pose and action predictions, resulting in a highly scalable approach.
This work was partially supported by the Brazilian National Council for Scientific and Technological Development (CNPq) – Grant 233342/2014-1.
- (2009) Pictorial structures revisited: People detection and articulated pose estimation. In Computer Vision and Pattern Recognition (CVPR), pp. 1014–1021. Cited by: §2.1.1.
- (2014) 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.1, §4.1.
- (2018-06) Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points. In Computer Vision and Pattern Recognition (CVPR), Cited by: Table 3, Table 7.
- (2018-06) Glimpse clouds: human activity recognition from unstructured feature points. In Computer Vision and Pattern Recognition (CVPR) (To appear), Cited by: §2.2.1.
- (2017) Pose-conditioned spatio-temporal attention for human action recognition. CoRR abs/1703.10106. External Links: Cited by: §2.2.2, §3.3.1, Table 3.
- (2015-12) Robust optimization for deep regression. In International Conference on Computer Vision (ICCV), pp. 2830–2838. Cited by: §2.1.1.
- (2016) Human pose estimation via Convolutional Part Heatmap Regression. In European Conference on Computer Vision (ECCV), pp. 717–732. Cited by: §2.1.1, §3.2.4.
- (2018-03) Body joint guided 3-d deep convolutional descriptors for action recognition. IEEE Transactions on Cybernetics 48 (3), pp. 1095–1108. External Links: Cited by: §2.2.1, §2.2.1, Table 2, Table 7.
- (2016) Human pose estimation with iterative error feedback. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4733–4742. Cited by: §2.1.1.
- (2017-07) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §2.2.1.
- (2017-07) 3D human pose estimation = 2d pose estimation + matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.2.
- (2017-10) Adversarial posenet: a structure-aware convolutional network for human pose estimation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.1.
- (2015) P-CNN: Pose-based CNN Features for Action Recognition. In ICCV, Cited by: §1, §2.2.1, §2.2.1.
- (2017-07) Xception: deep learning with depthwise separable convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
- (2017) Self adversarial training for human pose estimation. CoRR abs/1707.02439. Cited by: §2.1.1.
- (2018-06) PoTion: pose motion representation for action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
- (2017-07) Multi-context attention for human pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.1.
- (2013-06) Human Pose Estimation Using Body Parts Dependent Joint Regressors. In Computer Vision and Pattern Recognition (CVPR), pp. 3041–3048. External Links: Cited by: §2.1.1.
- (2017-10) RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.1, Table 2, Table 7.
- (2016) Chained Predictions Using Convolutional Neural Networks. European Conference on Computer Vision (ECCV). Cited by: §2.1.1, §3.2.4.
- (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §2.1.1.
- (2017) Going deeper into action recognition: a survey. Image and Vision Computing 60 (Supplement C), pp. 4 – 21. Note: Regularization Techniques for High-Dimensional Data Analysis External Links: Cited by: §2.
- (2016-05) DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model. In European Conference on Computer Vision (ECCV), Cited by: §2.1.1.
- (2014-07) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36 (7), pp. 1325–1339. Cited by: §2.1.2, §4.1.
- (2017) Pose for action - action for pose. FG-2017. Cited by: §1, Table 2, Table 7.
- (2018-09) Hand pose estimation via latent 2.5d heatmap regression. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.2, §3.2.2.
- (2013-12) Towards understanding action recognition. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.1, §2.2.1.
- (2017-07) A new representation of skeleton sequences for 3d action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.1.
- (2017) UberNet: training a ’universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. Computer Vision and Pattern Recognition (CVPR). Cited by: §1.
- (2016) Human pose estimation using deep consensus voting. In European Conference Computer Vision (ECCV), B. Leibe, J. Matas, N. Sebe and M. Welling (Eds.), pp. 246–260. External Links: Cited by: §2.1.1.
- (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In ECCV, B. Leibe, J. Matas, N. Sebe and M. Welling (Eds.), Cham, pp. 816–833. Cited by: §2.2.2, Table 3.
- (2017) Global context-aware attention lstm networks for 3d action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.2, Table 3.
- (2018-06) Recognizing human actions as the evolution of pose estimation maps. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.1, Table 2, Table 3.
- (2018-06) 2D/3d pose estimation and action recognition using multitask deep learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.2, §3.1, §3.2.2, §4.3, §4.4, Table 1, Table 2, Table 3, Table 7.
- (2017) Learning features combination for human action recognition from skeleton sequences. Pattern Recognition Letters. External Links: Cited by: §2.2.2.
- (2019) Human pose regression by combining indirect part detection and contextual information. Computers and Graphics 85, pp. 15 – 22. External Links: Cited by: §1, §2.1.1, §3.1.1.
- (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: §2.1.2, §4.1.1, Table 1, Table 7.
- (2017-10) Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 International Conference on 3D Vision (3DV), Vol. , pp. 506–516. External Links: Cited by: §2.1.2, §4.1.1, Table 1, Table 7.
- (2017) VNect: real-time 3d human pose estimation with a single rgb camera. In ACM Transactions on Graphics, Vol. 36. External Links: Cited by: §2.1.2, §2.1.2.
- (2016) Stacked Hourglass Networks for Human Pose Estimation. European Conference on Computer Vision (ECCV), pp. 483–499. Cited by: §2.1.1, §3.2.4.
- (2017) Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Transactions on Multimedia PP (99), pp. 1–1. External Links: Cited by: §2.1.1.
- (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.2, §3.2.4, §4.1.1, Table 1, Table 7.
- (2014) Deep convolutional neural networks for efficient pose estimation in gesture videos. In Asian Conference on Computer Vision (ACCV), Cited by: §2.1.1.
- (2013) Poselet Conditioned Pictorial Structures. In Computer Vision and Pattern Recognition (CVPR), pp. 588–595. Cited by: §2.1.1.
- (2016-06) DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.1.
- (2017-07) Deep multitask architecture for integrated 2d and 3d human sensing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.2.
- (2016) 3D skeleton-based human action classification: a survey. Pattern Recognition 53, pp. 130–147. Cited by: §2.2.2.
- (2016) An efficient convolutional network for human pose estimation. In BMVC, Vol. 1, pp. 2. Cited by: §2.1.1.
- (2016) 3D human pose estimation: a review of the literature and analysis of covariates. Computer Vision and Image Understanding 152 (Supplement C), pp. 1 – 20. External Links: Cited by: §2.
- (2016-06) NTU rgb+d: a large scale dataset for 3d human activity analysis. In CVPR, Cited by: §4.1, Table 3.
- (2017) Deep multimodal feature analysis for action recognition in rgb+d videos. TPAMI. Cited by: §2.2.2, Table 3, Table 7.
- (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI Conference on Artificial Intelligence, Vol. , , pp. . External Links: Cited by: §2.2.2, Table 3.
- (2019-06) Deep high-resolution representation learning for human pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.1.
- (2017-10) Compositional human pose regression. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.2, §4.1.1, Table 1, Table 7.
- (2018-09) Integral human pose regression. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.2, Table 1, Table 7.
- (2016) Fusing 2d uncertainty and 3d cues for monocular body pose estimation. CoRR abs/1611.05708. External Links: Cited by: §2.1.2.
- (2017-07) Lifting from the deep: convolutional 3d pose estimation from a single image. In CVPR, Cited by: §2.1.2.
- (2015-06) Efficient object localization using Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 648–656. Cited by: §2.1.1.
- (2014) DeepPose: Human Pose Estimation via Deep Neural Networks. In Computer Vision and Pattern Recognition (CVPR), pp. 1653–1660. Cited by: §2.1.1, §2.1.1.
- (2017) Long-term Temporal Convolutions for Action Recognition. TPAMI. Cited by: §2.2.1.
- (2018-09) Dividing and aggregating network for multi-view action recognition. In The European Conference on Computer Vision (ECCV), Cited by: §2.2.1.
- (2016) Convolutional pose machines. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.1.
- (2015-06) Joint action recognition and pose estimation from video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.1, §2.2.1, §4.1.1, Table 2, Table 7.
- (2017) Learning feature pyramids for human pose estimation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.1.
- (2018) 3D human pose estimation in the wild by adversarial learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.2, §4.1.1, §4.3, Table 1, Table 7.
- (2012-10-01) Coupled action recognition and pose estimation from multiple views. International Journal of Computer Vision 100 (1), pp. 16–37. External Links: Cited by: §1.
- (2016) LIFT: Learned Invariant Feature Transform. European Conference on Computer Vision (ECCV). Cited by: §1.
- (2013-12) From actemes to action: a strongly-supervised representation for detailed action understanding. In ICCV, Vol. , pp. 2248–2255. External Links: Cited by: §4.1.
- (2017) MonoCap: monocular human motion capture using a CNN coupled with a geometric prior. CoRR abs/1701.02354. Cited by: §2.1.2.
- (2017-10) Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
- (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67, pp. 301–320. Cited by: §4.2.1.