Anatomy-aware 3D Human Pose Estimation in Videos

Anatomy-aware 3D Human Pose Estimation in Videos

Abstract

In this work, we propose a new solution for 3D human pose estimation in videos. Instead of directly regressing the 3D joint locations, we draw inspiration from the human skeleton anatomy and decompose the task into bone direction prediction and bone length prediction, from which the 3D joint locations can be completely derived. Our motivation is the fact that the bone lengths of a human skeleton remain consistent across time. This promotes us to develop effective techniques to utilize global information across all the frames in a video for high-accuracy bone length prediction. Moreover, for the bone direction prediction network, we propose a fully-convolutional propagating architecture with long skip connections. Essentially, it predicts the directions of different bones hierarchically without using any time-consuming memory units (e.g. LSTM). A novel joint shift loss is further introduced to bridge the training of the bone length and bone direction prediction networks. Finally, we employ an implicit attention mechanism to feed the 2D keypoint visibility scores into the model as extra guidance, which significantly mitigates the depth ambiguity in many challenging poses. Our full model outperforms the previous best results on Human3.6M and MPI-INF-3DHP datasets, where comprehensive evaluation validates the effectiveness of our model.

\cvprfinalcopy

1 Introduction

3D human pose estimation in videos has been widely studied in recent years. It has extensive applications in action recognition, sports analysis and human-computer interaction. Current state-of-the-art approaches [24, 14, 4] typically decompose the task into 2D keypoint detection followed by 3D pose estimation. Given an input video, they first detect the 2D keypoints of each frame, and then predict the 3D joint locations of a frame based on the 2D keypoints.

When estimating the 3D joint locations from 2D keypoints, the challenge is to resolve depth ambiguity, as multiple 3D poses with different joint depths can be projected to the same 2D keypoints. Exploiting temporal information from the video has been demonstrated to be effective for reducing the depth ambiguity. Typically, to predict the 3D joint locations of a frame in a video, recent approaches [24, 25, 12] utilize temporal networks that additionally feed the adjacent frames’ 2D keypoints as input. These approaches consider the adjacent local frames most associated with the current frame, and extract their information as extra guidance. However, such approaches are limited to exploiting information only from the neighboring frames. Given a 1-minute input video with a frame rate of 50, even though we choose the existing temporal network with largest temporal window size (i.e 243 frames) [24], it is limited to using a concentrated short segment (about one-twelfth length of the video) to predict a single frame. Such a design can easily make existing temporal networks fail when the current frame and its adjacent input frames correspond to a complex pose, because none of the input frames provide reliable and high-confidence information to the networks.

Considering this, we propose a novel approach that can effectively capture the knowledge from both local and distant frames to estimate the 3D joint locations of the current frame by cleverly exploiting the anatomic properties of the human skeleton. We refer to it as anatomy awareness. Specifically, based on the anatomy of the human skeleton, we decompose the task of 3D joint location prediction into two sub-tasks – bone direction prediction and bone length prediction. We demonstrate that the combination of the two new tasks are essentially equivalent to the original task. The motivation is based on a simple fact that the bone lengths of a person remain consistent over time. Hence, when we predict the bone lengths of a particular frame, we could leverage the frames distributed over the length of the entire video for more accurate and smooth prediction. One problem for training the bone length prediction network is that the training dataset typically contains only a few skeletons. For example, the training set of Human3.6M contains 5 actors corresponding to 5 bone length settings. Directly training the network on the data from the 5 actors leads to serious overfitting. Therefore, we propose two mechanisms to prevent overfitting via a network design and data augmentation.

As for the bone directions, we adopt the temporal convolutional network in [24] to predict the direction of each bone in the 3D space for each frame. Motivated by [12], we believe it is beneficial to predict the directions of different bones hierarchically, in stead of all at once as in [24]. Following the human skeleton anatomy, the directions of simple torso bones (e.g. lumbar vertebra) with less motion variation should be predicted first, and then guide the prediction of challenging limb bones (e.g. arms and legs). This strategy is applied straightforwardly by a recurrent neural network (RNN) with different joints predicted step by step in [12] for a single frame. However, the high computation complexity of RNN precludes the network from holding a large temporal window which has been shown to improve performance. To solve this issue, based on [24], we propose a fully-convolutional propagating architecture, which contains multiple sub-networks with each predicting the directions of all the bones. The hierarchical prediction is implicitly performed via long skip connections between the adjacent sub-networks. Additionally, motivated by [26], we create an effective joint shift loss for the two sub-tasks (i.e., bone direction prediction and bone length prediction) to learn jointly. The joint shift loss penalizes the relative joint shift between all long-range joint pairs, for example the left hand and right foot. Thus, it provides an extra strong supervision for the two networks to be trained to coordinate with each other and produce robust predictions.

Last but not least, we propose a simple yet effective approach to further reduce the depth ambiguity. Specifically, we incorporate 2D keypoint visibility scores into the model as a new feature, which indicates the probability of each 2D keypoint being visible in a frame and provides extra knowledge of the depth relation between specific joints. We argue that the scores are useful to those poses with body parts occluded or when the relative depth matters. For example, if a person keeps her/his hands in front of the chest in a frontal view, our model will be confused on whether the hands are in front of the chest (visible) or behind the back (occluded), since the occluded 2D keypoints can still be predicted sometimes. Furthermore, We adopt an implicit attention mechanism to dynamically adjust the importance of the visibility scores for better performance.

Our contributions are summarized as follows:

  • We are the first to decompose the task of 3D joint estimation into bone direction prediction and bone length prediction. As such, the bone length prediction branch can fully utilize frames across the entire video.

  • We propose a new fully-convolutional architecture for hierarchical bone direction prediction.

  • We feed the visibility scores of 2D keypoint detection into the model to better resolve the depth ambiguity.

  • Our model design is inspired by the human skeleton anatomy and achieves the state-of-the-art performance on both Human3.6M and MPI-INF-3DHP datasets and outperforms the baselines.

2 Related Work

Previous work of 3D pose estimation typically fall into two categories based on training pipelines. For the approaches of the first category [23, 18, 31, 21, 27, 9, 10, 6], they train a convolutional neural network (CNN) to estimate the 3D pose directly from the original input images. In [23], Pavlakos et al. integrate the volumetric representation with a coarse-to-fine supervision scheme to predict 3D volumetric heatmaps. Dabral et al. [6] create a weakly-supervised ConvNet pose estimator and propose illegal-angle loss and symmetry loss for the network training. Sun et al. [27] present an effective integral regression approach that unifies the heat map representation and joint regression approaches. Kanazawa et al. [9] propose an end-to-end CNN framework for reconstructing a full 3D mesh of a human body from a single RGB image. These approaches based on image input can directly capture rich knowledge contained in images. However, without intermediate feature and supervision, the model’s performance will also be affected by the image’s background, lighting and person’s clothing.

For the approaches of the second category [30, 17, 2, 25, 12, 24, 20, 13, 3, 4], they build a 3D joint estimation model on top of high-performance 2D keypoint detector. The predicted 2D keypoints are lifted to 3D joint locations. As an ealier work, Chen et al. [2] regard the 3D pose estimation as a matching problem. They find the best matching 3D pose of the 2D keypoint input by a nearest-neighbor (NN) model. Martinez et al. [17] propose an effective fully-connected residual network to regress the 3D joint locations from 2D keypoint input. Lee et al. [12] introduce a LSTM-based framework to reconstruct 3D depth from the centroid to edge joints, while Chen et al. [4] present a weakly-supervised method of learning a geometry-aware representation to bridge multi-view images for pose estimation. Overall, the approaches in such a “image-2D-3D” pipeline outperform the end-to-end counterparts. One important reason is that the 2D detector can be trained by large-scale indoor/outdoor images. It provides the 3D model a strong intermediate feature to build upon.

When estimating the 3D pose in video, recent approaches [19, 15, 25, 24, 10] exploit temporal information into the model to alleviate incoherent and jittery predictions. Mehta et al. [19] apply temporal filtering across 2D and 3D poses from previous frames to predict a temporally consistent 3D pose. Lin et al. [15] present the Recurrent 3D Pose Sequence Machine, it automatically learns the image-dependent structural constraint and sequence-dependent temporal context by a multi-stage sequential refinement. Rayat et al. [25] predict temporally consistent 3D poses by learning the temporal context of a sequence using sequence-to-sequence LSTM-based network. Pavllo et al. [24] introduce a fully-convolutional model which enables parallel processing of multiple frames and supports very long 2D keypoint sequence as input. All these approaches essentially leverage the adjacent frames to benefit the current frame’s prediction. Compared with them, we are the first to make all the frames in a video contribute to the 3D prediction. It should be noticed that Sun et al. [26] also transform the 3D joint into a bone-based representation. They train the model to regress short and long range relative shifts between different joints. We demonstrate that completely decomposing the task into bone length and bone direction prediction achieves best performance and makes better use of the relative joint shift supervision.

3 Our Model

In this section, we formally present our 3D pose estimation model. In section 3.1, we first describe the overall anatomy-aware framework that decomposes the 3D joint location prediction task into bone length and direction prediction. In section 3.2, we present the fully-convolutional propagating network for hierarchical bone direction prediction. In Section 3.3, the architecture and training details of bone length prediction network are presented. In Section 3.4, we describe the framework’s overall training strategy. In section 3.5, an implicit attention mechanism is introduced to incorporate the keypoint visibility scores into the model as extra guidance. The overall architecture of the proposed framework is shown as Figure 1.

3.1 Anatomy-aware Framework

Figure 1: The overview of the proposed anatomy-aware framework. It predicts the bone directions and bone lengths of the current frame using consecutive local frames and randomly sampled frames across the entire video, respectively. For better viewing, please zoom in on the screen.

As in [24, 25, 4], given the predicted 2D keypoints of each frame in a video, we aim at predicting the normalized 3D locations of pre-defined joints for each frame. The 3D location of joint “Pelvis” is commonly defined as the origin of the 3D coordinates. Given a human joint set that contains joints as in Figure 2, they correspond to directed bones with each joint being the vertex of at least one bone. This enables us to transform the 3D joint coordinates to the presentation of bone lengths and bone directions.

Figure 2: The joint and bone representation of a human pose.

Formally, to predict the 3D joint locations of a specific (i.e. current) frame, we decompose the task to predict the length and direction of each bone. For the -th joint, its 3D location can be derived as:

(1)

Here and are the direction and length of bone , respectively. contains all the bones in the path from “Pelvis” to the -th joint.

We use two separate sub-networks to predict the bone lengths and directions of the current frame, respectively, as bone length prediction needs global input to ensure consistency across all the frames, whereas bone directions should be estimated within a local temporal window. Meanwhile, to ensure consistency between predicted bone lengths and directions, motivated by [26], we add a joint shift loss between the two predictions in addition to their own losses, as shown in Figure 1. Specifically, the joint shift loss is defined as follows:

(2)

Here is the -dimensional ground-truth relative joint shift of the current frame from the -th joint to the -th joint, is the corresponding predicted relative joint shift derived from the predicted bone lengths and bone directions of the current frame. contains all the joint pairs that are not directly connected as a bone. With the joint shift loss, the two sub-networks are connected and enforced to learn from each other jointly. We describe the details of the two sub-networks in the following two sections.

3.2 Bone Direction Prediction Network

We adopt the temporal fully-convolutional network proposed by Pavllo et al. [24] as the backbone architecture of our bone direction prediction network. Specifically, the 2D keypoints of consecutive frames are concatenated to form the input to the network, with the 2D keypoints of the current frame in the center. In essence, to predict the bone directions of the current frame, the temporal network captures the information of the current frame and the context from its adjacent frames as well. A bone direction loss based on mean squared error is applied to train the network:

(3)

Here and represent the predicted and ground-truth -dimensional bone direction vector of the current frame, respectively.

It should be noted that the joint shift loss introduced in Section 3.1 makes the predicted directions of different bones mutually relevant. For example, if the predicted direction of the left lower arm is inaccurate, the predicted direction of the left upper arm will also be affected, since the model is encouraged to regress a long range shift from left shoulder to left wrist. Intuitively, it would benefit the overall prediction if we could first predict those easy and high-confident cases, and let them guide the subsequent prediction of other joints. As poses may vary significantly, it is difficult to pre-determine the hierarchy of the prediction. Motivated by [12], here we propose a fully-convolutional propagating architecture with long skip connections, and let the network itself to learn the prediction hierarchy instead, as in Figure 3.

Figure 3: The architecture of the bone direction prediction network. Long skip connections are added between adjacent sub-networks.

Specifically, the architecture is a stack of several sub-networks, with each sub-network being a temporal fully-convolutional network with residual blocks proposed by [24]. The output of each sub-network is the predicted bone directions of the current frame. Except the top sub-network, we temporally duplicate the output of each sub-network times as the input to the next sub-network. As described above, is the input frame number of the bottom sub-network. is the stride of 1D convolution layer in the network (). For each residual block of a specific sub-network, we concatenate its output with the output of the corresponding residual block in the adjacent upper sub-network on channel level. This forms the long skip connections between adjacent sub-networks. We adopt an independent training strategy for each sub-network, that is, we train each sub-network by the loss of the bone direction prediction network, the back propagation is blocked between different sub-networks. By doing that, the bottom networks would not be affected by the upper ones, and instead would propagate high-confident predictions to guide subsequent predictions. In the process, the model automatically learns the hierarchical order of the prediction. In Section 4, we demonstrate the effectiveness of the proposed architecture.

3.3 Bone Length Prediction Network

Figure 4: Detailed structure of the bone length prediction network. We illustration the dimension of each input/output. is the batch size, is the output channel number of fully-connected layer, is the number of residual block, and are the size of the 2D keypoint and 3D joint set, respectively.

As discussed in Section 3.1, the prediction of bone lengths requires global inputs from the entire video. However, taking too many frames as the input would make the computation prohibitively expensive. To capture the global context efficiently, we choose to randomly sample frames across the entire video as the input to the network. The detailed structure of the network is shown as Figure 4.

We adopt the fully-connected residual network for bone length prediction. Specifically, it has the same structure and layer number as the bottom sub-network of the bone direction prediction network. However, since the randomly sampled frames do not have temporal connections, we replace each 1D convolution layer by the fully-connected layer in the network. This adapts the network for single-frame input instead of multi-frame consecutive inputs. The fully-connected network predicts the bone lengths of each sampled frame.

Intuitively, we can average the predicted bone lengths of each sampled frame as the predicted bone lengths of the current frame. In such a way, a similar bone length loss can be applied to train the fully-connected network:

(4)

Here and are the predicted and ground-truth -dimensional bone length vector of the current frame.

However, since the training datasets usually only contain very limited number of actors, and the bone lengths in the videos performed by the same actor are identical. Such a training loss would lead to severe overfitting.

To solve this problem, instead of directly predicting the bone lengths, the fully-connected residual network is modified to predict the 3D joint locations of each sampled frame, supervised by the mean per joint position error (MPJPE) loss as in [24]:

(5)

Here and are the predicted and ground-truth -dimensional 3D joint locations of the -th joint. Since each frame would predict a set of 3D joint locations, and the number of input frames is usually large, minimizing the MPJPE for the predictions of all the frames would make the convergence speed of the bone length prediction network much faster than the bone direction prediction network. This decreases the performance of jointly training them by Equation 2. We choose to randomly sample one frame from input frames and calculate its corresponding MPJPE loss. We find that such a training strategy works quite well as shown in the experiments.

Since the 3D joint locations would vary in each frame, the overfitting problem would be largely avoided. The bone lengths of each of the frame can then be derived from the 3D joint locations accordingly.

When averaging the bone length predictions from the input frames, the prediction accuracy for a specific bone depends on the poses. For example, some of the bones in a certain pose are hardly visible due to occlusion or foreshortening, making the prediction unreliable. To address that, we further incorporate a self-attention module at the top of the fully-connected network to predict the bone length vector of the current frame:

(6)

Here and are the predicted -dimentional 3D joint location vector and the corresponding derived bone length vector of the -th input frame. is a learnable matrix of the self-attention module. is the -dimensional attention weights that indicate the bone-specific importance of the -th frame’s predicted bone lengths for , is a hyper-parameter to control the degree of attention. During training, the fully-connected residual network and the bone length self-attention module are optimized independently by and .

To further solve the overfitting problem of the self-attention module, we augment the training data by generating samples with variant bone lengths. In particular, for each training video, we randomly create a new group of bone lengths. We modify the ground-truth 3D joint locations of each frame to make them accordant with the new bone lengths. Because the camera parameter is available, we can reconstruct the 2D keypoints of each frame from its modified ground-truth 3D joint locations. For each training iteration, we additionally feed a batch of randomly sampled frames from the augmented videos and use the corresponding 2D keypoints and bone lengths to optimize the self-attention module by . As the self-attention module is just used for bone length re-weighting, we consider it valid to train this module by a combination of predicted 2D keypoints and reconstructed clean 2D keypoints.

3.4 Overall Loss Function

By combining the losses in each sub-network and the joint shift loss, the overall loss function for our framework is given as:

(7)

Here , , and are hyper-parameters regulating the importance of each term.

During the training process, only the parameters of the bone direction prediction network are updated by the joint shift loss. Essentially, the joint shift loss supervises the model to predict robust bone directions that match with the predicted bone lengths for long range objectives. In Section 4, we prove that the proposed anatomy-aware framework better exerts the joint shift loss’s potential than [26].

During the inference process, to estimate the 3D joint locations of a specific frame, we adopt the same strategy as the training process. We still randomly sample frames of the video to predict the bone lengths of this frame. We find that taking all the frames of the video as input for bone length prediction does not lead to better performance. In Section 4, we provided more details regarding the frame sampling strategy for bone length prediction.

3.5 Incorporating the Visibility Scores

Figure 5: Detailed network structure for feeding the visibility scores. is the input frame number of the bone direction prediction network, and are the stride and output channel number of 1D convolution layer, respectively.

Our 3D joint prediction network takes the predicted 2D keypoints as the input, which sometimes have ambiguities. For example, when a person has his/her legs crossed in a front view, the corresponding 2D keypoints cannot provide information about which leg is in the front. A more common situation happens when the person put his/her hands in front of the chest or behind the back, the 3D model will be confused by the relative depth between the hands and chest. Even though temporal information is exploited, the above problems still exist.

We provide a simple yet effective approach to solve the problem without feeding the original images into the model. Specifically, we predict the visibility scores of each 2D keypoint, and incorporate the scores into 3D joint estimation. The visibility scores indicate the probability of each keypoint being visible in the frame, which can be extracted from most 2D keypoint detectors.

We argue that the importance of a specific keypoint’s visibility score is related to the corresponding pose. For example, the visibility scores of the hands become useless if the hands are stretched far away from the body. Therefore, we adopt an implicit attention mechanism to adaptively adjust the importance of the visibility scores.

Given the 2D keypoint sequence of consecutive frames as the input of the network in Section 3.2, we feed the keypoint visibility score sequence of the frames as in Figure 5. An 1D convolutional block is first applied, which maps the visibility score sequence into a temporal hidden feature that has the same dimension as the hidden feature of the 2D keypoint sequence. After that, we do element-wise multiplication for the two temporal hidden features as the weighted visibility score feature. In the end, we concatenate the weighted visibility score feature and the hidden feature of the 2D keypoint sequence on channel level and feed the combined feature to the next 1D convolution layer of the network. Similar to [11], the hidden feature of the 2D keypoint sequence can be regarded as implicit attention weights to adjust the visibility score importance.

4 Experiments

4.1 Datasets and Evaluation

We evaluate the proposed model on two well established 3D human pose estimation datasets: Human3.6M [8] and MPI-INF-3DHP [18].

  • Human3.6M contains 3.6 million video frames with the corresponding annotated 3D and 2D human joint positions, from 11 actors. Each actor performs 15 different activities captured from 4 unique camera views. Following previous works [24, 12, 25, 26, 17], the model is trained on five subjects (S1, S5, S6, S7, S8) and evaluated on two subjects (S9 and S11) on a 17-joint skeleton. We follow the standard protocols to evaluate the models on Human3.6M. The first one (i.e. Protocol 1) is the mean per-joint position error (MPJPE) in millimeters that measures the mean Euclidean distance between the predicted and ground-truth joint positions, without any transformation. The second one (i.e. Protocol 2) is the normalized variant P-MPJPE after aligning the predicted 3D pose with the ground-truth using a similarity transformation.

  • MPI-INF-3DHP is a recently proposed 3D dataset consisting of both constrained indoor and complex outdoor scenes. It records 8 actors performing 8 activities from 14 camera views. Following [19, 18], on a 14-joint skeleton, we consider all the 8 actors in the training set and select sequences from 8 camera views in total (5 chest-high cameras, 2 head-high cameras and 1 knee-high camera) for training. Evaluation is performed on the independent MPI-INF-3DHP test set. We report the Percentage of Correct Keypoints (PCK) within 150mm range, Area Under Curve (AUC), and MPJPE.

4.2 Implementation details

For Human3.6M, we use the predicted 2D keypoints released by [24] from the Cascaded Pyramid Network (CPN) as the input of our 3D pose model. For MPI-INF-3DHP, the predicted 2D keypoints are acquired from the pretrained AlphaPose model [7]. In addition to the 2D keypoints, the keypoint visibility scores for both datasets are also extracted from the pretrained AlphaPose model.

We use the Adam optimizer to train our model in an end-to-end manner. For each training iteration, the mini-batch size is set to 1024 for both original samples and augmented samples. We set = 0.015, = 0.05, = 1 and = 0.1 for the loss terms in Equation 7. For the bone length self-attention module, we set = 10 in Equation 6. The sampled frame number of the bone length prediction network is set to 50 for both the training and inference process. For the proposed architecture in Section 3.2, the number of sub-networks is set to 2. As in [24], the output channel number of each 1D convolution layer and fully-connected layer is set to 1024. For actual implementation, instead of manually deriving the 3D joint locations and relative joint shifts from the predicted bone lengths and bone directions, we regress the two objectives by feeding the concatenation of the predicted bone length vector and bone direction vector into two fully-connected layers, respectively. The fully-connected layers are trained together with the whole network. This achieves slightly better performance.

4.3 Experiment results

Protocol 1 Dir. Disc. Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Martinez et al. [17] ICCV’17 51.8 56.2 58.1 59.0 69.5 78.4 55.2 58.1 74.0 94.6 62.3 59.1 65.1 49.5 52.4 62.9
Sun et al. [26] ICCV’17 52.8 54.8 54.2 54.3 61.8 67.2 53.1 53.6 71.7 86.7 61.5 53.4 61.6 47.1 53.4 59.1
Pavlakos et al. [22] CVPR’18 48.5 54.4 54.4 52.0 59.4 65.3 49.9 52.9 65.8 71.1 56.6 52.9 60.9 44.7 47.8 56.2
Yang et al. [28] CVPR’18 51.5 58.9 50.4 57.0 62.1 65.4 49.8 52.7 69.2 85.2 57.4 58.4 43.6 60.1 47.7 58.6
Luvizon et al. [16] CVPR’18 49.2 51.6 47.6 50.5 51.8 60.3 48.5 51.7 61.5 70.9 53.7 48.9 57.9 44.4 48.9 53.2
Hossain & Little [25] ECCV’18 48.4 50.7 57.2 55.2 63.1 72.6 53.0 51.7 66.1 80.9 59.0 57.3 62.4 46.6 49.6 58.3
Lee et al. [12] ECCV’18 40.2 49.2 47.8 52.6 50.1 75.0 50.2 43.0 55.8 73.9 54.1 55.6 58.2 43.3 43.3 52.8
Chen et al. [4] CVPR’19 41.1 44.2 44.9 45.9 46.5 39.3 41.6 54.8 73.2 46.2 48.7 42.1 35.8 46.6 38.5 46.3
Pavllo et al. [24] (243 frames, Causal) CVPR’19 45.9 48.5 44.3 47.8 51.9 57.8 46.2 45.6 59.9 68.5 50.6 46.4 51.0 34.5 35.4 49.0
Pavllo et al. [24] (243 frames) CVPR’19 45.2 46.7 43.3 45.6 48.1 55.1 44.6 44.3 57.3 65.8 47.1 44.0 49.0 32.8 33.9 46.8
Lin et al. [14] BMVC’19 42.5 44.8 42.6 44.2 48.5 57.1 42.6 41.4 56.5 64.5 47.4 43.0 48.1 33.0 35.1 46.6
Cai et al. [1] ICCV’19 44.6 47.4 45.6 48.8 50.8 59.0 47.2 43.9 57.9 61.9 49.7 46.6 51.3 37.1 39.4 48.8
Cheng et al. [5] ICCV’19 (*) - - - - - - - - - - - - - - - 44.8
Yeh et al. [29] NIPS’19 44.8 46.1 43.3 46.4 49.0 55.2 44.6 44.0 58.3 62.7 47.1 43.9 48.6 32.7 33.3 46.7
Ours (243 frames, Causal) 42.5 45.4 42.3 45.2 49.1 56.1 43.8 44.9 56.3 64.3 47.9 43.6 48.1 34.3 35.2 46.6
Ours (81 frames) 42.1 43.8 41.0 43.8 46.1 53.5 42.4 43.1 53.9 60.5 45.7 42.1 46.2 32.2 33.8 44.6
Ours (243 frames) 41.4 44.0 41.6 42.6 46.4 53.4 41.7 41.3 53.6 60.4 45.8 41.7 45.6 32.2 33.6 44.3
Protocol 2 Dir. Disc. Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Martinez et al. [17] ICCV’17 39.5 43.2 46.4 47.0 51.0 56.0 41.4 40.6 56.5 69.4 49.2 45.0 49.5 38.0 43.1 47.7
Sun et al. [26] ICCV’17 42.1 44.3 45.0 45.4 51.5 53.0 43.2 41.3 59.3 73.3 51.0 44.0 48.0 38.3 44.8 48.3
Pavlakos et al. [22] CVPR’18 34.7 39.8 41.8 38.6 42.5 47.5 38.0 36.6 50.7 56.8 42.6 39.6 43.9 32.1 36.5 41.8
Yang et al. [22] CVPR’18 26.9 30.9 36.3 39.9 43.9 47.4 28.8 29.4 36.9 58.4 41.5 30.5 29.5 42.5 32.2 37.7
Hossain & Little [25] ECCV’18 35.7 39.3 44.6 43.0 47.2 54.0 38.3 37.5 51.6 61.3 46.5 41.4 47.3 34.2 39.4 44.1
Chen et al. [4] CVPR’19 36.9 39.3 40.5 41.2 42.0 34.9 38.0 51.2 67.5 42.1 42.5 37.5 30.6 40.2 34.2 41.6
Pavllo et al. [24] (243 frames, Causal) CVPR’19 35.1 37.7 36.1 38.8 38.5 44.7 35.4 34.7 46.7 53.9 39.6 35.4 39.4 27.3 28.6 38.1
Pavllo et al. [24] (243 frames) CVPR’19 34.1 36.1 34.4 37.2 36.4 42.2 34.4 33.6 45.0 52.5 37.4 33.8 37.8 25.6 27.3 36.5
Lin et al. [14] BMVC’19 32.5 35.3 34.3 36.2 37.8 43.0 33.0 32.2 45.7 51.8 38.4 32.8 37.5 25.8 28.9 36.8
Cai et al. [1] ICCV’19 35.7 37.8 36.9 40.7 39.6 45.2 37.4 34.5 46.9 50.1 40.5 36.1 41.0 29.6 33.2 39.0
Cheng et al. [5] ICCV’19 (*) - - - - - - - - - - - - - - - 34.1
Ours (243 frames, Causal) 33.6 36.0 34.4 36.6 37.5 42.6 33.5 33.8 44.4 51.0 38.3 33.6 37.7 26.7 28.2 36.5
Ours (81 frames) 33.1 35.3 33.4 35.9 36.1 41.7 32.8 33.3 42.6 49.4 37.0 32.7 36.5 25.5 27.9 35.6
Ours (243 frames) 32.1 35.0 33.5 34.9 36.3 40.9 32.2 31.8 42.4 49.0 37.1 32.4 35.6 25.0 27.4 35.0
Table 1: Quantitative comparisons of the Mean Per Joint Position Error (mm) between the estimated pose and the ground-truth on Human3.6M under Protocols 1 and 2. (*) We report the result without data augmentation using virtual cameras.

Table 1 shows the quantitative results of our proposed full model and other baselines on Human3.6M. As in [24], we present the performance of our 81-frame and 243-frame models which receive 81 and 243 consecutive frames, respectively, as the input of the bone direction prediction network. We also experiment with a causal version of our model to enable real-time prediction. During the training/inference process, the causal model only receives consecutive and randomly sampled frames from the past/current frames for the current frame’s estimation. Overall, our model has low average error on both Protocol 1 and Protocol 2. On most of the actions, we achieve the best performance. Compared with the baseline model [24] that shares the same 2D keypoint detector, our model achieves remarkably better performance on complex activities such as “Sitting” (-3.7mm) and “Sitting down” (-5.4mm). We attribute it to the accurate prediction of the bone lengths for these activities. Even though the person bends his/her body, based on the predicted bone lengths, the joint shift loss can effectively guide the model to predict high-quality bone directions. Figure 6 shows the visualized qualitative results from the baseline and our full model on “Sitting” and “Sitting down” poses.

Moreover, our model sharply improves the lower bound of 3D pose estimation when using the ground-truth 2D keypoints as input. For this experiment, data augmentation is not applied as it can be regarded as using extra ground truth 2D keypoints. From Table 2, the gap between our model and the baseline is nearly 5mm on Protocol 1. It indicates that if the performance of the bottom 2D keypoint detector is improved, our model can further boost the improvement.

Table 1 shows the quantitative results of our full model and other baselines on MPI-INF-3DHP. Overall, MPI-INF-3DHP contains fewer training samples than human3.6M. This leads to better performance for the 81-frame models than the 243-frame models. Still, our model outperforms the baselines by a large margin.

Figure 6: Qualitative comparison between the proposed 243-frame model and the baseline 243-frame model [24] on typical poses.
Protocol 1 Protocol 2
Martinez et al. [17] 45.5 37.1
Hossain & Little [25] 41.6 31.7
Lee et al. [12] 38.4 -
Pavllo et al. (243 frames) [24] 37.2 27.2
Ours (243 frames) 32.3 25.2
Table 2: Quantitative comparisons of models trained/evaluated on Human3.6M using the ground-truth 2D input.
PCK AUC MPJPE
Mehta et al. [18] 3DV’17 75.7 39.3 117.6
Mehta et al. [19] ACM ToG’17 76.6 40.4 124.7
Pavllo et al. [24] (81 frames) CVPR’19 86.0 51.9 84.0
Pavllo et al. [24] (243 frames) CVPR’19 85.5 51.5 84.8
Lin et al. [14] BMVC’19 82.4 49.6 81.9
Ours (81 frames) 87.9 54.0 78.8
Ours (243 frames) 87.8 53.8 79.1
Table 3: Quantitative comparisons of different models on MPI-INF-3DHP.

4.4 Ablation Study

ML BoneAtt AUG JSL LSC ScoreAtt MPJPE
Baseline 47.7
48.4
46.6
46.5
AF 46.3
45.8
45.1
44.6
Table 4: Comparison of different models under Protocol 1 on Human3.6M. Baseline represents the baseline 81-frame model [24]. Other rows represents the proposed anatomy-aware framework that decomposes the tasks into bone length prediction and bone direction prediction. ML refers to the use of MPJPE loss to solve the overfitting problem of the fully-connected residual network as Section 3.3. BoneAtt refers to the feeding of the bone length self-attention module for bone length re-weighting, instead of directly averaging the predicted bone lengths. AUG refers to the applying of data augmentation to solve the overfitting problem of the self-attention module. JSL, LSC and ScoreAtt refer to the applying of the joint shift loss, the incorporating of the fully-convolutional propagating architecture and the feeding of visibility scores by an implicit attention mechanism, respectively.

We next perform ablation experiments on Human3.6M under Protocol 1. For all the comparisons, we use the 81-frame models for the baseline [24] and ours. They receive 81 consecutive frames as the input to predict the 3D joint locations and the bone directions of the current frame, respectively.

We first show how each proposed module improves the model’s performance in Table 4. For the naive anatomy-aware framework, we adopt Baseline as the bone direction prediction network. We train the bone length prediction network and bone direction prediction network by the bone length loss and bone direction loss, respectively. We observe a drop of performance from 47.7mm to 48.4mm caused by the overfitting of the bone length prediction network. To our surprise, when we apply the MPJPE loss to solve the overfitting problem of the bottom fully-connected residual network, the bone length prediction is sharply improved and MPJPE drops to 46.6mm. In addition, we find that the bone length self-attention module, the joint shift loss, the fully-convolutional propagating architecture and the attentive feeding of visibility score reduce the error about 0.3mm, 0.5mm, 0.7mm and 0.5mm, respectively.

To further prove the validity of the proposed modules, the following models are compared:

  • Baseline + JSL refers to applying the joint shift loss on the baseline model. The model is trained to predict the 3D joint locations and the relative joint shifts with respect to the MPJPE loss and the joint shift loss. The training approach is essentially similar to [26]. We try different relative weights between the two losses, the best MPJPE is 47.5mm. It suggests that the joint shift loss does not significantly improve the performance of the baseline model without decomposing the task into bone length prediction and bone direction prediction.

  • AF(conse) + ML + BoneAtt + AUG + JSL. One may question whether the distant frames indeed help the prediction of bone length. To validate this, we investigate the model that receives consecutive local frames as the input of the bone length prediction network for training/inference, with the current one in the center. As Section 4.2, is still set to 50 for both the training and inference process. The error increases from 45.8mm to 46.7mm. This demonstrates that randomly sampling frames from the entire video for bone length prediction indeed improves the model’s performance, which is consistent with our motivation to decompose the task.

  • AF + ML + BoneAtt + AUG + JSL + Baseline-D. It should be noticed that for the bone direction prediction network, LSC is two times deeper than Baseline. For fair comparison, we further design Baseline-D as the bone direction prediction network. It has the same structure as LSC, but with the long skip connections removed. Also, same as Baseline, the loss is only applied at the top and the back-propagation is not blocked between different sub-networks. The MPJPE is still 45.8mm. This indicates the uselessness of simply increasing the layer and parameter number of the temporal network.

  • AF + ML + BoneAtt + AUG + JSL + LSC + Fusion refers to the model for which we directly concatenate the visibility scores with the input 2D keypoints before feeding them into the network instead of utilizing implicit attention. The MPJPE is 44.8mm. It proves that the implicit attention mechanism provides a more effective way to incorporate the visibility score feature.

5 Conclusion

We present a new solution to estimating the human 3D pose. Instead of directly regressing the 3D joint locations, we transform the task into predicting the bone lengths and directions. For bone length prediction, we make use of the frames across the entire video and propose an effective fully-connected residual network with a bone length re-weighting mechanism. For bone direction prediction, we add long skip connections into a fully-convolutional architecture for hierarchical prediction. Extensive experiments have demonstrated that the combination of bone length and bone direction is an effective intermediate representation to bridge the 2D keypoints and 3D joint locations. In the future, we will focus on jointly training the 2D detector and 3D estimator and making better use of the anatomic properties of the human skeleton.

References

  1. Y. Cai, L. Ge, J. Liu, J. Cai, T. Cham, J. Yuan and N. M. Thalmann (2019) Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2272–2281. Cited by: Table 1.
  2. C. Chen and D. Ramanan (2017) 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7035–7043. Cited by: §2.
  3. C. Chen, A. Tyagi, A. Agrawal, D. Drover, R. MV, S. Stojanov and J. M. Rehg (2019) Unsupervised 3d pose estimation with geometric self-supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5714–5724. Cited by: §2.
  4. X. Chen, K. Lin, W. Liu, C. Qian and L. Lin (2019) Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10895–10904. Cited by: §1, §2, §3.1, Table 1.
  5. Y. Cheng, B. Yang, B. Wang, W. Yan and R. T. Tan (2019) Occlusion-aware networks for 3d human pose estimation in video. In Proceedings of the IEEE International Conference on Computer Vision, pp. 723–732. Cited by: Table 1.
  6. R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma and A. Jain (2018) Learning 3d human pose from structure and motion. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 668–683. Cited by: §2.
  7. H. Fang, S. Xie, Y. Tai and C. Lu (2017) RMPE: regional multi-person pose estimation. In ICCV, Cited by: §4.2.
  8. C. Ionescu, D. Papava, V. Olaru and C. Sminchisescu (2013) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36 (7), pp. 1325–1339. Cited by: §4.1.
  9. A. Kanazawa, M. J. Black, D. W. Jacobs and J. Malik (2018) End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131. Cited by: §2.
  10. A. Kanazawa, J. Y. Zhang, P. Felsen and J. Malik (2019) Learning 3d human dynamics from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5614–5623. Cited by: §2, §2.
  11. J. Kim, S. Lee, D. Kwak, M. Heo, J. Kim, J. Ha and B. Zhang (2016) Multimodal residual learning for visual qa. In Advances in neural information processing systems, pp. 361–369. Cited by: §3.5.
  12. K. Lee, I. Lee and S. Lee (2018) Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–135. Cited by: §1, §1, §2, §3.2, 1st item, Table 1, Table 2.
  13. C. Li and G. H. Lee (2019) Generating multiple hypotheses for 3d human pose estimation with mixture density network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9887–9895. Cited by: §2.
  14. J. Lin and G. H. Lee (2019) Trajectory space factorization for deep video-based 3d human pose estimation. arXiv preprint arXiv:1908.08289. Cited by: §1, Table 1, Table 3.
  15. M. Lin, L. Lin, X. Liang, K. Wang and H. Cheng (2017) Recurrent 3d pose sequence machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 810–819. Cited by: §2.
  16. D. C. Luvizon, D. Picard and H. Tabia (2018) 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5137–5146. Cited by: Table 1.
  17. J. Martinez, R. Hossain, J. Romero and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649. Cited by: §2, 1st item, Table 1, Table 2.
  18. D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu and C. Theobalt (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 International Conference on 3D Vision (3DV), pp. 506–516. Cited by: §2, 2nd item, §4.1, Table 3.
  19. D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas and C. Theobalt (2017) Vnect: real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG) 36 (4), pp. 44. Cited by: §2, 2nd item, Table 3.
  20. F. Moreno-Noguer (2017) 3d human pose estimation from a single image via distance matrix regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2823–2832. Cited by: §2.
  21. S. Park, J. Hwang and N. Kwak (2016) 3D human pose estimation using convolutional neural networks with 2d pose information. In European Conference on Computer Vision, pp. 156–169. Cited by: §2.
  22. G. Pavlakos, X. Zhou and K. Daniilidis (2018) Ordinal depth supervision for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7307–7316. Cited by: Table 1.
  23. G. Pavlakos, X. Zhou, K. G. Derpanis and K. Daniilidis (2017) Coarse-to-fine volumetric prediction for single-image 3d human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034. Cited by: §2.
  24. D. Pavllo, C. Feichtenhofer, D. Grangier and M. Auli (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762. Cited by: §1, §1, §1, §2, §2, §3.1, §3.2, §3.2, §3.3, Figure 6, 1st item, §4.2, §4.2, §4.3, §4.4, Table 1, Table 2, Table 3, Table 4.
  25. M. Rayat Imtiaz Hossain and J. J. Little (2018) Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84. Cited by: §1, §2, §2, §3.1, 1st item, Table 1, Table 2.
  26. X. Sun, J. Shang, S. Liang and Y. Wei (2017) Compositional human pose regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2602–2611. Cited by: §1, §2, §3.1, §3.4, 1st item, 1st item, Table 1.
  27. X. Sun, B. Xiao, F. Wei, S. Liang and Y. Wei (2018) Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545. Cited by: §2.
  28. W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li and X. Wang (2018) 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5255–5264. Cited by: Table 1.
  29. R. A. Yeh, Y. Hu and A. G. Schwing (2019) Chirality nets for human pose regression. arXiv preprint arXiv:1911.00029. Cited by: Table 1.
  30. X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis and K. Daniilidis (2016) Sparseness meets deepness: 3d human pose estimation from monocular video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4966–4975. Cited by: §2.
  31. X. Zhou, X. Sun, W. Zhang, S. Liang and Y. Wei (2016) Deep kinematic pose regression. In European Conference on Computer Vision, pp. 186–201. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409301
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description