DeepSkeleton: Skeleton Map for 3D Human Pose Regression

DeepSkeleton: Skeleton Map for 3D Human Pose Regression

Qingfu Wan
Fudan University
   Wei Zhang
Fudan University
   Xiangyang Xue
Fudan University

Despite recent success on 2D human pose estimation, 3D human pose estimation still remains an open problem. A key challenge is the ill-posed depth ambiguity nature. This paper presents a novel intermediate feature representation named skeleton map for regression. It distills structural context from irrelavant properties of RGB image e.g. missing illumination and texture. It is simple, clean and can be easily generated via deconvolution network. For the first time, we show that training regression network from skeleton map alone is capable of meeting the performance of state-of-the-art 3D human pose estimation works. We further exploit the power of multiple 3D hypothesis generation to obtain reasonbale 3D pose in consistent with 2D pose detection. The effectiveness of our approach is validated on challenging in-the-wild dataset MPII and indoor dataset Human3.6M.

1 Introduction

A prevalant family of human pose estimation works generally fall into two groups: 2D human pose estimation and 3D human pose estimation. Last year has witnessed the revolution of 2D human pose estimation, thanks to the development of heatmap-based network and the availability of deep residual network [16]. However, the research on 3D human pose estimation has been significantly lagging behind its counterpart.

Prior works on 3D human pose estimation can be roughly categorized into two families: regression methods and reconstruction methods. Regression methods directly learn a mapping function from input image to the target 3D joint locations[22, 23, 24, 33, 36, 42, 43, 57, 44, 32]. As popularized in 3D human pose estimation, the performance is not as excellent as expected. Reconstruction methods typically follow a two-stage schema where in the first stage a 2D pose estimator is employed and then 3D pose is reconstructed aiming to minimize the reprojection error via optimization [1, 4, 37, 13, 59] or matching [18, 7]. While optimization can generate only one 3D output, recent work of Jahangiri and Yuille [18] has highlighted the importance of having multiple hypotheses under the widely known problem of depth ambiguity.

A fundamental challenge limiting previous methods from attacking 3D human pose estimation is the insufficient training data. Most top-performing methods in 3D human pose estimation are restricted in laboratory environment where the appearance variation is far less than outdoor scene. On the other hand, there exists no accurate 3D ground truth for in-the-wild dataset to date. Fusing 2D and 3D data, dates back to at least [56], therefore has become an emerging trend [42, 57]. This naturally brings us to a question, for accurate 3D human pose estimation, is mixing different data sources really indispensible?

In this work, we argue that combined training is not necessary by better exploiting the data that we already have. We put forth a novel expressive intermediate feature representation called skeleton map. For an example see Figure  1.

Figure 1: An example of skeleton map. 15 body parts are drawn in different colors.

The core insight is that human being can easily reason about 3D human pose given skeleton map by our prior knowledge about human kinematics, anatomy and anthropometrics, rather than image observation. This suggests that most of the image cues, e.g. lightning, texture, and human-object interaction are useless for 3D pose inference.

Skeleton map is simple, compact and effective. Unlike full human body part segmentation map which requires pixel-wise segmentation, it only models the connection between adjacent joints in the human skeleton. It contains rich structural information and exhibits inherent occlusion-aware property for regression.

Skeleton map is general for both indoor and outdoor scenario. Researchers previously argued that 2D and 3D data complements each other and mixing them is central to better shared feature learning. But with our pure and succinct skeleton map representation, we are able to achieve near state-of-the-art performance. We then take a leap forward to generate multiple hypotheses, sharing the same spirit of “wisdom of the crowd” [27]. Different scales of skeleton maps are conveniently generated by deconvolutional networks for subsequent regression. It can be intepretered as implicit data augmentation without data fusion or data synthesis. This desirable property further resolves the depth ambiguity.

To the best of our knowledge, this is the first work to

  • Perform regression directly from skeleton map alone.

  • Generate multiple hypotheses without 3D library.

  • Unite segmentation, pose regression and heatmap detection into a framework we call DeepSkeleton for 3D human pose estimation.

DeepSkeleton achieves 86.5mm average joint error on 3D Human3.6M dataset and delivers considerable 3D poses on 2D MPII dataset.

2 Related Work

For a comprehensive review on the literature of 2D human pose estimation we refer the readers to [15, 30, 41]. Here we survey on previous works most relevant to our approach.

3D Human Pose Estimation Li and Chan [22] pioneered the work of direct 3D human pose regression from 2D image. Several approaches have been proposed to learn the pose structure from data. Tekin et al. [43] learn latent pose representation using an auto-encoder. Zhou et al. [58] integrate a generative forward kinematics layer into the network to learn the joint angle parameters. Realizing the difficulty to minimize per-joint error, [36, 32, 42] advocate to predict relative joint locations to multiple joints. More recently, Rogez et al. [38] simplify the problem by local residual regression in the classified pose class. Another research direction has been focused on inferring 3D pose from 2D estimates. The underlying premise is that 2D human pose estimation can be regarded as nearly solved[49, 34, 6, 5, 10, 11, 14, 27, 55]. Consequently, the challenge of 3D pose estimation has been shifted from predicting accurate 2D towards predicting depth from RGB image. As an example, Moreno-Noguer [33] employs FCN[29] to obtain pairwise 3D distance matrix from 2D observation for absolute 3D pose recovery. Tekin et al. [44] fuse image space and heatmap space to combine the best of both worlds. Tome et al. [46] adopt an iterative multi-stage architecture where 2D confidence is progressively lifted to 3D and projected back to 2D, ensuring the match between 2D observation and 3D prediction. Zhou et al. [60] combine heatmap and 3D geometric prior in EM algorithm to reconstruct 3D skeleton from 2D joints. Zhou et al. [57] output depth from 2D heatmap and intermediate feature maps in a weakly-supervised fashion. Nevertheless, the inherent depth ambiguity presents the foremost obstacle impeding the progress of 3D human pose estimation. Recent techniques ameliorate this issue through dual-source training. Examples include [56, 57, 42]. Our proposed DeepSkeleton stands in contrast to the latent assumption of these works that shared feature learning from both 2D and 3D data is essential. Our results suggest that 2D and 3D data sources may be completely indepedent, which is in line with recent observation [46, 31].

Exemplar based pose estimation Most previous work on 3D human pose estimation rely on generating only one single 3D pose. This is problematic as multiple 3D poses may have similar 2D projection. With early reference to [9], generating multiple hypotheses has been repurposed for 3D human pose estimation in [18]. In their work, a generative model is responsible for generating multiple 3D poses from 3D MoCap library. Similarly, Chen and Ramanan [7] retrieve 3D pose using neareast neighbour search. DeepSkeleton shares some resemblence to these works expect for the need of large-scale offline MoCap dataset.

Semantic Segmentation Fully convolutional network(FCN)[29] made the first attempt to solve semantic segmentation using deep neural network. More recent works enjoy the benefits of residual connection [28, 8]. However, accurate pixel-wise annotation of segmentation map is time-intensive. Different from full human body part segmentation[35, 51, 26] , synthesizing ground truth of our proposed skeleton map only requires 2D annotation.

Joint Semantic Segmentation and Pose Estimation Segmentation and pose estimation are two intrinsically complementary tasks in the sense that pose esmation benefits from the topology of semantic parts, while the estimated pose skeleton offers natural cue for the alignment with part instance. One of the most significant works that jointly solve these two problems is Shotton et al. [40], which derives an intermediate body part segmentation representation for 3D joint prediction from depth image. A large body of literature has been devoted to this field [12, 21, 54, 20, 2]. Tripathi et al. [48] demonstrate the impact of human pose prior on person instance segmentation. Perhaps the most related approach to ours is Xia et al. [52]. Their skeleton label map serves as a prior to regularize the segmentation task, and is derived from 2D pose prediction. In contrast, our skeleton map is independent of 2D detection and is directly taken as input to the subsequent regression module. Another major difference is that they aim to solve multi-person 2D pose estimation, while we target at single-person 3D pose estimation.

3 Methodology

Given a RGB image of a person, our goal is to output 3D joint locations with joint number . We break the problem down into three steps:

  • Segmentation(Section 3.2):

    For each configuration with crop scale and stick width , foreground skeleton map and background skeleton map are generated via deconvolutional network ().

  • Regression(Section 3.3):

    Skeleton maps are individually fed into separate regression networks , where takes skeleton map as input and outputs 3D pose hypothesis , resulting in multiple 3D hypotheses .

  • Matching(Section 3.4):

    To match with 2D observation , the hypothesis with minimum projection error to 2D joint detection is selected as final output.

3.1 Skeleton Map

Skeleton map () draws a stick with width between neighbouring joints in the human skeleton and assigns different colors to distinguish body parts, an example of which is depicted in Figure  1. The human skeleton we use follows the paradigm of [42] expect that we define thorax as root in all our experiments, in which 15 body parts are defined. Skeleton map, like body part segmentation map, encodes part relationship in the body segments and imposes strong prior on human pose. However, training networks for full human body semantic segmentation needs labor-intensive dense and precise annotations, which is impractical for large human pose dataset e.g. Human3.6M [17]. The simplicity of skeleton map naturally addresses this issue. But what is a good architecture for skeleton map generation?

3.2 Deconvolution for Generating Skeleton Map

Deconvolutional Network Design A simple choice is to employ the encoder-decoder architecture. In practice, we apply ResNet-50[16] in a fully convolutional manner, producing pixel-level real values. We replace the fully connected layer after pool5 with deconvolutional layers. The network structure shown in Figure  2 starts with a 224224 image and extracts features along the downsampling process. Herein only res2c, res3d, res4f and res5c are sketched for brevity. The last fully connected layer is simply discarded. The devolutional module built upon pool5 gradually processes feature maps to the final output: a three-channel 5656 skeleton map. It comprises of repeated blocks of upsampling layer (with initial weights set to bilinear upsampling) followed by residual module. To encourage the learning of local and global context, high-level feature maps are combined with low-level feature maps through skip connections, analogous to those used in deep networks[29, 34]. The output composes of three channels representing the skeleton map. Rather than performing per-pixel classification into body part classes, we found that per-pixel regression results in better segmentation accuracy. We opt for sigmoid cross entropy loss in training. However, one common problem in training deep network is the notorious vanishing gradient. To remedy this issue, each blob feeding into the pixel-wise summation layer branches off and connects to a residual block. Intermediate supervision is then applied on the output of each residual block, allowing for learning skeleton map at multiple resolutions.

Figure 2: Deconvolutional Network Architecture. The blocks from input image up to res2c are not drawn for simplicity. Intermediate supervision is not plotted. See Section 3.2 for details.

After the initial performance gain brought by common pratice i.e. skip connection and intermediate supervision, investigations on prevalant conv-deconv network architectures e.g. RefineNet[28] and Stacked Hourglass[34] did little to further improve the segmentation accuracy.

Deal with trunction Truncation, that is, partial visibility of human body joints caused by image boundary, poses a noticeable challenge especially for in-the-wild 3D human pose estimation. In our case, this is supported by the fact that the deconvolutional network is uncertain about whether to plot the segments associated with cropped endpoint joint i.e. wrist or ankle, due to lack of image evidence. A standard way to deal with this is to use multiple image crop, which is made possible by multiplying the provided rough person scale in the dataset with the rescaling factor i.e. . The 2D joint ground truth is rescaled in the cropped window accordingly. Note that indoor dataset faces no truncation problem, and crop scale is always set to 1.0.

Deal with stick width Several endeavours have been made to effectively excavate features for human pose estimation, most of which focus on multi-stream learning[53]. DeepSkeleton differs in that it directly modifies the target output by changing stick width. Our design of multi-scale skeleton map is motivated by the claim that each convolutional layer is responsible for skeleton pixel whose scale is less than the receptive field[39]. In concrete terms, only convolutional layers with receptive field size larger than stick width can capture features of body parts. Hence, coarse segmentation(large stick width) and fine segmentation(small stick width) feature varying combinations of low-level and high-level features. Note that care has to be taken when deciding the body segment width of ground truth skeleton map. We consider two practial concerns:(1)the parts should be small enough to localize semantic body parts. (2)the parts should be large enough for convolution and deconvolution. The stick width i.e. is empirically set to be in the range .

Deal with occlusion Severe occlusion hinders accurate human pose estimation. Answers to simple queries such as “Is the left upper leg in front of right upper leg?” can actually provide important occlusion cues for regression. Motivated by inside/outside score map [25], in this work, foreground skeleton map displays body parts that are occluding others, while background skeleton map displays those occluded by others. That said, skeleton map explicitly models the occlusion relationship of two overlapping body parts. As far as we know, this straightforward formulation has never been done in the literature of human pose estimation. In more detail, recall that each 2D endpoint joint on the bone results in a ray oriented towards the camera optical center. Assume that 3D point on bone and on bone yield the same 2D projection . Denote the point with smaller depth(closer to camera in direction) as , and the other as . Foreground skeleton map assigns color of bone to the pixel . In contrast, background skeleton map assigns color of bone , pretending bone is occluding bone . See Figure 3 for an example. This inherent occlusion-aware property of skeleton map is important for regression.

Figure 3: A example of occlusion handling. From left to right: raw image, predicted foreground skeleton map, predicted background skeleton map, inferred 2D landmark, inferred 3D joint position. Note how the left/right body parts are distinguishably segmented in foreground/background skeleton map.

3.3 Regression

Plagued by the cluttered background of RGB image, a longstanding research direction in 3D human pose estimation has been exploiting better features from raw RGB input. We show that regression from skeleton map alone is feasible. We employ state-of-the-art ResNet-50[16] as the backbone network. Since skeleton maps generated by deconvolutional network are , they are rescaled to first and concatenated together, which is then taken as input to and processed along the downsampling path. The last fully connected layer is repurposed to output 3D position of all joints. Euclidean distance loss is applied for back propogation. Multiple 3D predictions are made by training independent regression networks for different skeleton maps input. We want to emphasize that naïvely concatenating all the skeleton maps as input (early fusion) or learning parallel streams for multiple input and fusing the feature responses subsequently (late fusion) will not help. As an alternative option, one might consider to concatenate skeleton map along with raw RGB image, which however, does not boost the performance in our observation. Therefore, we stick with the original design i.e. learning 3D solely from intermediate skeleton map feature representation.

3.4 Matching

Now we have multiple 3D pose hypotheses , the problem boils down to select the optimal hypothesis as final 3D output. The simpliest way is to choose the candidate whose projection best matches the 2D pose detection results. Writing as the camera projection matrix and as 2D pose detection, we seek to find the optimal 3D pose via minimizing the reprojection error:


We use the pre-trained state-of-the-art 2D detector Stacked Hourglass[34] for generating . No finetuning is employed. We remark that our 3D hypotheses are completely independent of 2D pose detection, rather they are learnt from multi-level discriminative skeleton maps.

4 Discussions

In principle, any expressive intermediate representation can be used for regression. Heatmap, for example, has been explorered in [50] to bridge the gap between real and synthetic dataset. Carreira et al. [6] stack heatmap with RGB image for coarse-to-fine regression. Yet, an obvious drawback of heatmap is different body joints are encoded in discrete gaussian peaks, thus the dependence between joints is not well exploited. Skeleton map overcomes this problem by explicitly connecting adjacent joints of each bone by a stick. The colored semantic body part offers a strong cue for regression learning. For generating skeleton map, perhaps the easiest way is to firstly detect 2D joints and then draw lines between neighbouring joints. However, this introduces two disadvantages: 1. Occlusion relationship information is completely discarded. 2. The inaccurate 2D detection has impact on the following regression. Our initial exploration shows that this has no apparent benefits.

5 Experiments

Our approach is evaluated on the largest human pose benchmarks MPII and Human3.6M. MPII [3] is a 2D real-world human pose dataset. It contains around natural images collected from YouTube with a variety of poses and complicated image appearances. Cluttered background, multiple people, severe truncation and occlusion make it the most challenging 2D human pose dataset. For 2D pose evaluation in MPII, we use PCKh[3] metric which measures the percentage of joints with distance to ground truth below a certain threshold.

Human3.6M[17] is a large-scale 3D human pose dataset consisting of 3.6M video frames captured in controlled laboratory environment. 5 male and 6 female actors performing 17 daily activities are captured by motion capture system in 4 different camera views. The image appearance in clothes and background is limited compared to MPII. Following standard practice in [57, 22, 58, 60], five subjects(S1, S5, S6, S7, S8) are used in training. Every frame of the two subjects(S9, S11) is used in testing. MPJPE(mean per joint position error)[17] is used as evaluation metric after aligning 3D poses to root joint. We represent 3D pose in local camera coordinate system following the methodology of Zhou et al. [57].

5.1 Implementation Detail

For training the network we use Caffe[19] with 15 GPUs. Deconvolutional network training starts with base learning rate 0.00001 and mini-batch size 12. For training regression network, we set base learning rate to 0.01 and mini-batch size to 32. Learning rate is dropped by a factor of 10 after error plateu on the validation set. The network is trained until convergence. For optimization, stochastic gradient descent is adopted. Weight decay is 0.0002, and momentum is 0.9. No data augmentation or data fusion is used.

5.2 Baseline Settings

To validate the effectiveness of skeleton map and multiple hypothesis, we test two baselines:

  • Direct RGB

    It performs regression directly from raw RGB input.

  • Ours w/o Mul-Hyp

    It performs regression from only one skeleton map.

Unless otherwise specified, we set for all the experiments of one hypothesis(skeleton map) both on Human3.6M and MPII.

Our final system is denoted as Ours w Mul-Hyp (equivalent to DeepSkeleton).

5.3 Experiments on 3D dataset Human3.6M

Comparison with state-of-the-art For fair comparison, we compare with state-of-the-art methods without mixed 2D and 3D data for training in Table 1. Note that Compositional Pose Regression [42] provides results with and without 2D data. We therefore denote as Compositional Pose Regression without extra 2D training data and report the results from the original paper. Table 1 shows that our final system ours w Mul-Hyp outperforms the main competitor Tome et al. [46]. Notably, it surpasses competeting methods in actions Sit, SitDown, Photo by a large margin. The improvement comes from our novel skeleton map representation and the expressiveness of multiple hypotheses. Visualized 3D poses are displayed in Figure 5.

Direction Discuss Eat Greet Phone Pose Purchase Sit
 Tekin[45] 102.4 147.7 88.8 125.3 118.0 112.4 129.2 138.9
Chen[7] 89.9 97.6 90.0 107.9 107.3 93.6 136.1 133.1
Zhou[60] 87.4 109.3 87.1 103.2 116.2 106.9 99.8 124.5
Xingyi[58] 91.8 102.4 97.0 98.8 113.4 90.0 93.8 132.2
[42] 90.2 95.5 82.3 85.0 87.1 87.9 93.4 100.3
Tome[46] 65.0 73.5 76.8 86.4 86.3 68.9 74.8 110.2
Moreno-Noguer[33] 69.5 80.2 78.2 87.0 100.8 76.0 69.7 104.7
Ours w Mul-Hyp 75.6 75.0 94.9 82.4 107.7 91.2 86.6 73.3
Method SitDown Smoke Photo Wait Walk WalkDog WalkTogether Avg
 Tekin[45] 224.9 118.4 182.7 138.8 55.1 126.3 65.8 125.0
Chen[7] 240.1 106.7 139.2 106.2 87.0 114.1 90.6 114.2
Zhou[60] 199.2 107.4 143.3 118.1 79.4 114.2 97.7 113.0
Xingyi[58] 159.0 106.9 125.2 94.4 79.0 126.0 99.0 107.3
[42] 135.4 91.4 94.5 87.3 78.0 90.4 86.5 92.4
Tome[46] 173.9 85.0 110.7 85.8 71.4 86.3 73.1 88.4
Moreno-Noguer[33] 113.9 89.7 102.7 98.5 79.2 82.4 77.2 87.3
Ours w Mul-Hyp 80.5 83.9 81.5 97.1 99.4 78.9 87.0 86.5
Table 1: Comparison with state-of-the-art on Human3.6M. No mixed 2D and 3D data training is used in all the methods. MPJPE(mean per joint position error) is used as evaluation metric.

Comparison with Regression from RGB Table 2 shows that Ours w/o Mul-Hyp significantly improves baseline Direct RGB by 12.3mm(relative 10.8%), demonstrating the strength of skeleton map.

Method Avg MPJPE
 Direct RGB 114.2
Ours w/o Mul-Hyp
Ours w Mul-Hyp

Table 2: Comparison of ours with regression from raw RGB on Human3.6M. Mul-Hyp=multiple hypotheses. MPJPE metric is used.

Does skeleton map brings better 2D estimation, or better depth estimation? In order to answer this question, we evaluate the average joint error given ground truth depth and ground truth 2D respectively in the following. Without loss of generality, we restrict ourselves to generate one hypothesis.

Impact of Skeleton Map on 2D Estimation We make use of ground truth depth and predicted 2D to recover 3D pose in the camera coordinate system. Table 3 reports the result of average joint error using different input sources for regression network. One can see that 25.9mm(relative 27.9%) error reduction is obtained after feeding predicted skeleton map to regression network. Further 17.4mm decrease is achieved by using ground truth skeleton map for regression. This can be interpreted as skeleton map simplifies the 2D learning procedure and prevents overfitting. Strong shape prior serves as important regularization cue for learning 2D location.

Method Avg MPJPE
 Direct RGB w GT Depth, Pred 2D 92.8
Pred Ske w GT Depth, Pred 2D
GT Ske w GT Depth, Pred 2D
Table 3: Performance given ground truth depth of different regression input on Human3.6M. Pred(GT) Ske=Use predicted(ground truth) skeleton map for regression. Direct RGB=Use RGB for regression. Pred 2D=Use predicted 2D. GT Depth=Use ground truth depth. MPJPE metric is used.

Impact of Skeleton Map on Depth Estimation To gain insight into the importance of skeleton map for depth estimation, we use ground truth 2D and predicted depth to acquire 3D joints. We see in Table 4 that depth regression from predicted skeleton map shows evident superiority over RGB image, yielding 21.5mm(relative 22.7%) error reduction. This indicates that skeleton map is more favorable for depth prediction.

Impact of Multiple Hypotheses Next we elaborate on the effect of using multiple hypotheses. Here we use hypotheses.111 We first assume that the ground truth skeleton map is provided. In Table 5, multiple hypotheses slightly improves the accuracy, but to a lower extent than expected. This implies that ground truth skeleton map is sufficiently powerful to reduce ambiguity. We then move to a realistic scenario where ground truth skeleton map is unavailable. Quite surprisingly, using multiple hypotheses reduces the average MPJPE from 101.9 mm to 86.5mm in Table 2, which largely narrows down the performance gap between ground truth and predicted skeleton map. Generated multiple hypotheses are illustrated in Fig 4. The third hypothesis is chosen as final output based on simple matching. One could argue that similar performance might be accomplished by ensembling multiple runs of the same regression network . To examine this, we take the regression outputs of different runs from single skeleton map, denoted as Ensemble. The result in Table 6 suggests that our multi-level skeleton map provides more information than single skeleton map. A natural problem arises: What is the performance upper bound of multiple hypotheses? To investigate this problem, we select the optimal 3D hypothesis with minimum 3D error to ground truth 3D pose, producing an error of 68.3. This is promising as we are able to excel most state-of-the-art works without offline 3D pose library. However, how to select the optimal 3D hypothesis remains unclear.

Method Avg MPJPE
 Direct RGB w GT 2D, Pred Depth 94.6
Pred Ske w GT 2D, Pred Depth
GT Ske w GT 2D, Pred Depth
Table 4: Performance given ground truth 2D of different regression input on Human3.6M. Pred(GT) Ske=Use predicted(ground truth) skeleton map for regression. Direct RGB=Use RGB for regression. GT 2D=Use ground truth 2D. Pred Depth=Use predicted depth. MPJPE metric is used.
Figure 4: Visualization of multiple hypotheses on Human3.6M. Bottom shows the predicted hypotheses and ground truth hypotheses(white) from a novel viewpoint. Top shows the projection of 3D hypotheses and raw image. The third hypothesis is the final output.
Figure 5: Qualitative results on Human3.6M(First row) and MPII(Second to fourth row). 3D poses are illustrated from a novel viewpoint. Note that 3D pose results for natural images(MPII) are quite plausible. Different colors are used to differentiate MPII from Human3.6M.
Method Avg MPJPE
 GT Ske w/o Mul-Hyp 76.2
GT Ske w Mul-Hyp
Table 5: Performance gain from multiple hypotheses given ground truth skeleton map on Human3.6M. MPJPE metric is used.

Direction Discuss Eat Greet Phone Pose Purchase Sit
 Ours w/o Mul-Hyp 94.5 89.1 103.8 101.7 131.4 97.0 107.7 84.4
Ours w Mul-Hyp
Method SitDown Smoke Photo Wait Walk WalkDog WalkTogether Avg
 Ours w/o Mul-Hyp 97.7 93.8 98.0 111.6 110.9 97.6 111.0 101.9
Ours w Mul-Hyp
Table 6: Performance gain from multiple hypotheses given predicted skeleton map on Human3.6M. MPJPE metric is used.

5.4 Experiments on 2D dataset MPII

We present 2D and 3D pose estimation for in-the-wild dataset MPII. We use MPII validation set [47] including 2958 images for ablation study.

Pseudo 3D Ground Truth MPII only provides 2D annotation, but training our network requires 3D pose ground truth. We use state-of-the-art 3D reconstruction approach [59] to initialize 3D pose from 2D landmark. Note that most of the reconstructed poses are already reasonable despite occasional incorrect inference. We then introduce human assistance, where a human expert is presented with the initialized 3D pose along with input image and asked to manually adjust wrong limb orientation. We stress that the goal of semi-automatic annotation is to resolve the depth ambiguity as far as possible by aligning 3D pose with image observation. Since accurate 3D MoCap pose is impratical for natural images, we call this pseudo 3D ground truth.

Method All
 Wei[49] 88.5
Newell[34] 90.9
Chu[11] 91.5


Rogez(LCR-Net)[38] 74.2
Ours w Mul-Hyp 73.1
Table 7: Comparison with state-of-the-art on MPII test set. PCKh@0.5 is used as evaluation metric. All denotes PCKh@0.5 of all joints. Top section: 2D detection based. Middle section: 2D regression based. Bottom section: 3D regression based.

Comparison with state-of-the-art Previous approaches generally fall into three families: 2D detection based, 2D regression based and 3D regression based. Our method belongs to 3D regression based. In this family, our closest competitors are [38, 42]. The comparison is not completely fair as they both use additional 3D data. Sun et al. [42] integrate a two-stage state-of-the-art 2D regression based method IEF[6] into their network. For the completeness of this work, we also report Stage 0 result provided in their paper, denoted as . This amounts to direct regression without ad-hoc stage. We observe in Table 7 ours w Mul-Hyp is on par with state-of-the-art 3D regression methods without 3D data or post processing. Qualitative results are shown in Figure 5.

Comparison with Regression from RGB Table 8 compares Ours w/o Mul-Hyp with Direct RGB. We observe that each joint gains tremendous improvement. For instance, elbow PCKh@0.5 is improved by 12.0%(relative 20.7%) and ankle PCKh@0.5 is improved by 5.7%(relative 17.2%). This again demonstrates the remarkable merit of skeleton map.

Head Sho. Elb. Wri.
 Direct RGB 79.1 75.1 58.0 46.9
Ours w/o Mul-Hyp
Method Hip Knee Ank. Mean
 Direct RGB 64.5 49.0 33.1 61.8
Ours w/o Mul-Hyp

Table 8: Comparison to direct regression from RGB on MPII validation set. Regression from only one skeleton map increases mean PCKh@0.5 by 9.4.

Impact of Multiple Hypotheses Table 9 shows the effect of multi-scale and multi-crop skeleton map. We observe the same conclusion as in Table 6. Using multi-scale skeleton map results in 9.4%(relative 24.2%) improvement of ankle PCKh@0.5. Multi-crop skeleton map yields extra 7.0% improvement. It is noteworthy that ensemble of the same regression network falls behind our final system, indicating multi-level skeleton map is able to capture diverse semantic features from input image.

Method Head Sho. Elb. Wri.
 Base 90.6 83.4 70.0 54.5

Hip Knee Ank. Mean
 Base 74.2 59.4 38.8 71.2
Table 9: Comparison to ours with single hypothesis on MPII validation set. Base: Ours w/o Mul-Hyp. Mul-S: Vary stick width of skeleton map in . Mul-C: Vary crop size of raw image in . Ensemble: 18 different runs of regression from one skeleton map. PCKh@0.5 metric is used.

Performance Upperbound One remaining question is what is the limit of skeleton map applied in natural unconstrained scenario? To assess the upper bound, we perform regression from one single ground truth skeleton map on MPII. We see in Table 10 regression from single ground truth skeleton map achieves 94.5% overall PCKh@0.5. This validates the effectiveness of skeleton map representation.

Method Head Sho. Elb. Wri.
 Ours w/o Mul-Hyp 90.6 83.4 70.0 54.5
GT Ske w/o Mul-Hyp

Hip Knee Ank. Mean
 Ours w/o Mul-Hyp 74.2 59.4 38.8 71.2
GT Ske w/o Mul-Hyp

Table 10: Comparison of regression from one predicted skeleton map(Ours w/o Mul-Hyp) and regression from one ground truth skeleton map(GT Ske w/o Mul-Hyp) on MPII validation set. PCKh@0.5 metric is used.

6 Conclusion

We have sucessfully shown how to push the limit of 3D human pose estimation using skeleton map without fusing different data sources. Skeleton map is an impressive abstraction of input, which when combined with multiple hypotheses generation is able to achieve compelling results on both indoor and in-the-wild dataset. We also carry out exhaustive experimental evaluation to understand the performance upper bound of our novel intermediate representation. We expect to further narrow down the performance gap between ground truth and predicted skeleton map by better segmentation network. We hope the idea of combining semantic segmentation and pose estimation inspire a new research direction in 3D human pose estimation.


  • [1] I. Akhter and M. J. Black. Pose-conditioned joint angle limits for 3d human pose reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1446–1455, 2015.
  • [2] K. Alahari, G. Seguin, J. Sivic, and I. Laptev. Pose estimation and segmentation of people in 3d movies. In Proceedings of the IEEE International Conference on Computer Vision, pages 2112–2119, 2013.
  • [3] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014.
  • [4] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578. Springer, 2016.
  • [5] A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In European Conference on Computer Vision, pages 717–732. Springer, 2016.
  • [6] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4733–4742, 2016.
  • [7] C.-H. Chen and D. Ramanan. 3d human pose estimation= 2d pose estimation+ matching. arXiv preprint arXiv:1612.06524, 2016.
  • [8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
  • [9] E. Cho and D. Kim. Accurate human pose estimation by aggregating multiple pose hypotheses using modified kernel density approximation. IEEE Signal Processing Letters, 22(4):445–449, 2015.
  • [10] X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4715–4723, 2016.
  • [11] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. Multi-context attention for human pose estimation. arXiv preprint arXiv:1702.07432, 2017.
  • [12] J. Dong, Q. Chen, X. Shen, J. Yang, and S. Yan. Towards unified human parsing and pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 843–850, 2014.
  • [13] X. Fan, K. Zheng, Y. Zhou, and S. Wang. Pose locality constrained representation for 3d human pose reconstruction. In European Conference on Computer Vision, pages 174–188. Springer, 2014.
  • [14] G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. In European Conference on Computer Vision, pages 728–743. Springer, 2016.
  • [15] W. Gong, X. Zhang, J. Gonzàlez, A. Sobral, T. Bouwmans, C. Tu, and E.-h. Zahzah. Human pose estimation from monocular images: A comprehensive survey. Sensors, 16(12):1966, 2016.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [17] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2014.
  • [18] E. Jahangiri and A. L. Yuille. Generating multiple diverse hypotheses for human 3d pose consistent with 2d joint detections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 805–814, 2017.
  • [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
  • [20] P. Kohli, J. Rihan, M. Bray, and P. H. Torr. Simultaneous segmentation and pose estimation of humans using dynamic graph cuts. International Journal of Computer Vision, 79(3):285–298, 2008.
  • [21] L. Ladicky, P. H. Torr, and A. Zisserman. Human pose estimation using a joint pixel-wise and part-wise formulation. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3578–3585, 2013.
  • [22] S. Li and A. B. Chan. 3d human pose estimation from monocular images with deep convolutional neural network. In Asian Conference on Computer Vision, pages 332–347. Springer, 2014.
  • [23] S. Li, Z.-Q. Liu, and A. B. Chan. Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 482–489, 2014.
  • [24] S. Li, W. Zhang, and A. B. Chan. Maximum-margin structured learning with deep networks for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2848–2856, 2015.
  • [25] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. arXiv preprint arXiv:1611.07709, 2016.
  • [26] X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and S. Yan. Human parsing with contextualized convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision, pages 1386–1394, 2015.
  • [27] I. Lifshitz, E. Fetaya, and S. Ullman. Human pose estimation using deep consensus voting. In European Conference on Computer Vision, pages 246–260. Springer, 2016.
  • [28] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation. arXiv preprint arXiv:1611.06612, 2016.
  • [29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • [30] D. C. Luvizon, H. Tabia, and D. Picard. Human pose regression by combining indirect part detection and contextual information. arXiv preprint arXiv:1710.02322, 2017.
  • [31] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. arXiv preprint arXiv:1705.03098, 2017.
  • [32] D. Mehta, H. Rhodin, D. Casas, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3d human pose estimation using transfer learning and improved cnn supervision. arXiv preprint arXiv:1611.09813, 2016.
  • [33] F. Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. arXiv preprint arXiv:1611.09010, 2016.
  • [34] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
  • [35] G. L. Oliveira, A. Valada, C. Bollen, W. Burgard, and T. Brox. Deep learning for human part discovery in images. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 1634–1641. IEEE, 2016.
  • [36] S. Park, J. Hwang, and N. Kwak. 3d human pose estimation using convolutional neural networks with 2d pose information. In Computer Vision–ECCV 2016 Workshops, pages 156–169. Springer, 2016.
  • [37] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing 3d human pose from 2d image landmarks. Computer Vision–ECCV 2012, pages 573–586, 2012.
  • [38] G. Rogez, P. Weinzaepfel, and C. Schmid. Lcr-net: Localization-classification-regression for human pose. In CVPR 2017-IEEE Conference on Computer Vision & Pattern Recognition, 2017.
  • [39] W. Shen, K. Zhao, Y. Jiang, Y. Wang, Z. Zhang, and X. Bai. Object skeleton extraction in natural images by fusing scale-associated deep side outputs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 222–230, 2016.
  • [40] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 56(1):116–124, 2013.
  • [41] K. Sun, C. Lan, J. Xing, W. Zeng, D. Liu, and J. Wang. Human pose estimation using global and local normalization. arXiv preprint arXiv:1709.07220, 2017.
  • [42] X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional human pose regression. arXiv preprint arXiv:1704.00159, 2017.
  • [43] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua. Structured prediction of 3d human pose with deep neural networks. arXiv preprint arXiv:1605.05180, 2016.
  • [44] B. Tekin, P. Marquez Neila, M. Salzmann, and P. Fua. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In International Conference on Computer Vision (ICCV), number EPFL-CONF-230311, 2017.
  • [45] B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct prediction of 3d body poses from motion compensated sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 991–1000, 2016.
  • [46] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. arXiv preprint arXiv:1701.00295, 2017.
  • [47] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014.
  • [48] S. Tripathi, M. Collins, M. Brown, and S. Belongie. Pose2instance: Harnessing keypoints for person instance segmentation. arXiv preprint arXiv:1704.01152, 2017.
  • [49] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
  • [50] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3d interpreter network. In European Conference on Computer Vision, pages 365–382. Springer, 2016.
  • [51] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human part segmentation with auto zoom net. arXiv preprint arXiv:1511.06881, 2015.
  • [52] F. Xia, P. Wang, X. Chen, and A. Yuille. Joint multi-person pose estimation and semantic part segmentation. arXiv preprint arXiv:1708.03383, 2017.
  • [53] S. Xie and Z. Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  • [54] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg. Parsing clothing in fashion photographs. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3570–3577. IEEE, 2012.
  • [55] W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3073–3082, 2016.
  • [56] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall. A dual-source approach for 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4948–4956, 2016.
  • [57] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Towards 3d human pose estimation in the wild: A weakly-supervised approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 398–407, 2017.
  • [58] X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei. Deep kinematic pose regression. In Computer Vision–ECCV 2016 Workshops, pages 186–201. Springer, 2016.
  • [59] X. Zhou, M. Zhu, S. Leonardos, and K. Daniilidis. Sparse representation for 3d shape estimation: A convex relaxation approach. IEEE transactions on pattern analysis and machine intelligence, 39(8):1648–1661, 2017.
  • [60] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3d human pose estimation from monocular video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4966–4975, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description