Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image
Although significant improvement has been achieved in 3D human pose estimation, most of the previous methods only consider a single-person case. In this work, we firstly propose a fully learning-based, camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. The pipeline of the proposed system consists of human detection, absolute 3D human root localization, and root-relative 3D single-person pose estimation models. Our system achieves comparable results with the state-of-the-art 3D single-person pose estimation models without any groundtruth information and significantly outperforms previous 3D multi-person pose estimation methods on publicly available datasets. The code is available in 111https://github.com/mks0601/3DMPPE_ROOTNET_RELEASE222https://github.com/mks0601/3DMPPE_POSENET_RELEASE.
The goal of 3D human pose estimation is to localize semantic keypoints of a human body in 3D space. It is an essential technique for human behavior understanding and human-computer interaction. Recently, many methods [37, 43, 52, 26, 49, 44] utilize deep convolutional neural networks (CNNs) and have achieved noticeable performance improvement on large-scale publicly available datasets [16, 28].
Most of the previous 3D human pose estimation methods [37, 43, 52, 26, 49, 44] are designed for single-person case. They crop the human area in an input image with a groundtruth box or the box that is predicted from a human detection model . The cropped patch of a human body is fed into the 3D pose estimation model, which then estimates the 3D location of each keypoint. As their models take a single cropped image, estimating the absolute camera-centered coordinate of each keypoint is difficult. To handle this issue, many methods [37, 43, 52, 26, 49, 44] estimate the relative 3D pose to a reference point in the body, e.g., the center joint (i.e., pelvis) of a human, called root. The final 3D pose is obtained by adding the 3D coordinates of the root to the estimated root-relative 3D pose. Prior information on bone length  or groundtruth  has been commonly used for the localization of the root.
Recently, many top-down approaches [13, 6, 47] for the 2D multi-person pose estimation have shown noticeable performance improvement. These approaches first detect humans by using a human detection module, and then estimate the 2D pose of each human by a 2D single-person pose estimation module. Although they are straightforward when used in 2D cases, extending them to 3D cases is challenging. Note that for the estimation of 3D multi-person poses, we need to know the absolute distance to each human from the camera as well as the 2D bounding boxes. However, existing human detectors provide 2D bounding boxes only.
In this study, we propose a general framework for 3D multi-person pose estimation. To the best of our knowledge, this study is the first to propose a fully learning-based camera distance-aware top-down approach of which components are compatible with most of the previous human detection and 3D human pose estimation methods. The pipeline of the proposed system consists of three modules. First, a human detection network (DetectNet) detects the bounding boxes of humans in an input image. Second, the proposed 3D human root localization network (RootNet) estimates the camera-centered coordinates of the detected humans’ roots. Third, a root-relative 3D single-person pose estimation network (PoseNet) estimates the root-relative 3D pose for each detected human. Figures 1 and 2 show the qualitative results and overall pipeline of our framework, respectively.
We show that our approach outperforms previous 3D multi-person pose estimation methods [41, 29] on several publicly available 3D single- and multi-person pose estimation datasets [16, 29] by a large margin. Also, even without any groundtruth information (i.e., the bounding box and the 3D location of the root), our method achieves comparable performance with the state-of-the-art 3D single-person pose estimation methods that use the groundtruth in the inference time. Note that our framework is new but follows previous conventions of object detection and 3D human pose estimation networks. Thus, previous detection and pose estimation methods can be easily plugged into our framework, which makes the proposed framework flexible and easy to use.
Our contributions can be summarized as follows.
We propose a new general framework for 3D multi-person pose estimation from a single RGB image. The framework is the first fully learning-based, camera distance-aware top-down approach, of which components are compatible with most of the previous human detection and 3D human pose estimation models.
Our framework outputs the absolute camera-centered coordinates of multiple humans’ keypoints. For this, we propose a 3D human root localization network (RootNet). This model makes it easy to extend the 3D single-person pose estimation techniques to the absolute 3D pose estimation of multi-person.
We show that our method significantly outperforms previous 3D multi-person pose estimation methods on several publicly available datasets. Also, it achieves comparable performance with the state-of-the-art 3D single-person pose estimation methods without any groundtruth information.
2 Related works
2D multi-person pose estimation. There are two main approaches in the multi-person pose estimation. The first one, top-down approach, deploys a human detector that estimates the bounding boxes of humans. Each detected human area is cropped and fed into the pose estimation network. The second one, bottom-up approach, localizes all human body keypoints in an input image first, and then groups them using some clustering techniques.
[34, 13, 6, 47, 31, 30] are based on the top-down approach. Papandreou et al.  predicted 2D offset vectors and 2D heatmaps for each joint. They fused the estimated vectors and heatmaps to generate highly localized heatmaps. Chen et al.  proposed a cascaded pyramid network whose cascaded structure refines an initially estimated pose by focusing on hard keypoints. Xiao et al.  used a simple pose estimation network that consists of a deep backbone network and several upsampling layers.
[38, 14, 3, 33, 21] are based on the bottom-up approach. Cao et al.  proposed the part affinity fields (PAFs) that model the association between human body keypoints. They grouped the localized keypoints of all persons in the input image by using the estimated PAFs. Newell et al.  introduced a pixel-wise tag value to assign localized keypoints to a certain human. Kocabas et al.  proposed a pose residual network for assigning detected keypoints to each person.
3D single-person pose estimation. Current 3D single-person pose estimation methods can be categorized into single- and two-stage approaches. The single-stage approach directly localizes the 3D body keypoints from the input image. The two-stage methods utilize the high accuracy of 2D human pose estimation. They initially localize body keypoints in a 2D space and lift them to a 3D space.
[23, 45, 37, 43, 44] are based on the single-stage approach. Li et al.  proposed a multi-task framework that jointly trains both the pose regression and body part detectors. Tekin et al.  modeled high-dimensional joint dependencies by adopting an auto-encoder structure. Pavlakos et al.  extended the U-net shaped network to estimate a 3D heatmap for each joint. They used a coarse-to-fine approach to boost performance. Sun et al.  introduced compositional loss to consider the joint connection structure. Sun et al.  used soft-argmax operation to obtain the 3D coordinates of body joints in a differentiable manner.
[35, 5, 26, 52, 7, 49, 4] are based on the two-stage approach. Park et al.  estimated the initial 2D pose and utilized it to regress the 3D pose. Martinez et al.  proposed a simple network that directly regresses the 3D coordinates of body joints from 2D coordinates. Zhou et al.  proposed a geometric loss to facilitate weakly supervised learning of the depth regression module with images in the wild. Yang et al.  utilized adversarial loss to handle the 3D human pose estimation in the wild.
3D multi-person pose estimation. Few studies have been conducted on 3D multi-person pose estimation from a single RGB image. Rogez et al.  proposed a top-down approach called LCR-Net, which consists of localization, classification, and regression parts. The localization part detects a human from an input image, and the classification part classifies the detected human into several anchor-poses. The anchor-pose is defined as a pair of 2D and root-relative 3D pose. It is generated by clustering poses in the training set. Then, the regression part refines the anchor-poses. Mehta et al.  proposed a bottom-up approach system. They introduced an occlusion-robust pose-map formulation which supports pose inference for more than one person through PAFs .
3D human root localization in 3D multi-person pose estimation. Rogez et al.  estimated both the 2D pose in the image coordinate space and the 3D pose in the camera-centered coordinate space simultaneously. They obtained the 3D location of the human root by minimizing the distance between the estimated 2D pose and projected 3D pose, similar to what Mehta et al.  did. However, this strategy cannot be generalized to other 3D human pose estimation methods because it requires both the 2D and 3D estimations. For example, many works [44, 52, 37, 49] estimate the 2D image coordinates and root-relative depth values of keypoints. As their methods do not output root-relative camera-centered coordinates of keypoints, such a distance minimization strategy cannot be used. Moreover, contextual information cannot be exploited because the image feature is not considered. For example, it cannot distinguish between a child close to the camera and an adult far from the camera because their scales in the 2D image is similar.
3 Overview of the proposed model
The goal of our system is to recover the absolute camera-centered coordinates of multiple persons’ keypoints , where denotes the number of joints. To address this problem, we construct our system based on the top-down approach that consists of DetectNet, RootNet, and PoseNet. The DetectNet detects a human bounding box of each person in the input image. The RootNet takes the cropped human image from the DetectNet and localizes the root of the human , in which and are pixel coordinates, and is absolute depth value. The same cropped human image is fed to the PoseNet, which estimates the root-relative 3D pose , in which and are pixel coordinates and is root-relative depth value. We convert into by adding and transform and to the original image space before cropping. After back-projection formula, the final absolute 3D pose is obtained.
We use Mask R-CNN  as the framework of DetectNet. Mask R-CNN  consists of three parts. The first one, backbone, extracts useful local and global features from the input image by using deep residual network (ResNet)  and feature pyramid network . Based on the extracted features, the second part, region proposal network, proposes human bounding box candidates. The RoIAlign layer extracts the features of each proposal and passes them to the third part, which is the classification head network. The head network determines whether the given proposal is a human or not and estimates bounding box refinement offsets. It achieves the state-of-the-art performance on publicly available object detection datasets . Due to the high performance and publicly available code [9, 27], we use Mask R-CNN  as a DetectNet in our pipeline.
5.1 Model design
The RootNet estimates the camera-centered coordinates of the human root from a cropped human image. To obtain them, RootNet separately estimates the 2D image coordinates and a depth value (i.e., the distance from the camera ) of the human root. The estimated 2D image coordinates are back-projected to the camera-centered coordinate space using the estimated depth value, which becomes the final output.
Considering that an image provides sufficient information on where the human root is located in image space, the 2D estimation part can learn to localize it easily. By contrast, estimating the depth only from a cropped human image is difficult because the input does not provide information on the relative position of the camera and human. To resolve this issue, we introduce a new distance measure, , which is defined as follows:
where , , , and are focal lengths divided by per-pixel distance factors (pixel) of - and -axes, the area of the human in real space (), and image space (pixel), respectively. approximate the absolute depth from the camera to the object using ratio of the actual area and the imaged area of it, given camera parameters. Eq 1 can be easily derived by considering a pinhole camera projection model. The distance () between the camera and object can be calculated as follows:
where , , , are the lengths of an object in real space () and in image space (pixel), on the and -axes, respectively. By multiplying the two representations of in Eq 2 and taking the square root of it, we can have the 2D extended version of depth measure in Eq 2. Assuming that is constant and using and from datasets, the distance between the camera and an object can be measured from the area of the bounding box. As we only consider humans, we assume that is . The area of the human bounding box is used as after extending it to fixed aspect ratio (i.e., height:width = 1:1). Figure 3 shows that such an approximation provides a meaningful correlation between and the real depth values of the human root in 3D human pose estimation datasets [16, 29].
Although can represent how far the human is from the camera, it can be wrong in several cases because it assumes that is an area of (i.e., ) in the image space when the distance between the human and the camera is . However, as is obtained by extension of the 2D bounding box, it can have a different value according to its appearance, although the distance to the camera is the same. For example, as shown in Figure 4 (a), two humans have different although they are at the same distance to the camera. On the other hand, in some cases, can be the same, even with different distances from the camera. For example, in Figure 4 (b), a child and an adult have similar however, the child is closer to the camera than the adult.
To handle this issue, we design the RootNet to utilize the image feature to correct , eventually . The image feature can give a clue to the RootNet about how much the has to be changed. For example, in Figure 4 (a), the left image can tell the RootNet to increase the area because the human is in a crouching posture. Also, in Figure 4 (b), the right image can tell the RootNet to increase the area because the input image contains a child. Specifically, the RootNet outputs the correction factor from the image feature. The estimated is multiplied by the given , which becomes . From , is calculated and it becomes the final depth value.
5.2 Camera normalization
Our RootNet outputs correction factor only from an input image and does not belong to specific camera space. Therefore, the RootNet can be trained and tested with data of various and . Also, it can estimate 3D absolute coordinate of human root in the camera-normalized space (i.e., ) if and are unknown. This property makes our RootNet very flexible and useful.
5.3 Network architecture
The network architecture of the RootNet, which comprises three components, is visualized in Figure 5. First, a backbone network extracts the useful global feature of the input human image using ResNet . Second, the 2D image coordinate estimation part takes a feature map from the backbone part and upsamples it using three consecutive deconvolutional layers with batch normalization layers  and ReLU activation function. Then, a 1-by-1 convolution is applied to produce a 2D heatmap of the root. Soft-argmax  extracts 2D image coordinates from the 2D heatmap. The third component is the depth estimation part. It also takes a feature map from the backbone part and applies global average pooling. Then, the pooled feature map goes through a 1-by-1 convolution, which outputs a single scalar value . The final absolute depth value is obtained by multiplying with . In practical, we implemented the RootNet to output directly and multiply it with the to obtain the absolute depth value (i.e., ).
5.4 Loss function
We train the RootNet by minimizing the distance between the estimated and groundtruth coordinates. The loss function is defined as follows:
where indicates the groundtruth.
6.1 Model design
The PoseNet estimates the root-relative 3D pose from a cropped human image. Many works have been presented for this topic [37, 43, 52, 49, 28, 26, 44]. Among them, we use the model of Sun et al. , which is the current state-of-the-art method. This model consists of two parts. The first part is the backbone, which extracts a useful global feature from the cropped human image using ResNet . Second, the pose estimation part takes a feature map from the backbone part and upsamples it using three consecutive deconvolutional layers with batch normalization layers  and ReLU activation function. A 1-by-1 convolution is applied to the upsampled feature map to produce the 3D heatmaps for each joint. The soft-argmax operation is used to extract the 2D image coordinates , and the root-relative depth values .
6.2 Loss function
We train the PoseNet by minimizing the distance between the estimated and groundtruth coordinates. The loss function is defined as follows:
where indicates groundtruth.
7 Implementation details
Publicly released Mask R-CNN model  pre-trained on the COCO dataset  is used for the DetectNet without fine-tuning on the human pose estimation datasets [16, 29]. For the RootNet and PoseNet, PyTorch  is used for implementation. Their backbone part is initialized with the publicly released ResNet-50  pre-trained on the ImageNet dataset , and the weights of the remaining part are initialized by Gaussian distribution with . The weights are updated by the Adam optimizer  with a mini-batch size of 128. The initial learning rate is set to and reduced by a factor of 10 at the 17th epoch. We use 256256 as the size of the input image of the RootNet and PoseNet. We perform data augmentation including rotation (), horizontal flip, color jittering, and synthetic occlusion  in training. Horizontal flip augmentation is performed in testing for the PoseNet following Sun et al. . We train the RootNet and PoseNet for 20 epochs with four NVIDIA 1080 Ti GPUs, which took two days, respectively.
8.1 Dataset and evaluation metric
Human3.6M dataset. Human3.6M dataset  is the largest 3D single-person pose benchmark. It consists of 3.6 millions of video frames. 11 subjects performing 15 activities are captured from 4 camera viewpoints. The groundtruth 3D poses are obtained using a motion capture system. Two evaluation metrics are widely used. The first one is mean per joint position error (MPJPE) , which is calculated after aligning the human root of the estimated and groundtruth 3D poses. The second one is MPJPE after further alignment (i.e., Procrustes analysis (PA) ). This metric is called PA MPJPE. To evaluate the localization of the absolute 3D human root, we introduce the mean of the Euclidean distance between the estimated coordinates of the root and the ground truth ones , i.e., the mean of the root position error (MRPE), as a new metric:
where superscript is the index of the sample, and denotes the total number of test samples.
MuCo-3DHP and MuPoTS-3D datasets. Mehta et al.  proposed a 3D multi-person pose estimation dataset. The training set, MuCo-3DHP, is generated by compositing the existing MPI-INF-3DHP 3D single-person pose estimation dataset . The test set, MuPoTS-3D dataset, was captured at outdoors and it includes 20 real-world scenes with groundtruth 3D poses for up to three subjects. The groundtruth is obtained with a multi-view marker-less motion capture system. For evaluation, a 3D percentage of correct keypoints (3DPCK) and area under 3DPCK curve from various threshold (AUC) is used after root alignment with groundtruth. It treats a joint’s prediction as correct if it lines within a 15cm from the groundtruth joint location. We additionally define 3DPCK which is the 3DPCK without root alignment to evaluate absolute camera-centered coordinates. To evaluate the localization of the absolute 3D human root, we use the average precision of 3D human root location () which considers a prediction is correct when the Euclidean distance between the estimated and groundtruth coordinates is smaller than 25cm.
8.2 Experimental protocol
Human3.6M dataset. Two experimental protocols are widely used. Protocol 1 uses six subjects (S1, S5, S6, S7, S8, S9) in training and S11 in testing. PA MPJPE is used as an evaluation metric. Protocol 2 uses five subjects (S1, S5, S6, S7, S8) in training and two subjects (S9, S11) in testing. MPJPE is used as an evaluation metric. We use every 5th and 64th frame of videos in training and testing, respectively following [43, 44]. When training on the Human3.6M dataset, we used additional MPII 2D human pose estimation dataset  following [52, 37, 43, 44]. Each mini-batch consists of half Human3.6M and half MPII data. For MPII data, the loss value of the -axis becomes zero for both of the RootNet and PoseNet following Sun et al. .
|Disjointed learning (Ours)||120.0||57.3||0.141|
|With groundtruth information in inference time|
|Without groundtruth information in inference time|
|With groundtruth information in inference time|
|Without groundtruth information in inference time|
|Accuracy for all groundtruths|
|Accuracy only for matched groundtruths|
MuCo-3DHP and MuPoTS-3D datasets. Following the previous protocol, we composite 400K frames of which half are background augmented. For augmentation, we use images from the COCO dataset  except for images with humans. We use an additional COCO 2D human keypoint detection dataset  when training our models on the MuCo-3DHP dataset following Mehta et al. . Each mini-batch consists of half MuCo-3DHP and half COCO data. For COCO data, loss value of -axis becomes zero for both of the RootNet and PoseNet following Sun et al. .
8.3 Ablation study
In this study, we show how each component of our proposed framework affects the 3D multi-person pose estimation accuracy. To evaluate the performance of the DetectNet, we use the average precision of bounding box () following metrics of the COCO object detection benchmark .
Disjointed pipeline. To demonstrate the effectiveness of the disjointed pipeline (i.e., separated DetectNet, RootNet, and PoseNet), we compare MRPE, MPJPE, and running time of joint and disjointed learning of the RootNet and PoseNet in Table 1. The running time includes DetectNet and is measured using a single TitanX Maxwell GPU. For the joint learning, we combine the RootNet and PoseNet into a single model which shares backbone part (i.e., ResNet ). The image feature from the backbone is fed to each branch of RootNet and PoseNet in a parallel way. Compared with the joint learning, our disjointed learning gives lower error under a similar running time. We believe that this is because each task of RootNet and PoseNet is not highly correlated so that jointly training all tasks can make training harder, resulting in lower accuracy.
Effect of the DetectNet. To show how the performance of the human detection affects the accuracy of the final 3D human root localization and 3D multi-person pose estimation, we compare AP, AUC, and 3DPCK using the DetectNet in various backbones (i.e., ResNet-50 , ResNeXt-101-32 ) and groundtruth box in the second, third, and fourth row of Table 2, respectively. The table shows that based on the same RootNet (i.e., Ours), better human detection model improves both of the 3D human root localization and 3D multi-person pose estimation performance. However, the groundtruth box does not improve overall accuracy considerably compared with other DetectNet models. Therefore, we have sufficient reasons to believe that the given boxes cover most of the person instances with such a high detection AP. We can also conclude that the bounding box estimation accuracy does not have a large impact on the 3D multi-person pose estimation accuracy.
Effect of the RootNet. To show how the performance of the 3D human root localization affects the accuracy of the 3D multi-person pose estimation, we compare AUC and 3DPCK using various RootNet settings in Table 2. The first and second rows show that based on the same DetectNet (i.e., R-50), our RootNet exhibits significantly higher AP and 3DPCK compared with the setting in which is directly utilized as a depth value. We use the and of the RootNet when the is used as a depth value. This result demonstrates that the RootNet successfully corrects the value. The fourth and last rows show that the groundtruth human root provides similar AUC, but significantly higher 3DPCK compared with our RootNet. This finding shows that better human root localization is required to achieve more accurate absolute 3D multi-person pose estimation results.
Effect of the PoseNet. All settings in Table 2 provides similar AUC. Especially, the first and last rows of the table show that using groundtruth box and human root does not provide significantly higher AUC. As the results in the table are based on the same PoseNet, we can conclude that AUC, which is an evaluation of the root-relative 3D human pose estimation highly depends on the accuracy of the PoseNet.
8.4 Comparison with state-of-the-art methods
Human3.6M dataset. We compare our proposed system with state-of-the-art 3D human pose estimation methods on the Human3.6M dataset  in Tables 3 and 4. As most of the previous methods use the groundtruth information (i.e., bounding box or 3D root location) in inference time, we report the performance of the PoseNet using the groundtruth 3D root location. Note that our full model does not require any groundtruth information in inference time. The tables show that our method achieves comparable performance despite not using any groundtruth information in inference time. Moreover, it significantly outperforms previous 3D multi-person pose estimation methods [25, 29].
MuCo-3DHP and MuPoTS-3D datasets. We compare our proposed system with the state-of-the-art 3D multi-person pose estimation methods on the MuPoTS-3D dataset  in Tables 5 and 6. The proposed system significantly outperforms them in most of the test sequences and joints.
Those comparisons clearly show that our approach outperforms previous 3D multi-person pose estimation methods.
Although our proposed method outperforms previous 3D multi-person pose estimation methods by a large margin, room for improvement is substantial. As shown in Table 2, using the groundtruth 3D root location brings significant 3DPCK improvement. Recent advances in depth map estimation from a single RGB image [8, 22] can give a clue for improving the 3D human root localization model.
Our framework can also be used in applications other than 3D multi-person pose estimation. For example, recent methods for 3D human mesh model reconstruction [2, 19, 18] reconstruct full 3D mesh model from a single person. Joo et al.  utilized 2D multi-view input for 3D multi-person mesh model reconstruction. In our framework, if the PoseNet is replaced with existing human mesh reconstruction model [2, 19, 18], 3D multi-person mesh model reconstruction can be performed from a single RGB image. This shows our framework can be applied to many 3D instance-aware vision tasks which take a single RGB image as an input.
We propose a novel and general framework for 3D multi-person pose estimation from a single RGB image. Our framework consists of human detection, 3D human root localization, and root-relative 3D single-person pose estimation models. Since any existing human detection and 3D single-person pose estimation models can be plugged into our framework, it is very flexible and easy to use. The proposed system outperforms previous 3D multi-person pose estimation methods by a large margin and achieves comparable performance with 3D single-person pose estimation methods without any groundtruth information while they use it in inference time. To the best of our knowledge, this work is the first to propose a fully learning-based camera distance-aware top-down approach whose components are compatible with most of the previous human detection and 3D human pose estimation models. We hope that this study provides a new basis for 3D multi-person pose estimation, which has only barely been explored.
Supplementary Material of “Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image”
In this supplementary material, we present more experimental results that could not be included in the main manuscript due to the lack of space.
1 Derivation of Equation 1
We provide a derivation of Equation 1 of the main manuscript with reference to Figure 6 ,which shows a pinhole camera model. The green and blue arrows represent the human root joint centered and -axes, respectively. The yellow lines show rays, and is the hole. , , and are distance between camera and the human root joint (), focal length (), and the length of human on the image sensor (), respectively.
According to the definition of ,
Let be per pixel distance factor in -axis. Then,
Above equations are also valid in -axis. Therefore,
2 Comparison of 3D human root localization with previous approaches
|Baseline [41, 28]||267.8||27.5||28.3||261.9|
|W/o limb joints||226.2||24.5||24.9||220.2|
|Accuracy for all groundtruths|
|Accuracy only for matched groundtruths|
Previous approaches [41, 28] simultaneously estimate 2D image coordinates and 3D camera-centered root-relative coordinates of keypoints. Then, absolute camera-centered coordinates of the human root are obtained by minimizing the distance between 2D predictions and projected 3D predictions. For optimization, linear least-squares formulation is used. To measure the errors of their method, we implemented and used ResNet-152-based model of Sun et al.  as a 2D pose estimator and model of Martinez et al.  as a 3D pose estimator, which are state-of-the-art methods. In addition, to minimize the effect of outliers in 3D-to-2D fitting, we excluded limb joints when fitting. Also, we performed RANSAC with a various number of joints to get optimal joint set for fitting instead of using heuristically selected joint set.
Table 7 shows our RootNet significantly outperforms previous approaches. Furthermore, the RootNet can be designed independently of the PoseNet, giving design flexibility to both models. In contrast, the previous 3D root localization methods [41, 28] require both of 2D and 3D predictions for the root localization, which results in lack of generalizability.
3 Running time of the proposed framework
In Table 8, we report seconds per frame for each component of our framework. The running time is measured using a single TitanX Maxwell GPU. As the table shows, most of the running time is consumed by DetectNet. It is hard to directly compare running time with previous works [41, 28] because they did not report it. However, we guess that there would be no big difference because models of  and  are similar with  and  whose speed is 0.2 and 0.11 seconds per frame, respectively.
4 Absolute 3D multi-person pose estimation errors
5 Qualitative results
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
-  F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In ECCV, 2016.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. CVPR, 2017.
-  J. Y. Chang and K. M. Lee. 2d–3d pose consistency-based conditional random fields for 3d human pose estimation. CVIU, 2018.
-  C.-H. Chen and D. Ramanan. 3d human pose estimation= 2d pose estimation+ matching. In CVPR, 2017.
-  Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. In CVPR, 2018.
-  H.-S. Fang, Y. Xu, W. Wang, X. Liu, and S.-C. Zhu. Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI, 2018.
-  H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018.
-  R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  J. C. Gower. Generalized procrustes analysis. Psychometrika, 1975.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  S. Huang, M. Gong, and D. Tao. A coarse-fine network for keypoint localization. In ICCV, 2017.
-  E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
-  C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI, 2014.
-  E. Jahangiri and A. L. Yuille. Generating multiple diverse hypotheses for human 3d pose consistent with 2d joint detections. In ICCV, 2017.
-  H. Joo, T. Simon, and Y. Sheikh. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In CVPR, 2018.
-  A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In CVPR, 2018.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2014.
-  M. Kocabas, S. Karagoz, and E. Akbas. Multiposenet: Fast multi-person pose estimation using pose residual network. In ECCV, 2018.
-  R. Li, K. Xian, C. Shen, Z. Cao, H. Lu, and L. Hang. Deep attention-based classification network for robust depth prediction. arXiv preprint arXiv:1807.03959, 2018.
-  S. Li and A. B. Chan. 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV, 2014.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In ICCV, 2017.
-  F. Massa and R. Girshick. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark, 2018.
-  D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, 2017.
-  D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll, and C. Theobalt. Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV, 2018.
-  G. Moon, J. Y. Chang, and K. M. Lee. Multi-scale aggregation r-cnn for 2d multi-person pose estimation. CVPRW, 2019.
-  G. Moon, J. Y. Chang, and K. M. Lee. Posefix: Model-agnostic general human pose refinement network. In CVPR, 2019.
-  F. Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. In CVPR, 2017.
-  A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, 2017.
-  G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
-  S. Park, J. Hwang, and N. Kwak. 3d human pose estimation using convolutional neural networks with 2d pose information. In ECCV, 2016.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. In CVPR, 2017.
-  L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  G. Rogez and C. Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. In NIPS, 2016.
-  G. Rogez, P. Weinzaepfel, and C. Schmid. Lcr-net: Localization-classification-regression for human pose. In CVPR, 2017.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
-  X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional human pose regression. In ICCV, 2017.
-  X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei. Integral human pose regression. In ECCV, 2018.
-  B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua. Structured prediction of 3d human pose with deep neural networks. BMVC, 2016.
-  D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. In CVPR, 2017.
-  B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
-  W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang. 3d human pose estimation in the wild by adversarial learning. In CVPR, 2018.
-  H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall. A dual-source approach for 3d pose estimation from a single image. In CVPR, 2016.
-  Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
-  X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Weaklysupervised transfer for 3d human pose estimation in the wild. In ICCV, 2017.
-  X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Monocap: Monocular human motion capture using a cnn coupled with a geometric prior. TPAMI, 2019.