Camera Distanceaware Topdown Approach for 3D Multiperson Pose Estimation from a Single RGB Image
Abstract
Although significant improvement has been achieved in 3D human pose estimation, most of the previous methods only consider a singleperson case. In this work, we firstly propose a fully learningbased, camera distanceaware topdown approach for 3D multiperson pose estimation from a single RGB image. The pipeline of the proposed system consists of human detection, absolute 3D human root localization, and rootrelative 3D singleperson pose estimation models. Our system achieves comparable results with the stateoftheart 3D singleperson pose estimation models without any groundtruth information and significantly outperforms previous 3D multiperson pose estimation methods on publicly available datasets. The code is available in ^{1}^{1}1https://github.com/mks0601/3DMPPE_ROOTNET_RELEASE^{2}^{2}2https://github.com/mks0601/3DMPPE_POSENET_RELEASE.
1 Introduction
The goal of 3D human pose estimation is to localize semantic keypoints of a human body in 3D space. It is an essential technique for human behavior understanding and humancomputer interaction. Recently, many methods [37, 43, 52, 26, 49, 44] utilize deep convolutional neural networks (CNNs) and have achieved noticeable performance improvement on largescale publicly available datasets [16, 28].
Most of the previous 3D human pose estimation methods [37, 43, 52, 26, 49, 44] are designed for singleperson case. They crop the human area in an input image with a groundtruth box or the box that is predicted from a human detection model [11]. The cropped patch of a human body is fed into the 3D pose estimation model, which then estimates the 3D location of each keypoint. As their models take a single cropped image, estimating the absolute cameracentered coordinate of each keypoint is difficult. To handle this issue, many methods [37, 43, 52, 26, 49, 44] estimate the relative 3D pose to a reference point in the body, e.g., the center joint (i.e., pelvis) of a human, called root. The final 3D pose is obtained by adding the 3D coordinates of the root to the estimated rootrelative 3D pose. Prior information on bone length [37] or groundtruth [44] has been commonly used for the localization of the root.
Recently, many topdown approaches [13, 6, 47] for the 2D multiperson pose estimation have shown noticeable performance improvement. These approaches first detect humans by using a human detection module, and then estimate the 2D pose of each human by a 2D singleperson pose estimation module. Although they are straightforward when used in 2D cases, extending them to 3D cases is challenging. Note that for the estimation of 3D multiperson poses, we need to know the absolute distance to each human from the camera as well as the 2D bounding boxes. However, existing human detectors provide 2D bounding boxes only.
In this study, we propose a general framework for 3D multiperson pose estimation. To the best of our knowledge, this study is the first to propose a fully learningbased camera distanceaware topdown approach of which components are compatible with most of the previous human detection and 3D human pose estimation methods. The pipeline of the proposed system consists of three modules. First, a human detection network (DetectNet) detects the bounding boxes of humans in an input image. Second, the proposed 3D human root localization network (RootNet) estimates the cameracentered coordinates of the detected humans’ roots. Third, a rootrelative 3D singleperson pose estimation network (PoseNet) estimates the rootrelative 3D pose for each detected human. Figures 1 and 2 show the qualitative results and overall pipeline of our framework, respectively.
We show that our approach outperforms previous 3D multiperson pose estimation methods [41, 29] on several publicly available 3D single and multiperson pose estimation datasets [16, 29] by a large margin. Also, even without any groundtruth information (i.e., the bounding box and the 3D location of the root), our method achieves comparable performance with the stateoftheart 3D singleperson pose estimation methods that use the groundtruth in the inference time. Note that our framework is new but follows previous conventions of object detection and 3D human pose estimation networks. Thus, previous detection and pose estimation methods can be easily plugged into our framework, which makes the proposed framework flexible and easy to use.
Our contributions can be summarized as follows.

We propose a new general framework for 3D multiperson pose estimation from a single RGB image. The framework is the first fully learningbased, camera distanceaware topdown approach, of which components are compatible with most of the previous human detection and 3D human pose estimation models.

Our framework outputs the absolute cameracentered coordinates of multiple humans’ keypoints. For this, we propose a 3D human root localization network (RootNet). This model makes it easy to extend the 3D singleperson pose estimation techniques to the absolute 3D pose estimation of multiperson.

We show that our method significantly outperforms previous 3D multiperson pose estimation methods on several publicly available datasets. Also, it achieves comparable performance with the stateoftheart 3D singleperson pose estimation methods without any groundtruth information.
2 Related works
2D multiperson pose estimation. There are two main approaches in the multiperson pose estimation. The first one, topdown approach, deploys a human detector that estimates the bounding boxes of humans. Each detected human area is cropped and fed into the pose estimation network. The second one, bottomup approach, localizes all human body keypoints in an input image first, and then groups them using some clustering techniques.
[34, 13, 6, 47, 31, 30] are based on the topdown approach. Papandreou et al. [34] predicted 2D offset vectors and 2D heatmaps for each joint. They fused the estimated vectors and heatmaps to generate highly localized heatmaps. Chen et al. [6] proposed a cascaded pyramid network whose cascaded structure refines an initially estimated pose by focusing on hard keypoints. Xiao et al. [47] used a simple pose estimation network that consists of a deep backbone network and several upsampling layers.
[38, 14, 3, 33, 21] are based on the bottomup approach. Cao et al. [3] proposed the part affinity fields (PAFs) that model the association between human body keypoints. They grouped the localized keypoints of all persons in the input image by using the estimated PAFs. Newell et al. [33] introduced a pixelwise tag value to assign localized keypoints to a certain human. Kocabas et al. [21] proposed a pose residual network for assigning detected keypoints to each person.
3D singleperson pose estimation. Current 3D singleperson pose estimation methods can be categorized into single and twostage approaches. The singlestage approach directly localizes the 3D body keypoints from the input image. The twostage methods utilize the high accuracy of 2D human pose estimation. They initially localize body keypoints in a 2D space and lift them to a 3D space.
[23, 45, 37, 43, 44] are based on the singlestage approach. Li et al. [23] proposed a multitask framework that jointly trains both the pose regression and body part detectors. Tekin et al. [45] modeled highdimensional joint dependencies by adopting an autoencoder structure. Pavlakos et al. [37] extended the Unet shaped network to estimate a 3D heatmap for each joint. They used a coarsetofine approach to boost performance. Sun et al. [43] introduced compositional loss to consider the joint connection structure. Sun et al. [44] used softargmax operation to obtain the 3D coordinates of body joints in a differentiable manner.
[35, 5, 26, 52, 7, 49, 4] are based on the twostage approach. Park et al. [35] estimated the initial 2D pose and utilized it to regress the 3D pose. Martinez et al. [26] proposed a simple network that directly regresses the 3D coordinates of body joints from 2D coordinates. Zhou et al. [52] proposed a geometric loss to facilitate weakly supervised learning of the depth regression module with images in the wild. Yang et al. [49] utilized adversarial loss to handle the 3D human pose estimation in the wild.
3D multiperson pose estimation. Few studies have been conducted on 3D multiperson pose estimation from a single RGB image. Rogez et al. [41] proposed a topdown approach called LCRNet, which consists of localization, classification, and regression parts. The localization part detects a human from an input image, and the classification part classifies the detected human into several anchorposes. The anchorpose is defined as a pair of 2D and rootrelative 3D pose. It is generated by clustering poses in the training set. Then, the regression part refines the anchorposes. Mehta et al. [29] proposed a bottomup approach system. They introduced an occlusionrobust posemap formulation which supports pose inference for more than one person through PAFs [3].
3D human root localization in 3D multiperson pose estimation. Rogez et al. [41] estimated both the 2D pose in the image coordinate space and the 3D pose in the cameracentered coordinate space simultaneously. They obtained the 3D location of the human root by minimizing the distance between the estimated 2D pose and projected 3D pose, similar to what Mehta et al. [28] did. However, this strategy cannot be generalized to other 3D human pose estimation methods because it requires both the 2D and 3D estimations. For example, many works [44, 52, 37, 49] estimate the 2D image coordinates and rootrelative depth values of keypoints. As their methods do not output rootrelative cameracentered coordinates of keypoints, such a distance minimization strategy cannot be used. Moreover, contextual information cannot be exploited because the image feature is not considered. For example, it cannot distinguish between a child close to the camera and an adult far from the camera because their scales in the 2D image is similar.
3 Overview of the proposed model
The goal of our system is to recover the absolute cameracentered coordinates of multiple persons’ keypoints , where denotes the number of joints. To address this problem, we construct our system based on the topdown approach that consists of DetectNet, RootNet, and PoseNet. The DetectNet detects a human bounding box of each person in the input image. The RootNet takes the cropped human image from the DetectNet and localizes the root of the human , in which and are pixel coordinates, and is absolute depth value. The same cropped human image is fed to the PoseNet, which estimates the rootrelative 3D pose , in which and are pixel coordinates and is rootrelative depth value. We convert into by adding and transform and to the original image space before cropping. After backprojection formula, the final absolute 3D pose is obtained.
4 DetectNet
We use Mask RCNN [11] as the framework of DetectNet. Mask RCNN [11] consists of three parts. The first one, backbone, extracts useful local and global features from the input image by using deep residual network (ResNet) [12] and feature pyramid network [24]. Based on the extracted features, the second part, region proposal network, proposes human bounding box candidates. The RoIAlign layer extracts the features of each proposal and passes them to the third part, which is the classification head network. The head network determines whether the given proposal is a human or not and estimates bounding box refinement offsets. It achieves the stateoftheart performance on publicly available object detection datasets [25]. Due to the high performance and publicly available code [9, 27], we use Mask RCNN [11] as a DetectNet in our pipeline.
5 RootNet
5.1 Model design
The RootNet estimates the cameracentered coordinates of the human root from a cropped human image. To obtain them, RootNet separately estimates the 2D image coordinates and a depth value (i.e., the distance from the camera ) of the human root. The estimated 2D image coordinates are backprojected to the cameracentered coordinate space using the estimated depth value, which becomes the final output.
Considering that an image provides sufficient information on where the human root is located in image space, the 2D estimation part can learn to localize it easily. By contrast, estimating the depth only from a cropped human image is difficult because the input does not provide information on the relative position of the camera and human. To resolve this issue, we introduce a new distance measure, , which is defined as follows:
(1) 
where , , , and are focal lengths divided by perpixel distance factors (pixel) of  and axes, the area of the human in real space (), and image space (pixel), respectively. approximate the absolute depth from the camera to the object using ratio of the actual area and the imaged area of it, given camera parameters. Eq 1 can be easily derived by considering a pinhole camera projection model. The distance () between the camera and object can be calculated as follows:
(2) 
where , , , are the lengths of an object in real space () and in image space (pixel), on the and axes, respectively. By multiplying the two representations of in Eq 2 and taking the square root of it, we can have the 2D extended version of depth measure in Eq 2. Assuming that is constant and using and from datasets, the distance between the camera and an object can be measured from the area of the bounding box. As we only consider humans, we assume that is . The area of the human bounding box is used as after extending it to fixed aspect ratio (i.e., height:width = 1:1). Figure 3 shows that such an approximation provides a meaningful correlation between and the real depth values of the human root in 3D human pose estimation datasets [16, 29].
Although can represent how far the human is from the camera, it can be wrong in several cases because it assumes that is an area of (i.e., ) in the image space when the distance between the human and the camera is . However, as is obtained by extension of the 2D bounding box, it can have a different value according to its appearance, although the distance to the camera is the same. For example, as shown in Figure 4 (a), two humans have different although they are at the same distance to the camera. On the other hand, in some cases, can be the same, even with different distances from the camera. For example, in Figure 4 (b), a child and an adult have similar however, the child is closer to the camera than the adult.
To handle this issue, we design the RootNet to utilize the image feature to correct , eventually . The image feature can give a clue to the RootNet about how much the has to be changed. For example, in Figure 4 (a), the left image can tell the RootNet to increase the area because the human is in a crouching posture. Also, in Figure 4 (b), the right image can tell the RootNet to increase the area because the input image contains a child. Specifically, the RootNet outputs the correction factor from the image feature. The estimated is multiplied by the given , which becomes . From , is calculated and it becomes the final depth value.
5.2 Camera normalization
Our RootNet outputs correction factor only from an input image and does not belong to specific camera space. Therefore, the RootNet can be trained and tested with data of various and . Also, it can estimate 3D absolute coordinate of human root in the cameranormalized space (i.e., ) if and are unknown. This property makes our RootNet very flexible and useful.
5.3 Network architecture
The network architecture of the RootNet, which comprises three components, is visualized in Figure 5. First, a backbone network extracts the useful global feature of the input human image using ResNet [12]. Second, the 2D image coordinate estimation part takes a feature map from the backbone part and upsamples it using three consecutive deconvolutional layers with batch normalization layers [15] and ReLU activation function. Then, a 1by1 convolution is applied to produce a 2D heatmap of the root. Softargmax [44] extracts 2D image coordinates from the 2D heatmap. The third component is the depth estimation part. It also takes a feature map from the backbone part and applies global average pooling. Then, the pooled feature map goes through a 1by1 convolution, which outputs a single scalar value . The final absolute depth value is obtained by multiplying with . In practical, we implemented the RootNet to output directly and multiply it with the to obtain the absolute depth value (i.e., ).
5.4 Loss function
We train the RootNet by minimizing the distance between the estimated and groundtruth coordinates. The loss function is defined as follows:
(3) 
where indicates the groundtruth.
6 PoseNet
6.1 Model design
The PoseNet estimates the rootrelative 3D pose from a cropped human image. Many works have been presented for this topic [37, 43, 52, 49, 28, 26, 44]. Among them, we use the model of Sun et al. [44], which is the current stateoftheart method. This model consists of two parts. The first part is the backbone, which extracts a useful global feature from the cropped human image using ResNet [12]. Second, the pose estimation part takes a feature map from the backbone part and upsamples it using three consecutive deconvolutional layers with batch normalization layers [15] and ReLU activation function. A 1by1 convolution is applied to the upsampled feature map to produce the 3D heatmaps for each joint. The softargmax operation is used to extract the 2D image coordinates , and the rootrelative depth values .
6.2 Loss function
We train the PoseNet by minimizing the distance between the estimated and groundtruth coordinates. The loss function is defined as follows:
(4) 
where indicates groundtruth.
7 Implementation details
Publicly released Mask RCNN model [27] pretrained on the COCO dataset [25] is used for the DetectNet without finetuning on the human pose estimation datasets [16, 29]. For the RootNet and PoseNet, PyTorch [36] is used for implementation. Their backbone part is initialized with the publicly released ResNet50 [12] pretrained on the ImageNet dataset [42], and the weights of the remaining part are initialized by Gaussian distribution with . The weights are updated by the Adam optimizer [20] with a minibatch size of 128. The initial learning rate is set to and reduced by a factor of 10 at the 17th epoch. We use 256256 as the size of the input image of the RootNet and PoseNet. We perform data augmentation including rotation (), horizontal flip, color jittering, and synthetic occlusion [51] in training. Horizontal flip augmentation is performed in testing for the PoseNet following Sun et al. [44]. We train the RootNet and PoseNet for 20 epochs with four NVIDIA 1080 Ti GPUs, which took two days, respectively.
8 Experiment
8.1 Dataset and evaluation metric
Human3.6M dataset. Human3.6M dataset [16] is the largest 3D singleperson pose benchmark. It consists of 3.6 millions of video frames. 11 subjects performing 15 activities are captured from 4 camera viewpoints. The groundtruth 3D poses are obtained using a motion capture system. Two evaluation metrics are widely used. The first one is mean per joint position error (MPJPE) [16], which is calculated after aligning the human root of the estimated and groundtruth 3D poses. The second one is MPJPE after further alignment (i.e., Procrustes analysis (PA) [10]). This metric is called PA MPJPE. To evaluate the localization of the absolute 3D human root, we introduce the mean of the Euclidean distance between the estimated coordinates of the root and the ground truth ones , i.e., the mean of the root position error (MRPE), as a new metric:
(5) 
where superscript is the index of the sample, and denotes the total number of test samples.
MuCo3DHP and MuPoTS3D datasets. Mehta et al. [29] proposed a 3D multiperson pose estimation dataset. The training set, MuCo3DHP, is generated by compositing the existing MPIINF3DHP 3D singleperson pose estimation dataset [28]. The test set, MuPoTS3D dataset, was captured at outdoors and it includes 20 realworld scenes with groundtruth 3D poses for up to three subjects. The groundtruth is obtained with a multiview markerless motion capture system. For evaluation, a 3D percentage of correct keypoints (3DPCK) and area under 3DPCK curve from various threshold (AUC) is used after root alignment with groundtruth. It treats a joint’s prediction as correct if it lines within a 15cm from the groundtruth joint location. We additionally define 3DPCK which is the 3DPCK without root alignment to evaluate absolute cameracentered coordinates. To evaluate the localization of the absolute 3D human root, we use the average precision of 3D human root location () which considers a prediction is correct when the Euclidean distance between the estimated and groundtruth coordinates is smaller than 25cm.
8.2 Experimental protocol
Human3.6M dataset. Two experimental protocols are widely used. Protocol 1 uses six subjects (S1, S5, S6, S7, S8, S9) in training and S11 in testing. PA MPJPE is used as an evaluation metric. Protocol 2 uses five subjects (S1, S5, S6, S7, S8) in training and two subjects (S9, S11) in testing. MPJPE is used as an evaluation metric. We use every 5th and 64th frame of videos in training and testing, respectively following [43, 44]. When training on the Human3.6M dataset, we used additional MPII 2D human pose estimation dataset [1] following [52, 37, 43, 44]. Each minibatch consists of half Human3.6M and half MPII data. For MPII data, the loss value of the axis becomes zero for both of the RootNet and PoseNet following Sun et al. [44].
Settings  MRPE  MPJPE  Time 

Joint learning  138.2  116.7  0.132 
Disjointed learning (Ours)  120.0  57.3  0.141 
DetectNet  RootNet  AUC  3DPCK  

R50  43.8  5.2  39.2  9.6  
R50  Ours  43.8  28.5  39.8  31.5 
X10132  Ours  45.0  31.0  39.8  31.5 
GT  Ours  100.0  31.4  39.8  31.6 
GT  GT  100.0  100.0  39.8  80.2 
Methods  Dir.  Dis.  Eat  Gre.  Phon.  Pose  Pur.  Sit  SitD.  Smo.  Phot.  Wait  Walk  WalkD.  WalkP.  Avg 
With groundtruth information in inference time  
Yasin [50]  88.4  72.5  108.5  110.2  97.1  81.6  107.2  119.0  170.8  108.2  142.5  86.9  92.1  165.7  102.0  108.3 
Rogez [40]                                88.1 
Chen [5]  71.6  66.6  74.7  79.1  70.1  67.6  89.3  90.7  195.6  83.5  93.3  71.2  55.7  85.9  62.5  82.7 
Moreno [32]  67.4  63.8  87.2  73.9  71.5  69.9  65.1  71.7  98.6  81.3  93.3  74.6  76.5  77.7  74.6  76.5 
Zhou [53]  47.9  48.8  52.7  55.0  56.8  49.0  45.5  60.8  81.1  53.7  65.5  51.6  50.4  54.8  55.9  55.3 
Martinez [26]  39.5  43.2  46.4  47.0  51.0  41.4  40.6  56.5  69.4  49.2  56.0  45.0  38.0  49.5  43.1  47.7 
Kanazawa [19]                                56.8 
Sun [43]  42.1  44.3  45.0  45.4  51.5  43.2  41.3  59.3  73.3  51.0  53.0  44.0  38.3  48.0  44.8  48.3 
Fang [7]  38.2  41.7  43.7  44.9  48.5  40.2  38.2  54.5  64.4  47.2  55.3  44.3  36.7  47.3  41.7  45.7 
Sun [44]  36.9  36.2  40.6  40.4  41.9  34.9  35.7  50.1  59.4  40.4  44.9  39.0  30.8  39.8  36.7  40.6 
Ours (PoseNet)  31.0  30.6  39.9  35.5  34.8  30.2  32.1  35.0  43.8  35.7  37.6  30.1  24.6  35.7  29.3  34.0 
Without groundtruth information in inference time  
Ours (Full)  32.5  31.5  41.5  36.7  36.3  31.9  33.2  36.5  44.4  36.7  38.7  31.2  25.6  37.1  30.5  35.2 
Methods  Dir.  Dis.  Eat  Gre.  Phon.  Pose  Pur.  Sit  SitD.  Smo.  Phot.  Wait  Walk  WalkD.  WalkP.  Avg 
With groundtruth information in inference time  
Chen [5]  89.9  97.6  90.0  107.9  107.3  93.6  136.1  133.1  240.1  106.7  139.2  106.2  87.0  114.1  90.6  114.2 
Tome [46]  65.0  73.5  76.8  86.4  86.3  68.9  74.8  110.2  173.9  85.0  110.7  85.8  71.4  86.3  73.1  88.4 
Moreno [32]  69.5  80.2  78.2  87.0  100.8  76.0  69.7  104.7  113.9  89.7  102.7  98.5  79.2  82.4  77.2  87.3 
Zhou [53]  68.7  74.8  67.8  76.4  76.3  84.0  70.2  88.0  113.8  78.0  98.4  90.1  62.6  75.1  73.6  79.9 
Jahangiri [17]  74.4  66.7  67.9  75.2  77.3  70.6  64.5  95.6  127.3  79.6  79.1  73.4  67.4  71.8  72.8  77.6 
Mehta [28]  57.5  68.6  59.6  67.3  78.1  56.9  69.1  98.0  117.5  69.5  82.4  68.0  55.3  76.5  61.4  72.9 
Martinez [26]  51.8  56.2  58.1  59.0  69.5  55.2  58.1  74.0  94.6  62.3  78.4  59.1  49.5  65.1  52.4  62.9 
Kanazawa [19]                                88.0 
Fang [7]  50.1  54.3  57.0  57.1  66.6  53.4  55.7  72.8  88.6  60.3  73.3  57.7  47.5  62.7  50.6  60.4 
Sun [43]  52.8  54.8  54.2  54.3  61.8  53.1  53.6  71.7  86.7  61.5  67.2  53.4  47.1  61.6  63.4  59.1 
Sun [44]  47.5  47.7  49.5  50.2  51.4  43.8  46.4  58.9  65.7  49.4  55.8  47.8  38.9  49.0  43.8  49.6 
Ours (PoseNet)  50.5  55.7  50.1  51.7  53.9  46.8  50.0  61.9  68.0  52.5  55.9  49.9  41.8  56.1  46.9  53.3 
Without groundtruth information in inference time  
Rogez [41]  76.2  80.2  75.8  83.3  92.2  79.9  71.7  105.9  127.1  88.0  105.7  83.7  64.9  86.6  84.0  87.7 
Mehta [29]  58.2  67.3  61.2  65.7  75.8  62.2  64.6  82.0  93.0  68.8  84.5  65.1  57.6  72.0  63.6  69.9 
Ours (Full)  51.5  56.8  51.2  52.2  55.2  47.7  50.9  63.3  69.9  54.2  57.4  50.4  42.5  57.5  47.7  54.4 
Methods  S1  S2  S3  S4  S5  S6  S7  S8  S9  S10  S11  S12  S13  S14  S15  S16  S17  S18  S19  S20  Avg 

Accuracy for all groundtruths  
Rogez [41]  67.7  49.8  53.4  59.1  67.5  22.8  43.7  49.9  31.1  78.1  50.2  51.0  51.6  49.3  56.2  66.5  65.2  62.9  66.1  59.1  53.8 
Mehta [29]  81.0  60.9  64.4  63.0  69.1  30.3  65.0  59.6  64.1  83.9  68.0  68.6  62.3  59.2  70.1  80.0  79.6  67.3  66.6  67.2  66.0 
Ours  94.4  77.5  79.0  81.9  85.3  72.8  81.9  75.7  90.2  90.4  79.2  79.9  75.1  72.7  81.1  89.9  89.6  81.8  81.7  76.2  81.8 
Accuracy only for matched groundtruths  
Rogez [41]  69.1  67.3  54.6  61.7  74.5  25.2  48.4  63.3  69.0  78.1  53.8  52.2  60.5  60.9  59.1  70.5  76.0  70.0  77.1  81.4  62.4 
Mehta [29]  81.0  65.3  64.6  63.9  75.0  30.3  65.1  61.1  64.1  83.9  72.4  69.9  71.0  72.9  71.3  83.6  79.6  73.5  78.9  90.9  70.8 
Ours  94.4  78.6  79.0  82.1  86.6  72.8  81.9  75.8  90.2  90.4  79.4  79.9  75.3  81.0  81.0  90.7  89.6  83.1  81.7  77.3  82.5 
Methods  Hd.  Nck.  Sho.  Elb.  Wri.  Hip  Kn.  Ank.  Avg 

Rogez [41]  49.4  67.4  57.1  51.4  41.3  84.6  56.3  36.3  53.8 
Mehta [29]  62.1  81.2  77.9  57.7  47.2  97.3  66.3  47.6  66.0 
Ours  79.1  92.6  85.1  79.4  67.0  96.6  85.7  73.1  81.8 
MuCo3DHP and MuPoTS3D datasets. Following the previous protocol, we composite 400K frames of which half are background augmented. For augmentation, we use images from the COCO dataset [25] except for images with humans. We use an additional COCO 2D human keypoint detection dataset [25] when training our models on the MuCo3DHP dataset following Mehta et al. [29]. Each minibatch consists of half MuCo3DHP and half COCO data. For COCO data, loss value of axis becomes zero for both of the RootNet and PoseNet following Sun et al. [44].
8.3 Ablation study
In this study, we show how each component of our proposed framework affects the 3D multiperson pose estimation accuracy. To evaluate the performance of the DetectNet, we use the average precision of bounding box () following metrics of the COCO object detection benchmark [25].
Disjointed pipeline. To demonstrate the effectiveness of the disjointed pipeline (i.e., separated DetectNet, RootNet, and PoseNet), we compare MRPE, MPJPE, and running time of joint and disjointed learning of the RootNet and PoseNet in Table 1. The running time includes DetectNet and is measured using a single TitanX Maxwell GPU. For the joint learning, we combine the RootNet and PoseNet into a single model which shares backbone part (i.e., ResNet [12]). The image feature from the backbone is fed to each branch of RootNet and PoseNet in a parallel way. Compared with the joint learning, our disjointed learning gives lower error under a similar running time. We believe that this is because each task of RootNet and PoseNet is not highly correlated so that jointly training all tasks can make training harder, resulting in lower accuracy.
Effect of the DetectNet. To show how the performance of the human detection affects the accuracy of the final 3D human root localization and 3D multiperson pose estimation, we compare AP, AUC, and 3DPCK using the DetectNet in various backbones (i.e., ResNet50 [12], ResNeXt10132 [48]) and groundtruth box in the second, third, and fourth row of Table 2, respectively. The table shows that based on the same RootNet (i.e., Ours), better human detection model improves both of the 3D human root localization and 3D multiperson pose estimation performance. However, the groundtruth box does not improve overall accuracy considerably compared with other DetectNet models. Therefore, we have sufficient reasons to believe that the given boxes cover most of the person instances with such a high detection AP. We can also conclude that the bounding box estimation accuracy does not have a large impact on the 3D multiperson pose estimation accuracy.
Effect of the RootNet. To show how the performance of the 3D human root localization affects the accuracy of the 3D multiperson pose estimation, we compare AUC and 3DPCK using various RootNet settings in Table 2. The first and second rows show that based on the same DetectNet (i.e., R50), our RootNet exhibits significantly higher AP and 3DPCK compared with the setting in which is directly utilized as a depth value. We use the and of the RootNet when the is used as a depth value. This result demonstrates that the RootNet successfully corrects the value. The fourth and last rows show that the groundtruth human root provides similar AUC, but significantly higher 3DPCK compared with our RootNet. This finding shows that better human root localization is required to achieve more accurate absolute 3D multiperson pose estimation results.
Effect of the PoseNet. All settings in Table 2 provides similar AUC. Especially, the first and last rows of the table show that using groundtruth box and human root does not provide significantly higher AUC. As the results in the table are based on the same PoseNet, we can conclude that AUC, which is an evaluation of the rootrelative 3D human pose estimation highly depends on the accuracy of the PoseNet.
8.4 Comparison with stateoftheart methods
Human3.6M dataset. We compare our proposed system with stateoftheart 3D human pose estimation methods on the Human3.6M dataset [16] in Tables 3 and 4. As most of the previous methods use the groundtruth information (i.e., bounding box or 3D root location) in inference time, we report the performance of the PoseNet using the groundtruth 3D root location. Note that our full model does not require any groundtruth information in inference time. The tables show that our method achieves comparable performance despite not using any groundtruth information in inference time. Moreover, it significantly outperforms previous 3D multiperson pose estimation methods [25, 29].
MuCo3DHP and MuPoTS3D datasets. We compare our proposed system with the stateoftheart 3D multiperson pose estimation methods on the MuPoTS3D dataset [29] in Tables 5 and 6. The proposed system significantly outperforms them in most of the test sequences and joints.
Those comparisons clearly show that our approach outperforms previous 3D multiperson pose estimation methods.
9 Discussion
Although our proposed method outperforms previous 3D multiperson pose estimation methods by a large margin, room for improvement is substantial. As shown in Table 2, using the groundtruth 3D root location brings significant 3DPCK improvement. Recent advances in depth map estimation from a single RGB image [8, 22] can give a clue for improving the 3D human root localization model.
Our framework can also be used in applications other than 3D multiperson pose estimation. For example, recent methods for 3D human mesh model reconstruction [2, 19, 18] reconstruct full 3D mesh model from a single person. Joo et al. [18] utilized 2D multiview input for 3D multiperson mesh model reconstruction. In our framework, if the PoseNet is replaced with existing human mesh reconstruction model [2, 19, 18], 3D multiperson mesh model reconstruction can be performed from a single RGB image. This shows our framework can be applied to many 3D instanceaware vision tasks which take a single RGB image as an input.
10 Conclusion
We propose a novel and general framework for 3D multiperson pose estimation from a single RGB image. Our framework consists of human detection, 3D human root localization, and rootrelative 3D singleperson pose estimation models. Since any existing human detection and 3D singleperson pose estimation models can be plugged into our framework, it is very flexible and easy to use. The proposed system outperforms previous 3D multiperson pose estimation methods by a large margin and achieves comparable performance with 3D singleperson pose estimation methods without any groundtruth information while they use it in inference time. To the best of our knowledge, this work is the first to propose a fully learningbased camera distanceaware topdown approach whose components are compatible with most of the previous human detection and 3D human pose estimation models. We hope that this study provides a new basis for 3D multiperson pose estimation, which has only barely been explored.
Supplementary Material of “Camera Distanceaware Topdown Approach for 3D Multiperson Pose Estimation from a Single RGB Image”
In this supplementary material, we present more experimental results that could not be included in the main manuscript due to the lack of space.
1 Derivation of Equation 1
We provide a derivation of Equation 1 of the main manuscript with reference to Figure 6 ,which shows a pinhole camera model. The green and blue arrows represent the human root joint centered and axes, respectively. The yellow lines show rays, and is the hole. , , and are distance between camera and the human root joint (), focal length (), and the length of human on the image sensor (), respectively.
According to the definition of ,
Let be per pixel distance factor in axis. Then,
Above equations are also valid in axis. Therefore,
Finally,
2 Comparison of 3D human root localization with previous approaches
We compare previous absolute 3D human root localization methods [41, 28] with the proposed RootNet on the Human3.6M dataset [16] based on protocol 2.
Methods  MRPE  MRPE  MRPE  MRPE 

Baseline [41, 28]  267.8  27.5  28.3  261.9 
W/o limb joints  226.2  24.5  24.9  220.2 
RANSAC  213.1  24.3  24.3  207.1 
RootNet (Ours)  120.0  23.3  23.0  108.1 
DetectNet  RootNet  PoseNet  Total 
0.120  0.010  0.011  0.141 
Methods  S1  S2  S3  S4  S5  S6  S7  S8  S9  S10  S11  S12  S13  S14  S15  S16  S17  S18  S19  S20  Avg 

Accuracy for all groundtruths  
Ours  59.5  44.7  51.4  46.0  52.2  27.4  23.7  26.4  39.1  23.6  18.3  14.9  38.2  26.5  36.8  23.4  14.4  19.7  18.8  25.1  31.5 
Accuracy only for matched groundtruths  
Ours  59.5  45.3  51.4  46.2  53.0  27.4  23.7  26.4  39.1  23.6  18.3  14.9  38.2  29.5  36.8  23.6  14.4  20.0  18.8  25.4  31.8 
Methods  Hd.  Nck.  Sho.  Elb.  Wri.  Hip  Kn.  Ank.  Avg 
Ours  37.3  35.3  33.7  33.8  30.4  30.3  31.0  25.0  31.5 
Previous approaches [41, 28] simultaneously estimate 2D image coordinates and 3D cameracentered rootrelative coordinates of keypoints. Then, absolute cameracentered coordinates of the human root are obtained by minimizing the distance between 2D predictions and projected 3D predictions. For optimization, linear leastsquares formulation is used. To measure the errors of their method, we implemented and used ResNet152based model of Sun et al. [44] as a 2D pose estimator and model of Martinez et al. [26] as a 3D pose estimator, which are stateoftheart methods. In addition, to minimize the effect of outliers in 3Dto2D fitting, we excluded limb joints when fitting. Also, we performed RANSAC with a various number of joints to get optimal joint set for fitting instead of using heuristically selected joint set.
Table 7 shows our RootNet significantly outperforms previous approaches. Furthermore, the RootNet can be designed independently of the PoseNet, giving design flexibility to both models. In contrast, the previous 3D root localization methods [41, 28] require both of 2D and 3D predictions for the root localization, which results in lack of generalizability.
3 Running time of the proposed framework
In Table 8, we report seconds per frame for each component of our framework. The running time is measured using a single TitanX Maxwell GPU. As the table shows, most of the running time is consumed by DetectNet. It is hard to directly compare running time with previous works [41, 28] because they did not report it. However, we guess that there would be no big difference because models of [41] and [28] are similar with [39] and [3] whose speed is 0.2 and 0.11 seconds per frame, respectively.
4 Absolute 3D multiperson pose estimation errors
5 Qualitative results
References
 [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
 [2] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In ECCV, 2016.
 [3] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. CVPR, 2017.
 [4] J. Y. Chang and K. M. Lee. 2d–3d pose consistencybased conditional random fields for 3d human pose estimation. CVIU, 2018.
 [5] C.H. Chen and D. Ramanan. 3d human pose estimation= 2d pose estimation+ matching. In CVPR, 2017.
 [6] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multiperson pose estimation. In CVPR, 2018.
 [7] H.S. Fang, Y. Xu, W. Wang, X. Liu, and S.C. Zhu. Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI, 2018.
 [8] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018.
 [9] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
 [10] J. C. Gower. Generalized procrustes analysis. Psychometrika, 1975.
 [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In ICCV, 2017.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [13] S. Huang, M. Gong, and D. Tao. A coarsefine network for keypoint localization. In ICCV, 2017.
 [14] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multiperson pose estimation model. In ECCV, 2016.
 [15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
 [16] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI, 2014.
 [17] E. Jahangiri and A. L. Yuille. Generating multiple diverse hypotheses for human 3d pose consistent with 2d joint detections. In ICCV, 2017.
 [18] H. Joo, T. Simon, and Y. Sheikh. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In CVPR, 2018.
 [19] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Endtoend recovery of human shape and pose. In CVPR, 2018.
 [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2014.
 [21] M. Kocabas, S. Karagoz, and E. Akbas. Multiposenet: Fast multiperson pose estimation using pose residual network. In ECCV, 2018.
 [22] R. Li, K. Xian, C. Shen, Z. Cao, H. Lu, and L. Hang. Deep attentionbased classification network for robust depth prediction. arXiv preprint arXiv:1807.03959, 2018.
 [23] S. Li and A. B. Chan. 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV, 2014.
 [24] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
 [25] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
 [26] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In ICCV, 2017.
 [27] F. Massa and R. Girshick. maskrcnnbenchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnnbenchmark, 2018.
 [28] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, 2017.
 [29] D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. PonsMoll, and C. Theobalt. Singleshot multiperson 3d pose estimation from monocular rgb. In 3DV, 2018.
 [30] G. Moon, J. Y. Chang, and K. M. Lee. Multiscale aggregation rcnn for 2d multiperson pose estimation. CVPRW, 2019.
 [31] G. Moon, J. Y. Chang, and K. M. Lee. Posefix: Modelagnostic general human pose refinement network. In CVPR, 2019.
 [32] F. MorenoNoguer. 3d human pose estimation from a single image via distance matrix regression. In CVPR, 2017.
 [33] A. Newell, Z. Huang, and J. Deng. Associative embedding: Endtoend learning for joint detection and grouping. In NIPS, 2017.
 [34] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multiperson pose estimation in the wild. In CVPR, 2017.
 [35] S. Park, J. Hwang, and N. Kwak. 3d human pose estimation using convolutional neural networks with 2d pose information. In ECCV, 2016.
 [36] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [37] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarsetofine volumetric prediction for singleimage 3d human pose. In CVPR, 2017.
 [38] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.
 [39] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [40] G. Rogez and C. Schmid. Mocapguided data augmentation for 3d pose estimation in the wild. In NIPS, 2016.
 [41] G. Rogez, P. Weinzaepfel, and C. Schmid. Lcrnet: Localizationclassificationregression for human pose. In CVPR, 2017.
 [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
 [43] X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional human pose regression. In ICCV, 2017.
 [44] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei. Integral human pose regression. In ECCV, 2018.
 [45] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua. Structured prediction of 3d human pose with deep neural networks. BMVC, 2016.
 [46] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. In CVPR, 2017.
 [47] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. In ECCV, 2018.
 [48] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
 [49] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang. 3d human pose estimation in the wild by adversarial learning. In CVPR, 2018.
 [50] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall. A dualsource approach for 3d pose estimation from a single image. In CVPR, 2016.
 [51] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
 [52] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Weaklysupervised transfer for 3d human pose estimation in the wild. In ICCV, 2017.
 [53] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Monocap: Monocular human motion capture using a cnn coupled with a geometric prior. TPAMI, 2019.