Consensus-based Optimization for 3D Human Pose Estimation in Camera Coordinates

Consensus-based Optimization for 3D Human Pose Estimation in Camera Coordinates

Diogo C. Luvizon           Hadi Tabia           David Picard
ETIS UMR 8051, Paris Seine University, ENSEA, CNRS, F-95000, Cergy, France
Advanced Technologies, Samsung Research Institute, Campinas, Brazil
IBISC, Univ. d´Evry Val d´Essonne, Université Paris Saclay
LIGM, UMR 8049, École des Ponts, UPE, Champs-sur-Marne, France

3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, in this paper, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated data and 3D poses, as well as a straightforward multi-view generalization. To that end, we cast the problem into a different perspective, where 3D poses are predicted in the image plane, in pixels, and the absolute depth is estimated in millimeters. Based on this, we propose a consensus-based optimization algorithm for multi-view predictions from uncalibrated images, which requires a single monocular training procedure. Our method improves the state-of-the-art on well known 3D human pose datasets, reducing the prediction error by 32% in the most common benchmark. In addition, we also reported our results in absolute pose position error, achieving 80mm for monocular estimations and 51mm for multi-view, on average.

1 Introduction

Figure 1: Absolute 3D human pose estimated from a single image (top-left) with occlusion and projected into a different view (top-right). Our multi-view consensus-based approach (bottom) results in a more precise absolute pose estimation and effectively handles cases of occlusion.

3D human pose estimation is a very active research topic, mainly due to the several applications that benefit from precise human poses, such as sports performance analysis, 3D model fitting, human behavior understanding, among others. Despite the many recent works on 3D human pose estimation, most of the methods in the literature are limited to the problem of relative pose prediction [7, 40, 47, 4, 2], where the root body joint is centered at the origin and the remaining joints are estimated relative to the center. This limitation hinders the generalization for multi-view scenarios since predictions are not in the camera coordinates. Contrarily, when estimations are relative to the camera coordinates, predicted human poses can be easily projected from one view to another, as illustrated in Fig. 1.

The methods in the state of the art frequently handle 3D human pose estimation as a regression task, directly converting the input images to predicted poses in millimeters [38, 18]. However, this is a depth learning problem, because identical distances in pixels can result in different distances in millimeters. For example, a person close to the camera with the hand next to the head has a distance (head to hand in mm) much shorter than a person far from the camera with her arm extended, although both result in the same distance in pixels. Consequently, those methods have to learn the intrinsic parameters indirectly. Moreover, by predicting 3D poses directly in millimeters, the abundant images with annotated 2D poses in pixels cannot be easily exploited, since this 2D data have no associated 3D information, and relative poses predicted from one camera cannot be easily projected into a different view, making it more difficult to handle occlusion cases in multi-view scenarios.

In our method, we tackle these limitations by casting the problem of 3D human pose estimation into a different perspective: instead of directly predicting the pose in millimeters relative to the root joint, we predict coordinates in the image plane, in pixels, and the absolute depth in millimeters. Both 2D human pose and absolute depth estimation are well known problems in the literature [3, 10, 8, 16], including the absolute depth estimation benchmark NYUv2 [24], but are usually not correlated. By casting the 3D human pose estimation in that way, we are able to effectively merge 2D and 3D datasets, making the best use of each. In addition, we are able to handle the challenging cases of occlusion by learning a consensus-based optimization to merge predictions from different views, considering estimations in the camera coordinates. Our method results not only in a robust approach to handle occlusion cases, but also in a new multi-view method for absolute 3D human pose estimation capable of estimating the camera extrinsic parameters.

Considering the exposed limitations of relative 3D human pose estimation, we aim to fill the gap of current methods by addressing the more complex problem of absolute 3D human pose estimation, where predictions are performed with respect to a static referential i.e., the camera position, and not to the person’s root joint. In that direction, we present our contributions: First, we propose an absolute 3D human pose estimation method from monocular cameras. Second, we propose a consensus-based optimization for multi-view absolute 3D human pose estimation from uncalibrated images, which requires only a single monocular training procedure. As a result, the proposed method sets the new state-of-the-art results on the challenging test set from Human3.6M, improving previous results by 10% with monocular predictions and by 32% considering multiple views.

The remaining of this paper is divided as follows. In section 2 we present the related work. Our method is explained in section 3. The experiments are presented in section 4 and in section 5 we conclude this paper.

2 Related work

In this section, we review the most related methods to our work, giving special attention to monocular (relative and absolute) and multi-view 3D human pose estimation. We recommend the survey in [35] for readers seeking for a more detailed review.

2.1 Monocular relative 3D human pose estimation

In the last decade, monocular 3D human pose estimation has been a very active research topic in the community [1, 7, 40, 49, 39, 18, 13]. Many recent works have proposed to directly predict relative 3D poses from images [38, 21, 37, 48, 28], which requires the model to learn a complex projection from 2D pixels to millimeters in three dimensions. Another drawback is their limitation to benefit from the abundant 2D data, since the manually annotated images have no associated 3D information.

A common approach to directly use 2D data during training is to first learn a 2D pose estimator, than lift 3D poses from 2D estimations [17, 31, 43, 40, 21, 7]. However, lifting 3D from 2D points only is an ill-defined problem since no visual cues are available, frequently resulting in ambiguity and, consequently, limited precision. Other methods assume that the absolute location of the root joint is provided during inference [48, 19], so the inverse projection from pixels to millimeters can be performed. In our approach, this assumptions is not made, since we estimate the 3D pose in absolute coordinates. The only additional information we need are the intrinsic parameters for monocular camera prediction, which is often given by the manufacturer or could be obtained with standard tools.

Contrarily to the previous works, we are able to train our method simultaneously with 3D and 2D annotated data in an effective way, since one part of our prediction is performed in the image plane and completely independent from 3D information. Moreover, estimating the first two coordinates in pixels in the image plane is a better defined problem then estimating floating 3D positions directly in millimeters. These advantages translates into higher accuracy for our method.

2.2 Monocular absolute 3D human pose estimation

Contrarily to relative estimation, in absolute pose prediction the 3D coordinates of the human body are predicted with respect to the camera. A simple approach is to infer the distance to the camera considering a normalized and constant body size [47], which is a non-realistic assumption. Inspired by the many works on depth estimation, [26] proposed to predict the depth of body joints individually. The drawback of this method is that it suffers to capture the human body structure, since errors in the estimated depth for individual joints can degenerate the final pose.

Very recently, a multi-person absolute pose estimation method was proposed in [23]. The authors proposed to predict the absolute distance from the person to the camera based on the area of the cropped 2D bounding box. However, it is known from the literature on absolute depth estimation [9, 8] that not only the size of objects are important, but also the position of objects in the image is an informative cue to predict its depth. For example, a person in the bottom of an image is more likely to be closer to the camera than a person on the top of the same image. In our approach, we combine three different information to predict the distance of the root joint: the size of the bounding box (including its ratio), the position in the image, and deep convolutional features that provide additional visual cues.

2.3 Multi-view 3D human pose estimation

For the challenging cases of occlusion or clutter background, multiple views can be decisive for disambiguating uncertain positions of body joints (see Fig. 1). To handle this, many approaches have proposed multi-view solutions for 3D human pose estimation [4, 2, 6, 5, 12], mostly exploring the classical concept of pictorial structures from multi-view images. More recently, deep neural networks have been used to estimate relative 3D poses from a set of 2D predictions from different views [33, 29, 27]. As an example, Pavlakos et al[29] proposed to collect 3D poses from 2D multi-view image, which are used to learn a second model to perform 3D estimations. Since these methods estimate 3D from multi-view 2D only, they often require both intrinsic and extrinsic parameters, with the exception of [33] that estimates the calibration.

From the recent literature, we can notice that current multi-view approaches are still completely dependent on the camera intrinsic parameters and often require a complete calibration setup, which can be prohibitive in some circumstances. Available methods are also limited to the inference of 3D from multiple 2D predictions, requiring multi-view datasets for training. Alternatively, we propose to predict absolute 3D poses from each individual view, which has two important advantages over previous methods. First, it allows us to easily combine predictions from multiple calibrated cameras, while requiring a single monocular training procedure. Second, we are able to estimate camera calibration, both intrinsic and extrinsic, from multi-view images, by a consensus-based optimization without retraining the model. The superiority of our approach is evidenced by its strong results, even when considering unknown and uncalibrated cameras.

3 Proposed method

The goal of our method is to predict 3D human poses in absolute coordinates with respect to the camera position. For this, we believe that the most effective approach is to predict each body joint in image pixel coordinates and in absolute depth, orthogonal to the image plane, in millimeters. Then, the predicted pixel coordinates and depth can be projected to the world, considering a pinhole camera model.

We further split the problem as relative 3D pose estimation and absolute depth estimation. The motivation for this comes from the idea that a well cropped bounding box around the person is better for predicting its pose than the full image frame, since a better resolution could be attained and the person scale is automatically handled by the image crop. Additionally, by providing a separated loss on relative depth for each joint helps the network to learn the human body structure, which would be more difficult to learn directly from absolute coordinates due to the position shift.

Estimating the absolute depth from monocular images is a hard problem, specially from cropped regions. Recent works on depth estimation have demonstrated that neural networks rely on both pictorial cues and geometry information to predict depth [9]. For the specific problem of 3D human pose estimation, the structure of the human body is also an important domain knowledge to be explored. Considering our motivations and the exposed challenges, we propose to predict 3D poses relative to a cropped region centered at the person, which eases the network to encode the human body structure, and absolute depth from combined local pictorial cues and global position and size of the cropped region.

Specifically, given an image and a person bounding box region , we define the problem as learning a function such that , where is the estimated relative pose, composed of body joints in the format , with , contains the body joint confidence score, which is an additional information that represents a level of confidence for each predicted body joint coordinates, and is the estimated absolute depth for the person’s root joint. The person bounding box is defined by its central position and size , and can be obtained using a standard person detector [32]. The parametrized regression function is implemented as a deep convolutional neural network (CNN), detailed as follows.

3.1 Network architecture

U-Nets are widely used for human pose estimation due to their multi-scale processing capabilities [25], while ResNet [11] is often preferable to produce CNN features. Since we want precise pose predictions and informative visual features for absolute depth estimation, we propose a combined network composed of a ResNet cut at block 4f as backbone, followed by 2 U-blocks, as shown in Fig. 2. This architecture is called ResNet-U and, in addition to a few fully connected layers to regress the absolute depth and the confidence scores , implements the function in our approach. The details about each part of our method is discussed as follows.

Figure 2: Proposed ResNet-U architecture. Input images and bounding box parameters are fed into a neural network that predicts absolute depth , human pose , and confidence scores .

3.2 3D human pose regression

As previously stated, at a first time, we want to estimate the 3D human pose relative to the cropped bounding box. Specifically, we predict the pixel coordinates in the image plane, given the information about the cropped image in . Since it is difficult to predict the absolute depth from an arbitrarily cropped region, at this stage we predict the relative depth of each body joint with respect to the location of the person. Therefore, the human pose estimation problem can be naturally split into two parts: image plane pose estimation and body joints depth estimation, as detailed next.

3.2.1 Relative UVD pose estimation

For the pose prediction in the image plane , we use the soft-argmax operation [20, 44] using the predicted probability distributions from the U-Nets. This probability distribution is defined as a feature map (positive and unitary sum) for the th body joint. The third dimension of the pose is composed of the depth per body joint, with respect to the location of the person. This prediction could be integrated in the soft-argmax by extending the feature map to a volumetric representation [38]. However, depending on the resolution and on the number of body joints, this approach can be costly. Instead, we propose to predict a normalized (in the interval ) depth map per joint, corresponding to an interval of 2 meters, with 0.5 being the depth of the root joint. By restricting the estimated depth in this range, we ensure that the bounding box prediction is well defined inside an small region e.g., 2 meters, corresponding to the enclosure of an average person. The regressed depth inside the bounding box is defined by:


where the depth range is set to 2 meters in our method. Note that in Equation 1 we pool the regions from the depth maps corresponding to the high probability locations of the body joints.

3.2.2 Absolute depth estimation

Once we have estimated the body joint coordinates in pixels and the depth with respect to the location of the person, we then predict the absolute depth of the person with respect to the camera. For this, we use two complementarily sources of information: the bounding box position and size, and deep visual features. The position and size of the bounding box provide a rough global information about the scale and position of the person in the image. Additionally, the visual features extracted from the bounding box region by means of ResNet features provide informative visual cues that are used to refine the absolute person distance estimation.

Both extracted features are then fed to a fully-connected network with 256 neurons at each level and a single neuron as output, represented by , which is activated by a smoothed sigmoid-like function, defined as:


where is the maximum depth, set as 10 meters in our experiments. The output is then supervised with the absolute depth of the root joint . This process is illustrates in Fig. 2 in the bottom left part. We demonstrated in our experiments that the two different types of information, visual and bounding box information, are complementarily for the task of absolute depth prediction.

3.2.3 Absolute 3D pose reconstruction

In order to accomplish our objective of estimating the absolute 3D pose represented as , we combine the estimated pose in the bounding box with the predicted absolute depth. Considering as the first and as the last, the absolute coordinate for each body joint is defined by . Note that is the absolute distance in millimeters from each body joint to the camera and it results from the combination of two distinct predictions, which are individually supervised. The other two absolute coordinates required to build the final absolute 3D pose, and , can then be computed by the following equation:


where and are the camera focal length and the camera center, both in pixels, considering the and axis. Note that these parameters are camera intrinsics and can be easily obtained. The camera focal length is often given by the manufacturer and the center of the image frequently corresponds to the image center in pixels or, even, both values could be estimated with standard tools. Nevertheless, we propose in the following a method to estimate the camera parameters, both intrinsic and extrinsic, without any a priori, directly from the predictions of our method, considering a multi-view scenario with uncalibrated cameras.

3.3 Consensus-based optimization

One of the main advantages of estimating absolute instead of relative 3D poses is the possibility to project the predictions from one camera to another, simply by applying a rotation and a translation. This advantage has important consequences in multi-view. For example, when the camera calibration is known, the predictions of different monocular cameras can be combined in a common reference, resulting in more precise predictions. For the cases where no information is known about the camera calibration, we propose a consensus-based algorithm that can be applied to estimate both intrinsic and extrinsic parameters, resulting in a completely uncalibrated multi-view approach. This algorithm is explained as follows.

Let us define our model’s prediction from two distinct cameras as: and , as well as their poses in absolute camera coordinates: and , respectively for cameras 1 and 2. Then, we define the projection of into camera 1 as:


where and are a rotation matrix and a translation vector from camera 2 to camera 1. Our goal is to minimize the projection error from camera 2 to camera 1 (and vice-versa) by optimizing a set of camera parameters , which includes the intrinsics from both cameras and the extrinsic parameters between them. Specifically, let us define the optimization problem as:


By solving Equation 5 for individual variables, we obtain the following equations for alternating gradient with steepest descent (details about the derivatives are given in the appendix):


where is given by:


and . Note that for Equations 7, 8, 9 the parameters for follows a similar form, replacing the components by and by . For the intrinsic parameters from camera 2, the same equations are used, except by swapping the camera indexes in each variable. Additionally, the reverse projection ( and ), from camera 1 to camera 2, is given by isolating in Equation 4.

Finally, we can solve the global optimization problem by alternating the optimization of camera extrinsic and intrinsic parameters. This process is detailed in Algorithm 1.

3:Compute and from Equation 3
4:Initialize using Equation 6 (assume )
7:     Update using rigid Procrustes alignment
8:     Update using Equation 6
9:     if  then
10:         Update using Equation 7
11:     else if  then
12:         Update using Equation 7
13:     else if  then
14:         Update using Equation 8
15:     else if  then
16:         Update using Equation 8
17:     end if
18:     Update and from Equation 3
Algorithm 1 Camera parameters optimization.

3.3.1 Body joint confidence scores

Since the proposed consensus-based optimization algorithm relies on estimated poses, it can be affect by the precision of predicted joint positions. Despite the average error of our method is very low compared to previous approaches, we also propose a confidence score that indicates whether the network is “confident” or not for each predicted body joint. This score varies from 0 to 1, and is implemented by a DNN that takes estimated poses as input (see Fig.2 - bottom right). The ground truth for the th joint is defined as follows:


where is the distance error between the predicted and ground truth joint position, is the average prediction error, and is the error standard deviation. By estimating Equation 10, we can remove predicted joints with error higher than the average simply by discarding points with . The predicted confidence score is useful in the Algorithm 1, resulting in a more robust optimization, but it also is important for multi-view estimations. In addition, we also take into account the confidence scores when predicting poses in multi-view scenario by weighting each body joint from each view by its corresponding predicted confidence score.

4 Experiments

In this section, we present the results of our method on two well known datasets, as well as a sequence of ablation studies to provide insights about our approach.

4.1 Datasets

Human3.6M [14] is a large scale dataset with 3D human poses collected by a motion capture system (MoCap) and RGB images captured by 4 synchronized cameras. A total of 15 activities are performed by 11 actors, 5 females and 6 males, resulting in 3.6 million images. Poses are composed of 23 body joints, from which 17 are used for evaluation.

MPI-INF-3DHP [22] is a dataset for 3D human pose estimation captured with a marker-less MoCap system, which allows outdoor video recording, e.g., TS5 and TS6 from testing. A total of 8 activities are performed by 8 different actors in two distinct sequences. Human poses are composed of 28 body joints, from which 17 are used for evaluation. The activities involve complex exercising poses, which makes this dataset more challenging than Human3.6M. However, the precision of marker-less motion capture is visually less precise than ground truth poses from [14]. Despite having a training set captured by 8 different cameras, test samples are captured by a single monocular camera.

4.2 Evaluation protocols and metrics

Three evaluation protocols are widely used for Human3.6M. In protocol 1, six subjects are used for training and only one is used for evaluation. Since this protocol uses a Procrustes Alignment between prediction and ground truth, we do not consider it in our work. In protocol 2, five subjects (S1, S5, S6, S7, S8) are dedicated for training and S9 and S11 for evaluation, and evaluation videos are sub-sampled every 64th frames. The third protocol is the official test set (S2, S3, S4), of which ground truth poses are withheld by the authors and evaluation is performed over all test frames (almost 1 million images) through a server. In our experiments, we report our scores in the most challenging official test set. Additionally, we consider protocol 2 for the ablation studies and for comparison with multi-view approaches.

The standard metric for Human3.6M is the mean per joint position error (MPJPE), which measures the average joint error after centering both predictions and ground truth poses to the origin. We also evaluated our method considering the mean of the root joint position erro (MRPE) [23], which measures the average error related to the absolute pose estimation. This metric is considered only for validation, since the server does not support this protocol.

For MPI-INF-3DHP, evaluation is performed on a test set composed of 6 videos/subjects, of which 2 are recorded in outdoor scenes, resulting in almost 25K frames. The authors of [22] proposed three evaluation metrics: the mean per joint position error, in millimeters, the 3D Percentage of Correct Keypoints (PCK), and the Area Under the Curve (AUC) for different thresholds on PCK. The standard threshold for PCK is 150mm. Differently from previous work, we use the real 3D poses to compute the error instead of the normalized 3D poses, since the last is not compatible with a constant camera projection. Since evaluation is performed on monocular images, we use the available intrinsic camera parameters to recover absolute poses in millimeters.

4.3 Implementation details

During training, for both the absolute and relative 3D pose estimations, we supervise our network using the elastic net loss (L1+L2) [50], respectively reffered as and . The final loss is then represented by:


Once the first part of our network is trained, we compute the average prediction error on training, which is used to train the confidence score network using the mean average error (MAE). RMSprop and Adam are used for optimization, respectively for the first and second training processes, starting with a learning rate of and decreased by after 150K and 170K iterations. Batches of 24 images are used. The full training process takes less then two days with a GTX 1080 Ti GPU. We augmented the training data with common techniques, such as random rotations (), re-scaling (from 0.7 to 1.3), horizontal flipping, color gains (from 0.9 to 1.1), and artificial occlusions with rectangular black boxes. Additionally, we augmented the training data in a 50/50 ratio with 2D images from MPII [3], which becomes an standard data augmentation technique for 3D human pose estimation.

Methods Directions Discussion Eating Greeting Phoning Posing Purchases Sitting
Popa et al. [30] 60 56 68 64 78 67 68 106
Zanfir et al. [45] 54 54 63 59 72 61 68 101
Zanfir et al. [46] 49 47 51 52 60 56 56 82
Shi et al. [36] 51 50 54 54 62 57 54 72
Ours monocular 42 44 52 47 54 48 49 66
Ours multi-view est. calib. 40 36 44 39 44 42 41 66
Ours multi-view GT calib. 31 33 41 34 41 37 37 51
Methods Sit. Down Smoking Photo Waiting Walking Walk.Dog Walk.Pair Average
Popa et al. [30] 119 77 85 64 57 78 62 73
Zanfir et al. [45] 109 74 81 62 55 75 60 69
Zanfir et al. [46] 94 64 69 61 48 66 49 60
Shi et al. [36] 76 62 65 59 49 61 54 58
Ours monocular 76 54 61 47 44 55 44 52
Ours multi-view est. calib. 70 46 49 43 34 46 34 45
Ours multi-view GT calib. 56 43 44 37 33 42 32 39
Table 1: Comparison with results from related methods on Human3.6M test set using MPJPE (millimeters error) evaluation.
Methods Camera calib. Directions Discussion Eating Greeting Phoning Posing Purchases Sitting
PVH-TSP [41] GT 92.7 85.9 72.3 93.2 86.2 101.2 75.1 78.0
Trumble et al. [42] GT 41.7 43.2 52.9 70.0 64.9 83.0 57.3 63.5
Pavlakos et al. [29] GT 41.1 49.1 42.7 43.4 55.6 46.9 40.3 63.6
Ours Estimated 59.3 40.7 38.7 39.1 41.7 39.5 40.6 64.1
Ours GT 31.0 33.7 33.8 33.4 38.6 32.2 36.3 48.2
Methods Camera calib. Sit. Down Smoking Photo Waiting Walking Walk.Dog Walk.Pair Average
PVH-TSP [41] GT 83.5 94.8 85.8 82.0 114.6 94.9 79.7 87.3
Trumble et al. [42] GT 61.0 95.0 70.0 62.3 66.2 53.7 52.4 62.5
Pavlakos et al. [29] GT 97.5 119.9 52.1 42.6 51.9 41.7 39.3 56.8
Ours Estimated 69.5 42.0 44.6 39.6 31.0 40.2 35.3 44.7
Ours GT 51.5 39.2 38.8 32.4 29.6 38.9 33.2 36.9
Table 2: Comparison with related multi-view methods on Human3.6M validation set, protocol 2. We report our scores in mm error (MPJPE), considering ground truth and estimated camera calibration. Note that all previous methods use ground truth camera calibration.
Method Stand Exercise Sit Crouch On the Floor Sports Misc. Total
Rogez et al. [34] 70.5 56.3 58.5 69.4 39.6 57.7 57.6 59.7 27.6 158.4
Zhou et al. [48] 85.4 71.0 60.7 71.4 37.8 70.9 74.4 69.2 32.5 137.1
Mehta et al. [22] 86.6 75.3 74.8 73.7 52.2 82.1 77.5 75.7 39.3 117.6
Kocabas et al. [15] 77.5 108.99
Ours monocular 83.8 79.6 79.4 78.2 73.0 88.5 81.6 80.6 42.1 112.1

Methods using normalized 3D human poses for evaluation.

Table 3: Results on MPI-INF-3DHP compared to the state-of-the-art.

4.4 Comparison with the state-of-the-art

4.4.1 Human3.6M

In Table 1, we show our results on the test set from Human3.6M. We provide results of our method considering monocular predictions and multi-view predictions, for estimated and ground truth camera calibration. In all the cases our method obtains state-of-the-art results by a fair merging, reducing the prediction error by more than 10% in monocular scenario. In the multi-view setup, our method achieves 39mm error, reducing errors by more than 32% on average. In the most challenging activity (Sitting Down), our method performs better than all previous approaches on the average. We believe our multi-view results are close to the best of what can be achieved, given the precision of the annotations. These results demonstrate the effectiveness of our method, considering that the test set from Human3.6M are very challenging and labels are withheld by the authors.

For a fairer comparison, we also consider results only from multi-view approaches in Table 2. We present our scores considering ground truth and estimated camera calibration, while all previous methods use the available calibration from the dataset. Still, our method obtains 36.9mm error, which improves the state of the art by 35%. In this comparison, we are not considering methods that make use of the ground truth absolute position of the root joint, since in our method we estimate this information.

4.4.2 Mpi-Inf-3dhp

Our results on MPI-INF-3DHP are shown in Table 3. We do not report results considering multiple views in this dataset, since the testing samples were captured by a single camera. Contrarily to what is more common in this dataset, we evaluated our method using non-normalized 3D poses, otherwise it will not be possible to perform the inverse camera projection. Nevertheless, our method achieves results comparable to the state-of-the-art, even considering other methods using normalized 3D poses.

4.5 Ablation studies

In this part, we present additional experiments to provide insights about our method and our design choices.

4.5.1 Network architecture

We evaluated three different network architectures as presented in Table 4. An off-the-shelf ResNet performed 62.2mm and 53.7mm, respectively when cut at blocks 4 and 5. The proposed ResNet-U improves on the ResNet b5 by 3.2mm while requiring 2.7M less parameters.

ResNet block 4 ResNet block 5 ResNet-U
MPJPE Param. MPJPE Param. MPJPE Param.
62.2 10.5M 53.7 26M 50.5 23.3M
Table 4: Evaluation of the network architecture, considering the backbone only (ResNet) cut at block 4 and block 5, and the refinement network (ResNet-U).

4.5.2 Absolute depth estimation

In Table 5, we evaluate the influence of visual features and bounding box position for the absolute depth estimation, considering the mean root position error in mm (MRPE). As can be observed, using only bounding box features is insufficient to precisely predict the absolute , but when combined with visual features it further improves by 20mm, which evidences the need of a global bounding box information for that task.

Features Bounding box CNN features Combined
MRPE 375.4 100.1 80.1
Table 5: Absolute root joint position error in mm based on different features combinations.

4.5.3 The effect of multiple camera views

Figure 3: On top, the absolute prediction from camera 1 is projected to camera 2 with considerable errors in occluded joints. At the bottom, predictions from cameras 1, 3, 4 are projected to camera 2 and merged, improving the prediction quality significantly.

Since the proposed method predicts 3D poses in absolute camera coordinates and is also capable of estimating the extrinsic camera parameters, we can use multiple cameras to predict the same pose at inference time. When considering multi-view scenarios, we can either use camera calibration, when provided, or we can use our consensus-based optimization algorithm.

In Table 6 we present our results considering both 3D pose estimation and absolute root position errors, as well as estimated and ground truth camera parameters. We use multiple combinations of cameras, in order to show the influence of different number of views. As we can see, each camera lowers the error by about 5mm, which is significant on Human3.6M. We can also notice that our consensus optimization approach is capable of providing highly precise estimation, even under uncalibrated conditions.

Additionally, Fig. 3 shows an example of highly occluded body parts where multiple camera predictions results in a significantly better reconstruction. Note that in this case we are projecting the estimated absolute 3D pose to a new point of view, not used during inference. Despite the highly occluded joints in some views, the resulting absolute pose is very coherent and has a reduced shift when our consensus-based algorithm is used.

Method GT camera Estimated camera
Monocular 50.5 80.1
Monocular + h.flip 49.2 79.9
Cameras 1,2 45.7 73.3 52.2 167.0
Cameras 1,4 46.2 74.9 59.0 171.0
Cameras 1,2,3 41.8 57.4 47.9 143.8
Cameras 1,2,3,4 36.9 51.0 44.7 130.7
Table 6: Results of our method on 3D human pose estimation and on root joint absolute error (MPJPE / MRPE) considering single and multi-view with different camera combinations.

Finally, in Fig. 4 we present some qualitative results of predicted absolute 3D poses by our method.

Figure 4: Absolute 3D pose predictions from monocular single images by our method.

5 Conclusions

In this paper, we have proposed a new method for the problem of predicting 3D human poses in absolute coordinates and a new algorithm for multi-view predictions optimization. We show that, by casting the problem into a new perspective, we can benefit from training with 2D and 3D data indistinguishably, while performing 3D predictions in a more effective way. These improvements boost monocular 3D pose estimation significantly. As another consequence of the absolute prediction, we show that multi-view estimations can be easily performed from multiple absolute monocular estimations, resulting in much higher precision than previous methods in the literature, even when considering multiple uncalibrated images.


  • [1] A. Agarwal and B. Triggs (2006-01) Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (1), pp. 44–58. External Links: ISSN 0162-8828 Cited by: §2.1.
  • [2] S. Amin, M. Andriluka, M. Rohrbach, and B. Schiele (2013) Multi-view pictorial structures for 3d human pose estimation.. In BMVC, Cited by: §1, §2.3.
  • [3] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014-06) 2D human pose estimation: new benchmark and state of the art analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.3.
  • [4] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic (2014) 3D pictorial structures for multiple human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1669–1676. Cited by: §1, §2.3.
  • [5] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic (2016-10) 3D pictorial structures revisited: multiple human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 38 (10), pp. 1929–1942. External Links: ISSN 0162-8828, Link, Document Cited by: §2.3.
  • [6] M. Burenius, J. Sullivan, and S. Carlsson (2013) 3D pictorial structures for multiple view articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3618–3625. Cited by: §2.3.
  • [7] C. Chen and D. Ramanan (2017-07) 3D human pose estimation = 2d pose estimation + matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, §2.1.
  • [8] W. Chen, Z. Fu, D. Yang, and J. Deng (2016) Single-image depth perception in the wild. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 730–738. External Links: Link Cited by: §1, §2.2.
  • [9] T. v. Dijk and G. d. Croon (2019-10) How do neural networks see depth in single images?. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, §3.
  • [10] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2366–2374. External Links: Link Cited by: §1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
  • [12] M. Hofmann and D. M. Gavrila (2012-01-01) Multi-view 3d human pose estimation in complex environment. International Journal of Computer Vision 96 (1), pp. 103–124. External Links: Document Cited by: §2.3.
  • [13] C. Ionescu, F. Li, and C. Sminchisescu (2011-11) Latent structured models for human pose estimation. In International Conference on Computer Vision (ICCV), pp. 2220–2227. External Links: ISSN 1550-5499 Cited by: §2.1.
  • [14] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014-07) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36 (7), pp. 1325–1339. Cited by: §4.1, §4.1.
  • [15] M. Kocabas, S. Karagoz, and E. Akbas (2019-06) Self-supervised learning of 3d human pose using multi-view geometry. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 3.
  • [16] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pp. 239–248. Cited by: §1.
  • [17] K. Lee, I. Lee, and S. Lee (2018-09) Propagating lstm: 3d pose estimation based on joint interdependency. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.
  • [18] S. Li, W. Zhang, and A. B. Chan (2015-12) Maximum-margin structured learning with deep networks for 3d human pose estimation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1.
  • [19] D. C. Luvizon, D. Picard, and H. Tabia (2018-06) 2D/3d pose estimation and action recognition using multitask deep learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [20] D. C. Luvizon, H. Tabia, and D. Picard (2019) Human pose regression by combining indirect part detection and contextual information. Computers and Graphics 85, pp. 15 – 22. External Links: ISSN 0097-8493, Document Cited by: §3.2.1.
  • [21] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: §2.1, §2.1.
  • [22] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision (3DV), 2017 Fifth International Conference on, External Links: Link Cited by: §4.1, §4.2, Table 3.
  • [23] G. Moon, J. Y. Chang, and K. M. Lee (2019-10) Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, §4.2.
  • [24] P. K. Nathan Silberman and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §1.
  • [25] A. Newell, K. Yang, and J. Deng (2016) Stacked Hourglass Networks for Human Pose Estimation. European Conference on Computer Vision (ECCV), pp. 483–499. Cited by: §3.1.
  • [26] B. X. Nie, P. Wei, and S. Zhu (2017) Monocular 3d human pose estimation by predicting depth on joints. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3467–3475. Cited by: §2.2.
  • [27] J. C. Núñez, R. Cabido, J. F. Vélez, A. S. Montemayor, and J. J. Pantrigo (2019) Multiview 3d human pose estimation using improved least-squares and lstm networks. Neurocomputing 323, pp. 335 – 343. External Links: ISSN 0925-2312, Document, Link Cited by: §2.3.
  • [28] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [29] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Harvesting multiple views for marker-less 3d human pose annotations. In CVPR, Cited by: §2.3, Table 2.
  • [30] A. Popa, M. Zanfir, and C. Sminchisescu (2017) Deep multitask architecture for integrated 2d and 3d human sensing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4714–4723. Cited by: Table 1.
  • [31] M. Rayat Imtiaz Hossain and J. J. Little (2018-09) Exploiting temporal information for 3d human pose estimation. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.
  • [32] J. Redmon and A. Farhadi (2017-07) YOLO9000: better, faster, stronger. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.
  • [33] H. Rhodin, J. Spörri, I. Katircioglu, V. Constantin, F. Meyer, E. Müller, M. Salzmann, and P. Fua (2018) Learning monocular 3d human pose estimation from multi-view images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8437–8446. Cited by: §2.3.
  • [34] G. Rogez, P. Weinzaepfel, and C. Schmid (2017-06) LCR-Net: Localization-Classification-Regression for Human Pose. In Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: Table 3.
  • [35] N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris (2016) 3D human pose estimation: a review of the literature and analysis of covariates. Computer Vision and Image Understanding 152, pp. 1 – 20. External Links: ISSN 1077-3142, Document, Link Cited by: §2.
  • [36] Y. Shi, X. Han, N. Jiang, K. Zhou, K. Jia, and J. Lu (2018) FBI-pose: towards bridging the gap between 2d images and 3d human poses using forward-or-backward information. External Links: 1806.09241 Cited by: Table 1.
  • [37] X. Sun, J. Shang, S. Liang, and Y. Wei (2017-10) Compositional human pose regression. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • [38] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018-09) Integral human pose regression. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.1, §3.2.1.
  • [39] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua (2016) Structured prediction of 3d human pose with deep neural networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, Cited by: §2.1.
  • [40] D. Tome, C. Russell, and L. Agapito (2017-07) Lifting from the deep: convolutional 3d pose estimation from a single image. In CVPR, Cited by: §1, §2.1, §2.1.
  • [41] M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. Collomosse (2017) Total capture: 3d human pose estimation fusing video and inertial sensors. In 2017 British Machine Vision Conference (BMVC), Cited by: Table 2.
  • [42] M. Trumble, A. Gilbert, A. Hilton, and J. Collomosse (2018) Deep autoencoder for combined human pose estimation and body model upscaling. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: Table 2.
  • [43] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang (2018-06) 3D human pose estimation in the wild by adversarial learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [44] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua (2016) LIFT: Learned Invariant Feature Transform. European Conference on Computer Vision (ECCV). Cited by: §3.2.1.
  • [45] A. Zanfir, E. Marinoiu, and C. Sminchisescu (2018-06) Monocular 3d pose and shape estimation of multiple people in natural scenes - the importance of multiple scene constraints. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
  • [46] A. Zanfir, E. Marinoiu, M. Zanfir, A. Popa, and C. Sminchisescu (2018) Deep network for the integrated 3d sensing of multiple people in natural images. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8410–8419. External Links: Link Cited by: Table 1.
  • [47] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis (2016-06) Sparseness meets deepness: 3d human pose estimation from monocular video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
  • [48] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei (2017-10) Towards 3d human pose estimation in the wild: a weakly-supervised approach. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1, §2.1, Table 3.
  • [49] X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei (2016) Deep kinematic pose regression. Computer Vision ECCV 2016 Workshops. Cited by: §2.1.
  • [50] H. Zou and T. Hastie (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67, pp. 301–320. Cited by: §4.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description