DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare

DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare

Yuanlu Xu   Song-Chun Zhu   Tony Tung
Facebook Reality Labs, Sausalito, USA  University of California, Los Angeles, USA
merayxu@gmail.com, sczhu@stat.ucla.edu, tony.tung@fb.com
Abstract

We present DenseRaC, a novel end-to-end framework for jointly estimating 3D human pose and body shape from a monocular RGB image. Our two-step framework takes the body pixel-to-surface correspondence map (\ie, IUV map) as proxy representation and then performs estimation of parameterized human pose and shape. Specifically, given an estimated IUV map, we develop a deep neural network optimizing 3D body reconstruction losses and further integrating a render-and-compare scheme to minimize differences between the input and the rendered output, \ie, dense body landmarks, body part masks, and adversarial priors. To boost learning, we further construct a large-scale synthetic dataset (MOCA) utilizing web-crawled Mocap sequences, 3D scans and animations. The generated data covers diversified camera views, human actions and body shapes, and is paired with full ground truth. Our model jointly learns to represent the 3D human body from hybrid datasets, mitigating the problem of unpaired training data. Our experiments show that DenseRaC obtains superior performance against state of the art on public benchmarks of various human-related tasks.

1 Introduction

Though much progress has been made in human pose estimation, body segmentation and action recognition, it remains underexplored to leverage such estimations into the 3D world, due to the difficulty in data acquisition, ambiguities from monocular inputs and nuisances in natural images (\eg, illumination, occlusion, texture). Existing learning-based methods [22, 39, 55] heavily rely on sparse 2D/3D landmarks (\ie, skeleton joints), body part masks or silhouettes. However, it is ambiguous to recover 3D human pose and body shape from such limited information.

Figure 1: DenseRaC estimates 3D human poses and body shapes given people-in-the-wild images. The proposed framework handles scenarios with multiple people, all genders, and various clothing in real time. Here, we show results on Internet images [16].

In this paper we propose DenseRaC, a new framework for 3D human pose and body shape estimation from monocular RGB image, as illustrated in Fig. 2:

•   The task is solved in a two-step framework, first by estimating pixel-to-surface correspondences (\ie, IUV images) from the RGB inputs, and then by leveraging the estimated IUV images into 3D human pose and body shape.

•   A parametric human pose and body shape representation is integrated into the forward pass and backward propagation, inspired by recent work [22, 39].

•   An IUV image based dense render-and-compare scheme is incorporated into the framework. We minimize 3D reconstruction errors as well as discrepancies between inputs and rendered images from estimated outputs.

We learn the proposed model with both unpaired and paired data, compatible with different levels of supervisions. The end-to-end training minimizes multiple losses defined upon human pose and body shape jointly, including parameter regression, 3D reconstruction, landmark reprojection, body part segmentation, as well as adversarial loss on impossible configurations (see Sec. 3.3).

To boost learning, we further construct a large scale synthetic dataset covering diversified human poses and body shapes. The synthetic data is generated using web-crawled 3D animations and scanned all-gender body shapes for human studies, and rendered from various camera views (see Sec. 4). Learning from synthetic data mitigates the problem of unpaired, partial paired, or inaccurately annotated training data in popular public people-in-the-wild and Mocap benchmarks, as well as improves the model robustness against varied camera views and occlusions.

Figure 2: Illustration of DenseRaC. Our two-step framework uses pixel-to-surface correspondences of human body as the intermediate representation, fed with data sources either from estimations on realistic images through DensePose-RCNN or rendered images on synthetic 3D humans. Given IUV images, we develop a deep neural network conducting parametric pose and shape regression and a differentiable renderer performing render-and-compare. The proposed framework optimizes losses of 3D reconstruction and discrepancies between inputs and rendered outputs by end-to-end learning.

In our experiments, we evaluate DenseRaC on three tasks: 3D pose estimation, semantic body segmentation and 3D body reconstruction. Qualitative and quantitative experimental results show DenseRaC outperforms existing methods on both public benchmarks and the newly proposed synthetic dataset (see Sec. 5).

To the best of our knowledge, this is the first end-to-end framework introducing a pixel-to-surface correspondence map as the intermediate representation and a corresponding dense render-and-compare scheme for learning 3D human pose and body shapes. We believe DenseRaC shows a great potential for numerous real-world applications in surveillance, entertainment, AR/VR, etc. Some featured results are shown in Fig. 1.

2 Related Work

The proposed method is mainly related to researches in three fields.

Monocular 3D pose estimation is a longstanding problem in computer vision. Current approaches train deep networks from large-scale training sets to regress 3D human joint transformations [18, 27]. Deep neural net architectures enable direct body location with pose prediction, which is an advantage compared to traditional model-based methods that require good initialization [3, 26]. Several methods predict 3D pose directly given monocular data [52, 41, 50, 38, 32, 15, 19, 47]. On the other hand, many approaches lift 2D human poses [7, 4], used as intermediate representation, and learn a model for 2D-3D pose space mapping [61, 62, 63, 34, 8]. State of the art in this track obtains fascinating performance on popular benchmarks limited to laboratory instrumented environments, and yet shows unsatisfactory results on in-the-wild images. Another common issue is that most existing methods do not incorporate a physically plausible human skeleton model and lack constraints on the estimated results, which results in extra post-processings for graphics related applications.

3D human body reconstruction aims at recovering full 3D meshes of the human body from single RGB images or video sequences, rather than major 3D skeleton joints. For example, Zuffi \etal [64] integrated both realistic body model and part-based graphical models [58, 57, 59] for jointly emphasizing graphics-like models of human body shape and part-based human pose inference. In [31, 3, 26, 53], a skinned body model (SMPL) is used to formulate body shape as a linear function of deformation basis (\ie, with blend shapes). In [51, 42, 22, 39], SMPL is considered as the parametric representation of 3D human body and DNNs are developed to estimate such parameters end-to-end. Guler \etal [12, 11] build a FCN for human shape estimation by learning dense image-to-template correspondences. Other work [6, 55, 20] focuses on reconstructing 3D body shapes using RGB or RGBD images and not directly estimates 3D human pose and body shapes. These approaches are also suitable for multiple-view video capture setup [35, 54]. In this paper, we use a SMPL variant as the parametric representation of 3D human body and further develop a pixel-to-surface dense correspondence based render-and-compare framework.

Learning from synthetic humans. Modeling 3D humans in arbitrary scenes requires representative training sets. A number of previous work has considered automatically generating data for assisting 3D models, \eg, upper body [40], full-body silhouettes [1]. [13] artificially renders pedestrians in a scene while leveraging camera parameters and geometrical layout, and further trains a scene-specific pedestrian detector. In [44], real 2D pose samples are reshaped by adding small perturbations, and augmented with different backgrounds. Rogez \etal [49], for a given 3D pose, combines local image patches from several images with kinematic constraints to create a new synthetic image. Rahmani \etal [46] fits synthetic 3D human models to Mocap skeletons and renders human poses from numerous virtual viewpoints. Varol \etal [56] also generate a synthetic human body dataset with random factors (\eg, pose, shape, texture, background, \etc). These datasets cannot solely serve to train models generalized to real data, due the gap between synthesized and realistic images. In this paper, we propose to use pixel-to-surface correspondence maps to bridge the gap. The joint training on hybrid datasets is proved to be effective in improving performance on realistic data. To our best knowledge, we are the first to address joint human pose and body shape estimation using such training modalities.

3 DenseRaC Framework

Figure 3: Illustration of mapping from pixel to 3D surface. Our framework estimates an IUV image and dense 3D landmarks from an RGB input, whose pixels refer to 3D points on the body model.

As illustrated in Fig. 2, the proposed framework estimates 3D human poses and body shapes in two steps: first obtaining pixel-to-surface correspondences (\ie, IUV images) and then leveraging the intermediate results IUV images into 3D surfaces. There are two sources of IUV inputs: i) estimations from RGB inputs using a pre-trained DensePose-RCNN [11], and ii) rendered IUV images from synthetic data.

Our framework employs a compact and expressive 3D human body model, which is parameterized by 3D human pose , body shape , instead of directly estimating 3D point clouds, voxels or depth maps. The 3D human pose is represented as a tree structure, with 58 relative 3D rotations between parent and child joints while the body shape is represented by 50 shape coefficients, as elaborated in Sec. 3.5.

3.1 Network Architecture

Given IUV inputs, we design a network architecture consisting of three modules:

•   A generator with a back-boned base network (\ie, ResNet-50 [14]) to extract expressive feature maps and a regressor which takes the stretched feature maps (\ie, 2048D feature vector) from the base network as inputs and estimates 3D human body parameters and camera parameters (\ie, 227D concatenated vector). The camera model is assumed to be an orthographic projection, parameterized by scale factor and camera axis . The regressor is composed of 3 fully connected layers with 1024 nodes each. Inspired by [22], we consider the regressor to model an iterative update to the final output, starting from the parameter mean . The weights are shared across all three layers, simulating the recursive tree structure within 3D human pose.

•   A differentiable renderer creates 2D projections of the reconstructed 3D human body mesh, using the estimated camera parameters (see Sec. 3.3). We implement a differential rasterizer which creates an IUV image suitable for gradient flow. Following a render-and-compare scheme, we define three losses to measure and minimize the differences between the input IUV image and the rendered IUV image from our model output.

•   A discriminator to constrain impossible configurations for unpaired data. We design two shallow networks with two fully connected layers as a discriminator. One is used for discriminating 3D human poses and the other one for body shapes. The number of nodes for pose and shape in sub-networks are set to 512 and 64, respectively.

3.2 IUV as Proxy Representation

As illustrated in Fig. 3, we utilize the IUV image as a proxy representation. An IUV map, similarly to UV map in graphics, defines pixel-to-surface correspondences (one-to-one), from 2D image to 3D surface mesh. Each pixel of an IUV image refers to a body part index , and coordinates that map to a unique point on the body model surface (see Sec. 3.5).

As also discussed in [39], RGB input contains much more information of the human target than 2D joints, silhouettes, or body part masks that are traditionally used as proxy representation. However information such as appearance, illumination or clothing may not be relevant for inferring the 3D geometry, and even overfits the network to nuisance factors. Similar to  [39], we also observe that explicit body part representations are more useful for the task of 3D human pose and body shape estimation, compared to RGB images and plain silhouettes. Better part segmentation produces better 3D reconstruction accuracy, while providing full spatial coverage of the person (compared to joint heatmaps). While further increasing the number of segmentation parts above a certain threshold (12) only incrementally leverage 3D pose prediction accuracy, it nevertheless greatly improves body shape estimation (see Sec. 5). We argue that prior work only estimates average body shape.

Note we further use two sources of IUV images as inputs, \ie, IUV images from realistic images estimated from [11] and IUV images from virtual humans synthesized by our renderer (see Sec. 3.3). The IUV estimation could be obtained by other off-the-shelf models or two-stage/end-to-end training. Both inputs go through our neural network model and are used to estimate 3D human pose and body shape parameters. Thus, there are several benefits for using IUV image representation: i) improving robustness against nuisances of light and texture in natural images, ii) providing richer geometry information on 3D human body (by including body part masks and dense landmarks), iii) unifying realistic and synthetic data for joint learning.

3.3 Dense Render-and-Compare

In this paper, 3D human pose and body shape are represented compactly by a parametric model (see Sec. 3.5). Parametrized 3D human body is inferred and fit to the input image, given also camera parameters. Human body surface is represented as a 3D triangular mesh, and body posing is obtained by standard linear blend skinning. To fully compare a reconstructed 3D human body to a 2D observation of it, we integrate a differentiable renderer, \ie, a computer graphics technique that creates a 2D image from a 3D object using differentiable operations [29, 23], and develop an end-to-end weakly-supervised training scheme.

Rendering consists of projecting 3D vertices of a mesh onto a 2D image plane and rasterizing it (\ie, sampling the faces). 3D-to-2D projection is obtained by a combination of differentiable transformations [33]. Rasterization is a discrete operation that requires gradient definition to allow back-propagation in a neural network. In [29], the authors approximate derivatives at occlusion boundaries which are discontinuous, while colors are interpolated between vertices (\ie, there is no differentiation with respect to texture). In [23], the authors obtain approximate gradients by blurring image to avoid sudden pixel color change. This produces non-zero gradients and enables gradient-flow between pixel (color) value to vertex position. However, lighting and material properties in natural images are complex to model and integrate into neural networks.

On the contrary, our IUV representation is invariant to background, lighting conditions and surface texture like clothing (see Sec. 3.2). In addition, UV values on each body part I are continuous with respect to neighbor pixels (see Fig. 3). This actually allows to naturally compute gradients on mesh surface and at boundaries and back-propagate them through network layers.

Our renderer creates IUV image comparable to the generated output of [11] (see Fig. 4). Self-occlusion is handled by depth buffering. Our rasterizer draws only the surface faces closest to the camera (and facing it) at each pixel. During back propagation, we only pass gradient flows of pixels corresponding to visible regions.

Different from [53, 24] where render-and-compare losses are computed upon silhouettes and 2D depth maps, we compute dense render-and-compare losses using IUV values between ground-truth IUV images and rendered ones (see Sec. 3.4). The differentiable renderer (including IUV rasterizer) and losses are implemented with differentiable operations using a neural net framework with automatic differentiation [5, 53, 24].

3.4 Loss Terms

Our model integrates a dense render-and-compare module with corresponding loss computations in the backward propagation, hence leveraging previous methods [42, 22, 39, 55]. The loss function is defined as

(1)

where indicates the existence of such annotation, , and denote render-and-compare loss, 3D reconstruction loss and parameter regression loss, respectively.

•   Render-and-Compare Loss is evaluated under three measurements, that is,

(2)

where , and denote landmark reprojection loss, part mask loss and adversarial loss, respectively.

Landmark Reprojection Loss measures displacement between ground truth and estimated dense 2D landmarks:

(3)

where indicates the visibility (1 if visible, 0 otherwise) for -th 2D landmark ( in total), and represent -th 2D landmark from ground truth and 3D mesh reprojection, respectively. To correctly localize the landmarks from ground truth (\ie, IUV image estimated from DensePose [11]), we formulate this problem as a point-to-point greedy match and solve the correspondence problem by k-Nearest Neighbor (k-NN) search. Specifically, we first create a k-D tree for IUV values of 3D body mesh vertices. For any input IUV image, we search for 1-NN of each visible pixel and obtain a matched pair with the closest 3D body mesh vertex within a distance threshold . Empirically, yields 100-300 matching pairs considered as near-optimal one-to-one 2D/3D dense landmarks correspondences. This serves as a weakly-supervised scaffold to densely fit 3D human body to the re-projected 2D image. Note the matching is computed offline and serves as a pre-processing step on IUV inputs, as shown in Fig. 5.

Part Mask Loss provides semantic information for the location of body part:

(4)

where is body part index and represents intersection over union of two masks. We keep the same body segments (12 parts) and mapping as specified in [11].

Adversarial Loss constrains configuration plausibility. Unlike [22] using unpaired or Mosh-based [30] weakly-supervised SMPL annotations, we use ground-truth 3D human poses and body shapes from our synthetic dataset, which contains much larger action variations than most Mocap sequences (see Sec. 4). We believe such long-tail poses are crucial for the adversarial loss in finding the decision boundary. Hereby, we account for millions of synthetic samples as both paired ground truth and unpaired adversarial prior for realistic datasets. We follow the GAN loss definitions in [9] and train our generator and discriminator jointly.

•   3D Reconstruction Loss measures the deformation of reconstructed 3D human body, compared with ground truth:

(5)

where and represent 3D keypoint positions from input and generated 3D mesh, respectively.

•   Parameter Regression Loss measures mean square errors between estimated parameters and ground truth :

(6)

where denotes the rotation matrix of . Notably, pose parameters are first transformed in rotation matrices. Losses are computed upon such matrices and gradients are automatically back-propagated. This helps avoid the singularity problem of XYZ-Euler based 3D rotation and requires no extra constraints on the rotation matrix, that is,

(7)

where denotes the matrix determinant.

Figure 4: IUV images from MOCA generated by rasterizing 3D bodies obtained with 3D poses from Mixamo and body shapes from CAESAR. MOCA contains 2M+ images with fully paired ground truth.

3.5 Human Body Model

We use a body shape model similar to SMPL [31, 3]. The statistical body model is obtained by PCA on pose-normalized 3D models of real humans, obtained by non-rigid registration of a body template to 3D scans of the CAESAR dataset111http://store.sae.org/caesar/, which represents anthropometric variability of 4,400 men and women. The body template mesh has 7,324 vertices, 14,644 triangular faces and a skeletal rig with body and hand joints.

Our model is trained with all 3D scans in the dataset, resulting in a statistical model that can describe bodies from unseen in-the-wild images regardless of gender. An arbitrary body shape can then be described by a set of shape coefficients (\ie, shape parameters or shape blend shapes) using a linear representation. Truncating shape coefficients to 50 principal components enables reconstruction of all-gender body shapes without noticeable distortions: \eg, the SMPL-Male with 10 coefficients does not reconstruct well female shape (RMSE=9.9mm), while an all-gender model does (RMSE=6.3/3.4mm with 10/50 coeffs respectively).

Considering potential applications in AR/VR, 3D animations and better utilization of annotations, we enrich the standard SMPL 24-joint skeleton with 28 joints for modeling fingers and 5 more joints on spine and head for better flexibility. We further add a root node for global translation and rotation, leading to a skeleton with 58 joints.

4 MOCA Synthetic Dataset

The literature has provided several datasets to evaluate human 3D pose (\eg, H3.6M [18], MPI-INF-3DHP [36]), but only few for joint 3D pose and body shape (\eg, SURREAL [56] and UP-3D [26]). However, SURREAL is dedicated to body segmentation and depth estimation and only has a rough skeleton (24 major body joints), while UP-3D has weakly-supervised shapes (from SMPL fitted to LSP and MPII), arguably imprecise [55].

Hence, we propose MOCA , a large-scale synthetic dataset with 2,089,104 images containing ground-truth body shapes and 3D poses, as shown in Fig. 4. For various human poses and actions, we seek to a popular collection center of 3D human animations (\ie, Mixamo222http://www.mixamo.com), whose sources mainly come from Mocap systems and artist designs. We implement a web crawler to fetch high fidelity animations. Notably, Mixamo supports tuning parameters (\eg, limb length, energy, overdrive) for each action sequence to generate variants. As we observe certain parameter settings may introduce artifacts, we thus keep the default setting for all sequences. We collect a set of 2,446 3D animation sequences with 261,138 frames at 30 fps, covering wide action categories of sports, combat, daily and social activities. We extract a finer 3D skeleton with fingers and facial bones using Maya and re-map those joints onto our body model.

We then generate 2,781 bodies using the 3D scans from CAESAR dataset and compute corresponding (PCA) shape coefficients. By combining 3D pose and shape , we pose body models to specific pose&shape configurations by standard linear blend skinning.

The complete combination of all 3D poses and body shapes produces an enormous amount of 3D human body samples. Currently, we randomly select 8 body shapes for each action sequence. We further add a random camera view for each sequence, and render them as IUV image sequences using our IUV rasterizer (see Sec. 3.3), obtaining a dataset with 2,089,104 frames in total and fully paired ground truth of body shape, 3D pose and the camera view. For training/testing set partition, we set the ratio as 90%/10%. We synthesize the training set with the first 2,201 Mixamo action sequences and 2,502 CAESAR body shapes and leave the rest 246 action sequences and 279 body shapes only visible to the testing set.

5 Experiments

We evaluate DenseRaC on several public large-scale benchmarks for three tasks: 3D pose estimation, body shape estimation and body semantic segmentation. We further assess human 3D reconstruction results (\ie, mesh-level reconstruction, joint&shape parameter estimation) on the proposed large synthetic dataset MOCA that contains ground-truth 3D pose and body shape. Our experiments compare favorably to the state of the art. Estimated 3D poses and body shapes are stable on videos (see additional materials). Our qualitative results also show natural hand poses (\eg, opened, clenched).

5.1 Datasets

Figure 5: Pre-processed training samples from public benchmarks. Left: original image, right: estimated IUV image, ground-truth keypoint annotations (yellow) and dense landmarks (red).

We use five public human benchmarks plus our synthesized MOCA for model training and evaluation, \ie, LSP [21], MPII [2], COCO [28], H3.6M [17, 18] and MPI-INF-3DHP [36]. We adopt standard training/validation/testing partitions on all datasets and calibrate loss terms using cross-validation. When a certain dataset is used for evaluation, all data from other datasets will be used in training.

For all training and testing samples, we crop out each person in the whole image using ground-truth bounding boxes. All samples are resized to 150-180 pixel height with preserved aspect ratio, and further adjusted to 224 224 with padding/cropping respectively. We then run IUV image estimation [11] on all samples. Considering a sample may contain multiple people and false alarms, we compute a saliency score for each detected person mask , where and represent the center of the person mask and the image, respectively. We then pick the person mask with the largest saliency score and suppress the other detection responses.

For the training set, we further run pixel-to-surface matching (as described in Sec. 3.3) to create dense correspondences. We discard samples with less than 200 corresponding pairs, as IUV image estimation usually failed under such situation. As illustrated in Fig. 5, pre-processing suppresses nuisances in the training samples quite well. During training, all training samples will further be augmented with a random jittering of translation, scaling and reflection to improve the model robustness. We also randomly black out a rectangle image region for the synthetic samples to simulate occlusion in realistic scenarios.

To unite the skeleton structure across all datasets, we use the same 14 joints as in LSP for joint related computation while maintaining our 58-joint skeleton in the backend.

5.2 Implementation Details

In these experiments, the whole framework is implemented with TensorFlow and runs on a DGX workstation with 2 Intel E5 CPUs, 512GB memory and 8 Titan V100 GPUs. Data synthesis and pre-processing (\ie, IUV image estimation) are implemented with multi-gpu data parallelism. The multi-gpu renderer processes around 300 fps and takes 2 days to generate 2 million MOCA samples (total size 2.7TB). Data pre-processing on realistic datasets takes 12 hours to prepare 0.8 million samples.

For learning, only a single GPU is used due to difficulty in gradient transfer and a potential performance drop. We use batch size 128, learning rate for the generator and for the discriminator, and Adam as the optimizer. Our full model is jointly trained on all datasets for 30 epochs. Empirically, for one batch, the forward pass takes around 15ms and the backward propagation takes (130ms) with IUV image render-and-compare (55ms) as the overhead. The total training process takes around a week to complete. For inference, IUV images are first estimated at around 15 fps and then the forward pass of our model is called, taking 120 fps and thus achieves real time.

5.3 3D Pose Estimation

 

H3.6M Protocol #1 Protocol #2 Protocol #3
MPJPE MPJPE MPJPE
Martinez \etal(ICCV’17) [34] 62.9 47.7 84.8
Fang \etal(AAAI’18) [8] 60.3 45.7 72.8
Rhodin \etal(CVPR’18) [48] 66.8 - -
Yang \etal(CVPR’18) [60] 58.6 37.7 -
Hossain \etal(ECCV’18) [15] 51.9 42.0 -

 

Lassner \etal(CVPR’17) [26] 80.7 - -
HMR (CVPR’18) [22] 88.0 56.8 77.3
Pavlakos \etal(CVPR’18) [42] - 75.9 -
NBF (3DV’18) [39] - 59.9 -
DenseRaC baseline 82.4 53.9 77.0
+ render-and-compare 79.5 51.4 75.9
+ synthetic data 76.8 48.0 74.1

 

 

MPI-INF-3DHP Protocol #1 Protocol #2
PCK AUC MPJPE PCK AUC MPJPE
Mehta \etal(3DV’17) [36] 75.7 39.3 117.6 - - -
Mehta \etal(TOG’17) [37] 76.6 40.4 124.7 83.9 47.3 98.0
HMR (CVPR’18) [22] 72.9 36.5 124.2 86.3 47.8 89.8
DenseRaC baseline 73.1 36.7 123.1 86.8 47.8 88.7
+ render-and-compare 74.7 38.6 124.9 87.5 48.3 86.7
+ synthetic data 76.9 41.1 114.2 89.0 49.1 83.5

 

Table 1: Quantitative comparisons of mean per joint position error (MPJPE), PCK and AUC between the estimated 3D pose and ground truth on H3.6M under Protocol #1, #2, #3 and MPI-INF-3DHP under Protocol #1, #2. - indicates results not reported. Lower MPJPE, higher PCK and AUC indicate better performance. Best scores are marked in bold.
Figure 6: Qualitative comparisons of results estimated from DenseRaC versus state of the art [22, 39, 55]. DenseRaC estimates 3D human poses and body shapes closest to the reality. Note that all examples come from the test set. Best viewed in color.

We first evaluate our method for the task of 3D pose estimation on H3.6M [18] and MPI-INF-3DHP [36] datasets.

For H3.6M, we use three evaluation protocols used to measure the performance: i) Protocol #1 uses 5 subjects (S1, S5, S6, S7 and S8) for training and 2 subjects (S9 and S11) for testing. Sequences are down-sampled to 10 fps and all 4 cameras and trials are used for evaluation. MSE is measured between estimated and ground-truth 3D joints. ii) Protocol #2 selects the same subjects for training and testing as Protocol #1, while evaluation is only conducted on sequences captured from the frontal camera (\ie, “cam 3”) from trial 1 on all frames. Predictions are post-processed via rigid transformations (\ie, per-frame Procrustes analysis) before comparison. iii) Protocol #3 uses the same subjects, frame rates and trials for training and testing in Protocol #1 except that camera views are further partitioned. The first three cameras (\ie, “cam 0, 1, 2”) are used for training and the last camera (\ie, “cam 3”) for testing.

 

UP-3D Body Part Fg/Bg
Accuracy F1 Accuracy F1
SMPL on DpCut (ECCV’16) [3] 87.7 0.64 91.9 0.88
SMPL, UP-P91 (ICCV’17) [26] 87.3 0.61 91.0 0.86
HMR (CVPR’18) [22] 87.1 0.60 91.7 0.87
BodyNet (ECCV’18) [55] - - 92.8 0.84
DenseRaC 87.9 0.64 92.4 0.88

 

MOCA Body Part Fg/Bg
Accuracy F1 Accuracy F1
HMR (CVPR’18) [22] 86.6 0.19 92.1 0.60
DenseRaC 89.3 0.27 96.4 0.68

 

Table 2: Quantitative comparisons of foreground and part segmentation on UP-3D and MOCA datasets. Accuracy unit is in %. - indicates results not reported. Best scores are marked in bold.

For MPI-INF-3DHP, we use all sequences from S1-S7 as training set and sequences from S8 as testing set. We regard Protocol #1 as the default comparison and Protocol #2 as applying rigid transformations before comparison.

We compare our method with both task-oriented 3D pose state of the art [50, 63, 34, 36, 37, 8, 48, 60, 15] and four parametric body model based estimators [26, 22, 42, 39]. We set up two baselines to validate the effectiveness of two key components in the proposed framework: render-and-compare and joint learning with synthetic data. In “DenseRaC baseline”, we use SMPL model and the same losses as  [22], only switch input sources from RGB images to IUV images. Variant “+ render-and-compare” denotes adding the proposed dense render-and-compare scheme losses into the framework and part masks. Variant “+ synthetic data” switches to our human body model and further uses augmented synthetic data for joint learning.

As reported in Table 1, we can observe each component in DenseRaC contributes to the final performance and leads DenseRaC to outperform state-of-the-art parametric body model estimators by a large margin. Also notice DenseRaC is comparable with latest task-oriented 3D pose estimators.

5.4 Human Body Segmentation

Given rendered images from outputs, we further employ semantic segmentation as another task to measure how similar the reconstructed 3D human body looks to the person in the input image. We evaluate the tasks of human body segmentation and test our approach on the LSP subset of UP-3D [26] and MOCA datasets. For UP-3D, we post-process our 24 body part masks by merging into the annotated 6 body part masks (i.e., head, torso, left and right leg, and left and right arm) and evaluate on body part and foreground segmentation, while we evaluate both body part segmentation (ignoring 4 subtle body parts, \ie, hands and feet) and foreground segmentation on MOCA. We measure segmentation accuracy and mean F1 score of the results and report metrics and comparisons in Table 2. It can be observed that our method achieves comparable or better performance with state of the art [3, 26, 22, 55] on all datasets.

5.5 3D Human Body Reconstruction

 

Methods MPJPE MPVPE
HMR (CVPR’18) [22] unpaired 110.2 - -
HMR (CVPR’18) [22] paired 91.9 - -
DenseRaC, 133.0 174.5 18.227
DenseRaC, 131.5 173.6 17.820
DenseRaC, 122.8 161.5 16.305
DenseRaC, 107.9 142.3 13.608
DenseRaC, 88.6 121.1 11.901
DenseRaC, 86.5 119.8 10.496
DenseRaC, 82.9 111.0 8.943
DenseRaC, 82.4 110.7 8.722
DenseRaC, 80.4 105.4 8.164
DenseRaC, full 80.3 105.2 8.151

 

Table 3: Quantitative comparisons of MPJPE, MPVPE, Pose&Shape Parameter Mean Square Error MSE on MOCA dataset. Lower values are better. See text for detailed explanations.

Notice 3D pose estimation and body semantic segmentation are tasks focusing on evaluating partial knowledge of the reconstructed 3D human body, We further evaluate the reconstructed 3D human body using two metrics: Mean Per Mesh Vertex Position Error (MPVPE) and regression error on MOCA dataset. These two metrics consider the 3D human body as a whole and provide more guidance about how well the reconstructed 3D human body is. For comparison, we re-train HMR which takes IUV images as input and uses 2D/3D joint supervisions (\ie, only 14 2D/3D joints in LSP format) and their original unpaired data (Mosh [30] on H3.6M and external Mocap) for the adversarial prior. As reported in Table 3, DenseRaC still significantly outperforms the competitive method.

Ablative Studies. We set up variants of DenseRaC to validate effectiveness of each loss terms. We also define two loss variants and representing 14-joint-only keypoint reprojection and 3D reconstruction losses, respectively. From the results, we could reach the following conclusions: i) All loss terms contribute to the final performance; ii) Losses used for dense render-and-compare provide richer information than those from sparse joints, greatly reduce impossible 3D body configurations; iii) When task oriented loss terms are given (\ie, and ), the contribution from the dense render-and-compare scheme seems to be suppressed, yet such finer supervisions help DenseRaC reach a much better local optimum.

Empirical Studies. We present qualitative results and comparisons to have a better understanding of merits of our method. As shown in Fig. 6, DenseRaC outperforms other competitive methods and reconstructs more plausible and natural 3D human bodies. Notably, HMR, which relies on sparse landmarks, sometimes reconstructs plausible 3D human body appearance, but confuses body front and back. Both NBF and BodyNet are sensitive to occlusions and heavy clothing. When fitting SMPL to such erroneously reconstructed volumes, BodyNet tends to produce highly non-human body shapes333We uses results from 3D skeleton fitting for BodyNet, as volume fitting usually performs much worse.. For all three methods, the estimated human bodies are arguably in an average body shape and insensitive to genders. We also search failure cases on validation set, as shown in Fig. 7. DenseRaC suffers from errors in IUV estimations (\eg, occlusions, long-tail data), and is limited by the orthographic projection assumption and SMPL-based human body representation.

We also explored virtual dressing, namely draping virtual clothing on 3D human body, using our beneath-clothing estimation. As shown in Fig. 1 (top right) and Fig. 8, a cascaded framework for adding physical simulations of clothing is possible [10, 25] and more visually acceptable than end-to-end volumetric reconstruction of BodyNet.

Figure 7: Current limitations: heavy occlusions (first row), incorrect IUV estimations (second row) and under-represented body shapes like children (third row). Each triplet shows the original image, IUV from [11] (our model input), and our model output.
Figure 8: Comparisons for cascaded and end-to-end frameworks on the application of virtual dressing.

6 Conclusion

We propose DenseRaC, a new end-to-end framework for reconstructing 3D human body from monocular RGB images in the wild. DenseRaC utilizes the pixel-to-surface correspondence map as proxy representation and incorporates a dense render-and-compare scheme to minimize the gap between rendered outputs and inputs. We further boost the model training with large scale synthetic data (MOCA), mitigating the problem of unpaired training data. The proposed framework obtains superior performance and we will explore handling occlusion and interaction (e.g., by multi-view fusion [45], temporal smoothing [43]) next.

Acknowledgements. We would like to thank Tengyu Liu and Elan Markowitz for helping with data collection, Tuur Jan M Stuyck and Aaron Ferguson for cloth simulation, Natalia Neverova and colleagues at FRL, FAIR and UCLA for their support and advice.

References

  • [1] A. Agarwal and B. Triggs (2006) Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (1), pp. 44–58. Cited by: §2.
  • [2] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014) 2D human pose estimation: new benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §5.1.
  • [3] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black (2016) Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, Cited by: §2, §2, §3.5, §5.4, Table 2.
  • [4] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [5] P. Dayan, G. Hinton, R. Neal, and R. Zemel (1995) The helmholtz machine. Neural Computing. Cited by: §3.3.
  • [6] E. Dibra, H. Jain, C. Oztireli, R. Ziegler, and M. Gross (2016) HS-nets: estimating human body shape from silhouettes with convolutional neural networks. In International Conference on 3D Vision, Cited by: §2.
  • [7] H. Fang, S. Xie, Y. Tai, and C. Lu (2017) RMPE: regional multi-person pose estimation. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [8] H. Fang, Y. Xu, W. Wang, X. Liu, and S. Zhu (2018) Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI Conference on Artificial Intelligence, Cited by: §2, §5.3, Table 1.
  • [9] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Annual Conference on Neural Information Processing Systems, Cited by: §3.4.
  • [10] P. Guan, L. Reiss, D. A. Hirshberg, A. Weiss, and M. J. Black (2012) DRAPE: dressing any person. In ACM SIGGRAPH, Cited by: §5.5.
  • [11] R. A. Guler, N. Neverova, and I. Kokkinos (2018) DensePose: dense human pose estimation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.2, §3.3, §3.4, §3.4, §3, Figure 7, §5.1.
  • [12] R. A. Guler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos (2017) DenseReg: fully convolutional dense shape regression in-the-wild. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [13] H. Hattori, V. Naresh Boddeti, K. M. Kitani, and T. Kanade (2015) Learning scene-specific pedestrian detectors without real data. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
  • [15] M. R. I. Hossain and J. J. Little (2018) Exploiting temporal information for 3d human pose estimation. In European Conference on Computer Vision, Cited by: §2, §5.3, Table 1.
  • [16] Images and videos available at youtube.com, onlinedoctor.superdrug.com, shutterstock.com. Cited by: Figure 1.
  • [17] C. Ionescu, F. Li, and C. Sminchisescu (2011) Latent structured models for human pose estimation. In IEEE International Conference on Computer Vision, Cited by: §5.1.
  • [18] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §2, §4, §5.1, §5.3.
  • [19] U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz (2018) Hand pose estimation via latent 2.5d heatmap regression. In European Conference on Computer Vision, Cited by: §2.
  • [20] Z. Ji, X. Qi, Y. Wang, G. Xu, P. Du, and Q. Wu (2018) Shape-from-mask: a deep learning based human body shape reconstruction from binary mask images. arXiv preprint arXiv:1806.08485. Cited by: §2.
  • [21] S. Johnson and M. Everingham (2011) Learning effective human pose estimation from inaccurate annotation. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §5.1.
  • [22] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, §3.1, §3.4, §3.4, Figure 6, §5.3, §5.4, Table 1, Table 2, Table 3.
  • [23] H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.3, §3.3.
  • [24] A. Kundu, Y. Li, and J. Rehg (2018) 3D-rcnn: instance-level 3d object reconstruction via render-and-compare. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.3.
  • [25] Z. Laehner, D. Cremers, and T. Tung (2018) DeepWrinkles: accurate and realistic clothing modeling. In European Conference on Computer Vision, Cited by: §5.5.
  • [26] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler (2017) Unite the people: closing the loop between 3d and 2d human representations. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §2, §4, §5.3, §5.4, Table 1, Table 2.
  • [27] S. Li, W. Zhang, and A. B. Chan (2015) Maximum-margin structured learning with deep networks for 3d human pose estimation. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [28] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, Cited by: §5.1.
  • [29] M. M. Loper and M. J. Black (2014) OpenDR: an approximate differentiable renderer. In European Conference on Computer Vision, Cited by: §3.3, §3.3.
  • [30] M. Loper, N. Mahmood, and M. Black (2014) MoSh: motion and shape capture from sparse markers. In SIGGRAPH Asia, Cited by: §3.4, §5.5.
  • [31] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Transactions on Graphics 34 (6), pp. 248. Cited by: §2, §3.5.
  • [32] D. C. Luvizon, D. Picard, and H. Tabia (2018) 2D/3d pose estimation and action recognition using multitask deep learning. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [33] S. Marschner and P. Shirley (2015) Fundamentals of computer graphics. In CRC Press, Cited by: §3.3.
  • [34] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In IEEE International Conference on Computer Vision, Cited by: §2, §5.3, Table 1.
  • [35] T. Matsuyama, S. Nobuhara, T. Takai, and T. Tung (2012) 3D video and its applications. In Springer, Cited by: §2.
  • [36] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In International Conference on 3D Vision, Cited by: §4, §5.1, §5.3, §5.3, Table 1.
  • [37] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas, and C. Theobalt (2017) Vnect: real-time 3d human pose estimation with a single rgb camera. In ACM Transactions on Graphics, Cited by: §5.3, Table 1.
  • [38] B. X. Nie, P. Wei, and S. Zhu (2017) Monocular 3d human pose estimation by predicting depth on joints. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [39] M. Omran, C. Lassner, G. Pons-Moll, P. V. Gehler, and B. Schiele (2018) Neural body fitting: unifying deep learning and model-based human pose and shape estimation. In International Conference on 3D Vision, Cited by: §1, §1, §2, §3.2, §3.4, Figure 6, §5.3, Table 1.
  • [40] G. S. Paul, P. Viola, and T. Darrell (2003) Fast pose estimation with parameter-sensitive hashing. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [41] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [42] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis (2018) Learning to estimate 3d human pose and shape from a single color image. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.4, §5.3, Table 1.
  • [43] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.
  • [44] L. Pishchulin, A. Jain, M. Andriluka, T. Thormahlen, and B. Schiele (2012) Articulated people detection and pose estimation: reshaping the future. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [45] H. Qi, Y. Xu, T. Yuan, T. Wu, and S. Zhu (2018) Scene-centric joint parsing of cross-view videos. In AAAI Conference on Artificial Intelligence, Cited by: §6.
  • [46] H. Rahmani and A. Mian (2016) 3D action recognition from novel viewpoints. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [47] H. Rhodin, M. Salzmann, and P. Fua (2018) Unsupervised geometry-aware representation for 3d human pose estimation. In European Conference on Computer Vision, Cited by: §2.
  • [48] H. Rhodin, J. Sporri, I. Katircioglu, V. Constantin, F. Meyer, E. Muller, M. Salzmann, and P. Fua (2018) Learning monocular 3d human pose estimation from multi-view images. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §5.3, Table 1.
  • [49] G. Rogez and C. Schmid (2016) MoCap-guided data augmentation for 3d pose estimation in the wild. In Annual Conference on Neural Information Processing Systems, Cited by: §2.
  • [50] X. Sun, J. Shang, S. Liang, and Y. Wei (2017) Compositional human pose regression. In IEEE International Conference on Computer Vision, Cited by: §2, §5.3.
  • [51] V. Tan, I. Budvytis, and R. Cipolla (2017) Indirect deep structured learning for 3d human body shape and pose prediction. In British Machine Vision Conference, Cited by: §2.
  • [52] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua (2016) Structured prediction of 3D human pose with deep neural networks. In British Machine Vision Conference, Cited by: §2.
  • [53] H. Tung, H. Tung, E. Yumer, and K. Fragkiadaki (2017) Self-supervised learning of motion capture. In NIPS, Cited by: §2, §3.3.
  • [54] T. Tung, S. Nobuhara, and T. Matsuyama (2009) Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [55] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid (2018) BodyNet: volumetric inference of 3d human body shapes. In European Conference on Computer Vision, Cited by: §1, §2, §3.4, §4, Figure 6, §5.4, Table 2.
  • [56] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §4.
  • [57] W. Wang, Y. Xu, J. Shen, and S. Zhu (2018) Attentive fashion grammar network for fashion landmark detection and clothing category classification. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [58] Y. Xu, L. Lin, W. Zheng, and X. Liu (2013) Human re-identification by matching compositional template with cluster sampling. In IEEE International Conference on Computer Vision, Cited by: §2.
  • [59] Y. Xu, L. Qin, X. Liu, J. Xie, and S. Zhu (2018) A causal and-or graph model for visibility fluent reasoning in tracking interacting objects. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [60] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang (2018) 3D human pose estimation in the wild by adversarial learning. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §5.3, Table 1.
  • [61] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall (2016) A dual-source approach for 3d pose estimation from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [62] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis (2016) Sparseness meets deepness: 3d human pose estimation from monocular video. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [63] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In IEEE International Conference on Computer Vision, Cited by: §2, §5.3.
  • [64] S. Zuffi and M. J. Black (2015) The stitched puppet: a graphical model of 3d human shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393517
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description